Get started with Secoda
See why hundreds of industry leaders trust Secoda to unlock their data's full potential.
See why hundreds of industry leaders trust Secoda to unlock their data's full potential.
The REGEXP_SUBSTR function in Snowflake is a robust tool designed for extracting substrings from a string that match a specified regular expression pattern. This function is particularly useful for tasks involving data parsing, cleaning, or transformation, allowing users to manage complex string manipulations effortlessly. If no match is found, the function returns NULL.
The syntax for the REGEXP_SUBSTR function is structured as follows:
REGEXP_SUBSTR ( <subject> , <pattern> [ , <position> [ , <occurrence> [ , <regex_parameters> [ , <group_num> ] ] ] ] )
Here’s a breakdown of the arguments:
.
wildcard also matches newline.e
option is implied.The function returns a VARCHAR value containing the matching substring.
Usage notes:
e
parameter is specified, REGEXP_SUBSTR returns the part of the subject that matches the first group in the pattern.Here are some practical examples demonstrating the usage of REGEXP_SUBSTR:
SELECT id, regexp_substr(string1, '#\\w+') AS "RESULT" FROM social_media_posts ORDER BY id;
This query extracts the first hashtag from the column string1
.
ID RESULT 1 #MondayMotivation 2 #Snowflake 3 NULL
SELECT id, regexp_substr(email_list, '\\b\\w+@\\w+\\.\\w+\\b', 1, 2) AS "RESULT" FROM contact_info ORDER BY id;
This query extracts the second email address from the email_list
column.
ID RESULT 1 jane.doe@example.com 2 john.smith@mydomain.org 3 NULL
SELECT id, regexp_substr(phone_number, '\\((\\d{3})\\)', 1, 1, 'e', 1) AS "RESULT" FROM customer_contacts ORDER BY id;
This query extracts the area code (first group) from the phone numbers in the phone_number
column.
ID RESULT 1 123 2 456 3 789
SELECT id,
regexp_substr(date_string, '(\\d{4})-(\\d{2})-(\\d{2})', 1, 1, 'e', 1) AS "YEAR",
regexp_substr(date_string, '(\\d{4})-(\\d{2})-(\\d{2})', 1, 1, 'e', 2) AS "MONTH",
regexp_substr(date_string, '(\\d{4})-(\\d{2})-(\\d{2})', 1, 1, 'e', 3) AS "DAY"
FROM events;
This query extracts the year, month, and day from a date string in the date_string
column.
ID YEAR MONTH DAY 1 2024 08 06 2 1999 12 31 3 2000 01 01
SELECT date_string,
REGEXP_SUBSTR(date_string, '\\b\\d{4}\\b') AS year
FROM events;
This query extracts the year from a date string in the date_string
column in the events
table.
DATE_STRING YEAR Event on 2023-08-06 2023 Scheduled for 1999/12/31 1999 Happened in 2000-01-01 2000 Planned for 2024.07.04 2024
SELECT email,
REGEXP_SUBSTR(email, '@\\w+\\.\\w+') AS domain
FROM users;
This query extracts the domain part of an email address from the email
column in the users
table.
EMAIL DOMAIN user1@example.com @example.com john.doe@mydomain.org @mydomain.org jane_smith@corporate.net @corporate.net another.user@university.edu @university.edu
The REGEXP_SUBSTR function offers several key features and benefits:
e
parameter and group numbers, REGEXP_SUBSTR can extract sub-matches within a pattern. This is useful for breaking down complex patterns into more manageable parts.Here is a comparison of REGEXP_SUBSTR with other similar functions:
Feature REGEXP_SUBSTR SUBSTR Pattern Matching Supports complex regular expressions Does not support regex Group Extraction Can extract specific groups within a pattern Cannot extract groups Case Sensitivity Supports case-sensitive and case-insensitive matching Case-sensitive only Multi-line Mode Supports multi-line mode Not applicable Starting Position Can specify starting position Can specify start pos. Occurrence Can specify which occurrence to match Not applicable
Feature REGEXP_SUBSTR LIKE Pattern Matching Supports complex regular expressions Simple pattern matching Group Extraction Can extract specific groups within a pattern Cannot extract groups Case Sensitivity Supports case-sensitive and case-insensitive matching Case-sensitive only Multi-line Mode Supports multi-line mode Not applicable Starting Position Can specify starting position Not applicable Occurrence Can specify which occurrence to match Not applicable
The REGEXP_SUBSTR function can be applied in various practical scenarios:
To begin with, let's extract a simple pattern from a string using REGEXP_SUBSTR.
SELECT regexp_substr('The quick brown fox', 'quick') AS result;
This query will extract the substring 'quick' from the given string.
Next, we will specify the starting position and the occurrence of the pattern to match.
SELECT regexp_substr('The quick brown fox jumps over the lazy dog', 'the', 1, 2) AS result;
This query will extract the second occurrence of the pattern 'the' from the string.
We can also use regular expression parameters to modify the matching behavior.
SELECT regexp_substr('The quick brown fox', 'the', 1, 1, 'i') AS result;
This query will perform a case-insensitive match to extract 'The' from the string.
To extract specific groups within a pattern, we can use the e
parameter and group numbers.
SELECT regexp_substr('The quick brown fox', '(quick) (brown)', 1, 1, 'e', 1) AS result1,
regexp_substr('The quick brown fox', '(quick) (brown)', 1, 1, 'e', 2) AS result2;
This query will extract the groups 'quick' and 'brown' from the string.
Finally, let's perform advanced pattern matching to extract complex patterns.
SELECT regexp_substr('The quick brown fox jumps over the lazy dog', '\\b\\w{5}\\b') AS result;
This query will extract words with exactly five letters from the string.
While using REGEXP_SUBSTR, you might encounter some common challenges. Here are a few solutions:
Let's summarize the key takeaways from this tutorial on the REGEXP_SUBSTR function in Snowflake:
By mastering the REGEXP_SUBSTR function, you can efficiently handle complex string manipulations and enhance your data processing tasks in Snowflake.