Regex Subexpressions
Lesson
Sometimes we want to split our regex up we can do this with subexpressions – also referred to as groups. Subexpressions allow us to pull out specific sections of text (for example just the domain name from a website URL) or look for repetitions of a pattern.
We can specify a group to match with parentheses – ()
. Whatever is in the
parentheses is our subexpression to compare against.
(foobar)
Capture group with a subexpression of 'foobar'.
To match a certain number of repetitions of a group, append a quantifier:
(foo){n}
Match n repetitions of 'foo'.
Example
We want to match IP (version 4) addresses in a hosts file. We won't worry about checking that the octets fall between 0 and 255 though.
Text to Search In
127.0.0.1 localhost
192.168.0.1 desktop
192.168.0.2 server
Regular Expression
(\d{1,3}\.){3}\d{1,3}
(\d{1,3}.){3} Look for 1 to 3 digits followed by a dot repeated 3 times.
\d{1,3} Look for a final number made up of 1 to 3 digits.
Output
127.0.0.1 localhost
192.168.0.1 desktop
192.168.0.2 server
Alternation
We can use alternation within a regular expression (or subexpression) to say “match
on this OR that”. We use the vertical bar, or pipe symbol, (|
) to delineate
the two parts of our OR statement. We can also use it to match one subexpression
or another.
x | y
Match x or y.
Example
We want to match IP (version 4) addresses which match with 10.x.x.x or 192.168.x.x.
Text to Search In
127.0.0.1 localhost
192.168.0.1 desktop
10.10.0.1 server
8.8.8.8 google
Regular Expression
(10\.(\d{1,3}\.){2}\d{1,3})|(192\.168\.\d{1,3}\.\d{1,3})
Output
127.0.0.1 localhost
192.168.0.1 desktop
10.10.0.1 server
8.8.8.8 google
Backreferences
We can use backreferences in regular expressions to refer back to a capture group. Backreferences look for a repetition of the actual text that matched the group. For example, we could use it in a spell checker to make sure that an author doesn't accidentally repeat the same word. Capture groups are referred to numerically in order (the first group is 1, the second is 2 etc. etc.).
To use a backreference, we use a back slash followed by the group number – e.g.
\1
would refer to the first capture group.
\n
Backreference to group n.
Example
In Markdown syntax, enclosing a phrase in double asterisks or double underscores applies the 'strong' HTML tag. We want a regular expression to highlight instances where this occurs. We can use a backreference to make sure that we don't match cases which start with asterisks and end with underscores or vice versa.
Text to Search In
**Apply Strong** don't apply strong __apply strong __normal text __not correct strong syntax**
Regular expression
(\*\*|\_\_).+?\1
(**|__) Capture group one – match a double asterisk or double underscore.
.+? Match one or more of any character, lazily (?).
\1 Look for another instance of whatever matched against capture group 1.
Result
**Apply Strong** don't apply strong __apply strong __ normal text __not correct strong syntax.**
Lookaround
We can more precisely locate the text we are looking for by using 'lookaround'. It lets us look ahead or look behind for a match without actually returning that match. Here we will focus on positive lookahead and positive lookbehind, but there is also a negative syntax which can be used. Check your implementation for what lookaround behaviour is supported.
If we want to look for some text and then match something which precedes it, we
use a lookahead – a capture group with a question mark (?) equals (=) followed by
the subexpression that we want to match (?=xyz)
.
foo(?=bar)
Match on 'foo', where it is followed by 'bar'
If we want to match on something that follows a pattern then we use a lookbehind
– we are checking the characters the come before the pattern we want to match and
return. We specify this in a capture group with a question mark (?), less than
(<) and equals (=) followed by the subexpression to look behind for (?<=xyz)
.
(?<=foo)
bar Match on 'bar', where it is preceded by foo.
Example
We want to return the comments from a C program. They have been written using the double forward-slash (//) notation. We don't want to include the slashes at the start in our result though.
Text to Search In
this is the start of the code
// This is a comment
this is some code
this is some more code
Regular Expression
(?<=// ).+
Note that in some implementations forward slash has special meaning and will need to be escaped with a backslash.
Output
this is the start of the code
// This is a comment
this is some code
this is some more code
Non-Capturing Groups
Often the data that we are interested in will be a subset of what we want to match
on. We can include groups which we want to use as part of the pattern to match
but which we don't want to be returned as it's own group. We do this by using a
capture group of ?:
followed by the subexpression to look for.
(?:foo)
Match on 'foo' but don't return it as a group.
Example
We have a list of domains which include 'www.' at the start, and we just want to return the domain without the 'www.'.
Text to Search In
www.website.com
Regular Expression
(?:www\.)(.+)
Output
Group 1: website.com
Full Match: www.website.com
References
Learn more about this topic by checking out these references.
Other Lessons
Learn more by checking out these related lessons
Courses
This lesson is part of the following courses.