Regex Subexpressions

Lesson

Sometimes we want to split our regex up we can do this with subexpressions – also referred to as groups. Subexpressions allow us to pull out specific sections of text (for example just the domain name from a website URL) or look for repetitions of a pattern.

We can specify a group to match with parentheses – (). Whatever is in the parentheses is our subexpression to compare against.

(foobar) Capture group with a subexpression of 'foobar'.

To match a certain number of repetitions of a group, append a quantifier:

(foo){n} Match n repetitions of 'foo'.

Example

We want to match IP (version 4) addresses in a hosts file. We won't worry about checking that the octets fall between 0 and 255 though.

Text to Search In

127.0.0.1          localhost

192.168.0.1        desktop

192.168.0.2        server

Regular Expression

(\d{1,3}\.){3}\d{1,3}

(\d{1,3}.){3} Look for 1 to 3 digits followed by a dot repeated 3 times.

\d{1,3} Look for a final number made up of 1 to 3 digits.

Output

127.0.0.1 localhost

192.168.0.1 desktop

192.168.0.2 server

Alternation

We can use alternation within a regular expression (or subexpression) to say “match on this OR that”. We use the vertical bar, or pipe symbol, (|) to delineate the two parts of our OR statement. We can also use it to match one subexpression or another.

x | y Match x or y.

Example

We want to match IP (version 4) addresses which match with 10.x.x.x or 192.168.x.x.

Text to Search In

127.0.0.1          localhost
192.168.0.1        desktop
10.10.0.1          server
8.8.8.8            google

Regular Expression

(10\.(\d{1,3}\.){2}\d{1,3})|(192\.168\.\d{1,3}\.\d{1,3})

Output

127.0.0.1 localhost

192.168.0.1 desktop

10.10.0.1 server

8.8.8.8 google

Backreferences

We can use backreferences in regular expressions to refer back to a capture group. Backreferences look for a repetition of the actual text that matched the group. For example, we could use it in a spell checker to make sure that an author doesn't accidentally repeat the same word. Capture groups are referred to numerically in order (the first group is 1, the second is 2 etc. etc.).

To use a backreference, we use a back slash followed by the group number – e.g. \1 would refer to the first capture group.

\n Backreference to group n.

Example

In Markdown syntax, enclosing a phrase in double asterisks or double underscores applies the 'strong' HTML tag. We want a regular expression to highlight instances where this occurs. We can use a backreference to make sure that we don't match cases which start with asterisks and end with underscores or vice versa.

Text to Search In

**Apply Strong** don't apply strong __apply strong __normal text __not correct strong syntax**

Regular expression

(\*\*|\_\_).+?\1

(**|__) Capture group one – match a double asterisk or double underscore.

.+? Match one or more of any character, lazily (?).

\1 Look for another instance of whatever matched against capture group 1.

Result

**Apply Strong** don't apply strong __apply strong __ normal text __not correct strong syntax.**

Lookaround

We can more precisely locate the text we are looking for by using 'lookaround'. It lets us look ahead or look behind for a match without actually returning that match. Here we will focus on positive lookahead and positive lookbehind, but there is also a negative syntax which can be used. Check your implementation for what lookaround behaviour is supported.

If we want to look for some text and then match something which precedes it, we use a lookahead – a capture group with a question mark (?) equals (=) followed by the subexpression that we want to match (?=xyz).

foo(?=bar) Match on 'foo', where it is followed by 'bar'

If we want to match on something that follows a pattern then we use a lookbehind – we are checking the characters the come before the pattern we want to match and return. We specify this in a capture group with a question mark (?), less than (<) and equals (=) followed by the subexpression to look behind for (?<=xyz).

(?<=foo)bar Match on 'bar', where it is preceded by foo.

Example

We want to return the comments from a C program. They have been written using the double forward-slash (//) notation. We don't want to include the slashes at the start in our result though.

Text to Search In

this is the start of the code
// This is a comment
this is some code
this is some more code

Regular Expression

(?<=// ).+

Note that in some implementations forward slash has special meaning and will need to be escaped with a backslash.

Output

this is the start of the code

// This is a comment

this is some code

this is some more code

Non-Capturing Groups

Often the data that we are interested in will be a subset of what we want to match on. We can include groups which we want to use as part of the pattern to match but which we don't want to be returned as it's own group. We do this by using a capture group of ?: followed by the subexpression to look for.

(?:foo) Match on 'foo' but don't return it as a group.

Example

We have a list of domains which include 'www.' at the start, and we just want to return the domain without the 'www.'.

Text to Search In

www.website.com

Regular Expression

(?:www\.)(.+)

Output

Group 1: website.com

Full Match: www.website.com


References

Learn more about this topic by checking out these references.


Other Lessons

Learn more by checking out these related lessons

Matching a Certain Number of Repetitions with Regex

lesson

View

Courses

This lesson is part of the following courses.

Learn Regular Expressions

course

View