Sometimes we want to split our regex up we can do this with subexpressions – also referred to as groups. This might be so that we can pull out specific sections of text (for example just the domain name from a website URL) or because we’re looking for repetitions of a certain subexpression. We can specify a group to match with parentheses – (). Whatever is in the parentheses is our subexpression to match on.

(foobar) Capture group with a subexpression of ‘foobar’.

To match a certain number of repetitions of a group, simply append a quantifier:

(foo){n} Match n repetitions of ‘foo’

Example

We want to match IP (version 4) addresses in a hosts file. We won’t worry about checking that the octets fall between 0 and 255 though.

Text to Search In

127.0.0.1          localhost

192.168.0.1        desktop

192.168.0.2        server

Regular Expression

(\d{1,3}\.){3}\d{1,3}

(\d{1,3}.){3} Look for 1 to 3 digits followed by a dot repeated 3 times.

\d{1,3} Look for a final number made up of 1 to 3 digits.

Output

127.0.0.1 localhost

192.168.0.1 desktop

192.168.0.2 server

Alternation

We can use alternation within a regular expression (or subexpression) to say “match on this OR that”. We use the vertical bar, or pipe symbol, (|) to delineate the two parts of our OR statement. We can also use it to match one subexpression or another.

x | y Match x or y.

Example

We want to match IP (version 4) addresses which match with 10.x.x.x or 192.168.x.x.

Text to Search In

127.0.0.1          localhost
192.168.0.1        desktop
10.10.0.1          server
8.8.8.8            google

Regular Expression

(10\.(\d{1,3}\.){2}\d{1,3})|(192\.168\.\d{1,3}\.\d{1,3})

Output

127.0.0.1 localhost

192.168.0.1 desktop

10.10.0.1 server

8.8.8.8 google

Backreferences

We can use backreferences in regular expressions to refer back to a capture group. This lets look for a repetition of the actual text that matched the group. For example, we could use it in a spell checker to make sure that an author doesn’t accidentally repeat the same word. Capture groups are referred to numerically in order (the first group is 1, the second is 2 etc. etc.).

To use a back reference we use a back slash followed by the group number – e.g. \1 would refer to the first capture group.

\n Backreference to group n.

Example

In markdown, a enclosing a phrase in double asterisks or double underscores applies the ‘strong’ html tag. We want a regular expression to highlight instances where this occurs. We can use a backreference to make sure that we don’t match instances which start with asterisks and end with underscores or vice versa.

Text to Search In

**Apply Strong** don’t apply strong __apply strong __normal text __not correct strong syntax**

Regular expression

(\*\*|\_\_).+?\1

(**|__) Capture group one – match a double asterisk or double underscore.

.+? Match one or more of any character, lazily (?).

\1 Look for another instance of whatever matched against capture group 1.

Result

**Apply Strong** don’t apply strong __apply strong __ normal text __not correct strong syntax.**

Lookaround

We can use regex lookaround functionality to more precisely locate the text we are looking for. It lets us look ahead or look behind for a match without actually returning that match. Here we will focus on positive lookahead and positive lookbehind but there are is also a negative syntax which can be used. Check your implementation for what lookaround behaviour is supported.

If we want to look for some text and then match something which precedes it, we use a lookahead – a capture group with question mark (?) equals (=) followed by the subexpression that we want to match (?=xyz).

foo(?=bar) Match on ‘foo’ where it is followed by ‘bar’

If we want to match on something that follows a pattern then we use a lookbehind – we are checking the characters the come before the pattern we want to match and return. We specify this in a capture group with a question mark (?), less than (<) and equals (=) followed by the subexpression to look behind for (?<=xyz).

(?<=foo)bar Match on bar where it is preceded by foo.

Example

We want to return the comments from a C program which have been included using the double forward slash (//) notation but without including the slashes at the start.

Text to Search In

// This is a comment

this is some code

Regular Expression

(?<=// ).+

Note that in some implementations forward slash has special meaning and will need to be escaped with a back slash.

Output

// This is a comment

this is some code

Non-Capturing Groups

Often the data that we are interested in will just be a subset of what we want to match on. We can include groups which we want to use as part of the pattern to match but which we don’t want to be returned as it's own group. We do this by using a capture group which starts with ?: and is followed by the subexpression to look for.

(?:foo) Match on ‘foo’ but don’t return it as a group.

Example

We have a list of domains which include www. at the start and we just want the second part.

Text to Search In

www.website.com

Regular Expression

(?:www\.)(.+)

Output

Group 1: website.com

Full Match: www.website.com