Sometimes we want to be more specific about where in a word or string we would like to match with our regular expression. We can use anchors to do this by defining the position at which to match based on a word or string boundary.

Regex Word Boundaries

We can look just for matches relative to a word boundary using \b. Regex interprets a word boundary as a point with an alphanumeric (or underscore) character to one side and a non-alphanumeric character to the other side. We can use this to locate a match relative to the start or end of a word or to match only when our text is the entire word.

\b Word boundary.

We can also specify a positional match which is explicitly not on a word boundary using \B.

\B Not a word boundary.

Example

We want to match occurrences of the word ‘red’ but we don’t want to match when the letters ‘red’ appear as part of another word (e.g. occurred).

Text to Search In

Alice wondered whether she should paint the room red. Then it occurred to her that she didn’t have any red paint.

Regular Expression

\bred\b

Without including the word boundaries, we would also match ‘red’ at the end of ‘wondered’ and ‘occurred’.

Output

Alice wondered whether she should paint the room red. Then it occurred to her that she didn’t have any red paint.

Match at the Start or End of a String

As well as using word boundaries, we can also look for a match relative to the start or end of the string we’re searching. This is useful for constraining what we match on which can be useful in large datasets or where character patterns are repeated a lot but we’re just interested in those at the start or end of text. For example looking for just the first tag in an html document. We use the caret (^) and dollar ($) symbols for matching at the start / end of a string respectively.

^ Start of the String

$ End of the String

Multiline Mode

Many regex engines allow us to use multiline mode – often activated with a ‘m’ flag. This will treat separate lines as separate strings for searching. In this case, string boundaries will occur at the beginning and end of each line.

Example

We want to look for log files ending in ‘.log’ which start with the year 2010. We have enabled multiline mode.

Text to Search In

20180402.log

20050301.log

20101211.log

20101211.txt

19920101.log

20101211.log.old

Regular Expression

^2010\w+.log$
Breakdown
^2010

Look for 2010 at the beginning of the string (line in multiline mode).

\w+

Then at least one alphanumeric character.

.log$

Expect ‘.log’ at the end of the string (line in multiline mode).

Output

20180402.log

20050301.log

20101211.log

20101211.txt

19920101.log

20101211.log.old