Regular expressions, or how to find many needles in a haystack

January 15, 2022

One of the main advantages of storing documents in digital form is the ease of searching them for particular words and phrases. If you had the paper version of the book A Game of Thrones, and you wanted to find the first time the character name “Daenerys” appeared, it would be like looking for the proverbial needle in a haystack. But in the digital version? Simply use the Find function. Replacing text is just as easy: using a global Replace command, you could, for example, change all instances of “Daenerys” to the alternative spelling “Denerys”.

However, the Find and Replace functions do not enable you to seek out simple patterns that a human would easily spot with the naked eye. A practised reader need only glance at a page of text to identify all instances of dates, postal addresses or e-mail addresses. But how to do this using the Find function? Unfortunately, it is by no means easy!

The task of seeking or replacing various text fragments that have a regular structure can be carried out using a mechanism known as regular expressions.

What is a regular expression?

Put simply, a regular expression is a pattern representing the set of all text strings that match the pattern. For a given regular expression, a suitably constructed computer mechanism is able to search a text and locate all strings in it that match the expression.

A regular expression may be any character, which (if it is not a special character – see below) represents only itself. For example, if a regular expression consists simply of the letter “a”, the expression handling mechanism will search the text for the first instance of that letter (for example: Many years ago…).
A regular expression may be a sequence of characters, which again represents only itself. If the expression is the sequence “ear”, the mechanism will search the text for that exact sequence (for example: Many years ago…).
The special character * in a regular expression represents any number of repetitions of the character preceding it – or the absence of a repetition, or the complete absence of that character. For example, the regular expression “hur*ah*” is matched by hua, huah, hura, hurah, hurrah, hurrahh and so on.
Parts of regular expressions can be grouped using parentheses. For example, the expression “hu(rah)*” is matched by hu, hurah, hurahrah and so on.
The special character | in a regular expression represents alternatives. For example, the regular expression “Kowalsk(i|a)” is matched by both Kowalski and Kowalska.

Errors

In practice, the program seeking strings that match a given regular expression often returns not quite the results that were expected. How is this possible? No doubt the creator of the regular expression made a mistake – and the most common reason for errors is misunderstanding of the priorities of operations.

Priorities of operations

The order of priority of operations in regular expressions is as follows:

grouping with parentheses: ()
the special character *
concatenation of characters (making a sequence)
the special character |

The regular expression “(ha)*” is matched by ha, haha, hahaha, etc., because we first group the characters h and a by means of parentheses, and afterwards apply repetition by means of the asterisk. The expression “ha*”, however, is matched by h, ha, haa, haaa, etc., because the operation of repetition is first applied to the character a, and afterwards the result is combined with the remainder of the character sequence.

The regular expression “Kowalski|a” is matched by Kowalski and by a (but not by Kowalska), since concatenation of characters has a higher priority than the specification of an alternative.

Equivalence of regular expressions

Different regular expressions may represent the same set of strings. For example, the expression “cat|dog” is equivalent to “dog|cat”, and “(cat)*” is equivalent to “((cat)*)*”. This property of regular expressions would appear to be highly advantageous – it means that a search for a particular text pattern can be expressed in different ways (just as the correct answer to a mathematical exercise can be reached in a variety of ways). The problem is that the creator of a regular expression may believe that two forms are equivalent when in fact they are not. For example, the expression “Kowalski|a” is not equivalent to “Kowalski|Kowalska”.

Helpful functions and extensions

The operations listed above make up the basic (mathematical) definition of regular expressions. However, computer scientists have gone a bit further than mathematicians, introducing a whole series of additional operations. The most popular of these are character classes, the dot character, quantifiers, and anchors.

Character classes

Character classes – written in square brackets – are another way of representing alternatives. For example, the expression “[AEO]la” is matched by Ala, Ela and Ola. The expression “[A-Z]la” is also matched by Bla, Cla, Dla, etc.

It is also possible to define negative classes. The expression “[^ds]” (with the caret symbol after the opening bracket) matches all characters except for d and s. Thus, the regular expression “[^ds]ay]”, given the text:

They say today is a day away.

will match only the string way, appearing as part of the word away.

Character classes can also be written using certain abbreviations; for example, “\d” denotes any digit, “\w” denotes any letter, and “\s” denotes any white character (space, tab or newline).

The dot character

The dot (period, full stop) is a special character. When used in a regular expression it will be matched by any character at all. So the expression “c.t” will be matched by cat, cot and cut, but also by c5t, for instance.

If you want a regular expression to represent a character that would normally be a special character (such as a full stop), you must place the symbol \ before that character. For example, the regular expression “ai\.POLENG\.pl” may be used to find the e-mail address of an employee of our company.

Quantifiers

Quantifiers are used to search for repetitions. One of them is the asterisk special character that was mentioned earlier.

The special character + stands for at least one repetition. For example, “hur+ah+” is matched by hurah, hurrah, hurahh, etc., but not by hua, hura or huah.

The special character ? denotes zero or one instance. The expression ‘Mr\.?’’ matches both “Mr.” with a period, and “Mr” without one.

Quantifiers can also specify the number of repetitions. The regular expression “\d{4}” matches exactly four digits, while “\w{5,10}” is matched by a string of five to ten letters.

Anchors

Anchors specify the place in the text where the string is to occur.

The special character ^ indicates that the string must appear at the very start of the text.

The special character $ indicates that the matched string must appear at the end of the text.

The regular expression “^\d+$” thus matches only texts consisting entirely of digits.

The anchor \b denotes a word boundary. The expression “\bcat\b” will not be matched to any string in the text We have two cats, because there is no word boundary immediately following the string cat.

Flags

The possibilities of text searching using regular expressions can be expanded with the use of flags. For example, the flag i means that upper and lower case letters are to be treated as identical when matching strings to the pattern, and the flag g means that the expression handling mechanism is to find all strings matching the pattern, not only the first of them.

A flag is not part of a regular expression. The information that a flag is to be applied is supplied separately by the user of the mechanism (in a manner that depends on the particular solution being used).

Examples of popular regular expressions

([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))

The above regular expression matches the currently most common date range and format. It requires that the date begin with the digit 1 or 2 ([12]), followed by any three digits (\d{3}). The next character is a hyphen (-), and after that comes the month: either a two-digit string beginning with zero (but not 00), or one of 10, 11, 12 (0[1-9]|1[0-2]). After another hyphen (-) comes the day – this consists of two digits, of which either the first is zero and the second is from 1 to 9 (0[1-9]), or the first is 1 or 2> and the second may be any digit ([12]\d), or else the first is 3 and the second is 0 or 1 (3[01]).

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b

The above expression – when used with the flag i – matches most e-mail addresses. It requires the address to begin with at least one letter, digit or other permitted character (b[A-Z0-9._%+-]+), after which comes the @ character, and then at least one string ending in a dot ([A-Z0-9.-]+\.), followed by a domain code consisting of at least two letters ([A-Z]{2,}).

How to test regular expressions

There are many online tools you can use to check whether a regular expression you have constructed “works” exactly as it should. One of them is available at regex101.com.

Write your regular expression in the edit window, and in the text field insert the text that is to be searched. The tool will highlight the first occurrence in that text of a string that matches the regular expression being tested.

Notice that in the above example the search mechanism is applying the flag i (which was enabled by clicking at the right-hand side of the edit window). If that flag is disabled, the e-mail address will not be found:

Summary

One of the main reasons why more and more texts are stored digitally is the ease of searching for information. Although combing the haystack for a single needle (a specific string) is a relatively simple task, and is thus available in virtually any text editor, finding multiple needles (all strings of a specified type) is far less trivial.

Regular expressions are currently the most popular mechanism for searching texts for strings of a specified type. It is therefore well worth becoming familiar with them – or at least learning the basics, as you have done now you have read this post.

Table fo content

Primary Item (H2)Sub Item 1 (H3)Sub Item 2 (H4)
Sub Item 3 (H5)
Sub Item 4 (H6)

Regular expressions, or how to find many needles in a haystack

What is a regular expression?