Regular Expressions in Python


We will now use python.org’s shell to start getting familiar with regular expressions in Python.

The re Python module gives you support for PCRE style regular expressions. You can import this module with import re.

The compile method of the re module compiles a regular expression based on the pattern string provided to it. You can already assemble many pattern strings using your PCRE knowledge.

The only surprising thing is the r in front of the string. In Python, this denotes a raw string. Although in the above example, we can omit this r, in generic cases, we can save a lot of character escaping if we use raw strings.

The search method of a compiled regular expression searches its string argument for a match:

We can get more information on the result with some methods:

The method group returns the first match in its longest form. The first character of the match and the first character after the match can be retrieved using the start, end, and span methods.

The match method of compiled regular expressions is misleading, because it looks for a match from the start of the string.

Technically, match is redundant, because we could use the ^ character to match the start of the string:

We got to know the match method to avoid confusing it with search.

There is no need to compile regular expressions, because the match and search methods are also available via the re module as a function accepting a regex string and a regular string:

It is also possible to enumerate all matches belonging to a search expression. The methods findall and finditer do the trick:

In the first example, we enumerated all matches in an array. In the second example, we created an iterator, and started iterating on it to print the results one by one.

Now you know enough to understand the example code generated by regex101.com:

Note that I formatted the long lines in the code. As the width of this book is limited, and it is not a good idea to write very long lines of code anyway, this change makes sense.

The code sequence prints out

The inner loop of groups was not executed in this example. Therefore, let’s change the regular expression and the matched string to include extraction of substrings using capture groups.

Suppose we would like to retrieve the currency, the numeric price value, and the full price with currency in a string of format:

The regular expression matching this string is:

We have to escape the dollar sign as a currency, because $ is a metasyntax character denoting an end of string or end of line anchor. We also have to escape a dot, because it is a metasyntax character denoting one arbitrary character.

Let’s add some parentheses for the substrings we have to capture:

The added capture groups contain the following data:

Capture group number Data
1 full price
2 currency symbol
3 numeric price

Let’s explore the retrieval of the three capture groups.

Once we paste this code in the Python shell, the result becomes visible:

As you can see, the capture groups are accessible via the match object.

For more information, consult the Python regex documentation.