Regex Lookaround

Regular expressions are a great tool to reach for when the right situation arises. One common case that can be tricky using only simple matching approaches is extracting something based on what comes before or after it. This is especially tricky when trying to extract something only when it is not before or after a pattern.

As an example, we have been given a list of ingredients for some chocolate muffins. Using this list, we would like to calculate the total cost of the recipe.

  • 500g plain flour – 20p @ £0.40/kg

  • 100g cocoa powder – £1.20 @ £12/kg

  • 4 tsp baking powder – 8p @ £3.50/kg

  • 4 large egg – £1.60 @ £0.40/each

  • 240g caster sugar – 40p @ £1.60/kg

  • 8 tbsp vegetable oil – 12p @ £1/kg

  • 1pint whole milk – 50p @ £0.50/pint

  • 200g chocolate chips (optional) – £1.60 @ £8/kg

  • 400g icing sugar – £1.20 @£3/kg

To do this, we need to first extract all of the total costs for each ingredient, but ignore things like weights, quantities, and prices. Additionally, we must also handle some costs being given in pounds, for example £1.60 for 4 large eggs, and some being given in pence, for example 50p for milk. Finally, we can add the extracted costs together.

Extracting numbers

Let’s start by first extracting all numbers from our ingredients list using a simple regular expression.

import re
 
ingredients = """
500g plain flour - 20p @ £0.40/kg
100g cocoa powder - £1.20 @ £12/kg
4 tsp baking powder - 8p @ £3.50/kg
4 large egg - £1.60 @ £0.40/each
240g caster sugar - 40p @ £1.60/kg
8 tbsp vegetable oil - 12p @ £1/kg
1pint whole milk - 50p @ £0.50/pint
200g chocolate chips (optional)  - £1.60 @ £8/kg
400g icing sugar - £1.20 @ £3/kg
"""
 
numbers = re.findall(r"([0-9]+(?:\.[0-9]{2})?)", ingredients)
print(numbers)
['500', '20', '0.40', '100', '1.20', '12', '4', '8', '3.50', '4', '1.60', '0.40', '240', '40', '1.60', '8', '12', '1', '1', '50', '0.50', '200', '1.60', '8', '400', '1.20', '3']

Let’s take a closer look at the regex that let us extract all of these numbers.

regex section

what it means

([0-9]+(?:.[0-9]{2})?)

the full regular expression

(____________________)

start a new match group that is returned

_[0-9]+_______________

match any number

_______(___________)?_

start a new optional match group

________?:____________

don’t create separate return for group

__________.[0-9]{2}___

match decimal point followed by two digits

Now that we have all of the numbers in the text extracted, we can think about how we filter this down to only the numbers which are related to costs so that we can sum them.

Positive lookbehind

We could identify which numbers are pounds by introducing the pound symbol to the start of our regular expression. Let’s take a quick look at what would happen if we did:

pounds = re.findall(r"(£[0-9]+(?:\.[0-9]{2})?)", ingredients)
print(pounds)
['£0.40', '£1.20', '£12', '£3.50', '£1.60', '£0.40', '£1.60', '£1', '£0.50', '£1.60', '£8', '£1.20', '£3']

Clearly this is not what we need to happen since our numbers now have a £ symbol in them! To overcome this problem, we can use a feature of regular expressions called a positive lookbehind.

A positive lookbehind is a type of group in a regular expression that is matched but not captured in the output. The group comes before our captured group, checking that the expression has a prefix. For example:

pounds = re.findall(r"(?<=£)([0-9]+(?:\.[0-9]{2})?)", ingredients)
print(pounds)
['0.40', '1.20', '12', '3.50', '1.60', '0.40', '1.60', '1', '0.50', '1.60', '8', '1.20', '3']

This is exactly what we need, giving us all numbers which are prefixed by a £ symbol. Let’s take a closer look at what is happening.

regex section

what it means

(?<=£)([0-9]+(?:.[0-9]{2})?)

the full regular expression

______([0-9]+(?:.[0-9]{2})?)

original number regex is unchanged

(____)______________________

a new group is introduced before our existing regex group

_?__________________________

this group should be matched but not in the output

__<_________________________

contents must happen before the next group in the regex (our original regex)

___=________________________

match must equal the contents of this group

____£_______________________

what to match for, in this case the £ symbol

Positive lookbehind is a really clean way of asserting that something has a prefix, whilst also ignoring the prefix from what is returned.

Negative lookahead

At this point our list of numbers contains all £s. If we look closely at our ingredients list though we will see that some of these numbers are a cost, ie £1.20, and some of them are a unit price, ie, £12/kg. Our task is to get a total cost, and so we need a way to exclude all numbers which are a unit price.

Regular expressions allow us to use a negation character, ^, to describe things that should not match. Let’s try excluding forward slashes by adding [^/] to the end of our regex and see what happens:

pounds = re.findall(r"(?<=£)([0-9]+(?:\.[0-9]{2})?)[^/]", ingredients)
print(pounds)
['0', '1.20', '1', '3', '1.60', '0', '1', '0', '1.60', '1.20']

Looking back at our ingredients something has gone wrong. We are expecting [‘1.20’, ‘1.60’, ‘1.60’, ‘1.20’] but somehow have extra numbers in our output. This is happening because [^/] is actually capturing any character that is not a forward slash, including other numbers, and decimal points!

Thankfully, regular expressions offer an alternative approach to excluding things which will help us overcome this problem – negative lookahead. In this technique, we look ahead of our match for a suffix, and then only consider groups where the lookahead is negative. That is, where the lookahead does not exist.

pounds = re.findall(r"(?<=£)([0-9]+(?:\.[0-9]{2})?)(?![/.0-9])", ingredients)
print(pounds)
['1.20', '1.60', '1.60', '1.20']

Let’s take a closer look at the lookahead group we have created:

regex section

what it means

(?<=£)([0-9]+(?:.[0-9]{2})?)(?![/.0-9])

the full regular expression

(?<=£)([0-9]+(?:.[0-9]{2})?)___________

existing regex is unchanged

____________________________(_________)

a new group is introduced after our existing regex group

_____________________________?_________

this group should be matched but not in the output

______________________________!________

previous group (the existing regex) must not be followed by the contents of this group

_______________________________[/.0-9]_

what to match for, in this case any symbol which is either a forward slash, decimal place, or another digit.

One difference from the previous section is that we didn’t use any symbol to describe where to look. Recall that with a lookbehind we define our group as (?< ) where the < denotes that this group describes a prefix. Rather than switching the < to a > regular expressions implicitly assume that when the < is missing the group must be a lookahead.

Positive lookahead

Some costs in our list of ingredients are given in pence, not as pounds. For these, we need to define a different regular expression. Since pennies do not have a decimal point in them we can use a slightly simpler regular expression to find all numbers: ([0-9]+)

To filter the list of number does to pennies we can use the positive lookahead (?=p) to limit to any number that is followed by a “p”.

pennies = re.findall(r"([0-9]+)(?=p)", ingredients)
print(pennies)
['20', '8', '40', '12', '1', '50']

This list is pretty close to what we want, but contains one extra number in it that we don’t want, the singular “1”. This has been extracted because our ingredients list contains “1pint”, which matches our expression for a number followed by a “p” character.

Negative lookbehind

To filter our matches for pennies we can use the final type of lookaround that regular expressions allow, a negative lookbehind. Similar to a negative lookahead, this technique asserts that our pattern is not prefixed by another pattern. For the ingredients list we have, we can check that our pattern is not prefixed by a new line for example using the negative lookbehind (?<!\n):

pennies = re.findall(r"(?<!\n)([0-9]+)(?=p)", ingredients)
print(pennies)
['20', '8', '40', '12', '50']

This is now the correct list of pennies that we wanted to extract from our ingredients list!

Note: checking for a prefix newline worked for this specific example, but would not work if “1pint” was on the first line. How would you modify the pattern to handle that case? What if prices were given in p/pint?

Putting it all together

Now that we have defined a way of extracting all pounds and penny costs from our ingredients list, we can put it all together to calculate the final cost of the recipe:

import re
 
ingredients = """
500g plain flour - 20p @ £0.40/kg
100g cocoa powder - £1.20 @ £12/kg
4 tsp baking powder - 8p @ £3.50/kg
4 large egg - £1.60 @ £0.40/each
240g caster sugar - 40p @ £1.60/kg
8 tbsp vegetable oil - 12p @ £1/kg
1pint whole milk - 50p @ £0.50/pint
200g chocolate chips (optional)  - £1.60 @ £8/kg
400g icing sugar - £1.20 @ £3/kg
"""
 
pounds = re.findall(r"(?<=£)"  # positive lookbehind for £ symbol 
                    r"([0-9]+\.[0-9]{2})"  # currency regex
                    r"(?![/.0-9])",  # negative lookahead for / . and numbers
                    ingredients)
 
# convert to float
pounds = [float(pound) for pound in pounds]
 
pennies = re.findall(r"(?<!\n)"  # negative lookbehind for newline
                     r"([0-9]+)"  # number regex
                     r"(?=p)",  # positive lookahead for p symbol
                     ingredients)
 
# convert to float and adjust to be in pounds
pennies = [float(penny) / 100.0 for penny in pennies]
 
cost = sum(pounds) + sum(pennies)
 
print(f"Recipe cost is £{cost:.2f}")
Recipe cost is £6.90

The final cost we get is £6.90 – not bad for 24 chocolate muffins, working out at roughly 29p each!

Summary

Lookarounds in regular expressions are great when we want to capture the presence or absence of either a prefix or suffix, but do not want to keep the prefix or suffix in our match. The following table gives a quick go-to reference of each type of lookaround:

syntax

name

description

a(?=b)

positive lookahead

match “a” with suffix “b”

a(?!b)

negative lookahead

match “a” without suffix “b”

(?<=b)a

positive lookbehind

match “a” with prefix “b”

(?<!b)a

negative lookbehind

match “a” without prefix “b”

In all examples “a” is part of the returned match while “b” is not returned.