Regex Lookaround
Contents
Regex Lookaround¶
Regular expressions are a great tool to reach for when the right situation arises. One common case that can be tricky using only simple matching approaches is extracting something based on what comes before or after it. This is especially tricky when trying to extract something only when it is not before or after a pattern.
As an example, we have been given a list of ingredients for some chocolate muffins. Using this list, we would like to calculate the total cost of the recipe.
500g plain flour – 20p @ £0.40/kg
100g cocoa powder – £1.20 @ £12/kg
4 tsp baking powder – 8p @ £3.50/kg
4 large egg – £1.60 @ £0.40/each
240g caster sugar – 40p @ £1.60/kg
8 tbsp vegetable oil – 12p @ £1/kg
1pint whole milk – 50p @ £0.50/pint
200g chocolate chips (optional) – £1.60 @ £8/kg
400g icing sugar – £1.20 @£3/kg
To do this, we need to first extract all of the total costs for each ingredient, but ignore things like weights, quantities, and prices. Additionally, we must also handle some costs being given in pounds, for example £1.60 for 4 large eggs, and some being given in pence, for example 50p for milk. Finally, we can add the extracted costs together.
Extracting numbers¶
Let’s start by first extracting all numbers from our ingredients list using a simple regular expression.
import re
ingredients = """
500g plain flour - 20p @ £0.40/kg
100g cocoa powder - £1.20 @ £12/kg
4 tsp baking powder - 8p @ £3.50/kg
4 large egg - £1.60 @ £0.40/each
240g caster sugar - 40p @ £1.60/kg
8 tbsp vegetable oil - 12p @ £1/kg
1pint whole milk - 50p @ £0.50/pint
200g chocolate chips (optional) - £1.60 @ £8/kg
400g icing sugar - £1.20 @ £3/kg
"""
numbers = re.findall(r"([0-9]+(?:\.[0-9]{2})?)", ingredients)
print(numbers)
['500', '20', '0.40', '100', '1.20', '12', '4', '8', '3.50', '4', '1.60', '0.40', '240', '40', '1.60', '8', '12', '1', '1', '50', '0.50', '200', '1.60', '8', '400', '1.20', '3']
Let’s take a closer look at the regex that let us extract all of these numbers.
regex section |
what it means |
---|---|
|
the full regular expression |
|
start a new match group that is returned |
|
match any number |
|
start a new optional match group |
|
don’t create separate return for group |
|
match decimal point followed by two digits |
Now that we have all of the numbers in the text extracted, we can think about how we filter this down to only the numbers which are related to costs so that we can sum them.
Positive lookbehind¶
We could identify which numbers are pounds by introducing the pound symbol to the start of our regular expression. Let’s take a quick look at what would happen if we did:
pounds = re.findall(r"(£[0-9]+(?:\.[0-9]{2})?)", ingredients)
print(pounds)
['£0.40', '£1.20', '£12', '£3.50', '£1.60', '£0.40', '£1.60', '£1', '£0.50', '£1.60', '£8', '£1.20', '£3']
Clearly this is not what we need to happen since our numbers now have a £ symbol in them! To overcome this problem, we can use a feature of regular expressions called a positive lookbehind.
A positive lookbehind is a type of group in a regular expression that is matched but not captured in the output. The group comes before our captured group, checking that the expression has a prefix. For example:
pounds = re.findall(r"(?<=£)([0-9]+(?:\.[0-9]{2})?)", ingredients)
print(pounds)
['0.40', '1.20', '12', '3.50', '1.60', '0.40', '1.60', '1', '0.50', '1.60', '8', '1.20', '3']
This is exactly what we need, giving us all numbers which are prefixed by a £ symbol. Let’s take a closer look at what is happening.
regex section |
what it means |
---|---|
|
the full regular expression |
|
original number regex is unchanged |
|
a new group is introduced before our existing regex group |
|
this group should be matched but not in the output |
|
contents must happen before the next group in the regex (our original regex) |
|
match must equal the contents of this group |
|
what to match for, in this case the £ symbol |
Positive lookbehind is a really clean way of asserting that something has a prefix, whilst also ignoring the prefix from what is returned.
Negative lookahead¶
At this point our list of numbers contains all £s. If we look closely at our ingredients list though we will see that some of these numbers are a cost, ie £1.20, and some of them are a unit price, ie, £12/kg. Our task is to get a total cost, and so we need a way to exclude all numbers which are a unit price.
Regular expressions allow us to use a negation character, ^
, to describe things that should not match. Let’s try excluding forward slashes by adding [^/]
to the end of our regex and see what happens:
pounds = re.findall(r"(?<=£)([0-9]+(?:\.[0-9]{2})?)[^/]", ingredients)
print(pounds)
['0', '1.20', '1', '3', '1.60', '0', '1', '0', '1.60', '1.20']
Looking back at our ingredients something has gone wrong. We are expecting [‘1.20’, ‘1.60’, ‘1.60’, ‘1.20’] but somehow have extra numbers in our output. This is happening because [^/]
is actually capturing any character that is not a forward slash, including other numbers, and decimal points!
Thankfully, regular expressions offer an alternative approach to excluding things which will help us overcome this problem – negative lookahead. In this technique, we look ahead of our match for a suffix, and then only consider groups where the lookahead is negative. That is, where the lookahead does not exist.
pounds = re.findall(r"(?<=£)([0-9]+(?:\.[0-9]{2})?)(?![/.0-9])", ingredients)
print(pounds)
['1.20', '1.60', '1.60', '1.20']
Let’s take a closer look at the lookahead group we have created:
regex section |
what it means |
---|---|
|
the full regular expression |
|
existing regex is unchanged |
|
a new group is introduced after our existing regex group |
|
this group should be matched but not in the output |
|
previous group (the existing regex) must not be followed by the contents of this group |
|
what to match for, in this case any symbol which is either a forward slash, decimal place, or another digit. |
One difference from the previous section is that we didn’t use any symbol to describe where to look. Recall that with a lookbehind we define our group as (?< )
where the <
denotes that this group describes a prefix. Rather than switching the <
to a >
regular expressions implicitly assume that when the <
is missing the group must be a lookahead.
Positive lookahead¶
Some costs in our list of ingredients are given in pence, not as pounds. For these, we need to define a different regular expression. Since pennies do not have a decimal point in them we can use a slightly simpler regular expression to find all numbers: ([0-9]+)
To filter the list of number does to pennies we can use the positive lookahead (?=p)
to limit to any number that is followed by a “p”.
pennies = re.findall(r"([0-9]+)(?=p)", ingredients)
print(pennies)
['20', '8', '40', '12', '1', '50']
This list is pretty close to what we want, but contains one extra number in it that we don’t want, the singular “1”. This has been extracted because our ingredients list contains “1pint”, which matches our expression for a number followed by a “p” character.
Negative lookbehind¶
To filter our matches for pennies we can use the final type of lookaround that regular expressions allow, a negative lookbehind. Similar to a negative lookahead, this technique asserts that our pattern is not prefixed by another pattern. For the ingredients list we have, we can check that our pattern is not prefixed by a new line for example using the negative lookbehind (?<!\n)
:
pennies = re.findall(r"(?<!\n)([0-9]+)(?=p)", ingredients)
print(pennies)
['20', '8', '40', '12', '50']
This is now the correct list of pennies that we wanted to extract from our ingredients list!
Note: checking for a prefix newline worked for this specific example, but would not work if “1pint” was on the first line. How would you modify the pattern to handle that case? What if prices were given in p/pint?
Putting it all together¶
Now that we have defined a way of extracting all pounds and penny costs from our ingredients list, we can put it all together to calculate the final cost of the recipe:
import re
ingredients = """
500g plain flour - 20p @ £0.40/kg
100g cocoa powder - £1.20 @ £12/kg
4 tsp baking powder - 8p @ £3.50/kg
4 large egg - £1.60 @ £0.40/each
240g caster sugar - 40p @ £1.60/kg
8 tbsp vegetable oil - 12p @ £1/kg
1pint whole milk - 50p @ £0.50/pint
200g chocolate chips (optional) - £1.60 @ £8/kg
400g icing sugar - £1.20 @ £3/kg
"""
pounds = re.findall(r"(?<=£)" # positive lookbehind for £ symbol
r"([0-9]+\.[0-9]{2})" # currency regex
r"(?![/.0-9])", # negative lookahead for / . and numbers
ingredients)
# convert to float
pounds = [float(pound) for pound in pounds]
pennies = re.findall(r"(?<!\n)" # negative lookbehind for newline
r"([0-9]+)" # number regex
r"(?=p)", # positive lookahead for p symbol
ingredients)
# convert to float and adjust to be in pounds
pennies = [float(penny) / 100.0 for penny in pennies]
cost = sum(pounds) + sum(pennies)
print(f"Recipe cost is £{cost:.2f}")
Recipe cost is £6.90
The final cost we get is £6.90 – not bad for 24 chocolate muffins, working out at roughly 29p each!
Summary¶
Lookarounds in regular expressions are great when we want to capture the presence or absence of either a prefix or suffix, but do not want to keep the prefix or suffix in our match. The following table gives a quick go-to reference of each type of lookaround:
syntax |
name |
description |
---|---|---|
a(?=b) |
positive lookahead |
match “a” with suffix “b” |
a(?!b) |
negative lookahead |
match “a” without suffix “b” |
(?<=b)a |
positive lookbehind |
match “a” with prefix “b” |
(?<!b)a |
negative lookbehind |
match “a” without prefix “b” |
In all examples “a” is part of the returned match while “b” is not returned.