Re error invalid group reference 2 at position 1

Example based guide to mastering Python regular expressions

Understanding Python re(gex)?

Groupings and backreferences

This chapter will show how to reuse portions matched by capture groups via backreferences. These can be used within the RE definition as well as the replacement section. You’ll also learn some of the special grouping features for cases where plain capture groups aren’t enough.

Backreference

Backreferences are like variables in a programming language. You have already seen how to use re.Match object to refer to the text captured by groups. Backreferences provide the same functionality, with the advantage that these can be directly used in RE definition as well as the replacement section without having to invoke re.Match objects. Another advantage is that you can apply quantifiers to backreferences.

The syntax is N or g<N> where N is the capture group you want. The below syntax variations are applicable in the replacement section, assuming they are used within raw strings.

  • 1, 2 up to 99 to refer to the corresponding capture group
    • provided there are no digit characters after
    • and NNN will be interpreted as octal value
  • g<1>, g<2> etc (not limited to 99) to refer to the corresponding capture group
    • this also helps to avoid ambiguity between backreferences and digits that follow
  • g<0> to refer to the entire matched portion, similar to index 0 of re.Match objects
    • cannot be used because numbers starting with 0 are treated as octal value

Here are some examples with N syntax.

# remove square brackets that surround digit characters
# note that use of raw strings for the replacement string
>>> re.sub(r'[(d+)]', r'1', '[52] apples [and] [31] mangoes')
'52 apples [and] 31 mangoes'

# replace __ with _ and delete _ if it is alone
>>> re.sub(r'(_)?_', r'1', '_apple_ __123__ _banana_')
'apple _123_ banana'

# swap words that are separated by a comma
>>> re.sub(r'(w+),(w+)', r'2,1', 'good,bad 42,24')
'bad,good 24,42'

And here are some examples with g<N> syntax.

# ambiguity between N and digit characters part of the replacement string
>>> re.sub(r'[(d+)]', r'(15)', '[52] apples and [31] mangoes')
re.error: invalid group reference 15 at position 2
# g<N> is helpful in such cases
>>> re.sub(r'[(d+)]', r'(g<1>5)', '[52] apples and [31] mangoes')
'(525) apples and (315) mangoes'
# or, you can use octal escapes
>>> re.sub(r'[(d+)]', r'(165)', '[52] apples and [31] mangoes')
'(525) apples and (315) mangoes'

# add something around the matched strings using g<0>
>>> re.sub(r'[a-z]+', r'{g<0>}', '[52] apples and [31] mangoes')
'[52] {apples} {and} [31] {mangoes}'

# note the use of '+' instead of '*' quantifier to avoid empty matches
>>> re.sub(r'.+', r'Hi. g<0>. Have a nice day', 'Hello world')
'Hi. Hello world. Have a nice day'

# capture the first field and duplicate it as the last field
>>> re.sub(r'A([^,]+),.+', r'g<0>,1', 'fork,42,nice,3.14')
'fork,42,nice,3.14,fork'

Here are some examples for using backreferences within RE definition. Only N syntax is available for use.

# words that have at least one consecutive repeated character
>>> words = ['effort', 'FLEE', 'facade', 'oddball', 'rat', 'tool', 'a22b']
>>> [w for w in words if re.search(r'(w)1', w)]
['effort', 'FLEE', 'oddball', 'tool', 'a22b']

# remove any number of consecutive duplicate words separated by space
# note the use of quantifier on backreferences
# use W+ instead of space to cover cases like 'a;a<-;a'
>>> re.sub(r'b(w+)( 1)+b', r'1', 'aa a a a 42 f_1 f_1 f_13.14')
'aa a 42 f_1 f_13.14'

info Since g<N> syntax is not available in RE definition, use formats like hexadecimal escapes to avoid ambiguity between normal digit characters and backreferences.

>>> s = 'abcdefghijklmna1d'

# even though there's only one capture group, 11 will give an error
>>> re.sub(r'(.).*11', 'X', s)
re.error: invalid group reference 11 at position 6
# use escapes for the digit portion to distinguish from the backreference
>>> re.sub(r'(.).*1x31', 'X', s)
'Xd'

# there are 12 capture groups here, so no error
# but requirement is 1 as backreference and 1 as normal digit
>>> re.sub(r'(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.).*11', 'X', s)
'abcdefghijklmna1d'
# use escapes again
>>> re.sub(r'(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.).*1x31', 'X', s)
'Xd'

warning It may be obvious, but it should be noted that backreference will provide the string that was matched, not the RE that was inside the capture group. For example, if (d[a-f]) matches 3b, then backreferencing will give 3b and not any other valid match of RE like 8f, 0a etc. This is akin to how variables behave in programming, only the result of expression stays after variable assignment, not the expression itself. The regex module supports Subexpression calls to refer to the RE itself.

Non-capturing groups

Grouping has many uses like applying quantifiers on a RE portion, creating terse RE by factoring common portions and so on. It also affects the behavior of functions like re.findall() and re.split() as seen in the Working with matched portions chapter.

When backreferencing is not required, you can use a non-capturing group to avoid undesired behavior. It also helps to avoid keeping a track of capture group numbers when that particular group is not needed for backreferencing. The syntax is (?:pat) to define a non-capturing group. You’ll see many more of such special groups starting with (? syntax later on.

# normal capture group will hinder ability to get whole match
# non-capturing group to the rescue
>>> re.findall(r'bw*(?:st|in)b', 'cost akin more east run against')
['cost', 'akin', 'east', 'against']

# capturing wasn't needed here, only common grouping and quantifier
>>> re.split(r'hand(?:y|ful)?', '123hand42handy777handful500')
['123', '42', '777', '500']

# with normal grouping, need to keep track of all the groups
>>> re.sub(r'A(([^,]+,){3})([^,]+)', r'1(3)', '1,2,3,4,5,6,7')
'1,2,3,(4),5,6,7'
# using non-capturing groups, only relevant groups have to be tracked
>>> re.sub(r'A((?:[^,]+,){3})([^,]+)', r'1(2)', '1,2,3,4,5,6,7')
'1,2,3,(4),5,6,7'

Referring to the text matched by a capture group with a quantifier will give only the last match, not the entire match. Use a capture group around the grouping and quantifier together to get the entire matching portion. In such cases, the inner grouping is an ideal candidate to be specified as non-capturing.

>>> s = 'hi 123123123 bye 456123456'
>>> re.findall(r'(123)+', s)
['123', '123']
>>> re.findall(r'(?:123)+', s)
['123123123', '123']
# note that this issue doesn't affect substitutions
>>> re.sub(r'(123)+', 'X', s)
'hi X bye 456X456'

>>> row = 'one,2,3.14,42,five'
# surround only the fourth column with double quotes
# note the loss of columns in the first case
>>> re.sub(r'A([^,]+,){3}([^,]+)', r'1"2"', row)
'3.14,"42",five'
>>> re.sub(r'A((?:[^,]+,){3})([^,]+)', r'1"2"', row)
'one,2,3.14,"42",five'

However, there are situations where capture groups cannot be avoided. In such cases, you’ll have to manually work with re.Match objects to get the desired results.

>>> s = 'effort flee facade oddball rat tool a22b'
# whole words containing at least one consecutive repeated character
>>> repeat_char = re.compile(r'bw*(w)1w*b')

# () in findall will only return the text matched by capture groups
>>> repeat_char.findall(s)
['f', 'e', 'l', 'o', '2']
# finditer to the rescue
>>> m_iter = repeat_char.finditer(s)
>>> [m[0] for m in m_iter]
['effort', 'flee', 'oddball', 'tool', 'a22b']

Named capture groups

RE can get cryptic and difficult to maintain, even for seasoned programmers. There are a few constructs to help add clarity. One such is naming the capture groups and using that name for backreferencing instead of plain numbers. The syntax is (?P<name>pat) for naming the capture groups. The name used should be a valid Python identifier. Use 'name' for re.Match objects, g<name> in replacement section and (?P=name) for backreferencing in RE definition. These will still behave as normal capture groups, so N or g<N> numbering can be used as well.

# giving names to first and second captured words
>>> re.sub(r'(?P<fw>w+),(?P<sw>w+)', r'g<sw>,g<fw>', 'good,bad 42,24')
'bad,good 24,42'

>>> s = 'aa a a a 42 f_1 f_1 f_13.14'
>>> re.sub(r'b(?P<dup>w+)( (?P=dup))+b', r'g<dup>', s)
'aa a 42 f_1 f_13.14'

>>> sentence = 'I bought an apple'
>>> m = re.search(r'(?P<fruit>w+)Z', sentence)
>>> m[1]
'apple'
>>> m['fruit']
'apple'
>>> m.group('fruit')
'apple'

You can use the groupdict() method on the re.Match object to extract the portions matched by named capture groups as a dict object. The capture group name will be the key and the portion matched by the group will be the value.

# single match
>>> details = '2018-10-25,car,2346'
>>> re.search(r'(?P<date>[^,]+),(?P<product>[^,]+)', details).groupdict()
{'date': '2018-10-25', 'product': 'car'}

# normal groups won't be part of the output
>>> re.search(r'(?P<date>[^,]+),([^,]+)', details).groupdict()
{'date': '2018-10-25'}

# multiple matches
>>> s = 'good,bad 42,24'
>>> [m.groupdict() for m in re.finditer(r'(?P<fw>w+),(?P<sw>w+)', s)]
[{'fw': 'good', 'sw': 'bad'}, {'fw': '42', 'sw': '24'}]

Atomic grouping

(?>pat) is an atomic group, where pat is the pattern you want to safeguard from further backtracking. You can think of it as a special group that is isolated from the other parts of the regular expression.

Here’s an example with greedy quantifier:

>>> numbers = '42 314 001 12 00984'

# 0* is greedy and the (?>) grouping prevents backtracking
# same as: re.findall(r'0*+d{3,}', numbers)
>>> re.findall(r'(?>0*)d{3,}', numbers)
['314', '00984']

Here’s an example with non-greedy quantifier:

>>> ip = 'fig::mango::pineapple::guava::apples::orange'

# this matches from the first '::' to the first occurrence of '::apple'
>>> re.search(r'::.*?::apple', ip)[0]
'::mango::pineapple::guava::apple'

# '(?>::.*?::)' will match only from '::' to the very next '::'
# '::mango::' fails because 'apple' isn't found afterwards
# similarly '::pineapple::' fails
# '::guava::' succeeds because it is followed by 'apple'
>>> re.search(r'(?>::.*?::)apple', ip)[0]
'::guava::apple'

info The regex module has a feature to match from right-to-left making it better suited than atomic grouping for certain cases. See the regex.REVERSE flag section for some examples.

Conditional groups

This special grouping allows you to add a condition that depends on whether a capture group succeeded in matching. You can also add an optional else condition. The syntax as per the docs is shown below.

(?(id/name)yes-pattern|no-pattern)

Here id means the N used to backreference a capture group and name refers to the identifier used for a named capture group. Here’s an example with yes-pattern alone being used. The task is to match elements containing word characters only or if it additionally starts with a double quote, it must end with a double quote.

>>> words = ['"hi"', 'bye', 'bad"', '"good"', '42', '"3']
>>> pat = re.compile(r'(")?w+(?(1)")')
>>> [w for w in words if pat.fullmatch(w)]
['"hi"', 'bye', '"good"', '42']

# for this simple case, you can also expand it manually
# but for complex patterns, it is better to use conditional groups
# as it will avoid repeating the complex pattern
>>> [w for w in words if re.fullmatch(r'"w+"|w+', w)]
['"hi"', 'bye', '"good"', '42']

# cannot simply use ? quantifier as they are independent, not constrained
>>> [w for w in words if re.fullmatch(r'"?w+"?', w)]
['"hi"', 'bye', 'bad"', '"good"', '42', '"3']
# also, fullmatch plays a big role in avoiding partial matches
>>> [w for w in words if pat.search(w)]
['"hi"', 'bye', 'bad"', '"good"', '42', '"3']

Here’s an example with no-pattern as well.

# filter elements containing word characters surrounded by ()
# or, containing word characters separated by a hyphen
>>> words = ['(hi)', 'good-bye', 'bad', '(42)', '-oh', 'i-j', '(-)', '(oh-no)']

# same as: r'(w+)|w+-w+'
>>> pat = re.compile(r'(()?w+(?(1))|-w+)')
>>> [w for w in words if pat.fullmatch(w)]
['(hi)', 'good-bye', '(42)', 'i-j']

Conditional groups have a very specific use case, and it is generally helpful for those cases. The main advantage is that it prevents pattern duplication, although that can also be achieved using Subexpression calls if you use the regex module.

Another issue with the duplication alternation method is that you’ll have to deal with different backreference numbers if the common pattern uses capture groups.

Match.expand()

The expand() method on re.Match objects accepts syntax similar to the replacement section of the re.sub() function. The difference is that the expand() method returns only the string after backreference expansion, instead of the entire input string with the modified content.

# re.sub vs Match.expand
>>> re.sub(r'w(.*)m', r'[1]', 'awesome')
'a[eso]e'
>>> re.search(r'w(.*)m', 'awesome').expand(r'[1]')
'[eso]'

# example with re.finditer
>>> dates = '2023/04/25,1986/03/02,77/12/31'
>>> m_iter = re.finditer(r'([^/]+)/([^/]+)/[^,]+,?', dates)
# same as: [f'Month:{m[2]}, Year:{m[1]}' for m in m_iter]
>>> [m.expand(r'Month:2, Year:1') for m in m_iter]
['Month:04, Year:2023', 'Month:03, Year:1986', 'Month:12, Year:77']

Cheatsheet and Summary

Note Description
N backreference, gives the matched portion of Nth capture group
applies to both RE definition and replacement section
possible values: 1, 2 up to 99 provided no more digits
and NNN will be treated as octal escapes
g<N> backreference, gives the matched portion of Nth capture group
applies only to replacement section
use escapes to prevent ambiguity in RE definition
possible values: g<0>, g<1>, etc (not limited to 99)
g<0> refers to the entire matched portion
(?:pat) non-capturing group
useful wherever grouping is required, but not backreference
(?P<name>pat) named capture group
refer as 'name' in re.Match object
refer as (?P=name) in RE definition
refer as g<name> in replacement section
can also use N and g<N> format if needed
groupdict() method applied on a re.Match object
gives named capture group portions as a dict
(?>pat) atomic grouping, protects a pattern from backtracking
(?(id/name)yes|no) conditional group
match yes-pattern if backreferenced group succeeded
else, match no-pattern which is optional
expand() method applied on a re.Match object
accepts syntax like replacement section of re.sub()
gives back only the string after backreference expansion

This chapter showed how to use backreferences to refer to the portion matched by capture groups in both RE definition and replacement sections. When capture groups lead to unwanted behavior change (ex: re.findall() and re.split()), you can use non-capturing groups instead. Named capture groups add clarity to patterns and you can use the groupdict() method on a re.Match object to get a dict of matched portions. Atomic groups help you to isolate a pattern from backtracking effects. Conditional groups allows you to take an action based on another capture group succeeding or failing to match. There are more special groups to be discussed in the coming chapters.

Exercises

a) Replace the space character that occurs after a word ending with a or r with a newline character.

>>> ip = 'area not a _a2_ roar took 22'

>>> print(re.sub())      ##### add your solution here
area
not a
_a2_ roar
took 22

b) Add [] around words starting with s and containing e and t in any order.

>>> ip = 'sequoia subtle exhibit asset sets2 tests si_te'

##### add your solution here
'sequoia [subtle] exhibit asset [sets2] tests [si_te]'

c) Replace all whole words with X that start and end with the same word character (irrespective of case). Single character word should get replaced with X too, as it satisfies the stated condition.

>>> ip = 'oreo not a _a2_ Roar took 22'

##### add your solution here
'X not X X X took X'

d) Convert the given markdown headers to corresponding anchor tags. Consider the input to start with one or more # characters followed by space and word characters. The name attribute is constructed by converting the header to lowercase and replacing spaces with hyphens. Can you do it without using a capture group?

>>> header1 = '# Regular Expressions'
>>> header2 = '## Compiling regular expressions'

##### add your solution here for header1
'# <a name="regular-expressions"></a>Regular Expressions'
##### add your solution here for header2
'## <a name="compiling-regular-expressions"></a>Compiling regular expressions'

e) Convert the given markdown anchors to corresponding hyperlinks.

>>> anchor1 = '# <a name="regular-expressions"></a>Regular Expressions'
>>> anchor2 = '## <a name="subexpression-calls"></a>Subexpression calls'

##### add your solution here for anchor1
'[Regular Expressions](#regular-expressions)'
##### add your solution here for anchor2
'[Subexpression calls](#subexpression-calls)'

f) Count the number of whole words that have at least two occurrences of consecutive repeated alphabets. For example, words like stillness and Committee should be counted but not words like root or readable or rotational.

>>> ip = '''oppressed abandon accommodation bloodless
... carelessness committed apparition innkeeper
... occasionally afforded embarrassment foolishness
... depended successfully succeeded
... possession cleanliness suppress'''

##### add your solution here
13

g) For the given input string, replace all occurrences of digit sequences with only the unique non-repeating sequence. For example, 232323 should be changed to 23 and 897897 should be changed to 897. If there are no repeats (for example 1234) or if the repeats end prematurely (for example 12121), it should not be changed.

>>> ip = '1234 2323 453545354535 9339 11 60260260'

##### add your solution here
'1234 23 4535 9339 1 60260260'

h) Replace sequences made up of words separated by : or . by the first word of the sequence. Such sequences will end when : or . is not followed by a word character.

>>> ip = 'wow:Good:2_two.five: hi-2 bye kite.777:water.'

##### add your solution here
'wow hi-2 bye kite'

i) Replace sequences made up of words separated by : or . by the last word of the sequence. Such sequences will end when : or . is not followed by a word character.

>>> ip = 'wow:Good:2_two.five: hi-2 bye kite.777:water.'

##### add your solution here
'five hi-2 bye water'

j) Split the given input string on one or more repeated sequence of cat.

>>> ip = 'firecatlioncatcatcatbearcatcatparrot'

##### add your solution here
['fire', 'lion', 'bear', 'parrot']

k) For the given input string, find all occurrences of digit sequences with at least one repeating sequence. For example, 232323 and 897897. If the repeats end prematurely, for example 12121, it should not be matched.

>>> ip = '1234 2323 453545354535 9339 11 60260260'

>>> pat = re.compile()      ##### add your solution here

# entire sequences in the output
##### add your solution here
['2323', '453545354535', '11']

# only the unique sequence in the output
##### add your solution here
['23', '4535', '1']

l) Convert the comma separated strings to corresponding dict objects as shown below. The keys are name, maths and phy for the three fields in the input strings.

>>> row1 = 'rohan,75,89'
>>> row2 = 'rose,88,92'

>>> pat = re.compile()      ##### add your solution here

##### add your solution here for row1
{'name': 'rohan', 'maths': '75', 'phy': '89'}
##### add your solution here for row2
{'name': 'rose', 'maths': '88', 'phy': '92'}

m) Surround all whole words with (). Additionally, if the whole word is imp or ant, delete them. Can you do it with just a single substitution?

>>> ip = 'tiger imp goat eagle ant important'

##### add your solution here
'(tiger) () (goat) (eagle) () (important)'

n) Filter all elements that contain a sequence of lowercase alphabets followed by - followed by digits. They can be optionally surrounded by {{ and }}. Any partial match shouldn’t be part of the output.

>>> ip = ['{{apple-150}}', '{{mango2-100}}', '{{cherry-200', 'grape-87']

##### add your solution here
['{{apple-150}}', 'grape-87']

o) The given input string has sequences made up of words separated by : or . and such sequences will end when : or . is not followed by a word character. For all such sequences, display only the last word followed by - followed by the first word.

>>> ip = 'wow:Good:2_two.five: hi-2 bye kite.777:water.'

##### add your solution here
['five-wow', 'water-kite']

p) Modify the given regular expression such that it gives the expected result.

>>> ip = '( S:12 E:5 S:4 and E:123 ok S:100 & E:10 S:1 - E:2 S:42 E:43 )'

# wrong output
>>> re.findall(r'S:d+.*?E:d{2,}', ip)
['S:12 E:5 S:4 and E:123', 'S:100 & E:10', 'S:1 - E:2 S:42 E:43']

# expected output
##### add your solution here
['S:4 and E:123', 'S:100 & E:10', 'S:42 E:43']

Webtech was working fine for me for a while, but now I’m getting the following error.

Any ideas?

 root@kali2019  /pentest/mobile/testing  pip3 install webtech                                                                                                            SIGINT(2) ↵  ⚡  1144  13:50:39 
Requirement already satisfied: webtech in /usr/local/lib/python3.7/dist-packages (1.2.7)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from webtech) (2.18.4)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->webtech) (1.22)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests->webtech) (2018.8.24)
Requirement already satisfied: idna<2.7,>=2.5 in /usr/lib/python3/dist-packages (from requests->webtech) (2.6)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/lib/python3/dist-packages (from requests->webtech) (3.0.4)
 root@kali2019  ~  webtech -u https://xerosecurity.com                                                                                                                             ✔  ⚡  1151  13:46:17 
Traceback (most recent call last):
  File "/usr/local/bin/webtech", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/webtech/__main__.py", line 54, in main
    wt.start()
  File "/usr/local/lib/python3.7/dist-packages/webtech/webtech.py", line 132, in start
    temp_output = self.start_from_url(url)
  File "/usr/local/lib/python3.7/dist-packages/webtech/webtech.py", line 172, in start_from_url
    return self.perform(target)
  File "/usr/local/lib/python3.7/dist-packages/webtech/webtech.py", line 222, in perform
    target.check_html(tech, html)
  File "/usr/local/lib/python3.7/dist-packages/webtech/target.py", line 221, in check_html
    matches = re.search(source, self.data['html'], re.IGNORECASE)
  File "/usr/lib/python3.7/re.py", line 183, in search
    return _compile(pattern, flags).search(string)
  File "/usr/lib/python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.7/sre_parse.py", line 924, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "/usr/lib/python3.7/sre_parse.py", line 501, in _parse
    code = _escape(source, this, state)
  File "/usr/lib/python3.7/sre_parse.py", line 399, in _escape
    raise source.error("invalid group reference %d" % group, len(escape) - 1)
re.error: invalid group reference 1 at position 130

In the previous tutorial in this series, you covered a lot of ground. You saw how to use re.search() to perform pattern matching with regexes in Python and learned about the many regex metacharacters and parsing flags that you can use to fine-tune your pattern-matching capabilities.

But as great as all that is, the re module has much more to offer.

In this tutorial, you’ll:

  • Explore more functions, beyond re.search(), that the re module provides
  • Learn when and how to precompile a regex in Python into a regular expression object
  • Discover useful things that you can do with the match object returned by the functions in the re module

Ready? Let’s dig in!

re Module Functions

In addition to re.search(), the re module contains several other functions to help you perform regex-related tasks.

The available regex functions in the Python re module fall into the following three categories:

  1. Searching functions
  2. Substitution functions
  3. Utility functions

The following sections explain these functions in more detail.

Searching Functions

Searching functions scan a search string for one or more matches of the specified regex:

Function Description
re.search() Scans a string for a regex match
re.match() Looks for a regex match at the beginning of a string
re.fullmatch() Looks for a regex match on an entire string
re.findall() Returns a list of all regex matches in a string
re.finditer() Returns an iterator that yields regex matches from a string

As you can see from the table, these functions are similar to one another. But each one tweaks the searching functionality in its own way.

re.search(<regex>, <string>, flags=0)

Scans a string for a regex match.

If you worked through the previous tutorial in this series, then you should be well familiar with this function by now. re.search(<regex>, <string>) looks for any location in <string> where <regex> matches:

>>>

>>> re.search(r'(d+)', 'foo123bar')
<_sre.SRE_Match object; span=(3, 6), match='123'>
>>> re.search(r'[a-z]+', '123FOO456', flags=re.IGNORECASE)
<_sre.SRE_Match object; span=(3, 6), match='FOO'>

>>> print(re.search(r'd+', 'foo.bar'))
None

The function returns a match object if it finds a match and None otherwise.

re.match(<regex>, <string>, flags=0)

Looks for a regex match at the beginning of a string.

This is identical to re.search(), except that re.search() returns a match if <regex> matches anywhere in <string>, whereas re.match() returns a match only if <regex> matches at the beginning of <string>:

>>>

>>> re.search(r'd+', '123foobar')
<_sre.SRE_Match object; span=(0, 3), match='123'>
>>> re.search(r'd+', 'foo123bar')
<_sre.SRE_Match object; span=(3, 6), match='123'>

>>> re.match(r'd+', '123foobar')
<_sre.SRE_Match object; span=(0, 3), match='123'>
>>> print(re.match(r'd+', 'foo123bar'))
None

In the above example, re.search() matches when the digits are both at the beginning of the string and in the middle, but re.match() matches only when the digits are at the beginning.

Remember from the previous tutorial in this series that if <string> contains embedded newlines, then the MULTILINE flag causes re.search() to match the caret (^) anchor metacharacter either at the beginning of <string> or at the beginning of any line contained within <string>:

>>>

 1>>> s = 'foonbarnbaz'
 2
 3>>> re.search('^foo', s)
 4<_sre.SRE_Match object; span=(0, 3), match='foo'>
 5>>> re.search('^bar', s, re.MULTILINE)
 6<_sre.SRE_Match object; span=(4, 7), match='bar'>

The MULTILINE flag does not affect re.match() in this way:

>>>

 1>>> s = 'foonbarnbaz'
 2
 3>>> re.match('^foo', s)
 4<_sre.SRE_Match object; span=(0, 3), match='foo'>
 5>>> print(re.match('^bar', s, re.MULTILINE))
 6None

Even with the MULTILINE flag set, re.match() will match the caret (^) anchor only at the beginning of <string>, not at the beginning of lines contained within <string>.

Note that, although it illustrates the point, the caret (^) anchor on line 3 in the above example is redundant. With re.match(), matches are essentially always anchored at the beginning of the string.

re.fullmatch(<regex>, <string>, flags=0)

Looks for a regex match on an entire string.

This is similar to re.search() and re.match(), but re.fullmatch() returns a match only if <regex> matches <string> in its entirety:

>>>

 1>>> print(re.fullmatch(r'd+', '123foo'))
 2None
 3>>> print(re.fullmatch(r'd+', 'foo123'))
 4None
 5>>> print(re.fullmatch(r'd+', 'foo123bar'))
 6None
 7>>> re.fullmatch(r'd+', '123')
 8<_sre.SRE_Match object; span=(0, 3), match='123'>
 9
10>>> re.search(r'^d+$', '123')
11<_sre.SRE_Match object; span=(0, 3), match='123'>

In the call on line 7, the search string '123' consists entirely of digits from beginning to end. So that is the only case in which re.fullmatch() returns a match.

The re.search() call on line 10, in which the d+ regex is explicitly anchored at the start and end of the search string, is functionally equivalent.

re.findall(<regex>, <string>, flags=0)

Returns a list of all matches of a regex in a string.

re.findall(<regex>, <string>) returns a list of all non-overlapping matches of <regex> in <string>. It scans the search string from left to right and returns all matches in the order found:

>>>

>>> re.findall(r'w+', '...foo,,,,bar:%$baz//|')
['foo', 'bar', 'baz']

If <regex> contains a capturing group, then the return list contains only contents of the group, not the entire match:

>>>

>>> re.findall(r'#(w+)#', '#foo#.#bar#.#baz#')
['foo', 'bar', 'baz']

In this case, the specified regex is #(w+)#. The matching strings are '#foo#', '#bar#', and '#baz#'. But the hash (#) characters don’t appear in the return list because they’re outside the grouping parentheses.

If <regex> contains more than one capturing group, then re.findall() returns a list of tuples containing the captured groups. The length of each tuple is equal to the number of groups specified:

>>>

 1>>> re.findall(r'(w+),(w+)', 'foo,bar,baz,qux,quux,corge')
 2[('foo', 'bar'), ('baz', 'qux'), ('quux', 'corge')]
 3
 4>>> re.findall(r'(w+),(w+),(w+)', 'foo,bar,baz,qux,quux,corge')
 5[('foo', 'bar', 'baz'), ('qux', 'quux', 'corge')]

In the above example, the regex on line 1 contains two capturing groups, so re.findall() returns a list of three two-tuples, each containing two captured matches. Line 4 contains three groups, so the return value is a list of two three-tuples.

re.finditer(<regex>, <string>, flags=0)

Returns an iterator that yields regex matches.

re.finditer(<regex>, <string>) scans <string> for non-overlapping matches of <regex> and returns an iterator that yields the match objects from any it finds. It scans the search string from left to right and returns matches in the order it finds them:

>>>

>>> it = re.finditer(r'w+', '...foo,,,,bar:%$baz//|')
>>> next(it)
<_sre.SRE_Match object; span=(3, 6), match='foo'>
>>> next(it)
<_sre.SRE_Match object; span=(10, 13), match='bar'>
>>> next(it)
<_sre.SRE_Match object; span=(16, 19), match='baz'>
>>> next(it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

>>> for i in re.finditer(r'w+', '...foo,,,,bar:%$baz//|'):
...     print(i)
...
<_sre.SRE_Match object; span=(3, 6), match='foo'>
<_sre.SRE_Match object; span=(10, 13), match='bar'>
<_sre.SRE_Match object; span=(16, 19), match='baz'>

re.findall() and re.finditer() are very similar, but they differ in two respects:

  1. re.findall() returns a list, whereas re.finditer() returns an iterator.

  2. The items in the list that re.findall() returns are the actual matching strings, whereas the items yielded by the iterator that re.finditer() returns are match objects.

Any task that you could accomplish with one, you could probably also manage with the other. Which one you choose will depend on the circumstances. As you’ll see later in this tutorial, a lot of useful information can be obtained from a match object. If you need that information, then re.finditer() will probably be the better choice.

Substitution Functions

Substitution functions replace portions of a search string that match a specified regex:

Function Description
re.sub() Scans a string for regex matches, replaces the matching portions of the string with the specified replacement string, and returns the result
re.subn() Behaves just like re.sub() but also returns information regarding the number of substitutions made

Both re.sub() and re.subn() create a new string with the specified substitutions and return it. The original string remains unchanged. (Remember that strings are immutable in Python, so it wouldn’t be possible for these functions to modify the original string.)

re.sub(<regex>, <repl>, <string>, count=0, flags=0)

Returns a new string that results from performing replacements on a search string.

re.sub(<regex>, <repl>, <string>) finds the leftmost non-overlapping occurrences of <regex> in <string>, replaces each match as indicated by <repl>, and returns the result. <string> remains unchanged.

<repl> can be either a string or a function, as explained below.

Substitution by String

If <repl> is a string, then re.sub() inserts it into <string> in place of any sequences that match <regex>:

>>>

 1>>> s = 'foo.123.bar.789.baz'
 2
 3>>> re.sub(r'd+', '#', s)
 4'foo.#.bar.#.baz'
 5>>> re.sub('[a-z]+', '(*)', s)
 6'(*).123.(*).789.(*)'

On line 3, the string '#' replaces sequences of digits in s. On line 5, the string '(*)' replaces sequences of lowercase letters. In both cases, re.sub() returns the modified string as it always does.

re.sub() replaces numbered backreferences (<n>) in <repl> with the text of the corresponding captured group:

>>>

>>> re.sub(r'(w+),bar,baz,(w+)',
...        r'2,bar,baz,1',
...        'foo,bar,baz,qux')
'qux,bar,baz,foo'

Here, captured groups 1 and 2 contain 'foo' and 'qux'. In the replacement string '2,bar,baz,1', 'foo' replaces 1 and 'qux' replaces 2.

You can also refer to named backreferences created with (?P<name><regex>) in the replacement string using the metacharacter sequence g<name>:

>>>

>>> re.sub(r'foo,(?P<w1>w+),(?P<w2>w+),qux',
...        r'foo,g<w2>,g<w1>,qux',
...        'foo,bar,baz,qux')
'foo,baz,bar,qux'

In fact, you can also refer to numbered backreferences this way by specifying the group number inside the angled brackets:

>>>

>>> re.sub(r'foo,(w+),(w+),qux',
...        r'foo,g<2>,g<1>,qux',
...        'foo,bar,baz,qux')
'foo,baz,bar,qux'

You may need to use this technique to avoid ambiguity in cases where a numbered backreference is immediately followed by a literal digit character. For example, suppose you have a string like 'foo 123 bar' and want to add a '0' at the end of the digit sequence. You might try this:

>>>

>>> re.sub(r'(d+)', r'10', 'foo 123 bar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python3.6/re.py", line 326, in _subx
    template = _compile_repl(template, pattern)
  File "/usr/lib/python3.6/re.py", line 317, in _compile_repl
    return sre_parse.parse_template(repl, pattern)
  File "/usr/lib/python3.6/sre_parse.py", line 943, in parse_template
    addgroup(int(this[1:]), len(this) - 1)
  File "/usr/lib/python3.6/sre_parse.py", line 887, in addgroup
    raise s.error("invalid group reference %d" % index, pos)
sre_constants.error: invalid group reference 10 at position 1

Alas, the regex parser in Python interprets 10 as a backreference to the tenth captured group, which doesn’t exist in this case. Instead, you can use g<1> to refer to the group:

>>>

>>> re.sub(r'(d+)', r'g<1>0', 'foo 123 bar')
'foo 1230 bar'

The backreference g<0> refers to the text of the entire match. This is valid even when there are no grouping parentheses in <regex>:

>>>

>>> re.sub(r'd+', '/g<0>/', 'foo 123 bar')
'foo /123/ bar'

If <regex> specifies a zero-length match, then re.sub() will substitute <repl> into every character position in the string:

>>>

>>> re.sub('x*', '-', 'foo')
'-f-o-o-'

In the example above, the regex x* matches any zero-length sequence, so re.sub() inserts the replacement string at every character position in the string—before the first character, between each pair of characters, and after the last character.

If re.sub() doesn’t find any matches, then it always returns <string> unchanged.

Substitution by Function

If you specify <repl> as a function, then re.sub() calls that function for each match found. It passes each corresponding match object as an argument to the function to provide information about the match. The function return value then becomes the replacement string:

>>>

>>> def f(match_obj):
...     s = match_obj.group(0)  # The matching string
...
...     # s.isdigit() returns True if all characters in s are digits
...     if s.isdigit():
...         return str(int(s) * 10)
...     else:
...         return s.upper()
...
>>> re.sub(r'w+', f, 'foo.10.bar.20.baz.30')
'FOO.100.BAR.200.BAZ.300'

In this example, f() gets called for each match. As a result, re.sub() converts each alphanumeric portion of <string> to all uppercase and multiplies each numeric portion by 10.

Limiting the Number of Replacements

If you specify a positive integer for the optional count parameter, then re.sub() performs at most that many replacements:

>>>

>>> re.sub(r'w+', 'xxx', 'foo.bar.baz.qux')
'xxx.xxx.xxx.xxx'
>>> re.sub(r'w+', 'xxx', 'foo.bar.baz.qux', count=2)
'xxx.xxx.baz.qux'

As with most re module functions, re.sub() accepts an optional <flags> argument as well.

re.subn(<regex>, <repl>, <string>, count=0, flags=0)

Returns a new string that results from performing replacements on a search string and also returns the number of substitutions made.

re.subn() is identical to re.sub(), except that re.subn() returns a two-tuple consisting of the modified string and the number of substitutions made:

>>>

>>> re.subn(r'w+', 'xxx', 'foo.bar.baz.qux')
('xxx.xxx.xxx.xxx', 4)
>>> re.subn(r'w+', 'xxx', 'foo.bar.baz.qux', count=2)
('xxx.xxx.baz.qux', 2)

>>> def f(match_obj):
...     m = match_obj.group(0)
...     if m.isdigit():
...         return str(int(m) * 10)
...     else:
...         return m.upper()
...
>>> re.subn(r'w+', f, 'foo.10.bar.20.baz.30')
('FOO.100.BAR.200.BAZ.300', 6)

In all other respects, re.subn() behaves just like re.sub().

Utility Functions

There are two remaining regex functions in the Python re module that you’ve yet to cover:

Function Description
re.split() Splits a string into substrings using a regex as a delimiter
re.escape() Escapes characters in a regex

These are functions that involve regex matching but don’t clearly fall into either of the categories described above.

re.split(<regex>, <string>, maxsplit=0, flags=0)

Splits a string into substrings.

re.split(<regex>, <string>) splits <string> into substrings using <regex> as the delimiter and returns the substrings as a list.

The following example splits the specified string into substrings delimited by a comma (,), semicolon (;), or slash (/) character, surrounded by any amount of whitespace:

>>>

>>> re.split('s*[,;/]s*', 'foo,bar  ;  baz / qux')
['foo', 'bar', 'baz', 'qux']

If <regex> contains capturing groups, then the return list includes the matching delimiter strings as well:

>>>

>>> re.split('(s*[,;/]s*)', 'foo,bar  ;  baz / qux')
['foo', ',', 'bar', '  ;  ', 'baz', ' / ', 'qux']

This time, the return list contains not only the substrings 'foo', 'bar', 'baz', and 'qux' but also several delimiter strings:

  • ','
  • ' ; '
  • ' / '

This can be useful if you want to split <string> apart into delimited tokens, process the tokens in some way, then piece the string back together using the same delimiters that originally separated them:

>>>

>>> string = 'foo,bar  ;  baz / qux'
>>> regex = r'(s*[,;/]s*)'
>>> a = re.split(regex, string)

>>> # List of tokens and delimiters
>>> a
['foo', ',', 'bar', '  ;  ', 'baz', ' / ', 'qux']

>>> # Enclose each token in <>'s
>>> for i, s in enumerate(a):
...
...     # This will be True for the tokens but not the delimiters
...     if not re.fullmatch(regex, s):
...         a[i] = f'<{s}>'
...

>>> # Put the tokens back together using the same delimiters
>>> ''.join(a)
'<foo>,<bar>  ;  <baz> / <qux>'

If you need to use groups but don’t want the delimiters included in the return list, then you can use noncapturing groups:

>>>

>>> string = 'foo,bar  ;  baz / qux'
>>> regex = r'(?:s*[,;/]s*)'
>>> re.split(regex, string)
['foo', 'bar', 'baz', 'qux']

If the optional maxsplit argument is present and greater than zero, then re.split() performs at most that many splits. The final element in the return list is the remainder of <string> after all the splits have occurred:

>>>

>>> s = 'foo, bar, baz, qux, quux, corge'

>>> re.split(r',s*', s)
['foo', 'bar', 'baz', 'qux', 'quux', 'corge']
>>> re.split(r',s*', s, maxsplit=3)
['foo', 'bar', 'baz', 'qux, quux, corge']

Explicitly specifying maxsplit=0 is equivalent to omitting it entirely. If maxsplit is negative, then re.split() returns <string> unchanged (in case you were looking for a rather elaborate way of doing nothing at all).

If <regex> contains capturing groups so that the return list includes delimiters, and <regex> matches the start of <string>, then re.split() places an empty string as the first element in the return list. Similarly, the last item in the return list is an empty string if <regex> matches the end of <string>:

>>>

>>> re.split('(/)', '/foo/bar/baz/')
['', '/', 'foo', '/', 'bar', '/', 'baz', '/', '']

In this case, the <regex> delimiter is a single slash (/) character. In a sense, then, there’s an empty string to the left of the first delimiter and to the right of the last one. So it makes sense that re.split() places empty strings as the first and last elements of the return list.

re.escape(<regex>)

Escapes characters in a regex.

re.escape(<regex>) returns a copy of <regex> with each nonword character (anything other than a letter, digit, or underscore) preceded by a backslash.

This is useful if you’re calling one of the re module functions, and the <regex> you’re passing in has a lot of special characters that you want the parser to take literally instead of as metacharacters. It saves you the trouble of putting in all the backslash characters manually:

>>>

 1>>> print(re.match('foo^bar(baz)|qux', 'foo^bar(baz)|qux'))
 2None
 3>>> re.match('foo^bar(baz)|qux', 'foo^bar(baz)|qux')
 4<_sre.SRE_Match object; span=(0, 16), match='foo^bar(baz)|qux'>
 5
 6>>> re.escape('foo^bar(baz)|qux') == 'foo^bar(baz)|qux'
 7True
 8>>> re.match(re.escape('foo^bar(baz)|qux'), 'foo^bar(baz)|qux')
 9<_sre.SRE_Match object; span=(0, 16), match='foo^bar(baz)|qux'>

In this example, there isn’t a match on line 1 because the regex 'foo^bar(baz)|qux' contains special characters that behave as metacharacters. On line 3, they’re explicitly escaped with backslashes, so a match occurs. Lines 6 and 8 demonstrate that you can achieve the same effect using re.escape().

Compiled Regex Objects in Python

The re module supports the capability to precompile a regex in Python into a regular expression object that can be repeatedly used later.

re.compile(<regex>, flags=0)

Compiles a regex into a regular expression object.

re.compile(<regex>) compiles <regex> and returns the corresponding regular expression object. If you include a <flags> value, then the corresponding flags apply to any searches performed with the object.

There are two ways to use a compiled regular expression object. You can specify it as the first argument to the re module functions in place of <regex>:

re_obj = re.compile(<regex>, <flags>)
result = re.search(re_obj, <string>)

You can also invoke a method directly from a regular expression object:

re_obj = re.compile(<regex>, <flags>)
result = re_obj.search(<string>)

Both of the examples above are equivalent to this:

result = re.search(<regex>, <string>, <flags>)

Here’s one of the examples you saw previously, recast using a compiled regular expression object:

>>>

>>> re.search(r'(d+)', 'foo123bar')
<_sre.SRE_Match object; span=(3, 6), match='123'>

>>> re_obj = re.compile(r'(d+)')
>>> re.search(re_obj, 'foo123bar')
<_sre.SRE_Match object; span=(3, 6), match='123'>
>>> re_obj.search('foo123bar')
<_sre.SRE_Match object; span=(3, 6), match='123'>

Here’s another, which also uses the IGNORECASE flag:

>>>

 1>>> r1 = re.search('ba[rz]', 'FOOBARBAZ', flags=re.I)
 2
 3>>> re_obj = re.compile('ba[rz]', flags=re.I)
 4>>> r2 = re.search(re_obj, 'FOOBARBAZ')
 5>>> r3 = re_obj.search('FOOBARBAZ')
 6
 7>>> r1
 8<_sre.SRE_Match object; span=(3, 6), match='BAR'>
 9>>> r2
10<_sre.SRE_Match object; span=(3, 6), match='BAR'>
11>>> r3
12<_sre.SRE_Match object; span=(3, 6), match='BAR'>

In this example, the statement on line 1 specifies regex ba[rz] directly to re.search() as the first argument. On line 4, the first argument to re.search() is the compiled regular expression object re_obj. On line 5, search() is invoked directly on re_obj. All three cases produce the same match.

Why Bother Compiling a Regex?

What good is precompiling? There are a couple of possible advantages.

If you use a particular regex in your Python code frequently, then precompiling allows you to separate out the regex definition from its uses. This enhances modularity. Consider this example:

>>>

>>> s1, s2, s3, s4 = 'foo.bar', 'foo123bar', 'baz99', 'qux & grault'

>>> import re
>>> re.search('d+', s1)
>>> re.search('d+', s2)
<_sre.SRE_Match object; span=(3, 6), match='123'>
>>> re.search('d+', s3)
<_sre.SRE_Match object; span=(3, 5), match='99'>
>>> re.search('d+', s4)

Here, the regex d+ appears several times. If, in the course of maintaining this code, you decide you need a different regex, then you’ll need to change it in each location. That’s not so bad in this small example because the uses are close to one another. But in a larger application, they might be widely scattered and difficult to track down.

The following is more modular and more maintainable:

>>>

>>> s1, s2, s3, s4 = 'foo.bar', 'foo123bar', 'baz99', 'qux & grault'
>>> re_obj = re.compile('d+')

>>> re_obj.search(s1)
>>> re_obj.search(s2)
<_sre.SRE_Match object; span=(3, 6), match='123'>
>>> re_obj.search(s3)
<_sre.SRE_Match object; span=(3, 5), match='99'>
>>> re_obj.search(s4)

Then again, you can achieve similar modularity without precompiling by using variable assignment:

>>>

>>> s1, s2, s3, s4 = 'foo.bar', 'foo123bar', 'baz99', 'qux & grault'
>>> regex = 'd+'

>>> re.search(regex, s1)
>>> re.search(regex, s2)
<_sre.SRE_Match object; span=(3, 6), match='123'>
>>> re.search(regex, s3)
<_sre.SRE_Match object; span=(3, 5), match='99'>
>>> re.search(regex, s4)

In theory, you might expect precompilation to result in faster execution time as well. Suppose you call re.search() many thousands of times on the same regex. It might seem like compiling the regex once ahead of time would be more efficient than recompiling it each of the thousands of times it’s used.

In practice, though, that isn’t the case. The truth is that the re module compiles and caches a regex when it’s used in a function call. If the same regex is used subsequently in the same Python code, then it isn’t recompiled. The compiled value is fetched from cache instead. So the performance advantage is minimal.

All in all, there isn’t any immensely compelling reason to compile a regex in Python. Like much of Python, it’s just one more tool in your toolkit that you can use if you feel it will improve the readability or structure of your code.

Regular Expression Object Methods

A compiled regular expression object re_obj supports the following methods:

  • re_obj.search(<string>[, <pos>[, <endpos>]])
  • re_obj.match(<string>[, <pos>[, <endpos>]])
  • re_obj.fullmatch(<string>[, <pos>[, <endpos>]])
  • re_obj.findall(<string>[, <pos>[, <endpos>]])
  • re_obj.finditer(<string>[, <pos>[, <endpos>]])

These all behave the same way as the corresponding re functions that you’ve already encountered, with the exception that they also support the optional <pos> and <endpos> parameters. If these are present, then the search only applies to the portion of <string> indicated by <pos> and <endpos>, which act the same way as indices in slice notation:

>>>

 1>>> re_obj = re.compile(r'd+')
 2>>> s = 'foo123barbaz'
 3
 4>>> re_obj.search(s)
 5<_sre.SRE_Match object; span=(3, 6), match='123'>
 6
 7>>> s[6:9]
 8'bar'
 9>>> print(re_obj.search(s, 6, 9))
10None

In the above example, the regex is d+, a sequence of digit characters. The .search() call on line 4 searches all of s, so there’s a match. On line 9, the <pos> and <endpos> parameters effectively restrict the search to the substring starting with character 6 and going up to but not including character 9 (the substring 'bar'), which doesn’t contain any digits.

If you specify <pos> but omit <endpos>, then the search applies to the substring from <pos> to the end of the string.

Note that anchors such as caret (^) and dollar sign ($) still refer to the start and end of the entire string, not the substring determined by <pos> and <endpos>:

>>>

>>> re_obj = re.compile('^bar')
>>> s = 'foobarbaz'

>>> s[3:]
'barbaz'

>>> print(re_obj.search(s, 3))
None

Here, even though 'bar' does occur at the start of the substring beginning at character 3, it isn’t at the start of the entire string, so the caret (^) anchor fails to match.

The following methods are available for a compiled regular expression object re_obj as well:

  • re_obj.split(<string>, maxsplit=0)
  • re_obj.sub(<repl>, <string>, count=0)
  • re_obj.subn(<repl>, <string>, count=0)

These also behave analogously to the corresponding re functions, but they don’t support the <pos> and <endpos> parameters.

Regular Expression Object Attributes

The re module defines several useful attributes for a compiled regular expression object:

Attribute Meaning
re_obj.flags Any <flags> that are in effect for the regex
re_obj.groups The number of capturing groups in the regex
re_obj.groupindex A dictionary mapping each symbolic group name defined by the (?P<name>) construct (if any) to the corresponding group number
re_obj.pattern The <regex> pattern that produced this object

The code below demonstrates some uses of these attributes:

>>>

 1>>> re_obj = re.compile(r'(?m)(w+),(w+)', re.I)
 2>>> re_obj.flags
 342
 4>>> re.I|re.M|re.UNICODE
 5<RegexFlag.UNICODE|MULTILINE|IGNORECASE: 42>
 6>>> re_obj.groups
 72
 8>>> re_obj.pattern
 9'(?m)(\w+),(\w+)'
10
11>>> re_obj = re.compile(r'(?P<w1>),(?P<w2>)')
12>>> re_obj.groupindex
13mappingproxy({'w1': 1, 'w2': 2})
14>>> re_obj.groupindex['w1']
151
16>>> re_obj.groupindex['w2']
172

Note that .flags includes any flags specified as arguments to re.compile(), any specified within the regex with the (?flags) metacharacter sequence, and any that are in effect by default. In the regular expression object defined on line 1, there are three flags defined:

  1. re.I: Specified as a <flags> value in the re.compile() call
  2. re.M: Specified as (?m) within the regex
  3. re.UNICODE: Enabled by default

You can see on line 4 that the value of re_obj.flags is the logical OR of these three values, which equals 42.

The value of the .groupindex attribute for the regular expression object defined on line 11 is technically an object of type mappingproxy. For practical purposes, it functions like a dictionary.

Match Object Methods and Attributes

As you’ve seen, most functions and methods in the re module return a match object when there’s a successful match. Because a match object is truthy, you can use it in a conditional:

>>>

>>> m = re.search('bar', 'foo.bar.baz')
>>> m
<_sre.SRE_Match object; span=(4, 7), match='bar'>
>>> bool(m)
True

>>> if re.search('bar', 'foo.bar.baz'):
...     print('Found a match')
...
Found a match

But match objects also contain quite a bit of handy information about the match. You’ve already seen some of it—the span= and match= data that the interpreter shows when it displays a match object. You can obtain much more from a match object using its methods and attributes.

Match Object Methods

The table below summarizes the methods that are available for a match object match:

Method Returns
match.group() The specified captured group or groups from match
match.__getitem__() A captured group from match
match.groups() All the captured groups from match
match.groupdict() A dictionary of named captured groups from match
match.expand() The result of performing backreference substitutions from match
match.start() The starting index of match
match.end() The ending index of match
match.span() Both the starting and ending indices of match as a tuple

The following sections describe these methods in more detail.

match.group([<group1>, ...])

Returns the specified captured group(s) from a match.

For numbered groups, match.group(n) returns the nth group:

>>>

>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
>>> m.group(1)
'foo'
>>> m.group(3)
'baz'

If you capture groups using (?P<name><regex>), then match.group(<name>) returns the corresponding named group:

>>>

>>> m = re.match(r'(?P<w1>w+),(?P<w2>w+),(?P<w3>w+)', 'quux,corge,grault')
>>> m.group('w1')
'quux'
>>> m.group('w3')
'grault'

With more than one argument, .group() returns a tuple of all the groups specified. A given group can appear multiple times, and you can specify any captured groups in any order:

>>>

>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
>>> m.group(1, 3)
('foo', 'baz')
>>> m.group(3, 3, 1, 1, 2, 2)
('baz', 'baz', 'foo', 'foo', 'bar', 'bar')

>>> m = re.match(r'(?P<w1>w+),(?P<w2>w+),(?P<w3>w+)', 'quux,corge,grault')
>>> m.group('w3', 'w1', 'w1', 'w2')
('grault', 'quux', 'quux', 'corge')

If you specify a group that’s out of range or nonexistent, then .group() raises an IndexError exception:

>>>

>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
>>> m.group(4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

>>> m = re.match(r'(?P<w1>w+),(?P<w2>w+),(?P<w3>w+)', 'quux,corge,grault')
>>> m.group('foo')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

It’s possible for a regex in Python to match as a whole but to contain a group that doesn’t participate in the match. In that case, .group() returns None for the nonparticipating group. Consider this example:

>>>

>>> m = re.search(r'(w+),(w+),(w+)?', 'foo,bar,')
>>> m
<_sre.SRE_Match object; span=(0, 8), match='foo,bar,'>
>>> m.group(1, 2)
('foo', 'bar')

This regex matches, as you can see from the match object. The first two captured groups contain 'foo' and 'bar', respectively.

A question mark (?) quantifier metacharacter follows the third group, though, so that group is optional. A match will occur if there’s a third sequence of word characters following the second comma (,) but also if there isn’t.

In this case, there isn’t. So there is match overall, but the third group doesn’t participate in it. As a result, m.group(3) is still defined and is a valid reference, but it returns None:

>>>

>>> print(m.group(3))
None

It can also happen that a group participates in the overall match multiple times. If you call .group() for that group number, then it returns only the part of the search string that matched the last time. The earlier matches aren’t accessible:

>>>

>>> m = re.match(r'(w{3},)+', 'foo,bar,baz,qux')
>>> m
<_sre.SRE_Match object; span=(0, 12), match='foo,bar,baz,'>
>>> m.group(1)
'baz,'

In this example, the full match is 'foo,bar,baz,', as shown by the displayed match object. Each of 'foo,', 'bar,', and 'baz,' matches what’s inside the group, but m.group(1) returns only the last match, 'baz,'.

If you call .group() with an argument of 0 or no argument at all, then it returns the entire match:

>>>

 1>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
 2>>> m
 3<_sre.SRE_Match object; span=(0, 11), match='foo,bar,baz'>
 4
 5>>> m.group(0)
 6'foo,bar,baz'
 7>>> m.group()
 8'foo,bar,baz'

This is the same data the interpreter shows following match= when it displays the match object, as you can see on line 3 above.

match.__getitem__(<grp>)

Returns a captured group from a match.

match.__getitem__(<grp>) is identical to match.group(<grp>) and returns the single group specified by <grp>:

>>>

>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
>>> m.group(2)
'bar'
>>> m.__getitem__(2)
'bar'

If .__getitem__() simply replicates the functionality of .group(), then why would you use it? You probably wouldn’t directly, but you might indirectly. Read on to see why.

A Brief Introduction to Magic Methods

.__getitem__() is one of a collection of methods in Python called magic methods. These are special methods that the interpreter calls when a Python statement contains specific corresponding syntactical elements.

The particular syntax that .__getitem__() corresponds to is indexing with square brackets. For any object obj, whenever you use the expression obj[n], behind the scenes Python quietly translates it to a call to .__getitem__(). The following expressions are effectively equivalent:

obj[n]
obj.__getitem__(n)

The syntax obj[n] is only meaningful if a .__getitem()__ method exists for the class or type to which obj belongs. Exactly how Python interprets obj[n] will then depend on the implementation of .__getitem__() for that class.

Back to Match Objects

As of Python version 3.6, the re module does implement .__getitem__() for match objects. The implementation is such that match.__getitem__(n) is the same as match.group(n).

The result of all this is that, instead of calling .group() directly, you can access captured groups from a match object using square-bracket indexing syntax instead:

>>>

>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
>>> m.group(2)
'bar'
>>> m.__getitem__(2)
'bar'
>>> m[2]
'bar'

This works with named captured groups as well:

>>>

>>> m = re.match(
...     r'foo,(?P<w1>w+),(?P<w2>w+),qux',
...     'foo,bar,baz,qux')
>>> m.group('w2')
'baz'
>>> m['w2']
'baz'

This is something you could achieve by just calling .group() explicitly, but it’s a pretty shortcut notation nonetheless.

When a programming language provides alternate syntax that isn’t strictly necessary but allows for the expression of something in a cleaner, easier-to-read way, it’s called syntactic sugar. For a match object, match[n] is syntactic sugar for match.group(n).

match.groups(default=None)

Returns all captured groups from a match.

match.groups() returns a tuple of all captured groups:

>>>

>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
>>> m.groups()
('foo', 'bar', 'baz')

As you saw previously, when a group in a regex in Python doesn’t participate in the overall match, .group() returns None for that group. By default, .groups() does likewise.

If you want .groups() to return something else in this situation, then you can use the default keyword argument:

>>>

 1>>> m = re.search(r'(w+),(w+),(w+)?', 'foo,bar,')
 2>>> m
 3<_sre.SRE_Match object; span=(0, 8), match='foo,bar,'>
 4>>> print(m.group(3))
 5None
 6
 7>>> m.groups()
 8('foo', 'bar', None)
 9>>> m.groups(default='---')
10('foo', 'bar', '---')

Here, the third (w+) group doesn’t participate in the match because the question mark (?) metacharacter makes it optional, and the string 'foo,bar,' doesn’t contain a third sequence of word characters. By default, m.groups() returns None for the third group, as shown on line 8. On line 10, you can see that specifying default='---' causes it to return the string '---' instead.

There isn’t any corresponding default keyword for .group(). It always returns None for nonparticipating groups.

match.groupdict(default=None)

Returns a dictionary of named captured groups.

match.groupdict() returns a dictionary of all named groups captured with the (?P<name><regex>) metacharacter sequence. The dictionary keys are the group names and the dictionary values are the corresponding group values:

>>>

>>> m = re.match(
...     r'foo,(?P<w1>w+),(?P<w2>w+),qux',
...     'foo,bar,baz,qux')
>>> m.groupdict()
{'w1': 'bar', 'w2': 'baz'}
>>> m.groupdict()['w2']
'baz'

As with .groups(), for .groupdict() the default argument determines the return value for nonparticipating groups:

>>>

>>> m = re.match(
...     r'foo,(?P<w1>w+),(?P<w2>w+)?,qux',
...     'foo,bar,,qux')
>>> m.groupdict()
{'w1': 'bar', 'w2': None}
>>> m.groupdict(default='---')
{'w1': 'bar', 'w2': '---'}

Again, the final group (?P<w2>w+) doesn’t participate in the overall match because of the question mark (?) metacharacter. By default, m.groupdict() returns None for this group, but you can change it with the default argument.

match.expand(<template>)

Performs backreference substitutions from a match.

match.expand(<template>) returns the string that results from performing backreference substitution on <template> exactly as re.sub() would do:

>>>

 1>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
 2>>> m
 3<_sre.SRE_Match object; span=(0, 11), match='foo,bar,baz'>
 4>>> m.groups()
 5('foo', 'bar', 'baz')
 6
 7>>> m.expand(r'2')
 8'bar'
 9>>> m.expand(r'[3] -> [1]')
10'[baz] -> [foo]'
11
12>>> m = re.search(r'(?P<num>d+)', 'foo123qux')
13>>> m
14<_sre.SRE_Match object; span=(3, 6), match='123'>
15>>> m.group(1)
16'123'
17
18>>> m.expand(r'--- g<num> ---')
19'--- 123 ---'

This works for numeric backreferences, as on lines 7 and 9 above, and also for named backreferences, as on line 18.

match.start([<grp>])

match.end([<grp>])

Return the starting and ending indices of the match.

match.start() returns the index in the search string where the match begins, and match.end() returns the index immediately after where the match ends:

>>>

 1>>> s = 'foo123bar456baz'
 2>>> m = re.search('d+', s)
 3>>> m
 4<_sre.SRE_Match object; span=(3, 6), match='123'>
 5>>> m.start()
 63
 7>>> m.end()
 86

When Python displays a match object, these are the values listed with the span= keyword, as shown on line 4 above. They behave like string-slicing values, so if you use them to slice the original search string, then you should get the matching substring:

>>>

>>> m
<_sre.SRE_Match object; span=(3, 6), match='123'>
>>> s[m.start():m.end()]
'123'

match.start(<grp>) and match.end(<grp>) return the starting and ending indices of the substring matched by <grp>, which may be a numbered or named group:

>>>

>>> s = 'foo123bar456baz'
>>> m = re.search(r'(d+)D*(?P<num>d+)', s)

>>> m.group(1)
'123'
>>> m.start(1), m.end(1)
(3, 6)
>>> s[m.start(1):m.end(1)]
'123'

>>> m.group('num')
'456'
>>> m.start('num'), m.end('num')
(9, 12)
>>> s[m.start('num'):m.end('num')]
'456'

If the specified group matches a null string, then .start() and .end() are equal:

>>>

>>> m = re.search('foo(d*)bar', 'foobar')
>>> m[1]
''
>>> m.start(1), m.end(1)
(3, 3)

This makes sense if you remember that .start() and .end() act like slicing indices. Any string slice where the beginning and ending indices are equal will always be an empty string.

A special case occurs when the regex contains a group that doesn’t participate in the match:

>>>

>>> m = re.search(r'(w+),(w+),(w+)?', 'foo,bar,')
>>> print(m.group(3))
None
>>> m.start(3), m.end(3)
(-1, -1)

As you’ve seen previously, in this case the third group doesn’t participate. m.start(3) and m.end(3) aren’t really meaningful here, so they return -1.

match.span([<grp>])

Returns both the starting and ending indices of the match.

match.span() returns both the starting and ending indices of the match as a tuple. If you specified <grp>, then the return tuple applies to the given group:

>>>

>>> s = 'foo123bar456baz'
>>> m = re.search(r'(d+)D*(?P<num>d+)', s)
>>> m
<_sre.SRE_Match object; span=(3, 12), match='123bar456'>

>>> m[0]
'123bar456'
>>> m.span()
(3, 12)

>>> m[1]
'123'
>>> m.span(1)
(3, 6)

>>> m['num']
'456'
>>> m.span('num')
(9, 12)

The following are effectively equivalent:

  • match.span(<grp>)
  • (match.start(<grp>), match.end(<grp>))

match.span() just provides a convenient way to obtain both match.start() and match.end() in one method call.

Match Object Attributes

Like a compiled regular expression object, a match object also has several useful attributes available:

Attribute Meaning
match.pos
match.endpos
The effective values of the <pos> and <endpos> arguments for the match
match.lastindex The index of the last captured group
match.lastgroup The name of the last captured group
match.re The compiled regular expression object for the match
match.string The search string for the match

The following sections provide more detail on these match object attributes.

match.pos

match.endpos

Contain the effective values of <pos> and <endpos> for the search.

Remember that some methods, when invoked on a compiled regex, accept optional <pos> and <endpos> arguments that limit the search to a portion of the specified search string. These values are accessible from the match object with the .pos and .endpos attributes:

>>>

>>> re_obj = re.compile(r'd+')
>>> m = re_obj.search('foo123bar', 2, 7)
>>> m
<_sre.SRE_Match object; span=(3, 6), match='123'>
>>> m.pos, m.endpos
(2, 7)

If the <pos> and <endpos> arguments aren’t included in the call, either because they were omitted or because the function in question doesn’t accept them, then the .pos and .endpos attributes effectively indicate the start and end of the string:

>>>

 1>>> re_obj = re.compile(r'd+')
 2>>> m = re_obj.search('foo123bar')
 3>>> m
 4<_sre.SRE_Match object; span=(3, 6), match='123'>
 5>>> m.pos, m.endpos
 6(0, 9)
 7
 8>>> m = re.search(r'd+', 'foo123bar')
 9>>> m
10<_sre.SRE_Match object; span=(3, 6), match='123'>
11>>> m.pos, m.endpos
12(0, 9)

The re_obj.search() call above on line 2 could take <pos> and <endpos> arguments, but they aren’t specified. The re.search() call on line 8 can’t take them at all. In either case, m.pos and m.endpos are 0 and 9, the starting and ending indices of the search string 'foo123bar'.

match.lastindex

Contains the index of the last captured group.

match.lastindex is equal to the integer index of the last captured group:

>>>

>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
>>> m.lastindex
3
>>> m[m.lastindex]
'baz'

In cases where the regex contains potentially nonparticipating groups, this allows you to determine how many groups actually participated in the match:

>>>

>>> m = re.search(r'(w+),(w+),(w+)?', 'foo,bar,baz')
>>> m.groups()
('foo', 'bar', 'baz')
>>> m.lastindex, m[m.lastindex]
(3, 'baz')

>>> m = re.search(r'(w+),(w+),(w+)?', 'foo,bar,')
>>> m.groups()
('foo', 'bar', None)
>>> m.lastindex, m[m.lastindex]
(2, 'bar')

In the first example, the third group, which is optional because of the question mark (?) metacharacter, does participate in the match. But in the second example it doesn’t. You can tell because m.lastindex is 3 in the first case and 2 in the second.

There’s a subtle point to be aware of regarding .lastindex. It isn’t always the case that the last group to match is also the last group encountered syntactically. The Python documentation gives this example:

>>>

>>> m = re.match('((a)(b))', 'ab')
>>> m.groups()
('ab', 'a', 'b')
>>> m.lastindex
1
>>> m[m.lastindex]
'ab'

The outermost group is ((a)(b)), which matches 'ab'. This is the first group the parser encounters, so it becomes group 1. But it’s also the last group to match, which is why m.lastindex is 1.

The second and third groups the parser recognizes are (a) and (b). These are groups 2 and 3, but they match before group 1 does.

match.lastgroup

Contains the name of the last captured group.

If the last captured group originates from the (?P<name><regex>) metacharacter sequence, then match.lastgroup returns the name of that group:

>>>

>>> s = 'foo123bar456baz'
>>> m = re.search(r'(?P<n1>d+)D*(?P<n2>d+)', s)
>>> m.lastgroup
'n2'

match.lastgroup returns None if the last captured group isn’t a named group:

>>>

>>> s = 'foo123bar456baz'

>>> m = re.search(r'(d+)D*(d+)', s)
>>> m.groups()
('123', '456')
>>> print(m.lastgroup)
None

>>> m = re.search(r'd+D*d+', s)
>>> m.groups()
()
>>> print(m.lastgroup)
None

As shown above, this can be either because the last captured group isn’t a named group or because there were no captured groups at all.

match.re

Contains the regular expression object for the match.

match.re contains the regular expression object that produced the match. This is the same object you’d get if you passed the regex to re.compile():

>>>

 1>>> regex = r'(w+),(w+),(w+)'
 2
 3>>> m1 = re.search(regex, 'foo,bar,baz')
 4>>> m1
 5<_sre.SRE_Match object; span=(0, 11), match='foo,bar,baz'>
 6>>> m1.re
 7re.compile('(\w+),(\w+),(\w+)')
 8
 9>>> re_obj = re.compile(regex)
10>>> re_obj
11re.compile('(\w+),(\w+),(\w+)')
12>>> re_obj is m1.re
13True
14
15>>> m2 = re_obj.search('qux,quux,corge')
16>>> m2
17<_sre.SRE_Match object; span=(0, 14), match='qux,quux,corge'>
18>>> m2.re
19re.compile('(\w+),(\w+),(\w+)')
20>>> m2.re is re_obj is m1.re
21True

Remember from earlier that the re module caches regular expressions after it compiles them, so they don’t need to be recompiled if used again. For that reason, as the identity comparisons on lines 12 and 20 show, all the various regular expression objects in the above example are the exact same object.

Once you have access to the regular expression object for the match, all of that object’s attributes are available as well:

>>>

>>> m1.re.groups
3
>>> m1.re.pattern
'(\w+),(\w+),(\w+)'
>>> m1.re.pattern == regex
True
>>> m1.re.flags
32

You can also invoke any of the methods defined for a compiled regular expression object on it:

>>>

>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
>>> m.re
re.compile('(\w+),(\w+),(\w+)')

>>> m.re.match('quux,corge,grault')
<_sre.SRE_Match object; span=(0, 17), match='quux,corge,grault'>

Here, .match() is invoked on m.re to perform another search using the same regex but on a different search string.

match.string

Contains the search string for a match.

match.string contains the search string that is the target of the match:

>>>

>>> m = re.search(r'(w+),(w+),(w+)', 'foo,bar,baz')
>>> m.string
'foo,bar,baz'

>>> re_obj = re.compile(r'(w+),(w+),(w+)')
>>> m = re_obj.search('foo,bar,baz')
>>> m.string
'foo,bar,baz'

As you can see from the example, the .string attribute is available when the match object derives from a compiled regular expression object as well.

Conclusion

That concludes your tour of Python’s re module!

This introductory series contains two tutorials on regular expression processing in Python. If you’ve worked through both the previous tutorial and this one, then you should now know how to:

  • Make full use of all the functions that the re module provides
  • Precompile a regex in Python
  • Extract information from match objects

Regular expressions are extremely versatile and powerful—literally a language in their own right. You’ll find them invaluable in your Python coding.

Next up in this series, you’ll explore how Python avoids conflict between identifiers in different areas of code. As you’ve already seen, each function in Python has its own namespace, distinct from those of other functions. In the next tutorial, you’ll learn how namespaces are implemented in Python and how they define variable scope.

Должно быть дублирование, но я не могу его найти …

Я использую группу, чтобы соответствовать повторяющейся подстроке. Однако я не хочу, чтобы группа была захвачена. Это кажется противоречием.

Если быть точным, предположим, что я хочу найти любой символ, который следует за 3 точными копиями строк из всех заглавных подмножеств. Для

s = 'hjgABABABfgfBBdqCCCugDDD'
              |         |

Должно вернуться

['f', 'u']

Я могу очень хорошо найти повторяющиеся строки и следующий символ

import re
print(re.findall(r'([A-Z]+)1{2}(.)', s))

Который дает

[('AB', 'f'), ('C', 'u')]

Я могу легко разобрать полученный список и получить только 2-ые пункты. Но есть ли регулярный способ получить только 2-ые элементы для начала? Если я попытаюсь сделать

print(re.findall(r'(?:[A-Z]+)1{2}(.)', s))

Я получаю

raise source.error("invalid group reference", len(escape)) sre_constants.error: invalid group reference at position 10

Я буду признателен за короткое подтверждение того, что проблема действительно заключается в коллизии между требованием о невозможности захвата и захватом, необходимым для обнаружения повторений. Тогда умная идея, как достичь цели аккуратно.

2 ответа

Лучший ответ

Причина, по которой это не сработает, заключается в том, что когда вы пишете 1, вы в основном говорите «содержимое первой группы», что, конечно, не определено, если группа не захватывает.


2

Dotan
27 Апр 2017 в 10:49

re.findall всегда будет получать список Если вы определите несколько групп захвата в шаблоне, вы не сможете использовать подход «только для регулярных выражений».

Используйте re.finditer, чтобы получить все объекты данных совпадений и получить содержимое группы 2 только для каждого совпадения:

print([x.group(2) for x in re.finditer(r'([A-Z]+)1{2}(.)', s)])

Смотрите демонстрацию Python


2

Wiktor Stribiżew
27 Апр 2017 в 09:37

Понравилась статья? Поделить с друзьями:
  • Re ctr stick error
  • Rdy ошибка частотника schneider
  • Rdworksv8 не видит станок пишет ошибка соединения
  • Rdrcef exe ошибка приложения
  • Rdr2 unknown error fff