Regular Expressions

Regular Expressions

September 12, 2023 | python

re CS #

Look-arounds #

  • Lookahead and lookbehind, collectively called “lookaround”
  • Lookarounds are zero width assertions.
  • They check for a regex (towards right or left of the current position - based on ahead or behind), succeeds or fails when a match is found (based on if it is positive or negative) and discards the matched portion.
  • metaphor of ahead and behind is used wrt to left to right direction
  • (?): are always present
    • !: means negative
    • =: means positive
    • <: look behind (direction)

Look-ahead #

  • (?=): positive A(?=B): Find expression A where expression B follows:

- ~(?!): negative ~A(?!B): Find expression A where expression B does not follow:

Look-behind #

  • (?<=): positive (?<=B)A: Find expression A where expression B precedes:

  • (?<!): negative

(?<!B)A: Find expression A where expression B does not precede

Example #

bar(?=bar)     finds the 1st bar ("bar" which has "bar" after it)
bar(?!bar)     finds the 2nd bar ("bar" which does not have "bar" after it)
(?<=foo)bar    finds the 1st bar ("bar" which has "foo" before it)
(?<!foo)bar    finds the 2nd bar ("bar" which does not have "foo" before it)

Refs #

Image #

\w (word character) matches any single letter, number or underscore (same as [a-zA-Z0-9_])*.

Python suspendedfc #

positioneaseboxintervaldue
front2.5002021-05-11T07:42:36Z

ref

Match object #

  • A Match Object is an object containing information about the search and the result.
  • This object is returned by `re.search()`, `re.match()`, `re.findall()`.
  • Has propterties and methods
    1. `.span()` returns a tuple containing the start-, and end positions of the match.

    2. `.string` returns the string passed into the function

    3. `.group()` returns the part of the string where there was a match

      import re
      txt = "My name is Javeed Ali Khan"
      result = re.search(r"(\bJ\w+) (\bA[li]{2})", txt)
      print(result.span())
      print(result.group()) # all match
      print(result.group(0)) # all match
      print(result.group(1)) # 1st group
      print(result.group(2)) # 2nd group
      print(result.string)
      
      (11, 21)
      Javeed Ali
      Javeed Ali
      Javeed
      Ali
      My name is Javeed Ali Khan
      

re methods 1 #

ref

re.compile(pattern, flags)Compile a regular expression of pattern, with flag s
re.match(pattern, string)Match pattern only at beginning of string
re.search(pattern,string)Match patterns anywhere in the string
re.split(pattern, string)Split string by occurences of pattern
re.sub(pttrn_2_repl, repl_with, in_string)

re method Objects 1 #

match.group(“name”)Return subgroup “name” of match
match.groups()Return tuple containing all subgroups of match
match.groupdict()Return dict containing all named subgroups of match
match.start(group)Return start index of substring match by group
match.end(group)Return end index of substring matched by group
match.span(group)Return 2-tuple start and end indices of group in match

re methods 2 #

re.fullmatch(pattern, string)Match pattern if whole string matches regular expression
re.findall(pattern, string)Return all non-overlapping matches of pattern in string, as a list of strings
re.finditer(pattern, string)Return an iterator yielding match objects over non-overlapping matches of pattern in string
re.subn(pattern, str2, string)Replace left most occurrences of pattern in string with str2, but return a tuple of (newstring, # subs made)
re.purge()Clear the regular expression cache

Difference between match , search and findall #

ref

matchsearchfindall
1. first occurence1. first occurencereturns all occurences
2. if match found in another line returns null2. check all lines unline matchreturns list of strings
3. returns match object3. returns match objector list of tuples of strings not match object

findall example

``` cc_list = ‘‘‘Ezra Koenig <ekoenig@vpwk.com>, Rostam Batmanglij <rostam@vpwk.com>, Chris Tomson <ctomson@vpwk. Bobbi Baio <bbaio@vpwk.com’’’ >>> matched = re.findall(r’\w+\@\w+\.\w+’, cc_list) >>> matched [’ekoenig@vpwk.com’, ‘rostam@vpwk.com’, ‘ctomson@vpwk.com’, ‘cbaio@vpwk.com’] >>> matched = re.findall(r’(\w+)\@(\w+)\.(\w+)’, cc_list) >>> matched [(’ekoenig’, ‘vpwk’, ‘com’), (‘rostam’, ‘vpwk’, ‘com’), (‘ctomson’, ‘vpwk’, ‘com’), (‘cbaio’, ‘vpwk’, ‘com’)] >>> names = [x[0] for x in matched] >>> names [’ekoenig’, ‘rostam’, ‘ctomson’, ‘cbaio’] ```

Examples #

Named group #

```(?P<name>regex)```

print("start")
import re
txt = """Javeed Ali Khan Mohammed 2284440597 lisai Taaina Immune by first dose Last update: Tuesday 8 June, 09:42 PM Immune by first dose until 22/10/2021 New Services Display All > COVID-19 Vaccine Certify Mobile Organ Health Donation Passport Number D00 000 of Javeed Ali Khan Mohammed 2284440597 lisai Taaina Immune by first dose """
txt = """Immune by first dose Last update Mon, 14 Jun 12:00 PM O Current Permits Immune by first dose Last update Mon, 14 Jun 12:00 PM O Current Permits"""

# pattern = r"""(?P<day_of_week>(mon|tues|wed(nes)?|thur(s)?|fri|sat(ur)?|sun)(day)?)(?<day_of_month>\s*\d+)(?<month>\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?),\s*)(?p<time>1[0-2]|0?[1-9]):([0-5][0-9]) ?([AaPp][Mm])"""

pattern = r"""(?P<day_of_week>(mon|tues|wed(nes)?|thur(s)?|fri|sat(ur)?|sun)(day)?,?)(?P<day_of_month>\s*\d+)(?P<month>\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?),?\s*)(?P<time>(1[0-2]|0?[1-9]):([0-5][0-9]) ?([AaPp][Mm]))"""
# pattern = r"""((mon|tues|wed(nes)?|thur(s)?|fri|sat(ur)?|sun)(day)?)(\s*\d+)(\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?),\s*)(1[0-2]|0?[1-9]):([0-5][0-9]) ?([AaPp][Mm])"""

result = re.search(pattern, txt, re.IGNORECASE)
print("result is")
print(result)
if result:
    print(result.group("day_of_week"))
    print(result.group("day_of_month"))
    print(result.group("month"))
    print(result.group("time"))
start
result is
<re.Match object; span=(33, 53), match='Mon, 14 Jun 12:00 PM'>
Mon,
 14
 Jun
12:00 PM

Emacs #

Native regular expression support #

“visual-regexp-steroids” package #

github depends on visual-regexp

Example #

to replace all “sample.html” to “sample” ([a-z0-9]*)\.html -> \1 # “\1” captures the group “([a-z0-9]*)”


Go to random page

Previous Next