Python Core Regular Expressions
What would (abc|xyz)\1 match? "abc" or "xyz", followed by the same thing "abc" or "xyz", then a "1" "abc", then "xyz"
"abc" or "xyz", followed by the same thing
Which pattern would NOT match "123!456!"? (\d*\W)+ [1-6!] (\D+\s?)+
(\D+\s?)+
The regex search returns an object with several methods that give details about it. These methods include group which returns the string matched, start and end which return the start and ending positions of the first match, and span which returns the start and end positions of the first match as a tuple. Example:
.mport re pattern = r'pam" match= re.search( pattern, "eggspamsausage") 1f match: print (match. group0) print (match. start0) print(match. end(O) print (match. span (O)
The metacharacter ? means "zero or one repetitions".
1.mport re pattern=r"ice(-) ?cream" if re.mat ch (pattern, "ice-cream" ): print("Match 1") if re.match(pattern, "icecream") : print("Mat ch 2") if re.match (pattern, "sausages " ) : print("Mat ch 3") if re.match (pattern, "ice--ice") print ("Match 4")
What would be matched by "(4{5,6})\1"? 456 10 or 12 fours 5 or 6 fours
10 or 12 fours
Which string would be matched by "[01]+0$"? 10101111001010 0101 011101
10101111001010
What would group(3) be of a match of 1(23)(4(56)78)9(0)? 45678 23 56 0
56
Groups
A group can be created by surrounding part of a regular expression with parentheses. This means that a group can be given as an argument to metacharacters such as * and ?. Example: mport re pattern = r"egg(spam) » " if re.match(pattern, "egg'"): print("Match 1") if re.match(pattern, "eggspamspams pamegg"): print( Match 2") if re.match (pattern, "spam") : print("Match 3") (spam) represents a group in the example pattern shown above.
What would '([^aeiou][aeiou][^aeiou])+' match? A number of repetitions of "aeiou" that is a multiple of three Nothing One or more repetitions of a non-vowel, a vowel and a non-vowel
A number of repetitions of "aeiou" that is a multiple of three
What would '....' match? Any string with only dots Strings with four digits only Any four character string with no newlines
Any four character string with no newlines
What would [abc][def] match? Either "abc" or "def" Any letter out of "abc", then any out of "def" Any three letters out of "abc", then any three from "def"
Any letter out of "abc", then any out of "def"
What would [1-5][0-9] match? 10, 19, 50, or 90 Any two-digit number from 10 to 59 Any two-digit number
Any two-digit number from 10 to 59
Character Classes
Character classes provide a way to match only one of a specific set of characters. A character class is created by putting the characters it matches inside square brackets. Example: import re pattern = r"[aeiou]" if re.search (pattern, grey"): print("Match 1") if re.search (pattern, "qwertyuiop ") : print("Match 2") if re.search (pattern, "rhythm myths") : print("Mat ch 3") The pattern [aeiou] in the search function matches all strings that contain any one of the characters defined.
Which of these must be done with regular expressions, rather than string methods? Checking to see if a string contains a date Splitting a string Checking whether a particular character is in a string
Checking to see if a string contains a date
Which of the following tasks CANNOT be performed using regular expressions? Changing the URL part of an email address Checking whether an email address is of the correct format Checking whether an email address is real
Checking whether an email address is real
Curly Braces
Curly braces can be used to represent the number of repetitions between two numbers. The regex {x,y} means "between x and y repetitions of something". Hence {0,1} is the same thing as ?. If the first number is missing, it is taken to be zero. If the second number is missing, it is taken to be infinity. Example: mport re pattern = r"9{1,3)}$" if re.match(pattern, "9"): print("Mat ch 1") if re.match (pattern, "999"): print("Match 2") if re.mat ch( pattern, "9999"): print("Match 3") "9{1,3}$" matches string that have 1 to 3 nines.
import re pattern = r"([\w\. -1+)a( [\w\. -].)(\. [\w\.]+)" str ="Please contact info@sololearn . com for assistance" match = re.search (pattern, str) if match: print (match . group( ))
In case the string contains multiple email addresses, we could use the re.findall method instead of re.search, to extract all email addresses. The regex in this example is for demonstration purposes only. A much more complex regex is required to fully validate an email address.
Other functions to match patterns are re.search and re.findall. The function re.search finds a match of a pattern anywhere in the string. The function re.findall returns a list of all substrings that match a pattern.
In the example above, the match function did not match the pattern, as it looks at the beginning of the string. The search function found a match in the string. The function re.finditer does the same thing as re.findall, except it returns an iterator, rather than a list. `LmportTe pattern = r"spam" if re.match (pattern, "eggspamsausagespam") : print( "Mat ch ") else: print( "No match") if re.search ( pattern, "eggspamsausagespam"): print("Match ") else: print("No match") print(re. findall (pattern, "eggspamsausages pam")) In the example above, the match function did not match the pattern, as it looks at the beginning of the string. The search function found a match in the string. The function re.finditer does the same thing as re.findall, except it returns an iterator, rather than a list.
There are several kinds of special groups. Two useful ones are named groups and non-capturing groups. Named groups have the format (?P<name>...), where name is the name of the group, and ... is the content. They behave exactly the same as normal groups, except they can be accessed by group(name) in addition to its number. Non-capturing groups have the format (?:...). They are not accessible by the group method, so they can be added to an existing regular expression without breaking the numbering. Example:
Lmport re pattern = r"(?P<first>abc) (? : def) (ghi) " match = re.match(pattern, "abcdefghi") if match: print (match.group ("first ") ) print (match. groups ( 0)
Metacharacters
Metacharacters are what make regular expressions more powerful than normal string methods. They allow you to create regular expressions to represent concepts like "one or more repetitions of a vowel". The existence of metacharacters poses a problem if you want to create a regular expression (or regex) that matches a literal metacharacter, such as "$". You can do this by escaping the metacharacters by putting a backslash in front of them. However, this can cause problems, since backslashes also have an escaping function in normal Python strings. This can mean putting three or four backslashes in a row to do all the escaping. To avoid this, you can use a raw string, which is a normal string with an "r" in front of it. We saw usage of raw strings in the previous lesson.
Special Sequences
More useful special sequences are \d, \s, and \w. These match digits, whitespace, and word characters respectively. In ASCII mode they are equivalent to [0-9], [ \t\n\r\f\v], and [a-zA-Z0-9_]. In Unicode mode they match certain other characters, as well. For instance, \w matches letters with accents. Versions of these special sequences with upper case letters - \D, \S, and \W - mean the opposite to the lower-case versions. For instance, \D matches anything that isn't a digit. Example: ` 1mport re pattern = r"(\D«\d)" match = re.match (pattern, "Hi 999! ") if match: print("Match 1") match =re.match( pattern, "1, 23, 456! ") if match: print("Match 2") match = re.match( pattern, " $?") if match: print("Match 3") (\D+\d) matches one or more non-digits followed by a digit.
Search & Replace
One of the most important re methods that use regular expressions is sub. Syntax: re.sub(pattern, repl, string, count=0) PY This method replaces all occurrences of the pattern in string with repl, substituting all occurrences, unless count provided. This method returns the modified string. Example: import re str ="My name is David. Hi David. " pattern = r'David" newstr= re.sub(pattern, "Amy", str) print(newstr)
Regular Expressions
Regular expressions are a powerful tool for various kinds of string manipulation. They are a domain specific language (DSL) that is present as a library in most modern programming languages, not just Python. They are useful for two main tasks: - verifying that strings match a pattern (for instance, that a string has the format of an email address), - performing substitutions in a string (such as changing all American spellings to British ones). Domain specific languages are highly specialized mini programming languages. Regular expressions are a popular example, and SQL (for database manipulation) is another. Private domain-specific languages are often used for specific industrial purposes.
Special Sequences
There are various special sequences you can use in regular expressions. They are written as a backslash followed by another character. One useful special sequence is a backslash and a number between 1 and 99, e.g., \1 or \17. This matches the expression of the group of that number. import re pattern = r"(.+) \1" match = re.match(pattern, "word word") lf match: print ("Match 1") match re .match (pattern, "2! ?!*) if match: print ("Match 2") match = re.match(pattern, "abc cde") if match: print ("Match 3") Note, that "(.+) \1" is not the same as "(.+) (.+)", because \1 refers to the first group's subexpression, which is the matched expression itself, and not the regex pattern.
Email Extraction
To demonstrate a sample usage of regular expressions, lets create a program to extract email addresses from a string. Suppose we have a text that contains an email address: str = "Please contact [email protected] for assistance" PY Our goal is to extract the substring "[email protected]". A basic email address consists of a word and may include dots or dashes. This is followed by the @ sign and the domain name (the name, a dot, and the domain name suffix). This is the basis for building our regular expression. pattern = r"([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)" PY [\w\.-]+ matches one or more word character, dot or dash. The regex above says that the string should contain a word (with dots and dashes allowed), followed by the @ sign, then another similar word, then a dot and another word. Our regex contains three groups: 1 - first part of the email address. 2 - domain name without the suffix. 3 - the domain suffix.
What would [a^]* match? Zero or more repetitions of any character Zero or more repetitions of any characters other than "a" Zero or more repetitions of "a" or "^"
Zero or more repetitions of "a" or "^"
What regex is not equivalent to the others? [1-6] (1|2|3|4|5) [12345]
[1-6]
Which of these metacharacters isn't to do with repetition? \ + *
\
Which pattern would match 'SPAM!' in a search? SP\AM!\Z \AS...\b.\Z \ASPAM\Z
\AS...\b.\Z
Which regex would match "[email protected]"? [0-9]@domain\.com \[email protected] email\@(domain\w)+
Some more metacharacters are * + ? { and }. These specify numbers of repetitions. The metacharacter * means "zero or more repetitions of the previous thing". It tries to match as many repetitions as possible. The "previous thing" can be a single character, a class, or a group of characters in parentheses. Example:
amport re pattern = r"egg(spam) * " if re.match (pattern, "egg"): print( "Match 1") if re.match (pattern, "eggspamspamegg "): print ("Match 2") if re.match(pattern, "spam") : print( "Mat ch 3") The example above matches strings that start with "egg" and follow with zero or more "spam"s.
Which of these is not a function in the re module? search findall findlist
findlist
The next two metacharacters are ^ and $. These match the start and end of a string, respectively.
import re pattern = r" gr.y$ if re.match (pattern, "grey "): print("Match 1") if re.match(pattern, "gray ") : print( "Match 2") if re.match (pattern, "stingray ") : print("Mat ch 3") The pattern "^gr.y$" means that the string should start with gr, then follow with any character, except a newline, and end with y.
Character classes can also match ranges of characters. Some examples: The class [a-z] matches any lowercase alphabetic character. The class [G-P] matches any uppercase character from G to P. The class [0-9] matches any digit. Multiple ranges can be included in one class. For example, [A-Za-z] matches a letter of any case.
import re pattern = r"[A-Z] [A-Z] [0-9]" if re.search(pattern, "LS8") print( "Match 1") if re.search(pattern, "E3"): print("Mat ch 2") if re.search (pattern, "iab") : print("Match 3") The pattern in the example above matches strings that contain two alphabetic uppercase letters followed by a digit.
Place a ^ at the start of a character class to invert it. This causes it to match any character other than the ones included. Other metacharacters such as $ and ., have no meaning within character classes. The metacharacter ^ has no meaning unless it is the first character in a class. Example:
import re pattern = r"[^A-Z]" if re.search (pattern, "this is all quiet ") : print( "Match 1") if re.search (pattern, "AbCdEfG123 " ) print("Match 2") if re.search (pattern, "THISISALLSHOUTING"): print("Match 3") The pattern [^A-Z] excludes uppercase strings. Note, that the ^ should be inside the brackets to invert the character class.
The metacharacter + is very similar to *, except it means "one or more repetitions", as opposed to "zero or more repetitions". Example:
import re pattern = r"g+" if re.match (pattern , "g"): print("Mat ch 1") if re.match(pattern, "gggggggggggggg): print("Match 2") if re.match( (pattern, "abc"): print( "Match 3") To summarize: * matches 0 or more occurrences of the preceding expression. + matches 1 or more occurrence of the preceding expression.
Regular expressions in Python can be accessed using the re module, which is part of the standard library. After you've defined a regular expression, the re.match function can be used to determine whether it matches at the beginning of a string. If it does, match returns an object representing the match, if not, it returns None. To avoid any confusion while working with regular expressions, we would use raw strings as r"expression". Raw strings don't escape anything, which makes use of regular expressions easier.
import re pattern = r"spam" if re.match (pattern, spamspamspam") : print ("Match ") else: print("No match") The above example checks if the pattern "spam" matches the string and prints "Match" if it does. Here the pattern is a simple word, but there are various characters, which would have special meaning when they are used in a regular expression.
The first metacharacter we will look at is . (dot). This matches any character, other than a new line. Example:
import re pattern = r'gr.y" if re.match ( pattern, "grey"): print("Match 1") if re.match (pattern, "gray"): print( Match 2") if re.match(pattern , "blue") : print("Match 3")
Another important metacharacter is |. This means "or", so red|blue matches either "red" or "blue". Example:
import re pattern= r"gr(ale)y" match = re. match (pattern, "gray ") if match: print ("Match 1") match = re. match(pattern, "grey ") if match: print ( Match 2") mat ch = re. match (pattern, "griy ") if match: print ("Match 3")
Additional special sequences are \A, \Z, and \b. The sequences \A and \Z match the beginning and end of a string, respectively. The sequence \b matches the empty string between \w and \W characters, or \w characters and the beginning or end of the string. Informally, it represents the boundary between words. The sequence \B matches the empty string anywhere else. Example:
mport re PY pattern = r"\b(cat) \b" match re . search(pattern, "The cat sat! ") if match: print ("Match 1") match re. search (pattern, "We s>cat<tered? ") 1t math: print ("Match 2") match = re.search(pattern, "We scattered . ") if match: print ("Match 3") \b(cat)\b" basically matches the word "cat" surrounded by word boundaries.
The content of groups in a match can be accessed using the group function. A call of group(0) or group() returns the whole match. A call of group(n), where n is greater than 0, returns the nth group from the left. The method groups() returns all groups up from 1. Example:
mport re pattern= r"a(bc) (de) (f(g)h)i" match = re.match (pattern, "abcdefghijklmnop ") if match: print (match.group()) print(match. group(0)) print (match. group(1)) print (match.group (2)) print (match. groups()) As you can see from the example above, groups can be nested.
Which of these patterns would not re.match the string "spamspamspam"? sp pamspam spamspam
pamspam
In our example, why is the dot character preceded by a backslash? it is mandatory because it is inside square brackets to treat it as a character
to treat it as a character
Which of these is the same as the metacharacter '+'? {0, 1} {0,} {1,}
{1,}