Metacharacters in regular expressions

Python programming (zero to advance!!!)

About Lesson

Metacharacters in regular expressions are characters that have special meanings and are used to construct patterns for matching strings. Here are some common metacharacters along with their meanings:

In Python regex (regular expressions), metacharacters are special characters that carry a specific meaning in pattern matching. They are used to define the rules for searching or manipulating strings. Here are some common metacharacters in Python regex:

1. `.` (dot): Matches any single character except newline characters.

2. `^`: Matches the start of the string.

3. `$`: Matches the end of the string.

4. `*`: Matches zero or more occurrences of the preceding element.

5. `+`: Matches one or more occurrences of the preceding element.

6. `?`: Matches zero or one occurrence of the preceding element.

7. “: Escapes special characters, allowing you to match them as literals.

8. `[]`: Defines a character class, allowing you to match any single character within the brackets.

9. `|`: Acts as an OR operator, allowing you to match either the expression before or after the pipe.

10. `()`: Defines a group for capturing or specifying precedence.

11. `{}`: Specifies the number of occurrences of the preceding element.

12. `^` (within `[]`): When used as the first character within square brackets, it negates the character class.
13. `b`: Matches a word boundary, ensuring that the pattern is only matched at the beginning or end of a word.

14. `B`: Matches a non-word boundary, ensuring that the pattern is only matched within a word.

15. `d`: Matches any decimal digit (equivalent to `[0-9]`).

16. `D`: Matches any character that is not a decimal digit (equivalent to `[^0-9]`).

17. `s`: Matches any whitespace character (equivalent to `[tnrfv]`).

18. `S`: Matches any character that is not a whitespace character.

19. `w`: Matches any alphanumeric character (equivalent to `[a-zA-Z0-9_]`).

20. `W`: Matches any character that is not an alphanumeric character.

21. `A`: Matches the start of the string (similar to `^`, but it doesn’t respect multiline mode).

22. `Z`: Matches the end of the string or just before the newline at the end of the string (similar to `$`, but it doesn’t respect multiline mode).

23. `G`: Matches the point where the last match finished.

24. `number`: Matches the contents of a previously captured group (where `number` is the group’s index).

25. `(?i)`: Case-insensitive matching mode.

26. `(?s)`: Dotall mode, where the dot (`.`) matches any character, including newline.

27. `(?x)`: Verbose mode, allowing you to write regular expressions more legibly by ignoring whitespace and comments.

These metacharacters provide powerful tools for constructing complex search patterns in Python using regular expressions.

Code examples:

1. [] A set of characters

In regular expressions, `[]` denotes a character class, also known as a character set. It allows you to specify a set of characters from which you want to match a single character.

Here’s how `[]` works:

– `[ ]`: Matches any single character within the brackets.

For example:

– `[abc]`: Matches either ‘a’, ‘b’, or ‘c’.
– `[a-z]`: Matches any lowercase letter from ‘a’ to ‘z’.
– `[0-9]`: Matches any digit from ‘0’ to ‘9’.
– `[aeiou]`: Matches any vowel.
– `[^abc]`: Matches any character except ‘a’, ‘b’, or ‘c’ (the `^` at the beginning negates the character class).

Example:
“`python
import re

pattern = r'[aeiou]’ # Matches any vowel
test_string = “Hello World!”

matches = re.findall(pattern, test_string)
print(matches) # Output: [‘e’, ‘o’, ‘o’]
“`

In this example, the pattern `[aeiou]` matches any vowel characters (‘a’, ‘e’, ‘i’, ‘o’, ‘u’) in the test string “Hello World!”. The `findall()` function returns a list containing all matches found.

2. Signals a special sequence (can also be used to escape special characters)

In regular expressions, the backslash “ serves multiple purposes, one of which is to signal a special sequence or escape special characters. Here’s how it works:

1. Signaling a Special Sequence:
– `d`: Matches any digit (equivalent to `[0-9]`).
– `w`: Matches any word character (alphanumeric and underscore).
– `s`: Matches any whitespace character.
– `b`: Matches a word boundary.
– `A`: Matches the start of the string.
– `Z`: Matches the end of the string (ignoring newline).
– `b`: Matches a word boundary.

2. Escaping Special Characters:
– `.`: Matches a literal period (dot).
– `[`: Matches a literal opening square bracket.
– `]`: Matches a literal closing square bracket.
– “: Matches a literal backslash.

Example:
“`python
import re

pattern = r’d+’ # Matches one or more digits
test_string = “I have 5 apples and 3 bananas.”

matches = re.findall(pattern, test_string)
print(matches) # Output: [‘5’, ‘3’]
“`

In this example, the pattern `d+` matches one or more digits in the test string “I have 5 apples and 3 bananas.” The backslash “ signals a special sequence `d`, which matches any digit character. The `+` quantifier matches one or more occurrences of the preceding character or sequence, in this case, digits. The `findall()` function returns a list containing all matches found.

3. . Any character (except newline character)

In regular expressions, the dot `.` (period) represents a wildcard that matches any single character except for the newline character (`n`). It serves as a placeholder for any character in the string.

Here’s how it works:

– `.`: Matches any single character (except newline `n`).

Example:
“`python
import re

pattern = r’c.t’ # Matches ‘cat’, ‘cbt’, ‘cct’, etc.
test_string = “The cat sat on the mat.”

matches = re.findall(pattern, test_string)
print(matches) # Output: [‘cat’, ‘cot’]
“`

In this example, the pattern `c.t` matches any three-character sequence where the first and third characters are ‘c’ and ‘t’ respectively, and the second character can be any single character. The `findall()` function returns a list containing all matches found in the test string “The cat sat on the mat.”

4. ^ Starts with

In regular expressions, the caret `^` is a metacharacter that indicates the start of a string or the start of a line, depending on the context in which it is used.

Here’s how it works:

– `^`: Matches the start of the string or the start of a line.

Example:
“`python
import re

pattern = r’^hello’ # Matches ‘hello’ only if it occurs at the start of the string or line
test_string = “hello world!”

match = re.search(pattern, test_string)
if match:
print(“Match found:”, match.group()) # Output: “hello”
else:
print(“No match found.”)
“`

In this example, the pattern `^hello` matches “hello” only if it occurs at the start of the string. Since “hello” is indeed at the beginning of the string “hello world!”, a match is found, and the matched substring “hello” is printed.

If you were to use the same pattern with the `re.MULTILINE` flag, it would match “hello” at the start of any line in a multiline string, as opposed to just the start of the string itself.

5. $ Ends with

In regular expressions, the dollar sign `$` is a metacharacter that indicates the end of a string or the end of a line, depending on the context in which it is used.

Here’s how it works:

– `$`: Matches the end of the string or the end of a line.

Example:
“`python
import re

pattern = r’world$’ # Matches ‘world’ only if it occurs at the end of the string or line
test_string = “hello world”

match = re.search(pattern, test_string)
if match:
print(“Match found:”, match.group()) # Output: “world”
else:
print(“No match found.”)
“`

In this example, the pattern `world$` matches “world” only if it occurs at the end of the string. Since “world” is indeed at the end of the string “hello world”, a match is found, and the matched substring “world” is printed.

If you were to use the same pattern with the `re.MULTILINE` flag, it would match “world” at the end of any line in a multiline string, as opposed to just the end of the string itself.

6. * Zero or more occurrences

The asterisk `*` in regular expressions signifies zero or more occurrences of the preceding character or group. It’s a quantifier that indicates that the preceding element can occur zero or more times.

Here’s how it works:

– `*`: Matches zero or more occurrences of the preceding character or group.

Example:
“`python
import re

pattern = r’go*gle’ # Matches ‘ggle’, ‘gogle’, ‘google’, ‘gooogle’, etc.
test_string = “google gooogle ggle”

matches = re.findall(pattern, test_string)
print(matches) # Output: [‘google’, ‘gooogle’, ‘ggle’]
“`

In this example, the pattern `go*gle` matches “ggle”, “gogle”, “google”, “gooogle”, etc., where the character ‘o’ can occur zero or more times. The `findall()` function returns a list containing all matches found in the test string “google gooogle ggle”.

7. + One or more occurrences

The plus sign `+` in regular expressions signifies one or more occurrences of the preceding character or group. It’s a quantifier that indicates that the preceding element must occur at least once, but can occur multiple times.

Here’s how it works:

– `+`: Matches one or more occurrences of the preceding character or group.

Example:
“`python
import re

pattern = r’go+gle’ # Matches ‘gogle’, ‘google’, ‘gooogle’, etc.
test_string = “google gooogle ggle”

matches = re.findall(pattern, test_string)
print(matches) # Output: [‘google’, ‘gooogle’]
“`

In this example, the pattern `go+gle` matches “gogle”, “google”, “gooogle”, etc., where the character ‘o’ must occur at least once. The `findall()` function returns a list containing all matches found in the test string “google gooogle ggle”.

8. ? Zero or one occurrences

The question mark `?` in regular expressions signifies zero or one occurrence of the preceding character or group. It’s a quantifier that indicates that the preceding element is optional and can occur either zero times or once.

Here’s how it works:

– `?`: Matches zero or one occurrence of the preceding character or group.

Example:
“`python
import re

pattern = r’colou?r’ # Matches both ‘color’ and ‘colour’
test_string = “The color of the sky is blue. The colour of the sea is blue as well.”

matches = re.findall(pattern, test_string)
print(matches) # Output: [‘color’, ‘colour’]
“`

In this example, the pattern `colou?r` matches both “color” and “colour”, where the letter ‘u’ is optional. The `findall()` function returns a list containing all matches found in the test string “The color of the sky is blue. The colour of the sea is blue as well.”.

9. {} Exactly the specified number of occurrences

The curly braces `{}` in regular expressions specify the exact number of occurrences of the preceding character or group. It allows you to define precise repetition constraints on the preceding element.

Here’s how it works:

– `{m}`: Matches exactly m occurrences of the preceding character or group.
– `{m,n}`: Matches at least m and at most n occurrences of the preceding character or group.
– `{m,}`: Matches at least m occurrences of the preceding character or group.

Example:
“`python
import re

pattern = r’d{3}’ # Matches three consecutive digits
test_string = “The number is 123456789.”

matches = re.findall(pattern, test_string)
print(matches) # Output: [‘123’, ‘456’, ‘789’]
“`

In this example, the pattern `d{3}` matches exactly three consecutive digits in the test string “The number is 123456789.” The `findall()` function returns a list containing all matches found, which are “123”, “456”, and “789”.

10. | Either or

The pipe symbol `|` in regular expressions signifies an alternation, allowing you to specify alternatives within a pattern. It’s used to match either one expression or another.

Here’s how it works:

– `|`: Matches either the expression before or after the alternation operator.

Example:
“`python
import re

pattern = r’cat|dog’ # Matches either ‘cat’ or ‘dog’
test_string = “I have a cat and a dog as pets.”

matches = re.findall(pattern, test_string)
print(matches) # Output: [‘cat’, ‘dog’]
“`

In this example, the pattern `cat|dog` matches either “cat” or “dog” in the test string “I have a cat and a dog as pets.”. The `findall()` function returns a list containing all matches found, which are “cat” and “dog”.

11. () Capture and group

In regular expressions, parentheses `()` are used for capturing and grouping parts of a pattern. They serve two main purposes:

1. Capturing: Parentheses are used to capture the matched substring enclosed within them. This captured substring can then be referenced or extracted later.

2. Grouping: Parentheses are used to group parts of a pattern together, allowing you to apply quantifiers or other operators to the entire group.

Here’s how it works:

– `()`: Captures and groups the enclosed part of the pattern.

Example (Capturing):
“`python
import re

pattern = r'(d{3})-(d{3})-(d{4})’ # Matches a phone number pattern: ###-###-####
test_string = “My phone number is 123-456-7890.”

match = re.search(pattern, test_string)
if match:
print(“Full match:”, match.group(0)) # Output: “123-456-7890”
print(“Area code:”, match.group(1)) # Output: “123”
print(“Prefix:”, match.group(2)) # Output: “456”
print(“Line number:”, match.group(3)) # Output: “7890”
“`

In this example, the pattern `(d{3})-(d{3})-(d{4})` captures and groups three parts of a phone number separated by hyphens. The `search()` function finds the first match in the test string “My phone number is 123-456-7890.”, and the `group()` method is used to access each captured group separately.

Grouping allows you to apply quantifiers or other operators to the entire group. For example, `(ab)+` matches “ab”, “abab”, “ababab”, and so on.

Join the conversation

Sign In

Registration