9.10.  Regular Expressions

Regular expressions are used to match patterns against strings.

Within a pattern, all characters except ., |, (, ), [, {, +, \, ^, $, *, and ? match themselves. If you want to match one of these special characters literally, precede it with a backslash.

Patterns for matching single characters:

To match a character from a set of characters the following character classes are supported. A character class is a set of characters between brackets. The significance of the special regular expression characters ., |, (, ), [, {, +, ^, $, *, and ? is turned off inside the brackets. However, normal string substitution still occurs, so (for example) \b represents a backspace character and \n a newline. To include the literal characters ] and - within a character class, they must appear at the start.

Predefined character classes:

POSIX character classes (US-ASCII):

Classes for Unicode blocks and categories:

Character sequences are matched by string the characters together.

The following constructs are used to easily match character sequences containing special characters.

Repetition modifiers allow to match multiple occurrences of a pattern.

These patterns are greedy, i.e. they will match as much of a string as they can. This behavior can be altered to let them match the minimum by adding a question mark suffix to the repetition modifier.

An unescaped vertical bar "|" matches either the regular expression that precedes it or the regular expression that follows it.

Parentheses are used to group terms within a regular expression. Everything within the group is treated as a single regular expression.

The following boundaries can be specified.

Back references allow to use part of the current match later in that match, i.e. to look for various forms of repetition.