9.10. Regular Expressions

9.10. Regular Expressions
Prev	Chapter 9. Reference	Next

9.10. Regular Expressions

Regular expressions are used to match patterns against strings.

9.10.1. Characters

Within a pattern, all characters except ., |, (, ), [, {, +, \, ^, $, *, and ? match themselves. If you want to match one of these special characters literally, precede it with a backslash.

Patterns for matching single characters:

To match a character from a set of characters the following character classes are supported. A character class is a set of characters between brackets. The significance of the special regular expression characters ., |, (, ), [, {, +, ^, $, *, and ? is turned off inside the brackets. However, normal string substitution still occurs, so (for example) \b represents a backspace character and \n a newline. To include the literal characters ] and - within a character class, they must appear at the start.

[abc]: Matches the characters a, b, or c.
[^abc]: Matches any character except a, b, or c (negation).
[a-zA-Z]: Matches the characters a through z or A through Z, inclusive (range).
[a-d[m-p]]: Matches the characters a through d, or m through p: [a-dm-p] (union).
[a-z&&[def]]: Matches the characters d, e, or f (intersection).
[a-z&&[^bc]]: Matches the characters a through z, except for b and c: [ad-z] (subtraction).
[a-z&&[^m-p]]: Matches the characters a through z, and not m through p: [a-lq-z] (subtraction).

Predefined character classes:

.: Matches any character.
\d: Matches a digit: [0-9].
\D: Matches a non-digit: [^0-9].
\s: Matches a whitespace character: [ \t\n\x0B\f\r].
\S: Matches a non-whitespace character: [^\s].
\w: Matches a word character: [a-zA-Z_0-9].
\W: Matches a non-word character: [^\w].

POSIX character classes (US-ASCII):

Classes for Unicode blocks and categories:

\p{InGreek}: Matches a character in the Greek block (simple block).
\p{Lu}: Matches an uppercase letter (simple category).
\p{Sc}: Matches a currency symbol.
\P{InGreek}: Matches any character except one in the Greek block (negation).
[\p{L}&&[^\p{Lu}]]: Matches any letter except an uppercase letter (subtraction).

9.10.2. Character Sequences

Character sequences are matched by string the characters together.

XY: Matches X followed by Y.

The following constructs are used to easily match character sequences containing special characters.

\Q: Quotes all characters until \E.
\E: Ends quoting started by \Q.

9.10.3. Repetition

Repetition modifiers allow to match multiple occurrences of a pattern.

X?: Matches X once or not at all.
X*: Matches X zero or more times.
X+: Matches X one or more times.
X{n}: Matches X exactly n times.
X{n,}: Matches X at least n times.
X{n,m}: Matches X at least n but not more than m times.

These patterns are greedy, i.e. they will match as much of a string as they can. This behavior can be altered to let them match the minimum by adding a question mark suffix to the repetition modifier.

9.10.4. Alternation

An unescaped vertical bar "|" matches either the regular expression that precedes it or the regular expression that follows it.

X|Y: Matches either X or Y.

9.10.5. Grouping

Parentheses are used to group terms within a regular expression. Everything within the group is treated as a single regular expression.

(X): Matches X.

9.10.6. Boundaries

The following boundaries can be specified.

^: Matches the beginning of a line.
$: Matches the end of a line.
\b: Matches a word boundary.
\B: Matches a non-word boundary.
\A: Matches the beginning of the string.
\G: Matches the end of the previous match.
\Z: Matches the end of the string but for the final terminator (e.g newline), if any.
\z: Matches the end of the string.

9.10.7. Back References

Back references allow to use part of the current match later in that match, i.e. to look for various forms of repetition.

\n: Whatever the n-th group matched.

Prev	Up	Next
9.9. Predefined Variables	Home	9.11. Keyboard Shortcuts