policies , security , visio-stencils

Bluecoat SGS Regular Expression Syntax

May 9, 2022

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like:

‘A’, ‘a’, or ‘3’, are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so ‘last’ matches the characters ‘last’. (In the rest of this section, regular expressions are written in a courier font, usually without quotes, and strings to be matched are ‘in single quotes’.)

Some characters, like | or (, are special. Special characters, called metacharacters, either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.

The metacharacters are described in the following table.

Metacharacter used in regular expressions
Metacharacter Description

• (?i) Evaluate the expression following this metacharacter in a case-insensitive manner.

• . (Dot) In the default mode, this matches any character except a newline. (Note that newlines should not be detected when using regular expressions in CPL.)

• ^ (Circumflex or caret) Matches the start of the string.

• $ Matches the end of the string.

• * Causes the resulting RE to match zero (0) or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

• + Causes the resulting RE to match one (1) or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

• ? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

• *?, +?, ?? The *, +, and ? qualifiers are all greedy; they match as much text as possible.

Sometimes this behavior isn’t desired. If the RE /page1/.*/ is matched against /page1/heading/images/, it will match the entire string, and not just /page1/heading/.

• Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; matching as few characters as possible.

• Using .*? in the previous expression will match only /page1/heading/. {m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 ‘a’ characters.

• {m,n}? Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier.

For example, on the 6-character string ‘aaaaaa’, a{3,5} will match 5 ‘a’ characters, while a{3,5}? will only match 3 characters.

• \ Either escapes special characters (permitting you to match characters like ‘*?+&$’), or signals a special sequence; special sequences are discussed below.

• [] Used to indicate a set of characters. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a ‘-’. Special characters are not active inside sets. For example, [akm$] will match any of the characters ‘a’, ‘k’, ‘m’, or ‘$’; [a-z] will match any lowercase letter and [a-zA-Z0-9] matches any letter or digit. Character classes such as \w or \S (defined below) are also acceptable inside a range. If you want to include a ] or a – inside a set, precede it with a backslash. Characters not within a range can be matched by including a ^ as the first character of the set; ^ elsewhere will simply match the ‘^’ character.

• | A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. This can be used inside groups (see below) as well. To match a literal ‘|’, use |, or enclose it inside a character class, like [|].

• (…) Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below.

• To match the literals ‘(‘or‘)’, use \ (or ), or enclose them inside a character class: [(] [)].