Important Update: Cohesity Products Documentation
All Cohesity product documentation are now managed via the Cohesity Docs Portal: https://docs.cohesity.com/HomePage/Content/home.htm. Some documentation available here may not reflect the latest information or may no longer be accessible.
Arctera™ Insight Classification Help
- What's New
- Known Issues
- Getting Started
- Policies
- About policies
- About built-in classification policies
- Creating policies
- About policy conditions
- Using a keywords-based exclusion policy condition
- Regular expression syntax
- Enabling or disabling policies
- Editing policies
- Exporting or importing policies
- Resetting policies
- Deleting policies
- Transparent policies
- Creating a customized copy of transparent policies
- Microsoft Information Protection (MIP) Labels
- Patterns
- Tags
- Analyze
- Audit
Regular expression syntax
This topic includes the following:
The Arctera Insight Classification supports a regular expression syntax that is based on the syntax in the Perl programming language. In Perl regular expressions, all characters match themselves except for the following special characters:
.[]{}()\*+?|^$For more information on the Perl syntax, see the following webpage:
https://perldoc.perl.org/perlre.html
You may find it helpful to build and test your regular expressions using the free online tool at https://regex101.com. This tool displays an explanation of your regular expression as you type it, and also lists all matches between the regular expression and a test string of your choice. The default regular expression flavor, pcre (php), is compatible with the Arctera Insight Classification.
Note:
Looking for regular expression matches is considerably slower than looking for matches for specific words or phrases. You can greatly improve performance and accuracy by looking for instances where both types of matches occur in proximity to each other. To do this, set up an All of condition group that contains both a regular expression condition and a condition for finding specific words and phrases, and specify the required distance within which matches must occur. The Arctera Insight Classification first evaluates the condition and only then looks for a regular expression match.
The . (period) character matches any single character, when it is used outside of a character set.
The ^ (caret) character matches the start of a line. The $ (dollar) character matches the end of a line.
A section that is surrounded with the characters ( and ) acts as a marked subexpression. The matching algorithms captures whatever matches the subexpression. Marked subexpressions can be repeated, or a back-reference can refer to them.
A marked subexpression is useful for lexically grouping part of a regular expression, but it has the side-effect of additional overhead. As an alternative, you can lexically group part of a regular expression without generating a marked subexpression by using (?: and ). For example, (?:ab)+ repeats ab without splitting out any separate subexpressions.
You can repeat any atom (single character, marked or non-marked subexpression, or character class) with the operators *, +, ?, and {}.
Table: Repeat operators
Operator | Description |
|---|---|
* | Matches the preceding atom zero or more times. For example, a*b matches any of the following: b ab aaaaaaaab |
+ | Matches the preceding atom one or more times. For example, a+b matches either of the following: ab aaaaaaaab However, it does not match b. |
? | Matches the preceding atom zero or one times. For example, ca?b matches either of the following: cb cab However, it does not match caab. |
{} | Repeats the preceding atom with a bounded repeat. a{n} matches a repeated exactly n times. a{n,} matches a repeated n or more times. a{n,m} matches a repeated between n and m times inclusive. For example, ^a{2,3}$ matches either of the following: aa aaa However, it does not match a or aaaa. |
These operators are "greedy"; they consume as much input as possible. However, non-greedy versions are available that consume as little input as possible while still producing a match. By following the repeat operators *, +, ?, and {} with the ? character, the repeats become non-greedy.
By default, when a repeated pattern does not match, the Arctera Insight Classification backtracks until it finds a match. This behavior can sometimes be undesirable for matchmaking or performance reasons, so there are also "possessive" repeats. These match as much as possible and do not then allow backtracking if the rest of the expression fails to match.
An escape character that is followed by a digit n, where n is in the range 1 through 9, matches the same string that the subexpression n matched. For example, consider the following expression:
^(a{2,3}).*\1$This matches aaabbaaa, but it does not match aaabba.
The | operator matches either of its arguments. For example, abc|def matches both abc and def.
You can use parentheses to group alternations. For example, ab(?:d|ef) matches both abd and abef.
A character set is a bracket expression that is enclosed within the characters [ and ]. It defines a set of characters, and matches any single character that is a member of the set.
A bracket expression can contain any combination of the following:
. For example, [abc] matches any of the characters a, b, or c.
. For example, [a-c] matches any single character in the range a through c. By default, for Perl regular expressions, a character x is within the range y to z, if the code point of the character lies within the code points of the endpoints of the range.
. If the bracket expression begins with the ^ character, it matches the complement of the characters that it contains. For example, [^a-c] matches any character that is not in the range a through c.
. An expression of the form [[:name:]] matches the named character class name. For example, [[:lower:]] matches any lowercase character. The supported character classes are as follows:
alnum
Any alphanumeric character.
punct
Any punctuation character.
alpha
Any alphabetic character.
s
Any whitespace character.
blank
Any whitespace character that is not a line separator.
space
Any whitespace character.
cntrl
Any control character.
unicode
Any extended character whose code point is above 255 in value.
d
Any decimal digit.
u
Any uppercase character.
digit
Any decimal digit.
upper
Any uppercase character.
graph
Any graphical character.
w
Any word character (alphanumeric characters plus the underscore).
l
Any lowercase character.
word
Any word character (alphanumeric characters plus the underscore).
lower
Any lowercase character.
xdigit
Any hexadecimal digit character.
print
Any printable character.
. All the escape sequences that match a single character or character class are permitted within a character class definition. For example, [[]] matches both [ and ], whereas [\W\d] matches any character that is either a digit or not a word character.
. You can combine one or more of the above in a character set declaration. For example, [a-cmnx-y\d].
Any special character that is preceded by an escape matches itself.
Table: Escape sequences that are synonyms for single characters
Escape | Character |
|---|---|
\a | \a |
\e | 0x1B |
\f | \f |
\n | \n |
\r | \r |
\t | \t |
\v | \v |
\b | \b (but only inside a character class declaration). |
\cX | An ASCII escape sequence: the character whose code point is X % 32. |
\xXX | A hexadecimal escape sequence: matches the single character whose code point is 0xXX. |
\x{XXXX} | A hexadecimal escape sequence: matches the single character whose code point is 0xXXXX. |
\0ddd | An octal escape sequence: matches the single character whose code point is 0ddd. |
\N{name} | Matches the single character that has the symbolic name name. For example, \N{newline} matches the single character \n. |
When x is the name of a character class, the escaped character x matches any character that is a member of the class. Conversely, X matches any character that is not a member of the x class.
Table: Escape sequences for "single character" character classes
Escape | Equivalent to | Escape | Equivalent to |
|---|---|---|---|
\d | [[:digit:]] | \D | [^[:digit:]] |
\l | [[:lower:]] | \L | [^[:lower:]] |
\s | [[:space:]] | \S | [^[:space:]] |
\u | [[:upper:]] | \U | [^[:upper:]] |
\w | [[:word:]] | \W | [^[:word:]] |
\h | Horizontal whitespace | \H | Not horizontal whitespace |
\v | Vertical whitespace | \V | Not vertical whitespace |
The following escape sequences match boundaries of words.
Table: Escape sequences for word boundaries
Escape | Description |
|---|---|
\< | Matches the start of a word. |
\> | Matches the end of a word. |
\b | Matches a word boundary (the start or end of a word). |
\B | Matches only when not at a word boundary. |
The following escape sequences match line endings.
Table: Escape sequences for line endings
Escape | Description |
|---|---|
\n | Newline. |
\r | CR. |
\R | Any line-ending character sequence. This is identical to the following expression: (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}]) |
Except for the following characters, any escape sequence matches the character that is escaped:
' ` A C E G K Q X z Z
For example, \@ matches a literal @.
All Perl-specific extensions to the regular expression syntax start with (?.
You can create a named subexpression as follows:
(?<NAME>expression)
You can then refer to the subexpression by the name NAME. Alternatively, you can delimit the name, as in the following:
(?'NAME'expression)
You can then refer to the subexpression in a backreference using either \g{NAME} or \k<NAME>.
(?# ... ) is treated as a comment. Its contents are ignored.
(?imsx-imsx ... ) alters which of the Perl modifiers are in effect within the pattern. Changes take effect from the point that the block is first seen and extend to any enclosing ). Letters before a '-' turn this Perl modifier on, and those after the '-' turn it off.
(?imsx-imsx:pattern) applies the specified modifiers to pattern only.
(?:pattern) lexically groups pattern, without generating an additional subexpression.
(?=pattern) consumes zero characters, but only if pattern matches.
(?!pattern) consumes zero characters, but only if pattern does not match.
You typically use lookahead to create the logical AND of two regular expressions. For example, if a password must contain a lowercase letter, uppercase letter, and punctuation symbol, and it must be at least six characters long, then you can use the following expression to validate the password:
(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}(?<=pattern) consumes zero characters, but only if pattern can be matched against the characters that precede the current position (pattern must be of fixed length).
(?<!pattern) consumes zero characters, but only if pattern cannot be matched against the characters that precede the current position (pattern must be of fixed length).
(?>pattern) matches pattern independently of the surrounding patterns. The expression never backtracks into pattern.
(?(condition)yes-pattern|no-pattern) tries to match yes-pattern if the condition is true, and otherwise tries to match no-pattern.
(?(condition)yes-pattern) tries to match yes-pattern if the condition is true, and otherwise matches the NULL string.
condition may be one of the following:
A forward lookahead assert.
The index of a marked subexpression (the condition becomes true if the subexpression has been matched).
Here is a summary of the possible predicates:
(?(?=assert)yes-pattern|no-pattern) | Executes yes-pattern if the forward look-ahead assert matches, and otherwise executes no-pattern. |
(?(?!assert)yes-pattern|no-pattern) | Executes yes-pattern if the forward look-ahead assert does not match, and otherwise executes no-pattern. |
(?(N)yes-pattern|no-pattern) | Executes yes-pattern if subexpression N has been matched, and otherwise executes no-pattern. |
(?(<name>)yes-pattern|no-pattern) | Executes yes-pattern if named subexpression name has been matched, and otherwise executes no-pattern. |
(?('name')yes-pattern|no-pattern) | Executes yes-pattern if named subexpression name has been matched, and otherwise executes no-pattern. |
The order of precedence for the operators is as follows:
Escaped characters \
Character set (bracket expression) []
Grouping ()
Single-character-ERE duplication * + ? {m,n}
Concatenation
Anchoring ^$
Alternation |