Important Update: Cohesity Products Documentation


All Cohesity product documentation are now managed via the Cohesity Docs Portal: https://docs.cohesity.com/HomePage/Content/home.htm. Some documentation available here may not reflect the latest information or may no longer be accessible.

Arctera™ Insight Classification Help

Last Published:
Product(s): Arctera Insight Classification (Version Not Specified)

Regular expression syntax

About regular expressions

The Arctera Insight Classification supports a regular expression syntax that is based on the syntax in the Perl programming language. In Perl regular expressions, all characters match themselves except for the following special characters:

.[]{}()\*+?|^$

For more information on the Perl syntax, see the following webpage:

https://perldoc.perl.org/perlre.html

You may find it helpful to build and test your regular expressions using the free online tool at https://regex101.com. This tool displays an explanation of your regular expression as you type it, and also lists all matches between the regular expression and a test string of your choice. The default regular expression flavor, pcre (php), is compatible with the Arctera Insight Classification.

Note:

Looking for regular expression matches is considerably slower than looking for matches for specific words or phrases. You can greatly improve performance and accuracy by looking for instances where both types of matches occur in proximity to each other. To do this, set up an All of condition group that contains both a regular expression condition and a contains text condition for finding specific words and phrases, and specify the required distance within which matches must occur. The Arctera Insight Classification first evaluates the contains text condition and only then looks for a regular expression match.

Wildcards

The . (period) character matches any single character, when it is used outside of a character set.

Anchors

The ^ (caret) character matches the start of a line. The $ (dollar) character matches the end of a line.

Marked subexpressions

A section that is surrounded with the characters ( and ) acts as a marked subexpression. The matching algorithms captures whatever matches the subexpression. Marked subexpressions can be repeated, or a back-reference can refer to them.

Non-marking groupings

A marked subexpression is useful for lexically grouping part of a regular expression, but it has the side-effect of additional overhead. As an alternative, you can lexically group part of a regular expression without generating a marked subexpression by using (?: and ). For example, (?:ab)+ repeats ab without splitting out any separate subexpressions.

Repeats

You can repeat any atom (single character, marked or non-marked subexpression, or character class) with the operators *, +, ?, and {}.

Table: Repeat operators

Operator

Description

*

Matches the preceding atom zero or more times. For example, a*b matches any of the following:

b
ab
aaaaaaaab

+

Matches the preceding atom one or more times. For example, a+b matches either of the following:

ab
aaaaaaaab

However, it does not match b.

?

Matches the preceding atom zero or one times. For example, ca?b matches either of the following:

cb
cab

However, it does not match caab.

{}

Repeats the preceding atom with a bounded repeat.

a{n} matches a repeated exactly n times.

a{n,} matches a repeated n or more times.

a{n,m} matches a repeated between n and m times inclusive.

For example, ^a{2,3}$ matches either of the following:

aa
aaa

However, it does not match a or aaaa.

These operators are "greedy"; they consume as much input as possible. However, non-greedy versions are available that consume as little input as possible while still producing a match. By following the repeat operators *, +, ?, and {} with the ? character, the repeats become non-greedy.

By default, when a repeated pattern does not match, the Arctera Insight Classification backtracks until it finds a match. This behavior can sometimes be undesirable for matchmaking or performance reasons, so there are also "possessive" repeats. These match as much as possible and do not then allow backtracking if the rest of the expression fails to match.

Back references

An escape character that is followed by a digit n, where n is in the range 1 through 9, matches the same string that the subexpression n matched. For example, consider the following expression:

^(a{2,3}).*\1$

This matches aaabbaaa, but it does not match aaabba.

Alternation

The | operator matches either of its arguments. For example, abc|def matches both abc and def.

You can use parentheses to group alternations. For example, ab(?:d|ef) matches both abd and abef.

Character sets

A character set is a bracket expression that is enclosed within the characters [ and ]. It defines a set of characters, and matches any single character that is a member of the set.

A bracket expression can contain any combination of the following:

  • Single characters. For example, [abc] matches any of the characters a, b, or c.

  • Character ranges. For example, [a-c] matches any single character in the range a through c. By default, for Perl regular expressions, a character x is within the range y to z, if the code point of the character lies within the code points of the endpoints of the range.

  • Negation. If the bracket expression begins with the ^ character, it matches the complement of the characters that it contains. For example, [^a-c] matches any character that is not in the range a through c.

  • Character classes. An expression of the form [[:name:]] matches the named character class name. For example, [[:lower:]] matches any lowercase character. The supported character classes are as follows:

    alnum

    Any alphanumeric character.

    punct

    Any punctuation character.

    alpha

    Any alphabetic character.

    s

    Any whitespace character.

    blank

    Any whitespace character that is not a line separator.

    space

    Any whitespace character.

    cntrl

    Any control character.

    unicode

    Any extended character whose code point is above 255 in value.

    d

    Any decimal digit.

    u

    Any uppercase character.

    digit

    Any decimal digit.

    upper

    Any uppercase character.

    graph

    Any graphical character.

    w

    Any word character (alphanumeric characters plus the underscore).

    l

    Any lowercase character.

    word

    Any word character (alphanumeric characters plus the underscore).

    lower

    Any lowercase character.

    xdigit

    Any hexadecimal digit character.

    print

    Any printable character.

      
  • Escaped characters. All the escape sequences that match a single character or character class are permitted within a character class definition. For example, [[]] matches both [ and ], whereas [\W\d] matches any character that is either a digit or not a word character.

  • Combinations. You can combine one or more of the above in a character set declaration. For example, [a-cmnx-y\d].

Escapes

Any special character that is preceded by an escape matches itself.

Table: Escape sequences that are synonyms for single characters

Escape

Character

\a

\a

\e

0x1B

\f

\f

\n

\n

\r

\r

\t

\t

\v

\v

\b

\b (but only inside a character class declaration).

\cX

An ASCII escape sequence: the character whose code point is X % 32.

\xXX

A hexadecimal escape sequence: matches the single character whose code point is 0xXX.

\x{XXXX}

A hexadecimal escape sequence: matches the single character whose code point is 0xXXXX.

\0ddd

An octal escape sequence: matches the single character whose code point is 0ddd.

\N{name}

Matches the single character that has the symbolic name name. For example, \N{newline} matches the single character \n.

"Single character" character classes

When x is the name of a character class, the escaped character x matches any character that is a member of the class. Conversely, X matches any character that is not a member of the x class.

Table: Escape sequences for "single character" character classes

Escape

Equivalent to

Escape

Equivalent to

\d

[[:digit:]]

\D

[^[:digit:]]

\l

[[:lower:]]

\L

[^[:lower:]]

\s

[[:space:]]

\S

[^[:space:]]

\u

[[:upper:]]

\U

[^[:upper:]]

\w

[[:word:]]

\W

[^[:word:]]

\h

Horizontal whitespace

\H

Not horizontal whitespace

\v

Vertical whitespace

\V

Not vertical whitespace

Word boundaries

The following escape sequences match boundaries of words.

Table: Escape sequences for word boundaries

Escape

Description

\<

Matches the start of a word.

\>

Matches the end of a word.

\b

Matches a word boundary (the start or end of a word).

\B

Matches only when not at a word boundary.

Line endings

The following escape sequences match line endings.

Table: Escape sequences for line endings

Escape

Description

\n

Newline.

\r

CR.

\R

Any line-ending character sequence. This is identical to the following expression:

(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])

Other escapes

Except for the following characters, any escape sequence matches the character that is escaped:

' ` A C E G K Q X z Z

For example, \@ matches a literal @.

Perl-specific extensions

All Perl-specific extensions to the regular expression syntax start with (?.

Named subexpressions

You can create a named subexpression as follows:

(?<NAME>expression)

You can then refer to the subexpression by the name NAME. Alternatively, you can delimit the name, as in the following:

(?'NAME'expression)

You can then refer to the subexpression in a backreference using either \g{NAME} or \k<NAME>.

Comments

(?# ... ) is treated as a comment. Its contents are ignored.

Modifiers

(?imsx-imsx ... ) alters which of the Perl modifiers are in effect within the pattern. Changes take effect from the point that the block is first seen and extend to any enclosing ). Letters before a '-' turn this Perl modifier on, and those after the '-' turn it off.

(?imsx-imsx:pattern) applies the specified modifiers to pattern only.

Non-marking groups

(?:pattern) lexically groups pattern, without generating an additional subexpression.

Lookahead

(?=pattern) consumes zero characters, but only if pattern matches.

(?!pattern) consumes zero characters, but only if pattern does not match.

You typically use lookahead to create the logical AND of two regular expressions. For example, if a password must contain a lowercase letter, uppercase letter, and punctuation symbol, and it must be at least six characters long, then you can use the following expression to validate the password:

(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}

Lookbehind

(?<=pattern) consumes zero characters, but only if pattern can be matched against the characters that precede the current position (pattern must be of fixed length).

(?<!pattern) consumes zero characters, but only if pattern cannot be matched against the characters that precede the current position (pattern must be of fixed length).

Independent subexpressions

(?>pattern) matches pattern independently of the surrounding patterns. The expression never backtracks into pattern.

Conditional expressions

(?(condition)yes-pattern|no-pattern) tries to match yes-pattern if the condition is true, and otherwise tries to match no-pattern.

(?(condition)yes-pattern) tries to match yes-pattern if the condition is true, and otherwise matches the NULL string.

condition may be one of the following:

  • A forward lookahead assert.

  • The index of a marked subexpression (the condition becomes true if the subexpression has been matched).

Here is a summary of the possible predicates:

(?(?=assert)yes-pattern|no-pattern)

Executes yes-pattern if the forward look-ahead assert matches, and otherwise executes no-pattern.

(?(?!assert)yes-pattern|no-pattern)

Executes yes-pattern if the forward look-ahead assert does not match, and otherwise executes no-pattern.

(?(N)yes-pattern|no-pattern)

Executes yes-pattern if subexpression N has been matched, and otherwise executes no-pattern.

(?(<name>)yes-pattern|no-pattern)

Executes yes-pattern if named subexpression name has been matched, and otherwise executes no-pattern.

(?('name')yes-pattern|no-pattern)

Executes yes-pattern if named subexpression name has been matched, and otherwise executes no-pattern.

Operator precedence

The order of precedence for the operators is as follows:

  • Escaped characters \

  • Character set (bracket expression) []

  • Grouping ()

  • Single-character-ERE duplication * + ? {m,n}

  • Concatenation

  • Anchoring ^$

  • Alternation |