Important Update: Cohesity Products Documentation


All Cohesity product documentation are now managed via the Cohesity Docs Portal: https://docs.cohesity.com/HomePage/Content/home.htm. Some documentation available here may not reflect the latest information or may no longer be accessible.

Arctera™ Insight Classification Help

Last Published:
Product(s): Arctera Insight Classification (Version Not Specified)

Creating or editing patterns

You cannot edit the built-in patterns, but you can edit any custom patterns that you have created.

To create or edit a pattern

  1. At the left of the Arctera Insight Classification, click Patterns.
  2. Do one of the following:

    • To create a pattern, click New.

    • To edit an existing pattern, select it and then click Edit.

    The following diagram shows the New Pattern dialog with the pattern type as Regular expression.

    Using Special Standalone Characters in Keyword-Based Patterns

    When creating keyword-based conditions in Patterns, it is important to understand how the application handles special standalone characters like @, #, or $.

    • Special standalone characters are not allowed by default in keyword conditions.

    • If such characters are entered, the system will display a warning message.

    • The pattern can still be saved, but any standalone special characters will be automatically removed in the back-end, unless the String Match option is explicitly selected.

    To Retain Special Characters

    If you want to retain standalone special characters in your keyword conditions, you must select the String Match option. This setting ensures that the characters are preserved exactly as entered and bypasses the automatic cleanup process.

    Scenarios and Behavior

    Scenarios

    Behavior

    Pattern with keyword condition containing special characters

    Warning message is shown. Special characters will be removed unless String Match is selected.

    Pattern with keyword condition and String Match selected

    Warning message is not shown. Special characters are retained.

  3. Set the fields as follows:

    Name

    Specifies the pattern name. The name must be unique, and it can contain up to 100 alphanumeric, space, and special characters.

    Description

    (Optional.) Provides a short description of the pattern for display in the Arctera Insight Classification.

    Type

    Specifies the pattern type.

    For a Text or Regular expression pattern, you must specify the value for which to look. The same guidelines that you must observe when you enter these values in a policy condition apply when you enter them as a pattern value.

    See About policy conditions.

    Choose Similar document to find items that resemble a supplied template. For example, you can find completed forms by submitting the blank form as a template. Unlike Text and Regular expression patterns, you can set the required confidence levels for Similar document patterns when you incorporate them in a policy condition.

    The document similarity feature can find instances where users have created variants of the template document by adding, removing, or reordering paragraphs, sentences, or words. It can also find instances where users have changed individual words. However, the more extensive these word changes are, the less likely the Arctera Insight Classification is to find a match.

    You must choose the required similarity mode: Full or Section. In Full mode, the Arctera Insight Classification compares the template document in its entirety with other documents in their entirety. This mode is useful when looking for instances where users have altered the template document in places without greatly affecting its overall size. In Section mode, the Arctera Insight Classification looks for instances where the content of the template document appears as one section within a larger document.

    To submit the template document, click Browse and then select the required document.

    Choose Exact Data Match to find match of one or more specific values in an item. Exact Data Match (EDM) gives precise control over the data classification process by setting more granular level data match conditions and provides less false positives.

    With EDM you can create patterns using database records.

    See “To create an Exact Data Match based pattern”.

  4. Test the pattern by clicking Browse and then choosing a document that ought to match it.

    Select the Perform sentiment analysis checkbox for performing sentiment analysis on the selected item to determine whether the sentiment associated with the item is positive, negative, or neutral.

    Note:

    If sentiment analysis fails, the classification process will continue without evaluating the sentiment analysis conditions within the policies. As a result, hits and matches based on the sentiment condition will be affected.

    Select the Include text in images checkbox for extracting information from images and performing classification using Optical character recognition (OCR).

    Note:

    The Include text in images checkbox appears only when the Tesseract software is installed on the system where Arctera Insight Classification is running.

    After a few moments, the Arctera Insight Classification indicates whether it has found a match. When this is the case, you can click Show details to see the matching text and confidence levels.

  5. Click Save.

To create an Exact Data Match based pattern

  1. Follow the initial steps for creating a pattern as described earlier.
  2. In the Type box, click to select Exact Data Match.
  3. Specify the following configuration options:

    First row contains column headers

    Select Yes if the first row in the source document contains the names of each field. If selected, content of first row from the source document will not be considered for rule generation.

    Select No if the first row in the source document do not contain the names of each field.

    Column delimiter

    This is an optional field. It specifies the delimiter character that separates each column/field in the data file.

    Note:

    • Delimiter can be any single special character. For example, a comma(,), pipe (|), a space, and so on.

    • If the source document contains only a single column/field, you can set any delimiter character that is not present in file.

    • Delimiter must be a single character value.

    Perform hashing to secure data fields

    Select Yes if the generated rule used for creating EDM pattern need to be hashed to protect the data. The data fields are hashed using hashing algorithm SHA256 when storing them in the generated classification rule.

    Note:

    Classification performance drops if hashing is used while creating Exact Data Match pattern.

    Use case-sensitive matching

    Select Yes if the match needs to be case sensitive.

    Proximity for matches

    Specifies the distance between two columns or fields in number of characters for a match to be considered valid. Valid values are greater than 0.

    Note:

    • If source document contains only a single column/field, proximity value should be set to 1.

    • The generateRulePack API that generates classification rule uses "From the first condition option" proximity option. "Sliding Window" proximity option is not supported for Exact Data Match.

    Example:

    With proximity = 20, if the CSV source document content is as follows,

    Goodbye, Hello

    and test document content is,

    … You say Goodbye and I say Hello …

    Here, between the two words "Goodbye" and "Hello" the proximity is 19 characters. The matched words are within the set range of proximity value, that is 20 characters. Therefore, Arctera Arctera Insight Classification will show a match.

    Minimum columns to match

    Specifies the minimum number of columns that should match to trigger a result. Note that matching of the first column is compulsory regardless of the value specified in Minimum columns while creating EDM pattern.

    Note:

    Minimum columns field will be ignored if All columns checkbox is selected.

    All columns

    Select this checkbox if all columns/fields in source document need to match to trigger a result.

  4. Under the Source Document section, browse to select the EDM source file based on which you want to create the classification rule.

    Note:

    • EDM source document must be of type CSV or TXT (plain text only)

    • Maximum document size is configurable. Recommended size is 5 MB.

    • CSV document with fields quoted is not supported

  5. Click Save.

    The created EDM pattern shows the user configured exact data matching options. The source document name is retained for pattern, but its location or direct link is not provided. See the following image.

    You can use the EDM pattern created to:

    • Enhance an existing policy

    • Create a new policy

For more information, See About policy conditions.

Known issue while editing EDM patterns

While editing EDM patterns, updating the pattern name or description may fail due to an internal system error. If you experience this issue, contact your system administrator or Arctera support.

Variable

Variable Support allows you to insert dynamic values into text conditions across policies and patterns using simple placeholders such as {{variable_name}}. Instead of repeatedly typing the same keywords, you can define it once and reuse it anywhere in your text conditions. A variable is a type of pattern that must be created first and then used in text conditions of policies or patterns. Predefined variables can be inserted directly in the text-based conditions by typing {{ in the text box. For example: To classify content that contains any reference to Royal Bank of Canada, you can create a variable named BANK_NAME and add values such as RBC, Bank of Canada, and Royal Bank of Canada. This variable can then be used in a text condition by inserting the placeholder {{BANK_NAME}}. During classification, the system automatically evaluates the content against all defined values of the variable. As a result, any content containing RBC, Bank of Canada, or Royal Bank of Canada will be classified, without the need to create separate conditions for each variation.

To create the basic variable,

  1. Follow the initial steps for creating a pattern as described earlier.
  2. In the Type box, click to select Variable.
  3. Define the values as per your requirement. For example, if you want to define a new pattern for identifying a Royal Bank of Canada as a bank name, add Bank_Name in the Name field and add relevant description. In the Value field, add values like RBC, Bank Of Canada, Royal Bank of Canada or the values you want to get a match on.
  4. Click Save to add this variable.

After saving the pattern, the system checks the content during classification and looks for any occurrence of the values you added to the pattern. If the content contains one or more of these values, it is identified as a match and classified accordingly. You can reuse this approach to create additional patterns, such as Department Names, Designations, Agency Names, and more.

To use the created variable in a pattern,

  1. Follow the initial steps for creating a pattern as described earlier.
  2. In the Value box, type {{ and list of all available variables appear on the page. All the values you have entered while creating the variable will be matched while classifying the content.
  3. Click Save to add this pattern.
  4. In View mode, variables in a text condition appear as links. You can click these link to navigate to the particular pattern.

You can use multiple variables in a condition; however, using a large number of variables may impact classification performance. It is recommended to use only the minimum number of variables required. This flexibility still allows you to define and reuse dynamic values across different text conditions.