Important Update: Cohesity Products Documentation
All Cohesity product documentation are now managed via the Cohesity Docs Portal: https://docs.cohesity.com/HomePage/Content/home.htm. Some documentation available here may not reflect the latest information or may no longer be accessible.
Arctera Insight Information Governance Administrator's Guide
- Section I. Getting started
- Introduction to Arctera Insight Information Governance administration
- Configuring Information Governance global settings
- About Information Governance licensing
- SQLite WAL mode
- Configuring SMTP server settings
- About scanning and event monitoring
- Monitoring Indexer Node Storage Utilization
- About filtering certain accounts, IP addresses, and paths
- About archiving data
- About Information Governance integration with Data Loss Prevention (DLP)
- Importing sensitive files information through CSV
- Configuring advanced analytics
- About open shares
- About user risk score
- Configuring file groups
- Configuring Workspace data owner policy
- Configuring Management Console settings
- About bulk assignment of custodians
- Configuring Watchlist settings
- Configuring Metadata Framework
- Proof of concept
- Section II. Configuring Information Governance
- Configuring Information Governance product users
- Configuring Information Governance product servers
- About Information Governance product servers
- Adding a new Information Governance server
- Managing Information Governance product servers
- Viewing Information Governance server details
- About node templates
- Adding Portal role to a Information Governance server
- Adding Classification Server role to a Information Governance server
- Assigning Classification Server to a Collector
- Associating a Classification Server pool to a Collector
- Viewing in-progress scans
- Configuring Information Governance services
- Configuring advanced settings
- Monitoring Information Governance jobs
- Rotating the encryption keys
- Viewing Information Governance server statistics
- About automated alerts for patches and upgrades
- Deploying upgrades and patches remotely
- Using the Upload Manager utility
- About migrating storage devices across Indexers
- Viewing the status of a remote installation
- Configuring saved credentials
- Configuring directory service domains
- About directory domain scans
- Adding a directory service domain to Information Governance
- Managing directory service domains
- Fetching users and groups data from NIS+ scanner
- Configuring attributes for advanced analytics
- Deleting directory service domains
- Scheduling scans
- Configuring business unit mappings
- Importing additional attributes for users and user groups
- Configuring containers
- Server Pools
- Section III. Configuring native file systems in Information Governance
- Configuring clustered NetApp file server monitoring
- About configuring a clustered NetApp file server
- About configuring FPolicy in Cluster-Mode
- Pre-requisites for configuring clustered NetApp file servers
- Credentials required for configuring a clustered NetApp file server
- Preparing a non-administrator local user on the clustered NetApp filer
- Preparing a non-administrator domain user on a NetApp cluster for Information Governance
- Persistent Store
- Preparing Information Governance for FPolicy in NetApp Cluster-Mode
- Preparing the ONTAP cluster for FPolicy
- About configuring secure communication between Information Governance and cluster-mode NetApp devices
- Enabling export of NFS shares on a NetApp Cluster-Mode file server
- Enabling SSL support for Cluster Mode NetApp auditing
- Configuring EMC Celerra or VNX monitoring
- Configuring EMC Isilon monitoring
- About configuring EMC Isilon filers
- Prerequisites for configuration of Isilon or Unity VSA file server monitoring
- Credentials required for configuring an EMC Isilon cluster
- Configuring audit settings on EMC Isilon cluster using OneFS GUI console
- Configuring audit settings on EMC Isilon cluster using the OneFS CLI
- Configuring Isilon audit settings for performance improvement
- Preparing Arctera Insight Information Governance to receive event notifications from an EMC Isilon or Unity VSA cluster
- Creating a non-administrator user for an EMC Isilon cluster
- Utilizing access zone's SmartConnect Zone/Alias mappings
- Purging the audit logs in an Isilon filer
- Configuring EMC Unity VSA file servers
- Configuring Hitachi NAS file server monitoring
- Configuring Windows File Server monitoring
- Configuring Arctera File System (VxFS) file server monitoring
- Configuring monitoring of a generic device
- Managing file servers
- About configuring filers
- Viewing configured filers
- Adding filers
- Add/Edit NetApp cluster file server options
- Add/Edit EMC Celerra filer options
- Add/Edit EMC Isilon file server options
- Add/Edit EMC Unity VSA file server options
- Add/Edit Windows File Server options
- Add/Edit Arctera File System server options
- Add/Edit a generic storage device options
- Add/Edit Hitachi NAS file server options
- Custom schedule options
- Editing filer configuration
- Deleting filers
- Viewing performance statistics for file servers
- About disabled shares
- Adding shares
- Managing shares
- Editing share configuration
- Deleting shares
- About configuring a DFS target
- Adding a configuration attribute for devices
- Configuring a DFS target
- About the DFS utility
- Running the DFS utility
- Importing DFS mapping
- Renaming storage devices
- Configuring clustered NetApp file server monitoring
- Section IV. Configuring SharePoint data sources
- Configuring monitoring of SharePoint web applications
- About SharePoint server monitoring
- Credentials required for configuring SharePoint servers
- Configuring a web application policy
- About the Information Governance web service for SharePoint
- Viewing configured SharePoint data sources
- Adding web applications
- Editing web applications
- Deleting web applications
- Adding site collections
- Managing site collections
- Removing a configured web application
- Configuring monitoring of SharePoint Online accounts
- About SharePoint Online account monitoring
- Configuring user with minimum privileges in Microsoft 365
- Creating an application in the Microsoft Azure portal
- Configuring application without user impersonation for Microsoft 365
- Adding SharePoint Online accounts
- Managing a SharePoint Online account
- Adding site collections to SharePoint Online accounts
- Managing site collections
- Configuring monitoring of SharePoint web applications
- Section V. Configuring cloud data sources
- Configuring monitoring of Box accounts
- Configuring OneDrive account monitoring
- Configuring Azure Netapp Files Device
- Managing cloud sources
- Section VI. Configuring Object Storage Sources
- Section VII. Health and monitoring
- Section VIII. Alerts and policies
- Configuring policies
- About Information Governance policies
- Managing policies
- Create Data Activity Trigger policy options
- Create User Activity Deviation policy options
- Create Real-time Data Activity Policy options
- Create Real-time Permitted User-based Activity Policy options
- Create Real-time Restricted User-based Activity Policy options
- Create Real-time Sensitive Data Activity policy options
- Managing alerts
- Configuring policies
- Section IX. Remediation
- Configuring remediation settings
- Section X. Reference
- Appendix A. Information Governance best practices
- Appendix B. Migrating Information Governance components
- Appendix C. Backing up and restoring data
- Appendix D. Arctera Information Governance health checks
- About Information Governance health checks
- Services checks
- Deployment details checks
- Generic checks
- Information Governance Management Server checks
- Information Governance Indexer checks
- Information Governance Collector checks
- Information Governance Windows File Server checks
- Information Governance SharePoint checks
- Classification server health checks
- Information Governance self service portal server health checks
- About Information Governance health checks
- Appendix E. Command File Reference
- Appendix F. Arctera Information Governance jobs
- Appendix G. Troubleshooting
- About general troubleshooting procedures
- About the Health Audit report
- Location of Information Governance logs
- Downloading Information Governance logs
- Migrating the data directory to a new location
- Troubleshooting FPolicy issues on NetApp devices
- Troubleshooting EMC Celera or VNX configuration issues
- Troubleshooting EMC Isilon configuration issues
- Troubleshooting SharePoint configuration issues
- Troubleshooting Hitachi NAS configuration issues
- Troubleshooting installation of Tesseract software
- Troubleshooting RHEL 9 upgrade issue
- Troubleshooting CyberArk Password Manager Configuration Issues
Understanding Information Governance best practices
To optimize the productivity and efficiency of Information Governance, you are advised to follow the guidelines given below:
Do not use System disk for Data directory. Use a separate disk instead.
Set up event notifications to ensure that errors and warnings are reported. Create a separate email distribution list including storage administrators, product administrators and other stakeholders.
Do not schedule scans at peak hours. That might impact user experience. It is advisable to schedule scans at off peak hours which will minimize user impact.
Audit Exclusions - Service Account exclusion
Exclude service accounts, application accounts from auditing.
If there is a third-party application that generates a lot of events residing on a volume, exclude that volume from auditing.
Scan Exclusions
Exclude scanning of specified folders or files like snapshot~ or any other temp files that will help in consuming less data and eventually improving overall performance.
Use high performance disks like SSD for indexers.
General guidelines around calculating index memory:
Computation speed for all the reports, including Dashboard computation can be enhanced by increasing the number of threads. You can decide to increase number of threads based on available resources like CPU usage on indexer and Management Server.
To see CPU usage and overall performance of Information Governance servers, navigate to
Settings >> Health and Monitoring >> Performance
Settings >> Inventory >>Servers >> Select Node >> Statistics >> Performance
Retention Policy
You can define retention policies to ensure the database and log files are maintained over time. The retention policy affects sizing guidelines and disk space requirements. This is more important for retention of product logs for future troubleshooting and will have implication around disk space requirements
For any windows or third-party upgrade
Before upgrade, ensure that
all the Information Governance Services are gracefully stopped.
any classification request is not running.
any report or Index-Writer Job is not running.
After upgrade, ensure that
nothing is broken in event logs.
all the services are up and running.
If possible, perform the activity during maintenance window when users are comparatively less active and check Events log to see if anything is broken.
If you are using anti-virus, ensure that the AV scanner has exclusions for the Information Governance install folder, the Data folder and the OS Temp folder on the Management Server and Indexers.
Create containers to logically group related objects together for administration and reporting purposes.
Use latest available version of Information Governance to ensure that the most recent security and defect fixes are applied.
General
Use recommended system configurations for better throughput.
Use a classification server pool of multiple nodes to achieve higher throughput for large classification tasks.
Disable smart classification if not required. In Information Governance 6.3, the option will be disabled by defauly.
Smart classification requires significant resources on Indexer and Management Server nodes to automatically generate the list of files to classify.
Update default disk safeguard thresholds to higher values especially in case of PDF Files where uncompressed files can consume up to 40GB disk space (considering 16 threads and file sizes around 2.5 GB) hence the values given below will safeguard against disk usage reaching maximum limit.
Reset at 50 GB (or higher)
Stop at 45 GB (or higher)
As a part of classification, Information Governance does text extraction and uses the data directory for storing temporary files.
Maximum file size supported
Information Governance has a default maximum file size of 50MB. This limit can be changed in the Classification Configuration settings page.
Text extraction during classification is bounded by the uncompressed size of a file and this uncompressed size dictates whether files can be successfully classified. All Microsoft Office documents since Office 2007 use Office Open XML format (.docx, .pptx etc) which introduced compression.
Most Office docs therefore have a degree of compression ranging from 20%-70% depending on the mix of text and images, with pure text compressing to around 80%.
Files with a lot of images will compress less as images such as JPEG and PNG are already compressed.
PDFs are not compressed by default unless the 'Optimize PDF' option in Adobe Acrobat or similar PDF authoring applications has been used.
It has been observed that 16 concurrent files of 400MB uncompressed docx files can be classified without any memory exhaustion.
This means that 16 concurrent requests of docx files in a range of 100MB-250MB logical sized would probably work fine given the average compression ratio.
Note that the compression ratio is impossible to predict unless you analyse each file or have some indication of the type of content within the corpus.
These figures do not relate to volume/disk level compression, but the compression that Microsoft Office applies to the content. A .docx file is simply a ZIP container that can be opened in a tool such as 7-Zip to assess the uncompressed size.
The table below shows the file types and sizes tested with the recommended Classification Server specification:
Recommended maximum file sizes for classification without OCR enabled
Document type
Extensions
Maximum compressed file size tested
Maximum uncompressed file size tested
Microsoft Word
doc, docx, docm, dotm, dotx
200 MB
450 MB
Microsoft PowerPoint
ppt, pptx, pps, potm, potx, ppsm, ppsx
200 MB
450 MB
Office Tabular
xls, xlsx, xlt, xltx, xlsb, xlam
50 MB
100 MB
Adobe PDF
pdf
1 MB
Compressed PDFs are not yet tested. However, the maximum uncompressed size would mirror the compressed size of 1 GB.
Server specification used (the recommended Information Governance Classification Server specification)
16 Cores, 32GB RAM
16 classification threads running in parallel
Using Optical Character Recognition (OCR)
OCR usually results in higher memory consumption which eventually affects the classification performance.
Larger File support
It is possible that larger files than tested could be successfully classified, but it depends on the size of other files being classified at the same time. For example, if a 300MB DOCX is 1GB uncompressed, it could still be classified successfully if all other 15 files running in parallel are relatively small since the total memory used by the classification process would be within limits.
As there is no way to ensure that a mix of small and large files are classified at the same time, recommend that any DQL reports that are used to select files to classify are not ordered or segregated by file size. This ensures that the files submitted to VIC are done so as 'randomly' as possible.
For example, do not classify all 'small DOCX' files first and leave the largest ones until later. Classifying the very largest files together in one classification Job increases the risk that the total uncompressed size of 16 large files would lead to VIC memory exhaustion. Submitting a mix of file sizes together provides the best chance of large and large uncompressed files being successfully classified.
If using DQL to generate a report of files to classify, do not order the output of the report by size as that would lead to VIC processing the largest files together, whether they are sorted to appear at the start or end of the report.
Recommendations for creating classification Jobs
Use DQL reports which will filter out the files based on the above recommendations and then trigger classification requests accordingly.
Enable only required policies in VIC configuration.
As the number of enabled policies and policy complexity increases (such as using complex regular expressions or hundreds of keywords), the throughput tends to decrease.
Disable OCR if not required.
Configure the content fetch pause window to reduce the potential impact on the source devices.
The content fetch job copies files from the source devices to classify them.
By default, the job is paused from 7am to 7pm which matches normal working hours.
Recommend assessing the load on the devices during the content fetch as many customers have discovered the load does not disrupt any normal activities. If it can run 24-hours a day, that will help ensure that the classification process has a constant feed of files to classify and hence throughput can be increased.