How Data Insight works, and the ports that it uses for communication with devices

Article: 100039166
Last Published: 2023-10-11
Ratings: 3 0
Product(s): Data Insight

Description

User-added image

 

Introduction

This document is designed to assist customers with locating and reviewing the ports required to be opened bidirectionally in their environment for Data Insight to function properly.  This guide provides a review of the network configuration of Data Insight for most environments.  This article assumes that the reader has a basic understanding of the main use cases of Data Insight, as well as some basic understanding of the fundamentals of the hardware platforms supported by Data Insight. 
 

Data Insight Basics               

Data Insight is a tool for scanning and analyzing data in unstructured data environments.  These should be viewed as two separate processes with two separate purposes. Scanning is a single traversal of the file system to read metadata from folder or file objects, as either a directed per-modified object1 or a complete full traverse2 of every object. Whereas an audit trail is established by connecting an application program interface (API) to the device continually for monitoring all input to the device from clients to the device. In auditing, the I/O captured is compared to the shares being monitored after receipt. Unstructured data usually refers to loose user data.  Individual users typically manage this data type.  In most environments, infrastructure administrators or IT decision makers usually have little understanding of the data on these systems, or how it is used.  Data Insight provides answers to many of the typical questions needed to effectively manage storage.
  • Data Ownership
  • Data Access Patterns / Forensics
  • Date Protection / Permissions
  • Data Access Governance
 
Data Insight focuses primarily on the data ownership, entitlement reporting and data access patterns required by administrators for effective unstructured data charge-back, storage tier optimization, and ILM reporting.  It also enables security use cases involving data forensics, Entitlements management and risk analysis.

 

Architecture Basics

Data Insight is a self-contained product that installs on the Windows Platform (and Linux Indexers). The product is tested and validated on virtual machines.  Virtual machines allow for a great deal of flexibility when requesting resources from a customer for install.  The only considerations for a virtual machine are that the host virtualization server should have the same minimum requirements as a physical server.  It’s also advisable to consider locating this server as close to the test target file servers as possible.
              
Data Insight is made up of 3 main component tiers. These components allow for flexible architecture decisions when planning for a worldwide production roll-out.  A single-tier  T1 would house the Management Server (MS) Collector and indexer housed on a single machine.  A two-tier installation would separate the MS from a single node Collector, indexer combination. A three-tier  T3 installation would separate all the worker nodes at least one time.   For self-service portal or classification, in any tier, a separate server is needed.
 
  • Management Console: This is the web engine that provides the GUI.  This server also acts as a query engine for other products that integrate into Data Insight such as DLP.  This server is also does the Active Directory scanning.  The main user database used to resolve SIDs is also located on this server.
  • Indexer: This is the main Data Insight data store.  It’s a highly parallel, independent data store that can be scaled out horizontally as needed to grow to meet the size of the largest environments.  The static metadata is maintained in small SQLite databases on a share level.  The audit data is maintained in a proprietary data store developed internally by Veritas and implemented as virtual tables in each share’s SQLite database.  This is all included in the main Data Insight install and requires no software pre-loads, or customer supplied licenses.
  • Collector: The collector tier is the set of components that communicate directly with the hardware that is being monitored.  The collector is a set of API explorers and scanning engines that collect the user access data and file metadata. 
              
For the purposes device access, the most important consideration is likely the placement of the collector.  The collector generates the most network traffic, as it’s responsible for collecting audit data as well as metadata scans.  It’s highly recommended to have the collector installed as close to the filer/device being monitored as possible to reduce the likelihood of latency issues created by crossing multiple ip network segments.  In connecting to the devices it will be the most important for this node to have open access across firewalls to connect to ports from the devices.
Self-Service Portal – The self-service portal server provides an interface for data custodians to perform a workflow action or remediation. There are 4 types of workflow tasks supported today:
  1. Ownership confirmation for custodians assigned within Data Insight
  2. Entitlement reviews / Access certifications
  3. DLP Incident remediation
  4. Records Classification
 

Pre-Planning

 
The installation and configuration work detailed in the manuals for the devices compatible with Data Insight must be completed to ensure that the customer environment is ready for the install. First it will give the customer insight into what types of interfaces Data Insight uses to collect data.  It will also inform the customer what logins or software for API access needs to be in place prior to install.  Review the manuals for administration and installation (6.1.6 versions linked) to ensure that valid logins with the correct privileges for each device will be available at install, and that the necessary change control actions if needed have been taken to open ports.

It’s a good idea to review the Release Notes prior to the install date.

 

 

Device Requirements

Review the Software Compatibility list

The SCL will contain the network port requirements

There are mechanisms for altering the ports such as application settings or configuration files if it is desired to move off of the defaults.

All requirements and directions are listed in the Installation and administration, below is a summary by device.
 

NetApp Filer Targets

User-added image
  • ONTAP 7.3.5 or higher (ONTAP 8.0.x OR 8.1.x must be configured in ONTAP 8 “7 mode”. ONTAP 8.2.x can be configured in cluster-mode or 7-mode.)
  • Domain User account that can scan shares (member of Backup Operators group locally on filer recommended)
  • Domain User account that can discover shares, configure FPolicy, and connect to FPolicy (either a member of the Administrators local group, or in a group with a role that has login-* and api-* rights is recommended)
  • Click on the DataInsightFpolicyCMod service to configure it.
    • Provide logon credential for Fpolicy. By default, Fpolicy name is matpol and FPolicy port is 8787.
    • Specify the IP address of the collector that can be accessed from the filer.
    • In cluster-mode, there is no credential required to be specified. Make sure correct IP is selected in Fpolicy Collector IP field, in some cases where multiple NIC’s are present private IPs may populate resulting tcp communication to fail. Then, press Configure.
  • Once successfully configured, Fpolicy cluster-mode Collector service will be up and running on the server.
  • Ports configuration and review:
Configuring your storage system in a cluster environment
 
Note: please check the correct note for your version if not listed.  

EMC Celerra / VNX

  • Common Event Enabler framework version 8.2 or higher
    • Referred to as CEE
  • Domain User account with sufficient privileges to scan shares
  • User account that can connect via VNX / Celerra Event Publishing Agent (CEPA) for user access collection.

  User-added image


 

EMC Isilon

User-added image
 
  • Version 7.1
  • Common Event Enabler framework version 8.2 or higher
    • Requires .NET 3.5
    • Default port 12228 is required to get to the CEE host from the cluster
    • port 135 is used for the initial call to the CEE/CEPA server answer is within the dynamic port range between 49152 and 65535.
  • Domain User account with sufficient privileges to scan shares
  • User account for share discovery
  • New APIs for discovery3
  • Access to the web interface for connection of the device
    • IPv4 - https://<yourNodeIPaddress>:8080
    • IPv6 - https://[<yourNodeIPaddress>]:8080
 

Windows File Servers

User-added image

With agent install (has a filter driver which enables auditing and monitoring)
  • Ports list
    • SMB: 445 (TCP) // File Sharing
    • RDP: 3389 (TCP) // Remote Desktop for Administration
    • NetBT enabled. If so, NetBT uses ports:
      • 137 (UDP)
      • 138 (UDP)
      • 139 (TCP)
 

Without agent (No Auditing available)

User-added image
  • Windows 2003 or 2003 R2, 32 or 64 bit
  • Windows 2008 or 2008 R2, 64bit
  • Windows 2012 or 2012 R2, 64bit
  • Other targets such as Windows 2000 have not been officially tested but have been observed to work in the field although there are no supported extensions known from Microsoft.
 

Veritas File System (VxFS) Server

  • 6.0.1 or higher
  • Configured in standalone or using Veritas Cluster Server (Clustered File Systems {CFS} not supported)
 

Microsoft SharePoint Servers

User-added image
  • Microsoft SharePoint 2019
  • Microsoft SharePoint 2016 (under latest version with some caveats and patches) (ports list)
  • Microsoft SharePoint 2013 (ports list)
  • Microsoft SharePoint 2010
  • Site Collection administrator account under SharePoint server domain having full control permissions on added web applications

Footnotes:
   (T1)
User-added image


   (T2)
User-added image


 
User-added image


User-added image


 

Incremental Scan-Related Jobs

CollectorJob runs once every 2 hours (configurable)

  1. Collector job moves all raw audit logs files (e.g. fpolicy*.sqlite, winnas*.sqlite) from collector folder and moves them to folder called collector/staging.
  2. For each filer for which it's a collector, it will determine the scan extract ID for this filer.
  3. It then executes collector.exe -e ext_ids -d <staging> -s <staging> -p <collector/changelog>
  4. Once the collector returns, it would have created a per-share adt file in the staging folder and would have populated logs into a per-filer changelog file (<devid>_changelog.pathdb) in the collector/changelog folder.
  5. All adt files will be moved to outbox.
  6. For filers where incremental scanning is disabled, it will purge all records from the changelog of the device to avoid buildup. Incremental scanning is disabled when IScannerJob is disabled for the effective collector OR scanning is disabled for the filer.
  7. IScannerJob runs  once every day at 7pm (configurable):
 

On Collector

  1. This job will first create a list of shares that need to be scanned. This includes all shares of filers for which this node is collector except shares of those filers which are monitored using a remote winnas agent.
  2. For all such shares, it will issue mergedb -x -m msu_ids -s <collector/changelog> -d collector/changelog/staging
  3. Once mergedb returns, it would have created per-msu pathdb files for all shares which had activity since the last time incremental scan ran.
  4. All per msu pathdb files are moved to the inbox.
  5. Then, inbox is scanned to figure out which msus have pathdbs in the inbox.
  6. If multiple pathdb files exist for the same msu, they are first merged and original removed.
  7. Then, based on which msus have pathdbs available, those shares are added to the incremental scan queue.
 

On Winnas agent (clustered and non_clustered)

  1. This job will first create a list of shares that need to be scanned. This includes all shares of the winnas filer for which this node is the agent such that the share is available on the local node. It checks this by accessing the physical path of the share. That way, if the share is currently imported on a different cluster node, it does not result in a failed scan.
  2. For all such shares, it will make a request to the Collector node to send it pathdb files.
  3. When the collector receives this request, it will issue the following command: mergedb -x -m msu_ids -s <collector/changelog> -d <collector/changelog/staging>.
  4. When mergedb returns, it would have created a per msu pathdb file in the staging folder.
  5. Collector will zip all these files and send them to the agent.
  6. The agent will unzip these files into inbox folder.
  7. Then, inbox is scanned to figure out which msus have pathdbs in the inbox.
If multiple pathdb files exist for the same msu, they are first merged and original removed.

 

Sequence of jobs for scan/audit in DI


User-added image
Sequence of Jobs for Scan for Shares
Job Node Type What it does Default Schedule
ScannerJob Collector Invokes scanner.exe for the shares and site collections monitored by collector node
scanner.exe creates scan_cifs*.sqlite or scan_nfs*.sqlite files in <datadir>\outbox
7 p.m. on last Friday of every month
FileTransferJob Collector Transfers scan*.sqlite files from <datadir>\outbox to <datadir>\inbox on Indexer node
 
Every minute
IndexWriterJob Indexer Invokes idxwriter.exe, which consumes scan data and updates the index db Every 4 hours
 
Sequence of Jobs for Audit for CIFS and NFS shares and Site Collections
Job Node Type What it does Default schedule
CollectorJob Collector Invokes collector.exe to process raw audit files present in the <datadir>\collector folder
and generates audit*.sqlite file in the <datadir>\outbox folder
Every 2 hours
FileTransferJob Collector Transfers audit*.sqlite files from <datadir>\outbox to <datadir>\inbox on Indexer node Every minute
IndexWriterJob Indexer Invokes idxwriter.exe, which consumes audit data and updates the index db Every 4 hours
 
Sequence of Jobs for Audit for Site Collections
Job Node Type What it does Default schedule
SPAuditJob Collector Fetches audit data for site collections configured in Data Insight from SQL Server
database and stores it in audit_sharepoint*.sqlite in <datadir>\collector (It used to move
the audit_sharepoint*.sqlite directly to outbox until DI 5.1, but after 5.1,
you must run CollectorJob to move the files to <datadir>\outbox folder.
This change was made to support real-time alerts)
Every 2 hours
CollectorJob Collector Invokes collector.exe to process raw audit files present in the <datadir>\collector
folder and generates audit*.sqlite file in the <datadir>\outbox folder
Every 2 hours
FileTransferJob Collector Transfers audit*.sqlite files from <datadir>\outbox to <datadir>\inbox on Indexer node Every minute
IndexWriterJob Indexer Invokes idxwriter.exe, which consumes audit data and updates the index db Every 4 hours
 
Sequence of Jobs for Incremental Scans
Incremental scanning takes place for a share when there are CREATE/WRITE/SECURITY events on the share. 
Job Node Type What it does Default schedule
CollectorJob Collector It moves all raw audit logs files (e.g. fpolicy*.sqlite, winnas*.sqlite) from
collector folder and moves them to folder called collector/staging. In case
of multiple collector threads, the folder is called collector/staging/<threadid>.
In case of multiple threads running collector, each thread will work on
separate filers.

Then, for each filer for which it's a collector, it determines the scan extract
ID for this filer.

It then executes "collector.exe -e ext_ids -d <staging> -s <staging> -p
<collector/changelog>", which
a. Generates per share audit*.sqlite file in the <datadir>\outbox folder
b. Populates changelogs into an intermediate per filer changelog file
(<device>_<timestamp>_changelog.pathdb) in the <datadir>collector\changelog\inbox folder

For filers where incremental scanning is disabled, it will purge all records from the changelog
of the device to avoid buildup. Incremental scanning = disabled when IScannerJob
disabled for the effective collector OR scanning disabled for the filer
Every 2 hours
ChangeLogJob Collector Merges changelog files from <datadir>\collector\changelog\inbox into a main
device-specific changelog file (collector\changelog\<device_id>_changelog.db),
which is used by IScannerJob
Every hour
IScannerJob Collector/Windows File
Server agent
Runs "mergedb.exe -x -m msu_ids -s <collector/changelog> -d
collector/changelog/staging", which creates per-msu pathdb files for all shares which
had activity since the last time incremental scan ran.
All per msu pathdb files are moved to the inbox.
based on which msus have pathdbs available, those shares are added to the incremental scan queue.
scanner.exe is executed as follows:

scanner.exe --extractid <extractid> --msuid <msuid> --pathdb
C:\DataInsight\data\inbox\<msuid>_<timestamp>.pathdb --aclsfor only_dir
--ownrfor dir_and_file --max_err 5000

This generates ISQLITE files in C:\DataInsight\data\outbox which has all the changes
that have been done on the share. In case of nfs monitoring, isqlite files will be generated
under <datadir>/unix/scanner folder, which will get transferred to outbox after
running NFSUserMappingJob.
7 p.m. Daily
FileTransferJob Collector Transfers *.isqlite files from <datadir>\outbox to <datadir>\inbox on Indexer node. Every minute
IndexWriterJob Indexer Invokes idxwriter.exe, which consumes incremental scan data and updates the index db. Every 4 hours
 
New APIs used by isilonutil for share discovery/APIs currently used by isilonutil for share discovery:
Get all zone names:  https://10.10.10.10:8080/platform/1/zones
Get all shares in that zone: https://10.10.10.10:8080/platform/1/protocols/smb/shares?zone=System
 
1. Discover OneFS API Version:
a.https://10.10.10.10:8080/platform/latest
b.Smart Connect APIs are supported for API version 3 and above. The older API versions don’t support this API
c.If this API returns a version of 3 or higher, we can look for smart connect or else fall back to old flow
 
2. Discover all access zones:
a.https://10.10.10.10:8080/platform/1/zones
 
3. Discover all shares for each access zones
a.https://10.10.10.10:8080/platform/1/protocols/smb/shares?zone=System
b.Return the smart connect name along with share information.

  User-added image


Example of node table with ports
(L)ocal/(R)emote ID Name IP Queryd_port Commd_port isConsole isIndexer isCollector Ctr
L 1 DIServer1.local DIServer1.local 8282 8383 1 1 1 0
R 2 DIPortal1.local DIPortal1.local 8282 8383 0 0 0 0
R 3 Indexer1.local Indexer1.local 8282 8383 0 1 1 0
R 4 Collector1.local Collector1.local 8282 8383 0 0 1 0
R 5 FileServer1.local FileServer1.local 8282 8383 0 0 1 0
R 6 FileServer2.local FileServer2.local 8282 8383 0 0 1 0

 
Reference

List of Jobs Available in Data Insight (veritas.com)

 

 

Was this content helpful?