Problem
In order to improve backup performance, Enterprise Vault provides a Collector function which moves numerous archived objects into central collection files. This document discusses how the Enterprise Vault NTFS Collector functions.
Solution
The Enterprise Vault (EV) Collector, a component of the Storage File Watch process, is a file management software which is responsible for:
- The collection of multiple saveset files (.DVS, .DVSSP and .DVSCC) into collection files (.CAB)
- The recall of saveset files from collection files when archived items are retrieved
- The deletion of temporarily recalled saveset files extracted from collection CABs
- Managing collection files in response to deletion of items
b. Once Collections are enabled for an NTFS Partition, this feature cannot be disabled.
c. Collections will decrease the total number of files on a partition, which will increase backup performance.
d. Native archived data on a partition are in a compressed format and moving the data into collection (CAB) files will not decrease the space used.
- Start at: Local time at which collecting will start (Default = 10:00 AM).
- End at: Local time at which collecting will finish (Default = 4:00 PM). EV will stop all collection threads at this time or when no more files are avaIlable to collect, whichever occurs first.
- Maximum collection file size: Maximum size of the CAB file generated (Default = 10 megabytes)
- Age at which files will be collected: Age of files on storage (Default = 10 Days)
EV vault store partitions that are hosted on an NTFS storage location, organize the storage of saveset files based on the sent/received date of the item archived by Enterprise Vault. These saveset files are organized across a hierarchy of folders rather than into one single folder.
The EV Collector parses the MONTH-DAY directory and utilizes the configurable rules to package the files within a directory into collection files (*.CAB files).
- The created date of each DAY directory is scanned and compared to the current date on the local system to determine whether that directory may contain files old enough for collection.
- If the difference is greater than the Age at which files will be collected setting for the partition, this indicates to the collector that the files under the day directory may be eligible for collection.
- The collector will begin to parse the files for the modified date attribute to determine eligibility for collection.
- During the eligibility check, the modified date of the files are compared to the current date on the local system to determine if these files meet the criteria for collections.
- If the difference is greater than the ‘Collect files older than’ setting for the partition, this indicates to the collector that the files are eligible for collection.
- The collector will compress the eligible files into a collection (CAB) file which is placed in the day directory. The collection of the saveset files are based on the following rules:
- During each daily run, the collector will add eligible saveset files to new collection (CAB) files.
- If there are fewer than 15 files that are eligible for collection, the collector does not perform a collection of these files.
- Per partition, there is a configurable, maximum size for each collection (CAB) file. The default size 10 MB with a maximum of 99 MB.
- When creating a collection, if one of these rules are met, the files to be collected will roll over into another collection (CAB) file. This process will result in the generation of multiple collection (CAB) files within a single day directory.
- When a Collection file is created, the file name format is ‘Collection#.CAB’, where # is an incremented numerical identifier for each collection file within a partition.
Within SQL:
- Archived items are referenced within the Vault Store Database, within the dbo.Saveset Table. When an archived items is placed within a Collection, the numerical identifer assigned to the CAB file is assigned to the individual items within the Saveset Table as the item's CollectionIdentity.
- The CollectionIdentity value is then added to the Vault Store Database, within the dbo.Collections Table. This table is then populated with the properties of the Collection file, including the location and name (RelativeFileName value)
- The SQL components are referenced when a retrieval request is performed in order to identify the location of the file(s) which are required by the retrieval/restore process.
Collected item retrievals
When an item is retrieved from NTFS storage, the following will be performed:
a. A check is performed within the Vault Store Saveset Table to identify the Partition where the item is located (PartitionEntryIdentity).
b. If a CollectionIdentity is recorded, a subsequent check against the Collections table is performed to identify the CAB location.
c. Storage will browse to the physical Partition location to attain the file(s) associated with the archived item in the following order:
i. A check of the original storage location (prior to being collected) of the items in .DVS format.
ii. A check of the original storage location (prior to being collected) of the items in .ARCHDVS format.
iii.A check for the CAB location is performed, based on the RelativeFileName SQL entry for the associated CollectionIdentity.
iv. When the requested files are located within the collection file, the files are temporarily extracted to the file's original location as an .ARCHDVS file.
Note: When performing an export of an archive, this process can potentially cause the partition location to temporarily decrease in free space, as the files will be in two locations on the partition.
d. If utilizing the Migrator option to move older collections to an alternate location, an additional check for an *.ARCHCAB file in the original collection file location. If a the *.CAB or *.ARCHCAB file is not present, a request to retrieve the collection from secondary storage is performed. This process will place the collection file in the original file location utilizing an *.ARCHCAB extension.
See the following articles for additional information on utilizing secondary storage Migration:
How to configure collection and migration of files
Deletion of temporarily retrieved files
Items retrieved from collection files will be extracted to the original file location with an *.ARCH* preface to the extension (Ex. *.ARCHDVS, *.ARCHDVSSP, *.ARCHDVSCC and *.ARCHCAB). When the Collection process scans each partition, a check is performed against each *.ARCH* file on the partition for the Last Accessed Date attribute. If the Last Accessed Date attribute is greater than 24 hours from the current time, the Collector, using the StorageFileWatch process, will issue a delete of the 'expired' file(s).
Note: In the case too many ARCH* files exist in a partition due to a large number of recalls, they can be removed manually as they are temporary files. EV will recall items from secondary storage again if needed.
Managing collection files due to deletions
The EV Collector process is responsible for permanently removing archived data from collection files after expiry or user deletions are performed. Initially when a collected archived items is deleted, the physical data will still be retained within the physical collection file until the collection is analyzed by the Collector process. This function is called the Sparse Collections process. See the following article for details on the Sparse Collections process:
How do Sparse Collections and the SparseCollectionPercentage setting work?