iseg dataset format

This dataset contains segments (see Data segments) with an index next to each segment, with a .index extension, containing a SQLite database.

It can enforce detection of duplicates, enforcing uniqueness on the set of metadata selected with the unique configuration keyword.

Duplicate detection relies on the invariant that segment naming produces segments that do not overlap, so a datum can only be found in a well defined segment, and duplicate detection can be performed on a per-segment basis.

iseg datasets do not have one global index, and enumerating the segments in the dataset is done by enumerating all the possible segments that would be generated by the dataset step within the time range defined by the query and the range of data present in the dataset. This means that iseg datasets will ignore all segments that do not fit in the naming scheme defined by the dataset step.

It can optimize queries that use metadata selected with the index configuration keyword.

Because the iseg dataset can only have one index per step, and one segment per index, it cannot store data encoded in multiple formats (like GRIB and BUFR together), as that would require multiple segments for the same step. For this reason, the format keyword is mandatory in a type=iseg dataset configuration (see Dataset configuration).

Example configuration

[name]
type = iseg
format = grib
step = daily
filter = origin: GRIB1,200
unique = origin, reftime, area
index = origin, reftime

Dataset layout

The .index files contain the metadata for all data in the segment.

The .index file contains additional indices for the metadata listed in the unique configuration value, to do quick duplicate detection, and extra indices for the metadata listed in the index configuration value.

Check and repack on concat segments

During check

  • the segment must be a file

  • the segment must exist [missing]

  • an empty segment not known by the index must be considered deleted [deleted]

  • data files not known by a valid index are considered data files whose entire content has been removed [deleted]

  • segments that contain some data that has been removed are identified as to be repacked [dirty]

  • segments that only contain data that has been removed are identified as fully deleted [deleted]

  • all data known by the index for this segment must be present on disk [corrupted]

  • no pair of (offset, size) data spans from the index can overlap [corrupted]

  • data must start at the beginning of the segment [dirty]

  • there must be no gaps between data in the segment [dirty]

  • data must end at the end of the segment [dirty]

  • find segments that can only contain data older than archive age days [archive_age]

  • find segments that can only contain data older than delete age days [delete_age]

  • the span of reference times in each segment must fit inside the interval implied by the segment file name (FIXME: should this be disabled for archives, to deal with datasets that had a change of step in their lifetime?) [corrupted]

  • data on disk must match the order of data used by queries [dirty]

  • segments not known by the index, but when the index is either missing, older than the file, or marked as needing checking, are marked for reindexing instead of deletion [unaligned]

  • format-specific consistency checks on the content of each file must pass [corrupted]

During --accurate check

During fix

  • [deleted] segments are left untouched

  • [dirty] segments are not touched

  • [unaligned] segments are imported in-place

  • [missing] segments are removed from the index

  • [corrupted] segments can only be fixed by manual intervention. They are reported and left untouched

  • [archive age] segments are not touched

  • [delete age] segments are not touched

During repack

  • [deleted] segments are removed from disk

  • [dirty] segments are rewritten to be without holes and have data in the right order. In concat segments, this is done to guarantee linear disk access when data are queried in the default sorting order. In dir segments, this is done to avoid sequence numbers growing indefinitely for datasets with frequent appends and removes.

  • [missing] segments are removed from the index

  • [corrupted] segments are not touched

  • [archive age] segments are repacked if needed, then moved to .archive/last

  • [delete age] segments are deleted

  • [delete age] [dirty] a segment that needs to be both repacked and deleted, gets deleted without repacking

  • [archive age] [dirty] a segment that needs to be both repacked and archived, gets repacked before archiving

  • [unaligned] segments are not touched, to prevent deleting data that should be reindexed instead

Check and repack on dir segments

During check

  • the segment must be a directory [unaligned]

  • the size of each data file must match the data size exactly [corrupted]

  • the modification time of a directory segment can vary unpredictably, so it is ignored. The modification time of the sequence file is used instead.

  • if arkimet is interrupted during rollback of an append operation on a dir dataset, there are files whose name has a higher sequence number than the sequence file. This will break further appends, and needs to be detected and fixed. [unaligned]

  • the segment must exist [missing]

  • an empty segment not known by the index must be considered deleted [deleted]

  • data files not known by a valid index are considered data files whose entire content has been removed [deleted]

  • segments that contain some data that has been removed are identified as to be repacked [dirty]

  • segments that only contain data that has been removed are identified as fully deleted [deleted]

  • all data known by the index for this segment must be present on disk [corrupted]

  • data must start at the beginning of the segment [dirty]

  • there must be no gaps between data in the segment [dirty]

  • data must end at the end of the segment [dirty]

  • find segments that can only contain data older than archive age days [archive_age]

  • find segments that can only contain data older than delete age days [delete_age]

  • the span of reference times in each segment must fit inside the interval implied by the segment file name (FIXME: should this be disabled for archives, to deal with datasets that had a change of step in their lifetime?) [corrupted]

  • data on disk must match the order of data used by queries [dirty]

  • segments not known by the index, but when the index is either missing, older than the file, or marked as needing checking, are marked for reindexing instead of deletion [unaligned]

  • format-specific consistency checks on the content of each file must pass [corrupted]

During --accurate check

During fix

  • [unaligned] fix low sequence file value by setting it to the highest sequence number found.

  • [unaligned] fix low sequence file value by setting it to the highest sequence number found, with one file truncated / partly written.

  • [deleted] segments are left untouched

  • [dirty] segments are not touched

  • [unaligned] segments are imported in-place

  • [missing] segments are removed from the index

  • [corrupted] segments can only be fixed by manual intervention. They are reported and left untouched

  • [archive age] segments are not touched

  • [delete age] segments are not touched

During repack

  • [deleted] segments are removed from disk

  • [dirty] segments are rewritten to be without holes and have data in the right order. In concat segments, this is done to guarantee linear disk access when data are queried in the default sorting order. In dir segments, this is done to avoid sequence numbers growing indefinitely for datasets with frequent appends and removes.

  • [missing] segments are removed from the index

  • [corrupted] segments are not touched

  • [archive age] segments are repacked if needed, then moved to .archive/last

  • [delete age] segments are deleted

  • [delete age] [dirty] a segment that needs to be both repacked and deleted, gets deleted without repacking

  • [archive age] [dirty] a segment that needs to be both repacked and archived, gets repacked before archiving

  • [unaligned] segments are not touched, to prevent deleting data that should be reindexed instead