iseg
dataset format¶
This dataset contains segments (see Data segments) with an index next to each
segment, with a .index
extension, containing a SQLite database.
It can enforce detection of duplicates, enforcing uniqueness on the set of
metadata selected with the unique
configuration keyword.
Duplicate detection relies on the invariant that segment naming produces segments that do not overlap, so a datum can only be found in a well defined segment, and duplicate detection can be performed on a per-segment basis.
iseg
datasets do not have one global index, and enumerating the segments in
the dataset is done by enumerating all the possible segments that would be
generated by the dataset step within the time range defined by the query and
the range of data present in the dataset. This means that iseg
datasets will
ignore all segments that do not fit in the naming scheme defined by the dataset
step.
It can optimize queries that use metadata selected with the index
configuration keyword.
Because the iseg
dataset can only have one index per step, and one segment
per index, it cannot store data encoded in multiple formats (like GRIB and BUFR
together), as that would require multiple segments for the same step. For this
reason, the format
keyword is mandatory in a type=iseg
dataset
configuration (see Dataset configuration).
Example configuration¶
[name]
type = iseg
format = grib
step = daily
filter = origin: GRIB1,200
unique = origin, reftime, area
index = origin, reftime
Dataset layout¶
The .index
files contain the metadata for all data in the segment.
The .index
file contains additional indices for the metadata listed in the
unique
configuration value, to do quick duplicate detection, and extra
indices for the metadata listed in the index
configuration value.
Check and repack on concat
segments¶
During check¶
the segment must be a file
the segment must exist [missing]
an empty segment not known by the index must be considered deleted [deleted]
data files not known by a valid index are considered data files whose entire content has been removed [deleted]
segments that contain some data that has been removed are identified as to be repacked [dirty]
segments that only contain data that has been removed are identified as fully deleted [deleted]
all data known by the index for this segment must be present on disk [corrupted]
no pair of (offset, size) data spans from the index can overlap [corrupted]
data must start at the beginning of the segment [dirty]
there must be no gaps between data in the segment [dirty]
data must end at the end of the segment [dirty]
find segments that can only contain data older than archive age days [archive_age]
find segments that can only contain data older than delete age days [delete_age]
the span of reference times in each segment must fit inside the interval implied by the segment file name (FIXME: should this be disabled for archives, to deal with datasets that had a change of step in their lifetime?) [corrupted]
data on disk must match the order of data used by queries [dirty]
segments not known by the index, but when the index is either missing, older than the file, or marked as needing checking, are marked for reindexing instead of deletion [unaligned]
format-specific consistency checks on the content of each file must pass [corrupted]
During --accurate
check¶
During fix¶
[deleted] segments are left untouched
[dirty] segments are not touched
[unaligned] segments are imported in-place
[missing] segments are removed from the index
[corrupted] segments can only be fixed by manual intervention. They are reported and left untouched
[archive age] segments are not touched
[delete age] segments are not touched
During repack¶
[deleted] segments are removed from disk
[dirty] segments are rewritten to be without holes and have data in the right order. In concat segments, this is done to guarantee linear disk access when data are queried in the default sorting order. In dir segments, this is done to avoid sequence numbers growing indefinitely for datasets with frequent appends and removes.
[missing] segments are removed from the index
[corrupted] segments are not touched
[archive age] segments are repacked if needed, then moved to .archive/last
[delete age] segments are deleted
[delete age] [dirty] a segment that needs to be both repacked and deleted, gets deleted without repacking
[archive age] [dirty] a segment that needs to be both repacked and archived, gets repacked before archiving
[unaligned] segments are not touched, to prevent deleting data that should be reindexed instead
Check and repack on dir
segments¶
During check¶
the segment must be a directory [unaligned]
the size of each data file must match the data size exactly [corrupted]
the modification time of a directory segment can vary unpredictably, so it is ignored. The modification time of the sequence file is used instead.
if arkimet is interrupted during rollback of an append operation on a dir dataset, there are files whose name has a higher sequence number than the sequence file. This will break further appends, and needs to be detected and fixed. [unaligned]
the segment must exist [missing]
an empty segment not known by the index must be considered deleted [deleted]
data files not known by a valid index are considered data files whose entire content has been removed [deleted]
segments that contain some data that has been removed are identified as to be repacked [dirty]
segments that only contain data that has been removed are identified as fully deleted [deleted]
all data known by the index for this segment must be present on disk [corrupted]
data must start at the beginning of the segment [dirty]
there must be no gaps between data in the segment [dirty]
data must end at the end of the segment [dirty]
find segments that can only contain data older than archive age days [archive_age]
find segments that can only contain data older than delete age days [delete_age]
the span of reference times in each segment must fit inside the interval implied by the segment file name (FIXME: should this be disabled for archives, to deal with datasets that had a change of step in their lifetime?) [corrupted]
data on disk must match the order of data used by queries [dirty]
segments not known by the index, but when the index is either missing, older than the file, or marked as needing checking, are marked for reindexing instead of deletion [unaligned]
format-specific consistency checks on the content of each file must pass [corrupted]
During --accurate
check¶
During fix¶
[unaligned] fix low sequence file value by setting it to the highest sequence number found.
[unaligned] fix low sequence file value by setting it to the highest sequence number found, with one file truncated / partly written.
[deleted] segments are left untouched
[dirty] segments are not touched
[unaligned] segments are imported in-place
[missing] segments are removed from the index
[corrupted] segments can only be fixed by manual intervention. They are reported and left untouched
[archive age] segments are not touched
[delete age] segments are not touched
During repack¶
[deleted] segments are removed from disk
[dirty] segments are rewritten to be without holes and have data in the right order. In concat segments, this is done to guarantee linear disk access when data are queried in the default sorting order. In dir segments, this is done to avoid sequence numbers growing indefinitely for datasets with frequent appends and removes.
[missing] segments are removed from the index
[corrupted] segments are not touched
[archive age] segments are repacked if needed, then moved to .archive/last
[delete age] segments are deleted
[delete age] [dirty] a segment that needs to be both repacked and deleted, gets deleted without repacking
[archive age] [dirty] a segment that needs to be both repacked and archived, gets repacked before archiving
[unaligned] segments are not touched, to prevent deleting data that should be reindexed instead