.. _datasets-simple:

``simple`` dataset format
=========================

This dataset contains segments (see :ref:`segments`) with minimal indexing.

It is useful for long-term storage of data, due to the small metadata overhead
and simple structure.

The only query optimization it supports is when querying by a restricted
date-time range.

Because of the lack of a detailed index, ``simple`` dataset cannot efficiently
detect duplicate data on import, therefore that feature is not implemented. For
the same reason, ``index`` and ``unique`` are not supported in the
configuration.


Example configuration
---------------------

::

  [name]
  type = simple
  step = daily
  filter = origin: GRIB1,200


Dataset layout
--------------

At the root of the dataset there is a ``MANIFEST`` file that lists the segments
known to the dataset, together with their reference time spans. The ``MANIFEST``
file can be encoded in plain text or in a ``.sqlite`` database.

For each segment there is an associated ``.metadata`` file that contains metadata
for all data in the segment. This makes it possible to select data according to
a query, without needing to rescan it every time.

For each segment there is also an associated ``.summary`` file, that contains a
summary of the data within the segment. This is intended to quickly filter out
segments during a query without needing to scan through all the ``.metadata``
file contents, and to support summary queries by merging existing summaries
instead of recomputing them for all data queried.


General check and repack notes
------------------------------

Since the dataset is intended for long-term archival, repack will never delete
a data segment. All data segments found without a ``.metadata`` file or next to
an empty ``.metadata`` file will always be rescanned.


Check and repack on concat segments
-----------------------------------

During check
^^^^^^^^^^^^

- the segment must be a file
- the segment must exist [missing]
- an empty segment not known by the index must be considered deleted [deleted]
- all data known by the index for this segment must be present on disk [corrupted]
- no pair of (offset, size) data spans from the index can overlap [corrupted]
- data must start at the beginning of the segment [dirty]
- there must be no gaps between data in the segment [dirty]
- data must end at the end of the segment [dirty]
- find segments that can only contain data older than `archive age` days [archive_age]
- find segments that can only contain data older than `delete age` days [delete_age]
- the span of reference times in each segment must fit inside the interval
  implied by the segment file name (FIXME: should this be disabled for
  archives, to deal with datasets that had a change of step in their lifetime?) [corrupted]
- the segment name must represent an interval matching the dataset step
  (FIXME: should this be disabled for archives, to deal with datasets that had
  a change of step in their lifetime?) [corrupted]
- data on disk must match the order of data used by queries [dirty]
- segments not known by the index, but when the index is either
  missing, older than the file, or marked as needing checking, are marked
  for reindexing instead of deletion [unaligned]
- format-specific consistency checks on the content of each file must pass [corrupted]
- `.metadata` file must not be empty [unaligned]
- `.metadata` file must not be older than the data [unaligned]
- `.summary` file must not be older than the `.metadata` file [unaligned]
- `MANIFEST` file must not be older than the `.metadata` file [unaligned]
- if the index has been deleted, accessing the dataset recreates it
  empty, and a check will rebuild it. Until it gets rebuilt, segments
  not present in the index would not be considered when querying the
  dataset
    
- metadata in the `.metadata` file must contain reference time elements [corrupted]

During ``--accurate`` check
^^^^^^^^^^^^^^^^^^^^^^^^^^^


During fix
^^^^^^^^^^

- [dirty] segments are not touched
- [unaligned] segments are imported in-place
- [missing] segments are removed from the index
- [corrupted] segments can only be fixed by manual intervention. They
  are reported and left untouched
- [archive age] segments are not touched
- [delete age] segments are not touched

During repack
^^^^^^^^^^^^^

- [dirty] segments are rewritten to be without holes and have data in the right order.
  In concat segments, this is done to guarantee linear disk access when
  data are queried in the default sorting order. In dir segments, this
  is done to avoid sequence numbers growing indefinitely for datasets
  with frequent appends and removes.
- [missing] segments are removed from the index
- [corrupted] segments are not touched
- [archive age] segments are repacked if needed, then moved to .archive/last
- [delete age] segments are deleted
- [delete age] [dirty] a segment that needs to be both repacked and
  deleted, gets deleted without repacking
- [archive age] [dirty] a segment that needs to be both repacked and
  archived, gets repacked before archiving
- [unaligned] segments are not touched, to prevent deleting data that
  should be reindexed instead


Check and repack on dir segments
--------------------------------

During check
^^^^^^^^^^^^

- the segment must be a directory [unaligned]
- the size of each data file must match the data size exactly [corrupted]
- the modification time of a directory segment can vary unpredictably,
  so it is ignored. The modification time of the sequence file is used
  instead.
- if arkimet is interrupted during rollback of an append operation on a
  dir dataset, there are files whose name has a higher sequence number
  than the sequence file. This will break further appends, and needs to
  be detected and fixed. [unaligned]
- the segment must exist [missing]
- an empty segment not known by the index must be considered deleted [deleted]
- all data known by the index for this segment must be present on disk [corrupted]
- no pair of (offset, size) data spans from the index can overlap [corrupted]
- data must start at the beginning of the segment [dirty]
- there must be no gaps between data in the segment [dirty]
- data must end at the end of the segment [dirty]
- find segments that can only contain data older than `archive age` days [archive_age]
- find segments that can only contain data older than `delete age` days [delete_age]
- the span of reference times in each segment must fit inside the interval
  implied by the segment file name (FIXME: should this be disabled for
  archives, to deal with datasets that had a change of step in their lifetime?) [corrupted]
- the segment name must represent an interval matching the dataset step
  (FIXME: should this be disabled for archives, to deal with datasets that had
  a change of step in their lifetime?) [corrupted]
- data on disk must match the order of data used by queries [dirty]
- segments not known by the index, but when the index is either
  missing, older than the file, or marked as needing checking, are marked
  for reindexing instead of deletion [unaligned]
- format-specific consistency checks on the content of each file must pass [corrupted]
- `.metadata` file must not be empty [unaligned]
- `.metadata` file must not be older than the data [unaligned]
- `.summary` file must not be older than the `.metadata` file [unaligned]
- `MANIFEST` file must not be older than the `.metadata` file [unaligned]
- if the index has been deleted, accessing the dataset recreates it
  empty, and a check will rebuild it. Until it gets rebuilt, segments
  not present in the index would not be considered when querying the
  dataset
    
- metadata in the `.metadata` file must contain reference time elements [corrupted]

During ``--accurate`` check
^^^^^^^^^^^^^^^^^^^^^^^^^^^


During fix
^^^^^^^^^^

- [unaligned] fix low sequence file value by setting it to the highest
  sequence number found.
- [unaligned] fix low sequence file value by setting it to the highest
  sequence number found, with one file truncated / partly written.
- [dirty] segments are not touched
- [unaligned] segments are imported in-place
- [missing] segments are removed from the index
- [corrupted] segments can only be fixed by manual intervention. They
  are reported and left untouched
- [archive age] segments are not touched
- [delete age] segments are not touched

During repack
^^^^^^^^^^^^^

- [dirty] segments are rewritten to be without holes and have data in the right order.
  In concat segments, this is done to guarantee linear disk access when
  data are queried in the default sorting order. In dir segments, this
  is done to avoid sequence numbers growing indefinitely for datasets
  with frequent appends and removes.
- [missing] segments are removed from the index
- [corrupted] segments are not touched
- [archive age] segments are repacked if needed, then moved to .archive/last
- [delete age] segments are deleted
- [delete age] [dirty] a segment that needs to be both repacked and
  deleted, gets deleted without repacking
- [archive age] [dirty] a segment that needs to be both repacked and
  archived, gets repacked before archiving
- [unaligned] segments are not touched, to prevent deleting data that
  should be reindexed instead


.. toctree::
   :maxdepth: 2
   :caption: Contents: