.. _datasets:

Datasets
========

Arkimet supports different dataset formats, which offer different performance
and indexing features to match various ways in which data is stored and
queried.

.. toctree::
   :maxdepth: 2
   :caption: Contents:

   datasets/iseg
   datasets/simple
   datasets/ondisk2
   datasets/archive
   datasets/error
   datasets/duplicates
   datasets/remote
   datasets/outbound
   datasets/discard


.. _dataset_config:

Dataset configuration
---------------------

Datasets are configured with simple ``key = value`` configuration options.

These are all the supported options:

* ``archive age``: data older than this number of days will be moved to the
  dataset archive during maintenance.
* ``delete age``: data older than this number of days will be deleted during
  maintenance.
* ``eatmydata``: disable `fsync`/`fdatasync` operations while writing data to
  dataset, and disable sqlite' journaling and other data integrity features.
  This makes acquiring data very fast, but an interrupted import or a
  concurrent import may cause data corruption.
* ``format``: format of data in the dataset (one of ``grib``, ``bufr``, ``odimh5``, ``vm2``)
* ``index``: comma-separated list of names of metadata to index for faster queries
* ``path``: path to the dataset, or URL for ``remote`` datasets.
* ``replace``: when ``yes``, importing duplicate data will replace the existing
  version . When ``no``, importing duplicate data will be rejected. When
  ``usn``, importing duplicate BUFR data will replace the existing version only
  if the BUFR Update Sequence Number is greater than the one currently in the
  dataset. A replace leaves the old data in the segment and appends the new
  data at the end, updating the index to refer to the new data. As with deleted
  data, disk space is only reclaimed when running ``arki-check --repack``
* ``restrict``: comma-separated list of names that have access to the dataset.
  This allows filtering with the ``--restrict`` option on command line.
* ``smallfiles``: ``yes`` or ``no``. When ``yes``, the file contents are also
  saved in the index, to speed up extraction of data with tiny payloads like
  ``vm2``.
* ``step``: segmentation step for the dataset (one of ``daily``, ``weekly``,
  ``biweekly``, ``monthly``, and ``yearly``).
* ``type``: dataset type (one of ``iseg``, ``simple``, ``error``,
  ``duplicates``, ``remote``, ``outbound``, ``discard``, ``file``).
* ``unique``: comma-separated list of names of metadata that, taken together,
  make it unique in the dataset