Caltech Library logo

dataset

The documentation is organized around the command line options and as a series of “how to” style examples.

Command line program documentation

Internal project concepts

dataset Operations

The basic operations support by dataset are listed below organized by collection and JSON document level.

A word about keys

dataset is based around the concept of key/value pairs where the key is the unique identifier for an object stored (i.e. the value) in the collection. Each storage option supported by dataset and its own issues around what things can be called. Keys should be lower case alpha numeric or underscore only. E.g. the pairtree storage relies on the file system to store the JSON objects. Some file systems are not case sensitive, others face challenges with non-alpha numeric filenames.

Collection Level

JSON Document level

JSON Document Attachments

datasetd as a web service

New as of version v2 is a web service providing access to dataset collections. This is described in the datasetd documentation page.

datasetd supports the following end points.

Storage engines

In v2 dataset is starting to suport storing your JSON document in a SQL database. Currently three SQL databases can be used to store the JSON documents, SQLite 3 (default engine, used in dataset’s test suites), MySQL 8 (minimally tested), Postgres >= 12 (well tested). See storage engines for more details.

Compatibity

Migrating dataset collections between major versions or just different collections can be done using the “dump” and “load” feature. This replaces the old process in early v2 that required you to run a “repair” operation to convert a collection to the current version of dataset.

Example migrating from dataset “data_v2.ds” from v2 to v3 as “data_v3.ds”.

dataset3 init data_v3.ds
dataset dump data_v2.ds | dataset3 load data_v3.ds