The Dataset Project provides tools for working with collections of JSON documents easily. It uses a simple key and object pair to organize JSON documents into a collection. It supports SQL querying of the objects stored in a collection.
It is suitable for temporary storage of JSON objects in data processing pipelines as well as a persistent storage mechanism for collections of JSON objects.
The Dataset Project provides a command line program and a web service for working with JSON objects as a collection or individual objects. As such it is well suited for data science projects as well as building web applications that work with metadata.
dataset is a command line tool for working with collections of JSON documents. Collections can be stored on the file system in a pairtree or stored in a SQL database that supports JSON columns like SQLite3, PostgreSQL or MySQL.
The dataset command line tool supports common data management operations as
See Getting started with dataset for a tour and tutorial.
datasetd is a JSON REST web service and static file host. It provides a JSON API supporting the main operations found in the dataset command line program. This allows dataset collections to be integrated safely into web applications or be used concurrently by multiple processes.
The Dataset Web Service can host multiple collections each with their own custom query API defined in a simple YAML configuration file.
dataset and datasetd are intended to be simple tools for managing collections JSON object documents in a predictable structured way. The dataset web service allows multi process or multi user access to a dataset collection via HTTP.
dataset is guided by the idea that you should be
able to work with JSON documents as easily as you can any plain text
document on the Unix command line. dataset is intended
to be simple to use with minimal setup
(e.g. dataset init mycollection.ds
creates a new collection
called ‘mycollection.ds’).
.ds
extension for easy identificationdataset collection storage options - SQL store stores JSON documents in a JSON column - SQLite3 (default), PostgreSQL >= 12 and MySQL 8 are the current SQL databases support - A “DSN URI” is used to identify and gain access to the SQL database - The DSN URI maybe passed through the environment - pairtree (depricated, will be removed in v3) - the pairtree path is always lowercase - non-JSON attachments can be associated with a JSON document and found in a directories organized by semver (semantic version number) - versioned JSON documents are created along side the current JSON document but are named using both their key and semver
datasetd is a web service - it is intended as a back end web service run on localhost - it runs on localhost and a designated port (port 8485 is the default) - supports multiple collections each can have their own configuration for global object permissions and supported SQL queries
The choice of plain UTF-8 is intended to help future proof reading dataset collections. Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi Zero while being equally comfortable on a more resource rich server or desktop environment. dataset can be re-implement in any programming language supporting file input and output, common string operations and along with JSON encoding and decoding functions. The current implementation is in the Go language.
dataset supports - Collection level - Initialize a new dataset collection - Codemeta file support for describing the collection contents - Dump a collection to a JSON lines document - Load a collection from a JSON lines document - Listing Keys in a collection - Object level actions - create - read - update - delete - keys - has-key - Documents as attachments - attachments (list) - attach (create/update) - retrieve (read) - prune (delete)
datasetd supports
Both dataset and datasetd maybe useful for general data science applications needing JSON object management or in implementing repository systems in research libraries and archives.
dataset has many limitations, some are listed below
datasetd is a simple web service intended to run on “localhost:8485”.
Compiled versions are provided for Linux (x86, aarch64), Mac OS X (x86 and M1), Windows 11 (x86, aarch64) and Raspberry Pi OS (ARM7).
github.com/caltechlibrary/dataset/releases
You can use dataset from Python via the py_dataset package.
You can use dataset from Deno+TypeScript by running datasetd and access it with ts_dataset.