Caltech Library logo

dataset

DOI

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

dataset is a command line tool, a Go package, and an experimental C shared library for working with JSON objects as collections. Collections are stored on your local disk. JSON objects are stored in collections as plain UTF-8 text files. This means the objects can be accessed with common Unix text processing tools as well as most programming languages. dataset is also available as a Python package, see py_dataset

The dataset command line tool supports common data management operations such as initialization of collections; document creation, reading, updating and deleting; listing keys of JSON objects in the collection.

datasets’s enhanced features include

See Getting started with dataset for a tour and tutorial. Both the command line and examples in Python 3 using using py_dataset are included.

Design choices

dataset isn’t a database or a replacement for a repository system. It is tool to manage JSON documents in a predictable and structured way. dataset is guided by the idea that you should be able to work with JSON documents as easily as you can any plain text document on Unix. dataset is intended to be simple to use with minimal setup (e.g.  dataset init mycollection.ds creates a new collection called ‘mycollection.ds’). It is built around the following abstractions

The choice of plain UTF-8 is intended to help future proof reading dataset collections. Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi Zero while being equally comfortable on a more resource rich server or desktop environment. dataset can be re-implement in any programming language supporting file input and output, common string operations and a JSON encoding and decoding.

Example Workflow

A typical processing pattern is to write a “harvester” which then stores it results in a dataset collection. This is often followed by another program that transforms or aggregates harvested material before rendering a prepared output, e.g. web pages or data files. At Caltech Library the harvesters are typically written in Python or Bash storing the results in a dataset collection. Depending on the performance needs transform and aggregates stages are written either in Python or Go and our final rendering stages are typically written in Python or as simple Bash scripts.

Features

dataset supports

You can work with dataset collections via the command line tool, via Go using the dataset package or in Python 3.8 using the py_dataset python package. dataset is useful for general data science applications which need intermediate JSON object management but not a full blown database.

Limitations of dataset

dataset has many limitations, some are listed below

Explore dataset through A Shell Example, Getting Started with Dataset, How To guides, topics and Documentation.

Releases

Compiled versions are provided for Linux (x86), Mac OS X (x86 and M1), Windows 10 (x86) and Raspberry Pi OS (ARM7). See https://github.com/caltechlibrary/dataset/releases.

You can use dataset from Python via the py_dataset package.