Caltech Library logo

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

dataset DOI

dataset is a command line tool, Go package, and an experimental C shared library for working with JSON objects as collections. Collections can be stored on local disc. JSON objects are stored in collections as plain UTF-8 text. This means the objects can be accessed with common Unix text processing tools as well as most programming languages. dataset is also available as a Python package, see py_dataset

The dataset command line tool supports common data manage operations such as initialization of collections, creation, reading, updating and deleting JSON objects in the collection. Some of its enhanced features include the ability to generate data frames as well as the ability to import, export and synchronize JSON objects to and from CSV files.

dataset is written in the Go programming language. It can be used as a Go package by other Go based software. Go supports generating C shared libraries. By compiling the Go source you can create a libdataset C shared library. The C shared library is currently being used by the Digital Library Development Group in Caltech Library from Python 3.8 (see py_dataset). This approach looks promising if you need support from other programming languages (e.g. Julia can call shared libraries easily with a ccall function).

See for a tour and tutorial. Include are both the command line as well as examples in Python using py_dataset.

Design choices

dataset isn’t a database or a replacement for repository systems. It is guided by the idea that you should be able to work with text files, the JSON objects documents, with standard Unix text utilities. It is intended to be simple to use with minimal setup (e.g.  dataset init mycollection.ds creates a new collection called ‘mycollection.ds’). It is built around a few abstractions – dataset stores JSON objects in collections, collections are folder(s) containing a pairtree of JSON object documents and any attachments, a collections.json file describing the mapping of keys to folder locations). dataset takes minimal system resources and keeps all content, except JSON object attachments, in plain UTF-8 text.

The choice of plain UTF-8 and future proof reading dataset collections.
Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi while being equally comfortable on a more resource rich server or desktop environment. It should be easy to do alternative implementations in any language having a good string library, JSON support and memory management.


A typical library processing pattern is to write a “harvester” which then stores it results in a dataset collection. Write something that transforms or aggregates harvested options and then write a final rendering program to prepare the data for the web. The the hearvesters are typically written in Python or as a simple Bash scripts storing the results in a dataset collection. Depending on the performance needs transform and aggregates stages are written either in Python or Go and our final rendering stages are typically written in Python or as simple Bash scripts.


dataset supports

You can work with dataset collections via the command line tool, via Go using the dataset package or in Python 3.8 using the py_dataset python package. dataset is useful for general data science applications which need intermediate JSON object management but not a full blown database.

Limitations of dataset

dataset has many limitations, some are listed below

Explore dataset through A Shell Example, Getting Started with Dataset, How To guides, topics and Documentation.


Compiled versions are provided for Linux (x86), Mac OS X (x86 and M1), Windows 10 (x86) and Raspbian (ARM7). See

You can use dataset from Python via the py_dataset package.