Caltech Library logo

dataset DOI

dataset is a command line tool, Go package, and an experimental C shared library for working with JSON objects as collections. Collections can be stored on disc, on AWS S3 or Google Cloud Storage. JSON objects are stored in collections as plain UTF-8 text. This means the objects can be accessed with common Unix text processing tools as well as most programming languages with text processing support.

The dataset command line tool supports common data manage operations such as initialization of collections, creation, reading, updating and deleting JSON objects in the collection. Some of its enhanced features include the ability to generate data grids and frames, the ability to import and export JSON object to and from CSV files and Google Sheets.

dataset is written in the Go programming language. It can be used as a Go package by other Go based software. Go supports generating C shared libraries. By compiling the Go source you can create a libdataset C shared library. The C shared library is currently being used by the DLD Group in Caltech Library expermentally from Python. This approach looks promising to support other languages (e.g. Julia can easily use dataset via its ccall function, while R, Octave and NodeJS would probably need some C++ wrapping code).

See getting-started-with-datataset.md for a tour and tutorial.

Design choices

dataset isn’t a database or repository system. It is guided by the idea that you should be able to work with text files (e.g. the JSON object documents) with standard Unix text utilities. It is intended to be simple to use with minimal setup (e.g. dataset init mycollection.ds would create a new collection called ‘mycollection.ds’). It is built around a few abstractions – dataset stores JSON objects in collections, collections are a folder(s) containing a JSON object documents and any attachments, a collections.json file describes the mapping of keys to folder locations). It takes minimal system resources and keeps all content, except JSON object attachments, in plain UTF-8 text.

A typical library processing pattern is to write a “harvester” which then stores it results in a dataset collection. The harvesters we use are written either as simple shell scripts, Python programs or Go programs. Once you have your JSON objects in a dataset collection it is easy to iterate over them and augment them further.

Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi while being equally comfortable on a more resource rich server or desktop environment.

Features

dataset supports

You can work with dataset collections via the command line tool, via Go using the dataset package or in Python 3.6 using a python package. dataset is useful for general data science applications which need intermediate JSON object management but not a full blown database.

Limitations of dataset

dataset has many limitations, some are listed below

Example

Below is a simple example of shell based interaction with dataset a collection using the command line dataset tool.

    # Create a collection "friends.ds", the ".ds" lets the bin/dataset command know that's the collection to use. 
    dataset init friends.ds
    # if successful then you should see an OK otherwise an error message

    # Create a JSON document 
    dataset friends.ds create frieda '{"name":"frieda","email":"frieda@inverness.example.org"}'
    # If successful then you should see an OK otherwise an error message

    # Read a JSON document
    dataset friends.ds read frieda
    
    # Path to JSON document
    dataset friends.ds path frieda

    # Update a JSON document
    dataset friends.ds update frieda '{"name":"frieda","email":"frieda@zbs.example.org", "count": 2}'
    # If successful then you should see an OK or an error message

    # List the keys in the collection
    dataset friends.ds keys

    # Get keys filtered for the name "frieda"
    dataset friends.ds keys '(eq .name "frieda")'

    # Join frieda-profile.json with "frieda" adding unique key/value pairs
    dataset friends.ds join append frieda frieda-profile.json

    # Join frieda-profile.json overwriting in commont key/values adding unique key/value pairs
    # from frieda-profile.json
    dataset friends.ds join overwrite frieda frieda-profile.json

    # Delete a JSON document
    dataset friends.ds delete frieda

    # Import data from a CSV file using column 1 as key
    dataset -quiet -nl=false friends.ds import-csv my-data.csv 1

    # To remove the collection just use the Unix shell command
    rm -fR friends.ds

Releases

Compiled versions are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/dataset/releases.