Caltech Library logo

dataset DOI

dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collection. Collections can be stored on disc, in S3 or Google Cloud Storage. JSON objects are stored in collections as plain UTF-8 text. This means the objects can be accessed with common Unix text processing tools as well as most programming languages with text processing support.

The dataset command line tool supports common data manage operations such as initialization of collections, creation, reading, updating and deleting JSON objects in the collection. Some of its enhanced features include the ability to generate data grids and frames, the ability to import and export JSON object to and from CSV files and Google Sheets. It even includes an experimental search feature by the integrating Blevesearch indexing and search engine library developed for CouchDB.

In addition to the command line tool dataset includes a C shared library called libdataset which is used for integration in a Python module of the same name. dataset itself is written in a Go package which can also in other Go based projects. libdataset could be used as a bases for integration with other languages that support a C API (e.g. Julia).

See getting-started-with-datataset.md for a tour and tutorial.

Origin story

The inspiration for creating dataset was the desire to process metadata as JSON object collections using simple Unix shell utilities and data pipelines. The core use case evolved at Caltech Library working with various repository systems’ API (e.g. EPrints and and Invenio). It has allowed the library to build an aggregated view of heterogeneous content (see https://feeds.library.caltech.edu) as well as facilitate ad-hoc analysis and data enhancement for a number of internal library projects.

Design choices

dataset isn’t a database or repository system. It is intended to be simple and easier to use with minimal setup (e.g. dataset init mycollection.ds would create a new collection called ‘mycollection.ds’). It is built around a few abstractions (e.g. dataset stores JSON objects in collections, collections are a folder containing a JSON file called collections.json and buckets, buckets containing the JSON objects and any attachments, the collections.json file describes the mapping of keys to buckets). It takes minimal system resources and keeps all content, except JSON object attachments, in plain UTF-8 text (attachments are kept in tar files).

A typical library processing pattern is to write a “harvester” which stores it results in a dataset collection, this uses either a shell or Python script to transform the collections content and finally redeploy the augmented results.

Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi while being equally comfortable on a more resource rich server or desktop environment.

Features

dataset supports

You can work with dataset collections via the command line tool, via Go using the dataset package or in Python 3.6 using a python package. dataset is useful for general data science applications which need intermediate JSON object storage but not a full blown database.

Limitations of dataset

dataset has many limitations, some are listed below

Example

Below is a simple example of shell based interaction with dataset collations using the command line dataset tool.

    # Create a collection "friends.ds", the ".ds" lets the bin/dataset command know that's the collection to use. 
    bin/dataset friends.ds init
    # if successful then you should see an OK otherwise an error message

    # Create a JSON document 
    bin/dataset friends.ds create frieda '{"name":"frieda","email":"frieda@inverness.example.org"}'
    # If successful then you should see an OK otherwise an error message

    # Read a JSON document
    bin/dataset friends.ds read frieda
    
    # Path to JSON document
    bin/dataset friends.ds path frieda

    # Update a JSON document
    bin/dataset friends.ds update frieda '{"name":"frieda","email":"frieda@zbs.example.org", "count": 2}'
    # If successful then you should see an OK or an error message

    # List the keys in the collection
    bin/dataset friends.ds keys

    # Get keys filtered for the name "frieda"
    bin/dataset friends.ds keys '(eq .name "frieda")'

    # Join frieda-profile.json with "frieda" adding unique key/value pairs
    bin/dataset friends.ds join append frieda frieda-profile.json

    # Join frieda-profile.json overwriting in commont key/values adding unique key/value pairs
    # from frieda-profile.json
    bin/dataset friends.ds join overwrite frieda frieda-profile.json

    # Delete a JSON document
    bin/dataset friends.ds delete frieda

    # Import data from a CSV file using column 1 as key
    bin/dataset -quiet -nl=false friends.ds import-csv my-data.csv 1

    # To remove the collection just use the Unix shell command
    rm -fR friends.ds

Releases

Compiled versions are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/dataset/releases.