Replicating an EPrints repository

This article is about using two Python modules that are part of eprinttools to replicate the public facing content of an EPrints repository into an S3 bucket hosing a static website version of the content.

Taking this approach the EPrints software becomes primarily a curration platform while the public view can function indenpent of the EPrints collection (e.g. it can be up when you have EPrints down for maintenance).

This approach combines several Caltech Library tools projects

Setting up a staging service

There are several steps to setup a staging service to populate your S3 bucket (or compatible object store).

Make sure you have access to the EPrints REST API by a service account
Setup a S3 bucket (or other compatible object store) to be your website
Install the following command line utilities dataset, eprinttools (cli) and mkpage into your a directory in your PATH (e.g. /usr/local/bin)
Install py_dataset (e.g. pip install py_dataset)
copy eprints3x and eprintviews modules to the staging location
copy harvester_full.py, harvester_recent.py, genviews.py, indexer.py, mk_website.py and publisher.py to the staging location
- these use eprints3x and eprintviews modules.
Create, update the necessary configuration files
Follow the instruction to run the build and deploy process
Initialize the dataset collection if needed

Configuration and initialization

JSON configuration file

A JSON file is used to run the example Python3 programs for replicating your public EPrints content. Here’s an example which I’ve named config.json in the examples below.

    {
      "eprint_url": "https://USER:SECERT@eprints.example.edu",
      "dataset": "DATASET_COLLECTION",
      "number_of_days": -7,
      "control_item": "https://eprints.example.edu/cgi/users/home?screen=EPrint::View&eprintid=",
      "users": "users.json",
      "subjects": "subjects.txt",
      "views": "views.json",
      "organization": "Example Library IR",
      "site_title": "A EPrints Repository",
      "site_welcome": "Welcome to an EPrints Repository",
      "distribution_id": "",
      "bucket": "",
      "htdocs": "htdocs",
      "templates": "templates",
      "static": "static",
      "base_url": "http://localhost:8000"
    }

Field explanations

eprint_url: (required) is a URL with login credentials authorized to access the EPrints REST API
dataset: (required) name of your dataset collection you’ve initialized with the dataset command
number_of_days: Is an integer used to calculate the “recent” number of days to harvest
control_item: (used in templates) This is a link (minus the EPrints ID) to use to access the EPrint page needed to edit/manage the item
users: (required) Points at the JSON export of the users in the EPrint repository, it is needed to get the names of depositors
subjects: (required) Points at a copy of a plain text file found in your EPrint under /archives/REPO_ID/cfg/subject, it is used to map subject views to paths and names
views: (required) Points at a JSON file that shows a path part (e.g. ids) and label to use for that view (e.g. “Eprint ID”).
organization: (used in templates) The name of your organization (used by the Pandoc templates)
site_title: (used in templates) The website title (used by the Pandoc templates)
site_welcome: (used in templates) The website welcome statement (used by the Pandoc templates)
htdocs: (has default value) This is the directory to host the website (or replicate from), it functions as your document root
templates: (has default value) This is the directory that holds your website templates
static: (has default value) This is the directory that holds your static files and assets (e.g. css, favicon, non-content images like logos)
bucket: (optional) This is a URI To the S3 (or S3 like) bucket to host the public static website if you are using publisher.py and invalidate_cloudfront.py programs
distribution_id: (optional) This is the ID number used by Cloudfront for invalidating CDN cache. It is only used by includate_cloudfront.py. It is only used by includate_cloudfront.py
base_url: If this is set to a non-empty string then this will be passed to the templates used by mk_website.py to build HTML pages. It is also used in generating an Elasticsearch JSON document for ingest and linking back to the targetted resource.
elastic_documents: This is the filename to use when creating JSON documents for ingest into Elasticsearch, if not set then no JSON document is created. E.g. elastic-documents.json
elastic_documents_max_no: This is the maximum number of records to include in the Elasticsearch ingest document file(s). If there are more records to ingest than this number the filename indicated by elastic_documents will contain a numeric index, e.g. elastic-documents-1.json, elastic-documents-2.json. This defaults to 2500 dobjects in the elastic search document array.
elastic_base_endpoint: Used to configure the Elasticsearch elastic-app-search Client
elastic_api_key: Used to configure the Elasticsearch elastic-app-search Client
elastic_use_https: used to configure the Elasticsearch elastic-app-search Client

views JSON file content

This file controls what views get built built by the genviews.py command. It is a key value of path and label. The path corresponds the to view aggregation supported in eprintviews/aggregator.py.

    {
        "ids": "Eprint ID", 
        "year": "Year",
        "person-az": "Person", 
        "event": "Conference", 
        "collection": "Collection", 
        "latest": "Latest Additions", 
        "publication": "Publication", 
        "issn": "ISSN", 
        "person": "Person", 
        "types": "Type", 
        "subjects": "Subjects"
    }

subjects text file

This is a copy of the subjects file used in your EPrints 3.3.x installation. It is used to gather the table values that map to the display names of the subjects.

This file is found by looking in the repository’s configuration (e.g. if the repository is called “instrepo” the path is probably archives/instrepo/cfg/subjects).

users JSON file

The users JSON file is created by exporting your current users via the Admin user search found in your EPrints repository. The search for select all users then export as JSON.

IMPORTANT: This file should not be in a publicly readable location

Build Process

“config.json” is the name of the configuraiton file used for example purposes. It can be have any name, the JSON content is what is important.

harvest EPrints into a dataset collection
1. harvester_full.py config.json
2. harvester_recent.py config.json
generate the views and landing pages directory and metadata
1. genviews.py config.json
index the metadata for search
1. indexer.py config.json
make the website pages
1. mk_website.py config.json
publisher your htdocs to your S3 bucket
1. publisher.py config.json
2. (if needed) invalidate_cloudfront.py config.json