Appendix

Record metadata

A record in InvenioRDM is serialized to/from JSON format. A complete record has the following top-level fields, but to create a new record, a client only needs to provide the data for the metadata field – the others are added by the InvenioRDM server.

{
   "access" : { ... },
   "created": "...",
   "files" : { ... },
   "id": "...",
   "links" : { ... },
   "metadata" : { ... },
   "parent": { ... },
   "pids" : { ... },
   "revision_id": N,
   "status": "...",
   "updated": "...",
}

The purpose of IGA is to construct the metadata field according to the InvenioRDM metadata definition. This part of the record has the following structure:

"metadata": {
    "additional_descriptions": [ ... ],
    "additional_titles": [ ... ],
    "contributors": [ ... ],
    "creators": [ ... ],
    "dates": [ ... ],
    "description": "...",
    "funding": [ ... ],
    "identifiers": [ ... ],
    "languages": [ ... ],
    "publication_date": "...",
    "publisher": "...",
    "references": [ ... ],
    "related_identifiers": [ ... ],
    "resource_type": { ... },
    "rights": [ ... ],
    "subjects": [ ... ],
    "title": "...",
    "version": "...",
}

The algorithms implemented in IGA are designed to try very hard to extract automatically as much metadata as possible. Because there are four possible sources of metadata (codemeta.json, CITATION.cff, the GitHub release, and the GitHub repository), and some of them overlap in what they store, it leads to a complex set of possibilities. The table below attempts to summarize how IGA goes about filling each field. (It is also available in a more detailed spreadsheet available online.) Note that some fields contain a single value, while others contain a list of multiple values.

Metadata

Method

additional_descriptions

Add separate items as follows:
   • If the CodeMeta releaseNotes is set and it’s not a URL and we didn’t use it as the value of the main description, add it with the InvenioRDM CV value "other".
   • If the CodeMeta description is set and we didn’t use it as the value of the main description, add it with the InvenioRDM CV value "other".
   • If the CFF abstract is set and we didn’t use it as the value of the main description, add it with the InvenioRDM CV value "other".
   • If the GitHub repo description is set and we didn’t use it as the value of the main description, add it with the InvenioRDM CV value "other".
   • If the CodeMeta readme is set and it’s not a URL, add it with the InvenioRDM CV value "technical-info". (If the value is a URL, create a string of the form “Additional information is available at {URL}” and add that instead.)

Deduplicate the resulting list of descriptions to avoid duplicate values.

additional_titles

Add separate items as follows:
   • If the CodeMeta name is set, add it with InvenioRDM CV type “alternate-title”.
   • If the CFF title is set, add it with InvenioRDM CV type “alternate-title”.

Deduplicate the resulting list of descriptions to avoid duplicate values.

contributors

Add separate items as follows:
   • If the CFF contact is set, add the (single) identity with an InvenioRDM role CV value of “contactperson”.
   • If the CodeMeta sponsor is set, add each identity in the list with an InvenioRDM role CV value of “sponsor”.
   • If the CodeMeta producer is set, add each identity in the list with an InvenioRDM role CV value of “producer”.
   • If the CodeMeta editor is set, add each identity in the list with an InvenioRDM role CV value of “editor”.
   • If the CodeMeta copyrightHolder is set, add each identity in the list with an InvenioRDM role CV value of “rightsholder”.
   • If the CodeMeta maintainer is set, add each identity in the list with an InvenioRDM role CV value of “other”.
   • If the CodeMeta contributor is set, add the identities with role “other”; else, if CodeMeta contributor is not set, use the GitHub repo contributors field to create a list of contributors, using the GitHub API to look up people’s names, and add them with an InvenioRDM role CV value of “other”.

Remove identities that have a role of “other” and are also listed in the creators field.

creators

Add separate items for each identity in the list of values from CodeMeta author or CFF author (but not both) if any are present; else, use the (single) GitHub release author if present; else, use the (single) GitHub repo owner. The method uses ORCID to look up names if only ORCID ID’s are given, as well as multiple NLP methods to split names into given/family name parts if names are given as single strings.

dates

Add separate items as follows:
   • An item with InvenioRDM date CV type “created” using the value of CodeMeta dateCreated (if set) or the GitHub repo created_at.
   • An item with InvenioRDM date CV type “updated” using the value of CodeMeta dateModified (if set) or the GitHub repo updated_at.
   • An item with InvenioRDM date CV type “available” using the value of the GitHub release published_at.
   • If CodeMeta copyrightYear is set, an item with InvenioRDM date CV type “copyrighted” using the value of CodeMeta copyrightYear.

description

If the GitHub release body is not empty, use that; else, if the CodeMeta releaseNotes is not empty and not a URL, use that; else, try CFF description, CFF abstract, and the GitHub repo’s description field, in that order.

formats

If the GitHub release has a value for tarball_url, add “application/x-tar-gz”. If the GitHub release has a value for zipball_url, add “application/zip”. If there are values in the GitHub release assets list, infer additional MIME types based on file extensions.

funding

Use CodeMeta funding and funder values, intelligently constructing InvenioRDM funding objects with names of funders (looking up ROR identifiers in ROR.org if necessary).

identifiers

For every item in CodeMeta identifier and CFF identifiers, detect recognizable identifiers of type ARXIV, DOI, GND, ISBN, ISNI, ORCID, PMCID, PMID, ROR, and SWH, and add InvenioRDM objects with scheme based on InvenioRDM identifier-types CV terms.

languages

Hardwired to the value representing English.

locations

Hardwired to an empty list.

publication_date

Use CodeMeta datePublished, CFF date-released, or the GitHub release published_at, tried in that order.

publisher

Use the name of the InvenioRDM server. The name is determined by downloading an existing record from the server and using the value of the publisher field in taht record.

references

Look at each item in CodeMeta referencePublication and CFF preferred-citation and references and collect identifiers of type DOI, ARXIV, ISBN, PMCID, and PMID. Use a combination of Crossref and Python’s isbnlib module to get the corresponding reference metadata, then generate plain-text references in APA format, and finally add each item to the InvenioRDM references field.

related_identifiers

Add separate items as follows:
   • The GitHub release html_url field value with InvenioRDM relation CV term “isidenticalto” and scheme “url”
   • The value of one of the fields CodeMeta codeRepository, CFF repository-code, or the GitHub repo html_url (whichever has a value first) with InvenioRDM relation CV term “isderivedfrom” and scheme “url”.
   • If the CodeMeta releaseNotes is a URL, add it with the invenioRDM relation CV term “isdescribedby”.
   • The value of one of the fields CodeMeta url, CFF url, or the GitHub repo homepage field (whichever has a value first) with InvenioRDM relation CV term “isdescribedby” and scheme “url”
   • The value of CodeMeta sameAs with InvenioRDM relation CV term “isversionof” and scheme “url”.
   • The value of Codemeta downloadUrl or CFF repository-artifact (whichever has a value first) with InvenioRDM relation CV term “isvariantformof” and scheme “url”.
   • The value of Codemeta installUrl with InvenioRDM relation CV term “isvariantformof” and scheme “url”.
   • If CodeMeta softwareHelp is set, or if the GitHub repo has an associated GitHub Pages URL, add one of them with InvenioRDM relation CV term “isdocumentedby” and scheme “url”.
   • If the CodeMeta issueTracker is set, add it with the invenioRDM relation CV term “issupplementedby”; else if the GitHub repo issues_url is set, add it instead.
   • The value(s) of CodeMeta relatedLink with InvenioRDM relation CV term “references” and scheme “url”
   • For each value in the CodeMeta referencePublication and CFF preferred-citation and references that has not already been added as a related identifier (see above), add the identifier with InvenioRDM relation CV term “isreferencedby” and scheme according to the identifier type.

resource_type

If the CFF field type is set to “dataset”, use InvenioRDM CV value “dataset”, otherwise in all other cases use “software”.

rights

Look for CodeMeta license, CFF license, and CFF license-url in that order; if none are available, look for GitHub repo license field value; if not set, look in the GitHub repository’s files for a file named “LICENSE”, “License”, “COPYING”, or similar. If the info found includes a name or a URL, match it against known SPDX licenses and use the identifier (e.g. “bsd-1-clause”) as the value of the rights object’s “id” field, with the title of the license as the “title” value and the URL of the license as the “link” value. If only a license file is found in the repo, create a value of the form {"title": {"en": "License"}, "link": URL}.

sizes

Not currently set, as the InvenioRDM server does not make use of it.

subjects

Create a union of all terms found in the repo topics field, CodeMeta keywords, CFF keywords, CodeMeta programmingLanguage, and the GitHub repo languages_url.

title

Construct a string of the form “title_part – version_part”, using an en-dash instead of a colon to separate the parts in order to avoid accidentally introducing two colons into the string.
   • For title_part, use the CodeMeta name; if that’s not set, use the CFF title; and if that’s not set, use the GitHub repository full_name.
   • For version_part, use the GitHub release name, or if that’s not set, the GitHub release tag_name.

version

Use the GitHub release tag_name, first removing any leading text of “v” or “version” if it appears as part of the tag name.

Known limitations

The following are known issues and limitations in IGA.

  • As of mid-2023, InvenioRDM requires names of record creators and other contributors to be split into given (first) and family (surname). This is problematic for multiple reasons. The first is that mononyms are common in many countries: a person’s name may legitimately be only a single word which is not conceptually a “given” or “family” name. To compound the difficulty for IGA, names are stored as single fields in GitHub account metadata, so unless a repository has a codemeta.json or CITATION.cff file (which allow authors more control over how they want their names represented), IGA is forced to try to split the single GitHub name string into two parts. A foolproof algorithm for doing this does not exist, so IGA will sometimes get it wrong. (That said, IGA goes to extraordinary lengths to try to do a good job.)

  • InvenioRDM requires that identities (creators, contributors, etc.) to be labeled as personal or organizational. The nature of identities is usually made clear in codemeta.json and CITATION.cff files. GitHub also provides a flag that is meant to be used to label organizational accounts, but sometimes people don’t set the GitHub account information correctly. Consequently, if IGA has to use GitHub data to get (e.g.) the list of contributors on a project, it may mislabel identities in the InvenioRDM record it produces.

  • Some accounts on GitHub are software automation or “bot” accounts, but are not labeled as such. These accounts are generally indistinguishable from human accounts on GitHub, so if they’re not labeled as bot or organizational accounts in GitHub, IGA can’t recognize that they’re humans. If such an account is the creator of a release in GitHub, and IGA tries to use its name-splitting algorithm on the name of the account, it may produce a nonsensical result. For example, it might turn “Travis CI” into an entry with a first name of “Travis” and last name of “CI”.

  • Funder and funding information can only be specified in codemeta.json files; neither GitHub nor CITATION.cff have provisions to store this kind of metadata. The CodeMeta specification defines two fields for this purpose: funder and funding. Unfortunately, these map imperfectly to the requirements of InvenioRDM’s metadata format. In addition, people don’t always follow the CodeMeta guidelines, and sometimes they write funding information as text strings (instead of structured objects), the interpretation of which would require software that can recognize grant and funding agency information from free-text descriptions. This combination of factors means IGA often can’t fill in the funding metadata in InvenioRDM records even if there is some funding information in the codemeta.json file.