Appendix¶
Record metadata¶
A record in InvenioRDM is serialized to/from JSON format. A complete record has the following top-level fields, but to create a new record, a client only needs to provide the data for the metadata
field – the others are added by the InvenioRDM server.
{
"access" : { ... },
"created": "...",
"files" : { ... },
"id": "...",
"links" : { ... },
"metadata" : { ... },
"parent": { ... },
"pids" : { ... },
"revision_id": N,
"status": "...",
"updated": "...",
}
The purpose of IGA is to construct the metadata
field according to the InvenioRDM metadata definition. This part of the record has the following structure:
"metadata": {
"additional_descriptions": [ ... ],
"additional_titles": [ ... ],
"contributors": [ ... ],
"creators": [ ... ],
"dates": [ ... ],
"description": "...",
"funding": [ ... ],
"identifiers": [ ... ],
"languages": [ ... ],
"publication_date": "...",
"publisher": "...",
"references": [ ... ],
"related_identifiers": [ ... ],
"resource_type": { ... },
"rights": [ ... ],
"subjects": [ ... ],
"title": "...",
"version": "...",
}
The algorithms implemented in IGA are designed to try very hard to extract automatically as much metadata as possible. Because there are four possible sources of metadata (codemeta.json
, CITATION.cff
, the GitHub release, and the GitHub repository), and some of them overlap in what they store, it leads to a complex set of possibilities. The table below attempts to summarize how IGA goes about filling each field. (It is also available in a more detailed spreadsheet available online.) Note that some fields contain a single value, while others contain a list of multiple values.
Metadata |
Method |
---|---|
|
Add separate items as follows: |
|
Add separate items as follows: |
|
Add separate items as follows: |
|
Add separate items for each identity in the list of values from CodeMeta |
|
Add separate items as follows: |
|
If the GitHub release |
|
If the GitHub release has a value for |
|
Use CodeMeta |
|
For every item in CodeMeta |
|
Hardwired to the value representing English. |
|
Hardwired to an empty list. |
|
Use CodeMeta |
|
Use the name of the InvenioRDM server. The name is determined by downloading an existing record from the server and using the value of the |
|
Look at each item in CodeMeta |
|
Add separate items as follows: |
|
If the CFF field |
|
Look for CodeMeta |
|
Not currently set, as the InvenioRDM server does not make use of it. |
|
Create a union of all terms found in the repo |
|
Construct a string of the form “title_part – version_part”, using an en-dash instead of a colon to separate the parts in order to avoid accidentally introducing two colons into the string. |
|
Use the GitHub release |
Known limitations¶
The following are known issues and limitations in IGA.
As of mid-2023, InvenioRDM requires names of record creators and other contributors to be split into given (first) and family (surname). This is problematic for multiple reasons. The first is that mononyms are common in many countries: a person’s name may legitimately be only a single word which is not conceptually a “given” or “family” name. To compound the difficulty for IGA, names are stored as single fields in GitHub account metadata, so unless a repository has a
codemeta.json
orCITATION.cff
file (which allow authors more control over how they want their names represented), IGA is forced to try to split the single GitHub name string into two parts. A foolproof algorithm for doing this does not exist, so IGA will sometimes get it wrong. (That said, IGA goes to extraordinary lengths to try to do a good job.)InvenioRDM requires that identities (creators, contributors, etc.) to be labeled as personal or organizational. The nature of identities is usually made clear in
codemeta.json
andCITATION.cff
files. GitHub also provides a flag that is meant to be used to label organizational accounts, but sometimes people don’t set the GitHub account information correctly. Consequently, if IGA has to use GitHub data to get (e.g.) the list of contributors on a project, it may mislabel identities in the InvenioRDM record it produces.Some accounts on GitHub are software automation or “bot” accounts, but are not labeled as such. These accounts are generally indistinguishable from human accounts on GitHub, so if they’re not labeled as bot or organizational accounts in GitHub, IGA can’t recognize that they’re humans. If such an account is the creator of a release in GitHub, and IGA tries to use its name-splitting algorithm on the name of the account, it may produce a nonsensical result. For example, it might turn “Travis CI” into an entry with a first name of “Travis” and last name of “CI”.
Funder and funding information can only be specified in
codemeta.json
files; neither GitHub norCITATION.cff
have provisions to store this kind of metadata. The CodeMeta specification defines two fields for this purpose:funder
andfunding
. Unfortunately, these map imperfectly to the requirements of InvenioRDM’s metadata format. In addition, people don’t always follow the CodeMeta guidelines, and sometimes they write funding information as text strings (instead of structured objects), the interpretation of which would require software that can recognize grant and funding agency information from free-text descriptions. This combination of factors means IGA often can’t fill in the funding metadata in InvenioRDM records even if there is some funding information in thecodemeta.json
file.