NRP Invenio repository python client

A python library for accessing invenio-based repositories.

Installation

pip install nrp-invenio-client

Imports

Will use the following imports in the examples:

from pprint import pprint
from nrp_invenio_client import NRPInvenioClient

Instantiating the client

Using config file

The client configuration file can be used to store parameters for repositories. The file is in YAML format and is located in ~/.nrp/config.yaml by default. The file can contain multiple repositories, each with its own alias. The alias is used to select the repository to use. One of the repositories is a default repository, which is used if no alias is specified.

To use the config file, you need to create it first. The easiest way is to use the nrp command line tool:

nrp add alias https://repo.du.cesnet.cz repo --default

This will create a config file with a repository with alias repo and default repository set to it. It will also try to open your browser, let you log in and create and store an access token.

Then, in your python code you instantiate the client like this:

client = NRPInvenioClient.from_config()
 
## or
client = NRPInvenioClient.from_config(alias="repo")

You can pass the location of the config file as well via the config_file argument.

Note: If NRP_INVENIO_CLIENT_ABC environent variable is set, it will be used instead of the value from the config file. The ABC is uppercase version of the config key, e.g. NRP_INVENIO_CLIENT_SERVER_URL will be used instead of server_url from the config file.

Using parameters

To bypass the config file completely, you can instantiate the client with parameters:

client = NRPInvenioClient(
    server_url     = "https://repo.du.cesnet.cz",
    token          = "api token",
    verify         = "/path/to/cert.pem",   # True by default, might be False to disable SSL verification
    retry_count    = 10,                    # idempotent requests will be retried 10 times
    retry_interval = 10,                    # retry interval in seconds if server is busy.
                                            # Exponential backoff is used from retry_interval - 10 * retry_interval
)

Data structures

The responses from the server are represented as python objects, not as plain dictionaries. To get the plain dictionary, call the to_dict method on the object.

Getting repository information

To get information about the invenio repository, we will use the info endpoint (/.well-known/info). Note that this endpoint is NRP extension, it is not present in vanilla Invenio.

General information

The general repository information is available at /.well-known/info/repository endpoint. Use the following code to get it:

pprint(client.info.repository.to_dict())
 
{
  "name": "NRP Invenio repository",
  "description": "NRP Invenio repository",
  "version": "1.0.0",
  "invenio_version": "12.0.38",
  "links": {
    "self": "https://repo.du.cesnet.cz/api/info/repository",
    "models": "https://repo.du.cesnet.cz/api/info/models",
  },
  "features": [
    "drafts",
    "requests",
    "files"
  ],
  "transfers": [
    "local-file",
    "direct-s3",
    "url-fetch"
  ]
}

Properties on the NRPRepositoryInfo objects are: name, description, version, invenio_version, links, features, transfers and contain exactly the same payload as in the example above.

Models

The repository can enumerate the models it supports, at the /.well-known/info/models endpoint. Use the following code to get it:

pprint(client.info.models.to_dict())
 
{
   "datasets": {
     "name": "Datasets",
     "description": "This model contains datasets.",
     "schemas": ["local://documents-1.0.0", "local://documents-2.0.0"],
     "version": "1.0.0",
     "features": [
       "drafts",
       "requests",
       "files",
       "custom_fields"
     ],
     "links": {
      "api": "https://myserver.cesnet.cz/api/datasets",
      "html": "https://myserver.cesnet.cz/datasets",
      "user": "https://myserver.cesnet.cz/api/user/datasets",
      "published":  "https://myserver.cesnet.cz/api/datasets",
      "drafts": "https://myserver.cesnet.cz/api/drafts/datasets",
      "schema": "https://myserver.cesnet.cz/info/schemas/datasets-v1.0.0.json",
      "model": "https://myserver.cesnet.cz/info/models/datasets-v1.0.0.json",
      "openapi": "https://myserver.cesnet.cz/info/openapi/datasets-v1.0.0.json"
     }
   }
 }

Properties on the NRPModelInfo objects are: name, description, ``schemas, version, features, linksand contain exactly the same payload as in the example above. Additionally, theurlproperty is available, which containslinks.apianduser_urlcontaining thelinks.user` endpoint (the exact values of the links are just examples, user should not rely on their exact structure).

Searching within repository

Searching vs. scanning

The client supports two modes of searching within the repository: searching and scanning.

Searching is used when you want to have the results sorted by relevance (or any other criteria). You will get only a limited number of records, at most 10000. This limit can not be changed.

Scanning, on the other hand, gives you all records that match the query. Scanning is more resource intensive than searching and server requires you to be authenticated to use it.

API principles

To initiate a search, you need to create a search request via client.search method:

client = NRPInvenioClient.from_config()
request = client.search("datasets")

On the request, you can set various parameters, such as query, page, size, etc.

## pagination - not used when scanning
request.page(2)
 
## pagination - not used when scanning
request.size(10)
 
## ordering - not used when scanning
request.order_by("+title", ...)
 
## select only specific records
request.published()
request.drafts()
 
## only return selected fields
request.fields("metadata.title", "metadata.description", ...)
 
## filter by query (either SOLR string or opensearch json)
request.query("metadata.title:foo")

Using the search endpoint

Finally, execute the request:

response = request.execute()

The response will contain the results of the search:

print(response.total)
print(response.total_error)
for record in response:
    print(record.to_dict())

The records returned from the iterator are instances of NRPRecord class.

Using the scan endpoint

For scanning, you need to use the scan method instead of request:

with request.scan() as response:
    for record in response:
        print(record)

Record operations

Getting a record from the repository

To get a record from the repository, you need to know its:

model-aware id
model and id within the model
doi
API url
HTML url

Getting a record by model-aware id

client = NRPInvenioClient.from_config()
rec: NRPRecord = client.records.get("datasets/1234", include_files=True, include_requests=True)

Getting a record by model and id within the model

client = NRPInvenioClient.from_config()
model_id = "datasets"
record_id = "1234"
rec: NRPRecord = client.records.get_by_id(f"{model_id}/{record_id}", include_files=True, include_requests=True)

Getting a record by DOI or urls

from nrp_invenio_client.config import NRPConfig
from nrp_invenio_client.records import record_getter
 
config = NRPConfig()
rec = record_getter(config, record_doi_or_url, include_files, include_requests)

Note: The record_getter function is a helper function that can be used to get a record by any identifier, including mid, doi, api url and html url. For the mid, you can pass a client object that will be used instead of the config's default repository client.

Creating a record

To create a record, call the create method on the records object:

client = NRPInvenioClient.from_config()
rec = client.records.create("datasets", {"title": "My dataset"})

The first argument is the name of the model to create the record in, the second argument is the metadata of the record. The method returns the created record (an instance of NRPRecord class).

Note: If the model supports draft records, this call will create a draft record and the mid on the record class will be in the form of draft/model/id.

Updating a record

To update a record, you might call the save method on the record object:

client = NRPInvenioClient.from_config()
rec = client.records.get("draft/datasets/1234")
rec.metadata["title"] = "My dataset"
rec.save()

This will ensure that optimistic locking is used when updating the record.

Note: For draft-enabled models you can update only draft records.

Deleting a record

To delete a record, call the delete method on the record object:

client = NRPInvenioClient.from_config()
rec = client.records.get("draft/datasets/1234")
rec.delete()

Note: For draft-enabled models you can delete only draft records.

Publish and edit published records

Publishing a record

To publish a record, call the publish method on the record object:

client = NRPInvenioClient.from_config()
rec = client.records.get("draft/datasets/1234")
published_record = rec.publish(version="v1")

You can optionally specify the "human-readable" version of the record to publish.

Editing a published record

To edit a published record, you need to create a new draft record at first. To do so, call the edit method on the published record object:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234")
draft = rec.edit()

The draft object is a new draft record, which is a copy of the published record. After you make the changes, you can save the draft record and publish it.

Files

Downloading record files

To download files from a record, you need at first get the record with files included:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_files=True)

With this option, there is a files property on the record object, which is a collection (class NRPRecordFiles) of NRPFile objects. To download the file, call the download method on the file object:

for fo in rec.files:
    fo.download("/path/to/downloaded/file")

Note: The current implementation of the download method uses a single thread download, which might be slow. If you need to download multiple files, you might want to use a tool for a parallel download, such as aria2c, axel or hget. To use it, get the fo.content url and download the file using a tool of your choice.

If you do not want to save the file but just get the content, you can call the open method:

for fo in rec.files:
    with fo.open() as f:
        print(f.read()) # do whatever here

Note: The open method returns a file-like object, which is a requests.Response object.

Uploading record files

To upload a file to a record, you need to get the record first:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_files=True)

Then, you can upload the file:

with open("filename.txt", "rb") as stream:
    rec.files.create("filename.txt", {"description": "My file"}, stream)

The first argument is the name of the file, the second argument is the metadata of the file and the third argument is the file stream that needs to be open in binary mode.

Note: The implementation may use chunked upload if the server supports it. A prerequisite for a chunked upload is that the opened stream supports the seek method and is able to tell its length.

Updating file metadata for record files

To update the metadata of a file, you need to get the record first:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_files=True)

Then, you can update the file metadata:

file = rec.files.get("filename.txt")
file.metadata["description"] = "My updated file"
file.save()

Replacing file content

To replace the content of a file, you need to get the record first:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_files=True)

Then, you can replace the file content:

file = rec.files.get("filename.txt")
with open("filename.txt", "rb") as stream:
    file.replace(stream)

Deleting record files

To delete a file from a record, you need to get the record first:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_files=True)

Then, you can delete the file:

file = rec.files.get("filename.txt")
file.delete()

Requests

Requests are used to ask for certain operations to be performed on a record. Each repository can provide a custom set of operations, examples of such operations are "publish", "unpublish", "delete", "edit", "access private files" etc.

Getting requests for a record

To get requests for a record, you need to get the record first:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_requests=True)

Then, you can get the requests (instances of NRPRecordRequests) from the record:

requests: NRPRecordRequests = rec.requests
 
for req_type in requests:
    print(f"Request type {req_type.type_id}")
    print("Open requests of this type:")
    for req in req_type.submitted_requests:
        print(req.to_dict())

You can iterate all requests of this type using for req in req_type or you can get subsets of requests:

req_type.submitted_requests - open requests
req_type.cancelled_requests - expired requests (closed)
req_type.accepted_requests - accepted requests (closed)
req_type.declined_requests - declined requests (closed)
req_type.expired_requests - expired requests (closed)

Creating a request for a record

To create a request for a record, you need to get the record first:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_requests=True)

Then, you can create a request:

req = rec.requests.create("publish", {"version": "v1"}, submit=True)

If you pass submit=True, the request will be submitted immediately. Otherwise, you need to call the submit method on the request object.

Updating request payload

To update the metadata of a request, you need to get the request first:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_requests=True)
req = rec.requests.get("publish", "1234") # or use listing of open requests and select yours
req.payload["version"] = "v2"
req.save()

Checking the status of a request

To check the status of a request, you need to get the request first and then get its status:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_requests=True)
req = rec.requests.get("publish", "1234") # or use listing of open requests and select yours
print(req.status)
 
# after a pause
req.refresh()
print(req.status)

Cancelling the request

To cancel a request, you need to get the request first and then call the cancel method on it:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_requests=True)
req = rec.requests.get("publish", "1234") # or use listing of open requests and select yours
req.cancel()

You might provide a reason for the cancellation (json object) as an argument to the cancel method.

Accepting the request

To accept a request, you need to get the request first and then call the accept method on it:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_requests=True)
req = rec.requests.get("publish", "1234") # or use listing of open requests and select yours
req.accept()

You might provide a reason for the acceptance (json object) as an argument to the accept method.

Declining the request

To decline a request, you need to get the request first and then call the decline method on it:

client = NRPInvenioClient.from_config()
rec = client.records.get("datasets/1234", include_requests=True)
req = rec.requests.get("publish", "1234") # or use listing of open requests and select yours
req.decline()

You might provide a reason for the decline (json object) as an argument to the decline method.

API reference External authorities