Versioning - Technical Overview
Versioning - Technical Overview
This is technical, developer oriented overview of the next gen versioning project implemented in 2020.
Storage: We offer an approach to data storage that leverages cloud blob storage directly (i.e.: without having to upload and serve all files via the CKAN web server), unlocking the performance characteristics of the storage backend directly. More info: http://tech.datopian.com/blob-storage/#direct-to-cloud-upload
Versioning: We offer an approach to data versioning that is integrated with CKAN, but does not implement large amounts of custom logic in order to achieve versioning, and instead, leverages git, the world’s most popular software for versioning, for this purpose.
Implementation
Working areas
- Data Versioning (Git Server + Git LFS Server)
- Cloud Storage (Direct to Cloud Upload)
- CKAN Integration (CKAN extensions to call APIs)
Main Architecture
CKAN Extensions
The solution has the following components:
ckanext-external-storage: This extension will integrate the javascript sdk with CKAN.
- Update resources form to upload files to the cloudstorage provider using the new Javascript SDK.
- Update all the download points in CKAN to download the resources from the cloudstorage
- Integrate datapub for data publishing flows
ckanext-authz-service: This extension uses CKAN's built-in authentication and authorization capabilities to:
- Generate JWT tokens and provide them via CKAN's Web API to clients.
- Validate JWT tokens
This extension requires a RSA keypair for RS* signing & encryption.
ckanext-versioning: This CKAN extension adds a full data versioning capability to CKAN including:
- Metadata and data is revisioned so that all updates create new revision and old versions of the metadata and data are accessible
- Create and manage "revision tags" (named labels plus a description for a specific revision of a dataset e.g. "v1.0")
- Diffs, reverting, etc
Giftless
giftless: Giftless is just a middleware service to get the location (URLs) of the files in the storage server so clients can interact with them through HTTP requests.
- Given the sha256 and the filesize of a data file, the call to giftless it will return the signed URLs so the client (ckanext-external-storage) can do a dummy HTTP PUT request to upload and to verify the content
- Validate the JWT token given by ckanext-authz-service
Note: Giftless will require in it's configuration access to the public key created for ckanext-authz-service.
Main libraries
Javascript SDK: This javascript SDK will contain the logic to be used in the browser to:
- calculate file hashes,
- get the authorization JWT from ckanext-authz-service
- get signed urls from git lfs server (giftless)
- upload the file to the cloud storage provider. (Basic HTTP PUT request to the signed url)
- Verify the file. (HTTP GET request to the verify URL)
metastore-lib: Library for storing dataset metadata, with versioning support and pluggable backends including GitHub.
metastore-lib is used by ckanext-versioning and requires environment variables related to the Github Repository where the data is gonna be stored and the access token.
frictionless-ckan-mapper (python): A library for mapping CKAN metadata <=> Frictionless metadata.
Main interaction between main components
This is a workflow example to understand how each component interacts with each other.
Note: I keep this mermaid code for versioning purposes
Main Configurations
CKAN configs
CKAN extensions configs are described as usual in the production.ini
file and the template to create it in our current deployment pipeline can be found in the values.yaml.template file
The specific configs for the versioning stack feature are:
# ckanext-external-storage
ckanext.external_storage.storage_service_url=https://{{CKAN_INSTANCE_NAME}}.gatesfoundation.org/giftless
# ckanext-authz-service
ckanext.authz_service.jwt_algorithm=RS256
ckanext.authz_service.jwt_private_key_file=/etc/ssh/jwt-rs256.key
ckanext.authz_service.jwt_public_key_file=/etc/ssh/jwt-rs256.key.pub
ckanext.authz_service.jwt_include_user_email=true
#ckanext-versioning
ckanext.versioning.backend_type=github
ckanext.versioning.backend_config={"github_options": {"login_or_token": "{{GITHUB_PERSONAL_TOKEN}}"}, "default_owner": "{{GITHUB_DEFAULT_OWNER}}", "lfs_server_url": "https://{{CKAN_INSTANCE_NAME}}.gatesfoundation.org/giftless", "private": True}
ckan-cloud-helm
The CKAN architecture is deployed using K8s and Helm charts: https://github.com/datopian/ckan-cloud-helm
Giftless needs a server of it's own so it is managed by K8s, the configs can be found in the ckan-cloud-helm repository and it is currently configured for Azure in ckan-cloud-helm repository.
Example of the config map we are using at the moment:
AUTH_PROVIDERS:
- factory: giftless.auth.jwt:factory
options:
algorithm: RS256
public_key_file: /etc/ssh/jwt-rs256.key.pub
TRANSFER_ADAPTERS:
basic:
factory: giftless.transfer.basic_external:factory
options:
storage_class: giftless.storage.azure:AzureBlobsStorage
storage_options:
connection_string: {{ .Values.excternalStorageConnectionString }}
container_name: {{ .Values.excternalStorageContainerName }}
path_prefix: {{ .Values.giftlessPathPrefix | default "giftless" | quote }}
PRE_AUTHORIZED_ACTION_PROVIDER:
options:
private_key: {{ randAlphaNum 16 | b64enc | quote }}
MIDDLEWARE:
- class: werkzeug.middleware.proxy_fix:ProxyFix
kwargs:
x_host: 1
x_port: 1
x_prefix: 1