Fundamental to CanDIG is national scale analysis, but over locally-controlled data. Our platform is completely distributed, with no central infrastructure to maintain or secure. But atop that, researchers need to be able to readily discover, access, and analyze this information, possibly jointly across sites, while allowing the data stewards to ensure the security and privacy of their data.
We do this by building on established or in-progress projects elsewhere such as OpenID Connect and Keycloak for authentication and the GA4GH (Global Alliance for Genomics and Health) APIs and schemas for genomic data and genomic data exchange.
In the CanDIG platform, all data access, even local, is API based; that is, there’s no processes which are let loose on directories of data files. This allows us several advantages:
We are making use of the GA4GH APIs for data (and metadata) access, with a thin CanDIG layer on top, which we will use for
select ... WHERE ...;
’)The API accesses against any particular dataset can be simple queries (“please tell me how many individuals have this particular variant in this data set”) or running longer-lived tasks, which must be scheduled and require a particular executable. For this we are making use of the GA4GH Task Execution Schemas and implementations such as Funnel.
Doing this requires the bundling and distribution of CanDIG-blessed images for fundamental bioinformatics tasks. We have examined various container and VM approaches for executing these discrete tasks; while we are proceeding with Docker comtainers in the short term, in the medium term we will be moving to Singularity or rkt which allow us to have what we need (application bundling, no system-wide root daemons) without what we don’t in our context of unprivledged users and sandboxes (container-level isolation).
For authentication, we are using best-practices for RESTful API authentication, such as OpenID Connect, using tools such as Keycloak. As the project involves, we anticipate using UMA (User Managed Access) for federated, role-based authorization.