What is DataSHIELD?
DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other health care professionals to individual level data. Although initially developed for work in the biomedical and social sciences, DataSHIELD can be used in any setting where microdata (data on individual subjects) must be analysed but cannot physically be shared with the research users.
DataSHIELD is a flexible, modular, free, open-source solution ideally placed to grow a broad user and development community.
Research in modern biomedicine and social science is increasingly dependent on the analysis and interpretation of microdata (data on individual subjects) or on the co-analysis of such data from several studies simultaneously. Making individual-level data available so that it may be queried by researchers – or other professional users – raises important ethico-legal questions and can be controversial.
DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other health care professionals to individual level data. Although initially developed for work in the biomedical and social sciences, DataSHIELD can be used in any setting where microdata must be analysed but cannot physically be shared with the research users.
DataSHIELD facilitates important research in settings where
- a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prevent the release or sharing of some of the required data, and/or render data access unacceptably slow
- equivalent governance concerns prevent or hinder access to a single data set
- a research group wishes to actively share the information held in its data with others but does not wish to cede control of the governance of those data and/or the intellectual property they represent by physically handing over the data themselves
- a data set which is to be remotely analysed – or included in a multi-study co-analysis – contains data objects (e.g. images) which are too large to be physically transferred to the site of analysis.
How does DataSHIELD work?
Analysis requests are sent from a central analysis machine to several data-holding machine storing the harmonised data to be co-analysed. The data sets are analysed simultaneously but in parallel, linked by non-disclosive summary statistics. Analysis is taken to the data – not the data to the analysis.
DataSHIELD is implemented entirely via free, open source software: at heart, a modified R statistical environment linked to an Opal database deployed behind the firewall at each data-holding organisation. Analysis is initiated in a standard R environment at the analysis machine, with communication between the analysis and data-holding machines controlled via secure web services. The same infrastructure and approach may also be used with just one data source – this is then referred to as “single site DataSHIELD” providing a freeware-based approach to creating a secure data enclave.
Examples of DataSHIELD infrastructure
This infrastructure is appropriate for the co-analysis of harmonised individual level data held at multiple locations. Each data location installs the server-side DataSHIELD infrastructure that holds a snapshot of the harmonised data to be co-analysed. One of the locations also installs and manages the DataSHIELD client portal, the mechanism by which users are authenticated to send analysis commands within the DataSHIELD infrastructure.
Multi-site DataSHIELD with reference to a resource
Where one or more server includes a "reference to a resource". A resource can be a dataset (e.g. a file stored in the local file system or in a file store server, a database table, etc.) or a server with some computation capabilities. This allows flexible delegation of the data connection to each R server which can include data in a wide range of formats that may be held locally on the server itself, or via a URL. This means that DataSHIELD can now be applied to high volume ‘omics data held in standard formats such as vcf.
This infrastructure is used to enable analysis of individual level data held at one location. In this case, the data server-side DataSHIELD infrastructure is installed in addition to the DataSHIELD client portal.
Who uses DataSHIELD?
Current DataSHIELD Implementations
- EUCAN-CONNECT: developing a federated FAIR platform enabling large-scale analysis of high-value cohort data connecting Europe and Canada in personalized health. Collaborating with 173 European population-based cohort studies with ~2.5M participants in total.
This project aims to coordinate DataSHIELD implementations across LifeCycle, RECAP, InterConnect, Reach, LongITools and Athlete (and future projects that emerge).
- LifeCycle: developing new strategies for optimizing early life that will help to maximize the human developmental potential for current and future European generations. Includes 40 European cohort studies.
- RECAP preterm: Research to improve to health, development and quality of life of babies born preterm. Includes 20 population-based cohort studies from Europe.
- InterConnect (MRC Epidemiology Unit, Cambridge): InterConnect is developing a global collaborative network for diabetes and obesity research, piloting DataSHIELD to facilitate a new approach to data sharing that is secure, scalable and sustainable. This includes data from 43 studies.
- ATHLETE: Develop advance tools for Human Early Lifecourse Exposome Research and establish a prospective exposome cohort, including a FAIR data infrastructure, by building on Europe’s most comprehensive exposome cohorts covering the first 18 years of life.
- MIRACUM (Medical Informatic in Research and Care in University Medicine): A national German network of 10 University Hospitals to improve healthcare and strengthen Biomedical Informatics in Research and Education. Grant number: BMBF FKZ 01ZZ1801B.
- ENPADASI (German Institute of Human Nutrition, Max Delbrück Center for Molecular Medicine in the Helmholtz Association): The European Nutritional Phenotype Assessment and Data Sharing Initiative aimed at delivering an open access research infrastructure containing data from a wide variety of nutritional studies, ranging from mechanistic/interventions to epidemiological studies including a multitude of phenotypic outcomes, facilitating combined analyses.
- INTIMIC (Max Delbrück Center for Molecular Medicine in the Helmholtz Association): The Intestinal Microbiomics Knowledge Platform (INTIMIC) has the main objective of fostering studies on the microbiota, nutrition and health by assembling available knowledge of the microbiota and the other aspects (e.g. food science and metabolomics) that are relevant in the context of microbiome research in a FAIRyfied (findable, accessible, interoperable and reusable) fashion to the scientific community, and to share information with the various stakeholders.
- the BioSHaRE-EU Healthy Obese Project for the federated analysis of 10 European studies including data from the National Child Development Study.
- the BioSHaRE-EU Environmental Core Project for the federated analysis of data from 6 European studies including UK Biobank.
Consortia Setting up DataSHIELD Pilots
- International 100,000+ cohorts consortium : Large cohort studies involving hundreds of thousands of participants have been established or launched in several regions worldwide. Cohorts provide great value for studying diverse populations and key demographic subgroups, rare genotypes and exposures, and gene-environment interactions. Each cohort is constrained, however, by its size, ancestral origins, and geographical boundaries, which limit the subgroups, exposures, outcomes, and interactions it can examine. Linking data across large cohorts provides a vast digital resource of diverse data to address questions that none of these cohorts can answer alone, enhancing the value of each cohort and leveraging the enormous investments made in them to date.
- LITMUS: Liver Investigation:
- Aiming to use DataSHIELD to provide non-disclosive/controlled access to the LITMUS Project's genetic information, using the dsOmics module developed in collaboration with EUCAN-Connect and ATHLETE. Researchers requiring access to the LITMUS genetic information are able to perform remote analysis operations without needing to directly access highly confidential data, speeding advances in diagnosis and treatment of liver disease.
- Testing Marker Utility in Steatohepatitis (LITMUS) funded by the European Innovative Medicines Initiative 2 Joint Undertaking, brings together clinicians and scientists from prominent academic centres across Europe with companies from the European Federation of Pharmaceutical Industries and Associations (EFPIA). Their common goals are developing, validating and qualifying better biomarkers for testing NAFLD.
- LONGITOOLS: a European research project studying the interactions between the environment, lifestyle and health in determining the risks of chronic cardiovascular and metabolic diseases; LongITools is bringing together 25 European cohorts and studies.
- NFDI4Health: the National Research Data Infrastructure for Personal Health Data aims at enabling findability, accessibility, interoperability, and reusability of data generated in clinical trials, epidemiological, and public health studies in Germany to enhance collaboration among research communities while complying with privacy regulations and ethical requirements.
DataSHIELD Integration and Scoping Projects
- vantage6 priVAcy preserviNg federaTed leArninG infrastructurE for Secure Insight eXchange
- BRISSKit (University of Leicester) integrated DataSHIELD with i2b2 data warehouse for biomedical data.
- ARDC England (University of Southampton) and ARDC Wales (University of Swansea).
- AMASED project in collaboration with F1000Research scoped secure post publication data analysis
- AMASED project in collaboration with the British Library scoped the secure text analysis in the digital humanities