1. What is DataSHIELD, and why use it?
(1) a full-likelihood-based individual person data (IPD) methodology which generates the same results as if the data from all sources were physically transferred to a central warehouse and analysed jointly. This may be called “virtual IPD” because the data are effectively analysed on an individual person basis, but without physically moving them (even transiently) from their usual trusted repository.
(2) Centrally commanded study level meta-analysis (SLMA), sometimes called federated meta-analysis. This is equivalent to undertaking the required analysis in each study separately and then combining the resultant estimates and standard errors using conventional study level meta-analysis methods: based either on fixed or random effects.
2. Funding to date, and the development of functionality
3. Data preparation and data governance: non-DataSHIELD pre-requisites for an analysis under DataSHIELD
Although DataSHIELD currently provides a unique approach to the analysis of sensitive data, it is necessarily subject to the same set of pre-requisites that are faced by any valid approach to the analysis or joint co-analysis of research data – particularly, sensitive health or social research data. Thus, any data to be processed via any method – including DataSHIELD - must satisfy two fundamental pre-requisites: (1) the data must be appropriately prepared from a scientific perspective (e.g. quality assured, cleaned and, if necessary, harmonized to ensure that corresponding data from different sources are inferentially equivalent); (2) all uses of the data under the analysis proposed must meet the relevant criteria specified under whatever data governance jurisdiction applies. Although these issues are in principle independent of the decision to use DataSHIELD, they are so important that we have included this section to address them. As touched on below, we anticipate that in the future functionality to address key aspects of both issues will be built into DataSHIELD, but this is currently a work in progress.
The extent to which data must be prepared from a scientific perspective before commencing any analysis is entirely context specific.
4. Project governance and sustainability
As the DataSHIELD project has progressed, the underlying concept and software it offers have proven to be increasingly attractive to a wide range of current and prospective users. Over the last three years in particular, interest has grown very rapidly, forcing DataSHIELD to transition from a small self-governing research software project with limited ambitions to a much larger enterprise with an active world-wide community of adopters, users, contributors and committers (as defined by our current governance policy page). Although delighted by this success, it has introduced growing pains requiring urgent therapy. This reflects fundamental challenges relating to governance and sustainability that can only be addressed by restructuring the project as a whole. Presently the Principal Investigator sits as a ‘benevolent dictator’ as per the current governance model (Figure 2).
For the future scientific and technological wellbeing of the project, it became crucial that project governance transitioned from the 'benevolent dictator' model to a community-driven meritocratic model overseen by some form of consortium Steering Committee, therefore in 2020 the DataSHIELD Advisory Board was established. This change was essential in order to continue to achieve scalable and sustainable engagement and strategic input across the increasingly global DataSHIELD community.
At present, we need to ensure sustainability of resource from the perspective of on-going funding. To date we have necessarily relied on funding via a series of traditional research grants. But as the project has matured, it is becoming increasingly difficult to persuade traditional research funders that DataSHIELD development is a “research” activity and we have for some time been targeting infrastructural development funding and working with the broader DataSHIELD and OBiBa communities we continue with this quest. In parallel, however, we have also recognised that we need to think commercially.
Figure 2: DataSHIELD Project Governance Model
In the longer term, if DataSHIELD is to remain a viable and innovative product, its development costs will have to be covered by fees (in some sense) raised from those who wish to use it. Because DataSHIELD, Opal, R and Java are all open-source products, we see a primary route to user-derived resourcing to be based on provision of training, consultancy and support for implementation, use and data governance, coupled with a capacity to provide targeted extension of functionality to projects with particularly urgent needs. This could all be wrapped up in a package of different service contracts with pricing determined by the level of the service provided, and the nature of the user (e.g. bona fide academic users vs health service users vs fully commercial pharmaceutical or biotechnology users). At the same time, it may be possible to develop some specializing add-ons to DataSHIELD as fully-commercial products under licences that are more commercially permissive than our current licence (GNU GPLv3) . For example, it has been proposed that this may be one route forward to resource the development of an easy to use Graphical User Interface.
In the shorter term, we now believe that we should consider a move towards a more commercial philosophy relatively quickly. In particular light of the infrastructural flexibility that has just been introduced through the “resources” capability, we believe that now is a perfect time to start exploring potential commercial interest both in developing and supporting a free-ware based “Community Edition” and a comprehensively supported “Professional Edition” allowing us to properly invest in the development of new large-scale applications for high throughput ‘omics and in health and social care including the pharmaceutical industry.