E.9 Data Management

Similar to other large-scale genome projects, the ICGC will require a Data Coordination Center (DCC) that is well integrated with ICGC operations at participating centers, and ICGC Governance and Scientific Coordination bodies. This requires a comprehensive management system that is designed to:
  • provide secure and reliable mechanisms for the sequencing centers, biorepositories, histopathology groups, and other ICGC participants to upload their data;
  • track data sets as they are uploaded and processed, to perform basic integrity checks on those sets;
  • allow regular audit of the project in order to provide high-level snapshots of the consortium's status;
  • perform more sophisticated quality control checks of the data itself, such as checks that the expected sequencing coverage was achieved, or that when a somatic mutation is reported in a tumor, the sequence at the reported position differs in the matched normal tissue;
  • enable the distribution of the data to the long-lived public repositories of genome-scale data, including sequence trace repositories, microarray repositories and the genome browsers;
  • provide essential meta-data to each public repository that will allow the data to be understandable;
  • facilitate the integration of the data with other public resources, by using widely-accepted ontologies, file formats and data models;
  • manage an ICGC data portal that provides researchers with access to the contents of all franchise databases and provides project-wide search and retrieval services.
The ICGC data management system will be required to provide the following support to experimental biologists, computational biologists, and other researchers:
  • support for hypothesis-driven research: The system should support small-scale queries that involve a single gene at a time, a short list of genes, a single specimen, or a short list of specimens. The system must provide researchers with an interactive system for identifying specimens of interest, finding what data sets are available for those specimens, selecting data slices across those specimens (e.g., counts of the number of somatic mutations observed a region within the UTR of a gene of interest), and running basic analytic tests on those data slices;
  • support for computational biologists: The system should allow large subsets, or even the entire ICGC dataset, to be downloaded;
  • enable ICGC and legislative policies for protecting the confidentiality of tissue donors, by prohibiting access to protected data to users who are not duly authorized.
Each data producer will manage its own workflow and be responsible for primary QC, data integrity and protection of confidential information. A common core of ICGC data intended for integration and redistribution will be shared with the research community via local “franchise databases” which share a common data model and structure. The franchise database software (schema, integrity-checking utilities, load and dump utilities) will be written by the DCC and managed by the data producers. Under this architecture, ICGC participants can develop their own project-specific data models, workflows, and databases. At regular intervals, a subset of the information contained in the project-specific databases will be exported into a local ICGC franchise database, which will implement a uniform simplified data model that captures the essential data elements that are needed to implement ICGC-wide policies on data release, quality control and milestones. The franchise will also includes a set of standardized validation and quality control tools, developed and deployed by the ICGC coordinating body, that are used to validate that the information placed in the franchise database is complete and internally consistent. In order to provide the research community with a single portal into the entire ICGC data set, a coordination backend database will act as the union of all the franchise databases. The coordination backend will use the same data model as the individual franchise databases, but will appear to users as though it contains all the ICGC data in one place. This effect can be achieved either via a physical mirroring process in which the coordination backend pulls in copies of each of the franchise databases at regular intervals, or via a passthrough system in which queries directed at the coordination backend are multiplexed among the individual franchise databases.
Figure 2: ICGC data coordination as a franchise system The community will obtain access to the ICGC data via one or more front ends (e.g., websites), that will provide an interface to the coordination backend. In addition, all project data will be submitted to the appropriate public repositories. The exact path that the data will take from the group that generates it to the public repository will be flexible. For some data types it would be appropriate for the ICGC participant to submit the information directly from their internal workflow system. In other cases, it might be appropriate for the information to be submitted from the franchise database or from the coordination database itself. This architecture provides the flexibility to allow certain specialized ICGC data types -- microarray CEL files, raw short sequence reads, details of tumor-specific treatment regimens, histology slide images -- to be submitted directly to the appropriate archive without bottlenecking through a central coordinating center or generic data model. Nevertheless, by whichever path the detailed data takes to the repository, the tracking information needed to connect the sample data to that detailed information will be captured by the franchise database and available to researchers via the coordinating back end. Box 8 includes additional recommendations with respect to requirements for data storage, analysis, distribution and protection.
Box 8. Additional guidelines for ICGC data management and security Quality standards: Periodic quality assurance exercises, such as round-robin validation experiments, should be coordinated and interpreted by the DCC. The results of these validation exercises will be made available via the ICGC data portal. Public and protected tiers: A binary system shall apply to portions of the data such that a datum is either public, meaning that all end-users can gain access to it, or protected, meaning that access is only available to authorized researchers who have agreed to protect patient confidentiality. Multilateral authorization: The ICGC should have multiple bodies that can authorize a researcher to gain access to protected data as per IDAC Policies. Once a researcher is authorized by any of these bodies, he or she should be granted access to all protected ICGC data, regardless of which collaborator generated it or which country the data resides in. Other portals: The ICGC should encourage the redistribution, integration and visualization of the data by community bioinformatics portals. However, portals that provide access to protected data sets must agree to respect and to implement ICGC's authentication and authorization standards for protection of patient confidentiality. Submission to archival repositories: The unprotected portion of the data should be submitted to public data repositories as rapidly as possible after passing QC and other verification tests. Use of community standards: Whenever possible, the ICGC coordinating center and participating data acquisition groups should represent data sets using existing community file formats, ontologies and other standards. Analysis services: Analysis and data aggregation services, which may be deployed against the ICGC data sets, will sometimes need to be co-located with the primary data in order to provide acceptable performance. In the event that a primary data set resides in a public archive, such as the short read archive, this will require coordination between the ICGC and the archive managers.