Similar to other large-scale genome projects, the ICGC will require a Data Coordination Center (DCC) that is well integrated with ICGC operations at participating centers, and ICGC Governance and Scientific Coordination bodies. This requires a comprehensive management system that is designed to:
The ICGC data management system will be required to provide the following support to experimental biologists, computational biologists, and other researchers:
Each data producer will manage its own workflow and be responsible for primary QC, data integrity and protection of confidential information. A common core of ICGC data intended for integration and redistribution will be shared with the research community via local “franchise databases” which share a common data model and structure. The franchise database software (schema, integrity-checking utilities, load and dump utilities) will be written by the DCC and managed by the data producers. Under this architecture, ICGC participants can develop their own project-specific data models, workflows, and databases. At regular intervals, a subset of the information contained in the project-specific databases will be exported into a local ICGC franchise database, which will implement a uniform simplified data model that captures the essential data elements that are needed to implement ICGC-wide policies on data release, quality control and milestones. The franchise will also includes a set of standardized validation and quality control tools, developed and deployed by the ICGC coordinating body, that are used to validate that the information placed in the franchise database is complete and internally consistent.
In order to provide the research community with a single portal into the entire ICGC data set, a coordination backend database will act as the union of all the franchise databases. The coordination backend will use the same data model as the individual franchise databases, but will appear to users as though it contains all the ICGC data in one place. This effect can be achieved either via a physical mirroring process in which the coordination backend pulls in copies of each of the franchise databases at regular intervals, or via a passthrough system in which queries directed at the coordination backend are multiplexed among the individual franchise databases.
Figure 2: ICGC data coordination as a franchise system
The community will obtain access to the ICGC data via one or more front ends (e.g., websites), that will provide an interface to the coordination backend. In addition, all project data will be submitted to the appropriate public repositories. The exact path that the data will take from the group that generates it to the public repository will be flexible. For some data types it would be appropriate for the ICGC participant to submit the information directly from their internal workflow system. In other cases, it might be appropriate for the information to be submitted from the franchise database or from the coordination database itself. This architecture provides the flexibility to allow certain specialized ICGC data types -- microarray CEL files, raw short sequence reads, details of tumor-specific treatment regimens, histology slide images -- to be submitted directly to the appropriate archive without bottlenecking through a central coordinating center or generic data model. Nevertheless, by whichever path the detailed data takes to the repository, the tracking information needed to connect the sample data to that detailed information will be captured by the franchise database and available to researchers via the coordinating back end.
Box 8 includes additional recommendations with respect to requirements for data storage, analysis, distribution and protection.
Quality standards: Periodic quality assurance exercises, such as round-robin validation experiments, should be coordinated and interpreted by the DCC. The results of these validation exercises will be made available via the ICGC data portal.
Public and protected tiers: A binary system shall apply to portions of the data such that a datum is either public, meaning that all end-users can gain access to it, or protected, meaning that access is only available to authorized researchers who have agreed to protect patient confidentiality.
Multilateral authorization: The ICGC should have multiple bodies that can authorize a researcher to gain access to protected data as per IDAC Policies. Once a researcher is authorized by any of these bodies, he or she should be granted access to all protected ICGC data, regardless of which collaborator generated it or which country the data resides in.
Other portals: The ICGC should encourage the redistribution, integration and visualization of the data by community bioinformatics portals. However, portals that provide access to protected data sets must agree to respect and to implement ICGC's authentication and authorization standards for protection of patient confidentiality.
Submission to archival repositories: The unprotected portion of the data should be submitted to public data repositories as rapidly as possible after passing QC and other verification tests.
Use of community standards: Whenever possible, the ICGC coordinating center and participating data acquisition groups should represent data sets using existing community file formats, ontologies and other standards.
Analysis services: Analysis and data aggregation services, which may be deployed against the ICGC data sets, will sometimes need to be co-located with the primary data in order to provide acceptable performance. In the event that a primary data set resides in a public archive, such as the short read archive, this will require coordination between the ICGC and the archive managers.