New ways to bridge the sea of big data science

Led by researchers at HSCI and University of Oxford, UK, more than 50 collaborators at over 30 scientific organizations around the globe have agreed on a common standard that will make possible the consistent description of enormous and radically different databases compiled in fields ranging from genetics, to stem cell science, to environmental studies.

The new standard provides a way for scientists in widely disparate fields to coordinate each other’s findings by allowing behind-the-scenes combination of the reams of data produced by modern, technology-driven science.

Biomedical research is generating increasingly complex sets of data that are growing at exponential rates. Modern biology is turning into a highly data-intensive field as scientists need to work across the traditional boundaries of formal disciplines and laboratory silos. That growth, however, needs to be matched by the capacity to store, analyze, integrate, and share the experimental results so that data and information can become useful knowledge. An additional challenge facing stem cell research derives from inconsistencies in experimental procedures and the resultant discoveries — making it often difficult or impossible to compare the results of one experiment with another.

“We are now working together to provide the means to manage enormous quantities of otherwise incompatible data, ranging from the biomedical to the environmental,” says Susanna-Assunta Sansone, PhD, Team Leader of the project at the University of Oxford’s Oxford e-Research Centre.

This standard-compliant data sharing effort and the establishment of its online presence, the ISA Commons — www.isacommons.org — was described in a Commentary published in the journal Nature Genetics.

It was necessary to establish common data standards, said the Commentary’s authors, because of the tsunami of data and technologies washing over the sciences. “There are hundreds of new technologies coming along but also many ways to describe the information produced,” Sansone said, noting that “we can take a jigsaw puzzle of different sciences and now fit the many pieces together to form a complete picture.”

“An example of how this works at the Harvard Stem Cell Institute is that we can now find a relationship between experiments involving, for example, normal blood stem cells in fish and cancers in children,” says Winston Hide, PhD, Director of HSCI’s new Center for Stem Cell Bioinformatics, and an Associate Professor of Bioinformatics at the Harvard School of Public Health.

A comprehensive data analysis platform is especially needed at a multi-organizational institute such as HSCI where the data sets generated by the many projects are produced in different places and by different researchers. The plan is for the Center to provide a data repository, a tightly linked analytical engine, and a services and consulting operation that together will address the current and future needs of the Institute. The system will build on the successful prototypes from HSCI’s Blood and Cancer program projects and will be released across the Institute.

More broadly, the infrastructure and public HSCI data will subsequently be provided to the scientific community as a shared data resource for knowledge exchange. The system will make it possible for any experiment to be compared to other experiments by common reagents, descriptions, scientific domains, cell phenotypes, molecular phenotypes, genomic locations, genes, pathways, and networks. Data analysis from these new perspectives and at this scale has the potential to make current projects more efficient and to open the door to new projects by revealing new avenues of research and discovery.