Yesterday, we described the awesome power of the new petascale supercomputers, which are capable of performing more than one quadrillion calculations per second. But building these machines is just the beginning — it’s how they’re applied to the great scientific problems of today and the future that will define their legacy. Immense computational power is best used for an immense challenge, such as complex scientific simulations or enormous datasets that would cripple an everyday laptop. Traditionally, astronomy and physics have provided the majority of this kind of work, flush as they are with data collected by telescopes and particle colliders. But as the other three speakers at our Petascale Day event described it, the disciplines of medicine, chemistry, and even business are entering a data-driven phase where they too can take advantage of petascale power.
In biology and medicine, the rise of genomics has been the primary driver of a newfound need for high performance computing. Soon, the cost of sequencing an individual’s DNA will be a mere $1,000, opening up new horizons for the study of disease and individualized treatments. But all those DNA sequences are, in their purest sense, data, and not an inconsequential amount either, said Robert Grossman, professor of medicine at the University of Chicago Medical Center and Computation Institute Senior Fellow. Each individual’s genome will comprise roughly 1 terabyte of data, and to truly understand the nature of human genetic variation and its relevance to disease, some experts have predicted we will need to collect one million such genomes. That’s a total of 1,000 petabytes of data, a number which pushes into a new numerical prefix: an exabyte.
“We hope that that’s enough variation so that we can begin to understand how things work,” Grossman said. “In the end, we want to get to precision diagnosis, where we take your genome or the genome of your tumor and we make a diagnosis based on your genome, or precision treatment, where we look at your genome to find the pathology, and try to treat based on that.”
But in order to study all this genomic data, a new kind of infrastructure needs to be developed that can both handle the heavy load and protect the privacy of individual medical information. Grossman described two efforts he’s involved with: the Open Science Data Cloud, which currently works with projects such as earth science data collected by NASA satellites, and the BSD Center for Research Informatics at UChicago. He also issued an invitation to the computer scientists in the room to help him and other biomedical scientists find new ways of doing research within this rapidly growing field.
While biology looks to petascale computing for help with this flood of data, computational chemists are interested in the enhanced power to run incredibly complex molecular simulations. Jeff Hammond, an assistant computational scientist at Argonne National Laboratory, described the current state of models for molecular dynamics, where computers simulate the interactions between the individual atoms within a molecule over time. Using the current class of supercomputers, it takes an entire day for even the “simplest” of these programs to simulate the motion of thousands of atoms for one to ten nanoseconds (one-billionth of a second). The more complicated models of quantum behavior that Hammond uses in his research are even slower, so much so that he simply described the timescale as “LOL”. New programs that exploit the faster speeds and parallel architecture of petascale supercomputers are needed, Hammond said, to run these and even more ambitious chemical models.
Those are the types of computational problems that data-driven business researchers only dream of having at this point, said Svetlozar Nestorov, Computation Institute Fellow and the event’s final speaker. While other fields start to dip their toes in the petascale, business research is just now entering the terascale, he said, but with the expectation that higher-level data demands are not too far in the future. To illustrate the current situation, Nestorov presented the Nielsen dataset administered by the Kilts Center for Marketing. Collected by the private Nielsen Company, perhaps best known for their television ratings service, the four-terabyte dataset contains input from retail outlets and consumers about purchases and pricing over six years. Researchers interested in studying retail patterns or mashing the Nielsen data together with other datasets can apply to access the formerly private data, Nestorov said, but as with medical data there are certain restrictions and privacy considerations.
“With genomic data, you want to find the gene or few genes that cause a disease and tell people what they are,” Nestorov said. “Unfortunately here, the findings you come up with shouldn’t single out individual panelists or a product — you shouldn’t say a particular brand causes obesity.”
Those warnings were a welcome reminder that one size does not fit all when it comes to computational research, and that even petascale machines can be restrained by the rules of a much slower reality. But even these hurdles present an opportunity for increasingly multi-disciplinary collaboration, where experts from the worlds of biology, chemistry, business, or other fields can work with computer scientists to sculpt the petascale’s broad potential into practical, customized forms that push the science forward on all sides.
“This Million Genome Challenge is a fundamental challenge that I think is going to change the way we do big data,” Grossman said. “We don’t have the computer architectures, don’t have the analysis, we don’t have the information theory…we don’t have any of the fundamental scientists we need for big data. If you’re looking to have an impact, then big data and the Million Genome Challenge is a really interesting place to do that, because it’s wide open.”