Physicists studying the results of tests at the Large Hadron Collider (LHC) at CERN, the Swiss nuclear research laboratory outside Geneva, have a lot of data to ponder. In fact, they have nearly 50% more than they had originally estimated. Initially, the LHC was expected to generate 15 petabytes of usable data each year. Recent reports have raised that number to more than 22 petabytes annually.
However, CERN says that its tests produce vastly greater amounts of data than gets studied. Some experiments can create up to one petabyte of data per second. Lucky for the data storage managers, not to mention the tired-eye physicists, on average all but 200 Mbytes per second of that petabyte are deemed “uninteresting data” and discarded by the system.
It’s Big Data quantities like we see at the LHC that helped prompt the U.S. government late last month to announce $200 million dollars in research and development funds specifically for scientists confronting the data deluge. According to a statement from the White House, the funds are necessary to develop the technologies capable of “managing, analyzing, visualizing, and extracting useful information from large and diverse data sets.” And it’s hoped that this investment in Big Data management “will accelerate scientific discovery and lead to new fields of inquiry that would otherwise not be possible.”
I am optimistic that those of us on the technology side will be up to the task of handling the data needs of science. But even I was daunted by recent news of the Square Kilometer Array (SKA). Headquartered in Manchester, UK, with a target completion date in 2024, the proposed 20-nation radio telescope research project dwarfs all other Big Data initiatives I’ve seen so far. This single project is currently estimated to produce one exabyte of data every day; or the equivalent of six weeks worth of the total volume of data traversing the Internet in 2011.
Large-scale, multi-nation collaborative science, such as the LHC and SKA, are relatively small in number and confront Big Data problems on a scale few of us can imagine. But they bear watching because almost all of us can learn from their Big Data solutions.