The Big Data Problem- Bioinformatics: Can we Re-Think?
The field of bioinformatics has seen some tremendous increase in emerging research it has been overwhelmed with new challenges in managing and computing the data. When we observe the results of the computing revolution in the field of bioinformatics we can appreciate the fact that many of critical deductions regarding health and the causing agents have come to the front. The research in this domain has been very aggressive and it has created a sea of findings which were really helpful for the health care industry but at cost repetitive computing solutions and multitude of architectures and last but not the least a vast array of flat file formats.
The field of Biomedical Informatics can be defined “as the scientific field that deals with biomedical information, data, and knowledge – their storage, retrieval, and optimal use for problem-solving and decision making” (Shortliffe et al.., 2001). The challenges in the field are predominantly the vast array of raw data, evolving knowledge that is the result of study of the genome and its manifestation.
In this field, we can see that the most of the important issues related to the following:
- Data Transfer
- Security and Privacy related Issues
- Storage-Cloud Data Store(?)
- And the most important point is Deriving Valuable information from the data
Now the last point redefines the field into a big data problem and we can see that Deriving Value process will have the to deal with the data in terms of :
Now when we investigate the developments in the field, we see variety of implementations in cloud and we see a lot of repetitions of the solutions and also in the community of GRID computing there were several solution but none of them emerged as gold standard.
Then we arrive at the computing challenge which predominantly can be categorized to the following:
- Data Processing
- Data Management
- Data Infrastructure
The biggest question will be how to design the chaos which can be either:
- Data Centric Model
- Computing Centric Model
There is a need of changing the way data is moved in this field and this led to lesser use of external hard drives to ship data to the use of cloud infrastructure but then again innovations are required to handle data transfer of say for instance 15 petabytes of data.
When we talk about innovations the same applies to cloud computing as the following inhibitors required to be dealt with:
- Lack of Standards
- Data Integrity
- Data Recovery
Let’s take the example of ‘de novo assembly algorithm’ for DNA data, it basically finds reads whose sequences overlap and records these overlaps as a an assembly graph. For a large genome, the output of this process will consume terabytes of RAM, and completing the genome sequence can require weeks of computation on a supercomputer.
There more than 2000 sequencing instruments around the world, hence generating more than 15 petabytes a year of genetic data. This becomes the biggest challenge for any cloud for managing patient sensitive data along with computing as if we observe the trend then we can find that computing not sequencing has now become slower and costlier in genomics research.
It has also become significant to address the storing, querying the bioinformatics data as this data is often heterogeneous both in structure and content. The issue is that there are several query engines out there which tries to enhance the experience for human interaction but in doing so they present new data formats.
The representation of any biological element we would observe that these are based on string of characters and therefore data models will be based on string structures. The challenge if we consider the introduction of database here would be to unify all the format be it the raw format or the reporting data. The bioinformatics pipeline results in several flat files which are storage intensive so there is a need to manage all the data intelligently to facilitate future usage.
This calls for a need of redefining and improving data processing across hundreds of thousands of servers and dealing with new system data infrastructure. To meet the challenges, we can adopt Google’s Map-Reduce programming Model for distributed computing with major concentration on the fault-tolerance aspect. Another aspect is toexpand the ecosystem with open source, distributed, and commodity hardware.
If we consider the database aspect we can come to a conclusion that Non-relational databases (NoSQL) are the alternative data store for this are and also relaxing consistency, eventual consistency can be a way forward. Also query performance depends heavily on data model, therefore design to support many concurrent short queries with automatic configuration, query plan and data organization is called for. Data Compression, Distributed SQL, Distributed Storage (Shared Nothing approach required for privacy of data) are the minimal level requirements for this field which calls for innovation but the innovation doesn’t end there as there are usability, user experience along process related issues that needs to be taken care of. So the field requires solution which amalgamates the techniques of design thinking along with critical thinking and serious rethinking of infrastructure in terms privacy, security and accessibility is required.