July 1, 2014 in Blog by tahera
By Tahera Zabuawala, PhD
In the previous blog I discussed the explosive growth of DNA sequencing technology, in this article I will shed light on the downstream bottleneck processes. Taking the car analogy again, even an engine with immense horsepower alone cannot move the car forward unless an efficient transmission system delivers the engine’s power to the wheels. Similarly, although we can now sequence DNA deeper, faster and cheaper than ever before, unless we intelligently interpret the sequenced DNA into accurate, easy and digestible solutions, we cannot use it to develop a roadmap towards enhancing and extending human life.
There are several steps involved to bring clinical value to the several million ‘ATGC’s sequenced. The first step is to align the sequence to a reference genome and annotate variants. There are advanced alignment tools available that accurately align sample genome to a reference genome and sophisticated algorithms that differentiate and annotate variant types to deliver variant calls representative of the true nature of the sample. Accuracy in alignment really matters so that the real biology can be discerned from noise in the data. The current public reference (hg19) contains a minor allele at over a million positions, significant fraction of which are associated with disease (e.g. Factor V Leiden, rs6025). This can cause false positive (individual has major alleles that are absent in reference genome) and false negatives (individual has minor alleles that are present in reference genome). Also, ethnical biases in the reference genome should be accounted for during variant calling.
The next step is to analyze and interpret the variants by incorporating prior scientific and clinical knowledge to deliver results in a concise, intuitive and actionable report. This step is really complicated, nuanced and multifaceted. Primary literature, manual curation and expert opinion are used to compare variants from individuals to the repository of biomedical important variants to filter out noise from the data and derive clinically tangible interpretations.
As one can sense, turning the genomics vision into reality in the clinic involves exponential data explosion – extracting, processing, analyzing and storing large volumes of data. The answer to an economically viable technology that could carry out such an operation came in with the advent of Big Data technology. The Big Data Phenomenon came to existence with the explosion of social/digital media in our lives. Companies like Google, Facebook & Twitter revolutionized our ways of interacting and communicating. Internet became the Information Highway and before we realized everyone and everything was connected to the internet. Companies started to realize that data overload, was in fact a treasure trove of valuable information that could be monetized and Hadoop came to existence. The idea of was quite simple. Instead of making one monster machine (like the mainframes of past), why not tie a bunch of computers together and make them work like one big machine, with each one sharing the workload, performing its tasks and finally providing one unified output? First developed as a project at Yahoo, Hadoop has now become the go-to platform for all Big Data Use Cases. Big Data does not just refer to large volumes of data, usually in petabytes, but also defined by the type of data and the frequency at which it is generated, commonly referred as 3 Vs (Volume, Variety, Velocity). The same 3Vs also very relevant in personalized medicine and hence Hadoop offered the much awaited economically viable solution bringing biological meaning to genomic data sets.
Clinical Workflow for Tumor Genome Analysis and Interpretation