Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

GENOMICSDB SOLUTION WHITE PAPER

$
0
0
By Hao Li, Danny Zhang, Carl Li, Hui Lv, Jianlei Gu, Welles Du, Haitong Wang
 

Abstract

During genomics life science research, the data volume of whole genomics and life science algorithm is going bigger and bigger, which is calculated as TB, PB or EB etc. The key problem will be how to store and analyze the data with optimized way. This paper demonstrates how Intel Big Data Technology and Architecture help to facilitate and accelerate the genomics life science research in data store and utilization. Intel defines high performance GenomicsDB for variant call data query. Based on this great technology, Intel defines genomics knowledge share and exchange architecture, which is landed and validated in Shanghai Children Hospital and Shanghai Jiaotong University with very positive feedback. And these big data technology can definitely be scaled to much more genomics life science partners in the world.

Keywords

GenomicsDB, TileDB; Big Data; Life Science 

 

1. INTRODUCTION

In genomics life science research, the data volume is going bigger and bigger, which is calculated as TB, PB or EB etc. For example, Genomics sequencing generates more than 1TB data per patient. During 2015, 1.65 million new patients in US generates more than 4EB data. In CNGB (China National Gene Bank), there is 500PB data volume deployed for now and it is estimated that volume will be increased by 5-10PB per year. In SCH (Shanghai Children Hospital) and Sjtu (Shanghai Jiao Tong University) Super Computing Center, there have hundreds of nodes for totally 30PB storage deployment. The key priority will be how to store and analyze data with optimized way and how to exchange and share the data with each other.

Intel defines and deploys scalable genomics knowledge share and exchange architecture, which is landed and validated in Shanghai Children Hospital and Shanghai Jiaotong University with very positive feedback. The solution architecture provides customized GenomicsDB engine for genomics variant call data search by position with very fast speed. Because genomics position is discrete, GenomicsDB also optimizes the sparse array storage by saving only the “useful” data. Intel also defines the genomics knowledge data sharing process and architecture for making the genomics knowledge be consolidated and utilized more efficiently.

2. INTEL BIG DATA ARCHITECTURE FOR LIFE SCIENCE

In real scenario, the research on genomics must work on big data mode. In this solution, it provides big data architecture (Figure 1). It is separated into 2 layer. One is application framework, which is likely interface to end user. It supports Genomics Knowledge App, Genomics DB UI as well as Genomics Work Flow etc. The other is core framework and Linux kernel, which provides services to support request from application framework. The architecture supports big data level core framework. Such as TileDB and GenomicsDB Engine for big genomics variant data.

Figure 1. Big Data Architecture for Life Science

2.1. GENOMICS KNOWLEDGE SHARE MODEL

Figure 2. Genomics Knowledge Share Model

Figure 3. Genomics Knowledge Share Architecture

In Figure 2 and Figure 3, it shows key usage model and architecture of Genomics knowledge share. There is Central Function defined as data share agent. The purpose is for secure and consolidated data share and exchange. In real scenario, the genomics data is relate to privacy, hospitals can’t share too much publicly. In this solution, hospitals keeps the raw data in local private data center, Central Function provides statistical and summarized genomics knowledge database share for query. For example, hospital A and B store raw genomics data with genomics work flow and create private GenomicsDB in their private data center. At the same time, hospital A and B can contribute statistical and representative variant call data to Central Function, data management system can create consolidated GenomicsDB knowledge center for share. Then end user can query consolidated GenomicsDB knowledge from both hospital A and B. And the data share is not for raw genomics data but statistical and GenomicsDB knowledge. Other than hospitals’ private data center, Central Function can be deployed on public cloud with secured access control for data share.

2.2. Visualized Genomics Work Flow

Figure 4. Visualized Genomics Work Flow

In Figure 4, it shows example of visualized genomics work flow, like data convert from fastq to BAM and then create final vcf (variant call format) data. In life science, some bio researchers don’t have too much IT background. The visualized UI can help bio researcher much during customization of genomics data analysis and conversion. These raw data should be put in secure environment like hospital private data center and will be used to create GenomicsDB knowledge.

2.3. GenomicsDB and TileDB

Figure 5. TileDB and GenomicsDB

In Figure 5, it shows TileDB work model. TileDB is a system for efficiently storing, querying and accessing sparse array data. It is optimized for sparse data and supports high performance linear algebra. For example, when storing data and querying cell, TileDB skips the empty cell to save much storage and query time. The GenomicsDB is instance of TileDB, which stores variant data in a 2D TileDB array. Each row corresponds to sample in a vcf and each column corresponds to a genomic position. Figure 5 also shows example of discrete genomics position data.

Figure 6. GenomicsDB Performance Report

Figure 6 shows real GenomicsDB testing report from shanghai children hospital with 11G sample vcf. It takes seconds time to response user and shows better performance when doing paralleled query and scaling to millions of variant data column range.

Figure 7. GenomicsDB UI

Figure 8 GenomicsDB UI Testing Report

Since some of bio researchers don’t have much IT background, this solution provides very friendly web UI to support the query by end user. In Figure 7, it shows the convenient UI that has been landed in shanghai children hospital. In Figure 8, it shows the GenomicsDB UI testing report. Although it takes longer time to UI parse, render as well as annotation process etc, it still takes only seconds time to response user and shows great performance when scaling to bigger data with bigger column rage.

2.4. Customized Genomics Annotation

Figure 9. Genomics Annotation

In Figure 9, it shows Genomics Annotation that can be customized together with GenomicsDB result. Besides the info and format data from vcf content, more annotations relate to Genomics variant call format data are very useful to bio researcher, GenomicsDB with its scalable design and interface can seamlessly integrate more annotations into standard vcf column info, which can contribute more valuable information during genomics research.

3. Conclusions

As big data era of genomics life science industry is coming, previous traditional way can’t fulfil the data growing request. By using Intel Big Data Architecture with GenomicsDB, genomics data can be stored and utilized with optimized way. Central Function with genomics knowledge share will be scaled as future commercial standard. It can definitely facilitate life science research and accelerate the genomics precision medicine.

Acknowledgements

Thank Carl Li, Hao Li, Guangjun Yu, Hui Lv, Jianlei Gu, Hong Sun, Ketan Paranjape, Paolo Narvaez, Karthik Gururaj, Kushal Datta, Danny Zhang, Julia Liang, Chang Yu, Ying Liu, Jian Li, Hua Ding, Hong Zhu, Ansheng Yang, for great help and support during solution design and development.

References

[1]  Karthik Gururaj & Kushal Datta, GenomicsDB.
https://github.com/Intel-HLS/GenomicsDB

[2]  Brian Cremeans & Kushal Datta & Karthik Gururaj & Samuel Madden & Timothy Mattson & Mishali Naik & Paolo Narvaez & Stavros Papadopoulos & Jagannath Premkumar, TileDB. 
http://istc-bigdata.org/tiledb/tutorials/index.html

[3]  Robert Read, Lustre file system for cloud and Hadoop, pp.1-25. https://www.openfabrics.org/images/eventpresos/workshops2015/UGWorkshop/Thursday/thursday_13.pdf

[4]  Intel Stands at Center of Huge Health Data Exchange Project,
https://www.premisehealth.com/intel-data-exchange-project/

[5]  Tim Mattson Intel labs, Polystore, Julia, and produc3vity in a Big Data world, pp. 1-38. http://www.clsac.org/uploads/5/0/6/3/50633811/mattson.pdf

[6]  Lustre Software Release 2.x Operations Manual, Part III. Administering Lustre, pp. 172-178. http://doc.lustre.org/lustre_manual.pdf

[7] Jennie Duggan Northwestern U. & Aaron Elmore U of Chicago & Tim Kraska Brown U. & Sam Madden M.I.T. & Tim Mattson Intel Corp. & Michael Stonebraker M.I.T., The BigDawg Architecture and Reference Implementation, pp. 1-2.
http://users.eecs.northwestern.edu/~jennie/research/BigDawgShort.pdf

[8]  A. Elmore Univ. of Chicago & J. Duggan Northwestern & M. Stonebraker MIT & M. Balazinska Univ. of Wash. & U. Cetintemel Brown & V. Gadepally MIT-LL & J. Heer Univ. of Wash. & B. Howe Univ. of Wash. & J. Kepner MIT-LL & T. Kraska Brown & S. Madden MIT & D. Maier Portland St U. & T. Mattson Intel & S.Papadopoulos Intel / MIT & J. Parkhurst Intel & N. Tatbul Intel / MIT & M. Vartak MIT & S. Zdonik Brown, A Demonstration of the BigDAWG Polystore System, pp. 1-4. 
http://livinglab.mit.edu/wp-content/uploads/2016/01/bigdawg-polystore-system.pdf

[9] Usegalaxy,
https://wiki.galaxyproject.org/Admin/GetGalaxy


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>