Next generation sequencing (NGS) technologies generate vast amounts of variant data, the analysis of which poses a big computational challenge. Numerous currently undertaken research efforts, such as population genetics studies or association studies, require computing various statistics and performing statistical tests on the genome sequencing data. With the aim of facilitating such analyses, Intel has developed a specialized analytics platform, referred to as the Intel Reference Architecture. This platform provides a comprehensive set of solutions, which enable convenient storing, manipulating and analyzing the genome sequencing data. The intuitive representation of variant data in a table format and the SQL-like interactive query interface make the Intel Reference Architecture a very attractive alternative to the existing NGS analytics tools.
In this study, we present a set of exemplary queries, which allow executing commonly used operations, such as calculating allele and genotype frequencies, testing for Hardy-Weinberg equilibrium and for association between SNPs and a given condition. To illustrate these queries, we used the 1000 Genomes data and we applied the operations to a set of 12 SNPs, known to be associated with type 2 diabetes.