QnGene GENETIC FORMAT FOR HPC
Health data analysis is typically done using a pipeline of Python scripts that run SQL queries on relational database technology. This approach is fast becoming obsolete because of its limitations to process BigData sizes present in genetic data. The Amplab of UC Berkeley (see their technical paper) has published the Adam format that can take advantage of recent open source Big Data technologies. Having access to genetic data on BigData is great but we also need clinical and demographical data obtained during clinical trials to conduct precision medicine research. This research aims to extend the Adam schema without impacting its current APIs. A first advantage is to have all this data colocated in efficient binay format for access by Hadoop/Spark processing clusters (see recent publication here) (recent video instructions here).
STUDENTS INVOLVED Fodil Belgait, Michel Hénault-Éthier, Béatriz Kanzki TECHNOLOGIES Python, Adam, Hbase, Parquet, Avro, H2o, genomeBrowser, IGV, VarSeq
First we have studied the genetic query processes and tools (i.e. LocusZoom, IGV, UCSC Genome Browser, COSMIC) used at the CRCHUM. Next we looked at the requirements of Dr. Sinnett at the research center of hôpital Sainte-Justine. A first proof of concept (throwable) prototype, coined GOAT v1, was initially developed (git@github.com:jokerbea/GOAT-Genetic-Output-Analysis-Tool.git) by Beatriz Kanzki. From this throwaway proof of concept, a number of reengineering projects where initiated (see Cédric v2 and Victor v3 reports in French, and - v4 refactoring was done also in French. Major changes were done to the front-end by replacing the Bokeh-Server by AmCharts for the visualisation and replacing the SQL database by the UC Berkeley Adam format on Spark. Previously, in v3, loading a large .vcf file took 6 hours. Now it only takes 30 minutes for the researcher to load his genetic data. The best improvement is that now, the query to the reference data loaded in Adam format takes only 5 seconds (prviously it was taking 3 minutes). Finally, operation on AWS makes this version of the prototype available from anywhere now. We are planning now for v5 that should include more than the 1000 genome reference.
Creating a Web-based system that includes a dashboard to track the progress of quality initiatives and measures against national excellence standards used by the health industry: Accreditation Canada, Planetree, du BOMA BESt and the Quebec Network of HealthInstitutions to monitor conformance.
STUDENTS INVOLVED M.Y.Tariq, U.Ghomsi, N.Brousseau, R.Chebli, G.Gbelai and A. Elmoul OPEN SOURCE TECHNOLOGIES .Net 4, IIS7, Sql Server 2008 R2, SQL, MDX, XML, SSIS, SSAS, SSRS
STUDENTS INVOLVED D. Lauzon, C. Vallières, P. Herrera, A. Boussif, A. Zakharov, D. Olano, M-A Tardif, P-E Viau, M. Ouellet, P-A St-Jean ALL OPEN SOURCE TECHNOLOGIES Highcharts JS, WebSockets, Socket.io, Node.js, VirtualBox, Ubuntu Server LTS, LXDE