- Genome sequences contain information with immense possibilities for research and personalized medical care, but their size, complexity and diversity make comparing sequences error-prone and slow.
- ISB researchers have created a method for summarizing a personal genome as a “fingerprint.” Creating and comparing these fingerprints is significantly faster than traditional methods – computation of genome fingerprints typically requires between 15 and 45 seconds per genome, and comparing genome fingerprints takes a fraction of a second.
- This method takes privacy into account. Genome fingerprints contain enough information about the larger genome sequence to enable comparison, but they do not reveal information that can identify genetic predisposition to disease or other traits that could be used to affect an individual’s ability to obtain or maintain employment, insurance or financial services, or otherwise inadvertently lead to social stigma or negative effects.
By Gustavo Glusman, PhD
Personal genome sequences contain the information required for assessing genetic risks, matching genetic backgrounds between cases and controls in medical research, and detecting duplicate individuals or close relatives for medical, legal or historical reasons.
However, the size, complexity and diversity of representations of personal genomes make comparison error-prone and slow, and therefore challenging to scale from pairs to the hundreds, thousands or millions of individuals we will soon wish to compare in order to provide improved, personalized medical care.
We created a method for summarizing a personal genome as a “fingerprint.” Commonly used representations of personal genomes consist of lists of variants relative to a reference, including their location and reference and alternative alleles, sorted by position. Genome fingerprints capture the unique patterns generated by pairs of consecutive single-nucleotide variants; similar genomes will include more such patterns in common. Fingerprints are small matrices of numbers that can be quickly compared by computing the Spearman correlation. The correlation between two fingerprints reflects the degree of relatedness between the two genomes and can be trivially integrated into higher-level analyses and pipelines.
Computation of genome fingerprints is very fast, typically requiring between 15 and 45 seconds per genome. Thanks to the small size of the fingerprint matrices, fingerprint comparisons are extremely fast. For example, we performed all-against-all comparisons in the set of 2,504 genomes from the Thousand Genomes Project. The 3.1 million comparisons required 67 seconds (21.3 microseconds per comparison) at high resolution, and 11 seconds (3.53 microseconds per comparison) at lower resolution. Fingerprint comparisons are also independent and trivially parallelizable.
Genome fingerprints are robust to reference versions, input format differences, complex post-processing of genome data, technology differences and missing data. They have many possible uses, including the identification of family relationships, population assignment, reconstruction of population structure and improved study design by rational selection of matched controls.
Public sharing of genome data has been limited by multiple personal privacy and confidentiality considerations. A central risk is the possibility of identifying genetic predispositions to certain diseases or other traits that could affect the individual’s ability to obtain or maintain employment, insurance or financial services, or may carry social stigma, or could lead to other negative effects. Importantly, the original genome representation cannot be reconstructed from its fingerprint; therefore, using genome fingerprints, one can share enough information about a genome to enable the comparison tasks without revealing the information needed for predicting phenotype. As such, our method decouples genome comparison from interpretation. This property has important implications for privacy-preserving genome analytics.
Genome fingerprints can be applied for many tasks, and can be developed in many possible directions. They can be also useful for studying genomes of non-human species. If you work with large data sets the are difficult to handle and with developers interested in contributing to improving the methodology and usability of such a system, we’d like to hear from you.
Title: Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints
Journal: Frontiers in Genetics
Authors: Gustavo Glusman, Denise E. Mauldin, Leroy E. Hood, Max Robinson
Pub. Date: Sept. 26, 2017