An Extraordinarily Fast Data Representation for Pangenomics

Pangenomics is an exciting new approach to analyzing genetic data. In “ordinary” genomics, you sequence each organism’s genome by aligning it to a reference genome that somebody previously assembled at great cost. In a sense, the traditional view models all of us as variations on a Platonic ideal of Homo sapiens. Pangenomicists instead try to directly model the variation among an entire population. Instead of aligning reads from N individuals against one common reference, the goal is to mutually align each of the N against each other. This all-to-all comparison, they tell us, is the key to understanding a population’s diversity and revealing subtleties that are undetectable with the traditional approach. But it also scales up the computational cost for every computational step in a genomic analysis pipeline—which aren’t exactly cheap even in the traditional, reference-based mode.

We have started building a new library, written in Rust, that can process pangenomic datasets extremely efficiently. We don’t know for sure, but we think it might be the fastest extant open-source representation for pangenomic data. In any case, it’s pretty darn fast. It owes its efficiency to a basic technique that might be called data structure flattening, which eliminates all pointers and stores data in dense on-disk buffers.

The Project

We are seeking 1–4 MEng students this year to extend and optimize this library. Some specific things we want you to focus on are:

Qualifications

Notably, you do not need to know anything about computational biology. This is a systems engineering project, and you will pick up all the biological context you need along the way.

However, we are seeking MEng students who either already have experience programming in Rust or are motivated to learn. If you’re in the latter category, you must have experience with another low-level systems programming language: C, C++, Zig, maybe Swift or Go. You also must be willing to endure the occasional frustration and confusion that is an inherent part of learning Rust. To reiterate, learning rust can be hard, and you must be excited to get over that hump.

Applying to the Project

If you are interested in this project, please start by carefully reading these blog posts:

  1. On data structure flattening.
  2. On pangenomics and GFA files.
  3. On flattening the GFA data structure.

When you apply, please answer these questions with one short paragraph each:

  1. What makes the GFA file format inefficient?
  2. What makes the FlatGFA approach efficient?
  3. Do you have experience programming in Rust? If not, what is your experience with systems programming, and why do you want to learn Rust?

Please email your answers to asampson@cs.cornell.edu.