Pangenomics is an exciting new approach to analyzing genetic data. In “ordinary” genomics, you sequence each organism’s genome by aligning it to a reference genome that somebody previously assembled at great cost. In a sense, the traditional view models all of us as variations on a Platonic ideal of Homo sapiens. Pangenomicists instead try to directly model the variation among an entire population. Instead of aligning reads from N individuals against one common reference, the goal is to mutually align each of the N against each other. This all-to-all comparison, they tell us, is the key to understanding a population’s diversity and revealing subtleties that are undetectable with the traditional approach. But it also scales up the computational cost for every computational step in a genomic analysis pipeline—which aren’t exactly cheap even in the traditional, reference-based mode.
We have started building a new library, written in Rust, that can process pangenomic datasets extremely efficiently. We don’t know for sure, but we think it might be the fastest extant open-source representation for pangenomic data. In any case, it’s pretty darn fast. It owes its efficiency to a basic technique that might be called data structure flattening, which eliminates all pointers and stores data in dense on-disk buffers.
We are seeking 1–4 MEng students this year to extend and optimize this library. Some specific things we want you to focus on are:
Notably, you do not need to know anything about computational biology. This is a systems engineering project, and you will pick up all the biological context you need along the way.
However, we are seeking MEng students who either already have experience programming in Rust or are motivated to learn. If you’re in the latter category, you must have experience with another low-level systems programming language: C, C++, Zig, maybe Swift or Go. You also must be willing to endure the occasional frustration and confusion that is an inherent part of learning Rust. To reiterate, learning rust can be hard, and you must be excited to get over that hump.
If you are interested in this project, please start by carefully reading these blog posts:
When you apply, please answer these questions with one short paragraph each:
Please email your answers to asampson@cs.cornell.edu.