An Extraordinarily Fast Data Representation for Pangenomics

Pangenomics is an exciting new approach to analyzing genetic data. In “ordinary” genomics, you sequence each organism’s genome by aligning it to a reference genome that somebody previously assembled at great cost. In a sense, the traditional view models all of us as variations on a Platonic ideal of Homo sapiens. Pangenomicists instead try to directly model the variation among an entire population. Instead of aligning reads from N individuals against one common reference, the goal is to mutually align each of the N against each other. This all-to-all comparison, they tell us, is the key to understanding a population’s diversity and revealing subtleties that are undetectable with the traditional approach. But it also scales up the computational cost for every computational step in a genomic analysis pipeline—which aren’t exactly cheap even in the traditional, reference-based mode.

We have started building a new library, written in Rust, that can process pangenomic datasets extremely efficiently. We don’t know for sure, but we think it might be the fastest extant open-source representation for pangenomic data. In any case, it’s pretty darn fast. It owes its efficiency to a basic technique that might be called data structure flattening, which eliminates all pointers and stores data in dense on-disk buffers.

The Project

We are seeking 1–4 MEng students this year to extend and optimize this library. Some specific things we want you to focus on are:

Extending the available set of operators. The library currently supports a few pangnenomic graph analyses; we want to support more. The focus here would be on reading a bunch of very gnarly C++ implementations that already exist and porting them to our new Rust library.
Implementing further data-structure optimizations. Our library is already pretty fast, but we have some ideas for ways to make it even faster. Your job would be to implement these optimizations and understand their impact on performance.
Measure performance systematically. To guide the optimization work, we need a way to periodically measure how fast our library is on a variety of large-scale pangenomic datasets.
Improve Python bindings. We have implemented a Python library to make it convenient to use our extremely fast representation. It is currently pretty basic and could use serious expansion to make it practical for biologists to do their work.

Qualifications

Notably, you do not need to know anything about computational biology. This is a systems engineering project, and you will pick up all the biological context you need along the way.

However, we are seeking MEng students who either already have experience programming in Rust or are motivated to learn. If you’re in the latter category, you must have experience with another low-level systems programming language: C, C++, Zig, maybe Swift or Go. You also must be willing to endure the occasional frustration and confusion that is an inherent part of learning Rust. To reiterate, learning rust can be hard, and you must be excited to get over that hump.

Applying to the Project

If you are interested in this project, please start by carefully reading these blog posts:

When you apply, please answer these questions with one short paragraph each:

What makes the GFA file format inefficient?
What makes the FlatGFA approach efficient?
Do you have experience programming in Rust? If not, what is your experience with systems programming, and why do you want to learn Rust?

Please email your answers to asampson@cs.cornell.edu.

www/pollen-meng.md 133a8d7 view or edit on GitHub