Bioinformatics Zen

We're wasting money by only supporting gzip for raw DNA files.

2023-01-03T00:00:00Z

The increasing throughput of Illumina DNA sequencing means institutions and companies are spending tens of thousands of dollars to store terabytes of raw DNA sequence (FASTQ). This data is stored using gzip, a 30-year-old compression algorithm.
Common bioinformatics tools should support more recent compression algorithms such as zstd for FASTQ data. Zstd has wide industry support, with comparable run times and would likely reduce storage costs by 50% over gzip.

Gzip is outperformed by other algorithms

The original implementation of gzip (Gailly/Madler) has been surpassed in performance by other gzip implementations. For example, cloudflare-zlib outperforms the original gzip in compression speeds and should be used instead.

The use of the gzip compression format is still ubiquitous for raw FASTQ DNA sequence. This is due to it being the only supported compression format for FASTQ in bioinformatics tools. In the thirty years since gzip was created there are now alternatives with superior compression ratios. Only supporting gzip for FASTQ translates into millions of dollars in storage fees on services like Amazon’s S3 and EFS compared with algorithms with better compression ratios. Companies like Meta, Amazon, and Uber are reported to be switching to zstd over gzip. If the most common bioinformatics tools can move to support ingesting zstd-compressed FASTQ format this could save everyone time and money with minimal impact on compression times.

A toy benchmark

As an example a zstd compressed FASTQ file (SRR7589561) is almost 50% the size of the same gzipped file. In the figure below I downloaded ~1.5Gb of FASTQ data and compressed it with either pigz or zstd. Pigz is a parallel implementation of the original gzip.

FASTQ file size by compression algorithm.

FASTQ files do however take longer to compress with zstd. The ztsd -15 command takes ~70s which is 100% longer than pigz -9 at ~35s. However, it’s worth noting when storing raw FASTQ from a sequencer, these files are compressed once, and then stored for years. This additional CPU time cost is more than offset by savings in storage costs. The same does not apply to intermediate files such as trimmed or filtered FASTQ in a pipeline that tend to be ephemeral. These would require a further examination of trade offs.

Total compression time in seconds by algorithm.

The next figure shows that changes in decompression time for the same file are relatively small, 3.5s vs 2.2s. Therefore decompression would be minimally impacted.

Total decompression time in seconds by algorithm.

Detailed comparison of flags

This figure compares the compressed output file size for all the different available gzip implementations with zstd for different compression flags on the same SRR7589561 FASTQ file. This shows that zstd outperforms gzip at the highest compression levels, with the output file sizes being ~60% the size of the highest gzip compression levels.

Output compressed file size ratios by command line flag for each compression tool. Each colour represents a different compression tool implementation. Each argument was benchmarked five times. Note that zopfli has a single datum because it only compresses to the max ratio.

This next plot compares the trade-offs for file size versus the wall clock run time taken to compress a FASTQ file. This is for the compression process running single-threaded. This shows that zstd can result in much better compression ratios, ~10% of the original file size but with increasing run time. Though not nearly as long as the run time for zopfli, a gzip implementation gives the best compression ratio of any gzip implementation but at the expense ~2 orders of magnitude in compression time.

Compression ratio versus compression time. Each colour represents a different compression tool implementation. Each argument was benchmarked five times. Note that zopfli has a single point because it only compresses to the max ratio.

Takeaway

The gzip implementation is superseded by other compression algorithms such as zstd. By continuing to only support gzip for FASTQ, the bioinformatics industry spends money unnecessarily on additional storage. Bioinformatics tools should widely support zstd as a compression format for FASTQ.

Fear and self-loathing: getting old in tech

2022-05-09T00:00:00Z

When I was young and starting my career, I was hungry to learn. I would spend ten hours a day reading, note-taking, and playing with code. Introducing myself to a world where DNA could be manipulated with computers was exhilarating. Until then, I had worked in odd jobs doing butchery or on a checkout at Sainsbury's. After manual work, learning about bioinformatics and programming was like finding a new room in a house after living in it for 25 years. I was so energised that I read the O'Reilly Java book cover to cover. I'd sit there each evening in my tiny Newcastle rental, on my secondhand IKEA desk, and make notes about new concepts: magic words like hash tables, linked lists, classes, abstract classes, abstract factories, abstract factory factories, abstract factory impls, and other exciting java concepts.

As I'm getting older, I often reflect on how I've changed compared to that person. An optimistic younger me now feels distant, like looking at your reflection in an old, cloudy mirror. Someone who feels like I'd have little to talk about if I met them now. That younger optimism came from excitement to be part of the tech industry, back when Google's motto was "Don't be evil." and people still believed it. When Gmail was revolutionary, and all your favourite blogs were in Google Reader. When working in tech felt a chance to contribute something meaningful to the world.

Compared to that younger person, screen time makes me tired physically in a way it didn't before. I can no longer drink four cups of coffee and execute 8-, 10-, or 12-hour coding sessions. More urgently, my earlier hunger to improve myself has faded. My desire to consume coding books has morphed into a hunger to consume Trader Joe's Cambozola cheese on crackers. The excitement for new programming books has gone. I can't remember the last time I spent more than five minutes in the programming section at Barnes and Noble. I'll still occasionally listen to tech books on Audible. That way I can instead be a passive participant in learning about new technology.

For more than ten years I lived in the Bay Area. Living there now it feels ironic that the tech industry is located in San Francisco, a city that spawned a counterculture. I read about the beat generation rejecting materialism and authority. But my generation only knows as the capital of an industry that spends billions of dollars to capture your attention. Where your monetary value can be reliably measured to the hundredth of a cent as the likelihood you will click on an ad. Where an Uber drops you off at a bar after work, so you can skip the human-shit dotted streets. But that same driver has to get by without employer health insurance or sick leave. Maybe if Ginsberg had been born in the late eighties he would have instead said he saw the greatest minds of his generation paid six figures to create the apps that enslave their own attention.

Author note: the end of that paragraph is way too melodramatic.

It seems like a lifetime, or at least a Main Era—the kind of peak that never comes again. San Francisco in the middle sixties was a very special time and place to be a part of. Maybe it meant something. Maybe not, in the long run… but no explanation, no mix of words or music or memories can touch that sense of knowing that you were there and alive in that corner of time and the world. Whatever it meant...

There was madness in any direction, at any hour. If not across the Bay, then up the Golden Gate or down 101 to Los Altos or La Honda.… You could strike sparks anywhere. There was a fantastic universal sense that whatever we were doing was right, that we were winning.…

-- Hunter S. Thompson, Fear and loathing in Las Vegas.

I had that feeling living in the Bay Area, for a few years at least. A feeling that being part of "tech" was worthwhile. That averaged over all startups we were making the world better. That more access to information would improve our lives. That freer and broader discussion would let us show each other how we are being disenfranchised. That it would make us less lonely and isolated. That technology could solve our problems and set us free. Free from the bullshit of menial, dead-end work. That it'd give us a reason to get out of bed each morning. Now instead, when I wake up I just reach for my phone to see what's changed on Reddit. Usually not much in the 6 hours since I fell asleep looking at it.

The US government is investigating whether Facebook/Meta knew its software harmed children, weakened democracy, and provoked violence in developing countries. For a while, Tesla made more money trading Bitcoin than selling electric cars. There's a feeling of cynicism now. Most of us working in tech don't give a second thought to making the world better. Instead it seems like we care most about is when is the best time to jump to the next startup to get in early on the options. Which of the five companies actively recruiting us is likely to have the highest payout.

We're more likely to lose sleep because a coworker makes $375k a year while we only make $362k. Not because our employer helped Russia influence a national election. Or because our employer used tips to pay the salary of delivery drivers. Or because our employer lobbied to pass a law denying its own drivers sick leave and health insurance.

What would continuing to improve myself even mean in this environment? Would getting better at my job lead to anything meaningful? Or does it just end up making some founders a little bit richer? But if I'm actively improving myself, then am I just coasting along? Is this all there is?

[Individuals] after months of experience typically obtain an acceptable level of proficiency. With long experience, often years, they can work as independent professionals. At that time, most professionals reach a stable average level of performance. Then they maintain this pedestrian level of performance for the rest of their careers.

-- Ericsson, KA "The influence of experience and deliberate practice on the development of superior expert performance" The Cambridge handbook of expertise and expert performance, pp. 683

This quote starts a self-loathing that has me gnawing on my fingernails. Staring blankly at the globs of dried peanut butter tucked amongst the tiny highways of my work keyboard. Is the peanut butter stuck in the trough between two cherry MX mechanical switch keys a metaphor for where my career has gotten after fifteen years?

Would tech companies be happy if employees did nothing to manage their own performance? Unlikely. Do they expect their employees to train at the intensity of an Olympic athlete in for software engineering? Probably not. In between these two extremes are most software engineers constantly improving themselves in return for their astronomical salaries? The recurrent theme on sites like Hacker News and Blind is that if I want a job at Google or Netflix, I should be constantly be grinding out coding practice exercises every day. Being able to churn out the solution to inverting a red-black tree is the bare minimum any software engineer should be able to do. At least that's what many hiring processes have institutionalised.

I can't even consider hiring you if you can't implement a garbage collector on a whiteboard. But while you're here, you should sign up for my crypto newsletter.

-- an interviewer in the Bay Area somewhere

I'm reasonably confident that my employer is happy with my performance. Most employers understand that getting a bunch of zeros and ones in the correct order from a 200lb fleshy bag of water, half-digested food, and coffee is a tricky process, especially so during a pandemic. My assumption is that tech companies prefer their engineers to keep up with technology trends. But nothing on par with a gymnast getting up at 6am before work to engage in deliberate practice for 3 hours. So as I get older, where does this ever-present background fear about coasting along come from?

Part of this fear is avoiding becoming stagnant: I turn up to work each day knowing only enough to get the job done. Then in two or five years, I realise that technology has passed me by. I'll admit I'm now useless, and everyone will soon find out too.

I want to continue to grow and get better. There's a meaning in being good at what you do. Software engineers often talk about craftsmanship, beyond being paid to do a job, but doing it well for its own sake. If I'm spending 40+ hours a week doing something, this demands mindful intention. The reward is pointing at a finished project and knowing I could do it because I was competent.

But beyond the fear of obsolescence, all of us only really have a limited amount of "good years" in our careers. Sam Harris puts this in a way that feels meaningful:

The reality is that you don't know much time you have. Do that most important thing now. Express your love now. Relinquish those hangups now. [...] Live fully now, for one day you will die." Each of us only has one life. There is no dress rehearsal. I want to find a meaning in this life that is the antidote to just waking up and immediately browsing social media.

-- Sam Harris

At some point, the good years in my life will end. COVID has shown that, for many of us, this could happen much sooner than we would like. I might be asking myself if I lived a good life when I'm 77. But I may also have an unexpected reason to be asking myself this when I'm 50, next year, or possibly even next week. It's the most urgent question to ask myself, what would I do with my time if I knew I only had two or three good years left? Urgent because this is what we spend most of our waking time doing, and ultimately because we just don't have as much time as we think. Do I want to use that time earning >$400k a year so that that hundreds of thousands of people spend an extra few seconds browsing ads?

hopeful?

For the most part I've enjoyed my career so far. I'm grateful for that. After having spent more than a year answering phones in a call centre in my early twenties, getting paid as much as I do to write code feels unreal. A background in biology means I get paid to work on interesting problems too. The destructive effects of the tech industry on housing, democracy, small businesses, and our attention are apparent to everyone. I wrote this essay after having noticed in myself a high-water mark in cynicism around our industry. But at least in some companies it feels like there's still a chance to be part of a team creating a better future. And working at a startup focused on identifying infectious diseases, there's a chance to help people who need a better future right now. And that's a future in tech industry I still want to be a part of creating.