<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Bioinformatics Zen</title>
  <subtitle>A blog about bioinformatics and mindfulness by Michael Barton.</subtitle>
  <link href="http://www.bioinformaticszen.com/feed.xml" rel="self"/>
  <link href="http://www.bioinformaticszen.com/"/>
  <updated>2023-01-03T00:00:00Z</updated>
  <id>http://www.bioinformaticszen.com/</id>
  <author>
    <name>Michael Barton</name>
    <email>mail@michaelbarton.me.uk</email>
  </author>
  
  <entry>
    <title>We&#39;re wasting money by only supporting gzip for raw DNA files.</title>
    <link href="http://www.bioinformaticszen.com/post/use-zstd-for-raw-fastq/"/>
    <updated>2023-01-03T00:00:00Z</updated>
    <id>http://www.bioinformaticszen.com/post/use-zstd-for-raw-fastq/</id>
    <content type="html">&lt;div class=&quot;centred banner_image&quot;&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/image_card.jpg&quot; alt=&quot;Stylised graphics of money with DNA helix in the foreground.&quot; width=&quot;640px&quot; class=&quot;responsive-image&quot; /&gt;&lt;/div&gt;
&lt;div class=&quot;lede&quot;&gt;
&lt;ul&gt;
&lt;li&gt;The increasing throughput of Illumina DNA sequencing means institutions and
companies are spending tens of thousands of dollars to store terabytes of raw
DNA sequence (FASTQ). This data is stored using gzip, a 30-year-old
compression algorithm.&lt;/li&gt;
&lt;li&gt;Common bioinformatics tools should support more recent compression algorithms
such as zstd for FASTQ data. Zstd has wide industry support, with comparable
run times and would likely reduce storage costs by 50% over gzip.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h2&gt;Gzip is outperformed by other algorithms&lt;/h2&gt;
&lt;p&gt;The original implementation of gzip (Gailly/Madler) has been surpassed in
performance by other gzip implementations. For example,
&lt;a href=&quot;https://github.com/cloudflare/zlib&quot;&gt;cloudflare-zlib&lt;/a&gt; outperforms the original
gzip in compression speeds and should be used instead.&lt;/p&gt;
&lt;p&gt;The use of the gzip compression format is still ubiquitous for raw FASTQ DNA
sequence. This is due to it being the only supported compression format for
FASTQ in bioinformatics tools. In the thirty years since gzip was created there
are now alternatives with superior compression ratios. Only supporting gzip for
FASTQ translates into millions of dollars in storage fees on services like
Amazon’s S3 and EFS compared with algorithms with better compression ratios.
Companies like
&lt;a href=&quot;https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/&quot;&gt;Meta&lt;/a&gt;,
&lt;a href=&quot;https://www.infoq.com/news/2022/09/amazon-gzip-zstd/&quot;&gt;Amazon&lt;/a&gt;, and
&lt;a href=&quot;https://www.uber.com/en-GB/blog/cost-efficiency-big-data/&quot;&gt;Uber&lt;/a&gt; are reported
to be switching to zstd over gzip. If the most common bioinformatics tools can
move to support ingesting zstd-compressed FASTQ format this could save everyone
time and money with minimal impact on compression times.&lt;/p&gt;
&lt;h2&gt;A toy benchmark&lt;/h2&gt;
&lt;p&gt;As an example a zstd compressed FASTQ file
(&lt;a href=&quot;https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&amp;amp;acc=SRR7589561&amp;amp;display=metadata&quot;&gt;SRR7589561&lt;/a&gt;)
is almost 50% the size of the same gzipped file. In the figure below I
downloaded ~1.5Gb of FASTQ data and compressed it with either &lt;code&gt;pigz&lt;/code&gt; or &lt;code&gt;zstd&lt;/code&gt;.
Pigz is a parallel implementation of the original gzip.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/file-size-example-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;FASTQ file size by compression algorithm.&lt;/strong&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;p&gt;FASTQ files do however take longer to compress with zstd. The &lt;code&gt;ztsd -15&lt;/code&gt; command
takes ~70s which is 100% longer than &lt;code&gt;pigz -9&lt;/code&gt; at ~35s. However, it’s worth
noting when storing raw FASTQ from a sequencer, these files are compressed once,
and then stored for years. This additional CPU time cost is more than offset by
savings in storage costs. The same does not apply to intermediate files such as
trimmed or filtered FASTQ in a pipeline that tend to be ephemeral. These would
require a further examination of trade offs.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/compress-time-example-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;Total compression time in seconds by algorithm.&lt;/strong&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;p&gt;The next figure shows that changes in decompression time for the same file are
relatively small, 3.5s vs 2.2s. Therefore decompression would be minimally
impacted.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/decompress-time-example-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;Total decompression time in seconds by algorithm.&lt;/strong&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;h2&gt;Detailed comparison of flags&lt;/h2&gt;
&lt;p&gt;This figure compares the compressed output file size for all the different
available gzip implementations with zstd for different compression flags on the
same SRR7589561 FASTQ file. This shows that zstd outperforms gzip at the highest
compression levels, with the output file sizes being ~60% the size of the
highest gzip compression levels.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/plot-file-size-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;Output compressed file size ratios by command line flag for each compression tool.&lt;/strong&gt; Each colour represents a different compression tool implementation. Each argument was benchmarked five times. Note that zopfli has a single datum because it only compresses to the max ratio.&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;p&gt;This next plot compares the trade-offs for file size versus the wall clock run
time taken to compress a FASTQ file. This is for the compression process running
single-threaded. This shows that zstd can result in much better compression
ratios, ~10% of the original file size but with increasing run time. Though not
nearly as long as the run time for zopfli, a gzip implementation gives the best
compression ratio of any gzip implementation but at the expense ~2 orders of
magnitude in compression time.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/plot-compression-time-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;Compression ratio versus compression time.&lt;/strong&gt; Each colour represents a different compression tool implementation. Each argument was benchmarked five times. Note that zopfli has a single point because it only compresses to the max ratio.&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;h1&gt;Takeaway&lt;/h1&gt;
&lt;p&gt;The gzip implementation is superseded by other compression algorithms such as
zstd. By continuing to only support gzip for FASTQ, the bioinformatics industry
spends money unnecessarily on additional storage. Bioinformatics tools should
widely support zstd as a compression format for FASTQ.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Fear and self-loathing: getting old in tech</title>
    <link href="http://www.bioinformaticszen.com/post/fear-and-self-loathing-growing-old-in-tech/"/>
    <updated>2022-05-09T00:00:00Z</updated>
    <id>http://www.bioinformaticszen.com/post/fear-and-self-loathing-growing-old-in-tech/</id>
    <content type="html">&lt;div class=&quot;centred &quot;&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202204181910-fear-loathing-getting-old-in-tech/cynical-profile-picture.png&quot; alt=&quot;Picture of my face with &amp;quot;cynical&amp;quot; written over the top in bright letters.&quot; width=&quot;720px&quot; class=&quot;responsive-image&quot; /&gt;&lt;/div&gt;
&lt;p&gt;When I was young and starting my career, I was hungry to learn. I would spend
ten hours a day reading, note-taking, and playing with code. Introducing myself
to a world where DNA could be manipulated with computers was exhilarating. Until
then, I had worked in odd jobs doing butchery or on a checkout at Sainsbury&#39;s.
After manual work, learning about bioinformatics and programming was like
finding a new room in a house after living in it for 25 years. I was so
energised that I read the O&#39;Reilly Java book cover to cover. I&#39;d sit there each
evening in my tiny Newcastle rental, on my secondhand IKEA desk, and make notes
about new concepts: magic words like hash tables, linked lists, classes,
abstract classes, abstract factories, abstract factory factories, abstract
factory impls, and other exciting java concepts.&lt;/p&gt;
&lt;p&gt;As I&#39;m getting older, I often reflect on how I&#39;ve changed compared to that
person. An optimistic younger me now feels distant, like looking at your
reflection in an old, cloudy mirror. Someone who feels like I&#39;d have little to
talk about if I met them now. That younger optimism came from excitement to be
part of the tech industry, back when Google&#39;s motto was &amp;quot;Don&#39;t be evil.&amp;quot; and
people still believed it. When Gmail was revolutionary, and all your favourite
blogs were in Google Reader. When working in tech felt a chance to contribute
something meaningful to the world.&lt;/p&gt;
&lt;p&gt;Compared to that younger person, screen time makes me tired physically in a way
it didn&#39;t before. I can no longer drink four cups of coffee and execute 8-, 10-,
or 12-hour coding sessions. More urgently, my earlier hunger to improve myself
has faded. My desire to consume coding books has morphed into a hunger to
consume Trader Joe&#39;s Cambozola cheese on crackers. The excitement for new
programming books has gone. I can&#39;t remember the last time I spent more than
five minutes in the programming section at Barnes and Noble. I&#39;ll still
occasionally listen to tech books on Audible. That way I can instead be a
passive participant in learning about new technology.&lt;/p&gt;
&lt;p&gt;For more than ten years I lived in the Bay Area. Living there now it feels
ironic that the tech industry is located in San Francisco, a city that spawned a
counterculture. I read about the beat generation rejecting materialism and
authority. But my generation only knows as the capital of an industry that
spends billions of dollars to capture your attention. Where your monetary value
can be reliably measured to the hundredth of a cent as the likelihood you will
click on an ad. Where an Uber drops you off at a bar after work, so you can skip
the human-shit dotted streets. But that same driver has to get by without
employer health insurance or sick leave. Maybe if Ginsberg had been born in the
late eighties he would have instead said he saw the greatest minds of his
generation paid six figures to create the apps that enslave their own attention.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Author note&lt;/strong&gt;: &lt;em&gt;the end of that paragraph is way too melodramatic.&lt;/em&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It seems like a lifetime, or at least a Main Era—the kind of peak that never
comes again. San Francisco in the middle sixties was a very special time and
place to be a part of. Maybe it meant something. Maybe not, in the long run…
but no explanation, no mix of words or music or memories can touch that sense
of knowing that you were there and alive in that corner of time and the world.
Whatever it meant...&lt;/p&gt;
&lt;p&gt;There was madness in any direction, at any hour. If not across the Bay, then
up the Golden Gate or down 101 to Los Altos or La Honda.… You could strike
sparks anywhere. There was a fantastic universal sense that whatever we were
doing was right, that we were winning.…&lt;/p&gt;
&lt;p&gt;-- Hunter S. Thompson, Fear and loathing in Las Vegas.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had that feeling living in the Bay Area, for a few years at least. A feeling
that being part of &amp;quot;tech&amp;quot; was worthwhile. That averaged over all startups we
were making the world better. That more access to information would improve our
lives. That freer and broader discussion would let us show each other how we are
being disenfranchised. That it would make us less lonely and isolated. That
technology could solve our problems and set us free. Free from the bullshit of
menial, dead-end work. That it&#39;d give us a reason to get out of bed each
morning. Now instead, when I wake up I just reach for my phone to see what&#39;s
changed on Reddit. Usually not much in the 6 hours since I fell asleep looking
at it.&lt;/p&gt;
&lt;p&gt;The US government is investigating whether Facebook/Meta knew its software
harmed children, weakened democracy, and provoked violence in developing
countries. For a while, Tesla made more money trading Bitcoin than selling
electric cars. There&#39;s a feeling of cynicism now. Most of us working in tech
don&#39;t give a second thought to making the world better. Instead it seems like we
care most about is when is the best time to jump to the next startup to get in
early on the options. Which of the five companies actively recruiting us is
likely to have the highest payout.&lt;/p&gt;
&lt;p&gt;We&#39;re more likely to lose sleep because a coworker makes $375k a year while we
only make $362k. Not because our employer helped Russia influence a national
election. Or because our employer used tips to pay the salary of delivery
drivers. Or because our employer lobbied to pass a law denying its own drivers
sick leave and health insurance.&lt;/p&gt;
&lt;p&gt;What would continuing to improve myself even mean in this environment? Would
getting better at my job lead to anything meaningful? Or does it just end up
making some founders a little bit richer? But if I&#39;m actively improving myself,
then am I just coasting along? Is this all there is?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[Individuals] after months of experience typically obtain an acceptable level
of proficiency. With long experience, often years, they can work as
independent professionals. At that time, most professionals reach a stable
average level of performance. Then they maintain this pedestrian level of
performance for the rest of their careers.&lt;/p&gt;
&lt;p&gt;-- Ericsson, KA &amp;quot;The influence of experience and deliberate practice on the
development of superior expert performance&amp;quot; The Cambridge handbook of
expertise and expert performance, pp. 683&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This quote starts a self-loathing that has me gnawing on my fingernails. Staring
blankly at the globs of dried peanut butter tucked amongst the tiny highways of
my work keyboard. Is the peanut butter stuck in the trough between two cherry MX
mechanical switch keys a metaphor for where my career has gotten after fifteen
years?&lt;/p&gt;
&lt;p&gt;Would tech companies be happy if employees did nothing to manage their own
performance? Unlikely. Do they expect their employees to train at the intensity
of an Olympic athlete in for software engineering? Probably not. In between
these two extremes are most software engineers constantly improving themselves
in return for their astronomical salaries? The recurrent theme on sites like
Hacker News and Blind is that if I want a job at Google or Netflix, I should be
constantly be grinding out coding practice exercises every day. Being able to
churn out the solution to inverting a red-black tree is the bare minimum any
software engineer should be able to do. At least that&#39;s what many hiring
processes have institutionalised.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I can&#39;t even consider hiring you if you can&#39;t implement a garbage collector on
a whiteboard. But while you&#39;re here, you should sign up for my crypto
newsletter.&lt;/p&gt;
&lt;p&gt;-- an interviewer in the Bay Area somewhere&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I&#39;m reasonably confident that my employer is happy with my performance. Most
employers understand that getting a bunch of zeros and ones in the correct order
from a 200lb fleshy bag of water, half-digested food, and coffee is a tricky
process, especially so during a pandemic. My assumption is that tech companies
prefer their engineers to keep up with technology trends. But nothing on par
with a gymnast getting up at 6am before work to engage in deliberate practice
for 3 hours. So as I get older, where does this ever-present background fear
about coasting along come from?&lt;/p&gt;
&lt;p&gt;Part of this fear is avoiding becoming stagnant: I turn up to work each day
knowing only enough to get the job done. Then in two or five years, I realise
that technology has passed me by. I&#39;ll admit I&#39;m now useless, and everyone will
soon find out too.&lt;/p&gt;
&lt;p&gt;I want to continue to grow and get better. There&#39;s a meaning in being good at
what you do. Software engineers often talk about craftsmanship, beyond being
paid to do a job, but doing it well for its own sake. If I&#39;m spending 40+ hours
a week doing something, this demands mindful intention. The reward is pointing
at a finished project and knowing I could do it because I was competent.&lt;/p&gt;
&lt;p&gt;But beyond the fear of obsolescence, all of us only really have a limited amount
of &amp;quot;good years&amp;quot; in our careers. Sam Harris puts this in a way that feels
meaningful:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The reality is that you don&#39;t know much time you have. Do that most important
thing now. Express your love now. Relinquish those hangups now. [...] Live
fully now, for one day you will die.&amp;quot; Each of us only has one life. There is
no dress rehearsal. I want to find a meaning in this life that is the antidote
to just waking up and immediately browsing social media.&lt;/p&gt;
&lt;p&gt;-- Sam Harris&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At some point, the good years in my life will end. COVID has shown that, for
many of us, this could happen much sooner than we would like. I might be asking
myself if I lived a good life when I&#39;m 77. But I may also have an unexpected
reason to be asking myself this when I&#39;m 50, next year, or possibly even next
week. It&#39;s the most urgent question to ask myself, what would I do with my time
if I knew I only had two or three good years left? Urgent because this is what
we spend most of our waking time doing, and ultimately because we just don&#39;t
have as much time as we think. Do I want to use that time earning &amp;gt;$400k a year
so that that hundreds of thousands of people spend an extra few seconds browsing
ads?&lt;/p&gt;
&lt;div class=&quot;centred &quot;&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202204181910-fear-loathing-getting-old-in-tech/here-hopeful.png&quot; alt=&quot;The word &amp;quot;hopeful&amp;quot; inside HTML &amp;quot;here&amp;quot; tags.&quot; width=&quot;560px&quot; class=&quot;responsive-image&quot; /&gt;&lt;/div&gt;
&lt;here&gt;
	hopeful?
&lt;/here&gt;
&lt;p&gt;For the most part I&#39;ve enjoyed my career so far. I&#39;m grateful for that. After
having spent more than a year answering phones in a call centre in my early
twenties, getting paid as much as I do to write code feels unreal. A background
in biology means I get paid to work on interesting problems too. The destructive
effects of the tech industry on housing, democracy, small businesses, and our
attention are apparent to everyone. I wrote this essay after having noticed in
myself a high-water mark in cynicism around our industry. But at least in some
companies it feels like there&#39;s still a chance to be part of a team creating a
better future. And working at a startup focused on identifying infectious
diseases, there&#39;s a chance to help people who need a better future right now.
And that&#39;s a future in tech industry I still want to be a part of creating.&lt;/p&gt;
</content>
  </entry>
</feed>
