<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Bioinformatics Zen</title>
  <subtitle>A blog about bioinformatics and mindfulness by Michael Barton.</subtitle>
  <link href="http://www.bioinformaticszen.com/feed.xml" rel="self"/>
  <link href="http://www.bioinformaticszen.com/"/>
  <updated>2023-01-03T00:00:00Z</updated>
  <id>http://www.bioinformaticszen.com/</id>
  <author>
    <name>Michael Barton</name>
    <email>mail@michaelbarton.me.uk</email>
  </author>
  
  <entry>
    <title>We&#39;re wasting money by only supporting gzip for raw DNA files.</title>
    <link href="http://www.bioinformaticszen.com/post/use-zstd-for-raw-fastq/"/>
    <updated>2023-01-03T00:00:00Z</updated>
    <id>http://www.bioinformaticszen.com/post/use-zstd-for-raw-fastq/</id>
    <content type="html">&lt;div class=&quot;centred banner_image&quot;&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/image_card.jpg&quot; alt=&quot;Stylised graphics of money with DNA helix in the foreground.&quot; width=&quot;640px&quot; class=&quot;responsive-image&quot; /&gt;&lt;/div&gt;
&lt;div class=&quot;lede&quot;&gt;
&lt;ul&gt;
&lt;li&gt;The increasing throughput of Illumina DNA sequencing means
institutions and companies are spending tens of thousands of dollars
to store terabytes of raw DNA sequence (FASTQ). This data is stored
using gzip, a 30-year-old compression algorithm.&lt;/li&gt;
&lt;li&gt;Common bioinformatics tools should support more recent compression
algorithms such as zstd for FASTQ data. Zstd has wide industry
support, with comparable run times and would likely reduce storage
costs by 50% over gzip.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h2&gt;Gzip is outperformed by other algorithms&lt;/h2&gt;
&lt;p&gt;The original implementation of gzip (Gailly/Madler) has been surpassed
in performance by other gzip implementations. For example,
&lt;a href=&quot;https://github.com/cloudflare/zlib&quot;&gt;cloudflare-zlib&lt;/a&gt; outperforms the
original gzip in compression speeds and should be used instead.&lt;/p&gt;
&lt;p&gt;The use of the gzip compression format is still ubiquitous for raw FASTQ
DNA sequence. This is due to it being the only supported compression
format for FASTQ in bioinformatics tools. In the thirty years since gzip
was created there are now alternatives with superior compression ratios.
Only supporting gzip for FASTQ translates into millions of dollars in
storage fees on services like Amazon’s S3 and EFS compared with
algorithms with better compression ratios. Companies like
&lt;a href=&quot;https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/&quot;&gt;Meta&lt;/a&gt;,
&lt;a href=&quot;https://www.infoq.com/news/2022/09/amazon-gzip-zstd/&quot;&gt;Amazon&lt;/a&gt;, and
&lt;a href=&quot;https://www.uber.com/en-GB/blog/cost-efficiency-big-data/&quot;&gt;Uber&lt;/a&gt; are
reported to be switching to zstd over gzip. If the most common
bioinformatics tools can move to support ingesting zstd-compressed FASTQ
format this could save everyone time and money with minimal impact on
compression times.&lt;/p&gt;
&lt;h2&gt;A toy benchmark&lt;/h2&gt;
&lt;p&gt;As an example a zstd compressed FASTQ file
(&lt;a href=&quot;https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&amp;amp;acc=SRR7589561&amp;amp;display=metadata&quot;&gt;SRR7589561&lt;/a&gt;)
is almost 50% the size of the same gzipped file. In the figure below I
downloaded ~1.5Gb of FASTQ data and compressed it with either &lt;code&gt;pigz&lt;/code&gt; or
&lt;code&gt;zstd&lt;/code&gt;. Pigz is a parallel implementation of the original gzip.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/file-size-example-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;FASTQ file size by compression algorithm.&lt;/strong&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;p&gt;FASTQ files do however take longer to compress with zstd. The &lt;code&gt;ztsd -15&lt;/code&gt;
command takes ~70s which is 100% longer than &lt;code&gt;pigz -9&lt;/code&gt; at ~35s. However,
it’s worth noting when storing raw FASTQ from a sequencer, these files
are compressed once, and then stored for years. This additional CPU time
cost is more than offset by savings in storage costs. The same does not
apply to intermediate files such as trimmed or filtered FASTQ in a
pipeline that tend to be ephemeral. These would require a further
examination of trade offs.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/compress-time-example-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;Total compression time in seconds by algorithm.&lt;/strong&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;p&gt;The next figure shows that changes in decompression time for the same
file are relatively small, 3.5s vs 2.2s. Therefore decompression would
be minimally impacted.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/decompress-time-example-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;Total decompression time in seconds by algorithm.&lt;/strong&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;h2&gt;Detailed comparison of flags&lt;/h2&gt;
&lt;p&gt;This figure compares the compressed output file size for all the
different available gzip implementations with zstd for different
compression flags on the same SRR7589561 FASTQ file. This shows that
zstd outperforms gzip at the highest compression levels, with the output
file sizes being ~60% the size of the highest gzip compression levels.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/plot-file-size-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;Output compressed file size ratios by command line flag for each compression tool.&lt;/strong&gt; Each colour represents a different compression tool implementation. Each argument was benchmarked five times. Note that zopfli has a single datum because it only compresses to the max ratio.&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;p&gt;This next plot compares the trade-offs for file size versus the wall
clock run time taken to compress a FASTQ file. This is for the
compression process running single-threaded. This shows that zstd can
result in much better compression ratios, ~10% of the original file size
but with increasing run time. Though not nearly as long as the run time
for zopfli, a gzip implementation gives the best compression ratio of
any gzip implementation but at the expense ~2 orders of magnitude in
compression time.&lt;/p&gt;
&lt;figure&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202212120000-use-zstd-for-raw-fastq/plot-compression-time-1.png&quot; /&gt;&lt;figcaption&gt;&lt;p&gt;&lt;strong&gt;Compression ratio versus compression time.&lt;/strong&gt; Each colour represents a different compression tool implementation. Each argument was benchmarked five times. Note that zopfli has a single point because it only compresses to the max ratio.&lt;/p&gt;&lt;/figcaption&gt;&lt;/figure&gt;
&lt;h1&gt;Takeaway&lt;/h1&gt;
&lt;p&gt;The gzip implementation is superseded by other compression algorithms
such as zstd. By continuing to only support gzip for FASTQ, the
bioinformatics industry spends money unnecessarily on additional
storage. Bioinformatics tools should widely support zstd as a
compression format for FASTQ.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Fear and self-loathing: getting old in tech</title>
    <link href="http://www.bioinformaticszen.com/post/fear-and-self-loathing-growing-old-in-tech/"/>
    <updated>2022-05-09T00:00:00Z</updated>
    <id>http://www.bioinformaticszen.com/post/fear-and-self-loathing-growing-old-in-tech/</id>
    <content type="html">&lt;div class=&quot;centred &quot;&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202204181910-fear-loathing-getting-old-in-tech/cynical-profile-picture.png&quot; alt=&quot;Picture of my face with &amp;quot;cynical&amp;quot; written over the top in bright letters.&quot; width=&quot;720px&quot; class=&quot;responsive-image&quot; /&gt;&lt;/div&gt;
&lt;p&gt;When I was young and starting my career, I was hungry to learn. I would spend
ten hours a day reading, note-taking, and playing with code. Introducing myself
to a world where DNA could be manipulated with computers was exhilarating.
Until then, I had worked in odd jobs doing butchery or on a checkout at
Sainsbury&#39;s. After manual work, learning about bioinformatics and programming
was like finding a new room in a house after living in it for 25 years. I was
so energised that I read the O&#39;Reilly Java book cover to cover. I&#39;d sit there
each evening in my tiny Newcastle rental, on my secondhand IKEA desk, and make
notes about new concepts: magic words like hash tables, linked lists, classes,
abstract classes, abstract factories, abstract factory factories, abstract
factory impls, and other exciting java concepts.&lt;/p&gt;
&lt;p&gt;As I&#39;m getting older, I often reflect on how I&#39;ve changed compared to that
person. An optimistic younger me now feels distant, like looking at your
reflection in an old, cloudy mirror. Someone who feels like I&#39;d have little to
talk about if I met them now. That younger optimism came from excitement to be
part of the tech industry, back when Google&#39;s motto was &amp;quot;Don&#39;t be evil.&amp;quot; and
people still believed it. When Gmail was revolutionary, and all your favourite
blogs were in Google Reader. When working in tech felt a chance to contribute
something meaningful to the world.&lt;/p&gt;
&lt;p&gt;Compared to that younger person, screen time makes me tired physically in a way
it didn&#39;t before. I can no longer drink four cups of coffee and execute 8-,
10-, or 12-hour coding sessions. More urgently, my earlier hunger to improve
myself has faded. My desire to consume coding books has morphed into a hunger
to consume Trader Joe&#39;s Cambozola cheese on crackers. The excitement for new
programming books has gone. I can&#39;t remember the last time I spent more than
five minutes in the programming section at Barnes and Noble. I&#39;ll still
occasionally listen to tech books on Audible. That way I can instead be a
passive participant in learning about new technology.&lt;/p&gt;
&lt;p&gt;For more than ten years I lived in the Bay Area. Living there now it feels
ironic that the tech industry is located in San Francisco, a city that spawned
a counterculture. I read about the beat generation rejecting materialism and
authority. But my generation only knows as the capital of an industry that
spends billions of dollars to capture your attention. Where your monetary value
can be reliably measured to the hundredth of a cent as the likelihood you will
click on an ad. Where an Uber drops you off at a bar after work, so you can
skip the human-shit dotted streets. But that same driver has to get by without
employer health insurance or sick leave. Maybe if Ginsberg had been born in the
late eighties he would have instead said he saw the greatest minds of his
generation paid six figures to create the apps that enslave their own
attention.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Author note&lt;/strong&gt;: &lt;em&gt;the end of that paragraph is way too melodramatic.&lt;/em&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It seems like a lifetime, or at least a Main Era—the kind of peak that never
comes again. San Francisco in the middle sixties was a very special time and
place to be a part of. Maybe it meant something. Maybe not, in the long run…
but no explanation, no mix of words or music or memories can touch that sense
of knowing that you were there and alive in that corner of time and the
world. Whatever it meant...&lt;/p&gt;
&lt;p&gt;There was madness in any direction, at any hour. If not across the Bay, then
up the Golden Gate or down 101 to Los Altos or La Honda.… You could strike
sparks anywhere. There was a fantastic universal sense that whatever we were
doing was right, that we were winning.…&lt;/p&gt;
&lt;p&gt;-- Hunter S. Thompson, Fear and loathing in Las Vegas.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had that feeling living in the Bay Area, for a few years at least. A feeling
that being part of &amp;quot;tech&amp;quot; was worthwhile. That averaged over all startups we
were making the world better. That more access to information would improve our
lives. That freer and broader discussion would let us show each other how we
are being disenfranchised. That it would make us less lonely and isolated. That
technology could solve our problems and set us free. Free from the bullshit of
menial, dead-end work. That it&#39;d give us a reason to get out of bed each
morning. Now instead, when I wake up I just reach for my phone to see what&#39;s
changed on Reddit. Usually not much in the 6 hours since I fell asleep looking
at it.&lt;/p&gt;
&lt;p&gt;The US government is investigating whether Facebook/Meta knew its software
harmed children, weakened democracy, and provoked violence in developing
countries. For a while, Tesla made more money trading Bitcoin than selling
electric cars. There&#39;s a feeling of cynicism now. Most of us working in tech
don&#39;t give a second thought to making the world better. Instead it seems like
we care most about is when is the best time to jump to the next startup to get
in early on the options. Which of the five companies actively recruiting us is
likely to have the highest payout.&lt;/p&gt;
&lt;p&gt;We&#39;re more likely to lose sleep because a coworker makes $375k a year while we
only make $362k. Not because our employer helped Russia influence a national
election. Or because our employer used tips to pay the salary of delivery
drivers. Or because our employer lobbied to pass a law denying its own drivers
sick leave and health insurance.&lt;/p&gt;
&lt;p&gt;What would continuing to improve myself even mean in this environment? Would
getting better at my job lead to anything meaningful? Or does it just end up
making some founders a little bit richer? But if I&#39;m actively improving myself,
then am I just coasting along? Is this all there is?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[Individuals] after months of experience typically obtain an acceptable level
of proficiency. With long experience, often years, they can work as independent
professionals. At that time, most professionals reach a stable average level of
performance. Then they maintain this pedestrian level of performance for the
rest of their careers.&lt;/p&gt;
&lt;p&gt;-- Ericsson, KA &amp;quot;The influence of experience and deliberate practice on the
development of superior expert performance&amp;quot; The Cambridge handbook of
expertise and expert performance, pp. 683&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This quote starts a self-loathing that has me gnawing on my fingernails.
Staring blankly at the globs of dried peanut butter tucked amongst the tiny
highways of my work keyboard. Is the peanut butter stuck in the trough between
two cherry MX mechanical switch keys a metaphor for where my career has gotten
after fifteen years?&lt;/p&gt;
&lt;p&gt;Would tech companies be happy if employees did nothing to manage their own
performance? Unlikely. Do they expect their employees to train at the intensity
of an Olympic athlete in for software engineering? Probably not. In between
these two extremes are most software engineers constantly improving themselves
in return for their astronomical salaries? The recurrent theme on sites like
Hacker News and Blind is that if I want a job at Google or Netflix, I should be
constantly be grinding out coding practice exercises every day. Being able to
churn out the solution to inverting a red-black tree is the bare minimum any
software engineer should be able to do. At least that&#39;s what many hiring
processes have institutionalised.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I can&#39;t even consider hiring you if you can&#39;t implement a garbage collector
on a whiteboard. But while you&#39;re here, you should sign up for my crypto
newsletter.&lt;/p&gt;
&lt;p&gt;-- an interviewer in the Bay Area somewhere&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I&#39;m reasonably confident that my employer is happy with my performance. Most
employers understand that getting a bunch of zeros and ones in the correct
order from a 200lb fleshy bag of water, half-digested food, and coffee is a
tricky process, especially so during a pandemic. My assumption is that tech
companies prefer their engineers to keep up with technology trends. But nothing
on par with a gymnast getting up at 6am before work to engage in deliberate
practice for 3 hours. So as I get older, where does this ever-present
background fear about coasting along come from?&lt;/p&gt;
&lt;p&gt;Part of this fear is avoiding becoming stagnant: I turn up to work each day
knowing only enough to get the job done. Then in two or five years, I realise
that technology has passed me by. I&#39;ll admit I&#39;m now useless, and everyone will
soon find out too.&lt;/p&gt;
&lt;p&gt;I want to continue to grow and get better. There&#39;s a meaning in being good at
what you do. Software engineers often talk about craftsmanship, beyond being
paid to do a job, but doing it well for its own sake. If I&#39;m spending 40+ hours
a week doing something, this demands mindful intention. The reward is pointing
at a finished project and knowing I could do it because I was competent.&lt;/p&gt;
&lt;p&gt;But beyond the fear of obsolescence, all of us only really have a limited
amount of &amp;quot;good years&amp;quot; in our careers. Sam Harris puts this in a way that feels
meaningful:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The reality is that you don&#39;t know much time you have. Do that most important
thing now. Express your love now. Relinquish those hangups now. [...] Live
fully now, for one day you will die.&amp;quot; Each of us only has one life. There is
no dress rehearsal. I want to find a meaning in this life that is the
antidote to just waking up and immediately browsing social media.&lt;/p&gt;
&lt;p&gt;-- Sam Harris&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At some point, the good years in my life will end. COVID has shown that, for
many of us, this could happen much sooner than we would like. I might be asking
myself if I lived a good life when I&#39;m 77. But I may also have an unexpected
reason to be asking myself this when I&#39;m 50, next year, or possibly even next
week. It&#39;s the most urgent question to ask myself, what would I do with my time
if I knew I only had two or three good years left? Urgent because this is what
we spend most of our waking time doing, and ultimately because we just don&#39;t
have as much time as we think. Do I want to use that time earning &amp;gt;$400k a year
so that that hundreds of thousands of people spend an extra few seconds
browsing ads?&lt;/p&gt;
&lt;div class=&quot;centred &quot;&gt;&lt;img src=&quot;https://s3.amazonaws.com/bioinformatics-zen/202204181910-fear-loathing-getting-old-in-tech/here-hopeful.png&quot; alt=&quot;The word &amp;quot;hopeful&amp;quot; inside HTML &amp;quot;here&amp;quot; tags.&quot; width=&quot;560px&quot; class=&quot;responsive-image&quot; /&gt;&lt;/div&gt;
&lt;here&gt;
	hopeful?
&lt;/here&gt;
&lt;p&gt;For the most part I&#39;ve enjoyed my career so far. I&#39;m grateful for that. After
having spent more than a year answering phones in a call centre in my early
twenties, getting paid as much as I do to write code feels unreal. A background
in biology means I get paid to work on interesting problems too. The
destructive effects of the tech industry on housing, democracy, small
businesses, and our attention are apparent to everyone. I wrote this essay
after having noticed in myself a high-water mark in cynicism around our
industry. But at least in some companies it feels like there&#39;s still a chance
to be part of a team creating a better future. And working at a startup focused
on identifying infectious diseases, there&#39;s a chance to help people who need a
better future right now. And that&#39;s a future in tech industry I still want to
be a part of creating.&lt;/p&gt;
</content>
  </entry>
</feed>
