<feed xmlns='http://www.w3.org/2005/Atom'>
  <title>Bioinformatics Zen</title>
  <link href='http://bioinformaticszen.com'>
  <id>http://bioinformaticszen.com</id>
  <updated>2013-05-13T09:15:00-04:00</updated>
  <author>
    <name>Michael Barton</name>
    <email>mail@michaelbarton.me.uk</email>
    <uri>http://www.michaelbarton.me.uk</uri>
  </author>
  <rights>Creative Commons Attribution 3.0 Unported</rights>
  <entry>
    <id>http://bioinformaticszen.com/post/trimming-454-data-from-the-5-prime-may-improve-fidelity/</id>
    <title>Trimming 454 reads at the 5' end may improve fidelity</title>
    <link href='http://bioinformaticszen.com/post/trimming-454-data-from-the-5-prime-may-improve-fidelity/'>
    <published>2013-05-13T09:15:00-04:00</published>
    <updated>2013-05-13T09:15:00-04:00</updated>
    <author>
      <name>Michael Barton</name>
    </author>
    <content type='html'>&lt;p&gt;A previous post described &lt;a href=&quot;/post/machine-learning-to-detect-bad-sequencing-reads/&quot;&gt;training machine learning classifiers&lt;/a&gt; to&amp;#x000A;filter inaccurate reads in 16S-based metagenomic studies. This analysis was&amp;#x000A;based on available mock community data - where an known set of 16S genes is&amp;#x000A;sequenced and the generated reads compared to the original sequences. The value&amp;#x000A;of using a mock community is that knowing the original sequences allows you to&amp;#x000A;identify the types of errors you are observing in the output data. The method I&amp;#x000A;previously used to find the accurate reads using the mock community data was as&amp;#x000A;follows:&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;ol&gt;&amp;#x000A;  &lt;li&gt;Remove barcodes and primers from reads and trim down to 200bp.&lt;/li&gt;&amp;#x000A;  &lt;li&gt;Select the unique representative reads and count how many times each&amp;#x000A;unique read is observed in the data.&lt;/li&gt;&amp;#x000A;  &lt;li&gt;Build an alignment from &lt;strong&gt;N&lt;/strong&gt;*2 of the most abundant unique reads. &lt;strong&gt;N&lt;/strong&gt;&amp;#x000A;is the number of sequences used in the mock community data.&lt;/li&gt;&amp;#x000A;  &lt;li&gt;Use &lt;a href=&quot;http://www.ncbi.nlm.nih.gov/pubmed/20236171&quot;&gt;single linkage clustering&lt;/a&gt; to group the aligned unique reads&amp;#x000A;to within 4bp differences of each other. This merges the unique read that&amp;#x000A;matches to with 4bp of another unique read into the more abundant read&amp;#x000A;group.&lt;/li&gt;&amp;#x000A;&lt;/ol&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The result of this method should be &lt;strong&gt;N&lt;/strong&gt; unique reads in the output data and&amp;#x000A;each unique read should then correspond to one of the mock community sequences.&amp;#x000A;This approach assumes that after removing up to 4 single base errors and&amp;#x000A;indels, the remaining most abundant unique read for each mock community&amp;#x000A;sequence should be the correct one. This assumption can be described another&amp;#x000A;way: any novel sequences generated from infidelity in the PCR or sequencing&amp;#x000A;steps should not produce more copies than that of the original correct&amp;#x000A;sequence.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I built this workflow with a combination of mothur and shell scripting. There&amp;#x000A;is a gist on github illustrating this &lt;a href=&quot;https://gist.github.com/michaelbarton/5490636&quot;&gt;method of identifying correct reads based&amp;#x000A;on abundance&lt;/a&gt;.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;This approach bothered me because it does not actually compare the reads that I&amp;#x000A;specify as correct against the original mock community. Therefore before&amp;#x000A;pursing further machine learning approaches I aimed to first confirm the reads&amp;#x000A;identified as accurate exactly match the mock community sequences.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;read-fidelity-in-quince-et-al-2009-data&quot;&gt;Read fidelity in Quince et. al 2009 data&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I used a simple approach to identify accurate reads in the 454 data: trim each&amp;#x000A;mock community sequence to a 100bp substring and see how many reads contained&amp;#x000A;this substring. I started with the divergent mock community data from &lt;a href=&quot;http://www.ncbi.nlm.nih.gov/pubmed/19668203&quot;&gt;Quince&amp;#x000A;&lt;em&gt;et. al&lt;/em&gt; 2009&lt;/a&gt;. This is the same data I used for machine learning&amp;#x000A;approach to filtering inaccurate reads. I trimmed the community sequences to&amp;#x000A;100bp beginning at 5’ end and searched the generated read data for exact&amp;#x000A;matching substrings using &lt;code&gt;grep&lt;/code&gt;. The following code illustrates this approach&amp;#x000A;with the resulting number of matches.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;pre class=&quot;prettyprint&quot;&gt;&amp;#x000A;fastx_trimmer -l 100 -i SRX002564.community.tab -o trim100.fasta&amp;#x000A;&amp;#x000A;grep --invert-match '&amp;gt;' trim100.fasta \&amp;#x000A;  | sort | uniq \&amp;#x000A;  | parallel 'grep {} SRX002564.fasta | wc -l' \&amp;#x000A;  | sort -r -n \&amp;#x000A;  | tr &quot;\\n&quot; &quot; &quot;&amp;#x000A;&amp;#x000A;#=&amp;gt; 1609 1543 1024 14 5 5 4 4 3 3 2 2 2 2 1 1 0 0 0 0 0 0 0 &amp;#x000A;&lt;/pre&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;This shows a large variance in the number of reads matching a mock community&amp;#x000A;sequence substring. There are 3 with &amp;gt;1000 matching reads while the&amp;#x000A;remainder of match &amp;lt;20 reads. I double-checked this result by examining a&amp;#x000A;couple of the community sequence reads individually. This is illustrated below&amp;#x000A;where the community sequence is shown in the top row and generated reads after&amp;#x000A;removal of the primer are shown in the rows below.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;pre class=&quot;prettyprint&quot;&gt;&amp;#x000A;&amp;gt;8                      CTA-ACGATGA-TACTCGAT&amp;#x000A;&amp;gt;SRR013437.14   tccacacc...A.......A........&amp;#x000A;&amp;gt;SRR013437.627  tccacacc...A.......A........&amp;#x000A;&amp;gt;SRR013437.935  tccacgcc...G.......A........&amp;#x000A;&amp;gt;SRR013437.2322 tcctcgcc...A.......A........&amp;#x000A;&amp;gt;SRR013437.2423 cccacgcc...A.......A........&amp;#x000A;&amp;#x000A;&amp;gt;72                  GCCGTAA-CGATGAGAACTAGCC&amp;#x000A;&amp;gt;SRR013437.245  tccac.......A...............&amp;#x000A;&amp;gt;SRR013437.351  tccac.......A...............&amp;#x000A;&amp;gt;SRR013437.433  tccac.......A...............&amp;#x000A;&amp;gt;SRR013437.542  tccac.......A...............&amp;#x000A;&amp;gt;SRR013437.661  tccac.......A...............&amp;#x000A;&lt;/pre&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;This highlights that, for two of these mock community sequences, there appear&amp;#x000A;to be insertions in the 5’ ends of the most abundant reads. These differences&amp;#x000A;prevent exact substring matches to the mock community sequence. I considered if&amp;#x000A;this was the case for the for further downstream sequence by repeating this&amp;#x000A;process with a 100bp substring 50bp downstream from the 5’ end.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;pre class=&quot;prettyprint&quot;&gt;&amp;#x000A;fastx_trimmer -f 50 -l 150 \&amp;#x000A;  -i SRX002564.community.tab \&amp;#x000A;  -o trim50_150.fasta&amp;#x000A;&amp;#x000A;grep --invert-match '&amp;gt;' trim50_150.fasta \&amp;#x000A;  | sort | uniq \&amp;#x000A;  | parallel 'grep {} SRX002564.fasta | wc -l' \&amp;#x000A;  | sort -r -n \&amp;#x000A;  | tr &quot;\\n&quot; &quot; &quot;&amp;#x000A;&amp;#x000A;#=&amp;gt; 2273 2229 1943 1873 1802 1795 1774 1701 1692 1677 1670&amp;#x000A;#   1628 1596 1573 1545 1522 1446 1442 1399 1356 935 635 548 &amp;#x000A;&lt;/pre&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;This shows that using a substring 50bp further downstream finds a larger number&amp;#x000A;of exact matches in the reads. The following figure compares the difference in&amp;#x000A;the substring matches. The figure is on a logarithmic scale.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://s3.amazonaws.com/bioinformatics-zen/read-analysis/002-read-fidelity/SRX002564.fidelity.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://s3.amazonaws.com/bioinformatics-zen/read-analysis/002-read-fidelity/thumb.SRX002564.fidelity.png&quot; width=&quot;320&quot; alt=&quot;Number&amp;#x000A;of correct reads vs read length.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;read-fidelity-in-the-human-microbiome-pilot-study&quot;&gt;Read fidelity in the Human Microbiome Pilot Study&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;This examined only a single 454 run and one which was generated in 2009. I&amp;#x000A;therefore considered further additional data determine if I observed this&amp;#x000A;result again. The human microbiome project (HMP) has &lt;a href=&quot;http://www.ncbi.nlm.nih.gov/bioproject/48341&quot;&gt;a pilot study&lt;/a&gt;&amp;#x000A;which compares a single mock community with the output from several different&amp;#x000A;sequencing centres. I repeated the above analysis with 9&lt;sup id=&quot;fnref:runs&quot;&gt;&lt;a href=&quot;#fn:runs&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; from 19 of&amp;#x000A;these sequencing experiments. I used a sample of 9 experiments to reduce&amp;#x000A;overall analysis time.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The major difference with analysing these data is that there is no specific set&amp;#x000A;of mock community sequences. The HMP instead used three pairs of primers to&amp;#x000A;create amplicons from 22 genomes&lt;sup id=&quot;fnref:genomes&quot;&gt;&lt;a href=&quot;#fn:genomes&quot; class=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. I therefore created the 16S&amp;#x000A;mock community&lt;sup id=&quot;fnref:mock&quot;&gt;&lt;a href=&quot;#fn:mock&quot; class=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; by simulating PCR &lt;em&gt;in silico&lt;/em&gt; using the primers and&amp;#x000A;corresponding genomes. I then tested the HMP generated reads against this&amp;#x000A;mock community using the same approach as before. The following figure compares&amp;#x000A;the distribution of matching reads for the different substrings of the&amp;#x000A;mock community.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://s3.amazonaws.com/bioinformatics-zen/read-analysis/002-read-fidelity/SRP002397.fidelity.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://s3.amazonaws.com/bioinformatics-zen/read-analysis/002-read-fidelity/thumb.SRP002397.fidelity.png&quot; width=&quot;320&quot; alt=&quot;Number&amp;#x000A;of correct reads vs read length.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;This figure is harder to interpret because the two distributions overlap. There&amp;#x000A;is less of a clear difference as to whether 5’ trimming results in a greater&amp;#x000A;number of matches to the community substring. The following figure instead&amp;#x000A;compares the difference in total number of matching reads for 5’ trimming&amp;#x000A;versus no trimming. A larger value equates to a greater the number of&amp;#x000A;substring-matching reads for the 5’ trimmed sequence versus the untrimmed&amp;#x000A;sequence.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://s3.amazonaws.com/bioinformatics-zen/read-analysis/002-read-fidelity/SRP002397.difference.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://s3.amazonaws.com/bioinformatics-zen/read-analysis/002-read-fidelity/thumb.SRP002397.difference.png&quot; width=&quot;320&quot; alt=&quot;Number&amp;#x000A;of correct reads vs read length.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;This figure shows that for two experiments 5’ trimming resulted in less matches&amp;#x000A;to the community substring. In one of these cases this resulted in &amp;gt;3000&amp;#x000A;fewer matching reads. In seven of nine experiments however 5’ trimming of the&amp;#x000A;mock community sequence increased the number of matching reads. In two cases 5’&amp;#x000A;trimming produced an improvement of &amp;gt;5000 exact substring matches.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The use of 3’ trimming in marker-based metagenomics is common. I am not&amp;#x000A;familiar with any articles showing that 5’ trimming may improve read-fidelity&amp;#x000A;however. I would be interested if anyone can highlight similar approaches. I&amp;#x000A;think it may also be work considering if this 5’ trimming could be related to&amp;#x000A;chimerism.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Trimming the 5’ end should not however be used in every case without prior&amp;#x000A;consideration. Instead I believe a quality-control mock community should be&amp;#x000A;added to each sequencing run to assess whether this is necessary. This way both&amp;#x000A;5’ and 3’ trimming could be tuned to each specific sequencing run. A similar&amp;#x000A;point is argued in Schloss &lt;em&gt;et. al&lt;/em&gt; 2011:&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;blockquote&gt;&amp;#x000A;  &lt;p&gt;We have shown that even when sequencing centers follow the same procedures&amp;#x000A;there is variation between and more importantly, within centers. The&amp;#x000A;inclusion of a mock community sample on each sequencing run can be used to&amp;#x000A;calculate the rate of chimerism, sequencing error rate, and drift in the&amp;#x000A;representation of a community structure.&lt;/p&gt;&amp;#x000A;&amp;#x000A;  &lt;p&gt;— &lt;strong&gt;Patrick D. Schloss mail, Dirk Gevers, Sarah L. Westcott&lt;/strong&gt;. Reducing the&amp;#x000A;Effects of PCR Amplification and Sequencing Artifacts on 16S rRNA-Based&amp;#x000A;Studies. &lt;a href=&quot;http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0027310&quot;&gt;PLoS One&lt;/a&gt;.&lt;/p&gt;&amp;#x000A;&lt;/blockquote&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;If machine learning approaches are to be used to filter incorrect reads then&amp;#x000A;the training data must be as accurate as possible. This means that every effort&amp;#x000A;should be made to correctly label the accurate reads in the training data.&amp;#x000A;Therefore I believe a constructing a machine learning filter may require the&amp;#x000A;optimisation of two factors: how much of each read needs to be trimmed and how&amp;#x000A;accurately the remaining reads can be accurately filtered for inaccuracy. This&amp;#x000A;would balance read length for phylogenetic analysis against how may errors in&amp;#x000A;the generated reads can be tolerated.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;footnotes&quot;&gt;&amp;#x000A;  &lt;ol&gt;&amp;#x000A;    &lt;li id=&quot;fn:runs&quot;&gt;&amp;#x000A;      &lt;p&gt;I used the HMP experiments: SRX020136 SRX020134 SRX020131 SRX020130 SRX020128 SRX019987 SRX019986 SRX019985 SRX019984. Each of these experiments contained 9 individual runs each making for a total of 81 454 GS FLX runs.&lt;a href=&quot;#fnref:runs&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;&amp;#x000A;    &lt;/li&gt;&amp;#x000A;    &lt;li id=&quot;fn:genomes&quot;&gt;&amp;#x000A;      &lt;p&gt;The accessions for the genomes used in the HMP pilot study are: AE017194 DS264586 NC&lt;em&gt;000913 NC&lt;/em&gt;000915 NC&lt;em&gt;001263 NC&lt;/em&gt;002516 NC&lt;em&gt;003028 NC&lt;/em&gt;003112 NC&lt;em&gt;003210 NC&lt;/em&gt;004116 NC&lt;em&gt;004350 NC&lt;/em&gt;004461 NC&lt;em&gt;004668 NC&lt;/em&gt;006085 NC&lt;em&gt;007493 NC&lt;/em&gt;007494 NC&lt;em&gt;007793 NC&lt;/em&gt;008530 NC&lt;em&gt;009085 NC&lt;/em&gt;009515 NC&lt;em&gt;009614 NC&lt;/em&gt;009617.&lt;a href=&quot;#fnref:genomes&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;&amp;#x000A;    &lt;/li&gt;&amp;#x000A;    &lt;li id=&quot;fn:mock&quot;&gt;&amp;#x000A;      &lt;p&gt;The HMP 16S mock community was created by simulating PCR amplification on the genome sequences using &lt;a href=&quot;http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/primersearch.html&quot;&gt;primer search&lt;/a&gt; and allowing up to a 20% mismatch against the primer sequence. Amplicons greater than 2000bp were discarded. The primer sequences were trimmed from the simulated amplicons except those with more than 2bp differences which were instead discarded.&lt;a href=&quot;#fnref:mock&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;&amp;#x000A;    &lt;/li&gt;&amp;#x000A;  &lt;/ol&gt;&amp;#x000A;&lt;/div&gt;</content>
  </entry>
  <entry>
    <id>http://bioinformaticszen.com/post/machine-learning-to-detect-bad-sequencing-reads/</id>
    <title>Filtering inaccurate NGS reads using machine learning</title>
    <link href='http://bioinformaticszen.com/post/machine-learning-to-detect-bad-sequencing-reads/'>
    <published>2013-04-17T13:00:00-04:00</published>
    <updated>2013-04-17T13:00:00-04:00</updated>
    <author>
      <name>Michael Barton</name>
    </author>
    <content type='html'>&lt;p&gt;One of the earliest things I learned about next generation sequencing is to&amp;#x000A;expect errors in the generated data. These errors include incorrect base&amp;#x000A;substitution or incorrect estimation of regions of identical bases&amp;#x000A;(homopolymers). When doing &lt;em&gt;de novo&lt;/em&gt; genome assembly or genome resequencing&amp;#x000A;these may not pose a problem as a majority consensus at each position can&amp;#x000A;overrule these errors.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;There are cases where inaccurate reads error do however cause a problem such as&amp;#x000A;tag or marker sequencing. This is the study of a single gene from many&amp;#x000A;different organisms with the goal of estimating polymorphisms or phylogeny. An&amp;#x000A;example may be sequencing the same gene (e.g. 16S rRNA) from numerous microbes&amp;#x000A;in the same environment. Another example is sequencing variations in viral&amp;#x000A;genes to identify the the source of an epidemic. In both of these cases the&amp;#x000A;single gene sequenced multiple times is the ‘marker’ of interest.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The advantage of this marker approach is that it generates a very large number&amp;#x000A;of reads to allow surveying a population deeply. The problem is that sequencing&amp;#x000A;errors can give false positives for diversity and variation. An extra base or a&amp;#x000A;substitution generates a novel sequence and therefore a new genotypic&amp;#x000A;signature. Furthermore when amplifying large quantities DNA even PCR errors&amp;#x000A;can become relevant. This can be ‘chimeras’, where sequences mis-prime to each&amp;#x000A;other generating a read composed of two origins. An additional rarer problem is&amp;#x000A;PCR infidelity where base changes are introduced during the amplification&amp;#x000A;process.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;As the use of marker-based approaches increases, many articles are highlighting&amp;#x000A;the need to identify and remove these poor quality reads. Not doing so can&amp;#x000A;produce results biased by a large number of false positives leading to&amp;#x000A;artefactually inflated diversity. If you are interested in the literature on&amp;#x000A;this topic &lt;a href=&quot;http://www.citeulike.org/user/michaelbarton/tag/16s&quot;&gt;my citeulike page has some articles bookmarked&lt;/a&gt;. &lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I’m going to highlight one particular article: &lt;a href=&quot;http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0027310&quot;&gt;Reducing the Effects of PCR&amp;#x000A;Amplification and Sequencing Artefacts on 16S rRNA-Based Studies&lt;/a&gt; by&amp;#x000A;Patrick D. Schloss, Dirk Gevers and Sarah L. Westcott. Part of this article&amp;#x000A;focuses on identifying inaccurate reads based on features of the reads. These&amp;#x000A;features include the quality of the read, length of the longest homopolymer and&amp;#x000A;so forth. I highly recommend this paper and it was one of the favourite&amp;#x000A;articles I read last year. This paper caught my attention because the authors&amp;#x000A;considered a wide range of features to identify inaccurate reads. This lead me&amp;#x000A;to think that these features could be applied as a machine learning&amp;#x000A;classification problem: given a set of labels (1 for an inaccurate ‘bad’ read,&amp;#x000A;0 for an accurate ‘good’ read) and the set of features (homopolymer length,&amp;#x000A;read quality, etc) can I predict whether the read is bad and should be&amp;#x000A;discarded?&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;h1 id=&quot;the-data&quot;&gt;The Data&lt;/h1&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;For this analysis I’ll be using the mock community from the &lt;a href=&quot;http://www.ncbi.nlm.nih.gov/pubmed/19668203&quot;&gt;PyroNoise&lt;/a&gt;&amp;#x000A;paper. This is an artificial community of 23 sequences that was 454 sequenced.&amp;#x000A;I’m using this data because it is relatively simple to map the generated reads&amp;#x000A;back to the original sequence. This thereby makes it straight forward to&amp;#x000A;identify if a generated read is accurate. The short read archive identifier for&amp;#x000A;this data is SRR013437.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I preprocessed this data by throwing away reads any shorter than 225bp. This&amp;#x000A;value is 25bp for the primer and barcode and approximately 200bp of sequence&amp;#x000A;that can be used for phylogenetic analysis after the barcode and primer are&amp;#x000A;removed. The 200bp figure is approximate because there may be indels that are&amp;#x000A;the natural differences between the sequences rather than sequencing error. The&amp;#x000A;choice of 200bp is arbitrary and different cut-offs may yield different&amp;#x000A;results. I followed this step by removing any reads containing ambiguous bases&amp;#x000A;(‘Ns’) in the first 225bp. This preprocessing left 54370 reads to examine.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I generated two groups of data from the processed reads. The first was whether&amp;#x000A;each read is inaccurate or not. This was resolved by grouping the reads into&amp;#x000A;clusters based on identity. The top N clusters, where N is the number of&amp;#x000A;sequences in the original mock community, are those reads which are correct,&amp;#x000A;every other read was considered inaccurate.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The second data were the descriptive features from which to predict read&amp;#x000A;accuracy. These data were divided into two groups the first was the longest&amp;#x000A;homopolymer of each nucleotide type and the second was various quality metrics&amp;#x000A;for each sequence. I also included original read length too. The distribution&amp;#x000A;of these features for accurate and inaccurate reads looks as follows.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://s3.amazonaws.com/bioinformatics-zen/machine-learning-ngs/SRR013437.dependencies.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://s3.amazonaws.com/bioinformatics-zen/machine-learning-ngs/thumb.SRR013437.dependencies.png&quot; width=&quot;320&quot; alt=&quot;Distribution of&amp;#x000A;features according to read accuracy.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;This data does not include all the features in Schloss &lt;em&gt;et. al&lt;/em&gt;, for instance&amp;#x000A;barcode and primer errors are not present. I selected only a subset of features&amp;#x000A;as a starting point for a machine learning algorithm. Further features can be&amp;#x000A;added in future. Given this feature set and accompanying accuracy labels, the&amp;#x000A;next step was to create classifiers.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;h1 id=&quot;classifiers&quot;&gt;Classifiers&lt;/h1&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;heuristic-classifier&quot;&gt;Heuristic Classifier&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;In the Schloss &lt;em&gt;et. al&lt;/em&gt; article the authors compared several different&amp;#x000A;heuristics for filtering and removing reads to improve quality. Two of these&amp;#x000A;examined if the average quality score of a 50bp sliding window dropped below 35&amp;#x000A;and if there were homopolymers of any type greater than 8bp. In R this&amp;#x000A;rule-based classifier might look as follows:&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;pre class=&quot;prettyprint&quot;&gt;&amp;#x000A;with(data,{&amp;#x000A;  prediction &amp;lt;- rep(0,length(bad.read))&amp;#x000A;  prediction[homop_ATGC &amp;gt; 8]     &amp;lt;- 1&amp;#x000A;  prediction[qual_min_50bp &amp;lt; 35] &amp;lt;- 1&amp;#x000A;&amp;#x000A;  print.score(&quot;heuristic&quot;,prediction,bad.read)&amp;#x000A;})&amp;#x000A;&lt;/pre&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The result of the &lt;code&gt;print.score&lt;/code&gt; function are the following metrics:&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;table&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Classifier&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Pr(Correct Positive)&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Pr(Correct Negative)&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Matthews Correlation&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Remaining Inaccuracy&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Heuristic&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.660&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.663&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.321&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.337&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;&lt;/table&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The second and third columns of this table show the probability of when a read&amp;#x000A;is identified as inaccurate (positive) or accurate (negative) given whether the&amp;#x000A;read is actually inaccurate or accurate. These are the positive predictive and&amp;#x000A;negative predictive values. The table shows that approximately 2/3 of reads are&amp;#x000A;correctly identified in both cases. The Matthews Correlation Coefficient&amp;#x000A;(&lt;a href=&quot;http://en.wikipedia.org/wiki/Matthews_correlation_coefficient&quot;&gt;Wikipedia&lt;/a&gt;) shows the overall performance of the classifier with a value&amp;#x000A;of 1.0 being the correct classification in every case, 0.0 is equivalent to&amp;#x000A;random classification, and -1.0 is the wrong classification in every case. The&amp;#x000A;remaining inaccuracy shows how many of the reads are inaccurate after filtering&amp;#x000A;with the given classifier.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;regularised-logistic-classifier&quot;&gt;Regularised Logistic Classifier&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The first classifier I trained was a generalised logistic regression&amp;#x000A;classifier. This uses a link function to map a linear regression to a value&amp;#x000A;between 0 and 1. This regression can therefore be trained to classify whether&amp;#x000A;reads are inaccurate or not. As a further step I used the glmnet package to fit&amp;#x000A;an elastic net generalised logistic regression. This combines both L1 and L2&amp;#x000A;normalisation to penalise model coefficients with the aim of preventing model&amp;#x000A;overfitting.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;pre class=&quot;prettyprint&quot;&gt;&amp;#x000A;require('glmnet')&amp;#x000A;&amp;#x000A;x &amp;lt;- as.matrix(data[,!(names(data) %in% c('read','bad.read'))])&amp;#x000A;y &amp;lt;- data$bad.read&amp;#x000A;&amp;#x000A;model   &amp;lt;- glmnet(x,y,family=c(&quot;binomial&quot;))&amp;#x000A;lambda  &amp;lt;- cv.glmnet(x,y,family=c(&quot;binomial&quot;))$lambda.min&amp;#x000A;predict &amp;lt;- predict.logit(predict(model, newx=x, s=lambda))&amp;#x000A;&amp;#x000A;print.score(&quot;logistic&quot;,predict,y)&amp;#x000A;&lt;/pre&gt;&amp;#x000A;&amp;#x000A;&lt;table&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Classifier&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Pr(Correct Positive)&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Pr(Correct Negative)&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Matthews Correlation&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Remaining Inaccuracy&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Logistic&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.754&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.656&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.374&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.354&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Heuristic&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.660&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.663&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.321&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.337&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;&lt;/table&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The logistic classifier performs better in correctly identifying inaccurate&amp;#x000A;reads, almost 10pp (percentage points) better. At the same time there is a&amp;#x000A;slight decrease in the number of reads incorrectly labelled as accurate: 0.7pp.&amp;#x000A;This results in a greater remaining error rate in the filtered data: 1.7pp.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;One useful consequence of a regression classifier is that the coefficients are&amp;#x000A;relatively straight forward to interpret. The following figure shows the&amp;#x000A;regression coefficients of this classifier for decreasing values of the&amp;#x000A;regularisation coefficient lambda. The graphs are separated into two groups:&amp;#x000A;the homopolymer coefficients and the quality coefficients. The coefficient for&amp;#x000A;read length is not shown.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://s3.amazonaws.com/bioinformatics-zen/machine-learning-ngs/SRR013437.regress.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://s3.amazonaws.com/bioinformatics-zen/machine-learning-ngs/thumb.SRR013437.regress.png&quot; width=&quot;320&quot; alt=&quot;Value of logistic&amp;#x000A;regression classifier coefficients.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Not too much should be given to the interpretation of this figure as it is the&amp;#x000A;result of training on only a single data set. This is apparent in the relatively&amp;#x000A;low value of lambda. The figure does show that the minimum from a 50bp sliding&amp;#x000A;window is the best predictor based on quality score. This is consistent with&amp;#x000A;the results seen in Schloss &lt;em&gt;et. al&lt;/em&gt;, the 10th percentile for minimum quality&amp;#x000A;score over the length of the read also appears play a role in classification.&amp;#x000A;Regarding homopolymer length the most important overall features associated&amp;#x000A;with error appear to be homopolymers of G and C bases.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;random-forest-classifier&quot;&gt;Random Forest Classifier&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Random forest classifiers are one best performing algorithms in the field of&amp;#x000A;machine learning. A random forest contains many individual decision trees each&amp;#x000A;trained with a subset of the data and features. The predicted output is the&amp;#x000A;modal average of all the forest classifiers.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;pre class=&quot;prettyprint&quot;&gt;&amp;#x000A;require('randomForest')&amp;#x000A;&amp;#x000A;rf &amp;lt;- randomForest(y=factor(data$bad.read),&amp;#x000A;                   x=data[,!(names(data) %in% c('bad.read','read'))],&amp;#x000A;                   ntree=500)&amp;#x000A;predict &amp;lt;- as.numeric(predict(rf)) - 1&amp;#x000A;&amp;#x000A;print.score(&quot;random_forest&quot;,predict,data$bad.read)&amp;#x000A;&lt;/pre&gt;&amp;#x000A;&amp;#x000A;&lt;table&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Classifier&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Pr(Correct Positive)&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Pr(Correct Negative)&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Matthews Correlation&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Remaining Inaccuracy&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Random forest&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.833&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.677&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.472&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.323&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Logistic&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.754&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.656&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.374&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.354&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Heuristic&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.660&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.663&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.321&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.337&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;&lt;/table&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The random forest performs well in correctly identifying inaccurate reads with&amp;#x000A;a probability of 0.833 of a read identified as inaccurate actually containing&amp;#x000A;errors. The probability of correctly identifying accurate reads remains&amp;#x000A;relatively unchanged. The improvement in the Matthews correlation coefficient&amp;#x000A;is therefore mainly due to the improvement in accuracy of which bad reads are&amp;#x000A;identified - fewer correct reads are thrown away.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;ensemble-classifier&quot;&gt;Ensemble Classifier&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Combinations of machine learning algorithms often perform better than the sum&amp;#x000A;of the parts. The ‘ensemble’ approaches may show improved performance when each&amp;#x000A;of the classifiers models the problem domain with a different structure. For&amp;#x000A;instance a regression classifier models the problem parametrically and linearly&amp;#x000A;while a random forest is non-parametric and can find complex interactions.&amp;#x000A;Combining two different algorithms of this type can, in some cases, combine the&amp;#x000A;best of both worlds. I therefore created an ensemble classifier based on the&amp;#x000A;random forest and logistic regression classifiers above. The results of this&amp;#x000A;are as follows:&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;pre class=&quot;prettyprint&quot;&gt;&amp;#x000A;require('randomForest')&amp;#x000A;require('glmnet')&amp;#x000A;&amp;#x000A;set.seed(1)&amp;#x000A;&amp;#x000A;x &amp;lt;- as.matrix(data[,!(names(data) %in% c('read','bad.read'))])&amp;#x000A;y &amp;lt;- data$bad.read&amp;#x000A;&amp;#x000A;rf &amp;lt;- randomForest(y=factor(y),&amp;#x000A;                   x=x,&amp;#x000A;                   ntree=500)&amp;#x000A;rf.pred &amp;lt;- predict(rf,type='prob')[,2]&amp;#x000A;&amp;#x000A;&amp;#x000A;glm.model &amp;lt;- glmnet(x,y,family=c(&quot;binomial&quot;))&amp;#x000A;lambda &amp;lt;- cv.glmnet(x,y,family=c(&quot;binomial&quot;))$lambda.min&amp;#x000A;glm.pred &amp;lt;- invlogit(predict(glm.model, newx=x, s=lambda))&amp;#x000A;&amp;#x000A;weighted.predictions &amp;lt;- function(w,prob.a,prob.b){&amp;#x000A;  ifelse(((1-w)*prob.a) + (w*prob.b) &amp;gt; 0.5, 1, 0)&amp;#x000A;}&amp;#x000A;&amp;#x000A;weight &amp;lt;- seq(0,1,0.01)&amp;#x000A;scores &amp;lt;- sapply(weight,function(w){&amp;#x000A;  score(weighted.predictions(w,rf.pred,glm.pred),y)&amp;#x000A;})&amp;#x000A;&amp;#x000A;opt.weight &amp;lt;- weight[which(scores == max(scores))]&amp;#x000A;prediction &amp;lt;- weighted.predictions(opt.weight,rf.pred,glm.pred)&amp;#x000A;&amp;#x000A;print.score(&quot;ensemble&quot;,prediction,data$bad.read)&amp;#x000A;&lt;/pre&gt;&amp;#x000A;&amp;#x000A;&lt;table&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Classifier&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Pr(Correct Positive)&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Pr(Correct Negative)&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Matthews Correlation&lt;/td&gt;&amp;#x000A;    &lt;td&gt;Remaining Inaccuracy&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Random forest&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.833&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.677&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.472&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.323&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Ensemble&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.834&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.676&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.472&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.324&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Logistic&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.754&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.656&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.374&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.354&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;  &lt;tr&gt;&amp;#x000A;    &lt;td&gt;Heuristic&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.660&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.663&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.321&lt;/td&gt;&amp;#x000A;    &lt;td&gt;0.337&lt;/td&gt;&amp;#x000A;  &lt;/tr&gt;&amp;#x000A;&lt;/table&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The results of this are approximately the same as that of the random forest&amp;#x000A;classifier. I can plot the performance of different weights between the random&amp;#x000A;forest and the logistic regression for comparison. This figure shows clearly&amp;#x000A;that the logistic classifier adds very little and that weighting mostly towards&amp;#x000A;the random forest provides the optimum performance. I think this is likely the&amp;#x000A;result of the random forest overfitting the data and cross validation across&amp;#x000A;data sets may give more weight to the broader view provided by the logistic&amp;#x000A;regression.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://s3.amazonaws.com/bioinformatics-zen/machine-learning-ngs/SRR013437.ensemble.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://s3.amazonaws.com/bioinformatics-zen/machine-learning-ngs/thumb.SRR013437.ensemble.png&quot; width=&quot;320&quot; alt=&quot;Performance on&amp;#x000A;ensemble learning model.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;h1 id=&quot;summary&quot;&gt;Summary&lt;/h1&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;These current classifiers have not been tested for predictive power only the&amp;#x000A;degree to which can explain the difference in between good and bad reads in a&amp;#x000A;single data set. The predictive power should be tested by cross validation&amp;#x000A;against multiple data from different experiments and sequencing centres. I&amp;#x000A;would expect the Matthews correlation coefficients for the ML classifiers would&amp;#x000A;be less than the values shown above because these classifiers have most likely&amp;#x000A;overfit to this single data set.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;As I wrote earlier, only a subset of features have been examined. For instance&amp;#x000A;errors in the barcode and primer sequence have not been included. The Schloss&amp;#x000A;&lt;em&gt;et. al&lt;/em&gt; paper showed that these are also useful for identifying inaccurate&amp;#x000A;reads. Therefore I think additional work should include features such as these.&amp;#x000A;One machine learning approach throws as many features as possible at the&amp;#x000A;problem and then uses normalisation to identify those with the most predictive&amp;#x000A;power.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Given all of these caveats, there does appear to be a hard ceiling on the&amp;#x000A;probability of identifying correct reads. Regardless of which classifier is&amp;#x000A;used the probability of correctly identifying accurate reads is around 0.66. I&amp;#x000A;hypothesise that this might be a problem of identifying chimeras - those&amp;#x000A;sequences that form during the PCR step prior to sequencing. As chimeras are&amp;#x000A;formed prior to sequencing, it could be considered extremely hard to identify&amp;#x000A;them based on features associated with sequencing. Tools such as Uchime and&amp;#x000A;Perseus should be combined in a pipeline to identify chimeras.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The main results of this small analysis appears to be to be the possibility of&amp;#x000A;an increase in accuracy when identifying inaccurate reads for discarding. This,&amp;#x000A;very initial, work suggests that using machine learning may be useful for&amp;#x000A;increasing precision and therefore translate into fewer accurate reads being&amp;#x000A;thrown away.&lt;/p&gt;</content>
  </entry>
  <entry>
    <id>http://bioinformaticszen.com/post/whats-the-point-of-publishing/</id>
    <title>What's the point of publishing?</title>
    <link href='http://bioinformaticszen.com/post/whats-the-point-of-publishing/'>
    <published>2013-02-27T08:00:00-05:00</published>
    <updated>2013-02-27T08:00:00-05:00</updated>
    <author>
      <name>Michael Barton</name>
    </author>
    <content type='html'>&lt;p&gt;What is the point of publishing? I think the answer most would give to this&amp;#x000A;question would be ‘because I have to.’ You have to publish because a CV or&amp;#x000A;resume full of publications allows to keep your job, apply for a new job,&amp;#x000A;become a professor, and get tenure. I feel this myself and this need has become&amp;#x000A;conflated with the desire to publish my research because I want to share it&amp;#x000A;with the world. Pragmatically though I must regularly publish if one day I hope&amp;#x000A;to see myself as a tenured professor.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Another answer a more career-secure researcher would give for publishing is ‘to&amp;#x000A;share research among scientists.’ This is the reason journals were created so&amp;#x000A;that large paper books could be distributed to universities. I am old enough to&amp;#x000A;remember going to the library to physically photocopy the journal articles I&amp;#x000A;wanted. A more specific question is then, given we have the Internet and many&amp;#x000A;journals have become online only, ‘What is the point of publishing now?’&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;If a group of scientists were to come together to create a solution for sharing&amp;#x000A;research would the current situation be close? That we spend weeks formatting&amp;#x000A;research into Microsoft Word, or LaTeX if you’re lucky, for third-party&amp;#x000A;publishers that will, after months or even years, eventually decide to&amp;#x000A;disseminate to everyone. If our research is declined we then have to repeat&amp;#x000A;this process for a different third-party publisher. Once this research is&amp;#x000A;finally accepted many publishers will then charge scientists to finally get&amp;#x000A;access to it.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;This pay-for-papers situation means that many poorer institutions cannot access&amp;#x000A;current research because they cannot afford the monthly subscription for the&amp;#x000A;articles. You might think that perhaps these poorer institutions are mostly in&amp;#x000A;the third-world. However if you have ever had to ask a colleague for a PDF then&amp;#x000A;you are working at one of these poorer institutions. I think the list of&amp;#x000A;non-poor institutions that have access to every journal subscription a&amp;#x000A;researcher could need is very small or perhaps even 0.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;You could argue then that open access publishers are fixing this situation.&amp;#x000A;Open access journals charge researchers for publishing an article but then make&amp;#x000A;the article free to access for everyone. There are still, however, limitations&amp;#x000A;to open-access publication. For instance can you perform large scale analysis on&amp;#x000A;the huge amounts of scientific data these open-access journals contain? Can you&amp;#x000A;access structured citation data for these open-access journals?&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;A good question might be “who needs access to this large quantity of data?”&amp;#x000A;Which scientist needs the citation information for all published articles? The&amp;#x000A;answer is almost no one. The reason is the same for why most people don’t need&amp;#x000A;to download the entire web to their computer. You only need to download the web&amp;#x000A;pages you’re interested in. One group that does however need to download the&amp;#x000A;entire web is Google. Their algorithm searches every web page to see which&amp;#x000A;pages are linking to each other. This algorithm then identifies the top search&amp;#x000A;results because they have many incoming links from other pages. Unfortunately&amp;#x000A;there is no equivalent of Google for scientific publications. So while I can&amp;#x000A;search the web an optimised search algorithm I am stuck in pre-Google days for&amp;#x000A;searching the world’s research. All because the research is siloed across many&amp;#x000A;individual journals.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;A solution to the slow publication time of journals is to use pre-print&amp;#x000A;servers. This means uploading your research to a different third-party that&amp;#x000A;will make your research available to the community until it is more formally&amp;#x000A;published in a journal. This situation means we have acknowledged that the&amp;#x000A;journals so slow at publishing that we must an additional third-party to take&amp;#x000A;up the slack. If we are becoming more happy using pre-print servers to share&amp;#x000A;research doesn’t this mean that publishers are a becoming a vestige from a&amp;#x000A;pre-Internet period?&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;You could argue that publishers provide an important a role in screening&amp;#x000A;research through peer-review and journal exclusivity. However both of these&amp;#x000A;functions are performed by academics rather than journals. So if academics are&amp;#x000A;performing the necessary peer-review why do still need the journals? Is the&amp;#x000A;lack of will to organise ourselves a good enough reason to put up with all the&amp;#x000A;inefficiency, effort and costs that publishers bring?&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Do we expect that the current status quo for sharing scientific research will&amp;#x000A;still be the same 20, 50 or 100 years from now? If the answer to this is no,&amp;#x000A;then why do we settle for it now?&lt;/p&gt;</content>
  </entry>
  <entry>
    <id>http://bioinformaticszen.com/post/whats-the-point-of-writing-good-scientific-software/</id>
    <title>What's the point of writing good scientific software?</title>
    <link href='http://bioinformaticszen.com/post/whats-the-point-of-writing-good-scientific-software/'>
    <published>2013-02-08T12:15:00-05:00</published>
    <updated>2013-02-08T12:15:00-05:00</updated>
    <author>
      <name>Michael Barton</name>
    </author>
    <content type='html'>&lt;p&gt;Software written by academics has a reputation of being poorer quality than&amp;#x000A;that of software written by professional software developers. This poorer&amp;#x000A;quality can be quantified by a lack of documentation, a non-intuitive&amp;#x000A;interface, or bug-ridden with a tendency to crash.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I assume, since you’re reading a bioinformatics blog, that you’ve experienced&amp;#x000A;trying to use code that you can’t download, won’t compile, won’t run on your&amp;#x000A;system, isn’t documented, won’t work on your data, or just crashes without&amp;#x000A;explanation. Speaking from personal experience this can be frustrating in the&amp;#x000A;extreme. Consequently there are efforts to increase the quality of academic&amp;#x000A;software including &lt;a href=&quot;http://www.openresearchcomputation.com/&quot;&gt;journals focused on publishing software with minimum&amp;#x000A;standards&lt;/a&gt; and &lt;a href=&quot;http://software-carpentry.org/&quot;&gt;workshops to help academics write better&amp;#x000A;software&lt;/a&gt;.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;What then does good quality software look like? I think this includes both unit&amp;#x000A;and integration tests to ensure the software works as expected. Revision&amp;#x000A;control to ensure that bugs can be removed by reverting to the last working&amp;#x000A;version. Continuous integration to make sure that software always passes tests&amp;#x000A;when new changes are committed. Finally documentation of available options with&amp;#x000A;examples helps users understand how to use the software. A question I have been&amp;#x000A;recently asking myself however is what’s the point of spending the extra time&amp;#x000A;to create better software?&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Pubmed shows the number of articles published by two journals, ‘BMC&amp;#x000A;Bioinformatics’ and ‘Bioinformatics,’ containing the word ‘software’ in the&amp;#x000A;title or abstract is ~350 each year. I think this may an underestimate though&amp;#x000A;since other journals also publish software articles and the terms ‘pipeline’,&amp;#x000A;‘package’ and ‘system’ may also be used instead of ‘software.’&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Considering around 1700 bioinformatics software articles published in the last&amp;#x000A;five years, this represents a significant corpus. This leads me to the question&amp;#x000A;what is the point of writing good quality software if the chances are small&amp;#x000A;that anyone will read about, or download it? For every highly cited tool, such&amp;#x000A;as SAM Tools or PAML, there are likely many other tools which are never&amp;#x000A;downloaded or used. Does it make sense to invest the extra time in the&amp;#x000A;documentation and interface if there is a good chance that no one else will&amp;#x000A;ever use it?&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Let me illustrate with a concrete example. I’m a post-doc working in microbial&amp;#x000A;genomics. I found that I had to write various scripts to help with the&amp;#x000A;finishing of my genome and the subsequent upload to GenBank. &lt;a href=&quot;/post/reuse,-contribute,-create/&quot;&gt;I have previously&amp;#x000A;believed&lt;/a&gt; that converting any code you’ve created into a open-source&amp;#x000A;library benefits the community and prevents reinvention of the wheel. Therefore&amp;#x000A;I converted my scripts first into a tool called &lt;a href=&quot;http://next.gs/&quot;&gt;scaffolder&lt;/a&gt; and then a&amp;#x000A;subsequent tool called &lt;a href=&quot;https://github.com/michaelbarton/genomer&quot;&gt;genomer&lt;/a&gt;. Both of these are tested with integration&amp;#x000A;and unit tests, and I create documentation and &lt;a href=&quot;http://www.youtube.com/watch?v=HfsdJOELFjs&quot;&gt;screencasts&lt;/a&gt;.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I have however started to realise that perhaps something I thought would be&amp;#x000A;very useful may be of little interest to anyone else. Furthermore the effort I&amp;#x000A;have put into testing and documentation may not have been the best use of my&amp;#x000A;time if no one but I will use it. As my time as a post doc is limited, the&amp;#x000A;extra time effort spent on improving these tools could have instead have been&amp;#x000A;spent elsewhere. Would it have been better if I have just completed the minimum&amp;#x000A;to get them published then moved on to another project?&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I have begun to think now that the most important thing when writing software&amp;#x000A;is to write the usable minimum. If then the tool becomes popular and other&amp;#x000A;people begin to use it, then I should I work on the documentation and&amp;#x000A;interface. This will then ensure then I make the best use of my time without&amp;#x000A;sinking effort onto something that could have little interest. Given the choice&amp;#x000A;I like writing robust, well documented software but this takes extra time and&amp;#x000A;in a competitive academic job market I feel like this is a luxury.&lt;/p&gt;</content>
  </entry>
  <entry>
    <id>http://bioinformaticszen.com/post/genotype-from-phenotype/</id>
    <title>Poster: Predicting genotype from phenotype</title>
    <link href='http://bioinformaticszen.com/post/genotype-from-phenotype/'>
    <published>2012-07-15T20:00:00-04:00</published>
    <updated>2012-07-15T20:00:00-04:00</updated>
    <author>
      <name>Michael Barton</name>
    </author>
    <content type='html'>&lt;p&gt;I presented the below poster at the 2012 American Society of Microbiology&amp;#x000A;Meeting. This research demonstrates how we have been comparing and testing&amp;#x000A;&lt;em&gt;Pseudomonas&lt;/em&gt; strains for twitching motility to identify the genes responsible&amp;#x000A;for phenotypic differences. Click on the image below to see a larger version.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://uk-me-michaelbarton.s3.amazonaws.com/posters/2012-asm/michael_barton_poster.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://uk-me-michaelbarton.s3.amazonaws.com/posters/2012-asm/thumb.png&quot; width=&quot;320&quot; alt=&quot;My ASM 2012 Poster.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;As with &lt;a href=&quot;/post/preseting-software-on-a-poster/&quot;&gt;my previous ASM poster&lt;/a&gt;, I think less text and a more visual&amp;#x000A;overview is more likely to attract interest and start a discussion. ASM&amp;#x000A;meetings usually have posters numbering in the thousands so I think a&amp;#x000A;text-heavy poster can be a lottery for getting people to talk to you. A lighter&amp;#x000A;poster, I hope, gives the impression that you’re not going to trap someone who&amp;#x000A;expresses an interest with ten minutes of intense conversation.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;inspiration&quot;&gt;Inspiration&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I usually get started by on a new poster by searching the posters section on&amp;#x000A;&lt;a href=&quot;http://www.deviantart.com/&quot;&gt;Deviant Art&lt;/a&gt; around a theme I am interested in. In&amp;#x000A;this case I felt like going for &lt;a href=&quot;http://browse.deviantart.com/?qh=&amp;amp;section=&amp;amp;q=minimalism+grunge&quot;&gt;“Grunge Minimalism.”&lt;/a&gt; There’s plenty&amp;#x000A;of inspiration to be found on deviant art when searching for different themes&amp;#x000A;and I was particularly inspired by the &lt;a href=&quot;http://daveforyou.deviantart.com/gallery/#/d3j2iy5&quot;&gt;Alice in Wonderland poster by&amp;#x000A;daveforyou&lt;/a&gt;. You can see similarities in the lines and colours between&amp;#x000A;this poster and how I ended up designing mine.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;    &lt;a href=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/alice_poster_by_daveforyou-d3j2iy5.jpeg&quot;&gt;&amp;#x000A;      &lt;img src=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/alice_poster_by_daveforyou-d3j2iy5.jpeg&quot; width=&quot;320&quot; alt=&quot;Minimalist grundge style poster for alice in Wonderland&quot; /&gt;&amp;#x000A;    &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;planning&quot;&gt;Planning&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I spend a large portion of my time planning a poster before opening Inkscape. I&amp;#x000A;think time spent planning tends to pay off many-fold compared with trying to&amp;#x000A;plan and create at the same time. My planning is rather straight forward: write&amp;#x000A;down the key points I want to include and how I will lay them out on the&amp;#x000A;poster. The images below show the papers I generated while planning the poster.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/planning_1.jpg&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/planning_1_thumb.jpg&quot; width=&quot;320&quot; alt=&quot;Paper drawings from planning my poster.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/planning_2.jpg&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/planning_2_thumb.jpg&quot; width=&quot;320&quot; alt=&quot;Paper drawings from planning my poster.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;images&quot;&gt;Images&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;I hand drew all the images using a black pen on squared paper. I draw them on&amp;#x000A;squared paper to ensure that all image have the same proportions. Squared paper&amp;#x000A;is also pretty handy for making accurate representations where it matters, for&amp;#x000A;instance the genes in the upper right of poster are to-scale representations of&amp;#x000A;the nucleotide length. I usually go through several iterations of each drawing&amp;#x000A;until I get something I’m really happy with. Below are the final drawings I&amp;#x000A;used for the poster.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/drawings.jpg&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/drawings_thumb.jpg&quot; width=&quot;320&quot; alt=&quot;Papers of drawings from poster&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;After I’m happy with the images I scan them in using a standard office scanner&amp;#x000A;to get a pdf. If you’re interested you can &lt;a href=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/originals.pdf&quot;&gt;download the pdf of these scanned&amp;#x000A;images.&lt;/a&gt; After scanning I import the pdf into Inkscape and &lt;a href=&quot;http://inkscape.org/doc/tracing/tutorial-tracing.html&quot;&gt;convert&amp;#x000A;each image to a path.&lt;/a&gt; Once in this state it is easy to simplify the&amp;#x000A;object path and remove roughness or scanner artefacts. Below is the poster with&amp;#x000A;the converted images.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/layout.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/layout_thumb.png&quot; width=&quot;320&quot; alt=&quot;Basic poster layout.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;colours&quot;&gt;Colours&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;The colours I used (blue and orange) are on opposing sides of the colour wheel&amp;#x000A;and therefore complementary. Using &lt;a href=&quot;http://www.colormatters.com/color-and-design/basic-color-theory&quot;&gt;a little colour theory&lt;/a&gt; can usually&amp;#x000A;make a poster more visually appealing. I also experimented with how to apply&amp;#x000A;these colours to the poster. I created seven different examples and asked my lab&amp;#x000A;mates their opinions. The solid use of colours (upper left) was the most popular&amp;#x000A;so this is the one I went for.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/colors.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/colors_thumb.png&quot; width=&quot;320&quot; alt=&quot;Comparison of different color schemes.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;font&quot;&gt;Font&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;Choosing a font is perhaps an underestimated but important point in designing a&amp;#x000A;poster. A font should match the context around the text: use a serif thick font&amp;#x000A;for loud and heavy impact or a lighter san-serif font for a more minimalist&amp;#x000A;message. I used the &lt;a href=&quot;http://www.yanone.de/typedesign/kaffeesatz/&quot;&gt;Yanone Kaffeesatz&lt;/a&gt; after seeing it on &lt;a href=&quot;http://zachholman.com/posts/slide-design-for-developers/&quot;&gt;Zack Holman’s&amp;#x000A;slide design for developers&lt;/a&gt; blog post. I recommend reading his post too&amp;#x000A;as it has many good points.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;    &lt;a href=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/font.png&quot;&gt;&amp;#x000A;      &lt;img src=&quot;http://bioinformatics-zen.s3.amazonaws.com/genotype-poster/font.png&quot; width=&quot;400&quot; alt=&quot;Highlighting the font used in the poster&quot; /&gt;&amp;#x000A;    &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;&amp;#x000A;&amp;#x000A;&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;&amp;#x000A;&amp;#x000A;&lt;p&gt;These are the main points on how I created my poster. The rest was just adding&amp;#x000A;text, boxes and lines in Inkscape which is quite straight forward. Even if you&amp;#x000A;do struggle with Inkscape usually a quick google search for the effect you’re&amp;#x000A;after with find a tutorial. Overall, if I could offer a single suggestion it&amp;#x000A;would be to spend a little time beforehand looking for inspiration and planning&amp;#x000A;how you want your poster to look before you begin creating it.&lt;/p&gt;&amp;#x000A;&amp;#x000A;&lt;div class=&quot;centred&quot;&gt;&amp;#x000A;      &lt;a id=&quot;lightbox_image&quot; href=&quot;http://uk-me-michaelbarton.s3.amazonaws.com/posters/2012-asm/michael_barton_poster.png&quot;&gt;&amp;#x000A;        &lt;img src=&quot;http://uk-me-michaelbarton.s3.amazonaws.com/posters/2012-asm/thumb.png&quot; width=&quot;320&quot; alt=&quot;My ASM 2012 Poster.&quot; /&gt;&amp;#x000A;      &lt;/a&gt;&amp;#x000A;  &lt;/div&gt;</content>
  </entry>
</feed>
