<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
<channel>
	<title>Campagne Laboratory</title>
	
	<link>http://campagnelab.org</link>
	<description>News about research from the Campagne laboratory.</description>
	<lastBuildDate>Mon, 30 Jan 2012 18:10:53 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/campagnelab" /><feedburner:info uri="campagnelab" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><geo:lat>40.76842</geo:lat><geo:long>-73.96045</geo:long><item>
		<title>Goby 1.9.8.2</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/pjt1rmlCJmg/</link>
		<comments>http://campagnelab.org/goby-1-9-8-2/#comments</comments>
		<pubDate>Sat, 28 Jan 2012 18:24:33 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/?p=3373</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/goby-1-9-8-2/"></g:plusone></div>
We have released Goby 1.9.8.2. This version offers the vcf-subset and vcf-compare replacements tools I mentioned in my earlier VCF post. The release also packs an option to call indels with Goby. We use the method of Krawitz et al (Bioinformatics 2010) to find equivalent indel regions (EIR). This approach can re-conciliate distinct indel observations into canonical  indel [...]]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/goby-1-9-8-2/"></g:plusone></div>
<p>We have released Goby 1.9.8.2. This version offers the vcf-subset and vcf-compare replacements tools I mentioned in my earlier VCF post.</p>
<p>The release also packs an option to call indels with Goby. We use the method of Krawitz et al (Bioinformatics 2010) to find equivalent indel regions (EIR). This approach can re-conciliate distinct indel observations into canonical  indel boundaries (an EIR). The genotype and compare-groups formats of the discover-sequence-variants mode will output EIRs at a frequency that sum over all the possible indel variations observed at the site that can be explained by that EIR. Of course, there is quite more to the Goby indel calling approach than the Krawitz method. For instance, the approach is integrated with the fast algorithm for local realignment around indels, so that indels that open when realigning end of reads contribute to the frequency of an EIR.</p>
<p>Programmers will find that Goby represents  observed indels at a site in a very similar way to base genotypes. Reading a base or indel frequency at a position in a sample is done with the same API (see the <a href="http://icbtools.med.cornell.edu/javadocs/goby/">SampleCountInfo</a> class). This makes it easy to support indels in different output formats.</p>
<p>The vcf-compare replacement (new in this release) can keep random samples of positions that differ between input files according to each category of differences it tallies (e.g., missed one allele RA vs RR, missed two alleles AA vs RR, genotypes differ C/T vs A/T where R=G). This is quite useful in inspecting positions in a genome viewer to try and understand differences between calls made by two approaches.</p>
<p>More details about this release are in the <a href="http://campagnelab.org/software/goby/change-log/">ChangeLog</a>.</p>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/pjt1rmlCJmg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/goby-1-9-8-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/goby-1-9-8-2/</feedburner:origLink></item>
		<item>
		<title>Stumbled on PLINK/SEQ while looking for a tool to estimate Ti/Tv from VCF files. At the moment, P…</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/kwiv_RRa_kI/</link>
		<comments>http://campagnelab.org/stumbled-on-plinkseq-while-looking-for-a-tool-to-estimate-titv-from-vcf-files-at-the-moment-p/#comments</comments>
		<pubDate>Thu, 12 Jan 2012 00:09:58 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/stumbled-on-plinkseq-while-looking-for-a-tool-to-estimate-titv-from-vcf-files-at-the-moment-p/</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/stumbled-on-plinkseq-while-looking-for-a-tool-to-estimate-titv-from-vcf-files-at-the-moment-p/"></g:plusone></div>
Stumbled on PLINK/SEQ while looking for a tool to estimate Ti/Tv from VCF files. At the moment, PLINK/SEQ seems limited to some older version of VCF, so it does not quite work with the files Goby generate (4.1), but I am interested in the mention of a binary file alternative to VCF, which could speed [...]]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/stumbled-on-plinkseq-while-looking-for-a-tool-to-estimate-titv-from-vcf-files-at-the-moment-p/"></g:plusone></div>
<p>Stumbled on PLINK/SEQ while looking for a tool to estimate Ti/Tv from VCF files. At the moment, PLINK/SEQ seems limited to some older version of VCF, so it does not quite work with the files Goby generate (4.1), but I am interested in the mention of a binary file alternative to VCF, which could speed up some of the work we do (see section about project creation).
<div class="g-crossposting-att">
<div class="g-crossposting-att-title"><a href="http://atgu.mgh.harvard.edu/plinkseq/tutorial.shtml" target="_blank">PLINK/SEQ genetics library</a></div>
<div class="g-crossposting-att-img" style="float:left"><a href="http://atgu.mgh.harvard.edu/plinkseq/tutorial.shtml" target="_blank"><img src="http://images0-focus-opensocial.googleusercontent.com/gadgets/proxy?container=focus&amp;gadget=a&amp;resize_h=100&amp;url=http%3A%2F%2Fatgu.mgh.harvard.edu%2Fplinkseq%2Fimg%2Fsing-by-dp.png" /></a></div>
<div class="g-crossposting-att-txt">Tutorial: working with 1000 Genomes Pilot 3 VCFs. By way of introducing some of the features and approaches of PLINK/Seq, this page provides a tutorial that uses PSEQ and the R interface to PLINK/Seq &#8230;</div>
</div>
<div class="g-crossposting-backlink"><a href="https://plus.google.com/116874816214311977726/posts/a1rFXT4q58U" target="_blank">This was posted on Google+&hellip;</a></div>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/kwiv_RRa_kI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/stumbled-on-plinkseq-while-looking-for-a-tool-to-estimate-titv-from-vcf-files-at-the-moment-p/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/stumbled-on-plinkseq-while-looking-for-a-tool-to-estimate-titv-from-vcf-files-at-the-moment-p/</feedburner:origLink></item>
		<item>
		<title>Here’s a nice blog-review about studies that discovered variations causing diseases with next-gen…</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/tRpcDFz7mdk/</link>
		<comments>http://campagnelab.org/heres-a-nice-blog-review-about-studies-that-discovered-variations-causing-diseases-with-next-gen/#comments</comments>
		<pubDate>Wed, 11 Jan 2012 16:21:07 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/heres-a-nice-blog-review-about-studies-that-discovered-variations-causing-diseases-with-next-gen</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/heres-a-nice-blog-review-about-studies-that-discovered-variations-causing-diseases-with-next-gen/"></g:plusone></div>
Here&#39;s a nice blog-review about studies that discovered variations causing diseases with next-generation sequencing. In addition to the content, I like the comment of the author that there is no time to write a review paper given the speed of the field development. Obviously what he means (since he has written the material already), is [...]]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/heres-a-nice-blog-review-about-studies-that-discovered-variations-causing-diseases-with-next-gen/"></g:plusone></div>
<p>Here&#39;s a nice blog-review about studies that discovered variations causing diseases with next-generation sequencing. </p>
<p>In addition to the content, I like the comment of the author that there is no time to write a review paper given the speed of the field development. Obviously what he means (since he has written the material already), is that publishing in a journal has very high overheads for the author(s) that simply slow down communication of information. In this case, I would argue that the comments have served a similar role as peer-review, prompting the author to add links to the papers, adding to the information or modulating the claims of novelty (see comment about RET).
<div class="g-crossposting-att">
<div class="g-crossposting-att-title"><a href="http://www.massgenomics.org/2011/12/disease-causing-mutations-discovered-by-ngs-in-2011.html" target="_blank">Disease-causing Mutations Discovered by NGS | MassGenomics</a></div>
<div class="g-crossposting-att-txt">The number of human genetic diseases unraveled by next-generation sequencing skyrocketed this year. Several factors contributed to this growth, two of which were the ever-increasing throughput of sequ&#8230;</div>
</div>
<div class="g-crossposting-backlink"><a href="https://plus.google.com/116874816214311977726/posts/UUidexCw1U3" target="_blank">This was posted on Google+&hellip;</a></div>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/tRpcDFz7mdk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/heres-a-nice-blog-review-about-studies-that-discovered-variations-causing-diseases-with-next-gen/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/heres-a-nice-blog-review-about-studies-that-discovered-variations-causing-diseases-with-next-gen/</feedburner:origLink></item>
		<item>
		<title>Evaluating Goby against the 1000 genome genotype calls and why is VCF so inefficient?</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/pjkPwrDV564/</link>
		<comments>http://campagnelab.org/evaluating-goby-against-the-1000-genome-genotype-calls-and-why-is-vcf-so-inefficient/#comments</comments>
		<pubDate>Sat, 07 Jan 2012 19:22:08 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/?p=3328</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/evaluating-goby-against-the-1000-genome-genotype-calls-and-why-is-vcf-so-inefficient/"></g:plusone></div>
We have recently started a large-scale evaluation of the genotype calling features of Goby and GobyWeb. To this end, we decided to obtain exome data from the 1000 genome project, and compare the genotypes called by Goby when all processing is done with GobyWeb (alignment and genotype calls). Since the 1000g project has way more [...]]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/evaluating-goby-against-the-1000-genome-genotype-calls-and-why-is-vcf-so-inefficient/"></g:plusone></div>
<p>We have recently started a large-scale evaluation of the genotype calling features of Goby and GobyWeb. To this end, we decided to obtain exome data from the 1000 genome project, and compare the genotypes called by Goby when all processing is done with GobyWeb (alignment and genotype calls). Since the 1000g project has way more data than we need for this evaluation, we picked two exome samples semi-randomly. Both are paired-end and one has length 76bp, while the other is 90bp long.</p>
<h3>Exome data realignments</h3>
<p>Realigning the reads to the 1000g reference was no trouble, we simply converted the bam files distributed by the 1000g project to the compact-reads format and uploaded this to GobyWeb. The rest is pretty much automated and was done in a matter of hours.</p>
<h3>Extracting a few samples from the 1000g VCF files</h3>
<p>The 1000g genome project distributes many versions of the genotype calls, in the VCF format. Locating the version that was produced against the 1000g reference (based on hg19) that we have installed in GobyWeb was a bit tricky since there is really no summary of all the versions. Thanks to Juan Rodriguez-Flores (in Jason Mesei&#8217;s lab at the ICB), for recommending this version:</p>
<p><a href="ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20111111_old_phase1_release_files/">ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20111111_old_phase1_release_files/</a></p>
<p>(we had initially gone directly to the very latest release, being wary of the &#8220;old&#8221; keyword in this version, but that the most recent version turned out to have been aligned against some ancestral reference reconstructed from primates, as far as we could tell and would not work for this validation).</p>
<p>As one can see form this directory, genotype calls are given in the VCF format, and split in one file per chromosome (excluding the X and Y chromosome and MT). The files are in the tens of GB even though they are compressed. The files are large because they contain genotypes and annotations for hundreds of samples studied in the 1000g genome.</p>
<p>Trying vcf-compare against just one of these files convinced us the comparison would be too slow against the complete files. We decided to extract just the two samples we selected for validation to yield smaller files that could be compared more efficiently.</p>
<p>Fortunately&#8212;we initially thought&#8212;VCF-tools provides a program called vcf-subset. Let&#8217;s just run this program with the names of the two samples we need to extract, on each of these chromosome, then concat the result. It turns out that vcf-subset is incredibly slow for the work it needs to perform. To be more specific, on a fast server, after a day of processing, we had not finished extracting the two samples from the chromosome 1 file. Upon closer inspection, the perl process was running at 100% CPU, but did not appear to make much process through this file (as judged from the speed at which results were added to the output). At this point, rather than throwing more CPUs at the problem and go on vacation (Chrismas time), we decided to go green (consider that inefficient programs are just as detrimental to our environment than other wasteful ways to burn oil).</p>
<p>Since Goby provides an efficient VCF parser, we reasoned we could write a more efficient way to extract the data we needed without too much trouble. To this end, we added a vcf-subset mode in Goby (an early implementation of this mode made it to 1.9.8.1, but we suggest getting the source code directly from our <a href="http://campagnelab.org/software/goby/download-goby/">subversion server</a> until we push 1.9.8.2 since the mode has improved a lot since that first release). Re-implementing the mode indeed provided a performance boost, but also offered an opportunity to add new options. One new feature that we added quickly was to process a number of files in parallel. We are now able to subset the 1000g VCF files in a few hours on a multi-threaded server.</p>
<h3>Why is vcf-subset so slow?</h3>
<p>This is obviously much better, but one has to wonder what is taking so long? After all, the input files are only a few Gigabytes, and we don&#8217;t need to do anything complicated, just extract a subset of information. It turns out that the design of the VCF format makes the task very computationally demanding (much more so than it would need to be). First of all, VCF is a text-based format, which by definition is slow to parse. To complicate matters, the format of the file can vary from line to line (look at the specification of the FORMAT and sample column). In my opinion, this is a very poor design decision. Consider  whether you have yet to encountered a VCF file that use the feature (e.g., that has different fields in the FORMAT field on different lines)? The need is not common, yet every program must be written to support this &#8220;flexibility&#8221; and there is a clear computing cost for supporting the feature. This is a typical red flag that should have told the designers of the added flexibility was not worth the compute cost.</p>
<p>Another complication introduced by VCF is that the type of delimiters varies by field (the INFO fields are delimited by ;, FORMAT fields use :, while other fields are delimited by tabs). All these &#8220;features&#8221; may seem to provide flexibility, but they combine to create significant inefficiencies. The format looks like as if it was designed so that humans can read it, yet is now used to store gigabytes of data. All this suggests to me that the committee/members of the mailing list who have been  responsible for the design of VCF could have paid more attention to the actual uses of the format and should have weighted  the impact of adding &#8220;nice to have, but not that frequent&#8221; features against the computational cost of these features. Then again, most people don&#8217;t care if a program or format is inefficient, at least until the computational cost makes some needed tasks impractical. It seems that VCF may be ripe for a redesign, it clearly meets requirements, but it is so complicated and inefficient that it would make sense to replace it with a leaner and efficient alternative.</p>
<p>I would not be surprised if the techniques we used in Goby formats could yield a more compressed VCF format alternative that could be subset in minutes rather than hours. Whether the community is feeling enough pain to consider adopting an alternative is a different question. What do you think?</p>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/pjkPwrDV564" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/evaluating-goby-against-the-1000-genome-genotype-calls-and-why-is-vcf-so-inefficient/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/evaluating-goby-against-the-1000-genome-genotype-calls-and-why-is-vcf-so-inefficient/</feedburner:origLink></item>
		<item>
		<title>Goby 1.9.8.1</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/-40vv57axpo/</link>
		<comments>http://campagnelab.org/goby-1-9-8-1/#comments</comments>
		<pubDate>Sat, 17 Dec 2011 15:30:34 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/?p=3319</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/goby-1-9-8-1/"></g:plusone></div>
We have released critical performance enhancements and bug fixes in Goby 1.9.8.1. A detailed list of changes can be found in the Change Log. All users are encouraged to install this latest distribution. Highlight include much better performance merging alignments with large &#62;100MB .tmh files (as required when calling variations across many samples) and several [...]]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/goby-1-9-8-1/"></g:plusone></div>
<p>We have released critical performance enhancements and bug fixes in Goby 1.9.8.1. A detailed list of changes can be found in the <a href="http://campagnelab.org/software/goby/change-log/">Change Log</a>. All users are encouraged to install this <a href="http://campagnelab.org/software/goby/download-goby/">latest distribution</a>. Highlight include much better performance merging alignments with large &gt;100MB .tmh files (as required when calling variations across many samples) and several fixes for subtle bugs that extensive testing has recently uncovered.</p>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/-40vv57axpo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/goby-1-9-8-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/goby-1-9-8-1/</feedburner:origLink></item>
		<item>
		<title>GobyWeb 1.6.1</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/L7eJUmWPcyQ/</link>
		<comments>http://campagnelab.org/gobyweb-1-6-1/#comments</comments>
		<pubDate>Thu, 10 Nov 2011 03:57:02 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/?p=3298</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/gobyweb-1-6-1/"></g:plusone></div>
We have just released a binary distribution of GobyWeb version 1.6.1. This is the first public release of GobyWeb. Detailled installation instructions are available on the download page. Please let us know if you are planning a local installation and have questions not covered in the instructions. See the change log for details about this version.]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/gobyweb-1-6-1/"></g:plusone></div>
<table border="0">
<tbody>
<tr>
<td><a href="http://campagnelab.org/wp-content/uploads/2010/03/gobyweb_logo.png"><img class="alignleft size-full wp-image-1861" title="gobyweb_logo" src="http://campagnelab.org/wp-content/uploads/2010/03/gobyweb_logo.png" alt="" width="181" height="116" /></a></td>
<td>We have just released a binary distribution of <a href="http://gobyweb.campagnelab.org">GobyWeb</a> version 1.6.1. This is the first public release of GobyWeb. Detailled installation instructions are available on the <a href="http://campagnelab.org/software/gobyweb/license-binary-distribution-and-installation-instructions/">download page</a>. Please let us know if you are planning a local installation and have questions not covered in the instructions. See the <a href="http://campagnelab.org/software/gobyweb/change-log/">change log</a> for details about this version.</td>
</tr>
</tbody>
</table>
<p></p>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/L7eJUmWPcyQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/gobyweb-1-6-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/gobyweb-1-6-1/</feedburner:origLink></item>
		<item>
		<title>Goby 1.9.8</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/SvU0iGpZi8E/</link>
		<comments>http://campagnelab.org/goby-1-9-8/#comments</comments>
		<pubDate>Wed, 09 Nov 2011 12:31:00 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/?p=3229</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/goby-1-9-8/"></g:plusone></div>
We have released Goby 1.9.8. This version includes the enhancements and bug fixes that are needed in the version of GobyWeb that we are preparing for release. See detailed change information in the project Change Log. &#160; &#160;]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/goby-1-9-8/"></g:plusone></div>
<div>We have released <a href="http://campagnelab.org/software/goby/download-goby/">Goby 1.9.8</a>. This version includes the enhancements and bug fixes that are needed in the version of GobyWeb that we are preparing for release. See detailed change information in the project <a href="http://campagnelab.org/software/goby/change-log/">Change Log</a>.</div>
<div>
<p>&nbsp;</p>
<p>&nbsp;</p>
</div>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/SvU0iGpZi8E" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/goby-1-9-8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/goby-1-9-8/</feedburner:origLink></item>
		<item>
		<title>GobyWeb training videos</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/Tlf9gSGxrzg/</link>
		<comments>http://campagnelab.org/gobyweb-training-videos/#comments</comments>
		<pubDate>Thu, 03 Nov 2011 14:54:09 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/?p=3194</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/gobyweb-training-videos/"></g:plusone></div>
We have started to post GobyWeb training videos. Video demonstrate key features of GobyWeb and are designed to help users who cannot attend our regular training sessions at the Weill Cornell Medical College. Additional videos will be posted as we finalize the first GobyWeb release.]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/gobyweb-training-videos/"></g:plusone></div>
<p>We have started to post <a href="http://campagnelab.org/software/gobyweb/gobyweb-video-training-series/">GobyWeb training videos</a>. Video demonstrate key features of GobyWeb and are designed to help users who cannot attend our regular training sessions at the Weill Cornell Medical College. Additional videos will be posted as we finalize the first GobyWeb release.</p>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/Tlf9gSGxrzg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/gobyweb-training-videos/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/gobyweb-training-videos/</feedburner:origLink></item>
		<item>
		<title>Storing paired end reads in BAM files, not so useful for reanalysis</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/0AmH0mWPAVc/</link>
		<comments>http://campagnelab.org/storing-paired-end-reads-in-bam-files-not-so-useful-for-reanalysis/#comments</comments>
		<pubDate>Fri, 02 Sep 2011 17:46:16 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/?p=2861</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/storing-paired-end-reads-in-bam-files-not-so-useful-for-reanalysis/"></g:plusone></div>
I have recently been interested in reanalyzing datasets from the study of Ajay et al, Genome Research (&#8216;Accurate and comprehensive sequencing of personal genomes&#8217;). The data were deposited to the ENA, and the accession code given in the paper (ERP000765), so all seemed good. I prefer to start with FastQ files, but these were not [...]]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/storing-paired-end-reads-in-bam-files-not-so-useful-for-reanalysis/"></g:plusone></div>
<p>I have recently been interested in reanalyzing datasets from the study of <a href="http://genome.cshlp.org/content/early/2011/07/18/gr.123638.111?top=1">Ajay et al</a>, Genome Research (&#8216;Accurate and comprehensive sequencing of personal genomes&#8217;). The data were deposited to the ENA, and the accession code given in the paper (<a href="http://www.ebi.ac.uk/ena/data/view/ERP000765">ERP000765</a>), so all seemed good.</p>
<p>I prefer to start with FastQ files, but these were not available for this study. I located BAM files at ENA under the Submitted files tab.</p>
<p>Goby provides a tool to extract reads from BAM files, so I did not expect major difficulties. For single-end reads, I would typically do:</p>
<pre>goby 1g <a href="http://campagnelab.org/software/goby/reference-documentation/modes/sam-extract-reads/">sam-extract-reads</a> input.bam -o output.compact-reads</pre>
<p>and proceed to analyze the compact-reads files with GobyWeb. The first problem I experienced was that the study used a paired-end design. This is a great idea when you are sequencing a complete genome, but we did not have an option in sam-extract-reads to load two BAM files and put the read and its mate together into one compact reads file. We do have this option in the tool fasta-to-compact, but since the files are tens of Gigabytes each, I did not feel like creating several huge intermediary files, or waiting for the conversions to run ((from BAM to compact-reads, then to fastq) x 2, then to compact again).</p>
<p>Using the options of fasta-to-compact as a template, I added the same capability to sam-extract-reads (this option will be available in our next release).</p>
<p>Here&#8217;s what you can now do:</p>
<pre>goby 1g sam-extract-reads SRA_GAIIX_RUN_1.bam SRA_GAIIX_RUN_2.bam --paired-end -o output.compact-reads</pre>
<p>This would cause the two input files to be loaded and each read combined with its pair and written to output.compact-reads (Goby stores primary and paired read in same file so that we can be sure these sequences never get out of order).</p>
<p>In implementing these options, I was careful to check that the id of the reads matched for the primary read and its paired read. As soon as I started this new tool on the datasets I obtained from ENA, I got an error indicating that the id of the first read in <span style="font-family: Consolas, Monaco, monospace; font-size: 12px; line-height: 18px; white-space: pre;">SRA_GAIIX_RUN_1.bam </span>did not match the id of first read in <span style="font-family: Consolas, Monaco, monospace; font-size: 12px; line-height: 18px; white-space: pre;">SRA_GAIIX_RUN_2.bam</span>.</p>
<p>I guess I should have expected these BAM files to be sorted in chromosomal order. Fortunately, I saw that samtools sort has an option to resort a BAM file by read id. My plan was to resort each file by id, then use <span style="font-family: Consolas, Monaco, monospace; font-size: 12px; line-height: 18px; white-space: pre;">sam-extract-reads. </span></p>
<p>Here&#8217;s the samtools command to resort by id:</p>
<pre>samtools sort -n SRA_GAIIX_RUN_1.bam SRA_GAIIX_RUN_namesorted_1.bam</pre>
<p>Since the GAIIX file was 23GB, I started the sort overnight. Here&#8217;s the message I got this morning:</p>
<p>[bam_sort_core] merging from 316 files&#8230;</p>
<pre>open: Too many open files</pre>
<pre>[bam_merge_core] fail to open file SRA_GAIIX_RUN-namesort_1.bam.0252.bam</pre>
<p>This does not look good. Let&#8217;s look at what happened here. The program samtools sort handles alignments that are larger than can fit in memory by splitting the input in small pieces, each large enough to fit in memory, then sorts the piece, and finally writes the result to disk. The last step of the process is to sort all pieces to write the final sorted output. The problem I encountered was that the default size for each piece resulted in so many files that the merge step failed (most operating systems have a limit on the number of files a process can open at any one time).</p>
<p>The work-around would consist in increasing the default -m parameter to a few GB and run on a machine with sufficient memory. This will reduce the number of files that need to be kept open in the merge step. However, this means that the whole process is now limited in the amount of memory one can get on a server. Consider that server memory is not growing nearly as fast as sequencing capacity and one can envision a point when the dataset is too large to process on any server. This is unlikely to happen tomorrow however, and the biggest problem is that one has to go through several steps (that do not run in parallel and hence are quite slow on such large files) when trying to reanalyze large paired-end datasets provided in BAM files.</p>
<p>It would have been so much simpler if the reads had been deposited in a format that keeps read pairs together. Goby compact reads does this nicely by keeping everything in one file, FASTQ does the trick with two files. Considering this experience, BAM files sorted by genomic location should probably not be the first choice when considering how to distribute full-genome paired-end NGS read data.</p>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/0AmH0mWPAVc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/storing-paired-end-reads-in-bam-files-not-so-useful-for-reanalysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/storing-paired-end-reads-in-bam-files-not-so-useful-for-reanalysis/</feedburner:origLink></item>
		<item>
		<title>Goby 1.9.7.3</title>
		<link>http://feedproxy.google.com/~r/campagnelab/~3/0N-v0tDAgOg/</link>
		<comments>http://campagnelab.org/goby-1-9-7-3/#comments</comments>
		<pubDate>Mon, 29 Aug 2011 15:50:33 +0000</pubDate>
		<dc:creator>Fabien Campagne</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">http://campagnelab.org/?p=2856</guid>
		<description><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/goby-1-9-7-3/"></g:plusone></div>
We have released Goby 1.9.7.3 [changelog]. New features include the ability to enter a short-cut for a mode name. You can call goby 1g ftc, where ftc is a short-cut for fasta-to-compact. See goby 1g ‐‐help for the list of all short-cuts. Another useful feature is the ability to convert sorted BAM files to sorted [...]]]></description>
			<content:encoded><![CDATA[<div style="display:inline;float:right;margin-left:1em"><g:plusone href="http://campagnelab.org/goby-1-9-7-3/"></g:plusone></div>
<p>We have released Goby 1.9.7.3 [<a href="http://campagnelab.org/software/goby/change-log/">changelog</a>]. New features include the ability to enter a short-cut for a mode name. You can call goby 1g ftc, where ftc is a short-cut for fasta-to-compact. See goby 1g ‐‐help for the list of all short-cuts. Another useful feature is the ability to convert sorted BAM files to sorted Goby alignments with sam-to-compact (short-cut: stc). This feature was suggested by Bradford Powell. A few bug fixes have been implemented, especially for VCF output. Finally, a bug in tally-reads was fixed that caused the tool to fail when processing reads with different lengths (thanks Adrian Platt for the bug report).</p>
<img src="http://feeds.feedburner.com/~r/campagnelab/~4/0N-v0tDAgOg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://campagnelab.org/goby-1-9-7-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://campagnelab.org/goby-1-9-7-3/</feedburner:origLink></item>
	</channel>
</rss>

