<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" gd:etag="W/&quot;CkEHSXs8cSp7ImA9WhVTEUw.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590</id><updated>2012-02-24T14:10:38.579-05:00</updated><category term="linux" /><category term="xml" /><category term="java" /><category term="n50" /><category term="bash history" /><category term="perl" /><category term="solaris 10" /><category term="os x" /><category term="velvet" /><category term="Samtools" /><category term="bioinformatics" /><category term="BioPerl" /><category term="grails" /><category term="economics" /><category term="histfile" /><category term="firefox 3" /><category term="groovy" /><category term="cycling" /><category term="scp" /><category term="BSgenome" /><category term="rstudio" /><category term="BioConductor" /><category term="hardware" /><category term="R" /><category term="NGS" /><category term="humor" /><title>Jermdemo Raised to the Law</title><subtitle type="html">Mostly bioinformatics, NGS, and cat litter box reviews</subtitle><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/posts/default" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/" /><link rel="next" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default?start-index=26&amp;max-results=25&amp;redirect=false&amp;v=2" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>50</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/JermdemoRaisedToTheLaw" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="jermdemoraisedtothelaw" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry gd:etag="W/&quot;DEQASHczfyp7ImA9WhVTEEQ.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-7206707885663816984</id><published>2012-02-20T21:44:00.000-05:00</published><updated>2012-02-24T10:12:29.987-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-02-24T10:12:29.987-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="NGS" /><category scheme="http://www.blogger.com/atom/ns#" term="humor" /><title>AGBT: digesting diposable MinIONs in diaspora</title><content type="html">Despite my current&amp;nbsp;&lt;a href="http://biostar.stackexchange.com/users"&gt;ranking of 15th in Biostar&lt;/a&gt;, myriad page views of my &lt;a href="http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html"&gt;BAS&lt;/a&gt;™ post (albeit mostly misdirected perverts), and positive response for my celebrated campaign against more microarray papers, for some reason I was not "comped" an all-expenses paid trip as honorary blog journalist to this year's &lt;a href="http://agbt.org/"&gt;Advances in Biology and Genome Technology&lt;/a&gt;, which is kind of like CES for&amp;nbsp;sequencing&amp;nbsp;people, except AGBT is still worth attending. Normally the oversight would not bother me, as bioinformatics itself is not the focus of this meeting, but the flood of &lt;a href="https://twitter.com/#%21/search/realtime/%23AGBT"&gt;#AGBT&lt;/a&gt; tweets would not let me forget this fact and I was forced to stew and blog in envy.
&lt;br /&gt;&lt;br /&gt;
&lt;h4&gt;
The first game changing disruptive revolutionary thing from England since 1964&lt;/h4&gt;
Even from my distant perch it was obvious all the scientific presentations at AGBT were overshadowed by a 17-minute showstopping demo from Clive Brown of Oxford Nanopore, a company that by all appearances would either die, focus on some minor stuff, or bring it. They chose the third option, and in so doing boosted the "&lt;a href="http://www.google.com/trends/?q=clive"&gt;Clive index&lt;/a&gt;" to unprecedented levels. OxN's recent decision to enlist famed geneticist and serial startup advisor George Church struck me as a huge gamble, as the string of Route 128 flameouts touting his name lead me to assume long ago that Church had stowed away some cursed Tiki idol in his luggage like Bobby in that episode of the Brady Bunch. However, after reading up on OxN, I had to admit I was just bitter about Dr. Church's refusal to invest in my chain of &lt;a href="http://www.polonator.org/"&gt;Polonator&lt;/a&gt;-based paternity testing clinics, Yo Po'lonatizz!™&lt;br /&gt;&lt;br /&gt;
Two new sequencer platforms were announced:&lt;br /&gt;
&lt;table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-oEj8DmCb5VE/Tz7VdWRyOAI/AAAAAAAABqk/2lg9BulMCnM/s1600/MinION_in_laptop.jpg" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="213" src="http://3.bp.blogspot.com/-oEj8DmCb5VE/Tz7VdWRyOAI/AAAAAAAABqk/2lg9BulMCnM/s320/MinION_in_laptop.jpg" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;A MinION. Forget to hit eject before removing this&lt;br /&gt;
and you will instantly lose $900.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;ul&gt;
&lt;li&gt;The MinION, a $900 "disposable" USB drive which detects minute changes in voltage incurred by the passage of DNA through a 
robust and delicious lipid bilayer. Finally a device capable of sequencing filthy rabbit blood right on the spot!&lt;/li&gt;
&lt;li&gt;The GridION system, a scalable rackmounted sequencer, which despite some lack of pricing clarity, should produce an actual $1000 15-minute human genome by 2013.&lt;/li&gt;
&lt;/ul&gt;
These exotic machines must be truly game-changing because they made properly expanding Albert Vilella's &lt;a href="https://docs.google.com/spreadsheet/ccc?key=0AvaxS3m5rl-9dHdtUGRtaGlsZWNFNWJleDRXaUhQTHc"&gt;NGS sequencer spreadsheet&lt;/a&gt; quite difficult. The MinION, in particular, could be viewed as a free device with $900 of consumables. This effectively lowers the bar to getting high-throughput sequence in the doctor's office to a 100% unamortized billable transaction. These things also claim &lt;i&gt;fucking unlimited read lengths.&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
Expression microarrays, SAGE, 454, ABI SOLiD, and now Pacific Biosciences have all left bad tastes of uncertainty and dissatisfaction in the mouths of scientists. It is easy to disappoint people on a grand scale with a $700,000 machine, but $900 worth of chemicals in a USB drive is a different animal, and it seems likely this invention will find a following if it even delivers on a fraction of what it promises.&lt;br /&gt;&lt;br /&gt;
&lt;table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; text-align: left;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-I01Cd82VSL8/Tz7ilRBeQFI/AAAAAAAABqw/BOc8cXqEEhg/s1600/GridION_scaling.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="170" src="http://2.bp.blogspot.com/-I01Cd82VSL8/Tz7ilRBeQFI/AAAAAAAABqw/BOc8cXqEEhg/s320/GridION_scaling.jpg" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;The GridION - put it in a rack or right on the floor.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Good information on this sequencer-on-a-stick is to be found at&amp;nbsp;&lt;a href="http://pathogenomics.bham.ac.uk/blog/2012/02/oxford-nanopore-megaton-announcement-why-do-you-need-a-machine-exclusive-interview-for-this-blog/"&gt;Nick Loman's blog&lt;/a&gt;,&amp;nbsp;&lt;a href="http://www.genomesunzipped.org/2012/02/making-sequencing-simpler-with-nanopores.php"&gt;Genomes Unzipped&lt;/a&gt;, &lt;a href="http://www.nanoporetech.com/news/press-releases/view/39"&gt;and official press releases&lt;/a&gt;. An excellent discussion of the nanopores themselves can be found at &lt;a href="http://www.omespeak.com/blog/?p=507"&gt;Omically Speaking&lt;/a&gt;.
&lt;br /&gt;&lt;br /&gt;
&lt;h4&gt;



More cringeworthy marketing from the West coast &lt;/h4&gt;
The Oxford Nanopore machines are so jaw-dropping, in fact, that&amp;nbsp;&lt;a href="http://www.forbes.com/sites/matthewherper/2012/02/18/who-doubts-the-usb-thumb-drive-sequencer-a-rival/"&gt;Jonathan Rothberg is already crying vaporware&lt;/a&gt;. His complaints do seem warranted, given disappointments from past year's announcements and the lack of publicly available sequence from these devices.
&lt;br /&gt;&lt;br /&gt;
Unfortunately Ion Torrent has spent all of its goodwill on an&amp;nbsp;inane and hamfisted advertising war against Illumina's MiSeq, an intentionally crippled opponent. Seemingly orchestrated by castoffs from the Celebrity Apprentice, this assault began with &lt;a href="http://www.youtube.com/watch?feature=player_embedded&amp;amp;v=GUr17pHezUo"&gt;cringe-inducing derivations&lt;/a&gt; of Apple commercials, and has expanded to include a sort of "feature combover." Through some&lt;a href="http://www.youtube.com/watch?v=jv-JHafk4UA&amp;amp;feature=youtu.be"&gt;&amp;nbsp;convoluted logic&lt;/a&gt; involving consensus, a professional whiteboard artist attempts to convince the public how the homopolymer error rate is actually lower using Ion Torrent PGM than MiSeq. This is the sequencing equivalent of having your mom try to convince you two apples is better than one&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Drake%27s_Devil_Dogs"&gt;devil dog&lt;/a&gt;, or some such utter nonsense.
&lt;br /&gt;&lt;br /&gt;
My response was predictably measured and cerebral:
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-y529xWo7K94/Tz6oBG7vDwI/AAAAAAAABqU/gqVyaFHCP1M/s1600/Screen+shot+2012-02-17+at+1.40.23+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="59" src="http://2.bp.blogspot.com/-y529xWo7K94/Tz6oBG7vDwI/AAAAAAAABqU/gqVyaFHCP1M/s640/Screen+shot+2012-02-17+at+1.40.23+PM.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;
This is not the&amp;nbsp;&lt;a href="https://picasaweb.google.com/lh/photo/oyux1nepVh9nxAqgbMnA1Px8lHi6ePCYZLKzcyR0WMQ?feat=directlink"&gt;first time&lt;/a&gt;&amp;nbsp;I have tweet-confronted Ion Torrent over its&amp;nbsp;odious approach.&amp;nbsp;All this is rather unnecessary because overall, and despite the homopolymer issues, the utility of the PGM has been more or less within expectations. The MiSeq is also exactly within expectations, since it is basically a transparent, measly 1/50th slice of a HiSeq.&amp;nbsp;The same cannot really be said for the RS, whose error rate is clearly far above what was expected at the outset. So if anyone requires an aggressive smokescreen-type marketing campaign (or a new machine) it is Pacific Biosciences.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-7206707885663816984?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/7206707885663816984/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2012/02/agbt-digesting-diposable-minions-in.html#comment-form" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/7206707885663816984?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/7206707885663816984?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2012/02/agbt-digesting-diposable-minions-in.html" title="AGBT: digesting diposable MinIONs in diaspora" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/-oEj8DmCb5VE/Tz7VdWRyOAI/AAAAAAAABqk/2lg9BulMCnM/s72-c/MinION_in_laptop.jpg" height="72" width="72" /><thr:total>2</thr:total></entry><entry gd:etag="W/&quot;DEMGR3Y8eCp7ImA9WhRVGUk.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-8185076191620100624</id><published>2012-01-18T21:15:00.000-05:00</published><updated>2012-01-18T23:13:46.870-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-01-18T23:13:46.870-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="NGS" /><category scheme="http://www.blogger.com/atom/ns#" term="humor" /><category scheme="http://www.blogger.com/atom/ns#" term="R" /><title>When can we expect the last damn microarray paper?</title><content type="html">&lt;h3&gt;With bonus R code&lt;/h3&gt;
It came as a shock to learn from PubMed that almost &lt;a href="http://www.ncbi.nlm.nih.gov/pubmed?term=microarray%5Btitle%5D%20and%202011%5Bpdat%5D"&gt;900 papers&lt;/a&gt; were published with the word "microarray" in their titles last year alone, just 12 shy of the 2010 count. More alarming, many of these papers were not of the innocuous "Microarray study of gene expression in dog scrotal tissue" variety, but dry rehashings along the lines of "Statistical approaches to normalizing microarrays to the reference brightness of Ursa Minor".
&lt;p&gt;
It's an ugly truth we must face: people aren't just using microarrays, &lt;i&gt;they're still writing about them&lt;/i&gt;.
&lt;/p&gt;
&lt;p&gt;
See for yourself:
&lt;/p&gt;
&lt;pre class="brush: shell"&gt;getCount&amp;lt;-function(term){function(year){
  nihUrl&amp;lt;-concat("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&amp;amp;term=",term,"+",year,"[pdat]")
  #cleanurl&amp;lt;-gsub('\\]','%5D',gsub('\\[','%5B',x=url))
  #http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&amp;amp;term=microarray%5btitle%5d+2003%5bpdat%5d
  xml&amp;lt;-xmlTreeParse(URLencode(nihUrl),isURL=TRUE)
  #Data Mashups in R, pg17
  as.numeric(xmlValue(xml$doc$children$eSearchResult$children$Count$children$text))
}}

years&amp;lt;-1995:2011
df&amp;lt;-data.frame(type="obs",year=years,
    mic=sapply(years,function(x){do.call(getCount('microarray[title]'),list(x))}),
    ngs=sapply(years,function(x){do.call(getCount('"next generation sequencing"[title] OR "high-throughput sequencing"[title]'),list(x))})
)
#papers with "microarray" in title
&amp;gt; df[,c("year","mic")]
   year  mic
1  1995    2
2  1996    4
3  1997    0
4  1998    7
5  1999   28
6  2000  108
7  2001  273
8  2002  553
9  2003  770
10 2004 1032
11 2005 1135
12 2006 1216
13 2007 1107
14 2008 1055
15 2009  981
16 2010  909
17 2011  897
&lt;/pre&gt;
Reading another treatise on microarray normalization in 2012 would be just tragic. Who still reads these? Who still writes these papers? Can we stop them? If not, when can we expect NGS to wipe them off the map?

&lt;br /&gt;
&lt;pre class="brush: shell"&gt;
#97 is a fair start
df&lt;-subset(df,year&gt;=1997)
mdf&lt;-melt(df,id.vars=c("type","year"),variable_name="citation")

c&lt;-ggplot(mdf,aes(x=year))
p&lt;-c+geom_point(aes(y=value,color=citation)) +
  ylab("papers") +
  stat_smooth(aes(y=value,color=citation),data=subset(mdf,citation=="mic"),method="loess") +
  scale_x_continuous(breaks=seq(from=1997,to=2011,by=2))
print(p)
&lt;/pre&gt;
Here I plot both microarray and next-generation sequencing papers (in title). We see kurtosis is working in our favor, and LOESS seems to agree!

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-MYehPyrrOEw/Txd6iNBzWwI/AAAAAAAABpU/k2ACY1xHGHY/s1600/obs.png" imageanchor="1" style=""&gt;&lt;img border="0"  src="http://1.bp.blogspot.com/-MYehPyrrOEw/Txd6iNBzWwI/AAAAAAAABpU/k2ACY1xHGHY/s800/obs.png" /&gt;&lt;/a&gt;&lt;/div&gt;


But when will the pain end? Let us extrapolate, wildly.
&lt;pre class="brush: shell"&gt;
#Return 0 for negative elements
# noNeg(c(3,2,1,0,-1,-2,2))
# [1] 3 2 1 0 0 0 2
noNeg&lt;-function(v){sapply(v,function(x){max(x,0)})}

#Return up to the first negative/zero element inclusive
# toZeroNoNeg(c(3,2,1,0,-1,-2,2))
# [1] 3 2 1 0
toZeroNoNeg&lt;-function(v){noNeg(v)[1:firstZero(noNeg(v))]}

#return index of first zero
firstZero&lt;-function(v){which(noNeg(v)==0)[1]}

#let's peer into the future
df.lo.mic&lt;-loess(mic ~ year,df,control=loess.control(surface="direct"))

#when will it stop?
mic_predict&lt;-as.integer(predict(df.lo.mic,data.frame(year=2012:2020),se=FALSE))
zero_year&lt;-2011+firstZero(mic_predict)
cat(concat("LOESS projects ",sum(toZeroNoNeg(mic_predict))," more microarray papers."))
cat(concat("The last damn microarray paper is projected to be in ",(zero_year-1),"."))

#predict ngs growth
df.lo.ngs&lt;-loess(ngs ~ year,df,control=loess.control(surface="direct"))
ngs_predict&lt;-as.integer(predict(df.lo.ngs,data.frame(year=2012:zero_year),se=FALSE))

pred_df&lt;-data.frame(type="pred",year=c(2012:zero_year),mic=toZeroNoNeg(mic_predict),ngs=ngs_predict)
df2&lt;-rbind(df,pred_df)

mdf2&lt;-melt(df2,id.vars=c("type","year"),variable_name="citation")

c2&lt;-ggplot(mdf2,aes(x=year))
p2&lt;-c2+geom_point(aes(y=value,color=citation,shape=type),size=3) +
    ylab("papers") +
    scale_y_continuous(breaks=seq(from=0,to=1600,by=200))+
    scale_x_continuous(breaks=seq(from=1997,to=zero_year,by=2))
print(p2)
&lt;/pre&gt;


&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-JZCL6HQlAus/Txd6ryRLKHI/AAAAAAAABpg/vQ4HNBiZOL8/s1600/pred.png" imageanchor="1" style=""&gt;&lt;img border="0"  src="http://4.bp.blogspot.com/-JZCL6HQlAus/Txd6ryRLKHI/AAAAAAAABpg/vQ4HNBiZOL8/s800/pred.png" /&gt;&lt;/a&gt;&lt;/div&gt;



&lt;p&gt;LOESS projects 2038 more microarray papers.&lt;br/&gt;
The last damn microarray paper is projected to be published in 2016.
&lt;/p&gt;
&lt;p&gt;
Yeah, right...
&lt;/p&gt;

Full R code here: &lt;a href="https://gist.github.com/1637248"&gt;https://gist.github.com/1637248&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-8185076191620100624?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/8185076191620100624/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2012/01/when-can-we-expect-last-damn-microarray.html#comment-form" title="9 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/8185076191620100624?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/8185076191620100624?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2012/01/when-can-we-expect-last-damn-microarray.html" title="When can we expect the last damn microarray paper?" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-MYehPyrrOEw/Txd6iNBzWwI/AAAAAAAABpU/k2ACY1xHGHY/s72-c/obs.png" height="72" width="72" /><thr:total>9</thr:total></entry><entry gd:etag="W/&quot;CUIARn88fSp7ImA9WhRTEEQ.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-1682411291794250329</id><published>2011-10-31T13:17:00.003-04:00</published><updated>2011-10-31T16:39:07.175-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-10-31T16:39:07.175-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="R" /><title>Making R's paste act more like CONCAT</title><content type="html">&lt;p&gt;While vector-friendly, R's &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/paste.html"&gt;paste&lt;/a&gt; function has a few behaviors I don't particularly like.&lt;/p&gt;

&lt;p&gt;One is using a space as the default separator:&lt;br/&gt;
&lt;pre class="brush: shell"&gt;
&gt; adjectives&lt;-c("lean","fast","strong")
&gt; paste(adjectives,"er")
&gt; paste(adjectives,"er")
[1] "lean er"   "fast er"   "strong er"  #d'oh
&gt; paste(adjectives,"er",sep="")
[1] "leaner"   "faster"   "stronger"
&lt;/pre&gt;&lt;/p&gt;
&lt;p&gt;Empty vectors get an undeserved first class treatment:
&lt;pre class="brush: shell"&gt;
&gt; paste(indelPositions,"i",sep="")
[1] "i"
&gt; indelPositions&lt;-c(5,6,7)
&gt; paste(indelPositions,"i",sep="")
[1] "5i" "6i" "7i" #good

&gt; indelPositions&lt;-c()
&gt; paste(indelPositions,"i",sep="")
[1] "i"  #not so good
&lt;/pre&gt;&lt;/p&gt;

&lt;p&gt;And perhaps worst of all, NA values get replaced with a string called "NA":
&lt;pre class="brush: shell"&gt;
&gt; placing&lt;-"1"
&gt; paste(placing,"st",sep="")
[1] "1st" #awesome

&gt; placing&lt;-NA_integer_
&gt; paste(placing,"st",sep="")
[1] "NAst" #ugh
&lt;/pre&gt;&lt;/p&gt;
&lt;p&gt;This is inconvenient in situations where I don't know a priori if I will get a value, a vector of length 0, or an NA.
&lt;/p&gt;
&lt;p&gt;Working from Hadley Wickham's str_c function in the &lt;a href="https://github.com/hadley/stringr"&gt;stringr&lt;/a&gt; package, I decided to write a paste function that behaves more like CONCAT in SQL:
&lt;br/&gt;
&lt;pre class="brush: shell"&gt;
library(stringr)
concat&lt;-CONCAT&lt;-function(...,sep="",collapse=NULL){
  strings&lt;-list(...)
  #catch NULLs, NAs
  if(
    all(unlist(llply(strings,length))&gt;0)
    &amp;&amp;
    all(!is.na(unlist(strings)))
    ){
    do.call("paste", c(strings, list(sep = sep, collapse = collapse)))
  }else{
    NULL
  }
}
&lt;/pre&gt;&lt;/p&gt;

&lt;p&gt;This function has the behaviors I expect:
&lt;pre class="brush: shell"&gt;
&gt; concat(adjectives,"er")
[1] "leaner"   "faster"   "stronger"

&gt; concat(indelPositions,"i")
NULL

&gt; concat(placing,"st")
NULL
&lt;/pre&gt;
&lt;/p&gt;
&lt;p&gt;
That's more like it!
&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-1682411291794250329?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/1682411291794250329/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2011/10/making-rs-paste-act-more-like-concat.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/1682411291794250329?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/1682411291794250329?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2011/10/making-rs-paste-act-more-like-concat.html" title="Making R's paste act more like CONCAT" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;CEQCQH45fCp7ImA9WhdUGU8.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-6859966095244928444</id><published>2011-10-06T13:10:00.000-04:00</published><updated>2011-10-06T13:32:41.024-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-10-06T13:32:41.024-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="linux" /><title>SELinux for enhanced headaches</title><content type="html">&lt;br /&gt;
&lt;a href="http://en.wikipedia.org/wiki/Security-Enhanced_Linux"&gt;Security Enhanced Linux (SELinux)&lt;/a&gt; is a new extra hidden layer of permissions that makes configuring things more difficult, without ever identifying itself as the culprit - kind of like &lt;a href="http://en.wikipedia.org/wiki/Access_control_list"&gt;ACLs&lt;/a&gt; but more cryptic. Though it may be more secure, it is not an enhancing experience to deal with, and probably not worth it for the average user.&lt;br /&gt;
&lt;br /&gt;
For example to have Apache serve personal websites (i.e. http://server/~leipzig) it is no longer enough to alter httpd.conf, because you will get mysterious 403 errors until you do this (as &lt;a href="http://serverfault.com/questions/307167/apache-403-forbidden/307209#307209"&gt;others&lt;/a&gt; have experienced):
&lt;br /&gt;
&lt;pre&gt;chcon -R -t httpd_sys_content_t /home/leipzig&lt;/pre&gt;
&lt;br /&gt;
You forget about this change until xauth starts complaining about stuff for no apparent reason:
&lt;br /&gt;
&lt;pre&gt;/usr/bin/xauth:&amp;nbsp; timeout in locking authority file /home/leipzig/.Xauthority&lt;/pre&gt;
&lt;br /&gt;
so of course you need to do this (thanks &lt;a href="http://www.linkedin.com/in/madhavdiwan"&gt;Madhav Diwan&lt;/a&gt; for &lt;a href="http://forums.fedoraforum.org/showpost.php?p=1336336&amp;amp;postcount=9"&gt;this&lt;/a&gt; post):&lt;br /&gt;
&lt;pre&gt;chcon unconfined_u:object_r:user_home_dir_t:s0 /home/leipzig&lt;/pre&gt;
&lt;br /&gt;
I have no idea what these things actually mean, nor any real interest in learning. I'm sure this stuff is great for sysadmin cocktail chat but at least for private servers it is just another the brake on the wheel of getting things done. For the time being I have set the level to "permissive", which means it displays warnings but does not interfere, but am leaning toward "disabled" or maybe something else:&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=excoriated
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted
&lt;/pre&gt;
&lt;br /&gt;

More on the pros and cons:&lt;br /&gt;
&lt;a href="http://unix.stackexchange.com/questions/9163/does-selinux-provide-enough-extra-security-to-be-worth-the-hassle-of-learning-set"&gt;http://unix.stackexchange.com/questions/9163/does-selinux-provide-enough-extra-security-to-be-worth-the-hassle-of-learning-set&lt;/a&gt;
&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-6859966095244928444?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/6859966095244928444/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2011/10/selinux-for-enhanced-headaches.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6859966095244928444?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6859966095244928444?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2011/10/selinux-for-enhanced-headaches.html" title="SELinux for enhanced headaches" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CkINRnc5fSp7ImA9WhdQFkU.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-2510745151990076348</id><published>2011-08-18T10:03:00.008-04:00</published><updated>2011-08-18T11:16:37.925-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-08-18T11:16:37.925-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="rstudio" /><category scheme="http://www.blogger.com/atom/ns#" term="R" /><title>Installing RStudio Server on Scientific Linux 6: My bash notebook</title><content type="html">&lt;a href="http://2.bp.blogspot.com/-99i2Opn_Adk/Tk0amQ0oHGI/AAAAAAAABco/auMgs9qczVc/s1600/Screen+shot+2011-08-18+at+9.58.10+AM.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="220" src="http://2.bp.blogspot.com/-99i2Opn_Adk/Tk0amQ0oHGI/AAAAAAAABco/auMgs9qczVc/s320/Screen+shot+2011-08-18+at+9.58.10+AM.png" width="320" /&gt;&lt;/a&gt;
Granted, not a brilliant sysadmin mind at work here, but this might help someone someday.&lt;br /&gt;
Scientific Linux (SL) is built from Red Hat Enterprise Linux&lt;br /&gt;
&lt;br /&gt;
See installation instructions here:&lt;br /&gt;
&lt;a href="http://rstudio.org/download/server"&gt;http://rstudio.org/download/server&lt;/a&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: left;"&gt;
&lt;/div&gt;
&lt;pre class="brush: bash"&gt;
[leipzig@localhost ~]$ sudo rpm -Uvh
http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-5.noarch.rpm
[sudo] password for leipzig: 
Retrieving
http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-5.noarch.rpm
warning: /var/tmp/rpm-tmp.S2RQAH: Header V3 RSA/SHA256 Signature, key ID
0608b895: NOKEY
Preparing...                ########################################### [100%]
   1:epel-release           ########################################### [100%]
[leipzig@localhost ~]$ rpm -qa | grep epel
epel-release-6-5.noarch
[leipzig@localhost ~]$ which R
/usr/local/bin/R
[leipzig@localhost ~]$ R

R version 2.13.0 (2011-04-13)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

&amp;amp;gt; q()
Save workspace image? [y/n/c]: n
[leipzig@localhost ~]$ wget
https://s3.amazonaws.com/rstudio-server/rstudio-server-0.94.92-x86_64.rpm
--2011-08-17 13:06:36-- 
https://s3.amazonaws.com/rstudio-server/rstudio-server-0.94.92-x86_64.rpm
Resolving s3.amazonaws.com... 72.21.211.170
Connecting to s3.amazonaws.com|72.21.211.170|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11373769 (11M) [application/x-redhat-package-manager]
Saving to: “rstudio-server-0.94.92-x86_64.rpm”

100%[===========================================================================
=============================================&amp;amp;gt;] 11,373,769  7.89M/s   in 1.4s   


2011-08-17 13:06:37 (7.89 MB/s) - “rstudio-server-0.94.92-x86_64.rpm” saved
[11373769/11373769]

[leipzig@localhost ~]$ sudo rpm -Uvh rstudio-server-0.94.92-x86_64.rpm
error: Failed dependencies:
	libR.so()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libRblas.so()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libRlapack.so()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libcrypto.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libgfortran.so.1()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libssl.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
[leipzig@localhost ~]$ sudo yum install R  
epel/metalink                                                                   
                                                          |  14 kB     00:00    

epel                                                                            
                                                          | 4.3 kB     00:00    

epel/primary_db                                                                 
                                                          | 4.0 MB     00:00    

sl                                                                              
                                                          | 3.2 kB     00:00    

sl-security                                                                     
                                                          | 1.9 kB     00:00    

Setting up Install Process
Resolving Dependencies
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package R.x86_64 0:2.13.1-1.el6 set to be updated
--&amp;amp;gt; Processing Dependency: libRmath-devel = 2.13.1-1.el6 for package:
R-2.13.1-1.el6.x86_64
--&amp;amp;gt; Processing Dependency: R-devel = 2.13.1-1.el6 for package:
R-2.13.1-1.el6.x86_64
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package R-devel.x86_64 0:2.13.1-1.el6 set to be updated
--&amp;amp;gt; Processing Dependency: R-core = 2.13.1-1.el6 for package:
R-devel-2.13.1-1.el6.x86_64
--&amp;amp;gt; Processing Dependency: bzip2-devel for package: R-devel-2.13.1-1.el6.x86_64
--&amp;amp;gt; Processing Dependency: gcc-gfortran for package: R-devel-2.13.1-1.el6.x86_64
--&amp;amp;gt; Processing Dependency: tk-devel for package: R-devel-2.13.1-1.el6.x86_64
--&amp;amp;gt; Processing Dependency: pcre-devel for package: R-devel-2.13.1-1.el6.x86_64
--&amp;amp;gt; Processing Dependency: tcl-devel for package: R-devel-2.13.1-1.el6.x86_64
---&amp;amp;gt; Package libRmath-devel.x86_64 0:2.13.1-1.el6 set to be updated
--&amp;amp;gt; Processing Dependency: libRmath = 2.13.1-1.el6 for package:
libRmath-devel-2.13.1-1.el6.x86_64
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package R-core.x86_64 0:2.13.1-1.el6 set to be updated
--&amp;amp;gt; Processing Dependency: cups for package: R-core-2.13.1-1.el6.x86_64
--&amp;amp;gt; Processing Dependency: libtk8.5.so()(64bit) for package:
R-core-2.13.1-1.el6.x86_64
---&amp;amp;gt; Package bzip2-devel.x86_64 0:1.0.5-7.el6_0 set to be updated
---&amp;amp;gt; Package gcc-gfortran.x86_64 0:4.4.4-13.el6 set to be updated
---&amp;amp;gt; Package libRmath.x86_64 0:2.13.1-1.el6 set to be updated
---&amp;amp;gt; Package pcre-devel.x86_64 0:7.8-3.1.el6 set to be updated
---&amp;amp;gt; Package tcl-devel.x86_64 1:8.5.7-6.el6 set to be updated
---&amp;amp;gt; Package tk-devel.x86_64 1:8.5.7-5.el6 set to be updated
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package cups.x86_64 1:1.4.2-35.el6_0.1 set to be updated
--&amp;amp;gt; Processing Dependency: portreserve for package:
1:cups-1.4.2-35.el6_0.1.x86_64
--&amp;amp;gt; Processing Dependency: poppler-utils for package:
1:cups-1.4.2-35.el6_0.1.x86_64
---&amp;amp;gt; Package tk.x86_64 1:8.5.7-5.el6 set to be updated
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package poppler-utils.x86_64 0:0.12.4-3.el6_0.1 set to be updated
---&amp;amp;gt; Package portreserve.x86_64 0:0.0.4-4.el6 set to be updated
--&amp;amp;gt; Finished Dependency Resolution

Dependencies Resolved

================================================================================
================================================================================
==
 Package                                 Arch                            Version
                                      Repository                            Size
================================================================================
================================================================================
==
Installing:
 R                                       x86_64                         
2.13.1-1.el6                                  epel                              
   17 k
Installing for dependencies:
 R-core                                  x86_64                         
2.13.1-1.el6                                  epel                              
   33 M
 R-devel                                 x86_64                         
2.13.1-1.el6                                  epel                              
   88 k
 bzip2-devel                             x86_64                         
1.0.5-7.el6_0                                 sl-security                       
  249 k
 cups                                    x86_64                         
1:1.4.2-35.el6_0.1                            sl-security                       
  2.3 M
 gcc-gfortran                            x86_64                         
4.4.4-13.el6                                  sl                                
  4.7 M
 libRmath                                x86_64                         
2.13.1-1.el6                                  epel                              
  111 k
 libRmath-devel                          x86_64                         
2.13.1-1.el6                                  epel                              
   21 k
 pcre-devel                              x86_64                         
7.8-3.1.el6                                   sl                                
  317 k
 poppler-utils                           x86_64                         
0.12.4-3.el6_0.1                              sl-security                       
   72 k
 portreserve                             x86_64                         
0.0.4-4.el6                                   sl                                
   21 k
 tcl-devel                               x86_64                         
1:8.5.7-6.el6                                 sl                                
  161 k
 tk                                      x86_64                         
1:8.5.7-5.el6                                 sl                                
  1.4 M
 tk-devel                                x86_64                         
1:8.5.7-5.el6                                 sl                                
  495 k

Transaction Summary
================================================================================
================================================================================
==
Install      14 Package(s)
Upgrade       0 Package(s)

Total download size: 43 M
Installed size: 89 M
Is this ok [y/N]: y
Downloading Packages:
(1/14): R-2.13.1-1.el6.x86_64.rpm                                               
                                                          |  17 kB     00:00    

(2/14): R-core-2.13.1-1.el6.x86_64.rpm                                          
                                                          |  33 MB     00:05    

(3/14): R-devel-2.13.1-1.el6.x86_64.rpm                                         
                                                          |  88 kB     00:00    

(4/14): bzip2-devel-1.0.5-7.el6_0.x86_64.rpm                                    
                                                          | 249 kB     00:00    

(5/14): cups-1.4.2-35.el6_0.1.x86_64.rpm                                        
                                                          | 2.3 MB     00:01    

(6/14): gcc-gfortran-4.4.4-13.el6.x86_64.rpm                                    
                                                          | 4.7 MB     00:02    

(7/14): libRmath-2.13.1-1.el6.x86_64.rpm                                        
                                                          | 111 kB     00:00    

(8/14): libRmath-devel-2.13.1-1.el6.x86_64.rpm                                  
                                                          |  21 kB     00:00    

(9/14): pcre-devel-7.8-3.1.el6.x86_64.rpm                                       
                                                          | 317 kB     00:00    

(10/14): poppler-utils-0.12.4-3.el6_0.1.x86_64.rpm                              
                                                          |  72 kB     00:00    

(11/14): portreserve-0.0.4-4.el6.x86_64.rpm                                     
                                                          |  21 kB     00:00    

(12/14): tcl-devel-8.5.7-6.el6.x86_64.rpm                                       
                                                          | 161 kB     00:00    

(13/14): tk-8.5.7-5.el6.x86_64.rpm                                              
                                                          | 1.4 MB     00:00    

(14/14): tk-devel-8.5.7-5.el6.x86_64.rpm                                        
                                                          | 495 kB     00:00    

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--
Total                                                                           
                                                 3.1 MB/s |  43 MB     00:13    

warning: rpmts_HdrFromFdno: Header V3 RSA/SHA256 Signature, key ID 0608b895:
NOKEY
epel/gpgkey                                                                     
                                                          | 3.2 kB     00:00 ...

Importing GPG key 0x0608B895 "EPEL (6) epel@fedoraproject.org" from
/etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
Is this ok [y/N]: y
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Warning: RPMDB altered outside of yum.
  Installing     : 1:tk-8.5.7-5.el6.x86_64                                      
                                                                            1/14

  Installing     : portreserve-0.0.4-4.el6.x86_64                               
                                                                            2/14

  Installing     : poppler-utils-0.12.4-3.el6_0.1.x86_64                        
                                                                            3/14

  Installing     : 1:cups-1.4.2-35.el6_0.1.x86_64                               
                                                                            4/14

  Installing     : R-core-2.13.1-1.el6.x86_64                                   
                                                                            5/14

  Installing     : gcc-gfortran-4.4.4-13.el6.x86_64                             
                                                                            6/14

  Installing     : libRmath-2.13.1-1.el6.x86_64                                 
                                                                            7/14

  Installing     : 1:tcl-devel-8.5.7-6.el6.x86_64                               
                                                                            8/14

  Installing     : 1:tk-devel-8.5.7-5.el6.x86_64                                
                                                                            9/14

  Installing     : libRmath-devel-2.13.1-1.el6.x86_64                           
                                                                           10/14

  Installing     : bzip2-devel-1.0.5-7.el6_0.x86_64                             
                                                                           11/14

  Installing     : pcre-devel-7.8-3.1.el6.x86_64                                
                                                                           12/14

  Installing     : R-devel-2.13.1-1.el6.x86_64                                  
                                                                           13/14

  Installing     : R-2.13.1-1.el6.x86_64                                        
                                                                           14/14


Installed:
  R.x86_64 0:2.13.1-1.el6                                                       
                                                                                


Dependency Installed:
  R-core.x86_64 0:2.13.1-1.el6                R-devel.x86_64 0:2.13.1-1.el6     
  bzip2-devel.x86_64 0:1.0.5-7.el6_0       cups.x86_64 1:1.4.2-35.el6_0.1     
  gcc-gfortran.x86_64 0:4.4.4-13.el6          libRmath.x86_64 0:2.13.1-1.el6    
  libRmath-devel.x86_64 0:2.13.1-1.el6     pcre-devel.x86_64 0:7.8-3.1.el6    
  poppler-utils.x86_64 0:0.12.4-3.el6_0.1     portreserve.x86_64 0:0.0.4-4.el6  
  tcl-devel.x86_64 1:8.5.7-6.el6           tk.x86_64 1:8.5.7-5.el6            
  tk-devel.x86_64 1:8.5.7-5.el6              

Complete!
[leipzig@localhost ~]$ sudo rpm -Uvh rstudio-server-0.94.92-x86_64.rpm
error: Failed dependencies:
	libcrypto.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libgfortran.so.1()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libssl.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
[leipzig@localhost ~]$ sudo yum install libcrypto.so.6
Setting up Install Process
Resolving Dependencies
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package openssl098e.i686 0:0.9.8e-17.el6 set to be updated
--&amp;amp;gt; Processing Dependency: libc.so.6(GLIBC_2.3.4) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libkrb5.so.3(krb5_3_MIT) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libc.so.6(GLIBC_2.1) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libcom_err.so.2 for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libc.so.6(GLIBC_2.0) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libk5crypto.so.3 for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libk5crypto.so.3(k5crypto_3_MIT) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libdl.so.2(GLIBC_2.0) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libc.so.6(GLIBC_2.7) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libkrb5.so.3 for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libc.so.6(GLIBC_2.4) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libgssapi_krb5.so.2 for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libdl.so.2(GLIBC_2.1) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libc.so.6(GLIBC_2.1.3) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libresolv.so.2 for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libz.so.1 for package: openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libc.so.6 for package: openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libdl.so.2 for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Processing Dependency: libc.so.6(GLIBC_2.3) for package:
openssl098e-0.9.8e-17.el6.i686
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package glibc.i686 0:2.12-1.7.el6_0.5 set to be updated
--&amp;amp;gt; Processing Dependency: libfreebl3.so for package:
glibc-2.12-1.7.el6_0.5.i686
--&amp;amp;gt; Processing Dependency: libfreebl3.so(NSSRAWHASH_3.12.3) for package:
glibc-2.12-1.7.el6_0.5.i686
---&amp;amp;gt; Package krb5-libs.i686 0:1.9-9.el6_1.1 set to be updated
--&amp;amp;gt; Processing Dependency: libkeyutils.so.1(KEYUTILS_0.3) for package:
krb5-libs-1.9-9.el6_1.1.i686
--&amp;amp;gt; Processing Dependency: libkeyutils.so.1 for package:
krb5-libs-1.9-9.el6_1.1.i686
--&amp;amp;gt; Processing Dependency: libselinux.so.1 for package:
krb5-libs-1.9-9.el6_1.1.i686
---&amp;amp;gt; Package libcom_err.i686 0:1.41.12-3.el6 set to be updated
---&amp;amp;gt; Package zlib.i686 0:1.2.3-25.el6 set to be updated
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package keyutils-libs.i686 0:1.4-1.el6 set to be updated
---&amp;amp;gt; Package libselinux.i686 0:2.0.94-2.el6 set to be updated
---&amp;amp;gt; Package nss-softokn-freebl.i686 0:3.12.8-1.el6_0 set to be updated
--&amp;amp;gt; Finished Dependency Resolution

Dependencies Resolved

================================================================================
================================================================================
==
 Package                                     Arch                         
Version                                     Repository                          
 Size
================================================================================
================================================================================
==
Installing:
 openssl098e                                 i686                         
0.9.8e-17.el6                               sl                                  
772 k
Installing for dependencies:
 glibc                                       i686                         
2.12-1.7.el6_0.5                            sl-security                         
4.3 M
 keyutils-libs                               i686                         
1.4-1.el6                                   sl                                  
 19 k
 krb5-libs                                   i686                         
1.9-9.el6_1.1                               sl-security                         
711 k
 libcom_err                                  i686                         
1.41.12-3.el6                               sl                                  
 33 k
 libselinux                                  i686                         
2.0.94-2.el6                                sl                                  
106 k
 nss-softokn-freebl                          i686                         
3.12.8-1.el6_0                              sl-security                         
108 k
 zlib                                        i686                         
1.2.3-25.el6                                sl                                  
 71 k

Transaction Summary
================================================================================
================================================================================
==
Install       8 Package(s)
Upgrade       0 Package(s)

Total download size: 6.0 M
Installed size: 18 M
Is this ok [y/N]: y
Downloading Packages:
(1/8): glibc-2.12-1.7.el6_0.5.i686.rpm                                          
                                                          | 4.3 MB     00:02    

(2/8): keyutils-libs-1.4-1.el6.i686.rpm                                         
                                                          |  19 kB     00:00    

(3/8): krb5-libs-1.9-9.el6_1.1.i686.rpm                                         
                                                          | 711 kB     00:00    

(4/8): libcom_err-1.41.12-3.el6.i686.rpm                                        
                                                          |  33 kB     00:00    

(5/8): libselinux-2.0.94-2.el6.i686.rpm                                         
                                                          | 106 kB     00:00    

(6/8): nss-softokn-freebl-3.12.8-1.el6_0.i686.rpm                               
                                                          | 108 kB     00:00    

(7/8): openssl098e-0.9.8e-17.el6.i686.rpm                                       
                                                          | 772 kB     00:00    

(8/8): zlib-1.2.3-25.el6.i686.rpm                                               
                                                          |  71 kB     00:00    

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--
Total                                                                           
                                                 1.2 MB/s | 6.0 MB     00:04    

Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing     : nss-softokn-freebl-3.12.8-1.el6_0.i686                       
                                                                             1/8

  Installing     : glibc-2.12-1.7.el6_0.5.i686                                  
                                                                             2/8

  Installing     : libcom_err-1.41.12-3.el6.i686                                
                                                                             3/8

  Installing     : zlib-1.2.3-25.el6.i686                                       
                                                                             4/8

  Installing     : libselinux-2.0.94-2.el6.i686                                 
                                                                             5/8

  Installing     : keyutils-libs-1.4-1.el6.i686                                 
                                                                             6/8

  Installing     : krb5-libs-1.9-9.el6_1.1.i686                                 
                                                                             7/8

  Installing     : openssl098e-0.9.8e-17.el6.i686                               
                                                                             8/8


Installed:
  openssl098e.i686 0:0.9.8e-17.el6                                              
                                                                                


Dependency Installed:
  glibc.i686 0:2.12-1.7.el6_0.5        keyutils-libs.i686 0:1.4-1.el6           
     krb5-libs.i686 0:1.9-9.el6_1.1       libcom_err.i686 0:1.41.12-3.el6      
  libselinux.i686 0:2.0.94-2.el6       nss-softokn-freebl.i686 0:3.12.8-1.el6_0 
     zlib.i686 0:1.2.3-25.el6            

Complete!
[leipzig@localhost ~]$ sudo rpm -Uvh rstudio-server-0.94.92-x86_64.rpm
error: Failed dependencies:
	libcrypto.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libgfortran.so.1()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libssl.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
[leipzig@localhost ~]$ sudo yum install libcrypto.so.6
Setting up Install Process
Package openssl098e-0.9.8e-17.el6.i686 already installed and latest version
Nothing to do
[leipzig@localhost ~]$ sudo yum install libgfortran.so.1
Setting up Install Process
Resolving Dependencies
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package compat-libgfortran-41.i686 0:4.1.2-39.el6 set to be updated
--&amp;amp;gt; Finished Dependency Resolution

Dependencies Resolved

================================================================================
================================================================================
==
 Package                                           Arch                         
   Version                                  Repository                      Size
================================================================================
================================================================================
==
Installing:
 compat-libgfortran-41                             i686                         
   4.1.2-39.el6                             sl                              99 k

Transaction Summary
================================================================================
================================================================================
==
Install       1 Package(s)
Upgrade       0 Package(s)

Total download size: 99 k
Installed size: 488 k
Is this ok [y/N]: y
Downloading Packages:
compat-libgfortran-41-4.1.2-39.el6.i686.rpm                                     
                                                          |  99 kB     00:00    

Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing     : compat-libgfortran-41-4.1.2-39.el6.i686                      
                                                                             1/1


Installed:
  compat-libgfortran-41.i686 0:4.1.2-39.el6                                     
                                                                                


Complete!
[leipzig@localhost ~]$ sudo rpm -Uvh rstudio-server-0.94.92-x86_64.rpm
error: Failed dependencies:
	libcrypto.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libgfortran.so.1()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
	libssl.so.6()(64bit) is needed by rstudio-server-0.94.92-1.x86_64
[leipzig@localhost ~]$ sudo yum install libssl.so.6
Setting up Install Process
Package openssl098e-0.9.8e-17.el6.i686 already installed and latest version
Nothing to do

[leipzig@localhost ~]$ sudo rpm -Uvh --nodeps rstudio-server-0.94.92-x86_64.rpm
Preparing...                ########################################### [100%]
   1:rstudio-server         ########################################### [100%]
rsession: no process killed
Starting rstudio-server: /usr/lib/rstudio-server/bin/rserver: error while
loading shared libraries: libssl.so.6: cannot open shared object file: No such
file or directory
[FAILED]

#trying some stuff recommended here
#http://support.rstudio.org/help/discussions/problems/839-installing-rstudio-
from-source-after-installing-r-from-source

[leipzig@localhost ~]$ sudo yum install openssl098e-0.9.8e
Setting up Install Process
Resolving Dependencies
--&amp;amp;gt; Running transaction check
---&amp;amp;gt; Package openssl098e.x86_64 0:0.9.8e-17.el6 set to be updated
--&amp;amp;gt; Finished Dependency Resolution

Dependencies Resolved

================================================================================
================================================================================
==
 Package                                  Arch                               
Version                                       Repository                      
Size
================================================================================
================================================================================
==
Installing:
 openssl098e                              x86_64                             
0.9.8e-17.el6                                 sl                             
762 k

Transaction Summary
================================================================================
================================================================================
==
Install       1 Package(s)
Upgrade       0 Package(s)

Total download size: 762 k
Installed size: 2.2 M
Is this ok [y/N]: y
Downloading Packages:
openssl098e-0.9.8e-17.el6.x86_64.rpm                                            
                                                          | 762 kB     00:00    

Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Warning: RPMDB altered outside of yum.
rstudio-server-0.94.92-1.x86_64 has missing requires of libcrypto.so.6()(64bit)
rstudio-server-0.94.92-1.x86_64 has missing requires of
libgfortran.so.1()(64bit)
rstudio-server-0.94.92-1.x86_64 has missing requires of libssl.so.6()(64bit)
  Installing     : openssl098e-0.9.8e-17.el6.x86_64                             
                                                                             1/1


Installed:
  openssl098e.x86_64 0:0.9.8e-17.el6                                            
                                                                                


Complete!
[leipzig@localhost ~]$ sudo yum install gcc41-libgfortran-4.1.2
Setting up Install Process
No package gcc41-libgfortran-4.1.2 available.
Error: Nothing to do
[leipzig@localhost ~]$ sudo yum install pango-1.28.1
Setting up Install Process
Package pango-1.28.1-3.el6_0.5.x86_64 already installed and latest version
Nothing to do
[leipzig@localhost ~]$ sudo rpm -Uvh --nodeps rstudio-server-0.94.92-x86_64.rpm
Preparing...                ########################################### [100%]
	package rstudio-server-0.94.92-1.x86_64 is already installed

[leipzig@localhost ~]$ sudo rstudio-server start
[leipzig@localhost ~]$ sudo rstudio-server verify-installation
Stopping rstudio-server:                                   [  OK  ]
/usr/lib/rstudio-server/bin/rsession: error while loading shared libraries:
libgfortran.so.1: wrong ELF class: ELFCLASS32
Starting rstudio-server:                                   [  OK  ]
[leipzig@localhost ~]$ sudo yum install libgfortran.so.1
Setting up Install Process
Package compat-libgfortran-41-4.1.2-39.el6.i686 already installed and latest
version
Nothing to do
[leipzig@localhost ~]$ sudo rpm -Uvh
ftp.scientificlinux.org/linux/scientific/6.0/x86_64/os/Packages/compat-
libgfortran-41-4.1.2-39.el6.x86_64.rpm
error: open of
ftp.scientificlinux.org/linux/scientific/6.0/x86_64/os/Packages/compat-
libgfortran-41-4.1.2-39.el6.x86_64.rpm failed: No such file or directory
[leipzig@localhost ~]$ wget
ftp.scientificlinux.org/linux/scientific/6.0/x86_64/os/Packages/compat-
libgfortran-41-4.1.2-39.el6.x86_64.rpm
--2011-08-18 04:39:39-- 
http://ftp.scientificlinux.org/linux/scientific/6.0/x86_64/os/Packages/compat-
libgfortran-41-4.1.2-39.el6.x86_64.rpm
Resolving ftp.scientificlinux.org... 131.225.110.147
Connecting to ftp.scientificlinux.org|131.225.110.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 128080 (125K) [application/x-rpm]
Saving to: “compat-libgfortran-41-4.1.2-39.el6.x86_64.rpm”

100%[===========================================================================
=============================================&amp;amp;gt;] 128,080      488K/s   in 0.3s   


2011-08-18 04:39:39 (488 KB/s) - “compat-libgfortran-41-4.1.2-39.el6.x86_64.rpm”
saved [128080/128080]

[leipzig@localhost ~]$ sudo rpm -Uvh
compat-libgfortran-41-4.1.2-39.el6.x86_64.rpm 
Preparing...                ########################################### [100%]
   1:compat-libgfortran-41  ########################################### [100%]
[leipzig@localhost ~]$ sudo rstudio-server verify-installation
Stopping rstudio-server:                                   [  OK  ]
Starting rstudio-server:                                   [  OK  ]
&lt;/pre&gt;
&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-2510745151990076348?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/2510745151990076348/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2011/08/installing-rstudio-server-on-scientific.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/2510745151990076348?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/2510745151990076348?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2011/08/installing-rstudio-server-on-scientific.html" title="Installing RStudio Server on Scientific Linux 6: My bash notebook" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-99i2Opn_Adk/Tk0amQ0oHGI/AAAAAAAABco/auMgs9qczVc/s72-c/Screen+shot+2011-08-18+at+9.58.10+AM.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;DkMDRH4-fSp7ImA9WhZbGU8.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-8951008304523784002</id><published>2011-06-23T17:14:00.000-04:00</published><updated>2011-06-24T10:07:55.055-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-06-24T10:07:55.055-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="NGS" /><category scheme="http://www.blogger.com/atom/ns#" term="hardware" /><title>Big-Ass Servers™ and the myths of clusters in bioinformatics</title><content type="html">&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;Spending $55k for a 512GB machine (Big-Ass Server™ or BAS™) can be a tough sell for a bioinformatics researcher to pitch to a department head.&lt;br /&gt;
&lt;br /&gt;
&lt;table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://i.dell.com/images/global/products/pedge/pedge_highlights/server-poweredge-r900-overview2.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://i.dell.com/images/global/products/pedge/pedge_highlights/server-poweredge-r900-overview2.jpg" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Dell PowerEdge r900, available in orange and lemon-lime&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;Speaking as someone who keeps his copy of &lt;a href="http://en.wikipedia.org/wiki/Introduction_to_Algorithms"&gt;CLR&lt;/a&gt; safely stored in the basement, ready to help rebuild society after a nuclear holocaust, I am painfully aware of the importance of algorithm development in the history of computing, and the possibilities for parallel computing to make problems tractable.&lt;br /&gt;
&lt;br /&gt;
Having recently spent 3 years in industry, however, I am now more inclined to just throw money at problems. In the case of hardware, I think this approach is more effective than clever programming for many of the current problems posed by NGS.&lt;br /&gt;
&lt;br /&gt;
From an economic and productivity perspective, I believe most bioinformatics shops doing basic research would benefit more from having access to a BAS™ than a cluster. Here's why:&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;The development of multicore/multiprocessor machines and memory capacity has outpaced the speed of networks. NGS analyses tends to be more &lt;a href="http://en.wikipedia.org/wiki/Memory_bound_function"&gt;memory-bound&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/I/O_bound"&gt;IO-bound&lt;/a&gt; rather than &lt;a href="http://en.wikipedia.org/wiki/CPU_bound"&gt;CPU-bound&lt;/a&gt;, so relying on a cluster of smaller machines can quickly overwhelm a network.&lt;/li&gt;
&lt;li&gt;NGS has forced the number of high-performance applications from BLAST and protein structure prediction to doing dozens of different little analyses, with tools that change on a monthly basis, or are homegrown to deal with special circumstances. There isn't time or ability to write each of these for parallel architectures.&lt;/li&gt;
&lt;/ul&gt;If those don't sound very convincing, here is my layman's guide to dealing with the myths you might encounter concerning NGS and clusters:&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;Myth: Google uses server farms. We should too.&lt;/h3&gt;&lt;br /&gt;
Google has to focus on doing one thing very well: search.&lt;br /&gt;
&lt;br /&gt;
Bioinformatics programmers have to explore a number of different questions for any given experiment. There is not time to develop a parallel solution to many of these questions as they will lead to dead ends.&lt;br /&gt;
&lt;br /&gt;
Many bioinformatic problems, de-novo assembly being a prime example, are notoriously difficult to divide among several machines without being overwhelmed with messaging. You can imagine trying to divide a jigsaw puzzle among friends sitting several tables, you would spend more time talking about the pieces than fitting them together.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;Myth: Our development setup should mimic our production setup&lt;/h3&gt;&lt;br /&gt;
An experimental computing structure with a BAS™ allows for researchers to freely explore big data without having to think about how to divide it efficiently. If an experiment is successful and there is the need to scale-up to a clinical or industrial platform, that can happen later.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;Myth: Clusters have been around a long time so there is a lot of shell-based infrastructure to distribute workflows&lt;/h3&gt;&lt;br /&gt;
There are tools for queueing jobs, but those are often quite helpless to assist in managing workflows that are written as parallel and serial steps - for example, waiting for steps to finish before merging results.&lt;br /&gt;
&lt;br /&gt;
Various programming languages have features to take advantage of clusters. For example, R has &lt;a href="http://cran.r-project.org/web/packages/snow/index.html"&gt;SNOW&lt;/a&gt;. But Rsamtools requires you to load BAM files into memory, so a BAS™ is not just preferable for NGS analysis with R, it's required.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;Myth: The rise of cloud computing and Hadoop means that homegrown clusters are irrelevant that but also means we don't need a BAS™&lt;/h3&gt;&lt;br /&gt;
The popularity of cloud computing in bioinformatics is also driven by the newfound ability to rent time on a BAS™. The main problem with cloud computing is the bottleneck posed by transferring GBs data to the cloud.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;Myth: Crossbow and Myrna are based on Hadoop, we can develop similar tools&lt;/h3&gt;&lt;br /&gt;
Ben Langmead, Cole Trapnell, and Michael Schatz, alums of Steven Salzberg's group at UMD, have developed NGS solutions using the Hadoop MapReduce framework.&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Crossbow is a Hadoop-based implementation of Bowtie. &lt;/li&gt;
&lt;li&gt;Myrna is an RNA-Seq pipeline. &lt;/li&gt;
&lt;li&gt;Contrail is a de novo short read assembler. &lt;/li&gt;
&lt;/ul&gt;These are difficult programs to develop, and these examples are also somewhat limited experimental proofs of concept or are married to components that may be undesirable for certain analyses. The Bowtie stack (Bowtie, Tophat, Cufflinks), while revolutionary in its implementation of Burroughs-Wheeler algorithm, is itself is built around the limitations of computers in the year 2008. For many it lacks the sensitivity to deal with, for example, 1000 Genomes data.&lt;br /&gt;
&lt;br /&gt;
The dynamic scripting languages used most bioinformatics programmers are not as well suited to Hadoop as Java. To imply we can all develop similar tools of this sophistication is unrealistic. Many bioinformatics programs are not even &lt;i&gt;threaded&lt;/i&gt;, much less designed to work amongst several machines.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;Myth: &lt;a href="http://en.wikipedia.org/wiki/Embarrassingly_parallel"&gt;embarrassingly parallel&lt;/a&gt; problems imply a cluster is needed&lt;/h3&gt;&lt;h3&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-weight: normal;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/h3&gt;&lt;h3&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-weight: normal;"&gt;A server with 4 quad-core processors is often adequate for handling these embarrassing problems. Dividing the work just tends to lead to further embarrassments.&lt;/span&gt;&lt;/span&gt;&lt;/h3&gt;&lt;h3&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-weight: normal;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/h3&gt;Here is a particularly telling &lt;a href="http://biostar.stackexchange.com/questions/2657/scalemp-vsmp-or-physical-ram"&gt;quote&lt;/a&gt; from Biohaskell developer Ketil Malde on Biostar:&lt;br /&gt;
&lt;blockquote&gt;In general, I think HPC are doing the wrong thing for bioinformatics. It's okay to spend six weeks to rewrite your meteorology program to take advantage of the latest supercomputer (all of which tend to be just a huge stack of small PCs these days) if the program is going to run continously for the next three years. It is not okay to spend six weeks on a script that's going to run for a couple of days.&lt;br /&gt;
&lt;br /&gt;
In short, I keep asking for a big PC with a bunch of the latest Intel or AMD core, and as much RAM as we can afford. &lt;/blockquote&gt;&lt;br /&gt;
&lt;h3&gt;Myth: We don't have money for a BAS™ because we need a new cluster to handle things like BLAST&lt;/h3&gt;&lt;br /&gt;
&lt;table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://farm3.static.flickr.com/2758/4400143436_50bcd3843a.jpg" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="215" src="http://farm3.static.flickr.com/2758/4400143436_50bcd3843a.jpg" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;IBM System x3850 X5 expandable to 1536GB, mouse not included&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;Even  the BLAST setup we think of as being the essence of parallel (a  segmented genome index - every node gets a part of the genome) is often  not the one that many institutions have settled on. Many rely on farming  out queries to a cluster in which every node has the full genome index  in memory.&lt;br /&gt;
&lt;br /&gt;
Secondly, the mpiBLAST appears to be more  suited to dividing an index among older machines than today's, which typically have &amp;gt;32GB RAM. Here is a telling FAQ entry: &lt;br /&gt;
&lt;br /&gt;
&lt;blockquote&gt;I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!&lt;br /&gt;
&lt;br /&gt;
mpiBLAST  only yields super-linear speedup when the database being searched is  significantly larger than the core memory on an individual node. The  super-linear speedup results published in the ClusterWorld 2003 paper  describing mpiBLAST are measurements of mpiBLAST v0.9 searching a 1.2GB  (compressed) database on a cluster where each node has 640MB of RAM. A  single node search results in heavy disk I/O and a long search time.&lt;br /&gt;
&lt;a href="http://www.mpiblast.org/Docs/FAQ#super-linear"&gt;http://www.mpiblast.org/Docs/FAQ#super-linear&lt;/a&gt;&lt;/blockquote&gt;&lt;br /&gt;
Your comments on this topic are welcome!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-8951008304523784002?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/8951008304523784002/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html#comment-form" title="8 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/8951008304523784002?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/8951008304523784002?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html" title="Big-Ass Servers™ and the myths of clusters in bioinformatics" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://farm3.static.flickr.com/2758/4400143436_50bcd3843a_t.jpg" height="72" width="72" /><thr:total>8</thr:total></entry><entry gd:etag="W/&quot;DEMBRHk_eyp7ImA9WhdQFko.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-5124241475783381776</id><published>2011-03-15T13:17:00.001-04:00</published><updated>2011-08-18T10:07:35.743-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-08-18T10:07:35.743-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="rstudio" /><category scheme="http://www.blogger.com/atom/ns#" term="R" /><title>RStudio: My thoughts</title><content type="html">&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="https://lh6.googleusercontent.com/-EJR0YXYNFs8/TX-ZZG0OEuI/AAAAAAAAA9U/BBGxAC8KMcU/s1600/Screen+shot+2011-03-15+at+12.52.22+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="416" src="https://lh6.googleusercontent.com/-EJR0YXYNFs8/TX-ZZG0OEuI/AAAAAAAAA9U/BBGxAC8KMcU/s640/Screen+shot+2011-03-15+at+12.52.22+PM.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Let me get this out of the way: I just love &lt;a href="http://www.rstudio.org/"&gt;RStudio&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
Created by a team lead by &lt;a href="http://en.wikipedia.org/wiki/Joseph_J._Allaire"&gt;JJ Allaire&lt;/a&gt;, a name that should ring a bell if you were involved in web development during the Clinton administration, RStudio is an R IDE that is actually designed for R from the ground up. RStudio works on Linux, Mac, and Windows platforms, and can even run over the web.&lt;br /&gt;
&lt;br /&gt;
While borrowing many of the best features from ESS, the Mac R-GUI, and maybe Anup Parikh's &lt;a href="http://www.red-r.org/"&gt;Red-R,&lt;/a&gt; RStudio provides solutions to several long-standing barriers that have hampered R code development. For instance, to do Sweave-&amp;amp;gt;tex-&amp;amp;gt;pdf (then view the pdf) in ESS was a frustrating, arthritic &lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;(M-n s M-n P)&lt;/span&gt;process that flummoxed even the &lt;a href="http://comments.gmane.org/gmane.emacs.ess.general/4873"&gt;greatest minds of our generation&lt;/a&gt;. RStudio has a handy button (Compile PDF) that brings you all the way from .Rnw to Acrobat. Although this command appears to run in its own session, leading to some &lt;a href="http://support.rstudio.org/help/discussions/suggestions/9-sweave-and-the-r-session-state"&gt;unexpected behavior&lt;/a&gt; compared to running Sweave from the command line, the fact that this IDE is already geared for Sweave bodes well for future development.&lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="https://lh3.googleusercontent.com/-NAyU9tRwqX8/TX-ctV_SuXI/AAAAAAAAA9Y/6__PXsxDe9g/s1600/Screen+shot+2011-03-15+at+12.20.03+PM.png" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="https://lh3.googleusercontent.com/-NAyU9tRwqX8/TX-ctV_SuXI/AAAAAAAAA9Y/6__PXsxDe9g/s1600/Screen+shot+2011-03-15+at+12.20.03+PM.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;This is fucking genius&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
The movement of commands back and forth from console to editor is another task that other editors made unnecessarily difficult - the old Mac R GUI console would not let you copy-and-paste a subset of the history, ESS was always geared to having users write code in the editor then executing lines but never writing code in the console then committing to the script. RStudio provides means of easily going in either direction. Control over multiple plots (solving both the overwritten X-Window and the annoying type=Cairo PNG problem on OS X) is a welcome relief.&lt;br /&gt;
&lt;br /&gt;
RStudio offers very good autocompletion for such a relatively weird language - in addition to package methods it is aware of data frame columns and user-defined functions, for instance. &lt;br /&gt;
&lt;br /&gt;
RStudio has already garnered a good number of &lt;a href="http://support.rstudio.org/help/discussions/suggestions"&gt;suggestions&lt;/a&gt;. Here's personally what I would like to see:&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;More support for LaTeX markup, including menu driven formatting options so users don't have to memorize stuff like \textbf{}&lt;/li&gt;
&lt;li&gt; More built-in aesthetic support for &lt;a href="http://had.co.nz/ggplot2/"&gt;ggplot2&lt;/a&gt;, something where users are given a WYSIWYG manipulating an existing plot similar to Jeroen Ooms' &lt;a href="http://yeroon.net/ggplot2/"&gt;ggplot2 web application&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;A non-sudo Linux binary and a method of specifying different R and TeX installations kicking around a server without re-installing from source.&lt;/li&gt;
&lt;li&gt;Better control over the working directory (&lt;a href="http://flowingdata.com/2011/03/02/rstudio-a-new-ide-for-r-that-makes-coding-easier/"&gt;already reported and a likely future feature&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;A means of quickly seeing where source files are actually located without mouseover&lt;/li&gt;
&lt;li&gt;Integration with version control.&lt;/li&gt;
&lt;li&gt;Code cleanup and indenting &lt;/li&gt;
&lt;/ol&gt;
&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-5124241475783381776?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/5124241475783381776/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2011/03/rstudio-my-thoughts.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/5124241475783381776?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/5124241475783381776?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2011/03/rstudio-my-thoughts.html" title="RStudio: My thoughts" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://lh6.googleusercontent.com/-EJR0YXYNFs8/TX-ZZG0OEuI/AAAAAAAAA9U/BBGxAC8KMcU/s72-c/Screen+shot+2011-03-15+at+12.52.22+PM.png" height="72" width="72" /><thr:total>3</thr:total></entry><entry gd:etag="W/&quot;DU8BQ3Y-cCp7ImA9Wx9QFkg.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-4715969279352296289</id><published>2010-12-23T15:42:00.000-05:00</published><updated>2010-12-29T16:24:12.858-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-12-29T16:24:12.858-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="BSgenome" /><category scheme="http://www.blogger.com/atom/ns#" term="BioConductor" /><category scheme="http://www.blogger.com/atom/ns#" term="R" /><title>Chromosome bias in R, my notebook</title><content type="html">My goal is to develop a means of detecting chromosome bias from a human BAM file.&lt;br /&gt;
&lt;br /&gt;
Because I've been working with proprietary and novel plant genomes for the last three years, I haven't had the chance to use any of the awesome UCSC-based annotational features that have been introduced and refined in Bioconductor until now. I've returned to biomedical research and I have some catching up to do.&lt;br /&gt;
&lt;br /&gt;
BSgenome might sound like horsecrap, but each &lt;b&gt;B&lt;/b&gt;io&lt;b&gt;s&lt;/b&gt;trings-based &lt;b&gt;genome&lt;/b&gt; data package is actually a huge digested version of a UCSC/NCBI genome freeze and basic sequence annotation compiled into R objects.&lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://www.bioconductor.org/help/bioc-views/release/bioc/html/BSgenome.html"&gt;BSgenome at Bioconductor&lt;/a&gt;&lt;br /&gt;
&lt;blockquote&gt;Be careful with googling bioconductor help - often the results point to older versions. Make sure your link has "release" in the url.&lt;/blockquote&gt;&lt;br /&gt;
Here are the BSgenomes available today:&lt;br /&gt;
&lt;pre class="brush: shell"&gt;&amp;gt; available.genomes(type=getOption("pkgType"))
BioC_mirror = http://www.bioconductor.org
Change using chooseBioCmirror().
 [1] "BSgenome.Amellifera.BeeBase.assembly4" "BSgenome.Amellifera.UCSC.apiMel2"     
 [3] "BSgenome.Athaliana.TAIR.01222004"      "BSgenome.Athaliana.TAIR.04232008"     
 [5] "BSgenome.Btaurus.UCSC.bosTau3"         "BSgenome.Btaurus.UCSC.bosTau4"        
 [7] "BSgenome.Celegans.UCSC.ce2"            "BSgenome.Celegans.UCSC.ce6"           
 [9] "BSgenome.Cfamiliaris.UCSC.canFam2"     "BSgenome.Dmelanogaster.UCSC.dm2"      
[11] "BSgenome.Dmelanogaster.UCSC.dm3"       "BSgenome.Drerio.UCSC.danRer5"         
[13] "BSgenome.Drerio.UCSC.danRer6"          "BSgenome.Ecoli.NCBI.20080805"         
[15] "BSgenome.Ggallus.UCSC.galGal3"         "BSgenome.Hsapiens.UCSC.hg17"          
[17] "BSgenome.Hsapiens.UCSC.hg18"           "BSgenome.Hsapiens.UCSC.hg19"          
[19] "BSgenome.Mmusculus.UCSC.mm8"           "BSgenome.Mmusculus.UCSC.mm9"          
[21] "BSgenome.Ptroglodytes.UCSC.panTro2"    "BSgenome.Rnorvegicus.UCSC.rn4"        
[23] "BSgenome.Scerevisiae.UCSC.sacCer1"     "BSgenome.Scerevisiae.UCSC.sacCer2"    
&lt;/pre&gt;Select and load hg19&lt;br /&gt;
&lt;pre class="brush: shell"&gt;biocLite("BSgenome.Hsapiens.UCSC.hg19")
library("BSgenome.Hsapiens.UCSC.hg19")
&lt;/pre&gt;&lt;br /&gt;
When we get an alignment file one of the first things we want to do is look for red flags that might indicate something went awry in the lab or downstream. An example is chromosome bias - are we seeing more reads aligned to certain chromosomes than would be expected on size alone? A sticky question, since any experiment will introduce confounds based on the inherent uneven distribution of interesting genomic features, not to mention mapability. And yet I think this is still a worthwhile exercise and should be part of any ngs sequencing pipeline.&lt;br /&gt;
&lt;br /&gt;
What we don't want to do is ignore that 7.6% of the GRCh37 freeze is sequence that looks like "NNNNNNN" - gaps representing unsequencable regions such as centromeres, scaffold gap delinations, and the like. We especially don't want to ignore gaps because they are not evenly distributed across the chromosomes (chrY is 56% gaps).&lt;br /&gt;
&lt;br /&gt;
Raw chromosome length can be obtained from the BAM file header, but for this chromosome bias analysis I need the "non-gappy" length, the portion eligible for alignment. This is one of the "masks" turned on by default for BSgenomes in order to allow various functions to work properly (see MaskCollection in the &lt;a href="http://www.bioconductor.org/help/bioc-views/release/bioc/html/IRanges.html"&gt;IRanges&lt;/a&gt; package for more information).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush: shell"&gt;&amp;gt; masks(Hsapiens)
Error in function (classes, fdef, mtable)  : 
  unable to find an inherited method for function masks, for signature "BSgenome"
#oops I see masks are a member of MaskedDNAString objects (i.e. chromosomes) not BSgenome objects
&amp;gt; masks(Hsapiens$chrY)
MaskCollection of length 4 and width 59373566
masks:
  maskedwidth maskedratio active names                               desc
1    33720000 0.567929506   TRUE AGAPS                      assembly gaps
2           0 0.000000000   TRUE   AMB   intra-contig ambiguities (empty)
3    16024357 0.269890426  FALSE    RM                       RepeatMasker
4      587815 0.009900281  FALSE   TRF Tandem Repeats Finder [period&amp;lt;=12]
all masks together:
  maskedwidth maskedratio
     49783032   0.8384713
all active masks together:
  maskedwidth maskedratio
     33720000   0.5679295
#I think the maskedwidth should reveal sum of actively masked nucleotides
&amp;gt; maskedwidth(Hsapiens$chrY)
[1] 33720000
#can we mess with the masks?
&amp;gt; active(masks(Hsapiens$chrY))["RM"]&amp;lt;-TRUE
Error in `$&amp;lt;-`(`*tmp*`, "chrY", value = &amp;lt; S4 object of class "MaskedDNAString"&amp;gt;) : 
  no method for assigning subsets of this S4 class
#oops I can't manipulate a BSgenome this way - it is behaving like a class instead of an instance of a class
&amp;gt; chrY&amp;lt;-Hsapiens$chrY
&amp;gt; active(masks(chrY))["RM"]&amp;lt;-TRUE
&amp;gt; maskedwidth(chrY)
[1] 49744357
# ok maskedwidth is working as I figured, but i need unmasked width
&amp;gt; unmaskedWidth&amp;lt;-function(chr){length(chr)-maskedwidth(chr)}
&amp;gt; unmaskedWidth(Hsapiens$chrY)
[1] 25653566
#how can I iterate over something with a $ operator? let's try [[]]
&amp;gt; unmaskedWidth(Hsapiens[["chrY"]])
[1] 25653566
&lt;/pre&gt;Now I want to create a data frame of with sequence names and unmaskedWidths to go with some read counts from a BAM file. Whenever I want to go from a list, through a function, to a data frame I think &lt;a href="http://had.co.nz/plyr/"&gt;plyr&lt;/a&gt;, specifically ldply (&lt;b&gt;l&lt;/b&gt;ist to &lt;b&gt;d&lt;/b&gt;ata frame).&lt;br /&gt;
&lt;pre class="brush: shell"&gt;# let's take chr 1-22,X,Y, skipping the unscaffolded sequences and mitochondrial chr
&amp;gt; maskedSizes&amp;lt;-ldply(.data=seqnames(Hsapiens)[1:24],
  .fun=function(x){
    data.frame(chr=x,seqlength=length(Hsapiens[[x]]),
    unmaskedWidth=unmaskedWidth(Hsapiens[[x]]))},
  .progress="text",
  .parallel=TRUE)
&amp;gt; maskedSizes
                     chr seqlength unmaskedWidth
1                   chr1 249250621     225280621
2                   chr2 243199373     238204518
3                   chr3 198022430     194797135
4                   chr4 191154276     187661676
5                   chr5 180915260     177695260
6                   chr6 171115067     167395066
7                   chr7 159138663     155353663
8                   chr8 146364022     142888922
9                   chr9 141213431     120143431
10                 chr10 135534747     131314738
11                 chr11 135006516     131129516
12                 chr12 133851895     130481393
13                 chr13 115169878      95589878
14                 chr14 107349540      88289540
15                 chr15 102531392      81694766
16                 chr16  90354753      78884753
17                 chr17  81195210      77795210
18                 chr18  78077248      74657229
19                 chr19  59128983      55808983
20                 chr20  63025520      59505520
21                 chr21  48129895      35106642
22                 chr22  51304566      34894545
23                  chrX 155270560     151100560
24                  chrY  59373566      25653566
&lt;/pre&gt;&lt;br /&gt;
Load the BAM file and get read counts in a data frame.&lt;br /&gt;
&lt;pre class="brush: shell"&gt;#other methods include scanBam and readAligned
bamFile&amp;lt;-readBamGappedAlignments("myIndexedSortedBamFile.bam")
&amp;gt; levels(rname(bamFile))
 [1] "1"          "2"          "3"          "4"          "5"         
 [6] "6"          "7"          "8"          "9"          "10"        
[11] "11"         "12"         "13"         "14"         "15"        
[16] "16"         "17"         "18"         "19"         "20"        
[21] "21"         "22"         "X"          "Y"          "MT"        
[26] "GL000207.1" "GL000226.1" "GL000229.1" "GL000231.1" "GL000210.1"
[31] "GL000239.1" "GL000235.1" "GL000201.1" "GL000247.1" "GL000245.1"
[36] "GL000197.1" "GL000203.1" "GL000246.1" "GL000249.1" "GL000196.1"
[41] "GL000248.1" "GL000244.1" "GL000238.1" "GL000202.1" "GL000234.1"
[46] "GL000232.1" "GL000206.1" "GL000240.1" "GL000236.1" "GL000241.1"
[51] "GL000243.1" "GL000242.1" "GL000230.1" "GL000237.1" "GL000233.1"
[56] "GL000204.1" "GL000198.1" "GL000208.1" "GL000191.1" "GL000227.1"
[61] "GL000228.1" "GL000214.1" "GL000221.1" "GL000209.1" "GL000218.1"
[66] "GL000220.1" "GL000213.1" "GL000211.1" "GL000199.1" "GL000217.1"
[71] "GL000216.1" "GL000215.1" "GL000205.1" "GL000219.1" "GL000224.1"
[76] "GL000223.1" "GL000195.1" "GL000212.1" "GL000222.1" "GL000200.1"
[81] "GL000193.1" "GL000194.1" "GL000225.1" "GL000192.1"
#the deflines in my reference do not match the BSgenome names, must fix at least the chromosomes of interest
levels(rname(bamFile))[1:25]&amp;lt;-paste('chr',c(1:22,'X','Y','M'),sep='')

#run length encoded read counts per chromosome
readRle&amp;lt;-rname(bamFile)

#get a data frame with chromosome and read counts
allReadsDf&amp;lt;-ldply(runValue(readRle),function(x){data.frame(chr=levels(runValue(readRle))[x],reads=runLength(readRle)[x])})
&amp;gt; head(allReadsDf)
   chr   reads
1 chr1 3616909
2 chr2 3642052
3 chr3 2843019
4 chr4 2636141
5 chr5 2590352
6 chr6 2497123
&lt;/pre&gt;&lt;br /&gt;
Merge the read counts with unmasked chromosome lengths and plot their relationship.&lt;br /&gt;
&lt;pre class="brush: shell"&gt;chrSizesReads&amp;lt;-merge(maskedSizes,readCounts,sort=FALSE)
library(ggplot2)
p&amp;lt;-ggplot(data=chrSizesReads, aes(x=unmaskedWidth, y=reads, label=chr)) + 
  geom_point() +
  geom_text(vjust=2,size=3) +
  stat_smooth(method="lm", se=TRUE,level=0.95) +
  ylab("Reads aligned") +
  xlab("Unmasked chromosome size") +
  opts(title = "Reads vs Chromosome Size")
print(p)
&lt;/pre&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_AWkLjPyyn0g/TRNu72gZjoI/AAAAAAAAA2s/8J1UZ_MmuIg/s1600/readChr.png" imageanchor="1"&gt;&lt;img border="0" height="305" src="http://2.bp.blogspot.com/_AWkLjPyyn0g/TRNu72gZjoI/AAAAAAAAA2s/8J1UZ_MmuIg/s320/readChr.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;There should be a strong linear relationship between read count and chromosome size. We can test this using a linear regression model, the null hypothesis being the number of reads aligned to a chromosome is independent of its size.  &lt;br /&gt;
&lt;pre class="brush: shell"&gt;&amp;gt; mylm&amp;lt;-lm(reads~unmaskedWidth,data=chrSizesReads)
&amp;gt; mysummary&amp;lt;-summary(mylm)
&amp;gt; mysummary

Call:
lm(formula = reads ~ unmaskedWidth, data = chrSizesReads)

Residuals:
    Min      1Q  Median      3Q     Max 
-271816 -108122  -43984   42826  676284 

Coefficients:
               Estimate Std. Error t value Pr(&amp;gt;|t|)    
(Intercept)   1.774e+05  9.505e+04   1.866   0.0754 .  
unmaskedWidth 1.455e-02  7.145e-04  20.365 9.12e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 206600 on 22 degrees of freedom
Multiple R-squared: 0.9496, Adjusted R-squared: 0.9473 
F-statistic: 414.8 on 1 and 22 DF,  p-value: 9.123e-16 
&lt;/pre&gt;The low p-value (that chr size has no influence) and R-squared (predictive value of the linear model) suggest this model is sound.&lt;br /&gt;
&lt;br /&gt;
The following plot is obtained from the standardized residuals (the standardized difference between data observed and values expected) of the linear model described earlier.&lt;br /&gt;
&lt;br /&gt;
Chromosome bias refers to uneven read alignment distribution across various chromosomes. We can expect some chromosome bias in treatment sets because of the inherient nature any experimental conditions - recovered fragments will not be evenly distributed among chromosomes because regions of affect are not evenly distributed. Other possible factors of chromosome bias include heterochromatin, uneven repeat content, and the potential for aligning the against an incorrect set of sex chromsomes. Aligners will typically randomly, evenly, assign discrete positions to reads which map ambiguously to multiple locations. &lt;br /&gt;
&lt;pre class="brush: shell"&gt;&amp;gt; p&amp;lt;-qplot(chrSizesReads$chr,rstandard(mylm))+
   aes(label=chrSizesReads$chr)+
   geom_text(vjust=2,size=3)+
   xlab("Chromosome")+
   ylab("Std Residual from lm (reads)")+
   geom_abline(slope=0,intercept=0)+
   opts(axis.text.x = theme_text(angle=45,hjust=1))+
   opts(title = "Linear Regression Residuals")
&amp;gt; print(p)
&lt;/pre&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_AWkLjPyyn0g/TRO5Mw9_52I/AAAAAAAAA28/AmoSfWK6A0Y/s1600/resid.png" imageanchor="1" style=""&gt;&lt;img border="0" height="320" width="320" src="http://4.bp.blogspot.com/_AWkLjPyyn0g/TRO5Mw9_52I/AAAAAAAAA28/AmoSfWK6A0Y/s320/resid.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;
Fortunately, there is no clear pattern to these residual values, which would indicate some model problems, but with a Z-score of 3.36, chrX appears to be an outlier. With 46M total alignments this is certainly not due to sampling error, but we can still test our observation with a Lund statistic. &lt;br /&gt;
&lt;pre class="brush: shell"&gt;#http://stackoverflow.com/questions/1444306/how-to-use-outlier-tests-in-r-code
&amp;gt;lundcrit&amp;lt;-function(a, n, q) {
 F&amp;lt;-qf(c(1-(a/n)),df1=1,df2=n-q-1,lower.tail=TRUE)
 crit&amp;lt;-((n-q)*F/(n-q-1+F))^0.5
 crit
}
&amp;gt; n&amp;lt;-nrow(chrSizesReads)
&amp;gt; q&amp;lt;-length(mylm$coefficients)
&amp;gt; crit&amp;lt;-lundcrit(0.05,n,q)
&amp;gt; chrSizesReads[which(rstandard(mylm)&amp;gt;crit),"chr"]
[1] chrX
&lt;/pre&gt;&lt;br /&gt;
Happy holidays!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-4715969279352296289?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/4715969279352296289/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2010/12/chromosome-bias-in-r-my-notebook.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4715969279352296289?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4715969279352296289?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2010/12/chromosome-bias-in-r-my-notebook.html" title="Chromosome bias in R, my notebook" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_AWkLjPyyn0g/TRNu72gZjoI/AAAAAAAAA2s/8J1UZ_MmuIg/s72-c/readChr.png" height="72" width="72" /><thr:total>3</thr:total></entry><entry gd:etag="W/&quot;AkYNRn07eSp7ImA9WhRUEUw.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-3210463591938969866</id><published>2010-12-09T17:41:00.000-05:00</published><updated>2012-01-20T22:56:37.301-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-01-20T22:56:37.301-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="bash history" /><category scheme="http://www.blogger.com/atom/ns#" term="histfile" /><title>Directory-based bash histories</title><content type="html">Using a directory-based bash history allows for a record of shell actions on a directory basis, so a group of developers have some record of what was done while in a directory, when, and by whom. This can be helpful when trying to reconstruct history with limited documentation.&lt;br /&gt;
&lt;br /&gt;
I know this setup will be of some benefit to my successor at my previous job because he has access to everything I ever did in any project directory.&lt;br /&gt;
&lt;br /&gt;
Place this code in your ~/.bash_profile or ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
(type source ~/.bash_profile (or .bashrc) to load this for your current session)&lt;br /&gt;
&lt;br /&gt;
&lt;script src="https://gist.github.com/1651133.js"&gt; &lt;/script&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-3210463591938969866?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/3210463591938969866/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2010/12/directory-based-bash-histories.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/3210463591938969866?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/3210463591938969866?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2010/12/directory-based-bash-histories.html" title="Directory-based bash histories" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;DUcEQnY8eip7ImA9Wx5QEko.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-938423623754390009</id><published>2010-08-30T14:12:00.000-04:00</published><updated>2010-08-31T13:50:03.872-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-08-31T13:50:03.872-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="NGS" /><category scheme="http://www.blogger.com/atom/ns#" term="Samtools" /><title>NGS viewers reviewed</title><content type="html">I gathered up some of the recent &lt;i&gt;free&lt;/i&gt; next generation sequence viewers that were capabale of viewing BAM files - and put each through the motions with a few BAM files and reference sequences of various sizes. While there are some great ideas and several choices to be found along the feature spectrum, I think we are still in the dark ages with this stuff. No viewer has really been able to entirely combine usability with performance and analysis capabilities, let alone extensibility and web connectivity.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://picasaweb.google.com/lh/photo/oeGoMtCA47G7fFzw89a96Q?feat=embedwebsite" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img src="http://lh6.ggpht.com/_AWkLjPyyn0g/THgnza2cQOI/AAAAAAAAAws/zfMefRXdjcI/s288/tview.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-size: large;"&gt;&lt;a href="http://samtools.sourceforge.net/"&gt;tview&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;
&lt;b&gt;My take:&lt;/b&gt; tview is the barebones, text-rendered viewer that is included with Samtools. People who favor this as their BAM viewer probably think Vim is too polished. Even the very limited navigation is remarkably unituitive (goto a coordinate requires chr:position even if you have just one chromosome, no errors are displayed if you forget chr).&lt;br /&gt;
&lt;b&gt;Most resembles which video game: &lt;/b&gt;Oregon Trail&lt;br /&gt;
&lt;b&gt;Good standout feature:&lt;/b&gt; command line access&lt;br /&gt;
&lt;b&gt;Bad feature:&lt;/b&gt; text display&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://picasaweb.google.com/lh/photo/WexFGlirQLnEUmeBpTZ4CQ?feat=embedwebsite" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img src="http://lh4.ggpht.com/_AWkLjPyyn0g/THgnX24E9vI/AAAAAAAAAwU/9XRVWMci3D0/s288/bamview.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-size: large;"&gt;&lt;a href="http://bamview.sourceforge.net/"&gt;BamView&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;
&lt;b&gt;My take:&lt;/b&gt; Bamview is a wicked fast simple BAM file viewer. It doesn't have much in the way of features, but for cursory examination of BAM files it is more palatable than tview.&lt;br /&gt;
&lt;b&gt;Most resembles which video game: &lt;/b&gt;Burgertime&lt;br /&gt;
&lt;b&gt;Good standout feature:&lt;/b&gt; strand split screen&lt;br /&gt;
&lt;b&gt;Bad feature:&lt;/b&gt; drag selecting a region turns it red, and umm... that's all it does&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://picasaweb.google.com/lh/photo/gey5Qp6K7mT-ZErxkbN3IA?feat=embedwebsite" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img src="http://lh5.ggpht.com/_AWkLjPyyn0g/THgnYNL8vDI/AAAAAAAAAwY/9LuXp6LTh78/s288/genoviewer-2.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-size: large;"&gt;&lt;a href="http://www.genoviewer.com/"&gt;GenoViewer&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;
&lt;b&gt;My take:&lt;/b&gt; The only NGS viewer endorsed by &lt;a href="http://www.youtube.com/watch?v=--Vaz9jW054"&gt;Speak&lt;/a&gt; the Hungarian rapper, unfortunately this recent entrant leaves a lot to be desired in terms of performance with large files. GenoViewer is very hard on the eyes - the indiscriminate use of primary colors looks like a kid somehow vomited up the ball pit at Chuck E. Cheese.&lt;br /&gt;
&lt;b&gt;Most resembles which video game: &lt;/b&gt;Centipede&lt;br /&gt;
&lt;b&gt;Good standout feature:&lt;/b&gt; promises that "You will not get lost in the details, and can easily figure out the true meaning behind the data. Guaranteed."&lt;br /&gt;
&lt;b&gt;Bad feature:&lt;/b&gt; graphics&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://picasaweb.google.com/lh/photo/TqnwOx1tMnM8vcYS7_lHSg?feat=embedwebsite" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img src="http://lh6.ggpht.com/_AWkLjPyyn0g/THgnY4U662I/AAAAAAAAAwk/dv0IKUlYoWs/s288/magicviewer.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-size: large;"&gt;&lt;a href="http://bioinformatics.zj.cn/magicviewer/"&gt;MagicViewer&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;
&lt;b&gt;My take:&lt;/b&gt; MagicViewer has come a long way since its initial release last December. With its interesting pie-chart icon renderings of SNP purity, decent treatment of annotation tracks, and improved performance, MagicViewer might soon be a contender to Tablet in the midweight category. The navigation is workable but takes some getting used to - it's unclear when to use scroll bars vs. arrows&lt;br /&gt;
&lt;b&gt;Most resembles which video game: &lt;/b&gt;Moon Patrol&lt;br /&gt;
&lt;b&gt;Good standout feature:&lt;/b&gt; primer design tool&lt;br /&gt;
&lt;b&gt;Bad feature:&lt;/b&gt; some regions are simply not visible - no reference, no reads&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://picasaweb.google.com/lh/photo/yPwsdAPqe8I5vM4gtHq_bA?feat=embedwebsite" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img src="http://lh6.ggpht.com/_AWkLjPyyn0g/THgnYPmNy5I/AAAAAAAAAwc/RdiMvbqIjAg/s288/tablet.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-size: large;"&gt;&lt;a href="http://bioinf.scri.ac.uk/tablet/"&gt;Tablet&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;
&lt;b&gt;My take:&lt;/b&gt; Hands down the most attractive of the viewers, Tablet is aimed at fostering a delicate balance between performance, features, and aesthetics. Tablet comes with a suite of read views - Packed, Stacked, and Classic - to suit both young children and elderly scientists alike. GFF feature files can be loaded but they appear to merely serve as position indices.&lt;br /&gt;
&lt;b&gt;Most resembles which video game: &lt;/b&gt;SimCity&lt;br /&gt;
&lt;b&gt;Good standout feature:&lt;/b&gt; interface&lt;br /&gt;
&lt;b&gt;Bad feature:&lt;/b&gt; read insertions not displayed correctly&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://picasaweb.google.com/lh/photo/Scp_X0E2C49rvYzqX8_CFA?feat=embedwebsite" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img src="http://lh3.ggpht.com/_AWkLjPyyn0g/THgnYvVoHnI/AAAAAAAAAwg/yf2A5Y6Xhy8/s280/igv.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-size: large;"&gt;&lt;a href="http://www.broadinstitute.org/igv/"&gt;IGV&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;
&lt;b&gt;My take:&lt;/b&gt; The Integrative Genomics Viewer is a serious tool for exploring and analyzing large datasets. In addition to viewing, IGV is designed to allow users to extract the kind of hard publishable data that has typically been the domain of Bio* scripts. Like the true product of an Ivy League education, IGV can appear aloof and arrogant to newcomers. The viewer will let you load BAM files and other annotation tracks that have nothing to do with the reference without comment or guidance, then require an extra unitutive click to actually generate a view. While not the easiest or even the best in performance, in terms of generating real queries on your data from the viewer there is nothing comparable.&lt;br /&gt;
&lt;b&gt;Most resembles which video game: &lt;/b&gt;Dig Dug&lt;br /&gt;
&lt;b&gt;Good standout feature:&lt;/b&gt; analysis tools&lt;br /&gt;
&lt;b&gt;Bad feature:&lt;/b&gt; throws a hissy fit if it cannot connect to home server&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://picasaweb.google.com/lh/photo/8pD4iTemnoCaiTx1EpeAug?feat=embedwebsite" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img src="http://lh3.ggpht.com/_AWkLjPyyn0g/THvuSlXB-qI/AAAAAAAAAxw/Vhk4ipIHkJM/s288/gb2.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span style="font-size: large;"&gt;&lt;a href="http://gmod.org/wiki/GBrowse"&gt;gbrowse2&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;
&lt;b&gt;My take:&lt;/b&gt; gbrowse2 is the AJAXified protege to the venerable generic genome browser. Samtools integration is a recent addition to this highly extensible platform, which has been used for years to display everything from large genomes to small sequencing projects. Of all the viewers here, GB2 provides the best set of visualization tools, such that virtually any biological information that can be rendered linearly has been done so as gbrowse tracks. The lone true web application, gbrowse genomic positions can be hyperlinked or even snapshot-embedded on the web. The web provides the best platform to share visual genomic data among several users. However, a gbrowse2 installation with BAM tracks can be a massive pain to install, configure, and debug ("landmark chrI not found" is the most popular google search in all of bioinformatics). Novices can expect a minefield of historical gotchas and arcane conventions ("Name=" not "name=" field in gff3, bp_seqfeature_load instead of bp_load_gff for gff3, no validator for conf files), and even experienced users are often baffled by cryptic errors that pop up in server logs.&lt;br /&gt;
&lt;b&gt;Most resembles which video game: &lt;/b&gt;Sissyfight 2000&lt;br /&gt;
&lt;b&gt;Good standout feature:&lt;/b&gt; hyperlinks&lt;br /&gt;
&lt;b&gt;Bad feature:&lt;/b&gt; setup&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-938423623754390009?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/938423623754390009/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2010/08/ngs-viewers-reviewed.html#comment-form" title="9 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/938423623754390009?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/938423623754390009?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2010/08/ngs-viewers-reviewed.html" title="NGS viewers reviewed" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh6.ggpht.com/_AWkLjPyyn0g/THgnza2cQOI/AAAAAAAAAws/zfMefRXdjcI/s72-c/tview.png" height="72" width="72" /><thr:total>9</thr:total></entry><entry gd:etag="W/&quot;CkYGSHw6eip7ImA9Wx5REk8.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-918029139414078057</id><published>2010-08-18T12:42:00.000-04:00</published><updated>2010-08-19T08:15:29.212-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-08-19T08:15:29.212-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="NGS" /><title>My thoughts on the acquisition of Ion Torrent by Life Technologies</title><content type="html">&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.iontorrent.com/lib/images/products/pgm_images.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 333px; height: 198px;" src="http://www.iontorrent.com/lib/images/products/pgm_images.jpg" border="0" alt="" /&gt;&lt;/a&gt;Yesterday it was announced that Ion Torrent, makers of the Ion Personal Genome Machine (PGM) sequencer, would be acquired by &lt;strike&gt;Invitrogen&lt;/strike&gt; &lt;strike&gt;ABI&lt;/strike&gt; Life Technologies for $375M in cash and stock, with the possibility of another $350M if various milestones are met.&lt;br /&gt;
&lt;br /&gt;
Gregory Lucier, chairman and CEO of Life Technologies, said, "We believe Ion Torrent's technology will represent a profound change for the life sciences industry, as fundamental as the one we saw with the introduction of qPCR." That analogy might not sound earth-shattering, but I suspect he is using qPCR as an example because it is one of the few molecular biology (as opposed to biochemical) techniques regularly used in clinical settings.&lt;br /&gt;
&lt;br /&gt;
The PGM is a unique machine because it is the first second generation sequencer (tentative &lt;a href="https://spreadsheets.google.com/ccc?key=0AvaxS3m5rl-9dHdtUGRtaGlsZWNFNWJleDRXaUhQTHc&amp;hl=en#gid=0"&gt;specs&lt;/a&gt;: ~3 million 200bp reads at $500 1hr run, $50k machine) that could be conceivably leased by a small clinical testing facility, like the ones fed by those LabCorp boxes you see scattered all over strip malls. Bear in mind even a capillary-based sanger sequencer like the ABI3730xl costs a whopping &lt;a href="http://webcache.googleusercontent.com/search?q=cache:Zr-H90V_YUcJ:www.genomeweb.com/sequencing/used-abi-3730xls-are-%E2%80%98flooding-market%E2%80%99-centers-shift-next-gen-sequencers+3730xl+price&amp;cd=7&amp;hl=en&amp;ct=clnk&amp;gl=us&amp;client=firefox-a"&gt;$375,000&lt;/a&gt;, which comes as a shock to those of us who mostly work with next-generation sequencing data. You can see not only why the PGM is really a game changer, but how it might fit into Life Technologies offerings.&lt;br /&gt;
&lt;br /&gt;
Exactly which clinical diagnostics are, or will be, suited for this machine are unclear. With the right foolproof software this could replace a lot of PCR and microarray-based tests. Read lengths and throughput are bound to increase just as they did with Solexa. Future applications like tumor sequencing and 16S rRNA microbiome sequencing have not even entered into medical practice yet.&lt;br /&gt;
&lt;br /&gt;
The key to Life Technology succeeding in these areas will likely come down to non-technical challenges:&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Getting FDA approval for the PGM as a medical device and for various protocols based on the PGM as approved clinical tests. Life Technology has experience with this process. Ion Torrent clearly does not.&lt;/li&gt;
&lt;li&gt;Convincing insurance companies and HMOs that these tests are cost-effective diagnostics. When you consider the exorbitant cost of other tests - e.g. several MRIs over a course of chemotherapy - this might not be such hard sell.&lt;/li&gt;
&lt;li&gt;Developing a business model that will allow small clinics to lease the machine for little or even no cost provided they agree to purchase a minimum amount of consumables. Life Technologies will likely market the PGM to smaller labs who themselves will inevitably face competition from larger dedicated sequencing centers trying to centralize this type of work using bigger machines from Illumina or PacBio (or even LT's SOLiD).&lt;/li&gt;
&lt;/ul&gt;&lt;br /&gt;
Although these obstacles seem daunting, I have sat through academic departmental meetings where very knowledgeable sequencer salespeople were invited back multiple times only to be jerked around as the faculty hemmed and hawed over whether this was the right $400k machine to buy at the time. That was undoubtedly an expensive and frustrating sales process for Solexa and 454. With the PGM, Life Technologies can pitch this much cheaper machine to an albeit different group of skeptics, clinical lab managers, but one with a clear bottom line in mind.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-918029139414078057?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/918029139414078057/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2010/08/my-thoughts-on-acquisition-of-ion.html#comment-form" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/918029139414078057?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/918029139414078057?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2010/08/my-thoughts-on-acquisition-of-ion.html" title="My thoughts on the acquisition of Ion Torrent by Life Technologies" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>2</thr:total></entry><entry gd:etag="W/&quot;DEMBR3k5fCp7ImA9WxBbEkk.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-6388825566068777403</id><published>2010-03-10T13:21:00.000-05:00</published><updated>2010-03-10T13:40:56.724-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-03-10T13:40:56.724-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="perl" /><title>Use local instead of my in Perl when using $$ (double dollar sign) inside an enclosed block</title><content type="html">I thought I would be all clever and initialize several hash indices at once using the $$ notation in Perl - evaluating strings as variables. Unfortunately using "my" screws this up bigtime.&lt;br /&gt;&lt;br /&gt;This must be another circumstance spelled out in the incomprehensible "my" treatise:&lt;br /&gt;&lt;a href="http://perldoc.perl.org/perlsub.html#Private-Variables-via-my%28%29"&gt;http://perldoc.perl.org/perlsub.html#Private-Variables-via-my()&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: perl"&gt;&lt;br /&gt;my %foo;&lt;br /&gt;$foo{'bar'}=1;&lt;br /&gt;foreach $fooDex(qw(foo)){&lt;br /&gt;    $$fooDex{'bar'}=2;&lt;br /&gt;}&lt;br /&gt;print $foo{'bar'}."\n"; #prints 1&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: perl"&gt;&lt;br /&gt;local %foo;&lt;br /&gt;$foo{'bar'}=1;&lt;br /&gt;foreach $fooDex(qw(foo)){&lt;br /&gt;    $$fooDex{'bar'}=2;&lt;br /&gt;}&lt;br /&gt;print $foo{'bar'}."\n"; #prints 2&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This is one of those solutions that gives you a feeling of dread instead of accomplishment.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-6388825566068777403?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/6388825566068777403/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2010/03/use-local-instead-of-my-in-perl-when.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6388825566068777403?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6388825566068777403?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2010/03/use-local-instead-of-my-in-perl-when.html" title="Use local instead of my in Perl when using $$ (double dollar sign) inside an enclosed block" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;CEQASHo8fSp7ImA9WxFSE0k.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-6723366348517987937</id><published>2010-03-09T16:57:00.000-05:00</published><updated>2010-04-15T10:39:09.475-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-15T10:39:09.475-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="BioConductor" /><category scheme="http://www.blogger.com/atom/ns#" term="R" /><title>Getting the basics from readAligned</title><content type="html">The &lt;a href="http://manuals.bioinformatics.ucr.edu/home/ht-seq"&gt;UCR&lt;/a&gt; guide is a little sparse with regard to getting basic information from readAligned.&lt;br /&gt;&lt;br /&gt;I'd like to add to the general cookbook. If some bioc people out there can contribute some alignment recipes can fill me in on some more basics please comment:&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: shell"&gt;&lt;br /&gt;alignedReads &lt;- readAligned("./", pattern="output.bowtie", type="Bowtie")&lt;br /&gt;&lt;br /&gt;#how many reads did I attempt to align&lt;br /&gt;#i don't think you can't get this from alignedReads&lt;br /&gt;&lt;br /&gt;#how many reads aligned (one or more times)&lt;br /&gt;length(unique(id(alignedReads)))&lt;br /&gt;&lt;br /&gt;#how many hits were there?&lt;br /&gt;length(alignedReads)&lt;br /&gt;&lt;br /&gt;#how many reads produced multiple hits&lt;br /&gt;length(unique(id(alignedReads[srduplicated(id(alignedReads))])))&lt;br /&gt;&lt;br /&gt;#how many reads produced multiple hits at the best strata?&lt;br /&gt;#please fill me in on this one&lt;br /&gt;&lt;br /&gt;#how many reads aligned uniquely (with exactly one hit)&lt;br /&gt;length(unique(id(alignedReads)))-length(unique(id(alignedReads[srduplicated(id(alignedReads))])))&lt;br /&gt;&lt;br /&gt;#how many reads aligned uniquely at the best strata (the other hits were not as good)&lt;br /&gt;#please fill me in on this one&lt;br /&gt;&lt;br /&gt;#how many unique positions were hit? what if I ignore strand?&lt;br /&gt;#please fill me in on this one&lt;br /&gt;&lt;br /&gt;#how many converging hits were there (two query sequences aligned to the same genomic position)&lt;br /&gt;#please fill me in on this one&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-6723366348517987937?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/6723366348517987937/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2010/03/getting-basics-from-readaligned.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6723366348517987937?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6723366348517987937?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2010/03/getting-basics-from-readaligned.html" title="Getting the basics from readAligned" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;Ak4BR3o7eCp7ImA9Wx5TGU4.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-674636573435045694</id><published>2010-03-03T16:48:00.000-05:00</published><updated>2010-08-04T12:22:36.400-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-08-04T12:22:36.400-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="bioinformatics" /><category scheme="http://www.blogger.com/atom/ns#" term="R" /><title>Quality trimming in R using ShortRead and Biostrings</title><content type="html">I wrote an R function to do soft-trimming, right clipping FastQ reads based on quality.&lt;br /&gt;
&lt;br /&gt;
This function has the option of leaving out sequences trimmed to extinction and will do left-side fixed trimming as well.&lt;br /&gt;
&lt;pre class="brush: shell"&gt;#softTrim
#trim first position lower than minQuality and all subsequent positions
#omit sequences that after trimming are shorter than minLength
#left trim to firstBase, (1 implies no left trim)
#input: ShortReadQ reads
#       integer minQuality
#       integer firstBase
#       integer minLength
#output: ShortReadQ trimmed reads
library("ShortRead")
softTrim&lt;-function(reads,minQuality,firstBase=1,minLength=5){
qualMat&lt;-as(FastqQuality(quality(quality(reads))),'matrix')
qualList&lt;-split(qualMat,row(qualMat))
ends&lt;-as.integer(lapply(qualList,
function(x){which(x &lt; minQuality)[1]-1}))
#length=end-start+1, so set start to no more than length+1 to avoid negative-length
starts&lt;-as.integer(lapply(ends,function(x){min(x+1,firstBase)}))
#use whatever QualityScore subclass is sent
newQ&lt;-ShortReadQ(sread=subseq(sread(reads),start=starts,end=ends),
quality=new(Class=class(quality(reads)),
quality=subseq(quality(quality(reads)),
start=starts,end=ends)),
id=id(reads))

#apply minLength using srFilter
lengthCutoff &lt;- srFilter(function(x) {
width(x)&gt;=minLength
},name="length cutoff")
newQ[lengthCutoff(newQ)]
}  
&lt;/pre&gt;&lt;br /&gt;
&lt;br /&gt;
To use:&lt;br /&gt;
&lt;pre class="brush: shell"&gt;library("ShortRead")
source("softTrimFunction.R") #or whatever you want to name this
reads&lt;-readFastq("myreads.fq") trimmedReads&lt;-softTrim(reads=reads,minQuality=5,firstBase=4,minLength=3) writeFastq(trimmedReads,file="trimmed.fq") &lt;/pre&gt;


I strongly recommend reading the excellent UC Riverside HT-Sequencing Wiki cookbook and tutorial if you wish to venture into using R for NGS handling. Among other things, it will explain how to perform casting if you have Solexa scaled (base 64) fastq files. The function should respect that.
&lt;a href="http://manuals.bioinformatics.ucr.edu/home/ht-seq"&gt;http://manuals.bioinformatics.ucr.edu/home/ht-seq&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-674636573435045694?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/674636573435045694/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2010/03/soft-trimming-in-r-using-shortread-and.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/674636573435045694?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/674636573435045694?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2010/03/soft-trimming-in-r-using-shortread-and.html" title="Quality trimming in R using ShortRead and Biostrings" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>3</thr:total></entry><entry gd:etag="W/&quot;CkAHR3k7eyp7ImA9Wx9SF0g.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-6849014121716243135</id><published>2009-11-19T13:56:00.000-05:00</published><updated>2010-12-07T14:38:56.703-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-12-07T14:38:56.703-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="velvet" /><title>Using Vmatch to combine assemblies</title><content type="html">&lt;h3&gt;In praise of Vmatch&lt;/h3&gt;&lt;br /&gt;
If I could take only one bioinformatics tool with me to a desert island it would be &lt;a href="http://www.vmatch.de/"&gt;Vmatch&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
In addition to being a versatile alignment tool, Vmatch has many lesser-known features which also leverage suffix trees. The dbcluster option allows Vmatch to cluster similar sequences using Vmatch index files. This method is about 1000x faster than attempting to align all sequences against each other, no matter how clever the algorithm.&lt;br /&gt;
&lt;br /&gt;
Originally presented as means to cluster EST sequences, this option is useful for processing the output from de novo short read assemblers like Velvet and ABYSS. Vmatch -dbcluster allows us to easily create a non-redundant set of contig sequences originating from several assemblies.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;Why would we want to combine assemblies? &lt;/h3&gt;&lt;br /&gt;
&lt;h4&gt;Every kmer is sacred&lt;/h4&gt;&lt;br /&gt;
&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 320px; height: 313px;" src="http://4.bp.blogspot.com/_AWkLjPyyn0g/SwR-nUjQhdI/AAAAAAAAAXA/oOVI8Rtvk3s/s640/svarP1.png" alt="" id="BLOGGER_PHOTO_ID_5405584666748028370" border="0" /&gt;&lt;p&gt;Why not just take the one with the highest N50? At the recent Genome Informatics meeting at CSHL, Inanc Birol (ABYSS) said "every kmer is sacred". Our experience with the de novo transcriptome assembly of plants has been that &lt;span style="font-weight: bold;"&gt;the best set of contigs is spread out all over the parameter space&lt;/span&gt;. Longer kmer and cvCut settings can produce a longer set of elite contigs at the expense of omitting lowly expressed (or spurious?) contigs that appear at less stringent settings.&lt;/p&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div class="clear"&gt;&lt;/div&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;h4&gt;Reads held hostage&lt;/h4&gt;&lt;br /&gt;
&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer; width: 320px; height: 316px;" src="http://2.bp.blogspot.com/_AWkLjPyyn0g/SwR_tHG3sKI/AAAAAAAAAXQ/Wz59JpkAtiQ/s640/svarP2.png" alt="" id="BLOGGER_PHOTO_ID_5405585865730142370" border="0" /&gt;&lt;p&gt;To the point above, I would add "every cvCut is also sacred". That doesn't exactly roll off the tongue, but we have seen some instances in which an assembly will have a higher read usage at a cvCut of 10 than at 5. This paradox suggests there is a &lt;span style="font-weight: bold;"&gt;"contig read hostage"&lt;/span&gt; situation, in which reads critical to the formation of longer contigs can be held captive in low coverage short contigs. Raising the cvCut threshold frees up these critical reads to extend more plausible contigs and allowing additional reads from the pool to be recruited into the assembly.&lt;br /&gt;
&lt;/p&gt;&lt;br /&gt;
&lt;div class="clear"&gt;&lt;/div&gt;&lt;br /&gt;
&lt;h3&gt;What is a non-redundant set?&lt;/h3&gt;&lt;br /&gt;
The following sequences have some redundancies. NODE2 and NODE3 do not add any information.&lt;br /&gt;
The non-redundant set will have only NODE1 and NODE4.&lt;br /&gt;
&lt;pre&gt;&gt;NODE1
AATTCAGTTGAAGTAATAGAGGCAGCTGCTGTTAGAACTTCGCTACGACTAGCAACGCTATGTCGAGTTGTACCTTCCCACCTCGTATACAAGGAGCATGAAGTCATCAGCC
CTTTCTACAGAATCTAGGTCCGAAATGATGAAATTAAGAGAAAGACAACTGAAGGTTTTAGGAGGCAAGGTTTACAGGGTAATGAACACGGAAAAACCCACAAGCTAGGAAC
AGTGTGTCTTGGAGTTTAAACTGATTTGGTAGTAGTTCGAAAACAAATTGGAAGGGACATTTAAAGTCCGAGTTGACGTTATCTGAGACAACTTTGTCTTTAACCGACAGGG
AGTTGAGGTAGGAGAGAGTGTCCACATATTTAACGTTGTTCAGATATGGGATGTAGCAGTTGTAACCGAAGCATGTAGGAAGGTTAAAGGGGTCCATACCTCTATTCTAGTC
CCGAAGGTTGGTAGCTAGACTCGGTGACCTAAAATGAGAACGGAAGAAACGGAGGTGACATCCATGGGGCTTGTCGTATATCCAATCATACCTTTGGAGAAGGAAGTTAAAG
GTCAAAACTTTAAAAACCATGAGGACCATTTTATCCTCGTCACTGTCGACTATGGTGAAGTGACCTAGCAGGTTTGTGTAGTTACTGTTTTGTAAGTGTAAAGTGCGTTGCT
GGCTATAGGAGTTTCGCATGAAACATGCTCGCTCTTGTGACCCATCGGTTACCAAGTTTACAGACTAGGTGAGGACTAGGTCACCTCTGTTTGGTAACCAAAAAGTAGAGAA
AGAATATCAACAAGGTATATACAACAACTTAGAGTAGCAAACCTTAAAAGAGTTGTCGTCAGTCACAAAGGGAGAGTTGTACTAAAGGGAACTTTTGTTCACAGAGCTAAAA
CTCATTAAGACCATTTTAATGTGGTCTACCAAGTCTGTCGAAAAGAAGTGGATGCTACGGCAGA
&gt;NODE2 a substring of NODE1
AATTCAGTTGAAGTAATAGAGGCAGCTGCTGTTAGAACTTCGCTACGACTAGCAACGCTATGTCGAGTTGTACCTTCCCACCTCGTATACAAGGAGCATGAAGTCATCAGCC
CTTTCTACAGAATCTAGGTCCGAAATGATGAAATTAAGAGAAAGACAACTGAAGGTTTTAGGAGGCAAGGTTTACAGGGTAATGAACACGGAAAAACCCACAAGCTAGGAAC
AGTGTGTCTTGGAGTTTAAACTGATTTGGTAGTAGTTCGAAAACAAATTGGAAGGGACATTTAAAGTCCGAGTTGACGTTATCTGAGACAACTTTGTCTTTAACCGACAGGG
AGTTGAGGTAGGAGAGAGTGTCCACATATTTAACGTTGTTCAGATATGGGATGTAGCAGTTGTAACCGAAGCATGTAGGAAGGT
&gt;NODE3 a reverse complement substring of NODE1
CGACAGACTTGGTAGACCACATTAAAATGGTCTTAATGAGTTTTAGCTCTGTGAACAAAAGTTCCCTTTAGTACAACTCTCCCTTTGTGACTGACGACAACTCTTTTAAGGT
TTGCTACTCTAAGTTGTTGTATATACCTTGTTGATATTCTTTCTCTACTTTTTGGTTACCAAACAGAGGTGACCTAGTCCTCACCTAGTCTGTAAACTTGGTAACCGATGGG
TCACAAGAGCGAGCATGTTTCATGCGAAACTCCTATAGCCAGCAACGCACTT
&gt;NODE4 a new sequence
TTACGAACGATAGCATCGATCGAAAACGCTACGCGCATCCGCTAAGCACTAGCATAATGCATCGATCGATCGACTACGCCTACGATCGACTAGCTAGCATCGAGCATCGATC
AGCATGCATCGATCGATCGAT
&lt;/pre&gt;&lt;br /&gt;
&lt;strike&gt;&lt;h3&gt;How do we create a non-redundant set?&lt;/h3&gt;&lt;br /&gt;
&lt;h4&gt;Do not use -nonredundant&lt;/h4&gt;&lt;br /&gt;
No, really. There is an option called -nonredundant which should presumably do what we want, but unfortunately that writes a "representative" member of each cluster to a file, which may or may not be the longest contig. I'm not sure what makes a sequence representative of a cluster, but for this application we want the &lt;span style="font-style: italic;"&gt;longest&lt;/span&gt; member of each cluster.&lt;/strike&gt;&lt;br /&gt;
&lt;em&gt;&lt;br /&gt;
On April 29th 2010, Vmatch 2.1.3 was released. The most important change is that the option -nonredundant now delivers the longest sequence from the corresponding cluster (instead&lt;br /&gt;
of an unspecified representative). This should make the longestSeq.pl approach unnecessary.&lt;/em&gt;&lt;br /&gt;
&lt;br /&gt;
To create a non-redundant set we will produce cluster files and then extract the longest sequence from each file. Use the following commands to produce your cluster files from an index:&lt;br /&gt;
&lt;pre class="brush: shell"&gt;mkvtree -allout -pl -db contigs1.fa contigs2.fa -dna -indexname myIndex
#mkvtree will accept multiple fasta or gzipped fasta files

#if using Vmatch 2.1.3 or later:
vmatch -d -p -l 25 -dbcluster 100 0 -v -nonredundant nonredundantset.fa myIndex &gt; mySeqs.rpt
# thx to Hamid Ashrafi for debugging this syntax


#if using older Vmatch
mkdir sequenceMatches 
#this is where vmatch will put each cluster

vmatch -d -p -l 25 -dbcluster 100 0 mySeqs "(1,0)" -v -s myIndex &gt; mySeqs.rpt
mv mySeqs*.match mySeqs*.fna sequenceMatches&lt;/pre&gt;&lt;br /&gt;
&lt;h3&gt;What do these options do?&lt;/h3&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;-d direct matches (forward strand)&lt;/li&gt;
&lt;li&gt;-p palindromic matches (reverse strand)&lt;/li&gt;
&lt;li&gt;-l search length (set this below your shortest sequence)&lt;/li&gt;
&lt;li&gt;-dbcluster queryPerc targetPerc - runs an internal alignment of the index created by mkvtree.&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;The two numeric arguments specify what percentage of your query sequence (the smaller) is involved in an alignment to the cluster sentinel/target superstring. For our purposes we require 100% of our query sequence substring to match the target. We don't care what percentage of the target is aligned, so set the second parameter to 0.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;The third argument to dbcluster is the index prefix name that vmatch will give to the THOUSANDS of cluster .fna and .match files it will create.&lt;/li&gt;
&lt;li&gt;The fourth argument "(1,0)" specifies that we want to keep singletons (in a file called mySeqs.single.fna) and that there is no limit to the number of sequences in an acceptable cluster.&lt;/li&gt;
&lt;/ul&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;-v verbose report (redirected to a the .rpt)&lt;/li&gt;
&lt;li&gt;-s create the individual cluster fasta files and match reports&lt;/li&gt;
&lt;/ul&gt;&lt;br /&gt;
&lt;br /&gt;
Consult the vmatch manual for fuzzy matches and more examples:&lt;br /&gt;
&lt;a href="http://www.zbh.uni-hamburg.de/vmatch/virtman.pdf"&gt;http://www.zbh.uni-hamburg.de/vmatch/virtman.pdf&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
The standard output details the clusters and singlets.&lt;br /&gt;
&lt;pre&gt;0:761437: NODE_83_length_31_cov_22.451612857642: 
NODE_106_length_31_cov_22.451612621409: NODE_152_length_29_cov_27.758621749981: 
NODE_185_length_29_cov_27.7586211:761861: NODE_531_length_424_cov_26.851416805590: 
NODE_1413_length_320_cov_28.187500837480: NODE_1236_length_320_cov_28.187500858407: 
NODE_937_length_320_cov_28.187500765510: NODE_1542_length_108_cov_34.870369621979: 
NODE_786_length_425_cov_32.602352750915: NODE_1290_length_321_cov_34.392525
&lt;/pre&gt;&lt;br /&gt;
&lt;h3&gt;Extract the longest sequence from your cluster files&lt;/h3&gt;&lt;br /&gt;
&lt;em&gt;If using Vmatch 2.1.3 or later this is unnecessary&lt;/em&gt;&lt;br /&gt;
Here is a perl script, which we will call longestSeq.pl, to do that&lt;br /&gt;
&lt;pre class="brush: perl"&gt;#!/usr/bin/perl
#print the longest sequence in a fasta file
use strict;my %seqs;my $defline;
while (&lt;&gt;) {
  chomp;
  if ( /^#/ || /^\n/ ) {
    #comment line or empty line do nothing
  }elsif (/&gt;/) {
    s/&gt;//g;$defline = $_;
    $seqs{$defline} = "";
  }elsif ($defline) {
    $seqs{$defline} .= $_;
  }else{
    die('invalid FASTA file');
  }
}
my $max = $defline;
foreach my $def ( keys %seqs ) {
  $max = ( length( $seqs{$def} ) &gt; length( $seqs{$max} ) ) ? $def : $max;
}
print "&gt;" . $max . "\n" . $seqs{$max} . "\n";&lt;/pre&gt;&lt;br /&gt;
We want the longest member of each cluster and all sequences in the singletons file:&lt;br /&gt;
&lt;pre class="brush: shell"&gt;for f in sequenceMatches/*fna;
do
if [ "$f" = "sequenceMatches/mySeqs.single.fna" ];
then 
cat $f &gt;&gt; mySeqs_longest_seqs.fa;
else perl longestSeq.pl $f &gt;&gt; mySeqs_longest_seqs.fa;
fi;
done
&lt;/pre&gt;&lt;br /&gt;
That's it - now you have a comprehensive and non-redundant set of the longest contigs from a number of assemblies.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-6849014121716243135?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/6849014121716243135/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/11/using-vmatch-to-combine-assemblies.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6849014121716243135?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6849014121716243135?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/11/using-vmatch-to-combine-assemblies.html" title="Using Vmatch to combine assemblies" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/_AWkLjPyyn0g/SwR-nUjQhdI/AAAAAAAAAXA/oOVI8Rtvk3s/s72-c/svarP1.png" height="72" width="72" /><thr:total>3</thr:total></entry><entry gd:etag="W/&quot;DkEMSX48eSp7ImA9WxFSEko.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-8223755843047671963</id><published>2009-11-04T12:54:00.000-05:00</published><updated>2010-04-14T15:51:28.071-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-14T15:51:28.071-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Samtools" /><category scheme="http://www.blogger.com/atom/ns#" term="BioPerl" /><category scheme="http://www.blogger.com/atom/ns#" term="R" /><title>R's xtabs for total weighted read coverage</title><content type="html">Samtools and its BioPerl wrapper Bio::DB:Sam prefer to give read coverage on a depth per base pair basis. This is typically an array of depths, one for every position that has at least one read aligned. OK, works for me. But how can we quickly see which targets (in my case transcripts) have the greatest total weighted read coverage (i.e. sum every base pair of every read that aligned)?&lt;br /&gt;&lt;br /&gt;My solution is to take that target-pos-depth information and import a table into R with at least the following columns:&lt;br /&gt;targetName&lt;br /&gt;depth&lt;br /&gt;&lt;br /&gt;I added the pos column here to emphasize the base-pair granularity&lt;br /&gt;&lt;pre&gt;         tx pos depth&lt;br /&gt;1   tx500090 227     1&lt;br /&gt;2   tx500090 228     1&lt;br /&gt;3   tx500090 229     1&lt;br /&gt;4   tx500090 230     1&lt;br /&gt;5   tx500090 231     1&lt;br /&gt;...&lt;br /&gt;66  tx500123 184     1&lt;br /&gt;67  tx500123 185     1&lt;br /&gt;68  tx500123 186     1&lt;br /&gt;69  tx500123 187     2&lt;br /&gt;70  tx500123 188     2&lt;br /&gt;71  tx500123 189     2&lt;/pre&gt;&lt;br /&gt;In R&lt;br /&gt;&lt;pre&gt;myCoverage&lt;-read.table("myCoverage.txt",header=TRUE)&lt;br /&gt;myxTab&lt;-xtabs(depth ~ tx,data=myCoverage)&lt;/pre&gt;&lt;br /&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/xtabs.html"&gt;xtabs&lt;/a&gt; will sum up depth-weighted positions by default (i suppose this is what tabulated contigency really means) and return an unsorted list of transcripts and their weighted coverage (total base pair read coverage)&lt;br /&gt;&lt;pre&gt;&gt; myxTab[1:100]&lt;br /&gt;tx&lt;br /&gt;tx500090 tx500123 tx500134 tx500155 tx500170 tx500178 tx500203 tx500207&lt;br /&gt;      38       92      610       46      176       46       92      130&lt;br /&gt;tx500273 tx500441 tx500481 tx500482 tx500501 tx500507 tx500667 tx500684&lt;br /&gt;      76     2390      114    71228      762      222      542      442&lt;br /&gt;tx500945 tx500955 tx501016 tx501120 tx501127 tx501169 tx501190 tx501192&lt;br /&gt;    1378     3604       46       46      420      854      130      352&lt;br /&gt;tx501206 tx501226 tx501229 tx501245 tx501270 tx501297 tx501390 tx501405&lt;br /&gt;     244     1204      206    15926      214       46      168       46&lt;br /&gt;tx501406 tx501438 tx501504 tx501694 tx501702 tx501877 tx501902 tx502238&lt;br /&gt;      38     2572     7768     3274      314      298       84      198&lt;br /&gt;tx502320 tx502364 tx502403 tx502414 tx502462 tx502515 tx502517 tx502519&lt;br /&gt;     122       38      588       46       46       38       38      466&lt;br /&gt;tx502610 tx502624 tx502680 tx502841 tx502882 tx503090 tx503192 tx503204&lt;br /&gt;     206       38      168     3750       38      122       76       92&lt;br /&gt;tx503416 tx503468 tx503523 tx503536 tx503571 tx503578 tx503623 tx503700&lt;br /&gt;     260       38      168       38       46       46       84       38&lt;br /&gt;tx503720 tx503721 tx503722 tx503788 tx503872 tx503892 tx503930 tx503970&lt;br /&gt;   97112       38       38     4708       38       38     1290       84&lt;br /&gt;tx503995 tx504107 tx504115 tx504346 tx504353 tx504355 tx504357 tx504398&lt;br /&gt;      46      152      206       46     3416     1402      122      290&lt;br /&gt;tx504434 tx504483 tx504523 tx504589 tx504612 tx504711 tx504751 tx504827&lt;br /&gt;     290     8728      176       46       46       76     5644     1308&lt;br /&gt;tx504828 tx504834 tx504882 tx504931 tx504952 tx505017 tx505029 tx505078&lt;br /&gt;    2336      328       46    34138     1000     1838       46      474&lt;br /&gt;tx505123 tx505146 tx505159 tx505184&lt;br /&gt;      38   123344      160      588&lt;/pre&gt;&lt;br /&gt;this is approximately 10000x faster than using a formula like:&lt;br /&gt;&lt;pre&gt;by(myCoverage,myCoverage$tx,function(x){sum(x$depth)}&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-8223755843047671963?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/8223755843047671963/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/11/rs-xtabs-for-total-weighted-read.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/8223755843047671963?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/8223755843047671963?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/11/rs-xtabs-for-total-weighted-read.html" title="R's xtabs for total weighted read coverage" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;Ck8DQXo6eyp7ImA9WxFSFU4.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-6985002185757397444</id><published>2009-10-23T15:22:00.000-04:00</published><updated>2010-04-17T15:01:10.413-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-17T15:01:10.413-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Samtools" /><title>Installing Bio::DB::Sam from CPAN</title><content type="html">Bio::DB::Sam is Lincoln Stein's BioPerl API to the SamTools package.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Installing via CPAN might skip a necessary question that will cause it to fail.&lt;/span&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;cpan&gt; install Bio::DB::Sam&lt;br /&gt;...&lt;br /&gt;...&lt;br /&gt;DIED. FAILED tests 1-93&lt;br /&gt;  Failed 93/93 tests, 0.00% okay&lt;br /&gt;Failed Test Stat Wstat Total Fail  Failed  List of Failed&lt;br /&gt;-------------------------------------------------------------------------------&lt;br /&gt;t/01sam.t      2   512    93  186 200.00%  1-93&lt;br /&gt;Failed 1/1 test scripts, 0.00% okay. 93/93 subtests failed, 0.00% okay.&lt;br /&gt;make: *** [test_dynamic] Error 2&lt;br /&gt;/usr/bin/make test -- NOT OK&lt;br /&gt;Running make install&lt;br /&gt;make test had returned bad status, won't install without force&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Navigate to where CPAN has downloaded the gz ffile&lt;br /&gt;A closer examination reveals that Build.PL file wants to know where the SamTools header files are located.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;Please enter the location of the bam.h and compiled libbam.a files:&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;I have no idea how to pass these arguments using CPAN. I would just avoid this method of installation.&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;Do the local build instead.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-6985002185757397444?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/6985002185757397444/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/10/installing-biodbsam-from-cpan.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6985002185757397444?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6985002185757397444?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/10/installing-biodbsam-from-cpan.html" title="Installing Bio::DB::Sam from CPAN" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;Ck8BRnoyfCp7ImA9WxFSFU4.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-4030752442244195162</id><published>2009-10-12T16:46:00.000-04:00</published><updated>2010-04-17T15:00:57.494-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-17T15:00:57.494-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="os x" /><title>Disable file locking in Eclipse for OS X</title><content type="html">Eclipse will refuse to use a workspace on an automounted OS X Server home directory.&lt;div&gt;&lt;pre&gt;Workspace in use or cannot be created&lt;/pre&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To remedy this problem do the following:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Right click the Eclipse application and select "Show Package Contents"&lt;/li&gt;&lt;li&gt;Contents-&gt;MacOS&lt;/li&gt;&lt;li&gt;Edit the eclipse.ini file in a text editor&lt;/li&gt;&lt;li&gt;Add -Dosgi.locking=none to the line below -vmargs&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;img src="http://2.bp.blogspot.com/_AWkLjPyyn0g/StOXN9PFXbI/AAAAAAAAAWY/5wV_uMRYC2Q/s320/Picture+17.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5391819444924538290" /&gt;&lt;img style="cursor:pointer; cursor:hand;width: 320px; height: 170px;" src="http://2.bp.blogspot.com/_AWkLjPyyn0g/StOXwpZ2mDI/AAAAAAAAAWg/EKOIOTJlF6A/s320/Picture+18.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5391820040896419890" /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-4030752442244195162?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/4030752442244195162/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/10/disable-file-locking-in-eclipse-for-os.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4030752442244195162?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4030752442244195162?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/10/disable-file-locking-in-eclipse-for-os.html" title="Disable file locking in Eclipse for OS X" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_AWkLjPyyn0g/StOXN9PFXbI/AAAAAAAAAWY/5wV_uMRYC2Q/s72-c/Picture+17.png" height="72" width="72" /><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;Ck8HRn4_cCp7ImA9WxFSFU4.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-6337173291502138678</id><published>2009-09-23T20:18:00.000-04:00</published><updated>2010-04-17T15:00:37.048-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-17T15:00:37.048-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="cycling" /><title>My 2009 Bridge-to-Bridge Experience</title><content type="html">&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_AWkLjPyyn0g/SrrAG0RkA1I/AAAAAAAAAUw/0gcspwE2zDE/s1600-h/DSC_0060.JPG"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 400px; height: 266px;" src="http://4.bp.blogspot.com/_AWkLjPyyn0g/SrrAG0RkA1I/AAAAAAAAAUw/0gcspwE2zDE/s400/DSC_0060.JPG" alt="" id="BLOGGER_PHOTO_ID_5384827527818904402" border="0" /&gt;&lt;/a&gt;Bridge-to-Bridge is a 105-mile ride up to the top of &lt;a href="http://en.wikipedia.org/wiki/Grandfather_Mountain"&gt;Grandfather Mountain&lt;/a&gt;, one of the highest peaks in the Blue Ridge mountains.&lt;br /&gt;&lt;br /&gt;Although B2B is not an officially sanctioned race, the organizers conduct it just as professionally (with the exception of neutral support). Cops manage the major crossings, volunteers provide hand-offs at the dozen feed stations, and the event is officially timed with the aid of some magnetic shoe things.&lt;br /&gt;&lt;br /&gt;I last did this ride in 1999. Now 10 years older but about 10 pounds lighter I had somehow forgotten how much suffering was involved and figured this was a good time to tackle the challenge, despite falling ill a couple of weeks beforehand.&lt;br /&gt;&lt;br /&gt;Due to constant rain and very heavy fog, this year was utter torture for the 299 finishers and 371 &lt;span style="font-style: italic;"&gt;non-finishers&lt;/span&gt; who braved the elements. I believe there may have been another 130 &lt;span style="font-style: italic;"&gt;non-starters&lt;/span&gt; who stayed in their hotel rooms enjoying the Golden Girls marathon on tv. While I'm sure I would have felt some sense of accomplishment doing that, I was obligated to finish this ride as we had already driven down from Philadelphia (en-route to a wedding in Nashville the following weekend).&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_AWkLjPyyn0g/SrrGyLM_ptI/AAAAAAAAAU4/xOLKbKvrsuk/s1600-h/DSC_0011.JPG"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 320px; height: 213px;" src="http://4.bp.blogspot.com/_AWkLjPyyn0g/SrrGyLM_ptI/AAAAAAAAAU4/xOLKbKvrsuk/s320/DSC_0011.JPG" alt="" id="BLOGGER_PHOTO_ID_5384834869777901266" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The Grandfather Mountain staff said riders  could not enter the park before 3pm. To accommodate both the riders and this odd rule two start times were offered - 10 and 11 am, with slower riders encouraged to start first. To me this was a welcome change from the ungodly pre-dawn start times of most big rides. Riders were advised to start at the later time if they estimated they would be pushing the 3pm threshold.&lt;br /&gt;&lt;br /&gt;I was on the fence about which group to join. I saw several very fit looking riders and expensive bikes in the 10am pack. I was still riding the same Litespeed I used in 1999. The word going around was that more rain was on the way (this turned out to be true for everyone). In the end I felt the risk of having an inexperienced rider fall in front of me to be the deciding factor to go with the second group. I knew I would have to pass several of that first group anyway but it would be on the later climbs instead of the early rollers. Ironically I almost got clipped by some idiot drifting carelessly in our pack. At the end there was considerable overlap in finishing times between the two groups.&lt;br /&gt;&lt;br /&gt;The two times I did this ride ('98 and '99) I got dropped by the leaders on the 13 mile climb up NC181, then spent the rest of the ride either riding alone or with a couple other guys. Despite my best efforts this year turned out roughly the same except I stuck with that front group for about half the climb instead of just the first couple miles. Remind me to buy a compact crank.&lt;br /&gt;&lt;br /&gt;This climb is very difficult psychologically - a relentless slog up a roughly-paved 4 lane highway. I was not very familiar with the profile and prematurely thought I had crested three times - each time putting in a kick over the "top". The fog would clear and I would see yet another rise.&lt;br /&gt;&lt;br /&gt;Feeling dispirited and exhausted, I nearly froze to death on the descents of 181 and the Blue Ridge Parkway and was eventually caught by a small group that stuck together until the ascent of Grandfather. I was amazed by how few words were exchanged in that group during the hour or so we traded pulls through the fog and rain, which only got worse as we neared the finish. One guy did say something that stuck with me, "It's like we're riding through a horror movie."&lt;br /&gt;&lt;br /&gt;After crawling to the finish in a 39x27, I was very fortunate that Mary Ellen had the foresight to drive up to the summit well in advance to meet me with a warm car. I thanked her by singing the Golden Girls theme song all the way to the hotel.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;B2B '09 Results:&lt;a href="http://www.caldwellcochamber.org/support/pagepics/09Bridge.txt"&gt;&lt;br /&gt;http://www.caldwellcochamber.org/support/pagepics/09Bridge.txt&lt;/a&gt;&lt;/li&gt;&lt;li&gt;An account by Bruce Humphries (1st place)&lt;a href="http://dieseldiaries.com/hom/?p=232"&gt;&lt;br /&gt;http://dieseldiaries.com/hom/?p=232&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Some more blog posts on the '09 ride:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://fraught.wordpress.com/2009/09/22/theres-blue-sky-ahead/"&gt;http://fraught.wordpress.com/2009/09/22/theres-blue-sky-ahead/&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://twentystone.blogspot.com/2009/09/2009-bridge-to-bridge-ride-report.html"&gt;http://twentystone.blogspot.com/2009/09/2009-bridge-to-bridge-ride-report.html&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://khobama.blogspot.com/2009/09/and-saga-continues.html"&gt;http://khobama.blogspot.com/2009/09/and-saga-continues.html&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://grimesjoseph.blogspot.com/2009/09/shenandoah-mountain-100-2009-bridge-to.html"&gt;http://grimesjoseph.blogspot.com/2009/09/shenandoah-mountain-100-2009-bridge-to.html&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-6337173291502138678?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/6337173291502138678/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/09/my-2009-bridge-to-bridge-experience.html#comment-form" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6337173291502138678?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/6337173291502138678?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/09/my-2009-bridge-to-bridge-experience.html" title="My 2009 Bridge-to-Bridge Experience" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/_AWkLjPyyn0g/SrrAG0RkA1I/AAAAAAAAAUw/0gcspwE2zDE/s72-c/DSC_0060.JPG" height="72" width="72" /><thr:total>2</thr:total></entry><entry gd:etag="W/&quot;AkUDRX0_eCp7ImA9WxNQE00.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-4375008100777516094</id><published>2009-09-16T12:20:00.000-04:00</published><updated>2009-09-18T17:11:14.340-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-09-18T17:11:14.340-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="bash history" /><category scheme="http://www.blogger.com/atom/ns#" term="histfile" /><title>Printing a specific line from bash_history</title><content type="html">Often I want to archive specific commands out of my immediate bash history to a batch script that I can run later.&lt;br /&gt;Unfortunately I could find no way of redirecting !# (where # is the bash history line I wish to save, e.g. !58 executes line 58) to a file. There is the "colon p" option - where !#:p will print the command instead executing it, but I could not redirect or pipe that output either.&lt;br /&gt;&lt;br /&gt;So I added this one-liner to a bin directory in my path:&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: shell"&gt;&lt;br /&gt;#!/bin/bash&lt;br /&gt;cat $HISTFILE | sed -n "$1p"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;I call it &lt;span style="font-family: courier new;"&gt;getHistLine&lt;/span&gt;. So now to save line 58 to a batch script I can just type:&lt;br /&gt;getHistLine 58 &gt;&gt; myBatchScript.sh&lt;br /&gt;&lt;br /&gt;To save a range of lines I can type&lt;br /&gt;getHistLine 50,58 &gt;&gt; myBatchScript.sh&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-4375008100777516094?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/4375008100777516094/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/09/printing-specific-line-from-bashhistory.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4375008100777516094?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4375008100777516094?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/09/printing-specific-line-from-bashhistory.html" title="Printing a specific line from bash_history" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;C0UEQncycCp7ImA9WxFSFU4.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-7883458160331554300</id><published>2009-08-26T23:27:00.000-04:00</published><updated>2010-04-17T15:06:43.998-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-17T15:06:43.998-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="velvet" /><category scheme="http://www.blogger.com/atom/ns#" term="humor" /><title>Charles Dickens de Bruijn Graph</title><content type="html">&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://lh3.ggpht.com/_AWkLjPyyn0g/SpX8-RJAYnI/AAAAAAAAAUQ/Ul0X_jVQQiw/s800/dicken-debruijn.png"&gt;&lt;img style="cursor: pointer; width: 800px; height: 314px;" src="http://lh3.ggpht.com/_AWkLjPyyn0g/SpX8-RJAYnI/AAAAAAAAAUQ/Ul0X_jVQQiw/s800/dicken-debruijn.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-7883458160331554300?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/7883458160331554300/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/08/charles-dickens-debruijn-graph.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/7883458160331554300?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/7883458160331554300?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/08/charles-dickens-debruijn-graph.html" title="Charles Dickens de Bruijn Graph" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh3.ggpht.com/_AWkLjPyyn0g/SpX8-RJAYnI/AAAAAAAAAUQ/Ul0X_jVQQiw/s72-c/dicken-debruijn.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;Ck4ASH84cCp7ImA9WxFSFU4.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-8971060436358730206</id><published>2009-08-25T10:17:00.001-04:00</published><updated>2010-04-17T15:02:29.138-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-17T15:02:29.138-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="velvet" /><category scheme="http://www.blogger.com/atom/ns#" term="R" /><title>Standardized Velvet Assembly Report</title><content type="html">&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://lh3.ggpht.com/_AWkLjPyyn0g/SpMal_Ykm6I/AAAAAAAAATw/hiHn39DcuNc/s400/example.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 391px;" src="http://lh3.ggpht.com/_AWkLjPyyn0g/SpMal_Ykm6I/AAAAAAAAATw/hiHn39DcuNc/s400/example.png" border="0" alt="" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://code.google.com/p/standardized-velvet-assembly-report/"&gt;&lt;br /&gt;http://code.google.com/p/standardized-velvet-assembly-report/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I finally got my Velvet Assembler report script up on google code. This "program" consists of some short scripts and a Sweave report designed to help Velvet users identify the optimal kmer and cvCut parameters.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-8971060436358730206?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/8971060436358730206/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/08/standardized-velvet-assembly-report.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/8971060436358730206?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/8971060436358730206?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/08/standardized-velvet-assembly-report.html" title="Standardized Velvet Assembly Report" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh3.ggpht.com/_AWkLjPyyn0g/SpMal_Ykm6I/AAAAAAAAATw/hiHn39DcuNc/s72-c/example.png" height="72" width="72" /><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;C0cDRX89eCp7ImA9WxFSFU4.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-4299835741578621128</id><published>2009-08-13T15:33:00.000-04:00</published><updated>2010-04-17T15:04:34.160-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-17T15:04:34.160-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="grails" /><category scheme="http://www.blogger.com/atom/ns#" term="groovy" /><title>GregorianCalendar foolishness</title><content type="html">I wanted to add queries to my Grails application to find tasks that needed to be completed today, or were delinquent (due date &lt; last midnight).&lt;br /&gt;My application kept thinking things were delinquent the afternoon of the due date.&lt;br /&gt;&lt;br /&gt;The problem was that I neglected to read how HOUR_OF_DAY differed from HOUR and absentmindedly mixed the two.&lt;br /&gt;&lt;a href="http://java.sun.com/j2se/1.4.2/docs/api/java/util/Calendar.html"&gt;http://java.sun.com/j2se/1.4.2/docs/api/java/util/Calendar.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;You can try to set HOUR to 0 but it will default to 12. If it is the afternoon when you request the Calendar object then it will assume you mean 12 noon.&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: groovy"&gt;&lt;br /&gt;&lt;br /&gt;Calendar lastMidnight = Calendar.getInstance();&lt;br /&gt;//DO NOT DO THIS!!&lt;br /&gt;  lastMidnight.set( Calendar.HOUR, lastMidnight.getMinimum(Calendar.HOUR_OF_DAY ));&lt;br /&gt;  ...snip...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;pre class="brush: groovy"&gt;&lt;br /&gt;  //this is ok&lt;br /&gt;  lastMidnight.set( Calendar.HOUR_OF_DAY, lastMidnight.getMinimum(Calendar.HOUR_OF_DAY ));&lt;br /&gt;  ...snip...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;pre class="brush: groovy"&gt;&lt;br /&gt;  //this is also ok&lt;br /&gt;  lastMidnight.set( Calendar.AM_PM, Calendar.AM );&lt;br /&gt;  lastMidnight.set( Calendar.HOUR, lastMidnight.getMinimum(Calendar.HOUR ));&lt;br /&gt;  ...snip...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The other settings are pretty self-explanatory&lt;br /&gt;&lt;pre class="brush: groovy"&gt;&lt;br /&gt;  lastMidnight.set( Calendar.MINUTE, lastMidnight.getMinimum(Calendar.MINUTE));&lt;br /&gt;  lastMidnight.set( Calendar.SECOND, lastMidnight.getMinimum(Calendar.SECOND));&lt;br /&gt;  lastMidnight.set( Calendar.MILLISECOND,lastMidnight.getMinimum(Calendar.MILLISECOND));&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-4299835741578621128?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/4299835741578621128/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/08/gregorian-calendar-craziness.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4299835741578621128?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4299835741578621128?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/08/gregorian-calendar-craziness.html" title="GregorianCalendar foolishness" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>0</thr:total></entry><entry gd:etag="W/&quot;Ck8FQ3o5eSp7ImA9WxFSFU4.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-3257411054600710787</id><published>2009-07-28T14:47:00.000-04:00</published><updated>2010-04-17T15:00:12.421-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-17T15:00:12.421-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="groovy" /><title>Beware! Groovy split and tokenize don't treat empty elements the same</title><content type="html">Groovy's &lt;span style="font-style:italic;"&gt;tokenize&lt;/span&gt;, which returns a List, will ignore empty elements (when a delimiter appears twice in succession). &lt;span style="font-style:italic;"&gt;Split&lt;/span&gt; keeps such elements and returns an Array. If you want to use List functions but you don't want to lose your empty elements, then just use &lt;span style="font-style:italic;"&gt;split&lt;/span&gt; and convert your Array into a List in a separate step.&lt;br /&gt;&lt;br /&gt;This might be important if you are parsing CSV files with empty cells.&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: groovy"&gt;&lt;br /&gt;import groovy.util.GroovyTestCase&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;class StringTests extends GroovyTestCase {&lt;br /&gt;&lt;br /&gt;  protected void setUp() {&lt;br /&gt;    super.setUp()&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  protected void tearDown() {&lt;br /&gt;    super.tearDown()&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  void testSplitAndTokenize() {&lt;br /&gt;    assertEquals("This,,should,have,five,items".tokenize(',').size(),5)&lt;br /&gt;    assertEquals("This, ,should,have,six,items".tokenize(',').size(),6)&lt;br /&gt;&lt;br /&gt;    assertEquals("This, ,should,have,six,items".split(',').size(),6)&lt;br /&gt;    assertEquals("This,,should,have,six,items".split(',').size(),6)&lt;br /&gt;&lt;br /&gt;    //convert array to List and re-evaluate&lt;br /&gt;     def fieldArray = "This,,should,have,six,items".split(',')&lt;br /&gt;     def fields=fieldArray.collect{it}&lt;br /&gt;     assert fields instanceof java.util.List&lt;br /&gt;     assertEquals(fields.size(),6)&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-3257411054600710787?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/3257411054600710787/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/07/beware-groovy-split-and-tokenize-dont.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/3257411054600710787?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/3257411054600710787?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/07/beware-groovy-split-and-tokenize-dont.html" title="Beware! Groovy split and tokenize don't treat empty elements the same" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>1</thr:total></entry><entry gd:etag="W/&quot;C0cBRH07fSp7ImA9WxFSFU4.&quot;"><id>tag:blogger.com,1999:blog-8532065960756482590.post-4822122645904617458</id><published>2009-05-05T12:45:00.000-04:00</published><updated>2010-04-17T15:04:15.305-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-04-17T15:04:15.305-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="grails" /><title>Loading files in grails bootstrap</title><content type="html">I've had to do this for two separate projects.&lt;br /&gt;&lt;br /&gt;A web application always has weird ideas of where it is in the path. I tend to look for examples that work without assumptions. This works for the bootstrap scripts I use to start up grails apps. Locating stuff in /web-app in the deployment context is a different story.&lt;br /&gt;&lt;br /&gt;I think the best way is as follows:&lt;br /&gt;create a directory: grails-app/conf/resources&lt;br /&gt;&lt;br /&gt;For a tab delimited-file I used the following routine to load and create domain objects&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: groovy"&gt;&lt;br /&gt;class BootStrap {&lt;br /&gt;&lt;br /&gt;  def init = {servletContext -&gt;&lt;br /&gt;    //AC204211.3    c0189C22    ACTIVEFIN    UNKNOWN    153071100    153252400    ctg708    9800    191100&lt;br /&gt;&lt;br /&gt;    def filePath = "resources/fpc_report.txt"&lt;br /&gt;&lt;br /&gt;    def text = ApplicationHolder.application.parentContext.getResource("classpath:$filePath").inputStream.text&lt;br /&gt;    text.eachLine {&lt;br /&gt;      println it&lt;br /&gt;      def BacFields = it.split("\t")&lt;br /&gt;      new Bacs(accession: BacFields[0],&lt;br /&gt;              cloneName: BacFields[1],&lt;br /&gt;              status: BacFields[2],&lt;br /&gt;              chr: BacFields[3],&lt;br /&gt;              chrStart: BacFields[4],&lt;br /&gt;              chrEnd: BacFields[5],&lt;br /&gt;              contig: BacFields[6],&lt;br /&gt;              contigStart: BacFields[7],&lt;br /&gt;              contigEnd: BacFields[8]).save()&lt;br /&gt;    }&lt;br /&gt;     }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;For a CSV file I used the opencsv library&lt;br /&gt;http://opencsv.sourceforge.net/&lt;br /&gt;I added to my maven pom the following dependency&lt;br /&gt;&lt;pre class="brush: xml"&gt;&lt;br /&gt; &lt;dependency&gt;&lt;br /&gt;    &lt;groupId&gt;net.sf.opencsv&lt;/groupId&gt;&lt;br /&gt;    &lt;artifactId&gt;opencsv&lt;/artifactId&gt;&lt;br /&gt;    &lt;version&gt;1.8&lt;/version&gt;&lt;br /&gt;&lt;/dependency&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;and used something similar to get the grails appHolder object to give up a Java File&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: groovy"&gt;&lt;br /&gt;import org.codehaus.groovy.grails.commons.ApplicationHolder&lt;br /&gt;&lt;br /&gt;class BootStrap {&lt;br /&gt;&lt;br /&gt;  def init = { servletContext -&gt;&lt;br /&gt;&lt;br /&gt;    def filePath = "resources/E357Lims.CSV"&lt;br /&gt;    def appHolder=ApplicationHolder.application.parentContext.getResource("classpath:$filePath")&lt;br /&gt;&lt;br /&gt;    def reader = new CSVReader(new FileReader(appHolder.getFile()))&lt;br /&gt;    def header = true&lt;br /&gt;    reader.readAll().each{ csvrow -&gt;&lt;br /&gt;      if(!header){&lt;br /&gt;        new FlatReport(name:csvrow[0].trim()).save()&lt;br /&gt;      }&lt;br /&gt;      header = false&lt;br /&gt;    }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8532065960756482590-4822122645904617458?l=jermdemo.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel="replies" type="application/atom+xml" href="http://jermdemo.blogspot.com/feeds/4822122645904617458/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://jermdemo.blogspot.com/2009/05/loading-files-in-grails-bootstrap.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4822122645904617458?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8532065960756482590/posts/default/4822122645904617458?v=2" /><link rel="alternate" type="text/html" href="http://jermdemo.blogspot.com/2009/05/loading-files-in-grails-bootstrap.html" title="Loading files in grails bootstrap" /><author><name>Jeremy Leipzig</name><uri>https://profiles.google.com/116405544829727492615</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh4.googleusercontent.com/-qRfLJ55Faus/AAAAAAAAAAI/AAAAAAAABTU/dBIftdLbXqw/s512-c/photo.jpg" /></author><thr:total>1</thr:total></entry></feed>

