<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US">
  <id>tag:depth-first.com,2008:/articles</id>
  <link rel="alternate" type="text/html" href="http://depth-first.com/" />
  
  <title>Depth-First</title>
  <updated>2010-02-09T07:22:41-08:00</updated>
  <generator uri="http://github.com/rapodaca/hubbub">Hubbub</generator>
  <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/Depth-first" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="depth-first" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><feedburner:emailServiceId xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">Depth-first</feedburner:emailServiceId><feedburner:feedburnerHostname xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://feedburner.google.com</feedburner:feedburnerHostname><entry>
    <id>tag:depth-first.com,2008:Article/609</id>
    <published>2010-02-09T07:20:50-08:00</published>
    <updated>2010-02-09T07:22:41-08:00</updated>
    <link rel="alternate" type="text/html" href="http://depth-first.com/articles/2010/02/09/big-data-in-chemistry-mirroring-pubchem-the-easy-way-part-2" />
    <title>Big Data in Chemistry: Mirroring PubChem the Easy Way Part 2</title>
    <content type="html">&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;&lt;img src="http://depth-first.s3.amazonaws.com/20100208/pubchem.gif" align="right" class="anchor"&gt;&lt;/img&gt;&lt;/a&gt;One of the useful (and unnerving) things about running a blog is that you're forced to face what you don't know (usually very publicly). I've been looking for the simplest way to maintain an up-to-date local copy of PubChem. I previously posted an article describing one way to &lt;a href="http://depth-first.com/articles/2010/02/08/big-data-in-chemistry-mirroring-pubchem-the-easy-way"&gt;mirror PubChem&lt;/a&gt; through the use of rsync and curlftpfs.&lt;/p&gt;

&lt;p&gt;Although this method works, it turns out that there's an even simpler way to do this with wget. For Compounds:&lt;/p&gt;

&lt;pre class="console"&gt;
$ wget --mirror --accept "*.sdf.gz" ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/
&lt;/pre&gt;


&lt;p&gt;For Substances:&lt;/p&gt;

&lt;pre class="console"&gt;
$ wget --mirror --accept "*.sdf.gz" ftp://ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/SDF/
&lt;/pre&gt;


&lt;p&gt;The --mirror option bundles several options together relating to &lt;a href="http://sunsite.ualberta.ca/Documentation/Gnu/wget-1.7/html_chapter/wget_5.html"&gt;timestamping&lt;/a&gt;. This is how wget will be able to download only the updates to the PubChem archives directory, rather than downloading the entire PubChem archive every time.&lt;/p&gt;

&lt;p&gt;Without any options, the default behavior of wget is to not preserve timestamps and to force a complete download of all files every time.&lt;/p&gt;

&lt;p&gt;The --accept option says we only want to download gzipped SDF files (leaving out, for example, text files).&lt;/p&gt;

&lt;p&gt;Whenever you're ready to update your local PubChem archive, simple run the two commands above and you're done. You'll have a copy of the PubChem dataset that matches - to within one day - the dataset being used by NCBI itself.&lt;/p&gt;

&lt;p&gt;Pretty simple, huh?&lt;/p&gt;

&lt;p&gt;Now if I could just figure out how to use a single wget command to mirror both the Compound and Substance directories...&lt;/p&gt;
&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=CT6MFNzLrXI:9o9p1CV-Nms:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=CT6MFNzLrXI:9o9p1CV-Nms:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=CT6MFNzLrXI:9o9p1CV-Nms:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=CT6MFNzLrXI:9o9p1CV-Nms:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=CT6MFNzLrXI:9o9p1CV-Nms:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=CT6MFNzLrXI:9o9p1CV-Nms:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=CT6MFNzLrXI:9o9p1CV-Nms:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</content>
  </entry>
  <entry>
    <id>tag:depth-first.com,2008:Article/608</id>
    <published>2010-02-08T06:59:18-08:00</published>
    <updated>2010-02-08T06:59:18-08:00</updated>
    <link rel="alternate" type="text/html" href="http://depth-first.com/articles/2010/02/08/big-data-in-chemistry-mirroring-pubchem-the-easy-way" />
    <title>Big Data in Chemistry: Mirroring PubChem the Easy Way</title>
    <content type="html">&lt;p&gt;&lt;a href="http://pubchem.ncbi.nlm.nih.gov/"&gt;&lt;img src="http://depth-first.s3.amazonaws.com/20100208/pubchem.gif" align="right" class="anchor"&gt;&lt;/img&gt;&lt;/a&gt;PubChem's massive size presents special challenges when working with this chemical dataset. Synchronization in particular requires special care. Although it's very easy to use a tool such as &lt;a href="http://www.gnu.org/software/wget/"&gt;wget&lt;/a&gt; to perform a complete, one-time download PubChem's archive files, this approach scales poorly if our goal is to maintain a copy that's always up-to-date. The PubChem dataset's substantial size makes it impractical to download frequently, and especially problematic when an up-to-date local copy is needed quickly.&lt;/p&gt;

&lt;p&gt;This article describes a simple way to create a low-maintenance, low-bandwidth, up-to-date local mirror of PubChem using two Unix tools.&lt;/p&gt;

&lt;h4&gt;What It Does&lt;/h4&gt;

&lt;p&gt;The method described here will create two directories on your filesystem that will exactly mirror the contents of the PubChem Compound and Substance archives, respectively. A simple command, which can either be run as a nightly cron job or on demand, will efficiently bring these local files up-to-date with PubChem whenever it's run.&lt;/p&gt;

&lt;h4&gt;Step #1: Create A Workspace and Mount PubChem FTP Site&lt;/h4&gt;

&lt;p&gt;We're going to need a workspace. In this workspace, we'll first create a mountpoint for the PubChem FTP site archives, then we'll mount the archives:&lt;/p&gt;

&lt;pre class="console"&gt;
$ mkdir workspace
$ cd workspace
$ mkdir -p ftp.ncbi.nlm.nih.gov/pubchem
$ curlftpfs ftp.ncbi.nlm.nih.gov/pubchem/ ftp.ncbi.nlm.nih.gov/pubchem/
&lt;/pre&gt;


&lt;p&gt;My Linux distribution (Ubuntu Karmic) gives me the error message:&lt;/p&gt;

&lt;pre class="console"&gt;
fusermount: failed to open /etc/fuse.conf: Permission denied
&lt;/pre&gt;


&lt;p&gt;which doesn't seem to matter. The FTP site is mounted, as I can see by listing the top-level entries:&lt;/p&gt;

&lt;pre class="console"&gt;
$ ls ftp.ncbi.nlm.nih.gov/pubchem
Bioassay  Compound     data_spec     README      Substance
CACTVS    Compound_3D  publications  specifications
&lt;/pre&gt;


&lt;p&gt;We can unmount the PubChemFTP site with fusermount:&lt;/p&gt;

&lt;pre class="console"&gt;
$ fusermount -u ftp.ncbi.nlm.nih.gov/pubchem/
&lt;/pre&gt;


&lt;h4&gt;Step #2: Create Synchronization Directories and Transfer Files&lt;/h4&gt;

&lt;p&gt;Next, let's create two directories to hold the PubChem files - one for Compounds and one for Substances:&lt;/p&gt;

&lt;pre class="console"&gt;
$ mkdir substances
$ mkdir compounds
&lt;/pre&gt;


&lt;p&gt;Now comes the magic. We'll use &lt;a href="http://samba.anu.edu.au/rsync/"&gt;rsync&lt;/a&gt; to copy the contents of the mounted FTP archive into each of our local directories. First, we synchronize the Compounds:&lt;/p&gt;

&lt;pre class="console"&gt;
$ rsync -r -t -v --progress --bwlimit=500  --include='*/' --include='*.sdf.gz' --exclude='*' ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/ compounds
&lt;/pre&gt;


&lt;p&gt;This is going to take nearly 24 hours.&lt;/p&gt;

&lt;p&gt;The option &lt;em&gt;--bwlimit&lt;/em&gt; sets the maximum bandwidth (in Mb/S). The &lt;em&gt;--include&lt;/em&gt; and &lt;em&gt;--exclude&lt;/em&gt; options say that we're only interested in gzipped sd files.&lt;/p&gt;

&lt;p&gt;We synchronize Substance records analogously:&lt;/p&gt;

&lt;pre class="console"&gt;
$ rsync -r -t -v --progress --bwlimit=500  --include='*/' --include='*.sdf.gz' --exclude='*' ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/SDF/ substances
&lt;/pre&gt;


&lt;p&gt;This command will take even longer to run.&lt;/p&gt;

&lt;h4&gt;Step #3: There Is No Step #3&lt;/h4&gt;

&lt;p&gt;That's really all there is to it. Every time we run the rsync command, we'll synchronize our local copy of the PubChem archive with the one on the PubChem FTP server. PubChem ensures that these archives are always current, so every time we synchronize, we'll have up-to-date files.&lt;/p&gt;

&lt;h4&gt;Why RSync?&lt;/h4&gt;

&lt;p&gt;RSync ensures that our synchronizations will be as efficient as possible by only downloading the archive files that change. From time to time, old records are updated in PubChem, and these changes appear as a new archive file that replaces an old archive file. The new file gets an updated timestamp. If you check out the &lt;a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/"&gt;Compounds FTP directory&lt;/a&gt; you'll notice several different timestamps reflecting the various updates of existing records. New records appear as new archive files.&lt;/p&gt;

&lt;p&gt;The genius of RSync is that it performs an incremental backup; files that haven't changed since our last update are never downloaded.&lt;/p&gt;

&lt;p&gt;We can even take this incremental backup idea one step further. Although I don't yet know if PubChem supports it, it's possible to &lt;a href="http://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/"&gt;create GZip archives optimized for rsync&lt;/a&gt;. This uses a variant of the GZip compression algorithm that makes it possible to transmit only the section of a gzip file that's actually changed, keeping network traffic to an absolute minimum.&lt;/p&gt;

&lt;p&gt;This rsyncable archive capability is built into most gzip binary distributions.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;Creating and maintaining your own up-to-date, verbatim copy of PubChem is both simple and inexpensive. The trick is to first mount the FTP archive using curlftpfs and then use rsync to perform an incremental backup of the mounted archive. The method described here works equally well as a cron job or an ad hoc command.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Credits: &lt;a href="http://www.wikihow.com/Mirror-an-FTP-Directory-With-Rsync-and-Curlftpfs"&gt;Mirror an FTP Directory with RSync and Curlftps&lt;/a&gt;; &lt;a href="http://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/"&gt;Rsyncable gzip&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=BNJU4hzUN1U:zsliuR2lzOI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=BNJU4hzUN1U:zsliuR2lzOI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=BNJU4hzUN1U:zsliuR2lzOI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=BNJU4hzUN1U:zsliuR2lzOI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=BNJU4hzUN1U:zsliuR2lzOI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=BNJU4hzUN1U:zsliuR2lzOI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=BNJU4hzUN1U:zsliuR2lzOI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</content>
  </entry>
  <entry>
    <id>tag:depth-first.com,2008:Article/607</id>
    <published>2010-02-06T08:25:20-08:00</published>
    <updated>2010-02-06T08:25:20-08:00</updated>
    <link rel="alternate" type="text/html" href="http://depth-first.com/articles/2010/02/06/i-dare-you-ask-your-toughest-experimental-chemistry-question-on-chempedia-lab" />
    <title>I Dare You: Ask Your Toughest Experimental Chemistry Question on Chempedia Lab</title>
    <content type="html">&lt;p&gt;&lt;a href="http://lab.chempedia.com"&gt;&lt;img src="http://products-blog.s3.amazonaws.com/assets/20091117/chempedia_lab_logo.png" align="right" class="anchor"&gt;&lt;/img&gt;&lt;/a&gt;&lt;a href="http://www.nature.com/news/index.html"&gt;Nature News&lt;/a&gt; is running a story on &lt;a href="http://intermolecular.wordpress.com/"&gt;Matthew Todd&lt;/a&gt; and his &lt;a href="http://www.nature.com/news/2010/100204/full/news.2010.50.html"&gt;initiative to develop a more practical treatment for Schistosomiasis by thinking different&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;"My funded project is intended to be the kernel, to which anyone can add," Todd says. He hopes that the project will become a successful example of open-source science, and open-source 'wet lab' chemistry in particular, a concept that has been slow to take off.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Call me an optimist, but the problems with getting something like this to work will have less to do with a scarcity of &lt;a href="http://depth-first.com/articles/2006/08/19/history-of-abstracting-at-chemical-abstracts-service"&gt;volunteer-minded chemists&lt;/a&gt; and more to do with finding them around the world and connecting them to each other.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://lab.chempedia.com"&gt;Chempedia Lab&lt;/a&gt; is a service that might have a role to play. It's a question and answer site dedicated to experimental chemistry. Ask a question and get a peer-reviewed answer. No inflated bureaucracy, no lengthy review process, no unaffordable subscriptions, no conflicts of interest, no nagging questions about re-use, no counterproductive rewards system. Just you, your peers, and the information - the way science is supposed to work.&lt;/p&gt;

&lt;p&gt;Maybe you're thinking that something like this can't possibly work. If so, I'll leave you with a simple challenge - do the experiment yourself. Ask the toughest question you can think of and see how long it takes to get either exactly the answer you were looking for, or an answer that puts you on the right track. Then ask yourself how you would have answered the same question without Chempedia Lab.&lt;/p&gt;

&lt;p&gt;Although Chempedia Lab may not be the best platform, one thing is clear - open science has no chance in the context of traditional scientific communication. That system is simply &lt;a href="http://cameronneylon.net/blog/peer-review-what-is-it-good-for/"&gt;too cost-ineffective&lt;/a&gt;, both in terms of money and time.&lt;/p&gt;

&lt;p&gt;My guess is that for every Matthew Todd there are at least a hundred others who would like to start the same kind of initiative, but who feel they lack the funding, the lab space, the staff, or some other critical resource. Thinking different about everything in the way we do chemistry - from who does the research, to where it gets done, to the medium of collaboration - is the key.&lt;/p&gt;
&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=Af8wSAuVU4Y:61OJ9jFAEx0:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=Af8wSAuVU4Y:61OJ9jFAEx0:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=Af8wSAuVU4Y:61OJ9jFAEx0:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=Af8wSAuVU4Y:61OJ9jFAEx0:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=Af8wSAuVU4Y:61OJ9jFAEx0:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=Af8wSAuVU4Y:61OJ9jFAEx0:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=Af8wSAuVU4Y:61OJ9jFAEx0:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</content>
  </entry>
  <entry>
    <id>tag:depth-first.com,2008:Article/606</id>
    <published>2010-02-04T15:14:27-08:00</published>
    <updated>2010-02-04T15:19:40-08:00</updated>
    <link rel="alternate" type="text/html" href="http://depth-first.com/articles/2010/02/04/pubcouch-create-your-own-custom-pubchem-subset" />
    <title>PubCouch: Create Your Own Custom PubChem Subset</title>
    <content type="html">&lt;p&gt;&lt;a href="http://couchdb.apache.org/"&gt;&lt;img src="http://depth-first.s3.amazonaws.com/20100119/couchdb.png" align="right" class="anchor"&gt;&lt;/img&gt;&lt;/a&gt;If you've ever worked with the PubChem dataset, you may have found yourself wanting to create a custom subset that filters out certain records. This article, the fourth in a &lt;a href="http://depth-first.com/articles/2010/01/20/pubcouch-a-couchdb-interface-to-pubchem"&gt;continuing series&lt;/a&gt;, shows a very simple way to create a custom PubChem dataset using &lt;a href="http://github.com/metamolecular/pubcouch"&gt;PubCouch&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;The Problem&lt;/h4&gt;

&lt;p&gt;I really like PubChem. It's the world's largest collection of freely-downloadable chemical structures and an excellent use of taxpayer dollars.&lt;/p&gt;

&lt;p&gt;But PubChem has faced some tough tradeoffs over the years., one of the foremost being how inclusive it should be. In other words, when to say 'no' to a substance depositor. I won't rehash the details here, but suffice it to say that the technologies on which PubChem is based are limited in important ways (for example: &lt;a href="http://depth-first.com/articles/2006/12/12/the-problem-with-ferrocene"&gt;organometallics&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;As part of ongoing work to expand &lt;a href="http://chempedia.com"&gt;Chempedia&lt;/a&gt;, the free chemical substance registry, I became interested in the possibility of building a subset of the PubChem Compound registry that only contained structures that could be safely encoded by the MDL Molfile specification. Call it "PubChem: The Good Parts."&lt;/p&gt;

&lt;p&gt;This database was likely to be huge and pretty non-relational. It looked like a perfect job for PubCouch.&lt;/p&gt;

&lt;h4&gt;A Solution&lt;/h4&gt;

&lt;p&gt;The software to solve this problem has been built into PubCouch. There are a couple of ways to run it, but I find one of the simplest is to use JRuby:&lt;/p&gt;

&lt;pre class="console"&gt;
$ git clone git@github.com:metamolecular/pubcouch.git
$ cd pubcouch
$ ant jar
$ jruby -S rake compounds:snapshot
&lt;/pre&gt;


&lt;p&gt;To get that last part working, you'll need to &lt;a href="http://jruby.org/getting-started"&gt;install JRuby&lt;/a&gt;. This is optional; you could also create an Ant task or use some other script. The point is that we're running a pre-packaged PubCouch task called "Compounds".&lt;/p&gt;

&lt;p&gt;There's one more thing - you'll obviously need CouchDB installed, and you'll need an empty database called "compounds". The database name can be changed to fit your preferences.&lt;/p&gt;

&lt;p&gt;Finally, the way this works is likely to change in the future. To be sure you'll be able to access the code describe here, please use &lt;a href="http://github.com/metamolecular/pubcouch/commit/8208a82f42815c15b8a5207db91725bb97e01242"&gt;this commit&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Filtering&lt;/h4&gt;

&lt;p&gt;After running the snapshot task, you'll see some output indicating Compound IDs being checked and written.&lt;/p&gt;

&lt;p&gt;Not every compound is being written. Only those passing a specific set of requirements will end up in CouchDB:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No bond annotations other than 'aromatic'.&lt;/li&gt;
&lt;li&gt;No multicomponent (disconnected) Compounds.&lt;/li&gt;
&lt;li&gt;No undefined stereochemistry.&lt;/li&gt;
&lt;li&gt;No charged species.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;These happen to be my requirements - yours will probably differ somewhat. To change the applied filter, simply change the method Compounds.StrictFilter.pass. It's that simple.&lt;/p&gt;

&lt;h4&gt;Fine-Tuning&lt;/h4&gt;

&lt;p&gt;This is all pretty rough at this point. There are many opportunities to refine the code for flexibility and performance. For example, I initially experimented with CouchDB's &lt;a href="http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API"&gt;bulk update&lt;/a&gt; capability, which compresses multiple writes into a single HTTP request. But this actually resulted in more memory/processor usage. My guess is that this was probably less due to CouchDB than it was to the JSON overhead in the &lt;a href="http://code.google.com/p/jcouchdb/"&gt;JCouchDB&lt;/a&gt; library I'm using to talk to CouchDB. Your results may vary.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;PubChem is an excellent free resource for raw chemical structures, if filtered correctly. This article showed how to create your own personal subset of PubChem using PubCouch.&lt;/p&gt;
&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=hqqn3QXzaF4:IBNGlG5meOg:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=hqqn3QXzaF4:IBNGlG5meOg:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=hqqn3QXzaF4:IBNGlG5meOg:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=hqqn3QXzaF4:IBNGlG5meOg:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=hqqn3QXzaF4:IBNGlG5meOg:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=hqqn3QXzaF4:IBNGlG5meOg:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=hqqn3QXzaF4:IBNGlG5meOg:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</content>
  </entry>
  <entry>
    <id>tag:depth-first.com,2008:Article/605</id>
    <published>2010-01-29T10:05:26-08:00</published>
    <updated>2010-01-29T10:05:26-08:00</updated>
    <link rel="alternate" type="text/html" href="http://depth-first.com/articles/2010/01/29/pubcouch-streams-arent-just-for-pipeline-pilot" />
    <title>PubCouch: Streams Aren't Just for Pipeline Pilot</title>
    <content type="html">&lt;p&gt;&lt;a href="http://couchdb.apache.org/"&gt;&lt;img src="http://depth-first.s3.amazonaws.com/20100119/couchdb.png" align="right" class="anchor"&gt;&lt;/img&gt;&lt;/a&gt;If you've been following along with the development of &lt;a href="http://github.com/metamolecular/pubcouch"&gt;PubCouch&lt;/a&gt;, the CouchDB interface for PubChem, you've probably noticed that only a fraction of the code relates to CouchDB itself. What's the rest of it doing?&lt;/p&gt;

&lt;p&gt;This article, the the third in a series on &lt;a href="http://depth-first.com/articles/2010/01/20/pubcouch-a-couchdb-interface-to-pubchem"&gt;using CouchDB for PubChem data&lt;/a&gt; describes how PubCouch transforms PubChem's collection of gzipped archive files into a stream of structure-data records that can be processed as if it were one big SD File.&lt;/p&gt;

&lt;h4&gt;The Problem&lt;/h4&gt;

&lt;p&gt;If you want to work with the PubChem dataset, one of the first problems you'll face is how to import the data into your database management system (dbms). The PubChem FTP server contains a rather large collection of archive "bundles", which are simply &lt;a href="http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp"&gt;gzipped SD Files of records within a certain ID range&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In most cases, importing the PubChem database will consist of sequentially reading every Compound and Substance record, applying the appropriate intermediate processing, and storing the result.&lt;/p&gt;

&lt;p&gt;So, we have a mismatch in the way PubChem stores its data (multiple gzipped archives) and the way we want to process it (as one big SD File). And by the way, how about not storing a bunch of temporary files, but rather transfer data directly from the archive to our database?&lt;/p&gt;

&lt;h4&gt;A Solution&lt;/h4&gt;

&lt;p&gt;One of the reasons Java was chosen as PubCouch's development language is its built-in support for high-performance IO operations. &lt;a href="http://java.sun.com/j2se/1.4.2/docs/api/java/io/InputStream.html"&gt;InputStream&lt;/a&gt;, the foundation of this support, turns out to be a very versatile class enabling a variety of filtering and reprocessing operations on raw data streams.&lt;/p&gt;

&lt;p&gt;Our &lt;a href="http://commons.apache.org/net/"&gt;FTP Client&lt;/a&gt; can return at most one raw byte stream from each archive file. By applying a set of filters on this stream, we can get pretty close to where we need to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;wrap each InputStream in a &lt;a href="http://commons.apache.org/net/"&gt;GZIPInputStream&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;wrap the result in an &lt;a href="http://java.sun.com/j2se/1.4.2/docs/api/java/io/InputStreamReader.html"&gt;StreamReaderReader&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;wrap the result in a &lt;a href="http://java.sun.com/j2se/1.4.2/docs/api/java/io/BufferedReader.html"&gt;BufferedReader&lt;/a&gt; to read line-by-line&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;These filters alone won't do the job - remember, we want to treat the entire FTP archive as one big SD File.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://java.sun.com/j2se/1.4.2/docs/api/java/io/SequenceInputStream.html"&gt;SequenceInputStream&lt;/a&gt; is just what we need. This nifty little class can make a series of InputStreams (i.e., the individual PubChem files) appear as one big InputStream.&lt;/p&gt;

&lt;p&gt;Putting this all together, we end up with a chained series of inputs:&lt;/p&gt;

&lt;pre&gt;
[InputStream -&amp;gt; GZIPInputStream] -&amp;gt;
SequenceInputStream -&amp;gt;
StreamReader -&amp;gt;
BufferedReader
&lt;/pre&gt;


&lt;p&gt;We now have a BufferedReader that will for all intents and purposes look like we've just opened a massive SD File. Handing this Reader to an SD File processor will let us capture all Substance or Compound records using a simple conceptual model.&lt;/p&gt;

&lt;h4&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;By using Java's support for stream chaining and transformation, PubCouch makes it possible to work with the PubChem FTP archive as if it were one big SD File. This turns out to be useful regardless of how you decide to ultimately represent and store the resulting records. There are still some rough edges in the implementation and possibilities for extending the concept (i.e., random-access), but the idea can be used on many other datasources, and in many other contexts.&lt;/p&gt;
&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=E_md4sG3ANs:hNK2vSsEmyk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=E_md4sG3ANs:hNK2vSsEmyk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=E_md4sG3ANs:hNK2vSsEmyk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=E_md4sG3ANs:hNK2vSsEmyk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=E_md4sG3ANs:hNK2vSsEmyk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Depth-first?a=E_md4sG3ANs:hNK2vSsEmyk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Depth-first?i=E_md4sG3ANs:hNK2vSsEmyk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;</content>
  </entry>
</feed>
