<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0"><channel><title>iPhylo</title><link>http://iphylo.blogspot.com/</link><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/iphylo" /><description>Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics.  For more ranty and less considered opinions, see my &lt;a&gt;Twitter feed&lt;/a&gt;.&lt;br&gt;View this blog in &lt;a href="http://iphylo.blogspot.com/view/magazine"&gt;Magazine View&lt;/a&gt;.</description><language>en</language><managingEditor>noreply@blogger.com (Roderic D. M. Page)</managingEditor><lastBuildDate>Wed, 30 May 2012 03:50:00 PDT</lastBuildDate><generator>Blogger http://www.blogger.com</generator><openSearch:totalResults xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/">494</openSearch:totalResults><openSearch:startIndex xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/">1</openSearch:startIndex><openSearch:itemsPerPage xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/">25</openSearch:itemsPerPage><feedburner:info uri="iphylo" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:browserFriendly>This is an XML content feed. It is intended to be viewed in a newsreader or syndicated to another site, subject to copyright and fair use.</feedburner:browserFriendly><item><title>The GBIF classification is broken — how do we fix it?</title><link>http://iphylo.blogspot.com/2012/05/gbif-classification-is-broken-how-do-we.html</link><category>BioStor</category><category>data cleaning</category><category>wiki</category><category>error</category><category>worm</category><category>mollusc</category><category>Hipponix</category><category>classification</category><category>Hipponyx</category><category>GBIF</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Wed, 30 May 2012 03:50:00 PDT</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-6835468730006383094</guid><description>This post arose from an ongoing email conversation with Tony Rees about extracting and annotating taxonomic names. In BioStor I use the GBIF classification to display the taxonomic names found in the OCR text in the form of a tree. The idea is to give the reader a sense of "what the paper is about". I also use the classification to help &lt;a href="http://iphylo.blogspot.com/2012/02/linking-gbif-and-biodiversity-heritage.html"&gt;link to GBIF occurrence records&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://ecat-dev.gbif.org/about/nub/"&gt;GBIF backbone classification&lt;/a&gt; ("nub") is probably the single largest classification of life that has been assembled, and provides GBIF users with a way to navigate through GBIF's collection of specimen and observation records. Given the scale of the undertaking it is inevitable that there will be issues with the classification, and this post provides one example.&lt;br /&gt;&lt;br /&gt;On the page for the article "Further additions to the known marine Molluscan fauna of St. Helena" (&lt;a href="http://biostor.org/reference/88554"&gt;http://biostor.org/reference/88554&lt;/a&gt;, see also &lt;a href="http://dx.doi.org/10.1080/00222939208677383"&gt;http://dx.doi.org/10.1080/00222939208677383&lt;/a&gt;) part of the classification looks like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;└Animalia&lt;br /&gt; └Annelida&lt;br /&gt;  └Polychaeta&lt;br /&gt;   └Sabellida&lt;br /&gt;    └Serpulidae&lt;br /&gt;     └Hipponyx&lt;/pre&gt;Tony points out that "Hipponyx" is a mollusc, yet in the GBIF classification appears in the annelid worms.&lt;br /&gt;&lt;br /&gt;Like a fool I started to investigate further. First off, what is "Hipponyx"? Browsing the GBIF classification there are species of &lt;em&gt;Hipponyx&lt;/em&gt; and &lt;em&gt;Hipponix&lt;/em&gt; under the genus &lt;em&gt;Hipponix&lt;/em&gt;, so it looks like we have two alternative spellings of this genus name. Nomenclator Zoologicus has both spellings, &lt;a href="http://iphylo.org/~rpage/nz/index.php?mode=genus&amp;q=Hipponix"&gt;&lt;em&gt;Hipponix&lt;/em&gt;&lt;/a&gt; credited to DeFrance 1819 Journ. de Physique, 88, 217, and &lt;a href="http://iphylo.org/~rpage/nz/index.php?mode=genus&amp;q=Hipponyx"&gt;&lt;em&gt;Hipponyx&lt;/em&gt;&lt;/a&gt; credited to Defrance 1819 Bull. Sci. Soc. philom. Paris, 8. Gotta love those cryptic citations. After some digging around in BHL I found Journ. de Physique, 88, 217 (&lt;a href="http://biostor.org/reference/103838"&gt;Mémoire sur un nouveau genre de mollusque&lt;/a&gt;) and Bull. Sci. Soc. philom. Paris, 8. (&lt;a href="http://biostor.org/reference/103837"&gt;Sur un nouveau genre de coquilles (Hipponix)&lt;/a&gt;). Both papers are by Jacques Louis Marin DeFrance, and both use the spelling &lt;em&gt;Hipponix&lt;/em&gt; (no 'y'). I'm guessing the second paper is actually the original description of the genus, but my French is abysmal (Google Translate to the rescue).&lt;br /&gt;&lt;br /&gt;OK, so we have two spellings of what is probably the same thing (and I've no idea why we have two spellings). Both spellings seem in use (see Google NGrams chart below).&lt;br /&gt;&lt;br /&gt;&lt;div style='padding-bottom: 2px; line-height: 0px'&gt;&lt;a href='http://pinterest.com/pin/119204721357461822/' target='_blank'&gt;&lt;img src='http://media-cache7.pinterest.com/upload/119204721357461822_HW1qzpSk_c.jpg' border='0' width='500' height ='183'/&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style='float: left; padding-top: 0px; padding-bottom: 0px;'&gt;&lt;p style='font-size: 10px; color: #76838b;'&gt;Source: &lt;a style='text-decoration: underline; font-size: 10px; color: #76838b;' href='http://books.google.com/ngrams/graph?content=Hipponyx%2CHipponix&amp;amp;year_start=1800&amp;amp;year_end=2000&amp;amp;corpus=0&amp;amp;smoothing=3'&gt;books.google.com&lt;/a&gt; via &lt;a style='text-decoration: underline; font-size: 10px; color: #76838b;' href='http://pinterest.com/rdmpage/' target='_blank'&gt;Roderic&lt;/a&gt; on &lt;a style='text-decoration: underline; color: #76838b;' href='http://pinterest.com' target='_blank'&gt;Pinterest&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;So, bit of a mess, but this still doesn't deal with &lt;em&gt;Hipponyx&lt;/em&gt; being a worm in GBIF. After a bit of Googling on "Serpulidae" and "Hipponyx" I came across a &lt;a href="http://collections.tepapa.govt.nz/objectdetails.aspx?oid=737819"&gt;specimen record from Te Papa&lt;/a&gt; labelled "Worm, Temporaria inexpectata (Mestayer, 1929); holotype; holotype of Hipponyx inexpectata Mestayer, 1929". I then came across this paper:&lt;br /&gt;&lt;br /&gt;Fleming, C. A. (1971). A preliminary list of New Zealand fossil polychaetes. New Zealand Journal of Geology and Geophysics, 14(4), 742–756. &lt;a href="http://dx.doi.org/10.1080/00288306.1971.10426332"&gt;doi:10.1080/00288306.1971.10426332&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;with the following abstract:&lt;br /&gt;&lt;blockquote&gt;An annotated list of fossil “worm tubes” from New Zealand includes both published and new records from Mesozoic and Cenozoic deposits.&lt;br /&gt;&lt;br /&gt;The binomen Zoophycos plicatus (Hutton) is proposed for the trace fossil long known as the Amuri fucoid, of unknown zoological affinity.&lt;br /&gt;&lt;br /&gt;The following living species are recorded as New Zealand fossils for the first time: Protula bispiralis (Savigny), Salmacina dysteri (Huxley), Hydroides norvegicus Gunnerus, Pomatoceras cariniferus (Gray), P. aff. terranovae (Benham), Galeolaria hystrix (Moerch), Boccardia ? polybranchia (Haswell); new records of fossil species are Ditrupa cf. plana (Sowerby), Dorsoserpula lumbricalis (Schlotheim), and Neomicrorbis crenatostriatus (Münster). &lt;strong&gt;The name Hipponyx inexpectata Mestayer 1929, applied to a serpulid operculum, is used in the combination Temporaria inexpectata for a tubeworm common in deep water off New Zealand that has also been identified, with associated operculum, from the bathyal Waitotaran (Pliocene) sediments of Palliser Bay&lt;/strong&gt;. Serpula wharjensis Wilkens and S. ougenensis Chapman are placed in Sclerostyla Moerch. Two species of Vermiliopsis and two of Spirorbis are figured but not named specifically.&lt;/blockquote&gt;&lt;br /&gt;The author of the paper (&lt;a href="http://en.wikipedia.org/wiki/Charles_Fleming_(ornithologist)"&gt;Charles Fleming&lt;/a&gt;) argues that &lt;em&gt;Hipponyx inexpectata&lt;/em&gt;, regarded as a mollusc by its describer (Marjorie K. Mestayer, see &lt;a href="http://rsnz.natlib.govt.nz/volume/rsnz_60/rsnz_60_02_004690.html"&gt;Notes on New Zealand Mollusca. No. 4.&lt;/a&gt;) is actually a worm, and he moves it to the genus &lt;em&gt;Temporaria&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;So it seems that the reason &lt;em&gt;Hipponyx&lt;/em&gt; has ended up being a worm in the GBIF classification is due to this synonymy.&lt;br /&gt;&lt;br /&gt;Now, this little investigation was "fun", but took a couple of hours. Much of that was spent tracking down the literature and adding it to BioStor, which is a one-time cost. Not every issue with the GBIF classification will take this long to resolve, some cases may take longer. So there's a problem of scalability. Then there's the issue of how this information gets into the GBIF classification so we fix it (and so that people don't think &lt;em&gt;Hipponyx&lt;/em&gt; is a worm). As has been said several times before, most eloquently by &lt;a href="http://ispiders.blogspot.co.uk/2007/10/taxonomic-consensus-as-software.html"&gt;David Shorthouse&lt;/a&gt;, isn't it time we started using software development tools such as version control to help build, annotate, and correct classifications such as the one that underpins GBIF? That way when somebody spots an error it can be flagged, and someone with the time (and curiosity) can fix it.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-6835468730006383094?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>EOL challenge draft proposal</title><link>http://iphylo.blogspot.com/2012/05/eol-challenge-draft-proposal.html</link><category>EOL</category><category>taxonomy</category><category>Challenge</category><category>data integration</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Tue, 15 May 2012 02:45:31 PDT</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-5341130646028083439</guid><description>In the spirit of the &lt;a href="http://iphylo.blogspot.com/2011/06/would-you-give-me-grant-experiment-in.html"&gt;Would you give me a grant experiment?&lt;/a&gt; [1] here's the draft of a proposal I'm working on for the &lt;a href="http://eol.org/info/323"&gt;Computable Data Challenge&lt;/a&gt;. It's an attempt to merge taxonomic names, the primary literature, and phylogenetics into one all-singing, all-dancing website that makes it easy to browse names, see the publications relevant to those names, and see what, if anything, we know about the phylogeny of those taxa. It builds on a number of other projects I've been working on, most recently my efforts to &lt;a href="http://iphylo.blogspot.com/2011/10/linking-taxonomic-names-to-literature.html"&gt;link names to the primary literature&lt;/a&gt;. Comments welcome (the proposal deadline is next week). &lt;br /&gt;&lt;br /&gt;The proposal is embedded below using Google's PDF viewer, if you can't see it try logging into your Google account, or &lt;a href="http://dl.dropbox.com/u/639486/EOLChallenge/EOLComputableDataChallenge.pdf"&gt;click here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;iframe src="http://docs.google.com/viewer?url=http://dl.dropbox.com/u/639486/EOLChallenge/EOLComputableDataChallenge.pdf&amp;amp;embedded=true&amp;amp;output=embed" width="500" height="600" style="border:none;"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;1. The answer from &lt;a href="http://www.nerc.ac.uk"&gt;NERC&lt;/a&gt; was a resounding "no". &lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-5341130646028083439?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)</title><link>http://iphylo.blogspot.com/2012/04/dark-taxa-even-darker-ncbi-pulls-dna.html</link><category>NCBI</category><category>dark taxa</category><category>DNA barcoding</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Tue, 24 Apr 2012 09:13:54 PDT</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-1585271051507658278</guid><description>&lt;a href="http://iphylo.blogspot.com/2011/04/dark-taxa-genbank-in-post-taxonomic.html"&gt;Dark taxa&lt;/a&gt; have become even darker. NCBI has pulled the plug on large numbers of DNA barcode sequences that lack scientific names. For example, taxon &lt;a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&amp;id=818059"&gt;Cyclopoida sp. BOLD:AAG9771&lt;/a&gt; (tax_id 818059) now has a sparse page that has no associated sequences. From an earlier download of EMBL I know that this taxon is associated with at least 5 sequences, such as &lt;a href="http://www.ncbi.nlm.nih.gov/nuccore/GU679674"&gt;GU679674&lt;/a&gt;. But if you go to that sequence you get this:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh6.ggpht.com/-gzyEUIOdQHE/T5ZV6r8tdWI/AAAAAAAABMY/ueWxb18xC-E/obsolete.png?imgmax=800" alt="Obsolete" title="obsolete.png" border="0" width="556" height="80" /&gt;&lt;br /&gt;&lt;br /&gt;So the the sequence is hidden. You can retrieve it by clicking on the &lt;a href="http://www.ncbi.nlm.nih.gov/nuccore/GU679674.1?report=genbank"&gt;obsolete version link&lt;/a&gt;, but by default it is hidden.&lt;br /&gt;&lt;br /&gt;It's an extraordinary state of affairs that a huge slice of fundamental biodiversity data has been effectively "pulled" from view.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Update&lt;/b&gt;&lt;a href="http://connect.barcodeoflife.net/profile/SujeevanRatnasingham"&gt;Sujeevan Ratnasingham&lt;/a&gt; from iBOL has pointed out that the sequence I'd used above (GU679674) was not one of the ones hidden by NCBI, rather it was suppressed at the request of the investigator (which I'd have realised if I'd paid more attention to the screenshot). &lt;a href="http://www.ncbi.nlm.nih.gov/nuccore/HQ918317"/&gt;HQ918317&lt;/a&gt; is an example of a BOLD record that was suppressed:&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh3.ggpht.com/-sPgNOgS3dd0/T5bMcslmg_I/AAAAAAAABMk/WWQHiZ_COrE/hq.png?imgmax=800" alt="Hq" title="hq.png" border="0" width="568" height="94" /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-1585271051507658278?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh6.ggpht.com/-gzyEUIOdQHE/T5ZV6r8tdWI/AAAAAAAABMY/ueWxb18xC-E/s72-c/obsolete.png?imgmax=800" height="72" width="72" /></item><item><title>Quick thoughts on specimen identifiers</title><link>http://iphylo.blogspot.com/2012/04/quick-thoughts-on-specimen-identifiers.html</link><category>specimens</category><category>identifiers</category><category>DOI</category><category>CrossRef</category><category>DataCite</category><category>specimen codes</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Fri, 20 Apr 2012 03:49:19 PDT</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-8927176770671134045</guid><description>Based on recent discussions my sense is that our community will continue to thrash the issue of identifiers to death, repeating many of the debates that have gone on (and will go on) in other areas. To be trite, it seems to me we have three criteria: &lt;b&gt;cheap&lt;/b&gt;, &lt;b&gt;resolvable&lt;/b&gt;, and &lt;b&gt;persistent&lt;/b&gt;. We get to pick two. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Cheap and resolvable&lt;/b&gt; means URLs, which everybody is nervous about because they break. They don't have to break, but for a bunch of reasons they do. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Cheap and persistent&lt;/b&gt; means things like &lt;a href="http://iphylo.blogspot.com/2011/12/dna-barcoding-darwin-core-triplet-and.html"&gt;Darwin Triplet Core&lt;/a&gt; or URNs. You can write things on paper and they will persist (&lt;a href="http://iphylo.blogspot.com/2012/01/extracting-museum-specimen-codes-from.html"&gt;the Biodiversity Heritage Library&lt;/a&gt; shows us that), but how in the digital era do we do anything with this? If it's not resolvable what, exactly, is the point? We tried URNs — even ones that were resolvable (LSIDs) — and &lt;a href="http://iphylo.blogspot.com/2012/02/why-lsids-suck.html"&gt;that was a disaster&lt;/a&gt; (we learnt a lot, but what a mess).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Resolvable and persistent&lt;/b&gt;. This is where technologies such as DOIs reside. If every specimen had a DOI would we still be having this discussion? We'd have a resolvable identifier that is resistant to change (including loss of museum domain names, specimens moving to new institutions, etc.), and one that is already in use by &lt;a href="http://www.crossref.org"&gt;CrossRef&lt;/a&gt; and &lt;a href="http://datacite.org"&gt;DataCite&lt;/a&gt;, and will also play ball with linked data folks.&lt;br /&gt;&lt;br /&gt;In practical terms, what if we had a convention that each collection gets it's own DOI prefix "10.nnnn", after which it appends whatever specimen identifier makes sense (and is unique within that collection). &lt;br /&gt;&lt;br /&gt;The bulk of specimen identifiers in the wild are of the form "Institution" "Catalogue number", e.g. ANSP 332467 (from the example I discussed in &lt;a href="http://iphylo.blogspot.co.uk/2012/03/bhl-and-gbif-as-biomedical-databases.html"&gt;BHL and GBIF as biomedical databases&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;If we wrote this as a DOI of the form &amp;lt;doi prefix&amp;gt;/Collection/InstitutionCatalogue number then we'd have identifiers that (in part) matched what most people would expect to see. In the example above we would have something like:&lt;br /&gt;&lt;br /&gt;10.nnnnn/MAL/ANSP332467&lt;br /&gt;&lt;br /&gt;where "MAL" is the acronym for the Malacology collection. This is pretty close to "ANSP 332467", is human friendly, but would also be resolvable. It also carries limited branding, so if the specimen was moved from it's current collection to a new institution, people wouldn't get too upset by the presence of "ANSP"). It would also help make the links between specimen codes and DOIs. We couldn't &lt;em&gt;rely&lt;/em&gt; on 10.nnnnn/MAL/ANSP332467 being a specimen in the Academy of Natural Sciences's malacological collection, but it would be a good place to start looking.&lt;br /&gt;&lt;br /&gt;As I've &lt;a href="http://iphylo.blogspot.co.uk/2009/04/gbif-and-handles-admitting-that.html"&gt;argued before&lt;/a&gt;, we could centralise the minting of these identifiers using GBIF, but do it in a such a way that host institutions could assume responsibility for it if and when they are able (i.e., initially GBIF is responsible for managing the DOI prefixes for each institution, with the option for institutions to do this). The beauty of identifiers like DOIs is that from the user's perspective the identifier is unchanged. &lt;br /&gt;&lt;br /&gt;I'm hoping we'll make some progress on this in the coming months...&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-8927176770671134045?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>EOL Computable Data Challenge community</title><link>http://iphylo.blogspot.com/2012/04/eol-computable-data-challenge-community.html</link><category>data</category><category>EOL</category><category>Challenge</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Thu, 05 Apr 2012 00:39:57 PDT</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-2759158532315136923</guid><description>&lt;img src="http://lh6.ggpht.com/-wUQJVGQirPk/T31MR89dCiI/AAAAAAAABMI/vnnYTNTm6pQ/17823_130_130.jpg?imgmax=800" alt="17823 130 130" title="17823_130_130.jpg" border="0" width="100" height="100" style="float:right;border:1px solid rgb(228,228,228);" /&gt;Now we are awash in challenges! EOL has announced its &lt;a href="http://eol.org/info/323"&gt;Computable Data Challenge&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;We invite ideas for scientific research projects that use EOL, including the &lt;a href="http://biodiversitylibrary.org"&gt;Biodiversity Heritage Library&lt;/a&gt; (BHL), to answer questions in biology. The specific field of biological interest for the challenge is open; projects in ecology, evolution, behavior, conservation biology, developmental biology, or systematics may be most appropriate. Projects advancing informatics alone may be less competitive. EOL may be used as a source of biological information, to establish a sampling strategy, to assist the retrieval of computable data by mapping identifiers across sources (e.g. to accomplish name resolution), and/or in other innovative ways. Projects involving data or text or image mining of EOL or BHL content are encouraged. Current EOL data and API shall be used; suggestions for modification of content or the API could be a deliverable of the project.  We encourage the use of data not yet in EOL for analyses. In all cases projects must honor terms of use and licensing as appropriate.&lt;/blockquote&gt;&lt;br /&gt;Some $US 50,000 is on offer. "Challenge" is perhaps a misnomer, as EOL is offering this money not as a prize at the end, but rather to fund one or more proposals (submitted by 22 May) that are accepted. So, it's essentially a grant competition (with a pleasingly minimal amount of administrivia). There is also a &lt;a href="http://eol.org/communities/125/newsfeed"&gt;Computable Data Challenge community&lt;/a&gt; to discuss the challenge&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;It's great to see EOL trying different strategies to engage with developers. Of the different challenges EOL is running this one is perhaps the most appealing to me, because one of my biggest complaints about EOL is that it's hard to envisage "doing science" with it. For example, we can download GenBank and cluster sequences into gene families, or grab data from GBIF and model species distributions, but what could we do with EOL? This challenge will be a chance to explore the extent to which EOL can support science, which I would argue will be a key part of its long term future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-2759158532315136923?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh6.ggpht.com/-wUQJVGQirPk/T31MR89dCiI/AAAAAAAABMI/vnnYTNTm6pQ/s72-c/17823_130_130.jpg?imgmax=800" height="72" width="72" /></item><item><title>BHL and GBIF as biomedical databases</title><link>http://iphylo.blogspot.com/2012/03/bhl-and-gbif-as-biomedical-databases.html</link><category>Mekong River Schistosomiasis</category><category>BHL</category><category>biomedical</category><category>linking</category><category>PubMed</category><category>GBIF</category><category>schistosomiasis</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Tue, 27 Mar 2012 06:57:54 PDT</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-4827990715051832710</guid><description>When I think of the Biodiversity Heritage Library (BHL) or GBIF I tend to think of taxonomy and biodiversity. Folk wisdom has it that BHL is full of old books, mostly pre-1923. Great for finding old taxonomic names, or &lt;a href="http://blog.biodiversitylibrary.org/2012/03/bhl-funded-by-neh-to-reveal-art-of-life.html"&gt;nice artwork&lt;/a&gt;, but not exactly "modern" biology. GBIF is mainly about displaying organism distributions based on museum specimens, the primary data of taxonomic research. Again, great stuff, but aren't museums simply full of dead stuff that people have collected and forgotten about?&lt;br /&gt;&lt;br /&gt;But BHL has a lot more post-1923 content than I suspect most people realise (several museum or society journals have 21st century issues in BHL's archives, for example). Continuing the theme of &lt;a href="http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.html"&gt;linking BHL and GBIF&lt;/a&gt; content, as part of a forthcoming project on taxonomic names (to be made available "real soon now") I stumbled across this 1976 paper in BHL (now in BioStor):&lt;br /&gt;&lt;br /&gt;&lt;a href="http://biostor.org/reference/102054"&gt;Monograph on "Lithoglyphopsis" aperta, the snail host of Mekong River Schistosomiasis&lt;/a&gt; by Davis &lt;i&gt;et al.&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh5.ggpht.com/-jZvJmXGivkg/T3HHVwu4ZkI/AAAAAAAABLs/AAgc7soVuWo/malacologia157576inst_0263.jpg?imgmax=800" alt="Malacologia157576inst 0263" title="malacologia157576inst_0263.jpg" border="0" width="383" height="600" /&gt;&lt;br /&gt;&lt;br /&gt;This paper has been indexed in PubMed (&lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/948206"&gt;PMID:948206&lt;/a&gt;, but as far as I'm aware, BHL (and BioStor) has the only digital copy of this paper. (As a side note, wouldn't it be great if PubMed could link to BHL content?).&lt;br /&gt;&lt;br /&gt;The article page in BioStor shows a map derived from the OCR text, showing a two localities:&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh3.ggpht.com/-lVM1hQsTTrI/T3HHXZPMHoI/AAAAAAAABL0/2kAchrGffdE/mekong.png?imgmax=800" alt="Mekong" title="mekong.png" border="0" width="400" height="207" /&gt;&lt;br /&gt;&lt;br /&gt;Below the map are the specimen codes I've automatically extracted from the OCR text, linked to the corresponding records in GBIF, which are georeferenced (e.g., &lt;a href="http://data.gbif.org/occurrences/215921962/"&gt;ANSP Malacology 330925&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;If we joined these things up just a little more, we could do some useful things. For example, what if a researcher searching in PubMed for schistosomiasis in South East Asia could find the Davis &lt;i&gt;et al.&lt;/i&gt; paper, and then go to BHL or BioStor to read it? What if a researcher looking at gastropod distributions in the Mekong River in the GBIF portal could see that BHL had publications on diseases associated with these organisms (as well as their taxonomy and biology). We could also traverse the link from GBIF to BHL to PubMed and provide a direct route from distribution maps to biomedical literature. &lt;br /&gt;&lt;br /&gt;It seems there's scope for trying to connect BHL, GBIF, and PubMed, and that BHL and GBIF may have important roles to play in providing access to basic information about organisms that have a serious impact on human populations.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-4827990715051832710?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh5.ggpht.com/-jZvJmXGivkg/T3HHVwu4ZkI/AAAAAAAABLs/AAgc7soVuWo/s72-c/malacologia157576inst_0263.jpg?imgmax=800" height="72" width="72" /></item><item><title>iEvoBio 2012 Challenge: Synthesizing phylogenies</title><link>http://iphylo.blogspot.com/2012/03/ievobio-2012-challenge-synthesizing.html</link><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Wed, 21 Mar 2012 08:46:42 PDT</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-3912970041764165945</guid><description>&lt;img src="http://lh3.ggpht.com/-gpAAwNj2OVs/T2n33V5wYvI/AAAAAAAABLY/YXjkX2IGLmg/0150.jpg?imgmax=800" alt="0150" title="0150.jpg" border="0" width="128" height="102" style="float:right;" /&gt;The &lt;a href="http://ievobio.org/challenge.html"&gt;iEvoBio 2012 Challenge has been announced&lt;/a&gt;, and the topic is synthesizing phylogenies. The task:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Somewhere, buried in large sets of trees, lies a stunning new revelation, a baffling discovery, the answer to a longstanding controversy, or simply something not obvious to the naked eye. The mission of the 2012 iEvoBio challenge is to find those revelations, discoveries and answers within your own data and/or within one of the datasets provided by the challenge.  What new scientifically interesting results can you pull from these trees, using any combination of techniques at your disposal?&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;The rules of this challenge are:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The set of trees you use must have at least 10,000 leaves in total. Acceptable entries could be a set comprising 2,500 distinct trees covering the same four taxa, a single tree with 10,000+ leaves, or anything in between.&lt;/li&gt;&lt;li&gt;Your results must be scientifically new.&lt;/li&gt;&lt;li&gt;The data, or at least a description of the data, must be publicly available. If working with your own dataset, you must at least provide a summary of the data you used (see below for the minimum description that must be provided).&lt;/li&gt;&lt;li&gt;The source code of any tool and/or method developed as part of your challenge submission must be publicly downloadable under an OSI-approved open-source license (or dedicated to the public domain) at the latest by the time of the conference.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;For more details see the &lt;a href="http://ievobio.org/challenge.html"&gt;challenge site&lt;/a&gt;. Deadline for submission is &lt;b&gt;June 25, 2012&lt;/b&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-3912970041764165945?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh3.ggpht.com/-gpAAwNj2OVs/T2n33V5wYvI/AAAAAAAABLY/YXjkX2IGLmg/s72-c/0150.jpg?imgmax=800" height="72" width="72" /></item><item><title>Yet more reasons to have specimen identifiers: annotating GenBank sequences</title><link>http://iphylo.blogspot.com/2012/03/yet-more-reasons-to-have-specimen.html</link><category>Genbank</category><category>identifiers</category><category>wiki</category><category>error</category><category>annotation</category><category>GBIF</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Thu, 01 Mar 2012 03:47:06 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-3547740594598027859</guid><description>One reason I'm pursuing the theme of specimen identifiers (and identifiers in general) is the central role they play in annotating databases. To give a concrete example, I (among others) have argued for a wiki-style annotation layer on top of GenBank to capture things such as sequencing errors, updated species names, etc. Annotation is a lot easier if we have consistent identifiers for the things being annotated. For example, every GenBank sequence has a unique accession number, so if you and I are discussing sequence &lt;a href="http://www.ncbi.nlm.nih.gov/nucleotide/DQ055738"&gt;DQ055738&lt;/a&gt;, you and I can be sure we are talking about the same thing.&lt;br /&gt;&lt;br /&gt;Sequence DQ055738 is interesting because Hua et al. &lt;b&gt;A Revised Phylogeny of Holarctic Treefrogs (Genus Hyla) Based on Nuclear and Mitochondrial DNA Sequences&lt;/b&gt; (&lt;a href="http://dx.doi.org/10.1655/08-058R1.1"&gt;http://dx.doi.org/10.1655/08-058R1.1&lt;/a&gt; - note the nice identifier we have for this article) have suggested this sequence (published in &lt;a href="http://dx.doi.org/10.1554/05-284.1"&gt;http://dx.doi.org/10.1554/05-284.1&lt;/a&gt;, another nice identifier) is misidentified. Given these identifiers we could construct various statements, such as:&lt;br /&gt;&lt;br /&gt;&lt;pre style="background:rgb(228,228,228)"&gt;&lt;br /&gt;DQ055738 -&gt; published in -&gt; doi:10.1554/05-284.1&lt;br /&gt;DQ055738 -&gt; annotated by -&gt; doi:10.1655/08-058R1.1&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(I've omitted the http:// stuff to keep things legible). Hua et al: state the following:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;However, the tissue number of this specimen (LSUMZ H-19067) is similar to that of a specimen of &lt;i&gt;H. versicolor&lt;/i&gt; (LSUMZ H-19077), which appears to have been processed at the same time (C. Austin, personal communication). Therefore, we hypothesize that the sequence data for H. gratiosa used by Smith et al. (2005) were actually from &lt;i&gt;H. versicolor&lt;/i&gt;.&lt;/blockquote&gt;&lt;br /&gt;It would be nice if we had unique, resolvable identifiers for LSUMZ H-19067 and LSUMZ H-19077 so that we could construct statements linking the sequence, the publications, and the specimens. But we don't. Nor is it obvious how to find out anything more about LSUMZ H-19067 and LSUMZ H-19077. By contrast, for the DOI or the sequence accession I know how to get more information, in either human- or machine-readable form.&lt;br /&gt;&lt;br /&gt;The acronym LSUMZ in this case is the Lousiana State University Museum of Natural Science Herpetology collection (&lt;a href="http://biocol.org/urn:lsid:biocol.org:col:34806"&gt;http://biocol.org/urn:lsid:biocol.org:col:34806&lt;/a&gt;). Just to confuse matters, LSUMZ specimens in GBIF use LSU as the acronym for Lousiana State University Museum of Natural Science. Given that GBIF's data comes from LSU itself, it's odd (but not surprising) that there's a muddle about which acronym to use (it would be nice to clear this up, but then anybody building identifiers based on those acronyms is in for some heartbreak).&lt;br /&gt;&lt;br /&gt;If I look at GBIF LSUMZ records there aren't specimens with the catalogue numbers H-19067 or H-19077. However, after a bit of poking around, and a helpful file from GBIF's Tim Robertson, I discovered that the LSUMZ herpetology tissue numbers (which is what the H-* codes actually are) are stored in GBIF, so I've found the corresponding specimens are &lt;a href="http://data.gbif.org/occurrences/45716232"&gt;http://data.gbif.org/occurrences/45716232&lt;/a&gt; (LSU Herp 84850, LSUMZ HerpNet Tissue 19067) and &lt;a href="http://data.gbif.org/occurrences/45710033"&gt;http://data.gbif.org/occurrences/45710033&lt;/a&gt; (LSU Herp 84862, LSUMZ HerpNet Tissue 19077). (Note that Hua et al. tell the reader that LSU 84850 = LSUMZ H-19067, but don't give the specimen code for LSUMZ H-19077).&lt;br /&gt;&lt;br /&gt;Now I have some resolvable identifiers, so I could construct statements like:&lt;br /&gt;&lt;br /&gt;&lt;pre style="background:rgb(228,228,228)"&gt;&lt;br /&gt;DQ055738 -&gt; voucher -&gt; occurrences/45716232&lt;br /&gt;DQ055738 -&gt; voucher -&gt; occurrences/45710033 &lt;br /&gt;                       |&lt;br /&gt;                       +-&gt; according to -&gt; doi:10.1655/08-058R1.1&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Let's skip over whether this is actually the best way to record the annotation, the point is we can now start to construct statements that can be linked to the wider world. If someone else has made statements about these specimens, and they used the GBIF URL, then we could aggregate those and learn more about these specimen and their associated sequences. Without globally unique, stable, resolvable identifiers we are left to flounder around in the bowels of various databases searching for something that may or may not be the object being discussed. Isn't it time we did something about this?&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-3547740594598027859?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>Making biodiversity data sticky: it's all about links</title><link>http://iphylo.blogspot.com/2012/02/making-biodiversity-data-sticky-it-all.html</link><category>velcro</category><category>integration</category><category>links</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Wed, 29 Feb 2012 05:16:34 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-521245854135043072</guid><description>&lt;div style="text-align:center;"&gt;&lt;a href="http://www.flickr.com/photos/wild-wood/5280610237/" title="Who invented velcro? by A-dep, on Flickr"&gt;&lt;img src="http://farm6.staticflickr.com/5086/5280610237_a4a0c6c6d9.jpg" width="450" alt="Who invented velcro?"&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Sometimes I need to remind myself just why I'm spending so much time trying to make sense of other people's data, and why I go on (and on) about identifiers. One reason for my obsession is I want data to be "sticky", like the burrs shown in the photo above (&lt;a href="http://www.flickr.com/photos/wild-wood/5280610237/"&gt;Who invented velcro?&lt;/a&gt; by A-dep). Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together. Given enough identifiers and enough data, then we could rapidly assemble a "ball" of interconnected data. A published the diagram below as part of my Elsevier Challenge entry (&lt;a href="http://hdl.handle.net/10101/npre.2009.3173.1"&gt;preprint&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1016/j.websem.2010.03.004"&gt;published version&lt;/a&gt;) summarises some of the links between diverse kinds of biological data:&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh6.ggpht.com/-EHn2Isf6SvQ/T04lKO-fYSI/AAAAAAAABK0/Nwy7DbTrC90/model.png?imgmax=800" alt="Model" border="0" width="450" height="376" /&gt;&lt;br /&gt;While in principle many of these links should be trivial to create, in practice they aren't. One major obstacle is the lack of globally unique identifiers, or if such identifiers exist they aren't being used. As a result, our data is anything but sticky. In the absence of identifiers, creating links between different data sets can a significant undertaking. One way to tackle this is focus on just one kind of link at a time and create a database of those links. The diagram below shows some of the links I've been working on:&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh5.ggpht.com/-nFQpD6vmtqw/T04lLmEkviI/AAAAAAAABK8/NosL9AK1Rzs/links.png?imgmax=800" alt="Links" border="0" width="325" height="600" /&gt;&lt;br /&gt;For example, the &lt;a href="http://iphylo.org/linkout/"&gt;iPhylo Linkout&lt;/a&gt; project creates links between taxon concepts in NCBI and Wikipedia. The &lt;a href="http://iphylo.org/~rpage/itaxon/"&gt;iTaxon&lt;/a&gt; project is a mapping between taxonomic names and publications. I've briefly explored &lt;a href="http://iphylo.blogspot.com/2011/03/visualising-symbiome-hosts-parasites.html"&gt;mapping host-parasite relationships using GenBank&lt;/a&gt;, and I'm currently exploring the &lt;a href="http://iphylo.blogspot.com/2012/02/linking-gbif-and-biodiversity-heritage.html"&gt;links between publications and specimens&lt;/a&gt;. This list certainly doesn't exhaust the set of possible links, but it's a start. The challenge is to create sufficient links for biodiversity data to finally coalesce and for us to be able to ask questions that span multiple sources and types of data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-521245854135043072?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh6.ggpht.com/-EHn2Isf6SvQ/T04lKO-fYSI/AAAAAAAABK0/Nwy7DbTrC90/s72-c/model.png?imgmax=800" height="72" width="72" /></item><item><title>GBIF specimens in BioStor: who are the top ten museums with citable specimens?</title><link>http://iphylo.blogspot.com/2012/02/gbif-specimens-in-biostor-who-are-top.html</link><category>BioStor</category><category>digitisation</category><category>host</category><category>parasite</category><category>museums</category><category>lice</category><category>GBIF</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Tue, 28 Feb 2012 06:31:45 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-86303531852772402</guid><description>&lt;img src="http://lh4.ggpht.com/-xUNNB8qt0bo/T0zlTrh5XaI/AAAAAAAABKo/gkNPg4kwKXU/gbif.gif?imgmax=800" alt="Gbif" title="gbif.gif" border="0" width="128" height="124" style="float:right;" /&gt;Brief update on &lt;a href="http://iphylo.blogspot.com/2012/02/linking-gbif-and-biodiversity-heritage.html"&gt;yesterday's post&lt;/a&gt; about finding specimens in &lt;a href="http://biostor.org"&gt;BioStor&lt;/a&gt;. BioStor has some 66,000 articles from BHL, from which I've extracted 143,000 cases of a specimen code being cited in the text. Of these 143,000 occurrences, 81,000 have been matched to an occurrence in GBIF.&lt;br /&gt;&lt;br /&gt;The top ten collections with specimens in BioStor are:&lt;br /&gt;&lt;br /&gt;&lt;table&gt;&lt;tr&gt;&lt;th&gt;Dataset&lt;/th&gt;&lt;th&gt;Number of specimens&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;NMNH Vertebrate Zoology Herpetology Collections (National Museum of Natural History)&lt;/td&gt;&lt;td&gt;11194&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Herpetology Collection (University of Kansas Biodiversity Research Center)&lt;/td&gt;&lt;td&gt;9619&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Herpetology Collection (University of Kansas Biodiversity Research Center)&lt;/td&gt;&lt;td&gt;9328&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;NMNH Invertebrate Zoology Collections (National Museum of Natural History)&lt;/td&gt;&lt;td&gt;9061&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;CAS Herpetology Collection Catalog (California Academy of Sciences)&lt;/td&gt;&lt;td&gt;6720&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MCZ Herpetology Collection (Museum of Comparative Zoology, Harvard University)&lt;/td&gt;&lt;td&gt;5818&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;NMNH Vertebrate Zoology Fishes Collections (National Museum of Natural History)&lt;/td&gt;&lt;td&gt;4642&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MCZ Herpetology Collection - Reptile Database (Museum of Comparative Zoology, Harvard University)&lt;/td&gt;&lt;td&gt;4380&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FMNH Herpetology Collections (Field Museum)&lt;/td&gt;&lt;td&gt;2110&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FMNH Fishes Collections (Field Museum)&lt;/td&gt;&lt;td&gt;2061&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;This is pretty much what I expected. Virtually complete runs of publications from The Field Museum at Chicago, the University of Kansas, and the Biological Society of Washington are available in BHL, and many of these have been added to BioStor. These journals have extensive taxonomic treatments of vertebrate taxa, particularly frogs, hence herpetology collections dominate the rankings.&lt;br /&gt;&lt;br /&gt;There will inevitably be errors in the mapping between specimen codes and GBIF occurrences. I've tried to minimise these by mapping codes within taxonomic groups, but it's clear that there are duplicate codes even within some collections. There is also all manner of variation in the way people cite museum specimens, and these are often different from the codes that appear in GBIF. There will also be issues with extracting specimen codes, and I'm also discovering a few *cough* duplicates of articles in BioStor, so the numbers I present above are liable to change as I clean things up.&lt;br /&gt;&lt;br /&gt;But one could imagine a "league table" of museum collections, where we can measure both the extent to which those collections have been digitised, and the extent to which material from those collections have been cited. We could use this to compute measures of the impact of a collection.&lt;br /&gt;&lt;br /&gt;But for now I'm browsing the results trying to get a sense of how successful the mapping has been. There are some interesting examples. The specimen codes extracted from the article &lt;a href="http://biostor.org/reference/81065"&gt;Review Of The Chewing Louse Genus Abrocomophaga (Phthiraptera : Amblycera), With Description Of Two New Species&lt;/a&gt; are those for the mammalian hosts of the lice. Hence someone viewing the records for these specimens and following the link to this paper would discover that these mammals had parasitic lice. If we add other sorts of links to the mix, such as between specimens and DNA sequences, then we can start to build a rich network of connections between the basic data of biodiversity.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-86303531852772402?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh4.ggpht.com/-xUNNB8qt0bo/T0zlTrh5XaI/AAAAAAAABKo/gkNPg4kwKXU/s72-c/gbif.gif?imgmax=800" height="72" width="72" /></item><item><title>Linking GBIF and the Biodiversity Heritage Library</title><link>http://iphylo.blogspot.com/2012/02/linking-gbif-and-biodiversity-heritage.html</link><category>BioStor</category><category>identifiers</category><category>BHL</category><category>linking</category><category>GBIF</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Mon, 27 Feb 2012 08:04:31 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-2718987431173780690</guid><description>Following on from exploring &lt;a href="http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html"&gt;links between GBIF and GenBank&lt;/a&gt; here I'm going to look at links between GBIF and the primary literature, in this case articles scanned by the &lt;a href="http://www.biodiversitylibrary.org"&gt;Biodiversity Heritage Library&lt;/a&gt; (BHL). The OCR text in BHL can be mined for a variety of entities. BHL itself has used uBio's tools to identity taxonomic names in the OCR text, and in my &lt;a href="http://biostor.org"&gt;BioStor&lt;/a&gt; project I've extracted article-level metadata and geographic co-ordinates. Given that many articles in BioStor list museum specimens I wrote some code to extract these (see &lt;a href="http://iphylo.blogspot.com/2012/01/extracting-museum-specimen-codes-from.html"&gt;Extracting museum specimen codes from text&lt;/a&gt;) and applied this to the OCR text for those articles.&lt;br /&gt;&lt;br /&gt;Having a list of specimens is nice, but in this digital age I want to be able to find out more about these specimens. An obvious solution is try and match these specimen codes to the specimen records held by &lt;a href="http://data.gbif.org"&gt;GBIF&lt;/a&gt;. Linking to GBIF is complicated by the fact that museum codes are not unique. For example, "FMNH 147942" could refer to a &lt;a href="http://data.gbif.org/occurrences/236968599"&gt;bird&lt;/a&gt;, an &lt;a href="http://data.gbif.org/occurrences/100432597"&gt;amphibian&lt;/a&gt;, or a &lt;a href="http://data.gbif.org/occurrences/61846037"&gt;mammal&lt;/a&gt;. To tackle the non uniqueness I use the taxonomic names extracted from each page by BHL to work out what taxon an article is mainly "about". To do this I use the Catalogue of Life classification to get "paths" for each name (i.e., the lineage of each taxon down to the root of the classification) and then find the majority-rule path. You can see these paths in the "Taxonomic classification" displayed on a page for a BioStor article. If there are multiple GBIF specimens for the same code I test whether the taxon or rank "class" in the GBIF record is in the majority-rule path for the article. If so, I accept that specimen as the match to the code. &lt;br /&gt;&lt;br /&gt;There are also issues where the specimen codes in GBIF have been modified during input (e.g., USNM 730715 has become &lt;a href="http://data.gbif.org/occurrences/137322490/"&gt;USNM 730715.457409&lt;/a&gt;). There are also the inevitable OCR errors that may cause museum codes to be missed or otherwise corrupted. Bearing all this in mind, BioStor now has specimen pages (these are still being generated as I write this). For example, the page for &lt;a href="http://biostor.org/specimen/FMNH%20147942"&gt;FMNH 147942&lt;/a&gt; lists the three articles in BioStor that cite this specimen code:&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh3.ggpht.com/-rHOMhlnCMLo/T0upivhNFBI/AAAAAAAABKc/yGXlUbSH6XI/fmnh147942.png?imgmax=800" alt="Fmnh147942" border="0" width="450" height="145" /&gt;&lt;br /&gt;&lt;br /&gt;All three specimens have been mapped on to GBIF occurrence &lt;a href="http://data.gbif.org/occurrences/61846037/"&gt;http://data.gbif.org/occurrences/61846037/&lt;/a&gt;. When BioStor displays the articles it now lists the specimen codes that have been extracted from the article, together with the GBIF logo if the specimen has been matched to a GBIF record. For example, here is a screenshot from &lt;a href="http://biostor.org/reference/15"&gt;Deep-water octopods (Mollusca: Cephalopoda) of the northeastern Pacific&lt;/a&gt;:&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh5.ggpht.com/-uQyT2vvxklY/T0upfpujJkI/AAAAAAAABKU/CcBZelBV-sY/deepwater.png?imgmax=800" alt="Deepwater" border="0" width="450" height="244" /&gt;&lt;br /&gt;&lt;br /&gt;The map has been extracted from the OCR text (an obvious next step would be to add localities associated with the specimen records). Below the map are the specimen codes. The lack of some USNM specimens is probably due to misinterpreted specimen codes, whereas the CAS specimens don't seem to be online (the California Academy of Sciences has some of its collections in GBIF, but not its molluscs).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Where next?&lt;/b&gt;&lt;br /&gt;Once these links between BioStor (and hence, BHL) and GBIF are created then we can do some interesting things. If you visit BioStor and want to learn more about a specimen you can click on the link an view the record in GBIF. We could also envisage doing the reverse. GBIF could augment the information it displays about a specimen by displaying a link to the content in BioStor (e.g., "this specimen is cited by these articles"). Those articles may contain further information about that specimen (for example, the habitat it was collected from, how secure is its identification, and so on).&lt;br /&gt;&lt;br /&gt;We could also start to compute the "impact" of different museum collections based on the number of citations of specimens from their collections (this idea is explored further in this paper: &lt;a href="http://dx.doi.org/10.1093/bib/bbn022"&gt;http://dx.doi.org/10.1093/bib/bbn022&lt;/a&gt;, free preprint available here: &lt;a href="http://hdl.handle.net/10101/npre.2008.1760.1"&gt;hdl:10101/npre.2008.1760.1&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;All of this works because we are linking objects (in this case articles and specimens) via their identifiers. Consequently, the links are as stable as their identifiers, which is why I've been pursuing the issue of specimen identifiers recently (see &lt;a href="http://iphylo.blogspot.com/2011/12/dna-barcoding-darwin-core-triplet-and.html"&gt;here&lt;/a&gt;, &lt;a href="http://iphylo.blogspot.com/2012/01/yet-another-reason-why-we-need-specimen.html"&gt;here&lt;/a&gt;, and &lt;a href="http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html"&gt;here&lt;/a&gt;). If GBIF maintains the URLs for the specimens I've linked to, then links I've created could persist. If these URLs are likely to change (e.g., because the metadata from the host institution has changed) then the links (and any associated value we get from them) disappear. This is why I want globally unique, resolvable, persistent identifiers for specimens.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-2718987431173780690?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh3.ggpht.com/-rHOMhlnCMLo/T0upivhNFBI/AAAAAAAABKc/yGXlUbSH6XI/s72-c/fmnh147942.png?imgmax=800" height="72" width="72" /></item><item><title>How many specimens does GBIF really have?</title><link>http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html</link><category>duplicates</category><category>identifiers</category><category>specimen codes</category><category>GBIF</category><category>Darwin Core riplet</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Thu, 23 Feb 2012 01:35:30 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-1873183687969963903</guid><description>&lt;img src="http://lh3.ggpht.com/-6U5zj3GwbPs/T0YIXk_W4CI/AAAAAAAABKA/6PmbNXJ54fU/gbif.gif?imgmax=800" alt="Gbif" title="gbif.gif" border="0" width="128" height="124" style="float:right;" /&gt;Duplicate records are the bane of any project that aggregates data from multiple sources. &lt;a href="http://www.mendeley.com"&gt;Mendeley&lt;/a&gt;, for example, has numerous copies of the same article, as documented by Duncan Hull (&lt;a href="http://duncan.hull.name/2010/09/01/mendeley/"&gt;How many unique papers are there in Mendeley?&lt;/a&gt;). In their defence, Mendeley is aggregating data from lots of personal reference libraries and hence they will often encounter the same article with slightly differing metadata (we all have our own quirks when we store bibliographic details of papers). It's a challenging problem to identify and merge records which are not identical, but which are clearly the same thing.&lt;br /&gt;&lt;br /&gt;What I'm finding rather more alarming is that &lt;a href="http://data.gbif.org"&gt;GBIF&lt;/a&gt; has duplicate records for &lt;b&gt;the same&lt;/b&gt; specimen from the &lt;b&gt;same data provider&lt;/b&gt;. For example, the specimen &lt;b&gt;USNM 547844&lt;/b&gt; is present twice:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://data.gbif.org/occurrences/157337271/"&gt;http://data.gbif.org/occurrences/157337271/&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://data.gbif.org/occurrences/244120356/"&gt;http://data.gbif.org/occurrences/244120356/&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;As far as I can tell this is the same specimen, but the catalogue numbers differ (547844 versus 547844.6544573). Apart from this the only difference is when the two records were indexed. The source for 547844 was last indexed August 9, 2009, the source for 547844.6544573 was first indexed August 22, 2010. So it would appear that some time between these two dates the US National Museum of Natural History (NMNH) changed the catalogue codes (by appending another number), so GBIF has treated them as two distinct specimens. Browsing other GBIF records from the NMNH shows the same pattern. I've not quantified the extent of this problem, but it's probably a safe bet that every NMNH herp specimen occurs twice in GBIF.&lt;br /&gt;&lt;br /&gt;Then there are the records from Harvard's Museum of Comparative Zoology that are duplicates, such as &lt;a href="http://data.gbif.org/occurrences/33400333/"&gt;http://data.gbif.org/occurrences/33400333/&lt;/a&gt; and &lt;a href="http://data.gbif.org/occurrences/328478233/"&gt;http://data.gbif.org/occurrences/328478233/&lt;/a&gt; (both for specimen MCZ A-4092, in this case the collectionCode is either "Herp" or "HERPAMPH"). These are records that have been loaded at different times, and because the metadata has changed GBIF hasn't recognised that these are the same thing.&lt;br /&gt;&lt;br /&gt;At the root of this problem is the lack of globally unique identifiers for specimens, or even identifiers that are unique and stable within a dataset. The &lt;a href="http://code.google.com/p/darwincore/wiki/Occurrence"&gt;Darwin Core wiki&lt;/a&gt; lists a field for &lt;b&gt;occurrenceID&lt;/b&gt; for which it states:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;The occurrenceID is supposed to (globally) uniquely identify an occurrence record, whether it is a specimen-based occurrence, a one-time observation of a species at a location, or one of many occurrences of an individual who is being tracked, monitored, or recaptured. Making it globally unique is quite a trick, one for which we don't really have good solutions in place yet, but one which ontologists insist is essential.&lt;/blockquote&gt;&lt;br /&gt;Well, now we see the side effect of not tackling this problem - our flagship aggregator of biodiversity data has duplicate records. Note that this has nothing to do with "ontologists" (whatever they are), it's simple data management. Assign a unique id (a primary key in a database will do fine) that can be used to track the identity of an object even as its metadata changes. Otherwise you are reduced to matching based on metadata, and if that is changeable then you have a problem.&lt;br /&gt;&lt;br /&gt;Now, just imagine the potential chaos if we start changing institution and collection codes to conform to the &lt;a href="http://iphylo.blogspot.com/2011/12/dna-barcoding-darwin-core-triplet-and.html"&gt;Darwin Core triplet&lt;/a&gt;. In the absence of unique identifiers (again, these can be local to the data set) GBIF is going to be faced with a massive data reconciliation task to try and match old and new specimen records.&lt;br /&gt;&lt;br /&gt;The other problem, of course, is that my plan to use GBIF occurrence URLs as globally unique identifiers for specimens is looking pretty shaky because they are unique (the same specimen can have more than one) and if GBIF cleans up the duplicates a number of these URLs will disappear. Bugger.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-1873183687969963903?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh3.ggpht.com/-6U5zj3GwbPs/T0YIXk_W4CI/AAAAAAAABKA/6PmbNXJ54fU/s72-c/gbif.gif?imgmax=800" height="72" width="72" /></item><item><title>Clustering strings</title><link>http://iphylo.blogspot.com/2012/02/clustering-strings.html</link><category>data cleaning</category><category>taxonomy</category><category>clustering</category><category>Graphviz</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Wed, 22 Feb 2012 07:15:53 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-5892993905869406165</guid><description>Revisiting an old idea (&lt;a href="http://iphylo.blogspot.com/2009/02/clustering-taxonomic-names.html"&gt;Clustering taxonomic names&lt;/a&gt;) I've added code to cluster strings into sets of similar strings to the phyloinformatics course site.&lt;br /&gt;&lt;br /&gt;This service (available at &lt;a href="http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php"&gt;http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php&lt;/a&gt;) takes a list of strings, one per line, and returns a list of clusters. For example, given the names&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;Ferrusac 1821&lt;br /&gt;Bonavita 1965&lt;br /&gt;Ferussa 1821&lt;br /&gt;Fer.&lt;br /&gt;Lamarck 1812&lt;br /&gt;Ferussac 1821&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;the service finds three clusters, displayed here using Google images:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://chart.googleapis.com/chart?cht=gv&amp;chl=graph+%7Bnode+0+%5Blabel%3D%22ferrusac+1821%22%5D%3Bnode+1+%5Blabel%3D%22bonavita+1965%22%5D%3Bnode+2+%5Blabel%3D%22ferussa+1821%22%5D%3Bnode+3+%5Blabel%3D%22fer%22%5D%3Bnode+4+%5Blabel%3D%22lamarck+1812%22%5D%3Bnode+5+%5Blabel%3D%22ferussac+1821%22%5D%3B0+--+3+%5Blabel%3D%223%22%5D%3B0+--+5+%5Blabel%3D%228%22%5D%3B2+--+3+%5Blabel%3D%223%22%5D%3B2+--+5+%5Blabel%3D%227%22%5D%3B3+--+5+%5Blabel%3D%223%22%5D%3B%7D" width="450" /&gt;&lt;br /&gt;&lt;br /&gt;(Note to self, investigate &lt;a href="http://code.google.com/p/canviz/"&gt;canviz&lt;/a&gt; as an alternative for displaying graphviz graphs.)&lt;br /&gt;&lt;br /&gt;If you are curious, these strings are taxonomic authorities associated with the name &lt;a href="http://iphylo.org/~rpage/itaxon/?search=Helicella"&gt;Helicella&lt;/a&gt;, and based on this clustering there are three taxonomic names, one of which has three different variations of the author's name.&lt;br /&gt; &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-5892993905869406165?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>Why LSIDs suck</title><link>http://iphylo.blogspot.com/2012/02/why-lsids-suck.html</link><category>rant</category><category>LSID</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Wed, 22 Feb 2012 01:39:28 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-2530158453393390941</guid><description>I'll keep this short: LSIDs suck because they are so hard to set up that many LSIDs don't actually work. Because of this there seems to be no shame in publishing "fake" LSIDs (LSIDs that look like LSIDs but which don't resolve using the LSID protocol). Hey, it's hard work, so let's just stick them on a web page but not actually make them resolvable. Hence we have an identifier that people don't recognise (most people have no idea what an LSID is) and which we have no expectations that it will actually work. This devalues the identifier to the point where it becomes effectively worthless. &lt;br /&gt;&lt;br /&gt;Now consider URLs. If you publish a URL I expect it to work (i.e., I paste it into a web browser and I get something). If it doesn't work then I can conclude that the URL is wrong, or that you are a numpty and can't run a web site (or don't care enough about your content to keep the URL working). At no point am I going to say "gee, it's OK that this URL doesn't resolve because these things are hard work."&lt;br /&gt;&lt;br /&gt;Now you might argue that whether your LSID resolves is an even better way for me to assess your technical ability (because it's hard work to do it right). Fair enough, but the fact that even major resources (such as Catalogue of Life) can't get them to work reliably reduces the value of this test (it's a poor predictor of the quality of the resource). Or, perhaps the LSID is a signal that you get this "globally unique identifier thing" and maybe one day will make the LSIDs work. No, it's a signal you don't care enough about identifiers to make them actually work today.&lt;br /&gt;&lt;br /&gt;As soon as people decided it's OK to publish LSIDs that don't work, LSIDs were doomed. The most immediate way for me to determine whether you are providing useful information (resolving the identifier) is gone. And with that goes any sense that I can trust LSIDs.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-2530158453393390941?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>Linking GBIF and Genbank</title><link>http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html</link><category>KML</category><category>Genbank</category><category>geophylogeny</category><category>NCBI</category><category>linking</category><category>TreeBASE</category><category>frogs</category><category>Pristimantis</category><category>GBIF</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Tue, 21 Feb 2012 02:44:00 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-8724627697733546951</guid><description>As part of my mantra that it's not about the data, it's all about the links between the data, I've started exploring matching GenBank sequences to GBIF occurrences using the specimen_voucher codes recorded in GenBank sequences. It's quickly becoming apparent that this is not going to be easy. Specimen codes are not unique, are written in all sorts of ways, there are multiple codes for the same specimen (GenBank sequences may be associated with museum catalogue entries, or which field or collector numbers).&lt;br /&gt;&lt;br /&gt;So why undertake what is fast looking like a hopeless task? There are several reasons:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;GBIF occurrences have a unique URL which we could potentially use as a unique, resolvable identifier for the corresponding specimen.&lt;/li&gt;&lt;li&gt;Linking GenBank to GBIF would make it possible for GBIF to list sequences associated with a specimen, as well as the associated publication, which means we could demonstrate the "impact" of a specimen. In the simplest terms this could be the number of sequences and publications that use data from the specimen, more sophisticated approaches could use PageRank-like measures, see &lt;a href="http://hdl.handle.net/10101/npre.2008.1760.1"&gt;hdl:10101/npre.2008.1760.1&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Having a unique identifier that is shared across different databases makes it easier to combine data from different sources. For example, if a sequence in GenBank lacks geographic coordinates but the voucher specimen in GBIF is georeferenced,  we can use that information to locate the sequence in geographic space (and hence build geophylogenies or add spatial indexes to databases such as TreeBASE). Conversely, if the GenBank sequence is georeferenced but the GBIF record isn't we can update the GBIF record and possibly expand the range of the corresponding taxon (this was part of the motivation behind &lt;a href="http://hdl.handle.net/10101/npre.2009.3173.1"&gt;hdl:10101/npre.2009.3173.1&lt;/a&gt;.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;As an example, below is the &lt;a href="http://data.gbif.org/occurrences/taxon/celldensity/taxon-celldensity-54040866.kml"&gt;GBIF 1° density map&lt;/a&gt; for the frog &lt;a href="http://data.gbif.org/species/2425583/"&gt;&lt;i&gt;Pristimantis ridens&lt;/i&gt;&lt;/a&gt; from GBIF, with the phylogeny from Wang &lt;i&gt;et al.&lt;/i&gt;&lt;b&gt;Phylogeography of the Pygmy Rain Frog (Pristimantis ridens) across the lowland wet forests of isthmian Central America&lt;/b&gt;&lt;a href="http://dx.doi.org/10.1016/j.ympev.2008.02.021"&gt;http://dx.doi.org/10.1016/j.ympev.2008.02.021&lt;/a&gt; layered over it. I created the KML tree from the corresponding tree in TreeBASE using the &lt;a href="http://iphylo.blogspot.com/2012/02/automating-creation-of-geophylogenies.html"&gt;tool I described earlier&lt;/a&gt;. You can grab the &lt;a href="http://dl.dropbox.com/u/639486/kml/Tr5096.kml"&gt;KML for the tree here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh4.ggpht.com/-lWaePnxJpZo/T0NyWoKWQaI/AAAAAAAABJg/R-Q5YOG6--E/density.png?imgmax=800" alt="Density" title="density.png" border="0" width="400" height="377" /&gt;&lt;br /&gt;&lt;br /&gt;As we'd expect, there is a lot of overlap in the two sources of data. If we investigate further, there are records that are in fact based on the same specimen. For example, if we download the &lt;a href="http://data.gbif.org/occurrences/taxon/placemarks/taxon-placemarks-54040866.kml"&gt;GBIF KML file with individual placemarks&lt;/a&gt; we see that in the northern part of the range their are 15 GBIF occurrences that map onto the same point as one of the terminal taxa in the tree.&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh6.ggpht.com/-QSfss1Qi6nE/T0NyX9Ib75I/AAAAAAAABJo/a4aZdWcVVBM/gbif.png?imgmax=800" alt="Gbif" title="gbif.png" border="0" width="400" height="379" /&gt;&lt;br /&gt;&lt;br /&gt;One of these 15 GBIF records (&lt;a href="http://data.gbif.org/occurrences/244335848"&gt;http://data.gbif.org/occurrences/244335848&lt;/a&gt;) is for specimen USNM 514547, which is the voucher specimen for &lt;a href="http://www.ncbi.nlm.nih.gov/nucleotide/EU443175"&gt;EU443175&lt;/a&gt;. This gives us a link between the record in GBIF and the record in GenBank. It also gives us a URI we can use for the specimen &lt;a href="http://data.gbif.org/occurrences/244335848"&gt;http://data.gbif.org/occurrences/244335848&lt;/a&gt; instead of the unresolvable and potentially ambiguous USNM 514547.&lt;br /&gt;&lt;br /&gt;If we view the geophylogeny from a different vantage point we see numerous localities that don't have occurrences in GBIF. &lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh6.ggpht.com/-RMa6W21ShVo/T0NyZdAJLxI/AAAAAAAABJw/Rfj6ruIGOMk/nogbif.png?imgmax=800" alt="Nogbif" title="nogbif.png" border="0" width="400" height="378" /&gt;&lt;br /&gt;&lt;br /&gt;Close inspection reveals that some of the specimens listed in the Wang &lt;i&gt;et al.&lt;/i&gt; paper are actually in GBIF, but lack geographic coordinates. For example the OTU "&lt;i&gt;Pristimantis ridens&lt;/i&gt; Nusagandi AJC 0211" has the voucher specimen FMNH 257697. This specimen is in GBIF as &lt;a href="http://data.gbif.org/occurrences/57919777/"&gt;http://data.gbif.org/occurrences/57919777/&lt;/a&gt;, but without coordinates, so it doesn't appear on the GBIF map. However, both the Wang &lt;i&gt;et al.&lt;/i&gt; paper and the GenBank record for the sequence from this specimen &lt;a href="http://www.ncbi.nlm.nih.gov/nucleotide/EU443164"&gt;EU443164&lt;/a&gt; give the latitude and longitude. In this example, GBIF gives us a unique identifier for the specimen, and GenBank provides data on location that GBIF lacks.&lt;br /&gt;&lt;br /&gt;Part of GBIFs success is due to the relative ease of integrating data by taxonomic names (despite the problems caused by synonyms, homonyms, misspellings, etc.) or using spatial coordinates (which immediately enables integration with environmental data. But if we want to integrate at deeper levels then specimen records are the glue that connects GBIF (and its contributing data sources) to sequence databases, phylogenies, and the taxonomic literature (via lists of material exampled). This will not be easy, certainly for legacy data that cites ambiguous specimen codes, but I would argue that the potential rewards are great.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-8724627697733546951?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh4.ggpht.com/-lWaePnxJpZo/T0NyWoKWQaI/AAAAAAAABJg/R-Q5YOG6--E/s72-c/density.png?imgmax=800" height="72" width="72" /></item><item><title>EOL Phylogenetic Tree Challenge</title><link>http://iphylo.blogspot.com/2012/02/eol-phylogenetic-tree-challenge.html</link><category>Tree of Life</category><category>EOL</category><category>Challenge</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Thu, 16 Feb 2012 10:51:04 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-7540929111877610916</guid><description>&lt;img src="http://lh6.ggpht.com/-9ByM1xvgQdo/Tz1QEANkMLI/AAAAAAAABJE/XLFtQrzLAlU/34106_130_130.jpg?imgmax=800" alt="34106 130 130" title="34106_130_130.jpg" border="0" width="130" height="130" style="float:right;" /&gt;The Encyclopedia of Life have announced the &lt;a href="http://eol.org/info/tree_challenge"&gt;EOL Phylogenetic Tree Challenge&lt;/a&gt;. The contest has two purposes:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;It provides a testbed for the Evolutionary Informatics community to develop robust methods for producing, serving, and evaluating large, biologically meaningful trees that will be useful both to the research community and to broader audiences.&lt;br /&gt;&lt;br /&gt;It enables the Encyclopedia of Life to organise the information it aggregates according to phylogenetic relationships; in other words, it provides a direct pipeline from research results to practical use.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;First prize is a trip to &lt;a href="http://ievobio.org/"&gt;iEvoBio 2012&lt;/a&gt;, this year in Ottawa, Canada. For more details visit the &lt;a href="http://eol.org/info/tree_challenge"&gt;challenge website&lt;/a&gt;. There is also an &lt;a href="http://eol.org/communities/98/newsfeed"&gt;EOL community devoted to this challenge&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Challenges are great things, especially ones with worthwhile tasks and decent prizes. EOL badly needs a phylogenetic perspective, so this is a welcome development.&lt;br /&gt;&lt;br /&gt;But (there's always a but), I can't help feeling that we need something a little more radical. The tree of life isn't a tree. At deep levels it's a forest, and even at shallow levels things are a complicated tangle of gene trees. Sometimes the tree is clear, sometimes not, and some of this is real and some reflects our ignorance.&lt;br /&gt;&lt;br /&gt;If you want a simple tree to navigate, then I'd argue that the NCBI tree is a pretty good start, and EOL already has this. What would be really cool is to have a way to navigate that makes it clear that phylogenetic knowledge has a degree of uncertainty, and that the "tree of life" might be better depicted as a set of overlapping trees. The mental image I have is of a collage of trees from different data sets, superimposed over each other, with perhaps an underlying consensus to help navigate. This visualisation could be zoomable, because in some ways the tree of life is fractal. Trees don't stop at species, as the wealth of barcoding and phylogeographic studies show. Given computational constraints (not to mention visualisation issues), I wonder whether there is an effective limit to the size of any one tree in terms of number of taxa. What varies is the taxonomic scope. So we could imagine a backbone tree based on slowly evolving genes, we zoom in and more trees appear, but at lower levels, and finally we hit populations and individuals, trees that may have 100's of samples, but a very narrow scope.&lt;br /&gt;&lt;br /&gt;This is all rather poorly articulated, but I can't help wondering whether a phylogenetic classification will end up distorting the very thing we're trying to depict. It also looses connection with the underlying data (and trees), which for me is a huge drawback of existing classifications. There's no sense of why they are the way they are. There's a chance here to bring together ideas that have been kicking around in the phylogenetic community for a couple of decades and rethink how we navigate the "tree of life".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-7540929111877610916?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh6.ggpht.com/-9ByM1xvgQdo/Tz1QEANkMLI/AAAAAAAABJE/XLFtQrzLAlU/s72-c/34106_130_130.jpg?imgmax=800" height="72" width="72" /></item><item><title>BLAST a sequence and get a tree and a map</title><link>http://iphylo.blogspot.com/2012/02/blast-sequence-and-get-tree-and-map.html</link><category>dark taxa</category><category>BLAST</category><category>phyloinformatics</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Fri, 10 Feb 2012 04:28:10 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-5125598095865821941</guid><description>I've updated the &lt;a href="http://iphylo.org/~rpage/phyloinformatics/blast/"&gt;BLAST a sequence and get a tree tool&lt;/a&gt; described in a &lt;a href="http://iphylo.blogspot.com/2012/01/blast-sequence-and-get-tree.html"&gt;previous post&lt;/a&gt; to output additional details, such as a list of the sequences used to build the tree and some basic metadata (such as the taxon name, name of any associated host, publication, and geographic coordinates). If the sequences are geotagged, then you will also see a little map showing the localities. As ever, all this relies on SVG, so if you're browser doesn't support that out won't see much.&lt;br /&gt;&lt;br /&gt;The example below is for the sequence &lt;a href="http://www.ncbi.nlm.nih.gov/nuccore/EU399074"&gt;EU399074&lt;/a&gt;, which falls in a cluster of &lt;a href="http://iphylo.blogspot.com/2011/04/dark-taxa-genbank-in-post-taxonomic.html"&gt;"dark taxa"&lt;/a&gt;; in this case, DNA barcode sequences that haven't been properly labelled.&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh5.ggpht.com/-Lr2pXxmODc4/TzUNWFxnRNI/AAAAAAAABI4/BYGxFfSBe9M/blastmap.png?imgmax=800" alt="Blastmap" border="0" width="415" height="311" /&gt;&lt;br /&gt;&lt;br /&gt; &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-5125598095865821941?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh5.ggpht.com/-Lr2pXxmODc4/TzUNWFxnRNI/AAAAAAAABI4/BYGxFfSBe9M/s72-c/blastmap.png?imgmax=800" height="72" width="72" /></item><item><title>Automating the creation of geophylogenies: NEXUS + delimited text = KML</title><link>http://iphylo.blogspot.com/2012/02/automating-creation-of-geophylogenies.html</link><category>KML</category><category>Google Earth</category><category>geophylogeny</category><category>matching</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Wed, 08 Feb 2012 09:57:32 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-9177505911839922335</guid><description>One thing which has always frustrated me about geophylogenies is how tedious they are to create. In theory, they should be pretty straightforward to generate. We take a tree, get point localities for each leaf in the tree, and generate the KML to display on Google Earth. The tedious part is getting the latitude and longitude data in the right format, and linking the leaves in the tree to the locality data.&lt;br /&gt;&lt;br /&gt;To help reduce the tedium I've create a tool that tries to automate this as much as possible. The goal is to be able to paste in a NEXUS tree, and a table of localities, and get back a KML tree. Some publishers are making it easier to extract data from articles. For example, if you go to a paper such as &lt;a href="http://dx.doi.org/10.1016/j.ympev.2009.07.011"&gt;http://dx.doi.org/10.1016/j.ympev.2009.07.011&lt;/a&gt; you will see a widget on the right labelled &lt;b&gt;Table download&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh4.ggpht.com/-Rhpf_y8yjcg/TzK16l1HxgI/AAAAAAAABIk/iYupBseqflQ/elsevier.png?imgmax=800" alt="Elsevier" border="0" width="400" height="171" /&gt;&lt;br /&gt;&lt;br /&gt;If you click on the &lt;b&gt;Find tables&lt;/b&gt; button you can download the tables in CSV format. In this case, Table 1 has latitude and longitude data for all the taxa in the tree in TreeBASE study &lt;a href="http://purl.org/phylo/treebase/phylows/study/TB2:S10103?format=html"&gt;S10103&lt;/a&gt;. With some regular expressions we can figure out which column has the latitude and longitude data, and parse values like &lt;code&gt;(10°12′N, 84°09′W)&lt;/code&gt; to extract the numerical values for latitude and longitude.&lt;br /&gt;&lt;br /&gt;It is also pretty straightforward to be able to read a tree in NEXUS format and extract the taxon names. At this point we have two sets of names (those from the tree and those from the table) which might not be the same (in this case they aren't, we have "Craugastor cf. podiciferus FMNH 257672" and "FMNH 257672"). Matching these names up by hand would be tedious, but as described in &lt;a href="http://iphylo.blogspot.com/2007/09/matching-names-in-phylogeny-data-files.html"&gt;Matching names in phylogeny data files&lt;/a&gt; we can use maximum weighted bipartite matching to compute an optimal matching between the two sets of labels.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Create KML tree&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;You can try the Create KML tree tool at &lt;a href="http://iphylo.org/~rpage/phyloinformatics/kml/"&gt;http://iphylo.org/~rpage/phyloinformatics/kml/&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;To get started, try it with the data below. In step 1 paste in the NEXUS tree, in step 2 paste in the table from the original paper. If all goes as it should, you will see a table displaying the matching, and the KML which you can save and open in Google Earth. If you have the &lt;a href="http://www.google.com/earth/explore/products/plugin.html"&gt;Google Earth Plug-in&lt;/a&gt; installed, then you should see the KML displayed on Google Earth in your web browser.&lt;br /&gt;&lt;br /&gt;&lt;iframe src="http://player.vimeo.com/video/36426437?title=0&amp;amp;byline=0&amp;amp;portrait=0" width="398" height="318" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;I've tested the tool on only a few examples, so there will be cases where it fails. It also assumes that every taxon in the tree has latitude and longitude values, and that the first column in the table is the taxon name (you'll need to edit the file if this is not the case).&lt;br /&gt;&lt;br /&gt;Here is the tree used in the example...&lt;br /&gt;&lt;br /&gt;&lt;pre style="font-size:10px;"&gt;&lt;br /&gt;#NEXUS&lt;br /&gt;BEGIN TREES;&lt;br /&gt;	TRANSLATE&lt;br /&gt;		Tl254954 'Craugastor cf. podiciferus FMNH 257672',&lt;br /&gt;		Tl254956 'Craugastor cf. podiciferus FMNH 257653',&lt;br /&gt;		Tl254965 'Craugastor cf. podiciferus UCR 16356',&lt;br /&gt;		Tl254960 'Craugastor sp. A USNM 563039',&lt;br /&gt;		Tl254938 'Craugastor sp. A USNM 563040',&lt;br /&gt;		Tl254945 'Craugastor cf. podiciferus UCR 16360',&lt;br /&gt;		Tl254928 'Craugastor cf. podiciferus UCR 17439',&lt;br /&gt;		Tl254959 'Craugastor cf. podiciferus UCR 17462',&lt;br /&gt;		Tl254951 'Craugastor cf. podiciferus FMNH 257596',&lt;br /&gt;		Tl254967 'Craugastor sp. A FMNH 257689',&lt;br /&gt;		Tl254934 'Craugastor cf. podiciferus UCR 16355',&lt;br /&gt;		Tl254964 'Craugastor cf. podiciferus FMNH 257671',&lt;br /&gt;		Tl254963 'Craugastor cf. podiciferus UCR 16358',&lt;br /&gt;		Tl254952 'Craugastor cf. podiciferus UCR 18062',&lt;br /&gt;		Tl254926 'Craugastor cf. podiciferus UCR 17442',&lt;br /&gt;		Tl254968 'Craugastor sp. A FMNH 257562',&lt;br /&gt;		Tl254939 'Craugastor cf. podiciferus UCR 17441',&lt;br /&gt;		Tl254946 'Craugastor cf. podiciferus FMNH 257757',&lt;br /&gt;		Tl254942 'Craugastor cf. podiciferus MVZ 149813',&lt;br /&gt;		Tl254961 'Craugastor cf. podiciferus FMNH 257595',&lt;br /&gt;		Tl254969 'Craugastor cf. podiciferus UCR 17469',&lt;br /&gt;		Tl254932 'Craugastor cf. podiciferus MVZ 164825',&lt;br /&gt;		Tl254970 'Craugastor sp. A AJC 0891',&lt;br /&gt;		Tl254943 'Craugastor cf. podiciferus UCR 16357',&lt;br /&gt;		Tl254929 'Craugastor cf. podiciferus FMNH 257673',&lt;br /&gt;		Tl254950 'Craugastor cf. podiciferus FMNH 257756',&lt;br /&gt;		Tl254944 'Craugastor cf. podiciferus FMNH 257652',&lt;br /&gt;		Tl254953 'Craugastor cf. podiciferus UCR 16359',&lt;br /&gt;		Tl254931 'Craugastor cf. podiciferus UCR 17443',&lt;br /&gt;		Tl254940 'Craugastor stejnegerianus UCR 16332',&lt;br /&gt;		Tl254935 'Craugastor underwoodi UCR 16315',&lt;br /&gt;		Tl254958 'Craugastor cf. podiciferus UCR 16354',&lt;br /&gt;		Tl254966 'Craugastor sp. A AJC 0890',&lt;br /&gt;		Tl254949 'Craugastor cf. podiciferus FMNH 257758',&lt;br /&gt;		Tl254933 'Craugastor cf. podiciferus UCR 16361',&lt;br /&gt;		Tl254962 'Craugastor cf. podiciferus FMNH 257651',&lt;br /&gt;		Tl254948 'Craugastor cf. podiciferus FMNH 257670',&lt;br /&gt;		Tl254971 'Craugastor cf. podiciferus FMNH 257669',&lt;br /&gt;		Tl254936 'Craugastor cf. podiciferus FMNH 257550',&lt;br /&gt;		Tl254957 'Craugastor underwoodi USNM 561403',&lt;br /&gt;		Tl254947 'Craugastor cf. podiciferus FMNH 257755',&lt;br /&gt;		Tl254927 'Craugastor cf. podiciferus UCR 16353',&lt;br /&gt;		Tl254925 'Craugastor bransfordii MVUP 1875',&lt;br /&gt;		Tl254930 'Craugastor cf. podiciferus UTA A 52449',&lt;br /&gt;		Tl254955 'Craugastor tabasarae MVUP 1720',&lt;br /&gt;		Tl254941 'Craugastor cf. longirostris FMNH 257678',&lt;br /&gt;		Tl254937 'Craugastor cf. longirostris FMNH 257561'		;&lt;br /&gt;	TREE 'Fig. 2' = ((Tl254955,(Tl254941,Tl254937)),(((((Tl254954,Tl254942,Tl254933,Tl254948,Tl254971),((Tl254934,Tl254958,Tl254927),((Tl254964,Tl254929),Tl254930))),(((Tl254965,(Tl254963,Tl254943)),(Tl254959,Tl254969),(Tl254951,Tl254961)),((Tl254928,Tl254926,Tl254939,Tl254931),(Tl254952,Tl254932)))),((((Tl254956,Tl254936),Tl254946,Tl254950,(Tl254944,Tl254962),Tl254947),Tl254949),(Tl254945,Tl254953))),((((Tl254960,Tl254938),(Tl254970,Tl254966)),(Tl254967,Tl254968)),((Tl254940,Tl254925),(Tl254935,Tl254957)))));&lt;br /&gt;END;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;...and here is the table:&lt;br /&gt;&lt;br /&gt;&lt;pre style="font-size:10px;"&gt;&lt;br /&gt;Taxon and institutional vouchera,Locality ID,Collection localityb,Geographic coordinates/approximate location,Elevation (m),GenBank accession number12S,16S,COI,c-myc&lt;br /&gt;1. UTA A-52449,1,"Puntarenas, CR","(10°18′N, 84°48′W)",1520,EF562312,EF562365,None,EF562417&lt;br /&gt;2. MVZ 149813,2,"Puntarenas, CR","(10°18′N, 84°42′W)",1500,EF562319,EF562373,EF562386,EF562430&lt;br /&gt;3. FMNH 257669,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562320,EF562372,EF562380,EF562432&lt;br /&gt;4. FMNH 257670,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562317,EF562336,EF562376,EF562421&lt;br /&gt;5. FMNH 257671,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562314,EF562374,EF562409,None&lt;br /&gt;6. FMNH 257672,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562318,None,EF562382,None&lt;br /&gt;7. FMNH 257673,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562311,EF562343,EF562392,None&lt;br /&gt;8. UCR 16361,3,"Alejuela, CR","(10°13′ N, 84°22′W)",1930,EF562321,EF562371,EF562375,EF562431&lt;br /&gt;9. UCR 16353,4,"Heredia, CR","(10°12′N, 84°09′W)",1500,EF562313,EF562349,None,EF562420&lt;br /&gt;10. UCR 16354,4,"Heredia, CR","(10°12′N, 84°09′W)",1500,EF562315,EF562363,None,EF562418&lt;br /&gt;11. UCR 16355,4,"Heredia, CR","(10°12′N, 84°09′W)",1500,EF562316,EF562366,None,EF562419&lt;br /&gt;12. UCR 18062,6,"Heredia, CR","(10°10′N, 84°06′W)",1900,EF562302,EF562342,EF562395,None&lt;br /&gt;13. UCR 17439,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562298,EF562341,EF562387,EF562427&lt;br /&gt;14. UCR 17441,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562299,EF562345,EF562388,EF562429&lt;br /&gt;15. UCR 17442,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562300,EF562337,EF562385,EF562422&lt;br /&gt;16. UCR 17443,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562301,EF562340,EF562384,EF562428&lt;br /&gt;17. UCR 17462,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562309,EF562355,EF562406,EF562440&lt;br /&gt;18. UCR 17469,5,"Heredia, CR","(10°09′N, 84°09′W)",2000,EF562310,EF562334,EF562405,EF562414&lt;br /&gt;19. MVZ 164825,7,"Heredia, CR","(10° 05′N, 84° 04′W)",2100,EF562303,EF562346,EF562381,EF562423&lt;br /&gt;20. UCR 16357,8,"San José, CR","(10°02′N, 83°57′W)",1600,EF562306,EF562339,EF562400,EF562433&lt;br /&gt;21. UCR 16358,8,"San José, CR","(10°02′N, 83°57′W)",1600,EF562307,EF562370,EF562412,EF562415&lt;br /&gt;22. UCR 16356,8,"San José, CR","(10°01′N, 83°56′W)",1940,EF562308,EF562329,None,None&lt;br /&gt;23. UCR 16359,10,"San José, CR","(9°26′N, 83°41′W)",1313,EF562297,EF562369,EF562396,None&lt;br /&gt;24. UCR 16360,10,"San José, CR","(9°26′N, 83°41′W)",1313,EF562296,EF562368,None,EF562434&lt;br /&gt;25. FMNH 257595,9,"Cartago, CR","(9°44′N, 83°46′W)",1600,EF562304,EF562338,EF562408,None&lt;br /&gt;26. FMNH 257596,9,"Cartago, CR","(9°44′N, 83°46′W)",1600,EF562305,EF562335,None,EF562416&lt;br /&gt;27. FMNH 257550,11,"Puntarenas, CR","(8°47′N, 82°59′W)",1350,EF562294,EF562330,EF562393,EF562443&lt;br /&gt;28. FMNH 257651,11,"Puntarenas, CR","(8°47′N, 82°59′W)",1350,EF562291,EF562367,EF562402,EF562435&lt;br /&gt;29. FMNH 257652,11,"Puntarenas, CR","(8°47′N, 82°59′W)",1350,EF562288,EF562364,EF562390,None&lt;br /&gt;30. FMNH 257653,11,"Puntarenas, CR","(8°47′N, 82°59′W)",1350,EF562292,EF562354,EF562392,EF562438&lt;br /&gt;31. FMNH 257755,11,"Puntarenas, CR","(8°46′N, 82°59′W)",1410,EF562289,EF562344,EF562379,None&lt;br /&gt;32. FMNH 257756,11,"Puntarenas, CR","(8°46′N, 82°59′W)",1410,EF562290,EF562347,EF562377,EF562413&lt;br /&gt;33. FMNH 257757,11,"Puntarenas, CR","(8°46′N, 82°59′W)",1410,EF562293,EF562352,EF562383,EF562437&lt;br /&gt;34. FMNH 257758,11,"Puntarenas, CR","(8°46′N, 82°59′W)",1410,EF562295,EF562348,EF562397,EF562436&lt;br /&gt;35. USNM 563039,12,"Chiriquí, PA","(8°48′N, 82°24′W)",1663,EF562284,EF562356,EF562389,EF562445&lt;br /&gt;36. USNM 563040,12,"Chiriquí, PA","(8°48′N, 82°24′W)",1663,EF562285,EF562350,EF562391,EF562439&lt;br /&gt;37. AJC 0890,12,"Chiriquí, PA","(8°48′N, 82°24′W)",1663,EF562282,EF562351,EF562398,EF562444&lt;br /&gt;38. MVUP 1880,12,"Chiriquí, PA","(8°48′N, 82°24′W)",1663,EF562283,EF562358,EF562399,EF562442&lt;br /&gt;39. FMNH 257689,12,"Chiriquí, PA","(8°45′N, 82°13′W)",1100,EF562287,EF562353,EF562407,EF562446&lt;br /&gt;40. FMNH 257562,12,"Chiriquí, PA","(8°45′N, 82°13′W)",1100,EF562286,EF562357,EF562410,EF562441&lt;br /&gt;41. USNM 561403,N/A,"Heredia, CR","(10°24′N, 84°03′W)",800,EF562323,EF562361,EF562378,None&lt;br /&gt;42. UCR 16315,N/A,"Alejuela, CR","(10°13′N, 84°35′W)",960,EF562322,EF562362,EF562394,None&lt;br /&gt;43. UCR 16332,N/A,"San José, CR","(9°18′N, 83°46′W)",900,EF562325,EF562360,EF562411,AY211320&lt;br /&gt;44. MVUP 1875 fitzingeri group,N/A,"BDT, PA","(9°24′N, 82°17′W)",50,EF562324,EF562359,None,AY211304&lt;br /&gt;45. MVUP 1720,N/A,"Coclé, PA","(8°40′N, 80°35′W)",800,EF562326,EF562332,EF562401,EF562424&lt;br /&gt;46. FMNH 257561,N/A,"Chiriquí, PA","(8°45′N, 82°13′W)",1100,EF562327,EF562331,None,EF562426&lt;br /&gt;47. FMNH 257678,N/A,"Chiriquí, PA","(8°45′N, 82°13′W)",1100,EF562328,EF562333,EF562404,EF562425&lt;/pre&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-9177505911839922335?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh4.ggpht.com/-Rhpf_y8yjcg/TzK16l1HxgI/AAAAAAAABIk/iYupBseqflQ/s72-c/elsevier.png?imgmax=800" height="72" width="72" /></item><item><title>Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data</title><link>http://iphylo.blogspot.com/2012/02/using-google-refine-and-taxonomic.html</link><category>data cleaning</category><category>Google Refine</category><category>taxonomic name</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Wed, 08 Feb 2012 13:42:25 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-6104449132512887912</guid><description>&lt;img src="http://lh4.ggpht.com/-HsS8f5sKWWc/Ty_fgRiXiII/AAAAAAAABHg/In4r9NXMkoc/refine.png?imgmax=800" alt="Refine" border="0" width="128" height="128" style="float:right;" /&gt;&lt;a href="http://code.google.com/p/google-refine/"&gt;Google Refine&lt;/a&gt; is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use &lt;a href="http://www.freebase.com/"&gt;Freebase&lt;/a&gt; reconciliation services, but you can also add external services. Inspired by this I've started to implement services to reconcile taxonomic names.&lt;br /&gt;&lt;br /&gt;The services I've implemented so far are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.eol.org"&gt;EOL&lt;/a&gt; http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.ncbi.nlm.nih.gov/Taxonomy/"&gt;NCBI taxonomy&lt;/a&gt; http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ncbi.php&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.ubio.org/tools/recognize.php"&gt;uBio FindIT&lt;/a&gt; http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.marinespecies.org/"&gt;WORMS&lt;/a&gt; http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_worms.php&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.gbif.org"&gt;GBIF&lt;/a&gt; http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_gbif.php&lt;/li&gt;&lt;li&gt;&lt;a href="http://gni.globalnames.org/"&gt;Global Names Index&lt;/a&gt; http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_globalnames.php&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;To use these you need to add the URLs above to Google Refine (see example below). The EOL, NCBI and WORMS do a basic name lookup. The uBio FindIT service extracts a taxonomic name from a string, and can be viewed as a "taxonomic name cleaner".&lt;br /&gt;&lt;br /&gt;&lt;b&gt;How to use reconciliation services&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Start a Google Refine session. Save the names below to a text file and open it as a new project.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;Names&lt;br /&gt;Achatina fulica (giant African snail)&lt;br /&gt;Acromyrmex octospinosus ST040116-01&lt;br /&gt;Alepocephalus bairdii (Baird's smooth-head)&lt;br /&gt;Alaska Sea otter (Enhydra lutris kenyoni)&lt;br /&gt;Toxoplasma gondii&lt;br /&gt;Leucoagaricus gongylophorus&lt;br /&gt;Pinnotheres&lt;br /&gt;Themisto gaudichaudii&lt;br /&gt;Hyperiidae&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;You should see something like this:&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh5.ggpht.com/-PqIRaUwZy40/Ty_fhvgf87I/AAAAAAAABHo/ZbMluFovHYc/refine1.png?imgmax=800" alt="Refine1" border="0" width="359" height="311" /&gt;&lt;br /&gt;&lt;br /&gt;Click on the column header &lt;b&gt;Names&lt;/b&gt; and choose &lt;b&gt;Reconcile&lt;/b&gt; → &lt;b&gt;Start reconciling&lt;/b&gt;. &lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh6.ggpht.com/-rDs2S-YG-V8/Ty_fiitowGI/AAAAAAAABHw/rJ3Uf6RcVQ8/refine2.png?imgmax=800" alt="Refine2" border="0" width="415" height="436" /&gt;&lt;br /&gt;&lt;br /&gt;A dialog will popup asking you to select a service. &lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh4.ggpht.com/-iMMUwS6tmc8/Ty_fjxRZnRI/AAAAAAAABH4/rLVPLLyDVKs/refine3.png?imgmax=800" alt="Refine3" border="0" width="415" height="285" /&gt;&lt;br /&gt;&lt;br /&gt;If you've already added a service it will be in the list on the left. If not, click the &lt;b&gt;Add Standard Services...&lt;/b&gt; button at the bottom left and paste in the URL (in this case &lt;code&gt;http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php&lt;/code&gt;).&lt;br /&gt;&lt;br /&gt;Once the service has loaded click on &lt;b&gt;Start Reconciling&lt;/b&gt;. Once it has finished you should see most of the names linked to uBio (click on a name to check this):&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh6.ggpht.com/-hG9-5E4YaDY/Ty_fkxZnckI/AAAAAAAABIA/IaAbrXWzPv4/refine4.png?imgmax=800" alt="Refine4" border="0" width="343" height="438" /&gt;&lt;br /&gt;&lt;br /&gt;Sometimes there may be more than one possible match, in which case these will be listed in the cell. Once you have reconciled the data you may want to do something with the reconciliation. For example, if you want to get the ids for the names you've just matched you can create a new column based on the reconciliation. Click on the &lt;b&gt;Names&lt;/b&gt; column header and choose &lt;b&gt;Edit column&lt;/b&gt; → &lt;b&gt;Add column based on this column...&lt;/b&gt;. A dialog box will be displayed:&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh6.ggpht.com/-T9cbG-OY8KA/Ty_fmKBujbI/AAAAAAAABII/R-rkj8Lj3w4/refine6.png?imgmax=800" alt="Refine6" border="0" width="415" height="307" /&gt;&lt;br /&gt;&lt;br /&gt;In the box labelled &lt;b&gt;Expression&lt;/b&gt; enter &lt;code&gt;cell.recon.match.id&lt;/code&gt; and give the column a name (e.g., "NamebankID"). You will now have a column of uBio NamebankIDs for the names:&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh6.ggpht.com/-5gd07QAHzWY/Ty_fnobAUtI/AAAAAAAABIQ/fx6W_iTy-B8/refine7.png?imgmax=800" alt="Refine7" border="0" width="378" height="439" /&gt;&lt;br /&gt;&lt;br /&gt;You could also get the names uBio extracted by creating a column based on the values of &lt;code&gt;cell.recon.match.name&lt;/code&gt;. To compare this with the original values, click on the &lt;b&gt;Names&lt;/b&gt; column header and choose &lt;b&gt;Reconcile&lt;/b&gt; → &lt;b&gt;Actions&lt;/b&gt; → &lt;b&gt;Clear reconciliation data&lt;/b&gt;. Now you can see the original input names, and the string uBio extracted from each name:&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh5.ggpht.com/-WfFkd_MxqXc/Ty_foWnMx2I/AAAAAAAABIY/7NYNqO0qx7o/refine8.png?imgmax=800" alt="Refine8" border="0" width="415" height="217" /&gt;&lt;br /&gt;&lt;br /&gt;These are some very simple ideas for using Google Refine with taxonomic name services. Obvious extensions would to use services that provide an "accepted name", or services that support approximate string matching so you could catch spelling mistakes (most of the services I've implemented here have some degree of support for these features).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Development notes&lt;/b&gt;&lt;br /&gt;The code for these services is in &lt;a href="https://github.com/rdmpage/phyloinformatics"&gt;Github&lt;/a&gt; (undocumented as yet, that's on the to do list). I had a few hiccups getting these services to work. There is detailed documentation at &lt;a href="http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi"&gt;http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi&lt;/a&gt;, but this seems a little out of step with what actually happens. Based on the documentation I thought Google Refine called a reconciliation service using HTTP GET, but in fact it uses POST. Google Refine always called my reconciliation service using "Multiple Query Mode", which meant supporting this mode wasn't optional. Once these issues were sorted out (turning on the Java console as per &lt;a href="https://groups.google.com/forum/?fromgroups#!topic/google-refine/mdUMgaf7ntY"&gt;David Huynh's tip&lt;/a&gt; helped) things work pretty well.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-6104449132512887912?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh4.ggpht.com/-HsS8f5sKWWc/Ty_fgRiXiII/AAAAAAAABHg/In4r9NXMkoc/s72-c/refine.png?imgmax=800" height="72" width="72" /></item><item><title>Browsing TreeBASE using a genome browser-like interface</title><link>http://iphylo.blogspot.com/2012/02/browsing-treebase-using-genome-browser.html</link><category>visualisation</category><category>TreeBASE</category><category>phyloinformatics</category><category>browser</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Thu, 02 Feb 2012 10:14:08 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-9161861248434242304</guid><description>One of the things I find frustrating about &lt;a href="http://www.treebase.org"&gt;TreeBASE&lt;/a&gt; is that there's no easy way to get an overview of what it contains. What is it's taxonomic coverage like? Is it dominated by plants and fungi, or are there lots of animal trees as well? Are the obvious gaps in our phylogenetic knowledge, or do the phylogenies it contains pretty much span the tree of life?&lt;br /&gt;&lt;br /&gt;As part of my &lt;a href="http://iphylo.org/~rpage/phyloinformatics/course/phylogeny/index.html"&gt;phyloinformatics course&lt;/a&gt; I've put together a simple browser to navigate through TreeBASE. The inspiration comes from genome browsers (e.g., the &lt;a href=="http://genome.ucsc.edu/cgi-bin/hgTracks?org=human"&gt;UCSC Genome Browser&lt;/a&gt;) where the genome is treated as a linear set of co-ordinates, and features of the genome are displayed as "tracks".&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh4.ggpht.com/-7vBkwbFCQvA/TyrSajKHiMI/AAAAAAAABHM/AUfompnFq-k/hgt_genome_596a_ac7fe0.png?imgmax=800" alt="Hgt genome 596a ac7fe0" border="0" width="400" height="192" /&gt;&lt;br /&gt;&lt;br /&gt;For my browser, I've used the order in which nodes appear in the NCBI tree as you go from left to right as the set of co-ordinates (actually, from top to bottom as my browser displays the co-ordinate axis vertically).&lt;br /&gt;&lt;br /&gt;&lt;img style="display:block; margin-left:auto; margin-right:auto;" src="http://lh4.ggpht.com/-EdUXOn9HNdY/TyrSbdAgIlI/AAAAAAAABHU/vMRUj78p8MI/browser.png?imgmax=800" alt="Browser" border="0" width="450" height="312" /&gt;&lt;br /&gt;&lt;br /&gt;I then place each TreeBASE tree within this classification by taking the TreeBASE → NCBI mapping provided by TreeBASE and finding the "majority rule" taxon for each tree (in a sense, the taxa that summarises what the tree is about). Each tree is represented by a vertical line depicting the span of the corresponding NCBI taxon (corresponding to a "track" in a genome browser). Taking the majority-rule taxon rather than say, the span of the tree, makes it possible to pack the vertical lines tightly together so that they take up less space (the ordering from left to right is determined by the NCBI taxonomy). &lt;br /&gt;&lt;br /&gt;If you mouse-over a vertical bar you can see the title of the study that published the tree. If you click on the vertical bar you'll see the tree displayed on the right (if your web browser understands SVG, that is). If you click on the background you will drill down a level in the NCBI classification. To go back up the classification, click on the arrow at the top left of the browser.&lt;br /&gt;&lt;br /&gt;This is all very preliminary, but you can take it for a spin at &lt;a href="http://iphylo.org/~rpage/phyloinformatics/treebase/"&gt;http://iphylo.org/~rpage/phyloinformatics/treebase/&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Below is a short video walking you through some examples.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align:center"&gt;&lt;iframe src="http://player.vimeo.com/video/36093738?title=0&amp;amp;byline=0&amp;amp;portrait=0" width="398" height="333" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-9161861248434242304?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://lh4.ggpht.com/-7vBkwbFCQvA/TyrSajKHiMI/AAAAAAAABHM/AUfompnFq-k/s72-c/hgt_genome_596a_ac7fe0.png?imgmax=800" height="72" width="72" /></item><item><title>BLAST a sequence and get a tree</title><link>http://iphylo.blogspot.com/2012/01/blast-sequence-and-get-tree.html</link><category>phylogeny</category><category>SVG</category><category>github</category><category>ajax</category><category>BLAST</category><category>phyloinformatics</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Mon, 30 Jan 2012 09:19:52 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-5901535091373784967</guid><description>For this weeks sessions of my &lt;a href="http://iphylo.org/~rpage/phyloinformatics/"&gt;phyloinformatics course&lt;/a&gt; I'm developing some phylogeny tools. The first is a simple AJAX-based BLAST tool. I've always wanted a quick way to see a GenBank sequence in its phylogenetic context, so I've built a simple tool to that takes a GenBank accession number or GI number, submits a BLAST job, retrieves the sequences, aligns them using CLUSTALW,  builds a quick and dirty neighbour-joining tree using PAUP*, then displays the tree using SVG (if your browser doesn't support this you won't see the tree). One use for this is to quikcly get a sense of whether an unnamed ("dark") taxon is related to sequences that have been identified.&lt;br /&gt;&lt;br /&gt;Nothing fancy, but it was a chance to display the whole process in the browser without opening new windows or refreshing the page. Here's an example for the GenBank sequence &lt;a href="http://www.ncbi.nlm.nih.gov/nucleotide/FJ559186"&gt;FJ559186&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;iframe src="http://player.vimeo.com/video/35895870?title=0&amp;amp;byline=0&amp;amp;portrait=0" width="398" height="244" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;For the technically-minded, the calls to BLAST and the alignment and tree construction tools all use AJAX, and there's a simple Javascript timer to countdown the seconds that the NCBI BLAST web service estimates the BLAST job will take, before we poll NCBI to see if the job has in fact finished. The code is in &lt;a href="https://github.com/rdmpage/phyloinformatics"&gt;GitHub&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-5901535091373784967?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>Extracting museum specimen codes from text</title><link>http://iphylo.blogspot.com/2012/01/extracting-museum-specimen-codes-from.html</link><category>data mining</category><category>specimen codes</category><category>museum</category><category>Darwin Core riplet</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Thu, 26 Jan 2012 04:43:43 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-2340342747346374591</guid><description>Quick note about a tool I've cobbled together as part of the &lt;a href="http://iphylo.org/~rpage/phyloinformatics/"&gt;phyloinformatics course&lt;/a&gt;, which addresses a long standing need I and others have to extract specimen codes from text. I've had this code kicking around for a while (as part of various never-finished data mining projects), but never got around to releasing it, until now. It is very crude (basically a bunch of regular expressions), and there's a lot which could be done to improve it (not least starting with a complete list of museum specimen codes, rather than just those I've come across in, say &lt;i&gt;Zootaxa&lt;/i&gt; and &lt;a href="http://biostor.org"&gt;BioStor&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;You can try the tool at &lt;a href="http://iphylo.org/~rpage/phyloinformatics/services/specimenparser.php"&gt;http://iphylo.org/~rpage/phyloinformatics/services/specimenparser.php&lt;/a&gt;. Paste in some text and it will try and extract museum codes. The tool tries to handle ranges of specimens (e.g., MHNSM 1808-09), and some of the more common specimen numbering schemes.&lt;br /&gt;&lt;br /&gt;Comments welcome. If you are looking for a source of text, papers in &lt;i&gt;Zookeys&lt;/i&gt; or &lt;i&gt;Zootaxa&lt;/i&gt; are a good place to start (especially papers on vertebrates where specimen numbers are often used). BioStor is also a good source: if you're looking at a paper in BioStor click on the "Text" link to get the OCR text for an article and paste that into the form at . For example, the text for &lt;a href="http://biostor.org/reference/97426"&gt;Systematics of the Bufo coccifer complex (Anura: Bufonidae) of Mesoamerica&lt;/a&gt; is available at &lt;a href="http://biostor.org/reference/97426.text"&gt;http://biostor.org/reference/97426.text&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The extraction tool can also be called as a web service using POST to get back the results in JSON.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-2340342747346374591?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>Open course on phyloinformatics</title><link>http://iphylo.blogspot.com/2012/01/open-course-on-phyloinformatics.html</link><category>teaching</category><category>github</category><category>phyloinformatics</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Mon, 23 Jan 2012 04:41:13 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-4245769225469318038</guid><description>As part of a postgraduate course here at the &lt;a href="http://www.gla.ac.uk/"&gt;University of Glasgow&lt;/a&gt; I'm teaching five sessions on "phyloinformatics", which I've decided to define broadly enough to encompass most of biodiversity informatics.&lt;br /&gt;&lt;br /&gt;Given that this module is being developed on the fly, and will make use of lots of little "toys" I've developed and discussed on this blog, I've decided to put the course notes online, along with the interactive demos and the source code. So, if you want to follow along for the next couple of weeks, here are the links:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://iphylo.org/~rpage/phyloinformatics/"&gt;Course home page&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://iphylo.org/~rpage/phyloinformatics/course/"&gt;Course notes and exercises&lt;/a&gt; (currently just the introductory session)&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/rdmpage/phyloinformatics"&gt;Source code on GitHub&lt;/a&gt; (including code for my &lt;a href="http://iphylo.blogspot.com/2012/01/eol-ipad-web-app-using-jquerymobile.html"&gt;EOL iPad webapp&lt;/a&gt;)&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Each course page supports comments (see the bottom of the page), so feel free to add comments, or suggestions. The notes are at a crude stage, and will be developed over the duration of the course (2 weeks). I'm also endeavouring to get all the source code for the demonstration apps into GitHub. None of these demos is polished, but they will hopefully provide some ideas for taking them further. There will be iSpecies-like mashups, iPad webapps, classification visualisations, TreeBASE search tools, geophylogenies and other phylogeny viewers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-4245769225469318038?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>EOL iPad web app using jQueryMobile</title><link>http://iphylo.blogspot.com/2012/01/eol-ipad-web-app-using-jquerymobile.html</link><category>jQueryMobile</category><category>iPad</category><category>API</category><category>EOL</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Thu, 19 Jan 2012 08:35:17 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-1754743869325390809</guid><description>As part of a course on "phyloinformatics" that I'm about to teach I've been making some visualisations of classifications. Here's one I've put together using &lt;a href="http://jquerymobile.com/"&gt;jQuery Mobile&lt;/a&gt; and the Encyclopedia of Life &lt;a href="http://eol.org/api"&gt;API&lt;/a&gt;. It's pretty limited, but is a simple way to explore EOL using three different classifications. You can view this live at &lt;a href="http://iphylo.org/~rpage/phyloinformatics/eoliphone/"&gt;http://iphylo.org/~rpage/phyloinformatics/eoliphone/&lt;/a&gt; (looks best on an iPad or iPhone). Once I've tidied it up I'll put the code online. Meantime here's a quick demo:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align:center"&gt;&lt;iframe src="http://player.vimeo.com/video/35321521?title=0&amp;amp;byline=0&amp;amp;portrait=0&amp;amp;autoplay=0" width="398" height="587" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-1754743869325390809?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item><item><title>Yet another reason why we need specimen identifiers, now!</title><link>http://iphylo.blogspot.com/2012/01/yet-another-reason-why-we-need-specimen.html</link><category>TAXACOM</category><category>specimens</category><category>identifiers</category><category>collections</category><category>citation</category><author>noreply@blogger.com (Roderic D. M. Page)</author><pubDate>Wed, 18 Jan 2012 05:22:03 PST</pubDate><guid isPermaLink="false">tag:blogger.com,1999:blog-16081779.post-5628455436846736390</guid><description>This &lt;a href="http://markmail.org/message/opv2we7fkmro2nen"&gt;message&lt;/a&gt; appeared on the TAXACOM mailing list:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;It is getting more and more necessary for taxonomists to demonstrate&lt;br /&gt;that they are useful and used. This does not only apply to the&lt;br /&gt;individual scientists, but also to institutions with taxonomic&lt;br /&gt;collections, such as museums and herbaria. &lt;br /&gt;&lt;br /&gt;In an attempt to live up to that increasing demand for documentation,&lt;br /&gt;the leadership of the Natural History Museum of Denmark has issued an&lt;br /&gt;order to its curatorial staff - The staff members are requested to&lt;br /&gt;document which publications from 2011, written entirely by external&lt;br /&gt;scientists, that in one way or another are based on material in the&lt;br /&gt;collections of the Museum. &lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;Given that most specimens lack resolvable digital identifiers (a theme I've harped on about before, most recently in the context of &lt;a href="http://iphylo.blogspot.com/2011/12/dna-barcoding-darwin-core-triplet-and.html"&gt;DNA barcoding&lt;/a&gt;), answering this kind of query ends up being a case of searching publications for text strings that contain the acronym of the collection. The sender of the message, &lt;a href="http://www.nathimus.ku.dk/bot/vip/friis.htm"&gt;Ib Friis&lt;/a&gt;, is alarmed at this prospect:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;In publications, material from our herbarium at "C" is normally referred&lt;br /&gt;to in text strings of one of the following forms: "(C)", "(C, ", ", C,"&lt;br /&gt;or " C)". But a search in for example Google Scholar or other search&lt;br /&gt;engines  result in overflow of thousands and thousands of hits, even&lt;br /&gt;when these text strings are combined with other relevant words such as&lt;br /&gt;"botany", "plants", etc.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;In an earlier paper "Biodiversity informatics: the challenge of linking data and the role of shared identifiers" (&lt;a href="http://dx.doi.org/10.1093/bib/bbn022"&gt;http://dx.doi.org/10.1093/bib/bbn022&lt;/a&gt;) (free preprint available here: &lt;a href="http://hdl.handle.net/10101/npre.2008.1760.1"&gt;hdl:10101/npre.2008.1760.1&lt;/a&gt;) I argued that having resolvable identifiers for specimens could enable measures of "citation" to be computed for specimens (and data derived from those specimens). Just as we have citation counts for articles and impact factors for journals, we could have equivalent measures for specimens and collections. These measures may keep administrators happy, for scientists I think the real benefits will be the ability to trace the provenance of some data, and the fate of data they themselves have collected or published.&lt;br /&gt;&lt;br /&gt;For things such as publications it is trivial to track their usage. For example, to find the number of times the article "Biodiversity informatics: the challenge of linking data and the role of shared identifiers" has been cited, I simply enter the DOI into Google Scholar, e.g. &lt;a href="http://scholar.google.co.uk/scholar?q=10.1093/bib/bbn022"&gt;http://scholar.google.co.uk/scholar?q=10.1093/bib/bbn022&lt;/a&gt;. Imagine being able to do the same for specimens?&lt;br /&gt;&lt;br /&gt;For this to happen, museum specimens need digital identifiers. If museums are serious about quantifying the impact of their collections, they should make assigning digital identifiers a priority.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/16081779-5628455436846736390?l=iphylo.blogspot.com' alt='' /&gt;&lt;/div&gt;</description></item></channel></rss>

