<?xml version="1.0"?>
<rss version="2.0" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:yt="http://gdata.youtube.com/schemas/2007" xmlns:atom="http://www.w3.org/2005/Atom">
   <channel>
      <title>GBIF News Feed</title>
      <description>A joint feed covering all individual GBIF work area feeds</description>
      <link>http://pipes.yahoo.com/pipes/pipe.info?_id=bf26e2e30cfdbb4d1b6c11482083cc3f</link>
      <atom:link rel="next" href="http://pipes.yahoo.com/pipes/pipe.run?_id=bf26e2e30cfdbb4d1b6c11482083cc3f&amp;_render=rss&amp;page=2"/>
      <pubDate>Thu, 01 Oct 2015 08:18:45 +0000</pubDate>
      <generator>http://pipes.yahoo.com/pipes/</generator>
      <item>
         <title>Simplified Downloads</title>
         <link>http://gbif.blogspot.com/2015/06/simplified-downloads.html</link>
         <description>&lt;div style=&quot;text-align:justify;&quot;&gt;
Since its re-launch in 2013&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/&quot;&gt;gbif.org&lt;/a&gt;&amp;nbsp;has supported the downloading of occurrence data using an arbitrary query with the download file provided as a&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/&quot;&gt;Darwin Core Archive&lt;/a&gt; file whose internal content is described &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/faq/datause&quot;&gt;here&lt;/a&gt;. This format contains comprehensive and self-explanatory information, which makes it suitable to be referenced in external resources. However, in cases where people only need the occurrence data in its simplest form the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/&quot;&gt;DwC-A&lt;/a&gt; format presents an additional complexity that can make it hard to use the data. Because of that we now support a new download format:&amp;nbsp;a zip file that only contains a single file with the most common fields/terms used, where each column is separated by the TAB character. This makes things much easier when it comes to importing the data into tools such as Microsoft Excel, geographic information systems and relational databases. The current download functionality was extended to allow the selection of the desired format:&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/-AbAeglZJSro/VXjLYFqv1WI/AAAAAAAAAZE/xdbKFBeSkzI/s1600/Screen%2BShot%2B2015-06-10%2Bat%2B18.19.42.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://4.bp.blogspot.com/-AbAeglZJSro/VXjLYFqv1WI/AAAAAAAAAZE/xdbKFBeSkzI/s320/Screen%2BShot%2B2015-06-10%2Bat%2B18.19.42.png&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
From this point the functionality remains the same: eventually you will receive an email containing a hyperlink where the file can be downloaded.&lt;/div&gt;
&lt;h2&gt;
Technical Architecture&lt;/h2&gt;
The simplified download format was implemented following the technical requirement that new formats should be supported in the near future with minimal impact to the formats supported at a specific moment. In general, occurrence downloads are implemented using two different sets of technologies depending on the estimated size of the download in number of records; a threshold of 200,000 records is set to define when a download is small (&amp;lt; 200K) and big (&amp;gt;200K), where history shows a vast majority of “small” downloads. The following chart summarizes the key technologies that enables occurrence downloads:&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://1.bp.blogspot.com/-csKM37rv3TI/VXjM_45ohoI/AAAAAAAAAZQ/5ILGlNPlSiY/s1600/Screen%2BShot%2B2015-06-11%2Bat%2B01.48.22.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://1.bp.blogspot.com/-csKM37rv3TI/VXjM_45ohoI/AAAAAAAAAZQ/5ILGlNPlSiY/s320/Screen%2BShot%2B2015-06-11%2Bat%2B01.48.22.png&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;h2&gt;
Download workflow&lt;/h2&gt;
Occurrence downloads are automated using a workflow engine called &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://oozie.apache.org/&quot;&gt;Oozie&lt;/a&gt;, it coordinates the required steps to produce a single download file. In summary the workflow proceeds as follows: &lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Initially, &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://lucene.apache.org/solr/&quot;&gt;Apache Solr&lt;/a&gt; is contacted to determine the number of records that the download file will contain.&lt;/li&gt;
&lt;li&gt;Big or small?&lt;/li&gt;
&lt;ol&gt;
&lt;li&gt;&amp;nbsp;If the amount of records is less than 200,000 (it is small download), &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://lucene.apache.org/solr/&quot;&gt;Apache Solr&lt;/a&gt; is queried to iterate over the results; the detail of each occurrence record is fetched from &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://hbase.apache.org/&quot;&gt;HBase&lt;/a&gt; since it’s the official storage of occurrence records. Individual downloads are produced by a multi-threaded application implemented using the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://akka.io/&quot;&gt;Akka&lt;/a&gt; framework; the Apache &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://zookeeper.apache.org/&quot;&gt;Zookeeper&lt;/a&gt; and &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://curator.apache.org/&quot;&gt;Curator&lt;/a&gt; frameworks are used to limit the amount of threads that can be running at the same time (it avoids a thread explosion in the machines that run the download workflow).&lt;/li&gt;
&lt;li&gt;If the amount of records is greater than 200,000 (it is a big download), &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://hive.apache.org/&quot;&gt;Apache Hive&lt;/a&gt; is used to retrieve the occurrence data from an&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html&quot;&gt;HDFS&lt;/a&gt;&amp;nbsp;table. To avoid overloading of &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://hbase.apache.org/&quot;&gt;HBase&lt;/a&gt;&amp;nbsp;we create that HDFS table as a daily snapshot&amp;nbsp;of the occurrence data stored in &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://hbase.apache.org/&quot;&gt;HBase&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;li&gt;Finally the occurrence records are collected and organized in the requested output format (DwC-A or Simple).&lt;/li&gt;
&lt;/ol&gt;
Note: the details of how this is implemented can be consulted in the Github project: &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/gbif/occurrence/tree/master/occurrence-download&quot;&gt;https://github.com/gbif/occurrence/tree/master/occurrence-download&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;h2&gt;
Conclusion&lt;/h2&gt;
&lt;div&gt;
Reducing both the number of columns and the size (number of bytes) in our downloads has been one of our most requested features, and we hope this makes using the GBIF data easier for everyone.&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Fede Méndez</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-909142432631639745</guid>
         <pubDate>Thu, 11 Jun 2015 17:06:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://4.bp.blogspot.com/-AbAeglZJSro/VXjLYFqv1WI/AAAAAAAAAZE/xdbKFBeSkzI/s72-c/Screen%2BShot%2B2015-06-10%2Bat%2B18.19.42.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Don't fill your HDFS disks (upgrading to CDH 5.4.2)</title>
         <link>http://gbif.blogspot.com/2015/05/dont-fill-your-hdfs-disks-upgrading-to.html</link>
         <description>Just a short post on the dangers of filling your HDFS disks. It's a warning you'll hear at conferences and in best practices blog posts like this one, but usually with only a vague consequence of &quot;bad things will happen&quot;. We upgraded from CDH 5.2.0 to CDH 5.4.2 this past weekend and learned the hard way: bad things will happen.&lt;br /&gt;
&lt;br /&gt;
&lt;h4&gt;
The Machine Configuration&lt;/h4&gt;
&lt;div&gt;
The upgrade went fine in our dev cluster (which has almost no data in HDFS) so we weren't expecting problems in production. Our production cluster is of course slightly different than our (much smaller) dev cluster. In production we have 3 masters, where one holds the NameNode and another holds the SecondaryNameNode (we're not yet using a High Availability setup, but it's in the plan). We have 12 DataNodes where each one has 13 disks dedicated to HDFS storage: 12 are 1TB and one is 512GB. They are formatted with 0% reserved blocks for root. The machines are evenly split into two racks.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h4&gt;
Pre Upgrade Status&lt;/h4&gt;
&lt;div&gt;
We were at about 75% total HDFS usage with only a few percent difference between machines. We were configured to use Round Robin block placement (&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;dfs.datanode.fsdataset.volume.choosing.policy&lt;/span&gt;) with 10GB reserved for non-hdfs use (&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;dfs.datanode.du.reserved&lt;/span&gt;), which are the defaults in CDH manager. Each of the 1TB disks was around 700GB used (of 932GB usable), and the 512 GB disks were all at their limit: 456GB used (of 466GB usable). That left only the configured 10GB free for non-hdfs use on the small disks. Our disks are mounted in the pattern /mnt/disk_a, /mnt/disk_b and so on, with /mnt/disk_m as the small disk. We're using the free version of CDHM so we can't do rolling upgrades, meaning&amp;nbsp;this upgrade would bringing everything down. And because our cluster is getting full (&amp;gt; 80% usage is another rumoured &quot;bad things&quot; threshold) we have reduced one class of data (user's &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/occurrence/search&quot;&gt;occurrence downloads&lt;/a&gt;) to a replication factor of 2 (from the default of 3). This is considered somewhere between naughty and criminal, and you'll see why below.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h4&gt;
Upgrade Time&lt;/h4&gt;
&lt;div&gt;
We followed the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_earlier_cdh5_upgrade.html&quot;&gt;recommended procedure&lt;/a&gt;&amp;nbsp;and&amp;nbsp;did the oozie, hive, and CDH manager backups, downloaded the latest parcels, and pressed the big Update button. Everything appeared to be going fine until HDFS tried to start up again, where the symptom was that it was taking a really long time (several minutes, after which the CDHM upgrade process finally gave up saying the DataNodes weren't making contact). Looking at the NameNode logs we see that it was performing a &quot;Block Pool Upgrade&quot;, which took btw 90 and 120 seconds for each of our ~700GB disks. Here's an excerpt of where it worked without problems:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;


&lt;br /&gt;
&lt;div&gt;
&lt;span style=&quot;font-size:11px;&quot;&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;2015-05-23 20:18:53,715 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_a/dfs/dn/in_use.lock acquired by nodename &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;mailto:27117@c4n1.gbif.org&quot;&gt;27117@c4n1.gbif.org&lt;/a&gt;&lt;br /&gt;2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535&lt;br /&gt;2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535&lt;br /&gt;2015-05-23 20:18:53,823 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.&lt;br /&gt;&amp;nbsp;&amp;nbsp; old LV = -56; old CTime = 1416737045694.&lt;br /&gt;&amp;nbsp;&amp;nbsp; new LV = -56; new CTime = 1432405112136&lt;br /&gt;2015-05-23 20:20:33,565 INFO org.apache.hadoop.hdfs.server.common.Storage: HardLinkStats: 59768 Directories, including 53157 Empty Directories, 0 single Link operations, 6611 multi-Link operations, linking 22536 files, total 22536 linkable files.&amp;nbsp; Also physically copied 0 other files.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-size:11px;&quot;&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;2015-05-23 20:20:33,609 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrade of block pool BP-2033573672-130.226.238.178-1367832131535 at /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535 is complete&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
That upgrade time happens sequentially for each disk, so even the though the machines were upgrading in parallel, we were still looking at ~30 minutes of downtime for the whole cluster. As if that wasn't sufficiently worrying, then we finally get to disk_m, our nearly full 512G disk:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;


&lt;br /&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:xx-small;&quot;&gt;&lt;span style=&quot;font-stretch:normal;&quot;&gt;2015-05-23 20:53:05,814 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_m/&lt;/span&gt;&lt;span style=&quot;font-stretch:normal;&quot;&gt;dfs/dn/in_use.lock acquired by nodename &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;mailto:12424@c4n1.gbif.org&quot;&gt;12424@c4n1.gbif.org&lt;/a&gt;&lt;br /&gt;2015-05-23 20:53:05,869 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535&lt;br /&gt;2015-05-23 20:53:05,870 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_m/&lt;/span&gt;&lt;span style=&quot;font-stretch:normal;&quot;&gt;dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535&lt;br /&gt;2015-05-23 20:53:05,886 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_m/&lt;/span&gt;&lt;span style=&quot;font-stretch:normal;&quot;&gt;dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.&lt;br /&gt;&amp;nbsp;&amp;nbsp; old LV = -56; old CTime = 1416737045694.&lt;br /&gt;&amp;nbsp;&amp;nbsp; new LV = -56; new CTime = 1432405112136&lt;br /&gt;2015-05-23 20:54:12,469 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-2033573672-130.226.238.178-1367832131535&lt;br /&gt;java.io.IOException: Cannot create directory /mnt/disk_m/&lt;/span&gt;dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1259)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(DataStorage.java:1023)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.linkAllBlocks(BlockPoolSliceStorage.java:647)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doUpgrade(BlockPoolSliceStorage.java:456)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:390)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:171)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:214)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:242)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:396)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1397)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1362)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:227)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:839)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; at java.lang.Thread.run(Thread.java:745)&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:xx-small;&quot;&gt;2015-05-23 20:54:12,476 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-2033573672-130.226.238.178-1367832131535 : Cannot create directory /mnt/disk_m/&lt;span style=&quot;font-stretch:normal;&quot;&gt;dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The somewhat misleading &quot;Cannot create directory&quot; is not a file permission problem but rather a disk full problem. During this block pool upgrade some temporary space is needed for rewriting metadata, and that space is apparently more than the 10G that was available to &quot;non-HDFS&quot; (which we've concluded means &quot;not HDFS storage files, but everything else is fair game&quot;). Because &lt;i&gt;some&lt;/i&gt; space is available to start the upgrade, it begins, but then when it exhausts the disk it fails, and &lt;b&gt;This Kills The DataNode&lt;/b&gt;. It does clean up after itself, but prevents the DataNode from starting, meaning our cluster was on its knees and in no danger of standing up.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
So the problem was lack of free space, which on 10 of our 12 machines we were able to solve by wiping temporary files from the colocated yarn directory. Those 10 machines were then able to upgrade their disk_m and started up. We still had two nodes down and unfortunately they were in different racks, so that meant we had a big pile of our replication factor 2 files missing blocks (the default HDFS block replication policy places the second and subsequent copies on a different rack from the first copy).&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
While digging around in the different properties that we thought could affect our disks and HDFS behaviour we were also restarting the failing DataNodes regularly. At some point the log message changed to:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:xx-small;&quot;&gt;WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.FileNotFoundException: /mnt/disk_m/dfs/dn/in_use.lock (No space left on device)&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
After that message the DataNode started, but with disk_m marked as a failed volume. We're not sure why this happened but presume that after one of our failures it didn't clean up it's temp files on disk_m and then on subsequent restarts found the disk completely full and (rightly) considered it unusable and tried to carry on. With the final two DataNodes up we had almost all of our cluster, minus the two failed volumes. There were only 35 corrupted files (missing blocks) left after they came up. These were files set to replication factor 2, and by bad luck had both copies of some of their blocks on the failed disk_m (one from rack1, one from rack2).&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
It would not have been the end of the world to just delete the corrupted user downloads (they were all over a year old) but on principle, it would not be The Right Thing To Do.&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h4&gt;
On inodes and hardlinks&lt;/h4&gt;
&lt;div class=&quot;p1&quot;&gt;
The normal directory structure of the dfs dir in a DataNode is /dfs/dn/current/&amp;lt;blockpool name&amp;gt;/current/finalized and within finalized are a whole series of directories to fan out the various blocks that the volume contains. During the block pool upgrade a copy of 'finalized' is made called previous.tmp. It's not a normal copy however - it uses &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Hard_link&quot;&gt;hardlinks&lt;/a&gt;&amp;nbsp;in order to avoid duplicating all of the data (which obviously wouldn't work). The copy is needed during the upgrade and is removed afterwards. Since our upgrade failed halfway through we had both directories and had no choice but to move the entire /dfs directory off of /disk_m to a temporary disk and complete the upgrade there. We first tried a copy (use cp -a to preserve hardlinks) to a mounted NFS share. The copy looked fine but on startup the DataNode didn't understand the mounted drive (&quot;drive not formatted&quot;). Then we tried copying to a USB drive plugged into the machine and that ultimately worked (despite feeling &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.aosabook.org/en/hdfs.html&quot;&gt;decidedly un-Yahoo&lt;/a&gt;). Once the USB drive was upgraded and online in the cluster, replication took over and copied all of its blocks to new homes on /rack2. We then unmounted the USB drive, wiped both /disk_m's and then let replication balance out again. Final result: no lost blocks.&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h4&gt;
Mitigation&lt;/h4&gt;
&lt;div class=&quot;p1&quot;&gt;
With the cluster happy again we made a few changes to hopefully ensure this doesn't happen again:&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;dfs.datanode.du.reserved:25GB&lt;/span&gt;&amp;nbsp;this guarantees 25GB free on each volume&amp;nbsp;(up from 10GB)&amp;nbsp;and&amp;nbsp;should be enough to allow a future upgrade to happen&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;dfs.datanode.fsdataset.volume.choosing.policy:AvailableSpace&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction:1.0 &lt;/span&gt;together these two direct new blocks to disks that have more free space, thereby leaving our now full /disk_m alone&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
Conclusion&lt;/h4&gt;
&lt;div&gt;
This was one small taste of what can go wrong with filling heterogenous disks in an HDFS cluster. We're sure there are worse dangers lurking on the full-disk horizon, so hopefully you've learned from our pain and will give yourself some breathing room when things start to fill up. Also, don't use a replication factor of less than 3 if there's anyway you can help it.&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span style=&quot;background-color:whitesmoke;color:#555555;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:12px;line-height:20px;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span style=&quot;background-color:whitesmoke;color:#555555;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:12px;line-height:20px;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Oliver Meyn</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-6898810307158210457</guid>
         <pubDate>Fri, 29 May 2015 16:34:00 +0000</pubDate>
      </item>
      <item>
         <title>Improving the GBIF Backbone matching</title>
         <link>http://gbif.blogspot.com/2015/03/improving-gbif-backbone-matching.html</link>
         <description>In GBIF &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/occurrence&quot;&gt;occurrence records&lt;/a&gt; are matched to a taxon in a &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&quot;&gt;backbone taxonomy&lt;/a&gt;&amp;nbsp;using the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/developer/species#searching&quot;&gt;species match API&lt;/a&gt;. This is important to reduce spelling variations and create consistent metrics and searches according to a single classification and synonymy.&lt;br /&gt;
&lt;br /&gt;
Over the past years we have been alerted to &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://dev.gbif.org/issues/issues?jql=labels%20%3D%20speciesmatch&quot;&gt;various bad matches&lt;/a&gt;. Most of the reported issues refer to a false fuzzy match for a name missing in our backbone.&lt;br /&gt;
&lt;br /&gt;
In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities. &amp;nbsp;The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself. &amp;nbsp;Here I explain some of the work currently underway to tackle the former, which is visible on the test environment.&lt;br /&gt;
&lt;h2 id=&quot;1name-parsing-of-undetermined-species&quot;&gt;
1.Name parsing of undetermined species&lt;/h2&gt;
In occurrences we see many names with a partly undetermined name such as &lt;em&gt;Lucanus spec.&lt;/em&gt; Erroneously these rank markers have been treated as real species epithets and together with fuzzy matching resulted in poor results. &lt;br /&gt;
&lt;strong&gt;&lt;br /&gt;&lt;/strong&gt;
&lt;strong&gt;Examples&lt;/strong&gt;&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/occurrence/164267402/verbatim&quot;&gt;&lt;em&gt;Xysticus&lt;/em&gt; sp.&lt;/a&gt; used to wrongly match &lt;em&gt;Xysticus spiethi&lt;/em&gt; while it now just matches the genus &lt;em&gt;Xysticus&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/occurrence/1061576151/verbatim&quot;&gt;&lt;em&gt;Triodia&lt;/em&gt; sp.&lt;/a&gt; used to match the family Poaceae while it now matches the genus&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;2-dameraulevenshtein-distance-algorithm&quot;&gt;
2. Damerau–Levenshtein distance algorithm&lt;/h2&gt;
For scoring fuzzy matches we have so far applied the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance&quot;&gt;Jaro Winkler distance&lt;/a&gt; which is often used for matching person names. It tends to allow for rather fuzzy matches at the end of long strings. This is desirable for scientific names, but the allowed fuzziness was too big and we decided to revert to the classical and more predictable &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance&quot;&gt;Damerau–Levenshtein distance&lt;/a&gt;. This reduces false positive fuzzy matches considerably even though we lost a few good matches at the same time.&lt;br /&gt;
&lt;strong&gt;&lt;br /&gt;&lt;/strong&gt;
&lt;strong&gt;Examples&lt;/strong&gt;&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/occurrence/1037140379/verbatim&quot;&gt;&lt;em&gt;Xyris kralii&lt;/em&gt; Wand.&lt;/a&gt; used to match to &lt;em&gt;Xyris harleyi&lt;/em&gt; but now just matches to the genus &lt;em&gt;Xyris L.&lt;/em&gt; as the species is missing from our backbone.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/occurrence/144904719/verbatim&quot;&gt;&lt;em&gt;Zea mays&lt;/em&gt; subsp. &lt;em&gt;parviglumis&lt;/em&gt; var. &lt;em&gt;huehuet&lt;/em&gt; Iltis &amp;amp; Doebley&lt;/a&gt; used to match &lt;em&gt;Zea mays&lt;/em&gt; var. &lt;em&gt;hirta&lt;/em&gt; while it now just hits the species &lt;em&gt;Zea mays&lt;/em&gt; L.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;matching-results&quot;&gt;
Matching results&lt;/h3&gt;
&lt;div class=&quot;p1&quot;&gt;
The distinct, verbatim classifications of 528 million records were passed through the original and the new fuzzy matching algorithms - this included 10.5 million distinct classifications in total. &amp;nbsp;The results show that 428 thousand classifications (4%), representing&amp;nbsp;5,323,758 occurrence records produce a&amp;nbsp;different match. So far we have taken a random subsample of the records which change, and manually inspected the results - we can hardly spot any degression or wrong matches.&lt;/div&gt;
&lt;div class=&quot;p2&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
We have published the complete matching comparison as well as the subset of changed records at &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://zenodo.org/record/16491&quot;&gt;Zenodo&lt;/a&gt; as tab delimited files:&lt;/div&gt;
&lt;div class=&quot;p2&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
Dataset 1:&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://zenodo.org/deposit/26044/file/?file_id=23bc2f5e-f883-410e-ae2d-bd718ccb2b40&quot;&gt;All classification matches (10.5 million)&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
Dataset 2:&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://zenodo.org/deposit/26044/file/?file_id=bbed9d39-ecb5-44cc-949e-a9a6068dc166&quot;&gt;Changed matches (428 thousand)&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;p2&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
The schema of the files have 3 column families each with the scientificName, GBIF taxonKey and the higher DwC classification terms for every match record (verbatim prefixed with v_ , old matching with an _old suffix and the new matching results with plain terms, e.g. v_scientificName, scientificName_old, scientificName).&lt;/div&gt;
&lt;div class=&quot;p2&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;p1&quot;&gt;
We are glad to receive any feedback on further improvements or bad matching results we need to fix in the next iteration of work. Please get in touch with Markus Döring, &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;mailto:mdoering@gbif.org&quot;&gt;&lt;span class=&quot;s1&quot;&gt;mdoering@gbif.org&lt;/span&gt;&lt;/a&gt;.&lt;/div&gt;
&lt;h3 id=&quot;appendix&quot;&gt;
Appendix&lt;/h3&gt;
&lt;h2 id=&quot;create-distinct-occurrence-names-table&quot;&gt;
Create distinct occurrence names table&lt;/h2&gt;
&lt;pre class=&quot;prettyprint&quot;&gt;&lt;code class=&quot; hljs sql&quot;&gt;&lt;span class=&quot;hljs-operator&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; markus.&lt;span class=&quot;hljs-keyword&quot;&gt;names&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-aggregate&quot;&gt;count&lt;/span&gt;(*) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; numocc, &lt;span class=&quot;hljs-aggregate&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;distinct&lt;/span&gt; datasetKey) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; numdatasets, v_scientificName, v_kingdom, v_phylum, v_class, v_order_ &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; v_order, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; prod_b.occurrence_hdfs 
&lt;span class=&quot;hljs-keyword&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; v_scientificName, v_kingdom, v_phylum, v_class, v_order_, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; v_scientificName, numocc &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;lookup-taxonkey-with-both-old-new-lookup&quot;&gt;
Lookup taxonkey with both old &amp;amp; new lookup&lt;/h2&gt;
&lt;pre class=&quot;prettyprint&quot;&gt;&lt;code class=&quot; hljs sql&quot;&gt;&lt;span class=&quot;hljs-operator&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; markus.name_matches &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
  n.numocc, 
  n.numdatasets, 
  n.v_scientificName, 
  n.v_kingdom, 
  n.v_phylum, 
  n.v_class, 
  n.v_order, 
  n.v_family, 
  n.v_genus, 
  n.v_subgenus, 
  n.v_specificEpithet, 
  n.v_infraspecificEpithet, 
  n.v_scientificNameAuthorship, 
  n.v_taxonrank, 
  n.v_higherClassification, 

  prod.taxonKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; taxonKey_old,
  prod.scientificName &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; scientificName_old,
  prod.rank &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; rank_old,
  prod.status &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; status_old,
  prod.matchType &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; matchType_old,
  prod.confidence &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; confidence_old,
  prod.kingdomKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; kingdomKey_old,
  prod.phylumKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; phylumKey_old,
  prod.classKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; classKey_old,
  prod.orderKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; orderKey_old,
  prod.familyKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; familyKey_old,
  prod.genusKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; genusKey_old,
  prod.speciesKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; speciesKey_old,
  prod.kingdom &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; kingdom_old,
  prod.phylum &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; phylum_old,
  prod.class_ &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; class_old,
  prod.order_ &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; order_old,
  prod.family &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; family_old,
  prod.genus &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; genus_old,
  prod.species &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; species_old,

  uat.taxonKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; taxonKey,
  uat.scientificName &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; scientificName,
  uat.rank &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; rank,
  uat.status &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; status,
  uat.matchType &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; matchType,
  uat.confidence &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; confidence,
  uat.kingdomKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; kingdomKey,
  uat.phylumKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; phylumKey,
  uat.classKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; classKey,
  uat.orderKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; orderKey,
  uat.familyKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; familyKey,
  uat.genusKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; genusKey,
  uat.speciesKey &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; speciesKey,
  uat.kingdom &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; kingdom,
  uat.phylum &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; phylum,
  uat.class_ &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; class_,
  uat.order_ &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; order_,
  uat.family &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; family,
  uat.genus &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; genus,
  uat.species &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; species

&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
    numocc, 
    numdatasets, 
    v_scientificName, 
    v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_subgenus, 
    v_specificEpithet, 
    v_infraspecificEpithet, 
    v_scientificNameAuthorship, 
    v_taxonrank, 
    v_higherClassification, 
    &lt;span class=&quot;hljs-keyword&quot;&gt;match&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;'PROD'&lt;/span&gt;, v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) prod, 
    &lt;span class=&quot;hljs-keyword&quot;&gt;match&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;'UAT'&lt;/span&gt;, v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) uat
  &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; markus.&lt;span class=&quot;hljs-keyword&quot;&gt;names&lt;/span&gt;
) n;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;hive-exports&quot;&gt;
Hive exports&lt;/h2&gt;
&lt;pre class=&quot;prettyprint&quot;&gt;&lt;code class=&quot; hljs sql&quot;&gt;&lt;span class=&quot;hljs-operator&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; markus.matches_changed 
&lt;span class=&quot;hljs-keyword&quot;&gt;ROW&lt;/span&gt; FORMAT DELIMITED FIELDS TERMINATED &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;'&amp;#92;t'&lt;/span&gt; LINES TERMINATED &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;'&amp;#92;n'&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NULL&lt;/span&gt; DEFINED &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;''&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; markus.name_matches&lt;/span&gt;&lt;/code&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt; taxonKey!=taxonKey_old;&lt;/pre&gt;
&lt;pre class=&quot;prettyprint&quot;&gt;&lt;code class=&quot; hljs sql&quot;&gt;&lt;span class=&quot;hljs-operator&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;prettyprint&quot;&gt;&lt;code class=&quot; hljs sql&quot;&gt;&lt;span class=&quot;hljs-operator&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; markus.matches_all 
&lt;span class=&quot;hljs-keyword&quot;&gt;ROW&lt;/span&gt; FORMAT DELIMITED FIELDS TERMINATED &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;'&amp;#92;t'&lt;/span&gt; LINES TERMINATED &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;'&amp;#92;n'&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NULL&lt;/span&gt; DEFINED &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;''&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; markus.name_matches&lt;/span&gt;&lt;/code&gt;;&lt;/pre&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Markus Döring</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-7907178205262565696</guid>
         <pubDate>Mon, 30 Mar 2015 22:30:00 +0000</pubDate>
      </item>
      <item>
         <title>IPT v2.2 – Making data citable through DataCite</title>
         <link>http://gbif.blogspot.com/2015/03/ipt-v22.html</link>
         <description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align:left;&quot;&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;GBIF is pleased to release&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/ipt&quot;&gt;&lt;span class=&quot;s1&quot;&gt;IPT 2.2&lt;/span&gt;&lt;/a&gt;, now capable of automatically connecting with either&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://www.datacite.org/&quot;&gt;&lt;span class=&quot;s1&quot;&gt;DataCite&lt;/span&gt;&lt;/a&gt;&amp;nbsp;or&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://ezid.cdlib.org/&quot;&gt;EZID&lt;/a&gt; to assign DOIs to datasets. This new feature makes biodiversity data easier to access on the Web and facilitates tracking its re-use.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;h3 style=&quot;text-align:left;&quot;&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;DataCite integration explained&lt;/span&gt;&lt;/h3&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;DataCite specialises in assigning DOIs to datasets. It was established in 2009 with three fundamental goals&lt;span style=&quot;font-size:xx-small;&quot;&gt;(1)&lt;/span&gt;:&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-TpjTdrwdPzw/VRG20469uPI/AAAAAAAAOgw/9e_MQulhE0I/s1600/datacite-logo-web.png&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;/a&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://images-blogger-opensocial.googleusercontent.com/gadgets/proxy?url=http%3A%2F%2F3.bp.blogspot.com%2F-TpjTdrwdPzw%2FVRG20469uPI%2FAAAAAAAAOgw%2F9e_MQulhE0I%2Fs1600%2Fdatacite-logo-web.png&amp;amp;container=blogger&amp;amp;gadget=a&amp;amp;rewriteMime=image%2F*&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://3.bp.blogspot.com/-TpjTdrwdPzw/VRG20469uPI/AAAAAAAAOgw/9e_MQulhE0I/s1600/datacite-logo-web.png&quot;/&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/a&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/div&gt;
&lt;ol class=&quot;ol1&quot;&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Establish easier access to research data on the Internet&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Increase acceptance of research data as citable contributions to the scholarly record&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Support research data archiving to permit results to be verified and re-purposed for future study&lt;a rel=&quot;nofollow&quot; name='more'&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div style=&quot;text-align:left;&quot;&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;EZID is hosted by the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.cdlib.org/&quot;&gt;California Digital Library&lt;/a&gt;&amp;nbsp;(a founding member of DataCite)&amp;nbsp;and adds &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.cdlib.org/uc3/ezid/&quot;&gt;services&lt;/a&gt; on top of the DataCite DOI infrastructure such as their own easy-to-use &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://ezid.cdlib.org/doc/apidoc.html&quot;&gt;programming interface&lt;/a&gt;.&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:left;&quot;&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:left;&quot;&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;To integrate with DataCite and further these three goals for biodiversity data, IPT version 2.2 introduces the following new features:&lt;/span&gt;&lt;/div&gt;
&lt;ul class=&quot;ul1&quot;&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;DOIs can be assigned to datasets thereby making them persistently resolvable&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;A new DOI can be assigned to a dataset each time it undergoes scientifically significant changes, which is recommended best practice&lt;span style=&quot;font-size:xx-small;&quot;&gt;(1)&lt;/span&gt;&amp;nbsp;and part of the IPT's new&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-providertoolkit/wiki/IPT2Versioning&quot;&gt;&lt;span class=&quot;s1&quot;&gt;versioning policy&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Citations can be automatically generated for datasets in a&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-providertoolkit/wiki/IPT2Citation&quot;&gt;&lt;span class=&quot;s1&quot;&gt;standard format&lt;/span&gt;&lt;/a&gt;&amp;nbsp;which includes the DOI and dataset version number&lt;/span&gt;&lt;/li&gt;
&lt;div style=&quot;text-align:right;&quot;&gt;
&lt;/div&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;A&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes#Version_history&quot;&gt;&lt;span class=&quot;s1&quot;&gt;version history&lt;/span&gt;&lt;/a&gt;&amp;nbsp;is kept for each dataset, allowing researchers to easily track changes and access/download all previous versions&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;To take advantage of these optional new features, there are two basic requirements:&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;ol class=&quot;ol1&quot; style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;The IPT must be configured with either a DataCite or EZID account. GBIF participants interested in a DataCite account should contact the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;mailto:helpdesk@gbif.org&quot;&gt;GBIF Helpdesk&lt;/a&gt;&amp;nbsp;directly. General information about getting a DataCite account can be found&amp;nbsp;&lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://www.datacite.org/join-datacite&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;span class=&quot;s1&quot;&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;; information about getting an EZID account can be found&amp;nbsp;&lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://ezid.cdlib.org/home/pricing&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;span class=&quot;s1&quot;&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;.&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;The IPT should be always on and accessible to ensure that assigned DOIs continue to be resolvable.&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Once publishers make their data citable through DataCite they can expect the following benefits:&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:left;&quot;&gt;
&lt;/div&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Their datasets will be globally discoverable through the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://search.datacite.org/ui&quot;&gt;&lt;span class=&quot;s1&quot;&gt;DataCite Metadata Search tool&lt;/span&gt;&lt;/a&gt;&amp;nbsp;and the Thomson&amp;nbsp;Reuters&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://wokinfo.com/products_tools/multidisciplinary/dci/&quot;&gt;&lt;span class=&quot;s1&quot;&gt;Data Citation Index&lt;/span&gt;&lt;/a&gt;&amp;nbsp;(part of the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://thomsonreuters.com/en/products-services/scholarly-scientific-research/scholarly-search-and-discovery/web-of-science.html&quot;&gt;Web of Science&lt;/a&gt;) thanks to a&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://thomsonreuters.com/en/press-releases/2014/thomson-reuters-collaborates-with-datacite-to-expand-discovery-of-research-data.html&quot;&gt;&lt;span class=&quot;s1&quot;&gt;collaboration&lt;/span&gt;&lt;/a&gt;&amp;nbsp;with DataCite formalised in 2014&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;They can find out exactly who cited their dataset via the&amp;nbsp;&lt;span class=&quot;s1&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://wokinfo.com/products_tools/multidisciplinary/dci/&quot;&gt;Data Citation Index&lt;/a&gt;&lt;/span&gt;, and better understand&amp;nbsp;the impact their dataset has had within the scholarly research and policy making communities&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;&lt;/span&gt;&lt;/div&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:right;margin-left:1em;text-align:right;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-S4fWCWFb1UE/VRF-UHfx58I/AAAAAAAAOgU/fnSiBSuQWW0/s1600/IPTManageResourceMetadataBasicMetadata.png&quot; style=&quot;clear:right;margin-bottom:1em;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://2.bp.blogspot.com/-S4fWCWFb1UE/VRF-UHfx58I/AAAAAAAAOgU/fnSiBSuQWW0/s1600/IPTManageResourceMetadataBasicMetadata.png&quot; height=&quot;204&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;Sample basic metadata page, IPT 2.2&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/span&gt;&lt;br /&gt;
&lt;h3&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;Other new features&lt;/span&gt;&lt;/h3&gt;
&lt;br /&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;The IPT 2.2 also introduces a simple way of licensing datasets&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;under one of three machine readable waivers or licences: &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://creativecommons.org/publicdomain/zero/1.0/legalcode&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;span class=&quot;s1&quot;&gt;CC0 v1.0&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;,&amp;nbsp;&lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://creativecommons.org/licenses/by/4.0/legalcode&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;span class=&quot;s1&quot;&gt;CC-BY v4.0&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;, or&amp;nbsp;&lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://creativecommons.org/licenses/by-nc/4.0/legalcode&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;span class=&quot;s1&quot;&gt;CC-BY-NC v4.0&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;. &amp;nbsp;These waivers or CC licenses are &quot;something that the creators of works can understand, their users can understand, and even the Web itself can understand.&quot;&lt;span style=&quot;font-size:xx-small;&quot;&gt;(2) &amp;nbsp;&lt;/span&gt;You may read more about GBIF's new licensing policy&amp;nbsp;&lt;/span&gt;&lt;span class=&quot;s1&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/terms/licences&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;here&lt;/a&gt;&amp;nbsp;for more information.&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;ul class=&quot;ul1&quot;&gt;
&lt;/ul&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:left;margin-left:1em;text-align:left;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://1.bp.blogspot.com/-SUJDc6CGGS4/VRF0uj31mWI/AAAAAAAAOfY/bSC2IbbBE1U/s1600/IPTManageResourceOverview.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://1.bp.blogspot.com/-SUJDc6CGGS4/VRF0uj31mWI/AAAAAAAAOfY/bSC2IbbBE1U/s1600/IPTManageResourceOverview.png&quot; height=&quot;316&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;Sample resource overview page, IPT 2.2&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;Whether an IPT is DOI-turbocharged or not, there are a number of other new benefits in this release:&lt;/span&gt;&lt;br /&gt;
&lt;ul class=&quot;ul1&quot;&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;&lt;b&gt;basisOfRecord validation&lt;/b&gt;&amp;nbsp;for occurrence datasets&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;The ability to&amp;nbsp;&lt;b&gt;preview source mappings&lt;/b&gt;&amp;nbsp;prior to publication&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;The ability to&amp;nbsp;&lt;b&gt;preview resource metadata&lt;/b&gt;&amp;nbsp;prior to publication&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;A suite of new metadata fields such as&amp;nbsp;&lt;b&gt;ORCIDs&lt;/b&gt;&amp;nbsp;for contacts&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;An enhanced user interface including a new and&amp;nbsp;&lt;b&gt;improved resource homepage&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;&lt;b&gt;Additional context help&lt;/b&gt;&amp;nbsp;to guide users, especially first-time users&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;br /&gt;
&lt;h3 style=&quot;text-align:left;&quot;&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Acknowledgements&lt;/span&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/h3&gt;
&lt;br /&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Thanks to the hard work and dedication of the team of contributors, version 2.2 has been fully translated into French, Japanese, Portuguese, and Spanish. Since so many new features have gone into this new version, the text requiring translation was enormous. The following translators deserve a huge thanks, merci, arigato,&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;obrigado, and&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;gracias:&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:left;&quot;&gt;
&lt;/div&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Sophie Pamerlon, Marie-Elise Lecoq (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.fr/&quot;&gt;&lt;span class=&quot;s1&quot;&gt;GBIF France&lt;/span&gt;&lt;/a&gt;) - Updating French translation&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Yukiko Yamazaki (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.jp/&quot;&gt;GBIF Japan (JBIF)&lt;/a&gt;) - Updating Japanese translation&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Allan Koch Veiga, Etienne Americo Cartolano,&amp;nbsp;Daniel Lins, and&amp;nbsp;Antonio Mauro Saraiva (&lt;span class=&quot;s1&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.biocomp.org.br/&quot;&gt;Universidade de São Paulo, Research Center on Biodiversity and Computing&amp;nbsp;- BioComp&lt;/a&gt;&lt;/span&gt;)&amp;nbsp;- Updating Portuguese translation&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Dairo Escobar,&amp;nbsp;Nestor Beltran, and Daniel Amariles (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.sibcolombia.net/web/sib/home&quot;&gt;&lt;span class=&quot;s1&quot;&gt;Colombian Biodiversity Information System (SiB Colombia)&lt;/span&gt;&lt;/a&gt;) - Updating Spanish Translation&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;Lastly, a special thanks must go out to David Shorthouse from&amp;nbsp;&lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.canadensys.net/&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;span class=&quot;s1&quot;&gt;Canadensys&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&amp;nbsp;for his guidance and help. Canadensys has been assigning DOIs to datasets it serves via its IPT since 2012, as described&amp;nbsp;&lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.canadensys.net/2012/link-love-dois-for-darwin-core-archives&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;span class=&quot;s1&quot;&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;, and has provided invaluable assistance throughout development.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;On behalf of the GBIF development team, I really hope you enjoy using this new version, and hope that you will be able to take advantage of all its exciting new features.&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;br /&gt;
&lt;h3 style=&quot;text-align:left;&quot;&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;Footnotes&lt;/span&gt;&lt;/h3&gt;
&lt;div&gt;
&lt;ol style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;http://schema.datacite.org/meta/kernel-3/doc/DataCite-MetadataKernel_v3.1.pdf&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;https://creativecommons.org/licenses/&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Kyle Braak</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-3946716376176094471</guid>
         <pubDate>Fri, 27 Mar 2015 13:55:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://3.bp.blogspot.com/-TpjTdrwdPzw/VRG20469uPI/AAAAAAAAOgw/9e_MQulhE0I/s72-c/datacite-logo-web.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Upgrading our cluster from CDH4 to CDH5</title>
         <link>http://gbif.blogspot.com/2014/11/upgrading-our-cluster-from-cdh4-to-cdh5.html</link>
         <description>A little over a year ago we wrote about&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2013/05/migrating-our-hadoop-cluster-from-cdh3.html&quot;&gt;upgrading from CDH3 to CDH4&lt;/a&gt;&amp;nbsp;and now the time had come to upgrade from CDH4 to &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html&quot;&gt;CDH5&lt;/a&gt;. The short version: upgrading the cluster itself was easy, but getting our applications to work with the new classpaths, especially MapReduce v2 (YARN), was painful.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
The Cluster&lt;/h3&gt;
&lt;div&gt;
Our cluster has grown since the last upgrade (now 12 slaves and 3 masters), and we no longer had the luxury of splitting the machines to build a new cluster from scratch. So this was an in-place upgrade, using CDH Manager.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h2&gt;
Upgrade CDH Manager&lt;/h2&gt;
&lt;div&gt;
The first step was upgrading to CDH Manager 5.2 (from our existing 4.8). The &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ag_upgrade_cm4_to_cm5.html&quot;&gt;Cloudera documentation&lt;/a&gt;&amp;nbsp;is excellent so I don't need to repeat it here. What we did find was that the management service now requests significantly more RAM for it's monitoring services (minimum &quot;happy&quot; config of 14GB), to the point where our existing masters were overwhelmed. As a stop gap we've added a 4th old machine to the &quot;masters&quot; group, used exclusively for the management service. In the longer term we'll replace the 4 masters with 3 new machines that have enough resources.&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h2&gt;
Upgrade Cluster Members&lt;/h2&gt;
&lt;div&gt;
Again the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_upgrade_tocdh5_using_parcels.html&quot;&gt;Cloudera documentation&lt;/a&gt;&amp;nbsp;is excellent but I'll just add a bit. The upgrade process will now ask if a JAVA jdk should be installed (an improvement over the old behaviour of just installing one anyway). That means we could finally remove the Oracle JDK 6 rpms that have been lying around on the machines. For some reason the Host Inspector still complains about OpenJDK 7 vs Oracle 7 but we've happily been running on OpenJDK 7 since early 2014, and so far so good with CDH5 as well. After the upgrade wizard finished we had to tweak memory settings throughout the cluster, including setting the &quot;Memory Overcommit Validation Threshold&quot; to 0.99, up from its (very conservative) default of 0.8. Cloudera has another nice blog post on &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/&quot;&gt;figuring out memory settings for YARN&lt;/a&gt;. Additionally Hue's configuration required some attention because after the upgrade it had forgotten where Zookeeper and the HBase Thrift server were. All in all quite painless.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h3&gt;
The Gotchas&lt;/h3&gt;
&lt;div&gt;
Getting our software to work with CDH5 was definitely not painless. All of our problems stemmed from conflicting versions of jars, due either to changes in CDH dependencies, or in changes to how a user classpath is specified as having priority over that of Yarn/HBase/Oozie. Additionally it took some time to wrap our heads around the new artifact packaging used by YARN and HBase. Note that we also use Maven for dependency management.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;b&gt;Guava&lt;/b&gt;&lt;br /&gt;
&lt;div&gt;
We're not alone in our suffering at the hands of mismatched Guava versions (e.g.&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/HADOOP-10101&quot;&gt;HADOOP-10101&lt;/a&gt;,&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/HDFS-7040&quot;&gt;HDFS-7040&lt;/a&gt;), but suffer we did. We resorted to specifying version 14.0.1 in any of our code that touches Hadoop and more importantly HBase, and exclude any higher version guavas from our dependencies. This meant downgrading some actual code that was using guava 15, but was the easiest path to getting a working system.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;b&gt;Jackson&lt;/b&gt;&lt;br /&gt;
&lt;div&gt;
We have many dependencies on Jackson 1.9 and 2+ throughout our code, so downgrading to match HBase's shipped 1.8.8 was not an option. It meant figuring out the classpath precedence rules described below, and solving the problems (like logging) that doing so introduced.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;b&gt;Logging&lt;/b&gt;&lt;br /&gt;
&lt;div&gt;
Logging in Java is a horrible mess, and with the number of intermingled projects required to make application software run on a Hadoop/HBase cluster it's not surprise that getting logging to work was brutal. We code to the SLF4J API and use Logback as our implementation of choice. The Hadoop world uses a mix of Java Commons Logging, java.util.logging, and log4j. We thought that meant we'd be clear if we used the same SLF4J API (1.7.5) and used the bridges (log4j-over-slf4j, jcl-over-slf4j, and jul-to-slf4j), which has worked for us up to now. &amp;lt;montage&amp;gt;Angry men smash things angrily over the course of days&amp;lt;/montage&amp;gt; Turns out, there's a bug in the 1.7.5 implementation of log4j-over-slf4j, which blows up as we described over at &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/YARN-2875&quot;&gt;YARN-2875&lt;/a&gt;. Short version - use 1.7.6+ in client code that attempts to use YARN and log4j-over-slf4j.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;YARN&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
The crux of our problems was having our classpath loaded after the Hadoop classpath had been loaded, meaning old versions of our dependencies were loaded first. The new, surprisingly hard to find parameter that tells YARN to load your classpath first is &quot;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&lt;b&gt;mapreduce.job.user.classpath.first&lt;/b&gt;&lt;/span&gt;&quot;. YARN also quizzically claims that the parameter is deprecated, but.. works for me.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;Oozie&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
Convincing Oozie to load our classpath involved another montage of angry faces. It uses the same parameter as YARN, but with a prefix, so what you want is &quot;&lt;b&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;oozie.launcher.mapreduce.job.user.classpath.first&lt;/span&gt;&lt;/b&gt;&quot;. We had been loading the old parameter &quot;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&lt;b&gt;mapreduce.task.classpath.user.precedence&lt;/b&gt;&lt;/span&gt;&quot; in each action in the workflow using the &lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&amp;lt;job-xml&amp;gt;&lt;/span&gt;&lt;span style=&quot;font-family:inherit;&quot;&gt; tag to load the configs from a file called&lt;/span&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;hive-default.xml&lt;/span&gt;&lt;span style=&quot;font-family:inherit;&quot;&gt;. We then encountered two problems:&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;&lt;span style=&quot;font-family:inherit;&quot;&gt;Note the name - we used &lt;/span&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;hive-default.xml&lt;/span&gt;&lt;span style=&quot;font-family:inherit;&quot;&gt; instead of &lt;/span&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;hive-site.xml&lt;/span&gt;&lt;span style=&quot;font-family:inherit;&quot;&gt; because of a bug in Oozie (discussed &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/RW5WmSTzbLo&quot;&gt;here&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://groups.google.com/a/cloudera.org/forum/#!msg/cdh-user/y66j12jb1ig/tODJGmJ2BawJ&quot;&gt;here&lt;/a&gt;).&amp;nbsp;That was fixed in the CDH5.2 Oozie, but we didn't get the memo. Now the file is called &lt;/span&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;hive-site.xml&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:inherit;&quot;&gt;and contains our specific configs and is again being picked up. BUT:&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Adding&amp;nbsp;&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-weight:bold;&quot;&gt;oozie.launcher.mapreduce.job.user.classpath.first&lt;/span&gt;&lt;span style=&quot;font-family:inherit;&quot;&gt; to&amp;nbsp;&lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;hive-site.xml&lt;/span&gt;&amp;nbsp;doesn't&amp;nbsp;work! As we wrote up in Oozie bug &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/OOZIE-2066&quot;&gt;OOZIE-2066&lt;/a&gt; this parameter has to be specified for each action, at the action level, in the workflow.xml. Repeating the example workaround from the bug report:&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;pre style=&quot;background-image:url(http://2.bp.blogspot.com/_z5ltvMQPaa8/SjJXr_U2YBI/AAAAAAAAAAM/46OqEP32CJ8/s320/codebg.gif);background:#f0f0f0;border:1px dashed #CCCCCC;color:black;font-family:arial;font-size:12px;height:auto;line-height:20px;overflow:auto;padding:0px;text-align:left;width:99%;&quot;&gt;&lt;code style=&quot;color:black;word-wrap:normal;&quot;&gt; &amp;lt;action name=&quot;run-test&quot;&amp;gt;  
  &amp;lt;java&amp;gt;  
   &amp;lt;job-tracker&amp;gt;c1n2.gbif.org:8032&amp;lt;/job-tracker&amp;gt;  
   &amp;lt;name-node&amp;gt;hdfs://c1n1.gbif.org:8020&amp;lt;/name-node&amp;gt;  
   &amp;lt;configuration&amp;gt;  
    &amp;lt;property&amp;gt;  
     &amp;lt;name&amp;gt;oozie.launcher.mapreduce.task.classpath.user.precedence&amp;lt;/name&amp;gt;  
     &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;  
    &amp;lt;/property&amp;gt;  
   &amp;lt;/configuration&amp;gt;  
   &amp;lt;main-class&amp;gt;test.CPTest&amp;lt;/main-class&amp;gt;  
  &amp;lt;/java&amp;gt;  
  &amp;lt;ok to=&quot;end&quot; /&amp;gt;  
  &amp;lt;error to=&quot;kill&quot; /&amp;gt;  
 &amp;lt;/action&amp;gt;  
&lt;/code&gt;&lt;/pre&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
&lt;b&gt;New Packaging Woes&lt;/b&gt;&lt;/h3&gt;
&lt;br /&gt;
We build our jars using a combination of jar-with-dependencies and the shade plugin, but in both cases it means all our dependencies are built in. The problems come when a downstream, transitive dependency loads a different (typically older) version of one of the jars we've bundled in our main jar. This happens a lot with the Hadoop and HBase artifacts, especially when it comes to MR1 and logging.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Example exclusions&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
hbase-server (needed to run MapReduce over HBase): &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/gbif/datacube/blob/master/pom.xml#L268&quot;&gt;https://github.com/gbif/datacube/blob/master/pom.xml#L268&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
hbase-testing-util (needed to run mini clusters): &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/gbif/datacube/blob/master/pom.xml#L302&quot;&gt;https://github.com/gbif/datacube/blob/master/pom.xml#L302&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
hbase-client: &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/gbif/metrics/blob/master/pom.xml#L226&quot;&gt;https://github.com/gbif/metrics/blob/master/pom.xml#L226&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
hadoop-client (removing logging): &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/gbif/metrics/blob/master/pom.xml#L327&quot;&gt;https://github.com/gbif/metrics/blob/master/pom.xml#L327&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Beyond just sorting conflicting dependencies, we also encountered a problem that presented as &quot;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;No FileSystem for scheme: file&quot;&lt;/span&gt;. It turns out we had projects bringing in both hadoop-common and hadoop-hdfs, and so we were getting only one of the META-INF/services files in the final jar. &amp;nbsp;Thus we could not use the FileSystem to read local files (like jars for the class path) and also from HDFS. &amp;nbsp;The fix was to include the &lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;org.apache.hadoop.fs.FileSystem&lt;/span&gt; in our project explicitly: &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/gbif/metrics/blob/master/cube/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem&quot;&gt;https://github.com/gbif/metrics/blob/master/cube/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
Finally we had to stop the TableMapReduceUtil from bringing in it’s own dependent jars, which brought in yet more conflicting jars - this appears to be a change in the default behaviour, where dependent jars are now being brought in by default in the shorter versions of &lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;initTableMapper&lt;/span&gt;:&lt;br /&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/gbif/metrics/blob/master/cube/src/main/java/org/gbif/metrics/cube/occurrence/backfill/BackfillCallback.java#L37&quot;&gt;https://github.com/gbif/metrics/blob/master/cube/src/main/java/org/gbif/metrics/cube/occurrence/backfill/BackfillCallback.java#L37&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Conclusion&lt;/h3&gt;
&lt;/div&gt;
&lt;div&gt;
As you can see the client side of the upgrade was beset on all sides by the iniquities of jars, packaging and old dependencies. It seems strange that upgrading Guava is considered a no-no, major breaking change by these projects, yet &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/HBASE-9117&quot;&gt;discussions about removing HBaseTablePool&lt;/a&gt;&amp;nbsp;are proceeding apace and will definitely break many projects (including any of ours that touch HBase). While we're ultimately pleased that everything now works, and looking forward to benefiting from the performance improvements and new features of CDH5, it wasn't a great trip. Hopefully our experience will help others migrate more smoothly.&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Oliver Meyn</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-4761299747301736556</guid>
         <pubDate>Wed, 26 Nov 2014 11:41:00 +0000</pubDate>
      </item>
      <item>
         <title>Multimedia in GBIF</title>
         <link>http://gbif.blogspot.com/2014/05/multimedia-in-gbif.html</link>
         <description>We are happy to announce another long awaited improvement to the GBIF portal. Our portal test environment now shows multimedia items and their metadata associated with occurrences. As of today we have nearly &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/occurrence/search?MEDIA_TYPE=Sound&amp;amp;MEDIA_TYPE=StillImage&amp;amp;MEDIA_TYPE=MovingImage&quot;&gt;700 thousand occurrences with multimedia&lt;/a&gt; indexed. Based on the Dublin Core type vocabulary we distinguish between images, videos and sound files. As has been requested by many people the media type is available as a new filter in the occurrence search and subsequently in downloads. For example you can now easily &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/occurrence/search?TAXON_KEY=212&amp;amp;MEDIA_TYPE=Sound&quot;&gt;find all audio recordings of birds&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:left;margin-right:1em;text-align:left;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-hHrOPCfJ01k/U2ioucIaS8I/AAAAAAAAECQ/AlV3WCTrmjg/s1600/Screen+Shot+2014-05-06+at+11.17.21.png&quot; style=&quot;clear:left;margin-bottom:1em;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://2.bp.blogspot.com/-hHrOPCfJ01k/U2ioucIaS8I/AAAAAAAAECQ/AlV3WCTrmjg/s1600/Screen+Shot+2014-05-06+at+11.17.21.png&quot; height=&quot;320&quot; width=&quot;297&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;UAM:Mamm:11470 - Eumetopias jubatus - skull&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
If you follow to the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/occurrence/779863593#media&quot;&gt;details page&lt;/a&gt;&amp;nbsp;of any of those records you can see that sound files show up as simple links to the media file. We do the same for video files and currently do not have plans to embed any media player in our portal. This is different from images which are shown in a dedicated gallery you might have encountered for species pages before already. On the left you can see an example of a&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/occurrence/784732286&quot;&gt;skull specimen with multiple images&lt;/a&gt;&lt;u&gt;.&lt;/u&gt;&lt;br /&gt;
&lt;span style=&quot;text-align:center;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;text-align:center;&quot;&gt;When requested for the first time, GBIF transiently caches the original images and processes them into various standard sizes and formats suitable for the use in the portal.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Publishing multimedia metadata&lt;/h3&gt;
GBIF indexes multimedia metadata published in different ways within the GBIF network. From a simple URL given as an additional field in Darwin Core via multiple items expressed as ABCD XML or a dedicated multimedia extension in Darwin Core archives the difference usually is in metadata expressiveness.&lt;br /&gt;
&lt;h4&gt;
Simple Darwin Core&lt;/h4&gt;
&lt;table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:right;margin-left:1em;text-align:right;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://1.bp.blogspot.com/-FUnigeu6Ubs/U2Dg7LzaIJI/AAAAAAAAECA/OO2MLbIXWvw/s1600/Screen+Shot+2014-04-30+at+13.38.55.png&quot; style=&quot;clear:right;margin-bottom:1em;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://1.bp.blogspot.com/-FUnigeu6Ubs/U2Dg7LzaIJI/AAAAAAAAECA/OO2MLbIXWvw/s1600/Screen+Shot+2014-04-30+at+13.38.55.png&quot; height=&quot;243&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;Melocactus intortus record in iNaturalist&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Whenever we spot the term &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/terms/index.htm#associatedMedia&quot;&gt;dwc:associatedMedia&lt;/a&gt; in xml or Darwin Core archives as part of the a simple, flat occurrence record we try to extract URLs to media items. As the term officially allows for concatenated lists of URLs we try common delimiters such as comma, semicolon or the pipe symbols. An example of multiple, concatenated image URLs can be found in&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/occurrence/891030819#images&quot;&gt;iNaturalist&lt;/a&gt;:&lt;br /&gt;
&lt;br /&gt;
As you can see on the right every extracted link is regarded as a separate media item as there is no standard way to detect that 2 links refer to the same item. In the example above every image has a link to the actual image file and another one to the respective html page where it's metadata is presented. There is also no way to specify additional metadata about a link. As a consequence all images based on dwc:associatedMedia do not have a title, license or any further information. The verbatim data for that record before we extract image links can be seen here:&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/occurrence/891030819/verbatim&quot;&gt;http://www.gbif-uat.org/occurrence/891030819/verbatim&lt;/a&gt;&lt;br /&gt;
&lt;h4&gt;
Darwin Core archive multimedia extension&lt;/h4&gt;
By having a &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.gbif.org/extension/gbif/1.0/multimedia.xml&quot;&gt;dedicated extension&lt;/a&gt; for media items many media items per core occurrence record can be published in a structured way. This is the GBIF recommended way to publish multimedia as it gives you most control over your metadata. Note that the same extension can also be used to publish multimedia for species in &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/dataset/search?type=CHECKLIST&quot;&gt;checklist datasets&lt;/a&gt;. This extension, based entirely on existing Dublin Core terms, allows you to specify the following information about a media item, all of which will make it into the GBIF portal if provided:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:type&lt;/b&gt;, the kind of media item based on the DCMI Type Vocabulary: &amp;nbsp;StillImage, MovingImage or Sound&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:format&lt;/b&gt;, MIME type of the multimedia object's format&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:identifier&lt;/b&gt;, the public URL that identifies and locates the media file directly, not the html page it might be shown on&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:references&lt;/b&gt;, the URL of an html webpage that shows the media item or its metadata. It is recommended to provide this url even if a media file exists as it will be used for linking out&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:title&lt;/b&gt;, the media items title&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:description&lt;/b&gt;, a&amp;nbsp;textual description of the content of the media item&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:created&lt;/b&gt;, the date and time this media item was taken&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:creator&lt;/b&gt;, the person that took the image, recorded the video or sound&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:contributor&lt;/b&gt;, any contributor in addition to the creator that helped in recording the media item&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:publisher&lt;/b&gt;, the name of an entity responsible for making the image available&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:audience&lt;/b&gt;, a&amp;nbsp;class or description for whom the image is intended or useful&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:source&lt;/b&gt;, a reference to the source the media item was derived or taken from. For example a book from which an image was scanned or the original provider of a photo/graphic, such as photography agencies&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:license&lt;/b&gt;, license for this media object. If possible declare it as CC0 to ensure greatest use&lt;/li&gt;
&lt;li&gt;&amp;nbsp;&lt;b&gt;dc:rightsHolder&lt;/b&gt;, the person or organization owning or managing rights over the media item&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
Access to Biological Collections Data&lt;/h4&gt;
As usual we also provide a binding from the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.tdwg.org/activities/abcd/&quot;&gt;TDWG ABCD standard&lt;/a&gt; (versions 1.2 and 2.06) mostly used with the BioCASE software.&lt;br /&gt;
&lt;br /&gt;
From &lt;i&gt;ABCD 1.2&lt;/i&gt; we extract media information based on the UnitDigitalImage subelements. In particular information about the file URL (ImageURI), the description (Comment) and the license (TermsOfUse).&lt;br /&gt;
&lt;br /&gt;
In &lt;i&gt;ABCD 2.06&lt;/i&gt; we use the unit MultiMediaObject subelements instead. Here there are distinct file and webpage URLs (FileURI, ProductURI), the description (Comment), &amp;nbsp;the license (License/Text, TermsOfUseStatements) and also an indication of the mime type (Format). The &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/occurrence/779863593&quot;&gt;bird sound example&lt;/a&gt; from above comes in as ABCD 2.06 via the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/dataset/b7ec1bf8-819b-11e2-bad2-00145eb45e9a&quot;&gt;Animal Sound Archive dataset&lt;/a&gt;. You can see the original details of that ABCD record in it's&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/occurrence/779863593/fragment&quot;&gt;raw XML fragment&lt;/a&gt;. There are also &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif-uat.org/occurrence/773646053#images&quot;&gt;fossil images&lt;/a&gt; available through ABCD.&lt;br /&gt;
&lt;br /&gt;
Missing from both ABCD versions is a media title, creator and created element.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Media type interpretation&lt;/h3&gt;
We derive the media type from either an explicitly given dc:type, the mime type found in dc:format or the media file suffix. In the case of dwc:associatedMedia found in simple Darwin Core we can only rely on the file URL to interpret the kind of media item. If that URL is pointing to some html page instead of an actual static media file with a wellknown suffix the media type remains unknown.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Production deployment&lt;/h3&gt;
We hope you like this new feature and we are eager to get this out into production within the next weeks. This is the first iteration of this work, and like all GBIF developments we welcome any feedback.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Markus Döring</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-1582482421307767408</guid>
         <pubDate>Tue, 06 May 2014 12:06:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://2.bp.blogspot.com/-hHrOPCfJ01k/U2ioucIaS8I/AAAAAAAAECQ/AlV3WCTrmjg/s72-c/Screen+Shot+2014-05-06+at+11.17.21.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>IPT v2.1 – Promoting the use of stable occurrenceIDs</title>
         <link>http://gbif.blogspot.com/2014/04/ipt-v21.html</link>
         <description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align:left;&quot;&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
GBIF is pleased to announce the release of the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/ipt&quot;&gt;IPT 2.1&lt;/a&gt; with the following key changes:&lt;br /&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;Stricter controls for the Darwin Core occurrenceID to improve stability of record level identifiers network wide&lt;/li&gt;
&lt;li&gt;Ability to support Microsoft Excel spreadsheets natively&lt;/li&gt;
&lt;li&gt;Japanese translation thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan&lt;/li&gt;
&lt;/ul&gt;
With this update, GBIF continues to refine and improve the IPT based on feedback from users, while carrying out a number of deliverables that are part of the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/resources/2970&quot;&gt;GBIF Work Programme for 2014-16&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
The most significant new feature that has been added in this release is the ability to validate that each record within an Occurrence dataset has a unique identifier. If any missing or duplicate identifiers are found, publishing fails, and the problem records are logged in the publication report.&lt;br /&gt;
&lt;br /&gt;
This new feature will support data publishers who use the Darwin Core term &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/terms/#occurrenceID&quot;&gt;occurrenceID&lt;/a&gt; to uniquely identify their occurrence records.  The change is intended to make it easier to link to records as they propagate throughout the network, simplifying the mechanism to cross reference databases and potentially help towards tracking use.&lt;br /&gt;
&lt;br /&gt;
Previously, GBIF has asked publishers to use the three Darwin Core terms: &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/terms/#institutionCode&quot;&gt;institutionCode&lt;/a&gt;, &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/terms/#collectionCode&quot;&gt;collectionCode&lt;/a&gt;, and &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/terms/#catalogNumber&quot;&gt;catalogNumber&lt;/a&gt; to uniquely identify their occurrence records.&amp;nbsp;This triplet style identifier will continue to be accepted, however, it is notoriously&amp;nbsp;unstable&amp;nbsp;since the codes are prone to change and in many cases&amp;nbsp;are meaningless for datasets originating from outside of the museum collections community.&amp;nbsp;For this reason, GBIF is adopting the recommendations coming from the IPT user community and recommending the use of occurrenceID instead. &lt;br /&gt;
&lt;br /&gt;
Best practices for creating an occurrenceID are that they (a) must be unique within the dataset,  (b) should remain stable over time, and (c) should be globally unique wherever possible.  By taking advantage of the IPT’s built-in identifier validation, publishers will automatically satisfy the first condition.&lt;br /&gt;
&lt;br /&gt;
Ultimately, GBIF hopes that by transitioning to more widespread use of stable occurrenceIDs, the following goals can be realized:&lt;br /&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;GBIF can begin to resolve occurrence records using an occurrenceID. This resolution service could also help check whether identifiers are globally unique or not.&lt;/li&gt;
&lt;li&gt;GBIF’s own occurrence identifiers will become inherently more stable as well.&lt;/li&gt;
&lt;li&gt;GBIF can sustain more reliable cross-linkages to its records from other databases (e.g. GenBank).&lt;/li&gt;
&lt;li&gt;Record-level citation can be made possible, enhancing attribution and the ability to track data usage.&lt;/li&gt;
&lt;li&gt;It will be possible to consider tracking annotations and changes to a record over time.&lt;/li&gt;
&lt;/ul&gt;
If you’re a new or existing publisher, GBIF hope you’ll agree these goals are worth working towards, and start using occurrenceIDs. &lt;br /&gt;
&lt;br /&gt;
The &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/ipt&quot;&gt;IPT 2.1&lt;/a&gt; also includes support for uploading Excel files as data sources.&lt;br /&gt;
&lt;br /&gt;
Another enhancement is that the interface has been translated into Japanese. GBIF offer their sincere thanks to Dr. Yukiko Yamazaki from the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.nig.ac.jp/english/index.html&quot;&gt;National Institute of Genetics (NIG)&lt;/a&gt; in Japan for this extraordinary effort.&lt;br /&gt;
&lt;br /&gt;
In the 11 months since version 2.0.5 was released, a total of 11 enhancements have been added, and 38 bugs have been squashed. So what else has been fixed?&lt;br /&gt;
&lt;br /&gt;
If you like the IPT’s auto publishing feature, you will be happy to know the bug causing the temporary directory to grow until disk space was exhausted has now been fixed. Resources that are configured to auto publish, but fail to be published for whatever reason, are now easily identifiable within the resource tables as shown:&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-2m3hE7IsRX8/U07eEE54fMI/AAAAAAAALzs/GFO-nQUbPb4/s1600/Screen+Shot+2014-04-16+at+9.45.11+PM.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://2.bp.blogspot.com/-2m3hE7IsRX8/U07eEE54fMI/AAAAAAAALzs/GFO-nQUbPb4/s1600/Screen+Shot+2014-04-16+at+9.45.11+PM.png&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
If you ever created a data source by connecting directly to a database like MySQL, you may have noticed an error that caused datasets to truncate unexpectedly upon encountering a row with bad data. Thanks to a patch from Paul Morris (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.huh.harvard.edu/&quot;&gt;Harvard University Herbaria&lt;/a&gt;) bad rows now get skipped and reported to the user without skipping subsequent rows of data.&lt;br /&gt;
&lt;br /&gt;
As always we’d like to give special thanks to the other volunteers who contributed to making this version a reality:&lt;br /&gt;
&lt;div&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;Marie-Elise Lecoq, and Gallien Labeyrie (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.fr/&quot;&gt;GBIF France&lt;/a&gt;) - Updating French translation&lt;/li&gt;
&lt;li&gt;Yu-Huang Wang (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://taibif.tw/&quot;&gt;TaiBIF&lt;/a&gt;, Taiwan) - Updating Traditional Chinese translation&lt;/li&gt;
&lt;li&gt;Nestor Beltran (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.sibcolombia.net/web/sib/home&quot;&gt;Colombian Biodiversity Information System (SiB)&lt;/a&gt;) - Updating Spanish translation&lt;/li&gt;
&lt;li&gt;Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.biocomp.org.br/&quot;&gt;Universidade de São Paulo, Research Center on Biodiversity and Computing&lt;/a&gt;) - Updating Portuguese translation&lt;/li&gt;
&lt;li&gt;Carlos Cubillos (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.sibcolombia.net/web/sib/home&quot;&gt;Colombian Biodiversity Information System (SiB)&lt;/a&gt;) - Contributing style improvements&lt;/li&gt;
&lt;/ul&gt;
On behalf of the GBIF development team, I can say that we’re really excited to get this new version out to everyone! Happy publishing.&lt;br /&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div style=&quot;font-size:13.5pt;&quot;&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;span style=&quot;font-size:13.5pt;&quot;&gt; &lt;/span&gt; &lt;br /&gt;
&lt;div style=&quot;font-size:13.5pt;&quot;&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Kyle Braak</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-4390755691463921673</guid>
         <pubDate>Wed, 23 Apr 2014 12:22:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://2.bp.blogspot.com/-2m3hE7IsRX8/U07eEE54fMI/AAAAAAAALzs/GFO-nQUbPb4/s72-c/Screen+Shot+2014-04-16+at+9.45.11+PM.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Lots of columns with Hive and HBase</title>
         <link>http://gbif.blogspot.com/2014/03/lots-of-columns-with-hive-and-hbase.html</link>
         <description>We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/&quot;&gt;Darwin Core&lt;/a&gt;. Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like!&lt;br /&gt;
&lt;br /&gt;
Or so we thought.&lt;br /&gt;
&lt;br /&gt;
Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/&quot;&gt;Cloudera blog&lt;/a&gt;). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase.&lt;br /&gt;
&lt;br /&gt;
Here's an example of how to write a Hive table definition for an HBase-backed table:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;CREATE EXTERNAL TABLE tiny_hive_example (&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;&amp;nbsp; key INT,&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;&amp;nbsp; kingdom STRING,&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;&amp;nbsp; kingdomkey INT&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;)&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;WITH SERDEPROPERTIES (&quot;hbase.columns.mapping&quot; = &quot;:key#b,o:kingdom#s,o:kingdomKey#b&quot;)&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;TBLPROPERTIES(&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;&amp;nbsp; &quot;hbase.table.name&quot; = &quot;tiny_hbase_table&quot;,&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;&amp;nbsp; &quot;hbase.table.default.storage.type&quot; = &quot;binary&quot;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;);&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
But now that we have something like 600 columns to map to HBase, and that we've chosen to name our HBase columns just like the DwC Terms they represent (e.g. the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/terms/index.htm#basisOfRecord&quot;&gt;basis of record&lt;/a&gt;&amp;nbsp;term's column name is basisOfRecord) we have a very long &quot;SERDEPROPERTIES&quot; string in our Hive table definition. How long? Well, way more than the 4000 character limit of Hive. For our Hive metastore we use PostgreSQL and when Hive creates the&amp;nbsp;SERDE_PARAMS table it gives the&amp;nbsp;PARAM_VALUE column a datatype of VARCHAR(4000). Because 4k should be enough for anyone, right? Sigh.&lt;br /&gt;
&lt;br /&gt;
The solution:&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;alter table &quot;SERDE_PARAMS&quot; alter column &quot;PARAM_VALUE&quot; type text;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;font-size:x-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
We did lots of testing to make sure the existing definitions didn't get nuked by this change, and can confirm that the Hive code is not checking that 4000 value either (value is turned into a String: &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://svn.apache.org/repos/asf/hive/trunk/metastore/src/model/package.jdo&quot;&gt;the source&lt;/a&gt;). Our new super-wide downloads table works, and will be in production soon!&lt;br /&gt;
&lt;br /&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Oliver Meyn</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-4190207595771627043</guid>
         <pubDate>Tue, 04 Mar 2014 11:20:00 +0000</pubDate>
      </item>
      <item>
         <title>The new (real-time) GBIF Registry has gone live</title>
         <link>http://gbif.blogspot.com/2013/10/the-new-real-time-gbif-registry-has.html</link>
         <description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align:left;&quot;&gt;
&lt;blockquote style=&quot;text-align:left;&quot; type=&quot;cite&quot;&gt;
&lt;div style=&quot;word-wrap:break-word;&quot;&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;&lt;span style=&quot;background-color:white;color:#222222;&quot;&gt;For the last 4 years, GBIF has operated the GBRDS registry with its own web application on&amp;nbsp;&lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot; style=&quot;background-color:white;color:#1155cc;&quot;&gt;http://gbrds.gbif.org&lt;/a&gt;. &amp;nbsp;Previously, when a dataset got registered in the GBRDS registry (for example using an &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/ipt&quot;&gt;IPT&lt;/a&gt;) it wasn't immediately visible in the portal for several weeks until after rollover took place.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;background-color:white;color:#222222;font-family:Times, Times New Roman, serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;background-color:white;color:#222222;font-family:Times, Times New Roman, serif;&quot;&gt;In October, GBIF launched its new portal on&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/&quot; style=&quot;color:#1155cc;&quot;&gt;www.gbif.org&lt;/a&gt;. &amp;nbsp;During the launch we indicated that the real-time data management would be starting up in November. &amp;nbsp;We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry. &amp;nbsp;&lt;/span&gt;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;blockquote style=&quot;text-align:left;&quot; type=&quot;cite&quot;&gt;
&lt;div style=&quot;word-wrap:break-word;&quot;&gt;
&lt;div&gt;
&lt;span style=&quot;background-color:white;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;What does this mean for you?&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;blockquote style=&quot;text-align:left;&quot; type=&quot;cite&quot;&gt;
&lt;div style=&quot;word-wrap:break-word;&quot;&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color:white;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;any dataset registered through GBIF (using an &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/ipt&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;IPT&lt;/a&gt;&lt;span style=&quot;background-color:white;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;, web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated&lt;/span&gt;&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;blockquote style=&quot;text-align:left;&quot; type=&quot;cite&quot;&gt;
&lt;div style=&quot;word-wrap:break-word;&quot;&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;background-color:white;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;the GBRDS web application&amp;nbsp;(&lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot; style=&quot;color:#1155cc;font-family:Times, 'Times New Roman', serif;&quot;&gt;http://gbrds.gbif.org&lt;/a&gt;&lt;span style=&quot;background-color:white;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;)&amp;nbsp;is no longer visible&lt;/span&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;&lt;span style=&quot;background-color:white;color:#222222;&quot;&gt;,&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;background-color:white;color:#222222;&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;since the new portal displays all the appropriate information&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;blockquote style=&quot;text-align:left;&quot; type=&quot;cite&quot;&gt;
&lt;div style=&quot;word-wrap:break-word;&quot;&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;old links to the GBRDS will automatically redirect to their corresponding entry in the new portal. As an example, try&amp;nbsp;&lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/browse/agent?uuid=4fa7b334-ce0d-4e88-aaae-2e0c138d049e&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;http://gbrds.gbif.org/browse/agent?uuid=4fa7b334-ce0d-4e88-aaae-2e0c138d049e&lt;/a&gt;&lt;span style=&quot;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;blockquote style=&quot;text-align:left;&quot; type=&quot;cite&quot;&gt;
&lt;div style=&quot;word-wrap:break-word;&quot;&gt;
&lt;div style=&quot;text-align:left;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align:left;&quot;&gt;
&lt;/div&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;&lt;span class=&quot;Apple-style-span&quot; style=&quot;color:#222222;font-family:Times, Times New Roman, serif;&quot;&gt;the GBRDS sandbox registry web application (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrdsdev.gbif.org/&quot;&gt;http://gbrdsdev.gbif.org&lt;/a&gt;&lt;/span&gt;&lt;span style=&quot;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;) is no longer visible, but a new registry sandbox has been setup to provide for &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/ipt&quot; style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;IPT&lt;/a&gt;&lt;span style=&quot;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt; installations installed in test mode&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;blockquote style=&quot;text-align:left;&quot; type=&quot;cite&quot;&gt;
&lt;div style=&quot;text-align:left;word-wrap:break-word;&quot;&gt;
&lt;span style=&quot;background-color:white;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;Please note that the new&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/developer/registry&quot;&gt;registry API&lt;/a&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color:white;font-family:Times, 'Times New Roman', serif;&quot;&gt;supports the same web service API that the GBRDS previously did&lt;/span&gt;&lt;span style=&quot;background-color:white;color:#222222;font-family:Times, 'Times New Roman', serif;&quot;&gt;, so existing tools and services built on the GBRDS API (such as the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/ipt&quot;&gt;IPT&lt;/a&gt;) will continue to work uninterrupted.&lt;/span&gt;&lt;span style=&quot;font-family:Times, 'Times New Roman', serif;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;/blockquote&gt;
&lt;blockquote style=&quot;text-align:left;&quot; type=&quot;cite&quot;&gt;
&lt;div style=&quot;word-wrap:break-word;&quot;&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;&lt;span style=&quot;background-color:white;color:#222222;&quot;&gt;As you may have noticed, occurrence data crawling has been temporarily suspended since the middle of September to prepare for launching&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color:white;color:#222222;&quot;&gt;real-time data management&lt;/span&gt;&lt;span style=&quot;background-color:white;color:#222222;&quot;&gt;.&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color:#222222;&quot;&gt;We aim to resume occurrence data crawling in the first week of November, meaning that updates to the index will be visible immediately afterwar&lt;/span&gt;&lt;span style=&quot;background-color:white;color:#222222;&quot;&gt;ds.&amp;nbsp;&lt;/span&gt;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;blockquote style=&quot;text-align:left;&quot; type=&quot;cite&quot;&gt;
&lt;div style=&quot;word-wrap:break-word;&quot;&gt;
&lt;div&gt;
&lt;span style=&quot;text-align:justify;&quot;&gt;&lt;span style=&quot;font-family:Times, Times New Roman, serif;&quot;&gt;On behalf of the GBIF development team, I thank you for your patience during this transition time, and hope you are looking forward to real-time data management as much as we are.&lt;/span&gt;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Kyle Braak</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-8197534906787634642</guid>
         <pubDate>Mon, 28 Oct 2013 12:04:00 +0000</pubDate>
      </item>
      <item>
         <title>GBIF Backbone in GitHub</title>
         <link>http://gbif.blogspot.com/2013/10/gbif-backbone-in-github.html</link>
         <description>&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;For a long time I wanted to experiment with using &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/mdoering/backbone&quot;&gt;GitHub&lt;/a&gt; as a tool to browse and manage the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&quot;&gt;GBIF backbone taxonomy&lt;/a&gt;. Encouraged by similar sentiments from&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://iphylo.blogspot.co.uk/2013/04/time-to-put-taxonomy-into-github.html&quot;&gt;Rod Page&lt;/a&gt;,&amp;nbsp;it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;great GitHub Treeslider to browse the taxonomy, so why not give it a try?&lt;/span&gt;&lt;br /&gt;
&lt;h3&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;A GitHub filesystem taxonomy&lt;/span&gt;&lt;/h3&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&lt;b&gt;README.md,&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt; a simple markdown file that gets rendered by github and shows the basic attributes of a taxon&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&lt;b&gt;data.json,&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;&amp;nbsp;a complete json representation of the taxon as it is exposed via the new &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/developer/species&quot;&gt;GBIF species API&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/mdoering/backbone/tree/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita/Amanita%20arctica&quot;&gt;Amanita arctica&lt;/a&gt;&amp;nbsp;is represented as:&lt;/span&gt;&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-dNFtybvZtnE/UmkTfFIwWfI/AAAAAAAAEAc/_AwIyGsxGus/s1600/Screen+Shot+2013-10-24+at+14.32.41.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;343&quot; src=&quot;http://3.bp.blogspot.com/-dNFtybvZtnE/UmkTfFIwWfI/AAAAAAAAEAc/_AwIyGsxGus/s400/Screen+Shot+2013-10-24+at+14.32.41.png&quot; width=&quot;400&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.&lt;/span&gt;&lt;br /&gt;
&lt;h3&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;Getting data into GitHub&lt;/span&gt;&lt;/h3&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;It didn't take much to write a small&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-ecat/source/browse/checklistbank/trunk/checklistbank-nub/src/main/java/org/gbif/nub/export/NubGitExporter.java&quot;&gt;NubGitExporter.java&lt;/a&gt;&amp;nbsp;class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;670.000&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;&amp;nbsp;files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;pre class=&quot;brush: bash&quot;&gt;Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)
To https://github.com/mdoering/backbone.git
&lt;/pre&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://stackoverflow.com/questions/984707/what-are-the-file-limits-in-git-number-and-size&quot;&gt;git&lt;/a&gt;&amp;nbsp;itself, are not made to be a repository of millions of files and folders.&lt;/span&gt;&lt;br /&gt;
&lt;h3&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;First GitHub impressions&lt;/span&gt;&lt;/h3&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;Browsing taxa in GitHub is surprisingly responsive. The fungi genus &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/mdoering/backbone/tree/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita&quot;&gt;Amanita&lt;/a&gt; &amp;nbsp;contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/species/2526057&quot;&gt;GBIF species pages&lt;/a&gt;&amp;nbsp;which of course shows much more information. The rendered &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/mdoering/backbone/blob/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita/README.md&quot;&gt;readme&lt;/a&gt; file is not ideal as it's at the very bottom of the page, but showing&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;information to&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s).&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/mdoering/backbone/blob/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita/README.md&quot;&gt;Amanita readme&lt;/a&gt; contains the synonym information) and also with infraspecific taxa. For example &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/mdoering/backbone/tree/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita/Amanita%20muscaria&quot;&gt;Amanita muscaria&lt;/a&gt;&amp;nbsp;contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;The GitHub browser sorts all files alphabetically. When mixing ranks&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;(we skip intermediate unknown ranks in the backbone),&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;&amp;nbsp;for example see the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/mdoering/backbone/tree/master/life/Fungi&quot;&gt;Fungus kingdom&lt;/a&gt;,&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;sorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Verdana, sans-serif;&quot;&gt;I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push &amp;amp; pull.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Markus Döring</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-6823547091157123078</guid>
         <pubDate>Thu, 24 Oct 2013 14:39:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://3.bp.blogspot.com/-dNFtybvZtnE/UmkTfFIwWfI/AAAAAAAAEAc/_AwIyGsxGus/s72-c/Screen+Shot+2013-10-24+at+14.32.41.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Validating scientific names with the forthcoming GBIF Portal web service API</title>
         <link>http://gbif.blogspot.com/2013/07/validating-scientific-names-with.html</link>
         <description>&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b1-98a4-ce0e-860efe6518aa&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;text-align:left;&quot;&gt;
&lt;i&gt;This guest post was written by Gaurav Vaidya, &lt;/i&gt;&lt;i&gt;Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously.&lt;/i&gt;&lt;/div&gt;
&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b1-98a4-ce0e-860efe6518aa&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:right;margin-left:1em;text-align:right;width:200px;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-vxZ3Fg0vyzc/Ue1808go0TI/AAAAAAAAAuQ/smjqdcmDpto/s1600/1024px-Mother_and_baby_sperm_whale.jpg&quot; style=&quot;clear:right;margin-bottom:1em;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img alt=&quot;&quot; border=&quot;0&quot; height=&quot;179&quot; src=&quot;http://2.bp.blogspot.com/-vxZ3Fg0vyzc/Ue1808go0TI/AAAAAAAAAuQ/smjqdcmDpto/s320/1024px-Mother_and_baby_sperm_whale.jpg&quot; title=&quot;&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;A whale named &lt;i&gt;&lt;strike&gt;Physeter macrocephalus&lt;/strike&gt; &lt;strike&gt;Physeter catodon&lt;/strike&gt; Physeter macrocephalus&lt;/i&gt; (photograph by Gabriel Barathieu, reused under CC-BY-SA from &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://commons.wikimedia.org/wiki/File:Mother_and_baby_sperm_whale.jpg&quot;&gt;the Wikimedia Commons&lt;/a&gt;)&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Porphyrio martini&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;cus&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;, not &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Porphyrio martini&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;ca&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.repository.naturalis.nl/record/318605&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;the dispute&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; over whether the sperm whale is really &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Physeter catodon&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; Linnaeus, 1758, or &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Physeter macrocephalus&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; Linnaeus, 1758.&lt;/span&gt;&lt;/div&gt;
&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b1-98a4-ce0e-860efe6518aa&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b1-f108-eed3-15bc258078d0&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;A good way to validate scientific names is to match them against a taxonomic checklist: a publication that describes the taxonomy of a particular taxonomic group in a particular geographical region. It is up to the taxonomists who write such treatises to catalogue all the synonyms that have ever been used for the names in their checklist, and to identify a single accepted name for each taxon they recognize. While these checklists are themselves evolving over time and sometimes contradict each other, they serve as essential points of reference in an ever-changing taxonomic landscape.&lt;/span&gt;&lt;/div&gt;
&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b1-f108-eed3-15bc258078d0&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b2-1506-462c-670a7a7a817b&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Over a hundred digitized checklists have been assembled by the Global Biodiversity Information Facility (GBIF) and will be indexed in the forthcoming &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://uat.gbif.org/&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;GBIF Portal&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;, currently in development and testing. This collection includes large, global checklists, such as the &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://uat.gbif.org/dataset/7ddf754f-d193-4cc9-b351-99906754a03b&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Catalogue of Life&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; and the &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://uat.gbif.org/dataset/046bbc50-cae2-47ff-aa43-729fbf53f7c5&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;International Plant Names Index&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;, alongside smaller, more focussed checklists, such as &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://uat.gbif.org/dataset/d7f2602e-9f79-45e8-8399-08d0c5e43f5d&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;a checklist of 383 species of seed plants&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; found in the &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Singalila_National_Park&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Singhalila National Park in India&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; and &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://uat.gbif.org/dataset/db93cee5-60d1-4e16-a69e-83dd7080a55e&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;the 87 species of moss bug&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; recorded in the &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://coleorrhyncha.speciesfile.org/&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Coleorrhyncha Species File&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;. Many of these checklists can be downloaded as &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/informatics/standards-and-tools/publishing-data/data-standards/darwin-core-archives/&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Darwin Core Archive&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; files, an important format for working with and exchanging biodiversity data.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;So how can we match names against these databases? &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.openrefine.org/&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;OpenRefine&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; (the recently-renamed Google Refine) is a popular data cleaning tool, with features that make it easy to clean up many different types of data. &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://about.me/jotegui&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Javier Otegui&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; has written a tutorial on &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://bit.ly/BITW13_OpenRefine&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;cleaning biodiversity data in OpenRefine&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;, and last year &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://iphylo.blogspot.com/&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Rod Page&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; provided tools and a &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://iphylo.blogspot.com/2012/02/using-google-refine-and-taxonomic.html&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;step-by-step guide to reconciling scientific names&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;, establishing OpenRefine as an essential tool for biodiversity data and scientific name cleanup.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:right;height:278px;margin-left:1em;text-align:right;width:267px;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/-E5_Ja4jxV8w/Ue19jVYkyPI/AAAAAAAAAuY/CwmaK86aRvw/s1600/Felis+Tigris+in+Syst+Nat+10th+ed.png&quot; style=&quot;clear:left;margin-bottom:1em;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;200&quot; src=&quot;http://4.bp.blogspot.com/-E5_Ja4jxV8w/Ue19jVYkyPI/AAAAAAAAAuY/CwmaK86aRvw/s200/Felis+Tigris+in+Syst+Nat+10th+ed.png&quot; width=&quot;190&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;Linnaeus' original description of &lt;i&gt;Felis Tigris&lt;/i&gt;. From an 1894 republication of Linnaeus' &lt;i&gt;Systema Naturae, 10th edition&lt;/i&gt;, &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://biodiversitylibrary.org/page/25033833&quot;&gt;digitized by the Biodiversity Heritage Library&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;line-height:1.15;text-decoration:none;vertical-align:baseline;&quot;&gt;We extended Rod's work by building a reconciliation service against &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://dev.gbif.org/wiki/display/POR/Webservice+API&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;the forthcoming GBIF web services API&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;.
  We wanted to see if we could use one of the GBIF Portal's biggest  
strengths -- the large number of checklists it has indexed -- to  
identify names recognized in similar ways by different checklists.  
Searching through multiple checklists containing possible synonyms and  
accepted names increases the odds of finding an obscure or recently  
created name; and if the same name is recognized by a number of  
checklists, this may signify a well-known synonymy -- for example, two  
of the Portal checklists recognize that the species Linnaeus named &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Felis tigris&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; is the same one that is known as &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Panthera tigris &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;today.&lt;/span&gt;&lt;br /&gt;
&lt;/div&gt;
&lt;br /&gt;
&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b4-9eeb-5e5f-e3e700cbe6c9&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;To do this, we wrote a new &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://refine.taxonomics.org/gbifchecklists/code&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;OpenRefine reconciliation service&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; that searches for a queried name in all the checklists on the GBIF Portal. It then clusters names using four criteria and counts how often a particular name has the same:&lt;/span&gt;&lt;/div&gt;
&lt;ul style=&quot;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;li dir=&quot;ltr&quot; style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;list-style-type:disc;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;scientific name (for example, &quot;&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Felis tigris&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;&quot;),&lt;/span&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li dir=&quot;ltr&quot; style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;list-style-type:disc;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;authority (&quot;Linnaeus, 1758&quot;),&lt;/span&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li dir=&quot;ltr&quot; style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;list-style-type:disc;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;accepted name (&quot;&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Panthera tigris&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;&quot;), and&lt;/span&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li dir=&quot;ltr&quot; style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;list-style-type:disc;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;kingdom (&quot;Animalia&quot;).&lt;/span&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;br /&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Once you do a reconciliation through our new service, your results will look like this:&lt;/span&gt;&lt;br /&gt;
&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b1-f108-eed3-15bc258078d0&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/-l-vU_0ve6Lw/Ue1-POoTaVI/AAAAAAAAAug/6gpqQOqjSHg/s1600/Felis+tigris+reconciliation.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;97&quot; src=&quot;http://4.bp.blogspot.com/-l-vU_0ve6Lw/Ue1-POoTaVI/AAAAAAAAAug/6gpqQOqjSHg/s320/Felis+tigris+reconciliation.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b5-545e-228f-32c0c2f8d033&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Since OpenRefine limits the number of results it shows for any reconciliation, we know only that at least five checklists in the GBIF Portal matched the name &quot;Felis tigris&quot;. Of these,&lt;/span&gt;&lt;br /&gt;
&lt;/div&gt;
&lt;ol style=&quot;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;li dir=&quot;ltr&quot; style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;list-style-type:decimal;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Two checklists consider &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Felis tigris &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Linnaeus, 1758 to be a junior synonym of &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Panthera tigris&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; (Linnaeus, 1758). Names are always sorted by the number of checklists that contain that interpretation, so this interpretation -- as it happens, the correct one -- is at the top of the list.&lt;/span&gt;&lt;br /&gt;
&lt;/div&gt;
&lt;/li&gt;
&lt;li dir=&quot;ltr&quot; style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;list-style-type:decimal;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;The remaining checklists all consider &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Felis tigris&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; to be an accepted name in its own right. They contain mutually inconsistent information: one places this species in the kingdom Animalia, another in the kingdom Metazoa, and the third contains both a kingdom and an taxonomic authority. You can click on each name to find out more details.&lt;/span&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Using our reconciliation service, you can immediately see how many checklists agree on the most important details of the name match, and whether a name should be replaced with an accepted name. The same name may also be spelled identically under different nomenclatural codes: for example, does &quot;Ficus&quot; refer to the genus &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Ficus_%28gastropod%29&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Ficus &lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Röding, 1798&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; or the genus &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Ficus&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Ficus&lt;/span&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt; L.&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;? If you know that the former is in kingdom Animalia while the latter is in Plantae, it becomes easier to figure out the right match for your dataset.&lt;/span&gt;&lt;/div&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;We've designed a complete workflow around our reconciliation service, starting with ITIS as a very fast first step to catch the most well recognized names, and ending with EOL's fuzzy matching search as a final step to look for incorrectly spelled names. For VertNet's 2013 &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://vertnet.org/about/BITW.php&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Biodiversity Informatics Training Workshop&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;, we wrote two tutorials that walk you through our workflow:&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;ul style=&quot;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;li dir=&quot;ltr&quot; style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;list-style-type:disc;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://bit.ly/bitw2013-taxon-validation-tutorial&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Name validation in OpenRefine&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;, using both the new GBIF API reconciliation service as well as Rod Page's reconciliation service for EOL, and&lt;/span&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li dir=&quot;ltr&quot; style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;list-style-type:disc;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://bit.ly/bitw2013-higher-taxonomy-tutorial&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;Higher taxonomy in OpenRefine&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;, using the web service APIs provided by GBIF and EOL, as well as OpenRefine's ability to parse JSON.&lt;/span&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;br /&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;If you're already familiar with OpenRefine, you can add the reconciliation service with the URL:&lt;/span&gt;&lt;/div&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://refine.taxonomics.org/gbifchecklists/reconcile&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Consolas;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;http://refine.taxonomics.org/gbifchecklists/reconcile&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Consolas;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;Give it a try, and let us know if it helps you reconcile names faster!&lt;/span&gt;&lt;/div&gt;
&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:normal;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div dir=&quot;ltr&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;text-align:justify;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.mappinglife.org/&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;The Map of Life project&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt; is continuing to work on improving OpenRefine for taxonomic use in a project we call &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/gaurav/taxrefine&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;TaxRefine&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;. If you have suggestions for features you'd like to see, please let us know! You can leave a comment on this blog post, or &lt;/span&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/gaurav/taxrefine/issues&quot; style=&quot;text-decoration:none;&quot;&gt;&lt;span style=&quot;background-color:transparent;color:#1155cc;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:underline;vertical-align:baseline;&quot;&gt;add an issue to our issue tracker on GitHub&lt;/span&gt;&lt;/a&gt;&lt;span style=&quot;background-color:transparent;color:black;font-family:Arial;font-size:15px;font-style:italic;font-variant:normal;font-weight:normal;text-decoration:none;vertical-align:baseline;&quot;&gt;.&lt;/span&gt;&lt;/div&gt;
&lt;div dir=&quot;ltr&quot; id=&quot;docs-internal-guid-617a94a6-07b1-98a4-ce0e-860efe6518aa&quot; style=&quot;line-height:1.15;margin-bottom:0pt;margin-top:0pt;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Gaurav Vaidya</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-5384116699362931392</guid>
         <pubDate>Mon, 22 Jul 2013 21:16:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://2.bp.blogspot.com/-vxZ3Fg0vyzc/Ue1808go0TI/AAAAAAAAAuQ/smjqdcmDpto/s72-c/1024px-Mother_and_baby_sperm_whale.jpg" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>IPT v2.0.5 Released - A melhor versão até o momento!</title>
         <link>http://gbif.blogspot.com/2013/05/ipt-v205-released-melhor-versao-ate-o.html</link>
         <description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align:left;&quot;&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
&lt;br class=&quot;Apple-interchange-newline&quot;/&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-providertoolkit/downloads/list&quot;&gt;&lt;span class=&quot;s1&quot;&gt;here&lt;/span&gt;&lt;/a&gt;.&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
As with every release, it's your chance to take advantage of the most requested feature enhancements and&amp;nbsp;bug fixes.&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
The most notable feature enhancements include:&lt;/div&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li&gt;&lt;span style=&quot;text-align:justify;&quot;&gt;A resource can now be configured to publish automatically on an interval&amp;nbsp;&lt;/span&gt;&lt;i style=&quot;text-align:justify;&quot;&gt;(See &quot;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes?tm=6#Published_Release&quot;&gt;Automated Publishing&lt;/a&gt;&quot; section in&amp;nbsp;User Manual)&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;&lt;i style=&quot;text-align:justify;&quot;&gt;&lt;span style=&quot;font-style:normal;&quot;&gt;The interface has been translated into Portuguese,&amp;nbsp;&lt;/span&gt;&lt;/i&gt;making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English.&lt;/li&gt;
&lt;li style=&quot;text-align:justify;&quot;&gt;An IPT can be configured to back up each DwC-Archive version published &lt;i&gt;(See &quot;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes?tm=6#Configure_IPT_settings&quot;&gt;Archival Mode&lt;/a&gt;&quot; in&amp;nbsp;User Manual)&lt;/i&gt;&lt;/li&gt;
&lt;li style=&quot;text-align:justify;&quot;&gt;Each resource version now has a resolvable URL &lt;i&gt;(See &quot;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes#Versioned_page&quot;&gt;Versioned Page&lt;/a&gt;&quot; section in&amp;nbsp;User Manual)&lt;/i&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:right;margin-left:1em;text-align:right;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://1.bp.blogspot.com/-6tFmFS6XnBY/UZyMMyHLAzI/AAAAAAAAHkI/nLYRp6Nh7Ss/s1600/Screen+Shot+2013-05-22+at+11.11.47+AM.png&quot; style=&quot;clear:right;margin-bottom:1em;margin-left:auto;margin-right:auto;text-align:justify;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;220&quot; src=&quot;http://1.bp.blogspot.com/-6tFmFS6XnBY/UZyMMyHLAzI/AAAAAAAAHkI/nLYRp6Nh7Ss/s400/Screen+Shot+2013-05-22+at+11.11.47+AM.png&quot; width=&quot;400&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;Filterable, pageable, and sortable resource overview table in v2.0.5&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li style=&quot;text-align:justify;&quot;&gt;The order of columns in published DwC-Archives is always the same between versions&lt;/li&gt;
&lt;li style=&quot;text-align:justify;&quot;&gt;Style (CSS) customizations are easier than ever - check out this new guide entitled &quot;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-providertoolkit/wiki/IPT2Customization&quot;&gt;How to Style Your IPT&lt;/a&gt;&quot; for more information&lt;/li&gt;
&lt;li style=&quot;text-align:justify;&quot;&gt;&lt;i&gt;&lt;span style=&quot;font-style:normal;&quot;&gt;Hundreds if not thousands of resources can be handled, now that the resource overview tables are filterable, pageable, and sortable&amp;nbsp;&lt;i&gt;(See &quot;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes#Public_Resources_Table&quot;&gt;Public Resource Table&lt;/a&gt;&quot; section in User Manual)&lt;/i&gt;&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
The most important bug fixes are:&lt;/div&gt;
&lt;div&gt;
&lt;ul style=&quot;text-align:left;&quot;&gt;
&lt;li style=&quot;text-align:justify;&quot;&gt;Garbled encoding on registration updates has been fixed&lt;/li&gt;
&lt;li style=&quot;text-align:justify;&quot;&gt;The problems uploading DwC-Archives in .gzip format has been fixed&lt;/li&gt;
&lt;li style=&quot;text-align:justify;&quot;&gt;The problem uploading a resource logo has been fixed&lt;/li&gt;
&lt;/ul&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:left;margin-right:1em;text-align:left;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://1.bp.blogspot.com/-xnJrM7Zv-eI/UZyMRZEFlRI/AAAAAAAAHkQ/6h1rgSZGUuA/s1600/Screen+Shot+2013-05-22+at+11.12.32+AM.png&quot; style=&quot;clear:left;margin-bottom:1em;margin-left:auto;margin-right:auto;text-align:justify;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;244&quot; src=&quot;http://1.bp.blogspot.com/-xnJrM7Zv-eI/UZyMRZEFlRI/AAAAAAAAHkQ/6h1rgSZGUuA/s320/Screen+Shot+2013-05-22+at+11.12.32+AM.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;The new look in v2.0.5&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
The changes mentioned above represent just a fraction of the work that has gone into this version. Since version 2.0.4 was released 7 months ago, a total of 45 issues have been addressed. These are detailed in the&amp;nbsp;&lt;span class=&quot;s1&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-providertoolkit/issues/list?can=1&amp;amp;q=milestone%3DRelease2.0.5&quot;&gt;issue tracking system&lt;/a&gt;&lt;/span&gt;.&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
It is great to see so much feedback from the community in the form of issues especially as the IPT becomes more stable and comprehensive over time. After all, the IPT is a community-driven project and anyone can contribute patches, translations, or have their say simply by adding or voting on issues.&amp;nbsp;&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
The single largest community contribution in this version has been the translation into Portuguese done by three volunteers at the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.biocomp.org.br/&quot;&gt;Universidade de São Paulo, Research Center on Biodiversity and Computing&lt;/a&gt;: Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva.&amp;nbsp;With &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/communications/news-and-events/showsingle/article/brazil-joins-global-initiative-for-biodiversity-data-access&quot;&gt;Brazil recently joining the GBIF network&lt;/a&gt;, we hope the Portuguese interface for the IPT will help in publication of the wealth of biodiversity data available from Brazilian institutions. &amp;nbsp;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
We’d also like to give special thanks to the other volunteers below:&lt;/div&gt;
&lt;/div&gt;
&lt;ul class=&quot;ul1&quot;&gt;
&lt;li class=&quot;li1&quot; style=&quot;text-align:justify;&quot;&gt;Marie-Elise Lecoq (GBIF France&lt;span class=&quot;s2&quot;&gt;)&lt;/span&gt;&amp;nbsp;- Updating French translation&lt;/li&gt;
&lt;li class=&quot;li1&quot; style=&quot;text-align:justify;&quot;&gt;Yu-Huang Wang (TaiBIF, Taiwan) - Updating Traditional Chinese translation&lt;/li&gt;
&lt;li class=&quot;li3&quot; style=&quot;text-align:justify;&quot;&gt;Dairo Escobar, and Daniel Amariles (Colombian Biodiversity Information System (SiB))&amp;nbsp;- Updating&amp;nbsp;&lt;span class=&quot;s3&quot;&gt;Spanish translation&lt;/span&gt;&lt;/li&gt;
&lt;li class=&quot;li3&quot; style=&quot;text-align:justify;&quot;&gt;Carlos Cubillos (Colombian Biodiversity Information System (SiB)) - Contributing style improvements&lt;/li&gt;
&lt;li class=&quot;li3&quot; style=&quot;text-align:justify;&quot;&gt;Sijmen Cozijnsen (independent contractor working for NLBIF, Netherlands) - Contributing style improvements&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;div style=&quot;text-align:justify;&quot;&gt;
On behalf of the GBIF development team, I hope you enjoy using the latest version of the IPT.&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Kyle Braak</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-4698111276555265080</guid>
         <pubDate>Wed, 22 May 2013 15:37:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://1.bp.blogspot.com/-6tFmFS6XnBY/UZyMMyHLAzI/AAAAAAAAHkI/nLYRp6Nh7Ss/s72-c/Screen+Shot+2013-05-22+at+11.11.47+AM.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Migrating our hadoop cluster from CDH3 to CDH4</title>
         <link>http://gbif.blogspot.com/2013/05/migrating-our-hadoop-cluster-from-cdh3.html</link>
         <description>We've written a number of times on the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2011/01/setting-up-hadoop-cluster-part-1-manual.html&quot;&gt;initial setup&lt;/a&gt;, eventual &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/06/faster-hbase-hardware-matters.html&quot;&gt;upgrade&lt;/a&gt; and continued &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/07/optimizing-writes-in-hbase.html&quot;&gt;tuning&lt;/a&gt; of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://blog.cloudera.com/blog/2012/02/introducing-cdh4/&quot;&gt;CDH4.2.1&lt;/a&gt;. Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;general performance improvements in the entire Hadoop/HBase stack&lt;/li&gt;
&lt;li&gt;continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem)&lt;/li&gt;
&lt;li&gt;multi-threaded compactions (the need for which we concluded &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/07/optimizing-writes-in-hbase.html&quot;&gt;in this post&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;table-based region balancing (rather than just cluster-wide)&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Bus_factor&quot;&gt;bus factor&lt;/a&gt;, and get the server management back into the hands of our ops team, we've moved to &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Downloads&quot;&gt;CDH Manager&lt;/a&gt; to control our CDH installation. That's been going pretty well so far, but, we're getting ahead of ourselves...&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h3&gt;
The Process&lt;/h3&gt;
&lt;div&gt;
We have 6 slave nodes that have a lot of disk capacity since we spec'd with a goal of lots of spindles which meant we got lots of space &quot;for free&quot;. Rather than upgrading in place, we decided to start fresh with new master &amp;amp; zookeeper nodes, and we calculated that we'd have enough space to pull half the slaves into the new cluster without losing any data. We cleaned up all the tmp files and anything we deemed not worth saving from HBase and hdfs, and started the migration:&lt;/div&gt;
&lt;h4&gt;
Reduce the replication factor&lt;/h4&gt;
&lt;div&gt;
We reduced the replication factor to 2 on the 6 slave nodes to reduce the disk use:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;hadoop fs -setrep -R 2 /&lt;/span&gt;&lt;/div&gt;
&lt;h4&gt;
Decommission the 3 nodes to move&lt;/h4&gt;
&lt;div&gt;
&quot;Decommissioning&quot; is the civilized and safe way to remove nodes from a cluster where there's risk that they contain the only copies of some data in the cluster (they'll block writes but accept reads until all blocks have finished replicating out). To do it add the names of the target machines to an &quot;excludes&quot; file (one per line) that your hdfs config needs to reference, and then refresh hdfs.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The block in hdfs-site.xml:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&amp;nbsp; &amp;lt;name&amp;gt;dfs.hosts.exclude&amp;lt;/name&amp;gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&amp;nbsp; &amp;lt;value&amp;gt;/etc/hadoop/conf/excluded_hosts&amp;lt;/value&amp;gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;then run:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;bin/hadoop dfsadmin -refreshNodes&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;and wait for the &quot;under replicated blocks&quot; count on the hdfs admin page to drop to 0 and the decommissioning nodes to move into state &quot;Decommissioned&quot;.&lt;/span&gt;&lt;/div&gt;
&lt;h4&gt;
Don't forget HBase&lt;/h4&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;The hdfs datanodes are tidied up now but don't forget to cleanly shutdown the HBase regionservers - run:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;./bin/graceful_stop.sh HOSTNAME&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;from within the HBase directory on the host you're shutting down (specifying the real name for HOSTNAME). It will shed its regions and shutdown when tidied up (more details &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://hbase.apache.org/book/node.management.html&quot;&gt;here&lt;/a&gt;).&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;Now you can shutdown the tasktracker and datanode, and then the machine is ready to be wiped.&lt;/span&gt;&lt;/div&gt;
&lt;h4&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;Build the new cluster&lt;/span&gt;&lt;/h4&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;We wiped the 3 decommissioned slave nodes and installed the latest version of CentOS (our linux of choice, version 6.4 at time of writing). We also pulled 3 much lesser machines from our other cluster after decommissioning them in the same way, and installed CentOS 6.4 there, too. The 3 lesser machines would form our zookeeper ensemble and master nodes in the new cluster.&lt;/span&gt;&lt;/div&gt;
&lt;h4&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;Enter CDH Manager&lt;/span&gt;&lt;/h4&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;The folks at &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.cloudera.com/content/cloudera/en/home.html&quot;&gt;Cloudera&lt;/a&gt; have made a free version of their &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Downloads&quot;&gt;CDH Manager app&lt;/a&gt; available, and it makes managing a cluster much, much easier. After setting up the 6 machines that would form the basis of our new cluster with just the barebones OS, we were ready to start wielding the manager. We made a small VM to hold the manager app and installed it there. The &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.cloudera.com/content/support/en/documentation/manager-free/cloudera-manager-free-v4-latest.html&quot;&gt;manager instructions&lt;/a&gt; are pretty good, so I won't recreate them here. We had trouble with the key-based install so had to resort to setting identical passwords for root and allowing root ssh access for the duration of the install, but other than that it all went pretty smoothly. We installed in the following configuration (the master machines are the lesser ones described above, and the slaves the more powerful machines).&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;style type=&quot;text/css&quot;&gt;.nobrtable br {display:none;}tr {text-align:center;}tr.alt td {background-color:#eeeeee;color:black;}tr {text-align:center;}caption {caption-side:bottom;}&lt;/style&gt;

&lt;br /&gt;
&lt;center&gt;
&lt;div class=&quot;nobrtable&quot;&gt;
&lt;table border=&quot;2&quot; cellpadding=&quot;10&quot; cellspacing=&quot;0&quot; style=&quot;background-color:#dddddd;border-collapse:collapse;&quot;&gt;
&lt;caption&gt;Machine and Role assignments&lt;/caption&gt;
&lt;tbody&gt;
&lt;tr style=&quot;background-color:#dddddd;color:black;padding-bottom:4px;padding-top:5px;&quot;&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Roles&lt;/th&gt;
&lt;/tr&gt;
&lt;tr class=&quot;alt&quot;&gt;
&lt;td&gt;master1&lt;/td&gt;
&lt;td&gt;HDFS Primary NameNode, Zookeeper Member, HBase Master (secondary)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;alt&quot;&gt;
&lt;td&gt;master2&lt;/td&gt;
&lt;td&gt;HDFS Secondary NameNode, Zookeeper Member, HBase Master (primary)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;alt&quot;&gt;
&lt;td&gt;master3&lt;/td&gt;
&lt;td&gt;Hadoop JobTracker, Zookeeper Member, HBase Master (secondary)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;alt&quot;&gt;
&lt;td&gt;slave1&lt;/td&gt;
&lt;td&gt;HDFS DataNode, Hadoop TaskTracker, HBase Regionserver&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;alt&quot;&gt;
&lt;td&gt;slave2&lt;/td&gt;
&lt;td&gt;HDFS DataNode, Hadoop TaskTracker, HBase Regionserver&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;alt&quot;&gt;
&lt;td&gt;slave3&lt;/td&gt;
&lt;td&gt;HDFS DataNode, Hadoop TaskTracker, HBase Regionserver&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/center&gt;
&lt;div&gt;
&lt;h4&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;Copy the data&lt;/span&gt;&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;Now we had two running clusters - our old CDH3u3 cluster (with half its machines removed) and the new, empty CDH 4.2.1 cluster. The trick was how to get data from the old cluster into the new, with our primary concern being the data in HBase. The builtin facility for this sort of thing is called CopyTable, and sounds great, except that it doesn't work across major versions of HBase, so that was out. Next we looked at copying the HFiles directly from the old cluster to the new using the HDFS builtin command &lt;/span&gt;&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;distcp&lt;/span&gt;&lt;span style=&quot;font-family:inherit;&quot;&gt;. Because we could handle shutting down HBase on the old cluster for the duration of the copy this, in theory, should work - newer versions of HBase can read the older versions' HFiles and then write the new versions during compactions (and by shutting down we don't run the risk of missing updates from caches that haven't flushed, etc). And in spite of lots of warnings around the net that it wouldn't work, we tried it anyway. And it didn't work :) We managed to get the -ROOT- table up but it couldn't find .META. and that's where our patience ended. The next, and thankfully successful, attempt was using HBase export, distcp, and HBase import.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;On the old cluster we ran:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;bin/hadoop jar hbase-0.90.4-cdh3u3.jar export table_name /exports/table_name&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
for each of our tables, which produced a bunch of sequence files in the old cluster's HDFS. Those we copied over to the new cluster using HDFS's &lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;distcp&lt;/span&gt; command:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;bin/hadoop distcp hftp://old-cluster-namenode:50070/exports/table_name hdfs://master1:8020/imports/table_name&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
which takes advantage of the builtin http-like interface (hftp) that HDFS provides, which makes the copy process version agnostic.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Finally on the new cluster we can import the copied sequence files into the new HBase:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;bin/hadoop jar&amp;nbsp;hbase-0.94.2-cdh4.2.1-security.jar import table_name /imports/table_name&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Make sure the table exists before you import, and because the import is a mapreduce job that does Puts, it would also be wise to presplit any large tables at creation time so that you don't crush your new cluster with lots of hot regions and splitting. Also one known issue in this version of HBase is a performance regression from version 0.92 to 0.94 (detailed in &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/HBASE-7868&quot;&gt;HBASE-7868&lt;/a&gt;), which you can workaround by adding the following to your table definition:&lt;/div&gt;
&lt;br /&gt;
&lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;DATA_BLOCK_ENCODING =&amp;gt; 'FAST_DIFF'&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
e.g. &lt;span style=&quot;font-family:Courier New, Courier, monospace;&quot;&gt;create 'test_table', {NAME=&amp;gt;'cf', COMPRESSION=&amp;gt;'SNAPPY', VERSIONS=&amp;gt;1, DATA_BLOCK_ENCODING =&amp;gt; 'FAST_DIFF'}&lt;/span&gt;&lt;br /&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family:inherit;&quot;&gt;As per that linked issue, you should also enable short-circuit reads from the CDH Manager interface.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
And to complete the copying process, run major compactions on all your tables to ensure the best data locality you can for your regionservers.&lt;/div&gt;
&lt;h4&gt;
All systems go&lt;/h4&gt;
&lt;div&gt;
After running checks on the copied data, and updating our software to talk to CDH4, we were happy that our new cluster was behaving as expected. To get back to our normal performance levels we then shutdown the remaining machines in the CDH3u3 cluster, wiped and installed the latest OS, and then told CDH Manager to install on them. A few minutes later we had all our M/R slots back, as well as our regionservers. We ran the HBase balancer to evenly spread out the regions, ran another major compaction on our tables to force data-locality, and we were back in business!&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Oliver Meyn</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-7064030536051141534</guid>
         <pubDate>Tue, 14 May 2013 15:34:00 +0000</pubDate>
      </item>
      <item>
         <title>Data cleaning: Using MySQL to identify XML breaking characters</title>
         <link>http://gbif.blogspot.com/2013/02/data-cleaning-using-mysql-to-identify.html</link>
         <description>Sometimes publishers have problems with data resources that contain control characters that will break the xml response if they are included. Identifying these characters and removing them can be a daunting task, especially if the dataset contains thousands of records.&lt;br /&gt;
&lt;br /&gt;
Publishers that share datasets through the DiGIR and TAPIR protocols are especially vulnerable to text fields that contain polluted data. Information about locality (http://rs.tdwg.org/dwc/terms/index.htm#locality) is often quite rich and can be copied from diverse sources, thereby entering the database table possibly without having been through a verification or a cleaning process. The locality string can be copy/pasted from a file into the locality column, or the data itself can be mass loaded infile, or it can be bulk inserted – each of these methods contains a risk that unintended characters enter the table.&lt;br /&gt;
&lt;br /&gt;
Even if you have time and are meticulous, you could miss certain control characters because they are invisible to the naked eye. So what are publishers - some with limited resources – going to do to ferret out these xml breaking characters? Assuming that you have access to the MySQL database itself you can identify these pesky control characters by performing a few basic steps that involves creating a small table, inserting some hexadecimal values into it (sounds much harder than it is), and finally you run the query that picks out these ‘illegal’ characters from the table that you specify.&lt;br /&gt;
&lt;br /&gt;
We start out with creating a table to hold the values for the problematic characters so that we can use them in a query:&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote&gt;CREATE TABLE control_char (&lt;br /&gt;
id int(4) NOT NULL AUTO_INCREMENT,&lt;br /&gt;
hex_val CHAR(2),&lt;br /&gt;
PRIMARY KEY(id) &lt;br /&gt;
) DEFAULT CHARACTER SET = utf8;&lt;/blockquote&gt;&lt;br /&gt;
The DEFAULT CHARACTER SET declaration forces UTF-8 compliance which the regular expressions used later requires.&lt;br /&gt;
We then populate the table with these hex values that represent control characters:&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote&gt;INSERT INTO control_char (hex_val)&lt;br /&gt;
VALUES&lt;br /&gt;
('00'),('01'),('02'),('03'),('04'),('05'),('06'),('07'),('08'),('09'),('0a'),('0b'),('0c'),('0d'),('0e'),('0f'),&lt;br /&gt;
('10'),('11'),('12'),('13'),('14'),('15'),('16'),('17'),('18'),('19'),('1a'),('1b'),('1c'),('1d'),('1e'),('1f')&lt;br /&gt;
;&lt;/blockquote&gt;&lt;br /&gt;
You can read more about these values here: &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/C0_and_C1_control_codes&quot;&gt;http://en.wikipedia.org/wiki/C0_and_C1_control_codes&lt;/a&gt; &lt;br /&gt;
&lt;br /&gt;
At this point you may ask why the control_char table is not a temporary table as you might not want it to be a permanent feature in the database. The reason for this is sadly that MySQL has a long standing bug that prevents a temporary table from being referenced more than once; &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html&quot;&gt;http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html&lt;/a&gt;  and we have to reference it more than once as you will see later.&lt;br /&gt;
&lt;br /&gt;
Now on to the main query – these declarations test the table and column that you specify against the control_char table:&lt;br /&gt;
&lt;blockquote&gt;SELECT t1.* FROM scinames_harvested t1, control_char&lt;br /&gt;
WHERE LOCATE(control_char.hex_val , HEX(t1.scientific_name)) MOD 2 != 0;&lt;/blockquote&gt;&lt;br /&gt;
The query references two tables; one is a table of roughly 5000 records containing a record primary key, scientific_name and some other columns. Some of the scientific name strings are polluted with characters that we want to get rid of. The second table contains the control characters.&lt;br /&gt;
The way we ensure that the LOCATE function tests for value pairs two steps at the time is by using the modulus keyword MOD. Remember we want to look through the scientific_name char string after it has been converted to hexadecimal values (HEX) that consist of value pairs. We don’t want to test across value pairs!&lt;br /&gt;
&lt;br /&gt;
Running the query, in this instance, gives me five records with characters that are not kosher:&lt;br /&gt;
&lt;br /&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-TDfCxJC6iGo/URTktXdaW8I/AAAAAAAAAIw/79AeXAEqLlU/s1600/control_char.png&quot; style=&quot;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;151&quot; width=&quot;281&quot; src=&quot;http://2.bp.blogspot.com/-TDfCxJC6iGo/URTktXdaW8I/AAAAAAAAAIw/79AeXAEqLlU/s400/control_char.png&quot;/&gt;&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
This is pretty neat if the alternative is eyeballing each and every record.&lt;br /&gt;
Note that I cannot guarantee that this will properly process every character from the UTF-8 Latin-1 supplement &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement&quot;&gt;http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement&lt;/a&gt; &lt;br /&gt;
&lt;br /&gt;
If you want to create a test table and try out the queries above, this UPDATE query template will change the string into something containing control characters:&lt;br /&gt;
&lt;blockquote&gt;UPDATE your_table SET your_column = CONCAT('Adelotus brevis', X'0B') WHERE id = 12345;&lt;/blockquote&gt;In the CONCAT declaration the second part looks funny, but you have to remember that the X in front of '0B' tells MySQL that a hex value is coming. In this case it is a line-tabulation character: &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.fileformat.info/info/unicode/char/000b/index.htm&quot;&gt;www.fileformat.info/info/unicode/char/000b/index.htm&lt;/a&gt;. This part can be edited to other values for test purposes. Naturally the CONCAT function can take n number of strings for concatenation. &lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Jan K. Legind</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-4950111590788724640</guid>
         <pubDate>Fri, 08 Feb 2013 12:45:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://2.bp.blogspot.com/-TDfCxJC6iGo/URTktXdaW8I/AAAAAAAAAIw/79AeXAEqLlU/s72-c/control_char.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>&quot;I noticed that the GBIF data portal has fewer records than it used to – what happened?&quot;</title>
         <link>http://gbif.blogspot.com/2012/12/i-noticed-that-gbif-data-portal-has.html</link>
         <description>&lt;br /&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
If you are a regular user of the GBIF data portal at &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.gbif.org/&quot;&gt;http://data.gbif.org&lt;/a&gt;, or keep an eye on the numbers given at &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/&quot;&gt;http://www.gbif.org&lt;/a&gt;, you may have noticed that the number of indexed records took a dip, from well over 389m records to a little more than 383m. Why would that be?&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span style=&quot;&quot;&gt;The main reason for this is that software and processing upgrades have made it easier to spot duplicates and old, no longer published versions of records and datasets. Since the previous version of the data index, some major removal of such records has taken place:&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;&quot;&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot; style=&quot;margin-left:36.0pt;&quot;&gt;
&lt;span style=&quot;&quot;&gt;&lt;span style=&quot;&quot;&gt;-&lt;span style=&quot;font:7.0pt;&quot;&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;&quot;&gt;Several publishers migrated their datasets from other publishing tools to the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/10/ipt-v204-released.html&quot;&gt;Integrated Publishing Toolkit (IPT)&lt;/a&gt; and Darwin Core Archive, and in the process identified and removed duplicate records in the published source data. As an additional effect, the use of Darwin Core Archives in publishing allows the indexing process to automatically remove records from the index that are no longer contained in the source file: a data transfer is reliably all-or-nothing, so that any record that is not touched during indexing can automatically be deleted. This is less easy in the dialog-driven data transfer protocols (DiGIR, BioCASe and TAPIR), where data transfer might fail at any point in between for a number of reasons, requiring human supervision of deletions.&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot; style=&quot;margin-left:36.0pt;&quot;&gt;
&lt;span style=&quot;&quot;&gt;&lt;span style=&quot;&quot;&gt;-&lt;span style=&quot;font:7.0pt;&quot;&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;&quot;&gt;The now &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/10/the-gbif-registry-is-now-dataset-aware.html&quot;&gt;dataset-aware registry&lt;/a&gt; and changed metadata updating workflow make it possible to much easier spot data resources that are no longer published at source, and therefore need to be removed from the data portal as well. Previously, such checks were manual and required regular screening. More often than not, datasets are not really withdrawn, but instead published under a new identifier, combined with other data, or moved to a new location, all with the old version still hanging in until spotted or pointed out. The new registry workflows will significantly speed up the process of detecting and handling such cases.&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span style=&quot;&quot;&gt;In summary, the current drop in numbers is the result of data cleaning and removal of duplicates, and reflects continuing efforts by publishers, nodes and the Secretariat to improve the quality of data accessible through the GBIF network. While they happen regularly, the effects of such cleaning activities often get masked by increased record numbers of existing resources and new datasets in the global index. This time, the reduction happens to be more prominent than the additions.&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Andrea Hahn</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-5095306560221805946</guid>
         <pubDate>Mon, 10 Dec 2012 11:51:00 +0000</pubDate>
      </item>
      <item>
         <title>The GBIF Registry is now dataset-aware!</title>
         <link>http://gbif.blogspot.com/2012/10/the-gbif-registry-is-now-dataset-aware.html</link>
         <description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align:left;&quot;&gt;








&lt;br /&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
This post continues the series of posts that highlight the latest updates
on the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt;.&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;To recap, in April 2011 &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.blogger.com/profile/00591450269169657407&quot;&gt;Jose Cuadra&lt;/a&gt;
wrote &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2011/04/evolution-of-gbif-registry.html&quot;&gt;The
evolution of the GBIF Registry&lt;/a&gt;, a post that&amp;nbsp;provided a&amp;nbsp;background to the GBIF
Network, explained how Network entities are now stored in a database instead of &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www2.sys-con.com/itsg/virtualcd/webservices/archives/0401/barbash/index.html&quot;&gt;UDDI
system&lt;/a&gt;, and how it has a new&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;web
application&lt;/a&gt; and &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-registry/wiki/TableOfContents&quot;&gt;API&lt;/a&gt;. &amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;Then a month later, Jose wrote another post entitled &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2011/05/2011-registry-refactoring.html&quot;&gt;2011 GBIF Registry Refactoring&lt;/a&gt;&amp;nbsp;that was&amp;nbsp;more technical in nature and detailed a new set of technologies chosen to improve the underlying codebase.&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
Now even if you have been keeping an eye on the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt;, you probably missed the most important improvement that happened in September 2012: the Registry is now dataset-aware!&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
To be dataset-aware, means that the Registry is now aware of all the datasets that exist behind &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://digir.sourceforge.net/&quot;&gt;DiGIR&lt;/a&gt;&amp;nbsp;and &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.biocase.org/products/provider_software/index.shtml&quot;&gt;BioCASE&lt;/a&gt; endpoints.&amp;nbsp;Just in case the reader isn't aware, &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://digir.sourceforge.net/&quot;&gt;DiGIR&lt;/a&gt; and &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.biocase.org/products/provider_software/index.shtml&quot;&gt;BioCASE&lt;/a&gt; are wrapper tools used by organizations in the GBIF Network to publish their datasets. The datasets are exposed via an endpoint URL, and there can potentially be thousands of datasets behind a single endpoint.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
Traditionally, the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt; knew about the endpoint but not about its datasets. It was then the job of &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-indexingtoolkit/&quot;&gt;GBIF's Harvesting and Indexing Toolkit (HIT)&lt;/a&gt; to discover what datasets existed behind the endpoint, harvest all their records, and index those records into the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.gbif.org/&quot;&gt;GBIF Data Portal&lt;/a&gt;.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
Therefore if you ever visited the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.gbif.org/&quot;&gt;GBIF Data Portal&lt;/a&gt; and viewed the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.gbif.org/datasets/provider/4/&quot;&gt;Portal page for the&amp;nbsp;Academy of Natural Sciences&lt;/a&gt;, you would find that it has 3 datasets.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-KfHuUj0SMQI/UIPxiI_IRNI/AAAAAAAAF8s/tqyC5nF58Hc/s1600/Picture+7.png&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://3.bp.blogspot.com/-KfHuUj0SMQI/UIPxiI_IRNI/AAAAAAAAF8s/tqyC5nF58Hc/s1600/Picture+7.png&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
Clicking on each one, reveals that they are all exposed via the same DiGIR endpoint (see &quot;Access point URL&quot;) - see below:&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-SGPVs2qQAZM/UIPyAl2ohII/AAAAAAAAF80/aOXs0GCI9B4/s1600/Picture+8.png&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://2.bp.blogspot.com/-SGPVs2qQAZM/UIPyAl2ohII/AAAAAAAAF80/aOXs0GCI9B4/s1600/Picture+8.png&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-BWAqqiw6NjE/UIP0g1U7clI/AAAAAAAAF9E/D6DzW7c7nZM/s1600/Picture+11.png&quot; style=&quot;clear:left;display:inline;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://3.bp.blogspot.com/-BWAqqiw6NjE/UIP0g1U7clI/AAAAAAAAF9E/D6DzW7c7nZM/s1600/Picture+11.png&quot;/&gt;&lt;/a&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-RosZJx1ENlE/UIP0MFH_5RI/AAAAAAAAF88/gxMLcl8N78E/s1600/Picture+10.png&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;220&quot; src=&quot;http://3.bp.blogspot.com/-RosZJx1ENlE/UIP0MFH_5RI/AAAAAAAAF88/gxMLcl8N78E/s640/Picture+10.png&quot; width=&quot;640&quot;/&gt;&lt;/a&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-RosZJx1ENlE/UIP0MFH_5RI/AAAAAAAAF88/gxMLcl8N78E/s1600/Picture+10.png&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;br /&gt;&lt;/a&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-RosZJx1ENlE/UIP0MFH_5RI/AAAAAAAAF88/gxMLcl8N78E/s1600/Picture+10.png&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;br /&gt;&lt;/a&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-RosZJx1ENlE/UIP0MFH_5RI/AAAAAAAAF88/gxMLcl8N78E/s1600/Picture+10.png&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;br /&gt;&lt;/a&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-RosZJx1ENlE/UIP0MFH_5RI/AAAAAAAAF88/gxMLcl8N78E/s1600/Picture+10.png&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
But, if you visited the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt;&amp;nbsp;and did the same search for the Academy of Natural Sciences,&amp;nbsp;&lt;i&gt;prior to the Registry being dataset-aware&lt;/i&gt;, you would have seen it has a DiGIR endpoint, but not found it has any datasets!&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/-0igfQzKc5Eo/UIP1uTct1xI/AAAAAAAAF9M/eSFyBYRxvZU/s1600/Picture+12.png&quot; style=&quot;clear:left;display:inline;float:left;margin-bottom:1em;margin-right:1em;text-align:center;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://4.bp.blogspot.com/-0igfQzKc5Eo/UIP1uTct1xI/AAAAAAAAF9M/eSFyBYRxvZU/s1600/Picture+12.png&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br class=&quot;Apple-interchange-newline&quot;/&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now that the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt;&amp;nbsp;is dataset-aware, however, the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/browse/agent?uuid=f9b67ad0-9c9b-11d9-b9db-b8a03c50a862&quot;&gt;Registry page for the Academy of Natural Sciences&lt;/a&gt;&amp;nbsp;shows that the organization owns 3 datasets, and has a (DiGIR) Technical Installation.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-Ij8WKdRchD0/UIP5x2sMJII/AAAAAAAAF9o/q3HYFeb6AVQ/s1600/Picture+14.png&quot; style=&quot;clear:left;float:left;margin-bottom:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://2.bp.blogspot.com/-Ij8WKdRchD0/UIP5x2sMJII/AAAAAAAAF9o/q3HYFeb6AVQ/s1600/Picture+14.png&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;span lang=&quot;EN-GB&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
So that's fantastic, now the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt; knows about 1000s of datasets that only the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.gbif.org/&quot;&gt;GBIF Data Portal &lt;/a&gt;knew about before. But how was dataset-awareness achieved?&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
First, the Registry&amp;nbsp;now does the job of dataset discovery that the HIT used to do. A project called the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-registry/source/browse/#svn%2Fregistry%2Ftrunk%2Fregistry-metadata-sync&quot;&gt;registry-metadata-sync&lt;/a&gt; was created to do this.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
Second, a special set of scripts was written to migrate all the datasets from the GBIF Data Portal index database, into the Registry database. For the first time, all datasets that existed in the GBIF Data Portal now exist in the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt;, and can be uniquely identified by their GBIF Registry UUID!&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
Third, the HIT was branched, creating a revised version of the tool that was able to understand the new dataset-aware Registry. The HIT also had to be modified to allow its operators to still trigger dataset discovery by technical installation. Life just got easier for the HIT though, since it could use each dataset's &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt; UUID to uniquely identify each dataset during indexation.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/-phhNvHOQlN4/UIQGwtM3AcI/AAAAAAAAF94/e4z_JPnw0RI/s1600/Picture+15.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;269&quot; src=&quot;http://4.bp.blogspot.com/-phhNvHOQlN4/UIQGwtM3AcI/AAAAAAAAF94/e4z_JPnw0RI/s640/Picture+15.png&quot; width=&quot;640&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
Indeed, the dataset-aware Registry allocates a UUID to each dataset. This is fundamentally the biggest advantage that the dataset-aware Registry brings. Now that GBIF has succeeded in uniquely identifying each Dataset in its Registry, it is now working to assign each Dataset a &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Globally_unique_identifier&quot;&gt;Globally Unique Identifier (GUID)&lt;/a&gt; in the form of a &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.doi.org/&quot;&gt;Digital Object Identifier (DOI)&lt;/a&gt;. The DOI for a dataset will be resolvable back to the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt;, and could be referenced when citing a Dataset, thereby enabling better tracking of Dataset usage in scientific publications.&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;MsoNormal&quot;&gt;
GBIF is really excited about being able to provide publishers a DOI for each of their dataset. Keep an eye on our Registry in the coming months for their grand appearance. &amp;nbsp;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Kyle Braak</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-2804284550271658726</guid>
         <pubDate>Mon, 29 Oct 2012 12:55:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://3.bp.blogspot.com/-KfHuUj0SMQI/UIPxiI_IRNI/AAAAAAAAF8s/tqyC5nF58Hc/s72-c/Picture+7.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>IPT v2.0.4 released</title>
         <link>http://gbif.blogspot.com/2012/10/ipt-v204-released.html</link>
         <description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align:left;&quot;&gt;
&lt;br /&gt;
&lt;div class=&quot;p1&quot;&gt;
Today the GBIF Secretariat has announced the release of version 2.0.4 of the Integrated Publishing Toolkit (IPT). For those who can't wait to get their hands on the release, it's available for download on the project website&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-providertoolkit/downloads/list&quot;&gt;&lt;span class=&quot;s1&quot;&gt;here&lt;/span&gt;&lt;/a&gt;.&lt;/div&gt;
&lt;div class=&quot;p2&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
Collaboration on this version was more global than ever before, with volunteers in Latin America, Asia, and Europe contributing translations, and volunteers in Canada and the United States contributing some patches.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
Add to that all the issue activity, things have been busy. In total 108 issues were&amp;nbsp;addressed in&amp;nbsp;this version; 38 Defects, 35 Enhancements, 7 Other, 5 Patches, 18 Won't fix, 4 Duplicates, and 1 that was considered as Invalid. These are detailed in the&amp;nbsp;&lt;span class=&quot;s1&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-providertoolkit/issues/list?can=1&amp;amp;q=milestone=Release2.0.4&quot;&gt;issue tracking system&lt;/a&gt;&lt;/span&gt;.&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
So what exactly has changed and why? Here's a quick rundown.&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
One thing that kept coming up again and again in version 2.0.3, was that users were unwittingly installing the IPT in test mode, thinking that they were running in production. After registering a resource, these users expected to see it show up in the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt; and ultimately be indexed by GBIF. Frustrated emails were then sent to the GBIF Helpdesk when nothing happened. Sadly the reply from the GBIF Helpdesk was always filled with the same disappointing news:&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&quot;Your resource is actually in the &lt;i&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrdsdev.gbif.org/&quot;&gt;Test Registry&lt;/a&gt;&lt;/i&gt; therefore it will never be indexed by GBIF. Oh, and you will have to reinstall your IPT using production mode next time and do your resource configuration over again!&quot;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
So to tackle this problem, the&amp;nbsp;&lt;i&gt;&lt;b&gt;setup pages have been improved&lt;/b&gt;&lt;/i&gt; to make it crystal clear what it means to choose one mode or the other.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-4-KMHziDwJ4/UH6oK1N5jaI/AAAAAAAAF7s/UyZPFy2Outo/s1600/Screen+Shot+2012-10-17+at+2.43.22+PM.png&quot; style=&quot;clear:left;margin-bottom:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;281&quot; src=&quot;http://3.bp.blogspot.com/-4-KMHziDwJ4/UH6oK1N5jaI/AAAAAAAAF7s/UyZPFy2Outo/s640/Screen+Shot+2012-10-17+at+2.43.22+PM.png&quot; width=&quot;640&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
The &lt;b&gt;&lt;i&gt;UI has also been branded&lt;/i&gt;&lt;/b&gt; when running in test mode to make it even more obvious what mode the IPT is running in. &amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/--Lfka2cJ9vE/UH6L9jcouXI/AAAAAAAAF7U/26XepsqVUyo/s1600/Screen+Shot+2012-10-17+at+12.43.24+PM.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;96&quot; src=&quot;http://4.bp.blogspot.com/--Lfka2cJ9vE/UH6L9jcouXI/AAAAAAAAF7U/26XepsqVUyo/s640/Screen+Shot+2012-10-17+at+12.43.24+PM.png&quot; width=&quot;640&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align:left;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&amp;nbsp;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
Now whether or not test mode was chosen accidentally, it can be used to help train administrators how to configure an instance, and to help train users how to publish resources. What was always missing, was a way to transfer configured resources from an IPT in test mode, to one in production.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
I'm happy to say that in 2.0.4, &lt;b&gt;&lt;i&gt;a resource can now be easily transferred between 2 IPTs&amp;nbsp;including all its source files and mappings&lt;/i&gt;&lt;/b&gt;. Users will be happy to know that they never have to waste time reconfiguring the same resource from scratch. How is this done? Well in short, resource transfer is achieved by uploading an archived IPT resource folder during resource creation - see user manual for full&amp;nbsp;&lt;span class=&quot;s1&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes?tm=6#Upload_a_zipped_(.zip)_IPT_resource_configuration_folder&quot;&gt;instructions&lt;/a&gt;.&lt;/span&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;span class=&quot;s1&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
Moving on..&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
With so many publishers opting for the convenience of publishing via the IPT, the GBIF helpdesk has been receiving dozens of requests to replace an existing DiGIR, BioCASE, or TAPIR resource in the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt; with one coming from their IPT. To facilitate resource migration, another new feature was added in 2.0.4 that allows the IPT to &lt;b&gt;&lt;i&gt;update an existing resource in the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt; during registration&lt;/i&gt;&lt;/b&gt;. The change is welcomed most of all by the&amp;nbsp;GBIF helpdesk who bore the brunt of carrying out resource migrations in the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbrds.gbif.org/&quot;&gt;GBIF Registry&lt;/a&gt;. See User Manual for&amp;nbsp;&lt;span class=&quot;s1&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes?tm=6#Migrate_a_Resource&quot;&gt;instructions&lt;/a&gt;.&lt;/span&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
Thanks to the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; class=&quot;l&quot; target=&quot;_blank&quot; href=&quot;http://www.taibif.org.tw/&quot; style=&quot;color:#1122cc;cursor:pointer;font-family:arial, sans-serif;white-space:nowrap;&quot;&gt;Taiwan Biodiversity Information Facility (TaiBIF)&lt;/a&gt;&amp;nbsp;&amp;nbsp;&lt;b&gt;&lt;i&gt;the IPT interface is now available in Traditional Chinese&lt;/i&gt;&lt;/b&gt;. That makes the IPT available in a total of 4 languages now including French, Spanish and of course English.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-5y6O6zq2_fM/UH6mrRNJN9I/AAAAAAAAF7k/qXz7B87a0OI/s1600/Screen+Shot+2012-10-17+at+2.35.10+PM.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;433&quot; src=&quot;http://3.bp.blogspot.com/-5y6O6zq2_fM/UH6mrRNJN9I/AAAAAAAAF7k/qXz7B87a0OI/s640/Screen+Shot+2012-10-17+at+2.35.10+PM.png&quot; width=&quot;640&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
What else?&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
Thanks to a patch from Peter Desmet, &lt;b&gt;&lt;i&gt;download metrics for the Archive, EML, and RTF files can now be tracked&lt;/i&gt;&lt;/b&gt; via Google Analytics.&amp;nbsp;For IPT admins who aren't already tracking analytics, there are simple&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes?tm=6#Configure_IPT_settings&quot;&gt;instructions&lt;/a&gt;&amp;nbsp;in the User Manual.&amp;nbsp;Here's a screenshot showing some metrics from http://ipt-rc.gbif.org For your reference, the &quot;Event Label&quot; is the resource short name in the IPT.&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/-iZucs0KH9QM/UH6szUEcdxI/AAAAAAAAF78/g9Y96VYSISc/s1600/Screen+Shot+2012-10-17+at+2.59.36+PM.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;579&quot; src=&quot;http://4.bp.blogspot.com/-iZucs0KH9QM/UH6szUEcdxI/AAAAAAAAF78/g9Y96VYSISc/s640/Screen+Shot+2012-10-17+at+2.59.36+PM.png&quot; width=&quot;640&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
Last but not least, it should be highlighted that the IPT's RSS feed is now updated every time a resource is published. The version number is displayed right beside the resource name, so subscribers can stay on top of the latest changes. Here's a screenshot from my RSS reader pulling from &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://ipt.gbif.org/rss.do&quot;&gt;http://ipt.gbif.org/rss.do&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-2hdVaIUT1FE/UH6w2JLhjfI/AAAAAAAAF8M/4sdyBX6Y110/s1600/Screen+Shot+2012-10-17+at+3.16.30+PM.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;255&quot; src=&quot;http://2.bp.blogspot.com/-2hdVaIUT1FE/UH6w2JLhjfI/AAAAAAAAF8M/4sdyBX6Y110/s400/Screen+Shot+2012-10-17+at+3.16.30+PM.png&quot; width=&quot;400&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
And that about wraps up the most important changes in this version.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p2&quot;&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;p1&quot;&gt;
As always, we’d like to give special thanks to the volunteer&amp;nbsp;translators for their time and efforts:&amp;nbsp;&lt;/div&gt;
&lt;ul class=&quot;ul1&quot;&gt;
&lt;li class=&quot;li1&quot;&gt;Nicolas Noé (Belgian Biodiversity Platform&lt;span class=&quot;s2&quot;&gt;, Belgium)&lt;/span&gt;&amp;nbsp;-&amp;nbsp;French&amp;nbsp;&lt;/li&gt;
&lt;li class=&quot;li1&quot;&gt;TaiBIF, Taiwan -&amp;nbsp;Traditional Chinese&lt;/li&gt;
&lt;li class=&quot;li3&quot;&gt;Laura Roldan Gomez,&amp;nbsp;Dairo Escobar, and Daniel Amariles, (Colombian Biodiversity Information System (SiB))&amp;nbsp;-&amp;nbsp;&lt;span class=&quot;s3&quot;&gt;Spanish&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;p1&quot;&gt;
Plus another couple of special mentions are owed to Peter Desmet and Laura Russell who provided an exceptional amount of feedback and suggestions.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
On behalf of the GBIF development team, I hope you enjoy using latest version.&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Kyle Braak</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-7596676567882813275</guid>
         <pubDate>Wed, 17 Oct 2012 16:58:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://3.bp.blogspot.com/-4-KMHziDwJ4/UH6oK1N5jaI/AAAAAAAAF7s/UyZPFy2Outo/s72-c/Screen+Shot+2012-10-17+at+2.43.22+PM.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Getting started with DataCube on HBase</title>
         <link>http://gbif.blogspot.com/2012/07/getting-started-with-datacube-on-hbase.html</link>
         <description>This tutorial blog provides a quick introduction to using &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/urbanairship/datacube&quot;&gt;DataCube&lt;/a&gt;, a Java based &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/OLAP_cube&quot;&gt;OLAP cube&lt;/a&gt; library with a pluggable storage engine open sourced by &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://urbanairship.com/&quot;&gt;Urban Airship&lt;/a&gt;.  In this tutorial, we make use of the inbuilt HBase storage engine.&lt;br /&gt;
&lt;br /&gt;
In a small database much of this would be trivial using aggregating functions (SUM(), COUNT() etc).  As the volume grows, one often precalculates these metrics which brings it's own set of consistency challenges. As one outgrows a database, as GBIF are, we need to look for new mechanisms to manage these metrics.  The features of DataCube that make this attractive to us are:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;A managable process to modify the cube structure&lt;/li&gt;
&lt;li&gt;A higher level API to develop against&lt;/li&gt;
&lt;li&gt;Ability to rebuild the cube with a single pass over the source data&lt;/li&gt;
&lt;/ul&gt;
For this tutorial we will consider the source data as classical &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/&quot;&gt;DarwinCore&lt;/a&gt; occurrence records, where each record represents the metadata associated with a species observation event, e.g.:

&lt;br /&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;ID, Kingdom, ScientificName, Country, IsoCountryCode, BasisOfRecord, CellId, Year
1, Animalia, Puma concolor, Peru, PE, Observation, 13245, 1967
2, Plantae, Abies alba, Spain, ES, Observation, 3637, 2010
3, Plantae, Abies alba, Spain, ES, Observation, 3638, 2010
&lt;/pre&gt;
Suppose the following metrics are required, each of which is termed a rollup in OLAP:
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Number of records per country&lt;/li&gt;
&lt;li&gt;Number of records per kingdom&lt;/li&gt;
&lt;li&gt;Number of records georeferenced / not georeferenced&lt;/li&gt;
&lt;li&gt;Number of records per kingdom per country&lt;/li&gt;
&lt;li&gt;Number of records georeferenced / not georeferenced per country&lt;/li&gt;
&lt;li&gt;Number of records georeferenced / not georeferenced per kingdom&lt;/li&gt;
&lt;li&gt;Number of records georeferenced / not georeferenced per kingdom per country&lt;/li&gt;
&lt;/ol&gt;
Given the requirements above, this can be translated into a cube definition with 3 dimensions, and 7 rollups as follows:

&lt;br /&gt;
&lt;pre class=&quot;brush:java&quot;&gt;/**
 * The cube definition (package access only).
 * Dimensions are Country, Kingdom and Georeferenced with counts available for:
 * &lt;ol&gt;
 * &lt;li&gt;Country (e.g. number of record in DK)&lt;/li&gt;
 * &lt;li&gt;Kingdom (e.g. number of animal records)&lt;/li&gt;
 * &lt;li&gt;Georeferenced (e.g. number of records with coordinates)&lt;/li&gt;
 * &lt;li&gt;Country and kingdom (e.g. number of plant records in the US)&lt;/li&gt;
 * &lt;li&gt;Country and georeferenced (e.g. number of records with coordinates in the UK&lt;/li&gt;
 * &lt;li&gt;Country and kingdom and georeferenced (e.g. number of bacteria records with coordinates in Spain)&lt;/li&gt;
 * &lt;/ol&gt;
 * TODO: write public utility exposing a simple API enabling validated read/write access to cube.
 */
class Cube {

  // no id substitution
  static final Dimension COUNTRY = new Dimension (&quot;dwc:country&quot;, new StringToBytesBucketer(), false, 2);
  // id substitution applies
  static final Dimension KINGDOM = new Dimension (&quot;dwc:kingdom&quot;, new StringToBytesBucketer(), true, 7);
  // no id substitution
  static final Dimension GEOREFERENCED = new Dimension (&quot;gbif:georeferenced&quot;, new BooleanBucketer(), false, 1);

  // Singleton instance if accessed through the instance() method
  static final DataCube INSTANCE = newInstance();

  // Not for instantiation
  private Cube() {
  }

  /**
   * Creates the cube.
   */
  private static DataCube newInstance() {
    // The dimensions of the cube
    List &amp;gt; dimensions = ImmutableList. &amp;gt;of(COUNTRY, KINGDOM, GEOREFERENCED);

    // The way the dimensions are &quot;rolled up&quot; for summary counting
    List rollups =
      ImmutableList.of(new Rollup(COUNTRY),
        new Rollup(KINGDOM),
        new Rollup(GEOREFERENCED),
        new Rollup(COUNTRY, KINGDOM),
        new Rollup(COUNTRY, GEOREFERENCED),
        new Rollup(KINGDOM, GEOREFERENCED),
        // more than 2 requires special syntax
        new Rollup(ImmutableSet. of(new DimensionAndBucketType(COUNTRY), new DimensionAndBucketType(KINGDOM),
          new DimensionAndBucketType(GEOREFERENCED))));

    return new DataCube (dimensions, rollups);
  }
}
&lt;/pre&gt;

In this code, we are making use of ID substitution for the kingdom.  ID substitution is an inbuilt feature of DataCube whereby an auto-generated ID is used to substitute verbose coordinates (a value for a dimension).  This is an important feature to help improve performance as coordinates are used to construct the cube lookup keys, which translate into the key used for the HBase table.  The substitution is achieved by using a simple table holding a running counter and a mapping table holding the field-to-id mapping. When inserting data into the cube, the counter is incremented (with custom locking to support concurrency within the cluster), the mapping is stored, and the counter value used as the coordinate.  When reading, the mapping table is used to construct the lookup key.

With the cube defined, we are ready to populate it.  One could simply iterate over the source data and populate the cube with the likes of the following:

&lt;br /&gt;
&lt;pre class=&quot;brush:java&quot;&gt;DataCubeIo dataCubeIo = setup(Cube.INSTANCE); // omitted for brevity
dataCubeIo.writeSync(new LongOp(1), 
  new WriteBuilder(Cube.INSTANCE)
    .at(Cube.COUNTRY, &quot;Spain&quot;) // for example
    .at(Cube.KINGDOM, &quot;Animalia&quot;)
    .at(Cube.GEOREFERENCED, true)
);
&lt;/pre&gt;
However, one should consider what to do when you have the following inevitable scenarios:

&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;A new dimension or rollup is to be added to the running cube&lt;/li&gt;
&lt;li&gt;Changes to the source data have occurred without the cube being notified (e.g. through a batch load, or missing notifications due to messaging failures)&lt;/li&gt;
&lt;li&gt;Some disaster recovery requiring a cube rebuild&lt;/li&gt;
&lt;/ol&gt;
To handle this when using HBase as the cube storage engine, we make use of the inbuilt backfill functionality.  Backfilling is a multistage process:

&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;A snapshot of the live cube is taken and stored in a snapshot table&lt;/li&gt;
&lt;li&gt;An offline cube is calculated from the source data and stored in a backfill table&lt;/li&gt;
&lt;li&gt;The snapshot and live cube are compared to determine changes that were accepted in the live cube during the rebuilding process (step 2).  These changes are then applied to the backfill table&lt;/li&gt;
&lt;li&gt;The backfill is hot swapped to become the live cube&lt;/li&gt;
&lt;/ol&gt;
This is all handled within DataCube with the exception of stage 2, where we are required to provide a BackfillCallback, the logic responsible for populating the new cube from the source data.  The following example illustrates a BackfillCallback using a simple MapReduce job to scan an HBase table for the source data.

&lt;br /&gt;
&lt;pre class=&quot;brush:java&quot;&gt;/**
 * The callback used from the backfill process to spawn the job to write the new data in the cube.
 */
public class BackfillCallback implements HBaseBackfillCallback {

  // Property keys passed in on the job conf to the Mapper
  static final String TARGET_TABLE_KEY = &quot;gbif:cubewriter:targetTable&quot;;
  static final String TARGET_CF_KEY = &quot;gbif:cubewriter:targetCF&quot;;
  // Controls the scanner caching size for the source data scan (100-5000 is reasonable)
  private static final int SCAN_CACHE = 200;
  // The source data table
  private static final String SOURCE_TABLE = &quot;dc_occurrence&quot;;

  @Override
  public void backfillInto(Configuration conf, byte[] table, byte[] cf, long snapshotFinishMs) throws IOException {
    conf = HBaseConfiguration.create();
    conf.set(TARGET_TABLE_KEY, Bytes.toString(table));
    conf.set(TARGET_CF_KEY, Bytes.toString(cf));
    Job job = new Job(conf, &quot;CubeWriterMapper&quot;);

    job.setJarByClass(CubeWriterMapper.class);
    Scan scan = new Scan();
    scan.setCaching(SCAN_CACHE);
    scan.setCacheBlocks(false);

    // we do not want to get bad counts in the cube!
    job.getConfiguration().set(&quot;mapred.map.tasks.speculative.execution&quot;, &quot;false&quot;);
    job.getConfiguration().set(&quot;mapred.reduce.tasks.speculative.execution&quot;, &quot;false&quot;);
    job.setNumReduceTasks(0);
    TableMapReduceUtil.initTableMapperJob(SOURCE_TABLE, scan, CubeWriterMapper.class, null, null, job);
    job.setOutputFormatClass(NullOutputFormat.class);
    try {
      boolean b = job.waitForCompletion(true);
      if (!b) {
        throw new IOException(&quot;Unknown error with job.  Check the logs.&quot;);
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
  }
}
&lt;/pre&gt;
&lt;pre class=&quot;brush:java&quot;&gt;/**
 * The Mapper used to read the source data and write into the target cube.
 * Counters are written to simplify the spotting of issues, so look to the Job counters on completion.
 */
public class CubeWriterMapper extends TableMapper {

  // TODO: These should come from a common schema utility in the future
  // The source HBase table fields
  private static final byte[] CF = Bytes.toBytes(&quot;o&quot;);
  private static final byte[] COUNTRY = Bytes.toBytes(&quot;icc&quot;);
  private static final byte[] KINGDOM = Bytes.toBytes(&quot;ik&quot;);
  private static final byte[] CELL = Bytes.toBytes(&quot;icell&quot;);

  // Names for counters used in the Hadoop Job
  private static final String STATS = &quot;Stats&quot;;
  private static final String STAT_COUNTRY = &quot;Country present&quot;;
  private static final String STAT_KINGDOM = &quot;Kingdom present&quot;;
  private static final String STAT_GEOREFENCED = &quot;Georeferenced&quot;;
  private static final String STAT_SKIPPED = &quot;Skipped record&quot;;
  private static final String KINGDOMS = &quot;Kingdoms&quot;;

  // The batch size to use when writing the cube
  private static final int CUBE_WRITE_BATCH_SIZE = 1000;

  static final byte[] EMPTY_BYTE_ARRAY = new byte[0];

  private DataCubeIo dataCubeIo;

  @Override
  protected void cleanup(Context context) throws IOException, InterruptedException {
    super.cleanup(context);
    // ensure we're all flushed since batch mode
    dataCubeIo.flush();
    dataCubeIo = null;
  }

  /**
   * Utility to read a named field from the row.
   */
  private Integer getValueAsInt(Result row, byte[] cf, byte[] col) {
    byte[] v = row.getValue(cf, col);
    if (v != null &amp;amp;&amp;amp; v.length &amp;gt; 0) {
      return Bytes.toInt(v);
    }
    return null;
  }

  /**
   * Utility to read a named field from the row.
   */
  private String getValueAsString(Result row, byte[] cf, byte[] col) {
    byte[] v = row.getValue(cf, col);
    if (v != null &amp;amp;&amp;amp; v.length &amp;gt; 0) {
      return Bytes.toString(v);
    }
    return null;
  }


  @Override
  protected void map(ImmutableBytesWritable key, Result row, Context context) throws IOException, InterruptedException {
    String country = getValueAsString(row, CF, COUNTRY);
    String kingdom = getValueAsString(row, CF, KINGDOM);
    Integer cell = getValueAsInt(row, CF, CELL);

    WriteBuilder b = new WriteBuilder(Cube.INSTANCE);
    if (country != null) {
      b.at(Cube.COUNTRY, country);
      context.getCounter(STATS, STAT_COUNTRY).increment(1);
    }
    if (kingdom != null) {
      b.at(Cube.KINGDOM, kingdom);
      context.getCounter(STATS, STAT_KINGDOM).increment(1);
      context.getCounter(KINGDOMS, kingdom).increment(1);
    }
    if (cell != null) {
      b.at(Cube.GEOREFERENCED, true);
      context.getCounter(STATS, STAT_GEOREFENCED).increment(1);
    }
    if (b.getBuckets() != null &amp;amp;&amp;amp; !b.getBuckets().isEmpty()) {
      dataCubeIo.writeSync(new LongOp(1), b);
    } else {
      context.getCounter(STATS, STAT_SKIPPED).increment(1);
    }
  }

  // Sets up the DataCubeIO with IdService etc.
  @Override
  protected void setup(Context context) throws IOException, InterruptedException {
    super.setup(context);
    Configuration conf = context.getConfiguration();
    HTablePool pool = new HTablePool(conf, Integer.MAX_VALUE);

    IdService idService = new HBaseIdService(conf, Backfill.LOOKUP_TABLE, Backfill.COUNTER_TABLE, Backfill.CF, EMPTY_BYTE_ARRAY);

    byte[] table = Bytes.toBytes(conf.get(BackfillCallback.TARGET_TABLE_KEY));
    byte[] cf = Bytes.toBytes(conf.get(BackfillCallback.TARGET_CF_KEY));


    DbHarness hbaseDbHarness =
      new HBaseDbHarness (pool, EMPTY_BYTE_ARRAY, table, cf, LongOp.DESERIALIZER, idService, CommitType.INCREMENT);

    dataCubeIo = new DataCubeIo (Cube.INSTANCE, hbaseDbHarness, CUBE_WRITE_BATCH_SIZE, Long.MAX_VALUE, SyncLevel.BATCH_SYNC);

  }
}
&lt;/pre&gt; 
With the callback written, all that is left to populate the cube is to run the backfill.  Note that this process can also be used to bootstrap the live cube for the first time:

&lt;br /&gt;
&lt;pre class=&quot;brush:java&quot;&gt;// The live cube table 
final byte[] CUBE_TABLE = &quot;dc_cube&quot;.getBytes();
// Snapshot of the live table used during backfill
final byte[] SNAPSHOT_TABLE = &quot;dc_snapshot&quot;.getBytes();
// Backfill table built from the source
final byte[] BACKFILL_TABLE = &quot;dc_backfill&quot;.getBytes();
// Utility table to provide a running count for the identifier service
final byte[] COUNTER_TABLE = &quot;dc_counter&quot;.getBytes();
// Utility table to provide a mapping from source values to assigned identifiers
final byte[] LOOKUP_TABLE = &quot;dc_lookup&quot;.getBytes();
// All DataCube tables use a single column family
final byte[] CF = &quot;c&quot;.getBytes();

HBaseBackfill backfill =
  new HBaseBackfill(
    conf, 
    new BackfillCallback(), // our implementation
    CUBE_TABLE, 
    SNAPSHOT_TABLE, 
    BACKFILL_TABLE, 
    CF, 
    LongOp.LongOpDeserializer.class);
backfill.runWithCheckedExceptions();
&lt;/pre&gt;
While HBase provides the storage for the cube, a backfill could be implemented against any source data, such as from a database over JDBC or from text files stored on a Hadoop filesystem.

Finally we want to be able to read our cube:
&lt;br /&gt;
&lt;pre class=&quot;brush:java&quot;&gt;DataCubeIo dataCubeIo = setup(Cube.INSTANCE); // omitted for brevity
Optional result = 
  cubeIo.get(
    new ReadBuilder(cube)
     .at(Cube.COUNTRY, &quot;DK&quot;)
     .at(Cube.KINGDOM, &quot;Animalia&quot;));
// need to check if this coordinate combination hit anything in the cube
if (result.isPresent()) {
  LOG.info(&quot;Animal records in Denmark: &quot; + result.get().getLong());
)
&lt;/pre&gt; 
All the source code for the above is available in the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-labs/source/browse/#svn%2Foccurrence-cube%2Ftags%2Ftutorial-blog&quot;&gt;GBIF labs svn&lt;/a&gt;. &lt;br /&gt;
&lt;br /&gt;
Many thanks to Dave Revell at UrbanAirship for his guidance.&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Tim Robertson</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-3896303116028324458</guid>
         <pubDate>Fri, 13 Jul 2012 17:35:00 +0000</pubDate>
      </item>
      <item>
         <title>Optimizing Writes in HBase</title>
         <link>http://gbif.blogspot.com/2012/07/optimizing-writes-in-hbase.html</link>
         <description>I've written a few times about our work to improve the scanning performance of our cluster (parts&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/02/performance-evaluation-of-hbase.html&quot;&gt;1&lt;/a&gt;, &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/03/hbase-performance-evaluation-continued.html&quot;&gt;2&lt;/a&gt;, and&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/06/faster-hbase-hardware-matters.html&quot;&gt;3&lt;/a&gt;)&amp;nbsp;since our highest priority for HBase is being able to serve requests for downloads of occurrence records (which require a full table scan). But now that the scanning is working nicely we need to start writing new records into our occurrence table as well as&amp;nbsp;&lt;span style=&quot;background-color:white;&quot;&gt;cleaning raw data and&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;interpreting it into something more useful for the users of our &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.gbif.org/&quot;&gt;data portal&lt;/a&gt;. That processing is built as Hive queries that read from and write back to the same HBase table. And while it was working fine on small test datasets, it all blew up once I moved the process to the full dataset. Here's what happened and how we fixed it. Note that we're using CDH3u3, with the addition of Hive 0.9.0, &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/05/hive-09-with-hbase-090.html&quot;&gt;which we patched&lt;/a&gt; to support HBase 0.90.4.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;









The problem&lt;/h3&gt;
&lt;div&gt;
Our processing is Hive queries which run as Hadoop MapReduce jobs. When the mappers were running they would eventually fail (repeatedly, ultimately killing the job) with an error that looks like this:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;pre style=&quot;font-size:12px;&quot;&gt;java.io.IOException: org.apache.hadoop.hbase.client.ScannerTimeoutException: 63882ms passed since the last invocation, timeout is currently set to 60000&lt;/pre&gt;
&lt;br /&gt;
We did some digging and found that this exception happens when the scanner's &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;next()&lt;/span&gt; method hasn't been called within the timeout limit. We simplified our test case by reproducing this same error when doing a simple CopyTable operation (the one that ships with HBase), and again using Hive to do &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;select overwrite table A select * from table B&lt;/span&gt;. In both cases mappers are assigned a split to scan based on TableInputFormat (just like our Hive jobs), and as they scan they simply put the record out to the new table. Something is holding up the loop as it tries to put, preventing it from calling&amp;nbsp;&lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;next()&lt;/span&gt;, and thereby triggering the timeout exception.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;









The logs&lt;/h3&gt;
&lt;/div&gt;
&lt;div&gt;
First stop - the logs. The regionservers are littered with lines like the following:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;pre style=&quot;font-size:12px;&quot;&gt;WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region occurrence,&amp;#92;x17&amp;#92;xF1o&amp;#92;x9C,1340981109494.ecb85155563c6614e5448c7d700b909e. has too many store files; delaying flush up to 90000ms&lt;/pre&gt;
&lt;pre style=&quot;font-size:12px;&quot;&gt;INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 7 on 60020' on region occurrence,&amp;#92;x17&amp;#92;xF1o&amp;#92;x9C,1340981109494.ecb85155563c6614e5448c7d700b909e.: memstore size 128.2m is &amp;gt;= than blocking 128.0m size&lt;/pre&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Well, blocking updates sure sounds like the kind of thing that would prevent a loop from writing more puts and dutifully calling&amp;nbsp;&lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;next()&lt;/span&gt;. After a little more digging (and testing with a variety of hbase.client.scanner.caching values, including 1) we concluded that yes, this was the problem, but why was it happening?&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;








Why it blocks&lt;/h3&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
It blocks because the memstore has hit what I'll call the &quot;memstore blocking limit&quot; which is controlled by the setting &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;hbase.hregion.memstore.flush.size&lt;/span&gt; * &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;hbase.hregion.memstore.block.multiplier&lt;/span&gt;, which by default are 64MB and 2 respectively. Normally the memstore should flush when it reaches the &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;flush.size&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;, but in this case it reaches 128MB because it's not allowed to flush due to too many store files (the first log line). The definition of &quot;too many storefiles&quot; is in turn a setting, namely&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;&lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;hbase.hstore.blockingStoreFiles&lt;/span&gt; (default 7). A new store file is created every time the memstore flushes, and their number is reduced by compacting them into fewer, bigger storefiles during minor and major compactions. B&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;y default, compactions will only start if there are at least &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;hbase.hstore.compactionThreshold&lt;/span&gt; (default 3) store files, and won't compact more than &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;hbase.hstore.compaction.max&lt;/span&gt; (default 7) in a single compaction. And&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;regardless of what you set &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;flush.size&lt;/span&gt; to, the memstore will always flush if all memstores in the regionserver combined are using too much heap. &amp;nbsp;The acceptable levels of heap usage are defined by &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;hbase.regionserver.global.memstore.lowerLimit&lt;/span&gt; (default 0.35) and &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;hbase.regionserver.global.memstore.upperLimit&lt;/span&gt; (default 0.4). There is a thread dedicated to flushing that wakes up regularly and checks these limits:&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span style=&quot;background-color:white;&quot;&gt;if the flush thread wakes up and memstores are greater than the lower limit it will start flushing (starting with current biggest memstore) until it gets below the limit.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color:white;&quot;&gt;if flush thread wakes up and memstores are greater than the upper limit it will block updates and start flushing until it gets under lower limit, when it unblocks updates.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;Fortunately we didn't see blocking because of the upper heap limit - only the &quot;memstore blocking limit&quot; described earlier. But at this point we only knew the different dials we could turn.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h3&gt;








Stopping the blocking&lt;/h3&gt;
&lt;div&gt;
&lt;div&gt;
Our goal is to stop the blocking so that our mapper doesn't timeout, while at the same time not running out of memory on the regionserver. The most obvious problem is that we have too many storefiles, which appears to be a combination of producing too many of them and not compacting them fast enough. Note that we have a 6GB heap dedicated to HBase, but can't afford to take any more away from the co-resident Hadoop mappers and reducers.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
We started by&amp;nbsp;&lt;span style=&quot;background-color:white;&quot;&gt;upping the memstore flush size - this will produce fewer but bigger storefiles on each flush:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span style=&quot;background-color:white;&quot;&gt;first we tried 128MB with block multiplier still at 2. This still produced too many storefiles and caused the same blocking (maybe a little better than at 64MB)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;background-color:white;&quot;&gt;then tried 256MB with multiplier of 4. &amp;nbsp;The logs and ganglia showed that the flushes were happening well before 256MB (still around 130MB) &quot;&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;&lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;due to global heap pressure&lt;/span&gt;&quot; - a sign that total memstores were consuming too much heap. This meant we were still&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;generating too many files and got the same blocking problem, but with the &quot;memstore blocking limit&quot; set to 1GB the memstore blocking happened much less often, and later in the process (still killed the mappers though)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div&gt;
We were now producing fewer storefiles, but they were still accumulating too quickly. From ganglia we also saw that the compaction queue and storefile counts were growing unbounded, which meant we'd hit the blocking limit again eventually. Next came trying to compact more files per compaction, hence raised&amp;nbsp;&lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;compaction.max&lt;/span&gt; to 20, and this made little difference.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
So how to reduce the number of storefiles? I&lt;span style=&quot;background-color:white;&quot;&gt;f we had fewer stores, we'd be creating fewer files and using up less heap for memstore, so next we increased the region size. &amp;nbsp;This meant increasing the setting &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;hbase.hregion.max.filesize&lt;/span&gt; from its default of 256MB to 1.5G, and then rebuilding our table with fewer pre-split regions. &amp;nbsp;That resulted in about 75% fewer regions.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;It was starting to look good - the number of &quot;Blocking updates&quot; log messages dropped to a handful per run, but it was still enough to affect one or two jobs to the point of them getting killed. &amp;nbsp;We tried upping&amp;nbsp;&lt;/span&gt;the&amp;nbsp;&lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;memstore.lowerLimit&lt;/span&gt;&amp;nbsp;and&amp;nbsp;&lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;upperLimit&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;&amp;nbsp;to 0.45 and 0.5 respectively, but again no joy.&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;h3&gt;







&lt;span style=&quot;background-color:white;&quot;&gt;Now what?&lt;/span&gt;&lt;/h3&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;Things looked kind of grim. After endless poring over ganglia charts, we kept coming back to one unexplained blip that seemed to coincide with the start of the storefile explosion that eventually killed the jobs.&lt;/span&gt;&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left:auto;margin-right:auto;text-align:center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-u_kOdtOiPYE/T_w814ozOXI/AAAAAAAAAEE/0gZBUGWfVBA/s1600/hbase_write_flush.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;196&quot; src=&quot;http://3.bp.blogspot.com/-u_kOdtOiPYE/T_w814ozOXI/AAAAAAAAAEE/0gZBUGWfVBA/s400/hbase_write_flush.png&quot; width=&quot;400&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i&gt;Figure 1: Average memstore flush size over time&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;At about the halfway point of the jobs the size of memstore flushes would spike and then gradually increase until the job died. Keep in mind that the chart shows averages, and it only took a few of those flushes to wait for storefiles long enough to fill to 1GB and then start the blocking that was our undoing. Back to the logs.&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;From the start of Figure 1 we can see that things appear to be going smoothly - the memstores are flushing at or just above 256MB, which means they have enough heap and are doing their jobs. From the logs we see the flushes happening fine, but there are regular lines like the following:&lt;/span&gt;&lt;br /&gt;
&lt;pre&gt;INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Under global heap pressure: Region uat_occurrence,&amp;#92;x06&amp;#92;x0E&amp;#92;xAC&amp;#92;x0F,1341574060728.ab7fed6ea92842941f97cb9384ec3d4b. has too many store files, but is 625.1m vs best flushable region's 278.2m. Choosing the bigger.&lt;/pre&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;This isn't quite as bad as the &quot;delaying flush&quot; line, but it shows that we're on the limit of what our heap can handle. Then starting from around 12:20 we see more and more like the following:&lt;/span&gt;&lt;br /&gt;
&lt;pre&gt;WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region uat_occurrence,&quot;&amp;#92;x98=&amp;#92;x1C,1341567129447.a3a6557c609ad7fc38815fdcedca6c26. has too many store files; delaying flush up to 90000ms
&lt;/pre&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;and then to top it off:&lt;/span&gt;&lt;br /&gt;
&lt;pre&gt;INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 1 regions(s): ab7fed6ea92842941f97cb9384ec3d4b&lt;span style=&quot;font-family:Times;&quot;&gt;&lt;span style=&quot;white-space:normal;&quot;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;
&lt;pre&gt;INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up with memory above low water.&lt;/pre&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;So what we have is memstores initially being forced to flush because of minor heap pressure (adds storefiles faster than we can compact). Then we have memstores delaying flushes because of too many storefiles (memstores start getting bigger - our graph spike). Then the write ahead log (WAL) complains about too many of its logs, which forces a memstore flush (so that the WAL HLog can be safely discarded - this again adds storefiles). &amp;nbsp;And for good measure the flushing thread now wakes up, finds its out of heap, and starts attempting flushes, which just aggravates the problem (adding more storefiles to the pile). The failure doesn't happen immediately but we're past the point of no return - by about 13:00 the memstores are getting to the &quot;memstore blocking limit&quot; and our mappers die.&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;h3&gt;




&lt;span style=&quot;background-color:white;&quot;&gt;What's left?&lt;/span&gt;&lt;/h3&gt;
&lt;br /&gt;
Knowing what's going on in the memstore flush size analysis is comforting, but just reinforces what we already knew - the problem is too many storefiles. So what's left? &amp;nbsp;Raising the max number of storefiles, that's what! Every storefile in a store consumes resources (file handles, xceivers, heap for holding metadata), which is why it's limited to a maximum of a very modest 7 files by default. But for us this level of write load is rare - in normal operations we won't hit anything like this, and having the capacity to write a whole bunch of storefiles over short-ish, infrequent bursts is relatively safe, since we know our nightly major compaction will clean them up again (and thereby free up all those extra resources). Crank that maximum up to 200 and, finally, the process works!&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;




Conclusion&lt;/h3&gt;
&lt;/div&gt;
&lt;div&gt;
Our problem was that our compactions couldn't keep up with all the storefiles we were creating. We tried turning all the different HBase dials to get the storefile/compaction process to work &quot;like they're supposed to&quot;, but in the end the key for us was the&amp;nbsp;&lt;span style=&quot;background-color:white;font-family:'Courier New', Courier, monospace;&quot;&gt;hbase.hstore.blockingStoreFiles&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;background-color:white;&quot;&gt;parameter which we set to 200, which is probably double what we actually needed but gives us buffer for our infrequent, larger write jobs. We additionally settled on larger (and therefore fewer) regions, and a somewhat bigger than default memstore. Here are the relevant pieces of our hbase-site.xml after all our testing:&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;pre&gt;  &amp;lt;!-- default is 256MB 268435456, this is 1.5GB --&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;hbase.hregion.max.filesize&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;1610612736&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  
  &amp;lt;!-- default is 2 --&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;hbase.hregion.memstore.block.multiplier&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;4&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  
  &amp;lt;!-- default is 64MB 67108864 --&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;hbase.hregion.memstore.flush.size&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;134217728&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  
  &amp;lt;!-- default is 7, should be at least 2x compactionThreshold --&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;hbase.hstore.blockingStoreFiles&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;200&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
&lt;/pre&gt;
&lt;div&gt;
&lt;span style=&quot;background-color:white;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
And finally, if our compactions were faster and/or more frequent, we might be able to keep up with our storefile creation. That doesn't look possible without multi-threaded compactions, but naturally those exist in newer versions of HBase (starting with 0.92 - &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/HBASE-1476&quot;&gt;HBASE-1476&lt;/a&gt;) so if you're having these problems, an upgrade might be in order. Indeed this is prompting us to consider an upgrade to CDH4.&lt;br /&gt;
&lt;br /&gt;
Many thanks to&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.larsgeorge.com/&quot;&gt;Lars George&lt;/a&gt;, who helped us get through the &quot;Now What?&quot; phase by digging deep into the logs and the source to help us work out what was going on.&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Oliver Meyn</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-3017085516315849334</guid>
         <pubDate>Wed, 11 Jul 2012 10:46:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://3.bp.blogspot.com/-u_kOdtOiPYE/T_w814ozOXI/AAAAAAAAAEE/0gZBUGWfVBA/s72-c/hbase_write_flush.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Launch of the Canadensys explorer</title>
         <link>http://gbif.blogspot.com/2012/06/at-canadensys-we-already-adopted-and.html</link>
         <description>&lt;p&gt;&lt;em&gt;At &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.canadensys.net&quot;&gt;Canadensys&lt;/a&gt; we already adopted and &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.com/2011/07/customizing-ipt.html&quot;&gt;customized the IPT&lt;/a&gt; as &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.canadensys.net/ipt&quot;&gt;our data repository&lt;/a&gt;. With the data of our network being served by the IPT, we have now built a tool to aggregate and explore these data. For an overview of how we built our network, see &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://speakerdeck.com/u/peterdesmet/p/canadensys-how-we-built-a-national-biodiversity-data-network&quot;&gt;this presentation&lt;/a&gt;. The post below originally appeared on the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.canadensys.net/2012/canadensys-explorer-launch&quot;&gt;Canadensys blog&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;We are very pleased to announce the beta version of the &lt;strong&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.canadensys.net/explorer/&quot;&gt;Canadensys explorer&lt;/a&gt;&lt;/strong&gt;. The tool allows you to explore, filter, visualize and download all the specimen records published through the Canadensys network.&lt;/p&gt;&lt;p&gt;The explorer currently aggregates nine published collections, comprising over half a million specimen records, with many more to come in the near future. All individual datasets are available on the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.canadensys.net/ipt/&quot;&gt;Canadensys repository&lt;/a&gt; and via the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.gbif.org/&quot;&gt;Global Biodiversity Information Facility (GBIF)&lt;/a&gt; as well. The main functionalities of the explorer are listed below, but we encourage you to &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.canadensys.net/explorer&quot;&gt;discover them for yourself instead&lt;/a&gt;. We hope it is intuitive. For the best user experience, please use an up-to-date version of your browser.&lt;/p&gt;&lt;p&gt;Happy exploring: &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://data.canadensys.net/explorer&quot;&gt;http://data.canadensys.net/explorer&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.canadensys.net/wp-content/uploads/canadensys-explorer-launch-default.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;305&quot; src=&quot;http://www.canadensys.net/wp-content/uploads/canadensys-explorer-launch-default.png&quot; width=&quot;400&quot;/&gt;&lt;/a&gt;&lt;/p&gt;&lt;h2&gt;Functionalities&lt;/h2&gt;&lt;ul&gt;&lt;li&gt;The explorer is a one page tool, limiting unnecessary navigation.&lt;/li&gt;
&lt;li&gt;The default view shows all the data, allowing users to get an overview and explore immediately.&lt;/li&gt;
&lt;li&gt;Data can be queried by using and combining filters.&lt;/li&gt;
&lt;li&gt;Filters use smart suggestions, informing the user how often their search term occurs even before they search.&lt;/li&gt;
&lt;li&gt;The exact number of results is displayed, including the number of georeferenced records.&lt;/li&gt;
&lt;li&gt;The map view displays all georeferenced records for the current query, has different base layer options and can be zoomed in to any level.&lt;/li&gt;
&lt;li&gt;Points on the map can be clicked for more information.&lt;/li&gt;
&lt;li&gt;The table view displays a sortable preview of the records in the current query.&lt;/li&gt;
&lt;li&gt;Records in the table can be clicked for more information in the same way as on the map.&lt;/li&gt;
&lt;li&gt;The number of columns in the table responds to the screen width and can be controlled by the user in the display panel.&lt;/li&gt;
&lt;li&gt;Data for the current query can be downloaded as a &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.canadensys.net/darwin-core&quot;&gt;Darwin Core archive&lt;/a&gt;. There is no limit on the number of records that can be downloaded.&lt;/li&gt;
&lt;li&gt;Users can download the data by providing their email address. Once the download package is generated, the user receives an email with a link to the data, information regarding the usage norms and a suggested citation.&lt;/li&gt;
&lt;li&gt;The interface and emails are available in French and English.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;As this is a beta version, you may encounter &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/canadensys/issues/list?q=label:OccurrencePortal&quot;&gt;issues&lt;/a&gt;. Please report them by clicking the feedback button on the right, which will open a report form.&lt;/p&gt;&lt;h2&gt;Technical details&lt;/h2&gt;&lt;p&gt;The Canadensys explorer was developed using the following open source tools:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.springsource.org/&quot;&gt;Spring&lt;/a&gt;, a Java framework used for the backend.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/ThreeTen/threeten&quot;&gt;ThreeTen&lt;/a&gt;, a date/time Java library, used for cleaning the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/terms/index.htm#eventDate&quot;&gt;eventDate&lt;/a&gt; field.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://code.google.com/p/gbif-common-resources/source/browse/gbif-parsers&quot;&gt;GBIF parsers&lt;/a&gt;, a Java library used for cleaning the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/terms/index.htm#country&quot;&gt;country&lt;/a&gt; field.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/&quot;&gt;ECAT common&lt;/a&gt;, a Java library used for parsing the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://rs.tdwg.org/dwc/terms/index.htm#scientificName&quot;&gt;scientificName&lt;/a&gt; field.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/darwincore/wiki/DarwinCoreArchiveReader&quot;&gt;Darwin Core archive reader&lt;/a&gt;, a Java library used to import the data in the backend and to create the user generated downloads.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.postgresql.org/&quot;&gt;PostgreSQL&lt;/a&gt;, the database.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://postgis.refractions.net/&quot;&gt;PostGIS&lt;/a&gt;, a geospatial extension to the database.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/mapnik&quot;&gt;Mapnik&lt;/a&gt;, a tool to generate the maps and provide the map interactivity.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/Vizzuality/Windshaft&quot;&gt;Windshaft&lt;/a&gt;, a tile server.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://freemarker.sourceforge.net/&quot;&gt;Freemarker&lt;/a&gt;, a Java template engine, used to structure to frontend.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://backbonejs.org/&quot;&gt;Backbone.js&lt;/a&gt;, a MVC structure for web applications, used for handling the filters.&lt;/li&gt;
&lt;/ul&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Peter Desmet</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-4724679218435538031</guid>
         <pubDate>Mon, 18 Jun 2012 21:03:00 +0000</pubDate>
      </item>
      <item>
         <title>Taxonomic Trees in PostgreSQL</title>
         <link>http://gbif.blogspot.com/2012/06/taxonomic-trees-in-postgresql.html</link>
         <description>&lt;style type=&quot;text/css&quot;&gt;
h1, h2, h3, h4, h5, p{
font-family:'Helvetica Neue', Arial, Helvetica, sans-serif;}
h4, h5{
margin-bottom:0em;}
h4{
text-transform:uppercase;margin-top:1.5em;color:#C60;}
p{
margin-top:0em;}
.code{
font-family:'Courier New', Courier, monospace;
}
&lt;/style&gt;

&lt;p&gt;
 Taken aside pro parte synonyms taxonomic data follows a classic hierarchical tree structure. 
 In relational databases such a tree is commonly represented by 3 models known as 
 the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Adjacency_list_model&quot;&gt;adjacency list&lt;/a&gt;, 
 the materialized path and the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Nested_set_model&quot;&gt;nested set&lt;/a&gt; model. 
 There are many comparisons out there listing pros and cons, 
 for example ON &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://stackoverflow.com/questions/4048151/what-are-the-options-for-storing-hierarchical-data-in-a-relational-database&quot;&gt;stackoverflow&lt;/a&gt;, 
 the slides by &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.slideshare.net/quipo/trees-in-the-database-advanced-data-structures&quot;&gt;Lorenzo Alberton&lt;/a&gt; 
 or &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.slideshare.net/billkarwin/models-for-hierarchical-data&quot;&gt;Bill Karwin&lt;/a&gt; 
 or a &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://explainextended.com/2009/09/24/adjacency-list-vs-nested-sets-postgresql/&quot;&gt;postgres specific performance comparison&lt;/a&gt; between the adjacency model and a nested set.
&lt;/p&gt;

&lt;h3&gt;Checklist Bank&lt;/h3&gt;
&lt;p&gt;
 At GBIF we use PostgreSQL to store taxonomic trees, which we refer to as checklists, in &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://ecat-dev.gbif.org/about/intro&quot;&gt;Checklist Bank&lt;/a&gt;. 
 At the core there is a single table &lt;span class=&quot;code&quot;&gt;name_usage&lt;/span&gt; which contains records each representing a single taxon in the tree 
 [note: in this post I am using the term taxon broadly covering both accepted taxa and synonyms]. 
 It primarily uses the adjacency model with a single foreign key &lt;span class=&quot;code&quot;&gt;parent_fk&lt;/span&gt; which is null for the root elements of the tree. 
 The simplified diagram of the main tables looks like this (actually there are some 20 extra fix width columns left out from name_usage here for simplicity):
&lt;/p&gt;
&lt;p&gt;
 &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://1.bp.blogspot.com/-SgnsrVVAhxI/T9HzfmiPgCI/AAAAAAAAD88/uxdE6n4agjE/s1600/NameUsage.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;
  &lt;img border=&quot;0&quot; height=&quot;146&quot; src=&quot;http://1.bp.blogspot.com/-SgnsrVVAhxI/T9HzfmiPgCI/AAAAAAAAD88/uxdE6n4agjE/s320/NameUsage.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
 For certain searches though an additional index is required. In particular listing all descendants of a taxon, 
 i.e. all members of a subtree, is a common operation that would otherwise involve a recursive function 
 or &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.postgresql.org/docs/9.1/static/queries-with.html&quot;&gt;Common Table Expression&lt;/a&gt;. 
 So far we have been using nested sets, but experiencing some badly performing queries lately I decided to do a quick evaluation of different options for Postgres:
&lt;/p&gt;
&lt;ol&gt;
 &lt;li&gt;the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.postgresql.org/docs/current/static/ltree.html&quot;&gt;ltree&lt;/a&gt; extension which implements a materialized path and provides many powerful operations.&lt;/li&gt;
 &lt;li&gt;the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.postgresql.org/docs/9.1/static/intarray.html&quot;&gt;intarray&lt;/a&gt; extension to manually manage a materialized path as an array of non null integers - the primary keys of the parent taxa.&lt;/li&gt;
 &lt;li&gt;a simple varchar field holding the same materialized path&lt;/li&gt;
 &lt;li&gt;the current nested set index using a lft and rgt integer column which is unique within a single taxonomy and therefore has to be combined with a checklist_key.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;Test Environment&lt;/h3&gt;
&lt;p&gt;
 For the tests I am using the current Checklist Bank which contains 14,907,828 records in total spread across 107 checklists of varying size between 7 and 4.25 million records. 
 I am querying the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://ecat-dev.gbif.org/checklist/1&quot;&gt;GBIF Backbone Taxonomy&lt;/a&gt;
 which is the largest checklist containing 4,251,163 records with 9 root taxa and a maximum depth of 11 levels. 
 For the queries I have picked specific 2 taxa with different position in the taxonomic tree:
 &lt;ol&gt;
  &lt;li&gt;44 &lt;i&gt;&lt;strong&gt;Vertebrata&lt;/strong&gt;&lt;/i&gt; the class covering all vertebrates with approximately 84.100 species in this nub.&lt;/li&gt;
  &lt;li&gt;2684876 &lt;i&gt;&lt;strong&gt;Abies&lt;/strong&gt;&lt;/i&gt; the fir genus covering 167 species in this nub and exactly 609 descendants.&lt;/li&gt;
 &lt;/ol&gt;
 For most of our real queries we provide paging for larger results. I will therefore use a limit of 100 and offset of 0 for all queries below.
&lt;/p&gt;
&lt;p&gt;
 All tests are executed ON a 2Ghz i7 MacBook Pro with 8GB of memory running Postgres 9.1.1. 
 All queries have been executed 3 times before &lt;span class=&quot;code&quot;&gt;EXPLAIN ANALYZE&lt;/span&gt; was used to get a rough idea ON how differently the various indices behave.
&lt;/p&gt;

&lt;h3&gt;Creating the indices&lt;/h3&gt;
&lt;p&gt;
In order to use the extensions these need to be compiled at installation time and enabled in the database. In Postgres 9.1 you can do the later by executing in psql:
&lt;/p&gt;

&lt;pre class=&quot;brush:sql&quot;&gt;
  CREATE EXTENSION ltree;
  CREATE EXTENSION intarray;
&lt;/pre&gt; 

&lt;p&gt;
 The individual data types require different indices. We set up the following indices:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 # general indices
 CREATE INDEX nu_chkl_idx ON name_usage (checklist_fk);
 CREATE INDEX nu_parent_idx ON name_usage (parent_fk);
 # tree specifics
 CREATE INDEX nu_path_idx ON name_usage USING GIST (path);
 CREATE INDEX nu_mpath_idx ON name_usage USING GIN (mpath gin__int_ops);
 CREATE INDEX nu_mspath_idx ON name_usage (mspath);
 CREATE INDEX nu_ckl_lft_rgt_idx ON name_usage (checklist_fk, lft, rgt);
&lt;/pre&gt;

&lt;p&gt;
 The &lt;i&gt;ltree&lt;/i&gt; GIST and &lt;i&gt;intarray&lt;/i&gt; GIN indices are rather expensive to create/maintain, but they are selected for best read performance.
&lt;/p&gt;


&lt;h3&gt;Populating the indices&lt;/h3&gt;
&lt;p&gt;
 The data being imported into Checklist Bank comes with a parent-child relation, so parent_fk is populated already.
 For all other indices we have to populate the indices first.
&lt;/p&gt;

&lt;h4&gt;ltree, intarray, string path&lt;/h4&gt;
&lt;p&gt;
 For all materialized paths I simply run the following SQL until no new updates happened:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
# update root paths once
UPDATE name_usage u SET 
 path = text2ltree(cast(u.id as text)), 
 mspath=u.id, 
 mpath = array[u.id] 
WHERE u.parent_fk is null;
# update until no more records are updated
UPDATE name_usage u SET 
 path = p.path || text2ltree(cast(u.id as text)), 
 mspath=p.mspath || '.' || u.id, 
 mpath = p.mpath || array[u.id] 
FROM name_usage p 
WHERE u.parent_fk=p.id AND p.mspath is not null AND u.mspath is null;
&lt;/pre&gt;


&lt;h4&gt;nested sets&lt;/h4&gt;
&lt;p&gt;
 I've created 2 functions and one sequence to populate the lft/rgt indices for every checklist:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
-- NESTED SET UTILITY FUNCTIONS
CREATE SEQUENCE name_usage_nestidx START 1;

CREATE FUNCTION update_nested_set_usage(integer) RETURNS BOOLEAN as $$
  BEGIN
    UPDATE name_usage set lft = nextval('name_usage_nestidx')-1 WHERE id=$1;
    PERFORM update_nested_set_usage(id) FROM usage WHERE parent_fk=$1 ORDER BY rank_fk, name_fk;;
    UPDATE name_usage set rgt = nextval('name_usage_nestidx')-1 WHERE id=$1;
    RETURN true;
  END
$$ LANGUAGE plpgsql;

CREATE FUNCTION build_nested_set_indices(integer) RETURNS BOOLEAN AS $$
BEGIN
  PERFORM setval('name_usage_nestidx', 1);
  PERFORM update_nested_set_usage(id) FROM usage WHERE parent_fk is null and checklist_fk=$1 ORDER BY rank_fk, name_fk;;
  RETURN true;
  END
$$ LANGUAGE plpgsql;
&lt;/pre&gt;



&lt;h3&gt;Query for Descendants&lt;/h3&gt;
&lt;p&gt;The queries return the first page with 100 records descendants
&lt;/p&gt;

&lt;h4&gt;adjacency&lt;/h4&gt;
&lt;p&gt;
A recursive postgres CTE query does the trick:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 WITH RECURSIVE d AS (
   SELECT id
    FROM name_usage
    WHERE id = 44
  UNION ALL
   SELECT c.id
    FROM d JOIN name_usage c ON c.parent_fk = d.id
 )
 SELECT * FROM d
  ORDER BY id
  LIMIT 100 OFFSET 0;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=74360.25..74360.50 rows=100 width=4) (actual time=7.306..7.337 rows=100 loops=1)
   CTE d
     -&amp;gt;  Recursive Union  (cost=0.00..72581.02 rows=30561 width=4) (actual time=0.041..6.017 rows=609 loops=1)
           -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=4) (actual time=0.038..0.040 rows=1 loops=1)
                 Index Cond: (id = 2684876)
           -&amp;gt;  Nested Loop  (cost=0.00..7195.91 rows=3056 width=4) (actual time=0.240..1.393 rows=152 loops=4)
                 -&amp;gt;  WorkTable Scan on d  (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.041 rows=152 loops=4)
                 -&amp;gt;  Index Scan using usage_parent_idx on name_usage c  (cost=0.00..715.75 rows=306 width=8) (actual time=0.006..0.008 rows=1 loops=609)
                       Index Cond: (parent_fk = d.id)
   -&amp;gt;  Sort  (cost=1779.24..1855.64 rows=30561 width=4) (actual time=7.304..7.316 rows=100 loops=1)
         Sort Key: d.id
         Sort Method: top-N heapsort  Memory: 29kB
         -&amp;gt;  CTE Scan on d  (cost=0.00..611.22 rows=30561 width=4) (actual time=0.046..6.678 rows=609 loops=1)
 Total runtime: 7.559 ms
&lt;/pre&gt;
&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=74360.25..74360.50 rows=100 width=4) (actual time=2065.053..2065.080 rows=100 loops=1)
   CTE d
     -&amp;gt;  Recursive Union  (cost=0.00..72581.02 rows=30561 width=4) (actual time=0.105..1797.898 rows=264325 loops=1)
           -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=4) (actual time=0.101..0.103 rows=1 loops=1)
                 Index Cond: (id = 44)
           -&amp;gt;  Nested Loop  (cost=0.00..7195.91 rows=3056 width=4) (actual time=0.915..182.695 rows=29369 loops=9)
                 -&amp;gt;  WorkTable Scan on d  (cost=0.00..0.20 rows=10 width=4) (actual time=0.007..8.056 rows=29369 loops=9)
                 -&amp;gt;  Index Scan using usage_parent_idx on name_usage c  (cost=0.00..715.75 rows=306 width=8) (actual time=0.004..0.005 rows=1 loops=264325)
                       Index Cond: (parent_fk = d.id)
   -&amp;gt;  Sort  (cost=1779.24..1855.64 rows=30561 width=4) (actual time=2065.050..2065.062 rows=100 loops=1)
         Sort Key: d.id
         Sort Method: top-N heapsort  Memory: 29kB
         -&amp;gt;  CTE Scan on d  (cost=0.00..611.22 rows=30561 width=4) (actual time=0.109..1986.606 rows=264325 loops=1)
 Total runtime: 2080.248 ms
&lt;/pre&gt;
&lt;p&gt;
 As expected the recursive query does a pretty good job if the subtree is small. 
 But for the large vertebrate subtree its rather slow because we first get all decendants and then apply a limit.
&lt;/p&gt;



&lt;h4&gt;ltree&lt;/h4&gt;
&lt;p&gt;
 With ltree you have various options to query for a subtree. 
 You can use a path lquery with ~, a full text ltxtquery via @ or the ltree &amp;lt;@ isDescendant operator.
&lt;/p&gt;
&lt;p&gt;
 Unanchored lqueries for any path containing the a node turned out to be very, very slow. 
 This is expected somehow because it can't use the index properly. 
 Surprisingly even the anchored queries and the full text query were far too slow from being useful at all.
 The fastest option definitely was the native descendants operator:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 WITH p AS (
   SELECT path FROM name_usage WHERE id=44
 )
 SELECT u.id 
  FROM name_usage u, p 
  WHERE u.path &amp;lt;@ p.path
  ORDER BY u.id
  LIMIT 100 OFFSET 0;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=54526.09..54526.34 rows=100 width=4) (actual time=2.926..2.963 rows=100 loops=1)
   CTE p
     -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.030..0.034 rows=1 loops=1)
           Index Cond: (id = 2684876)
   -&amp;gt;  Sort  (cost=54515.41..54551.25 rows=14336 width=4) (actual time=2.923..2.942 rows=100 loops=1)
         Sort Key: u.id
         Sort Method: top-N heapsort  Memory: 29kB
         -&amp;gt;  Nested Loop  (cost=2094.99..53967.50 rows=14336 width=4) (actual time=1.094..2.230 rows=609 loops=1)
               -&amp;gt;  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.035..0.040 rows=1 loops=1)
               -&amp;gt;  Bitmap Heap Scan on name_usage u  (cost=2094.99..53788.28 rows=14336 width=124) (actual time=1.051..1.952 rows=609 loops=1)
                     Recheck Cond: (path &amp;lt;@ p.path)
                     -&amp;gt;  Bitmap Index Scan on nu_path_idx  (cost=0.00..2091.40 rows=14336 width=0) (actual time=1.023..1.023 rows=609 loops=1)
                           Index Cond: (path &amp;lt;@ p.path)
 Total runtime: 3.068 ms
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=54526.09..54526.34 rows=100 width=4) (actual time=512.420..512.445 rows=100 loops=1)
   CTE p
     -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.091..0.094 rows=1 loops=1)
           Index Cond: (id = 44)
   -&amp;gt;  Sort  (cost=54515.41..54551.25 rows=14336 width=4) (actual time=512.417..512.432 rows=100 loops=1)
         Sort Key: u.id
         Sort Method: top-N heapsort  Memory: 29kB
         -&amp;gt;  Nested Loop  (cost=2094.99..53967.50 rows=14336 width=4) (actual time=115.119..428.632 rows=264325 loops=1)
               -&amp;gt;  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.096..0.100 rows=1 loops=1)
               -&amp;gt;  Bitmap Heap Scan on name_usage u  (cost=2094.99..53788.28 rows=14336 width=124) (actual time=115.016..372.070 rows=264325 loops=1)
                     Recheck Cond: (path &amp;lt;@ p.path)
                     -&amp;gt;  Bitmap Index Scan on nu_path_idx  (cost=0.00..2091.40 rows=14336 width=0) (actual time=109.791..109.791 rows=264325 loops=1)
                           Index Cond: (path &amp;lt;@ p.path)
 Total runtime: 512.723 ms
&lt;/pre&gt;





&lt;h4&gt;intarray&lt;/h4&gt;
&lt;p&gt;
 As a node in the tree appears only once, we can query for all usages that have the node id in their array but are not that very record.
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 SELECT u.id 
  FROM name_usage u
  WHERE u.mpath @@ '44' and u.id != 44
  ORDER BY u.id
  LIMIT 100 OFFSET 0;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=52491.99..52492.24 rows=100 width=4) (actual time=1.925..1.966 rows=100 loops=1)
   -&amp;gt;  Sort  (cost=52491.99..52527.83 rows=14336 width=4) (actual time=1.923..1.942 rows=100 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         -&amp;gt;  Bitmap Heap Scan on name_usage u  (cost=179.10..51944.08 rows=14336 width=4) (actual time=0.682..1.219 rows=608 loops=1)
               Recheck Cond: (mpath @@ '2684876'::query_int)
               Filter: (id &amp;lt;&amp;gt; 2684876)
               -&amp;gt;  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..175.52 rows=14336 width=0) (actual time=0.646..0.646 rows=609 loops=1)
                     Index Cond: (mpath @@ '2684876'::query_int)
 Total runtime: 2.052 ms
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=52491.99..52492.24 rows=100 width=4) (actual time=377.851..377.877 rows=100 loops=1)
   -&amp;gt;  Sort  (cost=52491.99..52527.83 rows=14336 width=4) (actual time=377.849..377.861 rows=100 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         -&amp;gt;  Bitmap Heap Scan on name_usage u  (cost=179.10..51944.08 rows=14336 width=4) (actual time=115.634..298.193 rows=264324 loops=1)
               Recheck Cond: (mpath @@ '44'::query_int)
               Filter: (id &amp;lt;&amp;gt; 44)
               -&amp;gt;  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..175.52 rows=14336 width=0) (actual time=110.776..110.776 rows=264325 loops=1)
                     Index Cond: (mpath @@ '44'::query_int)
 Total runtime: 378.131 ms
&lt;/pre&gt;



&lt;h4&gt;string path&lt;/h4&gt;
&lt;p&gt;
 Just for completeness and to compare relative performances I am trying an anchored pattern match against a varchar based materialzed path.
 In order to let postgres use the mspath index we must also order by that value, no id:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 WITH p AS (
  SELECT mspath FROM name_usage where id=44
 )
 SELECT u.id 
  FROM name_usage u, p
  WHERE u.mspath LIKE p.mspath || '.%'
  ORDER BY u.mspath
  LIMIT 100 OFFSET 0;
&lt;/pre&gt;
&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=10.68..80535.72 rows=100 width=36)
   CTE p
     -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=32)
           Index Cond: (id = 2684876)
   -&amp;gt;  Nested Loop  (cost=0.00..60022560.36 rows=74539 width=36)
         Join Filter: (u.mspath ~~ (p.mspath || '.%'::text))
         -&amp;gt;  Index Scan using nu_mspath_idx on name_usage u  (cost=0.00..59649864.65 rows=14907828 width=36)
         -&amp;gt;  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32)
&lt;/pre&gt;
&lt;p&gt;
 I don't know what is going on here, but I can't make postgres use the mspath btree index properly.
 The query therefore is very, very slow. If I use a hardcoded path instead of p.mspath the index is used.
 I'll show results here for the hardcoded path now - anyone having an idea on how to avoid the table scan in the above sql please let me know.
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 SELECT u.id 
  FROM name_usage u
  WHERE u.mspath LIKE '6.101.194.640.3925.2684876.%'
  ORDER BY u.mspath
  LIMIT 100 OFFSET 0;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=0.00..400.95 rows=100 width=36) (actual time=0.105..0.549 rows=100 loops=1)
   -&amp;gt;  Index Scan using nu_mspath_idx on name_usage u  (cost=0.00..298863.67 rows=74539 width=36) (actual time=0.102..0.516 rows=100 loops=1)
         Index Cond: ((mspath &amp;gt;= '6.101.194.640.3925.2684876.'::text) AND (mspath &amp;lt; '6.101.194.640.3925.2684876/'::text))
         Filter: (mspath ~~ '6.101.194.640.3925.2684876.%'::text)
 Total runtime: 0.637 ms
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=0.00..400.95 rows=100 width=36) (actual time=0.076..0.472 rows=100 loops=1)
   -&amp;gt;  Index Scan using nu_mspath_idx on name_usage u  (cost=0.00..298863.67 rows=74539 width=36) (actual time=0.074..0.435 rows=100 loops=1)
         Index Cond: ((mspath &amp;gt;= '1.44.'::text) AND (mspath &amp;lt; '1.44/'::text))
         Filter: (mspath ~~ '1.44.%'::text)
 Total runtime: 0.561 ms
&lt;/pre&gt;
&lt;p&gt;
 The hardcoded results are extremely fast. Much faster than any of the specialised indices above.
 Retrying ltree for example with a hardcoded path doesnt speed up the ltree query. 
 The vast speed gain here is because postgres can use the b-tree index for a sorted output and therefore really benefit from the limit.
 GiST and GIN indices on the other hand are not suitable for sorting. 
&lt;/p&gt;



&lt;h4&gt;nested set&lt;/h4&gt;
&lt;p&gt;
 A very different query model from the above, which all use some sort of a materialzed path.
 We need to also add the checklist_fk condition because the nested set indices are only unique within each taxonomy, not across all.
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 SELECT u.id 
  FROM name_usage u JOIN name_usage p ON u.checklist_fk=p.checklist_fk  
  WHERE  p.id=44 and u.lft BETWEEN p.lft and p.rgt
  ORDER BY u.lft
  LIMIT 100 OFFSET 0;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=81687.08..81687.33 rows=100 width=8) (actual time=2.030..2.076 rows=100 loops=1)
   -&amp;gt;  Sort  (cost=81687.08..82326.50 rows=255769 width=8) (actual time=2.029..2.052 rows=100 loops=1)
         Sort Key: u.lft
         Sort Method: top-N heapsort  Memory: 29kB
         -&amp;gt;  Nested Loop  (cost=0.00..71911.77 rows=255769 width=8) (actual time=0.062..1.482 rows=609 loops=1)
               -&amp;gt;  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.028..0.029 rows=1 loops=1)
                     Index Cond: (id = 2684876)
               -&amp;gt;  Index Scan using nu_ckl_lft_rgt_idx on name_usage u  (cost=0.00..71534.17 rows=20967 width=12) (actual time=0.029..1.182 rows=609 loops=1)
                     Index Cond: ((checklist_fk = p.checklist_fk) AND (lft &amp;gt;= p.lft) AND (lft &amp;lt;= p.rgt))
 Total runtime: 2.206 ms
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=81687.08..81687.33 rows=100 width=8) (actual time=475.811..475.829 rows=100 loops=1)
   -&amp;gt;  Sort  (cost=81687.08..82326.50 rows=255769 width=8) (actual time=475.809..475.817 rows=100 loops=1)
         Sort Key: u.lft
         Sort Method: top-N heapsort  Memory: 29kB
         -&amp;gt;  Nested Loop  (cost=0.00..71911.77 rows=255769 width=8) (actual time=0.158..381.822 rows=264325 loops=1)
               -&amp;gt;  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.075..0.077 rows=1 loops=1)
                     Index Cond: (id = 44)
               -&amp;gt;  Index Scan using nu_ckl_lft_rgt_idx on name_usage u  (cost=0.00..71534.17 rows=20967 width=12) (actual time=0.077..323.519 rows=264325 loops=1)
                     Index Cond: ((checklist_fk = p.checklist_fk) AND (lft &amp;gt;= p.lft) AND (lft &amp;lt;= p.rgt))
 Total runtime: 475.951 ms
&lt;/pre&gt;

&lt;p&gt;
 Again the tricky part was making postgres use the lft/rgt index properly when using the order by.
 Removing the order by turns this query into a super fast one for even the vertebrate query (only o.4ms for any of the two).
 If we also use hardcoded values instead of a joined p table we can get even better performances than with the string path model.
 For example the vertebrate query would look like this:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
SELECT u.id 
 FROM name_usage u 
 WHERE  u.checklist_fk=1 and u.lft BETWEEN 517646 and 1046295
 ORDER BY u.lft
 LIMIT 100 OFFSET 0;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=0.00..337.88 rows=100 width=8) (actual time=0.080..0.395 rows=100 loops=1)
   -&amp;gt;  Index Scan using nu_ckl_lft_rgt_idx on name_usage u  (cost=0.00..1566342.54 rows=463573 width=8) (actual time=0.078..0.356 rows=100 loops=1)
         Index Cond: ((checklist_fk = 1) AND (lft &amp;gt;= 517646) AND (lft &amp;lt;= 1046295))
 Total runtime: 0.480 ms
&lt;/pre&gt;







&lt;h3&gt;Query for Ancestors&lt;/h3&gt;
&lt;p&gt;
 The ancestor query does not use any limit as its only a few records and we always want all.
 All materialized paths already have the ancestors - they are the path.
&lt;/p&gt;
&lt;h4&gt;adjacency&lt;/h4&gt;
&lt;p&gt;
 A recursive CTE query:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 WITH RECURSIVE a AS (
  SELECT id, parent_fk, rank_fk
   FROM name_usage
   WHERE id = 44
 UNION ALL
  SELECT p.id, p.parent_fk, p.rank_fk
   FROM a JOIN name_usage p ON a.parent_fk = p.id
 )
 SELECT * FROM a
 WHERE id!=44
 ORDER BY rank_fk;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Sort  (cost=1089.52..1089.77 rows=100 width=12) (actual time=0.155..0.156 rows=5 loops=1)
   Sort Key: a.rank_fk
   Sort Method: quicksort  Memory: 25kB
   CTE a
     -&amp;gt;  Recursive Union  (cost=0.00..1083.93 rows=101 width=12) (actual time=0.030..0.117 rows=6 loops=1)
           -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=12) (actual time=0.027..0.029 rows=1 loops=1)
                 Index Cond: (id = 2684876)
           -&amp;gt;  Nested Loop  (cost=0.00..107.12 rows=10 width=12) (actual time=0.011..0.012 rows=1 loops=6)
                 -&amp;gt;  WorkTable Scan on a  (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.001 rows=1 loops=6)
                 -&amp;gt;  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.007..0.008 rows=1 loops=6)
                       Index Cond: (id = a.parent_fk)
   -&amp;gt;  CTE Scan on a  (cost=0.00..2.27 rows=100 width=12) (actual time=0.068..0.141 rows=5 loops=1)
         Filter: (id &amp;lt;&amp;gt; 2684876)
 Total runtime: 0.265 ms
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Sort  (cost=1089.52..1089.77 rows=100 width=12) (actual time=0.104..0.104 rows=1 loops=1)
   Sort Key: a.rank_fk
   Sort Method: quicksort  Memory: 25kB
   CTE a
     -&amp;gt;  Recursive Union  (cost=0.00..1083.93 rows=101 width=12) (actual time=0.040..0.077 rows=2 loops=1)
           -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=12) (actual time=0.037..0.040 rows=1 loops=1)
                 Index Cond: (id = 44)
           -&amp;gt;  Nested Loop  (cost=0.00..107.12 rows=10 width=12) (actual time=0.013..0.015 rows=0 loops=2)
                 -&amp;gt;  WorkTable Scan on a  (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.001 rows=1 loops=2)
                 -&amp;gt;  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.008..0.009 rows=0 loops=2)
                       Index Cond: (id = a.parent_fk)
   -&amp;gt;  CTE Scan on a  (cost=0.00..2.27 rows=100 width=12) (actual time=0.079..0.092 rows=1 loops=1)
         Filter: (id &amp;lt;&amp;gt; 44)
 Total runtime: 0.232 ms
&lt;/pre&gt;
&lt;p&gt;The trees are not very deep, so even the genus query only has to do 5 recursive calls to return 5 parents.&lt;/p&gt;



&lt;h4&gt;ltree&lt;/h4&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
  select path FROM name_usage WHERE id=44;
&lt;/pre&gt;
&lt;p&gt;The path contains all ancestor ids. Parsing it into separate ids within sql is not obvious at first glance though.&lt;/p&gt;



&lt;h4&gt;intarray&lt;/h4&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 SELECT mpath FROM name_usage WHERE id=44;
&lt;/pre&gt;
&lt;p&gt;The path contains all ancestor ids. Iterating over each id entry is simple.&lt;/p&gt;


 

&lt;h4&gt;nested set&lt;/h4&gt;
&lt;p&gt;
 To find out all ancestors of a given node, we just select all nodes that contain its LFT boundary (which in a properly built hierarchy implies containing the RGT boundary too):
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
SELECT p.id, p.rank_fk, p.lft
 FROM  name_usage u JOIN name_usage p ON u.lft BETWEEN p.lft and p.rgt AND u.checklist_fk=p.checklist_fk 
 WHERE u.id=44 ORDER BY 3;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Sort  (cost=100429.25..101068.67 rows=255769 width=12) (actual time=607.957..607.958 rows=6 loops=1)
   Sort Key: p.lft
   Sort Method: quicksort  Memory: 25kB
   -&amp;gt;  Nested Loop  (cost=0.00..73083.96 rows=255769 width=12) (actual time=415.862..607.938 rows=6 loops=1)
         -&amp;gt;  Index Scan using usage_pkey on name_usage u  (cost=0.00..10.68 rows=1 width=8) (actual time=0.046..0.047 rows=1 loops=1)
               Index Cond: (id = 2684876)
         -&amp;gt;  Index Scan using nu_ckl_lft_rgt_idx on name_usage p  (cost=0.00..72706.36 rows=20967 width=20) (actual time=415.809..607.880 rows=6 loops=1)
               Index Cond: ((checklist_fk = u.checklist_fk) AND (u.lft &amp;gt;= lft) AND (u.lft &amp;lt;= rgt))
 Total runtime: 608.034 ms
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Sort  (cost=100429.25..101068.67 rows=255769 width=12) (actual time=38.428..38.428 rows=2 loops=1)
   Sort Key: p.lft
   Sort Method: quicksort  Memory: 25kB
   -&amp;gt;  Nested Loop  (cost=0.00..73083.96 rows=255769 width=12) (actual time=0.111..38.410 rows=2 loops=1)
         -&amp;gt;  Index Scan using usage_pkey on name_usage u  (cost=0.00..10.68 rows=1 width=8) (actual time=0.022..0.023 rows=1 loops=1)
               Index Cond: (id = 44)
         -&amp;gt;  Index Scan using nu_ckl_lft_rgt_idx on name_usage p  (cost=0.00..72706.36 rows=20967 width=20) (actual time=0.084..38.378 rows=2 loops=1)
               Index Cond: ((checklist_fk = u.checklist_fk) AND (u.lft &amp;gt;= lft) AND (u.lft &amp;lt;= rgt))
 Total runtime: 38.504 ms
&lt;/pre&gt;
&lt;p&gt;
 Not very efficient.
&lt;/p&gt;






&lt;h3&gt;Query for Children&lt;/h3&gt;

&lt;h4&gt;adjacency&lt;/h4&gt;
&lt;p&gt;
 A perfect fit:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
 SELECT u.id 
  FROM name_usage  u 
  WHERE parent_fk=44 
  ORDER BY u.id
  LIMIT 100 OFFSET 0;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=640.97..641.22 rows=100 width=4) (actual time=0.566..0.623 rows=100 loops=1)
   -&amp;gt;  Sort  (cost=640.97..641.67 rows=279 width=4) (actual time=0.564..0.588 rows=100 loops=1)
         Sort Key: id
         Sort Method: quicksort  Memory: 32kB
         -&amp;gt;  Index Scan using usage_parent_idx on name_usage u  (cost=0.00..630.31 rows=279 width=4) (actual time=0.046..0.312 rows=167 loops=1)
               Index Cond: (parent_fk = 2684876)
 Total runtime: 0.696 ms
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=16863.12..16863.37 rows=100 width=4) (actual time=8.783..8.811 rows=100 loops=1)
   -&amp;gt;  Sort  (cost=16863.12..16881.75 rows=7454 width=4) (actual time=8.780..8.793 rows=100 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         -&amp;gt;  Index Scan using usage_parent_idx on name_usage u  (cost=0.00..16578.23 rows=7454 width=4) (actual time=0.060..5.169 rows=6083 loops=1)
               Index Cond: (parent_fk = 44)
 Total runtime: 8.867 ms
&lt;/pre&gt;



&lt;h4&gt;ltree&lt;/h4&gt;
&lt;p&gt;
 Search for all descendants that have one more node:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
WITH p AS (
 SELECT path FROM name_usage WHERE id=2684876
)
SELECT u.id 
 FROM  name_usage u, p
 WHERE u.path &amp;lt;@ p.path and nlevel(u.path)=nlevel(p.path)+1
 ORDER BY u.path
 LIMIT 100 OFFSET 0;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=56078.37..56078.56 rows=75 width=124) (actual time=4.118..4.161 rows=100 loops=1)
   CTE p
     -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.029..0.031 rows=1 loops=1)
           Index Cond: (id = 2684876)
   -&amp;gt;  Sort  (cost=56067.69..56067.88 rows=75 width=124) (actual time=4.116..4.134 rows=100 loops=1)
         Sort Key: u.path
         Sort Method: quicksort  Memory: 48kB
         -&amp;gt;  Nested Loop  (cost=2099.42..56065.36 rows=75 width=124) (actual time=1.230..3.241 rows=167 loops=1)
               Join Filter: (nlevel(u.path) = (nlevel(p.path) + 1))
               -&amp;gt;  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.037..0.040 rows=1 loops=1)
               -&amp;gt;  Bitmap Heap Scan on name_usage u  (cost=2099.42..55729.91 rows=14908 width=124) (actual time=1.175..2.274 rows=609 loops=1)
                     Recheck Cond: (path &amp;lt;@ p.path)
                     -&amp;gt;  Bitmap Index Scan on nu_path_idx  (cost=0.00..2095.69 rows=14908 width=0) (actual time=1.146..1.146 rows=609 loops=1)
                           Index Cond: (path &amp;lt;@ p.path)
 Total runtime: 4.321 ms
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=56078.37..56078.56 rows=75 width=124) (actual time=600.793..600.816 rows=100 loops=1)
   CTE p
     -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.048..0.051 rows=1 loops=1)
           Index Cond: (id = 44)
   -&amp;gt;  Sort  (cost=56067.69..56067.88 rows=75 width=124) (actual time=600.792..600.805 rows=100 loops=1)
         Sort Key: u.path
         Sort Method: top-N heapsort  Memory: 32kB
         -&amp;gt;  Nested Loop  (cost=2099.42..56065.36 rows=75 width=124) (actual time=112.999..595.738 rows=6083 loops=1)
               Join Filter: (nlevel(u.path) = (nlevel(p.path) + 1))
               -&amp;gt;  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.053..0.057 rows=1 loops=1)
               -&amp;gt;  Bitmap Heap Scan on name_usage u  (cost=2099.42..55729.91 rows=14908 width=124) (actual time=112.924..413.889 rows=264325 loops=1)
                     Recheck Cond: (path &amp;lt;@ p.path)
                     -&amp;gt;  Bitmap Index Scan on nu_path_idx  (cost=0.00..2095.69 rows=14908 width=0) (actual time=107.524..107.524 rows=264325 loops=1)
                           Index Cond: (path &amp;lt;@ p.path)
 Total runtime: 600.980 ms
&lt;/pre&gt;



&lt;h4&gt;intarray&lt;/h4&gt;
&lt;p&gt;
 Search for an array that contains the parent id, but has an array size one greater than its parent:
&lt;/p&gt;
&lt;pre class=&quot;brush:sql&quot;&gt;
WITH p AS (
 SELECT mpath FROM name_usage WHERE id=44
)
SELECT u.id 
 FROM  name_usage u, p
 WHERE  u.mpath @@ '44' and #u.mpath= #p.mpath+1
 ORDER BY u.mpath
 LIMIT 100 OFFSET 0;
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Abies&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=53976.90..53977.09 rows=75 width=56) (actual time=2.026..2.056 rows=100 loops=1)
   CTE p
     -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=52) (actual time=0.018..0.019 rows=1 loops=1)
           Index Cond: (id = 2684876)
   -&amp;gt;  Sort  (cost=53966.22..53966.41 rows=75 width=56) (actual time=2.024..2.042 rows=100 loops=1)
         Sort Key: u.mpath
         Sort Method: quicksort  Memory: 48kB
         -&amp;gt;  Hash Join  (cost=183.57..53963.89 rows=75 width=56) (actual time=0.341..1.352 rows=167 loops=1)
               Hash Cond: ((# u.mpath) = (# (p.mpath + 1)))
               -&amp;gt;  Bitmap Heap Scan on name_usage u  (cost=183.54..53851.29 rows=14908 width=56) (actual time=0.292..0.874 rows=609 loops=1)
                     Recheck Cond: (mpath @@ '2684876'::query_int)
                     -&amp;gt;  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..179.81 rows=14908 width=0) (actual time=0.279..0.279 rows=609 loops=1)
                           Index Cond: (mpath @@ '2684876'::query_int)
               -&amp;gt;  Hash  (cost=0.02..0.02 rows=1 width=32) (actual time=0.038..0.038 rows=1 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 1kB
                     -&amp;gt;  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.022..0.024 rows=1 loops=1)
 Total runtime: 2.144 ms
&lt;/pre&gt;

&lt;h5&gt;Query Plan: Vertebrata&lt;/h5&gt;
&lt;pre class=&quot;brush:plain&quot;&gt;
 Limit  (cost=53976.90..53977.09 rows=75 width=56) (actual time=501.304..501.327 rows=100 loops=1)
   CTE p
     -&amp;gt;  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=52) (actual time=0.075..0.078 rows=1 loops=1)
           Index Cond: (id = 44)
   -&amp;gt;  Sort  (cost=53966.22..53966.41 rows=75 width=56) (actual time=501.303..501.316 rows=100 loops=1)
         Sort Key: u.mpath
         Sort Method: top-N heapsort  Memory: 32kB
         -&amp;gt;  Hash Join  (cost=183.57..53963.89 rows=75 width=56) (actual time=116.159..495.063 rows=6083 loops=1)
               Hash Cond: ((# u.mpath) = (# (p.mpath + 1)))
               -&amp;gt;  Bitmap Heap Scan on name_usage u  (cost=183.54..53851.29 rows=14908 width=56) (actual time=116.030..391.099 rows=264325 loops=1)
                     Recheck Cond: (mpath @@ '44'::query_int)
                     -&amp;gt;  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..179.81 rows=14908 width=0) (actual time=111.387..111.387 rows=264325 loops=1)
                           Index Cond: (mpath @@ '44'::query_int)
               -&amp;gt;  Hash  (cost=0.02..0.02 rows=1 width=32) (actual time=0.101..0.101 rows=1 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 1kB
                     -&amp;gt;  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.082..0.086 rows=1 loops=1)
 Total runtime: 501.517 ms
&lt;/pre&gt;



&lt;h4&gt;nested set&lt;/h4&gt;
&lt;p&gt;
 A difficult task. Will leave this as a challenge for some later time.
&lt;/p&gt;



&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;
 The best performing queries do not come from one model. 
 The adjacency model is unbeatable for &lt;strong&gt;listing children&lt;/strong&gt; and with recursive queries in not too deep taxonomic trees it performs also very well to get all ancestors.
&lt;/p&gt;
&lt;p&gt;
 For listing &lt;strong&gt;descendants&lt;/strong&gt; the winner are queries that can use an ordered btree index. 
 I could only get this to work with hardcoded paths, so it's not useful like this for dynamic, single
 statement queries. If you can issue 2 separate queries in code though nested sets or the string materlialized path are ideal candidates for retrieving descendants.
 
 Because of a required ordering for doing paging it can use the btree efficiently, while all others produce the full list of decendants first and then order.
 Intarray is the quickest in that field, but nested sets and ltree perform rather similar. 
 It remains to see if an additional btree index could improve the ordering of ltree and intarray drastically.
&lt;/p&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Markus Döring</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-6503827713930295808</guid>
         <pubDate>Tue, 12 Jun 2012 15:03:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://1.bp.blogspot.com/-SgnsrVVAhxI/T9HzfmiPgCI/AAAAAAAAD88/uxdE6n4agjE/s72-c/NameUsage.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Faster HBase - hardware matters</title>
         <link>http://gbif.blogspot.com/2012/06/faster-hbase-hardware-matters.html</link>
         <description>As I've &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/02/performance-evaluation-of-hbase.html&quot;&gt;written&lt;/a&gt; &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/03/hbase-performance-evaluation-continued.html&quot;&gt;earlier&lt;/a&gt;, I've been spending some time evaluating the performance of HBase using PerformanceEvaluation. My earlier conclusions amounted to: bond your network ports and get more disks. So I'm happy to report that we got more disks, in the form of 6 new machines that together make up our new cluster:&lt;br /&gt;
&lt;br /&gt;
Master (c1n4): HDFS NameNode, Hadoop JobTracker, HBase Master, and Zookeeper&lt;br /&gt;
Zookeeper (c1n1): Zookeeper for this cluster, master for our other cluster&lt;br /&gt;
&lt;br /&gt;
Slaves (c4n1..c4n6): HDFS DataNode, Hadoop TaskTracker, HBase RegionServer (6 GB heap)&lt;br /&gt;
&lt;br /&gt;
Hardware:&lt;br /&gt;
&lt;strong&gt;c1n*&lt;/strong&gt;: 1x&lt;strong&gt;&amp;nbsp;&lt;/strong&gt;Intel Xeon X3363 @ 2.83GHz (quad), 8GB RAM, 2x500G SATA 5.4K&lt;br /&gt;
&lt;b&gt;c4n*&lt;/b&gt;: Dell R720XD, 2x Intel Xeon E5-2640 @ 2.50GHz (6-core), 64GB RAM, 12x1TB SATA 7.2K&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Obviously the new machines come with faster everything and lots more RAM, so first I bonded two ethernet ports and then ran the tests again to see how much we had improved:&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left:auto;margin-right:auto;text-align:center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-SqP6CwZzpAQ/T9B2zUTrliI/AAAAAAAAADc/3qX37EY4Mm4/s1600/hbase_scan_pe2.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;258&quot; src=&quot;http://2.bp.blogspot.com/-SqP6CwZzpAQ/T9B2zUTrliI/AAAAAAAAADc/3qX37EY4Mm4/s400/hbase_scan_pe2.png&quot; width=&quot;400&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i&gt;Figure 1: Scan performance of new cluster (2x 1gig ethernet)&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
So, 1 million records/second? Yes, please! That's a performance increase of about 300% over the c2 cluster I tested in my &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/02/performance-evaluation-of-hbase.html&quot;&gt;previous posts&lt;/a&gt;. Given that we doubled the number of machines and quadrupled the number of drives, that improvement feels about right. But the decline in performance as number of mappers is increased is still a bit suspect - we'd hope that performance would go up with more workers. This is the same behaviour we saw when we were limited by network in our original tests, and ganglia again shows a similar pattern, though this time it looks like we're hitting a limit around the 2 gig ethernet bond:&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:left;margin-right:1em;text-align:left;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-8dS_Fu0dlkk/T9H2t9--zQI/AAAAAAAAADo/bH2qkVoPFqk/s1600/bytes_in.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;157&quot; src=&quot;http://2.bp.blogspot.com/-8dS_Fu0dlkk/T9H2t9--zQI/AAAAAAAAADo/bH2qkVoPFqk/s320/bytes_in.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i style=&quot;background-color:white;color:#333333;font-family:Verdana, sans-serif;font-size:medium;line-height:22px;&quot;&gt;&lt;span style=&quot;font-size:xx-small;&quot;&gt;Figure 2: bytes_in (MB/s) with dual gigabit ethernet&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left:auto;margin-right:auto;text-align:left;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-5u7Zu4dZHl4/T9H2ujadudI/AAAAAAAAADs/Fn5-TVsrxEY/s1600/bytes_out.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;157&quot; src=&quot;http://3.bp.blogspot.com/-5u7Zu4dZHl4/T9H2ujadudI/AAAAAAAAADs/Fn5-TVsrxEY/s320/bytes_out.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i style=&quot;background-color:white;color:#333333;font-family:Verdana, sans-serif;font-size:medium;line-height:22px;&quot;&gt;&lt;span style=&quot;font-size:xx-small;&quot;&gt;Figure 3: bytes_out (MB/s) with dual gigabit ethernet&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
Unfortunately our master switch is currently full, so we don't have the extra 6 ports needed to test a triple bond - but given our past experience I feel reasonably confident that it would change Figure 1 such that scan performance increases with number of mappers up to some disk I/O contention limit.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;

Hang-on, what about data locality?&lt;/h3&gt;
But there's a bigger question here - why are we using so much network bandwidth in the first place? Why stress about major compactions and data locality when it doesn't seem to get used? Therein lies the rub - PerformanceEvaluation can't take advantage of data locality. Tim wrote about the tremendous importance of TableInputFormat in ensuring &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/05/optimizing-hbase-mapreduce-scans-for.html&quot;&gt;maximum scan performance from MapReduce&lt;/a&gt;, and PerformanceEvaluation doesn't do that. It assigns a block of ids to scan to different mappers at random, meaning that at best one in six mappers (in our setup) will coincidentally have local data to read, and the rest will all transfer their data across the network. This isn't a bug in PerformanceEvaluation, per se, because it was written to try and emulate the tests that Google ran in their &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://research.google.com/archive/bigtable.html&quot;&gt;seminal white paper on BigTable&lt;/a&gt;, rather than act as a true benchmark for scanning performance. But if you're new to this stuff (as I was) it sure can be confusing. When we switched to scanning our real data using TableInputFormat our throughput jumped to 2M/sec from the 1M/sec we got using PerformanceEvaluation.&lt;br /&gt;
&lt;h3&gt;

Conclusions&lt;/h3&gt;
&lt;div&gt;
While we learned a lot from using PerformanceEvaluation to test our clusters, and it helped to uncover any misconfigurations and taught us how to fine tune lots of parameters, it is not a good tool for benchmarking scan performance. As &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/05/optimizing-hbase-mapreduce-scans-for.html&quot;&gt;Tim wrote&lt;/a&gt;, scans across our real occurrence data (~370M records) using TableInputFormat are finishing in 3 minutes - for our needs that is excellent and means we're happy with our cluster upgrade. Stay tuned for news about the occurrence download service that Lars and I are writing to take advantage of all this speed :)&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Oliver Meyn</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-583776861820832768</guid>
         <pubDate>Fri, 08 Jun 2012 15:05:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://2.bp.blogspot.com/-SqP6CwZzpAQ/T9B2zUTrliI/AAAAAAAAADc/3qX37EY4Mm4/s72-c/hbase_scan_pe2.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Optimizing HBase MapReduce scans (for Hive)</title>
         <link>http://gbif.blogspot.com/2012/05/optimizing-hbase-mapreduce-scans-for.html</link>
         <description>&lt;blockquote class=&quot;tr_bq&quot;&gt;
&lt;br /&gt;
&lt;b&gt;By targeting data locality, full table scans of HBase using MapReduce across 373 million records are reduced from 19 minutes to 2.5 minutes.&amp;nbsp;&lt;/b&gt;&lt;/blockquote&gt;
&lt;br /&gt;
We've been posting &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.com/2012/03/hbase-performance-evaluation-continued.html&quot;&gt;some blogs&lt;/a&gt; about HBase Performance which are all based on the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation&quot;&gt;PerformanceEvaluation tools&lt;/a&gt; supplied with HBase. &amp;nbsp;This has helped us understand many characteristics of our system, but in some ways has sidetracked our tuning - namely investigating &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Channel_bonding&quot;&gt;channel bonding&lt;/a&gt;&amp;nbsp;to help increase inter machine bandwidth believing it was our primary limitation. &amp;nbsp;While that will help for many things (e.g. the copy between mappers and reducers), a key usage pattern involves full table scans of HBase (spawned by &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://hive.apache.org/&quot;&gt;Hive&lt;/a&gt;) and in a well setup environment network traffic &lt;u&gt;should be minimal&lt;/u&gt; for this. &amp;nbsp;Here I describe how we approached this problem, and the results.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;The environment&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
We run &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://ganglia.sourceforge.net/&quot;&gt;Ganglia&lt;/a&gt; for cluster monitoring (and &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://dev.gbif.org/ganglia/&quot;&gt;ours is public&lt;/a&gt;) and &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://puppetlabs.com/&quot;&gt;Puppet&lt;/a&gt; to provision machines. &amp;nbsp;As an aside, without these tools or an equivalent I don't think you can sanely hope to run HBase in production. &amp;nbsp;Here we are using the 6 node cluster, where each of the 6 slaves run a TaskTracker, DataNode and RegionServer, and each machine is Dell R720xd, with 64GB memory, 12xSATA 1TB drives with dual 6 core hyper threading CPUs. &amp;nbsp;Quoting the user list: &quot;HBase should be able to stretch its legs with this hardware&quot;.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Baseline&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
As a naive person getting access to the new cluster I started by creating the HBase table, mounted a Hive table on it, and populated it with a select using data from a CSV file.&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;INSERT OVERWRITE occurrence_hbase SELECT * FROM occurrence_hdfs. 
&lt;/pre&gt;
&lt;br /&gt;
The first good news was that this ran in only 2.5hrs to load up 367 million records with no pre splitting of the table and no failures (normally we use bulk loading tools but here I was feeling lazy). &amp;nbsp;I then crafted a super simple MR job based on TableMapReduceUtils that took an unfiltered Scan, and a Mapper that did nothing but increment a counter with the number of rows read. &amp;nbsp;This is a decent replica of what Hive would do when doing a range query (Hive does not do predicate push down to HBase with filters except for equality filters at the moment). &amp;nbsp;The first run took &lt;b&gt;19 minutes&lt;/b&gt;. &amp;nbsp;This test was across 200GB data (uncompressed) spread across 84 hard drives and 72 hyper threading cores reading it, so I knew it was too slow. &amp;nbsp;We started digging... here you find some insights as to how we approached the task.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Improvement #1: Host name versus IP address&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
On an early run, we see the following on the Ganglia bytes_out:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/-Q0SxKfz_ojk/T8IFpnN7ReI/AAAAAAAAAG4/q8xXyXADlEE/s1600/bytesOut.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;237&quot; src=&quot;http://4.bp.blogspot.com/-Q0SxKfz_ojk/T8IFpnN7ReI/AAAAAAAAAG4/q8xXyXADlEE/s400/bytesOut.png&quot; width=&quot;400&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Here we see what we have seen many times - saturating our network. &amp;nbsp;But why? &amp;nbsp;These are Mappers that should be hitting local RegionServers using local drives (12 of them). &amp;nbsp;Looking at the number of data-local mappers spawned, we see that we have very low data-local map tasks:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-tVb509U20Rg/T8IFCJVFVMI/AAAAAAAAAGw/sxyxAjXaEtg/s1600/Screen+Shot+2012-05-27+at+12.41.53+PM.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;58&quot; src=&quot;http://3.bp.blogspot.com/-tVb509U20Rg/T8IFCJVFVMI/AAAAAAAAAGw/sxyxAjXaEtg/s640/Screen+Shot+2012-05-27+at+12.41.53+PM.png&quot; width=&quot;640&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
This is suspicious, and &lt;b&gt;if you see this, start investigating why&lt;/b&gt;. &amp;nbsp;Basically the MR job is spawning tasks to use data that reside on other machines. &amp;nbsp;For us, that means Mappers that are talking to region servers that are not local to it. &amp;nbsp;On investigation, and from reading&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/HBASE-1672&quot;&gt;HBASE-1672&lt;/a&gt;, we observe that when looking at a task attempt in the MR console, the task attempt and input split locations look suspiciously different:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-5go29t3893A/T8ICnjVAwzI/AAAAAAAAAGg/vt-6XVzSPfI/s1600/Screen+Shot+2012-05-27+at+12.31.33+PM.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://3.bp.blogspot.com/-5go29t3893A/T8ICnjVAwzI/AAAAAAAAAGg/vt-6XVzSPfI/s1600/Screen+Shot+2012-05-27+at+12.31.33+PM.png&quot;/&gt;&lt;/a&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-nQ05h_Kv1LE/T8ICYQAB3II/AAAAAAAAAGQ/Zvjzu2iD-xM/s1600/Screen+Shot+2012-05-27+at+9.41.58+AM.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://2.bp.blogspot.com/-nQ05h_Kv1LE/T8ICYQAB3II/AAAAAAAAAGQ/Zvjzu2iD-xM/s1600/Screen+Shot+2012-05-27+at+9.41.58+AM.png&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align:left;&quot;&gt;
These actually refer to the same machine, but one is using the IP address, and the other the host name. &lt;br /&gt;
&lt;br /&gt;
HBase is using the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html&quot;&gt;TableInputFormat&lt;/a&gt;&amp;nbsp;which is returning the IP (thanks Stack for pointing this out). &amp;nbsp;Now, it is important to note that this code is &lt;b&gt;executed on the client calling machine&lt;/b&gt;, as it prepares a job for submission to the cluster. &amp;nbsp;In my setup I was running over VPN from my laptop which was providing IP addresses for the region locations. &amp;nbsp;The code in the TableInputFormatBase for our version of HBase was:&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;String regionLocation = table.getRegionLocation(keys.getFirst()[i]).getServerAddress().getHostname();
&lt;/pre&gt;
&lt;br /&gt;&lt;/div&gt;
But on my laptop, a getHostname() on these machines always returned the IP address. &amp;nbsp;Moving my code onto the cluster and launching from there solved this issue.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Improvement #2: Ensure HBase is balanced&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
When starting these tests there were 2 tables on HBase. &amp;nbsp;HBase reported it was nicely balanced, major table compactions were done. &amp;nbsp;On running however, we still see a huge bottleneck again shown clearly with bytes_out, bytes_in and region server requests:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-zI-y3py-O1o/T8IGoHzrraI/AAAAAAAAAHE/JceAEHyPuGM/s1600/bytesOut.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;189&quot; src=&quot;http://3.bp.blogspot.com/-zI-y3py-O1o/T8IGoHzrraI/AAAAAAAAAHE/JceAEHyPuGM/s320/bytesOut.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-aNTIaJPjBtY/T8IGouzmBmI/AAAAAAAAAHM/h2pvTy4fwAw/s1600/rsRequests.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;190&quot; src=&quot;http://3.bp.blogspot.com/-aNTIaJPjBtY/T8IGouzmBmI/AAAAAAAAAHM/h2pvTy4fwAw/s320/rsRequests.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-QcDSX6pQoX4/T8IGnp3X4jI/AAAAAAAAAHA/nEhC_CHqRK8/s1600/bytesIn.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;190&quot; src=&quot;http://3.bp.blogspot.com/-QcDSX6pQoX4/T8IGnp3X4jI/AAAAAAAAAHA/nEhC_CHqRK8/s320/bytesIn.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Again the network saturation is clear, but 1 region server is getting hammered, and many machines are receiving data. &amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
Why?&amp;nbsp;Well, we run HBase 0.90 which does balancing across all tables and not on a per table basis. &amp;nbsp;Thus, while HBase saw it's regions evenly distributed across the servers, one machine was hot spotted for one table. &amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/HBASE-3373&quot;&gt;This is fixed in HBase 0.94&lt;/a&gt;, but we don't run that yet and ganglia clearly shows us this is a limitation. &amp;nbsp;Again, MR is spawning jobs on other machines all hitting one RS and saturating the network. &amp;nbsp;Fortunately I could delete the unused table and rebalance the lot.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;Miscellaneous improvements&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
Somewhere along the line I saw exceptions reporting timeouts and things like this:&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;48 Lease exceptions
org.apache.hadoop.hbase.regionserver.LeaseException: org.apache.hadoop.hbase.regionserver.LeaseException: lease '7961909311940960915' does not exist
 at org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
 at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1879)
 at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
 at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
&lt;/pre&gt;
&lt;br /&gt;
Basically the HBase client is not reporting to the master quickly enough, and the master kills the client, and thus the task attempt fails. &amp;nbsp;The result is the JobTracker goes and spawns another task attempt, and here we often observed it was not a map-local attempt - everything we know we need to avoid for performance. &amp;nbsp;The following were done to address this:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Set the hbase.regionserver.lease.period to&amp;nbsp;600000, up from&amp;nbsp;60000. &amp;nbsp;This was then the same as the TaskTracker timeout, so HBase would timeout the scan client in the same duration that the JobTracker would timeout the Task attempt anyway&lt;/li&gt;
&lt;li&gt;Reduce the number of mappers from 44 to 36 on each machine. &amp;nbsp;Here we just ran a few runs, observed Ganglia load averages etc, repeated with different configuration and ultimately tuned the Mapper count to the point where no exceptions were thrown. &amp;nbsp;There was no magic recipe to this, other than &quot;rinse and repeat&quot;. &amp;nbsp;This is where Puppet is gold.&lt;/li&gt;
&lt;/ol&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;The final run&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
With a balanced HBase environment, and resolving the IP / Host issue, the MR result, and ganglia bytes_in, bytes_out and region server requests are shown:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://1.bp.blogspot.com/-hnzOBNNqXxY/T8IIZqA0qwI/AAAAAAAAAH8/fG5PAQuijj0/s1600/Screen+Shot+2012-05-27+at+10.51.23+AM.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;281&quot; src=&quot;http://1.bp.blogspot.com/-hnzOBNNqXxY/T8IIZqA0qwI/AAAAAAAAAH8/fG5PAQuijj0/s320/Screen+Shot+2012-05-27+at+10.51.23+AM.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-Vhv18J2eDWI/T8IH8vI5-tI/AAAAAAAAAHc/N6jCgU3ELmQ/s1600/bytesIn.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;190&quot; src=&quot;http://2.bp.blogspot.com/-Vhv18J2eDWI/T8IH8vI5-tI/AAAAAAAAAHc/N6jCgU3ELmQ/s320/bytesIn.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-dt8DxwxN0EI/T8IH-wAsZ7I/AAAAAAAAAHo/8oKXtRRxkpo/s1600/bytesOut.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;190&quot; src=&quot;http://3.bp.blogspot.com/-dt8DxwxN0EI/T8IH-wAsZ7I/AAAAAAAAAHo/8oKXtRRxkpo/s320/bytesOut.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-uHZGaHLtdUg/T8IH_Rojx9I/AAAAAAAAAHs/nWRXZ4_rays/s1600/rsRequests.png&quot; style=&quot;margin-left:1em;margin-right:1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;190&quot; src=&quot;http://3.bp.blogspot.com/-uHZGaHLtdUg/T8IH_Rojx9I/AAAAAAAAAHs/nWRXZ4_rays/s320/rsRequests.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
We still see 5 mappers running across the network, which are due to the FairScheduler deciding it has waited long enough for a data local mapper and spawning another - we might investigate if we want to increase this wait time.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;u&gt;A final thought&lt;/u&gt;&lt;/b&gt;&lt;br /&gt;
All these tests used no Filter in the Scan, thus the entire data was passed from the region server to the mapper. &amp;nbsp;Adding a filter such as the following, reduces the time to around 90 secs. &lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;      scan.setFilter(new SingleColumnValueFilter(
        &quot;v&quot;.getBytes(), 
        &quot;scientific_name&quot;.getBytes(), 
        CompareOp.EQUAL,
        &quot;Abies alba&quot;.getBytes()));
&lt;/pre&gt;
&lt;br /&gt;
We are using &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.com/2012/05/hive-09-with-hbase-090.html&quot;&gt;Hive 0.90 on HBase 0.90&lt;/a&gt;&amp;nbsp;and I believe Hive will push down the &quot;equals&quot; predicates to HBase, which will benefit from these kind of filters. &amp;nbsp;However, I wanted to run the tests without them, as we will often do range scans, and custom &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://sites.gbif.org/occurrencestore/occurrence-store/apidocs/org/gbif/occurrencestore/hive/udf/package-summary.html&quot;&gt;UDF&lt;/a&gt;s to do things like point in polygon checking.&lt;br /&gt;
&lt;br /&gt;
Thanks to everyone on the mailing lists for all their support through this. &amp;nbsp;You all know who you are, but in particular thanks to Lars George and Stack. &amp;nbsp;All GBIF work is open source, and we are committed to open data - we always welcome collaborations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Tim Robertson</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-1606601149605269801</guid>
         <pubDate>Mon, 28 May 2012 17:39:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://4.bp.blogspot.com/-Q0SxKfz_ojk/T8IFpnN7ReI/AAAAAAAAAG4/q8xXyXADlEE/s72-c/bytesOut.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Hive 0.9 with HBase 0.90</title>
         <link>http://gbif.blogspot.com/2012/05/hive-09-with-hbase-090.html</link>
         <description>Hive 0.9.0 was&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310843&amp;amp;version=12317742&quot;&gt;released&lt;/a&gt;&amp;nbsp;at the beginning of this month and it contains a lot of very nice improvements. Thanks to all involved!&lt;br /&gt;
&lt;br /&gt;
Unfortunately it drops compatibility with HBase 0.90.x due to two issues which introduced a dependency on HBase 0.92:&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/HIVE-2748&quot;&gt;https://issues.apache.org/jira/browse/HIVE-2748&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://issues.apache.org/jira/browse/HIVE-2764&quot;&gt;https://issues.apache.org/jira/browse/HIVE-2764&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
Fortunately these were relatively easy to revert so that's what we did because we wanted to all the 0.9.0 goodness on our HBase 0.90.4 cluster (CDH3u3).&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I've &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/lfrancke/hive&quot;&gt;forked&lt;/a&gt; Hive on Github and reverted the parts of those two issues (&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/lfrancke/hive/commit/1167634ca44bd62c10afd4e9d38403ce38b9b250&quot;&gt;HIVE-2748&lt;/a&gt;, &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;https://github.com/lfrancke/hive/commit/be495d9ff701aea18f91749a2d8ec2456c65272b&quot;&gt;HIVE-2764&lt;/a&gt;) that were causing problems.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
For all those &quot;stuck&quot; with HBase 0.90 (e.g. CDH3 users) we've also deployed this custom Hive HBase Handler to our own Maven repository and will maintain that for the&amp;nbsp;foreseeable&amp;nbsp;future. You can just download the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://repository.gbif.org/content/repositories/thirdparty/org/apache/hive/hive-hbase-handler/0.9.0/hive-hbase-handler-0.9.0-hbase-0.90-compat.jar&quot;&gt;jar&lt;/a&gt; file and use it in your projects or use our Maven repository:&lt;/div&gt;
&lt;div&gt;
&lt;pre class=&quot;brush:xml&quot;&gt; 
  gbif-thirdparty 
  http://repository.gbif.org/content/repositories/thirdparty 
 
&lt;/pre&gt;
&lt;/div&gt;
And then declare a dependency on this custom HBase Handler:&lt;br /&gt;
&lt;pre class=&quot;brush:xml&quot;&gt; 
  org.apache.hive 
  hive-hbase-handler 
  0.9.0 
  hbase-0.90-compat 
 
&lt;/pre&gt;
Note for Maven experts: I'm not sure if this is a valid classifier. I couldn't find any naming rules.&lt;br /&gt;
&lt;br /&gt;
That's it! We might maintain this for future versions of Hive but we'd love to hear about any problems in any case.&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Lars Francke</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-4573597901275495635</guid>
         <pubDate>Wed, 23 May 2012 16:53:00 +0000</pubDate>
      </item>
      <item>
         <title>HBase Performance Evaluation continued - The Smoking Gun</title>
         <link>http://gbif.blogspot.com/2012/03/hbase-performance-evaluation-continued.html</link>
         <description>Update: See also &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/02/performance-evaluation-of-hbase.html&quot;&gt;part 1&lt;/a&gt; and &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.dk/2012/06/faster-hbase-hardware-matters.html&quot;&gt;part 3&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
In my&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html&quot;&gt;last post&lt;/a&gt;&amp;nbsp;I described my initial foray into testing our HBase cluster performance using the PerformanceEvaluation class. &amp;nbsp;I wasn't happy with our conclusions, which could largely be summed up as &quot;we're not sure what's wrong, but it seems slow&quot;. &amp;nbsp;So in the grand tradition of people with itches that wouldn't go away, I kept scratching. &amp;nbsp;Everything that follows is based on testing with PerformanceEvaluation (the jar patched as in the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html&quot;&gt;last post&lt;/a&gt;) using a 300M row table built with &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;PerformanceEvaluation sequentialWrite 300&lt;/span&gt;, and tested with &lt;span style=&quot;font-family:'Courier New', Courier, monospace;&quot;&gt;PerformanceEvaluation scan 300&lt;/span&gt;. &amp;nbsp;I ran the scan test 3 times, so you should see 3 distinct bursts of activity in the charts. &amp;nbsp;And to recap our hardware setup - we have 3 regionservers and a separate master.&lt;br /&gt;
&lt;br /&gt;
The first unsettling&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://dev.gbif.org/ganglia/?c=hadoop-2&amp;amp;m=load_one&amp;amp;r=hour&amp;amp;s=by%20name&amp;amp;hc=4&amp;amp;mc=2&quot;&gt;ganglia metric&lt;/a&gt;&amp;nbsp;that kept me digging was of ethernet bytes_in and bytes_out. &amp;nbsp;I'll recreate those here:&lt;br /&gt;
&lt;table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:left;margin-right:1em;text-align:left;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://2.bp.blogspot.com/-ohKtMuZX-oc/T2iRguDWhPI/AAAAAAAAACc/5kGPc3J4bfs/s1600/ganglia_single_bytes_in.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;130&quot; src=&quot;http://2.bp.blogspot.com/-ohKtMuZX-oc/T2iRguDWhPI/AAAAAAAAACc/5kGPc3J4bfs/s320/ganglia_single_bytes_in.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i style=&quot;font-size:medium;&quot;&gt;&lt;span style=&quot;font-size:x-small;&quot;&gt;Figure 1 - bytes_in (MB/s) with single gigabit ethernet&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:left;margin-right:1em;text-align:left;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-4JV9grnZOYY/T2iRhVEXKCI/AAAAAAAAACg/xaM5hkVlehk/s1600/ganglia_single_bytes_out.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;130&quot; src=&quot;http://3.bp.blogspot.com/-4JV9grnZOYY/T2iRhVEXKCI/AAAAAAAAACg/xaM5hkVlehk/s320/ganglia_single_bytes_out.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i style=&quot;font-size:medium;&quot;&gt;&lt;span style=&quot;font-size:x-small;&quot;&gt;Figure 2 - bytes_out (MB/s) with single gigabit ethernet&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:left;&quot;&gt;
This is a single gigabit ethernet link, and that is borne out by the Max values, which are near the theoretical max of 120 megabytes per second (MB/s), and as you can see they certainly aren't sitting at 120 constantly in the way one would expect if ethernet was our bottleneck. But that bytes_in graph sure looks like it's hitting some kind of limit - steep ramp up, relatively constant transfer rate, then steep ramp down. &amp;nbsp;So I decided to try&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Channel_bonding&quot;&gt;channel bonding&lt;/a&gt;&amp;nbsp;to effectively use 2 gigabit links on each server &quot;bonded&quot; together. &amp;nbsp;It turns out that's not quite as trivial as it may seem, but in the end I got it working following the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.google.dk/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=2&amp;amp;ved=0CDYQFjAB&amp;amp;url=http%3A%2F%2Fdell.cloudera.com%2Fwp-content%2Fuploads%2F2011%2F09%2FDell_Cloudera_solution_for_Apache_Hadoop_Reference_Architecture_v1.2.pdf&amp;amp;ei=s4doT5b2HsOg4gSuybirCQ&amp;amp;usg=AFQjCNGk-BySrOZg6lQ9bFhHJtWpihpVug&amp;amp;sig2=E_pBbopQO8_QfC0PkcrsCw&quot;&gt;Dell/Cloudera configuration&lt;/a&gt;&amp;nbsp;suggestion of mode 6 / balance-alb, which is described more in the&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://www.kernel.org/doc/Documentation/networking/bonding.txt&quot;&gt;Linux kernel bonding documentation&lt;/a&gt;. The results are shown in the following graphs:&lt;/div&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:left;margin-right:1em;text-align:left;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://3.bp.blogspot.com/-oOURShs1OGE/T2iRuTENXPI/AAAAAAAAACs/Gud19drr0oo/s1600/ganglia_bonded_bytes_in.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;130&quot; src=&quot;http://3.bp.blogspot.com/-oOURShs1OGE/T2iRuTENXPI/AAAAAAAAACs/Gud19drr0oo/s320/ganglia_bonded_bytes_in.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i&gt;&lt;span style=&quot;font-size:x-small;&quot;&gt;Figure 3 - bytes_in (MB/s) with dual (bonded) gigabit ethernet&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float:left;margin-right:1em;text-align:left;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://1.bp.blogspot.com/-94cOMN63Q0I/T2iRwvGGoMI/AAAAAAAAAC0/q4EsMo5SJj4/s1600/ganglia_bonded_bytes_out.png&quot; style=&quot;clear:left;margin-bottom:1em;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;130&quot; src=&quot;http://1.bp.blogspot.com/-94cOMN63Q0I/T2iRwvGGoMI/AAAAAAAAAC0/q4EsMo5SJj4/s320/ganglia_bonded_bytes_out.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i&gt;&lt;span style=&quot;font-size:x-small;&quot;&gt;Figure 4 - bytes_out (MB/s) with dual (bonded) gigabit ethernet&lt;/span&gt;&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:left;&quot;&gt;
Unfortunately the scale has changed (ganglia ain't perfect) but you can see we're now hovering a bit higher than the single ethernet case, and at some points swinging significantly above the single ethernet limit (a few gusts up to 150). &amp;nbsp;So it would seem that even though the single gigabit graphs didn't look like they were hitting theoretical maximum, they were definitely limited. &amp;nbsp;Not by much though - maybe 10-15%? &amp;nbsp;As they say, the proof is in the pudding, so what was the final throughput for the two tests?&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:left;&quot;&gt;
&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;center&gt;
&lt;table border=&quot;1px&quot;&gt;
&lt;tbody&gt;
&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Ethernet&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Throughput (rows/s)&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;Single&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;340k&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;Double&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;357k&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/center&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:left;&quot;&gt;
Looks like 5%. Certainly not the doubling I was hoping for! Something else must be the limiting factor then, but what? &amp;nbsp;Well, what does CPU say?&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left:auto;margin-right:auto;text-align:center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/-vb5pPetbgsg/T2iW4KP5dYI/AAAAAAAAAC8/ZeUmlOVtYsc/s1600/ganglia_bonded_cpu_idle.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;130&quot; src=&quot;http://4.bp.blogspot.com/-vb5pPetbgsg/T2iW4KP5dYI/AAAAAAAAAC8/ZeUmlOVtYsc/s320/ganglia_bonded_cpu_idle.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i&gt;Figure 5 - cpu_idle with dual (bonded) gigabit ethernet&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
That CPU is saying, &quot;Feed me Seymour!&quot;. &amp;nbsp;If the CPU is hungry, what about the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://en.wikipedia.org/wiki/Load_average&quot;&gt;load average&lt;/a&gt;?&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear:both;text-align:center;&quot;&gt;
&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left:auto;margin-right:auto;text-align:center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://1.bp.blogspot.com/-9nu6jReAEVU/T2iXp4gnCjI/AAAAAAAAADM/lQbedkWAtJU/s1600/ganglia_bonded_load_one.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;130&quot; src=&quot;http://1.bp.blogspot.com/-9nu6jReAEVU/T2iXp4gnCjI/AAAAAAAAADM/lQbedkWAtJU/s320/ganglia_bonded_load_one.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i&gt;Figure 6 - one minute load average with dual (bonded) gigabit ethernet&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
These machines have 8 hyper-threaded cores, so effectively 16 cpus. The load average looks like it's keeping the CPUs fed - hovering around 20. But hang on - the load average isn't all about CPU - it's about work that the whole machine needs to do - including all the io subsystems. So&amp;nbsp;if it ain't network, and it ain't CPU then it's gotta be disks. Let's look at how much time the CPU spent waiting on io (which in this case basically means &quot;waiting on disks&quot;):&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left:auto;margin-right:auto;text-align:center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align:center;&quot;&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://4.bp.blogspot.com/-HkKRJ4c-RlA/T2iYodbDnKI/AAAAAAAAADU/uuiAJ1v_Vks/s1600/ganglia_bonded_cpu_wio.png&quot; style=&quot;margin-left:auto;margin-right:auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;130&quot; src=&quot;http://4.bp.blogspot.com/-HkKRJ4c-RlA/T2iYodbDnKI/AAAAAAAAADU/uuiAJ1v_Vks/s320/ganglia_bonded_cpu_wio.png&quot; width=&quot;320&quot;/&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align:center;&quot;&gt;&lt;i&gt;Figure 7 - percentage cpu spent in wait_io state, with dual (bonded) gigabit ethernet&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
From a pure gut perspective, that seems kind of high. &amp;nbsp;But on closer inspection, that graph also looks familiar - it's basically the same as Figure 6 - the load average. So what can we deduce? &amp;nbsp;When ethernet is removed as a limiting factor the run queue is filled with processes that all cause an increase in wait_io - which means our processes would all finish faster if they didn't have to wait for io. &amp;nbsp;The limiting factor must, then, be the disks.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;Recap (TL;DR)&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
Ethernet looked suspiciously capped, but performance (and ethernet usage) only increased slightly when the cap was lifted. Closer inspection of the resulting load average and CPU usage showed that the limiting factor was in fact the disks. &amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Solution? &amp;nbsp;Replacing the disks might help a little - they're 3 year old 2.5&quot; SATA, but they are 7.2k rpm, and a more modern 10k 2.5&quot; would be marginally faster. &amp;nbsp;The real benefit would probably be more disks - the current wisdom on the mailing lists and in the documentation is to maximize the ratio of disk spindles to cpu cores and I think that's borne out by these results. &amp;nbsp;In the coming weeks we're going to build a new cluster of 6 machines with 12 disks each and I very much look forward to testing my theories once they're up and running. &amp;nbsp;Stay tuned!&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;blogger-post-footer&quot;&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;/div&gt;</description>
         <author>Oliver Meyn</author>
         <guid isPermaLink="false">tag:blogger.com,1999:blog-2326624813533383062.post-4380606393153794278</guid>
         <pubDate>Tue, 20 Mar 2012 16:49:00 +0000</pubDate>
         <media:thumbnail height="72" url="http://2.bp.blogspot.com/-ohKtMuZX-oc/T2iRguDWhPI/AAAAAAAAACc/5kGPc3J4bfs/s72-c/ganglia_single_bytes_in.png" width="72" xmlns:media="http://search.yahoo.com/mrss/"/>
      </item>
      <item>
         <title>Text available for public review: The digitisation of biological nomenclatural and taxonomic information by Richard Pankhurst</title>
         <link>http://community.gbif.org/pg/forum/topic/18121/text-available-for-public-review-the-digitisation-of-biological-nomenclatural-and-taxonomic-information-by-richard-pankhurst/</link>
         <description>&lt;p&gt;Dear members of the GNA group,&lt;/p&gt;
&lt;p&gt;I am happy to announce that there's &lt;strong&gt;a new text available for public review&lt;/strong&gt;: '&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://community.gbif.org/pg/file/read/18107/&quot;&gt;&lt;em&gt;The digitisation of biological nomenclatural and taxonomic information&lt;/em&gt;&lt;/a&gt;' by Richard Pankhurst.&lt;/p&gt;
&lt;p&gt;The document guides the reader through the process of digitazing  information about taxonomic names and relationships. It explains the  nature of the taxonomic data with many examples and gives advice on how  to proceed in the digitization process in each case, independently of  the software used. After the digitization process, the data can be  published on the internet through networks such as GBIF.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The public review period ends the 9th December 2011. &lt;/strong&gt;The author will consider all comments received and a final version will be released soon afterwards.&lt;/p&gt;
&lt;p&gt;Please send your comments through the community site or to training@gbif.org.&lt;/p&gt;
&lt;p&gt;Thanks for your participation!&lt;/p&gt;
&lt;p&gt;http://community.gbif.org/pg/file/read/18107/&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/forum/topic/18121/text-available-for-public-review-the-digitisation-of-biological-nomenclatural-and-taxonomic-information-by-richard-pankhurst/</guid>
         <pubDate>Fri, 11 Nov 2011 11:10:04 +0000</pubDate>
      </item>
      <item>
         <title>PUBLIC REVIEW: The digitisation of biological nomenclatural and taxonomic information by Richard Pankhurst</title>
         <link>http://community.gbif.org/pg/file/read/18107/public-review-the-digitisation-of-biological-nomenclatural-and-taxonomic-information-by-richard-pankhurst</link>
         <description>&lt;p&gt;This document guides the reader through the process of digitazing information about taxonomic names and relationships. It explains the nature of the taxonomic data with many examples and gives advice on how to proceed in the digitization process in each case, independently of the software used. After the digitization process, the data can be published on the internet through networks such as GBIF.&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color:#ff0000;&quot;&gt;&lt;strong&gt;THIS DOCUMENT IS AVAILABLE FOR PUBLIC TILL THE 9TH DECEMBER 2011&lt;/strong&gt;&lt;/span&gt;. Please add your comments to this Community Site item, or send them to training@gbif.org.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt;: &quot;&lt;em&gt;This text explains and advises how best to set up and populate a database for the taxonomy and biodiversity documentation of a group of organisms. Examples will be taken from experience gained while creating the Rosaceae (Rose family) global taxonomic database, and will therefore relate to flowering plants (Angiospermae), and in particular to the International Code of Botanical Nomenclature (ICBN, McNeill 2006) which states the rules for wild plant nomenclature. There is a separate and different code for the naming of cultivated plants, and again separate from the codes for zoology and for bacteria. For organisms covered by these other codes the approach will necessarily be somewhat different.&lt;/em&gt;&quot;&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/file/read/18107/public-review-the-digitisation-of-biological-nomenclatural-and-taxonomic-information-by-richard-pankhurst</guid>
         <pubDate>Fri, 11 Nov 2011 10:13:58 +0000</pubDate>
         <enclosure length="654336" type="application/msword" url="http://community.gbif.org/pg/photos/download/18107/"/>
      </item>
      <item>
         <title>GBIF Call for Proposals:  National Checklist Building Best Practices</title>
         <link>http://community.gbif.org/pg/blog/read/15539/gbif-call-for-proposals-national-checklist-building-best-practices</link>
         <description>&lt;p&gt;&lt;span&gt;The GBIF Secretariat invites proposals to develop a document entitled Best Practice Guidelines in the Development and Maintenance of National Species Checklists. &amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-size:12px;&quot;&gt;We seek to identify individuals or groups who have been directly involved in the&amp;nbsp;&lt;/span&gt;&lt;span&gt;compilation, maintenance and dissemination of a national species checklist to develop these guidelines in order to compile a set of practices that may serve as a guide for future efforts to build and maintain national species lists.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;See&amp;nbsp;http://www.gbif.org/communications/news-and-events/showsingle/article/call-for-proposals-to-draft-best-practices-in-the-development-and-maintenance-of-national-species-c/ for more information.&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/blog/read/15539/gbif-call-for-proposals-national-checklist-building-best-practices</guid>
         <pubDate>Wed, 07 Sep 2011 12:26:17 +0000</pubDate>
      </item>
      <item>
         <title>Google Docs Taxonomic Name Service</title>
         <link>http://community.gbif.org/pg/blog/read/12837/google-docs-taxonomic-name-service</link>
         <description>&lt;p&gt;Mike Giddens of SilverBiology, has experimented with Google Docs scripting services to create a very nice name-mapping tool using Google Spreadsheets.&lt;/p&gt;
&lt;p&gt;A user can paste a list of names into a spreadsheet, and then have each name looked up in GBIFs name services, &amp;nbsp;and the classification, taxonomic status, and links to more information can be embedded into the spreadsheet.&lt;/p&gt;
&lt;p&gt;To try out this service: &amp;nbsp;&lt;/p&gt;
&lt;p&gt;1. &amp;nbsp;Log in to your google account (you must be logged in)&amp;nbsp;&lt;/p&gt;
&lt;p&gt;2. Go to https://spreadsheets0.google.com/spreadsheet/pub?hl=en_US&amp;amp;hl=en_US&amp;amp;key=0AhdYOHWBUlw_dF83RVRtQjdiQ2dwZXJCRERlUWR0SkE&amp;amp;output=html&lt;/p&gt;
&lt;p&gt;3. Paste your own list of names into column A or if there are existing names, you can use them.&lt;/p&gt;
&lt;p&gt;4. &amp;nbsp;If there is already classification information in columns B-H you can select and delete these to clear them&lt;/p&gt;
&lt;p&gt;5. Select a series of names in column B.&lt;/p&gt;
&lt;p&gt;6. Go to the &quot;GBIF&quot; option in the menu. &amp;nbsp;Select one of the two options.&lt;/p&gt;
&lt;p&gt;You can add a link to this script in your own docs by calling&amp;nbsp;&lt;/p&gt;
&lt;p&gt;http://labs.silverbiology.com/gbifclblookup/code.js&lt;span&gt; &lt;/span&gt;&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/blog/read/12837/google-docs-taxonomic-name-service</guid>
         <pubDate>Fri, 03 Jun 2011 09:35:59 +0000</pubDate>
      </item>
      <item>
         <title>Google Docs Taxonomic Name Service</title>
         <link>http://community.gbif.org/pg/blog/read/12824/google-docs-taxonomic-name-service</link>
         <description>&lt;p&gt;Mike Giddens of SilverBiology, has experimented with Google Docs scripting services to create a very nice name-mapping tool using Google Spreadsheets.&lt;/p&gt;
&lt;p&gt;A user can paste a list of names into a spreadsheet, and then have each name looked up in GBIFs name services, &amp;nbsp;and the classification, taxonomic status, and links to more information can be embedded into the spreadsheet.&lt;/p&gt;
&lt;p&gt;To try out this service: &amp;nbsp;&lt;/p&gt;
&lt;p&gt;1. &amp;nbsp;Log in to your google account (you must be logged in)&amp;nbsp;&lt;/p&gt;
&lt;p&gt;2. Go to https://spreadsheets0.google.com/spreadsheet/pub?hl=en_US&amp;amp;hl=en_US&amp;amp;key=0AhdYOHWBUlw_dF83RVRtQjdiQ2dwZXJCRERlUWR0SkE&amp;amp;output=html&lt;/p&gt;
&lt;p&gt;3. Paste your own list of names into column A or if there are existing names, you can use them.&lt;/p&gt;
&lt;p&gt;4. &amp;nbsp;If there is already classification information in columns B-H you can select and delete these to clear them&lt;/p&gt;
&lt;p&gt;5. Select a series of names in column B.&lt;/p&gt;
&lt;p&gt;6. Go to the &quot;GBIF&quot; option in the menu. &amp;nbsp;Select one of the two options.&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/blog/read/12824/google-docs-taxonomic-name-service</guid>
         <pubDate>Fri, 03 Jun 2011 09:35:54 +0000</pubDate>
      </item>
      <item>
         <title>Publishing Species Checklists, A Step by Step Guide</title>
         <link>http://community.gbif.org/pg/blog/read/11997/publishing-species-checklists-a-step-by-step-guide</link>
         <description>&lt;p&gt;A new GBIF user guide has been released today. &amp;nbsp;&quot;Publishing Species Checklists, A Step by Step Guide&quot; provides a high-level guide to a suite of tools, reference guide, and best practices for publishing annotated species checklists through the GBIF network. &amp;nbsp; http://links.gbif.org/checklist_how_to&lt;/p&gt;
&lt;p style=&quot;text-align:left;&quot;&gt;&lt;span style=&quot;font-size:small;&quot;&gt;&lt;span style=&quot;font-size:11px;&quot;&gt;&lt;br&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/blog/read/11997/publishing-species-checklists-a-step-by-step-guide</guid>
         <pubDate>Fri, 15 Apr 2011 07:04:49 +0000</pubDate>
      </item>
      <item>
         <title>Screencast available for using the Darwin Core Archive Assistant</title>
         <link>http://community.gbif.org/pg/blog/read/11949/screencast-available-for-using-the-darwin-core-archive-assistant</link>
         <description>&lt;p&gt;Publishing biodiversity data using Darwin Core Archives (DwC-A) is simple enough that it can be done with no dedicated software installed on your servers. &amp;nbsp; There is one XML file, called the metafile, that is required in a DwC-A. &amp;nbsp;The Darwin Core Archive Assistant is a service that builds this file for you. &amp;nbsp;See a 10 minute screencast that demonstrates this service at&amp;nbsp;http://www.youtube.com/watch?v=_hF0sslw-B4&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/blog/read/11949/screencast-available-for-using-the-darwin-core-archive-assistant</guid>
         <pubDate>Wed, 13 Apr 2011 07:28:43 +0000</pubDate>
      </item>
      <item>
         <title>GBIF becomes member of Species 2000</title>
         <link>http://community.gbif.org/pg/blog/read/11639/gbif-becomes-member-of-species-2000</link>
         <description>&lt;p&gt;On 14 March, GBIF, joined the Species 2000 consortium as a new member. &amp;nbsp;&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/blog/read/11639/gbif-becomes-member-of-species-2000</guid>
         <pubDate>Tue, 05 Apr 2011 05:56:29 +0000</pubDate>
      </item>
      <item>
         <title>GBIF TaxonFinder service and the new TIKA document service</title>
         <link>http://community.gbif.org/pg/blog/read/10963/gbif-taxonfinder-service-and-the-new-tika-document-service</link>
         <description>&lt;p&gt;The GBIF TaxonFinder service, &amp;nbsp;a web service for extracting scientific names from documents, &amp;nbsp;now has the means to extract names from PDF files, word documents, even powerpoint, using the new GBIF TIKA service. &amp;nbsp; &amp;nbsp;An illustration of the 3 steps to using this service to extract names from documents is as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Identify a source document URL. &amp;nbsp;For example, &amp;nbsp;this URL provides a PDF publication listing the biodiversity of earthworms in Taiwan&amp;nbsp;http://tai2.ntu.edu.tw/taiwania/pdf/tai.2006.51.3.226.pdf.&lt;/li&gt;
&lt;li&gt;Append the document URL to the url parameter of the GBIF TIKA service. &amp;nbsp;&amp;nbsp;http://ecat-dev.gbif.org/ws-tika/?url=http://tai2.ntu.edu.tw/taiwania/pdf/tai.2006.51.3.226.pdf &amp;nbsp; &amp;nbsp; The output of this service is an XML document containing the full text of the PDF.&lt;/li&gt;
&lt;li&gt;Use the URL from 2 as an input parameter to the GBIF TaxonFinder Service or any other name-finding service that supports the GNA Name-Finder API. &amp;nbsp;For the GBIF TaxonFinder Service use the base service call:
&lt;ol&gt;
&lt;li&gt;http://tools.gbif.org/ws/taxonfinder?&lt;/li&gt;
&lt;li&gt;input a type=url and a input of the URL as in 2&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;See the &lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://tools.gbif.org/ws/taxonfinder?type=url&amp;amp;input=http://ecat-dev.gbif.org/ws-tika/?url=http://tai2.ntu.edu.tw/taiwania/pdf/tai.2006.51.3.226.pdf&quot;&gt;XML output here&lt;/a&gt;.&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/blog/read/10963/gbif-taxonfinder-service-and-the-new-tika-document-service</guid>
         <pubDate>Mon, 20 Dec 2010 11:43:15 +0000</pubDate>
      </item>
      <item>
         <title>Taxonomic Name Processing</title>
         <link>http://community.gbif.org/pg/pages/view/10946/taxonomic-name-processing</link>
         <description>&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;div&gt;Processing scientific names is often a necessary component of working with biodiversity information. &amp;nbsp;A number of focal areas have emerged and been the subject of work within GNA Nomina meetings. &amp;nbsp; These areas of focus include:&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://community.gbif.org/pg/pages/view/1128/&quot;&gt;Name recognition tools&lt;/a&gt;&amp;nbsp;that can recognize known scientific names in free-text&lt;/li&gt;
&lt;li&gt;Name discovery algorithms extend recognition to include novel taxon names.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://community.gbif.org/pg/pages/view/1129/&quot;&gt;Name reconciliation&lt;/a&gt;&amp;nbsp;or de-aliasing methods that group spelling-variations of names and reconcile them to correctly-spelled names.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://community.gbif.org/pg/pages/view/1124/&quot;&gt;Atomizers&lt;/a&gt;&amp;nbsp;and canonizers that can break a name into identified component parts (Genus, Species, Authorship, etc.) and into a normalized form.&lt;/li&gt;
&lt;li&gt;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://community.gbif.org/pg/pages/view/1125/&quot;&gt;Services and applications&lt;/a&gt;&amp;nbsp;that use the tools described in this list.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This site provides forums for discussion, documentation, files and links to&amp;nbsp;&lt;a rel=&quot;nofollow&quot; target=&quot;_blank&quot; href=&quot;http://code.google.com/p/taxon-name-processing&quot;&gt;source repositories containing dictionaries, source code and development support&lt;/a&gt;.&lt;/p&gt;</description>
         <guid isPermaLink="false">http://community.gbif.org/pg/pages/view/10946/taxonomic-name-processing</guid>
         <pubDate>Fri, 17 Dec 2010 13:09:15 +0000</pubDate>
      </item>
      <item>
         <title>New framework to deliver biodiversity knowledge</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-framework-to-deliver-biodiversity-knowledge</link>
         <description>Global Biodiversity Informatics Outlook sets out key steps to harness IT and open data to inform better decisions</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 02 Oct 2013 09:24:00 +0000</pubDate>
      </item>
      <item>
         <title>Using ‘big data’ to help conserve life on Earth</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-gbif-web-portal-to-be-launched</link>
         <description>Save the date: Wednesday, 9 October</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Tue, 24 Sep 2013 13:51:00 +0000</pubDate>
      </item>
      <item>
         <title>Historic Brazilian plant collection available via GBIF</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/historic-brazilian-plant-collection-available-via-gbif</link>
         <description>Rio de Janeiro Botanic Gardens Research Institute is first data publisher from Brazil to enter global online network</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 16 Sep 2013 09:34:00 +0000</pubDate>
      </item>
      <item>
         <title>Data publishing network VertNet joins GBIF</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/data-publishing-network-vertnet-joins-gbif</link>
         <description>New Participant mobilizes global vertebrate occurrence data</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 13 Sep 2013 08:05:00 +0000</pubDate>
      </item>
      <item>
         <title>Filling data gaps to address biodiversity challenges</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/filling-data-gaps-to-address-biodiversity-challenges</link>
         <description>Task group recommendations on demand-driven data mobilization will be addressed in new GBIF Work Programme</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Thu, 15 Aug 2013 13:48:00 +0000</pubDate>
      </item>
      <item>
         <title>GBIF Participants collaborate in 2013 mentoring projects</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/gbif-participants-collaborate-in-2013-mentoring-projects</link>
         <description>Eight Participant nodes share expertise on data publishing, portal development, digitization and e-learning.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Tue, 16 Jul 2013 08:46:00 +0000</pubDate>
      </item>
      <item>
         <title>Registration for GBIF Governing Board meeting now open</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/registration-for-gbif-governing-board-meeting-now-open</link>
         <description>GB20 and associated events to be held in Berlin, Germany from 4-10 October</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Thu, 11 Jul 2013 14:51:00 +0000</pubDate>
      </item>
      <item>
         <title>GBIF Annual Report and Science Review published</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/gbif-annual-report-and-science-review-published</link>
         <description>New publication highlights more than 230 research papers citing use of GBIF as data source during 2012.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 24 Jun 2013 13:50:00 +0000</pubDate>
      </item>
      <item>
         <title>Think bigger, GBIF award winner urges biologists</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/think-bigger-gbif-award-winner-urges-biologists</link>
         <description>Winner of GBIF science prize will design ‘Ecotron’ to test predictions of climate impacts on species; other awards support young researchers’ work on bat ‘nectar corridors’ and identifying data gaps.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 17 Jun 2013 14:12:00 +0000</pubDate>
      </item>
      <item>
         <title>GBIF enables global study of climate impact on species</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/gbif-enables-global-study-of-climate-impact-on-species</link>
         <description>Research in Nature Climate Change uses data on 50,000 common plants and animals to predict worldwide range losses without urgent action to limit emissions</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 13 May 2013 05:18:00 +0000</pubDate>
      </item>
      <item>
         <title>Israel joins GBIF</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/israel-joins-gbif</link>
         <description>Decision aims to enhance collaboration on nature conservation and research</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 24 Apr 2013 11:40:00 +0000</pubDate>
      </item>
      <item>
         <title>US launches new biodiversity data system</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/us-launches-new-biodiversity-data-system</link>
         <description>BISON project from GBIF national node offers access to more than 100 million mapped species records</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 19 Apr 2013 11:21:00 +0000</pubDate>
      </item>
      <item>
         <title>Brazil surveys data holdings and informatics capacity</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/brazil-surveys-data-holdings-and-informatics-capacity</link>
         <description>Review of national institutions will help mobilize biodiversity data for new GBIF national node</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Thu, 14 Mar 2013 08:50:00 +0000</pubDate>
      </item>
      <item>
         <title>New portal on Colombia’s biodiversity</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-portal-on-colombias-biodiversity</link>
         <description>Information on Colombia’s biological diversity is now available at http://www.sibcolombia.net.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 06 Mar 2013 08:08:00 +0000</pubDate>
      </item>
      <item>
         <title>Support for projects to promote ‘data paper’ publication</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/support-for-projects-to-promote-data-paper-publication</link>
         <description>GBIF nodes in Spain, Colombia and India plan to enable more than 40 publications describing biodiversity datasets.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 18 Jan 2013 11:34:00 +0000</pubDate>
      </item>
      <item>
         <title>GBIF ready to support new science-policy platform</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/gbif-ready-to-support-new-science-policy-platform</link>
         <description>Briefing sets out GBIF role ahead of first IPBES plenary meeting</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Thu, 17 Jan 2013 08:10:00 +0000</pubDate>
      </item>
      <item>
         <title>Call for proposals for the 2013 Young Researchers Award</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/call-for-proposals-for-the-2013-young-researchers-award</link>
         <description>GBIF invites proposals from graduate students for the 2013 Young Researchers Award. This prize intends to foster innovative research and discovery in biodiversity informatics.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 14 Jan 2013 07:24:51 +0000</pubDate>
      </item>
      <item>
         <title>Call for nominations for the 2013 Ebbe Nielsen Prize</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/call-for-nominations-for-the-2013-ebbe-nielsen-prize</link>
         <description>GBIF invites nominations for the 2013 Ebbe Nielsen Prize, awarded annually to a person or team who demonstrates excellence in combining biodiversity informatics and biosystematics research.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 14 Jan 2013 11:21:00 +0000</pubDate>
      </item>
      <item>
         <title>Brazil joins global initiative for biodiversity data access</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/brazil-joins-global-initiative-for-biodiversity-data-access</link>
         <description>Entry of mega-diverse country into network ’very exciting’, says GBIF chair</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 30 Nov 2012 13:10:30 +0000</pubDate>
      </item>
      <item>
         <title>Peer review option proposed for biodiversity data</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/peer-review-option-proposed-for-biodiversity-data</link>
         <description>Discussion paper suggests new quality control procedures to promote data publishing and use</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 30 Nov 2012 13:11:15 +0000</pubDate>
      </item>
      <item>
         <title>New guide for compiling national species checklists</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-guide-for-compiling-national-species-checklists</link>
         <description>Step-by-step advice on developing species inventories aims to help countries improve management of biodiversity.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 30 Nov 2012 13:06:19 +0000</pubDate>
      </item>
      <item>
         <title>National biodiversity information ‘grid’ proposed for India</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/national-biodiversity-information-grid-proposed-for-india</link>
         <description>Senior government advisers back plan involving Indian GBIF node</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 17 Oct 2012 13:04:00 +0000</pubDate>
      </item>
      <item>
         <title>GBIF signs invasive species information partnership</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/gbif-signs-invasive-species-information-partnership</link>
         <description>Collaboration aims to help countries reduce threat to biodiversity from alien species</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 10 Oct 2012 15:26:00 +0000</pubDate>
      </item>
      <item>
         <title>GBIF committed to provide CBD data foundations – Hobern</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/gbif-committed-to-provide-cbd-data-foundations-hobern</link>
         <description>Executive Secretary statement to biodiversity convention’s conference in India</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Tue, 09 Oct 2012 08:02:00 +0000</pubDate>
      </item>
      <item>
         <title>GBIF promotes shared data culture at India conference</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/gbif-promotes-shared-data-culture-at-india-conference</link>
         <description>Events at CBD Conference of Parties (COP11) outline data mobilization projects, better information on invasive species, and a shared vision for ‘biodiversity intelligence’</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 08 Oct 2012 09:58:00 +0000</pubDate>
      </item>
      <item>
         <title>Plant data helps climate models - GBIF award winner</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/plant-data-helps-climate-models-gbif-award-winner</link>
         <description>Availability of large volumes of data through GBIF helps refine predictions of climate impact, says Ebbe Nielsen Prize winner</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 19 Sep 2012 12:48:00 +0000</pubDate>
      </item>
      <item>
         <title>Symposium showcases use of open-access biodiversity data</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/symposium-to-showcase-use-of-open-access-biodiversity-data-in-science</link>
         <description>Presentations highlight research on climate change impacts, Arctic biodiversity</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Sun, 16 Sep 2012 22:00:00 +0000</pubDate>
      </item>
      <item>
         <title>New guide for developing marine species checklists</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-guide-for-developing-marine-species-checklists</link>
         <description>Document highlights resources specific to marine regions for eventual publishing through GBIF</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 17 Aug 2012 10:19:00 +0000</pubDate>
      </item>
      <item>
         <title>Sharing expertise to manage data for science and society</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/sharing-expertise-to-manage-data-for-science-and-society</link>
         <description>Mentoring among GBIF Participants will support network-building and publication of biodiversity data from Indonesia, Costa Rica and cities worldwide.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 01 Aug 2012 08:33:00 +0000</pubDate>
      </item>
      <item>
         <title>New GBIF reference guide for verifying zoological names</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-gbif-reference-guide-for-verifying-zoological-names</link>
         <description>Manual also gives recommendations on publishing new names.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Thu, 12 Jul 2012 12:42:00 +0000</pubDate>
      </item>
      <item>
         <title>Building global collaboration for biodiversity intelligence</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/building-global-collaboration-for-biodiversity-intelligence</link>
         <description>Public to play major role in mobilizing expanded range of data needed to preserve vital functions of life on Earth, conference concludes.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 06 Jul 2012 14:23:00 +0000</pubDate>
      </item>
      <item>
         <title>A new vision for harnessing data about life on Earth</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/a-new-vision-for-harnessing-data-about-life-on-earth</link>
         <description>Conference will develop strategy to prioritize investment in biodiversity informatics.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Thu, 21 Jun 2012 15:07:00 +0000</pubDate>
      </item>
      <item>
         <title>Registration for GB19 is now available online</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/registration-for-gb19-is-now-available-online</link>
         <description>The online registration site for the 19th GBIF Governing Board meeting (GB19) is now open. GB19 will be held in Lillehammer, Norway, on 18 and 20  September 2012 with associated events taking place in the days before and after.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Thu, 07 Jun 2012 16:34:00 +0000</pubDate>
      </item>
      <item>
         <title>Awards target novel research on species distributions</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/awards-target-novel-research-on-species-distributions</link>
         <description>GBIF Ebbe Nielsen prize and Young Researchers Award honour scientists exploring interactions between climate, ecology and evolution</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 04 Jun 2012 12:58:00 +0000</pubDate>
      </item>
      <item>
         <title>Managing biodiversity data from local government</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/managing-biodiversity-data-from-local-government</link>
         <description>Guidance for local authorities on publishing through the GBIF network, helping preserve knowledge about biodiversity.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 25 May 2012 09:20:00 +0000</pubDate>
      </item>
      <item>
         <title>New GBIF guide for citing biodiversity data</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-gbif-guide-for-citing-biodiversity-data</link>
         <description>Recommended practice will credit all involved in making datasets freely accessible</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 14 May 2012 14:16:00 +0000</pubDate>
      </item>
      <item>
         <title>GBIF Annual Report for 2011 published</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/gbif-annual-report-for-2011-published</link>
         <description>Details a decade of achievement in providing free and open access to biodiversity data</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 23 Apr 2012 13:44:00 +0000</pubDate>
      </item>
      <item>
         <title>New GBIF videos on improving biodiversity data quality</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-gbif-videos-on-improving-biodiversity-data-quality</link>
         <description>Training videos provide information on sources of error and fitness for use</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Thu, 19 Apr 2012 13:28:00 +0000</pubDate>
      </item>
      <item>
         <title>Conference to chart way ahead for biodiversity informatics</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/conference-to-chart-way-ahead-for-biodiversity-informatics</link>
         <description>Copenhagen event aims to unite diverse communities to address key policy and science challenges.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Tue, 20 Mar 2012 13:50:00 +0000</pubDate>
      </item>
      <item>
         <title>Genomic data in GBIF moves a step closer</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/genomic-data-in-gbif-moves-a-step-closer</link>
         <description>Aligning standards will help share information on biodiversity yet to be discovered.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 14 Mar 2012 13:51:00 +0000</pubDate>
      </item>
      <item>
         <title>Results of GBIF regional training call announced</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/results-of-gbif-regional-training-call-announced</link>
         <description>Events to be held in Taipei, Kampala, Kathmandu and Bogotá will benefit from the support.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 07 Mar 2012 12:48:00 +0000</pubDate>
      </item>
      <item>
         <title>New Executive Secretary: exciting time ahead for GBIF</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-executive-secretary-exciting-time-ahead-for-gbif</link>
         <description>Donald Hobern sees data mobilized by network increasingly relevant for policy, science</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Tue, 31 Jan 2012 13:16:00 +0000</pubDate>
      </item>
      <item>
         <title>Call for applications for regional training support</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/call-for-applications-for-regional-training-support</link>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 16 Jan 2012 13:18:00 +0000</pubDate>
      </item>
      <item>
         <title>Call for proposals for the 2012 Young Researchers Award</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/call-for-proposals-for-the-2012-young-researchers-award</link>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 09 Jan 2012 13:41:00 +0000</pubDate>
      </item>
      <item>
         <title>Call for nominations for the 2012 Ebbe Nielsen Prize</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/call-for-nominations-for-the-2012-ebbe-nielsen-prize</link>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 19 Dec 2011 15:35:00 +0000</pubDate>
      </item>
      <item>
         <title>German research team targets ‘at risk’ data on biodiversity</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/german-research-team-targets-at-risk-data-on-biodiversity</link>
         <description>Project example of data hosting infrastructure promoted by GBIF</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 19 Dec 2011 10:10:00 +0000</pubDate>
      </item>
      <item>
         <title>New biodiversity data publishing framework proposed</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/new-biodiversity-data-publishing-framework-proposed</link>
         <description>Recommendations target social, cultural, technical, policy, legal, economic components to promote data sharing.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Thu, 15 Dec 2011 12:00:00 +0000</pubDate>
      </item>
      <item>
         <title>Biodiversity knowledge vital to combat climate change - King</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/biodiversity-knowledge-vital-to-combat-climate-change-nick-king</link>
         <description>During intervention in Durban Climate Change Conference, GBIF Executive Secretary Nick King calls for wider participation.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Fri, 09 Dec 2011 09:25:00 +0000</pubDate>
      </item>
      <item>
         <title>Plans take shape to digitize camera trap data from India</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/plans-take-shape-to-digitize-camera-trap-data-from-indian-wildlife</link>
         <description>Indo-Norwegian pilot project for new science-policy platform targets conservation strategies for large mammals</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Wed, 07 Dec 2011 09:29:00 +0000</pubDate>
      </item>
      <item>
         <title>First database-derived ‘Data Paper’ published in journal</title>
         <link>http://www.gbif.org/communications/news-and-events/showsingle/article/first-database-derived-data-paper-published-in-journal</link>
         <description>Peer-reviewed description of Indian bird dataset published through GBIF sets precedent for data-sharing incentives.</description>
         <guid isPermaLink="false"></guid>
         <pubDate>Mon, 28 Nov 2011 17:00:00 +0000</pubDate>
      </item>
   </channel>
</rss>
<!-- fe1.yql.bf1.yahoo.com compressed/chunked Thu Oct  1 08:18:44 UTC 2015 -->
