GBIF News Feed

Simplified Downloads

Fede Méndez — Thu, 11 Jun 2015 17:06:00 +0000

Since its re-launch in 2013 gbif.org has supported the downloading of occurrence data using an arbitrary query with the download file provided as a Darwin Core Archive file whose internal content is described here. This format contains comprehensive and self-explanatory information, which makes it suitable to be referenced in external resources. However, in cases where people only need the occurrence data in its simplest form the DwC-A format presents an additional complexity that can make it hard to use the data. Because of that we now support a new download format: a zip file that only contains a single file with the most common fields/terms used, where each column is separated by the TAB character. This makes things much easier when it comes to importing the data into tools such as Microsoft Excel, geographic information systems and relational databases. The current download functionality was extended to allow the selection of the desired format:

From this point the functionality remains the same: eventually you will receive an email containing a hyperlink where the file can be downloaded.

Technical Architecture

The simplified download format was implemented following the technical requirement that new formats should be supported in the near future with minimal impact to the formats supported at a specific moment. In general, occurrence downloads are implemented using two different sets of technologies depending on the estimated size of the download in number of records; a threshold of 200,000 records is set to define when a download is small (< 200K) and big (>200K), where history shows a vast majority of “small” downloads. The following chart summarizes the key technologies that enables occurrence downloads:

Download workflow

Occurrence downloads are automated using a workflow engine called Oozie, it coordinates the required steps to produce a single download file. In summary the workflow proceeds as follows:

Initially, Apache Solr is contacted to determine the number of records that the download file will contain.
Big or small?

If the amount of records is less than 200,000 (it is small download), Apache Solr is queried to iterate over the results; the detail of each occurrence record is fetched from HBase since it’s the official storage of occurrence records. Individual downloads are produced by a multi-threaded application implemented using the Akka framework; the Apache Zookeeper and Curator frameworks are used to limit the amount of threads that can be running at the same time (it avoids a thread explosion in the machines that run the download workflow).
If the amount of records is greater than 200,000 (it is a big download), Apache Hive is used to retrieve the occurrence data from an HDFS table. To avoid overloading of HBase we create that HDFS table as a daily snapshot of the occurrence data stored in HBase.

Finally the occurrence records are collected and organized in the requested output format (DwC-A or Simple).

Note: the details of how this is implemented can be consulted in the Github project: https://github.com/gbif/occurrence/tree/master/occurrence-download.

Conclusion

Reducing both the number of columns and the size (number of bytes) in our downloads has been one of our most requested features, and we hope this makes using the GBIF data easier for everyone.

Don't fill your HDFS disks (upgrading to CDH 5.4.2)

Oliver Meyn — Fri, 29 May 2015 16:34:00 +0000

Just a short post on the dangers of filling your HDFS disks. It's a warning you'll hear at conferences and in best practices blog posts like this one, but usually with only a vague consequence of "bad things will happen". We upgraded from CDH 5.2.0 to CDH 5.4.2 this past weekend and learned the hard way: bad things will happen.

The Machine Configuration

The upgrade went fine in our dev cluster (which has almost no data in HDFS) so we weren't expecting problems in production. Our production cluster is of course slightly different than our (much smaller) dev cluster. In production we have 3 masters, where one holds the NameNode and another holds the SecondaryNameNode (we're not yet using a High Availability setup, but it's in the plan). We have 12 DataNodes where each one has 13 disks dedicated to HDFS storage: 12 are 1TB and one is 512GB. They are formatted with 0% reserved blocks for root. The machines are evenly split into two racks.

Pre Upgrade Status

We were at about 75% total HDFS usage with only a few percent difference between machines. We were configured to use Round Robin block placement (dfs.datanode.fsdataset.volume.choosing.policy) with 10GB reserved for non-hdfs use (dfs.datanode.du.reserved), which are the defaults in CDH manager. Each of the 1TB disks was around 700GB used (of 932GB usable), and the 512 GB disks were all at their limit: 456GB used (of 466GB usable). That left only the configured 10GB free for non-hdfs use on the small disks. Our disks are mounted in the pattern /mnt/disk_a, /mnt/disk_b and so on, with /mnt/disk_m as the small disk. We're using the free version of CDHM so we can't do rolling upgrades, meaning this upgrade would bringing everything down. And because our cluster is getting full (> 80% usage is another rumoured "bad things" threshold) we have reduced one class of data (user's occurrence downloads) to a replication factor of 2 (from the default of 3). This is considered somewhere between naughty and criminal, and you'll see why below.

Upgrade Time

We followed the recommended procedure and did the oozie, hive, and CDH manager backups, downloaded the latest parcels, and pressed the big Update button. Everything appeared to be going fine until HDFS tried to start up again, where the symptom was that it was taking a really long time (several minutes, after which the CDHM upgrade process finally gave up saying the DataNodes weren't making contact). Looking at the NameNode logs we see that it was performing a "Block Pool Upgrade", which took btw 90 and 120 seconds for each of our ~700GB disks. Here's an excerpt of where it worked without problems:

2015-05-23 20:18:53,715 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_a/dfs/dn/in_use.lock acquired by nodename 27117@c4n1.gbif.org
2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:18:53,823 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.
old LV = -56; old CTime = 1416737045694.
new LV = -56; new CTime = 1432405112136
2015-05-23 20:20:33,565 INFO org.apache.hadoop.hdfs.server.common.Storage: HardLinkStats: 59768 Directories, including 53157 Empty Directories, 0 single Link operations, 6611 multi-Link operations, linking 22536 files, total 22536 linkable files. Also physically copied 0 other files.

2015-05-23 20:20:33,609 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrade of block pool BP-2033573672-130.226.238.178-1367832131535 at /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535 is complete

That upgrade time happens sequentially for each disk, so even the though the machines were upgrading in parallel, we were still looking at ~30 minutes of downtime for the whole cluster. As if that wasn't sufficiently worrying, then we finally get to disk_m, our nearly full 512G disk:

2015-05-23 20:53:05,814 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_m/dfs/dn/in_use.lock acquired by nodename 12424@c4n1.gbif.org
2015-05-23 20:53:05,869 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:53:05,870 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_m/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:53:05,886 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_m/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.
   old LV = -56; old CTime = 1416737045694.
   new LV = -56; new CTime = 1432405112136
2015-05-23 20:54:12,469 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-2033573672-130.226.238.178-1367832131535
java.io.IOException: Cannot create directory /mnt/disk_m/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1259)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(DataStorage.java:1023)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.linkAllBlocks(BlockPoolSliceStorage.java:647)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doUpgrade(BlockPoolSliceStorage.java:456)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:390)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:171)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:214)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:242)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:396)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1397)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1362)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:227)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:839)
        at java.lang.Thread.run(Thread.java:745)

2015-05-23 20:54:12,476 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-2033573672-130.226.238.178-1367832131535 : Cannot create directory /mnt/disk_m/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168

The somewhat misleading "Cannot create directory" is not a file permission problem but rather a disk full problem. During this block pool upgrade some temporary space is needed for rewriting metadata, and that space is apparently more than the 10G that was available to "non-HDFS" (which we've concluded means "not HDFS storage files, but everything else is fair game"). Because some space is available to start the upgrade, it begins, but then when it exhausts the disk it fails, and This Kills The DataNode. It does clean up after itself, but prevents the DataNode from starting, meaning our cluster was on its knees and in no danger of standing up.

So the problem was lack of free space, which on 10 of our 12 machines we were able to solve by wiping temporary files from the colocated yarn directory. Those 10 machines were then able to upgrade their disk_m and started up. We still had two nodes down and unfortunately they were in different racks, so that meant we had a big pile of our replication factor 2 files missing blocks (the default HDFS block replication policy places the second and subsequent copies on a different rack from the first copy).

While digging around in the different properties that we thought could affect our disks and HDFS behaviour we were also restarting the failing DataNodes regularly. At some point the log message changed to:

WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.FileNotFoundException: /mnt/disk_m/dfs/dn/in_use.lock (No space left on device)

After that message the DataNode started, but with disk_m marked as a failed volume. We're not sure why this happened but presume that after one of our failures it didn't clean up it's temp files on disk_m and then on subsequent restarts found the disk completely full and (rightly) considered it unusable and tried to carry on. With the final two DataNodes up we had almost all of our cluster, minus the two failed volumes. There were only 35 corrupted files (missing blocks) left after they came up. These were files set to replication factor 2, and by bad luck had both copies of some of their blocks on the failed disk_m (one from rack1, one from rack2).

It would not have been the end of the world to just delete the corrupted user downloads (they were all over a year old) but on principle, it would not be The Right Thing To Do.

On inodes and hardlinks

The normal directory structure of the dfs dir in a DataNode is /dfs/dn/current/<blockpool name>/current/finalized and within finalized are a whole series of directories to fan out the various blocks that the volume contains. During the block pool upgrade a copy of 'finalized' is made called previous.tmp. It's not a normal copy however - it uses hardlinks in order to avoid duplicating all of the data (which obviously wouldn't work). The copy is needed during the upgrade and is removed afterwards. Since our upgrade failed halfway through we had both directories and had no choice but to move the entire /dfs directory off of /disk_m to a temporary disk and complete the upgrade there. We first tried a copy (use cp -a to preserve hardlinks) to a mounted NFS share. The copy looked fine but on startup the DataNode didn't understand the mounted drive ("drive not formatted"). Then we tried copying to a USB drive plugged into the machine and that ultimately worked (despite feeling decidedly un-Yahoo). Once the USB drive was upgraded and online in the cluster, replication took over and copied all of its blocks to new homes on /rack2. We then unmounted the USB drive, wiped both /disk_m's and then let replication balance out again. Final result: no lost blocks.

Mitigation

With the cluster happy again we made a few changes to hopefully ensure this doesn't happen again:

dfs.datanode.du.reserved:25GB this guarantees 25GB free on each volume (up from 10GB) and should be enough to allow a future upgrade to happen
dfs.datanode.fsdataset.volume.choosing.policy:AvailableSpace
dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction:1.0 together these two direct new blocks to disks that have more free space, thereby leaving our now full /disk_m alone

Conclusion

This was one small taste of what can go wrong with filling heterogenous disks in an HDFS cluster. We're sure there are worse dangers lurking on the full-disk horizon, so hopefully you've learned from our pain and will give yourself some breathing room when things start to fill up. Also, don't use a replication factor of less than 3 if there's anyway you can help it.

Improving the GBIF Backbone matching

Markus Döring — Mon, 30 Mar 2015 22:30:00 +0000

In GBIF occurrence records are matched to a taxon in a backbone taxonomy using the species match API. This is important to reduce spelling variations and create consistent metrics and searches according to a single classification and synonymy.

Over the past years we have been alerted to various bad matches. Most of the reported issues refer to a false fuzzy match for a name missing in our backbone.

In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities. The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself. Here I explain some of the work currently underway to tackle the former, which is visible on the test environment.

1.Name parsing of undetermined species

In occurrences we see many names with a partly undetermined name such as Lucanus spec. Erroneously these rank markers have been treated as real species epithets and together with fuzzy matching resulted in poor results.

Examples

Xysticus sp. used to wrongly match Xysticus spiethi while it now just matches the genus Xysticus.
Triodia sp. used to match the family Poaceae while it now matches the genus

2. Damerau–Levenshtein distance algorithm

For scoring fuzzy matches we have so far applied the Jaro Winkler distance which is often used for matching person names. It tends to allow for rather fuzzy matches at the end of long strings. This is desirable for scientific names, but the allowed fuzziness was too big and we decided to revert to the classical and more predictable Damerau–Levenshtein distance. This reduces false positive fuzzy matches considerably even though we lost a few good matches at the same time.

Examples

Xyris kralii Wand. used to match to Xyris harleyi but now just matches to the genus Xyris L. as the species is missing from our backbone.
Zea mays subsp. parviglumis var. huehuet Iltis & Doebley used to match Zea mays var. hirta while it now just hits the species Zea mays L.

Matching results

The distinct, verbatim classifications of 528 million records were passed through the original and the new fuzzy matching algorithms - this included 10.5 million distinct classifications in total. The results show that 428 thousand classifications (4%), representing 5,323,758 occurrence records produce a different match. So far we have taken a random subsample of the records which change, and manually inspected the results - we can hardly spot any degression or wrong matches.

We have published the complete matching comparison as well as the subset of changed records at Zenodo as tab delimited files:

Dataset 1: All classification matches (10.5 million)

Dataset 2: Changed matches (428 thousand)

The schema of the files have 3 column families each with the scientificName, GBIF taxonKey and the higher DwC classification terms for every match record (verbatim prefixed with v_ , old matching with an _old suffix and the new matching results with plain terms, e.g. v_scientificName, scientificName_old, scientificName).

We are glad to receive any feedback on further improvements or bad matching results we need to fix in the next iteration of work. Please get in touch with Markus Döring, mdoering@gbif.org.

Appendix

Create distinct occurrence names table

CREATE TABLE markus.names AS 
SELECT count(*) as numocc, count(distinct datasetKey) as numdatasets, v_scientificName, v_kingdom, v_phylum, v_class, v_order_ as v_order, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
FROM prod_b.occurrence_hdfs 
GROUP BY v_scientificName, v_kingdom, v_phylum, v_class, v_order_, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
ORDER BY v_scientificName, numocc DESC

Lookup taxonkey with both old & new lookup

CREATE TABLE markus.name_matches AS
SELECT 
  n.numocc, 
  n.numdatasets, 
  n.v_scientificName, 
  n.v_kingdom, 
  n.v_phylum, 
  n.v_class, 
  n.v_order, 
  n.v_family, 
  n.v_genus, 
  n.v_subgenus, 
  n.v_specificEpithet, 
  n.v_infraspecificEpithet, 
  n.v_scientificNameAuthorship, 
  n.v_taxonrank, 
  n.v_higherClassification, 

  prod.taxonKey as taxonKey_old,
  prod.scientificName as scientificName_old,
  prod.rank as rank_old,
  prod.status as status_old,
  prod.matchType as matchType_old,
  prod.confidence as confidence_old,
  prod.kingdomKey as kingdomKey_old,
  prod.phylumKey as phylumKey_old,
  prod.classKey as classKey_old,
  prod.orderKey as orderKey_old,
  prod.familyKey as familyKey_old,
  prod.genusKey as genusKey_old,
  prod.speciesKey as speciesKey_old,
  prod.kingdom as kingdom_old,
  prod.phylum as phylum_old,
  prod.class_ as class_old,
  prod.order_ as order_old,
  prod.family as family_old,
  prod.genus as genus_old,
  prod.species as species_old,

  uat.taxonKey as taxonKey,
  uat.scientificName as scientificName,
  uat.rank as rank,
  uat.status as status,
  uat.matchType as matchType,
  uat.confidence as confidence,
  uat.kingdomKey as kingdomKey,
  uat.phylumKey as phylumKey,
  uat.classKey as classKey,
  uat.orderKey as orderKey,
  uat.familyKey as familyKey,
  uat.genusKey as genusKey,
  uat.speciesKey as speciesKey,
  uat.kingdom as kingdom,
  uat.phylum as phylum,
  uat.class_ as class_,
  uat.order_ as order_,
  uat.family as family,
  uat.genus as genus,
  uat.species as species

FROM (
  SELECT 
    numocc, 
    numdatasets, 
    v_scientificName, 
    v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_subgenus, 
    v_specificEpithet, 
    v_infraspecificEpithet, 
    v_scientificNameAuthorship, 
    v_taxonrank, 
    v_higherClassification, 
    match('PROD', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) prod, 
    match('UAT', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) uat
  FROM markus.names
) n;

Hive exports

CREATE TABLE markus.matches_changed 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' NULL DEFINED AS '' AS 
SELECT * from markus.name_matches 
WHERE taxonKey!=taxonKey_old;

CREATE TABLE markus.matches_all 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' NULL DEFINED AS '' AS 
SELECT * from markus.name_matches;

IPT v2.2 – Making data citable through DataCite

Kyle Braak — Fri, 27 Mar 2015 13:55:00 +0000

GBIF is pleased to release IPT 2.2, now capable of automatically connecting with either DataCite or EZID to assign DOIs to datasets. This new feature makes biodiversity data easier to access on the Web and facilitates tracking its re-use.

DataCite integration explained

DataCite specialises in assigning DOIs to datasets. It was established in 2009 with three fundamental goals(1):

Establish easier access to research data on the Internet
Increase acceptance of research data as citable contributions to the scholarly record
Support research data archiving to permit results to be verified and re-purposed for future study

EZID is hosted by the California Digital Library (a founding member of DataCite) and adds services on top of the DataCite DOI infrastructure such as their own easy-to-use programming interface.

To integrate with DataCite and further these three goals for biodiversity data, IPT version 2.2 introduces the following new features:

DOIs can be assigned to datasets thereby making them persistently resolvable
A new DOI can be assigned to a dataset each time it undergoes scientifically significant changes, which is recommended best practice(1) and part of the IPT's new versioning policy
Citations can be automatically generated for datasets in a standard format which includes the DOI and dataset version number

A version history is kept for each dataset, allowing researchers to easily track changes and access/download all previous versions

To take advantage of these optional new features, there are two basic requirements:

The IPT must be configured with either a DataCite or EZID account. GBIF participants interested in a DataCite account should contact the GBIF Helpdesk directly. General information about getting a DataCite account can be found here; information about getting an EZID account can be found here.
The IPT should be always on and accessible to ensure that assigned DOIs continue to be resolvable.

Once publishers make their data citable through DataCite they can expect the following benefits:

Their datasets will be globally discoverable through the DataCite Metadata Search tool and the Thomson Reuters Data Citation Index (part of the Web of Science) thanks to a collaboration with DataCite formalised in 2014
They can find out exactly who cited their dataset via the Data Citation Index, and better understand the impact their dataset has had within the scholarly research and policy making communities

Sample basic metadata page, IPT 2.2

Other new features

The IPT 2.2 also introduces a simple way of licensing datasets under one of three machine readable waivers or licences: CC0 v1.0, CC-BY v4.0, or CC-BY-NC v4.0. These waivers or CC licenses are "something that the creators of works can understand, their users can understand, and even the Web itself can understand."(2) You may read more about GBIF's new licensing policy here for more information.

Sample resource overview page, IPT 2.2

Whether an IPT is DOI-turbocharged or not, there are a number of other new benefits in this release:

basisOfRecord validation for occurrence datasets
The ability to preview source mappings prior to publication
The ability to preview resource metadata prior to publication
A suite of new metadata fields such as ORCIDs for contacts
An enhanced user interface including a new and improved resource homepage
Additional context help to guide users, especially first-time users

Acknowledgements

Thanks to the hard work and dedication of the team of contributors, version 2.2 has been fully translated into French, Japanese, Portuguese, and Spanish. Since so many new features have gone into this new version, the text requiring translation was enormous. The following translators deserve a huge thanks, merci, arigato, obrigado, and gracias:

Sophie Pamerlon, Marie-Elise Lecoq (GBIF France) - Updating French translation
Yukiko Yamazaki (GBIF Japan (JBIF)) - Updating Japanese translation
Allan Koch Veiga, Etienne Americo Cartolano, Daniel Lins, and Antonio Mauro Saraiva (Universidade de São Paulo, Research Center on Biodiversity and Computing - BioComp) - Updating Portuguese translation
Dairo Escobar, Nestor Beltran, and Daniel Amariles (Colombian Biodiversity Information System (SiB Colombia)) - Updating Spanish Translation

Lastly, a special thanks must go out to David Shorthouse from Canadensys for his guidance and help. Canadensys has been assigning DOIs to datasets it serves via its IPT since 2012, as described here, and has provided invaluable assistance throughout development.

On behalf of the GBIF development team, I really hope you enjoy using this new version, and hope that you will be able to take advantage of all its exciting new features.

Footnotes

http://schema.datacite.org/meta/kernel-3/doc/DataCite-MetadataKernel_v3.1.pdf
https://creativecommons.org/licenses/

Upgrading our cluster from CDH4 to CDH5

Oliver Meyn — Wed, 26 Nov 2014 11:41:00 +0000

A little over a year ago we wrote about upgrading from CDH3 to CDH4 and now the time had come to upgrade from CDH4 to CDH5. The short version: upgrading the cluster itself was easy, but getting our applications to work with the new classpaths, especially MapReduce v2 (YARN), was painful.

The Cluster

Our cluster has grown since the last upgrade (now 12 slaves and 3 masters), and we no longer had the luxury of splitting the machines to build a new cluster from scratch. So this was an in-place upgrade, using CDH Manager.

Upgrade CDH Manager

The first step was upgrading to CDH Manager 5.2 (from our existing 4.8). The Cloudera documentation is excellent so I don't need to repeat it here. What we did find was that the management service now requests significantly more RAM for it's monitoring services (minimum "happy" config of 14GB), to the point where our existing masters were overwhelmed. As a stop gap we've added a 4th old machine to the "masters" group, used exclusively for the management service. In the longer term we'll replace the 4 masters with 3 new machines that have enough resources.

Upgrade Cluster Members

Again the Cloudera documentation is excellent but I'll just add a bit. The upgrade process will now ask if a JAVA jdk should be installed (an improvement over the old behaviour of just installing one anyway). That means we could finally remove the Oracle JDK 6 rpms that have been lying around on the machines. For some reason the Host Inspector still complains about OpenJDK 7 vs Oracle 7 but we've happily been running on OpenJDK 7 since early 2014, and so far so good with CDH5 as well. After the upgrade wizard finished we had to tweak memory settings throughout the cluster, including setting the "Memory Overcommit Validation Threshold" to 0.99, up from its (very conservative) default of 0.8. Cloudera has another nice blog post on figuring out memory settings for YARN. Additionally Hue's configuration required some attention because after the upgrade it had forgotten where Zookeeper and the HBase Thrift server were. All in all quite painless.

The Gotchas

Getting our software to work with CDH5 was definitely not painless. All of our problems stemmed from conflicting versions of jars, due either to changes in CDH dependencies, or in changes to how a user classpath is specified as having priority over that of Yarn/HBase/Oozie. Additionally it took some time to wrap our heads around the new artifact packaging used by YARN and HBase. Note that we also use Maven for dependency management.

Guava

We're not alone in our suffering at the hands of mismatched Guava versions (e.g. HADOOP-10101, HDFS-7040), but suffer we did. We resorted to specifying version 14.0.1 in any of our code that touches Hadoop and more importantly HBase, and exclude any higher version guavas from our dependencies. This meant downgrading some actual code that was using guava 15, but was the easiest path to getting a working system.

Jackson

We have many dependencies on Jackson 1.9 and 2+ throughout our code, so downgrading to match HBase's shipped 1.8.8 was not an option. It meant figuring out the classpath precedence rules described below, and solving the problems (like logging) that doing so introduced.

Logging

Logging in Java is a horrible mess, and with the number of intermingled projects required to make application software run on a Hadoop/HBase cluster it's not surprise that getting logging to work was brutal. We code to the SLF4J API and use Logback as our implementation of choice. The Hadoop world uses a mix of Java Commons Logging, java.util.logging, and log4j. We thought that meant we'd be clear if we used the same SLF4J API (1.7.5) and used the bridges (log4j-over-slf4j, jcl-over-slf4j, and jul-to-slf4j), which has worked for us up to now. <montage>Angry men smash things angrily over the course of days</montage> Turns out, there's a bug in the 1.7.5 implementation of log4j-over-slf4j, which blows up as we described over at YARN-2875. Short version - use 1.7.6+ in client code that attempts to use YARN and log4j-over-slf4j.

YARN

The crux of our problems was having our classpath loaded after the Hadoop classpath had been loaded, meaning old versions of our dependencies were loaded first. The new, surprisingly hard to find parameter that tells YARN to load your classpath first is "mapreduce.job.user.classpath.first". YARN also quizzically claims that the parameter is deprecated, but.. works for me.

Oozie

Convincing Oozie to load our classpath involved another montage of angry faces. It uses the same parameter as YARN, but with a prefix, so what you want is "oozie.launcher.mapreduce.job.user.classpath.first". We had been loading the old parameter "mapreduce.task.classpath.user.precedence" in each action in the workflow using the <job-xml> tag to load the configs from a file called hive-default.xml. We then encountered two problems:

Note the name - we used hive-default.xml instead of hive-site.xml because of a bug in Oozie (discussed here and here). That was fixed in the CDH5.2 Oozie, but we didn't get the memo. Now the file is called hive-site.xml and contains our specific configs and is again being picked up. BUT:
Adding oozie.launcher.mapreduce.job.user.classpath.first to hive-site.xml doesn't work! As we wrote up in Oozie bug OOZIE-2066 this parameter has to be specified for each action, at the action level, in the workflow.xml. Repeating the example workaround from the bug report:

 <action name="run-test">  
  <java>  
   <job-tracker>c1n2.gbif.org:8032</job-tracker>  
   <name-node>hdfs://c1n1.gbif.org:8020</name-node>  
   <configuration>  
    <property>  
     <name>oozie.launcher.mapreduce.task.classpath.user.precedence</name>  
     <value>true</value>  
    </property>  
   </configuration>  
   <main-class>test.CPTest</main-class>  
  </java>  
  <ok to="end" />  
  <error to="kill" />  
 </action>

New Packaging Woes

We build our jars using a combination of jar-with-dependencies and the shade plugin, but in both cases it means all our dependencies are built in. The problems come when a downstream, transitive dependency loads a different (typically older) version of one of the jars we've bundled in our main jar. This happens a lot with the Hadoop and HBase artifacts, especially when it comes to MR1 and logging.

Example exclusions

hbase-server (needed to run MapReduce over HBase): https://github.com/gbif/datacube/blob/master/pom.xml#L268

hbase-testing-util (needed to run mini clusters): https://github.com/gbif/datacube/blob/master/pom.xml#L302

hbase-client: https://github.com/gbif/metrics/blob/master/pom.xml#L226

hadoop-client (removing logging): https://github.com/gbif/metrics/blob/master/pom.xml#L327

Beyond just sorting conflicting dependencies, we also encountered a problem that presented as "No FileSystem for scheme: file". It turns out we had projects bringing in both hadoop-common and hadoop-hdfs, and so we were getting only one of the META-INF/services files in the final jar. Thus we could not use the FileSystem to read local files (like jars for the class path) and also from HDFS. The fix was to include the org.apache.hadoop.fs.FileSystem in our project explicitly: https://github.com/gbif/metrics/blob/master/cube/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem

Finally we had to stop the TableMapReduceUtil from bringing in it’s own dependent jars, which brought in yet more conflicting jars - this appears to be a change in the default behaviour, where dependent jars are now being brought in by default in the shorter versions of initTableMapper:
https://github.com/gbif/metrics/blob/master/cube/src/main/java/org/gbif/metrics/cube/occurrence/backfill/BackfillCallback.java#L37

Conclusion

As you can see the client side of the upgrade was beset on all sides by the iniquities of jars, packaging and old dependencies. It seems strange that upgrading Guava is considered a no-no, major breaking change by these projects, yet discussions about removing HBaseTablePool are proceeding apace and will definitely break many projects (including any of ours that touch HBase). While we're ultimately pleased that everything now works, and looking forward to benefiting from the performance improvements and new features of CDH5, it wasn't a great trip. Hopefully our experience will help others migrate more smoothly.

Multimedia in GBIF

Markus Döring — Tue, 06 May 2014 12:06:00 +0000

We are happy to announce another long awaited improvement to the GBIF portal. Our portal test environment now shows multimedia items and their metadata associated with occurrences. As of today we have nearly 700 thousand occurrences with multimedia indexed. Based on the Dublin Core type vocabulary we distinguish between images, videos and sound files. As has been requested by many people the media type is available as a new filter in the occurrence search and subsequently in downloads. For example you can now easily find all audio recordings of birds.

UAM:Mamm:11470 - Eumetopias jubatus - skull

If you follow to the details page of any of those records you can see that sound files show up as simple links to the media file. We do the same for video files and currently do not have plans to embed any media player in our portal. This is different from images which are shown in a dedicated gallery you might have encountered for species pages before already. On the left you can see an example of a skull specimen with multiple images.

When requested for the first time, GBIF transiently caches the original images and processes them into various standard sizes and formats suitable for the use in the portal.

Publishing multimedia metadata

GBIF indexes multimedia metadata published in different ways within the GBIF network. From a simple URL given as an additional field in Darwin Core via multiple items expressed as ABCD XML or a dedicated multimedia extension in Darwin Core archives the difference usually is in metadata expressiveness.

Simple Darwin Core

Melocactus intortus record in iNaturalist

Whenever we spot the term dwc:associatedMedia in xml or Darwin Core archives as part of the a simple, flat occurrence record we try to extract URLs to media items. As the term officially allows for concatenated lists of URLs we try common delimiters such as comma, semicolon or the pipe symbols. An example of multiple, concatenated image URLs can be found in iNaturalist:

As you can see on the right every extracted link is regarded as a separate media item as there is no standard way to detect that 2 links refer to the same item. In the example above every image has a link to the actual image file and another one to the respective html page where it's metadata is presented. There is also no way to specify additional metadata about a link. As a consequence all images based on dwc:associatedMedia do not have a title, license or any further information. The verbatim data for that record before we extract image links can be seen here: http://www.gbif-uat.org/occurrence/891030819/verbatim

Darwin Core archive multimedia extension

By having a dedicated extension for media items many media items per core occurrence record can be published in a structured way. This is the GBIF recommended way to publish multimedia as it gives you most control over your metadata. Note that the same extension can also be used to publish multimedia for species in checklist datasets. This extension, based entirely on existing Dublin Core terms, allows you to specify the following information about a media item, all of which will make it into the GBIF portal if provided:

dc:type, the kind of media item based on the DCMI Type Vocabulary: StillImage, MovingImage or Sound
dc:format, MIME type of the multimedia object's format
dc:identifier, the public URL that identifies and locates the media file directly, not the html page it might be shown on
dc:references, the URL of an html webpage that shows the media item or its metadata. It is recommended to provide this url even if a media file exists as it will be used for linking out
dc:title, the media items title
dc:description, a textual description of the content of the media item
dc:created, the date and time this media item was taken
dc:creator, the person that took the image, recorded the video or sound
dc:contributor, any contributor in addition to the creator that helped in recording the media item
dc:publisher, the name of an entity responsible for making the image available
dc:audience, a class or description for whom the image is intended or useful
dc:source, a reference to the source the media item was derived or taken from. For example a book from which an image was scanned or the original provider of a photo/graphic, such as photography agencies
dc:license, license for this media object. If possible declare it as CC0 to ensure greatest use
dc:rightsHolder, the person or organization owning or managing rights over the media item

Access to Biological Collections Data

As usual we also provide a binding from the TDWG ABCD standard (versions 1.2 and 2.06) mostly used with the BioCASE software.

From ABCD 1.2 we extract media information based on the UnitDigitalImage subelements. In particular information about the file URL (ImageURI), the description (Comment) and the license (TermsOfUse).

In ABCD 2.06 we use the unit MultiMediaObject subelements instead. Here there are distinct file and webpage URLs (FileURI, ProductURI), the description (Comment), the license (License/Text, TermsOfUseStatements) and also an indication of the mime type (Format). The bird sound example from above comes in as ABCD 2.06 via the Animal Sound Archive dataset. You can see the original details of that ABCD record in it's raw XML fragment. There are also fossil images available through ABCD.

Missing from both ABCD versions is a media title, creator and created element.

Media type interpretation

We derive the media type from either an explicitly given dc:type, the mime type found in dc:format or the media file suffix. In the case of dwc:associatedMedia found in simple Darwin Core we can only rely on the file URL to interpret the kind of media item. If that URL is pointing to some html page instead of an actual static media file with a wellknown suffix the media type remains unknown.

Production deployment

We hope you like this new feature and we are eager to get this out into production within the next weeks. This is the first iteration of this work, and like all GBIF developments we welcome any feedback.

IPT v2.1 – Promoting the use of stable occurrenceIDs

Kyle Braak — Wed, 23 Apr 2014 12:22:00 +0000

GBIF is pleased to announce the release of the IPT 2.1 with the following key changes:

Stricter controls for the Darwin Core occurrenceID to improve stability of record level identifiers network wide
Ability to support Microsoft Excel spreadsheets natively
Japanese translation thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan

With this update, GBIF continues to refine and improve the IPT based on feedback from users, while carrying out a number of deliverables that are part of the GBIF Work Programme for 2014-16.

The most significant new feature that has been added in this release is the ability to validate that each record within an Occurrence dataset has a unique identifier. If any missing or duplicate identifiers are found, publishing fails, and the problem records are logged in the publication report.

This new feature will support data publishers who use the Darwin Core term occurrenceID to uniquely identify their occurrence records. The change is intended to make it easier to link to records as they propagate throughout the network, simplifying the mechanism to cross reference databases and potentially help towards tracking use.

Previously, GBIF has asked publishers to use the three Darwin Core terms: institutionCode, collectionCode, and catalogNumber to uniquely identify their occurrence records. This triplet style identifier will continue to be accepted, however, it is notoriously unstable since the codes are prone to change and in many cases are meaningless for datasets originating from outside of the museum collections community. For this reason, GBIF is adopting the recommendations coming from the IPT user community and recommending the use of occurrenceID instead.

Best practices for creating an occurrenceID are that they (a) must be unique within the dataset, (b) should remain stable over time, and (c) should be globally unique wherever possible. By taking advantage of the IPT’s built-in identifier validation, publishers will automatically satisfy the first condition.

Ultimately, GBIF hopes that by transitioning to more widespread use of stable occurrenceIDs, the following goals can be realized:

GBIF can begin to resolve occurrence records using an occurrenceID. This resolution service could also help check whether identifiers are globally unique or not.
GBIF’s own occurrence identifiers will become inherently more stable as well.
GBIF can sustain more reliable cross-linkages to its records from other databases (e.g. GenBank).
Record-level citation can be made possible, enhancing attribution and the ability to track data usage.
It will be possible to consider tracking annotations and changes to a record over time.

If you’re a new or existing publisher, GBIF hope you’ll agree these goals are worth working towards, and start using occurrenceIDs.

The IPT 2.1 also includes support for uploading Excel files as data sources.

Another enhancement is that the interface has been translated into Japanese. GBIF offer their sincere thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan for this extraordinary effort.

In the 11 months since version 2.0.5 was released, a total of 11 enhancements have been added, and 38 bugs have been squashed. So what else has been fixed?

If you like the IPT’s auto publishing feature, you will be happy to know the bug causing the temporary directory to grow until disk space was exhausted has now been fixed. Resources that are configured to auto publish, but fail to be published for whatever reason, are now easily identifiable within the resource tables as shown:

If you ever created a data source by connecting directly to a database like MySQL, you may have noticed an error that caused datasets to truncate unexpectedly upon encountering a row with bad data. Thanks to a patch from Paul Morris (Harvard University Herbaria) bad rows now get skipped and reported to the user without skipping subsequent rows of data.

As always we’d like to give special thanks to the other volunteers who contributed to making this version a reality:

Marie-Elise Lecoq, and Gallien Labeyrie (GBIF France) - Updating French translation
Yu-Huang Wang (TaiBIF, Taiwan) - Updating Traditional Chinese translation
Nestor Beltran (Colombian Biodiversity Information System (SiB)) - Updating Spanish translation
Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva (Universidade de São Paulo, Research Center on Biodiversity and Computing) - Updating Portuguese translation
Carlos Cubillos (Colombian Biodiversity Information System (SiB)) - Contributing style improvements

On behalf of the GBIF development team, I can say that we’re really excited to get this new version out to everyone! Happy publishing.

Lots of columns with Hive and HBase

Oliver Meyn — Tue, 04 Mar 2014 11:20:00 +0000

We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from Darwin Core. Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like!

Or so we thought.

Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this Cloudera blog). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase.

Here's an example of how to write a Hive table definition for an HBase-backed table:

CREATE EXTERNAL TABLE tiny_hive_example (
key INT,
kingdom STRING,
kingdomkey INT
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b")
TBLPROPERTIES(
"hbase.table.name" = "tiny_hbase_table",
"hbase.table.default.storage.type" = "binary"
);

But now that we have something like 600 columns to map to HBase, and that we've chosen to name our HBase columns just like the DwC Terms they represent (e.g. the basis of record term's column name is basisOfRecord) we have a very long "SERDEPROPERTIES" string in our Hive table definition. How long? Well, way more than the 4000 character limit of Hive. For our Hive metastore we use PostgreSQL and when Hive creates the SERDE_PARAMS table it gives the PARAM_VALUE column a datatype of VARCHAR(4000). Because 4k should be enough for anyone, right? Sigh.

The solution:

alter table "SERDE_PARAMS" alter column "PARAM_VALUE" type text;

We did lots of testing to make sure the existing definitions didn't get nuked by this change, and can confirm that the Hive code is not checking that 4000 value either (value is turned into a String: the source). Our new super-wide downloads table works, and will be in production soon!

The new (real-time) GBIF Registry has gone live

Kyle Braak — Mon, 28 Oct 2013 12:04:00 +0000

For the last 4 years, GBIF has operated the GBRDS registry with its own web application on http://gbrds.gbif.org. Previously, when a dataset got registered in the GBRDS registry (for example using an IPT) it wasn't immediately visible in the portal for several weeks until after rollover took place.

In October, GBIF launched its new portal on www.gbif.org. During the launch we indicated that the real-time data management would be starting up in November. We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry.

What does this mean for you?

any dataset registered through GBIF (using an IPT, web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated

the GBRDS web application (http://gbrds.gbif.org) is no longer visible, since the new portal displays all the appropriate information

old links to the GBRDS will automatically redirect to their corresponding entry in the new portal. As an example, try http://gbrds.gbif.org/browse/agent?uuid=4fa7b334-ce0d-4e88-aaae-2e0c138d049e

the GBRDS sandbox registry web application (http://gbrdsdev.gbif.org) is no longer visible, but a new registry sandbox has been setup to provide for IPT installations installed in test mode

Please note that the new registry API supports the same web service API that the GBRDS previously did, so existing tools and services built on the GBRDS API (such as the IPT) will continue to work uninterrupted.

As you may have noticed, occurrence data crawling has been temporarily suspended since the middle of September to prepare for launching real-time data management. We aim to resume occurrence data crawling in the first week of November, meaning that updates to the index will be visible immediately afterwards.

On behalf of the GBIF development team, I thank you for your patience during this transition time, and hope you are looking forward to real-time data management as much as we are.

GBIF Backbone in GitHub

Markus Döring — Thu, 24 Oct 2013 14:39:00 +0000

For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy. Encouraged by similar sentiments from Rod Page, it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the great GitHub Treeslider to browse the taxonomy, so why not give it a try?

A GitHub filesystem taxonomy

I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:

README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon
data.json, a complete json representation of the taxon as it is exposed via the new GBIF species API

The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica is represented as:

This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.

Getting data into GitHub

It didn't take much to write a small NubGitExporter.java class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop.

Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and 670.000 files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.

Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)
To https://github.com/mdoering/backbone.git

After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even git itself, are not made to be a repository of millions of files and folders.

First GitHub impressions

Browsing taxa in GitHub is surprisingly responsive. The fungi genus Amanita contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new GBIF species pages which of course shows much more information. The rendered readme file is not ideal as it's at the very bottom of the page, but showing information to humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s).

The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the Amanita readme contains the synonym information) and also with infraspecific taxa. For example Amanita muscaria contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.

The GitHub browser sorts all files alphabetically. When mixing ranks (we skip intermediate unknown ranks in the backbone), for example see the Fungus kingdom, sorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.

I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push & pull.

Validating scientific names with the forthcoming GBIF Portal web service API

Gaurav Vaidya — Mon, 22 Jul 2013 21:16:00 +0000

This guest post was written by Gaurav Vaidya, Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously.

A whale named ~~Physeter macrocephalus~~ ~~Physeter catodon~~ Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons)

Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's Porphyrio martinicus, not Porphyrio martinica). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over whether the sperm whale is really Physeter catodon Linnaeus, 1758, or Physeter macrocephalus Linnaeus, 1758.

A good way to validate scientific names is to match them against a taxonomic checklist: a publication that describes the taxonomy of a particular taxonomic group in a particular geographical region. It is up to the taxonomists who write such treatises to catalogue all the synonyms that have ever been used for the names in their checklist, and to identify a single accepted name for each taxon they recognize. While these checklists are themselves evolving over time and sometimes contradict each other, they serve as essential points of reference in an ever-changing taxonomic landscape.

Over a hundred digitized checklists have been assembled by the Global Biodiversity Information Facility (GBIF) and will be indexed in the forthcoming GBIF Portal, currently in development and testing. This collection includes large, global checklists, such as the Catalogue of Life and the International Plant Names Index, alongside smaller, more focussed checklists, such as a checklist of 383 species of seed plants found in the Singhalila National Park in India and the 87 species of moss bug recorded in the Coleorrhyncha Species File. Many of these checklists can be downloaded as Darwin Core Archive files, an important format for working with and exchanging biodiversity data.

So how can we match names against these databases? OpenRefine (the recently-renamed Google Refine) is a popular data cleaning tool, with features that make it easy to clean up many different types of data. Javier Otegui has written a tutorial on cleaning biodiversity data in OpenRefine, and last year Rod Page provided tools and a step-by-step guide to reconciling scientific names, establishing OpenRefine as an essential tool for biodiversity data and scientific name cleanup.

Linnaeus' original description of Felis Tigris. From an 1894 republication of Linnaeus' Systema Naturae, 10th edition, digitized by the Biodiversity Heritage Library.

We extended Rod's work by building a reconciliation service against the forthcoming GBIF web services API. We wanted to see if we could use one of the GBIF Portal's biggest strengths -- the large number of checklists it has indexed -- to identify names recognized in similar ways by different checklists. Searching through multiple checklists containing possible synonyms and accepted names increases the odds of finding an obscure or recently created name; and if the same name is recognized by a number of checklists, this may signify a well-known synonymy -- for example, two of the Portal checklists recognize that the species Linnaeus named Felis tigris is the same one that is known as Panthera tigris today.

To do this, we wrote a new OpenRefine reconciliation service that searches for a queried name in all the checklists on the GBIF Portal. It then clusters names using four criteria and counts how often a particular name has the same:

scientific name (for example, "Felis tigris"),
authority ("Linnaeus, 1758"),
accepted name ("Panthera tigris"), and
kingdom ("Animalia").

Once you do a reconciliation through our new service, your results will look like this:

Since OpenRefine limits the number of results it shows for any reconciliation, we know only that at least five checklists in the GBIF Portal matched the name "Felis tigris". Of these,

Two checklists consider Felis tigris Linnaeus, 1758 to be a junior synonym of Panthera tigris (Linnaeus, 1758). Names are always sorted by the number of checklists that contain that interpretation, so this interpretation -- as it happens, the correct one -- is at the top of the list.
The remaining checklists all consider Felis tigris to be an accepted name in its own right. They contain mutually inconsistent information: one places this species in the kingdom Animalia, another in the kingdom Metazoa, and the third contains both a kingdom and an taxonomic authority. You can click on each name to find out more details.

Using our reconciliation service, you can immediately see how many checklists agree on the most important details of the name match, and whether a name should be replaced with an accepted name. The same name may also be spelled identically under different nomenclatural codes: for example, does "Ficus" refer to the genus Ficus Röding, 1798 or the genus Ficus L.? If you know that the former is in kingdom Animalia while the latter is in Plantae, it becomes easier to figure out the right match for your dataset.

We've designed a complete workflow around our reconciliation service, starting with ITIS as a very fast first step to catch the most well recognized names, and ending with EOL's fuzzy matching search as a final step to look for incorrectly spelled names. For VertNet's 2013 Biodiversity Informatics Training Workshop, we wrote two tutorials that walk you through our workflow:

Name validation in OpenRefine, using both the new GBIF API reconciliation service as well as Rod Page's reconciliation service for EOL, and
Higher taxonomy in OpenRefine, using the web service APIs provided by GBIF and EOL, as well as OpenRefine's ability to parse JSON.

If you're already familiar with OpenRefine, you can add the reconciliation service with the URL:

http://refine.taxonomics.org/gbifchecklists/reconcile

Give it a try, and let us know if it helps you reconcile names faster!

The Map of Life project is continuing to work on improving OpenRefine for taxonomic use in a project we call TaxRefine. If you have suggestions for features you'd like to see, please let us know! You can leave a comment on this blog post, or add an issue to our issue tracker on GitHub.

IPT v2.0.5 Released - A melhor versão até o momento!

Kyle Braak — Wed, 22 May 2013 15:37:00 +0000

The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website here.

As with every release, it's your chance to take advantage of the most requested feature enhancements and bug fixes.

The most notable feature enhancements include:

A resource can now be configured to publish automatically on an interval (See "Automated Publishing" section in User Manual)
The interface has been translated into Portuguese, making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English.
An IPT can be configured to back up each DwC-Archive version published (See "Archival Mode" in User Manual)
Each resource version now has a resolvable URL (See "Versioned Page" section in User Manual)

Filterable, pageable, and sortable resource overview table in v2.0.5

The order of columns in published DwC-Archives is always the same between versions
Style (CSS) customizations are easier than ever - check out this new guide entitled "How to Style Your IPT" for more information
Hundreds if not thousands of resources can be handled, now that the resource overview tables are filterable, pageable, and sortable (See "Public Resource Table" section in User Manual)

The most important bug fixes are:

Garbled encoding on registration updates has been fixed
The problems uploading DwC-Archives in .gzip format has been fixed
The problem uploading a resource logo has been fixed

The new look in v2.0.5

The changes mentioned above represent just a fraction of the work that has gone into this version. Since version 2.0.4 was released 7 months ago, a total of 45 issues have been addressed. These are detailed in the issue tracking system.

It is great to see so much feedback from the community in the form of issues especially as the IPT becomes more stable and comprehensive over time. After all, the IPT is a community-driven project and anyone can contribute patches, translations, or have their say simply by adding or voting on issues.

The single largest community contribution in this version has been the translation into Portuguese done by three volunteers at the Universidade de São Paulo, Research Center on Biodiversity and Computing: Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva. With Brazil recently joining the GBIF network, we hope the Portuguese interface for the IPT will help in publication of the wealth of biodiversity data available from Brazilian institutions.

We’d also like to give special thanks to the other volunteers below:

Marie-Elise Lecoq (GBIF France) - Updating French translation
Yu-Huang Wang (TaiBIF, Taiwan) - Updating Traditional Chinese translation
Dairo Escobar, and Daniel Amariles (Colombian Biodiversity Information System (SiB)) - Updating Spanish translation
Carlos Cubillos (Colombian Biodiversity Information System (SiB)) - Contributing style improvements
Sijmen Cozijnsen (independent contractor working for NLBIF, Netherlands) - Contributing style improvements

On behalf of the GBIF development team, I hope you enjoy using the latest version of the IPT.

Migrating our hadoop cluster from CDH3 to CDH4

Oliver Meyn — Tue, 14 May 2013 15:34:00 +0000

We've written a number of times on the initial setup, eventual upgrade and continued tuning of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to CDH4.2.1. Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons:

general performance improvements in the entire Hadoop/HBase stack
continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem)
multi-threaded compactions (the need for which we concluded in this post)
table-based region balancing (rather than just cluster-wide)

We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our bus factor, and get the server management back into the hands of our ops team, we've moved to CDH Manager to control our CDH installation. That's been going pretty well so far, but, we're getting ahead of ourselves...

The Process

We have 6 slave nodes that have a lot of disk capacity since we spec'd with a goal of lots of spindles which meant we got lots of space "for free". Rather than upgrading in place, we decided to start fresh with new master & zookeeper nodes, and we calculated that we'd have enough space to pull half the slaves into the new cluster without losing any data. We cleaned up all the tmp files and anything we deemed not worth saving from HBase and hdfs, and started the migration:

Reduce the replication factor

We reduced the replication factor to 2 on the 6 slave nodes to reduce the disk use:

hadoop fs -setrep -R 2 /

Decommission the 3 nodes to move

"Decommissioning" is the civilized and safe way to remove nodes from a cluster where there's risk that they contain the only copies of some data in the cluster (they'll block writes but accept reads until all blocks have finished replicating out). To do it add the names of the target machines to an "excludes" file (one per line) that your hdfs config needs to reference, and then refresh hdfs.

The block in hdfs-site.xml:

<name>dfs.hosts.exclude</name>

<value>/etc/hadoop/conf/excluded_hosts</value>

</property>

then run:

bin/hadoop dfsadmin -refreshNodes

and wait for the "under replicated blocks" count on the hdfs admin page to drop to 0 and the decommissioning nodes to move into state "Decommissioned".

Don't forget HBase

The hdfs datanodes are tidied up now but don't forget to cleanly shutdown the HBase regionservers - run:

./bin/graceful_stop.sh HOSTNAME

from within the HBase directory on the host you're shutting down (specifying the real name for HOSTNAME). It will shed its regions and shutdown when tidied up (more details here).

Now you can shutdown the tasktracker and datanode, and then the machine is ready to be wiped.

Build the new cluster

We wiped the 3 decommissioned slave nodes and installed the latest version of CentOS (our linux of choice, version 6.4 at time of writing). We also pulled 3 much lesser machines from our other cluster after decommissioning them in the same way, and installed CentOS 6.4 there, too. The 3 lesser machines would form our zookeeper ensemble and master nodes in the new cluster.

Enter CDH Manager

The folks at Cloudera have made a free version of their CDH Manager app available, and it makes managing a cluster much, much easier. After setting up the 6 machines that would form the basis of our new cluster with just the barebones OS, we were ready to start wielding the manager. We made a small VM to hold the manager app and installed it there. The manager instructions are pretty good, so I won't recreate them here. We had trouble with the key-based install so had to resort to setting identical passwords for root and allowing root ssh access for the duration of the install, but other than that it all went pretty smoothly. We installed in the following configuration (the master machines are the lesser ones described above, and the slaves the more powerful machines).

Machine and Role assignments
Machine	Roles
master1	HDFS Primary NameNode, Zookeeper Member, HBase Master (secondary)
master2	HDFS Secondary NameNode, Zookeeper Member, HBase Master (primary)
master3	Hadoop JobTracker, Zookeeper Member, HBase Master (secondary)
slave1	HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave2	HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave3	HDFS DataNode, Hadoop TaskTracker, HBase Regionserver

Copy the data

Now we had two running clusters - our old CDH3u3 cluster (with half its machines removed) and the new, empty CDH 4.2.1 cluster. The trick was how to get data from the old cluster into the new, with our primary concern being the data in HBase. The builtin facility for this sort of thing is called CopyTable, and sounds great, except that it doesn't work across major versions of HBase, so that was out. Next we looked at copying the HFiles directly from the old cluster to the new using the HDFS builtin command distcp. Because we could handle shutting down HBase on the old cluster for the duration of the copy this, in theory, should work - newer versions of HBase can read the older versions' HFiles and then write the new versions during compactions (and by shutting down we don't run the risk of missing updates from caches that haven't flushed, etc). And in spite of lots of warnings around the net that it wouldn't work, we tried it anyway. And it didn't work :) We managed to get the -ROOT- table up but it couldn't find .META. and that's where our patience ended. The next, and thankfully successful, attempt was using HBase export, distcp, and HBase import.

On the old cluster we ran:

bin/hadoop jar hbase-0.90.4-cdh3u3.jar export table_name /exports/table_name

for each of our tables, which produced a bunch of sequence files in the old cluster's HDFS. Those we copied over to the new cluster using HDFS's distcp command:

bin/hadoop distcp hftp://old-cluster-namenode:50070/exports/table_name hdfs://master1:8020/imports/table_name

which takes advantage of the builtin http-like interface (hftp) that HDFS provides, which makes the copy process version agnostic.

Finally on the new cluster we can import the copied sequence files into the new HBase:

bin/hadoop jar hbase-0.94.2-cdh4.2.1-security.jar import table_name /imports/table_name

Make sure the table exists before you import, and because the import is a mapreduce job that does Puts, it would also be wise to presplit any large tables at creation time so that you don't crush your new cluster with lots of hot regions and splitting. Also one known issue in this version of HBase is a performance regression from version 0.92 to 0.94 (detailed in HBASE-7868), which you can workaround by adding the following to your table definition:

DATA_BLOCK_ENCODING => 'FAST_DIFF'

e.g. create 'test_table', {NAME=>'cf', COMPRESSION=>'SNAPPY', VERSIONS=>1, DATA_BLOCK_ENCODING => 'FAST_DIFF'}

As per that linked issue, you should also enable short-circuit reads from the CDH Manager interface.

And to complete the copying process, run major compactions on all your tables to ensure the best data locality you can for your regionservers.

All systems go

After running checks on the copied data, and updating our software to talk to CDH4, we were happy that our new cluster was behaving as expected. To get back to our normal performance levels we then shutdown the remaining machines in the CDH3u3 cluster, wiped and installed the latest OS, and then told CDH Manager to install on them. A few minutes later we had all our M/R slots back, as well as our regionservers. We ran the HBase balancer to evenly spread out the regions, ran another major compaction on our tables to force data-locality, and we were back in business!

Data cleaning: Using MySQL to identify XML breaking characters

Jan K. Legind — Fri, 08 Feb 2013 12:45:00 +0000

Sometimes publishers have problems with data resources that contain control characters that will break the xml response if they are included. Identifying these characters and removing them can be a daunting task, especially if the dataset contains thousands of records.

Publishers that share datasets through the DiGIR and TAPIR protocols are especially vulnerable to text fields that contain polluted data. Information about locality (http://rs.tdwg.org/dwc/terms/index.htm#locality) is often quite rich and can be copied from diverse sources, thereby entering the database table possibly without having been through a verification or a cleaning process. The locality string can be copy/pasted from a file into the locality column, or the data itself can be mass loaded infile, or it can be bulk inserted – each of these methods contains a risk that unintended characters enter the table.

Even if you have time and are meticulous, you could miss certain control characters because they are invisible to the naked eye. So what are publishers - some with limited resources – going to do to ferret out these xml breaking characters? Assuming that you have access to the MySQL database itself you can identify these pesky control characters by performing a few basic steps that involves creating a small table, inserting some hexadecimal values into it (sounds much harder than it is), and finally you run the query that picks out these ‘illegal’ characters from the table that you specify.

We start out with creating a table to hold the values for the problematic characters so that we can use them in a query:

CREATE TABLE control_char (
id int(4) NOT NULL AUTO_INCREMENT,
hex_val CHAR(2),
PRIMARY KEY(id)
) DEFAULT CHARACTER SET = utf8;

The DEFAULT CHARACTER SET declaration forces UTF-8 compliance which the regular expressions used later requires.
We then populate the table with these hex values that represent control characters:

INSERT INTO control_char (hex_val)
VALUES
('00'),('01'),('02'),('03'),('04'),('05'),('06'),('07'),('08'),('09'),('0a'),('0b'),('0c'),('0d'),('0e'),('0f'),
('10'),('11'),('12'),('13'),('14'),('15'),('16'),('17'),('18'),('19'),('1a'),('1b'),('1c'),('1d'),('1e'),('1f')
;

You can read more about these values here: http://en.wikipedia.org/wiki/C0_and_C1_control_codes

At this point you may ask why the control_char table is not a temporary table as you might not want it to be a permanent feature in the database. The reason for this is sadly that MySQL has a long standing bug that prevents a temporary table from being referenced more than once; http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html and we have to reference it more than once as you will see later.

Now on to the main query – these declarations test the table and column that you specify against the control_char table:

SELECT t1.* FROM scinames_harvested t1, control_char
WHERE LOCATE(control_char.hex_val , HEX(t1.scientific_name)) MOD 2 != 0;

The query references two tables; one is a table of roughly 5000 records containing a record primary key, scientific_name and some other columns. Some of the scientific name strings are polluted with characters that we want to get rid of. The second table contains the control characters.
The way we ensure that the LOCATE function tests for value pairs two steps at the time is by using the modulus keyword MOD. Remember we want to look through the scientific_name char string after it has been converted to hexadecimal values (HEX) that consist of value pairs. We don’t want to test across value pairs!

Running the query, in this instance, gives me five records with characters that are not kosher:

This is pretty neat if the alternative is eyeballing each and every record.
Note that I cannot guarantee that this will properly process every character from the UTF-8 Latin-1 supplement http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement

If you want to create a test table and try out the queries above, this UPDATE query template will change the string into something containing control characters:

UPDATE your_table SET your_column = CONCAT('Adelotus brevis', X'0B') WHERE id = 12345;

In the CONCAT declaration the second part looks funny, but you have to remember that the X in front of '0B' tells MySQL that a hex value is coming. In this case it is a line-tabulation character: www.fileformat.info/info/unicode/char/000b/index.htm. This part can be edited to other values for test purposes. Naturally the CONCAT function can take n number of strings for concatenation.

"I noticed that the GBIF data portal has fewer records than it used to – what happened?"

Andrea Hahn — Mon, 10 Dec 2012 11:51:00 +0000

If you are a regular user of the GBIF data portal at http://data.gbif.org, or keep an eye on the numbers given at http://www.gbif.org, you may have noticed that the number of indexed records took a dip, from well over 389m records to a little more than 383m. Why would that be?

The main reason for this is that software and processing upgrades have made it easier to spot duplicates and old, no longer published versions of records and datasets. Since the previous version of the data index, some major removal of such records has taken place:

- Several publishers migrated their datasets from other publishing tools to the Integrated Publishing Toolkit (IPT) and Darwin Core Archive, and in the process identified and removed duplicate records in the published source data. As an additional effect, the use of Darwin Core Archives in publishing allows the indexing process to automatically remove records from the index that are no longer contained in the source file: a data transfer is reliably all-or-nothing, so that any record that is not touched during indexing can automatically be deleted. This is less easy in the dialog-driven data transfer protocols (DiGIR, BioCASe and TAPIR), where data transfer might fail at any point in between for a number of reasons, requiring human supervision of deletions.

- The now dataset-aware registry and changed metadata updating workflow make it possible to much easier spot data resources that are no longer published at source, and therefore need to be removed from the data portal as well. Previously, such checks were manual and required regular screening. More often than not, datasets are not really withdrawn, but instead published under a new identifier, combined with other data, or moved to a new location, all with the old version still hanging in until spotted or pointed out. The new registry workflows will significantly speed up the process of detecting and handling such cases.

In summary, the current drop in numbers is the result of data cleaning and removal of duplicates, and reflects continuing efforts by publishers, nodes and the Secretariat to improve the quality of data accessible through the GBIF network. While they happen regularly, the effects of such cleaning activities often get masked by increased record numbers of existing resources and new datasets in the global index. This time, the reduction happens to be more prominent than the additions.

The GBIF Registry is now dataset-aware!

Kyle Braak — Mon, 29 Oct 2012 12:55:00 +0000

This post continues the series of posts that highlight the latest updates on the GBIF Registry.

To recap, in April 2011 Jose Cuadra wrote The evolution of the GBIF Registry, a post that provided a background to the GBIF Network, explained how Network entities are now stored in a database instead of UDDI system, and how it has a new web application and API.

Then a month later, Jose wrote another post entitled 2011 GBIF Registry Refactoring that was more technical in nature and detailed a new set of technologies chosen to improve the underlying codebase.

Now even if you have been keeping an eye on the GBIF Registry, you probably missed the most important improvement that happened in September 2012: the Registry is now dataset-aware!

To be dataset-aware, means that the Registry is now aware of all the datasets that exist behind DiGIR and BioCASE endpoints. Just in case the reader isn't aware, DiGIR and BioCASE are wrapper tools used by organizations in the GBIF Network to publish their datasets. The datasets are exposed via an endpoint URL, and there can potentially be thousands of datasets behind a single endpoint.

Traditionally, the GBIF Registry knew about the endpoint but not about its datasets. It was then the job of GBIF's Harvesting and Indexing Toolkit (HIT) to discover what datasets existed behind the endpoint, harvest all their records, and index those records into the GBIF Data Portal.

Therefore if you ever visited the GBIF Data Portal and viewed the Portal page for the Academy of Natural Sciences, you would find that it has 3 datasets.

Clicking on each one, reveals that they are all exposed via the same DiGIR endpoint (see "Access point URL") - see below:

But, if you visited the GBIF Registry and did the same search for the Academy of Natural Sciences, prior to the Registry being dataset-aware, you would have seen it has a DiGIR endpoint, but not found it has any datasets!

Now that the GBIF Registry is dataset-aware, however, the Registry page for the Academy of Natural Sciences shows that the organization owns 3 datasets, and has a (DiGIR) Technical Installation.

So that's fantastic, now the GBIF Registry knows about 1000s of datasets that only the GBIF Data Portal knew about before. But how was dataset-awareness achieved?

First, the Registry now does the job of dataset discovery that the HIT used to do. A project called the registry-metadata-sync was created to do this.

Second, a special set of scripts was written to migrate all the datasets from the GBIF Data Portal index database, into the Registry database. For the first time, all datasets that existed in the GBIF Data Portal now exist in the GBIF Registry, and can be uniquely identified by their GBIF Registry UUID!

Third, the HIT was branched, creating a revised version of the tool that was able to understand the new dataset-aware Registry. The HIT also had to be modified to allow its operators to still trigger dataset discovery by technical installation. Life just got easier for the HIT though, since it could use each dataset's GBIF Registry UUID to uniquely identify each dataset during indexation.

Indeed, the dataset-aware Registry allocates a UUID to each dataset. This is fundamentally the biggest advantage that the dataset-aware Registry brings. Now that GBIF has succeeded in uniquely identifying each Dataset in its Registry, it is now working to assign each Dataset a Globally Unique Identifier (GUID) in the form of a Digital Object Identifier (DOI). The DOI for a dataset will be resolvable back to the GBIF Registry, and could be referenced when citing a Dataset, thereby enabling better tracking of Dataset usage in scientific publications.

GBIF is really excited about being able to provide publishers a DOI for each of their dataset. Keep an eye on our Registry in the coming months for their grand appearance.

IPT v2.0.4 released

Kyle Braak — Wed, 17 Oct 2012 16:58:00 +0000

Today the GBIF Secretariat has announced the release of version 2.0.4 of the Integrated Publishing Toolkit (IPT). For those who can't wait to get their hands on the release, it's available for download on the project website here.

Collaboration on this version was more global than ever before, with volunteers in Latin America, Asia, and Europe contributing translations, and volunteers in Canada and the United States contributing some patches.

Add to that all the issue activity, things have been busy. In total 108 issues were addressed in this version; 38 Defects, 35 Enhancements, 7 Other, 5 Patches, 18 Won't fix, 4 Duplicates, and 1 that was considered as Invalid. These are detailed in the issue tracking system.

So what exactly has changed and why? Here's a quick rundown.

One thing that kept coming up again and again in version 2.0.3, was that users were unwittingly installing the IPT in test mode, thinking that they were running in production. After registering a resource, these users expected to see it show up in the GBIF Registry and ultimately be indexed by GBIF. Frustrated emails were then sent to the GBIF Helpdesk when nothing happened. Sadly the reply from the GBIF Helpdesk was always filled with the same disappointing news:

"Your resource is actually in the Test Registry therefore it will never be indexed by GBIF. Oh, and you will have to reinstall your IPT using production mode next time and do your resource configuration over again!"

So to tackle this problem, the setup pages have been improved to make it crystal clear what it means to choose one mode or the other.

The UI has also been branded when running in test mode to make it even more obvious what mode the IPT is running in.

Now whether or not test mode was chosen accidentally, it can be used to help train administrators how to configure an instance, and to help train users how to publish resources. What was always missing, was a way to transfer configured resources from an IPT in test mode, to one in production.

I'm happy to say that in 2.0.4, a resource can now be easily transferred between 2 IPTs including all its source files and mappings. Users will be happy to know that they never have to waste time reconfiguring the same resource from scratch. How is this done? Well in short, resource transfer is achieved by uploading an archived IPT resource folder during resource creation - see user manual for full instructions.

Moving on..

With so many publishers opting for the convenience of publishing via the IPT, the GBIF helpdesk has been receiving dozens of requests to replace an existing DiGIR, BioCASE, or TAPIR resource in the GBIF Registry with one coming from their IPT. To facilitate resource migration, another new feature was added in 2.0.4 that allows the IPT to update an existing resource in the GBIF Registry during registration. The change is welcomed most of all by the GBIF helpdesk who bore the brunt of carrying out resource migrations in the GBIF Registry. See User Manual for instructions.

Thanks to the Taiwan Biodiversity Information Facility (TaiBIF) the IPT interface is now available in Traditional Chinese. That makes the IPT available in a total of 4 languages now including French, Spanish and of course English.

What else?

Thanks to a patch from Peter Desmet, download metrics for the Archive, EML, and RTF files can now be tracked via Google Analytics. For IPT admins who aren't already tracking analytics, there are simple instructions in the User Manual. Here's a screenshot showing some metrics from http://ipt-rc.gbif.org For your reference, the "Event Label" is the resource short name in the IPT.

Last but not least, it should be highlighted that the IPT's RSS feed is now updated every time a resource is published. The version number is displayed right beside the resource name, so subscribers can stay on top of the latest changes. Here's a screenshot from my RSS reader pulling from http://ipt.gbif.org/rss.do

And that about wraps up the most important changes in this version.

As always, we’d like to give special thanks to the volunteer translators for their time and efforts:

Nicolas Noé (Belgian Biodiversity Platform, Belgium) - French
TaiBIF, Taiwan - Traditional Chinese
Laura Roldan Gomez, Dairo Escobar, and Daniel Amariles, (Colombian Biodiversity Information System (SiB)) - Spanish

Plus another couple of special mentions are owed to Peter Desmet and Laura Russell who provided an exceptional amount of feedback and suggestions.

On behalf of the GBIF development team, I hope you enjoy using latest version.

Getting started with DataCube on HBase

Tim Robertson — Fri, 13 Jul 2012 17:35:00 +0000

This tutorial blog provides a quick introduction to using DataCube, a Java based OLAP cube library with a pluggable storage engine open sourced by Urban Airship. In this tutorial, we make use of the inbuilt HBase storage engine.

In a small database much of this would be trivial using aggregating functions (SUM(), COUNT() etc). As the volume grows, one often precalculates these metrics which brings it's own set of consistency challenges. As one outgrows a database, as GBIF are, we need to look for new mechanisms to manage these metrics. The features of DataCube that make this attractive to us are:

A managable process to modify the cube structure
A higher level API to develop against
Ability to rebuild the cube with a single pass over the source data

For this tutorial we will consider the source data as classical DarwinCore occurrence records, where each record represents the metadata associated with a species observation event, e.g.:

ID, Kingdom, ScientificName, Country, IsoCountryCode, BasisOfRecord, CellId, Year
1, Animalia, Puma concolor, Peru, PE, Observation, 13245, 1967
2, Plantae, Abies alba, Spain, ES, Observation, 3637, 2010
3, Plantae, Abies alba, Spain, ES, Observation, 3638, 2010

Suppose the following metrics are required, each of which is termed a rollup in OLAP:

Number of records per country
Number of records per kingdom
Number of records georeferenced / not georeferenced
Number of records per kingdom per country
Number of records georeferenced / not georeferenced per country
Number of records georeferenced / not georeferenced per kingdom
Number of records georeferenced / not georeferenced per kingdom per country

Given the requirements above, this can be translated into a cube definition with 3 dimensions, and 7 rollups as follows:

/**
 * The cube definition (package access only).
 * Dimensions are Country, Kingdom and Georeferenced with counts available for:
 * 
 * Country (e.g. number of record in DK)
 * Kingdom (e.g. number of animal records)
 * Georeferenced (e.g. number of records with coordinates)
 * Country and kingdom (e.g. number of plant records in the US)
 * Country and georeferenced (e.g. number of records with coordinates in the UK
 * Country and kingdom and georeferenced (e.g. number of bacteria records with coordinates in Spain)
 * 
 * TODO: write public utility exposing a simple API enabling validated read/write access to cube.
 */
class Cube {

  // no id substitution
  static final Dimension COUNTRY = new Dimension ("dwc:country", new StringToBytesBucketer(), false, 2);
  // id substitution applies
  static final Dimension KINGDOM = new Dimension ("dwc:kingdom", new StringToBytesBucketer(), true, 7);
  // no id substitution
  static final Dimension GEOREFERENCED = new Dimension ("gbif:georeferenced", new BooleanBucketer(), false, 1);

  // Singleton instance if accessed through the instance() method
  static final DataCube INSTANCE = newInstance();

  // Not for instantiation
  private Cube() {
  }

  /**
   * Creates the cube.
   */
  private static DataCube newInstance() {
    // The dimensions of the cube
    List > dimensions = ImmutableList. >of(COUNTRY, KINGDOM, GEOREFERENCED);

    // The way the dimensions are "rolled up" for summary counting
    List rollups =
      ImmutableList.of(new Rollup(COUNTRY),
        new Rollup(KINGDOM),
        new Rollup(GEOREFERENCED),
        new Rollup(COUNTRY, KINGDOM),
        new Rollup(COUNTRY, GEOREFERENCED),
        new Rollup(KINGDOM, GEOREFERENCED),
        // more than 2 requires special syntax
        new Rollup(ImmutableSet. of(new DimensionAndBucketType(COUNTRY), new DimensionAndBucketType(KINGDOM),
          new DimensionAndBucketType(GEOREFERENCED))));

    return new DataCube (dimensions, rollups);
  }
}

In this code, we are making use of ID substitution for the kingdom. ID substitution is an inbuilt feature of DataCube whereby an auto-generated ID is used to substitute verbose coordinates (a value for a dimension). This is an important feature to help improve performance as coordinates are used to construct the cube lookup keys, which translate into the key used for the HBase table. The substitution is achieved by using a simple table holding a running counter and a mapping table holding the field-to-id mapping. When inserting data into the cube, the counter is incremented (with custom locking to support concurrency within the cluster), the mapping is stored, and the counter value used as the coordinate. When reading, the mapping table is used to construct the lookup key. With the cube defined, we are ready to populate it. One could simply iterate over the source data and populate the cube with the likes of the following:

DataCubeIo dataCubeIo = setup(Cube.INSTANCE); // omitted for brevity
dataCubeIo.writeSync(new LongOp(1), 
  new WriteBuilder(Cube.INSTANCE)
    .at(Cube.COUNTRY, "Spain") // for example
    .at(Cube.KINGDOM, "Animalia")
    .at(Cube.GEOREFERENCED, true)
);

However, one should consider what to do when you have the following inevitable scenarios:

A new dimension or rollup is to be added to the running cube
Changes to the source data have occurred without the cube being notified (e.g. through a batch load, or missing notifications due to messaging failures)
Some disaster recovery requiring a cube rebuild

To handle this when using HBase as the cube storage engine, we make use of the inbuilt backfill functionality. Backfilling is a multistage process:

A snapshot of the live cube is taken and stored in a snapshot table
An offline cube is calculated from the source data and stored in a backfill table
The snapshot and live cube are compared to determine changes that were accepted in the live cube during the rebuilding process (step 2). These changes are then applied to the backfill table
The backfill is hot swapped to become the live cube

This is all handled within DataCube with the exception of stage 2, where we are required to provide a BackfillCallback, the logic responsible for populating the new cube from the source data. The following example illustrates a BackfillCallback using a simple MapReduce job to scan an HBase table for the source data.

/**
 * The callback used from the backfill process to spawn the job to write the new data in the cube.
 */
public class BackfillCallback implements HBaseBackfillCallback {

  // Property keys passed in on the job conf to the Mapper
  static final String TARGET_TABLE_KEY = "gbif:cubewriter:targetTable";
  static final String TARGET_CF_KEY = "gbif:cubewriter:targetCF";
  // Controls the scanner caching size for the source data scan (100-5000 is reasonable)
  private static final int SCAN_CACHE = 200;
  // The source data table
  private static final String SOURCE_TABLE = "dc_occurrence";

  @Override
  public void backfillInto(Configuration conf, byte[] table, byte[] cf, long snapshotFinishMs) throws IOException {
    conf = HBaseConfiguration.create();
    conf.set(TARGET_TABLE_KEY, Bytes.toString(table));
    conf.set(TARGET_CF_KEY, Bytes.toString(cf));
    Job job = new Job(conf, "CubeWriterMapper");

    job.setJarByClass(CubeWriterMapper.class);
    Scan scan = new Scan();
    scan.setCaching(SCAN_CACHE);
    scan.setCacheBlocks(false);

    // we do not want to get bad counts in the cube!
    job.getConfiguration().set("mapred.map.tasks.speculative.execution", "false");
    job.getConfiguration().set("mapred.reduce.tasks.speculative.execution", "false");
    job.setNumReduceTasks(0);
    TableMapReduceUtil.initTableMapperJob(SOURCE_TABLE, scan, CubeWriterMapper.class, null, null, job);
    job.setOutputFormatClass(NullOutputFormat.class);
    try {
      boolean b = job.waitForCompletion(true);
      if (!b) {
        throw new IOException("Unknown error with job.  Check the logs.");
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
  }
}

/**
 * The Mapper used to read the source data and write into the target cube.
 * Counters are written to simplify the spotting of issues, so look to the Job counters on completion.
 */
public class CubeWriterMapper extends TableMapper {

  // TODO: These should come from a common schema utility in the future
  // The source HBase table fields
  private static final byte[] CF = Bytes.toBytes("o");
  private static final byte[] COUNTRY = Bytes.toBytes("icc");
  private static final byte[] KINGDOM = Bytes.toBytes("ik");
  private static final byte[] CELL = Bytes.toBytes("icell");

  // Names for counters used in the Hadoop Job
  private static final String STATS = "Stats";
  private static final String STAT_COUNTRY = "Country present";
  private static final String STAT_KINGDOM = "Kingdom present";
  private static final String STAT_GEOREFENCED = "Georeferenced";
  private static final String STAT_SKIPPED = "Skipped record";
  private static final String KINGDOMS = "Kingdoms";

  // The batch size to use when writing the cube
  private static final int CUBE_WRITE_BATCH_SIZE = 1000;

  static final byte[] EMPTY_BYTE_ARRAY = new byte[0];

  private DataCubeIo dataCubeIo;

  @Override
  protected void cleanup(Context context) throws IOException, InterruptedException {
    super.cleanup(context);
    // ensure we're all flushed since batch mode
    dataCubeIo.flush();
    dataCubeIo = null;
  }

  /**
   * Utility to read a named field from the row.
   */
  private Integer getValueAsInt(Result row, byte[] cf, byte[] col) {
    byte[] v = row.getValue(cf, col);
    if (v != null && v.length > 0) {
      return Bytes.toInt(v);
    }
    return null;
  }

  /**
   * Utility to read a named field from the row.
   */
  private String getValueAsString(Result row, byte[] cf, byte[] col) {
    byte[] v = row.getValue(cf, col);
    if (v != null && v.length > 0) {
      return Bytes.toString(v);
    }
    return null;
  }


  @Override
  protected void map(ImmutableBytesWritable key, Result row, Context context) throws IOException, InterruptedException {
    String country = getValueAsString(row, CF, COUNTRY);
    String kingdom = getValueAsString(row, CF, KINGDOM);
    Integer cell = getValueAsInt(row, CF, CELL);

    WriteBuilder b = new WriteBuilder(Cube.INSTANCE);
    if (country != null) {
      b.at(Cube.COUNTRY, country);
      context.getCounter(STATS, STAT_COUNTRY).increment(1);
    }
    if (kingdom != null) {
      b.at(Cube.KINGDOM, kingdom);
      context.getCounter(STATS, STAT_KINGDOM).increment(1);
      context.getCounter(KINGDOMS, kingdom).increment(1);
    }
    if (cell != null) {
      b.at(Cube.GEOREFERENCED, true);
      context.getCounter(STATS, STAT_GEOREFENCED).increment(1);
    }
    if (b.getBuckets() != null && !b.getBuckets().isEmpty()) {
      dataCubeIo.writeSync(new LongOp(1), b);
    } else {
      context.getCounter(STATS, STAT_SKIPPED).increment(1);
    }
  }

  // Sets up the DataCubeIO with IdService etc.
  @Override
  protected void setup(Context context) throws IOException, InterruptedException {
    super.setup(context);
    Configuration conf = context.getConfiguration();
    HTablePool pool = new HTablePool(conf, Integer.MAX_VALUE);

    IdService idService = new HBaseIdService(conf, Backfill.LOOKUP_TABLE, Backfill.COUNTER_TABLE, Backfill.CF, EMPTY_BYTE_ARRAY);

    byte[] table = Bytes.toBytes(conf.get(BackfillCallback.TARGET_TABLE_KEY));
    byte[] cf = Bytes.toBytes(conf.get(BackfillCallback.TARGET_CF_KEY));


    DbHarness hbaseDbHarness =
      new HBaseDbHarness (pool, EMPTY_BYTE_ARRAY, table, cf, LongOp.DESERIALIZER, idService, CommitType.INCREMENT);

    dataCubeIo = new DataCubeIo (Cube.INSTANCE, hbaseDbHarness, CUBE_WRITE_BATCH_SIZE, Long.MAX_VALUE, SyncLevel.BATCH_SYNC);

  }
}

With the callback written, all that is left to populate the cube is to run the backfill. Note that this process can also be used to bootstrap the live cube for the first time:

// The live cube table 
final byte[] CUBE_TABLE = "dc_cube".getBytes();
// Snapshot of the live table used during backfill
final byte[] SNAPSHOT_TABLE = "dc_snapshot".getBytes();
// Backfill table built from the source
final byte[] BACKFILL_TABLE = "dc_backfill".getBytes();
// Utility table to provide a running count for the identifier service
final byte[] COUNTER_TABLE = "dc_counter".getBytes();
// Utility table to provide a mapping from source values to assigned identifiers
final byte[] LOOKUP_TABLE = "dc_lookup".getBytes();
// All DataCube tables use a single column family
final byte[] CF = "c".getBytes();

HBaseBackfill backfill =
  new HBaseBackfill(
    conf, 
    new BackfillCallback(), // our implementation
    CUBE_TABLE, 
    SNAPSHOT_TABLE, 
    BACKFILL_TABLE, 
    CF, 
    LongOp.LongOpDeserializer.class);
backfill.runWithCheckedExceptions();

While HBase provides the storage for the cube, a backfill could be implemented against any source data, such as from a database over JDBC or from text files stored on a Hadoop filesystem. Finally we want to be able to read our cube:

DataCubeIo dataCubeIo = setup(Cube.INSTANCE); // omitted for brevity
Optional result = 
  cubeIo.get(
    new ReadBuilder(cube)
     .at(Cube.COUNTRY, "DK")
     .at(Cube.KINGDOM, "Animalia"));
// need to check if this coordinate combination hit anything in the cube
if (result.isPresent()) {
  LOG.info("Animal records in Denmark: " + result.get().getLong());
)

All the source code for the above is available in the GBIF labs svn.

Many thanks to Dave Revell at UrbanAirship for his guidance.

Optimizing Writes in HBase

Oliver Meyn — Wed, 11 Jul 2012 10:46:00 +0000

I've written a few times about our work to improve the scanning performance of our cluster (parts 1, 2, and 3) since our highest priority for HBase is being able to serve requests for downloads of occurrence records (which require a full table scan). But now that the scanning is working nicely we need to start writing new records into our occurrence table as well as cleaning raw data and interpreting it into something more useful for the users of our data portal. That processing is built as Hive queries that read from and write back to the same HBase table. And while it was working fine on small test datasets, it all blew up once I moved the process to the full dataset. Here's what happened and how we fixed it. Note that we're using CDH3u3, with the addition of Hive 0.9.0, which we patched to support HBase 0.90.4.

The problem

Our processing is Hive queries which run as Hadoop MapReduce jobs. When the mappers were running they would eventually fail (repeatedly, ultimately killing the job) with an error that looks like this:

java.io.IOException: org.apache.hadoop.hbase.client.ScannerTimeoutException: 63882ms passed since the last invocation, timeout is currently set to 60000

We did some digging and found that this exception happens when the scanner's next() method hasn't been called within the timeout limit. We simplified our test case by reproducing this same error when doing a simple CopyTable operation (the one that ships with HBase), and again using Hive to do select overwrite table A select * from table B. In both cases mappers are assigned a split to scan based on TableInputFormat (just like our Hive jobs), and as they scan they simply put the record out to the new table. Something is holding up the loop as it tries to put, preventing it from calling next(), and thereby triggering the timeout exception.

The logs

First stop - the logs. The regionservers are littered with lines like the following:

WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region occurrence,\x17\xF1o\x9C,1340981109494.ecb85155563c6614e5448c7d700b909e. has too many store files; delaying flush up to 90000ms

INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 7 on 60020' on region occurrence,\x17\xF1o\x9C,1340981109494.ecb85155563c6614e5448c7d700b909e.: memstore size 128.2m is >= than blocking 128.0m size

Well, blocking updates sure sounds like the kind of thing that would prevent a loop from writing more puts and dutifully calling next(). After a little more digging (and testing with a variety of hbase.client.scanner.caching values, including 1) we concluded that yes, this was the problem, but why was it happening?

Why it blocks

It blocks because the memstore has hit what I'll call the "memstore blocking limit" which is controlled by the setting hbase.hregion.memstore.flush.size * hbase.hregion.memstore.block.multiplier, which by default are 64MB and 2 respectively. Normally the memstore should flush when it reaches the flush.size, but in this case it reaches 128MB because it's not allowed to flush due to too many store files (the first log line). The definition of "too many storefiles" is in turn a setting, namely hbase.hstore.blockingStoreFiles (default 7). A new store file is created every time the memstore flushes, and their number is reduced by compacting them into fewer, bigger storefiles during minor and major compactions. By default, compactions will only start if there are at least hbase.hstore.compactionThreshold (default 3) store files, and won't compact more than hbase.hstore.compaction.max (default 7) in a single compaction. And regardless of what you set flush.size to, the memstore will always flush if all memstores in the regionserver combined are using too much heap. The acceptable levels of heap usage are defined by hbase.regionserver.global.memstore.lowerLimit (default 0.35) and hbase.regionserver.global.memstore.upperLimit (default 0.4). There is a thread dedicated to flushing that wakes up regularly and checks these limits:

if the flush thread wakes up and memstores are greater than the lower limit it will start flushing (starting with current biggest memstore) until it gets below the limit.
if flush thread wakes up and memstores are greater than the upper limit it will block updates and start flushing until it gets under lower limit, when it unblocks updates.

Fortunately we didn't see blocking because of the upper heap limit - only the "memstore blocking limit" described earlier. But at this point we only knew the different dials we could turn.

Stopping the blocking

Our goal is to stop the blocking so that our mapper doesn't timeout, while at the same time not running out of memory on the regionserver. The most obvious problem is that we have too many storefiles, which appears to be a combination of producing too many of them and not compacting them fast enough. Note that we have a 6GB heap dedicated to HBase, but can't afford to take any more away from the co-resident Hadoop mappers and reducers.

We started by upping the memstore flush size - this will produce fewer but bigger storefiles on each flush:

first we tried 128MB with block multiplier still at 2. This still produced too many storefiles and caused the same blocking (maybe a little better than at 64MB)
then tried 256MB with multiplier of 4. The logs and ganglia showed that the flushes were happening well before 256MB (still around 130MB) "due to global heap pressure" - a sign that total memstores were consuming too much heap. This meant we were still generating too many files and got the same blocking problem, but with the "memstore blocking limit" set to 1GB the memstore blocking happened much less often, and later in the process (still killed the mappers though)

We were now producing fewer storefiles, but they were still accumulating too quickly. From ganglia we also saw that the compaction queue and storefile counts were growing unbounded, which meant we'd hit the blocking limit again eventually. Next came trying to compact more files per compaction, hence raised compaction.max to 20, and this made little difference.

So how to reduce the number of storefiles? If we had fewer stores, we'd be creating fewer files and using up less heap for memstore, so next we increased the region size. This meant increasing the setting hbase.hregion.max.filesize from its default of 256MB to 1.5G, and then rebuilding our table with fewer pre-split regions. That resulted in about 75% fewer regions.

It was starting to look good - the number of "Blocking updates" log messages dropped to a handful per run, but it was still enough to affect one or two jobs to the point of them getting killed. We tried upping the memstore.lowerLimit and upperLimit to 0.45 and 0.5 respectively, but again no joy.

Now what?

Things looked kind of grim. After endless poring over ganglia charts, we kept coming back to one unexplained blip that seemed to coincide with the start of the storefile explosion that eventually killed the jobs.

Figure 1: Average memstore flush size over time

At about the halfway point of the jobs the size of memstore flushes would spike and then gradually increase until the job died. Keep in mind that the chart shows averages, and it only took a few of those flushes to wait for storefiles long enough to fill to 1GB and then start the blocking that was our undoing. Back to the logs.

From the start of Figure 1 we can see that things appear to be going smoothly - the memstores are flushing at or just above 256MB, which means they have enough heap and are doing their jobs. From the logs we see the flushes happening fine, but there are regular lines like the following:

INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Under global heap pressure: Region uat_occurrence,\x06\x0E\xAC\x0F,1341574060728.ab7fed6ea92842941f97cb9384ec3d4b. has too many store files, but is 625.1m vs best flushable region's 278.2m. Choosing the bigger.

This isn't quite as bad as the "delaying flush" line, but it shows that we're on the limit of what our heap can handle. Then starting from around 12:20 we see more and more like the following:

WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region uat_occurrence,"\x98=\x1C,1341567129447.a3a6557c609ad7fc38815fdcedca6c26. has too many store files; delaying flush up to 90000ms

and then to top it off:

INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 1 regions(s): ab7fed6ea92842941f97cb9384ec3d4b

INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up with memory above low water.

So what we have is memstores initially being forced to flush because of minor heap pressure (adds storefiles faster than we can compact). Then we have memstores delaying flushes because of too many storefiles (memstores start getting bigger - our graph spike). Then the write ahead log (WAL) complains about too many of its logs, which forces a memstore flush (so that the WAL HLog can be safely discarded - this again adds storefiles). And for good measure the flushing thread now wakes up, finds its out of heap, and starts attempting flushes, which just aggravates the problem (adding more storefiles to the pile). The failure doesn't happen immediately but we're past the point of no return - by about 13:00 the memstores are getting to the "memstore blocking limit" and our mappers die.

What's left?

Knowing what's going on in the memstore flush size analysis is comforting, but just reinforces what we already knew - the problem is too many storefiles. So what's left? Raising the max number of storefiles, that's what! Every storefile in a store consumes resources (file handles, xceivers, heap for holding metadata), which is why it's limited to a maximum of a very modest 7 files by default. But for us this level of write load is rare - in normal operations we won't hit anything like this, and having the capacity to write a whole bunch of storefiles over short-ish, infrequent bursts is relatively safe, since we know our nightly major compaction will clean them up again (and thereby free up all those extra resources). Crank that maximum up to 200 and, finally, the process works!

Conclusion

Our problem was that our compactions couldn't keep up with all the storefiles we were creating. We tried turning all the different HBase dials to get the storefile/compaction process to work "like they're supposed to", but in the end the key for us was the hbase.hstore.blockingStoreFiles parameter which we set to 200, which is probably double what we actually needed but gives us buffer for our infrequent, larger write jobs. We additionally settled on larger (and therefore fewer) regions, and a somewhat bigger than default memstore. Here are the relevant pieces of our hbase-site.xml after all our testing:

  <!-- default is 256MB 268435456, this is 1.5GB -->
  <property>
    <name>hbase.hregion.max.filesize</name>
    <value>1610612736</value>
  </property>
  
  <!-- default is 2 -->
  <property>
    <name>hbase.hregion.memstore.block.multiplier</name>
    <value>4</value>
  </property>
  
  <!-- default is 64MB 67108864 -->
  <property>
    <name>hbase.hregion.memstore.flush.size</name>
    <value>134217728</value>
  </property>
  
  <!-- default is 7, should be at least 2x compactionThreshold -->
  <property>
    <name>hbase.hstore.blockingStoreFiles</name>
    <value>200</value>
  </property>

And finally, if our compactions were faster and/or more frequent, we might be able to keep up with our storefile creation. That doesn't look possible without multi-threaded compactions, but naturally those exist in newer versions of HBase (starting with 0.92 - HBASE-1476) so if you're having these problems, an upgrade might be in order. Indeed this is prompting us to consider an upgrade to CDH4.

Many thanks to Lars George, who helped us get through the "Now What?" phase by digging deep into the logs and the source to help us work out what was going on.

Launch of the Canadensys explorer

Peter Desmet — Mon, 18 Jun 2012 21:03:00 +0000

At Canadensys we already adopted and customized the IPT as our data repository. With the data of our network being served by the IPT, we have now built a tool to aggregate and explore these data. For an overview of how we built our network, see this presentation. The post below originally appeared on the Canadensys blog.

We are very pleased to announce the beta version of the Canadensys explorer. The tool allows you to explore, filter, visualize and download all the specimen records published through the Canadensys network.

The explorer currently aggregates nine published collections, comprising over half a million specimen records, with many more to come in the near future. All individual datasets are available on the Canadensys repository and via the Global Biodiversity Information Facility (GBIF) as well. The main functionalities of the explorer are listed below, but we encourage you to discover them for yourself instead. We hope it is intuitive. For the best user experience, please use an up-to-date version of your browser.

Happy exploring: http://data.canadensys.net/explorer

Functionalities

The explorer is a one page tool, limiting unnecessary navigation.
The default view shows all the data, allowing users to get an overview and explore immediately.
Data can be queried by using and combining filters.
Filters use smart suggestions, informing the user how often their search term occurs even before they search.
The exact number of results is displayed, including the number of georeferenced records.
The map view displays all georeferenced records for the current query, has different base layer options and can be zoomed in to any level.
Points on the map can be clicked for more information.
The table view displays a sortable preview of the records in the current query.
Records in the table can be clicked for more information in the same way as on the map.
The number of columns in the table responds to the screen width and can be controlled by the user in the display panel.
Data for the current query can be downloaded as a Darwin Core archive. There is no limit on the number of records that can be downloaded.
Users can download the data by providing their email address. Once the download package is generated, the user receives an email with a link to the data, information regarding the usage norms and a suggested citation.
The interface and emails are available in French and English.

As this is a beta version, you may encounter issues. Please report them by clicking the feedback button on the right, which will open a report form.

Technical details

The Canadensys explorer was developed using the following open source tools:

Spring, a Java framework used for the backend.
ThreeTen, a date/time Java library, used for cleaning the eventDate field.
GBIF parsers, a Java library used for cleaning the country field.
ECAT common, a Java library used for parsing the scientificName field.
Darwin Core archive reader, a Java library used to import the data in the backend and to create the user generated downloads.
PostgreSQL, the database.
PostGIS, a geospatial extension to the database.
Mapnik, a tool to generate the maps and provide the map interactivity.
Windshaft, a tile server.
Freemarker, a Java template engine, used to structure to frontend.
Backbone.js, a MVC structure for web applications, used for handling the filters.

Taxonomic Trees in PostgreSQL

Markus Döring — Tue, 12 Jun 2012 15:03:00 +0000

Taken aside pro parte synonyms taxonomic data follows a classic hierarchical tree structure. In relational databases such a tree is commonly represented by 3 models known as the adjacency list, the materialized path and the nested set model. There are many comparisons out there listing pros and cons, for example ON stackoverflow, the slides by Lorenzo Alberton or Bill Karwin or a postgres specific performance comparison between the adjacency model and a nested set.

Checklist Bank

At GBIF we use PostgreSQL to store taxonomic trees, which we refer to as checklists, in Checklist Bank. At the core there is a single table name_usage which contains records each representing a single taxon in the tree [note: in this post I am using the term taxon broadly covering both accepted taxa and synonyms]. It primarily uses the adjacency model with a single foreign key parent_fk which is null for the root elements of the tree. The simplified diagram of the main tables looks like this (actually there are some 20 extra fix width columns left out from name_usage here for simplicity):

For certain searches though an additional index is required. In particular listing all descendants of a taxon, i.e. all members of a subtree, is a common operation that would otherwise involve a recursive function or Common Table Expression. So far we have been using nested sets, but experiencing some badly performing queries lately I decided to do a quick evaluation of different options for Postgres:

the ltree extension which implements a materialized path and provides many powerful operations.
the intarray extension to manually manage a materialized path as an array of non null integers - the primary keys of the parent taxa.
a simple varchar field holding the same materialized path
the current nested set index using a lft and rgt integer column which is unique within a single taxonomy and therefore has to be combined with a checklist_key.

Test Environment

For the tests I am using the current Checklist Bank which contains 14,907,828 records in total spread across 107 checklists of varying size between 7 and 4.25 million records. I am querying the GBIF Backbone Taxonomy which is the largest checklist containing 4,251,163 records with 9 root taxa and a maximum depth of 11 levels. For the queries I have picked specific 2 taxa with different position in the taxonomic tree:

44 Vertebrata the class covering all vertebrates with approximately 84.100 species in this nub.
2684876 Abies the fir genus covering 167 species in this nub and exactly 609 descendants.

For most of our real queries we provide paging for larger results. I will therefore use a limit of 100 and offset of 0 for all queries below.

All tests are executed ON a 2Ghz i7 MacBook Pro with 8GB of memory running Postgres 9.1.1. All queries have been executed 3 times before EXPLAIN ANALYZE was used to get a rough idea ON how differently the various indices behave.

Creating the indices

In order to use the extensions these need to be compiled at installation time and enabled in the database. In Postgres 9.1 you can do the later by executing in psql:

  CREATE EXTENSION ltree;
  CREATE EXTENSION intarray;

The individual data types require different indices. We set up the following indices:

 # general indices
 CREATE INDEX nu_chkl_idx ON name_usage (checklist_fk);
 CREATE INDEX nu_parent_idx ON name_usage (parent_fk);
 # tree specifics
 CREATE INDEX nu_path_idx ON name_usage USING GIST (path);
 CREATE INDEX nu_mpath_idx ON name_usage USING GIN (mpath gin__int_ops);
 CREATE INDEX nu_mspath_idx ON name_usage (mspath);
 CREATE INDEX nu_ckl_lft_rgt_idx ON name_usage (checklist_fk, lft, rgt);

The ltree GIST and intarray GIN indices are rather expensive to create/maintain, but they are selected for best read performance.

Populating the indices

The data being imported into Checklist Bank comes with a parent-child relation, so parent_fk is populated already. For all other indices we have to populate the indices first.

ltree, intarray, string path

For all materialized paths I simply run the following SQL until no new updates happened:

# update root paths once
UPDATE name_usage u SET 
 path = text2ltree(cast(u.id as text)), 
 mspath=u.id, 
 mpath = array[u.id] 
WHERE u.parent_fk is null;
# update until no more records are updated
UPDATE name_usage u SET 
 path = p.path || text2ltree(cast(u.id as text)), 
 mspath=p.mspath || '.' || u.id, 
 mpath = p.mpath || array[u.id] 
FROM name_usage p 
WHERE u.parent_fk=p.id AND p.mspath is not null AND u.mspath is null;

nested sets

I've created 2 functions and one sequence to populate the lft/rgt indices for every checklist:

-- NESTED SET UTILITY FUNCTIONS
CREATE SEQUENCE name_usage_nestidx START 1;

CREATE FUNCTION update_nested_set_usage(integer) RETURNS BOOLEAN as $$
  BEGIN
    UPDATE name_usage set lft = nextval('name_usage_nestidx')-1 WHERE id=$1;
    PERFORM update_nested_set_usage(id) FROM usage WHERE parent_fk=$1 ORDER BY rank_fk, name_fk;;
    UPDATE name_usage set rgt = nextval('name_usage_nestidx')-1 WHERE id=$1;
    RETURN true;
  END
$$ LANGUAGE plpgsql;

CREATE FUNCTION build_nested_set_indices(integer) RETURNS BOOLEAN AS $$
BEGIN
  PERFORM setval('name_usage_nestidx', 1);
  PERFORM update_nested_set_usage(id) FROM usage WHERE parent_fk is null and checklist_fk=$1 ORDER BY rank_fk, name_fk;;
  RETURN true;
  END
$$ LANGUAGE plpgsql;

Query for Descendants

The queries return the first page with 100 records descendants

adjacency

A recursive postgres CTE query does the trick:

 WITH RECURSIVE d AS (
   SELECT id
    FROM name_usage
    WHERE id = 44
  UNION ALL
   SELECT c.id
    FROM d JOIN name_usage c ON c.parent_fk = d.id
 )
 SELECT * FROM d
  ORDER BY id
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=74360.25..74360.50 rows=100 width=4) (actual time=7.306..7.337 rows=100 loops=1)
   CTE d
     ->  Recursive Union  (cost=0.00..72581.02 rows=30561 width=4) (actual time=0.041..6.017 rows=609 loops=1)
           ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=4) (actual time=0.038..0.040 rows=1 loops=1)
                 Index Cond: (id = 2684876)
           ->  Nested Loop  (cost=0.00..7195.91 rows=3056 width=4) (actual time=0.240..1.393 rows=152 loops=4)
                 ->  WorkTable Scan on d  (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.041 rows=152 loops=4)
                 ->  Index Scan using usage_parent_idx on name_usage c  (cost=0.00..715.75 rows=306 width=8) (actual time=0.006..0.008 rows=1 loops=609)
                       Index Cond: (parent_fk = d.id)
   ->  Sort  (cost=1779.24..1855.64 rows=30561 width=4) (actual time=7.304..7.316 rows=100 loops=1)
         Sort Key: d.id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  CTE Scan on d  (cost=0.00..611.22 rows=30561 width=4) (actual time=0.046..6.678 rows=609 loops=1)
 Total runtime: 7.559 ms

Query Plan: Vertebrata

 Limit  (cost=74360.25..74360.50 rows=100 width=4) (actual time=2065.053..2065.080 rows=100 loops=1)
   CTE d
     ->  Recursive Union  (cost=0.00..72581.02 rows=30561 width=4) (actual time=0.105..1797.898 rows=264325 loops=1)
           ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=4) (actual time=0.101..0.103 rows=1 loops=1)
                 Index Cond: (id = 44)
           ->  Nested Loop  (cost=0.00..7195.91 rows=3056 width=4) (actual time=0.915..182.695 rows=29369 loops=9)
                 ->  WorkTable Scan on d  (cost=0.00..0.20 rows=10 width=4) (actual time=0.007..8.056 rows=29369 loops=9)
                 ->  Index Scan using usage_parent_idx on name_usage c  (cost=0.00..715.75 rows=306 width=8) (actual time=0.004..0.005 rows=1 loops=264325)
                       Index Cond: (parent_fk = d.id)
   ->  Sort  (cost=1779.24..1855.64 rows=30561 width=4) (actual time=2065.050..2065.062 rows=100 loops=1)
         Sort Key: d.id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  CTE Scan on d  (cost=0.00..611.22 rows=30561 width=4) (actual time=0.109..1986.606 rows=264325 loops=1)
 Total runtime: 2080.248 ms

As expected the recursive query does a pretty good job if the subtree is small. But for the large vertebrate subtree its rather slow because we first get all decendants and then apply a limit.

ltree

With ltree you have various options to query for a subtree. You can use a path lquery with ~, a full text ltxtquery via @ or the ltree <@ isDescendant operator.

Unanchored lqueries for any path containing the a node turned out to be very, very slow. This is expected somehow because it can't use the index properly. Surprisingly even the anchored queries and the full text query were far too slow from being useful at all. The fastest option definitely was the native descendants operator:

 WITH p AS (
   SELECT path FROM name_usage WHERE id=44
 )
 SELECT u.id 
  FROM name_usage u, p 
  WHERE u.path <@ p.path
  ORDER BY u.id
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=54526.09..54526.34 rows=100 width=4) (actual time=2.926..2.963 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.030..0.034 rows=1 loops=1)
           Index Cond: (id = 2684876)
   ->  Sort  (cost=54515.41..54551.25 rows=14336 width=4) (actual time=2.923..2.942 rows=100 loops=1)
         Sort Key: u.id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Nested Loop  (cost=2094.99..53967.50 rows=14336 width=4) (actual time=1.094..2.230 rows=609 loops=1)
               ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.035..0.040 rows=1 loops=1)
               ->  Bitmap Heap Scan on name_usage u  (cost=2094.99..53788.28 rows=14336 width=124) (actual time=1.051..1.952 rows=609 loops=1)
                     Recheck Cond: (path <@ p.path)
                     ->  Bitmap Index Scan on nu_path_idx  (cost=0.00..2091.40 rows=14336 width=0) (actual time=1.023..1.023 rows=609 loops=1)
                           Index Cond: (path <@ p.path)
 Total runtime: 3.068 ms

Query Plan: Vertebrata

 Limit  (cost=54526.09..54526.34 rows=100 width=4) (actual time=512.420..512.445 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.091..0.094 rows=1 loops=1)
           Index Cond: (id = 44)
   ->  Sort  (cost=54515.41..54551.25 rows=14336 width=4) (actual time=512.417..512.432 rows=100 loops=1)
         Sort Key: u.id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Nested Loop  (cost=2094.99..53967.50 rows=14336 width=4) (actual time=115.119..428.632 rows=264325 loops=1)
               ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.096..0.100 rows=1 loops=1)
               ->  Bitmap Heap Scan on name_usage u  (cost=2094.99..53788.28 rows=14336 width=124) (actual time=115.016..372.070 rows=264325 loops=1)
                     Recheck Cond: (path <@ p.path)
                     ->  Bitmap Index Scan on nu_path_idx  (cost=0.00..2091.40 rows=14336 width=0) (actual time=109.791..109.791 rows=264325 loops=1)
                           Index Cond: (path <@ p.path)
 Total runtime: 512.723 ms

intarray

As a node in the tree appears only once, we can query for all usages that have the node id in their array but are not that very record.

 SELECT u.id 
  FROM name_usage u
  WHERE u.mpath @@ '44' and u.id != 44
  ORDER BY u.id
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=52491.99..52492.24 rows=100 width=4) (actual time=1.925..1.966 rows=100 loops=1)
   ->  Sort  (cost=52491.99..52527.83 rows=14336 width=4) (actual time=1.923..1.942 rows=100 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Bitmap Heap Scan on name_usage u  (cost=179.10..51944.08 rows=14336 width=4) (actual time=0.682..1.219 rows=608 loops=1)
               Recheck Cond: (mpath @@ '2684876'::query_int)
               Filter: (id <> 2684876)
               ->  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..175.52 rows=14336 width=0) (actual time=0.646..0.646 rows=609 loops=1)
                     Index Cond: (mpath @@ '2684876'::query_int)
 Total runtime: 2.052 ms

Query Plan: Vertebrata

 Limit  (cost=52491.99..52492.24 rows=100 width=4) (actual time=377.851..377.877 rows=100 loops=1)
   ->  Sort  (cost=52491.99..52527.83 rows=14336 width=4) (actual time=377.849..377.861 rows=100 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Bitmap Heap Scan on name_usage u  (cost=179.10..51944.08 rows=14336 width=4) (actual time=115.634..298.193 rows=264324 loops=1)
               Recheck Cond: (mpath @@ '44'::query_int)
               Filter: (id <> 44)
               ->  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..175.52 rows=14336 width=0) (actual time=110.776..110.776 rows=264325 loops=1)
                     Index Cond: (mpath @@ '44'::query_int)
 Total runtime: 378.131 ms

string path

Just for completeness and to compare relative performances I am trying an anchored pattern match against a varchar based materialzed path. In order to let postgres use the mspath index we must also order by that value, no id:

 WITH p AS (
  SELECT mspath FROM name_usage where id=44
 )
 SELECT u.id 
  FROM name_usage u, p
  WHERE u.mspath LIKE p.mspath || '.%'
  ORDER BY u.mspath
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=10.68..80535.72 rows=100 width=36)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=32)
           Index Cond: (id = 2684876)
   ->  Nested Loop  (cost=0.00..60022560.36 rows=74539 width=36)
         Join Filter: (u.mspath ~~ (p.mspath || '.%'::text))
         ->  Index Scan using nu_mspath_idx on name_usage u  (cost=0.00..59649864.65 rows=14907828 width=36)
         ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32)

I don't know what is going on here, but I can't make postgres use the mspath btree index properly. The query therefore is very, very slow. If I use a hardcoded path instead of p.mspath the index is used. I'll show results here for the hardcoded path now - anyone having an idea on how to avoid the table scan in the above sql please let me know.

 SELECT u.id 
  FROM name_usage u
  WHERE u.mspath LIKE '6.101.194.640.3925.2684876.%'
  ORDER BY u.mspath
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=0.00..400.95 rows=100 width=36) (actual time=0.105..0.549 rows=100 loops=1)
   ->  Index Scan using nu_mspath_idx on name_usage u  (cost=0.00..298863.67 rows=74539 width=36) (actual time=0.102..0.516 rows=100 loops=1)
         Index Cond: ((mspath >= '6.101.194.640.3925.2684876.'::text) AND (mspath < '6.101.194.640.3925.2684876/'::text))
         Filter: (mspath ~~ '6.101.194.640.3925.2684876.%'::text)
 Total runtime: 0.637 ms

Query Plan: Vertebrata

 Limit  (cost=0.00..400.95 rows=100 width=36) (actual time=0.076..0.472 rows=100 loops=1)
   ->  Index Scan using nu_mspath_idx on name_usage u  (cost=0.00..298863.67 rows=74539 width=36) (actual time=0.074..0.435 rows=100 loops=1)
         Index Cond: ((mspath >= '1.44.'::text) AND (mspath < '1.44/'::text))
         Filter: (mspath ~~ '1.44.%'::text)
 Total runtime: 0.561 ms

The hardcoded results are extremely fast. Much faster than any of the specialised indices above. Retrying ltree for example with a hardcoded path doesnt speed up the ltree query. The vast speed gain here is because postgres can use the b-tree index for a sorted output and therefore really benefit from the limit. GiST and GIN indices on the other hand are not suitable for sorting.

nested set

A very different query model from the above, which all use some sort of a materialzed path. We need to also add the checklist_fk condition because the nested set indices are only unique within each taxonomy, not across all.

 SELECT u.id 
  FROM name_usage u JOIN name_usage p ON u.checklist_fk=p.checklist_fk  
  WHERE  p.id=44 and u.lft BETWEEN p.lft and p.rgt
  ORDER BY u.lft
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=81687.08..81687.33 rows=100 width=8) (actual time=2.030..2.076 rows=100 loops=1)
   ->  Sort  (cost=81687.08..82326.50 rows=255769 width=8) (actual time=2.029..2.052 rows=100 loops=1)
         Sort Key: u.lft
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Nested Loop  (cost=0.00..71911.77 rows=255769 width=8) (actual time=0.062..1.482 rows=609 loops=1)
               ->  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.028..0.029 rows=1 loops=1)
                     Index Cond: (id = 2684876)
               ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage u  (cost=0.00..71534.17 rows=20967 width=12) (actual time=0.029..1.182 rows=609 loops=1)
                     Index Cond: ((checklist_fk = p.checklist_fk) AND (lft >= p.lft) AND (lft <= p.rgt))
 Total runtime: 2.206 ms

Query Plan: Vertebrata

 Limit  (cost=81687.08..81687.33 rows=100 width=8) (actual time=475.811..475.829 rows=100 loops=1)
   ->  Sort  (cost=81687.08..82326.50 rows=255769 width=8) (actual time=475.809..475.817 rows=100 loops=1)
         Sort Key: u.lft
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Nested Loop  (cost=0.00..71911.77 rows=255769 width=8) (actual time=0.158..381.822 rows=264325 loops=1)
               ->  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.075..0.077 rows=1 loops=1)
                     Index Cond: (id = 44)
               ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage u  (cost=0.00..71534.17 rows=20967 width=12) (actual time=0.077..323.519 rows=264325 loops=1)
                     Index Cond: ((checklist_fk = p.checklist_fk) AND (lft >= p.lft) AND (lft <= p.rgt))
 Total runtime: 475.951 ms

Again the tricky part was making postgres use the lft/rgt index properly when using the order by. Removing the order by turns this query into a super fast one for even the vertebrate query (only o.4ms for any of the two). If we also use hardcoded values instead of a joined p table we can get even better performances than with the string path model. For example the vertebrate query would look like this:

SELECT u.id 
 FROM name_usage u 
 WHERE  u.checklist_fk=1 and u.lft BETWEEN 517646 and 1046295
 ORDER BY u.lft
 LIMIT 100 OFFSET 0;

Query Plan: Vertebrata

 Limit  (cost=0.00..337.88 rows=100 width=8) (actual time=0.080..0.395 rows=100 loops=1)
   ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage u  (cost=0.00..1566342.54 rows=463573 width=8) (actual time=0.078..0.356 rows=100 loops=1)
         Index Cond: ((checklist_fk = 1) AND (lft >= 517646) AND (lft <= 1046295))
 Total runtime: 0.480 ms

Query for Ancestors

The ancestor query does not use any limit as its only a few records and we always want all. All materialized paths already have the ancestors - they are the path.

adjacency

A recursive CTE query:

 WITH RECURSIVE a AS (
  SELECT id, parent_fk, rank_fk
   FROM name_usage
   WHERE id = 44
 UNION ALL
  SELECT p.id, p.parent_fk, p.rank_fk
   FROM a JOIN name_usage p ON a.parent_fk = p.id
 )
 SELECT * FROM a
 WHERE id!=44
 ORDER BY rank_fk;

Query Plan: Abies

 Sort  (cost=1089.52..1089.77 rows=100 width=12) (actual time=0.155..0.156 rows=5 loops=1)
   Sort Key: a.rank_fk
   Sort Method: quicksort  Memory: 25kB
   CTE a
     ->  Recursive Union  (cost=0.00..1083.93 rows=101 width=12) (actual time=0.030..0.117 rows=6 loops=1)
           ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=12) (actual time=0.027..0.029 rows=1 loops=1)
                 Index Cond: (id = 2684876)
           ->  Nested Loop  (cost=0.00..107.12 rows=10 width=12) (actual time=0.011..0.012 rows=1 loops=6)
                 ->  WorkTable Scan on a  (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.001 rows=1 loops=6)
                 ->  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.007..0.008 rows=1 loops=6)
                       Index Cond: (id = a.parent_fk)
   ->  CTE Scan on a  (cost=0.00..2.27 rows=100 width=12) (actual time=0.068..0.141 rows=5 loops=1)
         Filter: (id <> 2684876)
 Total runtime: 0.265 ms

Query Plan: Vertebrata

 Sort  (cost=1089.52..1089.77 rows=100 width=12) (actual time=0.104..0.104 rows=1 loops=1)
   Sort Key: a.rank_fk
   Sort Method: quicksort  Memory: 25kB
   CTE a
     ->  Recursive Union  (cost=0.00..1083.93 rows=101 width=12) (actual time=0.040..0.077 rows=2 loops=1)
           ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=12) (actual time=0.037..0.040 rows=1 loops=1)
                 Index Cond: (id = 44)
           ->  Nested Loop  (cost=0.00..107.12 rows=10 width=12) (actual time=0.013..0.015 rows=0 loops=2)
                 ->  WorkTable Scan on a  (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.001 rows=1 loops=2)
                 ->  Index Scan using usage_pkey on name_usage p  (cost=0.00..10.68 rows=1 width=12) (actual time=0.008..0.009 rows=0 loops=2)
                       Index Cond: (id = a.parent_fk)
   ->  CTE Scan on a  (cost=0.00..2.27 rows=100 width=12) (actual time=0.079..0.092 rows=1 loops=1)
         Filter: (id <> 44)
 Total runtime: 0.232 ms

The trees are not very deep, so even the genus query only has to do 5 recursive calls to return 5 parents.

ltree

  select path FROM name_usage WHERE id=44;

The path contains all ancestor ids. Parsing it into separate ids within sql is not obvious at first glance though.

intarray

 SELECT mpath FROM name_usage WHERE id=44;

The path contains all ancestor ids. Iterating over each id entry is simple.

nested set

To find out all ancestors of a given node, we just select all nodes that contain its LFT boundary (which in a properly built hierarchy implies containing the RGT boundary too):

SELECT p.id, p.rank_fk, p.lft
 FROM  name_usage u JOIN name_usage p ON u.lft BETWEEN p.lft and p.rgt AND u.checklist_fk=p.checklist_fk 
 WHERE u.id=44 ORDER BY 3;

Query Plan: Abies

 Sort  (cost=100429.25..101068.67 rows=255769 width=12) (actual time=607.957..607.958 rows=6 loops=1)
   Sort Key: p.lft
   Sort Method: quicksort  Memory: 25kB
   ->  Nested Loop  (cost=0.00..73083.96 rows=255769 width=12) (actual time=415.862..607.938 rows=6 loops=1)
         ->  Index Scan using usage_pkey on name_usage u  (cost=0.00..10.68 rows=1 width=8) (actual time=0.046..0.047 rows=1 loops=1)
               Index Cond: (id = 2684876)
         ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage p  (cost=0.00..72706.36 rows=20967 width=20) (actual time=415.809..607.880 rows=6 loops=1)
               Index Cond: ((checklist_fk = u.checklist_fk) AND (u.lft >= lft) AND (u.lft <= rgt))
 Total runtime: 608.034 ms

Query Plan: Vertebrata

 Sort  (cost=100429.25..101068.67 rows=255769 width=12) (actual time=38.428..38.428 rows=2 loops=1)
   Sort Key: p.lft
   Sort Method: quicksort  Memory: 25kB
   ->  Nested Loop  (cost=0.00..73083.96 rows=255769 width=12) (actual time=0.111..38.410 rows=2 loops=1)
         ->  Index Scan using usage_pkey on name_usage u  (cost=0.00..10.68 rows=1 width=8) (actual time=0.022..0.023 rows=1 loops=1)
               Index Cond: (id = 44)
         ->  Index Scan using nu_ckl_lft_rgt_idx on name_usage p  (cost=0.00..72706.36 rows=20967 width=20) (actual time=0.084..38.378 rows=2 loops=1)
               Index Cond: ((checklist_fk = u.checklist_fk) AND (u.lft >= lft) AND (u.lft <= rgt))
 Total runtime: 38.504 ms

Not very efficient.

Query for Children

adjacency

A perfect fit:

 SELECT u.id 
  FROM name_usage  u 
  WHERE parent_fk=44 
  ORDER BY u.id
  LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=640.97..641.22 rows=100 width=4) (actual time=0.566..0.623 rows=100 loops=1)
   ->  Sort  (cost=640.97..641.67 rows=279 width=4) (actual time=0.564..0.588 rows=100 loops=1)
         Sort Key: id
         Sort Method: quicksort  Memory: 32kB
         ->  Index Scan using usage_parent_idx on name_usage u  (cost=0.00..630.31 rows=279 width=4) (actual time=0.046..0.312 rows=167 loops=1)
               Index Cond: (parent_fk = 2684876)
 Total runtime: 0.696 ms

Query Plan: Vertebrata

 Limit  (cost=16863.12..16863.37 rows=100 width=4) (actual time=8.783..8.811 rows=100 loops=1)
   ->  Sort  (cost=16863.12..16881.75 rows=7454 width=4) (actual time=8.780..8.793 rows=100 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Index Scan using usage_parent_idx on name_usage u  (cost=0.00..16578.23 rows=7454 width=4) (actual time=0.060..5.169 rows=6083 loops=1)
               Index Cond: (parent_fk = 44)
 Total runtime: 8.867 ms

ltree

Search for all descendants that have one more node:

WITH p AS (
 SELECT path FROM name_usage WHERE id=2684876
)
SELECT u.id 
 FROM  name_usage u, p
 WHERE u.path <@ p.path and nlevel(u.path)=nlevel(p.path)+1
 ORDER BY u.path
 LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=56078.37..56078.56 rows=75 width=124) (actual time=4.118..4.161 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.029..0.031 rows=1 loops=1)
           Index Cond: (id = 2684876)
   ->  Sort  (cost=56067.69..56067.88 rows=75 width=124) (actual time=4.116..4.134 rows=100 loops=1)
         Sort Key: u.path
         Sort Method: quicksort  Memory: 48kB
         ->  Nested Loop  (cost=2099.42..56065.36 rows=75 width=124) (actual time=1.230..3.241 rows=167 loops=1)
               Join Filter: (nlevel(u.path) = (nlevel(p.path) + 1))
               ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.037..0.040 rows=1 loops=1)
               ->  Bitmap Heap Scan on name_usage u  (cost=2099.42..55729.91 rows=14908 width=124) (actual time=1.175..2.274 rows=609 loops=1)
                     Recheck Cond: (path <@ p.path)
                     ->  Bitmap Index Scan on nu_path_idx  (cost=0.00..2095.69 rows=14908 width=0) (actual time=1.146..1.146 rows=609 loops=1)
                           Index Cond: (path <@ p.path)
 Total runtime: 4.321 ms

Query Plan: Vertebrata

 Limit  (cost=56078.37..56078.56 rows=75 width=124) (actual time=600.793..600.816 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=120) (actual time=0.048..0.051 rows=1 loops=1)
           Index Cond: (id = 44)
   ->  Sort  (cost=56067.69..56067.88 rows=75 width=124) (actual time=600.792..600.805 rows=100 loops=1)
         Sort Key: u.path
         Sort Method: top-N heapsort  Memory: 32kB
         ->  Nested Loop  (cost=2099.42..56065.36 rows=75 width=124) (actual time=112.999..595.738 rows=6083 loops=1)
               Join Filter: (nlevel(u.path) = (nlevel(p.path) + 1))
               ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.053..0.057 rows=1 loops=1)
               ->  Bitmap Heap Scan on name_usage u  (cost=2099.42..55729.91 rows=14908 width=124) (actual time=112.924..413.889 rows=264325 loops=1)
                     Recheck Cond: (path <@ p.path)
                     ->  Bitmap Index Scan on nu_path_idx  (cost=0.00..2095.69 rows=14908 width=0) (actual time=107.524..107.524 rows=264325 loops=1)
                           Index Cond: (path <@ p.path)
 Total runtime: 600.980 ms

intarray

Search for an array that contains the parent id, but has an array size one greater than its parent:

WITH p AS (
 SELECT mpath FROM name_usage WHERE id=44
)
SELECT u.id 
 FROM  name_usage u, p
 WHERE  u.mpath @@ '44' and #u.mpath= #p.mpath+1
 ORDER BY u.mpath
 LIMIT 100 OFFSET 0;

Query Plan: Abies

 Limit  (cost=53976.90..53977.09 rows=75 width=56) (actual time=2.026..2.056 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=52) (actual time=0.018..0.019 rows=1 loops=1)
           Index Cond: (id = 2684876)
   ->  Sort  (cost=53966.22..53966.41 rows=75 width=56) (actual time=2.024..2.042 rows=100 loops=1)
         Sort Key: u.mpath
         Sort Method: quicksort  Memory: 48kB
         ->  Hash Join  (cost=183.57..53963.89 rows=75 width=56) (actual time=0.341..1.352 rows=167 loops=1)
               Hash Cond: ((# u.mpath) = (# (p.mpath + 1)))
               ->  Bitmap Heap Scan on name_usage u  (cost=183.54..53851.29 rows=14908 width=56) (actual time=0.292..0.874 rows=609 loops=1)
                     Recheck Cond: (mpath @@ '2684876'::query_int)
                     ->  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..179.81 rows=14908 width=0) (actual time=0.279..0.279 rows=609 loops=1)
                           Index Cond: (mpath @@ '2684876'::query_int)
               ->  Hash  (cost=0.02..0.02 rows=1 width=32) (actual time=0.038..0.038 rows=1 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 1kB
                     ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.022..0.024 rows=1 loops=1)
 Total runtime: 2.144 ms

Query Plan: Vertebrata

 Limit  (cost=53976.90..53977.09 rows=75 width=56) (actual time=501.304..501.327 rows=100 loops=1)
   CTE p
     ->  Index Scan using usage_pkey on name_usage  (cost=0.00..10.68 rows=1 width=52) (actual time=0.075..0.078 rows=1 loops=1)
           Index Cond: (id = 44)
   ->  Sort  (cost=53966.22..53966.41 rows=75 width=56) (actual time=501.303..501.316 rows=100 loops=1)
         Sort Key: u.mpath
         Sort Method: top-N heapsort  Memory: 32kB
         ->  Hash Join  (cost=183.57..53963.89 rows=75 width=56) (actual time=116.159..495.063 rows=6083 loops=1)
               Hash Cond: ((# u.mpath) = (# (p.mpath + 1)))
               ->  Bitmap Heap Scan on name_usage u  (cost=183.54..53851.29 rows=14908 width=56) (actual time=116.030..391.099 rows=264325 loops=1)
                     Recheck Cond: (mpath @@ '44'::query_int)
                     ->  Bitmap Index Scan on nu_mpath_idx  (cost=0.00..179.81 rows=14908 width=0) (actual time=111.387..111.387 rows=264325 loops=1)
                           Index Cond: (mpath @@ '44'::query_int)
               ->  Hash  (cost=0.02..0.02 rows=1 width=32) (actual time=0.101..0.101 rows=1 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 1kB
                     ->  CTE Scan on p  (cost=0.00..0.02 rows=1 width=32) (actual time=0.082..0.086 rows=1 loops=1)
 Total runtime: 501.517 ms

nested set

A difficult task. Will leave this as a challenge for some later time.

Summary

The best performing queries do not come from one model. The adjacency model is unbeatable for listing children and with recursive queries in not too deep taxonomic trees it performs also very well to get all ancestors.

For listing descendants the winner are queries that can use an ordered btree index. I could only get this to work with hardcoded paths, so it's not useful like this for dynamic, single statement queries. If you can issue 2 separate queries in code though nested sets or the string materlialized path are ideal candidates for retrieving descendants. Because of a required ordering for doing paging it can use the btree efficiently, while all others produce the full list of decendants first and then order. Intarray is the quickest in that field, but nested sets and ltree perform rather similar. It remains to see if an additional btree index could improve the ordering of ltree and intarray drastically.

Faster HBase - hardware matters

Oliver Meyn — Fri, 08 Jun 2012 15:05:00 +0000

As I've written earlier, I've been spending some time evaluating the performance of HBase using PerformanceEvaluation. My earlier conclusions amounted to: bond your network ports and get more disks. So I'm happy to report that we got more disks, in the form of 6 new machines that together make up our new cluster:

Master (c1n4): HDFS NameNode, Hadoop JobTracker, HBase Master, and Zookeeper
Zookeeper (c1n1): Zookeeper for this cluster, master for our other cluster

Slaves (c4n1..c4n6): HDFS DataNode, Hadoop TaskTracker, HBase RegionServer (6 GB heap)

Hardware:
c1n*: 1x Intel Xeon X3363 @ 2.83GHz (quad), 8GB RAM, 2x500G SATA 5.4K
c4n*: Dell R720XD, 2x Intel Xeon E5-2640 @ 2.50GHz (6-core), 64GB RAM, 12x1TB SATA 7.2K

Obviously the new machines come with faster everything and lots more RAM, so first I bonded two ethernet ports and then ran the tests again to see how much we had improved:

Figure 1: Scan performance of new cluster (2x 1gig ethernet)

So, 1 million records/second? Yes, please! That's a performance increase of about 300% over the c2 cluster I tested in my previous posts. Given that we doubled the number of machines and quadrupled the number of drives, that improvement feels about right. But the decline in performance as number of mappers is increased is still a bit suspect - we'd hope that performance would go up with more workers. This is the same behaviour we saw when we were limited by network in our original tests, and ganglia again shows a similar pattern, though this time it looks like we're hitting a limit around the 2 gig ethernet bond:

Figure 2: bytes_in (MB/s) with dual gigabit ethernet

Figure 3: bytes_out (MB/s) with dual gigabit ethernet

Unfortunately our master switch is currently full, so we don't have the extra 6 ports needed to test a triple bond - but given our past experience I feel reasonably confident that it would change Figure 1 such that scan performance increases with number of mappers up to some disk I/O contention limit.

Hang-on, what about data locality?

But there's a bigger question here - why are we using so much network bandwidth in the first place? Why stress about major compactions and data locality when it doesn't seem to get used? Therein lies the rub - PerformanceEvaluation can't take advantage of data locality. Tim wrote about the tremendous importance of TableInputFormat in ensuring maximum scan performance from MapReduce, and PerformanceEvaluation doesn't do that. It assigns a block of ids to scan to different mappers at random, meaning that at best one in six mappers (in our setup) will coincidentally have local data to read, and the rest will all transfer their data across the network. This isn't a bug in PerformanceEvaluation, per se, because it was written to try and emulate the tests that Google ran in their seminal white paper on BigTable, rather than act as a true benchmark for scanning performance. But if you're new to this stuff (as I was) it sure can be confusing. When we switched to scanning our real data using TableInputFormat our throughput jumped to 2M/sec from the 1M/sec we got using PerformanceEvaluation.

Conclusions

While we learned a lot from using PerformanceEvaluation to test our clusters, and it helped to uncover any misconfigurations and taught us how to fine tune lots of parameters, it is not a good tool for benchmarking scan performance. As Tim wrote, scans across our real occurrence data (~370M records) using TableInputFormat are finishing in 3 minutes - for our needs that is excellent and means we're happy with our cluster upgrade. Stay tuned for news about the occurrence download service that Lars and I are writing to take advantage of all this speed :)

Optimizing HBase MapReduce scans (for Hive)

Tim Robertson — Mon, 28 May 2012 17:39:00 +0000

By targeting data locality, full table scans of HBase using MapReduce across 373 million records are reduced from 19 minutes to 2.5 minutes.

We've been posting some blogs about HBase Performance which are all based on the PerformanceEvaluation tools supplied with HBase. This has helped us understand many characteristics of our system, but in some ways has sidetracked our tuning - namely investigating channel bonding to help increase inter machine bandwidth believing it was our primary limitation. While that will help for many things (e.g. the copy between mappers and reducers), a key usage pattern involves full table scans of HBase (spawned by Hive) and in a well setup environment network traffic should be minimal for this. Here I describe how we approached this problem, and the results.

The environment
We run Ganglia for cluster monitoring (and ours is public) and Puppet to provision machines. As an aside, without these tools or an equivalent I don't think you can sanely hope to run HBase in production. Here we are using the 6 node cluster, where each of the 6 slaves run a TaskTracker, DataNode and RegionServer, and each machine is Dell R720xd, with 64GB memory, 12xSATA 1TB drives with dual 6 core hyper threading CPUs. Quoting the user list: "HBase should be able to stretch its legs with this hardware".

Baseline
As a naive person getting access to the new cluster I started by creating the HBase table, mounted a Hive table on it, and populated it with a select using data from a CSV file.

INSERT OVERWRITE occurrence_hbase SELECT * FROM occurrence_hdfs.

The first good news was that this ran in only 2.5hrs to load up 367 million records with no pre splitting of the table and no failures (normally we use bulk loading tools but here I was feeling lazy). I then crafted a super simple MR job based on TableMapReduceUtils that took an unfiltered Scan, and a Mapper that did nothing but increment a counter with the number of rows read. This is a decent replica of what Hive would do when doing a range query (Hive does not do predicate push down to HBase with filters except for equality filters at the moment). The first run took 19 minutes. This test was across 200GB data (uncompressed) spread across 84 hard drives and 72 hyper threading cores reading it, so I knew it was too slow. We started digging... here you find some insights as to how we approached the task.

Improvement #1: Host name versus IP address
On an early run, we see the following on the Ganglia bytes_out:

Here we see what we have seen many times - saturating our network. But why? These are Mappers that should be hitting local RegionServers using local drives (12 of them). Looking at the number of data-local mappers spawned, we see that we have very low data-local map tasks:

This is suspicious, and if you see this, start investigating why. Basically the MR job is spawning tasks to use data that reside on other machines. For us, that means Mappers that are talking to region servers that are not local to it. On investigation, and from reading HBASE-1672, we observe that when looking at a task attempt in the MR console, the task attempt and input split locations look suspiciously different:

These actually refer to the same machine, but one is using the IP address, and the other the host name.

HBase is using the TableInputFormat which is returning the IP (thanks Stack for pointing this out). Now, it is important to note that this code is executed on the client calling machine, as it prepares a job for submission to the cluster. In my setup I was running over VPN from my laptop which was providing IP addresses for the region locations. The code in the TableInputFormatBase for our version of HBase was:

String regionLocation = table.getRegionLocation(keys.getFirst()[i]).getServerAddress().getHostname();

But on my laptop, a getHostname() on these machines always returned the IP address. Moving my code onto the cluster and launching from there solved this issue.

Improvement #2: Ensure HBase is balanced
When starting these tests there were 2 tables on HBase. HBase reported it was nicely balanced, major table compactions were done. On running however, we still see a huge bottleneck again shown clearly with bytes_out, bytes_in and region server requests:

Again the network saturation is clear, but 1 region server is getting hammered, and many machines are receiving data.

Why? Well, we run HBase 0.90 which does balancing across all tables and not on a per table basis. Thus, while HBase saw it's regions evenly distributed across the servers, one machine was hot spotted for one table. This is fixed in HBase 0.94, but we don't run that yet and ganglia clearly shows us this is a limitation. Again, MR is spawning jobs on other machines all hitting one RS and saturating the network. Fortunately I could delete the unused table and rebalance the lot.

Miscellaneous improvements
Somewhere along the line I saw exceptions reporting timeouts and things like this:

48 Lease exceptions
org.apache.hadoop.hbase.regionserver.LeaseException: org.apache.hadoop.hbase.regionserver.LeaseException: lease '7961909311940960915' does not exist
 at org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
 at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1879)
 at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
 at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)

Basically the HBase client is not reporting to the master quickly enough, and the master kills the client, and thus the task attempt fails. The result is the JobTracker goes and spawns another task attempt, and here we often observed it was not a map-local attempt - everything we know we need to avoid for performance. The following were done to address this:

Set the hbase.regionserver.lease.period to 600000, up from 60000. This was then the same as the TaskTracker timeout, so HBase would timeout the scan client in the same duration that the JobTracker would timeout the Task attempt anyway
Reduce the number of mappers from 44 to 36 on each machine. Here we just ran a few runs, observed Ganglia load averages etc, repeated with different configuration and ultimately tuned the Mapper count to the point where no exceptions were thrown. There was no magic recipe to this, other than "rinse and repeat". This is where Puppet is gold.

The final run
With a balanced HBase environment, and resolving the IP / Host issue, the MR result, and ganglia bytes_in, bytes_out and region server requests are shown:

We still see 5 mappers running across the network, which are due to the FairScheduler deciding it has waited long enough for a data local mapper and spawning another - we might investigate if we want to increase this wait time.

A final thought
All these tests used no Filter in the Scan, thus the entire data was passed from the region server to the mapper. Adding a filter such as the following, reduces the time to around 90 secs.

      scan.setFilter(new SingleColumnValueFilter(
        "v".getBytes(), 
        "scientific_name".getBytes(), 
        CompareOp.EQUAL,
        "Abies alba".getBytes()));

We are using Hive 0.90 on HBase 0.90 and I believe Hive will push down the "equals" predicates to HBase, which will benefit from these kind of filters. However, I wanted to run the tests without them, as we will often do range scans, and custom UDFs to do things like point in polygon checking.

Thanks to everyone on the mailing lists for all their support through this. You all know who you are, but in particular thanks to Lars George and Stack. All GBIF work is open source, and we are committed to open data - we always welcome collaborations.

Hive 0.9 with HBase 0.90

Lars Francke — Wed, 23 May 2012 16:53:00 +0000

Hive 0.9.0 was released at the beginning of this month and it contains a lot of very nice improvements. Thanks to all involved!

Unfortunately it drops compatibility with HBase 0.90.x due to two issues which introduced a dependency on HBase 0.92:

Fortunately these were relatively easy to revert so that's what we did because we wanted to all the 0.9.0 goodness on our HBase 0.90.4 cluster (CDH3u3).

I've forked Hive on Github and reverted the parts of those two issues (HIVE-2748, HIVE-2764) that were causing problems.

For all those "stuck" with HBase 0.90 (e.g. CDH3 users) we've also deployed this custom Hive HBase Handler to our own Maven repository and will maintain that for the foreseeable future. You can just download the jar file and use it in your projects or use our Maven repository:

 
  gbif-thirdparty 
  http://repository.gbif.org/content/repositories/thirdparty

And then declare a dependency on this custom HBase Handler:

 
  org.apache.hive 
  hive-hbase-handler 
  0.9.0 
  hbase-0.90-compat

Note for Maven experts: I'm not sure if this is a valid classifier. I couldn't find any naming rules.

That's it! We might maintain this for future versions of Hive but we'd love to hear about any problems in any case.

HBase Performance Evaluation continued - The Smoking Gun

Oliver Meyn — Tue, 20 Mar 2012 16:49:00 +0000

Update: See also part 1 and part 3.

In my last post I described my initial foray into testing our HBase cluster performance using the PerformanceEvaluation class. I wasn't happy with our conclusions, which could largely be summed up as "we're not sure what's wrong, but it seems slow". So in the grand tradition of people with itches that wouldn't go away, I kept scratching. Everything that follows is based on testing with PerformanceEvaluation (the jar patched as in the last post) using a 300M row table built with PerformanceEvaluation sequentialWrite 300, and tested with PerformanceEvaluation scan 300. I ran the scan test 3 times, so you should see 3 distinct bursts of activity in the charts. And to recap our hardware setup - we have 3 regionservers and a separate master.

The first unsettling ganglia metric that kept me digging was of ethernet bytes_in and bytes_out. I'll recreate those here:

Figure 1 - bytes_in (MB/s) with single gigabit ethernet

Figure 2 - bytes_out (MB/s) with single gigabit ethernet

This is a single gigabit ethernet link, and that is borne out by the Max values, which are near the theoretical max of 120 megabytes per second (MB/s), and as you can see they certainly aren't sitting at 120 constantly in the way one would expect if ethernet was our bottleneck. But that bytes_in graph sure looks like it's hitting some kind of limit - steep ramp up, relatively constant transfer rate, then steep ramp down. So I decided to try channel bonding to effectively use 2 gigabit links on each server "bonded" together. It turns out that's not quite as trivial as it may seem, but in the end I got it working following the Dell/Cloudera configuration suggestion of mode 6 / balance-alb, which is described more in the Linux kernel bonding documentation. The results are shown in the following graphs:

Figure 3 - bytes_in (MB/s) with dual (bonded) gigabit ethernet

Figure 4 - bytes_out (MB/s) with dual (bonded) gigabit ethernet

Unfortunately the scale has changed (ganglia ain't perfect) but you can see we're now hovering a bit higher than the single ethernet case, and at some points swinging significantly above the single ethernet limit (a few gusts up to 150). So it would seem that even though the single gigabit graphs didn't look like they were hitting theoretical maximum, they were definitely limited. Not by much though - maybe 10-15%? As they say, the proof is in the pudding, so what was the final throughput for the two tests?

Ethernet	Throughput (rows/s)
Single	340k
Double	357k

Looks like 5%. Certainly not the doubling I was hoping for! Something else must be the limiting factor then, but what? Well, what does CPU say?

Figure 5 - cpu_idle with dual (bonded) gigabit ethernet

That CPU is saying, "Feed me Seymour!". If the CPU is hungry, what about the load average?

Figure 6 - one minute load average with dual (bonded) gigabit ethernet

These machines have 8 hyper-threaded cores, so effectively 16 cpus. The load average looks like it's keeping the CPUs fed - hovering around 20. But hang on - the load average isn't all about CPU - it's about work that the whole machine needs to do - including all the io subsystems. So if it ain't network, and it ain't CPU then it's gotta be disks. Let's look at how much time the CPU spent waiting on io (which in this case basically means "waiting on disks"):

Figure 7 - percentage cpu spent in wait_io state, with dual (bonded) gigabit ethernet

From a pure gut perspective, that seems kind of high. But on closer inspection, that graph also looks familiar - it's basically the same as Figure 6 - the load average. So what can we deduce? When ethernet is removed as a limiting factor the run queue is filled with processes that all cause an increase in wait_io - which means our processes would all finish faster if they didn't have to wait for io. The limiting factor must, then, be the disks.

Recap (TL;DR)

Ethernet looked suspiciously capped, but performance (and ethernet usage) only increased slightly when the cap was lifted. Closer inspection of the resulting load average and CPU usage showed that the limiting factor was in fact the disks.

Solution? Replacing the disks might help a little - they're 3 year old 2.5" SATA, but they are 7.2k rpm, and a more modern 10k 2.5" would be marginally faster. The real benefit would probably be more disks - the current wisdom on the mailing lists and in the documentation is to maximize the ratio of disk spindles to cpu cores and I think that's borne out by these results. In the coming weeks we're going to build a new cluster of 6 machines with 12 disks each and I very much look forward to testing my theories once they're up and running. Stay tuned!

Text available for public review: The digitisation of biological nomenclatural and taxonomic information by Richard Pankhurst

Fri, 11 Nov 2011 11:10:04 +0000

Dear members of the GNA group,

I am happy to announce that there's a new text available for public review: 'The digitisation of biological nomenclatural and taxonomic information' by Richard Pankhurst.

The document guides the reader through the process of digitazing information about taxonomic names and relationships. It explains the nature of the taxonomic data with many examples and gives advice on how to proceed in the digitization process in each case, independently of the software used. After the digitization process, the data can be published on the internet through networks such as GBIF.

The public review period ends the 9th December 2011. The author will consider all comments received and a final version will be released soon afterwards.

Please send your comments through the community site or to training@gbif.org.

Thanks for your participation!

http://community.gbif.org/pg/file/read/18107/

PUBLIC REVIEW: The digitisation of biological nomenclatural and taxonomic information by Richard Pankhurst

Fri, 11 Nov 2011 10:13:58 +0000

This document guides the reader through the process of digitazing information about taxonomic names and relationships. It explains the nature of the taxonomic data with many examples and gives advice on how to proceed in the digitization process in each case, independently of the software used. After the digitization process, the data can be published on the internet through networks such as GBIF.

THIS DOCUMENT IS AVAILABLE FOR PUBLIC TILL THE 9TH DECEMBER 2011. Please add your comments to this Community Site item, or send them to training@gbif.org.

Abstract: "This text explains and advises how best to set up and populate a database for the taxonomy and biodiversity documentation of a group of organisms. Examples will be taken from experience gained while creating the Rosaceae (Rose family) global taxonomic database, and will therefore relate to flowering plants (Angiospermae), and in particular to the International Code of Botanical Nomenclature (ICBN, McNeill 2006) which states the rules for wild plant nomenclature. There is a separate and different code for the naming of cultivated plants, and again separate from the codes for zoology and for bacteria. For organisms covered by these other codes the approach will necessarily be somewhat different."

GBIF Call for Proposals: National Checklist Building Best Practices

Wed, 07 Sep 2011 12:26:17 +0000

The GBIF Secretariat invites proposals to develop a document entitled Best Practice Guidelines in the Development and Maintenance of National Species Checklists. We seek to identify individuals or groups who have been directly involved in the compilation, maintenance and dissemination of a national species checklist to develop these guidelines in order to compile a set of practices that may serve as a guide for future efforts to build and maintain national species lists.

See http://www.gbif.org/communications/news-and-events/showsingle/article/call-for-proposals-to-draft-best-practices-in-the-development-and-maintenance-of-national-species-c/ for more information.

Google Docs Taxonomic Name Service

Fri, 03 Jun 2011 09:35:59 +0000

Mike Giddens of SilverBiology, has experimented with Google Docs scripting services to create a very nice name-mapping tool using Google Spreadsheets.

A user can paste a list of names into a spreadsheet, and then have each name looked up in GBIFs name services, and the classification, taxonomic status, and links to more information can be embedded into the spreadsheet.

To try out this service:

1. Log in to your google account (you must be logged in)

2. Go to https://spreadsheets0.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AhdYOHWBUlw_dF83RVRtQjdiQ2dwZXJCRERlUWR0SkE&output=html

3. Paste your own list of names into column A or if there are existing names, you can use them.

4. If there is already classification information in columns B-H you can select and delete these to clear them

5. Select a series of names in column B.

6. Go to the "GBIF" option in the menu. Select one of the two options.

You can add a link to this script in your own docs by calling

http://labs.silverbiology.com/gbifclblookup/code.js

Google Docs Taxonomic Name Service

Fri, 03 Jun 2011 09:35:54 +0000

Mike Giddens of SilverBiology, has experimented with Google Docs scripting services to create a very nice name-mapping tool using Google Spreadsheets.

To try out this service:

1. Log in to your google account (you must be logged in)

2. Go to https://spreadsheets0.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AhdYOHWBUlw_dF83RVRtQjdiQ2dwZXJCRERlUWR0SkE&output=html

3. Paste your own list of names into column A or if there are existing names, you can use them.

4. If there is already classification information in columns B-H you can select and delete these to clear them

5. Select a series of names in column B.

6. Go to the "GBIF" option in the menu. Select one of the two options.

Publishing Species Checklists, A Step by Step Guide

Fri, 15 Apr 2011 07:04:49 +0000

A new GBIF user guide has been released today. "Publishing Species Checklists, A Step by Step Guide" provides a high-level guide to a suite of tools, reference guide, and best practices for publishing annotated species checklists through the GBIF network. http://links.gbif.org/checklist_how_to

Screencast available for using the Darwin Core Archive Assistant

Wed, 13 Apr 2011 07:28:43 +0000

Publishing biodiversity data using Darwin Core Archives (DwC-A) is simple enough that it can be done with no dedicated software installed on your servers. There is one XML file, called the metafile, that is required in a DwC-A. The Darwin Core Archive Assistant is a service that builds this file for you. See a 10 minute screencast that demonstrates this service at http://www.youtube.com/watch?v=_hF0sslw-B4

GBIF becomes member of Species 2000

Tue, 05 Apr 2011 05:56:29 +0000

On 14 March, GBIF, joined the Species 2000 consortium as a new member.

GBIF TaxonFinder service and the new TIKA document service

Mon, 20 Dec 2010 11:43:15 +0000

The GBIF TaxonFinder service, a web service for extracting scientific names from documents, now has the means to extract names from PDF files, word documents, even powerpoint, using the new GBIF TIKA service. An illustration of the 3 steps to using this service to extract names from documents is as follows:

Identify a source document URL. For example, this URL provides a PDF publication listing the biodiversity of earthworms in Taiwan http://tai2.ntu.edu.tw/taiwania/pdf/tai.2006.51.3.226.pdf.
Append the document URL to the url parameter of the GBIF TIKA service. http://ecat-dev.gbif.org/ws-tika/?url=http://tai2.ntu.edu.tw/taiwania/pdf/tai.2006.51.3.226.pdf The output of this service is an XML document containing the full text of the PDF.
Use the URL from 2 as an input parameter to the GBIF TaxonFinder Service or any other name-finding service that supports the GNA Name-Finder API. For the GBIF TaxonFinder Service use the base service call:
1. http://tools.gbif.org/ws/taxonfinder?
2. input a type=url and a input of the URL as in 2

See the XML output here.

Taxonomic Name Processing

Fri, 17 Dec 2010 13:09:15 +0000

Processing scientific names is often a necessary component of working with biodiversity information. A number of focal areas have emerged and been the subject of work within GNA Nomina meetings. These areas of focus include:

Name recognition tools that can recognize known scientific names in free-text
Name discovery algorithms extend recognition to include novel taxon names.
Name reconciliation or de-aliasing methods that group spelling-variations of names and reconcile them to correctly-spelled names.
Atomizers and canonizers that can break a name into identified component parts (Genus, Species, Authorship, etc.) and into a normalized form.
Services and applications that use the tools described in this list.

This site provides forums for discussion, documentation, files and links to source repositories containing dictionaries, source code and development support.

New framework to deliver biodiversity knowledge

Wed, 02 Oct 2013 09:24:00 +0000

Global Biodiversity Informatics Outlook sets out key steps to harness IT and open data to inform better decisions

Using ‘big data’ to help conserve life on Earth

Tue, 24 Sep 2013 13:51:00 +0000

Save the date: Wednesday, 9 October

Historic Brazilian plant collection available via GBIF

Mon, 16 Sep 2013 09:34:00 +0000

Rio de Janeiro Botanic Gardens Research Institute is first data publisher from Brazil to enter global online network

Data publishing network VertNet joins GBIF

Fri, 13 Sep 2013 08:05:00 +0000

New Participant mobilizes global vertebrate occurrence data

Filling data gaps to address biodiversity challenges

Thu, 15 Aug 2013 13:48:00 +0000

Task group recommendations on demand-driven data mobilization will be addressed in new GBIF Work Programme

GBIF Participants collaborate in 2013 mentoring projects

Tue, 16 Jul 2013 08:46:00 +0000

Eight Participant nodes share expertise on data publishing, portal development, digitization and e-learning.

Registration for GBIF Governing Board meeting now open

Thu, 11 Jul 2013 14:51:00 +0000

GB20 and associated events to be held in Berlin, Germany from 4-10 October

GBIF Annual Report and Science Review published

Mon, 24 Jun 2013 13:50:00 +0000

New publication highlights more than 230 research papers citing use of GBIF as data source during 2012.

Think bigger, GBIF award winner urges biologists

Mon, 17 Jun 2013 14:12:00 +0000

Winner of GBIF science prize will design ‘Ecotron’ to test predictions of climate impacts on species; other awards support young researchers’ work on bat ‘nectar corridors’ and identifying data gaps.

GBIF enables global study of climate impact on species

Mon, 13 May 2013 05:18:00 +0000

Research in Nature Climate Change uses data on 50,000 common plants and animals to predict worldwide range losses without urgent action to limit emissions

Israel joins GBIF

Wed, 24 Apr 2013 11:40:00 +0000

Decision aims to enhance collaboration on nature conservation and research

US launches new biodiversity data system

Fri, 19 Apr 2013 11:21:00 +0000

BISON project from GBIF national node offers access to more than 100 million mapped species records

Brazil surveys data holdings and informatics capacity

Thu, 14 Mar 2013 08:50:00 +0000

Review of national institutions will help mobilize biodiversity data for new GBIF national node

New portal on Colombia’s biodiversity

Wed, 06 Mar 2013 08:08:00 +0000

Information on Colombia’s biological diversity is now available at http://www.sibcolombia.net.

Support for projects to promote ‘data paper’ publication

Fri, 18 Jan 2013 11:34:00 +0000

GBIF nodes in Spain, Colombia and India plan to enable more than 40 publications describing biodiversity datasets.

GBIF ready to support new science-policy platform

Thu, 17 Jan 2013 08:10:00 +0000

Briefing sets out GBIF role ahead of first IPBES plenary meeting

Call for proposals for the 2013 Young Researchers Award

Mon, 14 Jan 2013 07:24:51 +0000

GBIF invites proposals from graduate students for the 2013 Young Researchers Award. This prize intends to foster innovative research and discovery in biodiversity informatics.

Call for nominations for the 2013 Ebbe Nielsen Prize

Mon, 14 Jan 2013 11:21:00 +0000

GBIF invites nominations for the 2013 Ebbe Nielsen Prize, awarded annually to a person or team who demonstrates excellence in combining biodiversity informatics and biosystematics research.

Brazil joins global initiative for biodiversity data access

Fri, 30 Nov 2012 13:10:30 +0000

Entry of mega-diverse country into network ’very exciting’, says GBIF chair

Peer review option proposed for biodiversity data

Fri, 30 Nov 2012 13:11:15 +0000

Discussion paper suggests new quality control procedures to promote data publishing and use

New guide for compiling national species checklists

Fri, 30 Nov 2012 13:06:19 +0000

Step-by-step advice on developing species inventories aims to help countries improve management of biodiversity.

National biodiversity information ‘grid’ proposed for India

Wed, 17 Oct 2012 13:04:00 +0000

Senior government advisers back plan involving Indian GBIF node

GBIF signs invasive species information partnership

Wed, 10 Oct 2012 15:26:00 +0000

Collaboration aims to help countries reduce threat to biodiversity from alien species

GBIF committed to provide CBD data foundations – Hobern

Tue, 09 Oct 2012 08:02:00 +0000

Executive Secretary statement to biodiversity convention’s conference in India

GBIF promotes shared data culture at India conference

Mon, 08 Oct 2012 09:58:00 +0000

Events at CBD Conference of Parties (COP11) outline data mobilization projects, better information on invasive species, and a shared vision for ‘biodiversity intelligence’

Plant data helps climate models - GBIF award winner

Wed, 19 Sep 2012 12:48:00 +0000

Availability of large volumes of data through GBIF helps refine predictions of climate impact, says Ebbe Nielsen Prize winner

Symposium showcases use of open-access biodiversity data

Sun, 16 Sep 2012 22:00:00 +0000

Presentations highlight research on climate change impacts, Arctic biodiversity

New guide for developing marine species checklists

Fri, 17 Aug 2012 10:19:00 +0000

Document highlights resources specific to marine regions for eventual publishing through GBIF

Sharing expertise to manage data for science and society

Wed, 01 Aug 2012 08:33:00 +0000

Mentoring among GBIF Participants will support network-building and publication of biodiversity data from Indonesia, Costa Rica and cities worldwide.

New GBIF reference guide for verifying zoological names

Thu, 12 Jul 2012 12:42:00 +0000

Manual also gives recommendations on publishing new names.

Building global collaboration for biodiversity intelligence

Fri, 06 Jul 2012 14:23:00 +0000

Public to play major role in mobilizing expanded range of data needed to preserve vital functions of life on Earth, conference concludes.

A new vision for harnessing data about life on Earth

Thu, 21 Jun 2012 15:07:00 +0000

Conference will develop strategy to prioritize investment in biodiversity informatics.

Registration for GB19 is now available online

Thu, 07 Jun 2012 16:34:00 +0000

The online registration site for the 19th GBIF Governing Board meeting (GB19) is now open. GB19 will be held in Lillehammer, Norway, on 18 and 20 September 2012 with associated events taking place in the days before and after.

Awards target novel research on species distributions

Mon, 04 Jun 2012 12:58:00 +0000

GBIF Ebbe Nielsen prize and Young Researchers Award honour scientists exploring interactions between climate, ecology and evolution

Managing biodiversity data from local government

Fri, 25 May 2012 09:20:00 +0000

Guidance for local authorities on publishing through the GBIF network, helping preserve knowledge about biodiversity.

New GBIF guide for citing biodiversity data

Mon, 14 May 2012 14:16:00 +0000

Recommended practice will credit all involved in making datasets freely accessible

GBIF Annual Report for 2011 published

Mon, 23 Apr 2012 13:44:00 +0000

Details a decade of achievement in providing free and open access to biodiversity data

New GBIF videos on improving biodiversity data quality

Thu, 19 Apr 2012 13:28:00 +0000

Training videos provide information on sources of error and fitness for use

Conference to chart way ahead for biodiversity informatics

Tue, 20 Mar 2012 13:50:00 +0000

Copenhagen event aims to unite diverse communities to address key policy and science challenges.

Genomic data in GBIF moves a step closer

Wed, 14 Mar 2012 13:51:00 +0000

Aligning standards will help share information on biodiversity yet to be discovered.

Results of GBIF regional training call announced

Wed, 07 Mar 2012 12:48:00 +0000

Events to be held in Taipei, Kampala, Kathmandu and Bogotá will benefit from the support.

New Executive Secretary: exciting time ahead for GBIF

Tue, 31 Jan 2012 13:16:00 +0000

Donald Hobern sees data mobilized by network increasingly relevant for policy, science

Call for applications for regional training support

Mon, 16 Jan 2012 13:18:00 +0000

Call for proposals for the 2012 Young Researchers Award

Mon, 09 Jan 2012 13:41:00 +0000

Call for nominations for the 2012 Ebbe Nielsen Prize

Mon, 19 Dec 2011 15:35:00 +0000

German research team targets ‘at risk’ data on biodiversity

Mon, 19 Dec 2011 10:10:00 +0000

Project example of data hosting infrastructure promoted by GBIF

New biodiversity data publishing framework proposed

Thu, 15 Dec 2011 12:00:00 +0000

Recommendations target social, cultural, technical, policy, legal, economic components to promote data sharing.

Biodiversity knowledge vital to combat climate change - King

Fri, 09 Dec 2011 09:25:00 +0000

During intervention in Durban Climate Change Conference, GBIF Executive Secretary Nick King calls for wider participation.

Plans take shape to digitize camera trap data from India

Wed, 07 Dec 2011 09:29:00 +0000

Indo-Norwegian pilot project for new science-policy platform targets conservation strategies for large mammals

First database-derived ‘Data Paper’ published in journal

Mon, 28 Nov 2011 17:00:00 +0000

Peer-reviewed description of Indian bird dataset published through GBIF sets precedent for data-sharing incentives.