Arto Bendiken - Drupal

RDF for Intrepid Unix Hackers: Grepping N-Triples

Arto — 2010-03-05T16:00:00Z

I originally wrote this tutorial for the Datagraph Blog. Subscribe to the Datagraph RSS feed to get subsequent installments.

The N-Triples format is the lowest common denominator for RDF serialization formats, and turns out to be a very good fit to the Unix paradigm of line-oriented, whitespace-separated data processing. In this tutorial we'll see how to process N-Triples data by pipelining standard Unix tools such as grep, wc, cut, awk, sort, uniq, head and tail.

To follow along, you will need access to a Unix box (Mac OS X, Linux, or BSD) with a Bash-compatible shell. We'll be using curl to fetch data over HTTP, but you can substitute wget or fetch if necessary. A couple of the examples require a modern AWK version such as gawk or mawk; on Linux distributions you should be okay by default, but on Mac OS X you will need to install gawk or mawk from MacPorts as follows:

$ sudo port install mawk
$ alias awk=mawk

Grokking N-Triples

Each N-Triples line encodes one RDF statement, also known as a triple. Each line consists of the subject (a URI or a blank node identifier), one or more characters of whitespace, the predicate (a URI), some more whitespace, and finally the object (a URI, blank node identifier, or literal) followed by a dot and a newline. For example, the following N-Triples statement asserts the title of my website:

<http://ar.to/> <http://purl.org/dc/terms/title> "Arto Bendiken" .

This is an almost perfect format for Unix tooling; the only possible further improvement would have been to define the statement component separator to be a tab character, which would have simplified obtaining the object component of statements -- as we'll see in a bit.

Getting N-Triples

Many RDF data dumps are made available as compressed N-Triples files. DBpedia, the RDFization of Wikipedia, is a prominent example. For purposes of this tutorial I've prepared an N-Triples dataset containing all Drupal-related RDF statements from DBpedia 3.4, which is the latest release at the moment and reflects Wikipedia as of late September 2009.

I prepared the sample dataset by downloading all English-language core datasets (20 N-Triples files totaling 2.1 GB when compressed) and crunching through them as follows:

$ bzgrep Drupal *.nt.bz2 > drupal.nt

To save you from gigabyte-sized downloads and an hour of data crunching, you can just grab a copy of the resulting drupal.nt file as follows:

$ curl http://blog.datagraph.org/2010/03/grepping-ntriples/drupal.nt > drupal.nt

The sample dataset totals 294 RDF statements and weighs in at 70 KB.

Counting N-Triples

The first thing we want to do is count the number of triples in an N-Triples dataset. This is straightforward to do, since each triple is represented by one line in an N-Triples input file and there are a number of Unix tools that can be used to count input lines. For example, we could use either of the following commands:

$ cat drupal.nt | wc -l
294

$ cat drupal.nt | awk 'END { print NR }'
294

Since we'll be using a lot more of AWK throughout this tutorial, let's stick with awk and define a handy shell alias for this operation:

$ alias rdf-count="awk 'END { print NR }'"

$ cat drupal.nt | rdf-count
294

Note that, for reasons of comprehensibility, the previous examples as well as most of the subsequent ones assume that we're dealing with "clean" N-Triples datasets that don't contain comment lines or other miscellania. The DBpedia data dumps fit this bill very well. However, further onwards I will give "fortified" versions of these commands that can correctly deal with arbitrary N-Triples files.

Measuring N-Triples

We at Datagraph frequently use the N-Triples representation as the canonical lexical form of an RDF statement, and work with content-addressable storage systems for RDF data that in fact store statements using their N-Triples representation. In such cases, it is often useful to know some statistical characteristics of the data to be loaded in a mass import, so as to e.g. be able to fine-tune the underlying storage for optimum space efficiency.

A first useful statistic is to know the typical size of a datum, i.e. the line length of an N-Triples statement, in the dataset we're dealing with. AWK yields us N-Triples line lengths without much trouble:

$ alias rdf-lengths="awk '{ print length }'"

$ cat drupal.nt | rdf-lengths | head -n5
162
150
155
137
150

Note that N-Triples is an ASCII format, so the numbers above reflect both the byte sizes of input lines as well as the ASCII character count of input lines. All non-ASCII characters are escaped in N-Triples, and for present purposes we'll be talking in terms of ASCII characters only.

The above list of line lengths in and of itself won't do us much good; we want to obtain aggregate information for the whole dataset at hand, not for individual statements. It's too bad that Unix doesn't provide commands for simple numeric aggregate operations such as the minimum, maximum and average of a list of numbers, so let's see if we can remedy that.

One way to define such operations would be to pipe the above output to an RPN shell calculator such as dc and have it perform the needed calculations. The complexity of this would go somewhat beyond mere shell aliases, however. Thankfully, it turns out that AWK is well-suited to writing these aggregate operations as well. Here's how we can extend our earlier pipeline to boil the list of line lengths down to an average:

$ alias avg="awk '{ s += \$1 } END { print s / NR }'"

$ cat drupal.nt | rdf-lengths | avg
242.517

The above, incidentally, is an example of a simple map/reduce operation: a sequence of input values is mapped through a function, in this case length(line), to give a sequence of output values (the line lengths) that is then reduced to a single aggregate value (the average line length). Though I won't go further into this just now, it is worth mentioning in passing that N-Triples is an ideal format for massively parallel processing of RDF data using Hadoop and the like.

Now, we can still optimize and simplify the above some by combining both steps of the operation into a single alias that outputs an average line length for the given input stream, like so:

$ alias rdf-length-avg="awk '\
  { s += length }
  END { print s / NR }'"

Likewise, it doesn't take much more to define an alias for obtaining the maximum line length in the input dataset:

$ alias rdf-length-max="awk '\
  BEGIN { n = 0 } \
  { if (length > n) n = length } \
  END { print n }'"

Getting the minimum line length is only slightly more complicated. Instead of comparing against a zero baseline like above, we need to instead define a "roof" value to compare against. In the following, I've picked an arbitrarily large number, making the (at present) reasonable assumption that no N-Triples line will be longer than a billion ASCII characters, which would amount to somewhat less than a binary gigabyte:

$ alias rdf-length-min="awk '\
  BEGIN { n = 1e9 } \
  { if (length > 0 && length < n) n = length } \
  END { print (n < 1e9 ? n : 0) }'"

Now that we have some aggregate operations to crunch N-Triples data with, let's analyze our sample DBpedia dataset using the three aliases defined above:

$ cat drupal.nt | rdf-length-avg
242.517

$ cat drupal.nt | rdf-length-max
2179

$ cat drupal.nt | rdf-length-min
84

We can see from the output that N-Triples line lengths in this dataset vary considerably: from less than a hundred bytes to several kilobytes, but being on average in the range of two hundred bytes. This variability is to be expected for DBpedia data, given that many RDF statements in such a dataset contain a long textual description as their object literal whereas others contain merely a simple integer literal.

Many other statistics, such as the median line length or the standard deviation of the line lengths, could conceivably be obtained in a manner similar to what I've shown above. I'll leave those as exercises for the reader, however, as further stats regarding the raw N-Triples lines are unlikely to be all that generally interesting.

Parsing N-Triples

It's time to move on to getting at the three components -- the subject, the predicate and the object -- that constitute RDF statements.

We have two straightforward choices for obtaining the subject and predicate: the cut command and good old awk. I'll show both aliases:

$ alias rdf-subjects="cut -d' ' -f 1 | uniq"
$ alias rdf-subjects="awk '{ print \$1 }' | uniq"

While cut might shave off some microseconds compared to awk here, AWK is still the better choice for the general case, as it allows us to expand the alias definition to ignore empty lines and comments, as we'll see later. On our sample data, though, either form works fine.

You may have noticed and wondered about the pipelined uniq after cut and awk. This is simply a low-cost, low-grade deduplication filter: it drops consequent duplicate values. For an ordered dataset (where the input N-Triples lines are already sorted in lexical order), it will get rid of all duplicate subjects. In an unordered dataset, it won't do much good, but it won't do much harm either (what's a microsecond here or there?)

To fully deduplicate the list of subjects for a (potentially) unordered dataset, apply another uniq filter after a sort operation as follows:

$ cat drupal.nt | rdf-subjects | sort | uniq | head -n5
<http://dbpedia.org/resource/Acquia_Drupal>
<http://dbpedia.org/resource/Adland>
<http://dbpedia.org/resource/Advomatic>
<http://dbpedia.org/resource/Apadravya>
<http://dbpedia.org/resource/Application_programming_interface>

I've not made sort an integral part of the rdf-subjects alias because sorting the subjects is an expensive operation with resource usage proportional to the number of statements processed; when processing a billion-triple N-Triples stream, it is usually simply better to not care too much about ordering.

Getting the predicates from N-Triples data works exactly the same way as getting the subjects:

$ alias rdf-predicates="cut -d' ' -f 2 | uniq"
$ alias rdf-predicates="awk '{ print \$2 }' | uniq"

Again, you can apply sort in conjunction with uniq to get the list of unique predicate URIs in the dataset:

$ cat drupal.nt | rdf-predicates | sort | uniq | tail -n5
<http://www.w3.org/2000/01/rdf-schema#label>
<http://www.w3.org/2004/02/skos/core#subject>
<http://xmlns.com/foaf/0.1/depiction>
<http://xmlns.com/foaf/0.1/homepage>
<http://xmlns.com/foaf/0.1/page>

Obtaining the object component of N-Triples statements, however, is somewhat more complicated than getting the subject or the predicate. This is due to the fact that object literals can contain whitespace that will throw off the whitespace-separated field handling of cut and awk that we've relied on so far. Not to worry, AWK can still get us the results we want, but I won't attempt to explain how the following alias works; just be happy that it does:

$ alias rdf-objects="awk '{ ORS=\"\"; for (i=3;i<=NF-1;i++) print \$i \" \"; print \"\n\" }' | uniq"

The output of rdf-objects is the N-Triples encoded object URI, blank node identifier or object literal. URIs are output in the same format as subjects and predicates, with enclosing angle brackets; language-tagged literals include the language tag, and datatyped literals include the datatype URI:

$ cat drupal.nt | rdf-objects | sort | uniq | head -n5
"09"^^<http://www.w3.org/2001/XMLSchema#integer>
"16"^^<http://www.w3.org/2001/XMLSchema#integer>
"2001-01"^^<http://www.w3.org/2001/XMLSchema#gYearMonth>
"2009"^^<http://www.w3.org/2001/XMLSchema#integer>
"6.14"^^<http://www.w3.org/2001/XMLSchema#decimal>

Another very useful operation to have is getting the list of object literal datatypes used in an N-Triples dataset. This is also a somewhat involved alias definition, and requires a modern AWK version such as gawk or mawk:

$ alias rdf-datatypes="awk -F'\x5E' '/\"\^\^</ { print substr(\$3, 1, length(\$3)-2) }' | uniq"

$ cat drupal.nt | rdf-datatypes | sort | uniq
<http://www.w3.org/2001/XMLSchema#decimal>
<http://www.w3.org/2001/XMLSchema#gYearMonth>
<http://www.w3.org/2001/XMLSchema#integer>

As we can see, most object literals in this dataset are untyped strings, but there are some decimal and integer values as well as year + month literals.

Aliasing N-Triples

As promised, here follow more robust versions of all the aforementioned Bash aliases. Just copy and paste the following code snippet into your ~/.bash_aliases or ~/.bash_profile file, and you will always have these aliases available when working with N-Triples data on the command line.

# N-Triples aliases from http://blog.datagraph.org/2010/03/grepping-ntriples
alias rdf-count="awk '/^\s*[^#]/ { n += 1 } END { print n }'"
alias rdf-lengths="awk '/^\s*[^#]/ { print length }'"
alias rdf-length-avg="awk '/^\s*[^#]/ { n += 1; s += length } END { print s/n }'"
alias rdf-length-max="awk 'BEGIN { n=0 } /^\s*[^#]/ { if (length>n) n=length } END { print n }'"
alias rdf-length-min="awk 'BEGIN { n=1e9 } /^\s*[^#]/ { if (length>0 && length<n) n=length } END { print (n<1e9 ? n : 0) }'"
alias rdf-subjects="awk '/^\s*[^#]/ { print \$1 }' | uniq"
alias rdf-predicates="awk '/^\s*[^#]/ { print \$2 }' | uniq"
alias rdf-objects="awk '/^\s*[^#]/ { ORS=\"\"; for (i=3;i<=NF-1;i++) print \$i \" \"; print \"\n\" }' | uniq"
alias rdf-datatypes="awk -F'\x5E' '/\"\^\^</ { print substr(\$3, 2, length(\$3)-4) }' | uniq"

I should also note that though I've spoken throughout only in terms of N-Triples, most of the above aliases will work fine also for input in N-Quads format.

In the next installments of RDF for Intrepid Unix Hackers, we'll attempt something a little more ambitious: building a rdf-query alias to perform subject-predicate-object queries on N-Triples input. We'll also see what to do if your RDF data isn't already in N-Triples format, learning how to install and use the Raptor RDF Parser Library to convert RDF data between the various popular RDF serialization formats. Stay tuned.

Lest there be any doubt, all the code in this tutorial is hereby released into the public domain using the Unlicense. You are free to copy, modify, publish, use, sell and distribute it in any way you please, with or without attribution.

RDFizing Drupal: Upgrading the RSS Feeds

Arto — 2009-03-24T23:00:00Z

This is the first part in a series of articles on RDFizing Drupal, showing how you can make use of the RDF module for Drupal 6.x to set your data free and connect your Drupal site to the emerging Linked Data web.

If you've been wanting to use Drupal 6.x for creating RDF-enabled websites you've probably been annoyed at the fact that Drupal outputs feeds in RSS 2.0 format, which isn't based on RDF. This article will show you, step by step, how to upgrade^[1] all of Drupal's RSS feeds into clean, extensible and RDF-compatible RSS 1.0 format.

To get started you must first, of course, install the RDF module. Any version of the module since RDF 6.x-1.0-alpha6 ought to do fine. There are no dependencies that you need to care about^[2] other than ensuring your PHP version is sufficiently recent (PHP 5.2.0 or newer). Just follow the installation instructions in the accompanying INSTALL.txt file, and then enable the module at Administer » Site building » Modules:

After enabling the module, navigate to Administer » Site configuration » RDF settings » Feeds:

On this screen you'll see a listing of all the available RSS feeds output by Drupal's core modules. These are published, in Drupal 6.x, by the Node, Taxonomy, Blog and Aggregator modules. (If you've installed the Views module, this screen will also list any RDF feeds you've enabled for your views^[3].)

To upgrade any of the core feeds to RDF format, simply use the enable action. This will present you with the following choice:

Note that you can downgrade back to the default RSS 2.0 feeds at any time, so don't be afraid to experiment. To RDFize your feed, simply select the RSS 1.0 option and save the configuration. You will be returned to the same screen with a status message indicating that the feed was upgraded:

Once you've upgraded a feed, some additional configuration options will be made available to you. (Note that you don't need to change any of these settings if you don't want to, and everything will work just as before using the defaults; feel free to skip ahead several paragraphs if you don't care to tinker with this at the moment.)

In the Channel settings section, you will find settings that implement the RSS 1.0 syndication hints specification. This is a standard that specifies advisory metadata that you can include in your feed to tell feed readers how often your feed is updated. This allows aggregators to optimize how often they'll re-fetch your feed, and hence also affords you some potential control over your bandwidth usage:

Here you can also change the RSS feed's serialization format. RDF can be represented in a wide variety of serialization formats, and the RDF module provides support for some of the most popular ones (if you install the optional ARC RDF library, you will get support for yet more formats). However, only explicitly RDF-aware feed aggregators can handle anything else than the default RDF/XML serialization, so be advised that changing this setting is probably a rather bad idea for the time being.

Below the channel settings you will find a section for configuring how feed items (that is, Drupal nodes, taxonomy terms, and such) are output in the RSS feed. At the moment, you have additional two settings: you can configure how body fields get output (using the teaser only, or including the full text), and you can configure whether date/time information in the feed includes the time zone component (if applicable, such as for Date module fields) or whether all times will be output in UTC:

Once you're done with the feed settings, save the configuration and you'll be returned to the RDF feed management screen. Notice that the Operations column indicates which feeds have been upgraded to RDF, with the enable action changing to configure where applicable:

A special note on Drupal's front page feed, rss.xml: once you've RDFized this feed, it isn't ideal that it still has the all-too-generic URL extension .xml. You can certainly keep it that way if you wish (feed aggregators parse feeds based on the MIME content type, not the file extension), but Drupal makes it so trivially easy to rename the feed's URL that I'd recommend doing so. A more appropriate extension for RSS 1.0 feeds would be .rss or .rdf. You can rename the feed URL by navigating to Administer » Site building » URL aliases » Add alias and entering something like the following:

Based on the feeds listed at Planet RDF, index.rdf would seem to be the most popular URL for a front page feed, so that's a data point to take into consideration. (I've been contrarian on this, myself, and named my blog's feed simply blog.rss, intending it to only include blog posts. I'm using the .rss extension to differentiate my RSS feeds from other RDF data that I will publish here later using the usual .rdf extension.)

Now, in a similar way as you would rename Drupal's rss.xml, you can also define URL aliases for any of the other non-wildcard feeds listed on the RDF feeds management screen:

And if you'd like to rename any of the displayed wildcard feeds, such as the taxonomy feeds at paths of the form taxonomy/term/%/0/feed, that's easy enough to do by installing the excellent Pathauto module that will automatically create such URL aliases where needed. Here on my blog, for instance, all my tags have RDFized feeds with URLs of the form http://ar.to/tags/drupal.rss.

If you're a perfectionist, consider also installing the Global Redirect module to ensure that attempts to access a non-aliased feed URL will result in an HTTP redirect to the canonical aliased URL. For example, should you try to load up http://ar.to/rss.xml, you will be redirected to http://ar.to/blog.rss which is the alias I've defined for my front page feed. Among other benefits, this makes sure that search engines won't index both URLs.

Once you've RDFized your feeds, you may want to use W3C's RDF Validation Service to double-check that everything turned out a-okay and that your feeds are indeed valid RDF. My blog feed is clearly bursting at the seams with RDFness, as validating it yields the following reassuring message:

In case you are still learning RDF, the validation service is also a great way to view the underlying triples (RDF statements) that constitute RDF documents such as your RSS feed. You can get the triples listed both in table format and rendered as a graph in a variety of graphics formats; this can really be helpful in grokking how simple RDF actually is beneath all that XML verbiage.

Well, that's all for now. Go forth and RDFize all your feeds; you know you want to. I will add a link here to the first several people who upgrade their Drupal feeds per these instructions (just leave me a note with a link to your site). And should you run into any trouble with these instructions, please post an issue at drupal.org and we'll see if we can sort it out.

In the next couple parts of this article series, I'll be talking about how you can include additional CCK fields in your RSS feeds, and how to enable RDFa (affectionately known as "microformats on steroids") on your Drupal site. Be sure to subscribe to the aforementioned feed to get these upcoming articles!

Update: Julia Kulla-Mader (RSS) and Kaido Toomingas (RSS) are the first pioneers to brave these waters and RDFize their feeds. Anyone else?

^[1] I won't here delve into the history of the RSS 2.0 controversy, but sufficient to say that RSS 2.0 ("Really Simple Syndication") represents a downgrade from RSS 1.0 ("RDF Site Summary") in terms of capabilities and potential. You've heard of "embrace and extend", right? Well, try "co-opt and cripple" on for size. (Update: I posted some more on this at the Reddit thread and at groups.drupal.org.)

^[2] Note that for the purposes described in this article, you don't have to install the optional ARC RDF library; the RDF module includes native support for RDF/XML output using PHP's XMLWriter extension. This extension is available by default since PHP 5.1.2, though FreeBSD users may need to explicitly install the php5-xmlwriter package.

^[3] Developers: see hook_rdf_feeds() in rdf.module for an example on how you can declare RDF-compatible feeds that will be listed on this screen.

The Universal Timeline Aggregator

Arto — 2006-12-29T17:37:02Z

For those who haven’t yet come across it somewhere on the web, may I recommend checking out the Timeline widget developed by David F. Huynh of the MIT Simile project. It’s a snazzy DHTML/JavaScript tool for visualizing chronological events on a scrollable, graphical timeline — sort of a Google Maps for temporal information.

I’ve been working quite a bit with the Simile widget lately, co-developing (with David Donohue) a module that integrates the widget into Drupal, allowing Drupal sites to display any CCK / Views content as graphical timelines.

Today, inspired by Alexandre Passant’s RSS2Timeline implementation, I sat down to code up a generic web service that can take any Atom or RSS feed and convert it into a JSON-based event source for the Timeline widget. My goal was to make it absolutely trivial to embed live Atom/RSS timelines into blogs and whatnot, so that anyone with basic HTML skills could use timelines without having to go through the relatively complex technical setup the widget requires.

I hereby present the Universal Timeline Aggregator , available at http://timeline.to/ (I’ve been snapping up Tonga’s dot-to domains since getting the ultimate vanity domain, ar.to, as a Christmas present; the “to” preposition works rather nicely for the present purpose, too.)

Here’s a screen capture (and live example, if you click on it) of the sort of timeline display you can create in a minute or two using the timeline.to service:

Timeline view of “recent Ruby on Rails development”:http://dev.rubyonrails.org/timeline.

Embedding a live, interactive Atom or RSS timeline into any site is now as easy as copying and pasting the following HTML snippet, with the appropriate modifications:

<iframe src="http://timeline.to/http://www.mysite.com/rss.xml"
  width="500" height="400"
  scrolling="no" frameborder="1"
  marginwidth="0" marginheight="0"></iframe>

Just replace http://www.mysite.com/rss.xml with a real URL address to an Atom or RSS feed, and modify the width and height as you like. Autodiscovery of feeds is supported to a reasonable extent, so in most cases you won’t even need the exact URL to the feed; the website’s URL address itself will do.

Here’s another example as a screen capture, this time showing the popup box that opens when a timeline event is clicked:

Timeline view of Planet Scheme, showing a preview of a blog entry.

Starting out on the timeline.to implementation today, I had to actually pause a moment to contemplate which technology to use: Python, Ruby or PHP — it wasn’t quite as clear-cut a decision as usual.

Case in point, while I haven’t done much Python coding recently (since defecting to the Ruby camp), the language does have some excellent libraries and frameworks going for it — including arguably the best Atom/RSS parser in existence, the Universal Feed Parser written by Mark Pilgrim. Considering the staggering number of malformed and invalid feeds out there, a good parser is essential (I won’t get further, right this moment, into the delicious irony of having to parse XML formats using regular expressions).

On the other hand, Ruby also has (at least) two relatively comprehensive and decent libraries for feed parsing, FeedTools and Syndication. Unfortunately, in my experience neither library is quite up there with the Universal Feed Parser yet, and neither seems particularly active recently.

In the end, underdog PHP won out on this project on purely practical points: since my Timeline module for Drupal is written in PHP, it makes sense to try and reuse code both ways between the timeline.to service and Drupal.

Investigating the current best way to parse both Atom and RSS feeds with PHP, I learned of a new feed parser library for PHP called SimplePie, which has been gaining a lot of momentum lately (indeed, it seems to be on track for eventually surpassing MagpieRSS as the de-facto RSS parser for PHP). The SimplePie developers are apparently in the process of porting the 3000+ unit tests from Pilgrim’s parser, which certainly seems a promising prospect for creating a truly robust parser.

SimplePie is also bundled with the Feedparser Drupal module, so again, it all just makes sense. The library turned out to be quite painless to work with, and has, so far, been able to parse all the feeds I’ve thrown at it. (I did have to disable SimplePie’s ad-removal feature, as that was eating up the entry descriptions on some Atom feeds.)

Feedback on the timeline.to service is welcome. If there’s sufficient interest, I will consider adding further functionality such as iCalendar support, and perhaps mashup features allowing multiple feeds and data sources to be combined into a single timeline display.