The Datagraph Blog

What's New in RDF.rb 0.3.0

Arto Bendiken — 2010-12-30T13:00:00Z

It has now been nine months since the initial public release of RDF.rb, our RDF library for Ruby, and today we're happy to announce the release of RDF.rb 0.3.0, a significant milestone.

As the changelog attests, this has been a long release cycle that incorporates 170 commits by 6 different authors. The major new features include transactions and basic graph pattern (BGP) queries, as well as the availability of robust and fast parser/serializer plugins for the RDFa, Notation3, Turtle, and RDF/XML formats, complementing other already previously supported formats. In addition, many bugs have been fixed and general improvements, including significant performance improvements, implemented.

RDF.rb 0.3.0 is immediately available via RubyGems, and can be installed or upgraded to as follows on any Unix box with Ruby and RubyGems:

$ [sudo] gem install rdf

In all the code examples that follow below, we will assume that RDF.rb and the built-in N-Triples parser have already been loaded up like so:

require 'rdf'
require 'rdf/ntriples'

# Enable facile references to standard vocabularies:
include RDF

RDFa, N3, Turtle, and RDF/XML Support

While RDF.rb 0.3.0 continues with our minimalist policy of only supporting the N-Triples serialization format in the core library itself, support for every widely-used RDF serialization format is now available in the form of plugins.

Thanks to the hard work of Gregg Kellogg, the author of RdfContext, there are now RDF.rb 0.3.0-compatible plugins available for the RDFa, Notation3, Turtle, and RDF/XML formats, complementing the already previously-available plugins for the RDF/JSON and TriX formats. See Gregg's blog post for more details on the particulars of these plugins.

We are also pleased to announce that Gregg has joined the RDF.rb core development team, which now consists of him, Ben Lavender, and myself. This merger between the RDF.rb and RdfContext efforts is a perfect match, given that Ben and I have been focused more on storing and querying RDF data while Gregg has been busy single-handedly solving all RDF serialization questions.

To facilitate typical Linked Data use cases, we now also provide a metadistribution of RDF.rb that includes a full set of parsing/serialization plugins; the following will install all of the rdf, rdf-isomorphic, rdf-json, rdf-n3, rdf-rdfa, rdf-rdfxml, and rdf-trix gems in one go:

$ [sudo] gem install linkeddata

Similarly, instead of loading up support for each RDF serialization format one at a time, you can simply use the following to load them all; this is helpful e.g. for the automatic selection of an appropriate parser plugin given a particular file name or extension:

require 'linkeddata'

For a tutorial introduction to RDF.rb's reader and writer APIs, please refer to my previous blog post Parsing and Serializing RDF Data with Ruby.

Query API: Basic Graph Patterns (BGPs)

The query API in RDF.rb 0.3.0 now includes basic graph pattern (BGP) support, which has been a much-requested feature. BGP queries will already be a familiar concept to anyone using SPARQL, and in RDF.rb they are constructed and executed like this:

# Load some RDF.rb project information into an in-memory graph:
graph = RDF::Graph.load("http://rdf.rubyforge.org/doap.nt")

# Construct a BGP query for obtaining developers' names and e-mails:
query = RDF::Query.new({
  :person => {
    RDF.type  => FOAF.Person,
    FOAF.name => :name,
    FOAF.mbox => :email,
  }
})

# Execute the query on our in-memory graph, printing out solutions:
query.execute(graph).each do |solution|
  puts "name=#{solution.name} email=#{solution.email}"
end

Executing a BGP query returns a solution sequence, encapsulated as an instance of the RDF::Query::Solutions class. Solution sequences provide a number of convenient methods for further narrowing down the returned solutions to what you're actually looking for:

# Filter solutions using a hash:
solutions.filter(:author  => RDF::URI("http://ar.to/#self"))
solutions.filter(:author  => "Arto Bendiken")
solutions.filter(:updated => RDF::Literal(Date.today))

# Filter solutions using a block:
solutions.filter { |solution| solution.author.literal? }
solutions.filter { |solution| solution.title =~ /^SPARQL/ }
solutions.filter { |solution| solution.price < 30.5 }
solutions.filter { |solution| solution.bound?(:date) }
solutions.filter { |solution| solution.age.datatype == XSD.integer }
solutions.filter { |solution| solution.name.language == :es }

# Reorder solutions based on a variable:
solutions.order_by(:updated)
solutions.order_by(:updated, :created)

# Select particular variables only:
solutions.select(:title)
solutions.select(:title, :description)

# Eliminate duplicate solutions:
solutions.distinct

# Limit the number of solutions:
solutions.offset(20).limit(10)

# Count the number of matching solutions:
solutions.count
solutions.count { |solution| solution.price < 30.5 }

BGP-capable storage adapters should override and implement the following RDF::Queryable method in order to provide storage-specific optimizations for BGP query evaluation:

class MyRepository < RDF::Repository
  def query_execute(query, &block)
    # ...
  end
end

Repository API: Transactions

The repository API in RDF.rb 0.3.0 now includes basic transaction support:

# Load some RDF.rb project information into an in-memory repository:
repository = RDF::Repository.load("http://rdf.rubyforge.org/doap.nt")

# Delete one statement and insert another, atomically:
repository.transaction do |tx|
  subject = RDF::URI('http://rubygems.org/gems/rdf')

  tx.delete [subject, DOAP.name, nil]
  tx.insert [subject, DOAP.name, "RDF.rb 0.3.0"]
end

As you would expect, if the transaction block raises an exception, the current transaction will be aborted and rolled back; otherwise, the transaction is automatically committed when the block returns.

Transaction-capable storage adapters should override and implement the following three RDF::Repository methods:

class MyRepository < RDF::Repository
  def begin_transaction(context)
    # ...
  end

  def rollback_transaction(tx)
    # ...
  end

  def commit_transaction(tx)
    # ...
  end
end

The RDF::Transaction objects passed to these methods consist of a sequence of RDF statements to delete from, and a sequence of RDF statements to insert into, a given graph. The default transaction implementation in RDF::Repository simply builds up a transaction object in memory, buffering all inserts/deletes until the transaction is committed, at which point the operations are then executed against the repository.

Note that whether transactions are actually executed atomically depends on the particulars of the storage adapter you're using. For instance, the RDF::DataObjects plugin, which provides a storage adapter supporting SQLite, PostgreSQL, MySQL, and other RDBMS solutions, will certainly be able to offer ACID transaction support (albeit it has not been updated for that, or other 0.3.x features, just yet.)

On the other hand, not e.g. all NoSQL solutions support transactions, so storage adapters for such solutions may choose to omit explicit transaction support and have it supplied by RDF.rb's default implementation.

Performance & Scalability Improvements

In earlier RDF.rb releases, our focus was strongly centered on defining the core APIs that have enabled the thriving plugin ecosystem we can witness today. The focus was not so much, therefore, on the performance of the bundled default implementations of those APIs; in some cases, these implementations could have been described as being of only proof-of-concept quality.

In particular, the in-memory graph and repository implementations were suboptimal in RDF.rb 0.1.x, and only somewhat improved in 0.2.x. However, reflecting the increasing production-readiness of RDF.rb in general, matters have been much improved in RDF.rb 0.3.0.

Of course, performance improvements are an open-ended task, and I'm sure we'll see more work on this front in the future as need arises and time permits. But it's likely that RDF.rb 0.3.0 now offers a sufficient out-of-the-box performance level for many if not most common use cases.

Scalability has also been addressed by making use of enumerators throughout the APIs defined by RDF.rb. That means that all operations are generally performed in a streaming fashion, enabling you to build pipelines for hundreds of millions of RDF statements to flow through while still maintaining constant memory usage by ensuring that the statements are processed one by one.

RSpec 2.x Compatibility

Lastly, RDF.rb 0.3.0 has been upgraded to use and depend on RSpec 2.x instead of the previous 1.3.x branch. This requires minor changes to the spec/spec_helper.rb file in any project that relies on the RDF::Spec library. The most minimal spec_helper.rb contents are now as follows:

require 'rdf/spec'

RSpec.configure do |config|
  config.include RDF::Spec::Matchers
end

Kudos to Our Contributors

In tandem with the soon 10,000 downloads of RDF.rb on RubyGems.org, a very positive sign of all the interest and ongoing work around RDF.rb is our growing contributor list. We thank everyone who has sent in bug reports, and in particular the following people who have contributed patches to RDF.rb and/or an RDF.rb plugin; in alphabetical order:

Călin Ardelean, Christoph Badura, John Fieber, Joey Geiger, James Hetherington, Gabriel Horner, Nicholas Humfrey, Fumihiro Kato, David Nielsen, Thamaraiselvan Poomalai, Keita Urashima, Pius Uzamere, and Hellekin O. Wolf.

(My apologies if I have inadvertently omitted anyone from the previous, and please let me know about it.)

Looking Forward to Hearing From You

As always, if you have feedback regarding RDF.rb please contact us either privately or via the public-rdf-ruby@w3.org mailing list. Plain and simple bug reports, however, should more preferably go directly to the issue queue on GitHub.

Be sure to follow @datagraph, @bendiken, @bhuga, and @gkellogg on Twitter for the latest updates on RDF.rb as they happen.

Spira: A Linked Data ORM for Ruby

Ben Lavender — 2010-05-21T01:00:00Z

I've just released Spira, a first draft of an RDF ORM, where the 'R' can mean RDF or Resource at your pleasure. It's an easy way to create Ruby objects out of RDF data. The name is from Latin, for 'breath of life'--it's time to give those resource URIs some character. It looks like this (feel free to copy-paste):

require 'spira'
require 'rdf/ntriples'

repo = "http://datagraph.org/jhacker/foaf.nt"
Spira.add_repository(:default, RDF::Repository.load(repo))

class Person
  include Spira::Resource

  property :name,  :predicate => FOAF.name
  property :nick,  :predicate => FOAF.nick
end

jhacker = RDF::URI("http://datagraph.org/jhacker/#self").as(Person)
jhacker.name #=> "J. Random Hacker"
jhacker.nick #=> "jhacker"
jhacker.name = "Some Other Hacker"
jhacker.save!

Why a new project?

I try not to start new projects lightly. There's plenty of good stuff out there. But there wasn't quite what I wanted.

First of all, I want to program in Ruby, so it needed to be Ruby. Spira, while different, has a lot of overlap with a traditional ORM, and I was on the fence for a while about starting Spira or trying to implement things in DataMapper. There's already an RDF.rb backend for DataMapper, which is cool, but using it really cuts you off from RDF as RDF. It's more about making RDF work how DataMapper likes it. DataMapper's storage adapter interface is an implicit data model, one that is not RDF's, and it is not quite what I wanted.

On the RDF-specific front, there's ActiveRDF. ActiveRDF is based on SPARQL directly, and thus, while not hiding RDF from you, only gives you access via Redland. The Redland Ruby bindings have problems, and do not represent the entire RDF ecosystem. I wanted to start on something that completely abstracted away the data model, so I could focus on the problem at hand, which means RDF.rb. The difference is in allowing me to focus on what I'm focusing on: there exists a perfectly good, working SPARQL client storage adapter for RDF.rb, but it's one of many pluggable backends instead of a requirement.

Lastly, while both of those projects would represent a workable starting point, this was something of a journey of exploration in terms of semantics. Spira was going to be ~~'open world'~~ 'open model' from the start; I specifically wanted something that could read foreign data. By 'open model' I mean that Spira does not expect that a class definition is the authoritative, exclusive, or complete definition of a model class. That turns out to make Spira have some important semantic differences from ORMs oriented around object or relational databases. Stumbling on them was part of the fun, and even if I could have twisted DataMapper around the problem, I'm not sure that starting from there would have had me focusing on the core semantics.

So I decided to start something new. To be fair, Spira would suck a lot more were it not for the projects that came before it. In particular, it owes an intellectual debt to DataMapper, which has a generally sane model, readable code, and had to cover a lot of ground that any object-whatever-mapper would. It takes some digging, but as an example, one can find IRC logs where the DataMapper team discusses the ups and downs of identity map implementations in Ruby. That stuff is amazing to have available without spending hundreds of hours fighting it yourself, and again, it saves me a lot of trial and error on ancilliary considerations.

Making things simple

Spira's core use case is allowing programmers to create Ruby objects representing an aspect of an RDF resource. I'm still working on which terminology I like best, but I am leaning towards calling instances of Spira classes 'a projection of a given RDF resource as a Spira resource.' In the simplest of terms, Spira tries to let you create classes that easily get and set values for properties that correspond to RDF predicates. The README will explain it better than I want to in this post (now available in Github and Yardoc flavors).

The hopeful end result is a way to access the RDF data model in a way that agile web programmers have come to expect, without forcing them to get bogged down into a world of ontologies, rule languages, inference levels, and lord knows what-all else. RDF has taken off in the enterprise because of power user features, and we're approaching a critical mass of RDFa publishing, but it's not yet on anyone's radar as a data model for their next weekend project. I think that's a shame--RDF's schema-free model should be the easiest thing in the world to get started on. So in addition to hopefully being an open-model ORM, here's hoping Spira is a step in the adoption of RDF as a day-to-day data model.

So what's 'Open Model' mean?

Any useful abstraction layer is about applying constraints. Normal ORMs hide the power of relational databases to make them into proper object databases. Spira constrains you to a particular aspect of a resource. That means that in the aspect of 'Person', a resource's name is a given predicate, and they only have one. A person might also have a label, multiple names, a comment, function as a category or tag, have friends, have accounts, have tons of other stuff, but if all you want is their age, you just want to say person.name and person.age. The goal here is to let you use data (or at least, to have defined behavior for data) that you cannot say for sure meets any sort of criteria you set in Spira. Spira will have defined behavior for when data does not match a model class, and will still let you use that data easily, pretending it came from a closed system. That's good enough surprisingly often.

That open-model part is where tough semantics come in. As an example, I had intended to publish, with Spira, a reference implementation of SIOC. The SIOC core classes are in widespread use, so surely this would find some use, I figured. But it's not so simple to make a reference implementation unless you limit your possibilities. For example, a SIOC post can have topics (a sub-class of dcterms:subject). These topics are RDF resources which may be one (or, I suppose, both, or neither) of two classes defined in the SIOC types ontology, Category or Tag. These two classes have completely different semantics. Now, a Spira class could be created to deal with either of them, but to use that class usefully, you'd always be checking what it is, since the semantics are different. Spira will eventually have helpers to help you decide what to do here, but the point is that in RDF, a 'reference implementation' often doesn't make sense as a concept. However, this is at least in principle representable in Spira--I'm not sure it could be done in a traditional ORM, as it doesn't really match the single-table inheritance model.

Instead, I hope Spira classes are simple enough--throw away, even--that you can define them when you need them. Indeed, defining them programmatically is obvious with the framework in place, I just haven't done it yet.

Another example of differing semantics would be instance creation. An RDF resource does not 'exist or not'. It's either the subject of triples or not. So what would it mean to create an instance of a Spira resource and save it when it had no fields? Would one save a triple declaring the resource to be an RDF resource? How about saving the RDF type, should that happen if one has not saved fields? There are good arguments for several options. It's just not the same model as the 'find, create, find_or_create' trio of constructors that the world has grown used to, since the identifiers are global and always exist. Primary keys do not come into existence to allow reference to an object, the key is the object. I dodged the question and now do construction based on RDF::URIs.

Instantiation looks either like this:

RDF::URI("http://example.org/bob").as(Person)

or like this:

Person.for(RDF::URI("http://example.org/bob"))

There's no finding or creating. Resources just are. Creating a Spira object is creating the projection of that resource as a class. If you've told Spira about a repository where some information about that resource may or may not exist, great, but it's not required.

As another example, I see a lot of need for validations on creating an instance, not just saving one, as in traditional ORMs. RDF is not like the data fed to a traditional ORM, which is generally created by that ORM or by a known list of applications, managed by a set of hard constraints and schema. RDF data is often found, and used, in the wild.

There's still a ton left to do, but lots of stuff already works. The README has a good rundown of where things stand. I'd enumerate the to-do list, but I'd rather not feed that to Google, and it's long enough anyway that if certain deficencies quickly become obvious, I'd attack them first.

Anyways, hope someone has fun with it. gem install spira are the magic words. If you want to spoil the magic, the code is on Github.

The original version of this post used the term 'Open World' instead of 'Open Model' willy-nilly throughout, but I was corrected from using the term outside its strict meaning in terms of inference. See the comments. If a term exists for what I'm describing at this level of abstraction, I'm all ears.

How RDF Databases Differ from Other NoSQL Solutions

Arto Bendiken — 2010-04-22T20:00:00Z

This started out as an answer at Semantic Overflow on how RDF database systems differ from other currently available NoSQL solutions. I've here expanded the answer somewhat and added some general-audience context.

RDF database systems are the only standardized NoSQL solutions available at the moment, being built on a simple, uniform data model and a powerful, declarative query language. These systems offer data portability and toolchain interoperability among the dozens of competing implementations that are available at present, avoiding any need to bet the farm on a particular product or vendor.

In case you're not familiar with the term, NoSQL ("Not only SQL") is a loosely-defined umbrella moniker for describing the new generation of non-relational database systems that have sprung up in the last several years. These systems tend to be inherently distributed, schema-less, and horizontally scalable. Present-day NoSQL solutions can be broadly categorized into four groups:

Key-value databases are familiar to anyone who has worked with the likes of the venerable Berkeley DB. These systems are about as simple as databases get, being in essence variations on the theme of a persistent hash table. Current examples include MemcacheDB, Tokyo Cabinet, Redis and SimpleDB.
Document databases are key-value stores that treat stored values as semi-structured data instead of as opaque blobs. Prominent examples at the moment include CouchDB, MongoDB and Riak.
Wide-column databases tend to draw inspiration from Google's BigTable model. Open-source examples include Cassandra, HBase and Hypertable.
Graph databases include generic solutions like Neo4j, InfoGrid and HyperGraphDB as well as all the numerous RDF-centric solutions out there: AllegroGraph, 4store, Virtuoso, and many, many others.

RDF database systems form the largest subset of this last NoSQL category. RDF data can be thought of in terms of a decentralized directed labeled graph wherein the arcs start with subject URIs, are labeled with predicate URIs, and end up pointing to object URIs or scalar values. Other equally valid ways to understand RDF data include the resource-centric approach (which maps well to object-oriented programming paradigms and to RESTful architectures) and the statement-centric view (the object-attribute-value or EAV model).

Without just now extolling too much the virtues of RDF as a particular data model, the key differentiator here is that RDF database systems embrace and build upon W3C's Linked Data technology stack and are the only standardized NoSQL solutions available at the moment. This means that RDF-based solutions, when compared to run-of-the-mill NoSQL database systems, have benefits such as the following:

A simple and uniform standard data model. NoSQL databases typically have one-off, ad-hoc data models and capabilities designed specifically for each implementation in question. As a rule, these data models are neither interoperable nor standardized. Take e.g. Cassandra, which has a somewhat baroque data model that "can most easily be thought of as a four or five dimensional hash" and the specifics of which are described in a wiki page, blog posts here and there, and ultimately only nailed down in version-specific API documentation and the code base itself. Compare to RDF database systems that all share the same well-specified and W3C-standardized data model at their base.
A powerful standard query language. NoSQL databases typically do not provide any high-level declarative query language equivalent of SQL. Querying these databases is a programmatic data-model-specific, language-specific and even application-specific affair. Where query languages do exist, they are entirely implementation-specific (think SimpleDB or GQL). SPARQL is a very big win for RDF databases here, providing a standardized and interoperable query language that even non-programmers can make use of, and one which meets or exceeds SQL in its capabilities and power while retaining much of the familiar syntax.
Standardized data interchange formats. RDBMSes have (somewhat implementation-specific) SQL dumps, and some NoSQL databases have import/export capability from/to implementation-specific structures expressed in an XML or JSON format. RDF databases, by contrast, all have import/export capability based on well-defined, standardized, entirely implementation-agnostic serialization formats such as N-Triples and N-Quads.

From the preceding points it follows that RDF-based NoSQL solutions enjoy some very concrete advantages such as:

Data portability. Should you need to switch between competing database systems in-house, to make use of multiple different solutions concurrently, or to share data with external parties, your data travels with you without needing to write and utilize any custom glue code for converting some ad-hoc export format and data structure into some other incompatible ad-hoc import format and data structure.
Toolchain interoperability. The RDBMS world has its various database abstraction layers, but the very concept is nonsensical for NoSQL solutions in general (see "ad-hoc data model"). RDF solutions, however, represent a special case: libraries and toolchains for RDF are typically only loosely coupled to any particular DBMS implementation. Learn to use and program with Jena or Sesame for Java and Scala, RDFLib for Python, or RDF.rb for Ruby, and it generally doesn't matter which particular RDF-based system you are accessing. Just as with RDBMS-based database abstraction layers, your RDF-based code does not need to change merely because you wish to do the equivalent of switching from MySQL to PostgreSQL.
No vendor or product lock-in. If the RDF database solution A was easy to get going with but eventually for some reason hits a brick wall, just switch to RDF database solution B or C or any other of the many available interoperable solutions. Unlike switching between two non-RDF solutions, this does not have to be a big deal. Needless to say there are also ecosystem benefits with regards to the available talent pool and the commercial support options.
Future proof. With RDF now emerging as the definitive standard for publishing Linked Data on the web, and being entirely built on top of indelibly-established lower-level standards like URIs, it's not an unreasonable bet that your RDF data will still be usable as-is by, say, 2038. It's not at all evident, however, that the same could be asserted for any of the other NoSQL solutions out there at the moment, many which will inevitably prove to be rather short-lived in the big picture.

RDF-based systems also offer unique advantages such as support for globally-addressable row identifiers and property names, web-wide decentralized and dynamic schemas, data modeling standards and tooling for creating and publishing such schemas, metastandards for being able to declaratively specify that one piece of information entails another, and inference engines that implement such data transformation rules.

All these features are mainly due to the characteristics and capabilities of RDF's data model, though, and have already been amply described elsewhere, so I won't go further into them just here and now. If you wish to learn more about RDF in general, a great place to start would be the excellent RDF in Depth tutorial by Joshua Tauberer.

And should you be interested in the growing intersection between the NoSQL and Linked Data communities, you will be certain to enjoy the recording of Sandro Hawke's presentation Toward Standards for NoSQL (slides, blog post) at the NoSQL Live in Boston conference in March 2010.

Parsing and Serializing RDF Data with Ruby

Arto Bendiken — 2010-04-21T16:00:00Z

In this tutorial we'll learn how to parse and serialize RDF data using the RDF.rb library for Ruby. There exist a number of Linked Data serialization formats based on RDF, and you can use most of them with RDF.rb.

To follow along and try out the code examples in this tutorial, you need only a computer with Ruby and RubyGems installed. Any recent Ruby 1.8.x or 1.9.x version will do fine, as will JRuby 1.4.0 or newer.

Supported RDF formats

These are the RDF serialization formats that you can parse and serialize with RDF.rb at present:

Format      | Implementation        | RubyGems gem
------------|-----------------------|-------------
N-Triples   | RDF::NTriples         | rdf
Turtle      | RDF::Raptor::Turtle   | rdf-raptor
RDF/XML     | RDF::Raptor::RDFXML   | rdf-raptor
RDFa        | RDF::Raptor::RDFa     | rdf-raptor
RDF/JSON    | RDF::JSON             | rdf-json
TriX        | RDF::TriX             | rdf-trix

RDF.rb in and of itself is a relatively lightweight gem that includes built-in support only for the N-Triples format. Support for the other listed formats is available through add-on plugins such as RDF::Raptor, RDF::JSON and RDF::TriX, each one packaged as a separate gem. This approach keeps the core library fleet on its metaphorical feet and avoids introducing any XML or JSON parser dependencies for RDF.rb itself.

Installing support for all these formats in one go is easy enough:

$ sudo gem install rdf rdf-raptor rdf-json rdf-trix
Successfully installed rdf-0.1.9
Successfully installed rdf-raptor-0.2.1
Successfully installed rdf-json-0.1.0
Successfully installed rdf-trix-0.0.3
4 gems installed

Note that the RDF::Raptor gem requires that the Raptor RDF Parser library and command-line tools be available on the system where it is used. Here follow quick and easy Raptor installation instructions for the Mac and the most common Linux and BSD distributions:

$ sudo port install raptor             # Mac OS X with MacPorts
$ sudo fink install raptor-bin         # Mac OS X with Fink
$ sudo aptitude install raptor-utils   # Ubuntu / Debian
$ sudo yum install raptor              # Fedora / CentOS / RHEL
$ sudo zypper install raptor           # openSUSE
$ sudo emerge raptor                   # Gentoo Linux
$ sudo pkg_add -r raptor               # FreeBSD
$ sudo pkg_add raptor                  # OpenBSD / NetBSD

For more information on installing and using Raptor, see our previous tutorial RDF for Intrepid Unix Hackers: Transmuting N-Triples.

Consuming RDF data

If you're in a hurry and just want to get to consuming RDF data right away, the following is really the only thing you need to know:

require 'rdf'
require 'rdf/ntriples'

graph = RDF::Graph.load("http://datagraph.org/jhacker/foaf.nt")

In this example, we first load up RDF.rb as well as support for the N-Triples format. After that, we use a convenience method on the RDF::Graph class to fetch and parse RDF data directly from a web URL in one go. (The load method can take either a file name or a URL.)

All RDF.rb parser plugins declare which MIME content types and file extensions they are capable of handling, which is why in the above example RDF.rb knows how to instantiate an N-Triples parser to read the foaf.nt file at the given URL.

In the same way, RDF.rb will auto-detect any other RDF file formats as long as you've loaded up support for them using one or more of the following:

require 'rdf/ntriples' # Support for N-Triples (.nt)
require 'rdf/raptor'   # Support for RDF/XML (.rdf) and Turtle (.ttl)
require 'rdf/json'     # Support for RDF/JSON (.json)
require 'rdf/trix'     # Support for TriX (.xml)

Note that if you need to read RDF files containing multiple named graphs (in a serialization format that supports named graphs, such as TriX), you probably want to be using RDF::Repository instead of RDF::Graph:

repository = RDF::Repository.load("http://datagraph.org/jhacker/foaf.nt")

The difference between the two is that RDF statements in RDF::Repository instances can contain an optional context (i.e. they can be quads), whereas statements in an RDF::Graph instance always have the same context (i.e. they are triples). In other words, repositories contain one or more graphs, which you can access as follows:

repository.each_graph do |graph|
  puts graph.inspect
end

Introspecting RDF formats

RDF.rb's parsing and serialization APIs are based on the following three base classes:

RDF::Format is used to describe particular RDF serialization formats.
RDF::Reader is the base class for RDF parser implementations.
RDF::Writer is the base class for RDF serializer implementations.

If you know something about the file format you want to parse or serialize, you can obtain a format specifier class for it in any of the following ways:

require 'rdf/raptor'

RDF::Format.for(:rdfxml)       #=> RDF::Raptor::RDFXML::Format
RDF::Format.for("input.rdf")
RDF::Format.for(:file_name      => "input.rdf")
RDF::Format.for(:file_extension => "rdf")
RDF::Format.for(:content_type   => "application/rdf+xml")

Once you have such a format specifier class, you can then obtain the parser/serializer implementations for it as follows:

format = RDF::Format.for("input.nt")   #=> RDF::NTriples::Format
reader = format.reader                 #=> RDF::NTriples::Reader
writer = format.writer                 #=> RDF::NTriples::Writer

There also exist corresponding factory methods on RDF::Reader and RDF::Writer directly:

reader = RDF::Reader.for("input.nt")   #=> RDF::NTriples::Reader
writer = RDF::Writer.for("output.nt")  #=> RDF::NTriples::Writer

The above is what RDF.rb relies on internally to obtain the correct parser implementation when you pass in a URL or file name to RDF::Graph.load -- or indeed to any other method that needs to auto-detect a serialization format and to delegate responsibility for parsing/serialization to the appropriate implementation class.

Parsing RDF data

If you need to be more explicit about parsing RDF data, for instance because the dataset won't fit into memory and you wish to process it statement by statement, you'll need to use RDF::Reader directly.

Parsing RDF statements from a file

RDF parser implementations generally support a streaming-compatible subset of the RDF::Enumerable interface, all of which is based on the #each_statement method. Here's how to read in an RDF file enumerated statement by statement:

require 'rdf/raptor'

RDF::Reader.open("foaf.rdf") do |reader|
  reader.each_statement do |statement|
    puts statement.inspect
  end
end

Using RDF::Reader.open with a Ruby block ensures that the input file is automatically closed after you're done with it.

Parsing RDF statements from a URL

As before, you can generally use an http:// or https:// URL anywhere that you could use a file name:

require 'rdf/json'

RDF::Reader.open("http://datagraph.org/jhacker/foaf.json") do |reader|
  reader.each_statement do |statement|
    puts statement.inspect
  end
end

Parsing RDF statements from a string

Sometimes you already have the serialized RDF contents in a memory buffer somewhere, for example as retrieved from a database. In such a case, you'll want to obtain the parser implementation class as shown before, and then use RDF::Reader.new directly:

require 'rdf/ntriples'

input = open('http://datagraph.org/jhacker/foaf.nt').read

RDF::Reader.for(:ntriples).new(input) do |reader|
  reader.each_statement do |statement|
    puts statement.inspect
  end
end

The RDF::Reader constructor uses duck typing and accepts any input (for example, IO or StringIO objects) that responds to the #readline method. If no input argument is given, input data will by default be read from the standard input.

Serializing RDF data

Serializing RDF data works much the same way as parsing: when serializing to a named output file, the correct serializer implementation is auto-detected based on the given file extension.

Serializing RDF statements into an output file

RDF serializer implementations generally support an append-only subset of the RDF::Mutable interface, primarily the #insert method and its alias #<<. Here's how to write out an RDF file statement by statement:

require 'rdf/ntriples'
require 'rdf/raptor'

data = RDF::Graph.load("http://datagraph.org/jhacker/foaf.nt")

RDF::Writer.open("output.rdf") do |writer|
  data.each_statement do |statement|
    writer << statement
  end
end

Once again, using RDF::Writer.open with a Ruby block ensures that the output file is automatically flushed and closed after you're done writing to it.

Serializing RDF statements into a string result

A common use case is serializing an RDF graph into a string buffer, for example when serving RDF data from a Rails application. RDF::Writer has a convenience buffer class method that builds up output in a StringIO under the covers and then returns a string when all is said and done:

require 'rdf/ntriples'

output = RDF::Writer.for(:ntriples).buffer do |writer|
  subject = RDF::Node.new
  writer << [subject, RDF.type, RDF::FOAF.Person]
  writer << [subject, RDF::FOAF.name, "J. Random Hacker"]
  writer << [subject, RDF::FOAF.mbox, RDF::URI("mailto:jhacker@example.org")]
  writer << [subject, RDF::FOAF.nick, "jhacker"]
end

Customizing the serializer output

If a particular serializer implementation supports options such as namespace prefix declarations or a base URI, you can pass in those options to RDF::Writer.open or RDF::Writer.new as keyword arguments:

RDF::Writer.open("output.ttl", :base_uri => "http://rdf.rubyforge.org/")
RDF::Writer.for(:rdfxml).new($stdout, :base_uri => "http://rdf.rubyforge.org/")

Support channels

That's all for now, folks. For more information on the APIs touched upon in this tutorial, please refer to the RDF.rb API documention. If you have any questions, don't hesitate to ask for help on #swig or the public-rdf-ruby@w3.org mailing list.

How to Build an SQL Storage Adapter for RDF Data with Ruby

Ben Lavender — 2010-04-06T09:00:00Z

RDF.rb is approaching two thousand downloads on RubyGems, and while it has good documentation it could still use some more tutorials. I recently needed to get RDF.rb working with a PostgreSQL storage backend in order to work with RDF data in a Rails 3.0 application hosted on Heroku. I thought I'd keep track of what I did so that I could discuss the notable parts.

In this tutorial we'll be implementing an RDF.rb storage adapter called RDF::DataObjects::Repository, which is a simplified version of what I eventually ended up with. If you want the real thing, check it out on GitHub and read the docs. This tutorial will only cover the SQLite backend and won't concern itself with database indexes, performance tweaks, or any other distractions from the essential RDF.rb interfaces we'll focus on. There's a copy of the simplified code used in the tutorial at the tutorial's project page. And should you be inspired to build something similar of your own, I have set up an RDF.rb storage adapter skeleton at GitHub. Click fork, grep for lines containing a TODO comment, and dive right in.

I'll mention, briefly, that I chose DataObjects as the database abstraction layer, but I don't want to dwell on that -- this post is about RDF. DataObjects is just a way to use common methods to talk to different databases at the SQL level. It's a leaky abstraction, because we'll want to be using some SQL constraints to enforce statement uniqueness but those constraints need to be done differently for different databases. That means we still have to get down to the level of database-specific SQL, distasteful as that may be in this day and age. However, given that I wanted to be able to target PostgreSQL and SQLite both, DataObjects is still helpful.

Requirements

You just need a few gems for the example repository. This ought to get you going. Even if you have these, make sure you have the latest; RDF.rb gets updated frequently.

$ sudo gem install rdf rdf-spec rspec do_sqlite3

Testing First

So where do we start? Tests, of course. RDF.rb has factored out its mixin specs to the RDF::Spec gem, which provides the RSpec shared example groups that are also used by RDF.rb for its own tests. Thus, here is the complete spec file for the in-memory reference implementation of RDF::Repository:

require File.join(File.dirname(__FILE__), 'spec_helper')
require 'rdf/spec/repository'

describe RDF::Repository do
  before :each do
    @repository = RDF::Repository.new
  end

  # @see lib/rdf/spec/repository.rb
  it_should_behave_like RDF_Repository
end

If you haven't seen something like this before, that's an RSpec shared example group, and it's awesome. Anything can use the same specs as RDF.rb itself to verify that it conforms to the interfaces defined by RDF.rb, and that's exactly what we'll be doing in this tutorial. Let's implement that for our repository:

# spec/sqlite3.spec
$:.unshift File.dirname(__FILE__) + "/../lib/"

require 'rdf'
require 'rdf/do'
require 'rdf/spec/repository'
require 'do_sqlite3'

describe RDF::DataObjects::Repository do
  context "The SQLite adapter" do
    before :each do
      @repository = RDF::DataObjects::Repository.new "sqlite3::memory:"
    end

    after :each do
      # DataObjects pools connections, and only allows 8 at once.  We have
      # more than 60 tests.
      DataObjects::Sqlite3::Connection.__pools.clear
    end

    it_should_behave_like RDF_Repository
  end
end

If you're new to RSpec, run the tests with the spec command:

$ spec -cfn spec/sqlite3.spec

These fail miserably right now, of course, since we don't have an implementation. So let's make one.

Initial implementation

RDF.rb's interface for an RDF store is RDF::Repository. That interface is itself composed of a number of mixins: RDF::Enumerable, RDF::Queryable, RDF::Mutable, and RDF::Durable.

RDF::Queryable has a base implementation that works on anything which implements RDF::Enumerable. And RDF::Durable only provides boolean methods for clients to ask if it is durable? or not; the default is that a repository reports that it is indeed durable, so we don't need to do anything there.

The takeaway is that to create an RDF.rb storage adapter, we need to implement RDF::Enumerable and RDF::Mutable, and the rest will fall into place. Indeed, the reference implementation is little more than an array which implements these interfaces.

It turns out we can get away with just three methods to implement those two interfaces: RDF::Enumerable#each, RDF::Mutable#insert_statement, and RDF::Mutable#delete_statement. The default implementations will use these to build up any missing methods. That means we need to implement those first, so that we have a base to pass our tests. Then we can iterate further, replacing methods which iterate over every statement with methods more appropriate for our backend.

Here's a repository which doesn't implement much more than those three methods. We'll use it as a starting point.

# lib/rdf/do.rb

require 'rdf'
require 'rdf/ntriples'
require 'data_objects'
require 'do_sqlite3'
require 'enumerator'

module RDF
  module DataObjects
    class Repository < ::RDF::Repository

      def initialize(options)
        @db = ::DataObjects::Connection.new(options)
        exec('CREATE TABLE IF NOT EXISTS quads (
              `subject` varchar(255), 
              `predicate` varchar(255),
              `object` varchar(255), 
              `context` varchar(255), 
              UNIQUE (`subject`, `predicate`, `object`, `context`))')
      end

      # @see RDF::Enumerable#each.
      def each(&block)
        if block_given?
          reader = result('SELECT * FROM quads')
          while reader.next!
            block.call(RDF::Statement.new(
                  :subject   => unserialize(reader.values[0]),
                  :predicate => unserialize(reader.values[1]),
                  :object    => unserialize(reader.values[2]),
                  :context   => unserialize(reader.values[3])))

          end
        else
          ::Enumerable::Enumerator.new(self,:each)
        end
      end

      # @see RDF::Mutable#insert_statement
      def insert_statement(statement)
        sql = 'REPLACE INTO `quads` (subject, predicate, object, context) VALUES (?, ?, ?, ?)'
        exec(sql,serialize(statement.subject),serialize(statement.predicate), 
                 serialize(statement.object), serialize(statement.context)) 
      end

      # @see RDF::Mutable#delete_statement
      def delete_statement(statement)
        sql = 'DELETE FROM `quads` where (subject = ? AND predicate = ? AND object = ? AND context = ?)'
        exec(sql,serialize(statement.subject),serialize(statement.predicate), 
                 serialize(statement.object), serialize(statement.context)) 
      end

      ## These are simple helpers to serialize and unserialize component
      # fields.  We use an explicit empty string for null values for clarity in
      # this example; we cannot use NULL, as SQLite considers NULLs as
      # distinct from each other when using the uniqueness constraint we
      # added when we created the table.  It would let us insert duplicate
      # with a NULL context.
      def serialize(value)
        RDF::NTriples::Writer.serialize(value) || ''
      end
      def unserialize(value)
        value == '' ? nil : RDF::NTriples::Reader.unserialize(value)
      end

      ## These are simple helpers for DataObjects
      def exec(sql, *args)
        @db.create_command(sql).execute_non_query(*args)
      end
      def result(sql, *args)
        @db.create_command(sql).execute_reader(*args)
      end

    end
  end
end

And we have a repository. Poof, done, that's it. You can get a copy of this intermediate repository at the tutorial page and run the specs for yourself. It's not very efficient for SQL yet, but this is all it takes, strictly speaking.

Since they are so important, the three main methods deserve a little more attention:

`each`

Each is the only thing we have to implement to get information out after we've put it in. RDF::Enumerable will provide us tons of things like each_subject, has_subject?, each_predicate, has_predicate?, etc. If you were watching the spec output, you'll notice we ran tests for RDF::Queryable. The default implementation will use RDF::Enumerable's methods to implement basic querying. This means we can already do things like:

# Note that #load actually comes from insert_statement, see below
repo.load('http://datagraph.org/jhacker/foaf.nt')
repo.query(:subject => RDF::URI.new('http://datagraph.org/jhacker/foaf'))
#=> RDF::Enumerable of statements with given URI as subject

Note that if a block is not sent, it's defined to return an Enumerable::Enumerator.

RDF::Queryable, which defines #query, is probably the thing we can improve the most on with SQL as opposed to the reference implementation. We'll revisit it below.

`insert_statement`

insert_statement inserts an RDF::Statement into the repository. It's pretty straightforward. It gives us access to default implementations of things like RDF::Mutable#load, which will load a file by name or import a remote resource:

repo.load('http://datagraph.org/jhacker/foaf.nt')
repo.count
#=> 10

`delete_statement`

delete_statement deletes an RDF::Statement. Again, it's straightforward, and it's used to implement things like RDF::Mutable#clear, which empties the repository:

repo.load('http://datagraph.org/jhacker/foaf.nt')
repo.clear
repo.count
#=> 0

Iterate and Improve

Since we already have a nice test suite that we can pass, we can add functionality incrementally. For example, let's implement RDF::Enumerable#count in a fashion that does not require us to enumerate each statement, which is clearly not ideal for a SQL-based system:

# lib/rdf/do.rb

def count
  result = result('SELECT COUNT(*) FROM quads')
  result.next!
  result.values.first
end

The tests still pass, we can move on. Wash, rinse, repeat; probably every method in RDF::Enumerable and RDF::Mutable can be done more efficiently with SQL.

`RDF::Queryable`

RDF::Queryable is worth mentioning on its own, because the interface takes a lot of options. Specifically, it can take a Hash, a smashed Array, an RDF::Statement, or a Query object. Fortunately, we can call super to defer to the reference implementation if we get arguments we don't understand, so we can again be iterative here.

We can start by implementing the hash version, which is the most convienent for doing the actual SQL query later. The hash version takes a hash which may have keys for :subject, :predicate, :object, and :context, and returns an RDF::Enumerable which contains all statements matching those parameters

# lib/rdf/do.rb

      def query(pattern, &block)
        case pattern
          when Hash
            statements = []
            reader = query_hash(pattern)
            while reader.next!
              statements << RDF::Statement.new(
                      :subject   => unserialize(reader.values[0]),
                      :predicate => unserialize(reader.values[1]),
                      :object    => unserialize(reader.values[2]),
                      :context   => unserialize(reader.values[3]))
            end
            case block_given?
              when true
                statements.each(&block)
              else
                statements.extend(RDF::Enumerable, RDF::Queryable)
            end
          else
            super(pattern) 
        end
      end

      def query_hash(hash)
        conditions = []
        params = []
        [:subject, :predicate, :object, :context].each do |resource|
          unless hash[resource].nil?
            conditions << "#{resource.to_s} = ?"
            params     << serialize(hash[resource])
          end
        end
        where = conditions.empty? ? "" : "WHERE "
        where << conditions.join(' AND ')
        result('SELECT * FROM quads ' + where, *params)
      end

Our specs still pass. Note this trick:

statements.extend(RDF::Enumerable, RDF::Queryable)

RDF::Queryable is defined to return something which implements RDF::Enumerable and RDF::Queryable. Since the only thing we need to implement RDF::Enumerable is #each, and Array already implements that, we can simply extend this Array instance with the mixins and return it.

Note also that while we have taken care of the hard part, we're still calling the reference implementation if we don't know how to handle our arguments. Now we can start adding those other query arguments:

# lib/rdf/do.rb

      def query(pattern, &block)
        case pattern
          when RDF::Statement
            query(pattern.to_hash)
          when Array
            query(RDF::Statement.new(*pattern))
          when Hash
      .
      .
      .

Our specs still pass! Moving on, there's a lot more we can implement. And once we have implemented it in a straightforward way, we can still implement things like multiple inserts, paging, and more, all transparant to the user. You can see the full list of methods to implement in the docs, but don't be afraid to dive into the code.

If you do, don't forget that RDF.rb is completely public domain, so if you want to copy-paste to bootstrap your implementation, feel free.

Any questions?

Hopefully this is enough to get you started. Remember, the code is at the tutorial page, and don't forget to check out the storage adapter skeleton. The RDF.rb documentation have a lot of information on the APIs you'll be using.

And last but not least, a good place to ask questions or leave a comment is on the W3C RDF-Ruby mailing list.

RDF for Intrepid Unix Hackers: Transmuting N-Triples

Arto Bendiken — 2010-04-05T09:00:00Z

This is the second part in an ongoing RDF for Intrepid Unix Hackers article series. In the previous part, we learned how to process RDF data in the line-oriented, whitespace-separated N-Triples serialization format by pipelining standard Unix tools such as grep and awk.

That was all well and good, but what to do if your RDF data isn't already in N-Triples format? Today we'll see how to install and use the excellent Raptor RDF Parser Library to convert RDF from one serialization format to another.

Installing the Raptor RDF Parser tools

The Raptor toolkit includes a handy command-line utility called rapper, which can be used to convert RDF data between most of the various popular RDF serialization formats.

Installing Raptor is straightforward on most development and deployment platforms; here's how to install Raptor on Mac OS X with MacPorts and on any of the most common Linux and BSD distributions:

$ [sudo] port install raptor             # Mac OS X with MacPorts
$ [sudo] fink install raptor-bin         # Mac OS X with Fink
$ [sudo] aptitude install raptor-utils   # Ubuntu / Debian
$ [sudo] yum install raptor              # Fedora / CentOS / RHEL
$ [sudo] zypper install raptor           # openSUSE
$ [sudo] emerge raptor                   # Gentoo Linux
$ [sudo] pacman -S raptor                # Arch Linux
$ [sudo] pkg_add -r raptor               # FreeBSD
$ [sudo] pkg_add raptor                  # OpenBSD / NetBSD

The subsequent examples all assume that you have successfully installed Raptor and thus have the rapper utility available in your $PATH. To make sure that rapper is indeed available, just ask it to output its version number as follows:

$ rapper --version
1.4.21

We'll be using version 1.4.21 for this tutorial, but any 1.4.x release from 1.4.5 onwards should do fine for present purposes -- so don't worry if your distribution provides a slightly older version.

Should you have any trouble getting rapper set up, you can ask for help on the #swig channel on IRC or on the Raptor mailing list.

Transmuting RDF/XML into N-Triples

RDF/XML is the standard RDF serialization specified by W3C back before the dot-com bust. Despite some newer, more human-friendly formats, a great deal of the RDF data out there in the wild is still made available in this format.

For example, every valid RSS 1.0-compatible feed is, in principle, also a valid RDF/XML document (but note that the same is not true for non-RDF formats like RSS 2.0 or Atom). So, let's grab the RSS feed for this blog and define a Bash shell alias for converting RDF/XML into N-Triples using rapper:

$ alias rdf2nt="rapper -i rdfxml -o ntriples"

$ curl http://blog.datagraph.org/index.rss > index.rdf

$ rdf2nt index.rdf > index.nt
rapper: Parsing URI file://index.rdf with parser rdfxml
rapper: Serializing with serializer ntriples
rapper: Parsing returned 106 triples

Pretty easy, huh? It gets even easier, because rapper actually supports fetching URLs directly. Typically Raptor is built with libcurl support, so it supports the same set of URL schemes as does the curl command itself. This means that e.g. any http://, https:// and ftp:// input arguments will work right out of the box, so that we can combine our previous last two commands as follows:

$ rdf2nt http://blog.datagraph.org/index.rss > index.nt
rapper: Parsing URI http://blog.datagraph.org/index.rss with parser rdfxml
rapper: Serializing with serializer ntriples
rapper: Parsing returned 106 triples

Transmuting Turtle into N-Triples

After RDF/XML, Turtle is probably the most widespread RDF format out there. It is a subset of Notation3 and a superset of N-Triples, hitting a sweet spot for both expressiveness and conciseness. It is also much more pleasant to write by hand than XML, so personal FOAF files in particular tend to be authored in Turtle and then converted, e.g. using rapper, into a variety of formats when published on the Linked Data web.

For this next example, let's grab my FOAF file in Turtle format and convert it into N-Triples:

$ alias ttl2nt="rapper -i turtle -o ntriples"

$ ttl2nt http://datagraph.org/bendiken/foaf.ttl > foaf.nt
rapper: Parsing URI http://datagraph.org/bendiken/foaf.ttl with parser turtle
rapper: Serializing with serializer ntriples
rapper: Parsing returned 16 triples

Just as easy as with RDF/XML. And you'll notice that this time around we did the downloading and the conversion in a single step by letting rapper worry about fetching the data directly from the URL in question.

Transmuting N-Triples into other formats

Conversely, you can of course also use rapper to convert any N-Triples input data into other RDF serialization formats such as Turtle, RDF/XML and RDF/JSON. You need only swap the arguments to the -i and -o options and you're good to go.

So, let's define a couple more handy aliases:

$ alias nt2ttl="rapper -i ntriples -o turtle"
$ alias nt2rdf="rapper -i ntriples -o rdfxml-abbrev"
$ alias nt2json="rapper -i ntriples -o json"

Now we can quickly and easily convert any N-Triples data into other RDF formats:

$ nt2ttl  index.nt > index.ttl
$ nt2rdf  index.nt > index.rdf
$ nt2json index.nt > index.json

We can define similar aliases for any input/output permutation provided by rapper. To find out the full list of input and output RDF serialization formats supported by your version of the program, run rapper --help:

$ rapper --help
...
Main options:
  -i FORMAT, --input FORMAT   Set the input format/parser to one of:
    rdfxml          RDF/XML (default)
    ntriples        N-Triples
    turtle          Turtle Terse RDF Triple Language
    trig            TriG - Turtle with Named Graphs
    rss-tag-soup    RSS Tag Soup
    grddl           Gleaning Resource Descriptions from Dialects of Languages
    guess           Pick the parser to use using content type and URI
    rdfa            RDF/A via librdfa
...
  -o FORMAT, --output FORMAT  Set the output format/serializer to one of:
    ntriples        N-Triples (default)
    turtle          Turtle
    rdfxml-xmp      RDF/XML (XMP Profile)
    rdfxml-abbrev   RDF/XML (Abbreviated)
    rdfxml          RDF/XML
    rss-1.0         RSS 1.0
    atom            Atom 1.0
    dot             GraphViz DOT format
    json-triples    RDF/JSON Triples
    json            RDF/JSON Resource-Centric
...

Defining more `rapper` aliases

Copy and paste the following code snippet into your ~/.bash_aliases or ~/.bash_profile file, and you will always have these aliases available when working with RDF data on the command line:

# rapper aliases from http://blog.datagraph.org/2010/04/transmuting-ntriples
alias any2nt="rapper -i guess -o ntriples"         # Anything to N-Triples
alias any2ttl="rapper -i guess -o turtle"          # Anything to Turtle
alias any2rdf="rapper -i guess -o rdfxml-abbrev"   # Anything to RDF/XML
alias any2json="rapper -i guess -o json"           # Anything to RDF/JSON
alias nt2ttl="rapper -i ntriples -o turtle"        # N-Triples to Turtle
alias nt2rdf="rapper -i ntriples -o rdfxml-abbrev" # N-Triples to RDF/XML
alias nt2json="rapper -i ntriples -o json"         # N-Triples to RDF/JSON
alias ttl2nt="rapper -i turtle -o ntriples"        # Turtle to N-Triples
alias ttl2rdf="rapper -i turtle -o rdfxml-abbrev"  # Turtle to RDF/XML
alias ttl2json="rapper -i turtle -o json"          # Turtle to RDF/JSON
alias rdf2nt="rapper -i rdfxml -o ntriples"        # RDF/XML to N-Triples
alias rdf2ttl="rapper -i rdfxml -o turtle"         # RDF/XML to Turtle
alias rdf2json="rapper -i rdfxml -o json"          # RDF/XML to RDF/JSON
alias json2nt="rapper -i json -o ntriples"         # RDF/JSON to N-Triples
alias json2ttl="rapper -i json -o ntriples"        # RDF/JSON to N-Triples
alias json2rdf="rapper -i json -o ntriples"        # RDF/JSON to N-Triples

Since each of these aliases is a mnemonic patterned after the file extensions for the input and output formats involved, remembering these is easy as pie. Note also that I've included four any2* aliases that specify guess as the input format to let rapper try and automatically detect the serialization format for the input stream.

A big thanks goes out to Dave Beckett for having developed Raptor and for giving us the superbly useful N-Triples and Turtle serialization formats. I personally use rapper and these aliases just about every single day, and I hope you find them as useful as I have.

Stay tuned for more upcoming installments of RDF for Intrepid Unix Hackers.

Lest there be any doubt, all the code in this tutorial is hereby released into the public domain using the Unlicense. You are free to copy, modify, publish, use, sell and distribute it in any way you please, with or without attribution.

The Curious Case of RDF Graph Isomorphism

Ben Lavender — 2010-03-12T20:00:00Z

The first time I ever sat down to write some real RDF code, I started, as one always should, with some tests. Most of them went fine, but then I had to write a test that compared the equality of two graphs; I think this was for a parser in Scala, sometime last year, but I've lost track of what exactly I was looking at. In any case, what a can of worms I opened.

It turns out that graph equality in RDF is hard. The combination of blank and non-blank nodes makes it a graph isomorphism problem that I have not found an exact equivalence for in straight-up graph theory. Graphs with named vertices and edges have easy solutions, graphs with unnamed vertices and edges have other, difficult solutions. The difference, depending on the type of graph, can be between O(n) and O(n!) on the number of nodes, so when selecting a possible solution, we'd like to avoid solutions that don't take naming into account.

The isomorphism problem is hard enough that many popular RDF implementations don't even include a solution for it. RDFLib for Python has an approximation with a to-do note, I don't see an appropriate function in Redland's model API, and Sesame has an implementation with the following comment:

// FIXME: this recursive implementation has a high risk of
// triggering a stack overflow

My Java is rusty and I have no intention of polishing it up for this blog post, but I believe Sesame's implementation has factorial complexity.

Now, don't get me wrong. Those are all free projects, and it's a tough problem to do right. We over at Datagraph just made do without an isomorphism function in either Scala or Ruby for several months rather than solve it. So this is not intended to be a cheap shot at those projects -- in fact, we use both Redland and Sesame, and quite happily. And if I'm wrong on the sparse nature of this landscape, someone please correct me.

However, we're developing a new RDF library for Ruby, so when it came time to really solve the problem, we wanted to solve it right. Like most problems in computer science, it's actually old news. Jeremy Carroll solved it and implemented it for Jena either before or after writing a great paper on the topic. What I'm about to describe is more or less his algorithm, and while I slightly adjusted the following to my style, I'm not about to say much that his paper doesn't. So just go read the paper if that's your preference.

The algorithm can be described as a refinement of a naive O(n!) graph isomorphism algorithm, in which each blank node is mapped onto each other blank node, followed by a consistency check. The magic stems from RDF having these nifty global identifiers for most vertices and all edges. If we're smart about it, we can eliminate substantially all of the possible mappings before we try even our first speculative mapping.

I haven't done the math, but it would seem that one could generate a pathological case graph which would be O(n!). On the other hand, since RDF does not allow blank node predicates, and because the algorithm terminates on the first match, I haven't yet figured out how to create such a pathological graph for this algorithm. Graphs tend to be either open enough to have a large number of solutions, one of which will be found quickly, or tight enough to have only one.

The algorithm works as follows:

Compare graph sizes and all statements without blank nodes. If they do not match, fail.
Repeat, for each graph:
1. Repeat, for each blank node:
  1. Mark the node as grounded or not. A grounded node has only non-blank nodes or grounded nodes in statements in which it appears.
  2. Create a signature for the node. A signature consists of a canonical representation of all of the statements a node appears in.
2. Terminate unless we marked a node as grounded on this run.
Map grounded blank nodes to the other graph's grounded blank nodes where signatures match.
If all nodes are mapped, we have a bijection, which we can return.
Select ungrounded nodes from each graph with identical signatures. Mark them as grounded, then recurse to step 2.
If no ungrounded nodes have the same signature, or we have tried all matching pairs, a bijection does not exist. Fail.

In something approaching day-to-day English, what's happening here is that after eliminating the simple possibilities, we're generating a hash of all of the elements that appear with a given node in a graph. We then create a node-to-hash mapping. As the hashes will be the same for blank nodes on both input graphs, we use that hash to eliminate possible matchings before we try them. Instead of trying every mapping, we try mappings only on nodes with the same signature. The end result is an algorithm that requires a fairly pathological case to recurse at all, let alone to recurse deeply. Nice.

At any rate, you can see the details, along with some test cases to play with, in RDF::Isomorphic for RDF.rb. This blog post coincides with release 0.1.0, which features a slightly improved signature algorithm, reducing the number of rounds required in some cases. The documentation is also greatly improved -- I spent more time on this problem than I ever intended to, so I hope this can be a readable summary of the algorithm for anyone coming across this in the future. Of course, RDF.rb's structure means almost anything using RDF.rb can be tested for isomorphism now, so hopefully it won't ever occur to you to read the code.

Of course, RDF::Isomorphic is in the public domain, so should you find my implementation worthy, feel free to copy the code as directly as your framework or programming language allows. And please feel free to do that without any obligation to provide attribution or any such silliness.

RDF.rb: A Public-Domain RDF Library for Ruby

Arto Bendiken — 2010-03-09T22:00:00Z

We have just released version 0.1.0 of RDF.rb, our RDF library for Ruby. This is the first generally useful release of the library, so I will here introduce the design philosophy and object model of the library as well as provide a tutorial to using its core classes.

RDF.rb has extensive API documentation with many inline code examples, enjoys comprehensive RSpec coverage, and is immediately available via RubyGems:

$ [sudo] gem install rdf

Once installed, to load up the library in your own Ruby projects you need only do:

require 'rdf'

The RDF.rb source code repository is hosted on GitHub. You can obtain a local working copy of the source code as follows:

$ git clone git://github.com/bendiken/rdf.git

The Design Philosophy

The design philosophy for RDF.rb differs somewhat from previous efforts at RDF libraries for Ruby. Instead of a feature-packed RDF library that attempts to include everything but the kitchen sink, we have rather aimed for something like a lowest common denominator with well-defined, finite requirements.

Thus, RDF.rb is perhaps quickest described in terms of what it isn't and what it hasn't:

RDF.rb does not have any dependencies other than the Addressable gem which provides improved URI handling over Ruby's standard library. We also guarantee that RDF.rb will never add any hard dependencies that would compromise its use on popular alternative Ruby implementations such as JRuby.
RDF.rb does not provide any resource-centric, ORM-like abstractions to hide the essential statement-oriented nature of the API. Such abstractions may be useful, but they are beyond the scope of RDF.rb itself.
RDF.rb does not, and will not, include built-in support for any RDF serialization formats other than N-Triples and N-Quads. However, it does define a DSL and common API for adding support for other formats via third-party plugin gems. There presently exist RDF.rb-compatible RDF::JSON and RDF::TriX gems that add initial RDF/JSON and TriX support, respectively.
RDF.rb does not, and will not, include built-in support for any particular persistent RDF storage systems. However, it does define the interfaces that such storage adapters could be written to. Again, add-on gems are the way to go, and there already exists an in-the-works RDF.rb-compatible RDF::Sesame gem that enables using Sesame 2.0 HTTP endpoints with the repository interface defined by RDF.rb.
RDF.rb does not, and will not, include any built-in RDF Schema or OWL inference capabilities. There exists an in-the-works RDF.rb-compatible RDFS gem that is intended to provide a naive proof-of-concept implementation of a forward-chaining inference engine for the RDF Schema entailment rules.
RDF.rb does not include any built-in SPARQL functionality per se, though it will soon provide support for basic graph pattern (BGP) matching and could thus conceivably be used as the basis for a SPARQL engine written in Ruby.
RDF.rb does not come with a license statement, but rather with the stringent hope that you have a nice day. RDF.rb is 100% free and unencumbered public domain software. You can copy, modify, use, and hack on it without any restrictions whatsoever. This means that authors of other RDF libraries for Ruby are perfectly welcome to steal any of our code, with or without attribution. So, if some code snippet or file may be of use to you, feel free to copy it and relicense it under whatever license you have released your own library with -- no need to include any copyright notices from us (since there are none), or even to mention us in the credits (we won't mind).

So that's what RDF.rb is not, but perhaps more important is what we want it to be. There's no reason for simple RDF-based solutions to require enormous complex libraries, storage engines, significant IDE configuration or XML pushups. We're hoping to bring RDF to a world of agile programmers and startups, and to bring existing Linked Data enthusiasts to a platform that encourages rapid innovation and programmer happiness. And maybe everyone can have some fun along the way!

It is also our hope that the aforementioned minimalistic design approach and extremely liberal licensing can help lead to the emergence of a semi-standard Ruby object model for RDF, that is, a common core class hierarchy and API that could be largely interoperable between a number of RDF libraries for Ruby.

With that in mind, let's proceed to have a look at RDF.rb's core object model.

The Object Model

While RDF.rb is built to take full advantage of Ruby's duck typing and mixins, it does also define a class hierarchy of RDF objects. If nothing else, this inheritance tree is useful for case/when matching and also adheres to the principle of least surprise for developers hailing from less dynamic programming languages.

The RDF.rb core class hierarchy looks like the following, and will seem instantly familiar to anyone acquainted with Sesame's object model:

The five core RDF.rb classes, all of them ultimately inheriting from RDF::Value, are:

RDF::Literal represents plain, language-tagged or datatyped literals.
RDF::URI represents URI references (URLs and URNs).
RDF::Node represents anonymous nodes (also known as blank nodes).
RDF::Statement represents RDF statements (also known as triples).
RDF::Graph represents anonymous or named graphs containing zero or more statements.

In addition, the two core RDF.rb interfaces (known as mixins in Ruby parlance) are:

RDF::Enumerable provides RDF-specific iteration methods for any collection of RDF statements.
RDF::Queryable provides RDF-specific query methods for any collection of RDF statements.

Let's take a quick tour of each of these aforementioned core classes and mixins.

Working with RDF::URI

URI references (URLs and URNs) are represented in RDF.rb as instances of the RDF::URI class, which is based on the excellent Addressable::URI library.

Creating a URI reference

The RDF::URI constructor is overloaded to take either a URI string (anything that responds to #to_s, actually) or an options hash of URI components. This means that the following are two equivalent ways of constructing the same URI reference:

uri = RDF::URI.new("http://rdf.rubyforge.org/")

uri = RDF::URI.new({
  :scheme => 'http',
  :host   => 'rdf.rubyforge.org',
  :path   => '/',
})

The supported URI components are explained in the API documentation for Addressable::URI.new.

Getting the string representation of a URI

Turning a URI reference back into a string works as usual in Ruby:

uri.to_s        #=> "http://rdf.rubyforge.org/"

Navigating URI hierarchies

RDF::URI supports the same set of instance methods as does Addressable::URI. This means that the following methods, and many more, are available:

uri = RDF::URI.new("http://rubygems.org/gems/rdf")

uri.absolute?   #=> true
uri.relative?   #=> false
uri.scheme      #=> "http"
uri.authority   #=> "rubygems.org"
uri.host        #=> "rubygems.org"
uri.port        #=> nil
uri.path        #=> "/gems/rdf"
uri.basename    #=> "rdf"

In addition, RDF::URI supports several convenience methods that can help you navigate URI hierarchies without breaking a sweat:

uri = RDF::URI.new("http://rubygems.org/")
uri = uri.join("gems", "rdf")

uri.to_s        #=> "http://rubygems.org/gems/rdf"

uri.parent      #=> RDF::URI.new("http://rubygems.org/gems/")
uri.root        #=> RDF::URI.new("http://rubygems.org/")

Working with RDF::Node

Blank nodes are represented in RDF.rb as instances of the RDF::Node class.

Creating a blank node with an implicit identifier

The simplest way to create a new blank node is as follows:

bnode = RDF::Node.new

This will create a blank node with an identifier based on the internal Ruby object ID of the RDF::Node instance. This nicely serves us as a unique identifier for the duration of the Ruby process:

bnode.id   #=> "2158816220"
bnode.to_s #=> "_:2158816220"

Creating a blank node with a UUID identifier

You can also provide an explicit blank node identifier to the RDF::Node constructor. This is particularly useful when serializing or parsing RDF data, where you generally need to maintain a mapping of blank node identifiers to blank node instances.

The constructor argument can be any string or any object that responds to #to_s. For example, say that you wanted to create a blank node instance having a globally-unique UUID as its identifier. Here's how you would do this with the help of the UUID gem:

require 'uuid'

bnode = RDF::Node.new(UUID.generate)

The above is a fairly common use case, so RDF.rb actually provides a convenience class method for creating UUID-based blank nodes. The following will use either the UUID or the UUIDTools gem, whichever happens to be available:

bnode = RDF::Node.uuid
bnode.to_s #=> "_:504c0a30-0d11-012d-3f50-001b63cac539"

Working with RDF::Literal

All three types of RDF literals -- plain, language-tagged and datatyped -- are represented in RDF.rb as instances of the RDF::Literal class.

Creating a plain literal

Create plain literals by passing in a string to the RDF::Literal constructor:

hello = RDF::Literal.new("Hello, world!")

hello.plain?         #=> true
hello.has_language?  #=> false
hello.has_datatype?  #=> false

Note, however, that in most RDF.rb interfaces you will not in fact need to wrap language-agnostic, non-datatyped strings into RDF::Literal instances; this is done automatically when needed, allowing you the convenience of, say, passing in a plain old Ruby string as the object value when constructing an RDF::Statement instance.

Creating a language-tagged literal

To create language-tagged literals, pass in an additional ISO language code to the :language option of the RDF::Literal constructor:

hello = RDF::Literal.new("Hello!", :language => :en)

hello.has_language?  #=> true
hello.language       #=> :en

The language code can be given as either a symbol, a string, or indeed anything that responds to the #to_s method:

RDF::Literal.new("Hello!", :language => :en)
RDF::Literal.new("Wazup?", :language => :"en-US")
RDF::Literal.new("Hej!",   :language => "sv")
RDF::Literal.new("¡Hola!", :language => ["es"])

Creating an explicitly datatyped literal

Datatyped literals are created similarly, by passing in a datatype URI to the :datatype option of the RDF::Literal constructor:

date = RDF::Literal.new("2010-12-31", :datatype => RDF::XSD.date)

date.has_datatype?   #=> true
date.datatype        #=> RDF::XSD.date

The datatype URI can be given as any object that responds to either the #to_uri method or the #to_s method. In the example above, we've called the #date method on the RDF::XSD vocabulary class which represents the XML Schema datatypes vocabulary; this returns an RDF::URI instance representing the URI for the xsd:date datatype.

Creating implicitly datatyped literals

You'll be glad to hear that you don't necessarily have to always explicitly specify a datatype URI when creating a datatyped literal. RDF.rb supports a degree of automatic mapping between Ruby classes and XML Schema datatypes.

In most common cases, you can just pass in the Ruby value to the RDF::Literal constructor as-is, with the correct XML Schema datatype being automatically set by RDF.rb:

today = RDF::Literal.new(Date.today)

today.has_datatype?  #=> true
today.datatype       #=> RDF::XSD.date

The following implicit datatype mappings are presently supported by RDF.rb:

RDF::Literal.new(false).datatype               #=> RDF::XSD.boolean
RDF::Literal.new(true).datatype                #=> RDF::XSD.boolean
RDF::Literal.new(123).datatype                 #=> RDF::XSD.integer
RDF::Literal.new(9223372036854775807).datatype #=> RDF::XSD.integer
RDF::Literal.new(3.1415).datatype              #=> RDF::XSD.double
RDF::Literal.new(Date.new(2010)).datatype      #=> RDF::XSD.date
RDF::Literal.new(DateTime.new(2010)).datatype  #=> RDF::XSD.dateTime
RDF::Literal.new(Time.now).datatype            #=> RDF::XSD.dateTime

Working with RDF::Statement

RDF statements are represented in RDF.rb as instances of the RDF::Statement class. Statements can be triples -- constituted of a subject, a predicate, and an object -- or they can be quads that also have an additional context indicating the named graph that they are part of.

Creating an RDF statement

Creating a triple works exactly as you'd expect:

subject   = RDF::URI.new("http://rubygems.org/gems/rdf")
predicate = RDF::DC.creator
object    = RDF::URI.new("http://ar.to/#self")

RDF::Statement.new(subject, predicate, object)

The subject should be an RDF::Resource, the predicate an RDF::URI, and the object an RDF::Value. These constraints are not enforced, however, allowing you to use any duck-typed equivalents as components of statements.

Creating an RDF statement with a context

Pass in a URI reference in an extra :context option to the RDF::Statement constructor to create a quad:

context   = RDF::URI.new("http://rubygems.org/")
subject   = RDF::URI.new("http://rubygems.org/gems/rdf")
predicate = RDF::DC.creator
object    = RDF::URI.new("http://ar.to/#self")

RDF::Statement.new(subject, predicate, object, :context => context)

Creating an RDF statement from a hash

It's also worth mentioning that the RDF::Statement constructor is overloaded to enable instantiating statements from an options hash, as follows:

RDF::Statement.new({
  :subject   => RDF::URI.new("http://rubygems.org/gems/rdf"),
  :predicate => RDF::DC.creator,
  :object    => RDF::URI.new("http://ar.to/#self"),
})

The :context option can also be given, as before. Use whichever method of instantiating statements that you happen to prefer.

Statement objects also support a #to_hash method that provides the inverse operation:

statement.to_hash   #=> { :subject   => ...,
                    #     :predicate => ..., 
                    #     :object    => ... }

Accessing RDF statement components

Access the RDF statement components -- the subject, the predicate, and the object -- as follows:

statement.subject   #=> an RDF::Resource
statement.predicate #=> an RDF::URI
statement.object    #=> an RDF::Value

Since statements can also have an optional context, the following will return either nil or else an RDF::Resource instance:

statement.context   #=> an RDF::Resource or nil

Working directly with triples and quads

Because RDF.rb is duck-typed, you can often directly use a three- or four-item Ruby array in place of an RDF::Statement instance. This can sometimes feel less cumbersome than instantiating a statement object, and it may also save some memory if you need to deal with a very large amount of in-memory RDF statements. We'll see some examples of doing this this later on.

Converting from statement objects to Ruby arrays is trivial:

statement.to_triple #=> [subject, predicate, object]
statement.to_quad   #=> [subject, predicate, object, context]

Likewise, instantiating a statement object from a triple represented as a Ruby array is straightforward enough:

RDF::Statement.new(*[subject, predicate, object])

Working with RDF::Graph

RDF graphs are represented in RDF.rb as instances of the RDF::Graph class. Note that most of the functionality in this class actually comes from the RDF::Enumerable and RDF::Queryable mixins, which we'll examine further below.

Creating an anonymous graph

Creating a new unnamed graph works just as you'd expect:

graph = RDF::Graph.new

graph.named? #=> false
graph.to_uri #=> nil

Creating a named graph

To create a named graph, just pass in a blank node or a URI reference to the RDF::Graph constructor:

graph = RDF::Graph.new("http://rubygems.org/")

graph.named? #=> true
graph.to_uri #=> RDF::URI.new("http://rubygems.org/")

Adding statements to a graph

To insert RDF statements into a graph, use the #<< operator or the #insert method:

graph << statement

graph.insert(*statements)

Let's add some RDF statements to an unnamed graph, taking advantage of the aforementioned duck-typing convenience that lets us represent triples directly using Ruby arrays, and plain literals directly using Ruby strings:

rdfrb = RDF::URI.new("http://rubygems.org/gems/rdf")
arto  = RDF::URI.new("http://ar.to/#self")

graph = RDF::Graph.new do
  self << [rdfrb, RDF::DC.title,   "RDF.rb"]
  self << [rdfrb, RDF::DC.creator, arto]
end

If you prefer, you can also be more explicit and use the equivalent #insert method form instead of the #<< operator:

graph.insert([rdfrb, RDF::DC.title,   "RDF.rb"])
graph.insert([rdfrb, RDF::DC.creator, arto])

Deleting statements from a graph

To delete RDF statements from a graph, use the #delete method:

graph.delete(*statements)

Deleting the statements we inserted in the previous example works like so:

graph.delete([rdfrb, RDF::DC.title,   "RDF.rb"])
graph.delete([rdfrb, RDF::DC.creator, arto])

Alternatively, we can use wildcard matching (where nil stands for a "match anything" wildcard) to simply delete every statement in the graph that has a particular subject:

graph.delete([rdfrb, nil, nil])

For even more convenience, since non-existent array subscripts in Ruby return nil, the following abbreviation is exactly equivalent to the previous example:

graph.delete([rdfrb])

Working with RDF::Enumerable

RDF::Enumerable is a mixin module that provides RDF-specific iteration methods for any object capable of yielding RDF statements.

In what follows we will consider some of the key RDF::Enumerable methods specifically as used in instances of the RDF::Graph class.

Checking whether any statements exist

Just as with most of Ruby's built-in collection classes, graphs support an #empty? predicate method that returns a boolean:

graph.empty?      #=> true or false

Checking how many statements exist

You can use #count -- or if you prefer, the equivalent alias #size -- to return the number of RDF statements in a graph:

graph.count

Checking whether a specific statement exists

If you need to check whether a specific RDF statement is included in the graph, use the following method:

graph.has_statement?(RDF::Statement.new(subject, predicate, object))

There also exists an otherwise equivalent convenience method that takes a Ruby array as its argument instead of an RDF::Statement instance:

graph.has_triple?([subject, predicate, object])

Checking whether a specific value exists

If you need to check whether a particular value is included in the graph as a component of one or more statements, use one of the following three methods:

graph.has_subject?(RDF::URI.new("http://rdf.rubyforge.org/"))

graph.has_predicate?(RDF::DC.creator)

graph.has_object?(RDF::Literal.new("Hello!", :language => :en))

Enumerating all statements

The following method yields every statement in the graph as an RDF::Statement instance:

graph.each_statement do |statement|
  puts statement.inspect
end

You can also use #each as a shorter alias for #each_statement, though we ourselves consider using the more explicit form to be stylistically preferred.

If you don't require RDF::Statement instances and simply want to get directly at the triple components of statements, do the following instead:

graph.each_triple do |subject, predicate, object|
  puts [subject, predicate, object].inspect
end

Similarly, you can enumerate the graph using quads as well:

graph.each_quad do |subject, predicate, object, context|
  puts [subject, predicate, object, context].inspect
end

Note that for unnamed graphs, the yielded context will always be nil; for named graphs, it will always be the same RDF::Resource instance as would be returned by calling graph.context.

Obtaining all statements

If instead of enumerating statements one-by-one you wish to obtain all the data in a graph in one go as an array of statements, the following method does just that:

graph.statements  #=> [RDF::Statement(subject1, predicate1, object1), ...]

Naturally, there also exist the usual alternative methods that give you the statements in the form of raw triples or quads represented as Ruby arrays:

graph.triples     #=> [[subject1, predicate1, object1], ...]
graph.quads       #=> [[subject1, predicate1, object1, context1], ...]

Enumerating all values

A particularly useful set of methods is the following, which yield unique statement components from a graph:

graph.each_subject   { |value| puts value.inspect }
graph.each_predicate { |value| puts value.inspect }
graph.each_object    { |value| puts value.inspect }

For instance, #each_subject yields every unique statement subject in the graph, never yielding the same subject twice.

Obtaining all unique values

Again, instead of yielding unique values one-by-one, you can obtain them in one go with the following methods:

graph.subjects    #=> [subject1, subject2, subject3, ...]
graph.predicates  #=> [predicate1, predicate2, predicate3, ...]
graph.objects     #=> [object1, object2, object3, ...]

Here, #subjects returns an array containing all unique statement subjects in the graph, and #predicates and #objects do the same for statement predicates and objects respectively.

Working with RDF::Queryable

RDF::Queryable is a mixin that provides RDF-specific query methods for any object capable of yielding RDF statements. At present this means simple subject-predicate-object queries, but extended basic graph pattern matching will be available in a future release of RDF.rb.

In what follows we will consider RDF::Queryable methods specifically as used in instances of the RDF::Graph class.

Querying for specific statements

The simplest type of query is one that specifies all statement components, as in the following:

statements = graph.query([subject, predicate, object])

The result set here would contain either no statements if the query didn't match (that is, the given statement didn't exist in the graph), or otherwise at the most the single matched statement.

The #query method can also take a block, in which case matching statements are yielded to the block one after another instead of returned as a result set:

graph.query([subject, predicate, object]) do |statement|
  puts statement.inspect
end

Querying with wildcard components

You can replace any of the query components with nil to perform a wildcard match. For example, in the following we query for all dc:title values for a given subject resource:

rdfrb = RDF::URI.new("http://rubygems.org/gems/rdf")

graph.query([rdfrb, RDF::DC.title, nil]) do |statement|
  puts "dc:title = #{statement.object.inspect}"
end

We can also query for any and all statements related to a given subject resource:

graph.query([rdfrb, nil, nil]) do |statement|
  puts "#{statement.predicate.inspect} = #{statement.object.inspect}"
end

The result sets returned by #query also implement RDF::Enumerable and RDF::Queryable, so it is possible to chain several queries to incrementally refine a result set:

graph.query([rdfrb]).query([nil, RDF::DC.title])

Likewise, it is of course possible to chain RDF::Queryable operations with methods from RDF::Enumerable:

graph.query([nil, RDF::DC.title]).each_subject do |subject|
  puts subject.inspect
end

The Mailing List

If you have feedback regarding RDF.rb, please contact us either privately or via the public-rdf-ruby@w3.org mailing list. Bug reports should go to the issue queue on GitHub.

Coming Up

In upcoming RDF.rb tutorials we will see how to work with existing RDF vocabularies, how to serialize and parse RDF data using RDF.rb, how to write an RDF.rb plugin, how to use RDF.rb with Ruby on Rails 3.0, and much more. Stay tuned!

RDF for Intrepid Unix Hackers: Grepping N-Triples

Arto Bendiken — 2010-03-04T16:00:00Z

The N-Triples format is the lowest common denominator for RDF serialization formats, and turns out to be a very good fit to the Unix paradigm of line-oriented, whitespace-separated data processing. In this tutorial we'll see how to process N-Triples data by pipelining standard Unix tools such as grep, wc, cut, awk, sort, uniq, head and tail.

To follow along, you will need access to a Unix box (Mac OS X, Linux, or BSD) with a Bash-compatible shell. We'll be using curl to fetch data over HTTP, but you can substitute wget or fetch if necessary. A couple of the examples require a modern AWK version such as gawk or mawk; on Linux distributions you should be okay by default, but on Mac OS X you will need to install gawk or mawk from MacPorts as follows:

$ sudo port install mawk
$ alias awk=mawk

Grokking N-Triples

Each N-Triples line encodes one RDF statement, also known as a triple. Each line consists of the subject (a URI or a blank node identifier), one or more characters of whitespace, the predicate (a URI), some more whitespace, and finally the object (a URI, blank node identifier, or literal) followed by a dot and a newline. For example, the following N-Triples statement asserts the title of my website:

<http://ar.to/> <http://purl.org/dc/terms/title> "Arto Bendiken" .

This is an almost perfect format for Unix tooling; the only possible further improvement would have been to define the statement component separator to be a tab character, which would have simplified obtaining the object component of statements -- as we'll see in a bit.

Getting N-Triples

Many RDF data dumps are made available as compressed N-Triples files. DBpedia, the RDFization of Wikipedia, is a prominent example. For purposes of this tutorial I've prepared an N-Triples dataset containing all Drupal-related RDF statements from DBpedia 3.4, which is the latest release at the moment and reflects Wikipedia as of late September 2009.

I prepared the sample dataset by downloading all English-language core datasets (20 N-Triples files totaling 2.1 GB when compressed) and crunching through them as follows:

$ bzgrep Drupal *.nt.bz2 > drupal.nt

To save you from gigabyte-sized downloads and an hour of data crunching, you can just grab a copy of the resulting drupal.nt file as follows:

$ curl http://blog.datagraph.org/2010/03/grepping-ntriples/drupal.nt > drupal.nt

The sample dataset totals 294 RDF statements and weighs in at 70 KB.

Counting N-Triples

The first thing we want to do is count the number of triples in an N-Triples dataset. This is straightforward to do, since each triple is represented by one line in an N-Triples input file and there are a number of Unix tools that can be used to count input lines. For example, we could use either of the following commands:

$ cat drupal.nt | wc -l
294

$ cat drupal.nt | awk 'END { print NR }'
294

Since we'll be using a lot more of AWK throughout this tutorial, let's stick with awk and define a handy shell alias for this operation:

$ alias rdf-count="awk 'END { print NR }'"

$ cat drupal.nt | rdf-count
294

Note that, for reasons of comprehensibility, the previous examples as well as most of the subsequent ones assume that we're dealing with "clean" N-Triples datasets that don't contain comment lines or other miscellania. The DBpedia data dumps fit this bill very well. However, further onwards I will give "fortified" versions of these commands that can correctly deal with arbitrary N-Triples files.

Measuring N-Triples

We at Datagraph frequently use the N-Triples representation as the canonical lexical form of an RDF statement, and work with content-addressable storage systems for RDF data that in fact store statements using their N-Triples representation. In such cases, it is often useful to know some statistical characteristics of the data to be loaded in a mass import, so as to e.g. be able to fine-tune the underlying storage for optimum space efficiency.

A first useful statistic is to know the typical size of a datum, i.e. the line length of an N-Triples statement, in the dataset we're dealing with. AWK yields us N-Triples line lengths without much trouble:

$ alias rdf-lengths="awk '{ print length }'"

$ cat drupal.nt | rdf-lengths | head -n5
162
150
155
137
150

Note that N-Triples is an ASCII format, so the numbers above reflect both the byte sizes of input lines as well as the ASCII character count of input lines. All non-ASCII characters are escaped in N-Triples, and for present purposes we'll be talking in terms of ASCII characters only.

The above list of line lengths in and of itself won't do us much good; we want to obtain aggregate information for the whole dataset at hand, not for individual statements. It's too bad that Unix doesn't provide commands for simple numeric aggregate operations such as the minimum, maximum and average of a list of numbers, so let's see if we can remedy that.

One way to define such operations would be to pipe the above output to an RPN shell calculator such as dc and have it perform the needed calculations. The complexity of this would go somewhat beyond mere shell aliases, however. Thankfully, it turns out that AWK is well-suited to writing these aggregate operations as well. Here's how we can extend our earlier pipeline to boil the list of line lengths down to an average:

$ alias avg="awk '{ s += \$1 } END { print s / NR }'"

$ cat drupal.nt | rdf-lengths | avg
242.517

The above, incidentally, is an example of a simple map/reduce operation: a sequence of input values is mapped through a function, in this case length(line), to give a sequence of output values (the line lengths) that is then reduced to a single aggregate value (the average line length). Though I won't go further into this just now, it is worth mentioning in passing that N-Triples is an ideal format for massively parallel processing of RDF data using Hadoop and the like.

Now, we can still optimize and simplify the above some by combining both steps of the operation into a single alias that outputs an average line length for the given input stream, like so:

$ alias rdf-length-avg="awk '\
  { s += length }
  END { print s / NR }'"

Likewise, it doesn't take much more to define an alias for obtaining the maximum line length in the input dataset:

$ alias rdf-length-max="awk '\
  BEGIN { n = 0 } \
  { if (length > n) n = length } \
  END { print n }'"

Getting the minimum line length is only slightly more complicated. Instead of comparing against a zero baseline like above, we need to instead define a "roof" value to compare against. In the following, I've picked an arbitrarily large number, making the (at present) reasonable assumption that no N-Triples line will be longer than a billion ASCII characters, which would amount to somewhat less than a binary gigabyte:

$ alias rdf-length-min="awk '\
  BEGIN { n = 1e9 } \
  { if (length > 0 && length < n) n = length } \
  END { print (n < 1e9 ? n : 0) }'"

Now that we have some aggregate operations to crunch N-Triples data with, let's analyze our sample DBpedia dataset using the three aliases defined above:

$ cat drupal.nt | rdf-length-avg
242.517

$ cat drupal.nt | rdf-length-max
2179

$ cat drupal.nt | rdf-length-min
84

We can see from the output that N-Triples line lengths in this dataset vary considerably: from less than a hundred bytes to several kilobytes, but being on average in the range of two hundred bytes. This variability is to be expected for DBpedia data, given that many RDF statements in such a dataset contain a long textual description as their object literal whereas others contain merely a simple integer literal.

Many other statistics, such as the median line length or the standard deviation of the line lengths, could conceivably be obtained in a manner similar to what I've shown above. I'll leave those as exercises for the reader, however, as further stats regarding the raw N-Triples lines are unlikely to be all that generally interesting.

Parsing N-Triples

It's time to move on to getting at the three components -- the subject, the predicate and the object -- that constitute RDF statements.

We have two straightforward choices for obtaining the subject and predicate: the cut command and good old awk. I'll show both aliases:

$ alias rdf-subjects="cut -d' ' -f 1 | uniq"
$ alias rdf-subjects="awk '{ print \$1 }' | uniq"

While cut might shave off some microseconds compared to awk here, AWK is still the better choice for the general case, as it allows us to expand the alias definition to ignore empty lines and comments, as we'll see later. On our sample data, though, either form works fine.

You may have noticed and wondered about the pipelined uniq after cut and awk. This is simply a low-cost, low-grade deduplication filter: it drops consequent duplicate values. For an ordered dataset (where the input N-Triples lines are already sorted in lexical order), it will get rid of all duplicate subjects. In an unordered dataset, it won't do much good, but it won't do much harm either (what's a microsecond here or there?)

To fully deduplicate the list of subjects for a (potentially) unordered dataset, apply another uniq filter after a sort operation as follows:

$ cat drupal.nt | rdf-subjects | sort | uniq | head -n5
<http://dbpedia.org/resource/Acquia_Drupal>
<http://dbpedia.org/resource/Adland>
<http://dbpedia.org/resource/Advomatic>
<http://dbpedia.org/resource/Apadravya>
<http://dbpedia.org/resource/Application_programming_interface>

I've not made sort an integral part of the rdf-subjects alias because sorting the subjects is an expensive operation with resource usage proportional to the number of statements processed; when processing a billion-triple N-Triples stream, it is usually simply better to not care too much about ordering.

Getting the predicates from N-Triples data works exactly the same way as getting the subjects:

$ alias rdf-predicates="cut -d' ' -f 2 | uniq"
$ alias rdf-predicates="awk '{ print \$2 }' | uniq"

Again, you can apply sort in conjunction with uniq to get the list of unique predicate URIs in the dataset:

$ cat drupal.nt | rdf-predicates | sort | uniq | tail -n5
<http://www.w3.org/2000/01/rdf-schema#label>
<http://www.w3.org/2004/02/skos/core#subject>
<http://xmlns.com/foaf/0.1/depiction>
<http://xmlns.com/foaf/0.1/homepage>
<http://xmlns.com/foaf/0.1/page>

Obtaining the object component of N-Triples statements, however, is somewhat more complicated than getting the subject or the predicate. This is due to the fact that object literals can contain whitespace that will throw off the whitespace-separated field handling of cut and awk that we've relied on so far. Not to worry, AWK can still get us the results we want, but I won't attempt to explain how the following alias works; just be happy that it does:

$ alias rdf-objects="awk '{ ORS=\"\"; for (i=3;i<=NF-1;i++) print \$i \" \"; print \"\n\" }' | uniq"

The output of rdf-objects is the N-Triples encoded object URI, blank node identifier or object literal. URIs are output in the same format as subjects and predicates, with enclosing angle brackets; language-tagged literals include the language tag, and datatyped literals include the datatype URI:

$ cat drupal.nt | rdf-objects | sort | uniq | head -n5
"09"^^<http://www.w3.org/2001/XMLSchema#integer>
"16"^^<http://www.w3.org/2001/XMLSchema#integer>
"2001-01"^^<http://www.w3.org/2001/XMLSchema#gYearMonth>
"2009"^^<http://www.w3.org/2001/XMLSchema#integer>
"6.14"^^<http://www.w3.org/2001/XMLSchema#decimal>

Another very useful operation to have is getting the list of object literal datatypes used in an N-Triples dataset. This is also a somewhat involved alias definition, and requires a modern AWK version such as gawk or mawk:

$ alias rdf-datatypes="awk -F'\x5E' '/\"\^\^</ { print substr(\$3, 1, length(\$3)-2) }' | uniq"

$ cat drupal.nt | rdf-datatypes | sort | uniq
<http://www.w3.org/2001/XMLSchema#decimal>
<http://www.w3.org/2001/XMLSchema#gYearMonth>
<http://www.w3.org/2001/XMLSchema#integer>

As we can see, most object literals in this dataset are untyped strings, but there are some decimal and integer values as well as year + month literals.

Aliasing N-Triples

As promised, here follow more robust versions of all the aforementioned Bash aliases. Just copy and paste the following code snippet into your ~/.bash_aliases or ~/.bash_profile file, and you will always have these aliases available when working with N-Triples data on the command line.

# N-Triples aliases from http://blog.datagraph.org/2010/03/grepping-ntriples
alias rdf-count="awk '/^\s*[^#]/ { n += 1 } END { print n }'"
alias rdf-lengths="awk '/^\s*[^#]/ { print length }'"
alias rdf-length-avg="awk '/^\s*[^#]/ { n += 1; s += length } END { print s/n }'"
alias rdf-length-max="awk 'BEGIN { n=0 } /^\s*[^#]/ { if (length>n) n=length } END { print n }'"
alias rdf-length-min="awk 'BEGIN { n=1e9 } /^\s*[^#]/ { if (length>0 && length<n) n=length } END { print (n<1e9 ? n : 0) }'"
alias rdf-subjects="awk '/^\s*[^#]/ { print \$1 }' | uniq"
alias rdf-predicates="awk '/^\s*[^#]/ { print \$2 }' | uniq"
alias rdf-objects="awk '/^\s*[^#]/ { ORS=\"\"; for (i=3;i<=NF-1;i++) print \$i \" \"; print \"\n\" }' | uniq"
alias rdf-datatypes="awk -F'\x5E' '/\"\^\^</ { print substr(\$3, 2, length(\$3)-4) }' | uniq"

I should also note that though I've spoken throughout only in terms of N-Triples, most of the above aliases will work fine also for input in N-Quads format.

In the next installments of RDF for Intrepid Unix Hackers, we'll attempt something a little more ambitious: building a rdf-query alias to perform subject-predicate-object queries on N-Triples input. We'll also see what to do if your RDF data isn't already in N-Triples format, learning how to install and use the Raptor RDF Parser Library to convert RDF data between the various popular RDF serialization formats. Stay tuned.

Hacking on RDF in Ruby

Ben Lavender — 2010-02-28T07:20:17Z

RDF.rb is easily the most fun RDF library I've used. It uses Ruby's dynamic system of mixins to create a library that's very easy to use.

If you're new at Ruby, you might know about mixins in other languages--Scala traits, for example, are almost exactly functionally equivalent. They're distinctly more powerful than Java interfaces or abstract classes. A mixin is basically an interface and an abstract class rolled into one. Rather than extend an abstract class, one includes a mixin into your own class. A mixin will usually require that a given class implement a particular method. Ruby's own Enumerable class, for example, requires that implementing classes implement #each. For that tiny bit of trouble, you get a ton of methods (listed here), including iterators, mapping, partitions, conversion to arrays, and more. (If you're new to Ruby, it might also help you to know that #method_name means 'an instance method named method_name').

RDF.rb uses the principle extensively. RDF::Repository is, in fact, little more than an in-memory reference implementation for 4 traits: RDF::Enumerable, RDF::Mutable, RDF::Queryable, and RDF::Durable. RDF::Sesame::Repository has the exact same interface as the in-memory representation, but is based entirely on a Sesame server. In order to work as a repository, RDF::Sesame::Repository only had to extend the reference implementation and implement #each, #insert_statement, and #delete_statement. Nice! Of course, implementing those took some doing, but it's still exceedingly easy.

RDF::Enumerable is the key here. For implementing an #each that yields RDF::Statement objects, one gains a ton of functionality: #each_subject, #each_predicate, #each_object, #each_context, #has_subject?, #has_triple?, and more. It's a key abstraction that provides huge amounts of functionality.

But the module system goes the other way--not only is it easy to implement new RDF models, existing ones are easily extended. I recently wrote RDF::Isomorphic, which extends RDF::Enumerable with #bijection_to and #isomorphic_with? methods. The module-based system provided by RDF.rb means that my isomorphic methods are now available on RDF::Sesame::Repositories, and indeed anything which includes RDF::Enumerable. This is everything from repositories to graphs to query results! In fact, query results themselves implement RDF::Enumerable, and thus implement RDF::Queryable and can be checked for isomorphism, or whatever else you want to add. This is functionality that Sesame does not have natively, and which I wrote for a completely different purpose (testing parsers). Every RDF::Enumerable gets it for free because I wanted to compare 2 textual formats. Neat!

For example, here's what it takes to extend any RDF collection, from RDF::Isomorphic:

require 'rdf'
module RDF
  ##
  # Isomorphism for RDF::Enumerables
  module Isomorphic

    def isomorphic_with(other)
      # code that uses #each, or any other method from RDF::Enumerable goes here
      ...
    end

    def bijection_to(other)
      # code that uses #each, or any other method from RDF::Enumerable goes here
         ...
    end
  end

  # re-open RDF::Enumerable and add the isomorphic methods
  module Enumerable
    include RDF::Isomorphic
  end
end

Of course, this just can't be done without monkey patching. Mixins and monkey patching together make for a powerful toolkit. To my knowledge, this is the first RDF library that takes advantage of these features.

It's possible to provide powerful features to a wide range of implementations with this. RDF.rb does not yet have a inference layer, but any such layer would instantly work for any store which implements RDF::Enumerable. Want to prototype some custom business logic that operates over existing RDF data? Copy it into a local repository and hack away. No need for the production RDF store to be the same at all, but you can still apply the same code.

As a counter-example, compare this to the Java RDF ecosystem. There are some excellent implementations (RDF::Isomorphic is heavily in debt to Jena), but they're all incompatible. Jena's check for isomorphism is not really translatable to Sesame, or anything else. RDF.rb, in addition to providing a reference implementation, acts as an abstraction layer for underlying RDF implementations. The difference is night and day--with RDF.rb, you only need to implement a feature once, at the API layer, to have it apply to any implementation. This is not a knock at the very talented people behind those Java implementations; making this happen is a lot of work in a language without monkey patching, and RDF.rb is only as good as it is because of the significant influences those projects have been on Arto's design.

The end result of the mixin-based approach is a system that is incredibly easy to extend, and just downright fun. It would be a fairly simple task to extend a Ruby class completely unrelated to RDF with an #each method that yields statements, allowing it to work in RDF::Enumerable. Voila, your existing classes now have an RDF representation. Along the same lines, if one is bothered by the statement-oriented nature of RDF.rb, building a system which took a resource-oriented view would not require one to 'break away' from the RDF.rb ecosystem. Just build your subject-oriented model objects and implement #each, and away you go--you can now run RDF queries and test isomorphism on your model. Build it to accept an RDF::Enumerable in the constructor and you can use any existing repository or query to initialize your model.

RDF.rb is not yet ready for production use, but it's under heavy development and already quite useful. Give it a shot. You can post any issues in the GitHub issue queue.

Using jQuery with Rails 3.0 Beta

Josh Huckabee — 2010-02-08T22:04:52Z

One of the most talked about features in Rails 3 is its plug & play architecture with various frameworks like Datamapper in place of ActiveRecord for the ORM or jQuery for javascript. However, I've yet to see much info on how to actually do this with the javascript framework.

Fortunately, it looks like a lot of the hard work has already been done. Rails now emits HTML that is compatible with the unobtrusive approach to javascript. Meaning, instead of seeing a delete link like this:

<a href="/users/1" onclick="if (confirm('Are you sure?')) { var f = document.createElement('form'); f.style.display = 'none'; this.parentNode.appendChild(f); f.method = 'POST'; f.action = this.href;var m = document.createElement('input'); m.setAttribute('type', 'hidden'); m.setAttribute('name', '_method'); m.setAttribute('value', 'delete'); f.appendChild(m);f.submit(); };return false;">Delete</a>

you'll now see it written as

<a rel="nofollow" data-method="delete" data-confirm="Are you sure?" class="delete" href="/user/1">Delete</a>

This makes it very easy for a javascript driver to come along, pick out and identify the relevant pieces, and attach the appropriate handlers.

So, enough blabbing. How do you get jQuery working with Rails 3? I'll try to make this short and sweet.

Grab the jQuery driver at http://github.com/rails/jquery-ujs and put it in your javascripts directory. The file is at src/rails.js

Include jQuery (I just use the google hosted version) and the driver in your application layout or view. In HAML it would look something like.

= javascript_include_tag "http://ajax.googleapis.com/ajax/libs/jquery/1.4.1/jquery.min.js"
= javascript_include_tag 'rails'

Rails requires an authenticity token to do form posts back to the server. This helps protect your site against CSRF attacks. In order to handle this requirement the driver looks for two meta tags that must be defined in your page's head. This would look like:

<meta name="csrf-token" content="<%= form_authenticity_token %>" />
<meta name="csrf-param" content="authenticity_token" />

In HAML this would be:

%meta{:name => 'csrf-token', :content => form_authenticity_token}
%meta{:name => 'csrf-param', :content => 'authenticity_token'}

Update: Jeremy Kemper points out that the above meta tags can written out with a single call to "csrf_meta_tag".

That should be all you need. Remember, this is still a work in progress, so don't be surprised if there's a few bugs. Please also note this has been tested with Rails 3.0.0.beta.

Is W3C Going the Wrong Direction with SPARQL 1.1?

Ben Lavender — 2009-11-03T02:02:11Z

The W3C SPARQL working group (previously the Data Access Working Group) has recently released their first versions of the updated SPARQL standards, or SPARQL 1.1. The group's roadmap has these finalized a year from now, but they have asked for comments and I suppose these are mine.

I believe that these documents are a step further down a wrong path for SPARQL and, to a lesser degree, for RDF in general.

The latest round of changes includes a number of changes to SPARQL, including aggregate functions, subqueries, projection expressions, negations, updates and deletions, more specific HTTP protocol bindings, service discovery, entailment regimes, and a RESTful protocol for managing RDF graphs (the last one is not really just SPARQL, but it's in the updates).

So I'll start with my comments, which are mostly critical.

To start, an RDF-specific complaint, not really related to the rest of the post. Why would the one mandated format to be supported in the new RESTful RDF graph management interface be RDF/XML? What would it take for a the semweb community to move on from this failed standard, which has had known issues for more than 5 years? (those two issues were raised in 2001 and are currently marked 'postponed') Why should such an increasingly irrelevant standard as RDF/XML be chosen instead of the widely-supported and easy to implement N3, N-Triples, or Turtle?

As for SPARQL, the 1.1 standards continue to give named graphs first class citizen status, both in the web APIs and in more SPARQL syntax than they had before. It's not so much triples as quads these days. Other meta-metadata, such as time of assertion or validity time, are not covered. While named graphs are admittedly a particularly often-found case, why does it need to invade the syntax of SPARQL? Not every use case needs named graphs, but every SPARQL implementor must support them. The 1.1 standard now includes precedence rules when for named graph and base URIs when they conflict in HTTP query options and inside the query itself, attempting to solve this self-created problem.

How about subqueries? What about variables during insertions? What about subqueries during insertions? Do we really need implementors to consider these kinds of things for every SPARQL endpoint on the web?

None of these things is really all that bad by itself, but one must consider the bigger picture. SPARQL 1.0 was released in January of 2008 (with some comment period before that) and there is still no implementation of a SPARQL engine in PHP or Ruby (exceptions apply, see [1]). One does not increase the participation of that ecosystem by adding a selection of entailment regimes to the standard.

While a SPARQL implementation exists for the excellent RDFLib in Python, it's only one of the current big 3 (with Ruby and PHP) in web development, and there's only one. The fact that no SPARQL engines exist for Ruby or PHP should be considered a failure of the standard. Why are we adding complexity when there is no SQLite for SPARQL? Why are there at least 3 monolithic Java implementations (Jena, Sesame, Boca), all financially sponsored to some degree or another, but so little 'in the wild'? How long can RDFLib herd 16 cats as committers on the project? While I don't have a lot of direct experience with RDFLib, I pity the project 'leads' (I cannot find evidence that the project is sponsored or that anyone is 'in charge') trying to look towards the future of implementing 6 working papers of new standards.

One of the biggest success stories for semweb in widespread use is the Drupal RDF module, which has found wide acceptance in the Drupal community and started an ecosystem of modules. Drupal 7 will output RDFa by default and Drupal 6 supports a ton of wonderful features, including reversing the RSS 1.0 to 2.0 downgrade back to RDF. But Drupal remains a producer of simple triples and a consumer of SPARQL queries generated by other endpoints. Data in those sites remains locked down. Why? Because implementing SPARQL in PHP is nontrivial, and in a chicken-egg problem, nobody's paying for it before someone has a need for SPARQL.

I could go on, but these are symptoms (well, not that RDF/XML thing, I don't think there's a good reason for that). I feel that the working group is attempting to solve the wrong problem. Namely, it is attempting to define a somewhat-human-readable query language, SPARQL that works for almost all use cases. But why must the whole 'kitchen sink' be well-defined? Such a standards body should be attempting to define the easiest possible thing to implement and extend, not the the last tool anyone would ever use.

The SPARQL 1.0 standard's grammar was well-defined as a context free grammar. It also had extension functions, which were uniquely defined by URIs. Why the distinction between CFG elements and extension functions? Why not make syntax elements like named graphs and aggregate functions as discoverable as extensions? Well, the reason is that it's hard to write a parser of a human-readable format and make those things optional and discoverable. (Here's a SPARQL parser implementation in Scala, a language with powerful pattern matching features for good parsing, and it's 500 lines of code. It compiles to S-expressions, the parsing of which is about 30 lines. Hmm.)

If the protocol had been defined as S-expressions, the distinction would not exist and the syntax could be as expandable as the current functions (the current syntax would just be more functions). The new 1.1 service discovery mechanism is excellent and extendible and would allow the standard to grow dynamically instead of becoming bogged down in features for particular use cases. New baseline implementations of SPARQL would be easy to implement and grow incrementally, and the current human-readable format can be implemented in terms of these expressions.

The web of ontologies has grown with ad-hoc definitions created by people used to fill their needs. Standards grow organically around the ones that are needed most, others languish. Why should SPARQL functions have this kind of flexibility, but not the syntax? The distinction makes implementation overly difficult and is slowing the expansion of the Semantic Web.

In fact, it turns out that Jena has been parsing to S-expressions for some time. If you're an implementor, why would you do it any other way, especially when the standard can change as much as it does in 1.1? Any implementation will have to come up with something equivalent to S-expressions if you are going to be able to upgrade your engine implementation to meet standards like this when they are finalized. If people are doing it anyway, why not just make it the standard?

The SPARQL Working Group should be working on a definition for a function list and discovery protocol for S-expressions, and not for what we currently call SPARQL. What we call SPARQL is something that should compile to a simpler standard if various vendors want to implement it. S-expressions allow maximally simple parsing maximally simple serialization, and the ability to do feature discovery on core features of the language, not just portions which are blessed with the ability to be extended. S-expressions are easier for machines to generate for wide variety of automated use cases, far wider, I would venture, than the set of use cases for the human-readable queries.

Please, please, please do not doom the world to write the SPARQL equivalent of SQLAlchemy and ActiveRecord for the next 20 years! We can define a standard that machines can use natively. Now's the time.

At any rate, that's my beef in a nutshell. The working group won't come up with a successful standard until it's easy enough to implement it that workable implementations appear in the languages that are defining the web today. And when people can use those languages to implement that standard without an army of VC-funded engineers.

The SPARQL 1.1 proposals make the standard better than before, but it's not the standard we need. The SPARQL algebra is what needed expansion and specification, not the syntax.

[1]: The PHP ARC project has an implementation, but it attempts to directly convert SPARQL to an SQL query on particular table layout in MySQL, and is difficult to convert to general use. Despite SPARQL's complexity, ARC managed to implement this in just 6400 lines of code. The parser alone is 2000 lines and the engine another 4400. The serialization/parsing libraries, however, are fine, and were integrated successfully into the Drupal RDF module. The PHP RAP project has also done some good work and is perhaps more wrappable than ARC, but implements only a subset of SPARQL.