Working notes

Tweaking the Power Limit

2015-02-17T22:45:00.004-06:00

After posting the power-limit function earlier I found myself thinking about the inelegance of the expression. I like the approach, but the code seems much more verbose than should be required. In particular, the (partial apply not=) expression seemed too long for something so simple.

As a reminder, here is my last iteration on the function:

(defn power-limit [f a]
(let [s (iterate f a)]
  (ffirst
  (drop-while (partial apply not=)
  (map vector s (rest s))))))

The first thing that occurred to me here is the use of vector to pair up the values. It's this pairing that makes the predicate for drop-while more complex. Perhaps I should avoid the pair?

The reason for this pair is so that they can be compared for equality in the next step. So why not just compare them straight away? But that would leave me with a true/false value, when I need to get the values being compared when they are equal. What I really needed was a false/value result, and this reminded me of how a set lookup works. For instance (#{2} 1) returns nil, while (#{2} 2) returns 2. This is what I need, so why not try it?

(defn power-limit [f a]
(let [s (iterate f a)]
  (first
  (drop-while not
  (map #(#{%1} %2) s (rest s))))))

The anonymous function in the map looks a little like line noise, but it's not too bad.

Now that the result of map is a value and not a pair, we only need first and not ffirst. The drop-while predicate also gets simpler, just needing a not, but it's still verbose. However, the drop-while is just removing the leading nils from the results of the map, so let's move to keep instead:

(defn power-limit [f a]
(let [s (iterate f a)]
(first
(keep #(#{%1} %2) s (rest s)))))

So does this work?
=> (power-limit has-fixpoint 0)
ArityException Wrong number of args (3) passed to: core/keep clojure.lang.AFn.throwArity (AFn.java:429)

Doh! It turns out that keep only has a single arity and does not accept multiple collections like map does. So I'm stuck doing a (drop-while not ...) or (filter identity ...) or (remove nil? ...) to wrap the map operation.

The fact that I can remove all the leading nils, or all the nils (since all the nils are at the head here), made me take a step back again to think about what I'm doing. Once all the nils are out of the picture, I'm just looking for the first item in the resulting list. That's when I remembered some which tests all the items in a list according to its predicate, returning truthy/falsey, but the way it does this is to just return the first non-falsey value its predicate gives. So I still need to wrap the map operation, but I can drop the call to first.

(defn power-limit [f a]
(let [s (iterate f a)]
(some identity (map #(#{%1} %2) s (rest s)))))

Not as pretty as the very first implementation (which used loop/recur) but now it's not so verbose, so I'm feeling happier about it.

Life in Clojure

2015-02-17T15:08:00.000-06:00

A few months ago I was looking at the Life in APL video on YouTube and it occurred to me that other than the use of mathematical symbols for functions, the approach was entirely functional. This made me wonder if it would be easier (and perhaps clearer) to do it in Clojure. My goal was to show that Clojure was just as capable, and hopefully easier than APL.

I created Life in Clojure to do this, and it worked out pretty well. What I need to do now is to record a REPL session for YouTube, just like the original. However, there are a couple of things that were shown in the APL video that are missing in Clojure, and I needed to fill in the gaps. It's these gaps that I've been thinking about ever since, wondering if I could/should do better.

Since I don't have a video, Life in Clojure is implemented in a namespace that contains a REPL session. Just use "lein repl" and paste in that namespace file and it'll do everything in the same order as the APL video, with a couple of deviations. I've commented each operation with the appropriate dialog from the original APL video, so it should be easy to see the parallels. However, for anyone who wants to see it running, the final program is implemented in the life.core namespace.

Setup

The first change between these Clojure and APL implementations is that Clojure needs to explicitly import things into the environment, while the APL video seemed to demonstrate all operations being available by default. Given that most of these were matrix operations, and APL specializes in matrix operations, then it isn't surprising that these are built in. Fortunately, all the applicable operations are available to Clojure via core.matrix, so this was a simple require, along with a project dependency. I suppose I will need to show these at the start of any video I make (if I ever make it).

Output

The APL video showed the use of a display function that prints a matrix to the console. Clojure's pretty-print will also show an m*n matrix reasonably well, and I could have chosen to use that. However, the APL code is also printing a vector, and then a matrix, where each element is an m*n matrix, and pretty-print just messes that up.

The APL demo seemed to be using a library function to handle this printing, so it didn't feel like cheating to write my own. It's a bit clunky, as it's specialized to only handle 2, 3 or 4 dimensions, but it works.

The bigger issue was showing the current state of the Life playing field. APL was using a draw function to update the contents of a text file, and opened a window that dynamically showed the contents of that file as it was updated. Using a file proxy like this made the display easy in APL, but it's hack that I did not think was worth duplicating. Instead, I created a function to convert the playing field into an image, and sent the image to the display.

Creating an image from the matrix was an easy Swing-based function (just draw filled-in rectangles based on the contents of each matrix cell, and return the image). However, rendering it was a little harder.

Following the APL approach, I wanted a function that would "send" the image to the screen, and forget about it. However, Swing requires an event loop, and several listener objects to allow the window to interact with the environment. These sorts of objects do not fit cleanly into a purely function program, but using proxies and closures does a reasonably good job.

I was able to hide the default window parameters (width/height, etc), but had trouble with the fact that the APL program simply sends the output to the rendered file without defining it as a parameter. The main problem is that the open-window function should return the created resource (the window) and this value should be captured and "written to" each time the life algorithm needs to re-render the playing field. I could do that, but introducing this resource handling into the main algorithm would get in the way of the "clarity" that the demo was trying to express. In the end, I opted to create a global atom (called canvas) that gets updated with each call to draw. I would not usually write my code this way, but it seemed to be the best way to duplicate the APL demo.

Of course, not a lot of programs use Swing these days. The majority of programs I see with GUIs use a browser for the UI. So when a couple of my friends saw this app, they suggested reimplementing it in ClojureScript, and rendering in a web page. Funnily enough, the DOM ends up being an equivalent to the global canvas that I have in the Swing application, so the calling code would not need to change its approach. However, core.matrix does not yet support ClojureScript, so I haven't tried this yet.

Matrix Math/Logic Operations

One minor inconvenience in using the APL approach is that it uses the C-style idiom that the number 0 is the value for "false" and anything else is "true". The APL demo algorithm uses 1 to indicate that a cell is occupied, and then uses addition between cells to determine the values for subsequent generations. This creates an elegant shortcut for determining occupation for subsequent generations in Life, but it isn't as elegant in Clojure.

Of course, Clojure truthy values identify all numbers as "true", and only nil or boolean "False" have a falsey value. Without the zero=false equivalence, then the algorithm gets more awkward. Based on that, I created a function that takes a single argument boolean function and returns an new function that returns the same result mapping false->0 and true->1. This allows matrix addition to be used as the basis of the AND and OR operations, just like in the APL demo. It works fine, but I'm uncomfortable that it currently only handles the 2 argument case for the AND and OR matrix operations. Also, I've since learned a little more about core.matrix, and I think there may be better ways to define these functions. I'll have to look into that a little more.

There is also a convenience function I created which is used to map a matrix to boolean 1/0 based on each cell being equal to a given value. It's a simple one-liner that makes it easier to compare to the equivalent APL code but it always felt a little inelegant. However, this might just be due to the need to map booleans to numbers.

The other APL function that wasn't directly available to me was the "take" of a matrix. This operation allows a 2-D "slice" from a larger matrix, but also deals with the insertion of a small matrix into a larger "empty" matrix. After some examining of core.matrix I discovered that shape provided similar functionality, though it needs help from compute-matrix to handle expansions. I wrapped this in the function takeof to duplicate how the APL demo used it.

Powers and Power Limits

Up to this point, everything was either translating a library from APL (display, the graphics output, or some simple wrappers to smooth out matrix access, and handle the integer/boolean equivalence in APL). But right at the end of the video, the APL demonstration uses a couple of functions that had no equivalents in the Clojure. These are the power function and the power-limit function.

The behavior of the power function is to take a single argument function and apply it to an argument, take the result and apply the function to that, and keep repeating the application of the function as many time as requires. The specific operation of power is to return a function that applies the argument function iteratively like this for the defined number of times.

As an example, the 5th power of clojure.core/inc applied to 0 is:

=> (inc (inc (inc (inc (inc 0)))))
5

So (power-function inc 5) returns a function that applies inc 5 times:

=> ((power-function inc 5) 3)
8

The power-limit function is similar, in that it iteratively applies the provided function, however it does not stop iterating until it reaches a fixpoint value. That is, until the output of the function is equal to the input.

An example function that has a fixpoint is:

(defn has-fixpoint [x] (min (inc x) 3))

This returns either the increment or 3, whichever is smaller. If this is applied iteratively, then it will always reach a value of 3 and stay there.

My first approach to this naïvely executed the provided function n times:

(defn power [f n] (case n
0 identity
1 f
(fn [x]
(loop [r x, i n]
(if (zero? i)
r
(recur (f r) (dec i)))))))

This works, but one look at it was enough to tell me I'd gone down a blind alley. case statements always suggest that something is inelegant, and loop/recur often means that the Clojure's laziness and seq handling are being ignored, which implies that I may have missed an abstraction. Both are a code smell.

That's when I remembered clojure.core/iterate. This function applies a function iteratively, returning an infinite lazy sequence of these iterations. The required power is just the nth entry in that sequence, so the power function just becomes:

(defn power [f n] #(nth (iterate f %) n))

Much better.

Similarly, I'd implemented the power-limit function using a loop until the input equalled the output:

(defn power-limit [f a]
(loop [a a]
  (let [r (f a)]
  (if (= r a)
  r
(recur r)))))

Unlike the original version of the power function, I like this approach, since it clearly does exactly what it is defined to do (apply the function, compare input to output, and either return the output or do it again).

I already mentioned that loop/recur often suggests that I've missed an abstraction. This, along with neglecting to use the iterate function for power, had me wondering if there was a more idiomatic approach here. That's when I realized that I could pair up the iterated sequence of the function with its successor, and compare them. The result is:

(defn power-limit [f a]
(let [s (iterate f a)]
  (ffirst
  (drop-while (partial apply not=)
  (map vector s (rest s))))))

The iteration sequence is captured in the let block so that it can be re-used when pairing the sequences in the map operation. The vector is being used here to capture each iteration with its successor in vectors of pairs. Because iterate is lazy, the sequence of iterations is only calculated as needed, and only calculated once.

The drop-while drops off all pairs of values where the values are not equal. The not= operation can take multiple arguments, so rather than pull each of the two values out, I'm using partial/apply to provide the pair as all the arguments. The result of the drop-while/map is a sequence of pairs of argument/results (which are all equal). The final result is to take the first value from the first pair, hence the use of ffirst.

Conceptually, I like it more, and I think that it fits into Clojure better. However, it's less clear to read then the loop/recur solution. So I'm torn as to which one I like more.

DDD on Mavericks

2013-12-08T18:34:00.002-06:00

Despite everything else I'm trying to get done, I've been enjoying some of my time working on Gherkin. However, after merging in a new feature the other day, I caused a regression (dammit).

Until now, I've been using judiciously inserted calls to echo to trace what has been going on in Gherkin. That does work, but it can lead to bugs due to interfering with return values from functions, needs cleanup, needs the code to be modified each time something new needs to be inspected, and can result in a lot of unnecessary output before the required data shows up. Basically, once you get past a particular point, you really need to adopt a more scalable approach to debugging.

Luckily, Bash code like Gherkin can be debugged with bashdb. I'm not really familiar with bashdb, so I figured I should use DDD to drive it. Like most Gnu projects, I usually try to install them with Fink, and sure enough, DDD is available. However, installation failed, with an error due to an ambiguous overloaded + operator. It turns out that this is due to an update in the C++ compiler in OSX Mavericks. The code fix is trivial, though the patch hasn't been integrated yet.

Downloading DDD directly and running configure got stuck on finding the X11 libraries. I could have specified them manually, but I wasn't sure which ones fink likes to use, and the system has several available (some of them old). The correct one was /usr/X11R6/lib, but given the other dependencies of DDD I preferred to use Fink. However, until the Fink package gets updated then it won't compile and install on its own. So I had figured I should try to tweak Fink to apply the patch.

Increasing the verbosity level on Fink showed up a patch file that was already being applied from:

/sw/fink/dists/stable/main/finkinfo/devel/ddd.patch

It looks like Fink packages all rely on the basic package, with a patch applied in order to fit with Fink or OS X. So all it took was for me to update this file with the one-line patch that would make DDD compile. One caveat is that Fink uses package descriptions that include checksums for various files, including the patch file. My first attempt at using the new patch reported both the expected checksum and the one that was found, so that made it easy to update the .info file.

If you found yourself here while trying to get Fink to install DDD, then just use these 2 files to replace the ones in your system:

/sw/fink/dists/stable/main/finkinfo/devel/ddd.info

/sw/fink/dists/stable/main/finkinfo/devel/ddd.patch

If you have any sense, you'll check that the patch file does what I say it does. :)

Note that when you update your Fink repository, then these files should get replaced.

Clojurescript and Node.js

2013-03-03T19:02:00.001-06:00

For a little while now I've been interested in JavaScript. JS was a strong theme at Strangeloop 2011, where it was being touted as the next big Virtual Machine. I'd only had a smattering of exposure to JS at that point, so I decided to pick up Douglas Crockford's JavaScript: The Good Parts, which I quite enjoyed. I practiced what I learned on Robusta, which is both a SPARQL client library for JS, and simple UI for talking to SPARQL endpoints. It was a good learning exercise and it's proven to be useful for talking to Jena. Since then, I've thought that it might be interesting to write some projects in JS, but my interests are usually in data processing rather than on the UI front end. So the browser didn't feel like the right platform for me.

When ClojureScript came out, I was interested, but again without a specific UI project in mind I kept putting off learning about it.

At this point I'd heard about node.js, but had been misled into thinking that it was for server-side JS, which again didn't seem that interesting when the server offered so many interesting frameworks to work with (e.g. Rails).

This last Strangeloop had several talks about JS, and about the efficiency of the various engines, despite having to continue supporting some unfortunate decisions early in the language's design (or lack of design in some cases). I came away with a better understanding of how fast this system can be and started thinking about how I should try working with it.

But then came Clojure/conj last year and it all gelled.

When I first heard Chris Granger speaking I'll confess that I didn't immediately see what was going on. He was targeting Node.js for Light Table, but as I said, I didn't realize what Node really was. (Not to mention that the social aspects of the conj had me a little sleep deprived). So it wasn't until after his talk that I commented to a colleague (@gtrakGT) that it'd be great it there was a JS engine that wasn't attached to a browser, but still had a library that let you do real world things. In retrospect, I feel like an idiot. :-)

So I started on Node.

ClojureScript with Node.js

While JS has some nice features, in reality I'd much rather write Clojure, meaning that I finally had a reason to try ClojureScript. My first attempts were a bit of a bumpy ride, since I had to learn how to target ClojureScript to Node, how to access functions for getting program arguments, how to read/write with Node functions. Most of all, I had to learn that if I accidentally used a filename ending in .clj instead of .cljs then the compiler would silently fail and the program would print bizarre errors.

All the same, I was impressed with the speed of starting up a ClojureScript program. I found myself wondering about how fast various operations were running, in comparison to Clojure on the JVM. This is a question I still haven't got back to, but it did set me off in some interesting directions.

While driving home after the conj, the same colleague who'd told me how wrong I'd been about Node.js asked me about doing a simple-yet-expensive calculation like large factorials. It was trivial in Clojure, so I tried it with ClojureScript, and immediately discovered that ClojureScript uses JavaScript's numbers. These are actually floats, but for integers it means that they support 53 bits. Clojure automatically expands large number into the BigInteger class from Java, but JavaScript doesn't have the same thing to work with.

At that point I should have considered searching for BigIntegers in JavaScript out there on the web somewhere, but I was curious about how BigIntegers were implemented, so I started looking the code for java.math.BigInteger and seeing if it would be reproduced in Clojure (using protocols, so I can make objects that look a little like the original Java objects). The bit manipulations started getting tricky, and I put it aside for a while.

Then recently I had a need to pretty-print some JSON and EDN. I'd done it in Clojure before, but I used the same process on the command line, and starting the JVM is just too painful for words. So I tried writing it in JS for node, and was very happy with both the outcome and the speed of it. This led me back to trying the same thing in ClojureScript.

ClojureScript Hello World Process

The first thing I try doing with ClojureScript and Node is getting a Hello World program going. It had been months since I'd tried this, so I thought I should try again. Once again, the process did not work smoothly so I thought I'd write it down here.

Clojure can be tricky to use without Leiningen, so this is typically my first port of call for any Clojure-based project. I started out with "lein new hello" to start the new project. This creates a "hello" directory, along with various standard sub-directories to get going.

The project that has been created is a Clojure project, rather than ClojureScript, so the project file needs to be updated to work with ClojureScript instead. This means opening up project.clj and updating it to include lein-cljsbuild along with the configuration for building the ClojureScript application.

The lein-cljsbuild build system is added by adding a plugin, and a configuration to the project. I also like being able to run "lein compile" and this needs a "hook" to be added as well:

  :hooks [leiningen.cljsbuild]
  :plugins [[lein-cljsbuild "0.3.0"]]
  :cljsbuild {
    :builds [{
      :source-paths ["src/hello"]
      :compiler { :output-to "src/js/hello.js"
                  :target :nodejs
                  :optimizations :simple
                  :pretty-print true }}]}

Some things to note here:

The source path (a sequence of paths, but with just one element here) is where the ClojureScript code can be found.
The :output-to can be anywhere, but since it's source code and I wanted to look at it, I put it back into src, albeit in a different directory.
The :target has been set to :nodejs. Without this the system won't be able to refer to anything in Node.js.
The :optimizations are set to :simple. They can also be set to :advanced, but nothing else. (More on this later).

To start with, I ignored most of this and instead got it all running at the REPL. That meant setting up the dependencies in the project file to include cljs-noderepl:

  :dependencies [[org.clojure/clojure "1.5.0"]
                 [org.clojure/clojurescript "0.0-1586"]
                 [org.bodil/cljs-noderepl "0.1.7"]]

(Yes, Clojure 1.5.0 was released last Friday!)

Using the REPL requires that it be started with the appropriate classpath, which lein does for you when the dependencies are set up and you use the "lein repl" command. Once there, the ClojureScript compiler just needs to be "required" and then the compiler can be called directly, with similar options to those shown in the project file:

user=> (require 'cljs.closure)
nil
user=> (cljs.closure/build "src/hello" {:output-to "src/js/hello.js" :target :nodejs :optimizations :simple})
nil

Once you can do all of this, it's time to modify the source code:

  (ns hello.core)

  (defn -main []
    (println "Hello World"))


  (set! *main-cli-fn* -main)

The only trick here is setting *main-cli-fn* to the main function. This tells the compiler which function to run automatically.

The project starts with a core.clj file, which is what you'll end up editing. Compiling this appears to work fine, but when you run the resulting javascript you get an obscure error. The fix for this is to change the source filename to core.cljs. Once this has been changed, the command line compiler (called with "lein compile" will tell you which files were compiled (the compiler called from the REPL will continue to be silent). If the command line compiler does not mention which files were compiled by name, then they weren't compiled.

I wanted to get a better sense of what was being created by the compiler, so I initially tried using optimization options of :none and :whitespace, but then I got errors of undefined symbols when I tried to run the program. I reported this as a bug, but was told that this was known and an issue with Google's Closure tool (which ClojureScript uses). The :simple optimizations seem to create semi-readable code though, so I think I can live with it.

Interestingly, compiling created a directory called "out" that contained a lot of other code generated by the compiler. Inspection showed several files, including a core.cljs file that carries the Clojure core functions. I'm presuming that this gets fed into the clojurescript compiler along with the user program. The remaining files are mostly just glue for connecting things together. For instance, nodejscli.cljs contains:

(ns cljs.nodejscli
  (:require [cljs.nodejs :as nodejs]))

; Call the user's main function
(apply cljs.core/*main-cli-fn* (drop 2 (.-argv nodejs/process)))

This shows what ends up happening with the call to (set! *main-cli-fn* -main) that was required in the main program.

Where Now?

Since Node.js provides access to lots of system functions, I started to wonder just how far I could push this system. Browsing the Node.js API I found functions for I/O and networking, so there seems to be some decent scope in there. However, since performance is really important to V8 (which Node.js is built from) then how about fast I/O operations like memory mapping files?

I was disappointed to learn that there are in fact serious limits to Node.js. However, since the engine is native code, there is nothing to prevent extending it to do anything you want. Indeed, using "require" in JavaScript will do just that. It didn't take much to find a project that wraps mmap, though the author freely admits that it was an intellectual exercise and that users probably don't want to use it.

Using a library like mmap in ClojureScript was straight forward, but showed up another little bug. Calling "require" from JavaScript returns a value for the object containing the module's functions. Constant values in the module can be read using the Java-interop operation. So to see the numeric value of PROT_READ, you can say:

(defn -main []
  (let [m (js/require "mmap")]
    (println "value: " m/PROT_READ)))

However, a module import like this seems to map more naturally to Clojure's require, so I tried a more "global" approach and def'ed the value instead:

(def m (js/require "mmap"))
(defn -main []
  (println "value: " m/PROT_READ))

However, this leads to an error in the final JavaScript. Fortunately, the . operator will work in this case. This also leads to one of the differences with Clojure. Using the dot operator with fields requires that the field name be referenced with a dash leader:

(def m (js/require "mmap"))
(defn -main []
  (println "value: " (.-PROT_READ m)))

Finally, I was able to print the bytes out of a short test file with the following:

(ns pr.core
  (:use [cljs.nodejs :only [require]]))

(def u (require "util"))
(def fs (require "fs"))
(def mmap (js/require "mmap"))

(defn mapfile [filename]
  (let [fd (.openSync fs filename "r")
        sz (.-size (.fstatSync fs fd))]
    (.map mmap sz (.-PROT_READ mmap) (.-MAP_SHARED mmap) fd 0)))

(defn -main []
  (let [buffer (mapfile "data.bin")
        sz (alength buffer)]
   (println "size of file: " sz)
   (doseq [i (range sz)]
     (print " " (aget buffer i)))
   (println))) 

(set! *main-cli-fn* -main)

Since I've now seen that it's possible, it's inspired me to think about reimplementing Mulgara's MappedBlockFile and AVLNode classes for Clojure and ClojureScript. That would be a good opportunity to get a BTree implementation coded up as well.

I've been thinking of redoing these things for Clojure ever since starting on Datomic. Mulgara's phased trees are strikingly like Datomic's, with one important exception - in Mulgara we went to a lot of effort to reap old nodes for reuse. This is complex, and had a performance cost. It was important 13 years ago when we first started on it, but things have changed since then. More recent implementations of some of Mulgara have recognized that we don't need to be so careful with disk space any more, but Rich had a clearer insight: not only can we keep older transactions, but we should. As soon as I realized that, then I realized it would be easy to improve performance dramatically in Mulgara. However, to make it worthwhile we'd have to expose the internal phase information in the same way that Datomic does.

Unfortunately, Mulgara is less interesting to me at the moment, since it's all in Java, which is why I'm moving to re-implement so much RDF work in Clojure at the moment. A start to that can be found in crg, crg-turtle, and kiara. Mulgara isn't going away... but it will get modernized.

More in ClojureScript

Family is calling, so I need to wrap this post up. Funnily enough, I didn't get to write about the thing that made me think to blog in the first place. That's bit manipulations.

You'll recall that I mentioned the lack of a BigInteger in ClojureScript (since it's a Java class). As an exercise in doing more in ClojureScript, and learning about how BigInteger is implemented, I started trying to port this class in Clojure (and ultimately, ClojureScript). The plumbing is easy enough, as I can just use a record and a protocol, but some of the internals have been trickier. That's because BigInteger packs the numbers into an array of bits, that is in turn represented with an int array. More significantly, BigInteger does lots of bit manipulation on these ints.

Bit manipulation is supported in JavaScript and ClojureScript, but with some caveats. One problem is that the non-sign-extending right-shift operation (>>>) is not supported. I was surprised to learn that it isn't supported in Clojure either (this seems strange to me, since it would be trivial to implement). The bigger problem is that numbers are stored as floating point values. Fortunately, numbers of 32 bits or smaller can be treated as unsigned integers. However, integers can get larger than this limit, and if that happens, then some of the bit operations stop working. It's possible to do bit manipulation up to 53 bits, but signed values won't work as expected in that range, so it's basically off the table.

Anyway, I have a lot to learn yet, and a long way to go to even compare how a factorial in ClojureScript will compare to the same operation in Clojure, but in the meantime, it's fun. Searching for Node.js modules has shown a lot of early exploration is being performed by people from all over. Some of the recent threading modules are a good example of this.

I have a feeling that Node.js will get more important as time goes on, and with it, ClojureScript.

Tnetstrings

2012-08-28T12:24:00.000-05:00

A Forward Reference

Having had a couple of responses to the last post, I couldn't help but revisit the code. There's not much to report, but since I've had some feedback from a few people who are new to Clojure I thought that there were a couple of things that I could mention.

First, (and you can see this in the comments to the previous post) I wrote my original code in a REPL and pasted it into the blog. Unfortunately, this caused me to miss a forward reference to the parse-t function. On my first iteration of the code, I wasn't trying to parse all the data types, to the map of parsers didn't need to recurse into the parse-t function. However, when I updated the map, the parse-t function had been fully defined, so the references worked just fine.

Testing

That brings me to my second and third points: testing and Leiningen. As is often the case, I found the issues by writing and running tests. Setting up an environment for tests can be annoying for some systems, particularly for such a simple function. However, using Leiningen makes it very easy. The entire project was built using Leiningen, and was set up with the simple command:
lein new tnetstrings.
I'll get on to Leiningen in a moment, but for now I'll stick with the tests.

Clojure tests are easy to set up and use. They are based on a DSL built out of a set of macros that are defined in clojure.test. The two main macros are deftest and is. deftest is used to define a test in the same way that defn is used to define a function, sans a parameter definition. In fact, a test is a function, and can be called directly (it takes no parameters). This is very useful to run an individual test from a REPL.

The other main macro is called "is" and is simply used to assert the truth of something. This macro is used inside a test.

My tests for tnetstrings are very simple:

(ns tnetstrings.test.core
  (:use [tnetstrings.core :only [parse-t]]
        [clojure.test]))

(deftest test-single
  (is (= ["hello" ""] (parse-t "5:hello,")))
  (is (= [42 ""] (parse-t "2:42#")))
  (is (= [3.14 ""] (parse-t "4:3.14^")))
  (is (= [true ""] (parse-t "4:true!")))
  (is (= [nil ""] (parse-t "0:~"))))

(deftest test-compound
  (is (= [["hello" 42] ""] (parse-t "13:5:hello,2:42#]")))
  (is (= [{"hello" 42} ""] (parse-t "13:5:hello,2:42#}")))
  (is (= [{"pi" 3.14, "hello" 42} ""] (parse-t "25:5:hello,2:42#2:pi,4:3.14^}"))))

Note that I've brought in the tnetstrings.core namespace (my source code), and only referenced the parse-t function. I always try to list the specific functions I want in a use clause, though I'm not usually so particular when writing test code. You'll also see clojure.test. As mentioned, this is necessary for the deftest and is macros. It is worth pointing out that both of these use clauses were automatically generated for me by Leiningen, along with a the first deftest.

I could have created a convenience function that just extracted the first element out of the returned tuple, thereby making the tests more concise. However, I intentionally tested the entire tuple, to ensure that nothing was being left at the end. I ought to create a string with some garbage at the end as well, to see that being returned, but the array and map tests have this built in... and I was being lazy.

Something else that caught me out was that when I parse a floating point number, I did it with java.lang.Float/parseFloat. This worked fine, but by default Clojure uses double values instead, and all floating point literals are parsed this way. Consequently the tests around "4:3.14^" failed with messages like:

expected: (= [3.14 ""] (parse-t "4:3.14^"))
  actual: (not (= [3.14 ""] [3.14 ""]))

What isn't being shown here is that the two values of 3.14 have different types (float vs. double). Since Clojure prefers double, I changed the parser to use java.lang.Double/parseDouble and the problem was fixed.

Leiningen

For anyone unfamiliar with Leiningen, here is a brief rundown of what it does. By running the new command Leiningen sets up a directory structure and a number of stub files for a project. By default, two of these directories are src/ and test. Under src/ you'll find a stub source file (complete with namespace definition) for the main source code, and under test/ you'll find a stub test file, again with the namespace defined, and with clojure.test already brought in for you. In my case, these two files were:

src/tnetstrings/core.clj
test/tnetstrings/test/core.clj

To get running, all you have to do is put your code into the src/ file, and put your tests into the test/ file. Once this is done, you use the command:
lein test
to run the tests. Clojure gets compiled as it is run, so any problems in syntax and grammar can be found this way as well.

However, one of the biggest advantages to using this build environment, is the ease of bringing in libraries. Using Leiningen can be similar to using Maven, without much of the pain, and indeed, Leiningen even offers a pom command to generate a Maven POM file. It automatically downloads packages from both Clojars and Maven repositories, so this feature alone makes it valuable.

Leiningen is configured with a file called project.clj which is autogenerated when a project is created. This file is relatively easy to configure for simple things, so rather than delving into it here, I'll let anyone new to the system go the project page and sample file to learn more about it.

project.clj also works for some not-so-simple setups, but it gets more and more difficult the fancier it gets. It's relatively easy to update the source path, test path, etc, to mimic Maven directory structures, which can be useful, since the Maven structure allows different file types (e.g. Java sources, resources) to be stored in different directories. But since I always want this, it's annoying that I always have to manually configure it.

I'm also in the process of copying Alex Hall's setup for pre-compiling Antlr parser definitions so that I can do the same with Beaver. Again, it's great that I can do this with Leiningen, but it's annoying to do so. I shouldn't be too harsh though, as the way that extensions are done look more like they are derived from the flexibility of Clojure than Leiningen itself.

Clojure DC

2012-08-22T00:00:00.001-05:00

Tonight was the first night for Clojure DC, which is a meetup group for Clojure users. It's a bit of a hike for me to get up there, but I dread getting too isolated from other professionals, so I decided it was worth making the trip despite the distance and traffic. Luckily, I was not disappointed.

Although I was late (have you ever tried to cross the 14th Street Bridge at 6pm? Ugh) not much had happened beyond som pizza consumption. The organizers, Matt and Chris, had a great venue to work with, and did a good job of getting the ball rolling.

After introductions all around, Matt and Chris gave a description of what they're hoping to do with the group, prompted us with ideas for future meetings, and asked for feedback. They suggested perhaps doing some Clojure Koans, and in that spirit they provided new users with an introduction to writing Clojure code (not an intro to Clojure, but an intro to writing code), by embarking on a function to parse tnetstrings. I'd never heard of these, but they're a similar concept to JSON, only the encoding and parsing is even simpler.

This part of the presentation was fun, since Matt and Chris had a banter that was reminiscent of Daniel Friedman and William Byrd presenting miniKanren at Clojure/Conj last year. While writing the code they asked for feedback, and I was pleased to learn a few things from some of the more experienced developers who'd shown up (notably, Relevance employee Craig Andera, and ex-Clojure.core developer, and co-author of my favorite Clojure book, Michael Fogus). For instance, while I knew that maps operate as functions where they look up an argument in themselves, I did not know that they can optionally accept a "not-found" parameter like clojure.core/get does. I've always used "get" to handle this in the past, and it's nice to know I can skip it.

While watching what was going on, I decided that a regex would work nicely. So I ended up giving it a go myself. The organizers stopped after parsing a string and a number, but I ended up doing the lot, including maps and arrays. Interestingly, I decided I needed to return a tuple, and after I finished I perused the reference Python implementation and discovered that this returned the same tuple. Always nice to know when you're on the right track. :-)

Anyway, my attempt looked like:

(ns tnetstrings.core)

(def type-map {\, identity
               \# #(Integer/parseInt %)
               \^ #(Float/parseFloat %)
               \! #(Boolean/parseBoolean %)
               \~ (constantly nil)
               \} (fn [m] (loop [mp {} remainder m]
                            (if (empty? remainder)
                              mp
                              (let [[k r] (parse-t remainder)
                                    [v r] (parse-t r)]
                                (recur (assoc mp k v) r)))))
               \] (fn [m] (loop [array [] remainder m]
                            (if (empty? remainder)
                              array
                              (let [[a r] (parse-t remainder)]
                                (recur (conj array a) r)))))})

(defn parse-t [msg]
  (if-let [[header len] (re-find #"([0-9]+):" msg)]
    (let [head-length (count header)
          data-length (Integer/parseInt len)
          end (+ data-length head-length)
          parser (type-map (nth msg end) identity)]
      [(parser (.substring msg head-length end)) (.substring msg (inc end))])))

There are lots of long names in here, but I wasn't trying to play "golf". The main reason I liked this was because of the if-let I introduced. It isn't perfect, but if the data doesn't start out correctly, then the function just returns nil without blowing up.

While this worked, it was bothering me that both the array and the map forms looked so similar. I thought about this in the car on the way home, and I recalled the handy equivalence:

(= (assoc m k v) (conj m [k v]))

So with this in hand, I had another go when I got home:

(ns tnetstrings.core)

(defn embedded [s f]
  (fn [m] (loop [data s remainder m]
            (if (empty? remainder)
              data
              (let [[d r] (f remainder)]
                (recur (conj data d) r))))))

(def type-map {\, identity
               \# #(Integer/parseInt %)
               \^ #(Float/parseFloat %)
               \! #(Boolean/parseBoolean %)
               \~ (constantly nil)
               \} (embedded {} (fn [m] (let [[k r] (parse-t m)
                                             [v r] (parse-t r)]
                                         [[k v] r])))
               \] (embedded [] (fn [m] (let [[a r] (parse-t m)]
                                         [a r])))})

(defn parse-t [msg]
  (if-let [[header len] (re-find #"([0-9]+):" msg)]
    (let [head-length (count header)
          data-length (Integer/parseInt len)
          end (+ data-length head-length)
          parser (type-map (nth msg end) identity)]
      [(parser (.substring msg head-length end)) (.substring msg (inc end))])))

So now each of the embedded structures is based on a function returned from "embedded". This contains the general structure of:

Seeing if there is anything left to parse.
If not, then return the already parsed data.
If so, then parse it, and add the parsed data to the structure before repeating on the remaining string to be parsed.

In the case of the array, just one element is parsed by re-entering the main parsing function. The result is just the returned data. In the case of the map, the result is a key/value tuple, obtained by re-entering the parsing function twice. By wrapping the key/value like this we not only get to return it as a single "value", but it's also in the form required for the conj function that is used on the provided data structure (vector or map).

The result looks a little noisy (lots of brackets and parentheses), but I think it abstracts out the operations much better. Exercises like this are designed to help you think about problems the right way, so I think it was a great little exercise.

Other than this code, I also got the chance to chat with a few people, which was the whole point of the trip. It's getting late, so I won't go into those conversations now, but I was pleased to hear that many of them will be going to Clojure/Conj this year.

Clojure Lessons

2012-05-22T23:36:00.000-05:00

Recently I've been working with Java code in a Spring framework. I'm not a big fan of Spring, since the bean approach means that everything has a similar public face, which means that the data types don't document the system very well. The bean approach also means that most types can be plugged into most places (kind of like Lego), but just because something can be connected doesn't mean it will do anything meaningful. It can make for a confusing system. As a result, I'm not really have much fun at work.

To get myself motivated again, I thought I'd try something fun and render a Mandelbrot set. I know these are easy, but it's something I've never done for myself. I also thought it might be fun to do something with graphics on the JVM, since I'm always working on server-side code. Turned out that it was fun, keeping me up much later than I ought to have been. Being tired tonight I may end up rambling a bit. It may also be why I've decided to spell "colour" the way I grew up with, rather than the US way (except in code. After all, I have to use the Color class, and it's just too obtuse to have two different spellings in the same program).

To get my feet wet, I started with a simple Java application, with a plan to move it into Clojure. My approach gave me a class called Complex that can do the basic arithmetic (trivial to write, but surprising that it's not already there), and an abstract class called Drawing that does all of the Window management and just expects the implementing class to implement paint(Graphics). With that done it was easy to write a pair of functions:

coord2Math to convert a canvas coordinate into a complex number.
mandelbrotColor to calculate a colour for a given complex number (using a logarithmic scale, since linear shows too many discontinuities in colour).

Drawing this onto a graphical context is easy in Java:

for (int x = 0; x < gWidth; x++) {
  for (int y = 0; y < gHeight; y++) {
    g.setColor(mandelbrotColor(coord2Math(x, y)));
    plot(g, x, y);
  }
}

(plot(Graphics,int,int) is a simple function that draws one pixel at the given location).

A small image (300x200 pixels) on this MacBookPro takes ~360ms. A big one (1397x856) took ~11500ms. Room for improvement, but it'll do. So with a working Java implementation in hand, I turned to writing the same thing in Clojure.

Clojure Graphics

Initially I tried extending my Drawing class using proxy, with a plan of moving to an implementation completely in Clojure. However, after getting it working that way I realized that doing the entire thing in Clojure wasn't going to take much at all, so I did that straight away. The resulting code is reasonably simple and boilerplate:

(def window-name "Mandelbrot")
(def draw-fn)

(defn new-drawing-obj []
  (proxy [JPanel] []
    (paint [^Graphics graphics-context]
      (let [width (proxy-super getWidth)
            height (proxy-super getHeight)]
        (draw-fn graphics-context width height)))))

(defn show-window []
  (let [^JPanel drawing-obj (new-drawing-obj)
        frame (JFrame. window-name)]
    (.setPreferredSize drawing-obj (Dimension. default-width default-height))
    (.add (.getContentPane frame) drawing-obj)
    (doto frame
      (.setDefaultCloseOperation JFrame/EXIT_ON_CLOSE)
      (.pack)
      (.setBackground Color/WHITE)
      (.setVisible true))))

(defn start-window []
  (SwingUtilities/invokeLater #(show-window)))

Calling start-window sets off a thread that will run the event loop and then call the show-window function. That function uses new-drawing-obj to create a proxy object that handles the paint event. Then it sets the size of panel, puts it into a frame (the main window), and sets up the frame for display.

The only thing that seems worth noting from a Clojure perspective is the proxy object returned by new-drawing-obj. This is simple extension of java.swing.JPanel that implements the paint(Graphics) method of that class. Almost every part of the drawing can be done in an external function (draw-fn here), but the width and height are obtained by calling getWidth() and getHeight() on the JPanel object. That object isn't directly available to the draw-fn function, nor is it available through a name like "this". The object is returned from the proxy function, but that's out of scope for the paint method to access it. The only reasonable way to access methods that are inherited in the proxy is with the proxy-super function (I can think of some unreasonable ways as well, like setting a reference to the proxy, and using this reference in paint. But we won't talk about that kind of abuse).

While I haven't shown it here, I also wanted to close my window by pressing the "q" key. This takes just a couple of lines of code, whereby a proxy for KeyListener is created, and then added to the frame via (.addKeyListener the-key-listener-proxy). Compared to the equivalent code in Java, it's strikingly terse.

Rendering

The Java code for rendering used a pair of nested loops to generate coordinates, and then calculated the colour for each coordinate as it went. However, this imperative style of coding is something to explicitly avoid in any kind of functional programming. So the question for me at this point, was how should I think about the problem?

Each time the mandelbrotColor was to be called, it is mapping a coordinate to a colour. This gave me my first hint. I needed to map coordinates to colours. This implies calling map on a seq of coordinates, and ending up with a seq of colours. (Actually, not a seq, but rather a reducible collection). However, what order are the colours in? Row-by-row? That would work, but it would involve keeping a count of the offset while working over the seq, which seems onerous, particular when the required coordinates were available when the colour was calculated in the first place. So why not include the coordinates in the seq with the colour? Not only does that simplify processing, it makes the rendering of this map stateless, since any element of the seq could be rendered independently of any other.

Coordinates can be created as pairs of integers using a comprehension:

  (for [x (range width) y (range height)] [x y])

and the calculation can be done by mapping on a function that unpacks x and y and returns a triple of these two coordinates along with the calculated colour. I'll rename x and y to "a" and "b" in the mapping function to avoid ambiguity:

  (map (fn [[a b] [a b (mandelbrot-color (coord-2-math a b))])
       (for [x (range width) y (range height)] [x y]))

So now we have a sequence of coordinates and colours, but how do these get turned into an image? Again, the form of the problem provides the solution. We have a sequence (of tuples), and we want to reduce it into a single value (an image). Reductions like this are done using reduce. The first parameter for the reduction function will be the image, the second will be the next tuple to draw, and the result will be a new image with the tuple drawn in it. The reduce function isn't really supposed to mutate its first parameter, but we don't want to keep the original image without the pixel to be drawn, so it works for us here. The result is the following reduction function (type hint provided to avoid reflection on the graphical context):

  (defn plot [^Graphics g [x y c]]
    (.setColor g c)
    (.fillRect g x y 1 1)
    g)

Note that the original graphics context is returned, since this is the "new" value that plot has created (i.e. the image with the pixel added to it). Also, note that the second parameter is a 3 element tuple, which is just unpacked into x y and c.

So now the entire render process can be given as:

(reduce plot g
  (map (fn [[a b] [a b (mandelbrot-color (coord-2-math a b))])
       (for [x (range width) y (range height)] [x y])))

This works just fine, but there were performance issues, which was the part of this process that was most interesting. The full screen render (1397x856) took ~682 seconds (up from the 11.5 seconds it took Java). Obviously there were a few things to be fixed. There is still more to do, but I'll share what I came across so far.

Reflection

The first thing that @objcmdo suggested was to look for reflection. I planned on doing that, but thought I'd continue cleaning the program up first. The Complex class was still written in Java, so I embarked on rewriting that in Clojure.

The easiest way to do this was to implement a protocol that describes the actions (plus, minus, times, divide, absolute value), and to then define a record (of real/imaginary) that extends the protocol. It would have been nicer than the equivalent Java, but for one thing. Java allows method overloading based on parameter types, which means that a method like plus can be defined differently depending on whether it receives a double value, or another Complex number. My understanding is that Clojure only overloads functions based on the parameter count, meaning that different function names are required to redefine the same operation for different types. So for instance, the plus functions were written in Java as:

  public final Complex plus(Complex that) {
    return new Complex(real + that.real, imaginary + that.imaginary);
  }

  public final Complex plus(double that) {
    return new Complex(real + that, imaginary);
  }

But in Clojure I had to give them different names:

  (plus [this {that-real :real, that-imaginary :imaginary}]
        (Complex. (+ real that-real) (+ imaginary that-imaginary)))
  (plus-dbl [this that] (Complex. (+ real that) imaginary))

Not a big deal, but code like math manipulation looks prettier when function overloading is available.

It may be worth pointing out that I used the names of the operations (like "plus") instead of the symbolic operators ("+"). While the issue of function overloading would have made this awkward (+dbl is no clearer than plus-dbl) it has the bigger problem of clashing with functions of the same name in clojure.core. Some namespaces do this (the * character is a popular one to reuse), but I don't like it. You have to explicitly reject it from your current namespace, and then you need to refer to it by its full name if you do happen to need it. Given that Complex needs to manipulate internal numbers, then these original operators are needed.

So I created my protocol containing all the operators, defined a Complex record to implement it, and then I replaced all use of the original Java Complex class. Once I was finished I ran it again just to make sure that I hadn't broken anything.

To my great surprise, the full screen render went from 682 seconds down to 112 seconds. Protocols are an efficient mechanism, but they shouldn't be that good. At that point I realised that I hadn't used type hints around the Complex class, and that as a consequence the Clojure code had to perform reflection on the complex numbers. Just as @objcmdo had suggested.

Wondering what other reflection I may have missed, I tried enabling the *warn-on-reflection* flag in the repl, but no warnings were forthcoming. I suspect that this was being subverted by the fact that the code is all being run by a thread that belongs to the Swing runtime. I tried adding some other type hints, but nothing I added had any effect, meaning that the Clojure compiler was already able to figure out the types involved (or else it just wasn't in a critical piece of code).

Composable Abstractions

The next thing I wondered about was the map/reduce part of the algorithm. While it made for elegant programming, it was creating unnecessary tuples at every step of the way. Could these be having an impact?

Once you have a nice list comprehension, it's tough to break it out into an imperative-style loop. Aside from ruining the elegance of the original construct, once you've seen your way through to viewing a problem in such clear terms, it's difficult to reconceptualize it as a series of steps. Even when you do, how do you make Clojure work against itself?

Creating a loop without burning through resources can be done easily with tail recursion. Clojure doesn't do this automatically (since the JVM does not provide for it), but it can be emulated well with loop/recur. Since I want to loop between 0 (inclusive) and the width/height (exclusive), I decremented the upper limits for convenience. Also, the plot function is no longer constraint to just 2 arguments, so I changed the definition to accept all 4 arguments directly, thereby eliminating the need to construct that 3-tuple:

(let [dwidth (dec width)
                 dheight (dec height)]
  (loop [x 0 y 0]
    (let [[next-x next-y] (if (= x dwidth)
                              (if (= y dheight)
                                  [-1 -1]      ;; signal to terminate
                                  [0 (inc y)])
                              [(inc x) y])]
      (plot g x y (mandelbrotColor (coord-2-math x y)))
      (if (= -1 next-x)
        :end    ;; anything can be returned here
        (recur next-x next-y)))))

My word, that's ugly. The let that assigns next-x and next-y has a nasty nested if construct that increments x and resets it at the end of each row. It also returns a flag (could be any invalid number, such as the keyword :end) to indicate that the loop should be terminated. The loop itself terminates by testing for the termination value and returning a value that will be ignored.

But it all works as intended. Now instead of creating a tuple for every coordinate, it simply iterates through each coordinate and plots the point directly, just as the Java code did. So what's the performance difference here?

So far, the numbers I've provided are rounded to the nearest second. Repeated runs have usually taken a similar amount of time to the ones that I've reported here. However, there is always some jitter, sometimes by several seconds. Because of this, I was unable to see any difference whatsoever between using map/reduce on a for comprehension, versus using loop/recur.

That's an interesting result, since it shows that the Clojure compiler and JVM are indeed as clever as we're told, when we see that better abstractions are just as efficient as the direct approach. It's all well and good for a language to make it easy to write powerful constructs, but being able to perform more elegant code just as efficiently as more direct, imperative code that a language is really offering useful power.

Aside from the obvious clarity issues, the composability of the for/map/reduce makes an enormous difference. Because each element in the range being mapped is completely independent, we are free to use the pmap function instead of map. The documentation claims that this function is,

"Only useful for computationally intensive functions where the time of f dominates the coordination overhead."

Yup. That's us.

So how much does this change make for us? Using map on the current code, a full screen render takes 112 seconds. Changing map to pmap improves it to 75 seconds. That's a 33% improvement with no work, simply because the correct abstraction was applied. That's a very powerful abstraction.

Future Work

(Hmmm, that makes this sound like an academic paper. Should I be drawing charts?)

The final result is still a long way short of the 11.5 seconds the naïve Java code renders at. The single threaded version is particularly bad, taking about 10 times as long. I don't expect Clojure to be as fast as Java, but a factor of 10 suggests that there are some obvious things that I've missed, most likely related to reflection. If I can get it down to the same order of magnitude as the Java code, then using pmap could make the Clojure version faster due to being multi-threaded. Of course, Java can be multi-threaded as well, but the effort and infrastructure for doing this would be significant.

2011-09-04T12:24:00.003-05:00

SPARQL JSON

After commenting the other day that Fuseki was ignoring my request for a JSON response, I was asked to submit a bug report. It's easy to cast unsubstantiated stones in a blog, but a bug report is a different story, so I went back to confirm everything. In the process, I looked at the SPARQL 1.1 Query Results JSON Format spec (I can't find a public copy of it, so no link, sorry. UPDATE: Found it.) and was chagrined to discover that it has its own MIME type of "application/sparql-results+json". I had been using the JSON type of "application/json", and this indeed does not work, but the corrected type does. I don't think it's a good idea to ignore "application/json" so I'll report that, but strictly speaking it's correct (at least, I think so. As Andy said to me in a back channel, I don't really know what the + is supposed to mean in the subtype). So Fuseki got it right. Sorry.

When I finally get around to implementing this for Mulgara I'll try to handle both. Which reminds me... I'd better get some more SPARQL 1.1 implemented. My day job at Revelytix needs me to do some Mulgara work for a short while, so that may help get the ball rolling for me.

separate

I commented the other day that I wanted a Clojure function that takes a predicate and a seq, and returns two seqs: one that matches the predicate and the other that doesn't match. I was thinking I'd build it with a loop construct, to avoid recursing through the seq twice.

The next day, Gary (at work) suggested the separate function from Clojure Contrib. I know there's some good stuff in contrib, but unfortunately I've never taken the time to fully audit it.

The implementation of this function is obvious:

(defn separate [f s]
  [(filter f s) (filter (complement f) s)])

I was disappointed to learn that this function iterates twice, but Gary pointed out that there had been a discussion on exactly this point, and the counter argument is that this is the only way to build the results lazily. That's a reasonable point, and it is usually one of the top considerations for Clojure implementations. I don't actually have cause for complaint anyway, since the seqs I'm using are always small (being built out of query structures).

This code was also a good reminder that (complement x) offers a concise replacement for the alternative code:

  #(not x %)

By extension, it's a reminder to brush up on my idiomatic Clojure. I should finish the book The Joy of Clojure (which is a thoroughly enjoyable read, by the way).

2011-09-01T20:45:00.003-05:00

ASTs

After yesterday's post, I noticed that a number of references came to me via Twitter (using a t.co URL). After looking for it, I realized that I'm linked to from the Planet Clojure page. I'm not sure why, though I guess it's because I'm working for Revelytix, and we're one of the larger Clojure shops around. Only my post wasn't about Clojure - it was about JavaScript. So now I feel obligated to write something about Clojure. Besides, I'm out of practice with my writing, so it would do me good.

So the project I spend most of my time on (for the moment) is called Rex. It's a RIF based rules engine written entirely in Clojure. It does lots of things, but the basic workflow for running rules is:

Parse the rules out of the rule file and into a Concrete Syntax Tree. We can parse the RIF XML format, the RIF Presentation Syntax, and the RIF RDF format, and we plan on doing others.
Run a set of transformations on the CST to convert it into an appropriate Abstract Syntax Tree (AST). This can involve some tricky analysis, particularly for aggregates (which are a RIF extension).
Transformation of the AST into SPARQL 1.1 query fragments.
Execute the engine, by processing the rules to generate more data until the capacity to generate new data has been exhausted. (My... that's a lot of hand waving).

It's step 3 that I was interested in today.

The rest of this post is about how Rex processes a CST into an AST using Clojure, and about some subsequent refactoring that went on. You have been warned...

When I first wrote the CST to AST transformation step, it was to do a reasonably straight forward analysis of the CST. Most importantly, I needed to see the structure of the rule so that I could see what kind of data it depends on, thereby figuring out which other rules might need to be run once a given rule was executed. Since the AST is a tree structure, this made for relatively straight forward recursive functions.

Next, I had to start identifying some CST structures that needed to be changed in the AST. This is where it got more interesting. Again, I had to write recursive functions, but instead of simply analyzing the data, it had to be changed. It turns out that this is handled easily by having a different function for each type of node in the tree. In the normal case the function then recurses on all of its children, and constructs an identical node type using the new children. The leaf nodes then just return themselves. The "different function" for each type is actually accessed with the same name, but dispatches on the function type. In Java that would need a visitor pattern, or perhaps a map of types to functors, but in Clojure it's handled trivially with multimethods or protocols. Unfortunately, the online resources for describing the multi-dispatch aspects of Clojure protocols are not clear, but Luke VanderHart and Stuart Sierra's book Practical Clojure covers it nicely.

As an abstract example of what I mean, say I have an AST consisting of Conjunctions, Disjunctions and leaf nodes. Both Conjunctions and Disjunctions have a single field that contains a seq of the child nodes. These are declared with:

(defrecord Conjunction [children])

(defrecord Disjunction [children])

The transformation function can be called tx, and I'll define it with multiple dispatch on the node type using multimethods:

(defmulti tx  class)



(defmethod tx Disjunction [{c :children}]

    (Disjunction. (map tx c)))



(defmethod tx Conjunction [{c :children}]

    (Conjunction. (map tx c)))



(defmethod tx :default [n] n)

This will reconstruct an identical tree to the original, though with all new nodes except the leaves. Now, the duplication in the Disjunction and Conjunction methods should be ringing alarm bells, but in real code the functions have more specific jobs to do. For instance, the Conjunction may want to group terms that meet a certain condition (call the test "special-type?") into a new node type (call it "Foo"):

;; A different definition for tx on Conjunction

(defmethod tx Conjunction [{c :children}]

  (let [new-children (map tx c)

        special-nodes (filter special-type? new-children)

        other-nodes (filter (comp not special-type?) new-children)]

    (Conjunction. (conj other-nodes (Foo. special-nodes)))))

Hmmm... while writing that example I realized that I regularly run into the pattern of filtering out everything that meets a test, and everything that fails the test. Other than having to test everything twice, it seems too verbose. What I need is a function that will take a seq and a predicate and return a tuple containing a seq of everything that matches the predicate, and a second seq of everything that fails the predicate. I'm not seeing anything like that right now, so that may be a job for the morning.

I should note, that there is no need for a function to return the same type that came into it. There are several occasions where Rex returns a different type. For example, a conjunction between a Basic Graph Pattern (BGP) and the negation of another BGP becomes a MINUS operation between the two BGPs (a Basic Graph Pattern comes from SPARQL and is just a triple pattern for matching against the subject/predicate/object of a triple in an RDF store).

Overall, this approach works very well for transforming the CST into the full AST. As I've needed to incorporate more features and optimizations over time, I found that I had two choices. Either I could expand the complexity of the operation for every type in the tree processing code, or I could perform different types of analysis on the entire tree, one after another. The latter makes the process far easier to understand, making the design more robust and debugging easier, so that's how Rex has been written. It makes analysis slightly slower, but analysis is orders of magnitude faster than actually running the rules, so that is not a consideration.

Threading

The first time through, analysis was simple:

(defn analyse [rules]

  (process rules))

Adding a new processing step was similarly easy:

(defn analyse [rules]

  (process2 (process1 rules)))

But once a third, then a fourth step appeared, it became obvious that I needed to use the Clojure threading macro:

(defn analyse [rules]

  (-> rules

      process1

      process2

      process3

      process4))

So now it's starting to look nice. Each step in the analysis process is a single function name implemented for various types. These names are then provided in a list of things to be applied to the rules, via the threading macro. There's a little more complexity (one of the steps picks up references to parts of the tree, and since each stage changes the tree, then these references will be pointing to an old and unused version of the tree. So that step has to be last), but it paints the general picture.

Partial

Another thing that I've glossed over is that each rule is actually a record that contains the AST as one of its members. The rules themselves are a seq which is in turn a member of a record containing a "Rule Program". So each process step actually ended up being like this:

(defn process1 [rule-program]

  (letfn [(process-rule [rule] (assoc rule :body (tx (:body rule))))]

    (assoc rule-program :rules (map process-rule (:rules rule-prog)))))

Did you follow that?

The bottom line is replacing the :rules field with a new one. It's mapping process-rule onto the seq of rules, and storing the result in a new rules-program, which is what gets returned. The process-rule function is defined locally as associating the :body of a rule with the tx function applied to the existing body. This creates a new rule that has has the tx applied to it.

This all looked fine to start with. A new rule program is created by transforming all the rules in the old program. A transformed rule is created by transforming the AST (the :body) in the old rule. But after the third analysis it became obvious that there was duplication going on. In fact, it was all being duplicated except for the tx step. But that was buried deep in the function. What was the best way to pull it out?

To start with, the embedded process-rule function came out. After all, it was just inside to hide it, and not because it had to pick up a closure anywhere. This function then accepts the kind of transformation that it needs to do as a parameter:

(defn convert-rule

  [convert-fn rule]

  (assoc rule :body (convert-fn (:body rule))))

Next, we want a general function for converting all the rules, which can accept a conversion function to pass on to convert-rule. It does all the rules, so I just pluralized the name:

(defn convert-rules

  [conversion-fn rule-prog]

  (assoc rule-prog :rules (map #(convert-rule conversion-fn %) (:rules rule-prog)))))

That works, but now the function getting mapped is looking messy (and messy leads to mistakes). I could improve it by defining a new function, but I just factored a function out of this function. Fortunately, there is a simpler way to define this new function. It's a "partial" application of convert-rule. But I'll still move it into a let block for clarity:

(defn convert-rules

  [conversion-fn rule-prog]

  (let [conv-rule (partial convert-rule conversion-fn)]

    (assoc rule-prog :rules (map conv-rule (:rules rule-prog)))))

So now my original process1 definition becomes a simple:

(defn process1 [rule-program]

  (convert-rules tx rule-program))

That works, but the rule-program parameter is just sticking out like a sore thumb. Fortunately, we've already seen how to fix this:

(def process1 (partial convert-rules tx))

Indeed, all of the processing functions can be written this way:

(def process1 (partial convert-rules tx))

(def process2 (partial convert-rules tx2))

(def process3 (partial convert-rules tx3))

(def process4 (partial convert-rules tx4))

It may seem strange that a function is now being defined with a def instead of a defn, but it's really not an issue. It's worth remembering that defn is just a macro that uses def to attach a symbol to a call to (fn ...).

Documenting

Of course, my functions don't appear as sterile as what I've been typing here. I do indeed use documentation. That means that the process1 function would look more like:

(defn process1 [rule-program]

  "The documentation for process1"

  (convert-rules tx rule-program))

One of the nice features of the defn macro is the ease of writing documentation. This isn't as trivial with def since it's a special form, rather than a macro, but it's still not too hard to do. You just need to attach some metadata to the object, with a key of :doc. Unfortunately, I couldn't remember the exact syntax for this today, and rather then go trawling through books or existing code, Alex was kind enough to remind me:

(def ^{:doc "The documentation for process1"}

  process1 (partial convert-rules tx))

Framework

The upshot of all of this is a simpler framework for adding new steps to the existing analysis system. Adding a new analysis step just needs a new function thrown into the thread. I could put a call to (partial convert-rules ...) directly into the thread, but by using a def I get to name and document that step of the analysis. the only real work of the analysis is then done in the single multi-dispatch function, which is just as it should be.

So right now my evening "hobby" has been JavaScript, while my day job is Clojure. I have to tell you, the day job is much more fun.

2011-09-01T10:30:00.003-05:00

Robusta

I was just posting this into Google+ when I realized I was typing more than I'd intended. It read more like a short blog post. Which reminded me.... "Oh yeah. I have a blog out there somewhere! Maybe I should write this in there."

After spending a lot of recent nights on JavaScript I think I'm starting to get a feel for it. I started with Douglas Crockford's "JavaScript: The Good Parts", and am now plowing through David Flanagan's "JavaScript: The Definitive Guide". (I met David when he came to Brisbane to run a short course on Java programming back in 1996. Nice guy. He said that he'd preferred to have called his book "Java in a Demitasse", but O'Reilly wanted it in their "Nutshell" series).

After using languages like Ruby, Erlang, Scala and Clojure, I'm finding it a little frustrating, but it's hard to argue with the ubiquity of the platform. Fortunately it has closures and first class functions, though the variable scoping is bizarre. I've been enjoying the callback approach to asynchronous function calls, though the syntax tends to make the resulting code confusing to read. I'm mostly sticking to Crockford's subset of the language, and this does make things a little more sensible. Flanagan's book has been filling in the gaps for me, but it's especially useful for documenting libraries like HTML5 Canvas and the File API.

As with all first attempts at using a language, mine is a little messy and inconsistent. However, it can't hurt to put it out there. I've built a simple tool for working with SPARQL endpoints (specifically aimed at Jena/Fuseki, but it should mostly work on others too). The important piece is the SPARQL connection object that it comes with (found in sparql.js). I'm hoping that this will be a useful object for more general application. It can even convert XML responses into a SPARQL-JSON structure (I wrote this after discovering Fuseki was ignoring my Content-type settings on queries).

Unfortunately, it's missing one important piece, which is the ability to upload a file from a browser. In general, it's possible to upload a file using a form submission, but that encodes all the parameters into the request body, and the SPARQL HTTP protocol requires that the graph URI appear as a parameter in the URL of the request. In an attempt to get the graph parameter out of the body and into the URL, I even tried dynamically constructing the URL for the form submission, but the browser "cleverly" saw what I was doing and pushed the parameter back into the body. So I can't use form submission. Alternatively, JavaScript makes it easy to submit an HTTP POST operation with everything set the way you want it. However, the only way to read a local file is through the form submission process, which means I still can't do a file upload. In the end, I just used Fuseki's file upload servlet, but this has the problem of being non-standard, and it also doesn't like URIs that aren't http (yes Andy, that's why I asked you about this restriction - though I'd already run into it at work).

The resulting system needed a name, so I called it Robusta. Everyone seems to be enamored with Arabica beans, but no one ever talks about Robusta beans. Don't get me wrong.... if I had to choose between the two I'd definitely go for the Arabica. But by blending in a small portion of Robusta beans you add a richness to the flavor of your coffee (it's also used as a cheap "filler" and promotes crema in espresso, but I like the flavor aspect). At the time, I came up with the name because I really needed some caffeine, and Arabica was too obvious. But in retrospect, I like the name, since adding a bit of SPARQL to your scripts can really enhance a system (OK, that's tacky. I probably need another one of those coffees). It wasn't after I'd had some coffee that I thought to look for other projects with the same name, but by that point it was already up there.

Robusta is still a work in progress, and it's mostly a late night project that fits around everything else that I'm doing. But I'm using it at work, and it makes my life easier. I'd like to know if anyone has ideas for it, or can point out errors, inefficiencies, or potential improvements. It's posted at GitHub as a part of the Revelytix project group, at:
http://github.com/revelytix/robusta

2010-10-21T23:24:00.004-05:00

Clojure Experiences

I've been learning Clojure recently, as my new job uses it frequently. I'm happy with this (it has me attending (first clojure-conj) right now), but I'm still learning a lot about the language. I'm comfortable with functional programming, and of course, there's next to nothing to learn about syntax and grammar, but there is a steep learning curve to effectively use the libraries. Not only are there a lot of them, but they are not really well documented.

OO languages tell you which functions belong to a class, but since Clojure functions don't really belong to anything it can sometimes be hard to find everything that is relevant to the structures you are dealing with. I've had a few people tell me that there are good websites out there that provide the sort of documentation I'd like, so I'll go looking for them soon.

Meanwhile, I find myself looking at the libraries all the time. Other than the general principle of code reuse, the dominant paradigm in Clojure is to use the core libraries whenever possible. This is because those libraries are likely to be much more efficient than anything you can come up with on your own. They're also "lazy", meaning that the functions do very little work themselves, with all of the work being put off until you actually need it. Laziness is a good thing, and should show up in Clojure code as much as possible. Using the core libraries is the best way to accomplish that.

Clojure Maps

Part of my current task is to take a data structure that represents an RDF graph, and process it. The structure that I have now maps resource identifiers (sometimes URIs, but mostly internal blank node IDs) to structures representing all of the property/values associated with that resource. Unfortunately, the values are simple identifiers too, so to find their data I have to go back to the structure and get the resources for that object. This has to be repeated until everything found has no properties or is a literal (which has no properties). Also, the anonymous resources in RDF lists are showing up as data structures, which makes them difficult to work with as well.

Following this kind of structure is easy enough in an imperative language, though quite tedious. However, in Clojure it is simply too inelegant to work with. So I want some way to make this to a nice nested structure that represents the nested data from the RDF graph (fortunately, this kind of graph is guaranteed to not have any cycles). In particular, I'd this new structure to be lazy wherever I can get it.

The basic approach is simple: get the property/values for a given resource. Then convert the values (resources) that aren't literals to the actual data that they represent. This is done by getting the property/values of those resources, which is what I started with, so now I know I need recursion.

Now there are a few caveats. First off, if a resource is actually the head of a list, I don't want to just get it's properties. Instead, I want to traverse the list (lazily, of course) and get all the values attached to the nodes. That's easy enough, and is done with a function to identify resources that are lists, and another one to construct the list (a recursive function that uses lazy-seq). Second, there is one property/value that I want to preserve, and that is the :id property. This property shows the URI or identifier for the original node, so the value here needs to be kept, and not expanded on (it would recurse indefinitely otherwise).

After identifying these requirements, I figured I needed some way to convert all the key/value elements of a map into a new map. This sounds very much like converting all the elements in a sequence (a "seq") into a new sequence, a task which is done using the map function. Is there a similar operation to be applied to a map? (The noun and the verb get confusing here, sorry).

It turns out that Alex Miller had been dealing with exactly this problem just recently, and has built a new function called mapmap to do this operation. However, I wasn't aware of this at the time, so I had to pursue my own solution. I'm reasonably happy with my answer, and I learnt a bit in the process, so I'm glad I solved it myself, though it would have been nice to get that sleep back.

Morphing a Map

There have been a few times when I've needed to create a new map, and the first time I needed to I went to clojure.core to see what kind of map-creating functions were available to me. At that point I found hash-map, which takes a sequence containing an even number of items, and converts it into a map using the even-numbered offsets as the keys and the odd-numbered offsets as the values (with the first offset being "0"). So I could create a map of numbers to their labels using the following:

  (hash-map 1 "one" 2 "two" 3 "three")

So was there a way to create this list from my existing map? (Well, yes, but was it easy?)

I wanted to iterate over my existing map "entries" and create new entries. If you treat a map as a seq, then it appears as a sequence of 2-element vectors, with the first element being the key and the second element being the value. So if I want to morph the values in a map called "mymap" with a function called "morph", I could convert these pairs into new pairs with:

  (map (fn [[k v]] [k (morph v)]) mymap)

For anyone unfamiliar with the syntax, my little anonymous function in the middle there is taking as it's single parameter a 2 element vector, and destructuring it into the values k and v, then returning a new two element vector of k and (morph v). So let's apply this to a map of numbers to their names (incidentally, I don't code anything like this, but I do blog this way)....

  (def mymap {1 "one", 2 "two", 3 "three"})
  (def morph #(.toUpperCase %))
  (map (fn [[k v]] [k (morph v)]) mymap)

The result is:

([1 "ONE"] [2 "TWO"] [3 "THREE"])

Looks good, but there is one problem. This result is a sequence of pairs, while hash-map just takes a sequence. That's OK, we can just call flatten before passing it in:

  (apply hash-map (flatten (map (fn [[k v]] [k (morph v)]) mymap)))
{1 "ONE", 2 "TWO", 3 "THREE"}

The "apply" was needed because hash-map needed lots of arguments (6 in this case) while the result of flatten would have been just 1 argument (a list).

This is looking good. It's what I've done in the past, and it's hardly worthy of a blog post. But this time I had a new twist. My "values" were also structures, and often these structures were seqs as well. For instance, say I have a map of numbers to names in a couple of languages:

(def langmap {1 ["one" "uno"], 2 ["two" "due"], 3 ["three" "tre"]})

My morph method needs to change to accept a seq of strings instead of just a string, but that's trivial. So then the map step looks good, giving an output of:

([1 ("ONE" "UNO")] [2 ("TWO" "DUE")] [3 ("THREE" "TRE")])

But this fails to be loaded into a hash-map, with the error:

java.lang.IllegalArgumentException: No value supplied for key: TRE (NO_SOURCE_FILE:0)

This is exactly what bit me at 1am.

The problem is simple. flatten is recursive. It dives into the sub-sequences, and brings them up too. The result is trying to map 1 to "ONE", mapping "UNO" to 2, and so on, until it gets to "TRE" at which point it doesn't has a value to map it to, so it gives an error. Fortunately for me I had an odd number of items when everything was flattened, or else I would have been working with a corrupted map and may not have discovered it for some time.

My first thought was to write a version of flatten that isn't recursive, but that's when I took a step back and asked myself if there was another way to construct a map. Maybe I shouldn't be using hash-map. So I looked at Clojure's cheatsheet, and discovered zipmap.

zipmap

zipmap wasn't quite what I wanted, in that instead of taking a sequence of pairs, it wants a pair of sequences. It then constructs a map where the nth key/value comes from the nth entry in the first sequence (the key) and the nth entry from the second sequence (the value). So I just needed to rotate, or transpose my data. Fortunately, the day before one of my co-workers had asked me exactly the same question.

In this case, he had a sequence of MxN entries, looking something like:

[[a1 a2 a3] [b1 b2 b3] [c1 c2 c3]]

And the result that he wanted was:

[[a1 b1 c1] [a2 b2 c2] [a3 b3 c3]]

My first thought was some wildly inefficient loop-comprehension, when he asked me if the following would work (for some arg):

  (apply map list arg)

Obviously he didn't have a repl going, so I plugged it in, and sure enough it worked. But why?

I struggled to understand it until I looked up the documentation for map again, and remembered that it can take multiple sequences, not just one. I had totally forgotten this. When provided with n sequences, map requires a function that accepts n arguments. It then goes through each of the sequences in parallel, with the output being a sequence where the nth entry is the result of calling the function with the nth entry of every source sequence. In this case, the function was just list, meaning that it creates a list out of the items from each sequence. The apply just expanded the single argument out into the required lists that map needed. This one line is a lovely piece of elegance, and deserves its own name:

  (defn rotate [l] (apply map list l))

So the final composition for creating a new map with values modified from the original map is:

  (apply zipmap (rotate (map (fn [[k v]] [k (morph v)]) langmap)))
{3 ("THREE" "TRE"), 2 ("TWO" "DUE"), 1 ("ONE" "UNO")}

Now the actual function that I use inside map isn't nearly so simple. After all, I need to recursively traverse the RDF graph, plus I need to find RDF lists and turn them into sequences, etc. But this zipmap/rotate/map idiom certainly makes it easy.

Incidentally, I could do exactly the same thing with Alex's mapmap:

  (mapmap key (comp morph val) langmap)

Oh well. I'm still happy I discovered zipmap/rotate.

2010-05-13T12:18:00.003-05:00

Feedburner

Something reminded me that I had RSS going through Feedburner, so I tried to look at it. Turns out that after the Google move they want to update everyone's account, and I hadn't done it yet. So I gave them all my details, and the system told me that it didn't recognize me.

So then I said I'd forgotten my password, at which point it recognized my email address and asked a "secret question". I know the answer to the question, and it's not even that secret, since I'm sure that most of my family could figure it out, but Feedburner claims I'm wrong. Could someone have changed this? Maybe, but there are only a handful of people who would know that answer, and I trust them not to do something like that.

Not a problem, I'll just submit a report explaining my problem. Only it seems that Feedburner is exempt from that kind of thing. The closest you can get to help is an FAQ. Good one Google.

So now what? Well, I guess I just create a new Feedburner link and take it from there. Sorry, but if you've been using RSS to follow this blog, then would you mind changing it please? It's annoying, I know.

2010-05-13T07:03:00.004-05:00

Wrongful Indexing

Some years ago I commented on the number and type of indexes that can be used for tuples. At the time, I pointed out that indexing triples required 3 indexes, and there were 2 appropriate sets of indexes to use. Similarly, quads can be indexed with 6 indexes, and there are 4 such sets. In both cases (5-tuples get silly, requiring 10 indexes and there are 12 possible sets). In each case, I said that each set of indexes would work just as well as the others, and so I always selected the set that included the natural ordering of the tuples.

So for RDF triples, the two sets of indexes are ordered by:

  subject,predicate,object
  predicate,object,subject
  object,subject,predicate

and

  object,predicate,subject
  predicate,subject,object
  subject,object,predicate

For convenience I have always chosen the first set, as this includes the natural ordering of subject/predicate/object, but it looks like I was wrong.

In using these indexes I've always presumed random 3-tuples, but in reality the index is representing RDF. Whenever I thought about the data I was looking for, this seemed OK, but that's because I tended to think about properties on resources, and not other RDF structures. In particular, I was failing to consider lists.

Since first building RDF indexes (2001) and writing about them (2004) I've learnt a lot about functional programming. This, in turn, led to an appreciation of lists, particularly in algorithms. I'm still not enamored of them in on-disk structures, but I do appreciate their utility and elegance in many applications. So it was only natural that when I was representing RDF graphs with Scala and I needed to read lists, then I used some trivial recursive code to build a Scala list, and it all looked great. But then I decided to port the Graph class to Java to avoid including the Scala Jars for a really lightweight library.

I'd like to point out that I'm talking about a library function that can read a well-formed RDF list and return a list in whatever programming language the library is implemented in. The remainder of this post is going to presume that the lists are well formed, since any alternatives can never be returned as a list in an API anyway.

Reading a list usually involves the subject/predicate/object (SPO) index. You start by looking up the head of the list as a subject, then the predicates rdf:first for the data at that point in the list, and rdf:rest for the rest of the list. Rinse and repeat until rdf:rest yields a value of rdf:nil. So for each node in the list, there is a lookup by subject, followed by two lookups by predicate. This is perfect for the SPO index.

However, it's been bugging me that I have such a general approach, when the structure is predetermined. Why look up these two predicates so generally, when we know exactly what we want? What if we reduce the set we're looking in to just the predicates that we want and then go looking for the subjects? That would mean looking first by predicate, then subject, then object, leading to a PSO index. So what does that algorithm look like?

First, look up the rdf:rest predicate, leading to an index of subject/object containing all list structures. Next, look up the rdf:rest predicate, retrieving subject/objects containing all the list data. Now to iterate down the list no longer involves finding the subject followed by the predicate, in order to read the next list node, but rather it just requires finding the subject, and the list node is in the corresponding object. Similarly with the data stored in the node. We're still doing a fixed number of lookups in an index, which means that the overall complexity does not change at all. Tree indexes will still give O(log(N)) complexity, and hash indexes will still give O(1) complexity. However, each step can involve disk seeks, so it's worth seeing the difference.

To compare more directly, using an SPO index requires every node to:

Lookup across the entire graph by subject.
Lookup across the subject (2 or 3 predicates) for rdf:first.
Lookup across the subject (2 or 3 predicates) for rdf:rest.

For the PSO index we have some initial setup:

Lookup across the entire graph for the rdf:first predicate.
Lookup across the entire graph for the rdf:rest predicate.

Then for every node:

Lookup the rdf:first data for the value.
Lookup the rdf:rest data for the next node.

It's important to note a few things, particularly for tree indexes. Trees are the most likely structure used when using a disk, so I'm going to concentrate on them. The number of subjects in a graph tends to scale up with the size of the graph, while the number of predicates is bounded. This is because predicates are used to express a model, with each predicate indicating a certain relationship. Any system trying to deal with the model needs some idea of the concepts it is dealing with, so it's almost impossible to deal with completely arbitrary relationships. If we know what the relationships are ahead of time, then there must be a fixed number of them. In contrast, subjects represent individuals, and these can be completely unbounded. So if we look up an entire graph to find a particular subject, then we may have to dive down a very deep tree to find that subject. Looking across the entire graph for a given predicate will never have to go very deep, because there are so few of them.

So the first algorithm (using the SPO index) iteratively looks across every subject in the graph for each node in the list. The next two lookups are trivial, since nodes in a list will only have properties of rdf:first, rdf:rest and possibly rdf:type. The data associated with these properties will almost certainly be in the same block where the subject was found, meaning that there will be no more disk seeks.

The second algorithm (using the PSO index) does a pair of lookups across every predicate in the graph. The expected number of disk seeks to find the first predicate is significantly fewer than for any of the "subject" searches in the first algorithm. Given how few predicates are in the system, then finding the second predicate may barely involve any disk seeks at all, particularly since the first search will have populated the disk cache with a good portion of the tree, and the similarities in the URIs of the predicates is likely to make both predicates very close to each other. Of course, this presumes that the predicates are even in a tree. Several systems (including one I'm writing right now) treat predicates differently because of how few there are. Indeed, a lot of systems will cache them in a hashtable, regardless of the on-disk structure. So the initial lookup is very inexpensive.

The second algorithm then iterates down the list, just like the first one does. However, this time, instead of searching for the nodes out of every subject in the list, it will now be just searching for these nodes in the subjects that appear as list nodes. While lists are commonly used in some RDF structures, the subjects in all the lists typically form a very small minority out of all the subjects in a graph. Consequently, depending on the type and depth of trees being used, iterating through a list with the second algorithm, could result in a 2 or three (or more) fewer disk seeks for each node. That's a saving that can add up.

Solid State Disks

I've been talking about disk seeks, but this is an artificial restriction imposed by spinning disk drives. Solid State Disks (SSDs) don't have this limitation.

People have been promoting solid state drives (SSDs) for some years now, but I've yet to use them myself. In fact, most people I know are still using traditional spinning platters. The prices difference is still a big deal, and for really large data, disk drives are still the only viable option. But this will change one day, so am I right to be concerned about disk seeks?

Disk seeks are a function of data locality. When data has to be stored somewhere else on a disk, the drive head must physically seek across the surface to this new address. SSDs don't require anything to move, but there are still costs in addressing scattered data.

While it is possible to address every bit of memory in a device in one step, in practice this is never done. This is because the complexity of the circuit grows exponentially as you try to address more and more data in one step. Instead, the memory is broken up into "banks". A portion of the address can now be used to select a bank, allowing the remaining bits in the address to select the required memory in just that bank. This works well, but it does lead to some delays. Selecting a new bank requires "setup", "hold" and "settling" times, all leading to delays. These delays are an order of magnitude smaller than seek delays for a spinning disk, but they do represent a limit on the speed of the device. So while SSDs are much faster than disk drives, there are still limits to their speed, and improvements in data locality can still have a significant impact on performance.

2010-05-04T20:06:00.003-05:00

Web Services Solved

It's a been a long couple of days, and I really want to relax instead of write, but it's been a few days and I've been promising myself that I'd write, so I figured I need to get something written before I can open a beer.

First of all, the web services problem was trivial. I recently added a new feature that allowed ContextHandlers in Jetty to be configured. Currently the only configuration option I've put in there is the one that was requested, and that is the size of a form. Apparently this is 200k by default, but if you're going to load large files then that may not be enough. Anyway, the problem came about when my code tried to read the maximum form size from the configuration. I wasn't careful enough to check if the context was being configured in the first place, so an NPE was thrown if it was missing.

Fortunately, most people would never see the problem, since the default configuration file includes details for contexts, and this ends up in every build by default. The reason I was seeing it is because Topaz replaces the configuration with their own (since it describes their custom resolvers), and this custom configuration file doesn't have the new option in it. Of course, I could just add it to Topaz, but the correct solution is to make sure that a configuration can't throw an NPE – which is exactly what I told the Ehcache guys, so it's fitting that I have to do it myself. :-)

Hosting

Since I'm on the topic of Topaz, it looks like te OSU/OSL guys and I have both the Topaz and Mulgara servers configured. They wouldn't typically be hosting individual projects (well, they do occasionally), but in this case it's all going in under the umbrella of Duraspace. Of course this has taken some time, and in the case of Topaz I'm still testing that it's all correct, but I think it's there. I'll be changing the DNS for Topaz over soon, and Mulgara was changed last week. Mulgara's DNS has propagated now, so I'm in the process of cutting a long-overdue release.

One thing that changed in the hosting is that I no longer have a Linux host to build the distribution on. Theoretically, that would be OK, since I ought to be able to build on any platform. However, Mulgara is still distributed as a build for Java 1.5 (I've had complaints when I accidentally put out a release that was built for 1.6). This is easy to set up on Linux, since you just change the JAVA_HOME environment variable to make sure you're pointing to a 1.5 JDK. However, every computer I have here is a Mac. Once upon a time that didn't change anything, but now all JDKs point to JDK 1.6. That means I need to configure the compiler to output the correct version. It can be done, but Mulgara wasn't set up for it.

If you read the Ant documentation on compiling you'll see that you can set the target to any JDK version you like. However, that would require editing 58 files (I just had to run a quick command to see that. Wow... I didn't realize it was so bad). I'm sure I'd miss a <javac> somewhere. Fortunately, there is another option, even if the Ant documents discourage it. There's a system parameter called ant.build.java.target which will set the default value globally. I checked to make sure that nothing was going to be missed by this (ie. that nothing was manually setting the target) and when it all looked good I changed the build script to set this to "1.5". I didn't change the corresponding script on Windows, but personally I only want this for distributions. Anyone who needs to set it up on Windows probably has the very JDK they want to run Mulgara on anyway.

Well, that's my story, and I'm sticking to it.

Semantic Universe

What else? Oh yes. I wrote a post for Semantic Universe. It's much more technical that the other posts I've seen there, but I was told that would be OK. I'm curious to know how it will be received.

I was interested in how it was promoted on Twitter. I wrote something that mixes linked data and SPARQL to create a kind of federated query (something I find to be very useful, BTW, and I think more people should be aware of it). However, in the process I mentioned that this shouldn't be necessary, since SPARQL 1.1 will be including a note on federated querying. Despite SPARQL 1.1 only being mentioned a couple of times, the tweet said, that I discussed "how/why SPARQL 1.1 plans to be a bit more dazzling". Well, admittedly SPARQL 1.1 will be more dazzling, but my post didn't discuss that. Perhaps it was a hint to talk about that in a future post.

Miscellanea

Speaking of future posts, I realized that I've been indexing RDF backwards, at least for lists. It doesn't affect the maximum complexity of iterating a list, but it does affect the expected complexity. I won't talk about it tonight, but hopefully by mentioning it here I'll prompt myself to write about it soon.

This last weekend was the final weekend that my in-laws were visiting from the other side of the planet, so I didn't get much jSPARQLc done. I hope to fix that tomorrow night. I'm even wondering if the Graph API should be factored out into it's own sister project. It's turning out to be incredibly useful for reading and working with RDF data when you just want access to the structure and you don't need a full query engine. It would even plug directly into almost every query engine out there, so there's a lot of utility to it.

I'm also finally learning Hadoop, since I've had more pressure to consider a clustered RDF store, much as BBN have created. I've read the MapReduce, GFS and BigTable papers, so I went into it thinking I'd be approaching the problem one way, but the more I learn the more I think it would scale better if I went in other directions. So for the moment I'm trying to avoid getting too many preconceived notions of architecture until I've learnt some more and applied my ideas to some simple cases. Of course, Hive tries to do the same thing for relational data, so I think I need to look at the code in that project too. I have a steep learning curve ahead of me there, but I've been avoiding those recently, so it will do me some good.

Other than that, it's been interviews and immigration lawyers. These are horribly time consuming, and way too boring to talk about, so I won't. See you tomorrow.

2010-04-30T09:03:00.003-05:00

Topaz and Ehcache

Don't ask what I did 2 days ago, because I forget. It's one of the reasons I need to blog more. I also forgot because my brain got clouded due to yesterday's tasks.

I didn't get a lot done yesterday for the simple reason that I was filling in paperwork for an immigration lawyer. For anyone who has ever had to do this mind numbing task, they probably know that you end up filling in mostly the same things that you filled in 12 months ago, but just subtly different, so there's no possibility of using copy/paste. They will also know that getting all of the information together can take half a day. Strangely, the hardest piece of information was my mother's birthday (why do they want this? I have no idea). There is a bizarre story behind this, that I won't go into right now, but with my mother asleep in Australia there was no way to ask her. Fortunately, I had two brothers online at the time: one lives here in the USA, and the other is a student in Australia (who was up late drinking with friends, and decided to get online to say hi before going to bed). Unfortunately, neither of them new either (the well-oiled brother being completely unaware of why we didn't know).

But I finally got it done, cleared my immediate email queue (only 65 to go!) and got down to work.

My first task was to get the Topaz version of Mulgara up and running the same way it used to run 10 months ago. I had already tried going back through the subversion history for the project (ah, that's one of the things I did two days ago!), but with no success. However, I had been able to find out that others have had this error with Ehcache. No one had a fix, since upgrading the library normally made their problem go away. Well I tried upgrading it myself, but without luck. Evidently the problem was in usage, but I didn't know if it was a problem in the code talking to Ehcache, or the XML configuration file that is uses. Since everything used to work without error, I figured that the code was probably OK, and that it was the configuration at fault. The complexity of the configuration file only deepened my suspicion.

I didn't want to learn the ins-and-outs of an Ehcache configuration, so my first non-lawyer related task yesterday was to look at the code where the exception was coming from (thank goodness the Java compiler includes line numbers in class files by default). So it turned out that Terracotta (the company who provides Ehcache) have a nice navigable HTML versions of all their opensource code, which made this task much more pleasant than having to get it all from Subversion. This led me to the line that was throwing the exception, which looked like:

List localCachePeers = cacheManager.getCachePeerListener("RMI").getBoundCachePeers();

Great, a compound statement. OK, so I use them myself, but they're annoying when you debug. Was it cacheManager that was null or was it the return value from getCachePeerListener("RMI")?

At this point I jumped around in the code for a bit (I quite like those hyperlinks. I've seen them before too. I should figure out which project creates these pages), looking for what initialized cacheManager. I didn't find definitive proof that it was set, but it looked pretty good. So I looked at getCachePeerListener("RMI") and discovered that it was a lookup in a Hashmap. This is a prime candidate for returning null, and indeed the documentation for the method even states that it will return null if the scheme is not configured. Since the heartbeat code was making the presumption that it could perform an operation on the return value of this method, then the "RMI" scheme is evidently supposed to be configured in every configuration. The fact that it's possible for this method to return null (even if it's not supposed to) means that the calling code is not defensive enough (any kind of NullPointerException is unacceptable, even if you catch it and log it). Also, the fact that something is always supposed to be configured for "RMI" had me looking in the code to discover where listeners get registered. This turned out to come from some kind of configuration object, which looked like it had been built from an XML file.

So the problem appears to be the combination of something that's missing from the configuration file, and a presumption that it will be there (i.e. the code couldn't handle it if the item was missing). At this point I joined a forum and described the issue, both to point out that the code should be more defensive, and also to ask what is missing. In the meantime, I tried creating my own version of the library with a fix in it, and discovered that the issue did indeed go away. Then this morning I received a message explaining what I needed to configure, and also that the code now deals with the missing configuration. It still complains on every heartbeat (in 5 second intervals), but now it tells you what's wrong, and how to fix it:

WARNING: The RMICacheManagerPeerListener is missing. You need to configure
  a cacheManagerPeerListenerFactory with
  class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"
  in ehcache.xml.

Kudos to "gluck" for the quick response. (Hey, I just realized – "gluck" is from Brisbane. My home town!)

Incidentally, creating my own version of Ehcache was problematic in itself. It's a Maven project, and when I tried to build "package" it attempted to run all the tests, which took well over an hour. Coincidentally, it also happened to be dinner time, so I came back later, only to discover that not all of the tests had passed, and that the JAR files had not been built. Admittedly, it was an older release, but it was a release, so I found this odd. In the end, I avoided the tests by removing the code, and running the "package" target again.

With all the errors out of the way I went back to the Topaz system again and run it. As I said earlier, it was no longer reporting errors. But then when I tried to use queries against it, it was completely unresponsive. A little probing found that it wasn't listening for HTTP at all, so I checked the log, and sure enough:

EmbeddedMulgaraServer> Unable to start web services due to: null [Continuing]

Argh.

Not only do I have to figure out what's going on here, it also appears that someone (possibly me) didn't code this configuration defensively enough! Sigh.

At that point it was after dinner, and I had technical reading to do for a job I might have. Well, I've received the offer, but it all depends on me not being kicked out of the country.

2010-04-26T20:52:00.004-05:00

Multitasking

At the moment I feel like I have too many things on the boil. I'm regularly answering OSU/OSL about porting data over from Topaz and Mulgara, I'm supposed to be getting work done on SPARQL Update 1.1 (which suffered last week while I looked around for a way to stay in the USA), I'm trying to track down some unnecessary sorting that is being done by Mulgara queries in a Topaz configuration, I'm trying to catch up on reading (refreshing my memory on some important Semantic Web topics so that I keep nice and current), I'm trying to find a someone who can help us not get kicked out of the country (long story), I'm responding to requests on Mulgara, and when I have some spare time (in my evenings and weekends) I'm trying to make jSPARQLc look more impressive.

So how's it all going?

OSU/OSL

Well, OSU/OSL are responding slowly, which is frustrating, but also allows me the time to look at other things, so it's a mixed blessing. They keep losing my tickets, and then respond some time later apologizing for not getting back to me. However, they're not entirely at fault, as I have sudo access on out server, and could do some of this work for myself. The thing is that I've been avoiding the learning curves of Mailman and Trac porting while I have other stuff to be doing. All the same, we've made some progress lately, and I'm really hoping to switch the DNS over to the new servers in the next couple of days. Once that happens I'll be cutting an overdue release to Mulgara.

SPARQL Update 1.1

I really should have done some of this work already, but my job (and impending lack thereof) have interfered. Fortunately another editor has stepped up to help here, so with his help we should have it under control for the next publication round.

The biggest issues are:

Writing possible responses for each operation. In some cases this will simply be success/failure, but for others it will mean describing partial success. For instance, a long-running LOAD operation may have loaded 100,000 triples before failing. Most systems want that data to stay in there, and not roll back the change, and we need some way to report what has happened.
Dealing with an equivalent for FROM and FROM NAMED in INSERT/DELETE operations. Using FROM in a DELETE operation looks like this is the graph that you want to remove data from, whereas we really want to describe the list of graphs (and/or named graphs) that affect the WHERE clause. The last I read, the suggestion to use USING and USING NAMED instead was winning out. The problem is that no one really likes it, though they don't like every other suggestion even more. :-)

I doubt I'll get much done before the next meeting, but at least I did a little today, and I've been able to bring the other editor up to speed.

Sorting

This is a hassle that's been plaguing me for a while. A long time back PLoS complained about queries that were taking too long (like, up to 10 minutes!). After looking at them, I found that a lot of sorting of a lot of data was going on, so I investigated why.

From the outset, Mulgara adopted "Set Semantics". This meant that everything appeared only once. It made things a little harder to code, but it also made the algebra easier to work with. In order to accomplish this cleanly, each step in a query resolution removed duplicates. I wasn't there, so I don't know why the decision wasn't made to just leave it to the end. Maybe there was a good reason. Of course, in order to remove these duplicates, it had to order the data.

When SPARQL came along, the pragmatists pointed out that not having duplicates was a cost, and for many applications it didn't matter anyway. So they made duplicates allowable by default, and introduced the DISTINCT keyword to remove them if necessary, just like SQL. Mulgara didn't have this feature (though the Sesame-Mulgara bridge hacked it to work by selecting all variables across the bridge and projecting out the ones that weren't needed), but given the cost of this sorting, it was obvious we needed it.

The sorting in question came about because the query was a UNION between a number of product terms (or a disjunction of conjunctions). In order to make the UNION in order, each of the product terms was sorted first. Of course, without the sorting, a UNION can be a trivial operation, but with it the system was very slow. Actually, the query in question was more like a UNION between multiple products, with some of the product terms being UNIONS themselves. The resulting nested sorting was painful. Unfortunately, the way things stood, it was necessary, since there was no way to do a conjunction (product) without having the terms sorted, and since some of the terms could be UNIONS, then the result of a UNION had to be sorted.

The first thing I did was to factor the query out into a big UNION between terms (a sum-of-products). Then I manually executed each one to find out how long it took. After I added up all the times, the total was about 3 seconds, and most of that time was spent waiting for Lucene to respond (something I have no control over), so this was looking pretty good.

To make this work in a real query I had to make the factoring occur automatically, I had to remove the need to sort the output of a UNION, and I had to add a query syntax to TQL to turn this behavior on and off.

The syntax was already done for SPARQL, but PLoS were using TQL through Topaz. I know that a number of people use TQL, so I wasn't prepared to break the semantics of that language, which in turn meant that I couldn't introduce a DISTINCT keyword. After asking a couple of people, I eventually went with a new keyword of NONDISTINCT. I hate it, but it also seemed to be the best fit.

Next I did the factorization. Fortunately, Andrae had introduced a framework for modifying a query to a fixpoint, so I was able to add to that for my algebraic manipulation. I also looked at other expressions, like differences (which was only in TQL, but is about to become a part of SPARQL) and Optional joins (which were part of SPARQL, and came late into TQL). It turns out that there is a lot that you can do to expand a query to a sum-of-products (or as close to as possible), and fortunately it was easy to accomplish (thanks Andrae).

Finally, I put the code in to only do this factorization if a query was not supposed to be DISTINCT (the default in SPARQL, and if the new keyword is present for TQL). Unexpectedly, this ended up being the trickiest part. Part of the reason was because some UNION operations still needed to have the output sorted if they were embedded in an expression that couldn't be expanded out (a rare, though possible situation, but only when mixing with differences and optional joins).

I needed lots of tests to be sure that I'd done things correctly. I mean, this was a huge change to the query engine. If I'd got it wrong, it would be a serious issue. As a consequence, this code didn't get checked in and used in the timeframe that it ought to have. But finally, I felt it was correct, and I ran my 10 minute queries against the PLoS data.

Now the queries were running at a little over a minute. Well, this was an order of magnitude improvement, but still 30 times slower than I expected. What had happened? I checked where it was spending it's time, and it was still in a sort() method. Sigh. At a guess, I missed something in the code that allows sorting when needed, and avoids it the rest of the time.

Unfortunately, the time taken to get to that point had led to other things becoming important, and I didn't pursue the issue. Also, the only way to take advantage of this change was to update Topaz to use SELECT NONDISTINCT but that keyword was going to fail unless being run on a new Mulgara server. This meant that I couldn't update Topaz until I knew they'd moved to a newer Mulgara, and that didn't happen for a long time. Consequently, PLoS didn't see a performance change, and I ended up trying to improve other things for them rather than tracking it down. In retrospect, I confess that this was a huge mistake. PLoS recently reminded me of their speed issues with certain queries, but now they're looking at other solutions to it. Well, it's my fault that I didn't get it all going for them but that doesn't mean I should never do it, so I'm back at it again.

The problem queries only look really slow when executed against a large amount of data, so I had to get back to the PLoS dataset. The queries also meant running the Topaz setup, since they make use of the Topaz Lucene resolvers. So I updated Topaz and built the system.

Since I was going to work on Topaz, I figured I ought to add in the use of NONDISTINCT. This was trickier than I expected, since it looked like the Topaz code was not only trying to generate TQL code, it was also trying to re-parse it to do transformations on it. The parser in question was Antlr which is one that I've limited experience with, so I spent quite a bit of time trying to figure out what instances of SELECT could have a NONDISTINCT appended to it. In the end, I decided that all of the parsing was really for their own OQL language (which looks a lot like TQL). I hope I was right!

After spending way to long on Topaz, I took the latest updates from SVN, and compiled the Topaz version of Mulgara. Then I ran it to test where it was spending time in the query.

Unfortunately, I immediately started getting regular INFO messages of the form:

MulticastKeepaliveHeartbeatSender> Unexpected throwable in run thread. Continuing...null
java.lang.NullPointerException
 at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.createCachePeersPayload(MulticastKeepaliveHeartbeatSender.java:180)
 at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.run(MulticastKeepaliveHeartbeatSender.java:137)

Now Mulgara doesn't make use of ehcache at all. That's purely a Topaz thing, and my opinion to date has been that it's more trouble than it's worth. This is another example of it. I really don't know what could be going on here, but luckily I kept open the window where I updated the source from SVN, and I can see that someone has modified the class:

  org.topazproject.mulgara.resolver.CacheInvalidator

I can't guarantee that this is the problem, but I've never seen it before, and no other changes look related.

But by this point I'd reached the end of my day, so I decided I should come back to it in the morning (errr, maybe that will be after the SPARQL Working Group meeting).

Jetty

Despite that describing a good portion of my day (at least, those parts not spent in correspondence), I also got a few things done over the weekend. The first of these was a request for a new feature in the Mulgara Jetty configuration.

One of our users has been making heavy use of the REST API (yay! That time wasn't wasted after all!) and had found that Jetty was truncating their POST methods. It turns out that Jetty restricts this to 200,000 characters by default, and it wasn't enough for them. I do have to wonder what they're sticking in their queries, but OK. Or maybe they're POSTing RDF files to the server? That might explain it.

Jetty normally lets you define a lot of configuration with system parameters from the command line, or with an XML configuration file, and I was asked if I could allow either of those methods. Unfortunately, our embedded use of Jetty doesn't allow for either of these, but since I was shown exactly what was wanted I was able to track it down. A bit of 'grepping' for the system parameter showed me the class that gets affected. Then some Javadoc surfing took me to the appropriate interface (Context), and then I was able to go grepping through Mulgara's code. I found where we had access to these Contexts, and fortunately the Jetty configuration was located nearby. Up until this point Jetty's Contexts had not been configurable, but now they are. I only added in the field that had been requested, but everything is set up to add more with just two lines of code each - plus the XSD to describe the configuration in the configuration file.

jSPARQLc

My other weekend task was to add CONSTRUCT support to jSPARQLc. Sure, no one is using it yet, but Java needs so much boilerplate to make SPARQL work, that I figure it will be of use to someone eventually – possibly me. I'm also finding it to be a good learning experience for why JDBC is the wrong paradigm for SPARQL. I'm not too worried about that though, as the boilerplate stuff is all there, and it would be could easy to clean it up to something that doesn't try to conform to SPARQL. But for the moment it's trying to make SPARQL look like JDBC, and besides there's already another library that isn't trying to look like JDBC. I'd better stick to my niche.

I've decided that I'm definitely going to go with StAX to make forward-only result sets. However, I'm not sure if there is supposed to be a standard configuration for JDBC to set the desired form of the result set, so I haven't started on that yet.

The result of a CONSTRUCT is a graph. By default we can expect a RDF/XML document, though other formats are certainly possible. I'm not yet doing content negotiation with jSPARQLc, though that may need to be configurable, so I wanted to keep an open mind about what can be returned. That means that standard SELECT queries could return SPARQL Query Result XML or JSON, and CONSTRUCT queries could result in RDF/XML, N3, or RDF-JSON (Mulgara supports all but the last, but maybe I should add that one in. I've already left space for it).

Without content negotiation, I'm keeping to the XML formats for the moment, with the framework looking for the other formats (though it will report that the format is not handled). Initially I thought I might have to parse the top of the file, until I cursed myself for an idiot and looked up the content type in the header. Once the parameters have been removed, I could use the content type to do a "look up" for a parser constructor. I like this approach, since it means that any new content types I want to handle just become new entries in the look-up table.

This did leave me wondering if every SPARQL endpoint was going to fill in the Content-Type header, but I presume they will. I can always try a survey of servers once I get more features into the code.

Parsing an RDF/XML graph is a complex process that I had no desire to attempt (it could take all week to get it right - if not longer). Luckily, Jena has the ARP parser to do the job for me. However, the ARP parser is part of the main Jena jar, which seemed excessive to me. Fortunately, Jena's license is BSD, so it was possible to bring the ARP code in locally. I just had to update the packages to make sure it wouldn't conflict if anyone happens to have their own Jena in the classpath.

Funnily enough, while editing the ARP files (I'm doing this project "oldschool" with VIM). I discovered copyright notices for Plugged In Software. For anyone who doesn't know, Plugged In Software was the company that created the Tucana Knowledge Store (later to be open sourced as Kowari, then renamed to Mulgara). The company changed its name later on to match the software, but this code predated that. Looking at it, I seem to recall that the code in question was just a few bugfixes that Simon made. But it was still funny to see.

Once I had ARP installed, I could parse a graph, but into what? I'm not trying to run a triplestore here, just an API. So I reused an interface I came up with when I built my SPARQL test framework when I needed to read a graph. The interface isn't fully indexed, but it lets you do a lot of useful things if you want to navigate around a graph. For instance, it lets you ask for the list of properties on a subject, or to find the value(s) of a particular subject's property, or to construct a list from an RDF collection (usually an iterative operation). Thinking that I might also want to ask questions about particular objects (or even predicates) I've added in the other two indexes this time, but I'm in two minds about whether they really need to be there.

The original code for my graph interface was in Scala, and I was tempted to bring it in like this. But one of the purposes of this project was to be lightweight (unfortunately, I lost that advantage when I discovered that ARP needs Xerces), so I thought I should try to avoid the Scala JARs. Also, I thought that the exercise of bringing the Scala code into Java would refresh the API for me, as well as refresh me on Scala (which I haven't used for a couple of months). It did all of this, as well as having the effect of reminding me why Scala is so superior to Java.

Anyway, the project is getting some meat to it now, and it's been fun to work on in my evenings, and while I've been stuck indoors on my weekends. If anyone has any suggestions for it, then please feel free to let me know.

2010-04-24T16:39:00.003-05:00

Program Modeling

The rise of Model Driven Architecture in development, has created a slew of modeling frameworks that can be used as tools for this design process. While some MDA tools involve a static design and a development process, others seek to have an interactive model that allows developers to work with the model at runtime. Since an RDF store allows models developed in RDFS and OWL to be read and updated at runtime, it therefore seems natural to use the modeling capabilities of RDFS and OWL to provide yet another framework for design and development. RDF also has the unusual characteristic of seamlessly mixing instance data with model data (the so-called "ABox" and "TBox"), giving the promise of a system that allows both dynamic modeling and persistent data all in a single integrated system. However, there appear to be some common pitfalls that developers fall prey to, which make this approach less useful than it might otherwise be.

For good or ill, two of the most common languages used on large scale projects today are Java and C#. Java in particular also has good penetration on the web, though for smaller projects more modern languages, such as Python or Ruby, are more often deployed. There are lots of reasons for Java and C# to be so popular on large projects: They are widely known and it can be easy to assemble a team around it; both the JVM and .Net engines have demonstrated substantial benefits in portability, memory management and optimization through JIT compiling; they have established a reputation of stability over their histories. Being a fan of functional programming and modern languages, I often find Java to be frustrating, but these strengths often bring me back to Java again, despite its shortcomings. Consequently, it is usually with Java or C# in mind that MDA projects start out trying to use OWL modeling.

On a related note, enterprise frameworks regularly make use of Hibernate to store and retrieve instance data using a relational database (RDBMS). Hibernate maps object definitions to an equivalent representation in a database table using an Object-Relational Mapping (ORM). While not a formal MDA modeling paradigm, a relational database schema forms a model, in the same way that UML or MOF does (only less expressive). While an ORM is not a tool for MDA, it nevertheless represents the code of a project in a form of model, with instance data that is interactive at runtime.

Unfortunately, the ORM approach offers a very weak form of modeling, and it has no capability to dynamically update at runtime. Several developers have looked at this problem and reasoned that perhaps these problems could be solved by modeling in RDF instead. After all, an RDF store allows a model to be updated as easily as the data, and the expressivity of OWL is far greater than that of the typical RDBMS schema. To this end, we have seen a few systems which have created an Object-Triples Mapping (OTM), with some approaches demonstrating more utility than others.

Static Languages

When OTM approaches are applied to Java and C#, we typically see an RDFS schema that describes the data to be stored and retrieved. This can be just as effective as an ORM on a relational database, and has the added benefit of allowing expansions of the class definitions to be implemented simply as extra data to be stored, rather than the whole change management demanded with the update of a RDBMS. An RDF store also has the benefit of being able to easily annotate the class instances being stored, though this is not reflected in the Java/C# classes and requires a new interface to interact with it. Unfortunately, the flow of modeling information through these systems is typically one-way, and does not make use of the real power being offered by RDF.

ORM in Hibernate embeds the database schema into the programming language by mapping table schemas to class descriptions in Java/C# and table entries to instances of those classes in runtime memory. Perhaps through this example set by ORM, we often see OTM systems mapping OWL classes to Java/C# classes, and RDF instances to Java/C# instances. This mapping seems intuitive, and it has its uses, but it is also fundamentally flawed.

The principle issue with OTM systems that attempt to embed themselves in the language, is that static languages (like the popular Java and C# languages) are unable to deal with arbitrary RDF. RDF and OWL work on an Open World Assumption, meaning that there may well be more of the model that the system is not yet aware of, and should be capable of taking into consideration. However, static languages are only able to update class definitions outside of runtime, meaning that they cannot accept new modeling information during runtime. It is possible to define a new class at runtime using a bytecode editing library, but then the class may only be accessed through meta-programming interfaces like reflection, defeating the purpose of the embedding in the first place. This is what is meant by the flow of modeling information being one-way: updates to the program code can be dynamically handled by a model in an RDF store, but updates to the model cannot be reflected by corresponding updates in the program.

But these programming languages are Turing Complete. We ought to be able to work with dynamic modeling in triples with static languages, so how do we approach it? The solution is to abandon the notion of embedding the model into the language. These classes are not dynamically reconfigurable, and therefore they cannot update with new model updates. Instead, object structures that can be updated can be used to represent the model. Unfortunately, this no longer means that the model is being used to model the programming code (as desired in MDA), but it does mean that the models are now accurate, and can represent the full functionality being expressed in the RDFS/OWL model.

As an example, it is relatively easy to express an instance of a class as a Java Map, with properties being the keys, and the "object" values being the values in the map. This is exactly the same as the way structures are expressed in Perl, so it should be a familiar approach to many developers. These instances should be constructed with a factory that takes a structure that contains the details of an OWL class (or, more likely, that subset of OWL that is relevant to the application). In this way it is possible to accurately represent any information found in an RDF store, regardless of foreknowledge. I can personally attest to the ease and utility of this approach, having written a version of it over two nights, and then providing it to a colleague who used it along with Rails to develop an impressive prototype ontology and data editor, complete with rules engine, all in a single day. I expect others can cite similar experiences.

Really Embedding

So we've seen that static languages like C# and Java can't dynamically embed "Open" models like RDF/OWL, but are there languages that can?

In the last decade we've seen a lot of "dynamic" languages gaining popularity, and to various extents, several of these offer that functionality. The most obvious example is Ruby, which has explicit support for opening up already defined classes in order to add new methods, or redefine existing ones. Smalltalk has Metaprogramming. Meta-programming isn't an explicit feature for many other languages, but so long as the language is dynamic there is often a way, such as these methods for Python and Perl.

Despite the opportunity to embed models into these languages, I'm unaware of any systems which do so. It seems that the only forms of OTM I can find are in the languages which already have an ORM, and are probably better used with that paradigm. There are numerous libraries for accessing RDF in each of the above languages, such as the Redland RDF Language bindings for Ruby and Python, the Rena, and ActiveRDF in Ruby, RDFLib and pyrple in Python, Perl RDF... the list goes and continues to grow. But none of the libraries I know perform any kind of language embedding in a dynamic language. My knowledge in this space is not exhaustive, but the lack of obvious candidates tells a story on its own.

Is this indicative of dynamic languages not needing the same kind of modeling storage that static languages seem to require? Java and C# often used Hibernate and similar systems in large scale commercial applications with large development teams, while dynamic languages are often used by individuals or small groups to quickly put together useful systems that aim at a very different target market. But as commercial acceptance of dynamic languages develops further, perhaps this kind of modeling would be useful in future. In fact a good modeling library like this could well show a Java team struggling with their own version of an OTM, just what they've been missing in their closed world.

Topaz

I wrote this essay partly out of frustration with a system I've worked with called Topaz. Topaz is a very clever piece of OTM code, written for Java, and built over my own project Mulgara. However, Topaz suffers from all of the closed world problems I outlined above, without any apparent attempt to mitigate the use of RDF by reading extra annotations, etc. It has been used by the Public Library of Science for their data persistence, but they have been unhappy with it, and it will soon be replaced.

While performance in Mulgara (something I'm working on), in Topaz's use of Mulgara, and in Topaz itself, has been an issue, I believe that a deeper problem lay in the use of a dynamic system to represent static data. My knowledge of Topaz has me wondering why the system didn't simply choose to use Hibernate. I'm sure the used of RDF and OWL provided some functionality that isn't easy accomplished by Hibernate, but I don't see the benefits being strong enough to make it worth the switch to a dynamic model.

For my money, I'd either adopt the standard static approach that so many systems already employ to great effect, or go the whole hog and design an OTM system that is truly dynamic and open.

2010-04-21T21:26:00.003-05:00

Work

I've had a number of administrative things to get done this week, since work will be taking a dramatic new turn soon. I've been missing working in a team, so that part will be good, but there are too many unknowns right now, including a visa nightmare that has been unceremoniously dumped in my lap. So, I'm stressed and have a lot to do. But that doesn't mean I'm not working.

Multi-Results

I'd recently been asked to allow the HTTP interfaces to return multiple results. One look at the SPARQL Query Results XML Format makes it clear that SPARQL isn't capable of it, but the TQL XML format has always allowed it - or at least, I think it did. The SPARQL structure is sort of flat, with a declaration of variables at the top, and bindings under it. The TQL structure is similar, but embeds it all in another element called a "query". That name seems odd (since it's a result, not a query), so I wonder if someone had intended to include the original query as an attribute of that tag. Anyway, the structure is available, so I figured I should add it.

This was a little trickier than I expected, since I'd tried to abstract out the streaming of answers. This means that I could select the output type simply by using a different streaming class. For now, the available classes are SPARQL-XML, SPARQL-JSON and TQL-XML, but there could easily be others. However, I now had to modify all of those classes to handle multiple answers. Of course, the SPARQL streaming classes had to ignore them, while the TQL class didn't, but that wasn't too hard. However, I came away feeling that it was somehow messier than it ought to have been. Even so, I thought it worked OK.

One bit of complexity was in handling the GET requests of TQL vs. SPARQL. In SPARQL we can only expect a single query in a GET, but TQL can have multiple queries, separated by semicolons. While I like to keep as much code as possible common to all the classes, in the end I decided that the complexity of doing this was more than it was worth, and I put special multi-query-handling code in the TQL servlet.

All of this was done a little while ago, but because I was waiting on responses on the mulgara.org move, I decided not to release just yet. This was probably fortunate, since I got an email the other day explaining that subqueries were not being embedded properly. They were starting with a new query element tag, but not closing with them. However, these tags should not have appeared at this level at all. The suggested patch would have worked, but it relied on checking the indentation used for pretty-printing in order to find out if the query element should be opened. This would work, but was covering the problem, rather than solving it. A bit of checking, and I realized that I had code to send a header for each answer, code to send the data for the answer, but no code for the "footer". The footer would have been the closing tag for the query element, and this was being handled in other code, meaning that it only came up at the top level, and not in the embedded sub-answers. This in turn meant that it wasn't always matching up to the header. So I introduced a footer method for answers (a no-op in SPARQL-XML and SPARQL-JSON) which cleaned up the process well enough that avoiding the header (and footer) on sub-answers was now easy to see and get right.

So was I done? No. The email also commented on warnings of transactions not being closed. So I went looking at this, and decided that all answers were being closed properly. In confusion, I looked at the email again, and this time realized that the bug report said that they were using POST methods. Since I was only dealing with queries (and not update commands) I had only gone to the GET method. So I looked at POST, and sure enough it was a dogs breakfast.

Part of the problem with a POST is that it can include updates as well as queries. Not having a standard response for an update, I had struggled a little with this in the past. In the end, I'd chosen to only output the final result of all operations, but this was causing all sorts of problems. For a start, if there was more than one query, then only the last would be shown (OK in SPARQL, not in TQL). Also, since I was ignoring so many things, it meant that I wasn't closing anything if it needed it. This was particularly galling to have wrong, since I'd finally added SPARQL support for POST queries.

I'd really have liked to use the same multi-result code that I had for GET requests, but that didn't look like it was going to mix well with the need to support commands in the middle. In the end I copied/pasted some of the GET code (shudder) and fixed it up to deal with the result lists that I'd already built through the course of processing the POST request. It doesn't look too bad, and I've commented on the redundancy and why I've allowed it, so I think it's all OK. Anyway, it's all looking good now. Given that I also have a major bugfix from a few weeks back, then I should get it out the door despite the mulgara.org shuffle not being done.

I didn't mention that major bug, did I? For anyone interested, some time early last year a race bug was avoided by putting a lock into the transaction code. Unfortunately, that lock was to broad, and it prevented any other thread from reading while a transaction was being committed. This locked the database up during large commit operations. It's not the sort of thing that you're likely to see with unit tests, but I was still deeply embarrassed. At least I found it (a big thanks to the guys at PLoS for reporting this bug, and helping me find where it was).

So before I get dragged into any admin stuff tomorrow morning (office admin or sysadmin), I should try to cut a release to clean up some of these problems.

Meanwhile, I'm going to relax with a bit of Hadoop reading. I once talked about putting a triplestore on top of this, and it's an idea that's way overdue. I know others have tried exactly this, but each approach has been different, and I want to see what I can make of it. But I think I need a stronger background in the subject matter before I try to design something in earnest.

2010-04-19T11:03:00.003-05:00

jSPARQLc

I spent Sunday and a little bit of this afternoon finishing up the code and testing for the SPARQL API. The whole thing had just a couple of typos in it, which surprised me no end, because I used VIM and not Eclipse. I must be getting more thorough as I age. :-)

Anyway, the whole thing works, though its limited in scope. To start with, the only data accessing methods I wrote were getObject(int) and getObject(String). Ultimately, I'd like to include many of the other get... methods, but these would require tedious mapping. For instance, getInt() would need to map xsd:int, xsd:integer, and all the other integral types to an integer. I've done this work in some of the internal APIs for Mulgara (particularly when dealing with SPARQL), so I know how tedious it is. I suppose I can copy/paste a good portion of it out of Mulgara (it's all Apache 2.0), but for the moment I wanted to get it up and running the way I said.

If I were to do all of this, then there are all sorts of mappings that can be done between data types. For instance, xsd:base64Binary would be good for Blobs. I could even introduce some custom data types to handle things like serialized Java objects, with a data type like: java:org.example.ClassName. Actually, that looks familiar. I should see if anyone has done it.

Anyway, as I progressed, I found that while it was straight forward enough to get basic functionality in, the JDBC interfaces are really inappropriate.

To start with, JDBC usually accesses a "cursor" at the server end, and this is accessed implicitly by ResultSet. It's not absolutely necessary, but moving backwards and forwards through a result that isn't entirely in memory really does need a server-side cursor. Since I'm doing everything in memory right now, then I was able to do an implementation that isn't TYPE_FORWARD_ONLY, but if I were to move over to using StAX (in the comments from my last entry) then I'd have to fall back to that.

The server-side cursor approach also makes it possible to write to a ResultSet, since SQL results are closely tied to the tables they represent. However, this doesn't really apply to RDF, since statements can never be updated, only added and removed. SPARQL Update is coming (I ought to know, as I'm the editor for the document), but there is no real way to convert the update operations on a ResultSet back into SPARQL-Update operations over the network. It might be theoretically, possible but it would need a lot of communication with the server, and it doesn't even make sense. After all, you'd be trying to map one operation paradigm to a completely different one. Even if it could be made to work, it would be confusing to use. Since my whole point in writing this API was to simplify things for people who are used to JDBC, then it would be self defeating.

So if this API were to allow write operations as well, then it would need a new approach, and I'm not sure what that should be. Passing SPARQL Update operations straight through might be the best bet, though it's not offering a lot of help (beyond doing all the HTTP work for you).

The other thing that I noticed was that a blind adherence to the JDBC approach created a few classes that I don't think are really needed. For instance, the ResultSetMetaData interface only contains two methods that make any sense from the perspective of SPARQL: getColumnCount() and getColumnName(). The data comes straight out of the ResultSet, so I would have put them there if the choice were mine. The real metadata is in the list of "link" elements in the result set, but this could encoded with anything (even text) so there was no way to make that metadata fit the JDBC API. Instead, I just let the user ask for the last of links directly (creating a new method to do so).

Another class that didn't make too much sense to me was Statement. It's a handy place to record some state about what you've doing on a Connection, but other than that, it just seems to proxy the Connection it's attached to. I see there some options for caching (that I've never used myself when I've been on JDBC), so I suspect that it does more than I give it credit for, but for the moment it just appears to be an inconvenience.

Anyway, I thought I'd put it up somewhere, and since I haven't tried Google's code repository before, I've put it up there. It's a Java SPARQL Connectivity library, so for lack of anything better I called it jSPARQLc (maybe JRC for Java RDF Connectivity would have been better, but there are lots of JRC things out there, but jSPARQLc didn't return any hits from Google, so I went with that). It's very raw and has very little configuration, but it passes it's tests. :-)

Speaking of tests, if you want to try it, then the connectivity tests won't pass until you've done the following:

Start a SPARQL endpoint at http://localhost:8080/sparql/ (the sourcecode in the test needs to change if your endpoint is elsewhere).
Create a graph with the URI <test:data>
Loaded the data in test.n3 up into it (this file is in the root directory)

I know I shouldn't have hardcoded some of this, but it was just a test on a 0.1 level project. If it seems useful, and/or you have ideas for it, then please let me know.

Mulgara

Other than this, I have some administration I need to do to get both Mulgara and Topaz onto another server, and this seems to slow everything down as I wait for the admins to get back to me. It's also why there hasn't been a Mulgara release recently, even though it's overdue. However, I just got a message from an admin today, so hopefully things have progressed. Even so, I think I'll just cut the release soon anyway.

One fortunate aspect of the delayed release was a message I got from David Smith about how some resources aren't being closed in the TQL REST interface (when subqueries are used). He's sent me a patch, but I need to spend some time figuring out why I got this wrong, else I could end up hiding the real problem. That's a job for the morning... right after the SPARQL Working Group meeting. Once all of that is resolved, I'll get a release out, and try to figure out what I can do to speed up the server migration.

Oh, and I need to update Topaz to take advantage of some major performance improvements in Mulgara, and then I need to find even more performance improvements. Hopefully I'll be onto some of that by the afternoon, but I don't want to promise the moon only to come back tomorrow night and confess I got stuck on the same thing all day.

2010-04-17T19:08:00.003-05:00

SPARQL API

Every time I try to use SPARQL with Java I keep running into all that minutiae that Java makes you deal with. The HttpComponents from Apache make things easier, but there's still a loto f code that has to be written. Then after you get your data back, you still have to process it, which means XML or JSON parsing. All of this means a lot of code, just to get a basic framework going.

I know there are a lot of APIs out there for working with RDF engines, but there aren't many for working directly over SPARQL. I eventually had a look and found SPARQL Engine for Java, but this seemed to have more client-side processing than I'd expect. I haven't looked too carefully at it, so this may be incorrect, but I thought it would be worthwhile to take all the boilerplate I've had to put together in the past, and see if I can glue it all together in some sensible fashion. Besides, today's Saturday, meaning I don't have to worry about my regular job, and I'm recovering from a procedure yesterday, so I couldn't do much more than sit at the computer anyway.

One of my inspirations was a conversation I had with Henry Story (hmmm, Henry's let that link get badly out of date) a couple of years ago about a standard API for RDF access, much like JDBC. At the time I didn't think that Sun could make something like that happen, but if there were a couple of decent attempts at it floating around, then some kind of pseudo standard could emerge. I never tried it before, but today I thought it might be fun to try.

The first thing I remembered was that when you write a library, you end up writing all sorts of tedious code while you consider the various ways that a user might want to use it. So I stuck to the basics, though I did add in various options as I dealt with individual configuration options. So it's possible to set the default-graph-uri as a single item as well as with a list (since a lot of the time you only want to set one graph URI). I was eschewing Eclipse today, so I ended up making use of VIM macros for some of my more tedious coding. The tediousness also reminded me again why I like Scala, but given that I wanted it to look vaguely JDBC-like, I figured that the Java approach was more appropriate.

I remember that TKS (the name of the first incarnation of the Mulgara codebase) had attempted to implement JDBC. Apparently, a good portion of the API, was implemented, but the there were some elements that just didn't fit. So from the outset I avoided trying to duplicate that mistake. Instead, I decided to cherry pick the most obvious features, abandon anything that doesn't make sense, and add in a couple of new features where it seems useful or necessary. So while some of it might look like JDBC, it won't have anything to do with it.

I found a piece of trivial JDBC code I'd used to test something once-upon-a-time, and tweaked it a little to look like something I might try to do with SPARQL. My goal was to write the library that would make this work, and then take it from there. This is the example:

    final String ENDPOINT = "http://localhost:8080/sparql/";
    Connection c = DriverManager.getConnection(ENDPOINT);

    Statement s = c.createStatement();
    s.setDefaultGraph("test:data");
    ResultSet rs = s.executeQuery("SELECT * WHERE { ?s ?p ?o }");
    rs.beforeFirst();
    while (rs.next()) {
      System.out.println(
              rs.getObject(1).toString() + ", " +
              rs.getObject(2) + ", " +
              rs.getObject(3));
    }
    rs.close();
    c.close();

My first thought was that this is not how I would design the API (the "Statement" seems a little superfluous), but that wasn't the point.

Anyway, I've nearly finished it, but I'm dopey from pain medication, so I thought I'd write down some thoughts about it, and pick it up again in the morning. So if anyone out there is reading this (which I doubt, given how little I write here) these notes are more for me than for you, so don't expect to find it interesting. :-)

Observations

The first big difference to JDBC is the configuration. A lot of JDBC is either specific to a particular driver, or to RDBM Systems in general. This goes for the structure of the API as well as the configuration. For instance, ResultSet seems to be heavily geared towards cursors, which SPARQL doesn't support. I was momentarily tempted to try emulating this functionality through LIMIT and OFFSET, but that would have involved a lot of network traffic, and could potentially interfere with the user trying to use these keywords themselves. Getting the row number (getRow) would have been really tricky if I'd gone that way too.

But ResultSet was one of the last things I worked on today, so I'll rewind.

The first step was making the HTTP call. I usually use GET, but I've recently added in the POST binding for SPARQL querying in Mulgara , so I made sure the client code can do both. For the moment I'm automatically choosing to do a POST query when the URL gets above 1024 characters (I believe that was the URL limit for some version of IE), but I should probably make the use of POST vs. GET configurable. Fortunately, building parameters was identical for both methods, though they get put into difference places.

Speaking of parameters, I need to check this out, but I believe that graph URIs in SPARQL do not get encoded. Now that's not going to work if they contain their own queries (any why wouldn't they), but most graphs don't do that, so it's never bitten me before. Fortunately, doing a URL-Decode on an unencoded graph URI is usually safe, so that's how I've been able to get away with it until now. But as a client that has to do the encoding I needed to think more carefully about it.

From what I can tell, the only part that will give me grief is the query portion of the URI. So I checked out the query, and if there wasn't one, I just sent the graph unencoded. If there was one, then I'd encode just the query, add it to the URI, and then see if decoding got me back to the original. If it does, then I send that. Otherwise, I just encode the whole graph URI and send that. As I write it down, it looks even more like a hack than ever, but so far it seems to work.

So now that I have all the HTTP stuff happening, what about the response? Since answers can be large, my first thought was SAX. Well, actually, my first thought was Scala, since I've already parsed SPARQL response documents with Scala's XML handling, and it was trivial. But I'm Java so that means SAX or DOM. SAX can handle large documents, but possibly more importantly, I've always found SAX easier to deal with than DOM, so that's the way I went.

Because SAX operates on a stream, I thought I could build a stream handler, but I think that was just the medication talking, since I quickly remembered that it's an event model. The only way I could do it as a stream would be if I buit up a queue with one thread writing at one end and the consuming thread taking data off at the other. That's possible, but it's hard to test if it scales, and if the consumer doesn't get drain the queue in a timely manner, then you can cause problems for the writing end as well. It's possible to slow up the writer by not returning from the even methods until the queue has some space, but that seems clunky. Also, when you consider that a ResultSet is supposed to be able to rewind and so forth, a streaming model just doesn't work.

In the end, it seemed that I would have to have my ResultSets in memory. This is certainly easier that any other option I could think of, and the size of RAM these days means that it's not really a big deal. But it's still in the back of my mind that maybe I'm missing an obvious idea.

The other thing that came to mind is to create an API to provides object events in the same way that SAX provides events for XML elements. This would work fine, but it's nothing like the API I'm trying to look like, so I didn't give that any serious thought.

So now I'm in the midst of a SAX parser. There's a lot of work in there that I don't need when working with other languages, but it does give you a comfortable feeling knowing that you have such fine-grained control over the process, Java enumerations have come in handy here, as I decided to go with a state-machine approach. I don't use this very often (outside of hardware design, where I've always liked it), but it's made the coding so straightforward it's been a breeze.

One question I have, is if the parser should create a ResultSet object, or if it should be the object. It's sort of easy to just create the object with the InputStream as the parameter for the constructor, but then the object you get back could be either a boolean result or a list of variable bindings, and you have to interrogate it to find out which one it is. The alternative is to use a factory that returns different types of result sets. I initially went with the former because both have to parse the header section, but now that I've written it out, I'm thinking that the latter is the better way to go. I'll change it in the morning.

I'm also thinking of having a parser to deal with JSON (I did some abstraction to make this easy), but for now I'll just take one step at a time.

One issue I haven't given a lot of time to yet is the CONSTRUCT query. These have to return a graph and not a result set. That brings a few questions to mind:

How do I tell the difference? I don't want to do it in the API, since that's something the user may not want to have to figure out. But short of having an entire parser, it could be difficult to see the form of the query before it's sent.
I can wait for the response, and figure it out there, but then my SAX parser needs to be able to deal with RDF/XML. I usually use Jena's parser for this, since I know it's a lot of work. Do I really want to go that way? Unfortunately, I don't know of any good way to move to a different parser once I've seen the opening elements. I could try a BufferedInputStream, so I could rewind it, but can that handle really large streams? I'll think on that.
How do I represent the graph at the client end?

Representing a graph goes way beyond ResultSet, and poses the question of just how far to go. A simple list of triples would probably suffice, but if I have a graph then I usually want to do interesting stuff with it.

I'm thinking of using my normal graph library, which isn't out in the wild yet, but I find it very useful. I currently have implementations of it in Java, Ruby and Scala. I keep re-implementing it whenever I'm in a new language, because it's just so useful (it's trivial to put under a Jena or Mulgara API too). However, it also goes beyond the JDBC goal that I was looking for, so I'm cautious about going that way.

Anyway, it's getting late on a Saturday night, and I'm due for some more pain medication, so I'll leave it there. I need to talk to people about work again, so having an active blog will be important once more (even if it makes me look ignorant occasionally). I'll see if I can keep it up.

2009-05-21T13:15:00.004-05:00

Federated Queries

A long time ago, TKS supported federated queries, though the approach was a little naive (bring all the matches of triple patterns in to a single place, and join them there). Then a few years ago I added this to Mulgaraas well. I've always wanted to make it more intelligent in order to reduce network bandwidth, but at the same time, I was always happy that it worked. Unfortunately, it was all accomplished through RMI, and was Mulgara specific. That used to be OK, since RDF servers didn't have standardized communications mechanisms, but that changed with SPARQL.

More recently, I've started running across distributed queries through another avenue. While working through the SPARQL protocol, I realized that the Mulgara approach of treating unknown HTTP URIs as data that can be retrieved can be mixed with SPARQL CONSTRUCT queries encoded into a URI. The result of an HTTP request on a SPARQL CONSTRUCT query is an RDF document, which is exactly what Mulgara is expecting when it does an HTTP GET on a graph URI. The resulting syntax is messy, but it works quite well. Also, while retrieving graph URIs is not standard in SPARQL, most systems implement this, making it a relatively portable idiom. I was quite amused at the exclamations of surprise and horror (especially the horror) when I passed this along on a mailing list a few weeks ago.

The ease at which this was achieved using SPARQL made me consider how federated querying might be done using a SPARQL-like protocol. Coincidentally, the SPARQL Working Group has Basic Federated Queries as a proposed feature, and now I'm starting to see a lot of people asking about it on mailing lists (was people always asking about this, or am I just noticing it now?). I'm starting to think this feature may be more important in SPARQL, and think that perhaps I should have made it a higher priority when I voted on it. As it is, it's in the list of things we'll get to if we have time.

Then, while I was thinking about this, one of the other Mulgara developers tells me that he absolutely has to have distributed queries (actually, he needs to run rules over distributed datasets) to meet requirements in his organization. Well, the existing mechanisms will sort of work for him, but to do it right it should be in SPARQL.

Requirements

So what would I want to see in federated SPARQL queries? Well, as an implementer I need to see a few things:

A syntactic mechanism for defining the URI of a SPARQL endpoint containing the graph(s) to be accessed.
A syntactic mechanism for defining the query to be made on that endpoint (a subquery syntax would be fine here).
A means of asking the size of a query result.
A mechanism for passing existing bindings along with a query.

The first item seemed trivial until I realized that SPARQL has no standard way of describing an endpoint. Systems like Mulgara simply use http://hostname/sparql/, which provides access to the entire store (everything can be referred to using HTTP parameters, such as default-graph-uri and query). On the other hand, Joseki can do the /sparql/ thing, but also provides an option to access datasets through the path, and Sesame can have several repositories, each of which is accessible by varying the path in the URL.

The base URL for issuing SPARQL queries against would be easy enough to specify, but it introduces a new concept into the query language, and that has deeper ramifications than should be broached in this context.

The query that can be issued against an endpoint should look like a standard query, and not just a CONSTRUCT, as this provides greater flexibility and also binds the columns to variable names that can appear in other parts of the query. This is basically identical to a subquery, which is exactly what we want.

Bandwidth Efficiency

The last 2 items are a matter of efficiency and not correctness. However, they can mean the difference between transferring a few bytes vs a few megabytes over a network.

(BTW, when did "bandwidth" get subverted to describe data rates? When I was a boy this referred to the range of frequencies that a signal used, and this had a mathematical formula relating it to the number of symbols-per-second that could be transmitted over that signal - which does indeed translate to a data rate. However, it now gets used in completely different contexts which have nothing to do with frequency range. Oh well.. back to the story).

If I want to ask for the identifiers of people named "Fred" (as opposed to something else I want to name with foaf:givenname), then I could use the query:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person WHERE {
  ?person foaf:givenname "Fred" .
  ?person a foaf:Person
}

Now what if the "type" data and the "name" data appear on different servers? In that case we would use a distributed query.

Using the HTTP/GET idiom I mentioned at the top of this post, then I could send the query to the server containing the foaf:givenname statements, and change it now to say:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person WHERE {
  ?person foaf:givenname "Fred" .
  GRAPH <http://hostname/sparql/?query=
PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A
CONSTRUCT+%7B%3Fperson+a+foaf%3APerson%7D+
WHERE+%7B%3Fperson+a+foaf%3APerson%7D> {
    ?person a foaf:Person
  }
}

So now the server will resolver all the entities with the name "Fred", then it will retrieve a graph and ask it for all the entities that are a foaf:Person. Then it will join these results to create the final result.

But what happens if there are only 3 things named "Fred", but 10,000 people in the data set? In that case the server will resolve the first pattern, getting 3 bindings for ?person, and then make a request across the network, getting back 10,000 statement which are then queried for those statements the describe a foaf:Person (they all will), and only then does the join happen. Ideally, we'd have gone the other way, and asked the server with 10,000 people to request data from the server that had 3 entities named Fred, but we may not have known ahead of time that this would be better, and a more complex query could require a more complex access pattern than simply "reversing" the resolution order.

First of all, we need a way to ask each server how large a set of results is likely to be. The COUNT function that is being discussed in the SPARQL WG at the moment could certainly be used to help here, though for the sake of efficiency it would also be nice to have a mechanism for asking for the upper-limit of the COUNT. That isn't appropriate for a query language (since it refers to database internals) but would be nice to have in the protocol, such as with an HTTP/OPTION request (though I really don't see something like that being ratified by the SPARQL WG). But even without an "upper limit" option, a normal COUNT would give us what we need to find out how to move the query around.

So once we realize that the server running the query has only a little data (call it "Server A"), and it needs to join it to a large amount of data on a different server (call this one "Server B", then of course we want Server A to send the small amount of data to Server B instead of retrieving the large amount from it. One way to do this might be to invert the query at this point, and send the whole thing to Server B. That server then asks Server A for the data, and sends its response. Unfortunately, that is both complex, and requires a lot more hops than we want. The final chain here would be:

Client sends query as a request to Server A
Server A reverses the query and sends the new query as a request to Server B
Server B resolves its local data, and sends the remainder of the query as a request to Server A
Server A responds to Server B with the result of entities with the name "Fred"
Server A responds to the client with the unmodified results it just received

Yuck.

Instead, when Server A detects a data size disparity like this, it needs a mechanism to package up its bindings for the ?person variable, and send these to Server B along with the request. Fortunately, we already have a format for this in the SPARQL result set format.

Normally, a query would be performed using an HTTP/GET, but including a content-body in a GET request has never been formally recognized (though it has not been made illegal), so I don't want to go that way. Instead, a POST would work just as well here. The HTTP request with content could look like this (I've added line breaks to the request):

POST /sparql/?query=
PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A
SELECT+%3Fperson+WHERE+%7B%3Fperson+a+foaf%3APerson%7D HTTP/1.1
Host: www.example
User-agent: my-sparql-client/0.1
Content-Type: application/sparql-results+xml
Content-Length: xxx

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
 <head>
   <variable name="person"/>
 </head>
 <results distinct="false" ordered="false">
   <result>
     <binding name="person"><uri>http://www.example/FredFlintstone</uri></binding>
   </result>
   <result>
     <binding name="person"><uri>http://www.example/FredKruger</uri></binding>
   </result>
   <result>
     <binding name="person"><uri>http://www.example/FredTheDog</uri></binding>
   </result>
 </results>
</sparql>

I can't imagine that I could be successful in suggesting this as part of the underlying protocol for federated querying, but I'm thinking that I'll be incorporating it into Mulgara all the same.

2009-02-27T12:37:00.002-06:00

More Programmable Logic

In the last post I gave a basic description of how the Krule rules engine works. I left out a number of details, but it provides some details on the overall approach.

The most important detail I skipped was the interaction with resolvers in order to identify resources with particular properties. This includes among other things, finding URIs in a domain, inequality comparisons for numeric literals, regular expression matching on strings, the "type" of a resource (URI, blank node, or literal), and optimized transitivity of predicates. We also manage "subtractions" in our data, in the same way that Evren Sirin described NOT two weeks ago. I should point out that Mulgara does subtractions directly, and not with a combined operation with the OPTIONAL/FILTER/!BOUND pattern. This was introduced this in 2004 (or was it 2005?). I have to say that I'm a little surprised that SPARQL never included anything to express it directly, particularly since so many people ask for it (and hence, the popularization of OPTIONAL/FILTER/!BOUND as a pattern) and because the SPARQL algebra provides a definition of a function called "Diff" that is used internally.

Anyway, these extensions are not necessary to understand Krule or RLog, but I think it's useful to know that they're there.

So now that I've described Krule, I've set the scene for describing RLog.

Krule Configuration

When I first wrote Krule, I was ultimately aiming at OWL, but I had a short term goal of RDFS. I find that I have to take these things one step at a time, or else I never make progress. Since I knew that my rules were going to expand, I figured I should not hard code anything into Mulgara, but that I should instead interpret a data structure which described the rules. That also meant I would be able to run rules for lots of systems: RDFS, SKOS, OWL, or anything else. Of course, some things would need more features than RDFS needed (e.g. both OWL and SKOS need "lists"), but my plan was to work on that iteratively.

At the time, I designed an RDF schema to describe my rules, and built the Krule engine to initialize itself from this. This works well, since the whole system is built around RDF already. I also created a new TQL command for applying rules to data:

  apply <rule_graph_uri> to <data_graph_uri> [<output_graph_uri>]

By default all of the entailed data goes into the graph the rules are being applied to, but by including the optional output graph you can send the entailed data there instead.

This worked as planned, and I was able to build a Krule configuration graph for RDFS. Then life and work interfered and the rules engine was put on the back burner before I got to add some of the required features (like consistency checking).

Then about 18 months ago I thought I'd have a go at writing OWL entailment, at least for that part that the rules engine would support. So I set out to write a new Krule file. The complexity of the file was such that I started writing out the rules that I wanted using a kind of Prolog notation with second order programming, in a very similar way to how Raphael Volz represented the same constructs in some of his papers. This grammar uses binary predicates to represent genereal triples, and unary predicates to indicate "type" statements, ie. statements with a predicate of rdf:type. As an example, the owl:sameAs predicate indicates that if the subject of a statement is the owl:sameAs another resource, then that statement can be duplicated with the other resource as the subject. This was easily expressed this as:

  A(Y,Z) :- A(X,Z), owl:sameAs(X,Y).

I wrote out about 3 rules before I realized that converting these to Krule was going to be tedious and prone to error. In fact, I had unthinkingly demonstrated that I already had a language I wanted to use, and the process of translation was an easily automated task. Since the language was allowing me to describe RDF with logic, I decided to call it RLog (for RDF Logic).

Iterations

Andrae and I had been discussing how much we disliked SableCC for generating the TQL parser for Mulgara, and so I started looking around at other parsers. The power of LALR parsers appealed to me, and so I went with Beaver. Along with the JFlex lexer, this software is a pleasure to use. I had learned how to use them both, and created the RLog grammar in about an hour. I then converted the Krule configuration for RDFS into this new grammar, and convinced myself that I had it right. Then life got in the way again, and I put it away.

Last year while waiting for some tests to complete, I remembered this grammar, and spent some of my enforced downtime making it output some useful RDF in the Krule schema. For anyone who's looked at Krule, they may have noticed that triggers for rules (which rules cause which other rules to be run) are explicitly encoded into the configuration. I did this partly because I already had the list of trigger dependencies for RDFS rules, and partly because I thought it would offer more flexibility. However, I had realized some time before that these dependencies were easy to work out, and had been wanting to automate this. I decided that RLog was the perfect place to do it, partly because it meant not having to change much, but also because it still allowed me the flexibility of tweaking the configuration.

Once I'd finished writing a system that could output Krule, I tested it against my RDFS RLog file, and compared the generated Krule to the original configuration. Initially I was disappointed to see to many dependencies, but on closer inspection I realized that they were all valid. The original dependencies were a reduced set because they applied some of the semantics of the predicates and classes they were describing, which was not something that a grammar at the level of RLog could deal with. Semi-naïve evaluation was going to stop unnecessary rules from running anyway, so I decided that these extra triggers were fine. I ran it against the various test graphs that I had, and was pleased to see that it all worked perfectly.

But once again, work and life got in the way, and I put it aside again.

SKOS

A couple of months ago Brian asked me about running rules for generating SKOS entailments, as he was writing a paper on this topic. I pointed him towards RLog and knocked up a couple of useful rules for him. However, as I got into it, I realized that I could actually do most of SKOS quite easily, and before I knew it, I'd written an entire RLog program for it. The only thing I could not do was "S35", as this requires a predicate for list membership (also a requirement for OWL, and on my TODO list).

The really interesting thing about this document, is that almost everything is an axiom and not a rule. It only requires 2 RDFS rules and 5 OWL rules to make the whole thing work. This is quite important, as the complexity in running the rules is generally exponential in the number of rules.

This is (IMNSHO) the power of ontologies. By providing properties of classes and properties, they reduce the need for many rules. To demonstrate what I mean, I've seen a few systems (such as DLV) which define a predicate to be transitive in the following way:

  pred(A,C) :- pred(A,B), pred(B,C).

This works, but it creates a new rule to do it. Every new transitive predicate also gets its own rule. As I have already said, this has a significant detrimental effect on complexity.

Conversely, models such as OWL are able to declare properties as "transitive". Each such declaration then becomes a statement rather than a rule. Indeed, all the transitive statements get covered with a single second-order rule:

  P(A,C) :- P(A,B), P(B,C), owl:TransitivePredicate(P).

"Second-order" refers to the fact that variables can be used for the predicates (such as the variable P in the expression P(A,B)), and that predicates can appear as parameters for other predicates, such as owl:TransitivePredicate(...). The symmetry of Mulgara indexes for RDF statements allows such second order constructs to be evaluated trivially.

Using the OWL construct for transitivity, any number of predicates can be declared as transitive with no increase to the number of rules. The complexity of rules does have a component derived from the number of statements, but this is closer to linear or polynomial (depending on the specific structure of the rules), and is therefore far less significant for large systems. It is also worth noting that several OWL constructs do not need an exhaustive set of their own rules, as their properties can be described using other OWL constructs. For instance, owl:sameAs is declared as being owl:SymmetricProperty. This means that the entailment rule for owl:sameAs (shown above) need only be written once for owl:sameAs(A,B) and is not needed for symmetric case of owl:sameAs(B,A).

General Acceptance

Brian wasn't the only one to like RLog. I've had reports and feature requests from a few other people who like using it as well. The most commonly requested feature has been the generation of blank nodes. The main reason for this is to handle existential formula, which makes me wary, as this can lead to infinite loops if not carefully controlled. On the other hand, I can see the usefulness of it, so I expect to implement it eventually.

A related feature is to create multiple statements based on a single matched rule. This can usually be handled by introducing a new rule with the same body and a different head, but if a blank node has been generated by the rule, then there needs to be some way to re-use it in the same context.

A problem with general usage is that the domains understood by RLog have been preset with the domains that I've wanted, namely: RDF, RDFS, OWL, XSD, MULGARA, KRULE, FOAF, SKOS, and DC. The fix to this can be isolated in the parser, so I anticipate this being fixed by Monday. :-)

Despite it being limited, RLog was proving to be useful, allowing me to encode systems like SKOS very easily. However, being a separate program that translated an RLog file into Krule configuration files that then had to be loaded and applied to data, was a serious impediment to the usage.

Integration

The Mulgara "Content Handler" interface is a mechanism for loading any kind of data as triples, and optionally writing it back out. The two main ones are the RDF/XML handler and the N3 handler, but there are others in the default distribution as well. There is a MBox handler for representing Unix Mailbox files as RDF, and an MP3 handler which maps ID3 metadata from MP3 files. These handlers compliment the "Resolver" interface which represents external data sources as a dynamic graph.

Since RLog has a well-defined mapping into RDF (something the RLog program was already doing when it emitted RDF/XML) then reimplementing this system as a content handler would integrate it into Mulgara with minimal effort. I had been planning on this for some time, but there always seemed to be more pressing priorities. These other priorities are still there (and still pressing!) but a few people (e.g. David) have been pushing me for it recently, so I decided to bite the bullet and get it done.

The first problem was that the parser was in Beaver. This is yet another JAR file to include at a time when I'm trying to cut down on our plethora of libraries. It also seemed excessive, since we already have both JavaCC and SableCC in our system - the former for SPARQL, the latter for TQL, and I hope to redo TQL in JavaCC eventually anyway. So I decided to re-implement the grammar parser in JavaCC.

Unfortunately, it's been over a year since I looked at JavaCC, and I was very rusty. So my first few hours were spent relearning token lookahead, and various aspects of JavaCC grammar files. I actually think I know it better now than I did when I first did the SPARQL parser (that's a concern). There are a few parts of the grammar which are not LL(1) either, which forced me to think through the structure more carefully, and I think I benefited from the effort.

I was concerned that I would need to reimplement a lot of the AST for RLog, but fortunately this was not the case. Once I got a handle on the translation it all went pretty smoothly, and the JavaCC parser was working identically to the original Beaver parser by the end of the first day.

After the parser was under control I moved on to emitting triples. This was when I was reminded that writing RDF/XML can actually be a lot easier than writing raw triples. I ended up making slow progress, but I finally got it done last night.

Testing

Before running a program through the content handler for the first time, I wanted to see that the data looked as I expected it to. JUnit tests are going to take some time to write, and with so much framework around the rules configuration, it was going to be clear very quickly if things weren't just right. I considered running it all through the debugger, but that was going to drown me in a sea of RDF nodes. That was when I decided that I could make the resolver/content-handler interfaces work for me.

Mulgara can't usually tell the difference between data in its own storage, and data sourced from elsewhere. It always opts for internal storage first, but if a graph URI is not found, then it will ask the various handlers if they know what to do with the graph. By using a file: URL for the graphs, I could make Mulgara do all of it's reading and writing to files, using the content handlers to do the I/O. In this case, I decided to "export" my RLog graph to an N3 file, and compare the result to an original Krule RDF/XML file that I exported to another N3 file.

The TQL command for this was:

  export <file:/path/rdfs.rlog> to <file:/path/rlog-output.n3>

Similarly, for the RDF/XML file was transformed to N3 with:

  export <file:/path/rdfs-krule.rdf> to <file:/path/krule-output.n3>

I love it when I can glue arbitrary things together and it all "just works". (This may explain why I like the Semantic Web).

My first test run demonstrated that I was allowing an extra # into my URIs, and then I discovered that I'd fiddled with the literal token parsing, and was now including quotes in my strings (oops). These were trivial fixes. The third time through was the charm. I spent some time sorting my N3 files before deciding it looked practically identical, and so off I went to run an RLog program directly.

As I mentioned in my last post, applying a set of rules to data is done with the apply command. While I could have loaded the rules into an internal graph (pre-compiling them, so to speak) I was keen to "run" my program straight from the source:

  apply <file:/path/rdfs.rlog> to <test:data:uri>

...and whaddaya know? It worked. :-)

Now I have a long list of features to add, optimizations to make, bugs to fix, and all while trying to stay on top of the other unrelated parts of the system. Possibly even more importantly, I need to document how to write an RLog file! But for the moment I'm pretty happy about it, and I'm going to take it easy for the weekend. See you all on Monday!

2009-02-27T10:46:00.003-06:00

Programmable Logic

I had hoped to blog a lot this week, but I kept putting it off in order to get some actual work done. I still have a lot more to do, but I'm at a natural break, so I thought I'd write about it.

I have finally integrated RLog into Mulgara! In some senses this was not a big deal, so it took a surprising amount of work.

To explain what RLog is, I should describe Krule first.

Krule

The Mulgara rules engine, Krule, implements my design for executing rules on large data sets. For those familiar with rules engines, it is similar to RETE, but it runs over data as a batch, rather than iteratively. It was designed this way because our requirements were to perform inferencing on gigabytes of RDF. This was taking a long time to load, so we wanted to load it all up, and then do the inferencing.

RETE operates by setting up a network describing the rules, and then as statements are fed through it, each node in that network builds up a memory of the data it has seen and passed. As data comes in the first time, the nodes that care about it will remember this data and pass it on, but subsequent times through, the node will recognize the data and prevent it from going any further. In this way, every statement fed into the system will be processed as much as it needs to be, and no more. There is even a proof out there that says that RETE is the optimal rules algorithm. Note that the implementation of an algorithm can, and does, have numerous permutations which can allow for greater efficiency in some circumstances, so RETE is often treated as the basis for efficient engines.

One variation on algorithms of this sort is to trade time for space. This is a fancy way of saying that if we use more memory then we can use less processing time, or we can use more processing time to save on memory. Several variations on RETE do just this, and so does Krule.

When looking at the kinds of rules that can be run on RDF data, I noticed that the simple structure of RDF meant that each "node" in a RETE network corresponds to a constraint on a triple (or a BGP in SPARQL). Because Mulgara is indexed in "every direction", this means that every constraint can be found as a "slice" out of one of the indexes (while our current system usually takes O(log(n)) to find, upcoming systems can do some or all of these searches in O(1)). Consequently, instead of my rule network keeping a table in memory associated with every node, there is a section out of an index which exactly corresponds to this table.

There are several advantages to this. First, the existence of the data in the database is defined by it being in the indexes. This means that all the data gets indexed, rules engine or not. Second, when the rules engine is run, there is no need to use the data to iteratively populate the tables for each node, as the index slices (or constraint resolutions) are already fully populated, by definition. Finally, our query engine caches constraint resolutions, and they do not have to be re-resolved if no data has gone in that can affect them (well, some of the caching heuristics can be improved for better coverage, but the potential is there). This means that the "tables" associated with each node will be automatically updated for us as the index is updated, and the work needed to handle updates is minimal.

During the first run of Krule, none of the potential entailments have been made yet, so everything is potentially relevant. However, during subsequent iterations of the rules, Krule has no knowledge of which statements are new in the table on any given node. This means it will produce entailed statements that already exist, and are duplicates. Inserting these is unnecessary (and hence, suboptimal) and creates unwanted duplicates. We handle this in two ways.

The first and simpler mechanism is that Mulgara uses Set semantics, meaning that any duplicates are silently (and efficiently) ignored. Set semantics are important when dealing with RDF, and this is why I'm so frustrated at non-distinct nature of SPARQL queries... but that's a discussion for another time. :-)

The more important mechanism for duplicate inserts is based on RDF having a property of being monotonically increasing. This is because RDF lets you assert data, but not to "unassert" it. OWL 2 has introduced explicit denial of statements, but this is useful for preventing entailments and consistency checking... it does not remove previously existing statements. In non-monotonic systems a constraint resolution may keep the same size if some statements are deleted while an equal number of statements are inserted, but in a monotonic system like RDF, keeping the same size means that there has been no change. So a node knows to pass its data on if the size of its table increases, but otherwise it will do nothing. I stumbled across this technique as an obvious optimization, but I've since learned that it has a formal name: semi-naïve evaluation.

Krule Extensions

While this covers the "batch" use-case, what about the "iterative" use-case, where the user wants to perform inferences on streams of data as it is inserted into an existing database? In this case, the batch approach is too heavyweight, as it will infer statements that are almost entirely pre-existing. We might handle the non-insertion of these statements pretty well, but if you do the work again and again for every statement you try to insert, then it will add up. In this case, the iterative approach of standard RETE is more appropriate.

Unfortunately, RETE needs to build its tables up by iterating over the entire data set, but I've already indicated that this is expensive for the size of set that may be encountered. However, the Krule approach of using constraint resolutions as the tables is perfect for pre-populating these tables in a standard RETE engine. I mentioned this to Alex a few months ago, and he pointed out that he did exactly the same thing once before when implementing RETE in TKS.

I haven't actually done this extension, but I thought I'd let people know that we haven't forgotten it, and it's in the works. It will be based on Krule configurations, so a lot of existing work will be reused.

I don't want to overdo it in one post, so I'll write about RLog in the next one.

2009-02-19T00:34:00.002-06:00

Is it Me?

After struggling against nonsensical errors all day, I finally realized that either Eclipse was going mad, or I was. In frustration I finally updated a method to look like this:

  public void setContextOwner(ContextOwner owner) {
    contextOwner = owner;
    if (owner != contextOwner) throw new AssertionError("VM causing problems");
  }

This code throws an AssertionError, and yes, there is only 1 thread. If it were multithreaded at least I'd have a starting point. The problem only appears while debugging in Eclipse.

I'm not sure whether I'm happy to have isolated my problem down to a single point, or to be unhappy at something that is breaking everything I am trying to work on. I guess I'm unhappy, because it's kept me up much later than I'd hoped.

2009-02-17T00:48:00.003-06:00

Resting

I've had a couple of drinks this evening. Lets see if that makes me more or less intelligible.

Now that I've added a REST-like interface to Mulgara, I've found that I've been using it more and more. This is fine, but to modify data I've had to either upload a document (a very crude tool) or issue write commands on the TQL endpoint. Neither of these were very RESTful, and so I started wondering if it would make sense to do something more direct.

From my perspective (and I'm sure there will be some who disagree with me), the basic resources in an RDF database are graphs and statements. Sure, the URIs themselves are resources, but the perspective of RDF is that these resources are infinite. Graphs are a description of how a subset of the set of all resources are related. Of course, these relationships are described via statements.

Graphs and Statements

So I had to work out how to represent a graph in a RESTful way. Unfortunately, graphs are already their own URI, and this probably has nothing to do with the server it is on. However, REST requires a URL which identifies the host and service, and then the resource within it. So the graph URI has to be embedded in the URL, after the host. While REST URLs typically try to reflect structure in a path, encoding a URL makes this almost impossible. Instead I opted to encode the graph URI as a "graph" parameter.

Statements posed a similar though more complex challenge. I still needed the graph, so this had to stay. Similarly, the other resources also needed to be encoded as parameters, so I added this as well. This left me with 2 issues: blank nodes and literals.

Literals

Literals were reasonably easy... sort of. I simply decided that anything that didn't look like a URI would be a literal. Furthermore, if it was structured like a SPARQL literal, then this would be parsed, allowing a datatype or language to be included. However, nothing is never really easy (of course) and I found myself wondering about relative URIs. These had never been allowed in Mulgara before, but I've brought them in recently after several requests. Most people will ignore them, but for those people who have a use, they can be handy. That all seems OK, until you realize that the single quote character " is an unreserved character in URIs, and so the apparent literal "foo" is actually a valid relative URI. (Thank goodness for unit tests, or I would never have realized that). In the end, I decided to treat any valid URI as a URI and not a literal, unless it starts with a quote. If you really want a relative URI of "foo" then you'll have to choose another interface.

Blank Nodes

Blank nodes presented another problem. Initially, I decided that any missing parameter would be a blank node. That worked well, but then I started wondering about using the same blank node in more than one statement. I'm treating statements as resources, and you can't put more than one "resource" into a REST URL, so that would mean referring to the same "nameless" thing in two different method calls, which isn't possible. Also, adding statements with a blank node necessarily creates a new blank node every time, which breaks idempotency.

Then what about deletion? Does nothing match, or does the blank node match everything? But doing matches like that means I'm no longer matching a single statement, which was what I was trying to do to make this REST and not RPC for a query-like command.

Another option is to refer to blanks with a syntax like _:123. However, this has all of the same problems we've had with exactly this idea in the query language. For instance, these identifiers are not guaranteed to match between different copies of the same data. Also, introducing new data that includes the same ID will accidentally merge these nodes incorrectly. There are other reasons as well. Essentially, you are using a name for something that was supposed to be nameless, and because you're not using URIs (like named things are supposed to use) then you're going to encounter problems. URIs were created for a reason. If you need to refer to something in a persistent way, then use a name for it. (Alternatively, use a query that links a blank node through a functional/inverse-functional predicate to uniquely identify it, but that's another discussion).

So in the end I realized that I can't refer to blank nodes at all in this way. But I think that's OK. There are other interfaces available if you need to work with blank nodes, and some applications prohibit them anyway.

Reification

Something I wanted to come back to is this notion of representing a statement as 3 parameters in a URL (actually 4, since the graph is needed). The notion of representing a statement as a URI has already been addressed in reification, however I dismissed this as a solution here since reifying a statement does not imply that statement exists (indeed, the purpose of the reification may be to say that the statement is false). All the same, it's left me thinking that I should consider a way to use this interface to reify statements.

Methods

So the methods as they stand now are:

method/ resource	Graph	Statement	Other
GET	N/A	N/A	Used for queries.
POST	Upload graphs	N/A	Write commands (not SPARQL)
PUT	Creates graph	Creates statement	N/A
DELETE	Deletes graph	Deletes statement	N/A

I haven't done HEAD yet (I intend to indicate if a graph or statement exists), and I'm ignoring OPTION.

I've also considered what it might mean to GET a statement or a graph. When applied to a graph, I could treat this as a synonym for the query:

  construct {?s ?p ?o} where {?s ?p ?o}

Initially I didn't think it made much sense to GET a statement, but while writing this it occurs to me that I could return a reification URI, if one exists (this is also an option for HEAD, but I think existence is a better function there).

Is There a Point?

Everything I've discussed here may seem pointless, especially since there are alternatives, none of it is standard, and I'm sure there will be numerous criticisms on my choices. On the other hand, I wrote this because I found that uploading documents at a time to be too crude for real coding. I also find that constructing TQL command to modify data to be a little too convoluted in many circumstances, and that a simple PUT is much more appropriate.

So, I'm pretty happy with it, for the simple fact that I find it useful. If anyone has suggested modifications or features, than I'll be more than happy to take them on board.