Less Talk, More Code

Usage Statistics parsing and querying with redis and python

2010-03-26T03:36:00.000-07:00

This is an update of my previous dabblings with chomping through log files. To summarise where I am now:

I have a distributable workflow, loosely coordinated using Redis and Supervisord - redis is used in two fashions: firstly using its lists as queues, buffering the communication between the workers, and secondly as a store, counting and associating the usage with the items and the metadata entities (people, subjects, etc) of those items.

I have written a very small python logger, that pushes loglines directly onto a redis list, providing me with live updating abilities, as well as manual log file parsing. This is currently switched on for testing in the live repository.

Current code base is here: http://github.com/benosteen/UsageLogAnalysis - it has a good number of things hardcoded to the perculiarities of my log files and repository. However, as part of the PIRUS 2 project, I am turning this into an easily reusable codebase, adding in the ability to push out OpenURLs to PIRUS statistics gatherers.

Overview:

Loglines -- lpush'd to 'q:loglines'

workers - 'debot.py' - pulls lines from this queue and parses them up, separating them into 4 categories:

Any hit by a recognised Bot or spider

Any view or download made by a real person on an item in the repository

Any 404, etc

And anything else

and the lines are moved onto 4 (5) queues respectively, q:bothits, q:objectviews (and q:count simultaneously), q:fof, and q:other. I am using prefixes as a convention when working with Redis keys - "q:" will almost always be a queue of some sort. These four queues are consumed by loggers, who commit the logs to disc, segregated into their categories.

The q:count queue is consumed by a further worker called - count.py. This does a number of jobs, and is the part that actually does the analysis.

For each repository item logged event, it finds the ID of the item and also whether this was a download of an item's files. With my repository, both these facts are deducible from the URL itself.

Given the ID, it checks redis to see if this item has had its metadata analysed before. If it hasn't, it grabs the metadata for the item from the repositories index (hosted by an instance of Apache Solr) and starts to add connections between metadata entity and ID to the redis index:

eg say item "pid:1" has the simple metadata of author_name='Ben' and subjects='foo, bar'

create unique IDs from the text by hashing the text and prefix it with the type of the field they came from:

Prefixes:

name => "n:"

institution => "i:"

faculty => "f:"

subjects => "s:"

keyphrases => "k:"

content type => "type:"

collection => "col:"

thesis type => "tt:"

eg

>>> from hashlib import md5

>>> md5("Ben").hexdigest()

'092f2ba9f39fbc2876e64d12cd662f72'

So, the hashkey of the 'name' 'Ben' is 'n:092f2ba9f39fbc2876e64d12cd662f72'

Now to make the connections in Redis:

Add ID to the set 'objectitems' - to keep track of all the IDs (SADD objectitems {ID})

Set 'n:092f2....' to 'Ben' (so we can keep a reverse mapping)

Add 'n:092f2...' to 'names' set (to make it clearer. KEYS n:* should return an equivalent set)

Add 'n:092f2...' to 'e:{id}' eg "e:pid:1" - (e -> prefix for collections of entities. e:{id} is a set of all entities that occur in id)

Add 'e:pid:1' to 'e:n:092f2....' (gathers a list of item ids in which this entity 'Ben' occurs in)

Repeat for any entity you wish to track.

To make this more truth-manageable, you should include the id of record with the text when you generate the hashkey. That way, 'Ben' appearing in one record will have a different key than 'Ben' occuring in another. The assertion that these two entities are the same can easily take place in a different set, (I'm using b: as the prefix for these bundles of asserted equivalence)

Once you have made these assertions, you can set about counting :)

Conventions for tracking hits:

d[v|d|o]:{id} - set of the dates on which {id} was viewed (v), downloaded from (d) or any other page action (o)

eg dv:pid:1 -> set of dates on which pid:1 had page views.

YYYY-MM-DD:{id}:[v|d|o] - set of IP clients that accessed a particular item on a given day - v,d,o as above

eg 2010-02-03:pid:1:d - set of IP clients that downloaded a file from pid:1 on 2010-02-03

t:views:{hashkey}, t:dls:{hashkey}, t:other:{hashkey}

Grand totals of views, downloads or other accesses on a given entity or id. Good for quick lookups.

Let's walk through an example: consider that a client of IP 1.2.3.4 visits the record page for this 'pid:1' on 2010-01-01:

ID = pid:1

Add the User Agent string ("mozilla... etc") to the 'ua:{IP}' set, to keep track of the fingerprints of the visitors.

Try to add the IP address to the set - in this case "2010-01-01:pid:1:v"

If the IP isn't already in this set (the client hasn't accessed this page already today) then:

make sure that "2010-01-01" is a part of the 'dv:pid:1' set

go through all the entities that are part of pid:1 (n:092... etc) and increment their totals by one.
- INCR t:views:n:092...
- INCR t:views:pid:1

Now, what about querying?

Say we wish to look up the activity on a given entity, say for 'Ben'?

First, find the hashkey(s) that exist that are equivalent - either directly using the simple md5sum hash, or by checking which bundles are for this entity.

You can get the grand totals by simply querying "t:views:key", "t:dls..." for each key and summing them together.

You can get more refined answers by getting the set of IDs that this entity is associated with, and querying that to gather all the daily IP sets for them, and summing the answer. This gives me a nice way to generate data suitable for a daily activity sparkline, like:

I have added another set of keys to the store, of the form 'geocode:{IP}' that record country code to IP address, which gives me a nice way to plot out graphs like the following also using the google chart API:

Python logging to Redis

This functionality is mainly in one file in the github repo: redislogger.py

As you can see, most of that file is taken up with a demonstration of how to invoke it! The file that holds the logging configuration which this demo uses is in logging.conf.example.

NB The usage analysis code and UI is very much a WIP

but, I just wanted to post quickly on the rough overview on how it is set up and working.

Curating content from one repository to put into another

2010-03-25T08:42:00.000-07:00

First you need a little code that I've written:

sudo easy_install recordsilo oaipmhscraper

(This should install all the dependencies for the following)

To harvest some OAI-PMH records from say... http://eprints.soton.ac.uk/perl/oai2 :

First, take a look at the Identify page for the OAI-PMH endpoint: http://eprints.soton.ac.uk/perl/oai2?verb=Identify

The example identifier indicates that the record identifiers start with: "oai:eprints.soton.ac.uk:" - we'll need this in a bit. Maybe not need, but it'll make the local storage more... elegant?

Go to a nice clean directory, with enough storage to handle whatever you want to harvest.

Start a python commandline:

>>> from oaipmhscraper import OAIPMHScraper

---> NB OAIPMHScraper(storage_dir, base_oai_url, identifier_uri_prefix)

>>> oaipmh = OAIPMHScraper("myrepo",

"http://eprints.soton.ac.uk/perl/oai2",

"oai:eprints.soton.ac.uk:")

Let's have a look at what could be found out about the OAI-PMH endpoint then:

>>> oaipmh.state

{'lastidentified': '2010-03-25T15:57:15.670552', 'identify': {'deletedRecord': 'persistent', 'compression': [], 'granularity': 'YYYY-MM-DD', 'baseURL': 'http://eprints.soton.ac.uk/perl/oai2', 'adminEmails': ['mailto:eprints@soton.ac.uk'], 'descriptions': ['........'], 'protocolVersion': '2.0', 'repositoryName': 'e-Prints Soton', 'earliestDatestamp': '0001-01-01 00:00:00'}}

>>> oaipmh.getMetadataPrefixes()

{'oai_dc': ('http://www.openarchives.org/OAI/2.0/oai_dc.xsd', 'http://www.openarchives.org/OAI/2.0/oai_dc/'), 'uketd_dc': ('http://naca.central.cranfield.ac.uk/ethos-oai/2.0/uketd_dc.xsd', 'http://naca.central.cranfield.ac.uk/ethos-oai/2.0/')}

Let's grab all the oai_dc from all the objects:

>>> oaipmh.getRecords('oai_dc')

...

Go make a cup of coffee or tea.... you'll get lots of stuff like:

INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1267 found with datestamp 2004-04-27T00:00:00 - storing.

2010-03-25 16:01:11,807 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1268 found with datestamp 2005-04-22T00:00:00 - storing.

INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1268 found with datestamp 2005-04-22T00:00:00 - storing.

2010-03-25 16:01:11,813 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1269 found with datestamp 2004-04-07T00:00:00 - storing.

INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1269 found with datestamp 2004-04-07T00:00:00 - storing.

2010-03-25 16:01:11,819 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1270 found with datestamp 2004-04-07T00:00:00 - storing.

INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1270 found with datestamp 2004-04-07T00:00:00 - storing.

2010-03-25 16:01:11,824 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1271 found with datestamp 2004-04-14T00:00:00 - storing.

...

My advice is to hop to a different terminal window and start to poke around with the content you are getting. The underlying store is a take on the CDL's Pairtree microspec (pairtree being a minimalist specification for how to structure the access to object-orientated items on a hierarchical filesystem) This model on top of pairtree I've called a Silo (in the RecordSilo library I've written) and constitutes a basic object model, where each object has a persistent JSON state (r/w-able) and can store any file or file in a subdirectory. It has crude object-level versioning, rather than file-versioning, so you can clone one version, delete/alter/add to it to create a second, curated version for reuse elsewhere without affecting the original.

What makes pairtree attractive is that the files themselves are not altered in form, so normal posix tools can be used on the files without unwrapping, depacking, etc.

Let's have a look around at what's been harvested so far into the "myrepo" silo:

>>> from recordsilo import Silo

>>> s = Silo("myrepo")

>>> s.state

{'storage_dir': 'myrepo', 'identifier_uri_prefix': 'oai:eprints.soton.ac.uk:', 'uri_base': 'oai:eprints.soton.ac.uk:', 'base_oai_url': 'http://eprints.soton.ac.uk/perl/oai2'}'}

>>> len(s) # NB this can be a time-consuming operation

1100

>>> len(s)

1200

Now let's look at a record: I'm sure I saw '6102' whizz past as it was harvesting...

>>> obj = s.get_item("oai:eprints.soton.ac.uk:6102")

>>> obj

{'files': {'1': ['oai_dc']}, 'subdir': {'1': []}, 'versions': ['1'], 'date': '2004-06-24T00:00:00', 'currentversion': '1', 'metadata_files': {'1': ['oai_dc']}, 'item_id': 'oai:eprints.soton.ac.uk:6102', 'version_dates': {'1': '2004-06-24T00:00:00'}, 'metadata': {'identifier': 'oai:eprints.soton.ac.uk:6102', 'firstSeen': '2004-06-24T00:00:00', 'setSpec': ['7374617475733D707562', '7375626A656374733D51:5148:5148333031', '7375626A656374733D47:4743', '74797065733D61727469636C65', '67726F75703D756F732D686B']}}

>>> obj.files

['oai_dc']

>>> obj.versions

['1']

>>> obj.clone_version("1","workingcopy")

'workingcopy'

>>> obj.versions

['1', 'workingcopy']

>>> obj.currentversion

'workingcopy'

>>> obj.set_version_cursor("1")

True

>>> obj.set_version_cursor("workingcopy")

True

>>> obj.files

['oai_dc']

>>> with obj.get_stream("oai_dc") as oai_dc_xml:

... print oai_dc_xml.read()

...

<oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">

<dc:title>Population biology of Hirondellea sp. nov. (Amphipoda: Gammaridea: Lysianassoidea) from the Atacama Trench (south-east Pacific Ocean)</dc:title>

<dc:creator>Perrone, F.M.</dc:creator>

<dc:creator>Dell'Anno, A.</dc:creator>

<dc:creator>Danovaro, R.</dc:creator>

<dc:creator>Groce, N.D.</dc:creator>

<dc:creator>Thurston, M.H.</dc:creator>

<dc:subject>QH301 Biology</dc:subject>

<dc:subject>GC Oceanography</dc:subject>

<dc:description/>

<dc:publisher/>

<dc:date>2002</dc:date>

<dc:type>Article</dc:type>

<dc:type>PeerReviewed</dc:type>

<dc:identifier>http://eprints.soton.ac.uk/6102/</dc:identifier></oai_dc:dc></metadata>

You can add bytestreams as strings:

>>> obj.put_stream("foo.txt", "Some random text!")

or as file-like objects:

>>> with open("README", "r") as readmefile:

... obj.put_stream("README", readmefile)

...

>>> obj.files

['oai_dc', 'foo.txt', 'README']

>>> obj.set_version_cursor("1")

True

>>> obj.files

['oai_dc']

This isn't the easiest way to browse or poke around the files. It would be nice to see these through a web UI of some kind:

Grab the basic UI code from http://github.com/benosteen/siloserver

(You'll need to install web.py and Mako: sudo easy_install mako web.py)

Then edit the silodirectory_conf.py file to point to the location of the Silo - if the directory structure looks like the following:

myrepo

--- Silo directory stuff...

SiloServer

- dropbox.py

etc

You need to change data_dir to "../myrepo" and then you can start the server by running 'python dropbox.py'

Point a browser at http://localhost:8080/ and wait a while - that start page loads *every* object in the Silo.

And let's revisit our altered record, at http://localhost:8080/oai:eprints.soton.ac.uk:6102

So, from this point, I can curate the records as I wish, add files to each item - perhaps licences, PREMIS files, etc - and then push them onto another repository, such as Fedora.

My swiss army toolkit for distributed/multiprocessing systems

2010-02-11T03:45:00.000-08:00

My first confession - I avoid 'threading' and shared memory. Avoid it like the plague, not because I cannot do it but because it can be a complete pain to build and maintain relative to the alternatives.

I am very much pro multiprocessing versus multithreading - obviously, there are times when threading is by far the best choice, but I've found multiprocessing for the most part it to be quicker, easier and far easier to log, manage and debug than multithreading.

So, what do I mean by a 'multiprocessing' system? (just to be clear)

A multiprocessing system consists of many concurrently running processes running on one or more machines, and contains some means to distribute messages and persist data between these processes.

This does not mean that the individual processes cannot multithread themselves, it is just that each process handles a small, well-defined aspect of the system (paralleling the unix commandline tool idiom).

Tools for multiprocess management:

Redis - data structure server, providing atomic operations on integers, lists, sets, and sorted lists.
RabbitMQ - messaging server, based on the AMQP spec. IMO Much cleaner, easier to manage, more flexible and more reliable than all the JMS systems I've used.
Supervisor - a battle-tested, process manager that can be operated via XML-RPC or HTTP. Enables live control and status of your processes.

Redis has become my swiss army knife of data munging - a store that persists data and which has some very useful atomic operations, such as integer incrementing, list manipulations and very fast set operations. I've also used it for some quick-n-dirty process orchestrations (which is how I've used it in the example that ends this post.)

I've also used it for usage statistic parsing and characterisation of miscellaneous XML files too!

RabbitMQ - a dependable, fast message server which I am primarily using as a buffer for asynchronous operations and task distribution. More boilerplate to use than, say Redis, but by far more suited for that sort of thing.

Supervisord - I've been told that the ruby project 'god' is similar to this - I really have found it very useful, especially on those systems I run remotely. An HTML page to control processes and view logs and stats? what's not to like!

Now for a little illustration of a simple multiprocessing solution - in fact, this blog post far, far outweighs the code written and perhaps even overeggs the simple nature of the problem. I typically wouldn't use supervisor for a simple task like the following, but it seems a suitable example to show how to work it.

The ability to asynchronously deliver messages, updates and tasks between your processes is a real boon - it enables quick solutions to normally vexing or time-consuming problems. For example, let's look at a trivial problem of how to harvest the content from a repository with an OAI-PMH service:

A possible solution needs:

a process to communicate with the OAI-PMH service to gain the list of identifiers for the items in the repository (with the ability to update itself at a later time). Including the ability to find the serialised form of the full metadata for the item, if it cannot be gotten from the OAI-PMH service (eg Eprints3 XML isn't often included in the OAI-PMH service, but can be retrieved from the Export function.),
a process that simply downloads files to a point on the disc,
and a service that allows process one to queue jobs for process 2 to download - in this case Redis.

I told you it would be trivial :)

Installing Redis: (See http://code.google.com/p/redis/wiki/QuickStart for fuller instructions)

sudo apt-get install build-essential python-dev python-setuptools [make sure you can build and use easy_install - here shown for debian/ubuntu/etc]
sudo easy_install supervisor
mkdir oaipmh_directory # A directory to contain all the bits you need
cd oaipmh_directory

Create a supervisor configuration for the task at hand and save it as supervisord.conf.

[program:oaipmhgrabber]
autorestart = false
numprocs = 1
autostart = false
redirect_stderr = True
stopwaitsecs = 10
startsecs = 10
priority = 10
command = python harvest.py
startretries = 3
stdout_logfile = workerlogs/harvest.log

[program:downloader]
autorestart = true
numprocs = 1
autostart = false
redirect_stderr = True
stopwaitsecs = 10
startsecs = 10
priority = 999
command = oaipmh_file_downloader q:download_list
startretries = 3
stdout_logfile = workerlogs/download.log

[program:redis]
autorestart = true
numprocs = 1
autostart = true
redirect_stderr = True
stopwaitsecs = 10
startsecs = 10
priority = 999
command = path/to/the/redis-server
startretries = 3
stdout_logfile = workerlogs/redis.log

[unix_http_server]
file = /tmp/supervisor.sock

[supervisord]
minfds = 1024
minprocs = 200
loglevel = info
logfile = /tmp/supervisord.log
logfile_maxbytes = 50MB
nodaemon = false
pidfile = /tmp/supervisord.pid
logfile_backups = 10

[supervisorctl]
serverurl = unix:///tmp/supervisor.sock

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[inet_http_server]
username = guest
password = mypassword
port = 127.0.0.1:9001

This has a lot of boilerplate on it, so let's go through it, section by section:

[program:redis] - this controls the redis program. You will need to change the path to the redis server to wherever it was built on your system - eg ~/redis-1.2.1/redis-server

[program:oaipmhgrabber] and [program:downloader] - these set up the processes, look at the 'command' key for the command that is run for them eg downloader has "oaipmh_file_downloader q:download_list" - The OAIPMHScraper package adds in the script, 'q:download_list' is the redis-based list that the download tasks appear on. NB we haven't written harvest.py yet - don't worry!

NB very important that autorestart=false in [program:oaipmhgrabber] - if it didn't, it would eternally repeat - on and on and on - harvesting!

Supervisor boilerplate: [unix_http_server], [supervisord], [supervisorctl]

RPC interface control [rpcinterface:supervisor]

HTTP interface control - [inet_http_server] - which includes importantly the username and password to log in to the control panel!

Now, create the log directory:

mkdir workerlogs

Let's now write 'harvest.py': PLEASE use a different OAI2 endpoint url!

#!/usr/bin/env python

from oaipmhscraper import Eprints3Harvester

o = Eprints3Harvester("repo", base_oai_url="http://eprints.maths.ox.ac.uk/cgi/oai2/")

o.getRecords(metadataPrefix="XML",
template="%(pid)s/%(prefix)s/mieprints-eprint-%(pid)s.xml")

[Note there is a base OAIPMHScraper class, but this simply goes and gets the metadata or Identifiers for a given endpoint and stores whatever XML metadata it gets into a store. The Eprints3 harvester gets the files as well, or tries to.]

You may have to change the template for other eprints repositories - the above template would result in the following for item 774:

"http://eprints.maths.ox.ac.uk/cgi/export/774/XML/mieprints-eprint-774.xml"

YMMV for other repositories of course, so you can rewrite this template accordingly.

Your directory should look like this:

--> harvest.py supervisord.conf workerlogs/

Let's start the supervisor to make the configuration is correct:

[---] $ supervisord -c supervisord.conf

[---] $

Now open http://localhost:9001/ - it should look like the following:

Click on the 'redis' name to see the logfile that this is generating - you'll want to see lines like:

11 Feb 13:34:32 . 0 clients connected (0 slaves), 2517 bytes in use, 0 shared objects

Let's start the harvest :)

Click on 'start' for the oaipmh grabber process and wait - in the configuration file, we told it to wait for the process to stay up for 10 seconds before reporting that it was running, so it should take about that long for the page to refresh.

Now, let's see what it is putting onto the queue, before we start the download process (see, easy to debug!)

python

>>> from redis import Redis

>>> r = Redis()

>>> r.keys("*")

[u'q:download_list']

>>> r.llen("q:download_list")

351

>>> r.llen("q:download_list")

361

>>> r.llen("q:download_list")

370

>>> # Still accruing things to download as we speak...

>>> r.lrange("q:download_list", 0,0)

[u'{"url": "http://eprints.maths.ox.ac.uk/cgi/export/774/XML/mieprints-eprint-774.xml", "filename": "XML", "pid": "oai:generic.eprints.org:774", "silo": "repo"}']

Now, let's switch on the downloader and work on those messages - go back to http://localhost:9001 and start the downloader. Click on the downloader name when the page refreshes to get a 'tail' of it's logfile in the browser.

You should get something like the following:

INFO:CombineHarvester File downloader:Starting download of XML (from http://eprints.maths.ox.ac.uk/cgi/export/370/XML/mieprints-eprint-370.xml) to object oai:generic.eprints.org:370
2010-02-11 13:43:51,284 - CombineHarvester File downloader - INFO - Download completed in 0 seconds
INFO:CombineHarvester File downloader:Download completed in 0 seconds
2010-02-11 13:43:51,285 - CombineHarvester File downloader - INFO - Saving to Silo repo
INFO:CombineHarvester File downloader:Saving to Silo repo
2010-02-11 13:43:51,287 - CombineHarvester File downloader - INFO - Starting download of XML (from http://eprints.maths.ox.ac.uk/cgi/export/371/XML/mieprints-eprint-371.xml) to object oai:generic.eprints.org:371
INFO:CombineHarvester File downloader:Starting download of XML (from http://eprints.maths.ox.ac.uk/cgi/export/371/XML/mieprints-eprint-371.xml) to object oai:generic.eprints.org:371

So, that will go about and download all the XML (Eprints3 XML) for each item it found in the repository. (I haven't put in much to stop dupe downloads etc. - exercise for the reader ;))

How about we try to download the files for each item too? I just so happens I've included a little Eprints3 XML parser and method for queuing up the files for download 'reprocessRecords' - let's use this to download the files now - save as download_files.py

#!/usr/bin/env python

from oaipmhscraper import Eprints3Harvester

o = Eprints3Harvester("repo", base_oai_url="http://eprints.maths.ox.ac.uk/cgi/oai2/")

o.reprocessRecords()

Add this process to the top of the supervisord.conf file:

[program:queuefilesfordownload]
autorestart = false
numprocs = 1
autostart = false
redirect_stderr = True
stopwaitsecs = 10
startsecs = 10
priority = 999
command = python download_files.py
startretries = 3
stdout_logfile = workerlogs/download_files.log

Now, to demonstrate the commandline supervisor controller:

[--] $ supervisorctl

$ supervisorctl

downloader RUNNING pid 20750, uptime 0:15:41

oaipmhgrabber STOPPED Feb 11 01:58 PM

redis RUNNING pid 16291, uptime 0:25:31

supervisor> shutdown

Really shut the remote supervisord process down y/N? y

Shut down

supervisor>

(Press Ctrl+D to leave this terminal)

Now restart the supervisor:

[--] $ supervisord -c supervisord.conf

And refresh http://localhost:9001/

[NB in the following picture, I reran oaipmhgrabber, so you could see what the status of a normally exiting process looks like]

Now, switch on the reprocess record worker and tail -f the downloader if you want to watch it work :)

What's a RecordSilo? (aka How things are stored in the example)

This class is based on CDL's spec for Pairtree object storage - each object contains a JSON manifest and is made up of object-level versions. But, it is easier to understand if you have some kind of GUI to poke around with, so I quickly wrote the following dropbox.py server for that end:

Grab the dropbox code and templates from http://github.com/benosteen/SiloServer - unpack it into the same directory as you are in now.

so that:

[--] $ ls

download_files.py dropbox.py dump.rdb harvest.py repo supervisord.conf templates workerlogs

Edit dropbox.py and change the data_dir to equal your repo directory name - in this case, just "repo"

(Make sure you have mako and web.py installed too! sudo easy_install mako web.py)

then:

$ python dropbox.py

http://0.0.0.0:8080/

Go to http://localhost:8080/ to then see all your objects! This page opens them all, so could take a while :)

(I did this on my work computer and may have not put in some dependencies, etc but it worked for me. Let me know if it doesn't in the comments)

Usage stats and Redis

2010-01-18T09:19:00.000-08:00

Redis has been such a massively useful tool to me.

Recently, it has let me cut through access logs munging like a hot knife through butter, all with multiprocessing goodness.

Key things:

Using sets to manage botlists:

>>> from redis import Redis
>>> r = Redis()
>>> for bot in r.smembers("botlist"):
... print bot
...
lycos.txt
non_engines.txt
inktomi.txt
misc.txt
askjeeves.txt
oucs_bots
wisenut.txt
altavista.txt
msn.txt
googlebotlist.txt
>>> total = 0
>>> for bot in r.smembers("botlist"):
... total = total + r.scard(bot)
...
>>> total
3882

So, I have 3882 different IP addresses that I have built up that I consider bots.

Keeping counts and avoiding race-conditions

By using the Redis INCR command, it's easy to write little workers that run in their own process but which atomically increment counts of hits.

What does the stat system look like?

I am treating each line of the Apache-style log as a message that I am passing through a number of workers.

Queues

All in the same AMQP exchange: ("stats")

Queue "loglines" - msg's = A single log line in the Apache format. Can be sourced from either local logs or from the live service.

loglines is listened to by a debot.py worker, just one at the moment. This worker feeds three queues:

Queue "bothits" - log lines from a request that matches a bot IP

Queue "objectviews" - log lines from a request that was a record page view or item download

Queue "other" - log lines that I am presently not so interested in.

[These three queues are consumed by 3 loggers and these maintain a copy of the logs, pre-separated. These are designed to be temporary parts of the workflow, to be discarded once we know what we want from the logs.]

objectviews is subscribed to by a count.py worker which does the heavy crunching as shown below.

Debot.py

The first worker is 'debot.py' - this does the broad separation and checking of a logged event. In essence, it uses the Redis SISMEMBER command to see if the IP address is in the blacklists and if not, applies a few regex's to see if it is a record view and/or a download or something else.

Broad Logging

There are three logger workers that debot.py feeds for "bothits", "objectviews", and "other" - these workers just sit and listen on the relevant queue for an apache log line and appends it to the logfile it has open. Saves me having to open/close logger objects or pass anything around.

The logfiles are purely as a record of the processing and so I can skip redoing it if I want to do any further analysis, like tracking individuals, etc.

The loggers also INCR a key in Redis for each line they see - u:objectviews, u:bothits, and u:other as appropriate - these give me a rough idea of how the processing is going.

(And you can generate pretty charts from it too:)

http://chart.apis.google.com/chart?cht=p3&chds=0,9760660&chd=t:368744,9760660,1669552&chs=600x200&chl=Views|Bots|Other

(data sourced at a point during the processing - 10million bot hits vs 360k object views/dls)

Counting hits (metadata and time based)

Most of the heavy lifting is in count.py - this is fed from the object views/downloads stream coming from the debot.py worker. It does a number of procedural steps for the metadata:

Get metadata from ORA's Solr endpoint (as JSON)

Specifically, get the 'authors' (names), subjects/keyphrases, institutions, content types, and collections things appear in.
These fields correspond to certain keys in Redis. Eg names = 'number:names' = number of unique names, 'n:...' = hits to a given name, etc

For each view/dl:

INCR 'ids:XXXXX' where XXXXX is 'names', 'subjects', etc. It'll return the new value for this, eg 142
SET X:142 to be equal to the text for this new entity, where X is the prefix for the field.
SADD this id (eg X:142) to the relevant set for it, like 'names', 'subjects', etc - This is so we can have an accurate idea of the entities in use even after removing/merging them.
Reverse lookup: Hash the text for the entity (eg md5("John F. Smith")) and SET r:X:{hash} to be equal to "X:142"
SET X:views:142 to be equal to 1 to get the ball rolling (or X:dl:142 for downloads)

If the name is not new:

Hash the text and lookup r:{hash} to get the id (eg n:132)
INCR the item's counter (eg INCR n:views:132)

Time-based and other counts:

INCR t:{object id} (total hits on that repository object since logs began)
INCR t:MMYY (total 'proper' hits for that month)
INCR t:MMYY:{object id} (total 'proper' hits for that repo item that month)
INCR t:MMYY:{entity id} (Total hits for an entity, say 'n:132' that month)

A lot of pressure is put on Redis by count.py but it seems to be coping fine. A note for anyone else thinking about this: Redis keeps its datastore in RAM - running out of RAM is a Bad Thing(tm).

I know that I could also just use the md5 hashes as ids, rather than using a second id - I'm still developing this section and this outline just states it how it is now!

Also, it's worth noting that if I needed to, I can put remote redis 'shards' on other machines and they can just pull log lines from the main objectview queue to process. (It'll still need to create the id <-> entity name mapping on the main store though or a slave of the main store.)

But why did I do this?

I thought that it would mean I could handle both legacy logs and live data and have a framework I could put against other systems and in a way that would mean I would write less code and for the system to be more reliable.

So far, I still think this is the case. If people are interested, I'll abstract out a class or two (eg the metadata lookup function, etc) and stick it on google code. It's not really a lot of code so far, I think even this outline post is longer....

Python in a Pairtree

2009-10-15T02:30:00.001-07:00

(Thanks to @anarchivist for the title - I'll let him take all the 'credit')

"Pairtree? huh, what's that?" - in a nutshell it's 'just enough veneer on top of a conventional filesystem' for it to be able to store objects sensibly; a way of storing objects by id on a normal hierarchical filesystem in a pragmatic fashion. You could just have one directory that holds all the objects, but this would unbalance the filesystem and due to how most are implemented, would result in a less-than-efficient store. Filesystems just don't deal well with thousands or hundreds of thousands of directories in the same level.

Pairtree provides enough convention and fanning out of hierarchical directories to both spread the load of storing high numbers of objects, while retaining the ability to treat each object distinctly.

The Pairtree specification is a compromise between fanning out too much and too little and assumes that the ids used are opaque; that the ids have no meaning and are to all intents and purposes 'random'. If your ids are not, for example, they are human-readable words, then you will have to tweak how the ids are split into directories to ensure better performance.

[I'll copy&paste some examples from the specifications to illustrate what it does]

For example, to store objects that have identifiers like the following URI - http://n2t.info/ark:/13030/xt2{some string}

eg:

http://n2t.info/ark:/13030/xt2aacd
http://n2t.info/ark:/13030/xt2aaab
http://n2t.info/ark:/13030/xt2aaac

This works out to look like this on the filesystem:

current_directory/
|   pairtree_version0_1        [which version of pairtree]
|    ( This directory conforms to Pairtree Version 0.1. Updated spec: )
|    ( http://www.cdlib.org/inside/diglib/pairtree/pairtreespec.html  )
|
|   pairtree_prefix
|    ( http://n2t.info/ark:/13030/xt2                                 )
|
\--- pairtree_root/
|--- aa/
|    |--- cd/
|    |    |--- foo/
|    |    |    |   README.txt
|    |    |    |   thumbnail.gif
|    |    ...
|    |--- ab/ ...
|    |--- af/ ...
|    |--- ag/ ...
|    ...
|--- ab/ ...
...
\--- zz/ ...
     | ...

With the object http://n2t.info/ark:/13030/xt2aacd containing a directory 'foo', which itself contains a README and a thumbnail gif.

Creating this structure by hand is tedious, and luckily for you, you don't have to (if you use python that is)

To get the pairtree library that I've written, you can either install it from the Pypi site http://pypi.python.org/pypi/Pairtree or if python-setuptools/easy_install is on your system, you can just sudo easy_install pairtree

You can find API documentation and a quick start here.

The quick start should get you up and running in no time at all, but let's look at how we might store Fedora-like objects on disk using pairtree. (I don't mean how to replicate how Fedora stores objects on disk, I mean how to make an object store that gives us the basic framework of 'objects are bags of stuff')


>>> from pairtree import *
>>> f = PairtreeStorageFactory() 
>>> fedora = f.get_store(store_dir="objects", uri_base="info:fedora/")

Right, that's the basic framework done, let's add some content:


>>> obj = fedora.create_object('changeme:1')
>>> with open('somefileofdublincore.xml', 'r') as dc:
...     obj.add_bytestream('DC', dc)
>>> with open('somearticle.pdf', 'r') as pdf:
...     obj.add_bytestream('PDF', pdf)
>>> obj.add_bytestream('RELS-EXT', """<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
                                               xmlns:rel="info:fedora/fedora-system:def/relations-external#">
  <rdf:Description rdf:about="info:fedora/changeme:1">
    <rel:isMemberOf rdf:resource="info:fedora/type:article"/>
  </rdf:Description>
</rdf:RDF>""")

The add_bytestream method is adaptive - if you pass it something that supports a read() method, it will attempt to stream out the content in chunks to avoid reading the whole item into memory at once. If not, it will just write the content out as is.

I hope this gives people some idea on what can be possible with a conventional filesystem, after all, filesystem code is pretty well tested in the majority of cases so why not make use of it.

(NB the with python command is a nice way of dealing with file-like objects, made part of the core in python ~2.6 I think. It tries to make sure that the file is closed at the end of the block, equivalent to an "temp = open(foo) - do stuff - temp.close()")

What is a book if you can print one in 5 minutes?

2009-06-19T09:28:00.000-07:00

There exists technology now, available in bookshops and certain forward-thinking libraries, to print a book in 5 minutes from pressing Go, to getting the book into your hands.

This excites me a lot. Yes, that does imply I am a geek, but whatever.

So, what would I want to do with one? well, printing books that already exist is fun but not grasping the potential. If you can print a book in 5 minutes, for how long must that book have been in existence before you press print? Why can't we start talking about repurposing corrently licenced or public domain content?

Well, what I need (and am keen to get going with) is the following:

1) PDF generator -> pass it an RSS feed of items and it will do it's best to generate page content from these.
- blogs/etc: grab the RSS/Atom feed and parse out the useful content
- Include option to use blog comments or to gather comments/backlinks/tweets from the internet
- PDFs - simply concatenate the PDF as is into the final PDF
- Books/other digital items with ORE -> interleave these
- offer similar comment/backlink option as above
- ie the book can be added 'normally' with the internet-derived comments on the facing page to the book/excerpt they actually refer to, or the discussion can be mirrored with the comments in order and threaded, with the excerpts from the pages being attached to these. Or why not both?

- Automated indexes of URLs, dates and commenters can be generated without too much trouble on demand.
- Full-text indexes will be more demanding to generate, but I am sure that a little money and a crowd-sourced solution can be found.

2) Ability to (onsite) print these PDFs into a single, (highly sexy) bound volume using a machine such as can be found in many Blackwell's bookshops today.

3) A little capital to run competitions, targeting various levels in the university, asking the simple question "If you could print anything you want as a bound book in 5 minutes, what's the most interesting thing you can think of to print out?"

Why?
People like books. They work, they don't need batteries and people who can read can intuitively 'work' a book. But books are not very dynamic. You have to have editors, drafters, publishers, and so on and so forth, and the germination of a book has to be measured in years... right?

Print on demand smashes that and breaks down conceptions of what a book is. Is it a sacred tome that needs to be safeguarded and lent only to the most worthy? Or is is a snapshot of an ongoing teaching/research process? Or can it simply be a way to print out a notebook with page numbers as you would like them? Can a book be an alive and young collation of works, useful now, but maybe not as critical in a few years?

Giving people the ability to make and generate their own books offers more potential - what books are they creating? Which generated books garner the most reuse, comments and excitement? Would the comments about the generated works be worth studying and printing in due course? Will people break through the pen-barrier, that taboo of taking pen to a page? Or will we just see people printing wikitravel guides and their flickr account?

Use-cases to give a taste of the possibilities:
- Print and share a discussion about an author, with excepts ordered and surrounded by the chronologically ordered and threaded comments made by a research group, a teaching group or even just a book club.
- Library 'cafe' - library can subsidise the printing of existing works for use in the cafe, as long as the books stay in the cafe. Spillages, crumbs are not an issue to these facsimile books.
- Ability to record and store your terms/years/etc worth of notes in a single volume for posterity. At £5 a go, many students will want this.
- Test print a Thesis/Dissertation, without the expense of consulting a book binder.
- Archive in paper a snapshot of a digital labbook implemented on drupal or wordpress.
- Lecturer's notes from a given term, to avoid the looseleaf A4 overload spillage that often occurs.
- Printing of personalised or domain specific notebooks. (ie. a PDF with purposed fields, named columns and uniquely identified pages for recording data in the field - who says a printed book has to be full of info?)
- Maths sheets/tests/etc
- Past Papers

I am humbled by the work done by Russell Davies, Ben Terrett and friends in this area and I can pinpoint the time at which I started to think more about these things to BookCamp sponsored by Penguin UK and run by Jeremy Ettinghausen (blog)

Please, please see:

http://tinyurl.com/9qfoyt - Things Our Friends Have Written On The Internet 2008

Russell Davies UnNotebook: http://russelldavies.typepad.com/planning/2009/02/unnotebook.html
(http://tinyurl.com/cpdllw)

RDF + UI + Fedora for object metadata (RDF) editing

2009-05-15T02:22:00.001-07:00

Just a walkthrough of something I am trying to implement at the moment:

Requirements:

For the Web UI:

Using jQuery and 3 plugins: jEditable, autocomplete and rdfquery.

jeditable: http://www.appelsiini.net/projects/jeditable
jeditable live demo: http://www.appelsiini.net/projects/jeditable/default.html <-- see this to understand what it gives.
http://jquery.bassistance.de/autocomplete/demo/ <-- example autocomplete demo
http://code.google.com/p/rdfquery/ from jeni tennison for reading RDFa information from the DOM of an HTML page using javascript.

Needed middleware controls from the Web App:

create new session (specifically, a delta of the RDF expressed in iand's ChangeSet schema http://vocab.org/changeset/schema ) POST /{object-id}/{RDF}/session/new -> HTTP 201 - session url (includes object id root)
POST triples to /{session-url}/update to add to the 'add' and/or 'delete' portions
A POST to /{session-url}/commit or just DELETE /{session-url}

And all objects typed by rdf:type (multitypes allowed)

Workflow:

Template grabs RDF info from object, and then displays it in the typical manner (substituting labels for uris when relevant), but also encodes the values with RDFa.
If the user is auth'd to edit, each of these values has a css class added so that the inline editing for jeditable can act on it.
It then reads for the given type of object the cardinality of the fields present (eg from an OWL markup for the class) and also the other predicates that can be applied to this object. For multivalued predicates, an 'add another' type link is appended below. For unused predicates, its up to the template to suggest these - currently, all the objects in the repo can have type specific templates, but for this example, I am considering generics.
For predicates which have usefully typed ranges, ie foaf:knows in our system points to a URI, rather than a string - autocomplete is used to hook into our or maybe anothers index of known labels for uris to suggest correct values. For example, if an author was going to indicate their affiliation to a department here at oxford (BRII project) it would be handy if a correct list of department labels was used. A choice from the list would view as the label, but represent the URI in the page.
When the user clicks on it to change the value, a session is created if none exists stamped with the start time of the edit and the last modified date of the RDF datastream, along with details of the editor, etc.
rdfquery is used to pull the triple from the RDFa in the edited field. When the user submits a change, the rdfa triple is posted to the session url as a 'delete' triple and the new one is encoded as an 'add' triple.
A simple addition would just post to the session with no 'delete' parameter.
The UI should then reflect that the session is live and should be committed when the user is happy with the changes.

On commit, the session would save the changeset to the object being edited, and update the RDF file in question. (so we keep rdfquery would then update the RDFa in the page to the new values, upon a 200/204 reply.
On cancel, the values would be restored, and the session deleted.

Commit Notes:
If the lastmodified date on the datastream is different from the one marked on the session (ie possible conflict), the page information is updated to the most recent and the session is reapplied in the browser, highlighting the conflicts, and a warning given to the user.

I am thinking of increasing the feedback using a messaging system, while keeping the same optimistic edit model - you can see the status of an item, and that someone else has a session open on it. The degree to the feedback is something I am still thinking about - should the UI highlight or even reflect the values that the other user(s) is editing in realtime? is that useful?

Early evaluation and serialisation of preservation policy decisions.

2009-03-30T02:43:00.000-07:00

(Apologies, as this has been a draft when I thought it published. I have updated it to reflect changes that have been made since we started doing this.)

It may be policy to make sure that the archive's materials are free of computer malware - one part of enacting this policy is running anti-virus and anti-spyware scans of the content. However, malware may be stored in the archive a number of months before it is widely recognised as such. So, the enactment of the policy of 'no malware' would mean that content is scanned on ingest, in 3 to 6 months after ingest and one final scan a year later.

Given that it is possible to monitor when changes occur to the preservation archive, it is not necessary to run continual sweeps of the content held in the archive to assess whether a preservation action is needed or not. Most actions that result from a preservation policy choice can be pre-assigned to an item when it iundergoes a change of state (creation, modification, deletion, extension)

These decisions for actions (and also the auditable record of them occurring) are recorded inside the object itself and in a bid to reuse a standard rather than reinvent, this serialisation uses the iCal standard. iCal already has a proven capability to mark and schedule events and even handle reoccurring events, and to attach arbitrary information to these individual events.

For the archive to self-describe and be preservable for the longer term, it is necessary for the actions taken to be archivable in some way too. A human-readable description of the action, alongside a best-effort attempt to describe this in machine-readable terms should be archived and referenced by any event that is an instance of that action. ('best-effort' due to the underwhelming nature of the current semantics and schemas for describe these preservation processes)

In the Oxford system, an iCal calendar implementation called Darwin Calendar server was initially used to provide a queriable index of the preservation actions, along with a report of what events needed to be queued to be performed on a given day. These actions are queued in the short-term job queues (technically, being held in persistent AMQP queues) for later processing. However, the various iCal server implementations were not lightweight nor stable enough to be easily reused so from this point on, simple indexes are created as needed and retained from the serialised iCal to be used in its stead.

Preservation actions such as scanning (virii, file-format, etc) are not the only systems to benefit from monitoring the state of an item. Text- and data-mining and analysis, indexing for search indices, dissemination copy production, and so on are all actions that can be driven and kept in line with the content in this way. For example, it is likely that the indices will be altered or benefit from refreshing on a periodic basis and the event of last-indexing can be included in this iCal file as a VJOURNAL event.

NB no effort has been made to intertwine the iCal-serialised information with the packaging standard used, as this is heavily expected to both take considerable time and effort, and also severely limit our ability to reuse or migrate content from this packaging standard to a later, newer format. It is being stored as a separate Fedora datastream within the same object it describes, and is registered to be an iCal file containing preservation event information using information stored in the internal RDF manifest.

We need people!

2009-03-19T07:46:00.000-07:00

(UPDATE - Grrr.... seems that the concept of persistent URLs is lost on the admin - link below has been removed - see google cached copy here)

http://www.admin.ox.ac.uk/ps/oao/ar/ar3979j.shtml - job description.

Essentially, we need smart people who are willing to join us to do good, innovative stuff; work that isn't by-the-numbers with room for initiative and ideas.

Help us turn our digital repository into a digital library, it'll be fun! Well, maybe not fun, but it will be very interesting at least!

bulletpoints: python/ruby frameworks, REST, a little SemWeb, ajax, jQuery, AMQP, Atom, JSON, RDF+RDFa, Apache WSGI deployment, VMs, linux, NFS, storage, RAID, etc.

Developer Happiness days - why happyness is important

2009-02-25T06:09:00.001-08:00

Creativity and innovation

One of the defining qualities of a good innovative developer is creativity and a pragmatic attitude; someone with the 'rough consensus, running code' mentality that pervades good software innovation. This can be seen as the drive to experiment, to turn inspiration and ideas into real, running code or to pathfind by trying out different things. Innovation can often happen when talking about quite separate, seemingly unrelated things, even to the point that most of the time, the 'outcomes' of an interaction are impossible to pin down.

Play, vagueness and communication

Creativity, inspiration, innovation, ideas, fun, and curiousity are all useful and important when developing software. These words convey concepts that do not thrive in situations that are purely scheduled, didactic, and teacher-pupil focussed. There needs to be an amount of 'play' in the system (see 'Play'.) While this 'play' is bad in a tightly regimented system, it is an essential part in a creative system, to allow for new things to develop, new ideas to happen and for 'random' interactions to take place.

Alongside this notion of play in an event, there also needs to be an amount of blank space, a vagueness to the event. I think that we can agree that much of the usefulness of normal conferences comes from the 'coffee breaks' and 'lunch breaks', which are blank spaces of a sort. It is the recognition of this that is important and to factor it in more.

Note that if a single developer could guess at how things should best be developed in the academic space, they would have done so by now. Pre-compartmentalisation of ideas into 'tracks' can kill potential innovation stone-dead. The distinction between CMSs, repositories and VLE developers is purely semantic and it is detrimental for people involved in one space to not overhear the developments, needs, ideas and issues in another. It is especially counter-productive to further segregate by community, such as having simultaneous Fedora, DSpace and EPrints strands at an event.

While the inherent and intended vagueness provides the potential for cross-fertilisation of ideas, and the room for play provides the space, the final ingredient is that of speech, or any communication that takes place with the same ease and at the same speed of speech. While some may find the 140 character limit on twitter or identi.ca a strange constraint, this provides a target for people to really think about what they wish to convey and keeps the dialogue from becoming a series of monologues - much like the majority of emails of mailing lists - and keeps it as a dialogue between people.

Communication and Developers

One of the dichotomies in the necessity of communication to development is that developers can be shy, initially preferring the false anonymity of textual communication to spoken words between real people. There is a need to provide means for people to break the ice, and to strike up conversations with people that they can recognise as being of like minds. Asking that people's public online avatars are changed to be pictures of them can help people at an event find those that they have been talking to online and to start talking, face to face.

On a personal note, one of the most difficult things I have to do when meeting people out in real life is answer the question 'What do you do?' - it is much easier when I already know that the person asking the question has a technical background.

And again, going back to the concept of compartmentalisation - developers who only deal with developers and their managers/peers will build systems that work best for their peers and their managers. If these people are not the only users then they need to widen their communications. It is important for the developers that do not use their own systems to engage with the people who actually do. They should do this directly, without the potential for garbled dialogue via layers of protocol. This part needs managing in whatever space, both to avoid dominance by loud, disgruntled users and to mitigate anti-social behaviour. By and large, I am optimistic of this process, people tend to want to be thanked, and this simple feedback loop can be used to help motivate. Making this feedback more disproportionate (a small 'thank you' can lead to great effects) and adding in the notion of highscore can lead to all sorts of interaction and outcomes, most notably being the rapid reinforcement of any behaviour that led to a positive outcome.

Disproportionate feedback loops and Highscores drive human behaviour

I'll just digress quickly to cover what I mean be a disproportionate feedback loop: A disproportionate feedback loop is something that encourages a certain behaviour; the input to which is something small and inexpensive, in either time or effort but the output can be large and very rewarding. This pattern can be seen in very many interactions: playing the lottery, [good] video game controls, twitter and facebook, musical instruments, the 'who wants to be a millionaire' format, mashups, posting to a blog ('free' comments, auto rss updating, a google-able webpage for each post) etc.

The natural drive for highscores is also worth pointing out. At first glance, is it as simple as considering its use in videogames? How about the concept of getting your '5 fruit and veg a day'? http://www.5aday.nhs.uk/topTips/default.html Running in a marathon against other people? Inbox Zero (http://www.slideshare.net/merlinmann/inbox-zero-actionbased-email), Learning to play different musical scores? Your work being rated highly online? An innovation of yours being commented on by 5 different people in quick succession? Highscores can be very good drivers for human behaviour, addictive to some personalities.

Why not set up some software highscores? For example, in the world of repositories, how about 'Fastest UI for self-submission' - encouraging automatic metadata/datamining, a monthly prize for 'Most issue tickets handled' - to the satisfaction of those posting the tickets, and so on.

It is very easy to over-metricise this - some will purposefully abstain from this and some metrics are truely misleading. In the 90s, there was a push to have lines of code added as a metric to productivity. The false assumption is that lines of code have anything to do with producitivity - code should be lean, but not too lean to maintain.

So be very careful when adding means to record highscores - they should be flexible, and be fun - if they are no fun for the developers and/or the users, they become a pointless metric, more of an obstacle than a motivation.

The Dev8D event

People were free to roam and interact at the Dev8D event and there was no enforced schedule, but twitter and a loudhailer were used to make people aware of things that were going on. Talks and discussions were lined up prior to the event of course, but the event was organised on a wiki which all were free to edit. As experience has told us, the important and sometimes inspired ideas occur in relaxed and informal surroundings where people just talk and share information, such as in a typical social situation like having food and drink.

As a specific example, look at the role of twitter at the event. Sam Easterby-Smith (http://twitter.com/samscam) created a means to track 'developer happiness' and shared the tracking 'Happyness-o-meter' site with us all. This unplanned development inspired me to relay the infomation back to twitter and similarly led to me running an operating system/hardware survey in a very similar fashion.

To help break the ice and to encourage play, we instituted a number of ideas:

A wordcloud on each attendees badge, consisting of whatever we could find of their work online, be it their blog or similar so that it might provide a talking point, or allow people to spot people who write about things they might be interested in learning more about.

The poker chip game - each attendee was given 5 poker chips at the start of the event, and it was encouraged that chips were to be traded for help, advice or as a way to convey a thank you. The goal was that the top 4 people ranked by amounts of chips at the end of the third day would receive a Dell mini 9 computer. The balance to this was that each chip was also worth a drink at the bar on that day too.

We were well aware that we'd left a lot of play in this particular system, allowing for lotteries to be set up, people pooling their chips, and so on. As the sole purpose of this was to encourage people to interact, to talk and bargain with each other, and to provide that feedback loop I mentioned earlier, it wasn't too important how people got the chips as long as it wasn't underhanded. It was the interaction and the 'fun' that we were after. Just as an aside, Dave Flanders deserves the credit for this particular scheme.

Developer Decathlon

The basic concept of the Developer Decathlon was also reusing these ideas of play and feedback: "The Developer Decathlon is a competition at dev8D that enables developers to come together face-to-face to do rapid prototyping of software ideas. [..] We help facilitate this at dev8D by providing both 'real users' and 'expert advice' on how to run these rapid prototyping sprints. [..] The 'Decathlon' part of the competition represents the '10 users' who will be available on the day to present the biggest issues they have with the apps they use and in turn to help answer developer questions as the prototypes applications are being created. The developers will have two days to work with the users in creating their prototype applications."

The best two submissions will get cash prizes that go to the individual, not to the company or institution that they are affiliated with. The outcomes of which will be made public shortly, once the judging panel have done their work.

Summary

To foster innovation and to allow for creativity in software development:

Having play space is important
Being vague with aims and flexible with outcomes is not a bad thing and is vital for unexpected things to develop - e.g. A project's outcomes should be under continual re-negotiation as a general rule, not as the exception.
Encouraging and enabling free and easy communication is crucial.
Be aware of what drives people to do what they do. Push all feedback to be as disproportionate as possible, allowing both developers and users to benefit, with only putting a relatively trivial amount of input in (this pattern affects web UIs, development cycles, team interaction, etc)
Choose useful highscores and be prepared to ditch them or change them if they are no longer fun and motivational.

Handling Tabular data

2009-02-22T18:28:00.001-08:00

"Storage"

I put the s-word in quotes because the storing of the item is actually a very straightforward process - we have been dealing with storing tabular data for computation for a very long time now. Unfortunately, this also means that there are very many ways to capture, edit and present tables of information.

One realisation to make with regards to preserving access to data coming from research is that there is a huge backlog of data in formats that we shall kindly call 'legacy'. Not only is there this issue, but data is being made with tools and systems that effectively 'trap' or lock-in a lot of this information - case in point being any research being recorded using Microsoft Access. While the tables of data can often be extracted with some effort, it is normally difficult to impossible to extract the implicit information; how tables interlink, how the Access Form adds information to the dataset, etc.

It is this implicit knowledge that is the elephant in the room. Very many serialisations, such as SQL 'dumps', csv, xsl and so on, rely on implicit knowledge that is either related to the particulars of the application used to open it, or is actually highly domain specific.

So, it is trivial and easy to specify a model for storing data, but without also encoding the implied information and without making allowances for the myriad of sources, the model is useless; it would be akin to defining the colour of storage boxes holding bric-a-brac. The datasets need to be characterised, and the implied information recorded in as good a way as possible.

Characterisation

The first step is to characterise the dataset that has been marked for archival and reuse. (Strictly, the best first step is to consult with the researcher or research team and help and guide them so that as much of the unsaid knowledge is known by all parties.)

Some serialisations so a good job of this themselves, *SQL-based serialisations include basic data type information inside the table declarations themselves. As a pragmatic measure, it seems sensible to accept SQL-style table descriptions as a reasonable beginning. Later, we'll consider the implicit information that also needs to be recorded alongside such a declaration.

Some others, such as CSV, leave it up to the parsing agent to guess at the type of information included. In these cases, it is important to find out or even deduce the type of data held in each column. Again, this data can be serialised in a SQL table declaration held alongside the original unmodified dataset.

(It is assumed that a basic data review will be carried out; does the csv have a consistent number of columns per row, is the version and operating system known for the MySQL that held the data, is there a PI or responsible party for the data, etc.

Implicit information

Good teachers are right to point out this simple truth: "don't forget to write down the obvious!" It may seem obvious that all your data is latin-1 encoded, or that you are using a FAT32 filesystem, or even that you are running in a 32-bit environment, the painful truth is that we can't guarantee that these aspects won't affect how the data is held, accessed or stored. There may be systematic issues that we are not aware of, such as the problems with early versions of ZFS causing [, at the time, detected] data corruption, or MySQL truncating fields when serialised in a way that is not anticipated or discovered until later.

In characterising the legacy sets of data, it is important to realise that there will be loss, especially with the formats and applications that blend presentation with storage. For example, it will require a major effort to attempt to recover the forms and logic bound into the various versions of MS Access. I am even aware of a major dataset, a highly researched dictionary of old english words and phrases, that the final output of which is a Macromedia Authorware application, and the source files are held by an unknown party (that is if they still exist at all) - the Joy of hiring Contractors. In fact, this warrants a slight digression:

The gap in IT support for research

If an academic researcher wishes to gain an external email account at their institution, there is an established protocol for this. Email is so commonplace, it sounds an easy thing to provide, but you need expertise, server hardware, multiuser configuration, adoption of certain access standards (IMAP, POP3, etc), and generally there are very few types of email (text or text with MIME attachments - NB the IM in MIME stands for Internet Mail)

If a researcher has a need to store tables of data, where do they turn? They should turn to the same department, who will handle the heavy lifting of guiding standards, recording the implicit information and providing standard access APIs to the data. What the IT departments seem to be doing currently is - to carry on the metaphor - handing the researcher the email server software and telling them to get on with it, to configure it as they want. No wonder the resulting legacy systems are as free-form as they are.

Practical measures - Curation

Back to specifics now, consider that a set of data has been found to be important, research has been based on it, and it's been recognised that this dataset needs to be looked after. [This will illustrate the technical measures. Licencing, dialogue with the data owners, and other non-technical analysis and administration is left out, but assumed.]

First task is to store the incoming data, byte-for-byte, as much as is possible - storing the iso image of the media the data is stored on, storing the SQL dump of a database, etc.

Analyse the tables of data - record the base types of each column (text, binary, float, decimal, etc) apeing the syntax of a SQL table declaration, as well as trying to identify the key columns.

Record the inter-table joins between primary and secondary keys, possibly by using a "table.column SAMEAS table.column;" declaration after the table declarations.

Likewise, attempt to add information concerning each column, information such as units or any other identifying material.

Store this table description alongside the recorded tabular data source.

Form a representation of this data in a well-known, current format such as a MySQL dump. For spreadsheets that are 'frozen', cells that are the results of embedded formula should be calculated and added as fixed values. It is important to record the environment, library and platform that these calculations are made with.

Table description as RDF (strictly, referencing cols/rows via the URI)

One syntax I am playing around with is the notion that by appending sensible suffixes to the base URI for a dataset, we can unique specify a row, a column, a region or even a single cell. Simply put:

http://datasethost/datasets/{data-id}#table/{table-name}/column/{column-id} to reference a whole column
http://datasethost/datasets/{data-id}#table/{table-name}/row/{column-id} to reference a whole row, etc

[The use of the # in the position it is in will no doubt cause debate. Suffice it to say, this is a pragmatic measure, as I suspect that an intermediary layer will have to take care of dereferencing a GET on these forms in any case.]

The purpose for this is so that the tabular description can be made using common and established namespaces to describe and characterise the tables of data. Following on from a previous post on extending the BagIt protocol with an RDF manifest, this information can be included in said manifest, alongside the more expected metadata without disrupting or altering how this is handled.

A possible content type for tabular data

By considering the base Fedora repository object model, or the BagIt model, we can apply the above to form a content model for a dataset:

As a Fedora Object:

Original data in whatever forms or formats it arrives in (dsid prefix convention: DATA*)
Binary/textual serialisation in a well-understood format (dsid prefix convention: DERIV*)
'Manifest' of the contents (dsid convention: RELS-INT)
Connections between this dataset and other objects, like articles, etc as well as the RDF description of this item (RELS-EXT)
Basic description of dataset for interoperability (Simple dublin core - DC)

As a BagIt+RDF:

Zip archive -

/MANIFEST (list of files and checksums)
/RDFMANIFEST (RELS-INT and RELS-EXT from above)
/data/* (original dataset files/disk images/etc)
/derived/* (normalised/re-rendered datasets in a well known format)

Presentation - the important part

What is described above is the archival of the data. This is a form suited for discovery, but is not in a form suited for reuse. So, what is the possibility?

BigTable (Google) or HBase (Hadoop) provides a platform where tabular data can be put in a scalable manner. In fact, I would go on to suggest that HBase should be a basic service offered by the IT department of any institution. By providing this database as a service, it should be easier to normalise, and to educate the academic users in a manner that is useful to them, not just to the archivist. Google spreadsheet is an extremely good example of how such a large, scalable database might be presented to the end-user.

For archival sets with a good (RDF) description of the table, it should be possible to instantiate working versions of the tabular data on a scalable database platform like HBase on demand. Having a policy to put to 'sleep' unused datasets can provide a useful comprimise, avoiding having all the tables live but still providing a useful service.

It should also be noted that the adoption of popular methods of data access should be part of the responsibility of the data providers - this will change as time goes on, and protocols and methods for access alter with fashion. Currently, Atom/RSS feeds of any part of a table of data (the google spreadsheet model) fits very well with the landscape of applications that can reuse this information.

Summary

Try to record as much information as can be found or derived - from host operating system to column types.
Keep the original dataset byte-for-byte as you recieved it.
Try to maintain a version of the data in a well-understood format
Describe the tables of information in a reusable way, preferably by adopting a machine-readable mechanism
Be prepared to create services that the users want and need, not services that you think they should have.

Pushing the BagIt manifest concept a little further

2009-02-20T03:40:00.000-08:00

I really like the idea of BagIt - just enough framework to transfer files in a way that errors can be detected.

I really like the idea of RDF - just enough framework to detail, characterise and interlink resources in an extremely flexible and extendable fashion.

I really like the 4 rules of Linked Data - just enough rules to act as guides; follow the rules and your information will be much more useful to you and the wider world.

What I do not wish to go near is any format that requires a non-machine-readable profile to understand or a human to reverse-engineer - METS, being a good example of a framework giving you enough rope to hang yourself on.

So, what's my use-case? First, I'll outline what I digital objects I have, and why I handle and express them in the way I have.

I deal with lists of resources on a day-to-day basis, and what these resources are and the way these resources link together is very important. The metadata associated with the list is also important, as this conveys the perspective of the agent that constructed this list; the "who,what,where,when and why" of the list.

OAI-ORE is - at a basic level - a specification and a vocabulary, which can be used to depict a list of resources. This is a good thing. But here's the rub for me - I don't agree with how ORE semantically places this list. For me, the list is a subjective thing, a facet or perception of linkage between the resources listed. The list *always* implies a context through which the resources are to be viewed. This view leads me to the conclusion that any triples that are *asserted* by the list, such as triples containing an ordering predicate, such as 'hasNext' or 'hasLast', these triples must not be in the same graph as the factual triples which would enter the 'global' graph, such as list A is called (dc:title) "My photos" and contains resources a,b,c,d and e and was authored by Agent X.

This is easier to illustrate with an example with everyone's friends, Alice and Bob:

Now, while Alice and Bob may be 'aggregating' some of the same images, this doesn't mean we can infer much at all. Alice might be researching the development of a fruit fly's wings based on genetic degredation, and Bob might be researching the fruit fly's eye structure, looking for clear photos of the front of the fly. It could be even more unrelated in that Bob is actually looking for features on the electron microscope photos caused by dust or pollen.

So, to cope with contextual assertions (A asserts that <B> <verb C> <D>) there are a couple of well-discussed tactics: Reification, 'Promotion' (not sure of the correct term here) and Named Graphs.

Reification is a no-no. Very bad. Google will tell you all the reasons why this is bad.

'Promotion' (what the real term for this is, I hope someone will post in the comments.) 'Promotion' is just where a Classed node is introduced to allow contextual information to be added, very useful for adding information about a predicate. For example, consider <Person A> <researches> <ProjectX>. This, I'd argue is a bad triple for any system that will last more than <ProjectX>'s lifespan. We need to classify this triple with temporal information, and perhaps even location information too. So, one solution is to 'promote' the <researches> predicate to be of the following form: <Person A> <has_role> <_: type=Researcher>; <_:> <dtstart> <etc>, <_:> <researches> <ProjectX> ...

From the ORE camp, this promotion comes in the form of a Proxy for each aggregated resource that needs context. So in this way, they have 'promoted' the resource, as a kind of promotion to the act of aggregation. Tomayto, Tomarto. The way this works for ORE doesn't sit well for me though, and the convention for the URI schema here feels very awkward and heavy.

The third way (and my strong preference) is the Named Graph approach. Put all the triples that are asserted by, say Alice, into a resource at URI <Alices-NG> and say something like <Aggregation> <isProvidedContextBy> <Alices-NG>

For ease of reuse though, I'd suggest that the facts are left in the global graph, in the aggregation serialisation itself. I am sure that the semantic arguments over what should go where could rage on for eons, my take is that information that is factual or generally useful should be left in the global graph. Like resource mime-type, standards compliance ('conformsTo', etc), mirroring/alternate format information ('sha1_checksum', 'hasFormat' between a PDF, txt and Word doc versions, etc)

(There is the murky middle ground of course, like for licencing. But I'd suggest leave to the 'owning' aggregation to put it in the global graph.)

Enough of the digression on RDF!

So, how to extend BagIt, taking on board the things I have said above:

Add alongside the MANIFEST of BagIt (a simple list of files and checksums) an RDF serialisation - RDFMANIFEST.{format} (which in my preference is in N3 or turtle notation, .n3 or .turtle accordingly)

Copying the modelling of Aggregations from OAI-ORE, and we will say that one BagIt archive is equivalent to one Aggregation. (NB nothing wrong with a BagIt archive of BagIt archives!)

Re-use the Agent and ore:aggregates concepts and conventions from the OAI-ORE spec to 'list' the archive, and give some form of provenance. Add in a simple record for what this archive is meant to be as a whole (attached to the Aggregation class).

Give each BagIt a URI - in this case, preferably a resolvable URI from which you can download it, but for bulk transfers using SneakerNet or CarFullOfDrivesNet, use a info:BagIt/{id} scheme of your choice.

URIs for resources in transit are hierarchical, based on location in the archive: <info:BagIt/{book-id}/raw_pages/page1/bookid_page1_0.tiff>

Checksums, mimetypes and alternates should be added to the RDF Manifest:

NB <page1> == <info:BagIt/{book-id}/raw_pages/page1/bookid_page1_0.tiff>

<page1> <sha1> "9cbd4aeff71f5d7929b2451c28356c8823d09ab4";
<mimetype> "image/tiff";
<hasFormat> <info:BagIt/{book-id}/thumbnail_pages/page1/bookid_page1_0.jpg>;

Any assertions, such as page ordering in this case, should be handled as necessary. Just *please* do not use 'hasNext'! Use a named graph, use the built in rdf list mechanism, add an RSS 1.0 feed as a resource, anything but hasNext!

And that's about it for the format. One last thing to say about using info URIs though - I'd strongly suggest that they are used when the items do not have resolvable (http) URIs, and once transfered, I'd suggest that the info URIs are replaced with the http ones, and the info varients can be kept in a graph for provenance.

(Please note that I am biased in that this mirrors quite closely the way that the archives here and the way that digital items are held, but I think this works!)

Tracking conferences (at Dev8D) with python, twitter and tags

2009-02-18T04:18:00.000-08:00

There was so much going on at http://www.dev8d.org (#dev8d) that it might be foolish for me to attempt to write up what happened.

So, I'll focus on a small, but to my mind, crucial aspect of it - tag tracking with a focus on Twitter.

The Importance of Tags

First, the tag (#)dev8d was cloudburst over a number of social sites - Flickr(dev8d tagged photos), Twitter(dev8d feed), blogs such as the JISCInvolve Dev8D site, and so on. This was not just done for publicity, but as a means to track and re-assemble the various inputs to and outputs from the event.

Flickr has some really nice photos on it, shared by people like Ian Ibbotson (who caught an urban fox on camera during the event!) While there was an 'official' dev8d flickr user, I expect the most unexpected and most interesting photos to be shared by other people who kindly add on the dev8d tag so we can find them. For conference organisers, this means that there is a pool of images that we can choose from, each with their own provenance so we can contact the owner if we wanted to re-use, or re-publish. Of course, if the owner puts a CC licence on them, it makes things easier :)

So, asserting a tag or label for an event is a useful thing to do in any case. But, this twinned with using a messaging system like Twitter or Identi.ca, means that you can coordinate, share, and bring together an event. There was a projector in the Basecamp room, which was either the bar, or one of the large basement rooms at Birkbeck depending on the day. Initially, this was used to run through the basic flow of events, which was primarily organised through the use of a wiki, to which all of us and the attendees were members.

Projecting the bird's eye view of the event

I am not entirely sure whose idea it was initially to use the projector to follow the dev8d tag on twitter, auto-refreshing itself every minute, but it would be one or more of the following: Dave Flanders(@dfflanders), Andy McGregor(@andymcg) and Dave Tarrant(@davetaz) who is aka BitTarrant due to his network wizardry keeping the wifi going despite Birkbeck's network's best efforts at stopping any form of useful networking going.

The funny thing about the feed being there, was that it felt perfectly natural from the start. Almost like a mix of notice board, event liveblog and facebook status updates, but the overall effect was like it was the bird's eye view of the entire event, which you could dip into and out of at will, follow up on talks you weren't even attending, catch interesting links that people posted, and just follow the whole event while doing your own thing.

Then things got interesting.

From what I heard, a conversation in the bar about developer happiness (involving @rgardler?) lead to Sam Easterby-Smith (@samscam) to create a script that dug through the dev8d tweets looking for n/m (like 7/10) and to use that as a mark of happyness e.g.

" @samscam #dev8d I am seriously 9/10 happy http://samscam.co.uk/happier HOW HAPPY ARE YOU? " (Tue, 10 Feb 2009 11:17:15)

And computed the average happyness and overall happyness of those who tweeted how they were doing!

Of course, being friendly, constructive sorts, we knew the best way to help 'improve' his happyometer was to try to break it by sending it bad input... *ahem*.

" @samscam #dev8d based on instant discovery of bugs in the Happier Pipe am now only 3/5 happy " (Tue, 10 Feb 2009 23:05:05)

BUT things got fixed, and the community got involved and interested. It caused talk and debate, got people wondering how that it was done, how they could do the same thing and how to take it further.

At which point, I thought it might be fun to 'retweet' the happyness ratings as they change, to keep a running track of things. And so, a purpose for @randomdev8d was born:

How I did this was fairly simple: I grabbed his page every minute or so, used BeautifulSoup to parse the HTML, got the happyness numbers out and compared it to the last ones the script had seen. If there was a change, it tweeted it and seconds later, the projected tweet feed updated to show the new values - a disproportionate feedback loop, the key to involvement in games; you do something small like press a button or add 4/10 to a message, and you can affect the stock-market ticker of happyness :)

If I had been able to give my talk on the python code day, the code to do this would contain zero surprises, because I covered 99% of this - so here's my 'slides'[pdf] - basically a snapshot screencast.

Here's the crufty code though that did this:

import time
import simplejson, httplib2, BeautifulSoup
h = httplib2.Http()
h.add_credentials('randomdev8d','PASSWORD')
happy = httplib2.Http()
o = 130.9
a = 7.7
import urllib

while True:
print "Checking happiness...."
(resp, content) = happy.request('http://samscam.co.uk/happier/')
soup = BeautifulSoup.BeautifulSoup(content)
overallHappyness = soup.findAll('div')[2].contents
avergeHappyness = soup.findAll('div')[4].contents
over = float(overallHappyness[0])
ave = float(avergeHappyness[0])
print "Overall %s - Average %s" % (over, ave)
omess = "DOWN"
if over > o:
omess = "UP!"
amess = "DOWN"
if ave > a:
amess= "UP!"
if over == o:
omess = "SAME"
if ave == a:
amess = "SAME"
if not (o == over and a == ave):
print "Change!"
o = over
a = ave
tweet = "Overall happiness is now %s(%s), with an average=%s(%s) #dev8d (from http://is.gd/j99q)" % (overallHappyness[0], omess, avergeHappyness[0], amess)
data = {'status':tweet}
body = urllib.urlencode(data)
(rs,cont) = h.request('http://www.twitter.com/statuses/update.json', "POST", body=body)
else:
print "No change"
time.sleep(120)

(Available from http://pastebin.com/f3d42c348 with syntax highlighting - NB this was written beat-poet style, written from A to B with little concern for form. The fact that it works is a miracle, so comment on the code if you must.)

The grand, official #Dev8D survey!

... which was anything but official, or grand. The happyness-o-meter idea lead BitTarrant and I to think "Wouldn't it be cool to find out what computers people have brought here?" Essentially, finding out what computer environment developers choose to use is a very valuable thing - developers choose things which make our lives easier, by and large, so finding out which setups they use by preference to develop or work with could guide later choices, such as being able to actually target the majority of environments for wifi, software, or talks.

So, on the Wednesday morning, Dave put out the call on @dev8d for people to post the operating systems on the hardware they brought to this event, in the form of OS/HW. I then busied myself with writing a script that hit the twitter search api directly, and parsed it itself. As this was a more intended script, I made sure that it kept track of things properly, pickling its per-person tallys. (You could post up multiple configurations in one or more tweets, and it kept track of it per-person.) This script was a little bloated at 86 lines, so I won't post it inline - plus, it also showed that I should've gone to the regexp lesson, as I got stuck trying to do it with regexp, gave up, and then used whitespace-tokenising... but it worked fine ;)

Survey code: http://pastebin.com/f2c04719b

Survey results: http://spreadsheets.google.com/pub?key=pDKcyrBE6SJqToHzjCs2jaQ

OS:
Linux was the majority at 42% closely followed by Apple at 37% with MS-based OS at 18% with a stellar showing of one user of OpenSolaris (4%)!

Hardware type:
66% were laptops, with 25% of the machines there being classed as netbooks. 8% of the hardware there were iPhones too, and one person claimed to have brought Amazon EC2 with them ;)

The post hoc analysis

Now then, having gotten back to normal life, I've spent a little time grabbing stuff from twitter and digging through them. Here is the list of the 1300+ tweets with the #dev8d tag in them published via google docs, and here is some derived things posted by Tony Hirst(@psychemedia) and Chris Wilper(@cwilper) seconds after I posted this:

Tagcloud of twitterer's:
http://www.wordle.net/gallery/wrdl/549364/dev8_twitterers [java needed]

Tagcloud of tweeted words:
http://www.wordle.net/gallery/wrdl/549350/dev8d [java needed]

And a column of all the tweeted links:
http://spreadsheets.google.com/pub?key=p1rHUqg4g423-wWQn8arcTg

This lead me to dig through them and republish the list of tweets, but try to unminimise the urls and try to grab the <title> tag of the html page it goes to, which you can find here:

http://spreadsheets.google.com/pub?key=pDKcyrBE6SJpwVmV4_4qOdg

(Which incidently, lead me to spot that there was one link to "YouTube - Rick Astley - Never Gonna Give You Up" which means the hacking was all worthwhile :))

Graphing Happyness

For one, I've re-analysed the happyness tweets and posted up the following:

A full log of happyness with timeline attached to it,
The running average, with accompanying timeline,
and the average of the last 10 tweets in much the same way as before.

It is easier to understand the averages as graphs over time of course! You could also use Tony Hirst's excellent write up here about creating graphs from google forms and spreadsheets. I'm having issues embedding the google timeline widget here, so you'll have to make do with static graphs.

Average happyness over the course of the event - all tweets counted towards the average.

Average happyness, but only the previous 10 tweets counted towards the average making it more reflective of the happyness at that time.

If you are wondering about the first dip, that was when we all tried to break Sam's tracker by sending it bad data, a lot of 0 happyness's were recorded therefore :) As for the second dip, well, you can see that from the log of happyness, yourselves :)

Archive file resiliences

2008-12-02T08:12:00.000-08:00

Tar format: http://en.wikipedia.org/wiki/Tar_(file_format)#Format_details

The tar file format seems to be quite robust as a container: the files are put in byte for byte as they are on disc, with a 512 byte header prefix, and the file length padded to fit into 512 byte blocks.

(Also, quick tip with working with archives on the commandline - the utility 'less' can list the contents of many: less test.zip .tar etc)

Substitution damage:

By hacking the header files (using a hexeditor like ghex2), the inbuilt header checksum seems to detected corruption as intended. The normal tar utility will (possibly by default) skip corrupted headers and therefore files, but will find the other files in the archive:

ben@billpardy:~/tar_testground$ tar -tf test.tar
New document 1.2007_03_07_14_59_58.0
New document 1.2007_09_07_14_18_02.0
New document 1.2007_11_16_12_21_20.0
ben@billpardy:~/tar_testground$ echo $?
0
(tar reports success)
ben@billpardy:~/tar_testground$ ghex2 test.tar
(munge first header a little)
ben@billpardy:~/tar_testground$ tar -tf test.tar
tar: This does not look like a tar archive
tar: Skipping to next header
New document 1.2007_09_07_14_18_02.0
New document 1.2007_11_16_12_21_20.0
tar: Error exit delayed from previous errors
ben@billpardy:~/tar_testground$ echo $?
2

Which is all well and good, at least you can detect errors which hit important parts of the tar file. But what's a couple of 512 byte targets in a 100Mb tar file?

As the files are in the file unaltered, any damage that isn't in tar header sections or padding is restricted to damaging the file itself and not the files around it. So a few bytes of damage will be contained to the area it occurs in. It is important to make sure that you checksum the files then!

Additive damage:

The tar format is (due to legacy tape concerns) pretty fixed on the 512 block size. Any addition seems to cause the part of the file after the addition to be 'unreadable' to the tar utility as it checks only on the 512 byte mark. The exception is - of course - a 512 byte addition (or multiple thereof).

Summary: the tar format is reasonably robust, in part due to its uncompressed nature and also due to its inbuilt header checksum. Bitwise damage seems to be localised to the area/files it affects. Therefore, if used as, say, a BagIT container, it might be useful for the server to allow individual file re-download, to avoid pulling the entire container again.

ZIP format: http://en.wikipedia.org/wiki/ZIP_(file_format)#The_format_in_detail

Zip has a similar layout to Tar, in that the files are stored sequentially, with the main file table or 'central directory' being at the end of a file.

Substitution damage:

Through similar hacking of the file as in the tar test above, a few important things come out:

It is possible to corrupt individual files, without the ability to unzip being affected. However, like tar, it will report a error (echo $? -> '2') if the file crc32 doesn't match the checksum in the central directory.

BUT, this corruption seemed to only affect the file that the damage was made too. The other files in the archive were unaffected. Which is nice.

Losing the end of the file renders the archive hard to recover.
It does not checksum the central directory, so slight alterations here can cause all sorts of unintended changes: (NB altered a filename in the central directory [a '3'->'4'])

ben@billpardy:~/tar_testground/zip_test/3$ unzip test.zip
Archive: test.zip
New document 1.2007_04_07_14_59_58.0: mismatching "local" filename (New document 1.2007_03_07_14_59_58.0),
continuing with "central" filename version
inflating: New document 1.2007_04_07_14_59_58.0
inflating: New document 1.2007_09_07_14_18_02.0
inflating: New document 1.2007_11_16_12_21_20.0
ben@billpardy:~/tar_testground/zip_test/3$ echo $?
1
Note that the unzip utility did error out, but the filename of the file extracted was altered to the 'phony' central directory one. It should be New document 1.2007_03_07_14_59_58.0

Additive damage:

Addition to file region: This test resulted in quite a surprise for me, it did a lot better than I had anticipated. The unzip utility not only worked out that I had added 3 bytes to the first file in the set, but managed to reroute around that damage so that I could retrieve the subsequent files in the archive:

ben@billpardy:~/tar_testground/zip_test/4$ unzip test.zip
Archive: test.zip
warning [test.zip]: 3 extra bytes at beginning or within zipfile
(attempting to process anyway)
file #1: bad zipfile offset (local header sig): 3
(attempting to re-compensate)
inflating: New document 1.2007_03_07_14_59_58.0
error: invalid compressed data to inflate
file #2: bad zipfile offset (local header sig): 1243
(attempting to re-compensate)
inflating: New document 1.2007_09_07_14_18_02.0
inflating: New document 1.2007_11_16_12_21_20.0

Addition to central directory region: This one was, an anticipated, more devastating. A similar addition of three bytes to the middle of the region gave this result:

ben@billpardy:~/tar_testground/zip_test/5$ unzip test.zip
Archive: test.zip
warning [test.zip]: 3 extra bytes at beginning or within zipfile
(attempting to process anyway)
error [test.zip]: reported length of central directory is
-3 bytes too long (Atari STZip zipfile? J.H.Holm ZIPSPLIT 1.1
zipfile?). Compensating...
inflating: New document 1.2007_03_07_14_59_58.0
file #2: bad zipfile offset (lseek): -612261888

It rendered just a single file readable in the archive, but this was still a good result. It might be possible to remove the problem addition, given thorough knowledge of the zip format. However, the most important part is that it errors out, rather than silently failing.

Summary: the zip format looks quite robust as well, but pay attention to the error codes that the commandline utility (and hopefully, native unzip libraries) emit. Bitwise errors to files do not propagate to the other files, but do do widespread damage to the file in question, due to its compressed nature. It survives additive damage far better than the tar format, able to compensate and retrieve unaffected files.

Tar.gz 'format':

This can be seen as a combination of the two archive formats above. On the surface, it is a zip-style archive, but with a single file (the .tar file) and a single entry in the central directory. It shouldn't take long to realise then, that any change to the bulk of the tar.gz will cause havok thoughout the archive. In fact, a single byte substitution I made to a the file region in a test archive caused all byte sequences of 'it' to turn into ':/'. This decimated the XML in the archive, as all files where affected in the same manner.

Summary: Combine the two archive types above but in a bad way. Errors affect the archive as a whole - damage to any part of the bulk of the archive can cause widespread damage and damage to the end of the file can cause it all to be unreadable.

Beginning with RDF triplestores - a 'survey'

2008-11-14T08:50:00.000-08:00

Like last time, this was prompted by an email that eventually was passed to me. It was a call for opinion - "we thought we'd check first to see what software either of you recommend or use for an RDF database."

It's a good question.

In fact, it's a really great question, as searching for similar advice online results in very few opinions on the subject.

But which one's are the best for novices? Which have the best learning curves? which has the easiest install or the shortest time between starting out and being able to query things?

I'll try to pose as much as I can as a newcomer which won't be too hard :) Some of the comments will be my own, and some will be comments from others, but I'll try to be as honest as I can be to reflect new user expectation and experience and most importantly, developer-attention span. (See the end for some of my reasons for this approach.)

(Puts on newbie hat and enables PEBKAC mode.)

Installable (local) triplestores

Sesame - http://www.openrdf.org/

Simple menu on the left of the website, one called downloads. Great, I'll give that a whirl. "Download the latest Sesame 2.x release" looks good to me. Hmm 5 differently named files... I'll grab the 'onejar' file and try to run it. "Failed to load Main-Class manifest attribute from openrdf-sesame-2.2.1-onejar.jar", okay... so back to the site to find out how to install this thing.

No links for installation guide... on the Documentation page, no link for installation instructions for the sesame 2.2.1 I downloaded, but there is Sesame 2 user documentation and Sesame 2 system documentation. Phew, after guessing that the user documentation might have the guide, I finally found the installation guide (system documentation was about the architecture, not how to administer the system as you might expect.)

(Developer losing interest...)

Ah, I see, I need the SDK. I wonder what that 'onejar' was then... "The deployment process is container-specific, please consult the
documentation for your container on how to deploy a web application. " - right, okay... let's assume that I have a Java background and am not just a user wanting to hook into it from my language of choice, such as php, ruby, python, or dare I say it, javascript.

(Only Java-friendly developers continue on)

Right, got Tomcat, and put in the war file... right so, now I need to work out how to use a commandline console tool to set up a 'repository'... does this use SVN or CVS then? Oh, it doesn't do anything unless I end the line with a period. I thought it had hung trying to connect! "Triple indexes [spoc,posc]" Wha? Well, whatever that was, the test repository is created. Let's see what's at http://localhost:8080/openrdf-sesame then.

"You are currently accessing an OpenRDF Sesame server. This server is
intended to be accessed by dedicated clients, using a specialized
protocol. To access the information on this server through a browser,
we recommend using the OpenRDF Workbench software."

Bugger. Google for "sesame clients" then.

There is a Java client it seems, but it seems to need a lot to get going. Oh, and useful if my application is in Java or in a JVM (jRuby, jython)
http://jeenbroekstra.blogspot.com/2008/09/sesame-2-desktop-client.html .Net GUI... not so useful for programmatic stuff
...

I've pretty much given up at this point. If I knew I needed to use a triplestore then I might have persisted, but if I was just investigating it? I would've probably given up earlier.

Mulgara - http://www.mulgara.org/

Nice, they've given the frontpage some style, not too keen on orange, but the effort makes it look professional. "Mulgara is a scalable RDF database written entirely in Java." -> Great, I found what I am looking for, and it warns me it needs Java. "DOWNLOAD NOW" - that's pretty clear. *click*

Hmm, where's the style gone? Lots of download options, but thankfully one is marked by "These released binaries are all that are required for most applications." so I'll grab those. 25Mb? Wow...

Okay, it's downloaded and unpacked now. Let's see what we've got - a 'dist/' directory and two jars. Well, I guess I should try to run one (wonder what the licence is, where's the README?)

Mulgara Semantic Store Version 2.0.6 (Build 2.0.6.local) INFO [main] (EmbeddedMulgaraServer.java:715) - RMI Registry started automatically on port 10990 [main] INFO org.mulgara.server.EmbeddedMulgaraServer - RMI Registry started automatically on port 1099 INFO [main] (EmbeddedMulgaraServer.java:738) - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy3 [main] INFO org.mulgara.server.EmbeddedMulgaraServer - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy2008-11-14 14:06:39,899 INFO Database - Host name aliases for this server are: [billpardy, localhost, 127.0.0.1]

Well, I guess something has started... back to the site, there is a documentation page and a wiki. A quick view of the official documentation has just confused me, is this an external site? No easy link to something like 'getting started' or tutorials. I've heard of SPARQL, what's iTQL? nevermind, let's see if the wiki is more helpful.

Let's try 'Documentation' - sweet, first link looks like what I want - Web User Interface.

A default configuration for a standalone Mulgara server runs a set of
web services, including the Web User Interface. The standard
configuration puts uses port 8080, so the web services can be seen by
pointing a browser on the server running Mulgara to http://localhost:8080/.

Ooo cool. *click*

Available Services
SPARQL HTTP Service
User Interface
Web Services
TQL HTTP Service

SPARQL, I've heard of that. *click*

HTTP ERROR: 400
Query must be supplied
RequestURI=/sparql/
Powered by Jetty://

I guess that's the SPARQL api, good to know, but the frontpage could've warned me a little. Ah, second link is to the User Interface.

Good, I can use a drop down to look at lots of example queries, nice. Don't understand most of them at the moment, but it's definitely comforting to have examples. They look nothing like SPARQL though... wonder what it is? I'm sure it does SPARQL... was I wrong?

Quick poke at the HTML shows that it is just POSTing the query text to webui/ExecuteQuery. Looks straightforward to start hacking against too, but probably should password protect this somehow! I wonder how that is done... documentation mentions a 'java.security.policy' field:java.security.policy
string: URL: The URL for the security policy file to use.
Default: jar:file:/jar_path!/conf/mulgara-rmi.policy

Kinda stumped... will investigate that later, but at least there's hope. Just be firing off the example queries though shows me stuff, so I've got something to work with at least.

Jena - http://jena.sourceforge.net/

Front page is pretty clear, even if I don't understand what all those acronyms are. downloads link takes me to a page with an obvious download link, good. (Oh, and sourceforge, you suck. How many frikkin mirrors do I have to try to get this file?)

Have to put Jena on pause while Sourceforge sorts its life out.

ARC2 - http://arc.semsol.org/

Frontpage: "Easy RDF and SPARQL for LAMP systems" Nice, I know of LAMP and I particularly like the word Easy. Let's see... Download is easy to find, and tells me straight away I need PHP 4.3+ and MySQL 4.0.4+ *check* Right, now how do I enable PHP for apache again?... Ah, it helps if I install it first... Okay, done. Dropping the folder into my web space... Hmm nothing does anything. From the documentation, it does look like it is geared to providing a PHP library framework for working with its triplestore and RDF. Hang on, SPARQL Endpoint Setup looks like what I want. It wants a database, okay... done, bit of a hassle though.

Hmm, all I get is "Fatal error: Call to undefined function mysql_connect() in /********/arc2/store/ARC2_Store.php on line 53"

Of course, install php libraries to access mysql (PEBKAC)... done and I also realise I need to set up the store, like the example in "Getting Started"... done (with this) and what does the index page now look like?

Yay! there's like SPARQL and stuff... I guess 'load' and 'insert' will help me stick stuff in, and 'select' looks familiar... Well, it seems to be working at least.

Unfortunately, it looks like the Jena download from sourceforge is in a world of FAIL for now. Maybe I'll look at it next time?

Triplestores in the cloud

Talis Platform - http://www.talis.com/platform/

From the frontpage - "Developers using the Platform can spend more of their time building
extraordinary applications and less of their time worrying about how
they will scale their data storage." - pretty much want I wanted to hear, so how do I get to play with it?

There is a Get involved link on the left, which rapidly leads me to see the section: "Develop, play and try out" - n² developer community seems to be where it wants me to go.

Lots of links on the frontpage, takes a few seconds to spot: "Join - join the n² community to get free developer stores and online support" - free, nice word that. So, I just have to email someone? Okay, I can live with that.

Documentation seems good, lots of choices though, a little hard to spot a single thread to follow to get up to speed, but Guides and Tutorials looks right to get going with. The Kniblet tutorial (whatever a kniblet is) looks the most beginnerish, and it's also very PHP focussed, which is either a good thing or a bad thing depending on the user :)

Commercial triplestores

Openlink Virtuoso - http://virtuoso.openlinksw.com/

Okay, I tried the Download link, but I am pretty confused by what I'm greeted with:

Not sure what one to pick just to try it out, it's late in the day, and my tolerance for all things installable has ended.

-----------------------------------------

Why take the http/web-centric, newbie approach to looking at these?

Answer: In part, I am taking this approach because I have a deep belief that it
was only after relational DBs became commoditised - "You want fries
with you MySQL database?" - that the dynamic web kicked off. If we want
the semantic web to kick off, we need to commoditise it or at least, make
it very easy for developers to get started. And I mean EASY. A query that I want answered is: "Is there something that fits: 'apt-get install
triplestore; r = store('localhost'), r.add(rdf), r.query(blah)'? "

(I am particularly interested to see what happens when Tom Morris's work on Reddy collides with ActiveRecord or activerdf...)

NB I've short circuited the discovery of software homepages - Imagine
I've seen projects stating that they use "XXXXX as a triplestore". I know
this will likely mean I've compared apples to oranges, but as a newbie, how
would I be expected to know this? "Powered by the Talis Platform" and
"Powered by Jena" seem pretty similar on the surface.)

A Fedora/Solr Digital Library for Oxford's 'Forced Migration Online'

2008-11-13T06:30:00.000-08:00

(mods:subtitle - Slightly more technical follow-up to the Fedora Hatcheck piece.)

As I have been prompted via email by Phil Cryer (of the Missouri Botanical Garden) to talk more about how this technically works, I thought it would be best to make it a written post, rather than the more limited email response.

Background

Forced Migration Online (FMO) had a proprietary system, supporting their document needs. It was originally designed for newpaper holdings and applied that model to encoding the mostly paginated documents that FMO held - such that each part was broken up into paragraphs of text, images and the location of all these parts on a page. It even encoded (in its own format) the location of the words on the page when it OCR'd the documents, making per-word higlighting possible. Which is nice.

However, the backend that powered this was over-priced, and FMO wanted to move to a more open, sustainable platform.

Enter the DAMS

(DAMS = Digital Asset Management System)

I have been doing work on trying to make a service out of a base of fedora-commons and additional 'plugin' services, such as the wonderful Apache Solr and the useful eXist XML db. The end aim is for departments/users/whoever to requisition a 'store' with a certain quality of service (solr attached, 50Gb+ etc) but this is not yet an automated process.

The focus for the store is a very clear separation between storage, management, indexing services and distribution - Normal filesystems, or Sun Honeycomb are the storage, Fedora-commons provides the management + CRUD, solr, eXist, mulgara, sesame, and couchDB can provide potential index and query services, and distribution is handed pragmatically, caching outgoing and mirroring where necessary.

The FMO 'store'

From discussions with FMO, and examining the information they held and the way they wished to make use of it, a simple Fedora/Solr store seemed to fufill what they wanted: a persistant store of items with attachments and the ability to search the metadata and retrieve results.

Bring in the consultants

FMO hired Aptivate to do the migration of their data from the proprietary system, in its custom format, to a Fedora/Solr store and trying as much as possible to retain the functionality they had.

Some points that I think it is important to impress on people here:

In general, software engineer consultants don't understand METS or FOXML.
They *really* don't understand the point of disseminators.
Having to teach software engineer consultants to do METS/FOXML/bDef's etc is likely an arduous and costly task.
Consultants add lots of money to do things their team don't already have the experience to do.

So, my conclusion was to not make these things part of the development at all to the extent that I might even have forgotten to mention these things to them except in passing. I helped them install their own local store and helped them with the various interfaces and gotchas of the two software packages. By showing them how I use Fedora and Solr in ora.ouls.ox.ac.uk, they were able to hit the ground running.

They began by using the REST interface to Fedora and the RESTful interface to Solr. By having them begin by using the simple put/get REST interface to Fedora, they could concentrate on getting used to the nature of Fedora as an objectstore. I think they moved to use the SOAP interface as it better suited their Java background, although I cannot be certain as it wasn't an issue that came up.

Once they had developed the migration scripts to their satisfaction, they asked me to give them a store, which I did (but due to hardware and stupid support issues here I am afraid to say I held them up on this.) They fired off their scripts, moved all the content into the fedora with a straightforward layout per object (pdf, metadata, fulltext and thumbnail) The metadata is - from what I can see - the same XML metadata as before - very MARCXML in nature, with 'Application_Info' elements having types like 'bl:DC.Title'. If necessary, we will strip out the dublin core metadata and put what we can into the DC datastream, but that's not of particular interest to FMO right now.

Fedora/Solr notes

As for the link between Solr and Fedora? This is very loosely coupled, such that they are running in the same Tomcat container for convenience, but aren't linked in a hard way.

I've looked at GSearch, which is great for a homogenous collection of items, such that they can be acted on by the same XSLT to produce a suitable record for Solr, but as the metadata was a complete unknown for this project, it wasn't too suitable.

Currently, they have one main route into the fedora store, and so, it isn't hard to simply reindex an item after a change is made, especially for services such as Solr or eXist, which expect to have things change incrementally. I am looking at services such as ActiveMQ for scheduling these index tasks, but more and more I am starting to favour RabbitMQ which seems to be more useful, while retaining the scalability and very robust nature.

Sending an update to Solr is as simple as an HTTP POST to its /update service, consisting of a XML or JSON packet like "changeme:1John Smith...." - it uses a transactional model, such that you can push all the changes and additions into the live index via a commit call, without taking the index offline. To query Solr, all manner of clients exist, and it is built to be very simple to interact with, handling facet queries, filtering, ordering and can deliver the results in XML, JSON, PHP or Python directly. It can even do a XSLT transform of the results on the way out, leading to a trivial way to support OpenSearch, Atom feeds and even HTML blocks for embedding in other sites.

Likewise, to change a PDF in Fedora can be done by a HTTP POST as well. Does it need to be more complicated?

Last, but not least, a project to watch closely:

The Fascinator project, funded by ARROW, as part of their mini project scheme, is an Apache Solr front end to the Fedora commons repository. The goal of the project is to create a simple interface to Fedora that uses a single technology – that’s Solr – to handle all browsing, searching and security. Well worth a look, as it seeks to turn this Fedora/Solr pairing truly into an appliance, with a simple installer and handling the linkage between the two.

OCLC - viral licence being added to WorldCat data

2008-11-13T03:17:00.000-08:00

Very short post on this, as I just wanted to highlight a fantastic piece written by Rob Styles about OCLC's policy changes to WorldCat

In a nutshell, it seems that OCLC’s policy changes have the intention to restrict the usage of the data in order to prevent competing services from appearing. Competing services such as LibraryThing and the Internet Archive's Open Library but unfortunately, it seems that the changes will also impinge on the rights of users to collate citation lists in software such as Zotero, Endnote and others. Read Rob's post for a well researched view on the changes.

Useful, interesting, inspiring technology/software that is out there that you might not know about.

2008-11-12T07:40:00.000-08:00

(I guess this is more like a filtered link list, but with added comments in case you don't feel like following the links to find out what it's all about.. A mix of old, but solid links and a load of tabs that I really should close ;))

Tahoe - http://allmydata.org/~warner/pycon-tahoe.html
The "Tahoe" project is a distributed filesystem, which safely stores files on multiple machines to protect against hardware failures. Cryptographic tools are used to ensure integrity and confidentiality, and a decentralized architecture minimizes single points of failure. Files can be accessed through a web interface or native system calls (via FUSE). Fine-grained sharing allows individual files or directories to be delegated by passing short URI-like strings through email. Tahoe grids are easy to set up, and can be used by a handful of friends or by a large company for thousands of customers.
CouchDB - http://incubator.apache.org/couchdb/
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.
CouchDB is written in Erlang, but can be easily accessed from any environment that provides means to make HTTP requests. There are a multitude of third-party client libraries that make this even easier for a variety of programming languages and environments.
Yahoo Term Extractor - http://developer.yahoo.com/search/content/V1/termExtraction.html
The Term Extraction Web Service provides a list of significant words or phrases extracted from a larger content.
Kea term extractor (SKOS enabled) - http://www.nzdl.org/Kea/
KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.
Piwik - http://piwik.org/
piwik is an open source (GPL license) web analytics software. It gives interesting reports on your website visitors, your popular pages, the search engines keywords they used, the language they speak… and so much more. piwik aims to be an open source alternative to Google Analytics.
RabbitMQ - http://www.rabbitmq.com/
RabbitMQ is an Open (Mozilla licenced) implementation of AMQP, the emerging standard for high performance enterprise messaging. Built on top of the Open Telecom Platform (OTP). OTP is used by multiple telecommunications companies to manage switching exchanges for voice calls, VoIP and now video. These systems are designed to never go down and to handle truly vast user loads. And because the systems cannot be taken offline, they have to be very flexible, for instance it must be possible to 'hot deploy' features and fixes on the fly whilst managing a consistent user SLA.
Talis Platform - http://n2.talis.com/wiki/Main_Page
The Talis Platform provides solid infrastructure for building Semantic Web applications. Delivered as Software as a Service (SaaS), it dramatically reduces the complexity and cost of storing, indexing, searching and augmenting data. It enables applications to be brought to market rapidly with a smaller initial investment. Developers using the Platform can spend more of their time building extraordinary applications and less of their time worrying about how they will scale their data storage.
The Fascinator - http://ice.usq.edu.au/projects/fascinator/trac
The goal of the project is to create a simple interface to Fedora that uses a single technology – that’s Solr – to handle all browsing, searching and security. This contrasts with solutions that use RDF for browsing by ‘collection’, XACML for security and a text indexer for fulltext search, and in some cases relational database tables as well. We wanted to see if taking out some of these layers makes for a fast application which is easy to configure. So far so good.
RDFQuery - http://code.google.com/p/rdfquery/
This project is for plugins for jQuery that enable you to generate and manipulate RDF within web pages [in javascript]. There are two main aims of this project: 1) to provide a way of querying and manipulating RDF triples within Javascript that is as intuitive as jQuery is for querying and manipulating a DOM, and 2) to provide a way of gleaning RDF from elements within a web page, whether that RDF is represented as RDFa or with a microformat.
eXist - http://exist.sourceforge.net/
eXist-db is an open source database management system entirely built on XML technology. It stores XML data according to the XML data model and features efficient, index-based XQuery processing. It supports:
- XQuery 1.0 / XPath 2.0
- XSLT 1.0 (using Apache Xalan) or XSLT 2.0 (optional using Saxon)
- HTTP interfaces: REST, WebDAV, SOAP, XMLRPC, Atom Publishing Protocol
- XML database specific: XMLDB, XQJ/JSR-225 (under development), XUpdate, XQuery update extensions (to be aligned with the new XQuery Update Facility 1.0
Evergreen - http://evergreen-ils.org/
Evergreen is an enterprise-class [Open Source] library automation system
that helps library patrons find library materials, and helps libraries
manage, catalog, and circulate those materials, no matter how large or
complex the libraries.
Apache Solr - http://lucene.apache.org/solr/
Solr is an open source enterprise search server based on the
Lucene Java search library, with XML/HTTP and JSON APIs,
hit highlighting, faceted search, caching, replication, a web administration interface and many more features.
GATE - http://gate.ac.uk/
GATE is a leading toolkit for text-mining. It bills itself as "the Eclipse of Natural Language Engineering,
the Lucene of Information Extraction" [NB I have yet to use this, but it has the kind of provenance and userbase that makes me feel okay about sharing this link]
Ubuntu JeOS - http://www.ubuntu.com/products/whatisubuntu/serveredition/jeos
Ubuntu Server Edition JeOS (pronounced "Juice") is an efficient variant
of our server operating system, configured specifically for virtual
appliances. Currently available as a CD-Rom ISO for download, JeOS is a
specialised installation of Ubuntu Server Edition with a tuned kernel
that only contains the base elements needed to run within a virtualized
environment.
Synergy2 - http://synergy2.sourceforge.net/
Synergy lets you easily share a single mouse and keyboard between
multiple computers with different operating systems, each with its
own display, without special hardware. It's intended for users
with multiple computers on their desk since each system uses its
own monitor(s).

"Expressing Argumentative Discussions in Social Media sites" - and why I like it

2008-10-24T08:09:00.000-07:00

At the moment, there are a lot of problems with the way information is cited or referenced in paper-based research articles, etc and I am deeply concerned that this model is applied to digitally held content. I am by no means first to say the following, and people have put a lot more thought into this field but I can't find a way out of the situation by using the information that is captured. I find it easier to see the holes when I express it to myself in rough web metaphors:

a reference list is a big list of 'bookmarks', ordered generally by the order in which they appear in the research paper.
Sometimes a bookmark is included, because everyone else in the field includes the same one, a core text for example.
There are no tags, no comments and no numbers on how often a given bookmark is used.
There is no intentional way to find out how often the bookmark appears in other people's lists. This has to be reverse engineered.
There is no way to see from the list what the intent of the reference is, whether the author agrees, disagrees, refutes, or relies on the thing in question.
There are no anchor tags in the referenced articles (normally), so there is little way to reliably refer to a single quote in a text. Articles are often referenced as a whole, rather than to the line, chart, or paragraph.
The bookmark format varies from publisher to publisher, and from journal to journal
Due to the coase grained citation, a single reference will sometimes be used when the author refers to multiple parts of a given piece of work

Now, on a much more positive note, this issue is being tackled. At the VoCamp in Oxford, I talked with CaptSolo about the developments with SIOC and their idea to extend the vocabularies to deal with argumentative discourse. Their paper is now online at the SDoW2008 site (or directly to the pdf) The essence of this, is an extension to the SIOC vocab, recording the intent of a statement, such as Idea, Issue, Elaboration as well as recording an Argument.

I (maybe naïvely) have felt that there is a direct parallel to social discourse and academic discourse, to the point where I used the sioc:has_reply property to connect links made in blogs to items held in the archive (using trackbacks and pingbacks, a system in hiatus until I get time to beef up the antispam/moderation facilities) So, to see an argumentation vocab developing makes me more happy :) Hopefully, we can extend this vocab's intention with more academic-focussed terms.

What about the citation vocabularies that exist? I think that those that I have looked at suffer from the same issue - they are built to represent what exists in the paper-world, rather than what could exist in the web-world.

I also want to point out the work of the Spider project, which aims to semantically markup a single journal article, as they have taken significant steps towards showing what could be possible with enhanced citations. Take a look at their enhanced article, all sorts of very useful examples of what is possible. Pay special attention to how the references are shown, how they can be reordered, typed and so. Note that I am able to link to the references section in the first place! The part I really find useful is demonstrated by the two references in red in the 2nd paragraph of the introduction. Hover over them to find out what I mean. Note that even though the two references are the same in the reference list (due to this starting as a paper version article) they have been enhanced to popup the reasons and sections referred to in each case.

In summary then, please think twice when compiling a sparse reference list! Quote the actual section of text if you can and harvard format be damned ;)

Modelling and storing a phonetics database inside a store

2008-10-20T08:32:00.000-07:00

Long story short, a researcher asked us to store and dissmeninate a DVD containing ~600+ audio files and associated analyses comprising a phonetics database, focussed on the beat of speech.

This was the request that let me start on something I had planned for a while; a databank of long tail data. This is data that is too small to fit into the plans of Big Data (which have IMO a vanishingly small userbase for reuse) and too large and complex to sit as archived lumps on the web. The system supporting databank is a Fedora-commons install, with a basic Solr implemented for indexing.

As I haven't gotten an IP address for the databank machine, I cannot link to things yet, but I will walk through the modelling process. (I'll add in links to the raw objects once I have a networked VM for the databank)

Analysis: "What have we got here?"

The dataset has been given to us by a researcher called Dr. Greg Kochanski, the data having been burnt onto a DVD-R. He called it the "2008 Oxford Tick1 corpus". A quick look at the contents showed that it was a collection of files, grouped by file folder into a hierarchy of some sort. First things first, though - this is a DVD-R and very much prone to degradation. As it was the files on the disk that are important, rather than the disc image itself, a quick "tar -cvzf Tick1_backup.tar.gz /media/cdrom0/" made a zipped archive of the files. Remember, burnt DVDs have an integrity halflife of around 1 1/2 -> 2 years (according to a talk I heard at the last SUN-PAsig) and I myself have lost data to unreadable discs.

Disc contents: http://pastebin.com/f74aadacc

Top-level directory listing:

ch ej lp ps rr sh ta DB.fiat DBsub.fiat README.txt
cw jf nh rb sb sl tw DBsent.fiat LICENSE.txt

Each one of the two letter directories holds a number of files, each file seemingly having a subdirectory for it, containing ~6+ data files in a custom format.

The .fiat top-level files, DB.fiat, etc are in a format roughly described here - the documentation about the data held within each file being targeted for humans. In essence, it looks like a way to add comments and information to a csv file, but it doesn't seem to support/encourage the line syntax that csv enjoys, like quotes, likely due to it not using any standard csv library. For instance, the same data could be captured and use standard csv libs without any real invention, but I digress.

Ultimately, by parsing the fiat files, I can get the text spoken in each audio file, some metadata about each one, and some information about how some of the audio is interrelated (by text spoken, type, etc)

Modelling: "How to turn the data into a representation of objects"

There are very many ways to approach this, but I shall outline the aims I keep in mind, and also the fact that this will always be a compromise between different aims, in part due to the origins of the data.

I am a fan of sheer curation; I think that this not only is a great way to work, but also the only practical way to deal with this data in the first place. The researcher knows their intentions better than a post-hoc curator. Injecting data modelling/reuse considerations into the working lives of researchers is going to take a very long time. I have other projects focussed on just this, but I don't see it being the workflow through which the majority of data is piped any time soon.

In the meantime, the way I believe is best for this type of data is to capture and to curate by addition. Rather than try to get systems to learn the individual ways that researchers will store their stuff, we need to capture whatever they give us and, initially, present that to end-users. In other words, not to sweat it that the data we've put out there has a very narrow userbase, as the act of curation and preservation takes time. We need to start putting (cleared) data out there, in parallel to capturing the information necessary for understanding and then preserving the information. By putting the data out there quickly, the researcher feels that they are having an effect and are able to see that what we are doing for them is worth it. This can favourably aid dialogue that you need to have with the researcher (or other related researchers or analysts) to further characterise this data.

So, step one for me is to describe a physical view of the filesystem in terms of objects and then to represent this description in terms of RDF and Fedora objects. Luckily, this is straightforward, as it's already pretty intuitive.

There are groupings

a top-level containing 10 group folders and a description of the files
Each group folder contains a folder for each recording in its group
Each recording folder contains

There is a .wav file for each recording
There are 6 analysis .dat files (sound analyses like RMS, f0, etc), dependent and exclusive to each audio file
There are some optional files, dependent and exclusive to each audio file

Each audio file is symbolically linked into its containing group folder.

So, from this, I have 3 types of object: the top-level 'dataset' object, a grouping object of some kind, and a recording object, containing the files pertinent to a single recording. (In fact, the two grouping classes are preexisting classes in the archival system here, albeit informally.)

We can get a crude, but workable representation by using three 'different' (marked different in the RELS-EXT ds) fedora objects, and by using the dcterms 'isPartof' property to indicate the groupings (recording --- dcterms:isPartOf --> grouping)

Curation by addition

The way I've approached this with Fedora objects is to use a datastream with a reserved ID to capture the characteristics of the data being held. At the moment, I am using full RDF stored in a datastream called RELS-INT. (NB I fully expect someone to look at this dataset later in its life and say 'RDF? that's so passé'; the curation of the dataset will be a long-term process that may not end entirely.) RELS-INT is to contain any RDF that cannot be contained by the RELS-EXT datastream. (yes, having two sources for the RDF where one should do is not desirable, but it's a compromise between how Fedora works, and how i would like it to work.)

To indicate that the RELS-INT should also be considered when viewing the RELS-EXT, I add an assertion (which slightly abuses the intended range of the dcterms requires property, but reuse before reinvention):

<info:fedora/{object-id}> <http://dublincore.org/documents/dcmi-terms/#terms-requires> <info:fedora/{object-id}/RELS-INT> .

Term Name: requires
URI: http://purl.org/dc/terms/requires
Definition: A related resource that is required by the described resource to support its function, delivery, or coherence.

I am also using the convention of storing timestamped notes in iCal format within a datastream called EVENTS (I am doing this in a much more widespread fashion throughout) These notes are intended to document the curational/archivist why's behind changes to the objects, rather than the technical ones, which Fedora can keep track of. As the notes are available to be read, they are intended to describe how this dataset has evolved and why it has been changed or added to.

An assertion to add then is that the EVENTS datastream contains information pertinent to the provenance of the whole object (into the RELS-INT in this case) I am not happy with the following method, but I am open to suggestions.

<info:fedora/{object-id}> <http://dublincore.org/documents/dcmi-terms/#terms-provenance>
[ a http://purl.org/dc/terms/ProvenanceStatement .
dc:source <info:fedora/{object-id}/EVENTS> ];

Term Name: provenance
URI: http://purl.org/dc/terms/provenance
Definition: A statement of any changes in ownership and custody of the resource since its creation that are significant for its authenticity, integrity, and interpretation.
Comment: The statement may include a description of any changes successive custodians made to the resource.

From this point on, the characteristics of the files attached to each object can be recorded in a sensible and extendable manner. My next steps are to add simple dublin core metadata for what I can (both the objects and the individual files) and to indicate how the data is interrelated. I will also add (as an object in the databank) the basic description of the custom data format, which seems to be based loosely on the NASA FITS format, but not based well enough for FITS tools to work on, or to be able to validate the data.

Data Abstracts: "Binding data to traditional formats of research (articles, etc)

It should be possible to cite a single piece of data, as well as a grouping or even an arbitrary set of data and groupings of data. From a re-use point of view, this citation is a form of data currency that is passed around and re-cited, so it makes sense to make this citation as simple as possible; a data citation as a URL.

I start from the point of view that a single, generic, perfectly modelled citation format for data will take more time and resources to develop than I have. What I believe I can do though, is enable the more practically focussed case for re-use and the sharing of citations. If a single URL(URI) is created, one which serves as an anchor node to bind together resources and information. It should provide a means for the original citing author to indicate why they had selected this grouping of information, what this grouping of information means at that time and for what reason. I can imagine modelling the simple facts of such a citation in RDF assertions (person, date of citation, etc) but it's beyond me to imagine a generic but useful way to indicate author intention and perception in the same way. The best I can do is to adopt the model that researchers/academics are most comfortable with, and allow them to record a data 'abstract' in the native language.

Hopefully, this will prove a useful hook for researchers to focus on and to link to or from related or derived data. Whilst groupings are typically there to make it easier to understand the underlying data as a whole, the data abstract is there to record an author's perception of a grouping, based on whatever reasoning they choose to express.

News and updates Oct 2008

2008-10-16T03:39:00.000-07:00

Right, I haven't forgotten about this blog, just getting all my ducks in a line as it were. Some updates:

The JISC bid for eAdministration was successful, titled "Building the Research Information Infrastructure (BRII)". The project will categorise the research information structure, build vocabularies if necessary, and populate it with information. It will link research outputs (text and data), people, projects, groups, departments, grant/funding information and funding bodys together, using RDF and as many pre-existing vocabularies as is suitable. The first vocab gap we've hit is one for funding, and I've made a draft RDF schema for this which will be openly published once we've worked out a way to make it persistent here at Oxford (trying to get a vocab.ox.ac.uk address)

One of the final outputs will be a 'foafbook' which will re-use data in the BRII store - it will act as a blue book of researchers. Think Cornell's Vivo, but with the idea of Linked Data firmly in mind.
We are just sorting out a home for this project, and I'll post up an update as soon as it is there.

Forced Migration Online (FMO) have completed their archived document migration from a crufty, proprietary store to a ORA-style store (Fedora/Solr) - you can see their preliminary frontend at http://fmo.qeh.ox.ac.uk. Be aware that this is a work in progress. We provide the store as a service to them, giving them a Fedora and a Solr to use. They contracted a company called Aptivate to migrate their content, and I believe also to create their frontend. This is a pilot project to show that repositories can be treated in a distributed way, given out like very smart, shared drive space.
We are working to archive and migrate a number of library and historical catalogs. A few projects have a similar aim to provide an architecture and software system to hold medieval catalog research - a record of what libraries existed, and what books and works they held. This is much more complex that a normal catalog, as each assertion is backed by a type of evidence, ranging from the solid (first-hand catalog evidence), to the more loose (handwriting on the front page looks like a certain person who worked at a certain library.) So modelling this informational structure is looking to be very exciting, and we will have to try a number of ways to represent this, starting with RDF due to the interlinked nature of the data. This is related to the kinds of evidence that genealogy uses, and so related ontologies may be of use.
The work on storing and presenting scanned imagery is gearing up. We are investigating storing the sequence of images and associated metadata/ocr text/etc as a single tar file as part of a Fedora object (i.e. a book object will have a catalog record, technical/provenance information and an attached tar file and and a list of file to offset information.)

This is due to us trying to hit the 'sweet spot' for most file systems. A very large number of highly compressed images and little pieces of text does not fit well with most FS internals. We estimate that for a book there will be around [4+PDFs+2xPages] files, or 500+ typically. Just counting up the various sources of scanned media we already have, we are pressing for about 1/2 million books from one source, 200,000 images from another, 54,000 from yet another... it's adding up real fast.

We are starting to deal with archiving/curating the 'long-tail' of data - small, bespoke datasets that are useful to many, but don't fall into the realm of Big Data, or Web data. I don't plan on touching Access/FoxPro databases any time soon though! I am building a Fedora/Solr/eXist box to hold and disseminate these, which should live at databank.ouls.ox.ac.uk very, very shortly. (Just waiting on a new VMware host to grow into, our current one is at capacity.)

To give a better idea of the structure, etc, I am writing it up in a second blog post to follow shortly - currently named "Modelling and storing a phonetics database inside a store"

I am in the process of integrating the Google-analytics-style statistics package at http://piwik.org with the ORA interface, to give relatively live hit counts on a per-item and to build per-collection reports.

Right now, piwik is capturing the hits and downloads from ORA, but I have yet to add in the count display on each item page, so halfway there :)

We are just waiting on a number of departments here to upgrade the version of EPrints they are using for their internal, disciplinary repositories, so that we can begin archiving surrogate copies of the work they wish to put up for this service. (Using ORE descriptions of their items) By doing so, their content becomes exposed in ORA, mirror copies are made (working on a good way to maintain these as content evolves), but they retain the content control, ORA will also act as a registry for their content. It's only when their service drops do the users get redirected to the mirror copies that ORA holds (think google cache, but a 100% copy).
In the process of battle-testing the Fedora-Honeycomb connection, but as mentioned above, just waiting for a little more hardware before I set to it. Also, we are examining a number of other storage boxes that should plug in under Fedora, using the Honeycomb software, such as the new and shiny Thumper box, "Thor" Sun Fire Xsomething-or-other. Also, getting pretty interested at the idea of MAID storage - massive array of idle disks. Hopefully, this will act like tape, but have a sustainable access speed of disk. Also, a little more green than a tower of spinning hardware.
Planning out the indexer service at the moment. It will use the Solr 1.3 multicore functionality, with a little parsing magic at the ingest side of things to make a generic indexer-as-a-service type system. One use-case is to be able to bring up VM machines with multicore solr on to act as indexers/search engines as needed. An example aim? "Economics want an index that facets on their JEL codes." POST a schema and ingest indexer to the nearest free indexer, and point the search interface at it once an XMPP message comes back that it is finished.
URI resolvers - still investigating what can be put in place for this, as I strongly wish to avoid coding this myself. Looking at OCLC's OpenURL and how I can hack it to feed it info:fedora uris and link them to their disseminated location. Also, using a tinyurl type library + simple interface might not be a bad idea for a quick PoC.
Just to let you all know that we are building up the digital team here, most recently held interviews for the futureArch project but we are looking for about 3 others to hire, due to all the projects we are doing. We will be putting out job adverts as and when we feel up to battling with HR :)

That's most of the more interesting hot topics and projects I am doing at the moment.... phew :)

Cherry picking the Semantic Web (from Talis's Nodalities magazine)

2008-08-18T05:32:00.000-07:00

Just to say that in the Talis Nodalities magazine, Issue 3 [PDF] page 13 they have published an article of mine about how treating everything - author, department, funder, etc - as an first class object will have knock-on benefits to curation and cataloguing of archived items.

When I find a good, final version of the article that I haven't accidentally deleted, I'll post the text of it here ;) Until then, download the PDF version. Read all the articles actually, they are all good!

(NB The magazine itself is licensed under the CC-by-sa, which I think is excellent!)

DSpace and Fedora need opinionated installers.

2008-08-18T05:16:00.000-07:00

Just to say that both Fedora-Commons and DSpace really, really need opinionated installers that make choices for the user. Getting either installed is a real struggle - which we demonstrated during the Crigshow, so please don't write in the comments that it is easy, it just isn't.

Something that is relatively straightforward to install, is a debian package.

So, just a plea in the dark, can we set up a race? Who can make their repository software installable as a .deb first? will it be DSpace or Fedora? Who am I going to send a box of cookies to and a thank you note from the entire developer community?

(EPrints doesnt count in this race; they've already done it)

Re-using video compression code to aid document quality checking

2008-08-18T04:47:00.000-07:00

(Expanding on this video post from the Crigshow)

Problem:

The volume of pages from a large digitisation project can be overwhelming. Add into that the simple fact that all (UK) institutional projects are woefully underfunded and underresourced, it's surprising that we can cope with them really.

One issue that repeatedly comes up is the idea of quality assurance; How can we know that a given book has been scanned well? How can we spot images easily? Can we detect if foreign bodies were present in the scan, such as thumbs, fingers or bookmarks?

A quick solution:

Inspired by a talk at one of the conference strands at WorldComp, where the author talked about the use of a component of a commonly used video compression standard (MPEG2) to detect degrees of motion and change in a video, without having to analyse the image sequences using a novel, or smart algorithm.

He talked about using the motion vector stream to be a good rough guide to the amount of change between frames of video.

So, why did this inspire me?

MPEG-2 compression is a pretty much a solved problem; there are some very fast and scalable solutions out there today - direct benefit: No new code needs to be written and maintained
The format is very well understood and stripping out the motion vector stream wouldn't be tricky. Code exists for this too.
Pages of text in printed documents tend towards being justified so that the two edges of the text columns are straight lines. There is also (typically) a fixed number of lines on a page.
A (comparatively rapid) MPEG2 compression of the scans of a book would have the following qualities:

The motion vectors between pages of text would either shown little overall change (as differing letters are actually quite similar) or a small, global shift if the page was printed on a slight offset.
The motion vectors between a page of text and a page with an image embedded in text on the next, or a thumb on the edge, would show localised and distinct changes that differ greatly from the overall perspective.

In fact, a real crude solution could be, just using the vector stream to create a bookmark list for all the suspect changes. This might bring the number of pages to check down to a level that a human mediator could handle.

How much needs to be checked?

Via basic sample survey statistics: to be sure to 95% (±5%) that the scanned images of 300 million pages are okay, just 387 totally random pages need to be checked. However, to be sure that each individual book is okay to the same degree, a book being ~300 pages, 169 pages need to be checked in each book. I would suggest that the above technique would significantly lower this threshold, but it would be by an empirically found amount.

Also note that the above figures carry the assumption that the scanning process doesn't change over time, which of course it does!

The four rules of the web and compound documents

2008-08-18T03:40:00.000-07:00

A real quirk that truly interests me is the difference in aims between the way documents are typically published and the way that the information within them is reused.

A published document is normally in a single 'format' - a paginated layout, and this may comprise text, numerical charts, diagrams, tables of data and so on.

My assumption is that, to support a given view or argument, a reference to the entirety of an article is not necessary; The full paper gives the context to the information, but it is much more likely that a small part of this paper contains the novel insight being referenced.

In the paper-based method, it is difficult to uniquely identify parts of an article as items in their own right. You could reference a page number, give line numbers, or quote a table number, but this doesn't solve this issue that the author hadn't put time to considering that a chart, a table or a section of text would be reused.

So, on the web, where multiple representations of the same information is getting to be commonplace (mashups, rss, microblogs, etc), what can we do to help better fulfill both aims, to show a paginated final version of a document, and also to allow each of the components to exist as items in their own right, with their own URIs (or better, URLs containing some notion of the context e.g. if /store/article-id gets to the splash page of the article, /store/article-id/paragraph-id will resolve to the text for that paragraph in the article.)

Note that the four rules of the web (well, of Linked Data) are in essence:

give everything a name,
make that name a URL ...
...which results in data about that thing,
and have it link to other related things.

[From TimBL's originating article. Also, see this presentation - a remix of presentations from TimBL and the speaker, Kingsley Idehen - given at the recent Linked Data Planet conference]

I strongly believe that applying this to the individual components of a document is a very good and useful thing.

One thing first, we have to get over the legal issue of just storing and presenting a bitwise perfect copy of what an author gives us. We need to let author's know that we may present alternate versions, based on a user's demands. This actually needs to be the case for preservation and the repository needs to make it part of their submission policy to allow for format migrations, accessibility requirements and so on.

The system holding the articles needs to be able to clearly indicate versions and show multiple versions for a single record.

When a compound document is submitted to the archive, a second parallel version should be made by fragmenting the document into paragraphs of text, individual diagrams, tables of data, and other natural elements. One issue that has already come up in testing, is that documents tend to clump multiple, separate diagrams together into a single physical image. It is likely that the only solution to breaking these up to this is going to be a human one, either author/publisher education(unlikely) or by breaking them up by hand.

I would suggest using a very lightweight, hierarchical structure to record the document's logical structure. I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to 'recreate' the document roughly.

Summary:

1) Break apart any compound document into its constituent elements (paragraph level is suggested for text)
2) Make sure that each one of these parts are clearly expressed in the context they are in, using hierarchical URLs, /article/paragraph or even better, /article/page/chart
3) On the article's splashpage, make a clear distinction between the real article and the broken up version. I would suggest a scheme like Google search's 'View [PDF, PPT, etc] as HTML'. I would assert that many people intuitively understand that this view is not like the original and will look or act differently.

Some related video blogs from the Crigshow trip
Finding and reusing algorithms from published articles
OCR'ing documents; Real documents are always complex
Providing a systematic overview of how a Research paper is written - giving each component and each version of a component would have major benefits here

Less Talk, More Code

Usage Statistics parsing and querying with redis and python

Curating content from one repository to put into another

My swiss army toolkit for distributed/multiprocessing systems

Usage stats and Redis

Python in a Pairtree

What is a book if you can print one in 5 minutes?

RDF + UI + Fedora for object metadata (RDF) editing

Early evaluation and serialisation of preservation policy decisions.

We need people!

Developer Happiness days - why happyness is important

Handling Tabular data

Pushing the BagIt manifest concept a little further

Tracking conferences (at Dev8D) with python, twitter and tags

Archive file resiliences

Beginning with RDF triplestores - a 'survey'

Available Services

HTTP ERROR: 400

A Fedora/Solr Digital Library for Oxford's 'Forced Migration Online'

OCLC - viral licence being added to WorldCat data

Useful, interesting, inspiring technology/software that is out there that you might not know about.

"Expressing Argumentative Discussions in Social Media sites" - and why I like it

Modelling and storing a phonetics database inside a store

News and updates Oct 2008

Cherry picking the Semantic Web (from Talis's Nodalities magazine)

DSpace and Fedora *need* opinionated installers.

Re-using video compression code to aid document quality checking

The four rules of the web and compound documents

DSpace and Fedora need opinionated installers.