JISC-CRIG Planet

Open Access policy - Federation for the Humanities and Social Sciences

Richard Akerman — Sun, 24 May 2015 08:25:20 +0000

The Federation for the Humanities and Social Sciences released its open access policy on April 1, 2015. http://www.ideas-idees.ca/issues/open-access-aspp The official title is "Open Access and the Awards to Scholarly Publications Program (ASPP)". The policy is about facilitating and promoting open access books (monographs): the Federation will embrace the roles of promoter and facilitator of Open Access publishing projects for monographs, with a particular view to engage those that could include ASPP-funded books. To encourage innovation and experimentation, the Federation will use its resources and networks to facilitate the participation of Canadian publishers, libraries and authors in promising, scalable projects that provide practical (i.e. financial or in-kind) support for Open Access monograph publishing. To support the ongoing efforts of some Canadian publishers, the Federation will promote existing and future ASPP-funded Open Access books. As I understand it, it is not a mandatory policy, it's more about doing projects to support open access, in particular projects that would: Make books Open Access immediately upon publication (no embargo) as DRM-free PDFs or full- text HTML, using the final published work as the version of record; Track and report on downloads and/or usage; Use Creative Commons licences; Host Open Access publications in at least one recognized repository that provides permanent links; Follow accepted protocols for metadata. above from full policy http://www.ideas-idees.ca/sites/default/files/oa-aspp-policy-position-en.pdf

instagr.am

Thu, 29 Jan 2015 05:00:24 +0000

www.slideshare.net

Thu, 29 Jan 2015 05:00:24 +0000

NRC jobs - Database, Repository and Application Specialist

Richard Akerman — Sun, 28 Dec 2014 16:56:46 +0000

Knowledge Management Ottawa - Ontario CS-2, English This is a 3-year term position from the date of reporting. Closing Date: January 5, 2015 http://www.nrc-cnrc.gc.ca/eng/careers/competitions/16_14_0340.html He/she is involved in the installation, design, development, testing, support, and maintenance of digital repositories and business applications required by Knowledge Management to deliver its services. He/she also provides basic database administration services for related database management systems. The incumbent of this position will be part of a small team where communication and teamwork are a must and interactions with clients and partners very frequent. We expect members of the team to be very resourceful, dynamic and self-sufficient, and to become experts in a wide array of technologies.

hangingtogether.org

Sun, 21 Dec 2014 05:12:04 +0000

A week of Open Data in Ottawa - May 2015

Richard Akerman — Wed, 03 Dec 2014 06:58:45 +0000

A big line-up of open data related events in Ottawa, Ontario, Canada. UPDATE 2015-04-23: As I live in Ottawa, I have provided some info about restaurants, and how to get around in the downtown area from the conference centre. ENDUPDATE UPDATE 2015-05-26: The IODC has released a programme of events throughout the week (PDF). ENDUPDATE Please note: most of these events require separate registration. May 24: #FlashHacks at Maker Space North - register on EventBrite - see @opencorporates @makerspacenorth May 25: Canadian Open Data Summit http://opendatasummit.ca/ see @opennorth and hashtag #CODS15 for more information. social: 5pm-7pm Networking Social at Heart and Crown (67 Clarence St, Ottawa) May 26: IODC Unconference http://opendatacon.org/unconference/ social: 6pm-8pm register for HUB Hosts: IODC Unconference Networking Social possible W2P meetup in evening May 27: a number of different events 2015 Open Data Research Symposium http://www.opendataresearch.org/project/2015/symposium Joined Up Data Workshop / Open Data Leaders Summit http://opendatacon.org/pre-conference-events/ Opening Parliaments https://www.eventbrite.com/e/iodc-pre-conference-event-open-parliaments-tickets-17006256170 2pm-5pm - for more info contact @DanSwislow and @jkeserue - also see @openparl CKANCon 2015 http://www.eventbrite.com/e/ckancon-2015-tickets-16681567016 hashtag #CKANCon Data Standards Day http://iodc-standards.webfoundation.org/ Global Open Data for Agriculture and Nutrition (GODAN) Meeting - register - 3pm-5pm Open Data for Humanitarian Emergencies - register - 10:30am-noon May 27-28: Possible #HackingConflict #Diplohack - see http://new.secdev-foundation.org/hackingconflict For more information, please contact: info@hackingconflict.org - parent organisation @diplohack http://www.diplohack.org/ May 28-29: International Open Data Conference see website http://opendatacon.org/ and hashtag #IODC15 for more information. May 28 social: 7:30pm Open Knowledge community / School of Data meet and greet at The Brig Pub, 23 York St. May 29: GoGeomatics Ottawa May Social Richard Burcher of Ottawa OpenStreetMap will speak. 7pm at James Street Pub. http://www.meetup.com/Ottawa-GoGeomatics-Canada-Monthly-Networking-Social/events/222139575/ Note: this is a small meetup group. May 30 - June 1: IATI Technical Advisory Group Meeting http://www.aidtransparency.net/technicaladvisorygroup/tag-meetings/tag-meeting-may-2015 registration https://www.surveymonkey.com/s/TAG2015Ottawa deadline April 13, 2015 follow @IATI_aid and hashtags #IATI and #TAG2015 for more information The events continue, although with less direct connection to just open data May 30-31: Canadian Association of Learned Journals Open Access and Open Data are major topics - see schedule (PDF) Hashtag #CALJACRS15 May 30 - June 5: Congress of the Humanities and Social Sciences http://congress2015.ca/ follow @ideas_idees and rather tricky hashtag #CongreSSH Note that Congress has many, many sub-conferences and events, including: Digital Humanities Summer Institute (a series of 2.5 hour workshops) May 30-31 a hackfest on June 1 June 3: Open Data Ottawa - Open Data Book Club - Artifact dataset - 7pm - Smoque Shack - see @opendataottawa (see bottom of post for June 5-7 Random Hacks of Kindness) June 3-5: Conference of the Canadian Association for Information Science (CAIS) http://congress2015.ca/program/events/conference-cais-68(part of Congress of the Humanities and Social Sciences) hashtag #CAISACSI15 also see http://www.cais-acsi.ca/ June 3-5: Canadian Library Association Conference http://www.claconference.ca/ follow @CLAOtt15 and hashtag #claott15 UPDATE 2015-06-04: The CLA conference website removed the per-day pages, and now has a single program page that doesn't provide per-session links http://www.claconference.ca/program ENDUPDATE Some sessions that may be of interest include: Session Title: Innovation in Canadian LibrariesSession Code: INNOVDay: Wednesday June 3, 2015 12:00 PM - 4:30 PM Session Title: Open Government – the Virtual Library, an opportunity to participate Session Code: 417-15104Day: Wednesday June 3, 2015 1:00 PM - 2:00 PM Session Title: Transparency through the Federal Lens: Open Government Initiatives at Library and Archives Canada Session Code: 417-15061Day: Thursday June 4, 2015 10:00 AM - 10:30 AM Session Title: Preservation and Access Through Trustworthy Digital Repositories Session Code: 417-15192Day: Thursday 10:00 AM - 10:30 AM Session Title: Libraries and Open Data. How Open are We ? Session Code: 417-15037Day: Thursday June 4, 2015 11:00 AM - 12:00 PM Session Title: Library and Archives Canada as a Trusted Digital Repository Session Code: 417-15560Day: Thursday June 4, 2015 11:00 AM - 12:00 PM Session Title: Open Government Speed Dating Session Code: 417-15031Day: Friday June 5, 2015 8:30 AM - 10:30 AM Session Title: Libraries Preparing for the Research Data Deluge ? Session Code: 417-15038Day: Friday June 5, 2015 8:30 AM - 9:30 AM Session Title: It’s midnight, do you know where their data is? A taxonomy of how academic researchers understand the cloud, privacy and their data Session Code: 417-15277Day: Friday June 5, 2015 11:30 AM - 12:00 PM June 5-7: Random Hacks of Kindness Ottawa More info at http://rhok.ca/ - register on EventBrite ($5) UPDATE: Most open data events are in Lanyrd http://lanyrd.com/2015/cods15/ http://lanyrd.com/2015/iodc15/ UPDATE 2015-04-13: I made a Lanyrd guide for all the open data events http://lanyrd.com/guides/open-data-week-in-ottawa-may-2015/ Related articles Enabling the Data Revolution: Announcing the 3rd International Open Data Conference 2015

GitHub to repository deposit

Tue, 09 Sep 2014 12:03:53 +0000

Over the past few months there have been positive shifts in the infrastructure available to archive software. To ‘archive software’ can mean many things to many people, but for the purposes of this blog post, I’ll take the view that this is to take (well managed) code out of an existing source code control system, make a point-in-time snapshot of the code, and deposit that into a long-term repository, along with some basic descriptive metadata.

To this end, both Figshare and Zenodo have recently developed and released integrations into GitHub. These both allow the depositor to easily take a copy of their code from GitHub, and deposit it into the respective repository. One of the key benefits of doing this is that the repository platforms are then able to assign a persistent DataCite DOI (Digital Object Identifier) to the software, which makes it easier to cite and track through scholarly literature.

As one of the developers of the open SWORD deposit protocol that facilitates the deposit of resources into repositories, I thought it would be good to try and re-create this functionality using SWORD. Below is the ‘recipe’ of how this works…

Step one (optional): Setup your browser with a bookmark
To make it easier to deposit code from GitHub, you can install a ‘bookmarklet‘ that automatically detects that GitHub repository, and lets the deposit system know where this is. This means that from any GitHub repository, you can click on the bookmark to deposit the code. To install it, visit http://easydeposit.swordapp.org/example/github/easydeposit/ and drag the bookmarklet at the bottom of the page to your browser’s bookmark bar:

Step two: Choose the GitHub repository to deposit

GitHub makes use of accounts and repositories. Each user of the service has an account, and each account can create multiple code repositories. URLs for GitHub are in the form of https://github.com/{account}/{repository}, for example the PHP programming language is stored in GitHub: https://github.com/php/php-src (php is the account name, and php-src is the code repository for the PHP language).

Choose the GitHub repository that you wish to deposit in the repository by opening the repository in your browser. In the example below, this is the DSpace repository platform’s code repository:

Step three: Click the bookmark!
If you click the ‘GitHub Deposit’ bookmark that you created earlier, this will redirect you to a SWORD deposit system. The bookmarklet contains javascript that passes the URL of the GitHub repository to the deposit client, and populates the form automatically. Alternatively you can just visit http://easydeposit.swordapp.org/example/github/easydeposit/ and enter the URL of the repository yourself:

Step four: Download the code

Clicking ‘Next >’ will initiate the download of the latest version of the code (‘master’ in git terminology). Depending on the size of the repository, this may take a few seconds. The code isn’t doing anything clever, and unlike the Zenodo and Figshare integrations, it doesn’t make use of the GitHub API. Instead, it downloads the master.zip file by constructing a URL such as https://codeload.github.com/DSpace/DSpace/zip/master. It then uses basic metadata such as the title of the repository (title), the account holder (author), the URL of the repository (link) and the latest check-in comment and revision hash (abstract). These are then presented back to you to confirm:

Step five: Perform the deposit
Upon clicking the deposit button, the code will then translate the metadata into a METS file, and zip that up alongside the downloaded code bundle. All this is then deposited into the demo DSpace server (http://demo.dspace.org/). Assuming the deposit works, you’ll be presented with the URL of the deposited code. In this case, it is a ‘handle’, but to all intents and purposes that is a DOI, and DSpace can be configured to issue DOIs.

Step six: View the code

To see the deposited code in the repository, just click on the handle link! For example, http://hdl.handle.net/10673/51. This will take you to the repository, where the metadata can be seen, and the code downloaded!

This isn’t a highly polished integration, and was thrown together in a couple of hours, by adding it as an optional ‘step’ in the configurable web-based deposit client ‘EasyDeposit‘. But it is a good demonstration that creating small tools that archive code into SWORD-compliant repositories (DSpace, EPrints, Fedora, etc) can be achieved quite quickly!site

on.mash.to

Fri, 25 Jul 2014 05:48:47 +0000

Islandora Scholar Institutional Repository Solution Pack

mleggott — Tue, 29 Apr 2014 19:57:24 +0000

Islandora Release 7.x-1.3 Series #1

Background

The Islandora Scholar Solution Pack is one of the most feature rich and long-awaited components of the Islandora ecosystem. Scholar is an IR (Institutional Repository) or general-use document repository, ideal for preserving and promoting the scholarly output of your institution or a special collection of PDF documents. While Scholar may seem like one of the newest solutions in the Islandora stack, it has been evolving for a long time. It is one of the first Islandora installations, and started with UPEI's original IslandScholar.ca site, which went live in December 2008, and continued with the 2nd generation IslandScholar.ca (launched in 2012) that reflects the version you see today. This effort has been greatly enhanced with the contributions from over a dozen installations (including a number of customized ones) with similar goals - the management of rich citation data and associated digital documents. discoverygarden inc. and its development team led this current effort, re-writing code, adding new features and pulling a host of different efforts into one coherent package. Nick Ruest, the Release Manager for this latest release also put a lot of effort into getting the new version ready. Scholar's genesis also provides a deep well of stories of the effort that has gone into it. One particular story that comes to mind was when I was on leave from UPEI setting up discoverygarden, and working with one of our first clients, the City of Hope Cancer Research Hospital in California. I was sitting on the deck of our 2nd story room at the Blomidon Inn in Wolfville Nova Scotia, where my wife Trina and I were attending the APLA Conference, having a glass of local red wine, and I was talking on the phone with Andrea Lynch, a librarian at the City of Hope. We were working on the details of the Citation Style Language integration and how it was "mangling" some of the citations, depending on the chosen style. I was explaining some of the challenges of integrating a number of 3rd-party open source applications and making sure the polish was sufficient to meet the requirements. Andrea and I were debating the relative merits of spending time to modify code to get it right now, or wait for the code maintainers themselves to fix it for us. We ended up taking a bit of a hybrid approach, fixing some of the key issues, reaching out to the maintainer to see if they would consider making some changes, and deciding to live with some of the issues. This one surfaces frequently because it illustrates one of the fundamental yin/yang characteristics of an open source framework: the push and pull (literally and figuratively) of the ability to change the way the system works when it doesn't do what you want. Sometimes this is critical to making it all work at the local level, but it can be at the expense of ease of maintenance and migration to future releases. Ahhh, choice...

Feature Set

The Scholar Solution Pack (SPs are Islandora's way of packaging standard functionality, metadata forms, sample data and data transformations) provides a substantial out-of-the-box feature set in a highly customizable framework. You can run the full stack as a local install on your own hardware or in a complete turnkey and fully supported cloud service like discoverygarden's Islandora OnDemand.

rich bibliographic citations with a customizable MODS form
auto-generation and updating of Dublin Core records from the MODS
powerful keyword or fielded searching
customizable Solr facets including a bar-chart range date facet display
customizable citation display and export that leverages the CSL (Citation Style Language) standard that underlies Zotero and Papers
add citations to "My Bookmarks" and export a CSL citation as TXT, PDF, RIS or RTF
MARCXML export
multiple ingest formats for batch records (MODS, DOI, PubMed ID, Endnote XML, RIS) which should accommodate pretty much any requirement
fillable permission statement form on ingest
ability to upload a PDF file with fulltext indexing
ability to set embargoes on complete objects or just PDFs (allowing a public citation with an embargoed PDF)
restrict Collections or individual Objects to specified IP ranges
enhancements to increase indexing in Google Scholar and Zotero
support for Coins
support for SHERPA/RoMEO
Collection management features, including the ability to share a citation with multiple collections, without duplicating the citation object
setting your IR as an OAI-compliant repository

And these are just some of the Scholar features - when you combine these with the rich feature-set of the latest Islandora framework you can do a lot of creative things with the stack with little to no customization.

link a digital asset in a separate Collection to a citation using the Compound Solution Pack
create a compound scholarly record consisting of a citation, video, audio, and image files
authenticate to your local Shibboleth system
create new metadata editing forms to accommodate the full richness of bib records, or to present a different edit form to different users

Contributors

Like many efforts in the Islandora community, contributions can be intellectual (wireframes and functional requirements, code, testing and review, sysadmin tricks, documentation, posts to lists, or just good conversation) and/or financial contributions. Some of the institutional contributors to Scholar:

Scholar has also benefited from the skill and passion of the larger open source community, especially one I like to call "biblioland":

And of course the Drupal and Fedora Commons efforts! The open source world, a beautiful example of the intersection between global diversity and small/local.

What Does the Future Hold

I am looking forward to future releases of the Islandora project - feel free to drop me a note regarding top features you would like to see, or submit an idea to our Islandora IdeaWall. Some of the features that exist now in local customized Islandora Scholar installations, or are under active discussion, and will hopefully make it to an upcoming release include:

supplementary data files (ZIP package)
integration with Vireo
integration with Altmetrics
integration or sync to ORCID and/or an external VIVO datastore
sync with a Zotero database, including the harvesting of PDFs
additional import formats, like RefWorks XML
harvest complete OAI-compliant datastores
mint a DOI, ARC or Handle for each citation object
create links between Scholar records and citation on ingest, based on the logged in LDAP user
schema.org integration

Like any rich open source effort Islandora benefits from your ideas and resources. If you would like to support the integration of these or any other feature to make Scholar even better, I would like to hear from you. Next up in the Islandora Release 7.x-1.3 Series, why an Islandora Tuque will keep you warm AND allow you to get the most from the Islandora landscape.

Some reflections on the Berlin 11 conference, Berlin November 2013.

Neil Jacobs — Fri, 22 Nov 2013 10:36:52 +0000

The following are some things I particularly noticed at the Berlin 11 open access conference, which I attended earlier this week, with apologies for any misunderstandings, misattributions or mis-categorisations.
Vision
The research and education enterprise is in need of urgent transformation (Manfred Laubichler, Arizona State University)
The epistemic web: a universal and traceable web of knowledge. (Ulrich Poschl, Max Planck)
There can’t be insiders and outsiders in scientific knowledge (David Willets, UK Government)
Aim for ZEN: Zero Embargo Now (Glyn Moody)
Students at the heart of the open access movement (Nick Shockey, Right to Research Coalition)
Disruption
Open access: disruption is a feature, not a bug. (Mike Taylor, University of Bristol)
We need disruption to ensure that we don’t carry unhelpful practices into a new scholarly communications arrangements (Cameron Neylon, PLOS)
While cultural change takes time, we need to watch that we don’t set in stone things that we know not to be right (Cameron Neylon, PLOS)
The status of scientific communication
Researchers should control the organisation of scientific knowledge, and bottom-up standardisation ensures acceptance (Robert Stogl, MPG)
Scientific knowledge is a public good, and top-down policies ensure action to ensure it remains so (Cameron Neylon, PLOS)
Clay Shirky says that publishing is not a job or an industry, it’s a button. The publishing services we need are analogous to Red Hat services for Linux software (Glyn Moody)
We need to recognise that public research and education is paid in advance, and the IPR created is in a distinct class of its own (John Willinsky, PKP and Stanford)
Scientific diversity
All disciplines can move to OA, in different ways (Gunter Stock, ALLEA)
Scientific knowledge is diverse; we need biblio-diversity, beyond the Web of Science monoculture. (Marin Dacos, OpenEdition)
The social sciences and humanities need their own subject repositories (Nicholas Canny, ERC)
Some of the concerns about OA from the social sciences and humanities are legitimate, and some are not (Nicholas Canny, ERC)
Policies
We are balancing cost-effectiveness, openness, fast access and effective quality review (Roger Genet, Directorate General for Research and Innovation, France)
Will the US OSTP Directive be codified into law and, if so, will it be stronger, weaker or about the same? (Heather Joseph, SPARC)
Incentives
Researchers don’t write to communicate, but to be seen and counted; if communication is important, then it needs to be incentivised. (Cameron Neylon, PLOS)
Most scientists don’t directly benefit from OA (Robert Schlogl, Max Planck)
Unless their publication is in the repository, then it doesn’t exist (Bernard Rentier, University of Liege)
Making the transition to OA
Subscription funds must be moved to pay for Gold OA (Peter Gruss, MPG)
Take international coordinated action to cut subscription budgets by 30% and reallocate the money for APCs (Ulrich Poschl, Max Plank)
Authors might refuse to submit papers to hybrid journals that are seen to be “double dipping”. (David Willets, UK Government)
We expect publishers to take action on transparent and competitive pricing to show that the UK was right to support the hybrid model (David Willets, UK Government)
A UK Minister of State cannot advise independent universities on their promotion practices, for example to encourage OA, but the Royal Society might (David Willets, UK Government)
As students, we use music, art, poetry and even free medical examinations to interest people in OA (Daniel Mutonga, Medical Student Association of Kenya)
The core competences in a digital age are navigation, authentication, integration and innovation (Manfred Laubichler, Arizona State University)
Trust us (David Carroll and Joseph McArthur, students and inventors of the OA button, see below)
Coordination
A new kind of library, the DPLA, has been launched by bringing together coalitions of libraries and of funders (Robert Darnton, Harvard University)
There needs to be greater coordination between research funders, and the French Academy has agreed to support a biannual funders meeting between Berlin conferences. (Peter Gruss, MPG)
The Berlin conferences will from now on be biannual (Peter Gruss, MPG)
We need every research organisation to have a committee reporting directly to the head of the organisation, to monitor progress, experiments, requirements and infrastructure (Robert Schlogl, Max Planck)
Standards and interoperability are key: we need a new standards body for open access and open data (Robert Schlogl, Max Planck)
Progress
Linux, arXiv and the Web all started in a single week in August 1991 (Glyn Moody)
With Scielo, tailored versions of OJS and DSpace, Brazil and Latin America are leading OA (Sely Costa, University of Brasilia)
China and India have heard concerns from the west that they are not opening their research as quickly as the west. Now, some 34% of Chinese papers are OA (Xiaolin Zhang, Chinese Academy of Sciences)
The lack of OA journals is limiting Gold OA growth (Ulrich Poschl, Max Planck)
Monitoring
We need to instrument the research process (Cameron Neylon, PLOS)
We need to agree which data are useful to us in monitoring progress toward OA (Robert Schlogl, Max Planck)
We need evidence on both the costs and wider benefits of shorter embargo periods (Heather Joseph, SPARC)
The open access button will make visible the occasions where people hit a paywall (David Carroll and Joseph McArthur, students)
Research, scholarship and the wider economy and society
Germany invests 2.9% GDP on R+D, and is revising copyright law to help innovation (Georg Schutte, Feneral Ministry of Education and Research, Germany)
Chinese spending on R+D is rising at 15% – 20% p.a. Citations and collaborations in Chinese papers are increasing (Xiaoling Zhang, Chinese Academy of Sciences)
Of the 1m unique users of PubMedCentral a day, two-thirds come from outside the academic domain (Heather Joseph, SPARC)
Who needs access outside research institutions? whoneedsaccess.org (Mike Taylor, University of Bristol)
The fragments (documents, photographs, etc) that are of little importance to someone, might be of immense value to someone else as a piece of their history, and together they create stories for the future (Haim Gertner, Yad Vashem)
The DPLA is a distributed and democratic model; we have “scanabego” vehicles going to local communities to digitise content that is important to them (Robert Danton, Harvard University)
Software
Why is software not included in the Berlin Declaration (Glyn Moody)
Researchers might see software as their core intellectual property, and might not want to share it openly. The Royal Society will consider this. (David Willetts, UK Government)
Data
Only two of the top 20 big data companies are European, public-private partnerships might improve this for Europe (Carl-Christian Buhl, EC)
The scholarly record is challenged by a separation between idea (publication) and evidence (data), and more concretely by link rot. (David Willetts, UK Government)
Research communities need to take the lead on data: the Royal Society Open Data Forum will consider the issues of standards and skills (David Willetts, UK Government)

Blogging the ebooks landscape

verena weigert — Tue, 01 Oct 2013 14:23:20 +0000

Over the next few weeks we plan to blog a series of posts covering some of the main topics surround the creation, curation and consumption of ebooks in teaching, learning and research.

There’s little doubt about the growing use and importance of ebooks within universities.

Statistics compiled by the University of York, for example, show that the number of ebooks provided by the library has increased massively: by 22,878 in 2010/11 to a total of 576,689 in 2011/12.

One of the reasons for this rapid and exponential growth is that they provide access to library collections 24/7, every day, off-campus, for students and researchers and from their preferred device.

However, research by Jisc, Jisc Collections and others has highlighted the barriers that pose serious challenges to institutions who wish to exploit the potential of ebooks and ebook technology.

A recent Jisc project “The challenges of e-books in academic institutions” by Ken Chad has produced a number of case studies to illustrate how ebooks are created and managed by institutions and analysed the ways in which ebooks are used.

In a series of blog posts we‘ll try to give an overview of the current ebook landscape based on the work of the project and by adding further relevant content.

Each post will describe a particular topic and highlight challenges, lessons learned and emerging trends. Some of the topics we’re thinking of covering are:

Patron Driven Acquisition
Campus based publishing
ebooks and the role of the library
Beyond the pdf
Licensing and legal issues
Preservation of ebooks
Impact – the usage of ebooks and the student experience

We’re also going to include examples and links to further resources for each topic.

Stay tuned for more on new purchasing models for ebooks later this week!

Research Data Sharing without barriers…get involved?

Rachel Bruce — Fri, 20 Sep 2013 22:12:52 +0000

Research Data Alliance (RDA) – second plenary meeting – Washington DC – 16th – 18th September.

Many readers of this blog will know about the Research Data Alliance already, but there will, I guess also be a lot of people that don’t. I am using this post as an introduction to the RDA – having this week been to Washington DC to attend the second plenary meeting of the organisation.

What with all of the interest and some urgency around research data publishing, management and re-use, at Government level, at university level, disciplinary level; and of course with an eye on research being global, there is a need to join the data up with shared practices, standards, policies and infrastructure. That’s where the RDA comes in.

Building on initiatives such as Data One in the US, the initiatives across Europe, such as the Jisc research data activity, that take place in many member states & have collectively informed the EC’s direction on research data infrastructure as part of the forthcoming Horizon 2020– and the Australian National Data Service, the RDA has been formed. It’s been formed to address the ‘joining-up’ challenges and to build a global community that can contribute to shared practice and ultimately a more sustainable way to build an infrastructure and the intersections required to support data-driven research and innovation.

The founding members from funding type agencies are the US National Science Foundation (working also with Chris Greer from NIST), the European Commission and the Australian National Data Service (ANDS) – over the past year these partners have carefully consulted and built a community that is global and encourages bottom up sharing and agreement. I have been to some prior gatherings, and had discussions with Ross Wilkinson from ANDS, Carlos Morais-Pires from the EC, Juan Bicarregui from STFC, and others; and witnessed their planning and progress. In Europe engagement is overseen by RDA Europe, Norman Wiseman from Jisc is on the Strategic Forum that oversees this on behalf of the Knowledge Exchange/KE (KE do alot of work on Research Data!). It’s a big ask – forming a structure that can collaboratively take on progressing the research data challenge. And I have to say the meeting this week in Washington demonstrated pretty impressive progress.

So in short over the past year a set of working groups and interest groups have been formed to collectively work on key issues, and Washington was really the first time that they were there face to face to develop their work – there was a first plenary meeting from the 18th -20th of March 2013 in Gothenburg, Sweden where the initiative was formally launched and groups started to form their case statement for work – but in Washington these groups were able to show early outcomes and to form firmer priorities and plans.

So what are they (we) working on ? it’s a long list [see here for the current list -https://rd-alliance.org/working-and-interest-groups.html]. Some of the areas that the groups are tackling: metadata & a metadata standards directory; legal interoperability; data citation; a community capability model; persistent identifiers; practical policy; data foundation & terminology; big data and analytics & more – including interest groups that cover some disciplinary areas – such as agriculture and history and ethnography.

This Alliance is forming – but from what I experienced in Washington it certainly has a lot of potential and should be an essential vehicle to research data interoperability. In Washington this week, following the group discussions there was a plenary update from all of them highlighting their priorities (given in the grand setting of the US National Academy of Sciences) and Mark Parsons, RDA/US Managing Director facilitated a discussion on the scope and ways of working. It was a really useful discussion; and one where I think there was consensus that RDA isn’t a standards body but more of a clearing house for best practice, standards and approaches. So if you’re interested join up? I think it is an important initiative that will help to address the organisational,social and technical infrastructure required for real research data sharing. Jisc, and the Digital Curation Centre (DCC) are engaged in the initiative and will continue to be so; and we will tie in UK activities as best we can so we can learn from others and also input the lessons and emerging practice from the UK so we get to that utopia …a global research data infrastructure (note:there are many UK participants already).

We will continue to give updates on progress to try and keep people in the loop. But if it is your bag – go ahead and join in the discussions. Currently there are 800 members from over 50 countries, and I can say from having been there this week it’s an impressive crowd…

Yes it is early days – but it’s important and thus far very positive. Looking forward to seeing more progress – I think there will be!

The value and impact of the British Atmospheric Data Centre (BADC)

verena weigert — Wed, 11 Sep 2013 21:09:12 +0000

Jisc[i] in partnership with NERC[ii] have commissioned work to examine the value of impact of the British Atmospheric Data Centre (BADC). Charles Beagrie Ltd, the Centre for Strategic Economic Studies Victoria University, and the British Atmospheric Data Centre are pleased to announce key findings for the forthcoming publication of the results of the study on the value and impact of the British Atmospheric Data Centre (BADC). The study will be available for download on 30^th September at: http://www.jisc.ac.uk/whatwedo/programmes/di_directions/strategicdirections/badc.aspx

Key findings:

The study shows the benefits of integrating qualitative approaches exploring user perceptions and non-economic dimensions of value with quantitative economic approaches to measuring the value and impacts of research data services.

The measurable economic benefits of BADC substantially exceed its operational costs. A very significant increase in research efficiency was reported by users as a result of their using BADC data and services, estimated to be worth at least £10 million per annum.

The value of the increase in return on investment in data resulting from the additional use facilitated by the BADC was estimated to be between £11 million and £34 million over thirty years (net present value) from one-year’s investment – effectively, a 4-fold to 12-fold return on investment in the BADC service.

The qualitative analysis also shows strong support for the BADC, with many users and depositors aware of the value of the services for them personally and for the wider user community.

For example, the user survey showed that 81% of the academic users who responded reported that BADC was very or extremely important for their academic research, and 53% of respondents reported that it would have a major or severe impact on their work if they could not access BADC data and services.

Surveyed depositors cited having the data preserved for the long-term and its dissemination being targeted to the academic community, as the most beneficial aspects of depositing data with the BADC, both rated as a high or very high benefit by around 76% of respondents.

The study engaged the expertise of Neil Beagrie of Charles Beagrie Ltd and Professor John Houghton of Victoria University, to examine indicators of the value of digital collections and services provided by the BADC.

The findings of this study are relevant to the community attending the conferences below hence the announcement.

13th EMS Annual Meeting & 11th European Conference on Applications of Meteorology (ECAM) | 09 – 13 September 2013 | Reading, United Kingdom
http://www.ems2013.net/

2013 European Space Agency Living Planet Symposium
http://www.livingplanet2013.org/

The British Atmospheric Data Centre (BADC)
The BADC, based at the STFC Rutherford Appleton Laboratory in the UK, is the Natural Environment Research Council’s (NERC) Designated Data Centre for the Atmospheric Sciences. Its role is to assist UK atmospheric researchers to locate, access, and interpret atmospheric data and to ensure the long-term integrity of atmospheric data produced by NERC projects. There is also considerable interest from the international research community in BADC data holdings.

[i] http://www.jisc.ac.uk/
[ii] http://www.nerc.ac.uk/

The Benefits of Open Source for Libraries

Ben Showers — Tue, 10 Sep 2013 08:13:17 +0000

The following post appeared as a question and answer piece in the August edition of the Cilip Update magazine

What are the main benefits to the library of adopting open source?

There are some well known benefits that open source could bring to libraries, these include:

• Lower costs: Open source offers a lower total cost of ownership than traditional library systems. There are none of the traditional license costs associated with open source. Libraries are able take advantage of the reduced costs the cloud offers by reducing local support and hosting costs (if it is supported and hosted by a third party).

• No lock-in: Libraries are, in a sense, removed from the traditional lock-in associated with library systems. There is a greater opportunity to pick and choose components, and take advantage of what is, generally, better interoperability with open source solutions. Related to this is also the idea that open source is more sustainable: If a vendor goes out of business the software may disappear or be sold-on. With open it is always available, and there is usually a community involved in it to continue its development.

• Adaptation and Innovation: Connected to the above is the greater capacity that libraries have to innovate with open systems and software. There is no need to await the next update or release, instead in either isolation or collaboratively, can develop the functionality required. This enables much more agile services and systems, as well as ensuring user expectations are exceeded.

• A richer library systems ecosystem: A less direct impact of open source is a richer library systems ecosystem. This is both in terms of the library solutions available (a healthier marketplace with both proprietary and open solutions) and in terms of collaboration and engagement between libraries themselves. Libraries are able to collaborate and share code on the functionality and fixes they require. Indeed, there are open source systems such as Evergreen, which were developed as an open source library system for a consortial approach.

While these benefits are the headline grabbing ones, it might be argued there are more subtle, but none the less powerful benefits in the adoption of open source in libraries, especially within higher and further education. There are broader trends and themes emerging (and some fairly well entrenched) within the new information environment that make open source particularly timely for libraries. These developments include: open (linked) data; managing research data; open scholarship and science; Open content such as OERs; crowdsourcing, and, of course, open access. Open source solutions for the library fit very well into this broader open momentum affecting the academic world at present.

Away from the academic world it is difficult not to notice the close correlation between the open, learning, sharing and peer-production culture libraries embody and that of the open source culture.

So it may be that one of the greatest benefits of adopting open source is that it mirrors the very philosophy and values of the library itself.

Is it something all libraries should consider, or are there limitations to its usefulness as a solution (if so, what are the limitations)?

There are very few barriers to any library adopting an open source library system. The business models that surround open source library systems are currently based on third parties offering support and hosting services for libraries looking to implement a solution. Effectively, this means any library could take advantage of an open system.

There can sometimes be very pragmatic limitations to the systems themselves – the open source management system Koha, for example, doesn’t include an inter-library loan module (although they recognise this and have a wiki to collect the requirements for the module’s development).

For me, open source offers libraries an exciting opportunity: better understand the skills, roles and processes that are critical to the library’s community of users (whether academic, public or other). Open source can be about simply outsourcing your system and support to a third party; but it can also be about re-evaluating services, systems and understanding where the real value of the library lies. This may mean that support for the open source LMS is outsourced to a third party, so the local developers can work with librarians to ensure the services are innovative and meeting the needs of users.

Open source is an opportunity for the library to become more agile, and adopt a more ‘start-up’ like culture to the development and deployment of services.

What are the main barriers to a library adopting open source? (fear of the unknown, lack of technical ability etc)

It would be simple to blame the slow adoption of open source systems on fear – fear of the unknown, cost, security, perception, the list could go on. These are real concerns within the library community. But, it would miss the fact that libraries are using open source software. There are discovery interfaces that include Blacklight and VuFind. These open products themselves often run on top of the open search platform Apache Solr, for example.

Search and discovery are critical functions of the library, so these are not inconsequential adoptions.

Furthermore, there is a small, but growing recognition of the viability of open source for libraries. Halton Borough Council was the first to adopt open source for its public libraries, the University of Staffordshire was the first UK university to adopt an open source management system . These early adopters are helping raise the profile of open source and helping make it a visible alternative.

These developments point at potentially more entrenched barriers to adoption. One such barrier is the impact institutional and organisational procurement processes have on the decision making process (This, it might be argued, is a barrier to the development and adoption of proprietary systems as much as it is to the adoption of open source) . The procurement process for libraries (certainly in the academic sphere) has not been one that has traditionally explored innovative approaches – instead it has focused on relatively static and core specifications. This has had the effect of reinforcing the type of system, and the systems approach institutions and organisations adopt in their tender to suppliers.

For many organisations it might be summed up as simply as: who do you put a tender out to in the case of an open source solution?

However, many of the more superficial barriers are already largely redundant within the sector – the viability of open in general has been proved with the adoption of open source operating systems such as Linux in most sectors including business. Some of the more embedded organisational issues may take time to resolve, but already these are starting to dissolve as institutions seek to make effeciencies and adopt new approaches to procurement.

Are there issues over ongoing support? and do libraries need a decent IT dept to even consider open source?

As I mention above, IT support isn’t necessarily an issue for the library, this can be outsourced to a third party if necessary . But, having the right technical skills in the library is essential; it’s essential whether or not you’re choosing an open source solution.

However, the IT department does play an important role (whether they are in the library or wider organisation) as they are the people you’ll be talking to a lot about your decision. I think they key issue regarding the IT department is making sure they understand what you’re doing, and get them on your side!

There are also opportunities for libraries to engage in projects which share many of the characteristics of open source, but which have a slightly different approach. Examples include shared community activity such as Knowledge Base+ in the UK (a shared community knowledge base for electronic resources) which is a collaboration between HE libraries to improve the quality of e-resource metadata. Or the US ‘community source’ project KualiOLE (an Open Library Environment designed by libraries) where you pay to join the project to affect development, but the code for the system is open source. These examples build on the library’s tradition of openness and collaboration, and provide similar kinds of benefits to straightforward open source software.

Finally, it might just be that the greatest issue of open source facing libraries has already been overcome. David Parkes, Associate Director at University of Staffordshire, jokes that you should never be first. Of course, Staffordshire was the first HE institutions to implement an open source library system, so in many ways he’s removed the biggest hurdle to adoption there is!

Resources:

HELibTech wiki (open source library software page): http://lglibtech.wikispaces.com/Open+Source+Library+Systems+in+the+UK #
The business case for an open source library management system (Video of a presentation given by David Parkes, University of Staffordshire): http://www.esi.ac.uk/meetings/1114/videos/4807
LIS-OSS@JISCMail.ac.uk – discussion list about open source systems and software in libraries

Performance and Measurement in Libraries

Ben Showers — Fri, 26 Jul 2013 09:25:39 +0000

In his article in the New York Times, Robert Crease wrote:

We look away from what we are measuring, and why we are measuring, and fixate on the measuring itself.

For libraries, so used to collecting, managing and analysing various sets of data and metrics, this is a critical point.

It is also a sentiment that kicked off the 10th Northumbria conference on Performance Measurement in Libraries held in York earlier this week.

Elliot Shore from ARL (Association of Research Libraries) spoke about the need for libraries to take heed of this advice: To focus on the ‘fit’ of what we’re measuring.

This fit, as Shore calls it, has been evolving over the past 10 years as the role and presence of the library has changed. The digital environment and changing technologies and expectations of users means that what was once important to measure and capture may no longer have the same urgency.

This focus on what should be measured – and how it impacts on the role and shape of the library – was developed in a great talk by Margie Jantti at the University of Wollongong in Australia.

Margie talked about the constant flow of information and data that her staff (relationship managers) get from the researchers and academic staff, which is used to tailor services and focus resources on priority services. This has seen the library develop expertise in publication support for researchers by the library.

The large knowledgebase of data the library collects on its users enables it to punch far above it’s weight: helping develop a fast; agile and world-class library team.

Finally, one thing that emerged from a majority of the presentations during the conference was the increasing recognition that data and metrics from inside or about the library were no longer enough. The field from which the data and metrics is harvested is growing, and reaching further beyond the library. Into the teaching and learning space through to research, registry and student services and beyond.

The idea that library performance and measurement requires only data from the library – or within the immediate vicinity of the library – is no longer an option.

So, it was against this background that the Library Analytics and Metrics Project (LAMP) presented at the conference.

We provided some of the background to the project (where it has come from and the work that has led us to this point) and provided an overview of the work so far and how you can get involved and follow the progress of the project.

For me, what’s really interesting, is that LAMP has the potential to bring in data from across the institution (and beyond) to help inform decision making and how and where resources are allocated. It also takes away the burden of collecting the data and provides the space for libraries to act on the data, and to think strategically about what they want and should be measuring and analysing.

The conference was also useful in bringing to my attention LibQual, and the potential for LAMP to work with that data too (although this may be something for further down the development pipeline).

You can find a link to our presentation here. At the end are some ways that you and your library can get involved – so do feel free to get in touch.

Services to support UK repositories

Neil Jacobs — Wed, 24 Jul 2013 15:16:58 +0000

In the UK, the repository network is well established, and supports access to research papers and other digital assets created within universities. Increasingly, repositories are part of a research information management solution helping to track and report on research.

Over the past few years, Jisc has worked with a number of partners, including the University of Nottingham (Sherpa Services), EDINA, Mimas and the Open University (Knowledge Media Institute) to develop a range of services that benefit UK research by making institutional repositories more efficient and effective. Estimates put the annual net benefit, simply in terms of time saved by universities, at around £1.4m. Following a comprehensive review of the business case for these services, Jisc now intends to build on the RepositoryNet+ project led by EDINA, to put key services onto a more sustainable footing, including financial, organisational and technical aspects of their operation.

Sherpa Services run Sherpa-RoMEO and Sherpa-Juliet, the former providing trusted information about the rights of authors to deposit their papers into repositories, the latter providing a list of research funders’ open access policies, to which grant holders should comply. Together, these underpin the new Sherpa-FACT service, whose initial development has been supported by the Research Councils and the Wellcome Trust. Jisc proposes to work with these services over the next few months to identify a medium term strategy for each, and to support them thereafter, working in partnership where appropriate with others such as the Councils and Wellcome.

EDINA have developed the Repository Junction Broker, which promises to support mass deposit of papers from publishers and subject repositories into institutional repositories. Should the proposed HEFCE policy with respect to OA and the REF be confirmed, this will be a key service enabling institutional repositories to play their role in submissions to the next REF. Again, the proposed plan is to work with EDINA to develop a medium term strategy for RJB, and support it thereafter, while exploring a range of sustainability options with other stakeholders.

Mimas provide IRUS-UK, which enables institutional and other repositories to share usage data in a way that complies with international standards, so that usage reports can be reliably compared. Over 30 UK repositories are already part of the IRUS-UK network, with more joining all the time. Jisc intends to continue its support for IRUS-UK.

Other services have grown up alongside these, to varying levels of maturity. For example, EDINA have developed an organisational and repository identifier service, and explored a disaster recovery service for repositories. The Open University Knowledge Media Institute has developed CORE, a sophisticated aggregation and discovery service for repositories worldwide. Jisc looks forward to working with these services too, to ensure that UK repositories, their host organisations, and the people who use them benefit from greater efficiencies and a more responsive infrastructure.

As we move forward with this work, we will post updates here and elsewhere. If you have any questions or comments on this work, please contact Neil Jacobs (n.jacobs@jisc.ac.uk) or Balviar Notay (b.notay@jisc.ac.uk).

Library Systems Workshop

Ben Showers — Fri, 19 Jul 2013 12:23:23 +0000

On Monday this week the Library Systems Programme held a one-day workshop in London.

I’ll talk more about some of the things that cam out of the workshop in later posts – for now I just wanted to share some of the presentations which were given during the day.

You can also see what people were saying about the event on Twitter with this storify created by Helen Harrop from the LMS Change project:
[View the story “Jisc Library Systems Programme Event” on Storify]

The workshop was a chance for the projects that made up the programme to talk about the work they had done and the tools and resources they have created, and a chance for the community to discuss some of the issues and challenges that the sector currently faces.

The workshop was opened by Rachel Bruce of Jisc and Ann Rossiter of SCONUL and introduced some of the main themes of the day.

The workshop had three main strands that explored:

Collaborative Systems and Services;
Transforming workflows and practices, and;
Tools and Techniques for Systems Change

The workshop was then drawn to a close with a panel, chaired by Suzanne Enright of the university of Westminster, which explored what would be on your LMS wishlist.

The panel began with three short ‘provocations’ from Martin Myhill (Exeter), Andrew Preater (Senate House Libraries), and Owen Stephens (consultant).

Andre Preater at Senate House Libraries has also done a fantastic job of writing about the event, and you can find a copy of his presentation on his blog too. The provocations were rich in ideas and arguments, for example:

there was a call for a greater focus on (maybe even a commitment to) open source systems and the need for us to transcend the LMS,
the need for better exploitation of the data in our systems, and
the suggestion that the library sector may not understand, or have the right skills, to effectively inhabit an increasingly web-based environment.

In summing up the panel discussion and the day overall, Suzanne did a superb job of drawing out some of the main discussion themes and issues that had been surfaced during the day. An overview of these can be found on some slides she kindly put together:

A number of important themes emerge from Suzanne’s slides, and importantly there is a clear recognition that many of the challenges libraries face are not technological in nature. Rather they are about cultures and people.

So,what follows is a short overview of each session from the workshop and the presentations given (where available).

Collaborative Services and Systems

This session included presentations from projects exploring the potential to develop shared library systems and services. These were projects by SCURL in Scotland, WHELF in Wales and the Bloomsbury Consortium in London.

This project has contributed towards a new vision for library systems by investigating the following question: “How would a shared library management system improve services in Scotland?”

Building on the work of the earlier ‘WHELF: Sharing a Library Management System’ feasibility report the project has explored the potential benefits and pain points inherent in a move from distributed to centralised hosting and infrastructure model for a suite of library systems software, while building a possible overall business case for such a move by the HEIs within the WHELF consortium.

The Bloomsbury Library Management Consortium is building on the strengths of the Bloomsbury Colleges and Senate House Library and their track record for sharing and collaboration. The group undertook a study of the landscape of the 21st century Library Management Systems (LMSs) – and evaluating the options for building, commissioning or procuring a Bloomsbury Library Management System (BLMS) as a shared-service.

The presentation from the Bloomsbury consortium can be found here: 2013-07-15_JISC-Event-BLMS-for-circulation.

The group have made a decision in principle to go with KualiOLE open source /community library system.

Transforming workflows and processes

This session included a number of presentations exploring the impact of new systems and technologies on traditional library workflows and processes.

HIKE is exploring the integration of next generation library systems (specifically Intota) at the University of Huddersfield with Knowledgebase+ and the impact on traditional workflows and processes.

EBASS25 in a collaborative project, led by Royal Holloway, University of London, to develop shared models of ebook procurement using Patron-driven acquisition approaches.

[presentation to be added]

The Collaborative collections management project saw King’s College London and Senate House libraries collaborate on above campus initiatives around collection management for the benefit of students and researchers, and the use of the Copac collection management tool.

Tools and techniques for systems change

The LMS Change project took on the entire burden of this session themselves, showcasing the tools and approaches they have developed during the project and getting participants introduced to some of the tools. The LMS change presentation is below, and Ken Chad’s presentation on the business case for change can also be found here: Business_case_for_change_Jisc_LMSchange_wkshop_KenChad_July2013

ResourceSync and SWORD

Mon, 17 Jun 2013 19:46:01 +0000

This is the third post is a short series of blog posts about ResourceSync. Thanks to Jisc funding, a small number of us from the UK have been involved in the NISO / OAI ResourceSync Initiative. This has involved attending several meetings of the Technical Committee to help design the standard, working on documenting some of the different ResourceSync use cases, and working on some trial implementations. As mentioned in the previous blog posts, I’ve been creating a PHP API library that makes it easy to interact with ResourceSync-enabled services.

In order to really test the library, it is good to think of a real end-to-end use case and implement it. The use case I chose to do this was to mirror one repository to another, and to then keep it up to date. This first involves a baseline sync to gather all the content, followed by an incremental sync of changes made each day.

ResourceSync provides the mechanism by which to gather the resources from the remote repository. However another function is then required to take those resources and put them into the destination repository. The obvious choice for this is SWORD v2.

ResourceSync is designed to list all files (or changed files) on a server. These are then transferred using good old HTTP, but to get them into another repository requires a deposit protocol – in this case, SWORD. In other words, ResourceSync is used to harvest the resources onto my computer, and SWORD is then used to deposit them into a destination repository.

The challenge here is linking resources together. An ‘item’ in a repository is typically made up of a metadata resource, along with one or more associated file resources. Because these are separate resources, they are listed independently in the ResourceSync resource lists. However they contain attributes that link them together: ‘describes’ and ‘describedBy’. The metadata ‘describes’ the file, and the file is ‘describedBy’ the metadata. A good example of this is given in the CottageLabs description of how the OAI-PMH use case can be implemented using ResourceSync:

[xml]
xmlns:rs="http://www.openarchives.org/rs/terms/">

http://example.com/metadata-resource
2013-01-02T17:00:00Z

length="8876"
type="application/xml"/>

http://example.com/bitstream1
2013-01-02T17:00:00Z

length="14599"
type="application/pdf"/>

[/xml]

So here’s the recipe (and here’s the code) for syncing a resource list such as this, and then depositing it into a remote repository using SWORD. Both use PHP libraries, which makes the code quite short.

The recipe

[php]
include_once(‘../../ResyncResourcelist.php’);
$resourcelist = new ResyncResourcelist(‘http://93.93.131.168:8080/rs/resourcelist.xml’);
$resourcelist->registerCallback(function($file, $resyncurl) {
// Work out if this is a metadata object or a file
global $metadataitems, $objectitems;
$type = ‘metadata’;
$namespaces = $resyncurl->getXML()->getNameSpaces(true);
if (!isset($namespaces[‘sm’])) $sac_ns[‘sm’] = ‘http://www.sitemaps.org/schemas/sitemap/0.9’;
$lns = $resyncurl->getXML()->children($namespaces[‘rs’])->ln;
$key = ”;
$owner = ”;
foreach($lns as $ln) {
if (($ln->attributes()->rel == ‘describedby’) && ($ln->attributes()->href != ‘http://purl.org/dc/terms/’)) {
$type = ‘object’;
$key = $resyncurl->getLoc();
$owner = $ln->attributes()->href;
}
}

echo ‘ – New file saved: ‘ .$file . "\n";
echo ‘ – Type: ‘ . $type . "\n";

if ($type == ‘metadata’) {
$metadataitems[] = $resyncurl;
} else {
$objectitems[(string)$key] = $resyncurl;
$resyncurl->setOwner($owner);
}
});
[/php]

This piece of code is performing a baseline sync, and is using the callback registration option mentioned in the last blog. The callback is just doing one thing: sorting the metadata objects into one list, and the file objects into another. These will then be processed later.

Next, each metadata item is processed in order to deposit that metadata object into the destination repository using SWORD v2:

[php]
foreach ($metadataitems as $item) {
echo " – Item " . ++$counter . ‘ of ‘ . count($metadataitems) . "\n";
echo " – Metadata file: " . $item->getFileOnDisk() . "\n";
$namespaces = $xml->getNameSpaces(true);
if (!isset($namespaces[‘dc’])) $sac_ns[‘dc’] = ‘http://purl.org/dc/terms/’;
if (!isset($namespaces[‘dcterms’])) $sac_ns[‘dc’] = ‘http://purl.org/dc/elements/1.1/’;
$dc = $xml->children($namespaces[‘dc’]);
$dcterms = $xml->children($namespaces[‘dcterms’]);
$title = $dc->title[0];
$contributor = $dc->contributor[0];
$id = $dc->identifier[0];
$date = $dcterms->issued[0];
echo ‘   – Location: ‘ . $item->getLoc() . "\n";
echo ‘   – Author: ‘ . $contributor . "\n";
echo ‘   – Title: ‘ . $title . "\n";
echo ‘   – Identifier: ‘ . $id . "\n";
echo ‘   – Date: ‘ . $date . "\n";

// Create the atom entry
$test_dirin = ‘atom_multipart’;
$atom = new PackagerAtomTwoStep($resync_test_savedir, $sword_deposit_temp, ”, ”);
$atom->setTitle($title);
$atom->addMetadata(‘creator’, $contributor);
$atom->setIdentifier($id);
$atom->setUpdated($date);
$atom->create();

// Deposit the metadata record
$atomfilename = $resync_test_savedir . ‘/’ . $sword_deposit_temp . ‘/atom’;
echo ‘ – About to deposit metadata: ‘ . $atomfilename . "\n";
$deposit = $sword->depositAtomEntry($sac_deposit_location,
$sac_deposit_username,
$sac_deposit_password,
”,
$atomfilename,
true);
[/php]

This option being used here is to first create an atom entry that contains the metadata, and depositing that. The SWORD v2 ‘in-progress’ flag is being set to TRUE, which indicates that further activity will take place to the record.

The code then needs to look through the list of file resources, and find any that are ‘describedBy’ the metadata record in question. Any that are, are deposited to the same record using SWORD v2:

[php]
// Find related files for this metadata record
foreach($objectitems as $object) {
if ((string)$object->getOwner() == (string)$item->getLoc()) {
$finfo = finfo_open(FILEINFO_MIME_TYPE);
$mime = finfo_file($finfo, $object->getFileOnDisk());
echo ‘ – Related object: ‘ . $object->getLoc() . "\n";
echo ‘ – File: ‘ . $object->getFileOnDisk() . ‘ (‘ . $mime . ")\n";

// Deposit file
$deposit = $sword->addExtraFileToMediaResource($edit_media,
$sac_deposit_username,
$sac_deposit_password,
”,
$object->getFileOnDisk(),
$mime);
}
}
[/php]

Using the SWORD v2 API library is very easy: once you have the file and its MIME type, it is a single line of code to add that file to the record in the destination repository.

Once all the related files have been added, the final step is to set the ‘in-progress’ flag to FALSE to indicate that the object is complete, and that it can be formally archived into the repository. This is a simple as:

[php]
// Complete the deposit
$deposit = $sword->completeIncompleteDeposit($edit_iri,
$sac_deposit_username,
$sac_deposit_password,
”);
[/php]

The end to end process has now taken place – the items have been harvested using ResourceSync, and then deposited back using SWORD v2.

Limitations

The default DSpace implementation of the SWORD v2 protocol allows items to deposited, updated, and deleted. It does this by keeping items in the workflow, and when the ‘In-progress’ flag is set to false, the deposit is completed by moving it out of the workflow and into the main archive. Once the item is moved into the main archive, it can no longer be edited using SWORD.

This is a sensible approach for most situations. Once an item has been formally ingested, it is under the control of the archive manager, and the original depositor should probably not have the rights to make further changes.

However in the case of performing a synchronisation with ResrcoueSync, the master copy of the data is in a remote repository, and that should therefore be allowed to overwrite data that is formally archived in the repository. This is an implementation option though, and if an alternative WorkflowManager was written, this could be changed.

[Update: 20th June 2013. I have now edited the default WorkflowManager, to make one that permits updates to items that are in workflow or in the archive. This overcomes this limitation. I hope to add this as a configurable option to a future release of DSpace.]

Conclusion

ResourceSync and SWORD are two complementary interoperability protocols. ResourceSync can be used to harvest all content from one site, and SWORD used to deposit that content into another.

ResourceSync can differentiate between new, updated, and deleted content. SWORD v2 also allows these interactions, so can be used to reflect those changes as they happen.

The ResourceSync PHP Library

Sat, 15 Jun 2013 14:06:30 +0000

Over the past year, thanks to funding from the Jisc, I’ve been involved with the NISO / OAI ResourceSync initiative. The aim of ResourceSync is to provide mechanisms for large-scale synchronisations of web resources. There are lots of use cases for this, and many reasons why it is an interesting problem. For some background reading, I’d suggest:

Background to the initiative: http://www.niso.org/workrooms/resourcesync/
H. Van de Sompel, R. Sanderson, M. Klein, M. L. Nelson, B. Haslhofer, S. Warner, C. Lagoze. A Perspective on Resource Synchronization. D-Lib Magazine, Vol. 18, No. 9/10, 2012. http://dx.doi.org/10.1045/september2012-vandesompel
M. Klein, R. Sanderson, H. Van de Sompel, S. Warner, B. Haslhofer, C. Lagoze, M. L. Nelson. A Technical Framework for Resource Synchronization. D-Lib Magazine, Vol. 19, No. 1/2, 2013. http://dx.doi.org/10.1045/january2013-klein.
S. Lewis, R. Jones, S. Warner. Motivations for the Development of a Web Resource Synchronisation Framework. Ariadne, Issue 70, November 2012.

The specification itself can be read at http://www.openarchives.org/rs, and a quick read will highlight very quickly that the specification is based on sitemaps (http://www.sitemaps.org/) which is no surprise, given that they were developed for the easy and efficient listing of web resources for search engine crawlers to harvest – which in itself is a specialised form of resource synchronisation.

As with anything new, the proof is always in the pudding, which in this context means that reference implementations are required in order to both test that a standard can be implemented and fulfill the original use cases it was designed to do, but also to smooth off any rough edges that only appear once you use it in anger.

My role therefore has been to develop a PHP ResourceSync client library. The role of a client library is to allow other software systems to easily interact with a technology – in this case, web servers that support ResourceSync. The client library therefore provides the facility to connect to a web server and synchronise the contents, and then to stay up to date by loading lists of resources that have been created, updated, or deleted.

The PHP library can be downloaded from: https://github.com/stuartlewis/resync-php

The rest of this blog post will step through the different parts of ResourceSync, and shows how they can be access by the PHP client library:

The first step is to discover whether a site supports ResourceSync. The mechanism to do this is by using the well-known URI specification (see: RFC5785). Put simply, if a server supports ResourceSync, it places a file at http://www.example.com/.well-known/resourcesync which then points to where the capability list exists.

The first function of the PHP ResourceSync library is therefore to support this discovery:

[php]
include(‘ResyncDiscover.php’);
$resyncdiscover = new ResyncDiscover(‘http://example.com/’);
$capabilitylists = $resyncdiscover->getCapabilities();
echo ‘ – There were ‘ . count($capabilitylists) .
‘ capability lists found:’ . "\n";
foreach ($capabilitylists as $capabilties) {
echo ‘ – ‘ . $capabilties . "\n";
}
[/php]

Zero, one, or more capability list URIs are returned. If none are returned, then the site doesn’t support ResourceSync. If one is returned, the next step is to examine the capability list to see which parts of the ResourceSync protocol are supported:

[php]
include(‘ResyncCapabilities.php’);
$resynccapabilities = new ResyncCapabilities(‘http://example.com/capabilitylist.xml’);
$capabilities = $resynccapabilities->getCapabilities();
echo ‘Capabilities’ . "\n";
foreach($capabilities as $capability => $type) {
echo ‘ – ‘ . $capability . ‘ (capability type: ‘ . $type . ‘)’ . "\n";
}
[/php]

The output of this is that the specific ResourceSync capabilities supported by that server will be returned. Typically a resourcelist and a changelist will be shown.

The next step is often to perform a baseline sync (complete download of all resources). Again, the PHP library supports this:

[php]
include ‘ResyncResourcelist.php’;
$resourcelist = new ResyncResourcelist(‘http://example.com/resourcelist.xml’);
$resourcelist->enableDebug(); // Show progress
$resourcelist->baseline(‘/resync’);
[/php]

It is possible to ask the library how many files it has downloaded, and how large they were:

[php]
echo $resourcelist->getDownloadedFileCount() . ‘ files downloaded, and ‘ .
$resourcelist->getSkippedFileCount() . ‘ files skipped’ . "\n";
echo $resourcelist->getDownloadSize() . ‘Kb downloaded in ‘ .
$resourcelist->getDownloadDuration() . ‘ seconds (‘ .
($resourcelist->getDownloadSize() /
$resourcelist->getDownloadDuration()) . ‘ Kb/s)’ . "\n";
[/php]

It is possible to also restrict the files to be downloaded to those from a certain date. This can be useful if you only want to synchronise recently created files:

[php]
$from = new DateTime("2013-05-18 00:00:00.000000");
$resourcelist->baseline(‘/resync’, $from);
[/php]

Once a baseline sync has taken place, all of the files exposed via the ResourceSync interface will now exist on the local computer. The next step is to routinely keep this set of resources up to date. To do this, depending on the frequency at which the server produces change lists, these should be processed to download new or updated files, and to delete old files:

[php]
include ‘ResyncChangelist.php’;
$changelist = new ResyncChangelist(‘http://example.com/changelist.xml’);
$changelist->enableDebug(); // Show progress
$changelist->process(‘/resync’);
[/php]

Again, there are options to see what files have been processed:

[php]
echo ‘ – ‘ . $changelist->getCreatedCount() . ‘ files created’ . "\n";
echo ‘ – ‘ . $changelist->getUpdatedCount() . ‘ files updated’ . "\n";
echo ‘ – ‘ . $changelist-getDeletedCount() . ‘ files deleted’ . "\n";
echo $changelist->getDownloadedFileCount() . ‘ files downloaded, and ‘ .
$changelist->getSkippedFileCount() . ‘ files skipped’ . "\n";
echo $changelist->getDownloadSize() . ‘Kb downloaded in ‘ .
$changelist->getDownloadDuration() . ‘ seconds (‘ .
($changelist->getDownloadSize() /
$changelist->getDownloadDuration()) . ‘ Kb/s)’ . "\n";
[/php]

Also again, it is possible to only see changes since a particular date. This can be used to keep note of when the sync was last attempted, meaning only changes made since then are processed:

[php]
$from = new DateTime("2013-05-18 00:00:00.000000");
$changelist->process(‘/resync’, $from);
[/php]

The PHP library allows in a few steps, each consisting of a few lines, for the contents of a ResourceSync enabled server to be kept in sync with a local copy.

A further two blog posts will be published in this series. The next will show how to interact with the library so that more complex actions can be performed when resources are created, updated, or deleted. The final blog post will show this in action, with an application of the PHP ResourceSync library making use of the resources it processes.

Mobile discovery: don’t retro-fit; invent!

Ben Showers — Thu, 09 May 2013 12:11:22 +0000

The following is a version of an article that was printed in the March edition of Cilip Update.

When confronted with new technologies we often fail, early in their existence, to exploit the opportunities offered by the new medium. We retro-fit existing solutions rather than inventing new experiences.

The Canadian philosopher of communication and media, Marshall McLuhan, famously argued:

We see the world through a rear-view mirror. We march backwards into the future

In the early days of the web it was common for retailers to replicate paper brochures online, so called ‘brochureware’, missing the interactivity and format opportunities the web provides (and losing customers in the process too!). We continue to transpose our experiences of physical paper and books online, with little or no adaptation to the opportunities for interaction and multi-media.

While mobile technology has been available for decades, its current ubiquity and power (both socially and technologically) mean we find ourselves at the edge of a technological shift. As we move from a desk top to a mobile lifestyle we must be careful not to succumb to the rear-view mirror effect and replicate the desk top experience in the services and systems we design for the mobile user.

We find ourselves inhabiting a very different environment to a few years ago. Where once our computing power was located in one place, it now travels with us, capturing and distracting us no matter where we find ourselves. It connects us to people, places and things in ways not previously possible.

With this mobile lifestyle in mind I want to explore 4 challenges that mobile technologies present to libraries. In articulating these challenges I hope it will become increasingly clear what strategies and opportunities there are for libraries, and their services, systems and collections.

Simplicity

When you take a look at some of the best mobile experiences, whether apps or websites they usually have one thing in common: They do one thing extremely well. Everything extraneous is stripped away to leave only the most essential and relevant information.

Exemplars include Rise, an alarm clock app that incorporates visually simple interfaces, combined with gesture recognition and your music playlists. Or Clear, a ‘to do’ app, with intuitive gesture controls and the use of colour to denote urgency – nothing else.

Amazon’s stripped down app is a good example of a website that has adapted its presence to a mobile experience: Only the relevant information is included and all the complexity is hidden away from sight (although you can dig deeper if you wish).

The Amazon example is an interesting one. It invites comparisons with the library catalogue, and it certainly provides an effective template for mobile discovery. However, libraries have a physical infrastructure, processes and technologies that mean refining the mobile experience to a single thing can be hard. When we use a phrase like ‘discovery’ in a library or information-seeking context we often mean a set of interrelated actions, such as: search, select, find and use. Is it possible to break these down into their component parts and still deliver a positive experience for the user, both in terms of the mobile experience and of using the library?

The challenge the mobile devices present to libraries in this context is one of needs over solutions. The challenge is to think beyond the solutions already in place (the catalogue, discovery layer), to articulating the actual need. In the case of discovery maybe, ‘I need to answer a question’, or; ‘I need to find something’. Formulated in this way it is clear that a solution may be very different to the ones already available.

It forces us to consider the context we’re operating in; it invites us to invent, not retro-fit!

People and Place

Increasingly, the mobile device is a bridge between our online social connectivity and our localised real-world interactions. If you explore a map on your phone you don’t have to tell it where you are, the internal GPS has already told it. Similarly, it can tell you when a friend is near-by through apps like Facebook, FourSquare and so on.

There are a number of interesting examples where libraries and others have exploited these inherent benefits of mobile devices. Mendeley, the reference manager, is a good example of a service that is explicitly looking to build a social layer on top of the bibliographic data they have crowdsourced from the academic community in the form of bibliographies. You can follow academics with similar research interests, build groups and curate and build your own, personalised discovery network.
Increasingly, the discovery experience unfolds and is led by the content itself. What used to be the destination, the content or resource, is now the beginning of the journey.

For example, projects like Bomb Site, from the National Archives, have taken bomb site map data and made it available as a responsive website so that academics, researchers and members of the public can explore where bombs fell. This data is augmented over a map and includes images, descriptions and people’s memories.

Bomb Sight App

Similarly, the PhoneBooth project from the London School of Economics mobilised the Charles Booth poverty maps of London so that students and researchers could use and annotate the maps in context, i.e., on the streets of London as part of their learning experience.

PhoneBooth app

Increasingly the discovery process will find itself facilitating peer-to-peer and social recommendation experiences.

The traditional catalogue will itself begin to disappear from these interactions. Instead, the discovery experience will have an intimacy and personalisation associated with it that mirrors the intimately personal experience of the mobile device itself.

Personal

The web provides unparalleled opportunities for scale. The local bric-a-brac shop becomes eBay, the bookshop Amazon, the University becomes the massively open online course (MOOC) such as Cousera. Similarly the library begins to operate at ‘web-scale’ with its systems and services.

Yet, the mobile experience is an intimately personal one. It challenges libraries and information providers to find a balance between these two types of scale: the singular (the personal) and the ‘web-scale’. It is not enough simply to adopt web-scale systems and services: mobile challenges us to think about how that web-based interaction is transformed into real-world action.

One opportunity for libraries is in the data that circulates through their systems, both the management data and the user-generated interaction data. There are an increasing number of services and projects looking at exploiting this data for the personalisation of the user experience. These include commercial offerings, of which the best known is bX from Ex Libris.

There are also a number of academic libraries exploring the use of this data, including: SALT (surfacing the academic long tail) and RISE (Recommendations improve the search experience) which are exploring how different sets of data can be used to enhance and personalise the library experience.

The ability of libraries to exploit this data will grow increasingly important. The data provides a way for libraries to continue delivering services to hundreds and thousands of users, while providing a personalised experience that users expect from web-based services.

New models

If the mobile shift challenges libraries to invent new experiences, it also invites us to rethink how we develop and implement these.

As information becomes abundant and digital, the models for how libraries develop and implement new services and systems will radically change too. Libraries are no longer comparing themselves and their services to other libraries; instead they are being compared to the web, and the types of services and resources users can access there. Increasingly libraries will find themselves needing to adopt approaches that would normally be more associated with web start-ups.

This implies a greater focus on ideas (ideas from everywhere: librarians, users et al), rapid iteration and testing, and implementation of the idea (or quick relegation of ideas). This more entrepreneurial approach recognises that there is no simple crossing between how things are now and the future. There is not a simple roadmap from the complexities of the information environment as they are now, to some stable future; disruption is a feature, not a bug of the system.

While the change in a libraries approach to the user and the work it undertakes is significant, and not easy, there are some straightforward starting points. There are already great examples and case studies of mobile innovation in libraries. The M-Libraries community support blog, for example, includes a large amount of information, including case-studies, best practice guides and inspiration from other organisations on how they have transformed services with mobile technology.

Indeed, as many of the examples on the M-Libraries blog demonstrate, the financial overhead for this type of change should be low. Rethinking your approach to design of mobile services shouldn’t include significant barriers, either financial or technical. A good place to start is by borrowing ideas from other domains, like software development and design. The example of paper-prototyping, used in a recent mobile development workshop, provides a good place to start.

What many of these examples share is a renewed focus on the user. It moves us away from a focus on internal systems and processes, toward the behaviours and requirements of the user. The centre of gravity moves away from the technology and toward the user; the mobile-turn is one where the technology is overshadowed by the needs of the user.

The challenges mobile technologies present to libraries are ones drenched in paradox. The hardware (the phone, tablet, ereader) gradually fades from view, and it is the user, with their intricate behaviours and requirements that remain the focus of our attention.

Unlike so many other technologies, mobile enables the library to rethink its services, systems and processes to ensure that it is the user that remains at their heart. This does not mean business as usual, however. But it does mean that by understanding these challenges and their implications, libraries are in a position to design and deliver mobile experiences that users will want to engage with.

Shared Library Systems and Services, Part 1

Ben Showers — Mon, 22 Apr 2013 10:24:35 +0000

As part of the Library Systems Programme, two reports have been published exploring the potential for shared library systems across Universities in both Scotland and Wales.

In the first of two posts I wanted to briefly introduce you to the two recently published reports, and their main findings/recommendations. In the second post I want to highlight some of the other developments on the shared library systems landscape, and highlight some of the implications.

A Shared LMS for Wales (WHELF)

The Welsh Shared Service Library Management System Feasibility Report focussed on the most prevalent and practical issues for a shared all Wales HE library management system in broad terms:

A set of high-level agreed consortium requirements for a shared LMS.
A proposed governance model for the consortium.
High level recommendations on integration requirements for local systems; map communications standards which are applicable to the project against standards in use by suppliers.
A business case for a Wales-wide consortium LMS, including cost matrices for the different approaches presented.
Recommendations on the most cost-effective approach for software, hosting and ongoing management of the LMS.

The report makes the following recommendations:

The Project recommended setting up an All-Wales Consortium with formal governance. This requires the consortium to formally agree which processes, working practices and configurations will be adhered to by all members as a whole.

A cloud solution hosted by a vendor (or open source vendor) is the preferred option, because this will provide the most cost-effective resilient solution.

Further work will be required to develop a clear statement on the vision for shared LMS services in Wales, ensuring clarity of purpose and providing a compelling statement of intent for senior stakeholders and staff to achieve buy-in to the strategic direction proposed.

Next steps…

The report suggests a phased approach to implementation; anticipating that the first implementations will be no sooner than Summer 2014.

The report also suggests a task and finish group should be convened to quickly put together a high level plan, costs and cost allocation (i.e. funding) for the establishment of a project team.

The Benefits of Sharing (SCURL)

The Benefits of Sharing project has also just released a summary report of its work exploring a simple question:

How would a shared library management system improve services in Scotland?

While the question is simple, the answer is a little more complex. Indeed, the project began looking at the question with an initial workshop and subsequent report.

It then broke the problem into 3 parts:

The project also published a summary report which concludes with a number of recommendations, including the following:

From a systems perspective, sharing technical infrastructure and support structures would offer benefits of economies of scale, with more efficient use of staffing and greater expertise than any single library could offer. System options such as Open Source (OS) alternatives to ‘off the shelf’ commercial products could, therefore, become viable. It is recommended that at the tender and procurement phases of a shared LMS, all options, including OS systems, are reviewed and assessed.

————————————-

Both reports make very interesting reading – and also tell us a lot about the current library systems landscape. In particular there is a renewed vigour in the potential for sharing and collaborating around services and systems between libraries and institutions.

There is also a clear recognition that open source solutions are viable options for the community, and may represent a feature of this new library landscape.

In the second post on shared library services and systems I’ll explore some of the other developments within this landscape, and the implications they have for institutions, libraries and systems vendors.

Oh, the admin and the coder should be friends!

Tue, 03 Jul 2012 20:33:25 +0000

This is a tongue-in-cheek blog post a few days before the Open Repositories 2012 conference that is being held here in Edinburgh. I’ll give a bit of background first, a disclaimer, a video, then the main content of this post.

First the background: I have a slight love/hate relationship with the repository community and the Open Repositories conference related to how it makes a strong distinction between ‘Repository Managers’ and ‘Developers’. Its nice that we do this as it allows for innovate conference strands such as the ‘developers challenge‘ where developers can come and show-off their wares. However I also hate this segregation and the labeling of delegates into these categories. Personally I see myself as straddling the two, and I feel that we should be looking for our shared interests (developing open repository services) rather than highlighting differences between our roles.

However, I won’t rant or get on my soap-box, but instead I’ll butcher a song from the Rodgers and Hammerstein musical ‘Oklahoma!‘. One of the most famous songs is about how the farmers and the cowboys don’t get along and look for all the differences between themselves, rather than trying to work together to make the most of being settlers in a new territory. (See any similarities?!)

The disclaimer – the song makes the assumption that farmers and cowmen are all male, and that the females stay at home cooking, with the daughters waiting to get married. In my re-working of the lyrics I’ve been equally sexist and made the repository managers female and the developers male. This is not representative of my views or of reality, but it fits for the song! So please don’t hold this against me! This is just a light-hearted piece!

The song also fits with the OR2012 conference as it talks about the admins and coders (‘repository managers’ and ‘developers’ are too long for the song!) dancing together. The conference dinner will be ending with a ceilidh, where hopefully there will be much dancing and fun! If you’ve never seen Oklahoma you can watch a performance of this song below:

So sing along (preferably in your head if you work in a shared office!)

The admin and the coder should be friends.
Oh the admin and the coder should be friends.
One of them likes to bulk upload, the other likes to cut some code,
But that’s no reason why they can’t be friends!

Repository folks should stick together,
Repository folks should all be pals.
Admins dance with the coders’ daughters,
Coders dance with the admins’ gals.
(repeat)

I’d like to say a word for the admin,
She come out west when repos were in beta,
She came out west and built a lot of services,
And uploads PDFs with metadata!

The admin is a good and thrifty citizen,
no matter what the coder says or thinks.
You sometimes see ’em drinkin’ in the tea room.
And always wants download stats when she rings.

But the admin and the coder should be friends.
Oh, the admin and the coder should be friends.
The coder writes a script with ease, the admin holds the OA keys,
But that’s no reason why they can’t be friends.

Repository folks should stick together,
Repository folks should all be pals.
Admins dance with the coders’ daughters,
Coders dance with the admins’ gals.

I’d like to say a word for the coder,
the road he treads is difficult and stoney.
He codes for days on end with just a keyboard for a friend.
I sure do find he’s often tired and moany!

The coder should be sociable with the admin.
If he drops in looking like he needs bath water,
Don’t treat him like a louse make him welcome in your house.
But be sure that you lock up your wife and daughters!

I’d like to teach you all a little saying.
And learn the words by heart the way you should.
I don’t say I’m no better than anybody else,
But I’ll be damned if I ain’t just as good!

Repository folks should stick together,
Repository folks should all be pals.
Admins dance with the coders’ daughters,
Coders dance with the admins’ gals.

Suggestions for better lyrics are most welcome!

[If you want to see the original lyrics, you can view them at: http://www.stlyrics.com/lyrics/oklahoma/thefarmerandthecowman.htm]

iPad App for Islandora Repo Now Available

mleggott — Fri, 01 Jun 2012 13:11:18 +0000

Our new iPad application, Telling Island Stories, is now available in the App Store (and yes it does support the retina display). TIS is a front-end to a number of Islandora-based digital collections, with a focus on story and geo-level presentation of digital assets. As a lightweight iPad app it will give you a good example of what can be done with a powerful Islandora-Fedora driven backend. Some highlights of this version:

- provision of "collection" views via Media (images, audio, books, maps), Stories (associated subsets of assets) and Communities (assets sharing a geographic range);
- a map-based view of all assets, including media-specific pins, pin aggregation for larger numbers of items in close proximity (including the ability to turn this feature on and off), search/display via the map;
- flexible display including ability to toggle metadata display on and off by item, filter items by type;
- delivery of high quality retina-display quality books in a page

Items available at launch include photographs from the University of PEI and community groups, herbarium specimins, digital books from the Lucy Maud Montgomery Institute, oral histories from the Dutch Thompson collection and maps from UPEI and partner organizations. As content is added to the Islandora system it automatically shows up in the app - no need to download a new app just for new content. A future version will allow the user to submit content to a central repository.

We are especially proud of the collaborative approach to making content from all parts of Prince Edward Island available to the world. If you are visiting PEI this summer you definitely want to download the free TIS app to your iPad!

TIS is a joint development venture between UPEI and DiscoveryGarden, an Islandora services company, along with generous funding from Innovation PEI. If you are interested in adapting the TIS app to your own Fedora/Islandora framework drop me a note.

UPEI Senate Approves Open Access Policy

mleggott — Tue, 29 May 2012 12:16:15 +0000

After a lengthy but very fruitful process UPEI's Senate approved a new open access policy. Some highlights of the policy:

immediate deposit of scholarly articles (including faculty, grad student and undergraduate works) in the UPEI repository is encouraged (a mandate was not considered a viable option - next time...);
encouragement for scholars to retain copyright, including a reference to the SPARC addendum;
deposit of research data into a UPEI Virtual Research Environment is also encouraged, along with links to the final scholarly work.

IslandScholar.ca is the UPEI open access repository, and it is currently undergoing a major re-architecture with a new launch this summer. The UPEI effort is also a key part of the new Islandora Institutional Repository Solution Pack, which will be released this summer as well.

Is the Repository Developer a dying breed?

Wed, 18 Apr 2012 11:59:22 +0000

Is the Repository Developer a dying breed, and should we care?

Cast your mind back, perhaps seven or eight years. It was the heyday of repository development. Projects such as DSpace and EPrints were taking off, and institutions around the world were watching the area closely and with excitement to see where this glorious new world would take us.

But back in those days repositories were similar to the early motor car – you needed a lot of money, several years, and your own mechanic/driver (developer) to make it work. Luckily, back then these resources were often available, money was perhaps a little easier to come by, and there were many funding opportunities from the likes of JISC to help out too.

As the repository developer worked with the repository software for a few years, they became intimately related with the software – they knew how it worked, how it was structured, what it could and couldn’t do, how to structure data within the repository, and often became key players in the development of the open source platforms by taking on roles such as DSpace Committership. Life was good, and I was lucky to be part of this, riding on the waves of e-theses, JISC projects (Repository Bridge, ROAD, RSP, Deposit Plait, SWORD) and the start of local institutional open access advocacy movements.

However… life moves on. The early repository developers have taken different career paths, and now find themselves in different situations.

Some have left the domain when the project funded projects slowed down, repositories could be implemented without a dedicated developer, and new areas of interest arose.
Some have progressed in their careers, and chosen to take a non-technical route up the tree.
Some have taken a commercial route, choosing to take their skills into the commercial sector and providing development services back to repository-using institutions.
Others have specialised as repository developers, but often find their emphasis has to be on compliance or marketing issues such as statistics, research assessment, or branding, rather than continuing to develop and apply core repository functionality.

It is rare to find a role these days where a developer can specialise in repositories, spending the majority of their time in that area.

I believe there is still a need for repository developers, as they bring many benefits such as:

An understanding of the technology that helps them to know when repository technology can and should be applied, and when it should not. Often repositories do not appear to be suitable choices for some of our system requirements as we’re used to confining them to electronic theses and journal papers, but they have great potential in new areas.
They know the underlying technology and data structures used by repositories, and how these can be mapped onto new domains. This can save institutions time and money, as they can re-use their existing repository infrastructure and expertise, rather than in investing in others.
Equally and opposite, they know the weaknesses of repositories, and where current or future functionality will not be suitable.
They provide technical credibility. Often repositories are run by libraries, but in environments where there are IT departments who may hold varying views on the technical development competence of the library.
They make the running of the current repository/ies more smooth, and can help manipulate the data they contain (import / export / update / delete) in ways that are not supported by the native interfaces.
Repositories are starting to become integration targets of enterprise systems such as CRIS systems. Having a repository develop around can make these integrations easier.

I think we’re seeing a downward spiral in the availability of repository developers. From time to time you see job adverts seeking experienced repository developers, and unfortunately they seem to be becoming a rare breed – a breed which I think we should protect, recognise, foster, and grow. I’m lucky to have worked in several large institutions lately where there have been small, effective, embedded, and valued repository development teams. However these types of teams are starting to become fewer and harder to find.

Opportunities for repository developers to network, learn, and share, are decreasing. There are exciting events such as Open Repositories with their Developers Challenge, and the Repository Fringe, but these are only annual events, and do not provide the opportunity for repository developers to show their skills and the potential of repository technologies to anyone outside of the repository community.

How will this affect the repository community? I think that there will be increasing problems in the repository world if repository developers become a very rare breed:

The large open source platforms that we have come to rely upon (installations of platforms such as DSpace, EPrints, and Fedora number in the thousands) will find it harder to continue to develop and keep pace with current requirements. The large amount of development effort that has gone into these systems over the past decade could be wasted, and we’ll fail to see some of the benefits that only come to fruition after this length of maturity.
There will be fewer exemplars of good practice of repository use to inspire and drive forward the innovative use of repositories.
Those who administer repositories will lose their local allies who are able to provide the tools and integrations to make repositories a local success.
The potential for repositories to be involved in new hot topics such as Research Data Management, the resurgence of interest in open access publishing, or the need for better digital preservation may be missed.
It will be even harder to recruit experienced and passionate repository developers, and without well-established teams of these, new developers thrust into the arena will find it harder to grow their skills and knowledge.

What should we, and can we do about it? It think that we need to value the role of the repository developer, and continue to recognise that despite there being less requirement for a repository developer in order to run a repository, the absence of development skills may inhibit a good repository service from becoming a great repository service. As we value multi-talented systems librarians, we should value the repository developer as a multi-skilled employee that allows us to correctly apply and integrate repository technologies.

Looking around at commercial companies that offer repository development services (for example Cottage Labs and atmire) we see the sort of innovative thinking that has so much potential in this area, and when I talk to staff involved with these companies there seems no shortage of people wanting their skills. And this is good, and shows that there is a demand. But equally I feel we need to keep growing these skills within institutions, and not let the local Repository Developer become a dying breed.

Repositories are in their teenage years, we nurtured them through birth, messy childhoods, promising early years, and now we’re starting to get a glimpse of how they can become powerful embedded tools. But without the continued availability of skilled parents to shepherd their development, they may never reach their full adulthood potential.

[This blog post was written on the way to a DevCSI event for managers of developers, where we shall be looking at how we can show the positive impact of having local development teams within universities, and from my perspective and passion, their particular value to libraries.]site

Thoughts on the Elevator

Mon, 26 Mar 2012 11:07:40 +0000

The JISC have been running an experimental funding system known as the JISC Elevator. The introduction on the site’s homepage describes the concept well:

JISC elevator is a new way to find and fund innovative ways to use technology to improve universities and colleges. Anyone employed in UK higher or further education can submit an idea. If your idea proves popular then JISC will consider it for funding. The elevator is for small, practical projects with up to £10,000 available for successful ideas. So if you have a brainwave, why not pitch it on the elevator?

A small team of us from the University of Edinburgh Digital Library submitted a proposal: The Open Access Index #oaindex. The video submission is shown below…

I’ve previously blogged about the experience of creating this submission. This post however contains a few observations about the Elevator concept, and the proposals that have been submitted.

First off – I’m a big fan of this system for a number of reasons:

It gave us an avenue to submit this type of proposal for a small amount of funding (only a few thousand pounds)
It provided us with a public platform and forum to socialise and discuss the idea
It adds a more open peer review stage to the process
It could encourage proposals from first-time bidders (although the public nature of it might put some people off?)

It will be interesting to see if or how the concept evolves overtime. Last week I got to chat about this with Andy Mcgregor the JISC Programme Manager in charge of this, and Owen Stephens. A few ideas that arose include:

Restrict the number of votes that any one person can place – make voters think harder about which ideas are most worthy of funding as there is only a limited number of projects that can be funded
Perhaps allocate each voter a set of votes or mock money or shares – they decide how they invest them across the proposals (all to one great idea, or spread across a few)
Be more transparent about the funding each project has requested and who has voted

I’ve been following the different ideas as they’ve been submitted, and a few trends have surprised me.

The first relates to the funding band that the proposal falls into. There are three funding bands: up to £2,500, up to £5,000, and up to £10,000. There is a total of £30,000 being made available to fund some of the submissions. I can’t tell what band the proposals that have already received enough votes for fall into, but for those that are still collecting votes, the breakdown is as follows:

I was surprised at the number that requested the full £10,000. Of course, it could be that those in the ‘unknown’ category (those which have already received the number of votes they require) are all in the lower bands, therefore require fewer votes, and are there now fully voted for. When pitching an idea, I always consider the amount of money available, and therefore the likelihood of receiving a given share of that money. In this case, due to the very limited funding, we chose to submit a proposal in the lowest band to (hopefully) increase our chances.

The second aspect of the submissions that struck me was the domain to which the submission relates. I’ve split these up into three very broad (and arguably very bad) categories: Learning (students / learning enhancement), IT (systems, development) and Library (materials, metrics).

The ‘education’ category received by far the most submissions, with IT and Library lagging far behind. Indeed our #oaindex proposal seems to be the only one in the library domain. Why is this? Perhaps the amounts available are much lower in the IT and Library domains than we are used to bidding for? Perhaps there are less opportunities for funding in the education domain? Are those in the education domain better at seizing these new and innovative funding opportunities than those in the library or IT domain? Discuss…!

When we created our video, we ensured that we mentioned who we were and which institution we worked for. However it didn’t cross our minds to include any sort of branding in our submission. I only thought of this when watching some of the others.

We were not alone, only a few included some. Did we miss an opportunity here, or is the brand somewhat irrelevant to the format of submitting elevator pitches: should voters be influenced by the idea more than by the host institution?

Our submission took the format of ideas being drawn on a whiteboard, with a voice-over in the background. I’ll openly admit that this was because none of us really wanted to stand in front of a camera for 3 minutes. Given how much we laughed during the simple voice recording, I think doing this in front of the camera would have taken even longer. Sorry – we didn’t keep the out-takes!!!

Voting for the proposals ends in a few days, and I’m looking forward to seeing which get funded, which don’t get enough votes, and whether or not the concept continues. But the scheme certainly gets my vote for the periodic allocation of small amounts of funding for great ideas!

[ The data that I’ve collected on the proposals can be seen at: https://docs.google.com/spreadsheet/ccc?key=0AgXAkDGxqBWYdHR1c0l6Uzk5aDFfQzlaM0ZtV04wcVE I’d be happy to receive updates or corrections.]racer games

A tale of two bids

Fri, 23 Mar 2012 17:36:03 +0000

This is a tale of two bids; two recent JISC bids to be precise. One submitted via the ‘traditional’ route, and one via the experimental ‘Elevator‘ route. This blog post is a brief reflection of my thoughts about these, and a comparison of the experience, in particular comparing the effort involved.

First, I should provide a brief explanation of the two routes:

The ‘traditional’ route: Traditionally JISC requires bid proposals to be submitted as text documents, usually in the range of 6 to 12 pages. These include cover sheets, budgets, benefits and risks, a bit about the people involved, and of course an explanation of the problem that will be investigated. On top of that, there are letters of support and FOI checklists. As part of the recent JISC Digital Infrastructure call, we submitted a couple of bids. What we bid for is somewhat irrelevant, but I will disclose that the two bids we submitted were requesting funding of approx £30,000 each. These proposals will now be marked by internal and external markers, followed by a panel decision.
The Elevator route: JISC are currently running an experimental funding stream, known as the JISC Elevator. The idea is that proposals should be lightweight, consisting of a brief video presentation, along with a few words. No budgets, no letters of support, no FOI statements – just an elevator pitch about the idea. This is the first difference. The second difference is that ‘the crowd’, which in this case consists of anyone with a .ac.uk email address, are allowed to vote on which projects should be considered for funding. Any that get enough votes will go forward for consideration by a panel. The number of votes required are proportional to the amount requested, with three bands being up to £2,500, up to £5,000, and up to £10,000. We pitched at the bottom end of this scale, meaning that we required 50 votes (which we received in less than 24 hours).

I’ll openly admit that the traditional route is often stressful. It takes around about 1 week of effort (full time), usually spread over 3 or 4 weeks. The final days tend to get quite frantic as everything is pulled together, we go through internal reviews and consents, seek letters of support, and pull the bid together for final submission.

In comparison, our pitch for the elevator took about half a day – an hour to refine the idea and seek approval, an hour to write a script, an hour to record the voices, an hour to make the video, and a few minutes to upload it.

The feeling at the end of the elevator process was markedly different to the end of the traditional process – and this felt good. However, when you look at the sums (adjusted slightly to make the numbers easier)…

Elevator: 1/2 day = £3,000 potential funding
Traditional: 5 days = £30,000 potential funding

So the actual potential return per hour invested in bid writing is the same!

However if I extrapolate this to other bids I have written in the past, some of which have been for higher amounts, the trend does not seem to continue in a linear fashion.

My personal experience (your mileage may vary – it would be good to compare notes!), is that bids in the range of thirty to perhaps two hundred thousand, take a similar amount of time: a week or so for a primary proposal author, and various time commitments from other parties. But bids above this amount then start taking longer again, as project complexities, often around collaboration and external involvement kick in.

What can I conclude from this? I’m not sure really! Feel free to draw your own conclusions and to make comparisons with your own experience.

What I can say, is that we enjoyed the process of making, submitting, and then publicising our elevator pitch. We felt that we had more freedom to be inventive with our interpretation of the submission requirements, and felt quite refreshed at the end of the process, rather than frazzled!

Now we await the outcomes of both…

Making Debian Changelogs from Github repositories

Dave Tarrant — Fri, 20 Jan 2012 08:27:00 +0000

One of the many things that irks me is the gap between good developers who put all their code on platforms such as GitHub, and those who then actually bother to put some effort into packaging up their code for easy platform installation.

I have come to the realisation that this is mainly due to the pedantic nature of packaging formats and platform lock in. One such example is the exacting format of the debian changelog...

GitHib2Changelog is a bit of code that I knocked together to help in this situation. It takes a GitHub repository URL and builds a debian changelog from the repository commits and tags.

By looking at the tags and commits it works out which commits are related to which tags (something GitHub APIv3 doesn't do) and then outputs this directly to you already formatted.

The service is built in php, and is web based with both a pretty front end and API access.

Ironically, since i've now committed the code to GitHub here I now need to use the service on itself and build the easy to install packages. More on that soon...

DepositMOre - The Prototype

Dave Tarrant — Thu, 19 Jan 2012 06:51:00 +0000

Building on the success of DepositMO and SWORDv2, I thought it would be a good idea to put a quick HTML5 client together to save myself some pain.

The basic premise of this web-based client is to automatically search for "your stuff" in a number of ways and then allow it all to be submitted to a repository in one click.

First target for me was www.easychair.org. This service is used as an online conference submission and review system. In a nut-shell if an author wants to get accepted into a conference, easychair is one system which they WILL have to battle with in order to submit their content. As a result there is a strong potential that easychair knows about many publications which should also be present in other systems.

From the main screen in easychair it is possible to navigate and find the many conference publications which you have submitted. Each publication is tied to a conference and it can take a substantial number of clicks to navigate between each publication.

DepositMOre

DepositMOre is a modular system which is intended to be a home for many services which locate your publications. The first module to be developed is for easychair.

By simply providing your login credentials to the DepositMOre system, it will not only list all your authored items from easychair but also check if these are present in your locally detected repository. If they are not deposited, and they should be then one click will do this for you.

A combination of HTML5 and SWORD2 make this process quick and seemless! Multiple items can be submitted at once and as each are submitted you can instantly click a link to your item and can view it in the repository.

The following video gives a demo of the prototype in action. We hope to continue development with the support of a funded project.

Technologies Used

HTML/Javascript/JQuery/PHP

SWORD2 PHP Library - Stuart Lewis - https://github.com/stuartlewis/swordappv2-php-library/

Preservation Tools - Moving Forward

Dave Tarrant — Mon, 14 Mar 2011 04:38:00 +0000

Over the last number of years, JISC and other bodies have funded a number of digital preservation projects which have resulted in some really valuable contributions to the area... now is the time to realise the benefits of this work and provide a digital preservation experience to everyday users.

To achieve this a not insignificant amount of work needs to be undertaken, namely to identify key applications and separate these from the complex systems into which they have been built. Alternatively many applications now need re-thinking and the best bits built into system which have super-ceded these applications.

File Format Identification Tools

File format identification now has a number of tools available, each with their own advantages and disadvantages, in no particular order they are:

DROID:

Started out as a tool to identify file types and versions of those types. :)
Each file version was assigned an identifier which could be referenced and re-used. :)
Identification of file was done via "signature", not extension matching. :)
Became complex as it was adjusted to suit workflows and provide much more complex information which few people understand or want :(
Added complexity increased the time required for each file classification, no longer a simple tool :(

FIDO:

A new cut down client which takes the DROID signature files and does the simple stuff again :)

FILE:

A built in Unix tool installed on every Unix based system in the world already! :)
Does not do version type identification :(
Does not provide a mime-type URI :(
Very quick to run :)
Has the capacity to add version type identification and there is a TODO in the code for it! :)

With the PRONOM registry now looking at providing URIs for file versions, why can't we stop coding new tools and change the FILE library. This way it could handle the version information and feed back the URIs if people want them. I've looked briefly into this and the PRONOM signatures should be easy to transport and use with the file tool.

If I get time I might well have a go at this and feed it back to the community.

Hot Topics in Scholarly Systems

Dave Tarrant — Mon, 22 Nov 2010 07:13:00 +0000

Since I last wrote a blog post the world has been going through some harsh times where cutbacks and simplifications have been essential. The phrase "Throw money at it" no longer applies to anything and all of a sudden organisations as well as people seem far more keen to share than before (although we are still not fully open and sharing, mostly it's organisations wanting stuff without sharing themselves, but we'll get there).

Anyway enough of that, what is actually happening?

Well I am very proud to be at the forefront of an international effort to hold a series of scholarly technology meetings focussed on solving institutional problems. These meetings, known as the Scholarly Information Technical Summit (SITS) meetings, are being held in alongside many international conferences over the next 2 years and are being backed by all the major international funding bodies. See http://bit.ly/Scholarly_Infrastructure_Technical_Summit for more info.

There have now been 2 meetings, although SITS only came about because the first one was so successful. Each meeting conforms to the Open Agenda (see wikipedia) principal and is chaired likewise. This leads to the agenda being very pertinent to the people in the room and often creates conversation critical to the forward momentum of some of the technologies discussed. In the next few paragraphs I'm going to try and summerise the hot topics from the first meeting:

SWORD - Put stuff in a repository

SWORD has undoubtedly been a huge success, it's simple and well supported by many publishers and publishing software (including most notably the Microsoft office suite via the author add-in tool http://research.microsoft.com/authoring). There are however some problems which the community wants to address without making it more complex:

Packaging Formats - What exactly do you submit in your SWORD bundle, how should it be formed. There was no clear consensus other than we feel endpoints should try to support a multitude of formats depending on their users.
Endpoints are hard to find, for both users and the software, this could do with being addressed either via negotiation or meta tags of some sort.
URIs in the returned package are not well specified to say what they mean or what they should mean.
Not a complete CRUD model
No levels of compliance any more
SWORD uses basic auth (too basic?)

The general call was that these points need addressing without making the SIMPLE (that's what the S stands for) too complex. CRUD looks interesting.

OUTCOME: A follow on SWORD project has been funded by JISC (UK) along with a number of complementary (but separate) projects including DepositMO (http://blogs.ecs.soton.ac.uk/depositmo) and SONEX (http://sonexworkgroup.blogspot.com/).

Personally i'm involved in DepositMO which intends to use SWORD (+CRUD) at it's core and extend this even further (outside of SWORD) to be fully interactive with the users. More can be found on the levels of conformance via the DepositMO blog (http://blogs.ecs.soton.ac.uk/depositmo).

Package guidelines are to be set out by the new SWORD project along with tight definitions on what URIs mean and what it means to CRUD those URIs.

Being written in to both projects I hope to bring not only technical knowledge to the table but also real world usages.

There was also a call to look into technologies like OAuth and it's usages in SWORD, however this was a minor part of what became a major conversation at the second meeting.

Inverse Sword

This conversation started on workflows and a discussion on the opportunities for common workflows and their impact. The problem is that workflows tend to be very specific and quiet heavy weight in their approach to a problem, often constrained by the domain. This is the advantage of SWORD, it doesn't specify one, just a technique for transferring stuff. So what about reverse SWORD where you request a URI and a packaging format you want.

This basically then re-inforced the conversation on what it meant to have SWORD endpoints supporting full CRUD using content negotiation to agree on packaging formats. Clearly something to take forward... as it was!

Storage for Digital Repositories

Question was (not from me): What is their beyond the Akubra (now DuraCloud) and my two projects (one of which has been finished)?

It is clear that there are now a whole range of storage options and technologies with infinite numbers of APIs, luckily many of the cloud providers use the S3 API (which is good!). So what rules languages are there for expressing where things should be stored?

I briefly explained the EPrints implementation (labelled as mine but it isn't, it's EPrints property) which uses lightweight plug-ins to communicate with each service. These plug-ins implement 4 API calls (Store, Retrieve, Delete and one other necessary I won't bother explaining here). There is then an XML/XSLT based policy file which dictates which plug-ins are used to store what. Each file is then stored and metadata adjusted to state where it is stored in case policy changes. Upon a policy change, the files can be re-arranged to their correct locations again. This can also handle changes in storage architecture and whole services being off-lined. Advantage with this approach, which the community likes, is that you can use any number of storage solutions simultaneously and store as many copies of files on different ones as you like. For more see http://eprints.ecs.soton.ac.uk/17084/.

The actions from this were that others were going to look at this implementation to see if this rule based language could apply on other repository platforms. Further it would be nice to have some good reference architectures available from vendors.

Services and Configuration Languages (was Common Platforms/Tools on the day)

This was an interesting conversation which started around the idea of being able to re-use technologies by re-using/calling code libraries directly. The problem is here (as I see it) the number of coding environments and versions of these environments available.

The solution is REST (not SOAP) APIs on the web and abstraction APIs in the code (e.g. SOLR) which enable you to call functions from (say) the command line, without having to understand the code.

David Flanders perhaps summed it up best, there are levels of interaction, some easier than others:

Core System (hard)
Exposing structured data
End user interfaces (including APIs)

XML for configuration is a bit of a sticking point with users, but you need a machine readable language to configure the machine. Perhaps the point is here only use XML if you need it otherwise simple config files with "=" signs in is fine.

There is no real answer to this question other than try and keep it simple... stupid.

Author IDs (URIs)

Yes it's our favourite topic raising its ugly head again!

It is clear that there are many efforts in this area, none of which have fully succeeded yet. There is still much interest in this area however and it is clear that we should be prepared to handle multiple IDs for a single author and be able to align them (if allowed) at a later stage.

Currently the project to watch is ORCID which is a continuation of a previous project by Thompson (which failed commercially in this project).

The consensus was however that we are not wrong to mint URIs for our authors in our repositories.

Conclusions

Identification/Authorisation is a problem, can technologies like OAuth not only help with authorisation but also with identification? This could be a very interesting area.

SWORD being taken forward is a very positive outcome of the first SITS meeting.

Simple services with simple APIs are so much more effective than "project centric" solutions and bloatware.

Simple services are usable by lots of people!

Usage Statistics parsing and querying with redis and python

Ben O'Steen — Fri, 26 Mar 2010 03:36:00 +0000

This is an update of my previous dabblings with chomping through log files. To summarise where I am now:

I have a distributable workflow, loosely coordinated using Redis and Supervisord - redis is used in two fashions: firstly using its lists as queues, buffering the communication between the workers, and secondly as a store, counting and associating the usage with the items and the metadata entities (people, subjects, etc) of those items.

I have written a very small python logger, that pushes loglines directly onto a redis list, providing me with live updating abilities, as well as manual log file parsing. This is currently switched on for testing in the live repository.

Current code base is here: http://github.com/benosteen/UsageLogAnalysis - it has a good number of things hardcoded to the perculiarities of my log files and repository. However, as part of the PIRUS 2 project, I am turning this into an easily reusable codebase, adding in the ability to push out OpenURLs to PIRUS statistics gatherers.

Overview:

Loglines -- lpush'd to 'q:loglines'

workers - 'debot.py' - pulls lines from this queue and parses them up, separating them into 4 categories:

Any hit by a recognised Bot or spider

Any view or download made by a real person on an item in the repository

Any 404, etc

And anything else

and the lines are moved onto 4 (5) queues respectively, q:bothits, q:objectviews (and q:count simultaneously), q:fof, and q:other. I am using prefixes as a convention when working with Redis keys - "q:" will almost always be a queue of some sort. These four queues are consumed by loggers, who commit the logs to disc, segregated into their categories.

The q:count queue is consumed by a further worker called - count.py. This does a number of jobs, and is the part that actually does the analysis.

For each repository item logged event, it finds the ID of the item and also whether this was a download of an item's files. With my repository, both these facts are deducible from the URL itself.

Given the ID, it checks redis to see if this item has had its metadata analysed before. If it hasn't, it grabs the metadata for the item from the repositories index (hosted by an instance of Apache Solr) and starts to add connections between metadata entity and ID to the redis index:

eg say item "pid:1" has the simple metadata of author_name='Ben' and subjects='foo, bar'

create unique IDs from the text by hashing the text and prefix it with the type of the field they came from:

Prefixes:

name => "n:"

institution => "i:"

faculty => "f:"

subjects => "s:"

keyphrases => "k:"

content type => "type:"

collection => "col:"

thesis type => "tt:"

eg

>>> from hashlib import md5

>>> md5("Ben").hexdigest()

'092f2ba9f39fbc2876e64d12cd662f72'

So, the hashkey of the 'name' 'Ben' is 'n:092f2ba9f39fbc2876e64d12cd662f72'

Now to make the connections in Redis:

Add ID to the set 'objectitems' - to keep track of all the IDs (SADD objectitems {ID})

Set 'n:092f2....' to 'Ben' (so we can keep a reverse mapping)

Add 'n:092f2...' to 'names' set (to make it clearer. KEYS n:* should return an equivalent set)

Add 'n:092f2...' to 'e:{id}' eg "e:pid:1" - (e -> prefix for collections of entities. e:{id} is a set of all entities that occur in id)

Add 'e:pid:1' to 'e:n:092f2....' (gathers a list of item ids in which this entity 'Ben' occurs in)

Repeat for any entity you wish to track.

To make this more truth-manageable, you should include the id of record with the text when you generate the hashkey. That way, 'Ben' appearing in one record will have a different key than 'Ben' occuring in another. The assertion that these two entities are the same can easily take place in a different set, (I'm using b: as the prefix for these bundles of asserted equivalence)

Once you have made these assertions, you can set about counting :)

Conventions for tracking hits:

d[v|d|o]:{id} - set of the dates on which {id} was viewed (v), downloaded from (d) or any other page action (o)

eg dv:pid:1 -> set of dates on which pid:1 had page views.

YYYY-MM-DD:{id}:[v|d|o] - set of IP clients that accessed a particular item on a given day - v,d,o as above

eg 2010-02-03:pid:1:d - set of IP clients that downloaded a file from pid:1 on 2010-02-03

t:views:{hashkey}, t:dls:{hashkey}, t:other:{hashkey}

Grand totals of views, downloads or other accesses on a given entity or id. Good for quick lookups.

Let's walk through an example: consider that a client of IP 1.2.3.4 visits the record page for this 'pid:1' on 2010-01-01:

ID = pid:1

Add the User Agent string ("mozilla... etc") to the 'ua:{IP}' set, to keep track of the fingerprints of the visitors.

Try to add the IP address to the set - in this case "2010-01-01:pid:1:v"

If the IP isn't already in this set (the client hasn't accessed this page already today) then:

make sure that "2010-01-01" is a part of the 'dv:pid:1' set

go through all the entities that are part of pid:1 (n:092... etc) and increment their totals by one.
- INCR t:views:n:092...
- INCR t:views:pid:1

Now, what about querying?

Say we wish to look up the activity on a given entity, say for 'Ben'?

First, find the hashkey(s) that exist that are equivalent - either directly using the simple md5sum hash, or by checking which bundles are for this entity.

You can get the grand totals by simply querying "t:views:key", "t:dls..." for each key and summing them together.

You can get more refined answers by getting the set of IDs that this entity is associated with, and querying that to gather all the daily IP sets for them, and summing the answer. This gives me a nice way to generate data suitable for a daily activity sparkline, like:

I have added another set of keys to the store, of the form 'geocode:{IP}' that record country code to IP address, which gives me a nice way to plot out graphs like the following also using the google chart API:

Python logging to Redis

This functionality is mainly in one file in the github repo: redislogger.py

As you can see, most of that file is taken up with a demonstration of how to invoke it! The file that holds the logging configuration which this demo uses is in logging.conf.example.

NB The usage analysis code and UI is very much a WIP

but, I just wanted to post quickly on the rough overview on how it is set up and working.

Curating content from one repository to put into another

Ben O'Steen — Thu, 25 Mar 2010 08:42:00 +0000

First you need a little code that I've written:

sudo easy_install recordsilo oaipmhscraper

(This should install all the dependencies for the following)

To harvest some OAI-PMH records from say... http://eprints.soton.ac.uk/perl/oai2 :

First, take a look at the Identify page for the OAI-PMH endpoint: http://eprints.soton.ac.uk/perl/oai2?verb=Identify

The example identifier indicates that the record identifiers start with: "oai:eprints.soton.ac.uk:" - we'll need this in a bit. Maybe not need, but it'll make the local storage more... elegant?

Go to a nice clean directory, with enough storage to handle whatever you want to harvest.

Start a python commandline:

>>> from oaipmhscraper import OAIPMHScraper

---> NB OAIPMHScraper(storage_dir, base_oai_url, identifier_uri_prefix)

>>> oaipmh = OAIPMHScraper("myrepo",

"http://eprints.soton.ac.uk/perl/oai2",

"oai:eprints.soton.ac.uk:")

Let's have a look at what could be found out about the OAI-PMH endpoint then:

>>> oaipmh.state

{'lastidentified': '2010-03-25T15:57:15.670552', 'identify': {'deletedRecord': 'persistent', 'compression': [], 'granularity': 'YYYY-MM-DD', 'baseURL': 'http://eprints.soton.ac.uk/perl/oai2', 'adminEmails': ['mailto:eprints@soton.ac.uk'], 'descriptions': ['........'], 'protocolVersion': '2.0', 'repositoryName': 'e-Prints Soton', 'earliestDatestamp': '0001-01-01 00:00:00'}}

>>> oaipmh.getMetadataPrefixes()

{'oai_dc': ('http://www.openarchives.org/OAI/2.0/oai_dc.xsd', 'http://www.openarchives.org/OAI/2.0/oai_dc/'), 'uketd_dc': ('http://naca.central.cranfield.ac.uk/ethos-oai/2.0/uketd_dc.xsd', 'http://naca.central.cranfield.ac.uk/ethos-oai/2.0/')}

Let's grab all the oai_dc from all the objects:

>>> oaipmh.getRecords('oai_dc')

...

Go make a cup of coffee or tea.... you'll get lots of stuff like:

INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1267 found with datestamp 2004-04-27T00:00:00 - storing.

2010-03-25 16:01:11,807 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1268 found with datestamp 2005-04-22T00:00:00 - storing.

INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1268 found with datestamp 2005-04-22T00:00:00 - storing.

2010-03-25 16:01:11,813 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1269 found with datestamp 2004-04-07T00:00:00 - storing.

INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1269 found with datestamp 2004-04-07T00:00:00 - storing.

2010-03-25 16:01:11,819 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1270 found with datestamp 2004-04-07T00:00:00 - storing.

INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1270 found with datestamp 2004-04-07T00:00:00 - storing.

2010-03-25 16:01:11,824 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1271 found with datestamp 2004-04-14T00:00:00 - storing.

...

My advice is to hop to a different terminal window and start to poke around with the content you are getting. The underlying store is a take on the CDL's Pairtree microspec (pairtree being a minimalist specification for how to structure the access to object-orientated items on a hierarchical filesystem) This model on top of pairtree I've called a Silo (in the RecordSilo library I've written) and constitutes a basic object model, where each object has a persistent JSON state (r/w-able) and can store any file or file in a subdirectory. It has crude object-level versioning, rather than file-versioning, so you can clone one version, delete/alter/add to it to create a second, curated version for reuse elsewhere without affecting the original.

What makes pairtree attractive is that the files themselves are not altered in form, so normal posix tools can be used on the files without unwrapping, depacking, etc.

Let's have a look around at what's been harvested so far into the "myrepo" silo:

>>> from recordsilo import Silo

>>> s = Silo("myrepo")

>>> s.state

{'storage_dir': 'myrepo', 'identifier_uri_prefix': 'oai:eprints.soton.ac.uk:', 'uri_base': 'oai:eprints.soton.ac.uk:', 'base_oai_url': 'http://eprints.soton.ac.uk/perl/oai2'}'}

>>> len(s) # NB this can be a time-consuming operation

1100

>>> len(s)

1200

Now let's look at a record: I'm sure I saw '6102' whizz past as it was harvesting...

>>> obj = s.get_item("oai:eprints.soton.ac.uk:6102")

>>> obj

{'files': {'1': ['oai_dc']}, 'subdir': {'1': []}, 'versions': ['1'], 'date': '2004-06-24T00:00:00', 'currentversion': '1', 'metadata_files': {'1': ['oai_dc']}, 'item_id': 'oai:eprints.soton.ac.uk:6102', 'version_dates': {'1': '2004-06-24T00:00:00'}, 'metadata': {'identifier': 'oai:eprints.soton.ac.uk:6102', 'firstSeen': '2004-06-24T00:00:00', 'setSpec': ['7374617475733D707562', '7375626A656374733D51:5148:5148333031', '7375626A656374733D47:4743', '74797065733D61727469636C65', '67726F75703D756F732D686B']}}

>>> obj.files

['oai_dc']

>>> obj.versions

['1']

>>> obj.clone_version("1","workingcopy")

'workingcopy'

>>> obj.versions

['1', 'workingcopy']

>>> obj.currentversion

'workingcopy'

>>> obj.set_version_cursor("1")

True

>>> obj.set_version_cursor("workingcopy")

True

>>> obj.files

['oai_dc']

>>> with obj.get_stream("oai_dc") as oai_dc_xml:

... print oai_dc_xml.read()

...

<oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">

<dc:title>Population biology of Hirondellea sp. nov. (Amphipoda: Gammaridea: Lysianassoidea) from the Atacama Trench (south-east Pacific Ocean)</dc:title>

<dc:creator>Perrone, F.M.</dc:creator>

<dc:creator>Dell'Anno, A.</dc:creator>

<dc:creator>Danovaro, R.</dc:creator>

<dc:creator>Groce, N.D.</dc:creator>

<dc:creator>Thurston, M.H.</dc:creator>

<dc:subject>QH301 Biology</dc:subject>

<dc:subject>GC Oceanography</dc:subject>

<dc:description/>

<dc:publisher/>

<dc:date>2002</dc:date>

<dc:type>Article</dc:type>

<dc:type>PeerReviewed</dc:type>

<dc:identifier>http://eprints.soton.ac.uk/6102/</dc:identifier></oai_dc:dc></metadata>

You can add bytestreams as strings:

>>> obj.put_stream("foo.txt", "Some random text!")

or as file-like objects:

>>> with open("README", "r") as readmefile:

... obj.put_stream("README", readmefile)

...

>>> obj.files

['oai_dc', 'foo.txt', 'README']

>>> obj.set_version_cursor("1")

True

>>> obj.files

['oai_dc']

This isn't the easiest way to browse or poke around the files. It would be nice to see these through a web UI of some kind:

Grab the basic UI code from http://github.com/benosteen/siloserver

(You'll need to install web.py and Mako: sudo easy_install mako web.py)

Then edit the silodirectory_conf.py file to point to the location of the Silo - if the directory structure looks like the following:

myrepo

--- Silo directory stuff...

SiloServer

- dropbox.py

etc

You need to change data_dir to "../myrepo" and then you can start the server by running 'python dropbox.py'

Point a browser at http://localhost:8080/ and wait a while - that start page loads *every* object in the Silo.

And let's revisit our altered record, at http://localhost:8080/oai:eprints.soton.ac.uk:6102

So, from this point, I can curate the records as I wish, add files to each item - perhaps licences, PREMIS files, etc - and then push them onto another repository, such as Fedora.

An Analytical Anniversary

Richard — Mon, 22 Mar 2010 16:42:00 +0000

Today is my anniversary. I have been at Symplectic Ltd for one of your Earth "years". And a very busy one it has been, what with writing repository integration tools for our research management system to deposit content into DSpace, EPrints and Fedora, plus supporting the integration into a number of other platforms. I thought it would be fun to do a bit of a breakdown of the code that I've written from scratch in the last 12 months (which I'm counting as 233 working days). I'm going to do an analysis of the following areas of productivity:

lines of code
lines of inline code commentary
number of A4 pages of documentation (end user, administrator and technical)
number of version control commits

Lets start from the bottom and work upwards.

Number of version control commits

Total: 700

Per day: 3

I tend to commit units of work, so this might suggest that I do 3 bits of functionality every day. In reality I quite often also commit quick bug fixes (so that I can record in the commit log the fix details), or at the end of a day/week, when I want to know that my code is safe from hardware theft, nuclear disaster, etc.

Number of A4 pages of documentation

Total: 72

Per day: 0.31

Not everyone writes their documentation in A4 form any more, and it's true that some of my dox take the form of web pages, but as a commercial software house we tend to produce well formatted, nice end-user and administrator documentation. In addition, I rather enjoy at a geek level a nice printable document that's well laid out, so I do my technical dox that way too.

The amount of documentation is relatively small, but it doesn't take into account a lot of informal documentation. More importantly, though, at the back end of the first version of our Repository Tools software, the documentation is still in development. I expect the number of pages to probably triple or quadruple over the next few weeks.

Lines of Code and Lines of Commentary

I wrote a script which analysed my outputs. Ironically, it's written in Python, which isn't one of the languages that I use professionally, so it's not included in this analysis (and none of my personal programming projects are therefore included). This analysis covers all of my final code on my anniversary (23rd March), and does not take into account prototyping or refactoring of any kind. Note also that blank lines are not counted.

Line Counts:

XML (107 Files) :: Lines of Code: 17819; Lines of Inline Comments: 420

XML isn't really programming, but it was interesting to see how much I actually work with it. This figure is not used in any of the below statistics. Some of these are large metadata documents and some are configuration (maven build files, ant build files, web server config, etc).

XSLT (36 Files) :: Lines of Code: 8502; Lines of Inline Comments: 2762
JAVA (181 Files) :: Lines of Code: 22350; Lines of Inline Comments: 7565
JSP (16 Files) :: Lines of Code: 2847; Lines of Inline Comments: 1
PERL (58 Files) :: Lines of Code: 6506; Lines of Inline Comments: 1699
---------------
TOTAL (291 Files) :: Lines of Code: 40205; Lines of Inline Comments: 12027

I remember once being told that 30k lines of code a year was pretty reasonable for a developer. I feel quite chuffed!

Lines of code/comments per day:

XSLT :: Lines of Code: 36; Lines of Inline Comments: 12
JAVA :: Lines of Code: 96; Lines of Inline Comments: 32
JSP :: Lines of Code: 12; Lines of Inline Comments: 0
PERL :: Lines of Code: 28; Lines of Inline Comments: 7
---------------
TOTAL :: Lines of Code: 173; Lines of Inline Comments: 52

It looks much less impressive when you look at it on a daily basis. We just have to remember that this is 173 wonderful lines of code every day!

Comment to code ratio (comments/code):

XSLT :: 0.33
JAVA :: 0.34
JSP :: 0
PERL :: 0.26
---------------
TOTAL :: 0.30

It was interesting to see that my commenting ratio is fairly stable at about 30% of the overall codebase size. I didn't plan that or anything. This includes block comments for classes and methods, and inline programmer documentation. The reason for the shortfall in Perl is suggested below. Notice that I didn't write any comments in the JSPs because I only use this code for testing, and is less carefully curated code.

Some perl comments don't start with anything specific - they are block comments starting and ending with =xxx and =cut respectively, which is difficult to parse out for analysis easily. Therefore the Perl code line counts overestimate and the comment counts underestimate. More likely figures are, given a 0.33 comment to code ratio:

PERL (58 Files) :: Lines of Code: 5498; Lines of Inline Comments: 2707

Amount of testing code (testing/production):

9937 / 30268 = 0.33

This is the total amount of code that I wrote to test the other code that I wrote. So nearly 10k lines of code are there purely to demonstrate that the other 30k lines of code are working. I'm not going to suggest that this 33% is a linear relationship as the projects increase in size, but maybe we'll find out next year. Incidentally, the test code that I analysed was the third version of my test framework, so in reality I wrote quite a few more lines of code (perhaps 3 or 4k) before reaching the final version used above.

Note that I'm a big fan of Behaviour Driven Development, and this does tend to cause testing code to be fairly extensive in its own right.

Number of new files per day:

XSLT :: 0.15
JAVA :: 0.78
JSP :: 0.07
PERL :: 0.25
---------------
TOTAL :: 1.25

In reality, of course, I create lots and lots of new files over a short period of time, and then nothing for ages.

Average file length:

Excluding blank lines: 179
Including blank lines: 211
Spaciousness (including/excluding): 1.18

What is spaciousness? It's a measure of how I tend to space my code. Everyone, I have noticed, is fairly different in this regard - I wonder what other people's spaciousness is?

Source Code

Do you want to have a go at this yourself? Blogger doesn't make attaching files particularly easy, so you can get this from the nice folks at pastebin, who say this shouldn't ever time out: http://pastebin.com/GVkHd7tB.

My swiss army toolkit for distributed/multiprocessing systems

Ben O'Steen — Thu, 11 Feb 2010 03:45:00 +0000

My first confession - I avoid 'threading' and shared memory. Avoid it like the plague, not because I cannot do it but because it can be a complete pain to build and maintain relative to the alternatives.

I am very much pro multiprocessing versus multithreading - obviously, there are times when threading is by far the best choice, but I've found multiprocessing for the most part it to be quicker, easier and far easier to log, manage and debug than multithreading.

So, what do I mean by a 'multiprocessing' system? (just to be clear)

A multiprocessing system consists of many concurrently running processes running on one or more machines, and contains some means to distribute messages and persist data between these processes.

This does not mean that the individual processes cannot multithread themselves, it is just that each process handles a small, well-defined aspect of the system (paralleling the unix commandline tool idiom).

Tools for multiprocess management:

Redis - data structure server, providing atomic operations on integers, lists, sets, and sorted lists.
RabbitMQ - messaging server, based on the AMQP spec. IMO Much cleaner, easier to manage, more flexible and more reliable than all the JMS systems I've used.
Supervisor - a battle-tested, process manager that can be operated via XML-RPC or HTTP. Enables live control and status of your processes.

Redis has become my swiss army knife of data munging - a store that persists data and which has some very useful atomic operations, such as integer incrementing, list manipulations and very fast set operations. I've also used it for some quick-n-dirty process orchestrations (which is how I've used it in the example that ends this post.)

I've also used it for usage statistic parsing and characterisation of miscellaneous XML files too!

RabbitMQ - a dependable, fast message server which I am primarily using as a buffer for asynchronous operations and task distribution. More boilerplate to use than, say Redis, but by far more suited for that sort of thing.

Supervisord - I've been told that the ruby project 'god' is similar to this - I really have found it very useful, especially on those systems I run remotely. An HTML page to control processes and view logs and stats? what's not to like!

Now for a little illustration of a simple multiprocessing solution - in fact, this blog post far, far outweighs the code written and perhaps even overeggs the simple nature of the problem. I typically wouldn't use supervisor for a simple task like the following, but it seems a suitable example to show how to work it.

The ability to asynchronously deliver messages, updates and tasks between your processes is a real boon - it enables quick solutions to normally vexing or time-consuming problems. For example, let's look at a trivial problem of how to harvest the content from a repository with an OAI-PMH service:

A possible solution needs:

a process to communicate with the OAI-PMH service to gain the list of identifiers for the items in the repository (with the ability to update itself at a later time). Including the ability to find the serialised form of the full metadata for the item, if it cannot be gotten from the OAI-PMH service (eg Eprints3 XML isn't often included in the OAI-PMH service, but can be retrieved from the Export function.),
a process that simply downloads files to a point on the disc,
and a service that allows process one to queue jobs for process 2 to download - in this case Redis.

I told you it would be trivial :)

Installing Redis: (See http://code.google.com/p/redis/wiki/QuickStart for fuller instructions)

sudo apt-get install build-essential python-dev python-setuptools [make sure you can build and use easy_install - here shown for debian/ubuntu/etc]
sudo easy_install supervisor
mkdir oaipmh_directory # A directory to contain all the bits you need
cd oaipmh_directory

Create a supervisor configuration for the task at hand and save it as supervisord.conf.

[program:oaipmhgrabber]
autorestart = false
numprocs = 1
autostart = false
redirect_stderr = True
stopwaitsecs = 10
startsecs = 10
priority = 10
command = python harvest.py
startretries = 3
stdout_logfile = workerlogs/harvest.log

[program:downloader]
autorestart = true
numprocs = 1
autostart = false
redirect_stderr = True
stopwaitsecs = 10
startsecs = 10
priority = 999
command = oaipmh_file_downloader q:download_list
startretries = 3
stdout_logfile = workerlogs/download.log

[program:redis]
autorestart = true
numprocs = 1
autostart = true
redirect_stderr = True
stopwaitsecs = 10
startsecs = 10
priority = 999
command = path/to/the/redis-server
startretries = 3
stdout_logfile = workerlogs/redis.log

[unix_http_server]
file = /tmp/supervisor.sock

[supervisord]
minfds = 1024
minprocs = 200
loglevel = info
logfile = /tmp/supervisord.log
logfile_maxbytes = 50MB
nodaemon = false
pidfile = /tmp/supervisord.pid
logfile_backups = 10

[supervisorctl]
serverurl = unix:///tmp/supervisor.sock

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[inet_http_server]
username = guest
password = mypassword
port = 127.0.0.1:9001

This has a lot of boilerplate on it, so let's go through it, section by section:

[program:redis] - this controls the redis program. You will need to change the path to the redis server to wherever it was built on your system - eg ~/redis-1.2.1/redis-server

[program:oaipmhgrabber] and [program:downloader] - these set up the processes, look at the 'command' key for the command that is run for them eg downloader has "oaipmh_file_downloader q:download_list" - The OAIPMHScraper package adds in the script, 'q:download_list' is the redis-based list that the download tasks appear on. NB we haven't written harvest.py yet - don't worry!

NB very important that autorestart=false in [program:oaipmhgrabber] - if it didn't, it would eternally repeat - on and on and on - harvesting!

Supervisor boilerplate: [unix_http_server], [supervisord], [supervisorctl]

RPC interface control [rpcinterface:supervisor]

HTTP interface control - [inet_http_server] - which includes importantly the username and password to log in to the control panel!

Now, create the log directory:

mkdir workerlogs

Let's now write 'harvest.py': PLEASE use a different OAI2 endpoint url!

#!/usr/bin/env python

from oaipmhscraper import Eprints3Harvester

o = Eprints3Harvester("repo", base_oai_url="http://eprints.maths.ox.ac.uk/cgi/oai2/")

o.getRecords(metadataPrefix="XML",
template="%(pid)s/%(prefix)s/mieprints-eprint-%(pid)s.xml")

[Note there is a base OAIPMHScraper class, but this simply goes and gets the metadata or Identifiers for a given endpoint and stores whatever XML metadata it gets into a store. The Eprints3 harvester gets the files as well, or tries to.]

You may have to change the template for other eprints repositories - the above template would result in the following for item 774:

"http://eprints.maths.ox.ac.uk/cgi/export/774/XML/mieprints-eprint-774.xml"

YMMV for other repositories of course, so you can rewrite this template accordingly.

Your directory should look like this:

--> harvest.py supervisord.conf workerlogs/

Let's start the supervisor to make the configuration is correct:

[---] $ supervisord -c supervisord.conf

[---] $

Now open http://localhost:9001/ - it should look like the following:

Click on the 'redis' name to see the logfile that this is generating - you'll want to see lines like:

11 Feb 13:34:32 . 0 clients connected (0 slaves), 2517 bytes in use, 0 shared objects

Let's start the harvest :)

Click on 'start' for the oaipmh grabber process and wait - in the configuration file, we told it to wait for the process to stay up for 10 seconds before reporting that it was running, so it should take about that long for the page to refresh.

Now, let's see what it is putting onto the queue, before we start the download process (see, easy to debug!)

python

>>> from redis import Redis

>>> r = Redis()

>>> r.keys("*")

[u'q:download_list']

>>> r.llen("q:download_list")

351

>>> r.llen("q:download_list")

361

>>> r.llen("q:download_list")

370

>>> # Still accruing things to download as we speak...

>>> r.lrange("q:download_list", 0,0)

[u'{"url": "http://eprints.maths.ox.ac.uk/cgi/export/774/XML/mieprints-eprint-774.xml", "filename": "XML", "pid": "oai:generic.eprints.org:774", "silo": "repo"}']

Now, let's switch on the downloader and work on those messages - go back to http://localhost:9001 and start the downloader. Click on the downloader name when the page refreshes to get a 'tail' of it's logfile in the browser.

You should get something like the following:

INFO:CombineHarvester File downloader:Starting download of XML (from http://eprints.maths.ox.ac.uk/cgi/export/370/XML/mieprints-eprint-370.xml) to object oai:generic.eprints.org:370
2010-02-11 13:43:51,284 - CombineHarvester File downloader - INFO - Download completed in 0 seconds
INFO:CombineHarvester File downloader:Download completed in 0 seconds
2010-02-11 13:43:51,285 - CombineHarvester File downloader - INFO - Saving to Silo repo
INFO:CombineHarvester File downloader:Saving to Silo repo
2010-02-11 13:43:51,287 - CombineHarvester File downloader - INFO - Starting download of XML (from http://eprints.maths.ox.ac.uk/cgi/export/371/XML/mieprints-eprint-371.xml) to object oai:generic.eprints.org:371
INFO:CombineHarvester File downloader:Starting download of XML (from http://eprints.maths.ox.ac.uk/cgi/export/371/XML/mieprints-eprint-371.xml) to object oai:generic.eprints.org:371

So, that will go about and download all the XML (Eprints3 XML) for each item it found in the repository. (I haven't put in much to stop dupe downloads etc. - exercise for the reader ;))

How about we try to download the files for each item too? I just so happens I've included a little Eprints3 XML parser and method for queuing up the files for download 'reprocessRecords' - let's use this to download the files now - save as download_files.py

#!/usr/bin/env python

from oaipmhscraper import Eprints3Harvester

o = Eprints3Harvester("repo", base_oai_url="http://eprints.maths.ox.ac.uk/cgi/oai2/")

o.reprocessRecords()

Add this process to the top of the supervisord.conf file:

[program:queuefilesfordownload]
autorestart = false
numprocs = 1
autostart = false
redirect_stderr = True
stopwaitsecs = 10
startsecs = 10
priority = 999
command = python download_files.py
startretries = 3
stdout_logfile = workerlogs/download_files.log

Now, to demonstrate the commandline supervisor controller:

[--] $ supervisorctl

$ supervisorctl

downloader RUNNING pid 20750, uptime 0:15:41

oaipmhgrabber STOPPED Feb 11 01:58 PM

redis RUNNING pid 16291, uptime 0:25:31

supervisor> shutdown

Really shut the remote supervisord process down y/N? y

Shut down

supervisor>

(Press Ctrl+D to leave this terminal)

Now restart the supervisor:

[--] $ supervisord -c supervisord.conf

And refresh http://localhost:9001/

[NB in the following picture, I reran oaipmhgrabber, so you could see what the status of a normally exiting process looks like]

Now, switch on the reprocess record worker and tail -f the downloader if you want to watch it work :)

What's a RecordSilo? (aka How things are stored in the example)

This class is based on CDL's spec for Pairtree object storage - each object contains a JSON manifest and is made up of object-level versions. But, it is easier to understand if you have some kind of GUI to poke around with, so I quickly wrote the following dropbox.py server for that end:

Grab the dropbox code and templates from http://github.com/benosteen/SiloServer - unpack it into the same directory as you are in now.

so that:

[--] $ ls

download_files.py dropbox.py dump.rdb harvest.py repo supervisord.conf templates workerlogs

Edit dropbox.py and change the data_dir to equal your repo directory name - in this case, just "repo"

(Make sure you have mako and web.py installed too! sudo easy_install mako web.py)

then:

$ python dropbox.py

http://0.0.0.0:8080/

Go to http://localhost:8080/ to then see all your objects! This page opens them all, so could take a while :)

(I did this on my work computer and may have not put in some dependencies, etc but it worked for me. Let me know if it doesn't in the comments)

Usage stats and Redis

Ben O'Steen — Mon, 18 Jan 2010 09:19:00 +0000

Redis has been such a massively useful tool to me.

Recently, it has let me cut through access logs munging like a hot knife through butter, all with multiprocessing goodness.

Key things:

Using sets to manage botlists:

>>> from redis import Redis
>>> r = Redis()
>>> for bot in r.smembers("botlist"):
... print bot
...
lycos.txt
non_engines.txt
inktomi.txt
misc.txt
askjeeves.txt
oucs_bots
wisenut.txt
altavista.txt
msn.txt
googlebotlist.txt
>>> total = 0
>>> for bot in r.smembers("botlist"):
... total = total + r.scard(bot)
...
>>> total
3882

So, I have 3882 different IP addresses that I have built up that I consider bots.

Keeping counts and avoiding race-conditions

By using the Redis INCR command, it's easy to write little workers that run in their own process but which atomically increment counts of hits.

What does the stat system look like?

I am treating each line of the Apache-style log as a message that I am passing through a number of workers.

Queues

All in the same AMQP exchange: ("stats")

Queue "loglines" - msg's = A single log line in the Apache format. Can be sourced from either local logs or from the live service.

loglines is listened to by a debot.py worker, just one at the moment. This worker feeds three queues:

Queue "bothits" - log lines from a request that matches a bot IP

Queue "objectviews" - log lines from a request that was a record page view or item download

Queue "other" - log lines that I am presently not so interested in.

[These three queues are consumed by 3 loggers and these maintain a copy of the logs, pre-separated. These are designed to be temporary parts of the workflow, to be discarded once we know what we want from the logs.]

objectviews is subscribed to by a count.py worker which does the heavy crunching as shown below.

Debot.py

The first worker is 'debot.py' - this does the broad separation and checking of a logged event. In essence, it uses the Redis SISMEMBER command to see if the IP address is in the blacklists and if not, applies a few regex's to see if it is a record view and/or a download or something else.

Broad Logging

There are three logger workers that debot.py feeds for "bothits", "objectviews", and "other" - these workers just sit and listen on the relevant queue for an apache log line and appends it to the logfile it has open. Saves me having to open/close logger objects or pass anything around.

The logfiles are purely as a record of the processing and so I can skip redoing it if I want to do any further analysis, like tracking individuals, etc.

The loggers also INCR a key in Redis for each line they see - u:objectviews, u:bothits, and u:other as appropriate - these give me a rough idea of how the processing is going.

(And you can generate pretty charts from it too:)

http://chart.apis.google.com/chart?cht=p3&chds=0,9760660&chd=t:368744,9760660,1669552&chs=600x200&chl=Views|Bots|Other

(data sourced at a point during the processing - 10million bot hits vs 360k object views/dls)

Counting hits (metadata and time based)

Most of the heavy lifting is in count.py - this is fed from the object views/downloads stream coming from the debot.py worker. It does a number of procedural steps for the metadata:

Get metadata from ORA's Solr endpoint (as JSON)

Specifically, get the 'authors' (names), subjects/keyphrases, institutions, content types, and collections things appear in.
These fields correspond to certain keys in Redis. Eg names = 'number:names' = number of unique names, 'n:...' = hits to a given name, etc

For each view/dl:

INCR 'ids:XXXXX' where XXXXX is 'names', 'subjects', etc. It'll return the new value for this, eg 142
SET X:142 to be equal to the text for this new entity, where X is the prefix for the field.
SADD this id (eg X:142) to the relevant set for it, like 'names', 'subjects', etc - This is so we can have an accurate idea of the entities in use even after removing/merging them.
Reverse lookup: Hash the text for the entity (eg md5("John F. Smith")) and SET r:X:{hash} to be equal to "X:142"
SET X:views:142 to be equal to 1 to get the ball rolling (or X:dl:142 for downloads)

If the name is not new:

Hash the text and lookup r:{hash} to get the id (eg n:132)
INCR the item's counter (eg INCR n:views:132)

Time-based and other counts:

INCR t:{object id} (total hits on that repository object since logs began)
INCR t:MMYY (total 'proper' hits for that month)
INCR t:MMYY:{object id} (total 'proper' hits for that repo item that month)
INCR t:MMYY:{entity id} (Total hits for an entity, say 'n:132' that month)

A lot of pressure is put on Redis by count.py but it seems to be coping fine. A note for anyone else thinking about this: Redis keeps its datastore in RAM - running out of RAM is a Bad Thing(tm).

I know that I could also just use the md5 hashes as ids, rather than using a second id - I'm still developing this section and this outline just states it how it is now!

Also, it's worth noting that if I needed to, I can put remote redis 'shards' on other machines and they can just pull log lines from the main objectview queue to process. (It'll still need to create the id <-> entity name mapping on the main store though or a slave of the main store.)

But why did I do this?

I thought that it would mean I could handle both legacy logs and live data and have a framework I could put against other systems and in a way that would mean I would write less code and for the system to be more reliable.

So far, I still think this is the case. If people are interested, I'll abstract out a class or two (eg the metadata lookup function, etc) and stick it on google code. It's not really a lot of code so far, I think even this outline post is longer....

#iPres09: e-Infrastructure and digital preservation: challenges and outlook

Dave Tarrant — Mon, 05 Oct 2009 13:39:00 +0000

e-infrastructure: Starts by defining infrastructure (see wikipedia) and e-infrastructure specific to a collection of European digital repositories. So basically we are looking at opportunities to build and supply services which are applicable to these repositories.

Background: EU is supplying lots of support for this and in germany they are researching national approaches, identifying activities and assign tasks to "expert" institutions. By introducing the current fields of project he is outlining that there is still a significant mismatch between the scale of the problem and the amount of effort being expended. From this he outlines that there is a significant lack of common approaches to solving problems. [I don't think this will ever go away, unless there is a mandate, and even then not everyone will want to sign up].

[Lots of argument] Funding is focused on many individual projects and thus doubles up the the argument that there are no commons. This led leads to a slide about interoperability and standards and the lack of them. [Which again, i don't think will ever go away and I think that we should be appreciative that people tend to pick XML to encode their data in, this makes it interoperable right].

[This is a start of project presentation, I don't seem to see that much output. They have some simple models as diagrams, again though at this stage it is hard to see how they are not just another project which will come up with (another) set of standards which no one will then want to adopt.]

Giving a set of examples now where they are going to re-use and extend existing software/projects. The goals are good, in terms of concrete steps for global infrastructure for registries, data formats, software deposits and risk management. [Just not sure how achievable all this is based upon the fact it has been the aim of many projects already]

Thoughts on digitization, data deluge and linking

Dave Tarrant — Tue, 08 Sep 2009 07:01:00 +0000

It's been a while since I've put a post up and this is probably due to being busy and also trying to tidy up a lot of stuff before starting on new projects.

In this post then: Digitisation

I never really gathered how big the area of digitisation is and how many non repository people are actively involved in digitisation. There are a great many projects >50 who are digitising resources and these include national libraries. Items being digitised include everything from postcards and newspapers to full books and old journals.

So what's the problem here ... simple ... how many people are digitising the same things? Yes I know that there is so much out there that this is unlikely to be the case however it brings me nicely to the problem of information overload. There is already more valuable information on the internet than we can possibly handle effectively, so how do you ensure that any resources you digitize for open access usage on the web can be found and used?

I don't normally say this but perhaps we should look at physical libraries for the answer. Libraries are a very good central point where you can find publications related to all subject areas, and if your local library does not have a copy then it will try and find a copy somewhere else.

How then does this map onto the web? Web sites become the library and links become the references to additional items or items this site does not contain, simple right? Unfortunately with 50+ projects I can count already, this leads to 50+ different web sites all with differing information presented in different ways. Due to the presentation of each web site being totally different this means that in fact they are not a library - that pride themselves on the standard way to organise resources -
thus web sites become books. Thus to find resources we have to rely on search engines and federation. Thus we are back to where we started and we have a problem with information overload.

Unfotunately I don't have an answer to this problem, however I do know that links hold the key to the solution. Each website at the moment is simply an island of infromation, what is desperately required is the authors and community to establish links to these resources. If digitisation houses are curating refereed resources then the simplist way to link to these would be to put information about them on wikipedia.

This would be my final point then, wikipedia is actually a good thing, simply because of the the community aspect. However it also provides many other huge benefits:

External resources such as photoes have to have a licience

In annotating a page/item you create links and establish facts which are available by semantic wikipedia (dbpedia)

Wikipedia is an easy way to establish your presence on the link data web (linkeddata.org)

So if you are digitising books by an author, add this link to their wikipedia page. If you are digitising a collection of World War images, add links to some of these to wikipedia and flikr.

Establish links and help yourself to help everyone else.

Less talk, less code, more data - The Preserv2 Data Registry

Dave Tarrant — Wed, 08 Apr 2009 08:21:00 +0000

Yes, less talk more code (oxfordrepo.blogspot.com) is a good saying but i'm going to argue in this post that in fact we need more data! Having a ton of available services and a load of highly complex and well considered data models is all well and good but without data all of these services are useless; A repository is not a repository until it has something in it (Harnad).

If we look outside of the repository community for a minute we find the web community we are accumulating a whole ton of data, wikipedia being the main point of reference here. Yet in the repository community we are not harnessing this open linked data model to enhance our data.

I have been working in the area of digital preservation for a while now and the PRONOM file format registry (TNA UK) has been my friend for many years now and contains some valuable data. However I am concerned with the way I see it progressing. The main thing I use the PRONOM registry for is as a complement to DROID for file format information, and the data here is not even that complete. I am concerned however at the size of the new data model and the sheer effort which is going to be required to fill it with the data which it specifies.

Why not looked to the linked data web to see how to tie a series of smaller systems together to make a much more powerful and easier to maintain one!

This is where I have started with the preserv2 registry available at http://p2-registry.ecs.soton.ac.uk/.

The preserv2 registry is a semantic knowledge base (RDF triples based) with an SPARQL endpoint, RESTful services and a basic browser. Currently the data is focussed on file formats and is basically made up of the PRONOM database ported from a complex XML schema into simple RDF triples. On top of this i'm beginning to add data from dbpedia (wikipedia RDF'd) and making links between the PRONOM data and the dbpedia data!

Already this is helping is ascertain a greater knowledge base and the cost of gathering and compiling this data is very low. Other than that the registry took me less than a week to construct!

So "Go forth and make links" (Wendy Hall) is exactly what I'm now doing. With enough data you will be able to make complex OWL-S rules that can be used to deduce accurately facts such as formats which are at risk.

We need people!

Ben O'Steen — Thu, 19 Mar 2009 07:46:00 +0000

(UPDATE - Grrr.... seems that the concept of persistent URLs is lost on the admin - link below has been removed - see google cached copy here)

http://www.admin.ox.ac.uk/ps/oao/ar/ar3979j.shtml - job description.

Essentially, we need smart people who are willing to join us to do good, innovative stuff; work that isn't by-the-numbers with room for initiative and ideas.

Help us turn our digital repository into a digital library, it'll be fun! Well, maybe not fun, but it will be very interesting at least!

bulletpoints: python/ruby frameworks, REST, a little SemWeb, ajax, jQuery, AMQP, Atom, JSON, RDF+RDFa, Apache WSGI deployment, VMs, linux, NFS, storage, RAID, etc.

Developer Happiness days - why happyness is important

Ben O'Steen — Wed, 25 Feb 2009 06:09:00 +0000

Creativity and innovation

One of the defining qualities of a good innovative developer is creativity and a pragmatic attitude; someone with the 'rough consensus, running code' mentality that pervades good software innovation. This can be seen as the drive to experiment, to turn inspiration and ideas into real, running code or to pathfind by trying out different things. Innovation can often happen when talking about quite separate, seemingly unrelated things, even to the point that most of the time, the 'outcomes' of an interaction are impossible to pin down.

Play, vagueness and communication

Creativity, inspiration, innovation, ideas, fun, and curiousity are all useful and important when developing software. These words convey concepts that do not thrive in situations that are purely scheduled, didactic, and teacher-pupil focussed. There needs to be an amount of 'play' in the system (see 'Play'.) While this 'play' is bad in a tightly regimented system, it is an essential part in a creative system, to allow for new things to develop, new ideas to happen and for 'random' interactions to take place.

Alongside this notion of play in an event, there also needs to be an amount of blank space, a vagueness to the event. I think that we can agree that much of the usefulness of normal conferences comes from the 'coffee breaks' and 'lunch breaks', which are blank spaces of a sort. It is the recognition of this that is important and to factor it in more.

Note that if a single developer could guess at how things should best be developed in the academic space, they would have done so by now. Pre-compartmentalisation of ideas into 'tracks' can kill potential innovation stone-dead. The distinction between CMSs, repositories and VLE developers is purely semantic and it is detrimental for people involved in one space to not overhear the developments, needs, ideas and issues in another. It is especially counter-productive to further segregate by community, such as having simultaneous Fedora, DSpace and EPrints strands at an event.

While the inherent and intended vagueness provides the potential for cross-fertilisation of ideas, and the room for play provides the space, the final ingredient is that of speech, or any communication that takes place with the same ease and at the same speed of speech. While some may find the 140 character limit on twitter or identi.ca a strange constraint, this provides a target for people to really think about what they wish to convey and keeps the dialogue from becoming a series of monologues - much like the majority of emails of mailing lists - and keeps it as a dialogue between people.

Communication and Developers

One of the dichotomies in the necessity of communication to development is that developers can be shy, initially preferring the false anonymity of textual communication to spoken words between real people. There is a need to provide means for people to break the ice, and to strike up conversations with people that they can recognise as being of like minds. Asking that people's public online avatars are changed to be pictures of them can help people at an event find those that they have been talking to online and to start talking, face to face.

On a personal note, one of the most difficult things I have to do when meeting people out in real life is answer the question 'What do you do?' - it is much easier when I already know that the person asking the question has a technical background.

And again, going back to the concept of compartmentalisation - developers who only deal with developers and their managers/peers will build systems that work best for their peers and their managers. If these people are not the only users then they need to widen their communications. It is important for the developers that do not use their own systems to engage with the people who actually do. They should do this directly, without the potential for garbled dialogue via layers of protocol. This part needs managing in whatever space, both to avoid dominance by loud, disgruntled users and to mitigate anti-social behaviour. By and large, I am optimistic of this process, people tend to want to be thanked, and this simple feedback loop can be used to help motivate. Making this feedback more disproportionate (a small 'thank you' can lead to great effects) and adding in the notion of highscore can lead to all sorts of interaction and outcomes, most notably being the rapid reinforcement of any behaviour that led to a positive outcome.

Disproportionate feedback loops and Highscores drive human behaviour

I'll just digress quickly to cover what I mean be a disproportionate feedback loop: A disproportionate feedback loop is something that encourages a certain behaviour; the input to which is something small and inexpensive, in either time or effort but the output can be large and very rewarding. This pattern can be seen in very many interactions: playing the lottery, [good] video game controls, twitter and facebook, musical instruments, the 'who wants to be a millionaire' format, mashups, posting to a blog ('free' comments, auto rss updating, a google-able webpage for each post) etc.

The natural drive for highscores is also worth pointing out. At first glance, is it as simple as considering its use in videogames? How about the concept of getting your '5 fruit and veg a day'? http://www.5aday.nhs.uk/topTips/default.html Running in a marathon against other people? Inbox Zero (http://www.slideshare.net/merlinmann/inbox-zero-actionbased-email), Learning to play different musical scores? Your work being rated highly online? An innovation of yours being commented on by 5 different people in quick succession? Highscores can be very good drivers for human behaviour, addictive to some personalities.

Why not set up some software highscores? For example, in the world of repositories, how about 'Fastest UI for self-submission' - encouraging automatic metadata/datamining, a monthly prize for 'Most issue tickets handled' - to the satisfaction of those posting the tickets, and so on.

It is very easy to over-metricise this - some will purposefully abstain from this and some metrics are truely misleading. In the 90s, there was a push to have lines of code added as a metric to productivity. The false assumption is that lines of code have anything to do with producitivity - code should be lean, but not too lean to maintain.

So be very careful when adding means to record highscores - they should be flexible, and be fun - if they are no fun for the developers and/or the users, they become a pointless metric, more of an obstacle than a motivation.

The Dev8D event

People were free to roam and interact at the Dev8D event and there was no enforced schedule, but twitter and a loudhailer were used to make people aware of things that were going on. Talks and discussions were lined up prior to the event of course, but the event was organised on a wiki which all were free to edit. As experience has told us, the important and sometimes inspired ideas occur in relaxed and informal surroundings where people just talk and share information, such as in a typical social situation like having food and drink.

As a specific example, look at the role of twitter at the event. Sam Easterby-Smith (http://twitter.com/samscam) created a means to track 'developer happiness' and shared the tracking 'Happyness-o-meter' site with us all. This unplanned development inspired me to relay the infomation back to twitter and similarly led to me running an operating system/hardware survey in a very similar fashion.

To help break the ice and to encourage play, we instituted a number of ideas:

A wordcloud on each attendees badge, consisting of whatever we could find of their work online, be it their blog or similar so that it might provide a talking point, or allow people to spot people who write about things they might be interested in learning more about.

The poker chip game - each attendee was given 5 poker chips at the start of the event, and it was encouraged that chips were to be traded for help, advice or as a way to convey a thank you. The goal was that the top 4 people ranked by amounts of chips at the end of the third day would receive a Dell mini 9 computer. The balance to this was that each chip was also worth a drink at the bar on that day too.

We were well aware that we'd left a lot of play in this particular system, allowing for lotteries to be set up, people pooling their chips, and so on. As the sole purpose of this was to encourage people to interact, to talk and bargain with each other, and to provide that feedback loop I mentioned earlier, it wasn't too important how people got the chips as long as it wasn't underhanded. It was the interaction and the 'fun' that we were after. Just as an aside, Dave Flanders deserves the credit for this particular scheme.

Developer Decathlon

The basic concept of the Developer Decathlon was also reusing these ideas of play and feedback: "The Developer Decathlon is a competition at dev8D that enables developers to come together face-to-face to do rapid prototyping of software ideas. [..] We help facilitate this at dev8D by providing both 'real users' and 'expert advice' on how to run these rapid prototyping sprints. [..] The 'Decathlon' part of the competition represents the '10 users' who will be available on the day to present the biggest issues they have with the apps they use and in turn to help answer developer questions as the prototypes applications are being created. The developers will have two days to work with the users in creating their prototype applications."

The best two submissions will get cash prizes that go to the individual, not to the company or institution that they are affiliated with. The outcomes of which will be made public shortly, once the judging panel have done their work.

Summary

To foster innovation and to allow for creativity in software development:

Having play space is important
Being vague with aims and flexible with outcomes is not a bad thing and is vital for unexpected things to develop - e.g. A project's outcomes should be under continual re-negotiation as a general rule, not as the exception.
Encouraging and enabling free and easy communication is crucial.
Be aware of what drives people to do what they do. Push all feedback to be as disproportionate as possible, allowing both developers and users to benefit, with only putting a relatively trivial amount of input in (this pattern affects web UIs, development cycles, team interaction, etc)
Choose useful highscores and be prepared to ditch them or change them if they are no longer fun and motivational.

Handling Tabular data

Ben O'Steen — Sun, 22 Feb 2009 18:28:00 +0000

"Storage"

I put the s-word in quotes because the storing of the item is actually a very straightforward process - we have been dealing with storing tabular data for computation for a very long time now. Unfortunately, this also means that there are very many ways to capture, edit and present tables of information.

One realisation to make with regards to preserving access to data coming from research is that there is a huge backlog of data in formats that we shall kindly call 'legacy'. Not only is there this issue, but data is being made with tools and systems that effectively 'trap' or lock-in a lot of this information - case in point being any research being recorded using Microsoft Access. While the tables of data can often be extracted with some effort, it is normally difficult to impossible to extract the implicit information; how tables interlink, how the Access Form adds information to the dataset, etc.

It is this implicit knowledge that is the elephant in the room. Very many serialisations, such as SQL 'dumps', csv, xsl and so on, rely on implicit knowledge that is either related to the particulars of the application used to open it, or is actually highly domain specific.

So, it is trivial and easy to specify a model for storing data, but without also encoding the implied information and without making allowances for the myriad of sources, the model is useless; it would be akin to defining the colour of storage boxes holding bric-a-brac. The datasets need to be characterised, and the implied information recorded in as good a way as possible.

Characterisation

The first step is to characterise the dataset that has been marked for archival and reuse. (Strictly, the best first step is to consult with the researcher or research team and help and guide them so that as much of the unsaid knowledge is known by all parties.)

Some serialisations so a good job of this themselves, *SQL-based serialisations include basic data type information inside the table declarations themselves. As a pragmatic measure, it seems sensible to accept SQL-style table descriptions as a reasonable beginning. Later, we'll consider the implicit information that also needs to be recorded alongside such a declaration.

Some others, such as CSV, leave it up to the parsing agent to guess at the type of information included. In these cases, it is important to find out or even deduce the type of data held in each column. Again, this data can be serialised in a SQL table declaration held alongside the original unmodified dataset.

(It is assumed that a basic data review will be carried out; does the csv have a consistent number of columns per row, is the version and operating system known for the MySQL that held the data, is there a PI or responsible party for the data, etc.

Implicit information

Good teachers are right to point out this simple truth: "don't forget to write down the obvious!" It may seem obvious that all your data is latin-1 encoded, or that you are using a FAT32 filesystem, or even that you are running in a 32-bit environment, the painful truth is that we can't guarantee that these aspects won't affect how the data is held, accessed or stored. There may be systematic issues that we are not aware of, such as the problems with early versions of ZFS causing [, at the time, detected] data corruption, or MySQL truncating fields when serialised in a way that is not anticipated or discovered until later.

In characterising the legacy sets of data, it is important to realise that there will be loss, especially with the formats and applications that blend presentation with storage. For example, it will require a major effort to attempt to recover the forms and logic bound into the various versions of MS Access. I am even aware of a major dataset, a highly researched dictionary of old english words and phrases, that the final output of which is a Macromedia Authorware application, and the source files are held by an unknown party (that is if they still exist at all) - the Joy of hiring Contractors. In fact, this warrants a slight digression:

The gap in IT support for research

If an academic researcher wishes to gain an external email account at their institution, there is an established protocol for this. Email is so commonplace, it sounds an easy thing to provide, but you need expertise, server hardware, multiuser configuration, adoption of certain access standards (IMAP, POP3, etc), and generally there are very few types of email (text or text with MIME attachments - NB the IM in MIME stands for Internet Mail)

If a researcher has a need to store tables of data, where do they turn? They should turn to the same department, who will handle the heavy lifting of guiding standards, recording the implicit information and providing standard access APIs to the data. What the IT departments seem to be doing currently is - to carry on the metaphor - handing the researcher the email server software and telling them to get on with it, to configure it as they want. No wonder the resulting legacy systems are as free-form as they are.

Practical measures - Curation

Back to specifics now, consider that a set of data has been found to be important, research has been based on it, and it's been recognised that this dataset needs to be looked after. [This will illustrate the technical measures. Licencing, dialogue with the data owners, and other non-technical analysis and administration is left out, but assumed.]

First task is to store the incoming data, byte-for-byte, as much as is possible - storing the iso image of the media the data is stored on, storing the SQL dump of a database, etc.

Analyse the tables of data - record the base types of each column (text, binary, float, decimal, etc) apeing the syntax of a SQL table declaration, as well as trying to identify the key columns.

Record the inter-table joins between primary and secondary keys, possibly by using a "table.column SAMEAS table.column;" declaration after the table declarations.

Likewise, attempt to add information concerning each column, information such as units or any other identifying material.

Store this table description alongside the recorded tabular data source.

Form a representation of this data in a well-known, current format such as a MySQL dump. For spreadsheets that are 'frozen', cells that are the results of embedded formula should be calculated and added as fixed values. It is important to record the environment, library and platform that these calculations are made with.

Table description as RDF (strictly, referencing cols/rows via the URI)

One syntax I am playing around with is the notion that by appending sensible suffixes to the base URI for a dataset, we can unique specify a row, a column, a region or even a single cell. Simply put:

http://datasethost/datasets/{data-id}#table/{table-name}/column/{column-id} to reference a whole column
http://datasethost/datasets/{data-id}#table/{table-name}/row/{column-id} to reference a whole row, etc

[The use of the # in the position it is in will no doubt cause debate. Suffice it to say, this is a pragmatic measure, as I suspect that an intermediary layer will have to take care of dereferencing a GET on these forms in any case.]

The purpose for this is so that the tabular description can be made using common and established namespaces to describe and characterise the tables of data. Following on from a previous post on extending the BagIt protocol with an RDF manifest, this information can be included in said manifest, alongside the more expected metadata without disrupting or altering how this is handled.

A possible content type for tabular data

By considering the base Fedora repository object model, or the BagIt model, we can apply the above to form a content model for a dataset:

As a Fedora Object:

Original data in whatever forms or formats it arrives in (dsid prefix convention: DATA*)
Binary/textual serialisation in a well-understood format (dsid prefix convention: DERIV*)
'Manifest' of the contents (dsid convention: RELS-INT)
Connections between this dataset and other objects, like articles, etc as well as the RDF description of this item (RELS-EXT)
Basic description of dataset for interoperability (Simple dublin core - DC)

As a BagIt+RDF:

Zip archive -

/MANIFEST (list of files and checksums)
/RDFMANIFEST (RELS-INT and RELS-EXT from above)
/data/* (original dataset files/disk images/etc)
/derived/* (normalised/re-rendered datasets in a well known format)

Presentation - the important part

What is described above is the archival of the data. This is a form suited for discovery, but is not in a form suited for reuse. So, what is the possibility?

BigTable (Google) or HBase (Hadoop) provides a platform where tabular data can be put in a scalable manner. In fact, I would go on to suggest that HBase should be a basic service offered by the IT department of any institution. By providing this database as a service, it should be easier to normalise, and to educate the academic users in a manner that is useful to them, not just to the archivist. Google spreadsheet is an extremely good example of how such a large, scalable database might be presented to the end-user.

For archival sets with a good (RDF) description of the table, it should be possible to instantiate working versions of the tabular data on a scalable database platform like HBase on demand. Having a policy to put to 'sleep' unused datasets can provide a useful comprimise, avoiding having all the tables live but still providing a useful service.

It should also be noted that the adoption of popular methods of data access should be part of the responsibility of the data providers - this will change as time goes on, and protocols and methods for access alter with fashion. Currently, Atom/RSS feeds of any part of a table of data (the google spreadsheet model) fits very well with the landscape of applications that can reuse this information.

Summary

Try to record as much information as can be found or derived - from host operating system to column types.
Keep the original dataset byte-for-byte as you recieved it.
Try to maintain a version of the data in a well-understood format
Describe the tables of information in a reusable way, preferably by adopting a machine-readable mechanism
Be prepared to create services that the users want and need, not services that you think they should have.

Tracking conferences (at Dev8D) with python, twitter and tags

Ben O'Steen — Wed, 18 Feb 2009 04:18:00 +0000

There was so much going on at http://www.dev8d.org (#dev8d) that it might be foolish for me to attempt to write up what happened.

So, I'll focus on a small, but to my mind, crucial aspect of it - tag tracking with a focus on Twitter.

The Importance of Tags

First, the tag (#)dev8d was cloudburst over a number of social sites - Flickr(dev8d tagged photos), Twitter(dev8d feed), blogs such as the JISCInvolve Dev8D site, and so on. This was not just done for publicity, but as a means to track and re-assemble the various inputs to and outputs from the event.

Flickr has some really nice photos on it, shared by people like Ian Ibbotson (who caught an urban fox on camera during the event!) While there was an 'official' dev8d flickr user, I expect the most unexpected and most interesting photos to be shared by other people who kindly add on the dev8d tag so we can find them. For conference organisers, this means that there is a pool of images that we can choose from, each with their own provenance so we can contact the owner if we wanted to re-use, or re-publish. Of course, if the owner puts a CC licence on them, it makes things easier :)

So, asserting a tag or label for an event is a useful thing to do in any case. But, this twinned with using a messaging system like Twitter or Identi.ca, means that you can coordinate, share, and bring together an event. There was a projector in the Basecamp room, which was either the bar, or one of the large basement rooms at Birkbeck depending on the day. Initially, this was used to run through the basic flow of events, which was primarily organised through the use of a wiki, to which all of us and the attendees were members.

Projecting the bird's eye view of the event

I am not entirely sure whose idea it was initially to use the projector to follow the dev8d tag on twitter, auto-refreshing itself every minute, but it would be one or more of the following: Dave Flanders(@dfflanders), Andy McGregor(@andymcg) and Dave Tarrant(@davetaz) who is aka BitTarrant due to his network wizardry keeping the wifi going despite Birkbeck's network's best efforts at stopping any form of useful networking going.

The funny thing about the feed being there, was that it felt perfectly natural from the start. Almost like a mix of notice board, event liveblog and facebook status updates, but the overall effect was like it was the bird's eye view of the entire event, which you could dip into and out of at will, follow up on talks you weren't even attending, catch interesting links that people posted, and just follow the whole event while doing your own thing.

Then things got interesting.

From what I heard, a conversation in the bar about developer happiness (involving @rgardler?) lead to Sam Easterby-Smith (@samscam) to create a script that dug through the dev8d tweets looking for n/m (like 7/10) and to use that as a mark of happyness e.g.

" @samscam #dev8d I am seriously 9/10 happy http://samscam.co.uk/happier HOW HAPPY ARE YOU? " (Tue, 10 Feb 2009 11:17:15)

And computed the average happyness and overall happyness of those who tweeted how they were doing!

Of course, being friendly, constructive sorts, we knew the best way to help 'improve' his happyometer was to try to break it by sending it bad input... *ahem*.

" @samscam #dev8d based on instant discovery of bugs in the Happier Pipe am now only 3/5 happy " (Tue, 10 Feb 2009 23:05:05)

BUT things got fixed, and the community got involved and interested. It caused talk and debate, got people wondering how that it was done, how they could do the same thing and how to take it further.

At which point, I thought it might be fun to 'retweet' the happyness ratings as they change, to keep a running track of things. And so, a purpose for @randomdev8d was born:

How I did this was fairly simple: I grabbed his page every minute or so, used BeautifulSoup to parse the HTML, got the happyness numbers out and compared it to the last ones the script had seen. If there was a change, it tweeted it and seconds later, the projected tweet feed updated to show the new values - a disproportionate feedback loop, the key to involvement in games; you do something small like press a button or add 4/10 to a message, and you can affect the stock-market ticker of happyness :)

If I had been able to give my talk on the python code day, the code to do this would contain zero surprises, because I covered 99% of this - so here's my 'slides'[pdf] - basically a snapshot screencast.

Here's the crufty code though that did this:

import time
import simplejson, httplib2, BeautifulSoup
h = httplib2.Http()
h.add_credentials('randomdev8d','PASSWORD')
happy = httplib2.Http()
o = 130.9
a = 7.7
import urllib

while True:
print "Checking happiness...."
(resp, content) = happy.request('http://samscam.co.uk/happier/')
soup = BeautifulSoup.BeautifulSoup(content)
overallHappyness = soup.findAll('div')[2].contents
avergeHappyness = soup.findAll('div')[4].contents
over = float(overallHappyness[0])
ave = float(avergeHappyness[0])
print "Overall %s - Average %s" % (over, ave)
omess = "DOWN"
if over > o:
omess = "UP!"
amess = "DOWN"
if ave > a:
amess= "UP!"
if over == o:
omess = "SAME"
if ave == a:
amess = "SAME"
if not (o == over and a == ave):
print "Change!"
o = over
a = ave
tweet = "Overall happiness is now %s(%s), with an average=%s(%s) #dev8d (from http://is.gd/j99q)" % (overallHappyness[0], omess, avergeHappyness[0], amess)
data = {'status':tweet}
body = urllib.urlencode(data)
(rs,cont) = h.request('http://www.twitter.com/statuses/update.json', "POST", body=body)
else:
print "No change"
time.sleep(120)

(Available from http://pastebin.com/f3d42c348 with syntax highlighting - NB this was written beat-poet style, written from A to B with little concern for form. The fact that it works is a miracle, so comment on the code if you must.)

The grand, official #Dev8D survey!

... which was anything but official, or grand. The happyness-o-meter idea lead BitTarrant and I to think "Wouldn't it be cool to find out what computers people have brought here?" Essentially, finding out what computer environment developers choose to use is a very valuable thing - developers choose things which make our lives easier, by and large, so finding out which setups they use by preference to develop or work with could guide later choices, such as being able to actually target the majority of environments for wifi, software, or talks.

So, on the Wednesday morning, Dave put out the call on @dev8d for people to post the operating systems on the hardware they brought to this event, in the form of OS/HW. I then busied myself with writing a script that hit the twitter search api directly, and parsed it itself. As this was a more intended script, I made sure that it kept track of things properly, pickling its per-person tallys. (You could post up multiple configurations in one or more tweets, and it kept track of it per-person.) This script was a little bloated at 86 lines, so I won't post it inline - plus, it also showed that I should've gone to the regexp lesson, as I got stuck trying to do it with regexp, gave up, and then used whitespace-tokenising... but it worked fine ;)

Survey code: http://pastebin.com/f2c04719b

Survey results: http://spreadsheets.google.com/pub?key=pDKcyrBE6SJqToHzjCs2jaQ

OS:
Linux was the majority at 42% closely followed by Apple at 37% with MS-based OS at 18% with a stellar showing of one user of OpenSolaris (4%)!

Hardware type:
66% were laptops, with 25% of the machines there being classed as netbooks. 8% of the hardware there were iPhones too, and one person claimed to have brought Amazon EC2 with them ;)

The post hoc analysis

Now then, having gotten back to normal life, I've spent a little time grabbing stuff from twitter and digging through them. Here is the list of the 1300+ tweets with the #dev8d tag in them published via google docs, and here is some derived things posted by Tony Hirst(@psychemedia) and Chris Wilper(@cwilper) seconds after I posted this:

Tagcloud of twitterer's:
http://www.wordle.net/gallery/wrdl/549364/dev8_twitterers [java needed]

Tagcloud of tweeted words:
http://www.wordle.net/gallery/wrdl/549350/dev8d [java needed]

And a column of all the tweeted links:
http://spreadsheets.google.com/pub?key=p1rHUqg4g423-wWQn8arcTg

This lead me to dig through them and republish the list of tweets, but try to unminimise the urls and try to grab the <title> tag of the html page it goes to, which you can find here:

http://spreadsheets.google.com/pub?key=pDKcyrBE6SJpwVmV4_4qOdg

(Which incidently, lead me to spot that there was one link to "YouTube - Rick Astley - Never Gonna Give You Up" which means the hacking was all worthwhile :))

Graphing Happyness

For one, I've re-analysed the happyness tweets and posted up the following:

A full log of happyness with timeline attached to it,
The running average, with accompanying timeline,
and the average of the last 10 tweets in much the same way as before.

It is easier to understand the averages as graphs over time of course! You could also use Tony Hirst's excellent write up here about creating graphs from google forms and spreadsheets. I'm having issues embedding the google timeline widget here, so you'll have to make do with static graphs.

Average happyness over the course of the event - all tweets counted towards the average.

Average happyness, but only the previous 10 tweets counted towards the average making it more reflective of the happyness at that time.

If you are wondering about the first dip, that was when we all tried to break Sam's tracker by sending it bad data, a lot of 0 happyness's were recorded therefore :) As for the second dip, well, you can see that from the log of happyness, yourselves :)

EPrints 3.2 - Amazon S3/Cloudfront Plug-in

Dave Tarrant — Wed, 21 Jan 2009 05:45:00 +0000

A quick post to say that we have just successfully tested an EPrints 3.2 (svn) install with the new Storage Controller plugged into Amazon S3!

This has quiet a lot of implications for both EPrints and other projects wanting to provide external services which operate on objects in a repository. We hope to bring people more news on this at the upcoming Open Repositories 2009 conference in Atlanta.

For more information on this all check out storage section on the Preserv2 website @ www.preserv.org.uk.

Beginning with RDF triplestores - a 'survey'

Ben O'Steen — Fri, 14 Nov 2008 08:50:00 +0000

Like last time, this was prompted by an email that eventually was passed to me. It was a call for opinion - "we thought we'd check first to see what software either of you recommend or use for an RDF database."

It's a good question.

In fact, it's a really great question, as searching for similar advice online results in very few opinions on the subject.

But which one's are the best for novices? Which have the best learning curves? which has the easiest install or the shortest time between starting out and being able to query things?

I'll try to pose as much as I can as a newcomer which won't be too hard :) Some of the comments will be my own, and some will be comments from others, but I'll try to be as honest as I can be to reflect new user expectation and experience and most importantly, developer-attention span. (See the end for some of my reasons for this approach.)

(Puts on newbie hat and enables PEBKAC mode.)

Installable (local) triplestores

Sesame - http://www.openrdf.org/

Simple menu on the left of the website, one called downloads. Great, I'll give that a whirl. "Download the latest Sesame 2.x release" looks good to me. Hmm 5 differently named files... I'll grab the 'onejar' file and try to run it. "Failed to load Main-Class manifest attribute from openrdf-sesame-2.2.1-onejar.jar", okay... so back to the site to find out how to install this thing.

No links for installation guide... on the Documentation page, no link for installation instructions for the sesame 2.2.1 I downloaded, but there is Sesame 2 user documentation and Sesame 2 system documentation. Phew, after guessing that the user documentation might have the guide, I finally found the installation guide (system documentation was about the architecture, not how to administer the system as you might expect.)

(Developer losing interest...)

Ah, I see, I need the SDK. I wonder what that 'onejar' was then... "The deployment process is container-specific, please consult the
documentation for your container on how to deploy a web application. " - right, okay... let's assume that I have a Java background and am not just a user wanting to hook into it from my language of choice, such as php, ruby, python, or dare I say it, javascript.

(Only Java-friendly developers continue on)

Right, got Tomcat, and put in the war file... right so, now I need to work out how to use a commandline console tool to set up a 'repository'... does this use SVN or CVS then? Oh, it doesn't do anything unless I end the line with a period. I thought it had hung trying to connect! "Triple indexes [spoc,posc]" Wha? Well, whatever that was, the test repository is created. Let's see what's at http://localhost:8080/openrdf-sesame then.

"You are currently accessing an OpenRDF Sesame server. This server is
intended to be accessed by dedicated clients, using a specialized
protocol. To access the information on this server through a browser,
we recommend using the OpenRDF Workbench software."

Bugger. Google for "sesame clients" then.

There is a Java client it seems, but it seems to need a lot to get going. Oh, and useful if my application is in Java or in a JVM (jRuby, jython)
http://jeenbroekstra.blogspot.com/2008/09/sesame-2-desktop-client.html .Net GUI... not so useful for programmatic stuff
...

I've pretty much given up at this point. If I knew I needed to use a triplestore then I might have persisted, but if I was just investigating it? I would've probably given up earlier.

Mulgara - http://www.mulgara.org/

Nice, they've given the frontpage some style, not too keen on orange, but the effort makes it look professional. "Mulgara is a scalable RDF database written entirely in Java." -> Great, I found what I am looking for, and it warns me it needs Java. "DOWNLOAD NOW" - that's pretty clear. *click*

Hmm, where's the style gone? Lots of download options, but thankfully one is marked by "These released binaries are all that are required for most applications." so I'll grab those. 25Mb? Wow...

Okay, it's downloaded and unpacked now. Let's see what we've got - a 'dist/' directory and two jars. Well, I guess I should try to run one (wonder what the licence is, where's the README?)

Mulgara Semantic Store Version 2.0.6 (Build 2.0.6.local) INFO [main] (EmbeddedMulgaraServer.java:715) - RMI Registry started automatically on port 10990 [main] INFO org.mulgara.server.EmbeddedMulgaraServer - RMI Registry started automatically on port 1099 INFO [main] (EmbeddedMulgaraServer.java:738) - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy3 [main] INFO org.mulgara.server.EmbeddedMulgaraServer - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy2008-11-14 14:06:39,899 INFO Database - Host name aliases for this server are: [billpardy, localhost, 127.0.0.1]

Well, I guess something has started... back to the site, there is a documentation page and a wiki. A quick view of the official documentation has just confused me, is this an external site? No easy link to something like 'getting started' or tutorials. I've heard of SPARQL, what's iTQL? nevermind, let's see if the wiki is more helpful.

Let's try 'Documentation' - sweet, first link looks like what I want - Web User Interface.

A default configuration for a standalone Mulgara server runs a set of
web services, including the Web User Interface. The standard
configuration puts uses port 8080, so the web services can be seen by
pointing a browser on the server running Mulgara to http://localhost:8080/.

Ooo cool. *click*

Available Services
SPARQL HTTP Service
User Interface
Web Services
TQL HTTP Service

SPARQL, I've heard of that. *click*

HTTP ERROR: 400
Query must be supplied
RequestURI=/sparql/
Powered by Jetty://

I guess that's the SPARQL api, good to know, but the frontpage could've warned me a little. Ah, second link is to the User Interface.

Good, I can use a drop down to look at lots of example queries, nice. Don't understand most of them at the moment, but it's definitely comforting to have examples. They look nothing like SPARQL though... wonder what it is? I'm sure it does SPARQL... was I wrong?

Quick poke at the HTML shows that it is just POSTing the query text to webui/ExecuteQuery. Looks straightforward to start hacking against too, but probably should password protect this somehow! I wonder how that is done... documentation mentions a 'java.security.policy' field:java.security.policy
string: URL: The URL for the security policy file to use.
Default: jar:file:/jar_path!/conf/mulgara-rmi.policy

Kinda stumped... will investigate that later, but at least there's hope. Just be firing off the example queries though shows me stuff, so I've got something to work with at least.

Jena - http://jena.sourceforge.net/

Front page is pretty clear, even if I don't understand what all those acronyms are. downloads link takes me to a page with an obvious download link, good. (Oh, and sourceforge, you suck. How many frikkin mirrors do I have to try to get this file?)

Have to put Jena on pause while Sourceforge sorts its life out.

ARC2 - http://arc.semsol.org/

Frontpage: "Easy RDF and SPARQL for LAMP systems" Nice, I know of LAMP and I particularly like the word Easy. Let's see... Download is easy to find, and tells me straight away I need PHP 4.3+ and MySQL 4.0.4+ *check* Right, now how do I enable PHP for apache again?... Ah, it helps if I install it first... Okay, done. Dropping the folder into my web space... Hmm nothing does anything. From the documentation, it does look like it is geared to providing a PHP library framework for working with its triplestore and RDF. Hang on, SPARQL Endpoint Setup looks like what I want. It wants a database, okay... done, bit of a hassle though.

Hmm, all I get is "Fatal error: Call to undefined function mysql_connect() in /********/arc2/store/ARC2_Store.php on line 53"

Of course, install php libraries to access mysql (PEBKAC)... done and I also realise I need to set up the store, like the example in "Getting Started"... done (with this) and what does the index page now look like?

Yay! there's like SPARQL and stuff... I guess 'load' and 'insert' will help me stick stuff in, and 'select' looks familiar... Well, it seems to be working at least.

Unfortunately, it looks like the Jena download from sourceforge is in a world of FAIL for now. Maybe I'll look at it next time?

Triplestores in the cloud

Talis Platform - http://www.talis.com/platform/

From the frontpage - "Developers using the Platform can spend more of their time building
extraordinary applications and less of their time worrying about how
they will scale their data storage." - pretty much want I wanted to hear, so how do I get to play with it?

There is a Get involved link on the left, which rapidly leads me to see the section: "Develop, play and try out" - n² developer community seems to be where it wants me to go.

Lots of links on the frontpage, takes a few seconds to spot: "Join - join the n² community to get free developer stores and online support" - free, nice word that. So, I just have to email someone? Okay, I can live with that.

Documentation seems good, lots of choices though, a little hard to spot a single thread to follow to get up to speed, but Guides and Tutorials looks right to get going with. The Kniblet tutorial (whatever a kniblet is) looks the most beginnerish, and it's also very PHP focussed, which is either a good thing or a bad thing depending on the user :)

Commercial triplestores

Openlink Virtuoso - http://virtuoso.openlinksw.com/

Okay, I tried the Download link, but I am pretty confused by what I'm greeted with:

Not sure what one to pick just to try it out, it's late in the day, and my tolerance for all things installable has ended.

-----------------------------------------

Why take the http/web-centric, newbie approach to looking at these?

Answer: In part, I am taking this approach because I have a deep belief that it
was only after relational DBs became commoditised - "You want fries
with you MySQL database?" - that the dynamic web kicked off. If we want
the semantic web to kick off, we need to commoditise it or at least, make
it very easy for developers to get started. And I mean EASY. A query that I want answered is: "Is there something that fits: 'apt-get install
triplestore; r = store('localhost'), r.add(rdf), r.query(blah)'? "

(I am particularly interested to see what happens when Tom Morris's work on Reddy collides with ActiveRecord or activerdf...)

NB I've short circuited the discovery of software homepages - Imagine
I've seen projects stating that they use "XXXXX as a triplestore". I know
this will likely mean I've compared apples to oranges, but as a newbie, how
would I be expected to know this? "Powered by the Talis Platform" and
"Powered by Jena" seem pretty similar on the surface.)

A Fedora/Solr Digital Library for Oxford's 'Forced Migration Online'

Ben O'Steen — Thu, 13 Nov 2008 06:30:00 +0000

(mods:subtitle - Slightly more technical follow-up to the Fedora Hatcheck piece.)

As I have been prompted via email by Phil Cryer (of the Missouri Botanical Garden) to talk more about how this technically works, I thought it would be best to make it a written post, rather than the more limited email response.

Background

Forced Migration Online (FMO) had a proprietary system, supporting their document needs. It was originally designed for newpaper holdings and applied that model to encoding the mostly paginated documents that FMO held - such that each part was broken up into paragraphs of text, images and the location of all these parts on a page. It even encoded (in its own format) the location of the words on the page when it OCR'd the documents, making per-word higlighting possible. Which is nice.

However, the backend that powered this was over-priced, and FMO wanted to move to a more open, sustainable platform.

Enter the DAMS

(DAMS = Digital Asset Management System)

I have been doing work on trying to make a service out of a base of fedora-commons and additional 'plugin' services, such as the wonderful Apache Solr and the useful eXist XML db. The end aim is for departments/users/whoever to requisition a 'store' with a certain quality of service (solr attached, 50Gb+ etc) but this is not yet an automated process.

The focus for the store is a very clear separation between storage, management, indexing services and distribution - Normal filesystems, or Sun Honeycomb are the storage, Fedora-commons provides the management + CRUD, solr, eXist, mulgara, sesame, and couchDB can provide potential index and query services, and distribution is handed pragmatically, caching outgoing and mirroring where necessary.

The FMO 'store'

From discussions with FMO, and examining the information they held and the way they wished to make use of it, a simple Fedora/Solr store seemed to fufill what they wanted: a persistant store of items with attachments and the ability to search the metadata and retrieve results.

Bring in the consultants

FMO hired Aptivate to do the migration of their data from the proprietary system, in its custom format, to a Fedora/Solr store and trying as much as possible to retain the functionality they had.

Some points that I think it is important to impress on people here:

In general, software engineer consultants don't understand METS or FOXML.
They *really* don't understand the point of disseminators.
Having to teach software engineer consultants to do METS/FOXML/bDef's etc is likely an arduous and costly task.
Consultants add lots of money to do things their team don't already have the experience to do.

So, my conclusion was to not make these things part of the development at all to the extent that I might even have forgotten to mention these things to them except in passing. I helped them install their own local store and helped them with the various interfaces and gotchas of the two software packages. By showing them how I use Fedora and Solr in ora.ouls.ox.ac.uk, they were able to hit the ground running.

They began by using the REST interface to Fedora and the RESTful interface to Solr. By having them begin by using the simple put/get REST interface to Fedora, they could concentrate on getting used to the nature of Fedora as an objectstore. I think they moved to use the SOAP interface as it better suited their Java background, although I cannot be certain as it wasn't an issue that came up.

Once they had developed the migration scripts to their satisfaction, they asked me to give them a store, which I did (but due to hardware and stupid support issues here I am afraid to say I held them up on this.) They fired off their scripts, moved all the content into the fedora with a straightforward layout per object (pdf, metadata, fulltext and thumbnail) The metadata is - from what I can see - the same XML metadata as before - very MARCXML in nature, with 'Application_Info' elements having types like 'bl:DC.Title'. If necessary, we will strip out the dublin core metadata and put what we can into the DC datastream, but that's not of particular interest to FMO right now.

Fedora/Solr notes

As for the link between Solr and Fedora? This is very loosely coupled, such that they are running in the same Tomcat container for convenience, but aren't linked in a hard way.

I've looked at GSearch, which is great for a homogenous collection of items, such that they can be acted on by the same XSLT to produce a suitable record for Solr, but as the metadata was a complete unknown for this project, it wasn't too suitable.

Currently, they have one main route into the fedora store, and so, it isn't hard to simply reindex an item after a change is made, especially for services such as Solr or eXist, which expect to have things change incrementally. I am looking at services such as ActiveMQ for scheduling these index tasks, but more and more I am starting to favour RabbitMQ which seems to be more useful, while retaining the scalability and very robust nature.

Sending an update to Solr is as simple as an HTTP POST to its /update service, consisting of a XML or JSON packet like " changeme:1 John Smith .... " - it uses a transactional model, such that you can push all the changes and additions into the live index via a commit call, without taking the index offline. To query Solr, all manner of clients exist, and it is built to be very simple to interact with, handling facet queries, filtering, ordering and can deliver the results in XML, JSON, PHP or Python directly. It can even do a XSLT transform of the results on the way out, leading to a trivial way to support OpenSearch, Atom feeds and even HTML blocks for embedding in other sites.

Likewise, to change a PDF in Fedora can be done by a HTTP POST as well. Does it need to be more complicated?

Last, but not least, a project to watch closely:

The Fascinator project, funded by ARROW, as part of their mini project scheme, is an Apache Solr front end to the Fedora commons repository. The goal of the project is to create a simple interface to Fedora that uses a single technology – that’s Solr – to handle all browsing, searching and security. Well worth a look, as it seeks to turn this Fedora/Solr pairing truly into an appliance, with a simple installer and handling the linkage between the two.

News and updates Oct 2008

Ben O'Steen — Thu, 16 Oct 2008 03:39:00 +0000

Right, I haven't forgotten about this blog, just getting all my ducks in a line as it were. Some updates:

The JISC bid for eAdministration was successful, titled "Building the Research Information Infrastructure (BRII)". The project will categorise the research information structure, build vocabularies if necessary, and populate it with information. It will link research outputs (text and data), people, projects, groups, departments, grant/funding information and funding bodys together, using RDF and as many pre-existing vocabularies as is suitable. The first vocab gap we've hit is one for funding, and I've made a draft RDF schema for this which will be openly published once we've worked out a way to make it persistent here at Oxford (trying to get a vocab.ox.ac.uk address)

One of the final outputs will be a 'foafbook' which will re-use data in the BRII store - it will act as a blue book of researchers. Think Cornell's Vivo, but with the idea of Linked Data firmly in mind.
We are just sorting out a home for this project, and I'll post up an update as soon as it is there.

Forced Migration Online (FMO) have completed their archived document migration from a crufty, proprietary store to a ORA-style store (Fedora/Solr) - you can see their preliminary frontend at http://fmo.qeh.ox.ac.uk. Be aware that this is a work in progress. We provide the store as a service to them, giving them a Fedora and a Solr to use. They contracted a company called Aptivate to migrate their content, and I believe also to create their frontend. This is a pilot project to show that repositories can be treated in a distributed way, given out like very smart, shared drive space.
We are working to archive and migrate a number of library and historical catalogs. A few projects have a similar aim to provide an architecture and software system to hold medieval catalog research - a record of what libraries existed, and what books and works they held. This is much more complex that a normal catalog, as each assertion is backed by a type of evidence, ranging from the solid (first-hand catalog evidence), to the more loose (handwriting on the front page looks like a certain person who worked at a certain library.) So modelling this informational structure is looking to be very exciting, and we will have to try a number of ways to represent this, starting with RDF due to the interlinked nature of the data. This is related to the kinds of evidence that genealogy uses, and so related ontologies may be of use.
The work on storing and presenting scanned imagery is gearing up. We are investigating storing the sequence of images and associated metadata/ocr text/etc as a single tar file as part of a Fedora object (i.e. a book object will have a catalog record, technical/provenance information and an attached tar file and and a list of file to offset information.)

This is due to us trying to hit the 'sweet spot' for most file systems. A very large number of highly compressed images and little pieces of text does not fit well with most FS internals. We estimate that for a book there will be around [4+PDFs+2xPages] files, or 500+ typically. Just counting up the various sources of scanned media we already have, we are pressing for about 1/2 million books from one source, 200,000 images from another, 54,000 from yet another... it's adding up real fast.

We are starting to deal with archiving/curating the 'long-tail' of data - small, bespoke datasets that are useful to many, but don't fall into the realm of Big Data, or Web data. I don't plan on touching Access/FoxPro databases any time soon though! I am building a Fedora/Solr/eXist box to hold and disseminate these, which should live at databank.ouls.ox.ac.uk very, very shortly. (Just waiting on a new VMware host to grow into, our current one is at capacity.)

To give a better idea of the structure, etc, I am writing it up in a second blog post to follow shortly - currently named "Modelling and storing a phonetics database inside a store"

I am in the process of integrating the Google-analytics-style statistics package at http://piwik.org with the ORA interface, to give relatively live hit counts on a per-item and to build per-collection reports.

Right now, piwik is capturing the hits and downloads from ORA, but I have yet to add in the count display on each item page, so halfway there :)

We are just waiting on a number of departments here to upgrade the version of EPrints they are using for their internal, disciplinary repositories, so that we can begin archiving surrogate copies of the work they wish to put up for this service. (Using ORE descriptions of their items) By doing so, their content becomes exposed in ORA, mirror copies are made (working on a good way to maintain these as content evolves), but they retain the content control, ORA will also act as a registry for their content. It's only when their service drops do the users get redirected to the mirror copies that ORA holds (think google cache, but a 100% copy).
In the process of battle-testing the Fedora-Honeycomb connection, but as mentioned above, just waiting for a little more hardware before I set to it. Also, we are examining a number of other storage boxes that should plug in under Fedora, using the Honeycomb software, such as the new and shiny Thumper box, "Thor" Sun Fire Xsomething-or-other. Also, getting pretty interested at the idea of MAID storage - massive array of idle disks. Hopefully, this will act like tape, but have a sustainable access speed of disk. Also, a little more green than a tower of spinning hardware.
Planning out the indexer service at the moment. It will use the Solr 1.3 multicore functionality, with a little parsing magic at the ingest side of things to make a generic indexer-as-a-service type system. One use-case is to be able to bring up VM machines with multicore solr on to act as indexers/search engines as needed. An example aim? "Economics want an index that facets on their JEL codes." POST a schema and ingest indexer to the nearest free indexer, and point the search interface at it once an XMPP message comes back that it is finished.
URI resolvers - still investigating what can be put in place for this, as I strongly wish to avoid coding this myself. Looking at OCLC's OpenURL and how I can hack it to feed it info:fedora uris and link them to their disseminated location. Also, using a tinyurl type library + simple interface might not be a bad idea for a quick PoC.
Just to let you all know that we are building up the digital team here, most recently held interviews for the futureArch project but we are looking for about 3 others to hire, due to all the projects we are doing. We will be putting out job adverts as and when we feel up to battling with HR :)

That's most of the more interesting hot topics and projects I am doing at the moment.... phew :)

Institutions hate repositories... one simple reason.

Dave Tarrant — Thu, 18 Sep 2008 02:14:00 +0000

Open access is not enough!

People want to give Open Access to some of their materials at their institution however the IR software is seen as a means to manage all Institutional content and not just that which is Open Access and part of the external image of the Institution.
The problem exists in the other direction as well where repository software is trying to solve these problems, thus people are not likely to use this software until it is included.

So what do we end up with...

Lots of Repository Islands which aren't interoperable with each other!

So if we solve the access and copyright issue will people use the software? errrr No. At this point the software is an all in solution and not a service which can be utilised by current institutional practise ... Give up...?

No!

Focus on providing a service, e.g. something which can manage your Digital Resources and enable this to plug to existing institutional services. Some softwares would argue they support this already. OK good, so don't try and solve the problem if it is just an integration issue.

To the repositories: Decouple! Build a set of services, build ways of plugging services together and allow the community to pic 'n' mix.

To the institution: You already have access control systems ask your Information/Computer Systems department. You probably already have a Content Management System for educational resources for students (Blackboard? - Integrates with an LDAP server), these use external services to manage access and authentication! Here's a few services for you... LDAP, Radius, Eduroam, Domain Controller.

Till next time!

Enterprise Abstraction: CISTI SRU API

Thu, 11 Sep 2008 09:18:53 +0000

DSpace and Fedora need opinionated installers.

Ben O'Steen — Mon, 18 Aug 2008 05:16:00 +0000

Just to say that both Fedora-Commons and DSpace really, really need opinionated installers that make choices for the user. Getting either installed is a real struggle - which we demonstrated during the Crigshow, so please don't write in the comments that it is easy, it just isn't.

Something that is relatively straightforward to install, is a debian package.

So, just a plea in the dark, can we set up a race? Who can make their repository software installable as a .deb first? will it be DSpace or Fedora? Who am I going to send a box of cookies to and a thank you note from the entire developer community?

(EPrints doesnt count in this race; they've already done it)

Re-using video compression code to aid document quality checking

Ben O'Steen — Mon, 18 Aug 2008 04:47:00 +0000

(Expanding on this video post from the Crigshow)

Problem:

The volume of pages from a large digitisation project can be overwhelming. Add into that the simple fact that all (UK) institutional projects are woefully underfunded and underresourced, it's surprising that we can cope with them really.

One issue that repeatedly comes up is the idea of quality assurance; How can we know that a given book has been scanned well? How can we spot images easily? Can we detect if foreign bodies were present in the scan, such as thumbs, fingers or bookmarks?

A quick solution:

Inspired by a talk at one of the conference strands at WorldComp, where the author talked about the use of a component of a commonly used video compression standard (MPEG2) to detect degrees of motion and change in a video, without having to analyse the image sequences using a novel, or smart algorithm.

He talked about using the motion vector stream to be a good rough guide to the amount of change between frames of video.

So, why did this inspire me?

MPEG-2 compression is a pretty much a solved problem; there are some very fast and scalable solutions out there today - direct benefit: No new code needs to be written and maintained
The format is very well understood and stripping out the motion vector stream wouldn't be tricky. Code exists for this too.
Pages of text in printed documents tend towards being justified so that the two edges of the text columns are straight lines. There is also (typically) a fixed number of lines on a page.
A (comparatively rapid) MPEG2 compression of the scans of a book would have the following qualities:

The motion vectors between pages of text would either shown little overall change (as differing letters are actually quite similar) or a small, global shift if the page was printed on a slight offset.
The motion vectors between a page of text and a page with an image embedded in text on the next, or a thumb on the edge, would show localised and distinct changes that differ greatly from the overall perspective.

In fact, a real crude solution could be, just using the vector stream to create a bookmark list for all the suspect changes. This might bring the number of pages to check down to a level that a human mediator could handle.

How much needs to be checked?

Via basic sample survey statistics: to be sure to 95% (±5%) that the scanned images of 300 million pages are okay, just 387 totally random pages need to be checked. However, to be sure that each individual book is okay to the same degree, a book being ~300 pages, 169 pages need to be checked in each book. I would suggest that the above technique would significantly lower this threshold, but it would be by an empirically found amount.

Also note that the above figures carry the assumption that the scanning process doesn't change over time, which of course it does!

The four rules of the web and compound documents

Ben O'Steen — Mon, 18 Aug 2008 03:40:00 +0000

A real quirk that truly interests me is the difference in aims between the way documents are typically published and the way that the information within them is reused.

A published document is normally in a single 'format' - a paginated layout, and this may comprise text, numerical charts, diagrams, tables of data and so on.

My assumption is that, to support a given view or argument, a reference to the entirety of an article is not necessary; The full paper gives the context to the information, but it is much more likely that a small part of this paper contains the novel insight being referenced.

In the paper-based method, it is difficult to uniquely identify parts of an article as items in their own right. You could reference a page number, give line numbers, or quote a table number, but this doesn't solve this issue that the author hadn't put time to considering that a chart, a table or a section of text would be reused.

So, on the web, where multiple representations of the same information is getting to be commonplace (mashups, rss, microblogs, etc), what can we do to help better fulfill both aims, to show a paginated final version of a document, and also to allow each of the components to exist as items in their own right, with their own URIs (or better, URLs containing some notion of the context e.g. if /store/article-id gets to the splash page of the article, /store/article-id/paragraph-id will resolve to the text for that paragraph in the article.)

Note that the four rules of the web (well, of Linked Data) are in essence:

give everything a name,
make that name a URL ...
...which results in data about that thing,
and have it link to other related things.

[From TimBL's originating article. Also, see this presentation - a remix of presentations from TimBL and the speaker, Kingsley Idehen - given at the recent Linked Data Planet conference]

I strongly believe that applying this to the individual components of a document is a very good and useful thing.

One thing first, we have to get over the legal issue of just storing and presenting a bitwise perfect copy of what an author gives us. We need to let author's know that we may present alternate versions, based on a user's demands. This actually needs to be the case for preservation and the repository needs to make it part of their submission policy to allow for format migrations, accessibility requirements and so on.

The system holding the articles needs to be able to clearly indicate versions and show multiple versions for a single record.

When a compound document is submitted to the archive, a second parallel version should be made by fragmenting the document into paragraphs of text, individual diagrams, tables of data, and other natural elements. One issue that has already come up in testing, is that documents tend to clump multiple, separate diagrams together into a single physical image. It is likely that the only solution to breaking these up to this is going to be a human one, either author/publisher education(unlikely) or by breaking them up by hand.

I would suggest using a very lightweight, hierarchical structure to record the document's logical structure. I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to 'recreate' the document roughly.

Summary:

1) Break apart any compound document into its constituent elements (paragraph level is suggested for text)
2) Make sure that each one of these parts are clearly expressed in the context they are in, using hierarchical URLs, /article/paragraph or even better, /article/page/chart
3) On the article's splashpage, make a clear distinction between the real article and the broken up version. I would suggest a scheme like Google search's 'View [PDF, PPT, etc] as HTML'. I would assert that many people intuitively understand that this view is not like the original and will look or act differently.

Some related video blogs from the Crigshow trip
Finding and reusing algorithms from published articles
OCR'ing documents; Real documents are always complex
Providing a systematic overview of how a Research paper is written - giving each component and each version of a component would have major benefits here

#crigshow - Conference 2 - Worldcomp

Dave Tarrant — Wed, 16 Jul 2008 13:19:00 +0000

Agents and Web Services... Why no collaboration?

Out of all the presentations at worldcomp this one struck me as one of the most obvious but not covered areas for research in computer science. Probably the most well known agent system is that used by the travel industry where they have standard ways of interfacing with each other to find details of travel and hotels available on a global scale. This is no mean feat with the number of companies there are hooking into this network.

So why doesn't the same exist for web services or if there is such a system why isn't everyone in the open community using it?

Surely the point of web services is for people to discover and use them in their own scenarios just like the agents in the travel industry do. OK so maybe the problem lies in the fact that there are so many communities that there will never be a specific use case or framework and thus hosting a generic web service network becomes infinitely hard with the number of different APIs and Implementations.

OK so if you are going to use Agents in Web Services what issues do you need to consider? Also what do you gain through doing this?

One of the key ideas which came out of a talk at worldcomp is to use Agents to be the intelligent front to a web service. This enables an agent to track of a set of web services including information about a specific web service such as availability, versions, changing cost and and offline copy if the service allows this. So the agent becomes a Rendezvous Point for a series of web services.

So why aren't we seeing more collaboration between the Agent community and the Web Services community?

#crigshow - Conference 1 - Oscelot

Dave Tarrant — Mon, 14 Jul 2008 13:00:00 +0000

This open source day (#osdiii) hosted by Oscelot was an unconferene which soon became based heavily around the Blackboard platform. This was expected as the majority of people attending it were then going on to attend the BbWorld conference. With the title of the conference being Open Source and yet the main topic being that of a Closed Source product this gave an opening for the CRIG team to promote the wider Open Source community to those who are focused on Blackboard use cases.

The day was a success for the team as we promoted good practices in web development, standards, resource management and the fact that the people who manage an eLearning platform has a responsibility to the content they hold.

From our point of view, we discovered: If blackboard is the industry leader in learning management systems then the repository community is big problems when it comes to archiving these resources by the current methodologies each community practices.

More Collaboration and Awareness please!

OAI-PMH + OAI-ORE (Atom) + Pronom Droid = Pretty

Dave Tarrant — Fri, 27 Jun 2008 05:56:00 +0000

I've just finished writing a wrapper (very simple!) which takes a OAI-ORE Resource Map in Atom Format and classifies the objects which are listed in the Aggregation using the National Archives (UK) technical registry (Pronom).

The wrapper provides a simple front end to the DROID tool, it takes an OAI-PHM URI and requests the latest resource maps in atom format (ore-atom) and creates a list of the resources which are passed to DROID to classify directly.

The wrapper requires OAI-PMH as it requests all records which have been modified since it last did a parse of the repository. This way the wrapper can be scheduled to run once a day/week/month etc.

A single DROID xml file comes back as the output.

This is all working with EPrints repository software currently.

Next stage is to do something useful with the output xml in terms of providing useful data back to the repository manager.

Total lines of source code for the wrapper: 302 :)

ORE software libraries from Foresite

Richard — Mon, 09 Jun 2008 14:56:00 +0000

The Foresite [1] project is pleased to announce the initial code of two software libraries for constructing, parsing, manipulating and serialising OAI-ORE [2] Resource Maps. These libraries are being written in Java and Python, and can be used generically to provide advanced functionality to OAI-ORE aware applications, and are compliant with the latest release (0.9) of the specification. The software is open source, released under a BSD licence, and is available from a Google Code repository:

http://code.google.com/p/foresite-toolkit/

You will find that the implementations are not absolutely complete yet, and are lacking good documentation for this early release, but we will be continuing to develop this software throughout the project and hope that it will be of use to the community immediately and beyond the end of the project.

Both libraries support parsing and serialising in: ATOM, RDF/XML, N3, N-Triples, Turtle and RDFa

Foresite is a JISC [3] funded project which aims to produce a demonstrator and test of the OAI-ORE standard by creating Resource Maps of journals and their contents held in JSTOR [4], and delivering them as ATOM documents via the SWORD [5] interface to DSpace [6]. DSpace will ingest these resource maps, and convert them into repository items which reference content which continues to reside in JSTOR. The Python library is being used to generate the resource maps from JSTOR and the Java library is being used to provide all the ingest, transformation and dissemination support required in DSpace.

Please feel free to download and play with the source code, and let us have your feedback via the Google group:

foresite@googlegroups.com

Richard Jones & Rob Sanderson

[1] Foresite project page: http://foresite.cheshire3.org/
[2] OAI-ORE specification: http://www.openarchives.org/ore/0.9/toc
[3] Joint Information Systems Committee (JISC): http://www.jisc.ac.uk/
[4] JSTOR: http://www.jstor.org/
[5] Simple Web Service Offering Repository Deposit (SWORD):
http://www.ukoln.ac.uk/repositories/digirep/index/SWORD
[6] DSpace: http://www.dspace.org/

Repository Software is Dead

Dave Tarrant — Sun, 08 Jun 2008 04:43:00 +0000

Repository Software for digital collections as we know it supplies the complete solution to the client, thus without the software you cannot access any of the data in your repository. This is a bad thing for object reuse and digital preservation!

Many people at conferences such as Open Repositories 2008 and from workgroups like CRIG have been talking for a long while about the importance of Interoperability. However, if you get rid of the need for the interoperability and use a standard specification for accessing simple data objects (pdfs and their metadata), then you don't need interoperability!

So this leads me to the fact that EPrints, Fedora and hopefully at some point DSpace are abstracting their database and storage layers to support use of any type of storage platform. Thanks goes SUN Microsystems preservation action group and open storage group for pushing this work from a commercial perspective. But we need to go further than this to get rid of the need for interoperability.

From Open Repositories 2008, myself and a college Ben O'Steen from Oxford University proved how OAI-ORE (OAI specification for Object Reuse and Exchange) can be used to enable high level repository interoperability. This work won us $5000 but more importantly got the community thinking about the true power of a specification like OAI-ORE. Ben and I are now hoping to push this work down to the low level storage such that the objects within an ORE map (documents and metadata) can be directly referenced without the need for the current repository layer. For this to happen all objects need to be stored in their simplest form - NO WRAPPER FORMATS ALLOWED at the lowest level.

From recent talks with Sandy Payette and Les Carr (Fedora and EPrints respectively) I am envisaging that the current repository software becomes classified as repository service software which is able to manage low level objects but is not specifically required to access these objects. So current services which plug into the repository software can act directly on the objects.

A couple of problems to solve, security and consistency of cached data. All especially applicable if you have more than one piece of repository service software modifying your objects.

CRIG / IEDemonstator After Thoughts

Dave Tarrant — Sun, 08 Jun 2008 04:35:00 +0000

IEDemonstrator is a really bad name for a project as it just says Microsoft to me but I'm fairly it isn't anything to do with that most stable of web browsers.

From the workshop it has become clear to me that discussing a specification for service interaction globally is going to be impossible. This could be due to the fact that SOAP did such a good job of it and no one wants to use anything else (enough sarcasm??). I think many people left the workshop with a much better idea at how HTTP error codes (which have been around years) already go most of the way to solving a web service model. We also realised quickly that any specification would have to be built specifically for pay services (e.g. make use of the 402 code), this would then encourage companies/institutions to supply reliable services which last more than 4 years (cough AHDS cough).

First Post - CRIG DRY Workshop

Dave Tarrant — Fri, 06 Jun 2008 04:43:00 +0000

Well there's a surprise!

CRIG DRY Workshop in Bath is where I am now. So what's happening:

People have been talking about services and proposed projects to provide authoritative and complete services to users/agents/repositories. A couple of themes have come out morning session for me:

SKOS: A lot of projects (incl. Library of Congress) are using this RDF language to describe subject and properties. Each provides access to this information in so many different ways it is hard to see how to interact in a constant manor.

Service Interaction (read on as the name is not that descriptive)

This moves us on from the Open Storage stuff i've been working on (again more later in another blog post) into how we facilitate the use of services and discover how to interact with these services. We are pushing for the use of http codes! CRIG it.

Tis it for now....

CRIG Flipchart Outputs

Richard — Thu, 24 Jan 2008 14:23:00 +0000

The JISC CRIG meeting which I previously live-blogged from has now had its output formulated into a series of slides with annotations on Flickr, which can be found here:

http://www.flickr.com/photos/wocrig/

The process by which this was achieved was through an intense round of brain-storming sessions culminating in a room full of topic spaced flip chart sheets. We then performed a Dotmocracy, and the results that you see on the Flickr page are the ideas which made it through the process as having some interest invested in them.

European ORE Roll-Out at Open Repositories 2008

Richard — Wed, 23 Jan 2008 17:11:00 +0000

The European leg of the ORE roll-out has been announced and will occur on the final day of the Open Repositories 2008 conference in Southampton, UK. This is to complement the meeting at Johns Hopkins University in Baltimore on March 3. From the email circular:

A meeting will be held on April 4, 2008 at the University of Southampton, in conjunction with Open Repositories 2008, to roll-out the beta release of the OAI-ORE specifications. This meeting is the European follow-on to a meeting that will be held in the USA on March 3, 2008 at Johns Hopkins University.

The OAI-ORE specifications describe a data model to identify and describe aggregations of web resources, and they introduce machine-readable formats to describe these aggregations based on ATOM and RDF/XML. The current, alpha version of the OAI-ORE specifications is at http://www.openarchives.org/ore/0.1/.

Additional details for the OAI-ORE European Open Meeting are available at:

- The full press release for this event:

http://www.openarchives.org/ore/documents/EUKickoffPressrelease.pdf

- The registration site for the event:

http://regonline.com/eu-oai-ore

Note that registration is required and space is limited.

Fine Grained Repository Interoperability: can't package, won't package

Richard — Tue, 22 Jan 2008 18:56:00 +0000

Sadly (although some of you may not agree!), my paper proposed for this year's Open Repositories conference in Southampton has not made it through the Programme Committee. I include here, therefore, my submission so that it may live on, and you can get an idea of the sorts of things I was thinking about talking about.

The reasons given for not accepting it are probably valid; mostly concerning a lack of focus. Honestly, I thought it did a pretty good job of saying what I would talk about, but such is life.

What is the point of interoperability, what might it allow us to achieve, and why aren't we very good at it yet?

Interoperability is a loosely defined concept. It can allow systems to talk to each other about the information that they hold, about the information that they can disseminate, and to interchange that information. It can allow us to tie systems together to improve ingest and dissemination of repository holdings, and allows us to distribute repository functions across multiple systems. It ought even to allow us to offer repository services to systems which don't do so natively, improving the richness of the information space; repository interoperability is not just about repository to repository, it is also about cross-system communications. The maturing set of repositories such as DSpace, Fedora and EPrints and other information systems such as publications management tools and research information systems, as well as home-spun solutions are making the task of taking on the interoperability beast both tangible and urgent.

Traditional approaches to interoperability have often centred around moving packaged information between systems (often other repositories). The effect this has is to introduce a black-box problem concerning the content of the package itself. We are no longer transferring information, we are transferring data! It therefore becomes necessary to introduce package descriptors which allow the endpoint to re-interpret the package correctly, to turn it back into information. But this constrains us very tightly in the form of our packages, and introduces a great risk of data loss. Furthermore, it means that we cannot perform temporally and spatially disparate interoperability on an object level (that is, assemble an object's content over a period of time, and from a variety of sources). A more general approach to information interchange may be more powerful.

This paper brings together a number of sources. It discusses some of the work undertaken at Imperial College London to connect a distributed repository system (built on top of DSpace) to an existing information environment. This provides repository services to existing systems, and offers library administrators custom repository management tools in an integrated way. It also considers some of the thoughts arising from the JISC Common Repository Interfaces Group (CRIG) in this area, as well as some speculative proposals for future work and further ideas that may need to be explored.

Where do we start? The most basic way to address this problem is to break the idea of the package down into its most simple component parts in the context of a repository: the object metadata, the file content, and the use rights metadata. Using this approach, you can go a surprisingly long way down the interoperability route without adding further complexity. At the heart of the Imperial College Digital Repository is a set of web services which deal with exactly this fine structure of the package, because the content for the repository may be fed from a number of sources over a period of time, and thus there never is a definitive package.

These sorts of operations are not new, though, and there are a variety of approaches to it which have already been undertaken. For example, WebDAV offers extensions to HTTP to deal with objects using operations such as PUT, COPY or MOVE which could be used to achieve the effects that we desire. The real challenge, therefore, is not in the mechanics of the web services which we use to exchange details about this deconstructed package, but is in the additional complexities which we can introduce to enhance the interoperability of our systems and provide the value-added services which repositories wish to offer.

Consider some other features of interoperability which might be desirable

- fine grained or partial metadata records. We may wish to ingest partial records from a variety of sources to assemble into a single record, or disseminate only subsets of our stored metadata.
- file metadata, or any other sub-structure element of the object. This may include bibliographic, administrative or technical metadata.
object structural information, to allow complex hierarchies and relationships to be expressed and modified.
- versioning, and other inter-object relationships.
- workflow status, if performing deposit across multiple systems, it may be necessary to be aware of the status of the object in each system to calculate an overall state.
- state and provenance reporting, to offer feedback on the repository state to other information systems, administrators or users.
- statistics, to allow content delivery services to aggregate statistics globally.
- identifiers, to support multiple identification schemes.

Techniques such as application profiling for metadata allow us to frame entire metadata records in terms of their interpretation (e.g. the Scholarly Works Application Profile (SWAP)), but should also be used to frame individual metadata elements. Object structural data can be encoded using standards such as METS, which can also help us with attaching metadata to sub-structures of the object itself, such as its files. Versioning, and other inter-object relationships could be achieved using an RDF approach, and perhaps the OAI-ORE project will offer some guidance. But other operations such as workflow status, and state and provenance reporting do not have such clear approaches. Meanwhile, the Interoperable Repository Statistics (IRS) project has looked at the statistics problem, and the RIDIR project is looking into interoperable identifiers. In these latter cases, can we ever consider providing access to their outcomes or services through some general fine grained interface?

The Imperial College Digital Repository offers limited file metadata which is attached during upload and exposed as part of a METS record, detailing the entire digital object, as a descriptive metadata section. It can deal with the idea that some metadata comes from one source, while other metadata comes from another, allowing for a primitive partial metadata interchange process. Conversely, it will also deal with multiple metadata records for the same item. Also introduced are custom workflow metadata fields which allow some basic interaction between different systems to track deposit of objects both from the point of view of the administrator, the depositor and the systems themselves. In addition, there is an extensible notifications engine which is used to produce periodic reports to all depositors whose content has undergone some sort of modification or interesting event in a given time period. This notifications engine is behind a very generic web service which offers extreme flexibility within the context of the College's information environment.

Important work in the fields that will help achieve this interoperability include the SWORD deposit mechanism which currently deals with packages but may be extensible to include these much needed enhancements. Meanwhile, the OAI-ORE will be able to provide the semantics for complex objects which will no doubt assist in framing the problems that interoperability faces in a manor in which they can be solved.

Other examples of the spaces in which interoperability needs to work would include the EThOSnet project, the UK national e-theses effort, where it is conceivable that institutions may want to provide their own e-theses submission system with integration into the central hub to offer seamless distributed submission. Or in the relationship between Current Research Information Systems (CRIS) and open access repositories, to offer a full-stack information environment for researchers and administrators alike. The possibilities are extensive and the benefit to the research community would be truly great. HP Labs is actively researching in these and related areas with its continued work on the DSpace platform.

SWORD/ORE

Richard — Mon, 21 Jan 2008 13:05:00 +0000

Last week I was at the ORE meeting in Washington DC, and presented some thoughts regarding SWORD and its relationship to ORE. The slides I presented can be found here:

http://wiki.dspace.org/static_files/1/1d/Sword-ore.pdf

[Be warned that discussion on these slides ensued, and they therefore don't reflect the most recent thinking on the topic]

The overall approach of using SWORD as the infrastructure to do deposit for ORE seems sound. There are three main approaches identified:

- SWORD is used to deposit the URI of a Resource Map onto a repository
- SWORD is used to deposit the Resource Map as XML onto a repository
- SWORD is used to deposit a package containing the digital object and its Resource Map onto a repository

In terms of complications there are two primary ones which concern me the most:

- Mapping of the SWORD levels to the usage of ORE.

The principal issue is that level 1 implies level 0, and therefore level 2 implies level 1 and level 0. The inclusion of semantics to support ORE specifics could invoke a new level, and if this level is (for argument's sake) level 3, it implies all the levels beneath it, whatever they might require. Since the service, by this stage, is becoming complex in itself, such a linear relationship might not follow.

A brief option discussed at the meeting would be to modularise the SWORD support instead of implementing a level based approach. That is, the service document would describe the actual services offered by the server, such as ORE support, NoOp support, Verbose support and so forth, with no recourse to "bundles" of functionality labelled by linear levelling.

- Scalability of the service document

The mechanisms imposed by ORE allow for complex objects to be attached to other complex objects as aggregated resources (ORE term). This means that you could have a resource map which you wish to tell a repository describes a new part of an existing complex object. In order to do this, the service document will need to supply the appropriate deposit URI for a segment of an existing repository item. In DSpace semantics, for example, we may be adding a cluster of files to an existing item, and would therefore require the deposit URI of the item itself. To do otherwise would be to limit the applicability of ORE within SWORD and the repository model. Our current service document is a flat document describing what is pragmatically assumed (correctly, in virtually all cases) to be a small selection of deposit URIs. The same will not be true of item level deposit targets, which could be a very large number of possible deposit targets. Furthermore, in repositories which exploit the full descriptive capabilities of ORE, the number of deposit targets could be identical to the number of aggregations described (which can be more than one per resource map), which has the potential to be a very large number.

The consequences are in scalability of response time, which is a platform specific issue, and the scalability of the document itself and the usefulness of the consequences. It may be more useful to navigate hierarchically through the different levels of the service document in order to identify deposit nodes.

Any feedback on this topic is probably most useful in the ORE Google Group

BMC and the Free Open Repository Trial

Richard — Wed, 12 Dec 2007 09:36:00 +0000

Our good buddies at BioMedCentral's Open Repository team have released the latest upgrade to their service, and are offering 3 month trial repositories for evaluation. From the DSpace home page:

BioMed Central announced the latest upgrades to Open Repository, the open access publisher's hosted repository solution. Open Repository offers institutions a cost effective repository solution (setup, hosting and maintenance) which includes new DSpace features, customization options, improved user interface. Along with the annoucement of the upgrades, Open Repository is offereing a free 3-month pilot repository, so institutions can test the suitability of the service without obligation. See the full articles in Weekly News Digest and in Alpha Galieo.

CRIG Meeting Day 2 (2)

Richard — Fri, 07 Dec 2007 10:08:00 +0000

Topics for today:

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Unconference#Friday_December_7th

The ones that interest me the most are probably these:

- Death to Packages

Not really Death to Packages, but lets not forget that packaging sometimes isn't what we want to do or what we can do.

- Get What?

This harks to my ORE interest, as to what is available under the URLs, and what that means for something like content negotiation.

- One Put to Multiple Places

Really important to distributed information systems (e.g. ethosnet integration into local institutions). Also, this relates, for me, to the unpackaging question, because it introduces differences between what systems might all be expecting.

- Web 2.0 interfaces (ok, ok)

I'm interested in web services. Yes it's a bit trendy. But it is useful.

- Core Servies of a Repository

For repository core architecture, this is important. With my DSpace hat on I'd like to see what sorts of things an internal service architecture or api ought to be able to support

CRIG Meeting Day 2 (1)

Richard — Fri, 07 Dec 2007 08:42:00 +0000

It's first thing on day two. I'm late because I have to get all the way across town, which takes a surprisingly long time in London. I should have just stayed at a nearby hotel. Oh well.

The remainder of yesterday was interesting. Scope for live blogging is difficult, as the conference is extremely mobile. Today I will have to pick a point and hide in a corner to get you up to date.

In the afternoon we discussed the CRIG scenarios, and then implemented something called a Dotmocracy, which involves sticking dots (like house points at school) next to topics which appeared which we were interested in. When we start up today, the first order of business will be to see what topics made the cut. From what I saw at the end of the day, this will include Federated Searching, Google Search, and package deconstruction (my personal favourite this week).

As a brief aside, one running theme has been "no more standards". As it happens, I disagree with this. We're never going to get everything thinking the same and working the same. That's why there are so many standards, and why new ones get made all the time. It's the way of the world. At least, with a standard, though, when you have implemented one, you at least have a way of telling people what you did, over the home grown undocumented solutions which are the alternative.

Right, I suppose I'd better get my skates on.

CRIG Meeting Day 1 (2)

Richard — Thu, 06 Dec 2007 15:01:00 +0000

http://en.wikipedia.org/wiki/Unconference

See also Jim Downing's live blogging.

We've just done a round of preliminary unconferencing, where the CRIG Podcast topics were brainstormed onto flip charts. Not sure how useful that's going to be, but I'm going to approach the whole thing with an open mind. I've got my marker pen, my baloon, and my three dots.

wish me luck ...

CRIG Meeting Day 1 (1)

Richard — Thu, 06 Dec 2007 13:19:00 +0000

Some live blogging; may be slightly malformed, as this is happening inline, with no post-editing.

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Unconference

Les Carr and Jim Downing have introduced us to the CRIG workshop first day. We're unconferencing which means that there's not a programme! We're going to try and stay at the abstract or high level discussion, not try to talk about technology.

David Flanders outlines the meeting philosophy. The outputs aimed for the meeting include: ideas (bluesky), standards and scenarios and how they can be linked together. The outputs will be taken to OR08. The best way for a group to produce good stuff is for everyone to think about themselves. Makes me think of an article I read recently:

http://www7.nationalgeographic.com/ngm/0707/feature5/index.html

We are not about creating new specs.

Julie then brings us some stuff about SWORD. See my previous post on this. We are going to have implementations for xrXiv, white rose research online and Jorum. A SPECTRa deposit client, and later an article in Ariadne and a presentation at OR08.

Break time ... tea and coffee!

CRIG Podcast

Richard — Fri, 30 Nov 2007 09:19:00 +0000

A couple of weeks ago the JISC CRIG (Common Repository Interfaces Group) organised a series of telephone debates on important areas for it. These have now been edited into short commentaries which might be of interest to you, and are aimed at priming and informing the upcoming "unconference" to be held 6/7 December in London:

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Podcasts

The "unconference" will take place at Birkbeck College in Bloomsbury, London. Take a listen, and enjoy. Yours truly appears in the "Get and Put within Repositories" and the "Object Interoperability" discussions.

SWORD 1.0 Released

Richard — Thu, 08 Nov 2007 14:45:00 +0000

Just a quick heads up to say that the SWORD 1.0 release is now out and ready for download from SourceForge:

http://sourceforge.net/projects/sword-app/

Here you will find the common java library which supports repositories wanting to implement SWORD, plus implementations for DSpace and Fedora. There is also a client (with GUI and CLI versions) which you can use to deposit content into the repositories.

The DSpace implementation is designed only to work with the forthcoming DSpace 1.5 (which is currently in Alpha release). Your feedback and experiences with the code would be much appreciated. We expect to be making refinements to the DSpace implementation up unitl DSpace 1.5 is released as stable.

Scandinavian Dugnad

Richard — Wed, 31 Oct 2007 12:23:00 +0000

I was invited by the Scandinavian DSpace User Group meeting to join them in their first official meeting yesterday in Oslo. It was great to see so many people representing a small-ish geographical area and a reasonably small population all together from 4 nations (Norway, Sweden, Finland and Denmark) to talk about DSpace. Probably 35 people all-in, with plans to extend the group to be the Nordic DSpace User Group to include members from Iceland, and perhaps even the Faroe Islands, and Greenland (if DSpace instances appear there).

http://wiki.dspace.org/index.php/Scandinavia

In the grand traditions of Open Source and Open Access, I borrowed presentations given at the recent DSpace User Group Rome, and gave them an update on the state of the DSpace Foundation, DSpace 2.0, and then went on to produce some original slides telling folks how to get involved in DSpace developments. Hopefully all the content will be available on the web soon.

As your humble chronicaller struggled with his sub-par Norwegian, he picked up some interesting things. There is good user end development going on in Scandinavia which could be harnessed to bring improvements to the DSpace UI. There are also increasingly many requests for "Integration with ...", where the object of integration is one of a variety of library information systems. Statistics are high on the agenda here as they are everywhere else. They are also a base of experts in multi-language problems stemming from being polyglot nations with additional letters in their native alphabets.

It's clear where the future of repositories lie in Scandinavian nations where the national interest and the community feature prominently in society and culture. Bibsys, a major supplier of library systems and services in Norway (and organisers of the meeting), have 29 DSpace clients on their books already, and are looking at tighter integration between it and their other products, right down to the information model level. National research reporting systems are much desired repository data sources, and internal information systems at each institutions are starting to feed into their public repositories.

With such a big user group, and such a community focus, there is little doubt in my mind that the Nordic user group will be a great asset to the DSpace users in that region, and probably to the DSpace community as a whole.

PS Dugnad is a Norwegian word effectively referring to voluntary, communal work which benefits the community to some degree, but is also social and enjoyable for the participants. It also formed the basis of the 2006 DSpace User Group Meeting in Bergen

http://dsug2006.uib.no/

DSpace 1.5 Alpha with experimental binary distribution

Richard — Thu, 25 Oct 2007 09:48:00 +0000

The DSpace 1.5 Alpha has now been released and we encourage you to download this exciting new release of DSpace and try it out.

There are big changes in this code base, both in terms of functionality and organisation. First, we are now using Maven to manage our build process, and have carved the application into a set of core modules which can be used to assemble your desired DSpace instance. For example, the JSP UI and the Manakin UI are now available as separate UI modules, and you may build either or both of these. We are taking an important step down the road, here, to allowing for community developments to be more easily created, and also more easily shared. You should be able, with a little tinkering, to provide separate code packages which can be dropped in alongside the dspace core modules, and built along with them. There are many stages to go through before this process is complete or perfect, so we encourage you to try out this new mechanism, and to let us know how you get on, or what changes you would make. Oh, and please do share your modules with the community! Props to Mark Diggory and the MIT guys for this restructuring work.

The second big and most exciting thing is that Manakin is now part of our standard distribution, and we want to see it taking over from the JSP UI over the next few major releases. A big hand for Scott Phillips and the Texas A&M guys for getting this code into the distribution; they have worked really hard.

In addition to this, we have an Event System which should help us start to decouple tightly integrated parts of the repository, from Richard Rodgers and the guys at MIT. Browsing is now done with a heavily configurable system written initially by myself, but with significant assistance from Graham Triggs at BioMed Central. Tim Donohue's much desired Configurable Submission system is now integrated with both JSP and Manakin interfaces and is part of the release too.

Further to this we have a bunch of other functionality including: IP Authentication, better metadata and schema registry import, move items from one collection to another, metadata export, configurable multilingualism support, Google and html sitemap generator, Community and Sub-Communities as OAI Sets, and Item metadata in XHTML head <meta> elements.

All in all, a good looking release. There will be a testathon organised shortly which will be announced on the mailing lists, so that we can run this up to beta and then into final release as soon as possible. There's lots to test, so please lend a hand.

We are also experimenting with a binary release, which can be downloaded from the same page as the source release. We are interested in how people get on with this, so let us know on the mailing lists.

Come and get it:

http://sourceforge.net/project/showfiles.php?group_id=19984

my my where did the summer go

Richard — Wed, 24 Oct 2007 15:27:00 +0000

OK, ok, it's been a long long time since I updated. Did I say at the beginning that this was an experiment in seeing if I was capable of maintaining a blog? If I didn't I should have done.

But there's a good reason that I've not updated for a while. That is, that I've been working flat out on the Imperial College Digital Repository: Spir@l, and am pleased to finally announce in a quiet way that we are officially LIVE:

http://spiral.imperial.ac.uk/

On the outside it doesn't look too serious. A standard looking DSpace, I hear you say, with an Imperial College site template on it. And you'd be right. But only about the tip of the ice-berg.

Without wishing to blow my own trumpet (modesty is the third or fourth best thing about me), please do check out the article which I co-wrote with my good colleague Fereshteh Afshari:

http://hdl.handle.net/10044/1/493

And you may also be interested in my presentation at the recent DSpace User Group Meeting in Rome 2007 (more on that later, maybe):

http://www.aepic.it/conf/viewabstract.php?id=200&cf=11

I could probably be persueded to write a little here about how it works; maybe you'll even get snippets from the monolithic technical documentation that I'm in the middle of writing.

Oh, and there's more news, but now I've got your attention again you have to wait for the next installment.

EThOSnet Kick-Off

Richard — Thu, 10 May 2007 14:05:00 +0000

On Tuesday of this week the EThOSnet Project Board met for the first time to kick off this significant new project. For background, this project is the successor to the EThOS project, which in turn grew out of the Scottish projects: Theses Alive at Edinburgh, DAEDALUS at Glasgow, and Electronic Theses at the Robert Gordon University.

The aim of EThOSnet is to take the work done under EThOS and bring it up to a point where UK institutions can actually start to become early adopters, to start to digitise the back-catalogue of print theses in the UK, investigate technology for the current and the future incarnations of the system, and to basically kick-start a genuinely viable service for deposit and dissemination of UK theses.

At this stage, the project does not have a Project Manager, which is causing minor hold-ups initially, but Project Director, and Director of Library Services Clare Jenkins of Imperial College Library has stepped in to hold things together until one is appointed (we are expecting to hear very soon). In the interim, the Project Board has also been put in place to check that all the 7 Work Packages have the things they need to get going.

Of these 7 workpackages, the first and last are concerned with project management and exit strategy, and the meat of the project will take place in packages 2 - 6. Details of these work packages are available in the project proposal, which will hopefully be available on the JISC website soon.

A quick summary, then, of some of the changes and more concrete decisions that we made during the meeting:

We have set a pleasingly high target of 20,000 digitised theses and 3,000 born-digital theses by the end of the project. This will be sourced from the many institutions who have already expressed an interest in adopting the service, before the project is even going!

The first port of call for the technology is to smooth the process of the existing software tools for repository users. I would hope to have something which works well for DSpace available quickly, and general enough to be part of the main distribution. EPrints is already fully compliant, and Fedora has representitives from the University of Hull looking after it.

Communications will be done primarily through a soon-to-exist project wiki, and it is hoped that the existing E-Theses UK list will be used more heavily than it is already. Imperial College has agreed to host the existing ethos website, the wiki, and potentially the toolkit if necessary (currently hosted at RGU).

Toolkit development will be ongoing, with work being done on it within a wiki, but with the option to move to some XML format for the final product

This is a very big project, and I can't possibly represent everything that came out of Tuesday's meeting here. In the near future expect to see links to the project wiki appear and more information to come out.

JISC-CRIG Planet

Open Access policy - Federation for the Humanities and Social Sciences

instagr.am

www.slideshare.net

NRC jobs - Database, Repository and Application Specialist

hangingtogether.org

A week of Open Data in Ottawa - May 2015

GitHub to repository deposit

on.mash.to

Islandora Scholar Institutional Repository Solution Pack

Islandora Release 7.x-1.3 Series #1

Background

Feature Set

Contributors

What Does the Future Hold

Some reflections on the Berlin 11 conference, Berlin November 2013.

Blogging the ebooks landscape

Research Data Sharing without barriers…get involved?

The value and impact of the British Atmospheric Data Centre (BADC)

The Benefits of Open Source for Libraries

What are the main benefits to the library of adopting open source?

Is it something all libraries should consider, or are there limitations to its usefulness as a solution (if so, what are the limitations)?

What are the main barriers to a library adopting open source? (fear of the unknown, lack of technical ability etc)

Are there issues over ongoing support? and do libraries need a decent IT dept to even consider open source?

Resources:

Performance and Measurement in Libraries

Services to support UK repositories

Library Systems Workshop

Collaborative Services and Systems

Transforming workflows and processes

Tools and techniques for systems change

ResourceSync and SWORD

The ResourceSync PHP Library

Mobile discovery: don’t retro-fit; invent!

Simplicity

People and Place

Personal

New models

Shared Library Systems and Services, Part 1

A Shared LMS for Wales (WHELF)

The Benefits of Sharing (SCURL)

Oh, the admin and the coder should be friends!

iPad App for Islandora Repo Now Available

UPEI Senate Approves Open Access Policy

Is the Repository Developer a dying breed?

Thoughts on the Elevator

A tale of two bids

Making Debian Changelogs from Github repositories

DepositMOre - The Prototype

DepositMOre

Technologies Used

Preservation Tools - Moving Forward

Hot Topics in Scholarly Systems

Usage Statistics parsing and querying with redis and python

Curating content from one repository to put into another

An Analytical Anniversary

My swiss army toolkit for distributed/multiprocessing systems

Usage stats and Redis

#iPres09: e-Infrastructure and digital preservation: challenges and outlook

Thoughts on digitization, data deluge and linking

Less talk, less code, more data - The Preserv2 Data Registry

We need people!

Developer Happiness days - why happyness is important

Handling Tabular data

Tracking conferences (at Dev8D) with python, twitter and tags

EPrints 3.2 - Amazon S3/Cloudfront Plug-in

Beginning with RDF triplestores - a 'survey'

Available Services

HTTP ERROR: 400

A Fedora/Solr Digital Library for Oxford's 'Forced Migration Online'

News and updates Oct 2008

Institutions hate repositories... one simple reason.

Enterprise Abstraction: CISTI SRU API

DSpace and Fedora *need* opinionated installers.

Re-using video compression code to aid document quality checking

The four rules of the web and compound documents

#crigshow - Conference 2 - Worldcomp

#crigshow - Conference 1 - Oscelot

OAI-PMH + OAI-ORE (Atom) + Pronom Droid = Pretty

ORE software libraries from Foresite

DSpace and Fedora need opinionated installers.