commonplace.net

Infrastructure for heritage institutions – Open and Linked Data

Lukas Koster — Tue, 01 Jun 2021 12:25:19 +0000

Permalink: https://purl.org/cpl/3227

In my June 2020 post in this series, “Infrastructure for heritage institutions – change of course ” , I said:

“The results of both Data Licences and the Data Quality projects (Object PID’s, Controlled Vocabularies, Metadata Set) will go into the new Data Publication project, which will be undertaken in the second half of 2020. This project is aimed at publishing our collection data as open and linked data in various formats via various channels. A more detailed post will be published separately.”

In November 2020 we implemented ARK Persistent Identifiers for the central catalogue of the Library of the University of Amsterdam (see Infrastructure for heritage institutions – ARK PID’s ). And now, in May 2021, we present our open and linked data portals:

The Open Data website with information on datasets, licences, formats, publication channels and links to downloads and harvesting endpoints
A separate Linked Data portal

Here I will provide some background information on the choices made in the project, the workflow, the features, and the options made possible by publishing the data.

General approach

The general approach for publishing collection data is: determine data sources, define datasets, select and apply data licences, determine publication channels, define applicable data, record and syntax formats, apply transformations to obtain the desired formats and publish. A mix of expertise types is required: content, technical and communication. In the current project this general approach is implemented in a very pragmatic manner. This means that we haven’t always taken the ideal paths forward, but mainly the best possible options at the time. There will be shortcomings, but we are aware of them and they will be addressed in due course. It is also a learning project.

Data sources

The Library maintains a number of systems/databases containing collection data, although the intention is to eventually minimize the number of systems and data sources in the context of the Digital Infrastructure programme. The bulk of the collection data is managed in the central Ex Libris Alma catalogue. Besides that there is an ArchivesSpace installation, as well as several Adlib systems and a KOHA installation originating from the adoption of collections of other organisations. Most of these databases will probably be incorporated into the central catalogue in the near future.

In this initial project we have focused on the Alma central catalogue of the University only (and not yet of our partner the Amsterdam University of Applied Sciences).

Data licences

According to the official policy, the Library strives to make its collections as open as possible, including the data and metadata required to access these. For this reason the standard default licence for collection data is the Open Data Commons Public Domain Dedication and License (PDDL), which applies to databases as a whole. However, there is one important exception. A large part of the metadata records in the central catalogue originates from OCLC WorldCat. This situation is inherited from the good old Dutch national PICA shared catalogue. Of course there is nothing wrong with shared cataloguing, but unfortunately OCLC requires attribution to the WorldCat community, using an Open Data Commons Attribution License (ODC-BY), according to the WorldCat Community Norms. In practice you have to be able to distinguish between metadata originating from WorldCat and metadata not originating from WorldCat. On the bright side: OCLC considers referencing a WorldCat URI in the data sufficient attribution in itself. In the open data from the central catalogue canonical WorldCat URI’s are present when applicable, so the required licence is implied. But on the dark side: especially in the case of linked data (to which the OCLC ODC-BY licence explicitly applies) “the attribution requirement makes reuse and data integration more difficult in linked open data scenarios like WikiData” as Olaf Janssen of the National Library of The Netherlands cited the DBLP Data Licence Change comment on Twitter. An attribution licence might make sense if the database is reused as a whole, or if, in the case of the implicit URI reference, full database records are reused. But especially in linked data contexts it is not uncommon to reuse and combine individual data elements or properties, leaving out the URI references. This makes an ODC-BY licence practically unworkable. It is time that OCLC reconsider their licence policy and adapt it to the modern world.

Datasets

The central catalogue contains descriptions of over four million items, of which more than three million books. The rest consists of maps, images, audio, video, sheet music, archaeological objects, museum objects etc. For various practical reasons it is not feasible to make the full database available as one large dataset. That is why it was decided to split the database into smaller segments and publish datasets for physical objects by material type. A separate dataset was defined for digitised objects (images) published in the Image Repository. Because of the large amount of books and manuscripts, of these two material types only datasets of incunabula and letters are published. Other book datasets are available on demand.

In Alma these datasets are defined as “Logical Sets”, which are basically saved queries with dynamic result records. These Logical Sets serve as input for Alma Publishing Profiles, used for creating export files and harvesting endpoints (see below).

Formats

Data format: the published datasets only contain public metadata from Alma. Internal business and confidential data are filtered out before publishing. Creator/contributor and subject fields are enriched with URI’s, based on available identifiers from external authority files (Library of Congress Name Authority File and OCLC FAST for more recent records, Dutch National Authors and Subjects Thesaurus for older records). Through these URI’s relations to other global authority files can be established, such as VIAF, Wikidata and AAT. This is especially important for linked data (see below).

If these fields only contain textual descriptions without identifiers, enrichment is not applied. This lack of identifiers is input for the data quality improvement activities currently taking place. Available OCLC numbers are converted to canonical WorldCat URI’s, as mentioned in the Licences section. These data format transformations are performed using Alma Normalization Rules Sets, from within the Publishing Profiles.

Record and syntax formats: currently the datasets are made available in MARC-XML and Dublin Core Unqualified, two of the standard Alma export formats. For linked data formats, see below.

Publication channels

Downloadable files

For each Alma Logical Set once a month two export files are generated and written to a separate server. Two separate Alma Publishing Profiles are needed, one for each output format (MARC-XML and Dublin Core). The file names are generated using the syntax [institution-data source-dataset-format], for instance “uva_alma_maps_marc“, “uva_alma_maps_dc“. Alma automatically adds “_new” and zips the files, so the results are for instance “uva_alma_maps_marc_new.tar.gz” and “uva_alma_maps_dc_new.tar.gz“. These export files are moved by a shell script to a publicly accessible directory on the same server, replacing the already existing files in that directory. On the Library Open Data website the links to all, currently twenty, files are published on a static webpage.

OAI-PMH Harvesting

OAI-PMH harvesting endpoints are created using the same Alma Publishing Profiles, one for each output format. The set_spec and set_name are [dataset-format] and [dataset] respectively. The set_spec is used in the Alma system OAI-PMH call, for instance:

The harvesting links for all datasets/formats are also published on the same static webpage.

API’s

For the Alma data source Ex Libris provides a number of API’s both for the Alma backend and for the Alma Discovery/Primo frontend. However, there are some serious limitations in using these. The Alma API’s can only be used for raw data and the full Alma database. No logical sets can be used, nor data transformations using Normalization Rules. This means that data can’t be enriched with PID’s and URI’s, non-public data can’t be hidden, no individual datasets can be addressed. For our business case this means that the Alma API’s are not useful. Alternatively Primo API’s could be used, where the display data is enriched with PID’s and URI’s. However, it is again not possible to publish only specific sets and to filter out private data. The internal local field labels (“lds01”, “lds02”, etc.) can’t be replaced by more meaningful labels. Moreover, for all API’s there are API keys and API call limits.

For our business case an alternative API approach is required, either developing and maintaining our own API’s, or using a separate data and/or API platform.

Linked Data

Just like API’s, Ex Libris provides linked data features for Alma and Primo, which are not useful for implementing real linked data (yet). The concept of linked data is characterised by the fact that it is essentially a combination of specific formats (RDF) and publication channels (Sparql, content negotiation). Alma provides specific RDF formats (BIBFRAME, JSON-LD, RDA-RDF) with URI enrichment, but it is not possible to publish the RDF with your own PID-URI’s (in our case ARK’s and Handles). Instead internal system dependent URI’s are used. The Alma RDF formats can be used in the Alma Publishing Profiles to generate downloadable files, and in the Alma API’s. We have already seen that the Alma API’s have serious limitations. Moreover, Ex Libris currently does not support Sparql endpoints and content negotiation. These features appear to be on the roadmap however. It is a pity that I have not been able to implement the Ex Libris Alma and Primo linked data features that ultimately resulted from the first linked data session I helped organise at the 2011 IGeLU annual conference and the subsequent establishment of the IGeLU/ELUNA Linked Open Data Working Group ten years ago.

Anyway, we ended up implementing a separate linked data platform that serves as an API platform at the same time: Triply. In order to publish the collection data on this new platform, another separate tool is required for transforming the collection’s MARC data to RDF. For this we currently use Catmandu. We have had previous experience with both tools during the AdamLink project some years ago.

RDF transformation with Catmandu

Catmandu is a multipurpose data transformation toolset, maintained by an international open source community. It provides import and export modules for a large number of formats, not only RDF. Between import and export the data can be transformed using all kinds of “fix” commands. In our case we depend heavily on the Catmandu MARC modules library and the example fix file MARC2RDF by Patrick Hochstenbach, as starting point.

The full ETL process makes use of the MARC-XML dataset files exported by the Alma Publishing profiles. These MARC-XML files are transformed to RDF using Catmandu, and the resulting RDF files are then imported into the Triply data platform using the Triply API.

RDF model

The pragmatic approach resulted in the adoption of a simplified version of the Europeana Data Model (EDM) as the local RDF model for the library collection metadata. EDM largely fits the MARC21 record format used in the catalogue for all material types. EDM is based on Qualified Dublin Core. A MARC to DC mapping is used based on the official Library of Congress MARC to DC mapping, adapted to our own situation.

The original three EDM RDF core classes Provided Cultural Heritage Object, Web Resource, Aggregation are furthermore merged into one: Provided Cultural Heritage Object, with additional subclasses for the individual material types. The Library RDF model description is available from the Open Data website.

Triply data platform

Triply is a platform for storing, publishing and using linked data sets. It offers various ways of using the data: the Browser for browsing through the data, Table for presenting triples in tabular form, Sparql Yasgui endpoints, ElasticSearch full text search, API’s (Triple Pattern Fragments, JavaScript Client) and Stories (articles incorporating interactive visualisations of the underlying data using Sparql Yasgui queries).

The Library of the University of Amsterdam Triply platform currently shows separate datasets for each of the ten datasets defined in Alma, as well as one combined dataset for the nine physical material type datasets. For this Catalogue dataset and for the Image Repository dataset only, Sparql and ElasticSearch endpoints are defined.

Content negotiation

Content negotiation is the differentiated resolution of a PID-URI to different targets based on requested response formats. This way one PID-URI for a specific item can lead to different representations of the item, for instance a record display for human consumption in a web interface, or a data representation for machine readable interfaces. The Triply API supports a number of response formats (such as Turtle, JSON, JSON-LD, N3 etc.), both in HTTP headers and as HTTP URL parameters.

We have implemented content negotiation for our ARK PID’s as simple redirect rules to these Triply API response formats.

Workflow

Data publication workflows can differ greatly in composition and maintenance, depending on the underlying systems, metadata quality and infrastructure. The extent of dependency on specific systems is an important factor.

For the central catalogue certain required actions and transformations are performed using internal Alma system utilities: Logical Sets, Publishing Profiles, Normalization Rules, harvesting endpoints. This way basic transformations and publication channels are implemented with system specific utilities.

The more complicated transformations and publication channels (linked data, API’s, etc.) are implemented using generic external tools. In time, it might become possible to implement all data publication features with the Ex Libris toolset. When that time comes the Library should have decided on their strategic position in this matter: implement open data as much as possible with the features of the source systems in use, or be as system independent as possible? Depending fully on system features means that overall maintenance is easier but in the case of a system migration everything has to be developed from scratch again. Depending on generic external utilities means that you have to develop everything from scratch, but in the case of a system migration most of the externally implemented functionality will continue working.

Follow up

After delivering these first open data utilities, the time has come for evaluation, improvement and expansion. Shortcomings, as mentioned above, can be identified based on user feedback and analysis of usage statistics. Datasets and formats can be added or updated, based on user requests and communication with target audiences. New data sources can be published, with appropriate workflows, datasets and formats. The current workflow can be evaluated internally and adapted if needed. The experiences with the data publication project will also have a positive influence on the internal digital infrastructure, data quality and data flows of the library. Last but not least, time will tell if and in what expected and unexpected ways the library collections open data will be used.

Infrastructure for heritage institutions – ARK PID’s

Lukas Koster — Tue, 03 Nov 2020 10:36:58 +0000

Permalink: https://purl.org/cpl/3110

In the Digital Infrastructure program at the Library of the University of Amsterdam we have reached a first milestone. In my previous post in the Infrastructure for heritage institutions series, “Change of course“, I mentioned the coming implementation of ARK persistent identifiers for our collection objects. Since November 3, 2020, ARK PID’s are available for our university library Alma catalogue through the Primo user interface. Implementation of ARK PID’s for the other collection description systems will follow in due course.

The new ARK system will coexist with the two other PID systems we already have in place: Handle for the PURE/DARE institutional scholarly output repository and the Allard Pierson image repository, and Datacite DOI for the institutional figshare research datasets.

First of all, it is good to remember the hybrid, dual function of persistent identifiers:

uniquely identify a specific object (the actual PID)
provide persistent availability information for that object on the web (the PID-URI)

Handle or ARK

Among the available PID systems (as described in my article “Persistent identifiers for heritage objects“) the final choice was between Handle and ARK. It would seem logical to select Handle because we already maintain two Handle subsystems, but there are a few disadvantages to this option, dependent on configuration choices, most importantly the way the PID’s are constructed and managed.

A PID always consists of at least an indication of the PID system, a code identifying the assigning institution and the actual unique identifier within that context, optionally including a code for the “assignment stream” to differentiate between possible multiple collections, organizational units, sources, etc. For the actual identifier strings within the institutional PID namespace, two options are available:

minting independent unique identifier strings
using existing internal unique system identifiers

The internal system identifiers are made globally unique because of the PID system and institutional context, just as the minted identifiers. Because of the complicated and error-prone workflow of minting independent identifier strings for Alma, we decided to use the internal Alma identifiers. The same method is already in use for the institutional repository. Required conditions for this approach are that the internal system identifiers are stable and robust, and that the current system identifiers can be migrated to future new system environments where they can be directly accessed. This has already happened a couple of times with the institutional repository. In the current PURE environment there are three types of Handles, two based on system identifiers from two previous systems, and the new ones based on UUID’s generated in PURE.

In the case of PID’s based on existing identifiers there is no need for minting new PID’s and storing them in a mapping table for resolving, or storing them in the cataloguing system for publishing (although that still would be the preferred solution). PID-URI’s are simply formed based on a template consisting of a base URL, including the institutional ID and the prefix identifying the source or assignment stream, + a placeholder for the internal identifier.

If the PID’s are not stored in the cataloguing system, then the template must be implemented in the system’s user interface in order to publish and present the PID-URI’s. The same procedure must be implemented in all other data publication channels, like OAI, API’s and download options. In order to avoid maintenance errors, the idea is to minimize the number of places to implement this template procedure. If the PID is stored in the source system’s metadata, then the whole procedure can be omitted.

For resolving template based PID-URI’s, the PID redirection web server maps the incoming PID-URI to the target according to the template, replacing the URL with the target systems’ URL syntax for retrieving a single record and inserting the internal system identifier, provided in the PID’s actual identifier part, in the correct location. If it turns out that in a future system migration the old system identifiers can’t be used after all, Plan B is to store the existing PID’s in a newly created mapping table and switch to the minted identifier method, resolving PID-URI’s by reading the mapping table.

Back to the Handle or ARK question. In the case of Handle, a separate full local Handle server installation with additional add-on software is always necessary, even for template handles used for PID’s based on internal system identifiers. The template handle configuration in itself is quite complex as well. For ARK, in this case a simple web server configuration with redirects for each combination of institution/assignment stream is sufficient. No dedicated ARK software is necessary.

Moreover, ARK registration is completely free, while Handle charges a fee for each institutional prefix.

In the end, the ease of implementation and maintenance in combination with the method chosen turned the scales in favour of ARK.

ARK implementation

An ARK consists of the ARK label “ark:/”, a Name Assigning Authority Number (NAAN) for the institution assigning the ARK, a unique string within the ARK/NAAN namespace (“name”) and an optional qualifier, for specific versions or representations. The “name” can be prefixed with a “shoulder” indicating the “assignment stream”. This part implements the unique identifier function (PID). The ARK is prefixed with a base URL (NMA – Name Mapping Authority) to make it actionable on the web. This implements the web availability function (PID-URI).

Example: http://example.org/ark:/12025/654xz321/s3/f8.05v.tiff

Syntax: NMA/ark:/NAAN/(shoulder)name/qualifier

In our case we installed an Apache web server on the hostname pid.uba.uva.nl, with simple template based redirect configurations for each combination of NAAN/shoulder. ARK PID’s can be resolved directly in the local web server environment using the local base URL/namespace, without an intermediate global redirecting server, as usually is the case with Handle (https://hdl.handle.net/). However, this configuration is also still an option for ARK, by means of the global resolving/redirection server http://n2t.net/, if the local base URL is correctly entered in the registration for the NAAN in question. The n2t.net server also resolves a large number of other PID’s, such as Handle and DOI (using labels like hdl:/, doi:/, etc.). The base URL for the ARK PID-URI’s of the Library of the University of Amsterdam https://pid.uba.uva.nl/ can be replaced with http://n2t.net/ at all times:

– https://pid.uba.uva.nl/ark:/88238/b1990020797420205131

– https://n2t.net/ark:/88238/b1990020797420205131

Here “883238” is the NAAN for the Library of the University of Amsterdam, and “b1” is the shoulder for the Alma assignment stream. The string “ark:/88238/b1990020797420205131” is the actual PID.

Both actionable PID-URI’s resolve to:

https://lib.uva.nl/discovery/fulldisplay?vid=31UKB_UAM1_INST:UVA&docid=alma990020797420205131.

In the web server the first part of the URL up to and including the “b1” shoulder is replaced by the Primo URL + syntax “https://lib.uva.nl/discovery/fulldisplay?vid=31UKB_UAM1_INST:UVA&docid=alma“.

On the Alma/Primo user interface side, the ARK PID-URI for the object described in the displayed record is generated on the fly using internal “Alma normalization rules for display”. If in the metadata a handle is available (for items from the institutional and image repository), then this handle is displayed as “Persistent identifier”, in all other cases the ARK PID-URI is constructed by prefixing the internal Alma system identifier with “https://pid.uba.uva.nl/ark:/88238/b1”.

Issues

Ideally institutions only assign persistent identifiers to their own unique (or semi-unique) collection objects. Usually, at first these PID’s are assigned retroactively in batch, using an automated tool, after which PID’s are assigned individually for each new object added and catalogued. If an institution uses separate cataloguing environments for “PID worthy” and “non PID worthy” objects, then there is no problem. However, if only one cataloguing environment is used for all types of collection objects, it is necessary to be able to differentiate between unique and non-unique items based on the available metadata, in order to be able to automatically assign PID’s to PID worthy objects only.

At the Library of the University of Amsterdam, the Alma system is used for all material types: books, journals, images, museum objects, archival items, etcetera. Unfortunately, because of long standing work procedures and standard cataloguing profiles, it is not really possible to make a sharp distinction between all PID worthy and non PID worthy objects in the database. There is no indication of “uniqueness” or something similar in the metadata. Even a combination of specific selection criteria, such as material type + date + location is not completely conclusive. A book from before 1850 might very well be a unique item, but it does not have to be. A poster located in the museum might be unique, but it probably is not. Assessing all objects in the catalogue individually would probably be a matter of decades.

We have opted for a pragmatic approach, publishing ARK PID’s for all objects described in our local Alma catalogue system. In this way our ARK PID’s function both as real unique identifiers for PID worthy objects, and as truly persistent web links for all types of objects, replacing the default, not persistent, system and instance dependent Primo permalinks.

Another issue brought about by existing workflows is the fact that newly created digital representations of existing physical collection objects have been and will be assigned handles (as mentioned above). The corresponding physical source objects are assigned the new ARK PID’s in Alma/Primo. Both the physical object and its digital representation have their own record in the Alma catalogue, one identified by an ARK, the other by a handle, without a direct relation available in the metadata. Fortunately, most of these pairs are displayed as a cluster in Primo, so the implicit relationship is visible. And in the corresponding record in the image repository the links to the original records are displayed in the form of the new ARK PID’s.

The practice of assigning separate PID’s to digital representations of an object is not incorrect as such, but it would be better for the usability of the data if the relation between them would be made explicit. It is probably possible to fix this omission retroactively.

Future work

There are still a couple of other collections catalogued in other systems that will have to get persistent identifiers. These will also be handled in a pragmatic way, depending on the options of the systems, the available metadata, etcetera.

At the moment we do not use ARK qualifiers. We will look into this matter when we will investigate the options for content negotiation in the context of linked data in the very near future.

All in all, the implementation of persistent identifiers for all collection objects of the Library is a big step towards a more efficient and usable digital infrastructure, both internally and externally. One of the next steps will be the publication of linked open usable collection data, for which persistent identifiers are essential.

Acknowledgements

Without the discussions in the Digital Infrastructure Team of the Library of the University of Amsterdam/University of Applied Sciences Amsterdam (consisting of metadata specialists, IT staff, project coordinators and information specialists) and other colleagues, it would not have been possible to reach this milestone.

I would like to thank Herbert Van de Sompel for the exchange of ideas leading to the theoretical, philosophical and practical foundations of this important infrastructural step, described in the article mentioned above.

I would also like to thank my former colleague Roxana Maurer-Popista?u of the National Library of Luxemburg for providing us with background information and the choices involved in their ARK implementation. Our resulting solutions are quite different, but that is only natural in the diverse LAM world with so many different complex environments. Fortunately ARK can accommodate them all.

Infrastructure for heritage institutions – change of course

Lukas Koster — Tue, 23 Jun 2020 09:32:03 +0000

Permalink: https://purl.org/cpl/3069

In July 2019 I published the first post about our planning to realise a “coherent and future proof digital infrastructure” for the Library of the University of Amsterdam. In February I reported on the first results. As frequently happens, since then the conditions have changed, and naturally we had to adapt the direction we are following to achieve our goals. In other words: a change of course, of course.

Projects

I will leave aside the ongoing activities that I mentioned, and focus on the thirteen short term projects, which were originally planned like this:

Licensing
Object PID’s
Controlled Vocabularies
Metadata Set
ETL Hub
Object Platform
Digital Objects/IIIF
Digitisation Workflow
Access/Reuse
Data Enrichment
Linked Data
Alma
Georef

In my first results post these were already grouped together based on status and dependencies:

Object PID’s
Object Platform/Digital Objects/IIIF
Licensing
Metadata Set/Controlled Vocabularies
Data Enrichment/Georeference
Other projects (dependent on the results in the main projects):
- ETL Hub
- Digitisation Workflow
- Access/Reuse
- Linked Data

Investigating the options of Alma as a separate project was abandoned, because it became very clear that Alma fulfils a central role in almost all other aspects of the digital infrastructure.

Developments

In the mean time the exploratory study into the options for a digital object platform have resulted in a recommendation to procure a long term digital preservation (DP) solution, in compliance with the OAIS reference model, which takes descriptive metadata from Alma and other systems and also serves as the source for publication of digital objects through various channels (Digital Asset Management – DAM). Given the expected procurement and implementation time for such a system, a working digital object platform will not be available until the end of 2021 the earliest. Since the digital object focused projects are all closely interlinked with the availability of a digital object platform, and also because of a number of experiences in the other projects, we have decided to restructure the original planning completely.

Adapted planning

Firstly we have defined two separate main project clusters, a data cluster and a digital object cluster. This involved joining and splitting some of the existing project ideas. Secondly, we have separated both clusters in time. We will implement the data cluster first, as far as possible in 2020, and after that the digital objects cluster starting in 2021.

Two projects have a bit of both, they have been grouped together and will be assessed separately. Finally a new project was defined, focusing on streamlining the full digital infrastructure system and database landscape, with the objective of eliminating redundancies in both systems and data.

Data Cluster (2020)
- Data Licences
- Data Quality sub cluster
  - Object PID’s
  - Controlled Vocabularies
  - Metadata Set
- Data Publication sub cluster
  - Data Access and Reuse
  - ETL
  - Linked Data
Digital Objects Cluster (2021-2022)
- Object Licenses
- Digital Objects Platform
- Digital Object Representations
- Digitisation Workflow
- Digital Objects Access and Reuse
Data + Digital Objects (2020-2022?)
- Data Enrichment
- Georeferencing
Digital Infrastructure Streamlining (2020-2022)

Dependencies

In the Data Cluster, results of the Data Licences and Data Quality projects must be available for implementing Data Publication options. Linked Data can only be implemented if there is already a data publication facility available, including ETL procedures.

In the Digital Objects Cluster the Digital Objects Platform (DP/DAM) must be available in order to implement a full blown Digitisation Workflow. Access and reuse of digital objects depend on the availability of the platform with relevant object representations and licenses.

The Data Enrichment and Georeferencing projects are both aimed at generating additional metadata for digitised maps based on the digital objects themselves. For a full and serious implementation high quality digital object representations in relevant formats should be available on a fully functioning digital object platform, and this will not be available before the end of 2021. In the mean time a pilot could be executed with currently available offline digital maps. Planning this will be considered independently of the main project clusters.

Streamlining the digital infrastructure is obviously targeted at existing and future systems and data, and dependent on developments in the digital infrastructure program. The project will start as soon as possible nonetheless, with an exploratory and definition phase.

Current status

In the Data Cluster we are ready to start implementing persistent identifiers for collection objects in the broadest sense. This PID project will be the subject of another more detailed post. In brief: we will adopt a pragmatic approach and maintain a hybrid environment, keeping our existing handles and DOI’s and implementing ARK as the new default PID system, using rule based PID assignment based on identifiers available in the target systems. This entails copying the identifiers used to new systems in case of future migrations in order to keep the identifiers persistent.

For Data Licences we are inclined to use a public domain ODC PDDL licence as the default licence for data. An exception will have to be made for data originating from OCLC WorldCat, which applies to the bulk of our data in Alma and derivatives thereof. For WorldCat data an ODC-BY licence must be used acknowledging the OCLC WorldCat origin. It will be a bit of a challenge how to use both licences simultaneously for our Alma instance, since part of the Alma data does not derive from WorldCat.

The results of both Data Licences and the Data Quality projects (Object PID’s, Controlled Vocabularies, Metadata Set) will go into the new Data Publication project, which will be undertaken in the second half of 2020. This project is aimed at publishing our collection data as open and linked data in various formats via various channels. A more detailed post will be published separately.

As mentioned before, the Digital Objects Platform and related projects will take some time. In the mean time a IIIF pilot has already been completed successfully. IIIF is available for the current online image repository.Last but not least, the exploratory phase of the Infrastructure Streamlining project will start in the second half of 2020.

Infrastructure for heritage institutions – first results

Lukas Koster — Mon, 24 Feb 2020 14:58:24 +0000

Permalink: https://purl.org/cpl/3017

In July 2019 I published the post Infrastructure for heritage institutions in which I described our planning to realise a “coherent and future proof digital infrastructure” for the Library of the University of Amsterdam. Time to look back: how far have we come? And time to look forward: what’s in store for the near future?

Ongoing activities

I mentioned three “currently ongoing activities”:

Monitoring and advising on infrastructural aspects of new projects
Maintaining a structured dynamic overview of the current systems and dataflow environment
Communication about the principles, objectives and status of the programme

So, what is the status of these ongoing activities?

Monitoring and advising

We have established a small dedicated “governance” team that is charged with assessing, advising and monitoring large and small projects that impact the digital infrastructure, and with creating awareness among stakeholders about the role of the larger core infrastructure team. The person managing the institutional project portfolio has agreed to take on the role of governance team coordinator, which is a perfect combination of responsibilities.

Dynamic overview

Until now we have a number of unrelated instruments to describe infrastructural components and relations, with different objectives. The two main ones are a huge static diagram that tries to capture all internal and external systems and relationships without detailed specifications, and there is the dynamic DataMap repository describing all dataflows between systems and datastores. The latter uses a home made extended version of the Data Flow Diagram (DFD) methodology, as described in an earlier post Analysing library data flows for efficient innovation (see also my ELAG 2015 presentation Datamazed). In that post I already mentioned Archimate as a possible future way to go. And this is exactly what we are going to do now. DFD is OK for describing dataflows, but not for documenting the entire digital infrastructure including digital objects, protocols, etc. Archimate version 3.1 can be used for digital and physical structures as well as for data, application and business structures. We are currently deciding on the templates and patterns to use (Archimate is very flexible and can be used in very many different ways). The plan is to collaborate with the central university architecture community and document our infrastructure in the tool that they are already using.

Communication

This series of posts is one of the ways we communicate about the programme externally. For internal communication we have set up a section on the university library intranet.

Projects

I mentioned thirteen short term projects. How are they coming on? For all projects we are adopting a pragmatic approach. Use what is already available, set short term realistic goals, avoid solutions that are too complicated.

Object PID’s

I did some research into persistent identifiers (PID’s) and documented my findings in an internal memo. It consists of a general theoretical description of PID’s (what they are, administration and use, characterization of existing PID systems, object types PID’s can be assigned to and linked data requirements), and a practical part describing current practices, pros and cons of existing PID systems, a list of requirements, practical considerations and recommendations. A generic English version of this document is published in Code4Lib Journal issue 47 with the title “Persistent identifiers for heritage objects“.

In January 2020 we have started to test the different scenarios that are possible for implementing PID’s.

Object platform/Digital objects/IIIF

The library is currently executing an exploratory study into options for a digital object platform. There have been conversations with a number of institutions similar to the university library (university libraries, archives, museums) discussing their existing and future solutions. There will also be discussions with vendors, among which Ex Libris, the supplier of our current central collection management platform Alma. This study will result in a recommendation in the first half of 2020, after which an implementation project will be started.

The Digital Objects and IIIF topics are part of this comprehensive project, and obviously Alma is considered as a candidate. The library has already developed a IIIF test environment as a separate pilot project.

Licensing

We are taking first steps in setting up a dedicated team for deciding on default standard licences and regulations for collections, metadata and digital objects, per type when relevant. Furthermore, the team will assess dedicated licences and regulations in case the default ones do not apply. We are currently thinking along the lines of Public Domain Mark or Creative Commons CC0 for content that is not subject to copyright, CC-BY or CC-BY-SA for copyrighted content, and righsstatements.org for content for which copyright is unclear.

For metadata the corresponding Open Data Commons licences are considered. For that part of the metadata in our central cataloguing system Alma which originates in Worldcat, OCLC recommends applying an ODC-BY licence according to the OCLC Worldcat Rights and Responsibilities. For the remaining metadata we are considering a public domain mark or an ODC-BY.

If it is feasible, the assigned licences and regulations for objects may be added to the metadata of the relevant digital objects in the collection management systems, both as text and in machine-readable form. In any case, the licences and regulations will be published in all online end user interfaces and in all machine/application interfaces.

Metadata set/Controlled vocabularies

Both defining the standard minimum required metadata for various use cases and selecting and implementing controlled vocabularies/authority files are aspects of data quality assurance. Both issues will be addressed simultaneously.

Regarding the metadata sets required for the various use cases and target audiences, this is a long term process, which will have to be carried out in smaller batches focused on specific audiences and use cases. Then again, because of the large number of catalogued objects it is practically impossible to extend and enrich the metadata for all objects manually. New items are catalogued using RDA Core Elements, in which minimum elements required for describing resources by type are defined. There is also a huge legacy metadata records base with many non-standard descriptions. Hopefully automated tools can be employed in the future for improving and extending metadata for specific use cases. This will be explored in the Data enrichment and Georeference projects.

Regarding the controlled vocabularies, on the contrary, there are short term practical solutions available. Libraries have been using authority files for cataloguing for a long time, especially for people and organisations (creators, contributors) and subjects. In most cases, besides the string values, also the identifiers of the terms in the authority files used have been recorded in our cataloguing records. In the past we have used national authority files for The Netherlands, currently we are using international authority files: Library of Congress Name Authority File and FAST. Fortunately, all these authority files have been published on the web as open and linked data, with persistent URI’s for each term. This means that we can dynamically construct and publish these persistent URI’s through human and machine readable interfaces for all vocabulary terms that we have registered. We are currently testing the options.

Data enrichment/Georeference

The Data enrichment and Georeference projects are closely related to the Open Maps pilot, in which a set of digitised maps from a unique 19th century atlas serve as practical test bed and implementation for the Digital Infrastructure programme. As such, these projects do not contribute to the improvement of the digital infrastructure in the narrow sense. However they demonstrate the extended possibilities of such an improved digital infrastructure. Both projects are directly related to all other projects defined in the programme, and offer valuable input for them.

Essentially both projects are aimed at creating additional object metadata on top of the basic metadata set, targeted at specific audiences, derived from the objects themselves.

An initial draft action plan was created for both projects to be executed simultaneously, in collaboration with a university digital humanities group and the central university ICT department. For the Data enrichment project the idea is to use OCR, text mining and named entity recognition methods to derive valuable metadata from various types of texts printed on maps. The Georeference project is targeted at obtaining various georeferences for the maps themselves and for selected items on the maps. All new data should have resolvable identifiers/URI’s in order to be able to be used for linked data.

Other projects

The remaining projects (ETL Hub, Digitisation Workflow, Access/Reuse, Linked Data) are dependent on the other activities carried out in the programme.

An Extract-Transform-Load platform for streamlining data flows and data conversions can only be effectively implemented when a more or less complete overview of the system and dataflow environment is available, and the extent of the role of Alma as central data hub has become clear. Moreover, the standardisation of basic metadata set, controlled vocabularies and persistent identifiers is required. In the end it could also turn out that an ETL Hub is not necessary at all.

The Digitisation Workflow can only be defined when a Digital Object Platform is up and running, and digital object formats are sorted out. It is also dependent on functioning PID and License workflows and established metadata sets and controlled vocabularies.

Acces and Reuse of metadata and digital objects depends on the availability of a Digital Object Platform, standardised metadata sets, controlled vocabularies, PID’s and license policies.

Last but not least, linked data can only be published if PID’s as linked data URI’s, open licences, standardised metadata sets and controlled vocabularies with URI’s are implemented. For Linked Data an ETL Hub might be required.

Infrastructure for heritage institutions

Lukas Koster — Thu, 11 Jul 2019 12:33:22 +0000

Permalink: https://purl.org/cpl/2795

During my vacation I saw this tweet by LIBER about topics to address, as suggested by the participants of the LIBER 2019 conference in Dublin:

It shows a word cloud (yes, a word cloud) containing a large number of terms. I list the ones I can read without zooming in (so the most suggested ones, I guess), more or less grouped thematically:

Open science Open data Open access Licensing Copyrights Linked open data Open education Citizen science	Scholarly communication Digital humanities/DH Digital scholarship Research assessment Research data New metrics	Digital preservation Data curation Data stewardship Data management Fair (probably meaning FAIR) Collections
Digital skills Skills training Information literacy	Collaboration Library management Management Leadership Ethics Innovation Strategy Networking	IIIF ———————————— AI ———————————— Better food

The last topic (“Better food“) probably says something about the conference organisation, not about global health or climate change.

Although being on holiday I could not resist reacting on twitter with “And for all that you need interoperability and a sound digital infrastructure.“. My tweet got a small number of likes, retweets and replies, mostly from the more library technology inclined.

The reason for my tweet was that LIBER2019 was another example of a library conference focusing on global ideals and objectives without paying any attention to the means that are needed to actually achieve those. Let’s have a look at the grouped suggested topics. We see “Openness“, “Scholarship/research“, “Data/digital objects curation“, “Skills” and more general “Management” buzzwords. Ignoring “Better food” we have two acronyms left, of which “AI” (Artificial Intelligence, I presume) could be grouped with Scholarship/research. Leaving only “IIIF” as a specific technical/infrastructural topic that could serve as a means to achieve the objectives outlined in the other topic groups,

Now, it is understandable that LIBER conference participants mention these topics, because LIBER is the Association of European Research Libraries. But what to my mind is not understandable is that these Research Libraries conference participants do not talk about the practical issues involved to achieve those goals. Even more so because the first line in LIBER’s mission is “Provide an information infrastructure to enable research in LIBER Institutions to be world class“.

To be clear: by “digital infrastructure” I don’t mean the hardware layer underlying all digital communication (servers, workstations, cables, routers, etc.), but the layers on top of that (systems, databases, data and record formats, digital object formats, identifiers, communication protocols, data flows, API’s, export and import tools, etc.).

I have never attended a LIBER conference myself, so I can’t say anything about the nature of the event from personal experience, but people who have attended tell me that the conference has been mainly targeted at library management. Looking at the LIBER2019 programme however, there are a small number of presentations that look like they may have been of a more practical or technical nature.

Anyway, having been to many library and library related conferences and events over the years I think I can safely say that most “general” library conferences focus more on missions and objectives, ignoring the practical and technical conditions and requirements that are essential to achieve just those. And of course the more “technical” library conferences tend to do the opposite, ignoring organisational, social and financial conditions. We really need conferences that take into account both sides.

The fact remains however that a sound digital infrastructure, both internally within the individual institutions and externally between institutions, is essential. And I prefer going to the more practical events, because I’m a bit allergic to events where people say “We should do [fill in whatever you think we should be doing as libraries]“, “We are so great“, “We are so inspired“.

As a follow up to my original tweet, in a reply to Christina Harlow and Rurik Greenall, I said: “Let’s propose a presentation for liber2020 about teaching the librarians how infrastructure is essential for their word cloud ideas to get real.“. Christina replied “I’m on board“, to which Saskia Scheltjens, of Rijksmuseum Amsterdam, reacted “I’ll hold you all to that, you know “. But someone else (Timo Borst) warned me: “LIBER is not the right place to address those challenges. It is rather a club of feelgood-librarians, reinforcing themselves in what they are doing and always have done. There are many skilled people engaged at #LIBER, but no common ground and agenda for tackling infra issues.“, confirming more or less my experiences with library conferences.

I am not sure what will happen next year at LIBER2020, but let’s take this subject a bit further and move away from conferences to the actual institutions (libraries, archives, museums). After working in libraries for 16 years now (of which the last 13 years for the Library of the University of Amsterdam) it is my experience that “digital infrastructure” is not a topic that has been the subject of much attention from the people who decide about funding, resources and policies in libraries and other heritage institutions.

Since I started working for the Library of the University of Amsterdam in 2006, in the Digital Services/Systems department, I have been trying to get the library focus more on the underlying infrastructure instead of only on end user services, individual dedicated systems and data formats, without success. Time and again decisions were made to either replace one proprietary system with another, solve a problem with a new system, or create a new database with metadata copied from another one, thereby increasing a huge unmanageable landscape of data formats, systems and user interfaces without possibilities to actually innovate. Even in 2018 in a meeting about establishing the new strategic policy plan for the Library one of the attending management team members said “Infrastructure is a difficult word“.

Until recently, after my colleague Saskia Woutersen-Windhouwer (who now works for Leiden University Library) and I managed to get a memo accepted about “Open Collections”, in which we argued that the Library should adopt “FAIR principles for collections”. An adapted English version of that is available online as an article in Code4Lib Journal, Issue 40, 2018. “FAIR” stands for Findable, Accessible, Interoperable, Reusable, and the original FAIR principles are targeted at scholarly output, in particular research data sets. We adapted these original principles to apply to heritage collections, distinguishing FAIR principles for Objects, Metadata and Metadata Records.

In the official advice following the original memo we also distinguished three additional aspects of FAIR principles, making clear that infrastructure is not only technical: Licensing, Infrastructure, Data Quality (LID). Obviously there is a certain overlap between these three aspects: for instance a licence must also be entered in the data and must be machine-readable. Besides that we also stressed the need for organisational change. The way that workflows are organised is part of the infrastructure. Departments have always been focused on traditional activities that were separated, such as metadata, systems, user services. A more integrated approach is needed.

To make a longer story short, in the Library’s new Strategy Plan for 2019-2022 a “coherent and future proof digital infrastructure” is presented as an essential precondition for all other strategic objectives (Open Collections, Open Science and Education, Open Campus, Open Knowledge). And from this year on I will be coordinating the planning and projects to realise this new streamlined digital infrastructure, together with a specially assembled core team of representative library employees with required expertise from various departments.

Given my earlier remarks about heritage institutions and infrastructure, I have the impression that the challenges we are facing are not unique for our situation. Maybe other institutions can benefit from the approach described here, while at the same time I hope we can benefit from other institutions’ experiences.

In our planning we distinguish between ongoing, structural activities that can already be executed now, and short term projects that will implement clearly described goals and also lead to ongoing, structural workflows.

The currently ongoing activities are:

Monitoring and advising on infrastructural aspects of new projects
Maintaining a structured dynamic overview of the current systems and dataflow environment
Communication about the principles, objectives and status of the programme

For the short term projects we determined dependencies, made a planning and assigned core project teams that can be extended with internal and external experts as needed. We also chose a defined and limited use case as core pilot to focus on and use as test bed before wider implementation of the results. This pilot consists of a set of over 300 old maps in a 19th century unique “collector’s atlas” in the possession of the Library. A set of high resolution digitised images of the maps is available, that are catalogued but not yet presented directly on a library website.

The project topics (with very brief descriptions) are:

Licensing
- Establish and implement default and dedicated licenses for objects and metadata
Object PID’s
- Decide on and implement PID schemes to be used for physical and digital objects
Controlled Vocabularies
- Decide on and implement authority schemes using PID’s for people, subjects etc.
Metadata Set
- Decide on and implement the standard minimum required metadata for various use cases, based on data quality guidelines
ETL Hub
- Implement a uniform central Extract Transform Load platform for streamlining data flows and data conversions
Object Platform
- Decide on and implement a platform for storing, distributing and preserving digital objects
Digital Objects/IIIF
- Decide on and implement formats, types, resolution for digital objects, focusing on IIIF
Digitisation Workflow
- Implement workflow for digitising physical objects
Access/Reuse
- Implement methods, protocols and platforms for accessing and reusing objects and metadata
Data Enrichment
- Investigate and implement methods of enriching metadata through text and data mining etc.
Linked Data
- Investigate and implement methods of publishing linked data
Alma
- Investigate options of Alma (which is our new main backoffice platform) as central data and object hub
Georef
- Especially for digital maps: investigate and implement georeferencing options and use cases

For some of these topics individual pilots and projects are already planned or have been carried out. The idea is to connect and integrate these existing plans and projects in order to avoid redundant work and conflicting results.

There is a natural dependency scheme between the project topics. For instance licensing, PID’s, protocols, controlled vocabularies and a good metadata set are required before you can actually publish your data for access and reuse. The same applies to Linked Open Data. To publish objects for reuse you need to have the formats, platform, protocols and licensing sorted out.We can’t find out everything by ourselves, obviously. We will gladly use experiences from other institutions. We will contact you soon. And if you have any valuable advice to give, don’t hesitate to contact me.

Ten years linked open data

Lukas Koster — Sat, 04 Jun 2016 12:04:54 +0000

Permalink: https://purl.org/cpl/2571

This post is the English translation of my original article in Dutch, published in META (2016-3), the Flemish journal for information professionals.

Ten years after the term “linked data” was introduced by Tim Berners-Lee it appears to be time to take stock of the impact of linked data for libraries and other heritage institutions in the past and in the future. I will do this from a personal historical perspective, as a library technology professional, systems and database designer, data infrastructure specialist, social scientist, internet citizen and information consumer.

Linked data is a set of universal methods for connecting information from multiple web sources in order to generate enriched or new information and prevent information redundancy and ambiguity. This is achieved by describing information as “triples” (relationships between two objects) in RDF (Resource Description Framework), in which both objects and relationships are represented as URI’s (Unique Resource Identifiers), which point to definitions of these on the web. The object’s type and attributes can also be represented as triples.

“Open data” means that the information concerned actually can and may be used.

It can be ascertained that the concept of “linked data” came too early for the library and heritage world in general. The majority of libraries, particularly public libraries, at that time simply did not possess the context and expertise to do something meaningful with it. Only larger institutions with sufficient expertise, technical staff and funding were capable of executing linked data pilot projects and implement linked data services, such as national libraries, scientific institutes, library consortia and renowned heritage institutions. Furthermore many institutions are dependent on external system, database and content providers. It is only in the last couple of years (roughly since 2014) that influential organisations in the international library and heritage world have seriously begun exploring linked data. These are for instance large commercial system vendors like OCLC and Ex Libris, and national and regional umbrella organisations like National Libraries and Library Consortia.

The first time I used the term “linked data” myself is documented on the web, in a blog post dated June 19, 2009, with the title ‘Linked Data for Libraries’, already in reference to libraries. The main assertion of my argumentation was “data is relationships” which still holds in full. The gist of my story was rather optimistic, focusing on a couple of technical and modelling aspects (URI’s, RDF, ontologies, content negotiation, etc.) for which there simply seemed to be a number of solutions at hand. In practice however these technical and modelling aspects turned out to be the subject of much discussion among linked data theorists and evangelists. Because of theoretical discussions like these, however necessary, consensus on standards and best practices is usually not reached very swiftly, which in turn leads to holding off development of universal and practical applications too.

At that time I already worked at the Library of the University of Amsterdam (UvA), in charge of a number of library systems. I had however already applied the concepts underlying linked data years before that, even before the term “linked data” existed, to be precise in the period 2000-2002 at the former NIWI (Netherlands Institute for Scientific Information Services), in collaboration with my colleague Niek van Baalen. Essentially we are dealing here with nothing more than very elementary and universal principles that can make life a lot easier for system and database designers. Our basic premise was that everything to be described was a thing or an object with a unique ID, to which a type or concept was assigned, such as ‘person’, ‘publication’, ‘organisation’ etc. Depending on the type, the object could have a number of attributes (such as ‘name’, ‘start date’, etc.) and relationships with other objects. The objects could be denoted with various textual labels in specific languages. All of this implemented in an independent relational database, with a fully decoupled web frontend based on object oriented software as a middle layer. This approach was a logical answer to the problem of integrating the various databases and information systems of the six former institutes of the KNAW (Dutch Royal Academy of Science) that constituted NIWI [See: Concepts and Relations and http://www.niekvanbaalen.net/swiftbox/].

Unfortunately both our concept-relational approach and NIWI were premature. The ideas on system independent concepts and relationships did not fall on fertile ground, and also the time was not right for an interdisciplinary scientific institute. From the late NIWI the current Dutch Data Archiving Institute DANS has risen, which continues the activities of the former Steinmetz Institute and the Dutch Historical Data Archive. One of the main areas of research for DANS nowadays is linked data.

Anyway, when I first learned about the concept of linked data in 2009, I was immediately converted. In 2010 I had the opportunity to carry out a linked data pilot in collaboration with Ad Aerts of the former Theatre Institute of the Netherlands (TIN) and my UvA colleague Roxana Popistasu, in which theatre texts in the UvA Aleph OPAC were enriched with related information about performances of the play in question from the TIN Adlib Theatre Productions Database. The objective of this pilot was to show the added value of enrichment of search results via linked data with relevant information from other databases, while at the same time exposing bottlenecks in the data used. In particular the lack of universally used identifiers for objects, people and subjects at that time appeared to be a barrier for successfully implementing linked data.

Example theatre linked data pilot: Waiting for Godot

2010 was also the year that I first attended the SWIB conference (Semantic Web In Libraries). As it was only the second time the conference was organised, SWIB was still a largely German language meeting for a predominantly German audience. In the mean time SWIB has developed into one of the most important international linked open data conferences, held completely in English. Attending linked data conferences like SWIB often generates ambiguous feelings. On the one hand the discussions and the projects presented are a source of motivation, on the other hand they also give rise to frustration, because after returning to your own place of work it becomes clear once more that what large institutions can do in projects is not possible in everyday life. It is particularly the dependence on system providers that makes it difficult for libraries to implement linked data. In the theatre play pilot with the Ex Libris Aleph library system mentioned before it was only possible to use JavaScript add-ons in the user interface HTML pages, but not to adjust the internal system architecture and the international bibliographic MARC standard.

This vendor dependence was the immediate motive of establishing the Linked Open Data Special Interest Working Group (LOD SIWG) within IGeLU, the International Group of Ex Libris Users. This group’s objective was and is to convince the global library systems provider Ex Libris to implement linked data options in their systems. Some effort was needed to make Ex Libris appreciate the value of this, but after five years the company has officially initiated a “Linked Data Collaboration Program”, in which the Library of the University of Amsterdam is a partner. Besides the LOD SIWG activities, of course parallel developments in the library world have contributed to this as well, such as the Library of Congress BIBFRAME project and the linked data activities of competitor OCLC.

The BIBFRAME project is concerned with storing bibliographic data as linked data in RDF, replacing the international bibliographic MARC format. OCLC is primarily focused on publishing

BIBFRAME basic schema

WorldCat and authority information as linked data through URI’s and enhancing its findability in search engines like Google through schema.org. Storing linked data should in principle utilize information published as linked data elsewhere, especially authority files such as VIAF and LoC Vocabularies.

Consuming data published elsewhere is of course the actual goal of implementing linked data, in particular for the purpose of presenting end users with additional relevant information about topics they are interested in, without the need to execute similar searches in other systems. Academic libraries for example are increasingly developing an interest in presenting research output not only in the form of scholarly publications, but also in the form of related information about research projects, research data, procedures, networks, etc.

In 2012-2013 I have in this context carried out a pilot linking scholarly publications, harvested from the UvA institutional repository and loaded into the UvA Primo discovery index, to related information in the Dutch national research information repository NARCIS, which since a number of years is managed by the previously mentioned DANS. In NARCIS a limited subset of “Enhanced Publications” is available, in which all available research information is connected. These publications can also be retrieved as linked data/RDF. Unfortunately the only workable result of this test was adding an external link to author information in NARCIS. Processing of URI’s and linked data was and is not yet available in Primo. But this is going to change now with the aforementioned Ex Libris Linked Data Collaboration Program.

Example of NARCIS Enhanced Publications

However, even if one has access to software that is targeted at storing and processing linked data and RDF, that does not suffice to actually tie together information from multiple sources. This was the outcome of another UvA pilot in the area of linked data and research information, using the open source linked data research information tool VIVO. This pilot showed that the data available in the internal university research information system was not good and complete enough. The objective of registering research information had always been limited to monitoring and publishing research output in an optimal way, mainly in the form of scholarly publications.

In 2016 the odds appear to be steadily turning in favour of a broader application of linked data in libraries and other heritage institutions, in any case in my own experience. The Library of the University of Amsterdam is a partner in the Ex Libris Linked Data Collaboration Program Discovery Track. And the term “linked data” appears more and more in official library policy documents.

Looking back on ten years of linked data and libraries one can conclude that successful implementation depends on the state of affairs in the full heritage information processing ecosystem. In this respect five preconditions within individual organisations are of importance: business case, tools, data, workflow and lifecycle.

Business case: an organisation always requires a business case for applying linked data. It is not a goal in its own right. For instance plans may exist for providing new services or improving efficiency in existing tasks for which linked data can be employed. For example presenting integrated research information, providing background information about the creation of works of art, or simply eliminating redundant information in multiple databases.

Tools: the software used must be suited for linked data. Publishing RDF, maintaining a SPARQL endpoint, processing external linked data through URI’s, storing data in a triple store. Specialised expertise is required in the case of homegrown software. For third party software this must be provided by the vendors.

Data: internal and external data must be available and suitable for publishing and consuming as linked data. The local information architecture and interoperability require profound attention. Excessive focus on individual systems with closed databases prohibits this.

Workflow: working procedures must be adapted to the processing of linked data. Existing working procedures are targeted at existing objectives, functionality and systems. Because all that changes with implementing linked data, procedures, jobs and the division of tasks will have to be adapted too. Particularly the use, continuity and reliability of internal and external linked data sources will have to be taken into account.

Lifecycle: new tools, data infrastructures and workflows will have to be secured in the organisation for the long term. It is important to adhere to existing standards and best practices, and to participate in collaboratives like open source communities, library consortia and user groups, if possible.

For the coming years I expect a number of standards and initiatives in the realm of linked data to reach maturity, which will enable individual libraries, archives and museums to get involved when they have practical implementations in mind, such as the aforementioned new services or efficiency improvements.

Maps, dictionaries and guidebooks

Lukas Koster — Mon, 03 Aug 2015 14:51:59 +0000

Permalink: https://purl.org/cpl/2529

Interoperability in heterogeneous library data landscapes

Libraries have to deal with a highly opaque landscape of heterogeneous data sources, data types, data formats, data flows, data transformations and data redundancies, which I have earlier characterized as a “data maze”. The level and magnitude of this opacity and heterogeneity varies with the amount of content types and the number of services that the library is responsible for. Academic and national libraries are possibly dealing with more extensive mazes than small public or company libraries.

In general, libraries curate collections of things and also provide discovery and delivery services for these collections to the public. In order to successfully carry out these tasks they manage a lot of data. Data can be regarded as the signals between collections and services.

These collections and services are administered using dedicated systems with dedicated datastores. The data formats in these dedicated datastores are tailored to perform the dedicated services that these dedicated systems are designed for. In order to use the data for delivering services they were not designed for, it is common practice to deploy dedicated transformation procedures, either manual ones or as automated utilities. These transformation procedures function as translators of the signals in the form of data.

Here lies the origin of the data maze: an inextricably entangled mishmash of systems with explicit and

implicit data redundancies using a number of different data formats, some of which systems are talking to each other in some way. This is not only confusing for end users but also for library system staff. End users lack clarity about user interfaces to use, and are missing relevant results from other sources and possible related information. Libraries need licenses and expertise for ongoing administration, conversion and migration of multiple systems, and suffer unforeseen consequences of adjustments elsewhere.

To take the linguistic analogy further, systems make use of a specific language (data format) to code their signals in. This is all fine as long as they are only talking to themselves. But as soon as they want to talk to other systems that use a different language, translations are needed, as mentioned. Sometimes two systems use the same language (like MARC, DC, EAD), but this does not necessarily mean they can understand each other. There may be dialects (DANMARC, UNIMARC), local colloquialisms, differences in vocabularies and even alphabets (local fields, local codes, etc.). Some languages are only used by one system (like PNX for Primo). All languages describe things in their own vocabulary. In the systems and data universe there are not many loanwords or other mechanisms to make it clear that systems are talking about the same thing (no relations or linked data). And then there is syntax and grammar (such as subfields and cataloguing rules) that allow for lots of variations in formulations and formats.

Translation does not only require applying a dictionary, but also interpretation of the context, syntax, local variations and transcriptions. Consequently much is lost in translation.

The transformation utilities functioning as translators of the data signals suffer from a number of limitations. They translate between two specific languages or dialects only. And usually they are employed by only one system (proprietary utilities). So even if two systems speak the same language, they probably both need their own translator from a common source language. In many cases even two separate translators are needed if source and target system do not speak each other’s language or dialect. The source signals are translated to some common language which in turn is translated into the target language. This export-import scenario, which entails data redundancy across systems, is referred to as ETL (Extract Transform Load). Moreover, most translators only know a subset of the source and target language dependent on the data signals needed by the provided services. In some cases “data mappings” are used as conversion guides. This term does not really cover what is actually needed, as I have tried to demonstrate. It is not enough to show the paths between source and target signals. It is essential to add the selections and transformations needed as well. In order to make sense of the data maze you need a map, a dictionary and a guidebook.

To make things even more complicated, sometimes reading data signals is only possible with a passport or visa (authentication for access to closed data). Or even worse, when systems’ borders are completely closed and no access whatsoever is possible, not even with a passport. Usually, this last situation is referred to with the term “data silos”, but that is not the complete picture. If systems are fully open, but their data signals are coded by means of untranslatable languages or syntaxes, we are also dealing with silos.

Anyway, a lot of attention and maintenance is required to keep this Tower of Babel functioning. This practice is extremely resource-intensive, costly and vulnerable. Are there any solutions available to diminish maintenance, costs and vulnerability? Yes there are.

First of all, it is absolutely crucial to get acquainted with the maze. You need a map (or even an atlas) to be able to see which roads are there, which ones are inaccessible, what traffic is allowed, what shortcuts are possible, which systems can be pulled down and where new roads can be built. This role can be fulfilled by a Dataflow Repository, which presents an up-to-date overview of locations and flows of all content types and data elements in the landscape.

Secondly it is vital to be able to understand the signals. You need a dictionary to be able to interpret all signals, languages, syntaxes, vocabularies, etc. A Data Dictionary describing data elements, datastores, dataflows and data formats is the designated tool for this.

And finally it is essential to know which transformations are taking place en route. A guidebook should be incorporated in the repository, describing selections and transformations for every data flow.

You could leave it there and be satisfied with these guiding tools to help you getting around the existing data maze more efficiently, with all its ETL utilities and data redundancies. But there are other solutions, that focus on actually tackling or even eliminating the translation problem. Basically we are looking at some type of Service Oriented Architecture (SOA) implementation. SOA is a rather broad concept, but it refers to an environment where individual components (“systems”) communicate with each other in a technology and vendor agnostic way using interoperable building blocks (“services”). In this definition “services” refer to reusable dataflows between systems, rather than to useful results for end users. I would prefer a definition of SOA to mean “a data and utilities architecture focused on delivering optimal end user services no matter what”.

Broadly speaking there are four main routes to establish a SOA-like condition, all of which can theoretically be implemented on a global, intermediate or local level.

Single Store/Single Format: A single universal integrated datastore using a universal data format. No need for dataflows and translations. This would imply some sort of linked (open) data landscape with RDF as universal language and serving all systems and services. A solution like this would require all providers of relevant systems and databases to commit to a single universal storage format. Unrealistic in the short term indeed, but definitely something to aim for, starting at the local level.
Multiple Stores/Shared Format: A heterogeneous system and datastore landscape with a universal communication language (a lingua franca, like English) for dataflows. No need for countless translators between individual systems. This universal format could be RDF in any serialization. A solution like this would require all providers of relevant systems and databases to commit to a universal exchange format. Already a bit less unrealistic.
Shared Store/Shared Format: A heterogeneous system and datastore landscape with a central shared intermediate integrated datastore in a single shared format. Translations from different source formats to only one shared format. Dataflows run to and from the shared store only. For instance with RDF functioning as Esperanto, the artificial language which is actually sometimes used as “Interlingua” in machine translation. A solution like this does not require a universal exchange format, only a translator that understands and speaks all formats, which is the basis of all ETL tools. This is much more realistic, because system and vendor dependencies are minimized, except for variations in syntax and vocabularies. The platform itself can be completely independent.
Multiple Stores/Single Translation Pool: or what is known as an Enterprise Service Bus (ESB). No translations are stored, no data is integrated. Simultaneous point to point translations between systems happen on the fly. Looks very much like the existing data maze, but with all translators sitting together in one cubicle. This solution is not a source of much relief, or as one large IT vendor puts it: “Using an ESB can become problematic if large volumes of data need to be sent via the bus as a large number of individual messages. ESBs should never replace traditional data integration like ETL tools. Data replication from one database to another can be resolved more efficiently using data integration, as it would only burden the ESB unnecessarily.”.

Overlooking the possible routes out of the data maze, it seems that the first step should be employing the map, dictionary and guidebook concept of the dataflow repository, data dictionary and transformation descriptions. After that the only feasible road on the short term is the intermediate integrated Shared Store/Shared Format solution.

Standard deviations in data modeling, mapping and manipulation

Lukas Koster — Tue, 16 Jun 2015 12:26:46 +0000

Permalink: https://purl.org/cpl/2467

Or: Anything goes. What are we thinking? An impression of ELAG 2015

Mapping pathways in Stockholm

This year’s ELAG conference in Stockholm was one of many questions. Not only the usual questions following each presentation (always elicited in the form of yet another question: “Any questions?”). But also philosophical ones (Why? What?). And practical ones (What time? Where? How? How much?). And there were some answers too, fortunately. This is my rather personal impression of the event. For a detailed report on all sessions see Christina Harlow’s conference notes.

The theme of the ELAG 2015 conference was: “DATA”. This immediately leads to the first question: “What is data?”. Or rather: “What do we mean with data?”. And of course: “Who is ‘we’?”.

In the current information professional and library perception ‘we’ typically distinguish data created and used for describing stuff (usually referred to as ‘metadata’), data originating from institutions, processes and events (known as ‘usage data’, ‘big data’), and a special case of the latter: data resulting from scholarly research (indeed: ‘research data’). All of these three types were discussed at ELAG.

It is safe to say however, that the majority of the presentations, bootcamps and workshops focused on the ‘descriptive data’ type. I try to avoid the use of the term ‘metadata’, because it is confusing, and superfluous. Just use ‘data’, meaning ‘artificial elements of information about stuff’. To be perfectly clear, ‘metadata’ is NOT ‘data about data’ as many people argue. It’s information about virtual entities, physical objects, information contained in these objects (or ‘content’), events, concepts, people, etc. We could only rightfully speak of ‘data about data’ in the special case of data describing (research) datasets. For this case ‘we’ have invented the job of ‘data librarian’, which is a completely nonsensical term, because this job is concerned with the storage, discoverability and obtainability of only one single object or entity type: research datasets. Maybe we should start using the job title ‘dataset librarian’ for this activity. But this seems a bit odd, right? On the other hand, should we replace the term ‘metadata librarian’ with ‘data librarian’? Also a bit odd. Data is at this moment in time what libraries and other information and knowledge institutions use to make their content findable and usable to the public. Let’s leave it at that.

This brings us to the two fundamental questions of our library ecosystem: “What are we describing?” and the mother of all data questions: “Why are we describing?”, which were at the core of what in my eyes was this year’s key presentation (not keynote!) by Niklas Lindström of the Swedish Royal Library/LIBRIS. I needed some time to digest the core assertions of Niklas’ philosophical talk, but I am convinced that ‘we’ should all be aware of the essential ‘truths’ of his exposition.

First of all: “Why are we describing?“. The objective of having a library in the first place is to provide access in any way possible to the objects in our collections, which in turn may provide access to information and knowledge. So in general we should be describing in order to enable our intended audience to obtain what they need in terms of the collection. Or should that be in terms of knowledge? In real life ‘we’ are describing for a number of reasons: because we follow our profession, because we have always done this, because we are instructed to do so, because we need guidance in our workflows, because the library is indispensable, because of financial and political reasons. In any case we should be clear about what our purposes are, because the purpose influences what we’re describing and how we do that.

Secondly: “What are we describing?”. Physical objects? Semi-tangible objects, like digital publications? Only outputs of processes, or also the processes themselves? Entities? Concepts? Representations? Relationships? Abstractions? Events? Again, we should be clear about this.

Thirdly (a Monty Python Spanish Inquisition moment ;-): “How are we describing?”. We use models, standards, formats, syntax, vocabularies in order to make maps (simplified representations of real world things) for reconciling differences between perceptions, bridging gaps between abstractions and guiding people to obtain the stuff they need. In doing so, Niklas says, we must adhere to Postel’s law, or the Robustness Principle, which states: “Be liberal in what you accept; be conservative in what you send”.

Back to the technology, and the day to day implementation of all this. ‘We’ use data to describe entities and relationships of whatever nature. We use systems to collect and store the data in domain, service and time dependent record formats in system dependent datastores. And we create flows and transformations of data between these systems in order to fulfill our goals. For all this there are multiple standards.

Basically, my own presentation “Datamazed – Analysing library dataflows, data manipulations and data redundancies” targeted this fragmented data environment, describing the library of the University of Amsterdam Dataflow Inventory project leading to a Dataflow Repository, effectively functioning as a map of mappings. “System of Systems (SoS)” was also the topic of the workshop I participated in, “What is metadata management in net-centric systems?” by Jing Wang.

Making sense of entities and relationships was the focus of a number of talks, especially the one by Thom Hickey about extending work, expression and author entities by way of data mining in Worldcat and VIAF, and the presentation by Jane Stevenson on the Jisc/Archives Hub project “exploring British Design”, which entailed shifting focus from documents to people and organizations as connected entities. Some interesting observations about the latter project: the project team started with identifying what the target audience actually wanted and how they would go about getting the desired information (“Why are we describing?”) in order to get to the entity based model (“What are we describing?”). This means that any entity identified in the system can become a focus, or starting point for pathways. A problem that became apparent is that the usual format standards for collection descriptions didn’t allow for events to be described.

Here we arrive at the critique of standards that was formulated by Rurik Greenall in his talk about the Oslo Public Library ILS migration project, where they are migrating from a traditional ILS to RDF based cataloguing. Starting point here is: know what you need in order to support your actual users, not some idealised standard user, and work with a number of simple use cases (“Why are we describing?”). Use standards appropriate for your users and use cases. Don’t be rigid, and adapt. Use enough RDF to support use cases. Use just a part of the open source ILS Koha to support specific use cases that it can do well (users and holdings). Users and holdings are a closed world, which can be dealt with using a part of an existing system. Bibliographic information is an open world which can be taken care of with RDF. The data model again corresponds to the use cases that are identified. It grows organically as needed. Standards are only needed for communicating with the outside world, but we must not let the standards infect our data model (here we see Postel’s Law again).

A striking parallel can be distinguished with the Stockholm University Library project for integration of the Open Source ILS Koha, the Swedish LIBRIS Union Catalogue and a locally developed logistics and ILL system. Again, only one part of Koha is used for specific functions, mainly because with commercial ILSes it is not possible to purchase only individual modules. Integrated library systems, which seemed a good idea in the 1980’s, just cannot cope with fragmented open world data environments.

Dedicated systems, like ILSes, either commercial or open source, tend to force certain standards upon you. These standards not only apply to data storage (record formats etc.) but also to system structure (integrated systems, data silos), etc. This became quite clear in the presentation about the CERN Open Data Portal, where the standard digital library system Invenio imposed the MARC bibliographic format for describing research datasets for high energy physics, which turned out to be difficult if not impossible. Currently they are moving towards using JSON (yet another data standard) because the system apparently supports that too.

With Open Source systems it is easier to adapt the standards to your needs than with proprietary commercial vendor systems. An example of this was given by the University of Groningen Library project where the Open Source Publication Repository software EPrints was tweaked to enable storage and description of research datasets focused on archeological findings, which require very specific information.

As was already demonstrated in the two ILS migration projects deviation of standards of any kind can very easily be implemented. This is obviously not always the case. The locally developed Swedish Royal Library system for the legal deposit of electronic documents supports available suitable metadata standards like OAI, METS, MODS, PREMIS.

For the OER World Map project, presented by Felix Ostrowski we can safely say that no standards were followed whatsoever, except using the discovery data format schema.org for storing data, which is basically also an adaptation of a standard. Furthermore the original objective of the project was organically modified by using the data hub for all kinds of other end user services and visualisations than just the original world map of the location of Open Educational Resources.

It should be clear that every adaptation of standards generates the need for additional mappings and transformations besides the ones already needed in a fragmented systems and data infrastructure for moving around data to various places for different services. Mapping and transformation of data can be done in two ways: manually, in the case of explicit, known items, and by mining, in the case of implicit, unknown items.

Manual mapping and transformation is of course done by dedicated software. The manual part consists of people selecting specific source data elements to be transformed into target data elements. This procedure is known as ETL (Extract Transform Load), and implies the copying of data between systems and datastores, which always entails some form of data redundancy. Tuesday afternoon was dedicated to this practice with three presentations: Catmandu and Linked Data Fragments by Ruben Verborgh and Patrick Hochstenbach; COMSODE by Jindrich Mynarz; d:swarm by Thomas Gängler. Of these three the first one focused more on efficiently exposing data as Linked Open Data by using the Linked Data Fragments protocol. An important aspect of tools like this is that we can move our accumulated knowledge and investment in data transformation away from proprietary formats and systems to system and vendor independent platforms and formats.

Data mining and text mining are used in the case of non explicit data about entities and relationships, where bodies of data and text are analyzed using dedicated algorithms in order to find implicit entities and relationships and make them explicit. This was already mentioned in Thom Hickey’s Worldcat Entity Extension project. It is also used in the InFoLiS2 project, where data and text mining is used to find relationships between research projects and scholarly publications.

Another example was provided by Rob Koopman and Shenghui Wang of OCLC Research, who analyzed keywords, authors and journal titles in the ArticleFirst database to generate proximity visualizations for greater serendipity.

As long as ‘we’ don’t or can’t describe these types of relationships explicitly, we have to use techniques like mining to extract meaningful entities and relationships and generate data. Even if we do create and maintain explicit descriptions, we will remain a closed world if we don’t connect our data to the rest of the world. Connections can be made by targeted mapping, as in the case of the Finnish FINTO Library Ontology service for the public sector, or by adopting an open world strategy making use of semantic web and linked open data instruments.

Furthermore, ‘we’ as libraries should continuously ask ourselves the questions “Why and what are we describing?”, but also “Why are we here?”. Should we stick to managing descriptive data, or should we venture out into making sense of big data and research data, and provide new information services to our audience?

Finally, I thank the local organizers, the presenters and all other participants for making ELAG2015 a smooth, sociable and stimulating experience.

Analysing library data flows for efficient innovation

Lukas Koster — Thu, 27 Nov 2014 12:24:34 +0000

Permalink: https://purl.org/cpl/2419

In my work at the Library of the University of Amsterdam I am currently taking a step forward by actually taking a step back from a number of forefront activities in discovery, linked open data and integrated research information towards a more hidden, but also more fundamental enterprise in the area of data infrastructure and information architecture. All for a good cause, for in the end a good data infrastructure is essential for delivering high quality services in discovery, linked open data and integrated research information.
In my role as library systems coordinator I have become more and more frustrated with the huge amounts of time and effort spent on moving data from one system to another and shoehorning one record format into the next, only to fulfill the necessary everyday services of the university library. Not only is it not possible to invest this time and effort productively in innovative developments, but this fragmented system and data infrastructure is also completely unsuitable for fundamental innovation. Moreover, information provided by current end user services is fragmented as well. Systems are holding data hostage. I have mentioned this problem before in a SWIB presentation. The issue was also recently touched upon in an OCLC Hanging Together blog post: “Synchronizing metadata among different databases” .

Fragmented data (SWIB12)

In order to avoid confusion in advance: when using the term “data” here, I am explicitly not referring to research data or any other specific type of data. I am using the term in a general sense, including what is known in the library world as “metadata”. In fact this is in line with the usage of the term “data” in information analysis and system design practice, where data modelling is one of the main activities. Research datasets as such are to be treated as content types like books, articles, audio and people.

It is my firm opinion that libraries have to focus on making their data infrastructure more efficient if they want to keep up with the ever changing needs of their audience and invest in sustainable service development. For a more detailed analysis of this opinion see my post “(Discover AND deliver) OR else – The future of the academic library as a data services hub”. There are a number of different options to tackle this challenge, such as starting completely from scratch, which would require huge investments in resources for a long time, or implementing some kind of additional intermediary data warehouse layer while leaving the current data source systems and workflows in place. But for all options to be feasible and realistic, a thorough analysis of a library’s current information infrastructure is required. This is exactly what the new Dataflow Inventory project is about.

The project is being carried out within the context of the short term Action Plans of the Digital Services Division of the Library of the University of Amsterdam, and specifically the “Development and improvement of information architecture and dataflows” program. The goal of the project is to describe the nature and content of all internal and external datastores and dataflows between internal and external systems in terms of object types (such as books, articles, datasets, etc.) and data formats, thereby identifying overlap, redundancy and bottlenecks that stand in the way of efficient data and service management. We will be looking at dataflows in both front and back end services for all main areas of the University Library: bibliographic, heritage and research information. Results will be a logical map of the library data landscape and recommendations for possible follow up improvements. Ideally it will be the first step in the Cleaning-Reconciling-Enriching-Publishing data chain as described by Seth van Hooland and Ruben Verborgh in their book “Linked Data for Libraries, Archives and Museums”.

The first phase of this project is to decide how to describe and record the information infrastructure in such a form that the data map can be presented to various audiences in a number of ways, and at the same time can be reused in other contexts on the long run, for instance designing new services. For this we need a methodology and a tool.

At the university library we do not have any thorough experience with describing an information infrastructure on an enterprise level, so in this case we had to start with a clean slate. I am not at all sure that we came up with the right approach in the end. I hope this post will trigger some useful feedback from institutions with relevant experience.

Since the initial and primary goal of this project is to describe the existing infrastructure instead of a desired new situation, the first methodological area to investigate appears to be Enterprise Architecture (interesting to see that Wikipedia states “This article appears to contain a large number of buzzwords“). Because it is always better to learn from other people’s experiences than to reinvent all four wheels, we went looking for similar projects in the library, archive and museum universe. This proved to be rather problematic. There was only one project we could find that addresses a similar objective, and I happened to know one of the project team members. The Belgian “Digital library system’s architecture study” (English language report here)” was carried out for the Flemish Public Library network Bibnet, by Rosemie Callewaert among others. Rosemie was so kind to talk to me and explain the project objectives, approaches, methods and tools used. For me, two outcomes of this talk stand out: the main methodology used in the project is Archimate, which is an Enterprise Architecture methodology, and the approach is completely counter to our own approach: starting from the functional perspective as opposed to our overview of the actual implemented infrastructure. This last point meant we were still looking at a predominantly clean slate.
Archimate also turned out to be the method of choice used by the University of Amsterdam central enterprise architecture group, whom we also contacted. It became clear that in order to use Archimate efficiently, it is necessary to spend a considerable amount of time on mastering the methodology. We looked for some accessible introductory information to get started. However the official Open Group Archimate website is not as accessible as desired in more than one way. We managed to find some documentation anyway, for instance the direct linkt to the Archimate specification and the free document “Archimate made practical”. After studying this material we found that Archimate is a comprehensive methodology for describing business, application and technical infrastructure components, but we also came to the conclusion that for our current short term project presentation goals we needed something that could be implemented fairly soon. We will keep Archimate in mind for the intermediate future. If anybody is interested, there is a good free open source modelling tool available, Archi. Other Enterprise Architecture methodologies like Business Process Modelling focus more on workflows than on existing data infrastructures. Turning to system design methods like UML (Unified Modelling Language) we see similar drawbacks.

An obvious alternative technique to consider is Dataflow Diagramming (DFD) (what’s in a name?), part of the Structured Design and Structured Analysis methodology, which I had used in previous jobs as systems designer and developer. Although DFD’s are normally used for describing functional requirements on a conceptual level, with some tweaking they can also be used for describing actual system and data infrastructures, similar to the Archimate Application and Infrastructure layers. The advantage of the DFD technique is that it is quite simple. Four elements are used to describe the flow of information (dataflows) between external entities, processes and datastores. The content of dataflows and datastores can be specified in more detail using a data dictionary. The resulting diagrams are relatively easy to comprehend. We decided to start with using DFD’s in the project. All we had left to do was find a good and not too expensive tool for it.

Basic DFD structure

There are basically two types of tools for describing business processes and infrastructures: drawing tools, focusing on creating diagrams, and repository based modelling tools, focused on reusing the described elements. The best known drawing tool must be MicroSoft Visio, because it is part of their widely used Office Suite. There are a number of other commercial and free tools, among which the free Google Drive extension Draw.io. Although most drawing tools cover a wide range of methods and techniques, they don’t usually support reuse of elements with consistent characteristics in other diagrams. Also, diagrams are just drawings, they can’t be used to generate data definition scripts or basic software modules or reverse engineering or flexible reporting. Repository based tools can do all these things. Reuse, reporting, generating, reverse engineering and import and export features are exactly the features we need. We also wanted a tool that supports a number of other methods and techniques for employing in other areas of modelling, design and development. There are some interesting free or open source tools, like OpenModelSphere (which supports UML, ERD Data modelling and DFD), and a range of commercial tools. To cut a long story short we selected the commercial design and management tool Visual-Paradigm because it supports a large number of methodologies with an extensive feature set in a number of editions for reasonable fees. An additional advantage is the online shared teamwork repository.

After acquiring the tool we had to configure it the way we wanted to use it. We decided to try and align the available DFD model elements to the Archimate elements so it would in time be possible to move to Archimate if that would prove to be a better method for future goals. Archimate has Business Service and Business Process elements on the conceptual business level, and Application Component (a “system”), Application Function (a “module”) and Application Service (a “function”) elements on the implementation level.

Basic Archimate Structure

In our project we will mainly focus on the application layer, but with relations to the business layer. Fortunately, the DFD method supports a hierarchical process structure by means of the decomposition mechanism, so the two hierarchical structures Business Service – Business Process and Application Component – Application Function – Application Service can be modeled using DFD. There is an additional direct logical link between a Business Process and the Application Service that implements it. By adding the “stereotypes” feature from the UML toolset to the DFD method in Visual Paradigm, we can effectively distinguish between the five process types (for instance by colour and attributes) in the DFD.

Archimate DFD alignment

So in our case, a DFD process with a “system” stereotype represents a top level Business Service (“Catalogue”, “Discover”, etc.) and a “process” process within “Cataloguing” represents an activity like “Describe item”, “Remove item”, etc. On the application level a “system” DFD process (Application Component) represents an actual system, like Aleph or Primo, a “module” (Application Function) a subsystem like Aleph CAT or Primo Harvesting, and a “function” (Application Service) an actual software function like “Create item record”.
A DFD datastore is used to describe the physical permanent and temporary files or databases used for storing data. In Archimate terms this would probably correspond with a type of “Artifact” in the Technical Infrastructure layer, but that might be subject for interpretation.
Finally an actual dataflow describes the data elements that are transferred between external entities and processes, between processes, and between processes and datastores, in both directions. In DFD, the data elements are defined in the data dictionary in the form of terms in a specific syntax that also supports optionality, selection and iteration, for instance:

book = title + (subtitle) + {author} + publisher + date
author = name + birthdate + (death date)

etc.
In Archimate there is a difference in flows in the Business and Application layers. In the Business layer a flow can be specified by a Business Object, which indicates the object types that we want to describe, like “book”, “person”, “dataset”, “holding”, etc. The Business Object is realised as one or more Data Objects in the Application Layer, thereby describing actual data records representing the objects transferred between Application Services and Artifacts. In DFD there is no difference between a business and a dataflow. In our project we particularly want to describe business objects in dataflows and datastores to be able to identify overlap and redundancies. But besides that we are also interested in differences in data structure used for similar business objects. So we do have to distinguish between business and data objects in the DFD model. In Visual-Paradigm this can be done in a number of ways. It is possible to add elements from other methodologies to a DFD with links between dataflows or datastores and the added external elements. Data structures like this can also be described in Entity Relationship Diagrams, UML Class Diagrams or even RDF Ontologies.
We haven’t decided on this issue yet. For the time being we will employ the Visual Paradigm Glossary tool to implement business and data object specifications using Data Dictionary terms. A specific business object (“book”) will be linked to a number of different dataflows and datastores, but the actual data objects for that one business object can be different, both in content and in format, depending on the individual dataflows and datastores. For instance a “book” Business Object can be represented in one datastore as an extensive MARC record, and in another as a simple Dublin Core record.

Example bibliographic dataflows

After having determined method, tool and configuration, the next step is to start gathering information about all relevant systems, datastores and dataflows and describing this in Visual Paradigm. This will be done by invoking our own internal Digital Services Division expertise, reviewing applicable documentation, and most importantly interviewing internal and external domain experts and stakeholders.
Hopefully the resulting data map will provide so much insight that it will lead to real efficiency improvements and really innovative services.

Looking for data tricks in Libraryland

Lukas Koster — Fri, 05 Sep 2014 12:12:34 +0000

Permalink: https://purl.org/cpl/2391

IFLA 2014 Annual World Library and Information Congress Lyon – Libraries, Citizens, Societies: Confluence for Knowledge

After attending the IFLA 2014 Library Linked Data Satellite Meeting in Paris I travelled to Lyon for the first three days (August 17-19) of the IFLA 2014 Annual World Library and Information Congress. This year’s theme “Libraries, Citizens, Societies: Confluence for Knowledge” was named after the confluence or convergence of the rivers Rhône and Saône where the city of Lyon was built.

This was the first time I attended an IFLA annual meeting and it was very much unlike all conferences I have ever attended. Most of them are small and focused. The IFLA annual meeting is very big (but not as big as ALA) and covers a lot of domains and interests. The main conference lasts a week, including all kinds of committee meetings, and has more than 4000 participants and a lot of parallel tracks and very specialized Special Interest Group sessions. Separate Satellite Meetings are organized before the actual conference in different locations. This year there were more than 20 of them. These Satellite Meetings actually resemble the smaller and more focused conferences that I am used to.

A conference like this requires a lot of preparation and organization. Many people are involved, but I especially want to mention the hundreds of volunteers who were present not only in the conference centre but also at the airport, the railway stations, on the road to the location of the cultural evening, etc. They were all very friendly and helpful.

Another feature of such a large global conference is that presentations are held in a number of official languages, not only English. A team of translators is available for simultaneous translations. I attended a couple of talks in French, without translation headset, but I managed to understand most of what was presented, mainly because the presenters provided their slides in English.

It is clear that you have to prepare for the IFLA annual meeting and select in advance a number of sessions and tracks that you want to attend. With a large multi-track conference like this it is not always possible to attend all interesting sessions. In the light of a new data infrastructure project I recently started at the Library of the University of Amsterdam I decided to focus on tracks and sessions related to aspects of data in libraries in the broadest sense: “Cloud services for libraries – safety, security and flexibility” on Sunday afternoon, the all day track “Universal Bibliographic Control in the Digital Age: Golden Opportunity or Paradise Lost?” on Monday and “Research in the big data era: legal, social and technical approaches to large text and data sets” on Tuesday morning.

Cloud Services for Libraries

It is clear that the term “cloud” is a very ambiguous term and consequently a rather unclear concept. Which is good, because clouds are elusive objects anyway.

In the Cloud Services for Libraries session there were five talks in total. Kee Siang Lee of the National Library Board of Singapore (NLB) described the cloud based NLB IT infrastructure consisting of three parts; a private, public and hybrid cloud. The private (restricted access) cloud is used for virtualization, an extensive service layer for discovery, content, personalization, and “Analytics as a service”, which is used for pushing and recommending related content from different sources and of various formats to end users. This “contextual discovery” is based on text analytics technologies across multiple sources, using a Hadoop cluster on virtual servers. The public cloud is used for the Web Archive Singapore project which is aimed at archiving a large number of Singapore websites. The hybrid cloud is used for what is called the Enquiry Management System (EMS), where “sensitive data is processed in-house while the non-sensitive data resides in the cloud”. It seems that in Singapore “cloud” is just another word for a group of real or virtual servers.

In the talk given by Beate Rusch of the German Library Network Service Centre for Berlin and Brandenburg KOBV the term “cloud” meant: the shared management of data on servers located somewhere in Germany. KOBV is one of the German regional Library Networks involved in the CIB project targeted at developing a unified national library data infrastructure. This infrastructure may consist of a number of individual clouds. Beate Rusch described three possible outcomes: one cloud serving as a master for the others, a data roundabout linking the other clouds, and a cross cloud dataspace where there is an overlapping shared environment between the individual clouds. An interesting aspect of the CIB project is that cooperation with two large commercial library system vendors, OCLC and Ex Libris, is part of the official agreement. This is of interest for other countries that have vested interests in these two companies, like The Netherlands.

Universal Bibliographic Control in the Digital Age

The Universal Bibliographic Control (UBC) session was an all day track with twelve very diverse presentations. Ted Fons of OCLC gave a good talk explaining the importance of the transition from the description of records to the modeling of entities. My personal impression lately is that OCLC all in all has been doing a good job with linked data PR, explaining the importance and the inevitability of the semantic web for libraries to a librarian audience without using technical jargon like URI, ontology, dereferencing and the like. Richard Wallis of OCLC, who was at the IFLA 2014 Linked Data Satellite Meeting and in Lyon, is spreading the word all over the globe.

Of the rest of the talks the most interesting ones were given in the afternoon. Anila Angjeli of the National Library of France (BnF) and Andrew MacEwan of the British Library explained the importance, similarities and differences of ISNI and VIAF, both authority files with identifiers used for people (both real and virtual). Gildas Illien (also one of the organizers of the Linked Data Satellite Meeting in Paris) and Françoise Bourdon, both BnF, described the future of Universal Bibliographic Control in the web of data, which is a development closely related to the topic of the talks by Ted Fons, Anila Angjeli and Andrew MacEwan.

The ONKI project, presented by the National Library of Finland, is a very good example of how bibliographic control can be moved into the digital age. The project entails the transfer of the general national library thesaurus YSA to the new YSO ontology, from libraries to the whole public sector and from closed to open data. The new ontology is based on concepts (identified by URIs) instead of monolingual text strings, with multilingual labels and machine readable relationships. Moreover the management and development of the ontology is now a distributed process. On top of the ontology the new public online Finto service has been made available.

The final talk of the day “The local in the global: universal bibliographic control from the bottom up” by Gordon Dunsire applied the “Think globally, act locally” aphorism to the Universal Bibliographic Control in the semantic web era. The universal top down control should make place for local bottom up control. There are so many old and new formats for describing information that we are facing a new biblical confusion of tongues: RDA, FRBR, MARC, BIBO, BIBFRAME, DC, ISBD, etc. What is needed are a number of translators between local and global data structures. On a logical level: Schema Translator, Term Translator, Statement Maker, Statement Breaker, Record Maker, Record Breaker. These black boxes are a challenge to developers. Indeed, mapping and matching of data of various types, formats and origins are vital in the new web of information age.

Research in the big data era

The Research in the big data era session had five presentations on essentially two different topics: data and text mining (four talks) and research data management (one talk). Peter Leonard of Yale University Library started the day with a very interesting presentation of how advanced text mining techniques can be used for digital humanities research. Using the digitized archive of Vogue magazine he demonstrated how the long term analysis of statistical distribution of related terms, like “pants”, “skirts”, “frocks”, or “women”, “girls”, can help visualise social trends and identify research questions. To do this there are a number of free tools available, like Google Books N-Gram Search and Bookworm. To make this type of analysis possible, researchers need full access to all data and text. However, rights issues come into play here, as Christoph Bruch of the Helmholtz Association, Germany, explained. What is needed is “intelligent openness” as defined by the Royal Society: data must be accessible, assessable, intelligible and usable. Unfortunately European copyright law stands in the way of the idea of fair use. Many European researchers are forced to perform their data analysis projects outside Europe, in the USA. The plea for openness was also supported by LIBER’s Susan Reilly. Data and text mining should be regarded as just another form of reading, that doesn’t need additional licenses

IdeasBox

IdeasBox packed

A very impressive and sympathetic library project that deserves everybody’s support was not an official programme item, but a bunch of crates, seats, tables and cushions spread across the central conference venue square. The whole set of furniture and equipment, that comes on two industrial pallets, constitutes a self supporting mobile library/information centre to be deployed in emergency areas, refugee camps etc. It is called IdeasBox, provided by Libraries without Borders. It contains mobile internet, servers, power supplies, ereaders, laptops, board games, books, etc., based on the circumstances, culture and needs of the target users and regions. The first IdeasBoxes are now used in Burundi in camps for refugees from Congo. Others will soon go to Lebanon for Syrian refugees. If librarians can make a difference, it’s here. You can support Libraries without Borders and IdeadBox in all kinds of ways: http://www.ideas-box.org/en/support-us.html.

IdeasBox unpacked

Conclusion

The questions about data management in libraries that I brought with me to the conference were only partly addressed, and actual practical answers and solutions were very rare. The management and mapping of heterogeneous and redundant types of data from all types of sources across all domains that libraries cover, in a flexible, efficient and system independent way apparently is not a mainstream topic yet. For things like that you have to attend Satellite Meetings. Legal issues, privacy, copyright, text and data mining, cloud based data sharing and management on the other hand are topics that were discussed. It turns out that attending an IFLA meeting is a good way to find out what is discussed, and more importantly what is NOT discussed, among librarians, library managers and vendors.

The quality and content of the talks vary a lot. As always the value of informal contacts and meetings cannot be overrated. All in all, looking back I can say that my first IFLA has been a positive experience, not in the least because of the positive spirit and enthusiasm of all organizers, volunteers and delegates.

(Special thanks to Beate Rusch for sharing IFLA experiences)