-=( In Between )=- - Scholarly Online Publishing, Open Access and Library Related Technology

In between has moved

Mon, 16 Nov 2009 00:50:48 +0100

Because I will move again, and will get a new job at Tilburg Unviversity as Head of the unit Academic Services, this weblog will move too. I have reserved a new domain (who knows when I'll move again) for the weblog. For all further posts see:

http://henkellermann.nl/inbetween

Hope to see you back there.

What do we want to find?

Tue, 12 May 2009 15:06:47 +0200

With the recent news, if not hype, about WolframAlpha - the system that will answer questions - comes the question of whether the search "paradigm" that permeates the library world is valid.

Search, for us librarians, is more often than not about the retrieval of relevant documents. But search is a result of a question that needs an answer. If I want to know when Dewey was born, do I really want documents that contain this answer? First of all I just want the answer and only after that I may need one document for reference purposes.

Question answering might be a better paradigm. The sort of answer one gets may depend on the question and the circumstances. I could get a factoid answer, a list of answers, or something else (a list of documents say that are somewhat relevant). So there is a lot to be done to identify questions, to analyze content, etc.

Search in the library field may have become too narrow a concept. Question answering might give a better perspective on what we need to do. And if we take that perspective, boy, we sure need to work differently. We need to know about language parsing, about formalizing relations between concepts (not simply terms), about the pragmatics of language, about ontologies and the semantic web. We need to learn a lot of things we know very little about, yet.

Open acces or cheaper access?

Fri, 03 Apr 2009 16:25:33 +0200

Ewing fulminates against open access. Access has never been better he says. The real problem with journals is their price, not their limited access. And open access will lead to worse articles, he claims, and not without reason. This article is surely worth a read.

One argument he makes is that in an open access (author pays) model it is the publishers and the authors that determine what is published. In the current model libraries (and their clients) determine which journals have "intellectual value". The latter prevents vanity publishing, while the open access model doesn't. As long as the author is willing to pay, he will get published.

I am a bit surprised that Ewing seems to trust only financial punishments to prevent bad science from getting published. However, the status of the journal is, for many years already, a very important element in any evaluation of the quality of the authors. Rankings dominate the field. In fact, I think, it is their rankings that determine the price of a journal, and not the other way around, as Ewing seems to suggest.

So, whenever the day is born on which all journals are open access, there will still be lots of factors that make vanity publishing a futile exercise. Those journals, whether freely available or not, that maintain strict quality norms, will increase an author's reputation more than those journals that "just publish".

Money does not need to enter the picture here. Reputation is still the name of the game: as it should be.

Ewing raises more issues, be sure to read him and reflect on what he says. But all in all I dare say that he misses the point, the point that not money, but reputation (and ranking) really dominates the field. And that will not change in an open access publishing mode.

Ewing wants us to focus on the economics of publishing (re journals). Nah, we should simply increase their access, which is not optimal at all at the moment. We should also focus on good evaluations of journals, but universities can be trusted to take up that challenge, simply because they already do.

New Dutch Magazine about Digital Libraries

Tue, 17 Mar 2009 12:59:08 +0100

There is a new magazine called "De digitale bibliotheek". They have a web presence too.

It's a magazine, with a forum, and stuff... A magazine you have to pay for. If you subscribe to enough magazines (that is, when you subscribe to more magazines), you will become a member of the Essentisals Community and may expect to pay a little less when attending their workshops and masterclasses.

So, why this magazine? Well, the usual: to bring people together, to inspire and motivate them. The magazine wants conversation, because (quoting someone) "conversation is the platform today".

This is not a magazine with a concrete mission and it has no focus in terms of content (conversation is not a focus!), it just wants to give digital librarians a platform to write for and read in. What about? Well, about whatever is new, I guess. News we all can find on the web already, thanks to weblogs, mail, magazines that are open access (like dlib), twitter, etc..

Will it get us working on a national digital library? Will it give us in depth detail on interoperability? Will it share code? Will it focus on the architecture of digital library systems? Will it help us think on the uses and functions of digital libraries? I guess not.

My prediction for this magazine therefore: A glossy with lots of little blocks, lots of pictures of people who have said this or that on God knows what, as long as the term digital library is part of it, interviews, awards, CV's, opinions, workshop announcements and reports... Well, stuff like that. This is social networking implemented in an old fashioned technology: the magazine that costs you money.

No, we really don't need another general magazine. We may need more magazines with a strong focus, a bit like D-Lib and Ariadne, we may need more technical journals, more theoretical journals, more practical journals: journals by and for people working with and thinking about digital libraries and magazines with some form of quality control (peer reviews for instance).

I'll keep an eye on this magazine and I might change my opinion some time in the future, but, having said that, I am not going spend too much of my time on reading it, and certainly no money.

Open Access Again

Mon, 23 Feb 2009 21:10:58 +0100

... ... ... ... ... ... ... ... ... ... ... ...

Evans and Reimer published a paper in Science on the effects of Open Access entitled Open Access and Participation in Science. They analyse citations to articles in journals indexed by Thomson Scientific’s Science, Social Science, and Humanities Citation Indexes (CI). These include articles and associated citations from the 8253 most highly cited journals (going back to 1945).

The most important feature of this article is, the authors claim, the use of more extensive citation data than previous research did. I think they are right.

Before I present a few of their main results, I like to make some comments that, IMHO, limit the generality of their results.

First:

It is citations they analyse and so the phrase "participation in science" is operationalized as "citing a paper". Nothing wrong with that of course, but we should note that citations are not a perfect measure of the use of articles. Citations in journals, papers, proceedings, books and so on that are not indexed in CI (and many many scholarly and scientific outlets are not indexed in CI), are simply not counted. The real impact of an article is broader then citations in a limited number of, even when important, journals, but no dean will compliment me for this observation. I am still looking for a study that operationalizes impact in a more satisfactory manner.

Second:

The analysis is based on 8253 journals. According to UlrichsWeb there are about 29000 online journals out there, 17000 of which are peer reviewed. I can't figure out whether the journals analysed by Evans and Reimer are all peer reviewed, but I tend to assume they all are. In any case, the percentage of journals analysed in this paper is between 28 and 48 percent of all relevant online journals. The journals analysed are the most cited ones, I will grant that, but still, we do have to recognize the fact that many citations are simply not counted.

Nevertheless, accepting the limitation of the analysis to papers in the CI index that are cited in papers in the CI index, Evans and Reimers make a few important observations.

First, there is no OA effect (no rise in citation frequencies because of being available in Open Access) for three disciplines: chemistry, physical sciences and social sciences. This is hardly surprising because, as Evan and Reimer point out too, Open Access is more or less the norm here. Indeed, of most papers a version can be obtained for free. It would be reasonable to exclude these zero effects from a real OA effect in disciplines where it still matters (that is, where it is not the norm), thereby, probably, increasing the overall 8 percent OA effect reported. (It is a necessary consequence of the fight for Open Access that the OA effect will disappear, obviously).

Second, High OA effects are found for multidisciplinary journals (around 20 %). This accords with many previous findings. For a librarian, who wants to tailor his collection to the local interests, this is no surprise at all. With the growing importance of multidisciplinary research, this part of the overall OA effect should be taken as a very important argument for Open Access. Alas, an analysis of cross-discipline citations has not been performed by Evans and Reimer, and this I do consider an omission. My conjecture here is that the OA effect would be high for cross-discipline citations, well, lets be wild, and claim that it will be even higher than it is for multidisciplinary journals.

Third, absolutely marvellous is the attention Evans and Reimer give to the OA effect for authors in developing countries. For poor countries the OA effect may, in general, reach heights of 30 percent. They present a world map of OA effects and the results are almost painful: only Europe, North America and Australia show a relatively (sic!) low OA effect.

So, despite a few omissions and a few criticism, I think this paper demonstrates the importance of OA quite forcefully. Multidisciplinary research benefits greatly from Open Access. For Developing countries Open Access is nothing but a blessing. And for the rest it is very nice at the very least.

Given all this, it is saddening to see that a national dutch newspaper, De volkskrant, presents these results under the header: "Free journals only lead to a few extra citations" (uit: Volkskrant wetenschapsbijlage zat. 21-feb-2009: Gratis tijdschrift geeft weinig meer citaten).

Yuck.

One reason why libraries will become obsolete

Thu, 12 Feb 2009 14:45:25 +0100

There is a nice post by Will Sherman. 33 reasons why libraries and librarians will be important, even in the digital age. A large number of these reasons concern the relevance and quality of the materials hosted by a library and the expertise librarians have, for instance for describing books (etc.) adequately.

That is all fine. Will Sherman however seems to assume that all these different functions (re the management of information) are performed by libraries and librarians and that they will remain doing that. I tend to disagree.

No one doubts the importance of information management, but the tools and the workflows and the organisations that (should) exist to manage information require competences that are not easily found within library walls. Text mining, information retrieval, the management of peer review, publishing, the semantic web, ontologies, the architecture of the web, theories of classification, automatic term extraction, web archiving, the identification of authors, documents and institutes, protocols for information exchange, digital rights management, etc., etc., will be of growing importance and the skills needed for these tasks is at the moment not sufficiently available within libraries.

So the conclusion is:

Libraries "as we knew them" will become obsolete if they want to stay libraries. The same goes for librarians.

QED

Collections in the Digital Age?

Tue, 03 Feb 2009 07:38:26 +0100

Mary Frances Casserly is one of the authors who has thought about the meaning of a collection in the digital age (Casserly, M.F. (2002). Developing a Concept of Collection for the Digital Age. Libraries and the Academy 2.4 (2002) 577-587. The article is relatively old, but that's ok.

One of the problems one faces is finding a metaphor to describe a collection that for a large part consists of resources available on the internet. She mentions a few (citing others), like interface, logical gateway, information commons, gateway library or even information population.

The main idea, rather obviously, seems to be that there is a huge collection of information on the internet but that the collection (the one deemed relevant for... well whatever) is a subset that needs to picked from the total set of available online resources.

I find it quite remarkable that the new collection is seen as the result of a process of picking elements, a process similar to finding shells on a beach. The delivery of new resources is, as a process, set apart from setting up a collection. It is the sea that bring us new shells, and the sea is a mystery.

What if we expand the notion of a collection in such a way that the sea becomes part of it? The main issue with any sensible collection is quality control. We don't want ugly things in our collections. But if documents, and this surely is the case in the digital age, become fluid, for instance when there are many version of one document and when documents show up as movies, datasets and the like, and when it becomes hard to judge such a huge variety of documents with respect to their quality, it might be a good idea to refocus quality control; away from the documents towards the people that add documents. Qualified people can add documents.

Then a collection is not a simple store of documents anymore, but a rather complex system of interrelated documents, controlled by a selected group of people.

Librarians "just" need to make the system searchable.

Well, I don't know, really...

Rankings and Repositories

Tue, 27 Jan 2009 21:25:46 +0100

Our dissertations repository (here or here) made a huge jump on the webometrics rankings. Our position on the list was near 300, yesterday. Today we are on position 14 or 23, depending on where you look (here and here)

All in all something to rejoice in.

Or not? Well yes, of course, but there are some issues too.

The reason we made the jump is because one of my colleagues, Wim Braakman, contacted webometrics to ask details about their harvesting procedures. I am not going into the details here, let us just say that the webometrics procedures are a little strange, they cannot handle all regular harvest addresses. Redirects for instance are a problem. Anyway, Wim Braakman read the specifications, contacted them quite often (he was persistent), and in the end webometrics was provided with an URL they could handle, but which does not capture all our content.

The current ranking is only based on our dissertations repository. Without considerable re-engineering of the domain names in our repositories and aggregators all the other documents we have collected are not counted by webometrics. Indeed, not even half of our total number of documents are counted, so our real position should be considerably higher.

Yes, we are glad we have moved up in this ranking. No, we are not happy with the rules and regulations that webometrics uses. If we only had one repository, as most institutions seem to have, it would be no problem. But we work with a large number of specialized repositories, adapted to the needs of the users, with the sad consequence that we are not harvestable fully, at least not by webometrics. No, not even our aggregators will do the job. We need to build a special one for webometrics and that we flatly refuse.

In short: webometrics should re-engineer their harvesting procedures because this is not entirely fair, not even to us.

A Paradox?

Mon, 26 Jan 2009 22:42:15 +0100

Is this a paradox, or something I just don't understand, or something that simply isn't true?

The many digital libraries that have been developed by librarians can be characterized als closed systems. Making metadata and documents re-usable by third parties is often not even considered, and when it is considered, it is limited to a set of partner libraries, tightly controlled by contracts and lawyers. General re-use of software and data does not seem to be a primary concern. Yet, tons of standards have been developed to describe documents. In other words: the organization of knowledge (contained in documents) is heavily standardized (gazetteers, controlled vocabularies, metadataformats like MARC, MODS, perhaps even FRBR). There is some re-use of data, of course. Union catalogues are an example, but despite the immense efforts spent on standardization, the software built by institutional digital libraries seems to ignore re-use.

Yes, I am aware of OAI-PMH, of SRU/SRW, but these are relatively recent inventions, and I simply cannot find that much software developed within institutional libraries that freely offers such interfaces, nor data that can be transmitted through such interfaces.

Software developed in the context of non institutional digital libraries (librarything comes to mind, but also bookmarking sites like deli.co.us and perhaps OAIster) do offer such interfaces, but often do not use the best of the standardized knowledge organisation schemes that library science has to offer.

I find this puzzling. Is it simply because institutional digital libraries are focussed on the members of their own libraries and ignore the rest of the world? Is it just a matter of ownership and money?

5S

Wed, 14 Jan 2009 16:13:12 +0100

A common complaint against people like me is that we just create our own localized digital libraries. A digital library, at least for those working within the confines (I did not say coffins) of a library, is an addendum to a normal library.

Rarely is software, however small, that was built by one group re-used by others. If, as the past few years have shown, it is almost impossible to cooperate on developing any piece of digital library software, we at least would need a framework that can guide software development to make it possible that the software is re-used. Very few attempts have been made to define such frameworks. There are a few however. One seems to be the DELOS framework, which I haven't studied yet, another is the 5S model, one that I am studying now.

The 5S model decomposes the problem of making a digital library into 5 components: Streams, Structures, Spaces, Scenarios, and Societies, all words starting with an S, in case you failed to notice.

Streams are basically information resources on the Internet. Text files, streaming video, you name it, they are all streams. Structures are structures within the streams. Perhaps the prime example is a XML document, where the XML explicates the structure of the stream. Other structuring principles are possible of course. Spaces are the operations one can perform on the (structured) streams. Operations can vary from indexing to defining an ontology. Scenarios is where the user enters the scene. Scenarios are a set of operations (state-transitions) that a user performs, or can perform, while using a digital library. Societies, finally, are groups of users with differing information needs. Societies and scenarios can be seen as an explication of user centered design (I surmise).

What is really good about 5S is that it does not stop at inherently vague descriptions such as given in the previous paragraph. Streams, structures, spaces, scenarios and societies are defined in terms of set theory and relational algebra. All definitions are formal. The advantage of this is that such definitions can guide implementation. For example, descriptive metadata is defined in graph theoretical terms, clearly suggesting ways to formalize metadata. If these basic structures are defined formally, exchange of both data and implementations should become easier. They could even hint, well more than hint, at how to define protocols for information exchange.

I wonder if there are environments in which 5S is already used. I am still exploring the model. It seems hard to get information on the practicalities of the 5S model. Nevertheless, theoretically it seems not only sound, but very attractive too. I like pretty... uh... work.

PurpleSearch launched

Mon, 12 Jan 2009 21:47:07 +0100

Today we officially released the first beta version of PurpleSearch, software for federated search developed by people from my department, the digital library department of the university of Groningen. André Keyzer in particular is responsible for the design, Bart Alewijns has been the main programmer of the system. It is a beta version. Also a number of features that were present in its predecessor livetrix were dropped, or are given a less prominent place, because the user interface had to be as simple as possible. PurpleSearch also offers a few webservices, in particular a recommendation function that returns an "educated" guess consisting of a number of databases that might be relevant for a query. PurpleSearch is a learning system in that it stores all queries ever entered by users and determines which databases return a significant number of hits given the query.

The following text, taken from the PurpleSearch helpfile, describes the system.

Purplesearch enables simultaneous search in the most important scientific and scholarly databases. It is an interface that eases and enriches federated search.

It eases this method of searching by not requiring manual selection of the databases to search in. PurpleSearch learns, over time, what each database contains and will give good results for any given search query. PurpleSearch combines smart search techniques, local indexing, and using that index for each new search. As such, presented results are those from a search in the best scoring databases for a query. It is also possible to do targeted searches within different databases.

Among other things it chooses databases that are likely to give results for any given query. As this does not always pick the most important databases for the intended subject area, you may use the subject guide to start searching in the most important databases, or choose the databases you want to search in manually.

It allows catalogue searching for books and other physical resources, and will lead researchers to electronic full-text articles when we have a relevant subscription.

A number of festivities are organized to promote the use of PurpleSearch within the university of Groningen.

It is a nice day. :)

Documents of the Future

Wed, 07 Jan 2009 15:17:08 +0100

When we, librarians, deal with documents we add metadata. The metadata are used in a separately developed interface to give people a search and/or browsing interface to find and locate documents.

In this mode of thinking the documents on the Internet are passive objects. The metadata associated with those documents are passive too. Software is written to use the data and present it. All the action is in the extra software.

Another approach that has been suggested is to turn documents into code. Or, perhaps better, to encapsulate documents in code that, when it is sent a proper request, will return a view on the document. Documents become active. So instead of just sending a URL to a server that will return a PDF document, a number of different requests can be sent, and different results will be returned.

Examples of requests that could be sent are (this is just to give an idea):

getDefaultPresentation: returns the document in a default format (which can be PDF, PPT, Word, Tex, what have you).

getPresentation PFD: returns a PDF version of the document (if available).

getIndex: returns an Index of words used (many parameters are possible, for instance a stemming method can be selected).

getAuthor: returns the list of authors.

getKeywords: returns a list of keywords. The request can be parametrized, for instance when a controlled vocabulary is used.

getReferences: returns a list of citations used in the document.

getType: is it a book? an article? etc.

getRights: may return a creative commons license indicating if, and how, the document can be re-used.

getDC: returns metadata in DC format.

getMarc: returns metadata in Marc format.

Problems may arise when information is added to the document after its first publication. Social tags for instance should also be associated with the document and should be retrievable. The code therefore should also accept requests that add data to it. The question of how this has to be implemented is not a trivial one. It might make sense to encapsulate the document in different code sets, not all of which have to be maintained (developed) by the original publisher. So company X will add, as an added value, social tagging information to a document published by company Y. Encapsulations can be layered.

There are many problems to be solved when the latter approach is followed. The main ones are the definition of the interface to the document (in form of a protocol, or document API), and the findability of the different encapsulations.

But would it not be great if all documents became active documents?

2009: The Year of...?

Thu, 01 Jan 2009 11:38:50 +0100

2008 was probably the year of web 2.0/library 2.0. It has brought us lots of new tools and toys to play with, and lots of new evangelists (to toy with). It has brought us too a vision on the information ecology of the future. In this vision, open computation, open access and collective intelligence play an important role.

And 2009?

I hope I am wrong, but.....

..... 2009 will be the year when all the major university libraries will retreat from web 2.0 and will not give their employees the time to toy with new tools. The smell of "being fed up with 2.0 and their evangelists" permeates our buildings. There is a financial crises, there will be budget cuts. So: there will be a massive generalized skepticism about all new information technologies, including open computation, including collective intelligence. If new technologies will be introduced, they will be based on notions librarians are comfortable with, like the extremely silly developments around FRBR. A retreat therefore from the unknown to the safe and comfortable.

2009 will be the year for conservatives, for those who see the library as a service unit that needs a tight management. Technology should be outsourced, systems should be maintained, books should be bought, journals licensed, employees will be clocked; there will be less and less money. Lawyers will help the conservatives and invent new applications of copyright laws to hamper innovation.

2009 therefore will be a year in which libraries lose a decisive battle against companies that do have a vision on search and information-behavior and do have the talent and the time to develop new services, that do know about (text-driven) computing, about collective intelligence, about statistics. These companies will not include most of the current publishers and it will most likely not be Google either. The new companies will come, financial crises or not - and they'll earn good money, for instance doing business with universities, which could actually be earnt by the libraries.

It will be a tough year for us innovative workers within the library, and I am a bit tired already.

We really should unite, and not let library walls determine what is united. And unite does not mean: set up a new platform to "chat a little".

Projects

Thu, 30 Oct 2008 08:22:11 +0100

I have spent the last few days of my fine life on the horrible, horrible task of (co-)writing a project proposal to get some work financed which we'd like to do with OAI-ORE. I will talk about the contents of that project later. For now I'm just wondering whether there isn't a better way to get innovative work in the digital library funded. The proposals one has to write these days - yes, it has gone from bad to plain awful, the whole thing just shows organized distrust. Each activity has to be described meticulously, every hour spent has to be accounted for, as well as each person involved. I hope no one reads this before the proposal is judged, but this juggling with hours and names is somehow pure fiction. I know, people can't be trusted, and if public money is spent on work that me and my colleagues do, we should definitely have to make explicit how it was spent, or will be spent. But is THIS really necessary?

I have no ready answer here, but please, please let some creative individual address this problem and save us from the mindless game of writing a project proposal, of filling in these dreadful forms. Why can't those who have the money simply watch what is going on, take note of measurable outputs a certain group has produced, both in terms of software developed or articles written, and just say: "hey, good work, here is some money, use it well"? This is asking too much, I know, but isn't there some middle ground between the bureaucracy of project-writing and the laissez faire attitude that I would love so much and that seems to be gone completely these days?

Big sigh... the work is not over yet, one more day to go...

The Boundary Problem

Sun, 26 Oct 2008 16:47:38 +0100

It took me some time to realize, but what OAI-ORE seems to be really about is the boundary problem: The fact that groups of objects cannot be properly identified in the basic web architecture.

OAI-ORE allows one to identify aggregates. On top of that it offers means to describe the relations between the aggregated objects. It allows one to define boundaries between groups of objects.

Of course, any web page can contain a number of links to other pages and documents, but those links are not typed, meaning that it is hard to distinguish between, say, navigational links from links to objects. OAI-ORE may be a way to solve not only the grouping problem (enhanced publications) but may give web archiving a great boost too. Now relatively complex software like httrack or heritrix is used to heuristically define relevant groups, but using OAI-ORE's resource maps, a good hint at what should be archived becomes possible.

OAI-ORE also highlights an often underestimated problem. The transition from the normal library to a digital one needs to be based on descriptions of individual items (and not, as is common in the library field, on the expression or manifestation level) and these items need to be grouped.

Whether OAI_ORE solves all problems remains to be seen. One of the things that may need reworking is the flexibility of OAI-ORE resource maps. As far as I can see now, all the possible relations between documents in an aggregate need to be predefined. But I am not sure if this is flexible enough.