<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  <title>HubLog</title>
  <id>tag:hublog.hubmed.org,2010://2</id>
  <updated>2012-02-10T09:22:44+00:00</updated>
  <author>
    <name>Alf Eaton</name>
  </author>
  <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/hublog" /><feedburner:info uri="hublog" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:browserFriendly>This is an XML content feed. It is intended to be viewed in a newsreader or syndicated to another site, subject to copyright and fair use.</feedburner:browserFriendly><entry>
    <id>tag:hublog.hubmed.org,2010://2.1949</id>
    <title>ISSN(L)s And Serial Title Abbreviations</title>
    <updated>2012-02-09T19:16:28+00:00</updated>
    <published>2012-02-09T18:56:06+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001949.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<p>I'd like to build a non-copyrighted list of journals/serials (anything with an ISSN, basically) - including their ISSNs, full titles and abbreviated titles.</p>

<p>As a start, every item needs an identifier. The <a href="http://issn.org/">ISSN International Centre</a> assigns ISSNs to publications that request them, but <a href="http://www.issn.org/2-22659-ISSN-Data-file.php">the data file listing all the ISSN:serial title mappings</a> is copyrighted, expensive and not redistributable (it also uses sentence case for journal titles, which means some information gets lost).</p>

<p>As each publication can have multiple ISSNs - one for each medium in which it is distributed (for example, the online version of a publication can have a different ISSN to the print version) - a pan-ISSN identifier is required to link all the ISSNs together, so the <a href="http://www.issn.org/2-22637-What-is-an-ISSN-L.php">ISSN-L</a> was introduced.</p>

<p><a href="http://www.issn.org/2-24117-Download-the-ISSN-ISSN-L-table.php">The table that maps ISSNs to ISSNLs</a> can be downloaded from issn.org after filling in a form requesting access. In the latest table, there are 1,614,355 unique ISSNs, mapped to 1,552,542 unique ISSNLs.

<p>This ISSNL:ISSN mapping table, like all the information published by the ISSN Internation Centre, is protected by sui generis database rights, which last, <a href="http://en.wikipedia.org/wiki/Database_Directive">according to Wikipedia</a>, for 15 years from the last substantial update.</p>

<blockquote cite="http://www.issn.org/2-22687-Legal-notices.php">The databases appearing on or accessible from the website "the ISSN International Centre" are the exclusive property of CIEPS and are protected under the provisions of the law of 1st July 1998 implementing in the Intellectual Property Code the European Directive of 11 March 1996 on the legal protection of databases. Any performance, whether total or partial, of this site by any company whatsoever, without the express authorization of the CIEPS is strictly forbidden and shall constitute an infringement sanctioned such as Intellectual Property Code.</blockquote>
  
  <cite><a href="http://www.issn.org/2-22687-Legal-notices.php">ISSN International Centre, Legal Notice, Section 3</a></cite>

<p>(note, incidentally, that I'm already in conflict with <a href="http://www.issn.org/2-22687-Legal-notices.php">section 3 of the legal notice</a>: "Users and visitors cannot place a hyperlink to this website without the CIEPS' express and prior authorization."&hellip;)</p>

<p>The full database itself may be copyrighted, but (I believe) the individual facts within it shouldn't be. In which case, the best non-copyrighted source for the journal title, ISSN and ISSN-L of each serial is probably the publishers themselves, and as many publishers make their metadata through an OAI interface it may be possible to extract a fair amount of serials information from those sources. <a href="http://en.scientificcommons.org/repository/overview">ScientificCommons</a>, for example, harvests articles from OAI repositories, so may be able to aggregate useful title and ISSN information.</p>

<h3>Existing, non-free sources</h3>

<p>A commercial source of the information I'm looking to create is the <a href="http://journalseek.net/">JournalSeek</a> database (from Genamics, <a href="http://nj.oclc.org/journalseek/">licensed through OCLC</a>) which <a href="http://journalseek.net/publishers.htm">includes around 100,000 journals</a>.</p>

  <p><a href="http://www.sherpa.ac.uk/romeo/journalbrowse.php?fIDnum=|&mode=simple&la=en">SHERPA/RoMEO aggregates journal lists</a> from <a href="http://zetoc.mimas.ac.uk/jnllist.html">Zetoc</a>, <a href="http://www.doaj.org/doaj?func=loadTempl&templ=faq#metadata">DOAJ</a> (7500 journals) and <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?Db=journals&Cmd=DetailsSearch&Term=currentlyindexed%5BAll%5D">Entrez</a> (<a href="http://www.ncbi.nlm.nih.gov/entrez/citmatch_help.html#JournalLists">40,000 journals</a>), but <a href="http://www.sherpa.ac.uk/romeoreuse.html">the data is only available for non-commercial use</a>.</p>

<h3>Other sources of Journal/ISSN information</h3>

<ul>
  <li><a href="http://www.serialssolutions.com/management/ulrichs/">300,000 serials in Ulrich's</a> (JSON interface somewhere?)</li>
    <li><a href="http://www.serialssolutions.com/resources/detail/summon-serials-titles">SerialsSolutions Summon</a> (title list as PDF)</li>
  <li><a href="http://www.ebscohost.com/titleLists/a9h-journals.htm">13,000 journals in EBSCO "Academic Search Complete"</a></li>
  <li><a href="http://www.oclc.org/worldcatlocal/overview/content/journals.htm">91,000 journals in WorldCat Local</a></li>
  <li><a href="http://ip-science.thomsonreuters.com/cgi-bin/jrnlst/jlresults.cgi?PC=MASTER">17,000 journals in Thomson Reuters Master Journal List</a></li>
  <li><a href="http://www.crossref.org/titleList/">27,000 journals in CrossRef</a></li>
  <li><a href="http://www.portico.org/digital-preservation/who-participates-in-portico/participating-titles">12,500 journals archived by Portico</a></li>
  <li><a href="http://lockss.org/lockss/Publishers_and_Titles">9,000 journals participating in LOCKSS</a></li>
    <li><a href="http://www.sciencedirect.com/science/journals">3,330 journals in ScienceDirect</a></li>
  <li><a href="http://academic.research.microsoft.com/RankList?entitytype=4&topDomainID=6">Journal lists by subject in Bing Academic Search</a></li>
  <li><a href="http://cassi.cas.org/search.jsp">Search journals in the CAS Source Index</a> (copyrighted by the ACS)</li>
</ul>
      
 <p>It would be nice if Freebase could serve as a central repository for this information, but there are <a href="http://www.freebase.com/view/book/journal">only 4,321 journals in Freebase</a> so far.</p>

<h3>Abbreviations</h3>
      
      <p>Once we have the list of ISSNs and journal titles, we also need the corresponding journal title abbreviations, for use when generating bibliographies. There are <a href="http://www.library.uq.edu.au/faqs/endnote/journal_terms.html">lists of journal title abbreviations available for import into EndNote</a>. Sadly, there are several different abbreviation styles (ISO, MEDLINE, BIOSIS, CASSI, etc).</p>

<ul>
  <li>The ISSN International Centre maintains the <a href="http://www.issn.org/2-22660-LTWA.php">list of Title Word Abbreviations</a> which corresponds to the ISO 4 standard (<a href="http://www.iso.org/iso/catalogue_detail?csnumber=3569">available from the ISO store for ~£50</a>), which describes the rules for abbreviating title words and titles of publications. The <a href="http://www.issn.org/2-22661-LTWA-online.php">list of Title Word Abbreviations is available online as HTML</a>.</li>
  <li><a href="http://www.nlm.nih.gov/pubs/factsheets/constructitle.html">The NLM uses the list of Title Word Abbreviations to abbreviate periodical titles.</a></li>
  <li><a href="http://www.ncbi.nlm.nih.gov/books/NBK7251/">The NCBI provides a less-comprehensive list of abbreviations for commonly-used English words in journal titles.</a></li>
  <li><a href="http://www.compholio.com/latex/jabbrv/">The jabbrv LaTeX package abbreviates journal titles using the list of Title Word Abbreviations.</a></li>
  <li>The <a href="http://jabbr.mannlib.cornell.edu/">JAbbr service</a> built a list of abbreviated serial titles from the Cornell library catalog of MARC records, and provides abbreviation &rarr; full title mapping as a JSON and HTML web service (source code provided). <a href="http://journal.code4lib.org/articles/1758">Article in Code4Lib Journal.</a></li>
  <li><a href="http://www.ncbi.nlm.nih.gov/sites/entrez?db=journals&term=%22Nat%20Methods%22[Title%20Abbreviation]">The NLM Catalog is searchable using a journal title abbreviation</a>, but uses sentence case for full titles.</li>
  <li><a href="http://www.ncbi.nlm.nih.gov/nlmcatalog?term=%221548-7091%22%5BISSNL%5D">The NLM Catalog is also searchable by various ISSNs.</a></li>
    <li><a href="http://images.webofknowledge.com/WOK46/help/WOS/A_abrvjt.html">Web of Science has a list of journal title abbreviations.</a></li>
  <li><a href="http://www.abbreviations.com/jas.asp">A large list of Journal Abbreviation Sources.</a></li>
</ul>
      
      <h3>More links</h3>
      <p><a href="http://pinboard.in/u:hubpin/t:journals/">http://pinboard.in/u:hubpin/t:journals/</a></p>

]]></content>
  </entry>
  <entry>
    <id>tag:hublog.hubmed.org,2010://2.1948</id>
    <title>Extracting Text From A PDF Using Only Javascript</title>
    <updated>2011-11-18T11:38:16+00:00</updated>
    <published>2011-11-18T10:55:04+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001948.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<p>Using an HTML page like <a href="https://gist.github.com/1376120">this</a>, which embeds a PDF-to-text extraction service I built using <a href="https://github.com/mozilla/pdf.js">pdf.js</a>, you can extract the text from a PDF using only client-side Javascript:</p>

<pre><code>
&lt;!-- edit this; the PDF file must be on the same domain as this page -->
&lt;iframe id="input" src="your-file.pdf">&lt;/iframe>

&lt;!-- embed the pdftotext service as an iframe -->
&lt;iframe id="processor" src="http://hubgit.github.com/2011/11/pdftotext/">&lt;/iframe>

&lt;!-- a container for the output -->
&lt;div id="output">&lt;/div>

&lt;script>
var input = document.getElementById("input");
var processor = document.getElementById("processor");
var output = document.getElementById("output");

// listen for messages from the processor
window.addEventListener("message", function(event){
  if (event.source != processor.contentWindow) return;

  switch (event.data){
    // "ready" = the processor is ready, so fetch the PDF file
    case "ready":
      var xhr = new XMLHttpRequest;
      xhr.open('GET', input.getAttribute("src"), true);
      xhr.responseType = "arraybuffer";
      xhr.onload = function(event) {
        processor.contentWindow.postMessage(this.response, "*");
      };
      xhr.send();
    break;

    // anything else = the processor has returned the text of the PDF
    default:
      output.textContent = event.data.replace(/\s+/g, " ");
    break;
  }
}, true);
&lt;/script>
</code></pre>

<p><a href="http://hubgit.github.com/2011/11/pdftotext/example/">See an example running as a live demonstration.</a></p>

<p>It'll only work in recent browsers, as it requires <a href="http://updates.html5rocks.com/2011/09/Workers-ArrayBuffer">sending binary data between windows as an ArrayBuffer using window.postMessage</a>, and <a href="http://www.html5rocks.com/en/tutorials/workers/basics/">Web Workers</a> in pdf.js.</p>

<p>Basically, this fetches a PDF as an ArrayBuffer using XMLHTTPRequest, then posts it to the embedded window, which uses <a href="https://github.com/mozilla/pdf.js">pdf.js</a> to render the PDF to Canvas (invisibly; you can see the rendered images if you poke around a bit with a web inspector tool). As it does so, <a href="https://github.com/mozilla/pdf.js/pull/738">an HTML layer is constructed</a>, containing a block to match each row of the PDF - this would normally be overlaid on top of the rendered images to allow text to be selected, a technique used by many services that allow PDF text selection and highlighting, including <a href="http://crocodoc.com/">Crocodoc</a> and Google Docs' PDF viewer. By taking the text content of those blocks, the service can return the contents of the PDF as a single block of text.</p>
  
<p>I expect that pdf.js will acquire a native function for retrieving the text content directly, to make documents searchable. It would be nice, next, to try to recreate paragraphs by looking at the spacing between the blocks, and to use the formatting and other heuristics to extract metadata like title, authors, etc.</p>]]></content>
  </entry>
  <entry>
    <id>tag:hublog.hubmed.org,2010://2.1947</id>
    <title>Open Graph wins the Semantic Web</title>
    <updated>2011-09-29T23:57:10+00:00</updated>
    <published>2011-09-29T23:39:21+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001947.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<p>It took me a year - and the configuration step below - to realise that <a href="http://ogp.me/">Open Graph</a> has found a solution that works for referencing things on the web:</p>

<div><img src="/files/misc/2011-09-29-og-configuration.png"></div>

<p>We now have a standard way of providing metadata about any object, based on two principles:</p>

<ol>
  <li>Every object is represented by at least one HTML page on the web.</li>
  <li>Properties of that object are represented as &lt;meta&gt; elements in the &lt;head&gt; section of that HTML page.</li>
</ol>

<p>From that, we can make statements about any object using URIs, and fetch metadata about that object using HTTP. The Semantic Web!</p>

<h2>Statements</h2>

<p>This is an RDF statement:</p>

<table style="margin-top:0">
  <tr><td>[THING]</td><td>[LINK]</td><td>[THING]</td></tr>
  <tr><td>&lt;http://music.com/band/nirvana&gt;</td><td>&lt;http://example.com/member&gt;</td><td>&lt;http://music.com/person/kurt-cobain&gt;.</td></tr>
</table>

<p>That&#39;s two things connected by a link, all represented by URIs.</p>

<h2><span>Creating a Graph</span></h2>

<p>Several of this kind of statement can be combined to make a graph:</p>

<div>&lt;http://music.com/band/nirvana&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://example.com/member&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/person/kurt-cobain&gt;.</div>
<div>&lt;http://music.com/band/nirvana&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://example.com/member&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/person/dave-grohl&gt;.</div>
<div>&lt;http://music.com/band/nirvana&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://example.com/member&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/person/krist-novoselic&gt;.</div>
<div>&lt;http://music.com/band/nirvana&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://example.com/recorded&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/track/on-a-plain&gt;.</div>

<p><img  src="/files/misc/2011-09-29-og-graph.png"></p>

<p>Or, to write those statements in shorthand, without repeating the first part of each one:</p>

<div>&lt;http://music.com/band/nirvana&gt;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://example.com/member&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/person/kurt-cobain&gt;;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://example.com/member&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/person/dave-grohl&gt;;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://example.com/member&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/person/krist-novoselic&gt;;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://example.com/recorded&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/track/on-a-plain&gt;.</div>

<p>And using prefixes to avoid having to write out the full URI each time:</p>

<div>PREFIX eg: &lt;http://example.com/&gt;</div>
<div>&lt;http://music.com/band/nirvana&gt;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;eg:member&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/person/kurt-cobain&gt;;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;eg:member&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/person/dave-grohl&gt;;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;eg:member&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/person/krist-novoselic&gt;;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;eg:recorded&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://music.com/track/on-a-plain&gt;.</div>

<h2>Fetching information</h2>

<p>The URI &lt;http://music.com/band/nirvana&gt; represents the band Nirvana. We could equally have used &lt;http://en.wikipedia.org/wiki/Nirvana_(band)&gt; or &lt;http://open.spotify.com/artist/6olE6TJLqED3rqDCT0FyPh&gt;*. As these are HTTP URLs, a representation of this Thing can be <a href="http://open.spotify.com/artist/6olE6TJLqED3rqDCT0FyPh">fetched using HTTP</a> - in this case, your web browser probably receives an HTML representation of the band.</p>

<p>How does the server decide in which format to return that information? There&#39;s a negotiation between whoever requests the information and the server that provides the information. The request contains a list of formats that it would be able to handle, and the server returns the first of those that it&#39;s able to provide. &nbsp;In fact, the information about a Thing might be available as JSON, or XML, or any other format, but <strong>Open Graph requires that every Thing identified by a URL must have an HTML web page that represents it</strong>.</p>

<p>In this way, we can make statements about any Thing, and fetch information about that Thing by dereferencing its URL to see what information it provides.</p>

<p>How should the information about the Thing be presented in that HTML page**? As the page represents the Thing***, this information can be added to the &lt;head&gt; section of the page; it ends up looking like this:</p>

<div>&lt;meta property=&quot;eg:member&quot; content=&quot;http://music.com/person/kurt-cobain&quot;&gt;</div>
<div>&lt;meta property=&quot;eg:member&quot; content=&quot;http://music.com/person/dave-grohl&quot;&gt;</div>
<div>&lt;meta property=&quot;eg:member&quot; content=&quot;http://music.com/person/krist-novoselic&quot;&gt;</div>
<div>&lt;meta property=&quot;eg:recorded&quot; content=&quot;http://music.com/track/on-a-plain&quot;&gt;</div>

<p>Which is exactly the same information as in the shorthand RDF statements above. It&#39;s RDF in HTML!</p>

<p>If someone says they like the album &quot;Nevermind&quot;, a statement is created:<br>
&lt;http://facebook.com/eaton.alf&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://example.com/emotions/likes&gt;&nbsp;&nbsp;&nbsp;&nbsp;&lt;http://open.spotify.com/album/6okv1avxEgYSdc2JYy6ZEi&gt;</p>

<p>When we fetch the HTML document from the URL referenced (&lt;http://open.spotify.com/album/6okv1avxEgYSdc2JYy6ZEi&gt;), it contains (amongst other things) this information:</p>

<div>&lt;http://open.spotify.com/album/6okv1avxEgYSdc2JYy6ZEi&gt;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;og:type&gt; &quot;music.album&quot;;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;og:title&gt; &quot;Nevermind&quot;;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;music:release_date&gt; &quot;1991-01-01&quot;;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;music:musician&gt; &lt;http://open.spotify.com/artist/6olE6TJLqED3rqDCT0FyPh&gt;.</div>

<p>And when we fetch the HTML document from the &quot;musician&quot; URL &lt;http://open.spotify.com/artist/6olE6TJLqED3rqDCT0FyPh&gt;, it contains (amongst other things) this information:</p>

<div>&lt;http://open.spotify.com/artist/6olE6TJLqED3rqDCT0FyPh&gt;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;og:title&gt; &quot;Nirvana&quot;.</div>

<p>When all that information is combined, we know that this person, who clicked the &quot;Like&quot; button while listening to an album in Spotify, liked the album &quot;Nevermind&quot; by the musician &quot;Nirvana&quot; - which is what gets displayed in their Facebook timeline.</p>

<h2>Referring to URLs</h2>

<p>This is all relevant to my <a href="http://hublog.hubmed.org/archives/001946.html">recent post about citing with URIs</a>. In that demonstration, the script dereferenced the URI to get information about the thing, but specifically asked for JSON. In the end, though, the JSON is basically just a list of properties about the thing being referenced, and there&#39;s no reason why that information can&#39;t be represented in &lt;meta&gt; elements in the &lt;head&gt; of an HTML page, which is exactly what most publishers do in order to <a href="http://scholar.google.com/intl/en/scholar/inclusion.html">get their documents indexed by Google Scholar</a>. They use several prefixes (&quot;dc.&quot;, &quot;prism.&quot;, &quot;citation_&quot;); they often use meta[name][content] instead of meta[property][content], but it&#39;s all basically the same thing. I&#39;ve now updated the script to parse &lt;meta&gt; elements from HTML, alongside JSON responses.</p>

<p>In summary: if someone wants to refer to a Thing, they should be able to use a HTTP URL. If someone wants to get information about that Thing, they should be able to dereference that URL, get an HTML document, look in the &lt;meta&gt; elements in the &lt;head&gt; section, and retrieve all the information about that thing (including further URLs to find out more information about any of those properties).</p>

<hr>

<p>* Asserting equivalence between URIs allows links from one URI to also apply to the other. For example:</br>
&lt;http://example.com/music/nirvana&gt; &lt;http://www.w3.org/2002/07/owl#sameAs&gt; &lt;http://open.spotify.com/artist/6olE6TJLqED3rqDCT0FyPh&gt;</p>

<p>** We don&#39;t have to worry about representing multiple items on a single page - each one will have a link to its own, individual page.</p>

<p>*** We don&#39;t have to worry about whether the URI represents the Thing or a document about the Thing: it&#39;s always the Thing. Most of the time, no-one cares who wrote the document about the Thing, or when that document was last updated. An exception might be Wikipedia, so I have a suggestion: the Thing is still represented by the web page; information about authors and update times can be attached to an appropriate property of the Thing, e.g. &lt;http://en.wikipedia.org/wiki/Nirvana_(band)#description&gt;.</p>

]]></content>
  </entry>
  <entry>
    <id>tag:hublog.hubmed.org,2010://2.1946</id>
    <title>Citing With URIs in Google Docs</title>
    <updated>2011-09-16T02:14:50+00:00</updated>
    <published>2011-09-16T00:58:35+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001946.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<p>I built a script that runs in Google Docs, turning inline citations into a formatted bibliography. It lets you cite using DOIs, Mendeley library IDs, or any URL that returns metadata as JSON. It's a first, basic attempt, but here's why I like it:<p>

<h3>Citing With URIs</h3>

<p>The most straightforward way of being able to cite something in a document is to insert an identifier. On the web we hyperlink using URLs, which provide a unique identifier for the item being referenced - with the added bonus of being able to follow that URL to retrieve the item. When writing a scholarly article, however, there's still an expectation that the metadata for a citation will be provided, so that the reference will still make sense even if the URL stops working.</p>

<p>To be able to successfully cite using identifiers, therefore, means being able to retrieve the metadata for each identifier, and the simplest way to do that is to convert that identifier to a URL - if it isn't already - and retrieve it using an HTTP request.</p>

<p>Once we have the metadata for each citation, all that's needed is to generate a bibliography (a list of endnotes) at the end of the document, and insert links to those references inline. As a complication, there are many different publishing systems, and they each have <a href="http://www.zotero.org/styles">their own special preferred formatting</a> for those inline citations and bibliographies, so the tool should ideally be able to cater for any of those formats.</p>

<p>There is a need for citation software that works with Google Docs, as it's basically the standard online writing tool (and is continually getting more awesome). I've managed to get the first steps of a citation processor working in Google Docs; it's not complete yet...</p>

<h3>Inserting and Processing Citations</h3>

  <p><a href="http://code.google.com/googleapps/appsscript/">Google Apps Script</a> provides a way to add menu items to Google Docs and call a function when a menu item is selected. It's server-side Javascript, with an online editor that functions well. You can currently only attach scripts to Google Spreadsheets, but that's ok in this case: we need somewhere to store a local copy of our references.</p>

<p>Here's how to use Google Apps Script to format citations in a Google Document:</p>

<ol>
  <li>Create a new Document in Google Docs and give it a unique title.</li>
  <li>Write your article, adding citations inline in the form {{cite:doi:10.1038/nchem.1108}}.</li>
  <li>Create a new Spreadsheet in Google Docs and give it a title which is the same as the document, but with " - References" at the end.</li>
  <li>Add <a href="https://github.com/hubgit/Exciting">my Exciting script</a> to the spreadsheet (Tools > Script Editor). Once it's installed, an "Exciting" menu should appear.</li>
  <li>From the "Exciting" menu, select "Generate Bibliography".</li>
</ol>

<p>The script will now create a copy of the document (which must be in the same folder as the spreadsheet, and have the same name minus the " - References" suffix). The original document will remain untouched. It will parse the document for {{cite}} strings, fetch the metadata for each one, and store the data in the current spreadsheet (if the script is run a second time, it will use this local data instead of fetching it again). It will then replace the inline citations with numbered references, add a formatted bibliography at the end of the document, email you a PDF of the final, formatted document, and move the formatted copy of the document to the trash.</p>
  
<p>[NB: this is a first attempt, written last weekend. The citation formatting is very, very basic.]<p>

<p>There are several ways to cite using this system, and this is where it gets most interesting:</p>
<ul>
  <li>You can use {{cite:doi:10.1038/nchem.1108}} to cite an item by DOI; in this case the data will be fetched from CrossRef.</li>
  <li>You can use {{cite:mendeley:123456}} to cite an item using its ID in your Mendeley library; in this case the data will be fetched from mendeley.com (this might not be working properly yet - I haven't tested it much. It uses OAuth authentication, so you need to register an application and get a key from the <a href="http://dev.mendeley.com/">Mendeley Developers Portal</a>. You also need to run the "authorizeMendeley" function from within the Script Editor, to authorize this application).</li>
  <li><strong>You can use any URL</strong>, as long as it returns JSON when specified in HTTP Accept headers. For example, {{cite:http://dx.doi.org/10.1038/nchem.1108}} works just as well as the DOI example above.</li>
</ul>

<p>Theoretically, you can cite any URL, and the script will retrieve the metadata from that URL and make use of it. In practice, not nearly as many URLs as I'd like perform content negotiation and return JSON instead of HTML from the same URL, and even when they do there's no standard format for the reference metadata (which is where RDF comes in, but there's no RDF parser in Google Apps Script; RDF triples as JSON would be an good intermediate). The current script has custom functions to normalise the data returned from CrossRef and Mendeley into a single, standard format for local use; adding other sources would probably require a custom parser for their metadata as well.</p>

<p>I'm not able to enter the <a href="http://dev.mendeley.com/api-binary-battle">Mendeley/PLoS API Binary Battle</a>, but |'d be delighted if anyone who's interested was to take this code and make use of it. I see the next steps like this, possibly: 1) get citeproc-node running on a node.js server somewhere (Heroku or Joyent, maybe), and use that for formatting the references; 2) use the UI Services/GUI Builder in Google Apps Script to build an editing interface, for tidying up references once they've been retrieved; 3) add the ability to specify custom formatting for the inline citations, and to choose the citation format for the bibliography.</p>

]]></content>
  </entry>
  <entry>
    <id>tag:hublog.hubmed.org,2010://2.1945</id>
    <title>Client-Side PubMed Searching</title>
    <updated>2011-07-23T18:39:29+00:00</updated>
    <published>2011-07-23T17:35:14+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001945.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<p>The NCBI have added <a href="https://developer.mozilla.org/en/HTTP_access_control#Access-Control-Allow-Origin"><tt>Access-Control-Allow-Origin: *</tt></a> to the <a href="http://eutils.ncbi.nlm.nih.gov/">eUtils</a> response headers, to allow <a href="http://www.w3.org/TR/cors/">cross-origin resource sharing</a>.</p>

<p>This means that anyone can now make client-side PubMed search interfaces, like <a href="http://alf.hubmed.org/2011/07/pubmed/">this one</a>.</p>

<p>Only the eSearch and eSummary methods have the Access-Control-Allow-Origin header so far, so it's not possible to get abstracts or full citation data this way (using eFetch) yet.</p>

]]></content>
  </entry>
  <entry>
    <id>tag:hublog.hubmed.org,2010://2.1944</id>
    <title>Capturing a manipulated web page with PhantomJS</title>
    <updated>2011-03-25T10:50:30+00:00</updated>
    <published>2011-03-25T10:49:54+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001944.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<p><a href="http://www.phantomjs.org/">PhantomJS</a> is easy to compile in Ubuntu, uses QtWebKit to render pages, is controlled by Javascript and can output PNG. Here's a simple script to create a screenshot of a web page, with a tiny bit of DOM manipulation:</p>

<pre><code>var url = 'http://www.guardian.co.uk/'
var output = 'snapshot.png'

switch (phantom.state.length){
    case 0:
        phantom.state = 'rasterize'
        phantom.viewportSize = { width: 1024, height: 768 }
        phantom.open(url)
    break;

    default:
        document.querySelectorAll("#guardian-logo img").item(0).setAttribute("src", "http://placekitten.com/344/52")
        phantom.sleep(1000)
        phantom.render(output)
        phantom.exit()
    break;
}</code></pre>]]></content>
  </entry>
  <entry>
    <id>tag:hublog.hubmed.org,2010://2.1943</id>
    <title>This Weblog In (Some) URLs</title>
    <updated>2011-03-06T14:42:53+00:00</updated>
    <published>2011-03-06T14:27:18+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001943.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<h3>Index</h3>

<ol style="list-style-type:none">
  <li><a href="/">/</a></li>
  <li><a href="/?_limit=5">/?_limit=5</a></li>
  <li><a href="/?_limit=5&_start=20">/?_limit=5&_start=20</a></li>
  <li><a href="/?_tags=firefox&_limit=10">/?tags=firefox&_limit=10</a></li>
  <li><a href="/?_tags=firefox&_limit=10&_format=json">/?tags=firefox&_limit=10&_format=json</a></li>
  <li><a href="/?_format=atom">/?_format=atom</a></li>
</ol>

<h3>Item</h3>

<ol style="list-style-type:none">
  <li><a href="/archives/001943.html">/archives/001943.html</a></li>
  <li><a href="/archives/001943.json">/archives/001943.json</a></li>
</ol>

<p>(archive URL structure retained from the old version, for compatibility)</p>
    
<p>POST/GET/PUT/DELETE</p>

]]></content>
  </entry>
  <entry>
    <id>tag:hublog.hubmed.org,2010://2.1942</id>
    <title>A Modular System for Automatic Entity Extraction and Manual Annotation of Academic Papers</title>
    <updated>2011-02-04T09:31:46+00:00</updated>
    <published>2011-02-03T19:46:45+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001942.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<style>
 img.screenshot  { 
    -webkit-box-shadow: 0px 5px 20px #777;
    -moz-box-shadow: 0px 5px 20px #777;
    box-shadow: 0px 5px 20px #777;
    border-radius: 10px;
    display:block;
    margin: 1em auto;
    max-width: 100%;
  }  
  figure {
    margin: 1em 0 1.5em;
  }
  figcaption {
    font-style: italic;
    text-align: center;
  }
</style>

<p>At  the recent "<a href="https://sites.google.com/site/beyondthepdf/">Beyond The PDF</a>" conference in San Diego (which was  pleasantly easy to attend remotely, because anyone could follow the  webcast and participate via Twitter) there were several sessions that  discussed entity extraction and manual annotation of academic papers.  This, therefore, seems like a good time to write about the annotation  system and user interface that I worked on at Nature Publishing Group  over the last couple of years.</p>

<p>I'd been <a href="http://hublog.hubmed.org/archives/001339.html">trying out annotation systems</a>, and others had been working on similar things, for a while (the RSC's <a href="http://www.rsc.org/Publishing/Journals/ProjectProspect/">Project Prospect</a> launched in 2007 and was being presented at conferences, while Phil Bourne presented several authoring tools at <a href="http://hublog.hubmed.org/archives/001375.html">Data Webs</a>,  the first conference I attended after joining NPG). This project  started out as a discussion and early prototype with Robert Hoffman, who  had recently launched <a href="http://www.wikigenes.org/">WikiGenes</a> and had a biological entity extractor in use on <a href="http://www.ihop-net.org/UniPub/iHOP/">iHOP</a>;  the initial aim was to allow authors to annotate genes and proteins  mentioned in their papers. The focus of the project switched quickly to  chemistry, though, as the launch of <a href="http://www.nature.com/nchem/">Nature Chemistry</a>  was getting close; from then on, this work was performed in close  collaboration with the Nature Chemistry team, particularly technical  editor <a href="http://twitter.com/#!/laurajcroft">Laura Croft</a>, who provided ideas, most of the feature requests and  user interface requirements for this system.</p>

<p>The  annotation workflow expanded last year to cover more journals and now  includes annotation of both chemical entities (compound names/molecular  formulae) and biological entities (gene/protein names).</p>

<figure>
  <img class="screenshot" src="/files/misc/2011-02-03/chemical-entities.png">
  <figcaption>The curation interface, showing an annotation highlighted for editing and search results across chemistry-related databases.</figcaption>
</figure>

<figure>
  <img class="screenshot" src="/files/misc/2011-02-03/gene-protein-search.png">
  <figcaption>Curating a set of gene/protein annotations, with search results across biology-related databases.</figcaption>
</figure>

<p>The system comprises five main parts:</p>

<ol>
  <li>
    <p>Input:  When an article XML file is uploaded, a specified list of elements  (title, abstract, body, tables and figures) is converted to HTML using  an XSL template. A configuration file specifies which of the original  elements are blocks (which become divs in the HTML) and which are inline  elements (which become spans). Some elements, such as links, get  special treatment, and all element names are carried over to the class  names of the HTML elements for styling. All named entities are converted  to UTF-8 characters, and no characters are added to or removed from  these elements, so the character positions in the HTML and the original  XML are identical.</p>
  </li>
  <li>
    <p>Entity  extraction: The content is passed through several automatic entity  extractors, each of which is specialised for a particular type of entity  (chemical names, gene names, place names, etc). As most entity  extractors prefer to parse plain text rather than HTML, the content is  converted to text and separator characters are added to prevent  annotation across the boundaries of block-level elements.</p>
    <p>When  the results of the entity extraction are returned from each extraction  web service, the annotations are converted to a standard format, which  is then stored in MongoDB. By accounting for the previously-added  separator characters, the positions of each annotation can be correctly  translated back to positions in the HTML/XML.</p>
  </li>
  <li>
    <p>Curation:  The HTML is displayed in a web browser as several identical overlaid  layers: one base layer containing no annotations, one layer for each set  of automatically-extracted annotations and one layer for each set of  manually-curated annotations. The display of each of these can be  toggled on or off by the curator, allowing several sets of annotations  to be displayed concurrently without breaking the DOM by overlapping  elements.</p>
    <p>Each set of annotations is loaded  on-demand, as JSON, so that the initial rendering of the page is fast.  The text is transparent on all layers except the base layer, so it  doesn't cause anti-aliasing artifacts, and the CSS <a href="https://developer.mozilla.org/en/CSS/pointer-events">pointer-events</a>  property is used to pass all clicks through to the base layer;  highlighting a passage of text thus creates an annotation only in the  base layer. Annotations in each layer are represented as inline spans:  these have visible text, colours to show their state, and can receive  clicks regardless of which layer they're in (the z-index of each layer  determines which layer's annotations receive clicks in preference to  other layers; the manual sets of annotations are in the foreground).</p>
    <p>Each  annotation can have a single entity attached to it: a data object with a  set of metadata properties appropriate for the type of  entity/annotation being curated. Once an entity is chosen from the  search results (see below) and attached, the annotation is copied out of  the set of automatic annotations and into one of the manual sets: these  are the annotations which are going to be published.</p>
  </li>
    <li>
      <p>Search:  Creating a new annotation, or clicking on an existing annotation (which  selects all annotations of the same text in the current document),  launches a search across several databases, chosen according to the type  of annotation being curated (chemistry, biology, etc). The curator can  choose which of the known properties of the currently attached entity to  search on: the default is to run a search on the "title" property using  the text of the annotation. The results from each search source are  converted to a standard format (currently HTML with pseudo-RDFa markup  rendered server-side, but could easily be JSON rendered client-side into  a template), and the search results are displayed. When one of the  search results is selected by the curator (from any of the search  sources), the entity represented by that search result is attached to  all of the annotations currently being edited, replacing any entity  already attached. A list in the sidebar keeps track of all the attached  entities in each annotation set.</p>
  </li>
  <li>
    <p>Export:  The positions of the curated annotations are spliced back into the  article XML, which then re-enters the publishing workflow. The  annotations themselves - including the entities attached to each  annotation - are stored in an XML database for retrieval when the  article is rendered as HTML, where they are matched back up to the  annotation positions inline in the article XML; this storage also allows  the annotations to be published independently via an OpenSearch/SRU  gateway.</p>
  </li>
</ol>

<figure>
<img class="screenshot" src="/files/misc/2011-02-03/sidebar-popup.png">
  <figcaption>A pop-up information box showing information about an entity, which is attached to one or more annotations.</figcaption>
</figure>

<p>This  project is ongoing: there is much work to be done on streamlining the  user interface and adding more features for chemistry and biology  curation. </p>

<p>From  a technical point of view, there are several things which could be  improved, including using a system like <a href="http://documentcloud.github.com/backbone/">Backbone.js</a> to separate the data  model from the DOM (making it easier to synchronise changes between the  annotation data, the front-end display and the server-side storage). It  might, perhaps, turn out to be better to store annotation positions  relative to each node, and give each node a unique ID using the XPath  for that node (as <a href="http://www.plos.org/">PLoS</a>  use for their public annotations system), rather than the more fragile  system used here which counts the distance of each annotation node in  characters from the start of the document.</p>

<p>The  key benefit of this system is that it's straightforward to plug in more  automated entity extractors as they become available: by standardising  the input and output formats, we can make use of as many entity  extractors and search sources as possible. The automated annotations are  mostly used as hints to the human curators, though, so being able to  store the corrections that the curators make and feed those back to the  entity extractors will be a big improvement - not many automatic  annotation services are set up to learn from manual feedback, yet.</p>

<p>As more search sources are added to the system, the similarities between this and Paolo Ciccarese's <a href="http://vimeo.com/18510599">Semantic/SWAN Annotation Framework</a>  become more and more obvious (Paolo's work inspired the use of  annotation sets here, for example). In the SAF, each entity is selected  from a set of ontologies rather than from a set of databases, but  basically the process is quite similar. </p>

<p>We're storing the properties of each entity as XML using simple key/value pairs (using <a href="http://www.w3.org/TR/curie/">CURIEs</a> as the keys) in MarkLogic, but when publishing these annotations I hope that they can be published using both the <a href="http://code.google.com/p/annotation-ontology/">Annotation Ontology</a> and <a href="http://www.openannotation.org/">OpenAnnotation</a> ontologies, which have similar aims in standardising the representation and publication of annotations.</p>

<p>While  I'd like to be able to open-source the code for anyone to use, it's  probably going to remain locked up. As alternatives, there's some  excellent work on <a href="http://okfn.org/projects/annotator/">an annotation system at the Open Knowledge Foundation</a>  (built for annotating in the Open Shakespeare project), the automatic  markup of entities in PubMed Central UK (using the modular text-mining  system <a href="http://www.ebi.ac.uk/webservices/whatizit/">Whatizit</a>,  developed and maintained by Dietrich Rebholz's group at the EBI), the  Semantic Annotation Framework mentioned above (being applied to a  similar purpose as our tool, for curating the results of text-mining  services, in collaboration with Elsevier), OntoText's <a href="http://linkedlifedata.com/">Linked Life Data</a> platform (not specifically about annotation, but lots of text mining and linked data) and many others.</p>

]]></content>
  </entry>
  <entry>
    <id>tag:hublog.hubmed.org,2010://2.1941</id>
    <title>Getting and Sending Binary Files with XMLHttpRequest</title>
    <updated>2010-12-15T17:31:39+00:00</updated>
    <published>2010-12-15T17:24:20+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001941.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<p>This seems to work in Firefox (which provides xhr.sendAsBinary) and Chrome 9 (which implements BlobBuilder):</p>
<p><script src="https://gist.github.com/742267.js"></script></p>
<p><a href="http://alf.hubmed.org/2010/12/sendasbinary/">Test page</a>.</p>]]></content>
  </entry>
  <entry>
    <id>tag:hublog.hubmed.org,2010://2.1940</id>
    <title>AOTY 2010</title>
    <updated>2011-01-23T16:47:31+00:00</updated>
    <published>2010-11-18T21:24:49+00:00</published>
    <link rel="alternate" type="html" href="http://hublog.hubmed.org/archives/001940.html" />
    <content type="html" xml:base="http://hublog.hubmed.org/" xml:lang="en"><![CDATA[<p><a href="http://aoty.hubmed.org/">Albums of the Year</a> has started up again, with this year's first round-up being <a href="http://www.roughtrade.com/site/content.lasso?page=AOY_2010_11-100_v2.html">Rough Trade's Top 100 Albums of the Year</a>.</p>
<p>View it <a href="http://aoty.hubmed.org/year/2010/rough-trade">on AOTY</a>, or <a href="http://open.spotify.com/user/hubspot/playlist/3M7kutazkXoE6q2xSySw0O">as a Spotify playlist</a>.</p>]]></content>
  </entry>
</feed>

