the visible archive

Generative Heritage - notes on Succession

2014-12-04T14:00:00.000+11:00

I spent three months this year on sabbatical at Culture Lab, Newcastle University (UK). It was a privelege to spend time in such a vibrant research lab, as well as to get to know the city of Newcastle. One of the projects to come out of my visit is Succession, an experiment in generative digital heritage that uses Newcastle and its history to think about industrialisation, global capital, our shared pasts and potential futures. Personally, it brings together two strands of my work that have been separate until now - on generative systems and digital cultural collections. Hence you'll also find this cross-posted over on The Teeming Void. Here, some notes and documentation on the work, some musings on generative and computational heritage.

Much of my recent work with digital cultural collections has worked to create rich representations of these ever-expanding datasets. A key thread has been an interest in the complexity of these collections; the multitudes they contain, their wealth of potential meaning as complex, interrelated wholes, rather than simply respositories of individual resources. Visualisation can provide a macroscopic view of this complexity, but it can be just as vivid when sampled at a micro scale. Tim Sherratt's Trove News Bot tweets digitised newspaper articles in response to the day's news headlines, creating little juxapositions, timely sparks of meaning that can be pithy, funny, or provocative. Trove News Bot appropriates the twitter bot - the joking-but-deadly-serious computational voice of our age - and adapts it to work with the digital archive. We could call this generative heritage; using computational processes to create new artefacts (and meanings) from historical material.

Succession applies this generative approach to the digital heritage of Newcastle Upon Tyne. Newcastle has a rich industrial heritage; it played a major role in the Industrial Revolution that began in Britain and went on to remake global civilisation. Today Newcastle is a post-industrial or de-industrialised city: coal, steel and shipbuilding have given way to service industries: education, retail, entertainment and tourism. As an outsider exploring the city I was struck by the mixture of pre-modern, industrial and post-industrial eras in the fabric of the city. Different (often inconsistent) patterns of life, work and economy are accreted in layers as the city continues the everyday process of adaptation, experimentation with the possible; working out what comes next.

The city, like the digital archive, is a multitude; an unthinkably complex matrix of people, things, systems, narratives. Newcastle - more than many other cities - also speaks to the expansive dynamics of industrialisation, globalisation, extractive industry, fossil fuels; the whole modern trajectory that has brought us to our current predicament. This seems to be both urgent and unthinkable - or perhaps, unsayable. How can we speak back to this complexity; how can we make in a way that responds to this tangled, expansive mess? Here generative techniques offer a way to synthesise complexity and create multitudes, formations that might portray the city as it was, or hint at what it could be. Automatic juxtaposition and remix create nonsense but also, occasionally, glimmers of a new sense, or at least a texture or sensation that emerges from a random constellation of images, sources and contexts. Succession requires us to piece together fragments of history; and this is a work of imagination, as Ross Gibson writes, framing his own work of generative heritage (with Kate Richards, Life After Wartime

Our parlous states need imagination. We need to propose “what if” scenarios that help us account for what has happened in our habitat so that we can then better envisage what might happen. We need to apprehend the past. Otherwise, we won’t be able to align ourselves to historical momentum. Without doing this we won’t be able to divine the continuous tendencies that are making us as they persist out of the past into the present.

In practical terms, the work is based on a corpus of around two thousand images sourced from the Flickr Commons. Most come from the (wonderful) Tyne and Wear Archives and Museums collection; many more from the Internet Archive Books collection, with a smattering of others from UK and international institutions. Succession uses these ingredients to generate new digital "fossils"; composite images assembled in the browser using HTML Canvas. This generative process is extremely simple: pick five sources at random, and place them in the frame using some semi-random rules for positioning, compositing and repetition. Opacity is kept low, so that the sources blend and merge. The visual process often obscures the source images - they end up buried, cropped or indistinguishable, squashed like fossil strata. But at the same time the source items are preserved and presented in context, so each composite retains references to its sources and their attendant contexts. Composites can be saved, acquiring an ID and permalink; the images in this post show some of my favourites, but there are over a hundred to sift through already.

As a generative system this is, in formal terms, incredibly simple. It's essentially a combinatorial process, in that each composite consists of five elements from a set of around two thousand. Yet already this adds up to 2.5 x 10^15 unique combinations - it would take eight million years to see them all, at one per second. Compositing and layout parameters are random within constraints - so this simple machine can produce an immense variety of unique results; I'm still surprised and delighted by the fossils people discover (or generate). But this computational variety is also strongly shaped by the human creative choices involved in making the work. This is what Bill Seaman (combinatorial media artist par excellence) calls "authored space" - a domain of potential that is expansive but never arbitrary. The corpus reflects a handful of coherent themes, seasoned with generous sprinklings of the lateral and miscellaneous; the aim is, in Seaman's words, a kind of "resonant unfixity." Also the corpus and the compositing process work in tandem; for example compositor treats the largely monochrome line-art and engravings of the Internet Archive material differently to other (largely photographic) sources. The generative machine is programmed in part by the textures and qualities of its material.

The Internet Archive book images are interesting on several fronts; for one, they are an amazing demonstration of the power of computational processes for generating and describing large collections (like 2.6 million items large). Given the right kind of source material, this computational leverage changes the logic of collections completely. When adding and describing items is expensive, it makes sense to be selective, and publish only what is most "significant". Automation makes it possible to simply publish everything - for who's to say (really) what is significant, or how it might one day be significant? In Succession the Internet Archive material plays a crucial role. The line art and diagrams - many from obscure publications like the Transactions of the North of England Institute of Mining and Mechanical Engineers - offer evocative fragments of the machinery of mid-nineteenth century industrialisation.

As for generative digital heritage, it's a fairly open-ended proposal. What happens when we turn algorithms loose on our digital culture with makerly, synthetic, speculative or poetic intention? There are some pretty solid precedents in the digital humanities for these approaches; Schnapp and Presner call for a "generative" DH in their 2009 manifesto. Before that Drucker and Nowviskie outlined a "speculative computing" with a strongly generative flavour. Gibson and Richards' Life After Wartime is an early exemplar of generative heritage in the digital arts. More recently we've seen the rise of massive online collections, web-scale computing, and a proliferation of cultural, critical and creative bots, not to mention projects like #NaNoGenMo. If there is such a thing as generative digital heritage, then now's the time.

AngularJS for Digital Cultural Collections

2014-07-19T01:07:00.001+10:00

In the previous post I introduced our Discover the Queenslander project for the SLQ, and mentioned that we used the AngularJS web framework. That process has got me thinking about some of the technical challenges in creating rich collection interfaces, and the different approaches in play, and I'll report on these in the next two posts. In this one I'll focus on AngularJS, and in the next, some broader questions on working with collection data on the client side.

AngularJS is a Javascript-based framework that focuses on extending HTML to deal with dynamic content. Angular "binds" data to HTML elements; so change the data, and the HTML updates. Even better, the bindings are two-way: interacting with an HTML element can also change its bound data. Angular implements a MVC (Model View Controller) architecture, where the data structure is the Model, the HTML document is the View, and a Javascript Controller links the two together.

Our previous web-based collections projects (TroveMosaic, Manly Images, Prints and Printmaking) were built in plain JS and jQuery. The general approach is pretty straightforward: load and manipulate some collection metadata (either from an API or a static JSON file), then build the HTML dynamically (adding and styling elements according to the data). jQuery makes handling interactions with the HTML pretty straightforward. It also (in my experience) makes for a verbose mess. Because all the HTML is built dynamically there's a lot of code devoted to creating elements, setting attributes, then stuffing them into the DOM. Code that loads and munges data gets tangled with code that builds the document and code handling interactions. Some elements get styled with static CSS, others are styled with hard-coded attributes. It all works fine - jQuery is very robust - but under the surface, it's bad code.

AngularJS tidies this process up quite a bit. Here's a quick example showing how straightforward it is to bind some collection data to some HTML. Say we have a JSON array items where each item looks something like:

{ "id":"702692-19340823-s002b",
 "title":"Illustrated advertisement from The Queenslander, 23 August, 1934",
 "description":["Caption: Practical garments","An Advertisement for women's clothing sewing patterns acquired through mail order from The Queenslander Pattern Service."],
 "subjects":["women's clothing & accessories","advertisements"],
 "thumbURL":"702692-19340823-s002.jpg",
 "year":"1934"
}

To create a HTML list where each item appears as a list element:

<ul>
 <li ng-repeat="i in items"> 
  <h1>{{i.title}}</h1>
  <img ng-src="{{i.thumbURL}}"/>
 </li> 
</ul>

Angular lets us iterate over a list of elements with the ng-repeat directive; it will simply generate a <li> for each element in the items array. Attributes of each item i are easily bound to the HTML using the {{moustache}} notation - so the item title will appear inside the h1. Apart from the compact, HTML-based rendering syntax, the killer feature here is that the HTML stays bound to the data: in order to change the display, we simply change the contents of items. No jQuery-style DOM manipulation; the data drives the document.

So rendering items in a list is trivially easy; but what about more complex displays? It's a matter of creating the data structures you need, then binding them to HTML in the same way. The Queenslander grid interface (above) includes a histogram showing items per year. In HTML this is simply another list, where each column is a list element. To create the data structure we sort the items into a new array where each element contains both the year, and a count of its items. Then as in the example, we run through the array with Angular building an element (this time a column) for each year. Angular's ng-style directive lets us create a custom height for each element, based on the number of items in the year list. With an array yearTable, where each year y has a totalCount

<ul>
     <li ng-repeat="y in yearTable">
           <div ng-style="{height: y.totalCount+'px'}"
           ng-click="setYearFilter(y.year);" >
     </li>
</ul>

Here Angular is doing some rudimentary data vis, linking variables in the data to the dimensions of each HTML element. Note also that each column element has an ng-click directive, calling a function that filters the items displayed. The term clouds for subjects and creators work the same way.

Hopefully this gives a hint of how AngularJS can be applied to cultural collection interfaces. From a developer's perspective, there are a number of big advantages. Compared to our previous jQuery process, Angular simplifies the page-building process immensely; the templating approach encourages a separation of concerns and more organised, maintainable code. Angular's data-centric binding also provides some big wins. Data structures (models) become more important; Angular requires that you get your data organised before binding it to the DOM. Coming from the free-wheeling procedural world of jQuery, this data-centric approach was the biggest conceptual challenge. The bottom line is: manipulate the data, not the HTML. The payoff is that the work of keeping the HTML and the data coordinated just disappears. Angular's modular architecture and active developer community also bring benefits: in the Queenslander project for example we used ngStorage, a module that made the favourites incredibly easy to build.

Compared to standard web interfaces, the big difference here is that all the collection data (in this case some 1000 items worth) is in the browser, on the client side. No server calls, pulling down a few items at a time - instead we load the whole set up front, and build the interface dynamically based on that data. The biggest payoff for this approach is responsiveness - filtering and exploration are lightning fast - but there are problems too; search engines can't index this dynamic content, and it requires modern browsers with fast JS engines. Some would argue that this approach is just plain wrong; abusing the client/server architecture of the web. I'm more of a pragmatist, but there are certainly some technical issues to consider, and in the next post I'll go a bit deeper into this notion of client-side data for digital cultural collections.

Discover the Queenslander

2014-07-15T23:28:00.000+10:00

Discover the Queenslander is our latest generous interface project, commissioned by the State Library of Queensland to showcase their collection of covers and pages from The Queenslander newspaper. Published 1866-1939, The Queenslander was the illustrated weekend supplement to the Brisbane Courier Mail. This collection includes around 1000 covers, advertisements and illustrations - a beautiful slice of Australian pre-WWII visual culture. Geoff Hinchcliffe and I developed a web-based interface that builds on our previous approaches - rich overview, browsing and visual exploration - and adds some new techniques. Here I'll provide a quick outline; in the following post I will focus on the web framework we used - AngularJS - which I think has some interesting applications for digital collections.

The Mosaic view provides a chronological overview of the collection - each tile represents items from a single year. Like the Manly Images mosaic, the tiles gradually reveal their contents - in this case they are also directly navigable. The Grid view is a more general-purpose explorer for browsing subjects, creators and years as well as colours. Both Grid and Mosaic interfaces link to a detailed item view. There's nothing radically new here - though there are a few new elements that extend on our generous interfaces repertoire.

Inspired by the qualities of the collection images and the related work happening at Cooper Hewitt, Geoff and I were keen to experiment with using colour to explore the collection. The process was (surprise!) more complex than we expected, but ultimately rewarding. Using some palette extraction code that Geoff developed, we first pre-built a palette for each item. These colours are stored in the collection metadata, and act much like any other metadata field. The interface then dynamically builds an "overview" palette revealing the colours in the current set of items, and both the item palette and the overview palette act in turn as filters; rinse and repeat for open-ended colour-browsing. Note also how the filters and facets in the grid view interact; selecting a colour will also reveal corresponding dates, creators and subjects (and vice versa).

This project also introduces some simple personalisation, with the ability to curate and share a collection of favourite items. We opted for a lightweight, no-login approach using HTML5 web storage (essentially fancy cookies) to simply track item IDs. Sharing a collection is a simple as sharing a URL with a list of IDs baked in; and because collections operate within the standard grid view they get filters and facets too.

Finally a little feature that I am particularly fond of is the Trove link on the item page; a simple demonstration of how we might start to link up collections across institutional boundaries. In this case, the State Library of Queensland has high-res images of covers and illustrations, while the NLA's Trove publishes the full contents of The Queenslander (albeit with low-res scans). Using the Trove API we simply harvested the full list of issue dates and corresponding Trove IDs, then matched them against the SLQ items. So each Queenslander item also provides a link to its source issue, providing additional context as well as opening onto further exploration.

Manly Images - a Generous Prototype

2012-10-15T14:54:00.000+11:00

Over the past twelve months we have been developing some new approaches to the challenge of providing rich, revealing interfaces to cultural collections. The key idea here is the notion of generous interfaces - an argument that we can (and should) show more of these collections than the search box normally allows; and that there's a zone between conventional web design and interactive data visualisation, where generous interfaces might happen. There's more on this concept in my NDF 2011 presentation, or (in a more formal mode) in the paper I presented at the recent ICA conference.

Here I want to introduce an experimental "generous interface" prototype. Manly Images is an explorer for the Manly Local Studies Image Library - a collection hosted by the Manly Library. This is a collection of around 7000 images, documenting the history of the Manly region from the 1800s to the 1990s. The aim here was to develop a "generous," exploratory, non-search interface to the collection, delivered in HTML.

The original intention here was simply to adapt our CommonsExplorer work into HTML - CommonsExplorer uses a linked combination of thumbnails and title words to provide a dense overview of an image collection. But to "show everything" would mean 7000 elements, a stretch even for modern browsers; and I wanted to experiment with some new approaches to overview which remains the key problem here - a really juicy one. Given 7000 images with titles and little else, how can we provide a compact but revealing representation of the whole collection?

Here, the strategy was to break the collection into smaller segments based on either terms in the title, or date, and to draw each segment as a simple HTML div, where the size of the box reflects the number of items in that segment. These segments also act as navigational elements, opening a "slider" type display for browsing through specific records, and finally a lightbox for larger images, with links to canonical URLs on both Trove and the Manly site.

As a visualisation, it's a bit like a treemap (without the heirarchy), or a reconfigured histogram. But a collection like this is more than a list of quantities; the texture and character of the images is crucial. So as well as showing quantity, the segments become windows revealing (fragments of) the images inside them in a rolling slideshow. We get a visual core-sample of each segment, revealing the character of that group; and across the collection as a whole, a shifting mosaic that reveals diversity (and consistency), and invites further exploration. An interesting side effect is that it becomes possible to surf through the whole collection without doing a thing; it will (eventually) just roll past. This might not be realistic in a traditional browser context, but that traditional, "sit-forward" user model is not what it used to be - as Marian Dork argues, the leisurely drift of the information flaneur might be more apt.

So, a rich exploratory interface to 7000 images, without search, and delivered entirely in HTML; we have shown that it's possible, but is it any good? I'll write up my own evaluation with some technical documentation shortly; meantime, feedback on the prototype is very welcome - and if you are interested in building on it, or adapting it for other collections, the source is up on GitHub.

Finally some acknowledgements: this project was funded by the State Library of New South Wales and supported by Cameron Morley and Ellen Forsyth; thanks to John Taggart of Manly Library for permission to use the image collection. The collection data is harvested from the excellent Trove API, developed by the National Library of Australia.

Generous Interfaces at NDF 2011 (video)

2012-04-10T14:40:00.000+10:00

Generous Interfaces (NDF 2011 keynote)

2011-12-08T12:43:00.001+11:00

Generous Interfaces - rich websites for digital collections

View more presentations from Mitchell Whitelaw

I recently gave this presentation at the National Digital Forum 2011 in Wellington. It proposes a way to think about collection interfaces through the concept of generosity - "sharing abundantly". The presentation argues that collection interfaces dominated by search are stingy, or ungenerous: they don't provide adequate context, and they demand the user make the first move. By contrast, there seems to be a move towards more open, exploratory and generous ways of presenting collections, building on familiar web conventions and extending them. This presentation features "generous interfaces" by developers including Icelab, Tim Sherratt and Paul Hagon, and it includes a preview of some work I am currently doing with the National Gallery of Australia's Prints and Printmaking collection, in collaboration with Ben Ennis Butler.

Visualising Cultural Collections at TedXCanberra

2010-11-25T14:30:00.002+11:00

commonsExplorer

2010-03-16T16:48:00.018+11:00

Although the Visible Archive project wound up months ago, its visualisation techniques live on. In particular I've been developing and adapting the title-word-frequency interface of the A1 Explorer, and trying it out on a range of different datasets. One of these spinoff projects - the commonsExplorer - has finally launched. Here, some documentation, reflection and rationale.

My colleague Sam Hinton and I began work on this as a project for MashupAustralia late last year. Our initial focus was the Flickr set of the State Library of NSW, and our aim was a rich, dynamic, "show everything" interface, building on the A1 Explorer work, but with image-based content. Some months later, having totally missed our original deadline, the scope had broadened out to the whole (amazing) Flickr Commons.

The explorer consists of a three-pane interface. The term cloud shows the 150 most frequently occurring words in the titles (not tags) of the current set of images. This will look familiar to anyone who's played with the A1 Explorer. It uses the same co-occurrence visualisation, and the same blocking / focusing navigation, with a few UI refinements. After some strong user feedback, I added a "back" button to step the navigation back one state. It also uses left and right-clicks, rather than modifier keys, to block or focus words. Applying this title-word approach to different sets has shown up its strengths, and a few weaknesses.

Its strengths are that titles and co-occurrence are a reliably rich cue for content, and that for most collections, thanks to the wonder of Zipf's law, the top-level cloud of 150 words will "cover" (refer to) more than 75% of the images in the set - even in a collection numbering in the thousands. Often, in smaller collections, the coverage is more than 95%. One question I haven't answered yet is how to communicate this idea of coverage to the user, and how to make those images not in the top level cloud, more immediately discoverable. Because after all, sometimes it's the outliers or exceptions in a collection, that we are interested in.

The bottom pane is the thumbnail grid, which is where most of the new stuff is. The grid is an attempt at a "show everything" image visualisation that can scale from tens to thousands of elements. As the number of elements grows, the grid size decreases to fit in the available space. Rather than scale images down, we simply crop the thumbnails - the intention isn't to represent the whole image but to provide some rich but unstructured visual clues: a sort of visual core sample through the whole set. The results show how this can help reveal structure within the collection. Different photographic processes are instantly apparent - monochrome, sepia, cyanotype, stereoscopic, Kodachrome. Other similarities also pop out, even in small tiles - landscapes vs portraits, for example.

This "clue" approach actually sums up our visualisation approach nicely. The Explorer presents us with a rich mass of partial information - or rather data: linked fragments of titles, and of images. Moments of discovery come when we see those fragments unified in a source image: the fragments are contextualised and become more meaningful. This contextual information then propagates back to the fragmentary display - when it works best there is a feedback loop from discovery to context and back to discovery. I've argued for a distinction between data and information, which is relevant here: these fragments are data points, abstracted and decontextualised. Information occurs only when we link and interpret those fragments - and it happens strictly on the human side of the screen.

Another feature of the grid that isn't immediately obvious is chronological sorting. Many collections, including the SLNSW set we started with, include dates in image titles. We look for those dates and sort dated images first in the grid. This approach is simple, and prone to the occasional false positive, but it degrades gracefully, and adds a usable layer of structure to the grid layout. Why not use Flickr's "date taken" field instead? Most Commons collections don't set it, so instead it gives the date uploaded. For the same reason we decided not to use tags, or attempt to scrape data from descriptions: these fields are inconsistent across the Commons - some images have no tags, others have dozens. Title and thumbnail seem to be the richest data that is always available.

Sam Hinton did the heavy programming work that makes the grid go. The main technical challenge we faced was memory usage: loading 700 tiny images just eats memory in Processing / Java. Sam devised a system for stashing the square thumbnails locally, optimising memory and acting as a cache to speed up loading. Drawing thousands of little images to the screen also raised performance issues - we draw to a single offscreen PGraphics context, then draw that to the screen.

In the end I think we've done what we set out to do - make a rich experience that encourages an understanding of context, and enables discovery in large collections. We've also shown that this approach is broadly applicable - if you've got a large image collection where you think it might apply, let us know. Most importantly though, try it out and let us know what you think.

Download commonsExplorer for Mac | Windows | Linux (1Mb)

A1 Explorer Screencast

2009-09-24T12:49:00.000+10:00

Visible Archive A1 Explorer from Mitchell Whitelaw on Vimeo.

Series Browser Screencast

2009-09-22T14:26:00.001+10:00

Visible Archive Series Browser from Mitchell Whitelaw on Vimeo.

Exploring A1 - Items to Documents

2009-08-11T10:27:00.010+10:00

In the last post I outlined the main approach used to visualise Series A1 - a word frequency cloud based on item titles, showing co-occurrences between terms. Here I'll show how that was expanded into an interactive tool for exploring the Series, all the way down to images of the documents themselves. If you're impatient, skip straight to the latest sketch for Mac, Windows and Linux (1.8Mb Java executables).

To turn the text cloud visualisation into a general-purpose interface, I added the ability to focus on terms (where focus means include only items containing that word). Exclusion and focus have an additive relationship; so I can exclude one term to create a subset of items, then focus on a second term to show only items in that subset, with a given term. I can also exclude, or focus on, multiple terms to further refine a subset. A simple interface allows for terms to be removed from any point in the sequence; so for example I can exclude all "naturalisation" items, then focus in on a second term (in the grab below, "immigration"). While this navigation technique isn't perfect, it is simple and scalable. We can rapidly move from Series level to small groups of items - in the grab below, we have zoomed from 65k items to 233 items, in two clicks. With this iterative navigation process, the co-occurrence display in the cloud becomes a useful way to scope or preview term relationships, and inform the next focus or exclusion.

The new visualisation element here is a simple histogram, showing the number of items with start dates in each year of the Series. The histogram visualises the current subset; so refining the text-cloud display also modifies the histogram; as well, hovering on a term in the cloud shows the relative distribution of that term, in the histogram. The date histogram becomes a powerful tool for exploration and discovery in this display. For example in the grab above, there's a big spike in the histogram in 1927: why? Hovering over the most prominent words in the cloud, we get a sense of their different distributions; for example "restriction" appears mainly between 1900 and 1915, whereas "deportation" occurs almost exclusively in items starting in 1927, and makes up most of the spike. Simply clicking either a term, or a histogram column, fills the lower pane with a list of relevant items, and from there we can explore much deeper - more of that later.

A second example of how the text cloud, co-occurrences, and histogram can combine to reveal patterns in the dataset, and prompt discoveries in the content of the series. Focusing in on "Darwin" reveals another big spike in the histogram, this time in the year 1937. In this case, the co-occurrences and the distribution of terms give an accurate preview of what the items in that spike are about: a cyclone hit Darwin. The text cloud even reveals the month of the event, and again the item listing shows fine-grained confirmation of the pattern.

The final challenge in this process was to zoom in again, to the level of the individual document. The Archives has digitised a significant chunk of its records - it currently stores 18.2 million images which are accessable through the search interface of RecordSearch. With the invaluable help of the Archives' Tim Sherratt, I can access these images dynamically by passing item details - barcode and page number - to an Archives PHP script. Because the dataset I am working with does not record the number of digitised pages, this is a two-stage process: first, query RecordSearch for the item details, and scrape out the number of digitised pages (shown in the right hand column of the items listing). Then, when an item in the list is clicked, load and display the page images.

This involved getting around a couple of little technical issues. The loading of the images was surprisingly straightforward. Processing's requestImage() function happily grabs an image from the web without bringing the entire applet to a halt. Loading the pages data was slightly harder, because loadStrings() does halt everything while it waits; and in this case, I wanted to load up to 14 URLs at a time. Java threads provided the solution - another case where Processing's ability to call on Java for backup was extremely useful.

The first time I successfully loaded one of these scans - a crinkled, typewritten page, encrusted with notes - was a real thrill. What this shows is that given the opportunity, interactive visualisation can provide not only insights into the structure and content of an archival collection; it can also provide an interface to the (digitised) collection itself. If the text cloud and histogram visualisations hint at historical events in the items content, the page images let us verify or explore their leads in the primary sources. For example in the 1927 deportation items above, the digitised documents reveal cases where recent migrants were deported to their country of origin because of mental illness. The Immigration Act (1901-1925), quoted in these documents, gives the Minister the power to deport recent arrivals who are convicted criminals, prostitutes, or "inmates of insane asylums." Not what I expected to find - but that's good, if the aim here is exploration. There's an amazing wealth of material in here - and it is beautifully material: the screen grab above shows a page from item 1921/22488, which documents the theft of a pearling lugger by its (indentured) Japanese crew. This page shows a handprint of one of the men, Unoske Shimomura, taken in 1914.

You can download the A1 Explorer applet for Mac, Windows and Linux (each is a 1.8Mb Java executable). System requirements are pretty minimal, though you will need a network connection to load images. One more caveat: the user interface is very rudimentary - again UI is not the focus of my research here - so below is a quick cheat sheet that should be enough to get you going. I'd love to hear your feedback on it, or any interesting discoveries you've made.

A1 Explorer Cheat Sheet

Text Cloud view

hover over words to see correlations, item distributions and numbers
click a word to load a list of its items into the lower pane
to exclude a word and regenerate the cloud hold down '-' and click the word
to focus on a word hold down '+' and click the word
to remove a focused or excluded word, click on it in the central info bar
use the up and down arrow keys to scroll through the items list in the lower pane
click on an item in the list to load its page images and switch to document view

Document view

page through the document with the left and right arrow keys
drag the page image to move it
press 'Z' to zoom the image up
press 'H' to load a higher-resolution image of the current page
press 'T' to revert to text-cloud view

Inside A1 - Text Clouds from Item Titles

2009-07-22T15:37:00.009+10:00

The final phase of the project was to focus in on Series A1, and explore techniques for visualising the items it contains. First, a few basic stats on the task at hand. A1 contains some 64,000 registered Items, dating largely from the period 1903-1939. It was recorded to by Agencies including the Department of Home Affairs, the Department of the Interior, and the Department of External Affairs. In the dataset I am working with, each Item has a title, contents start and end dates, a control symbol, and a barcode. Other than dates, the most informative data here about the contents of the item, is the title. That raises some interesting problems: the title is a more or less unstructured field of text. Titles range from "August ZALEWSKI - naturalisation." to "International conference re Bills of Exchange [0.5cm]" and "Northern Territory. Pastoral Permit No.256 in the name of C.J. Scrutton."

The initial approach was to use simple word-frequency techniques to gain a sense of the range and distribution of text in the titles. If we take all 64,397 titles, and split them into their constituent words, and exclude some uninteresting words ("of","and","to","with","the","for", "from") the 150 most frequently occuring words look like this. Note that here text size is proportional to the square root of the word count - in other words text area is proportional to word frequency.

Naturalisation and certificate jump out fairly dramatically. In fact looking at the numbers, over 47000 items contain "naturalisation" - that's around 73% of the Series. Some 17,500 items contain "certificate" - 27%. A quick inspection of the records verifies this impression: the vast majority of the records listed are naturalisation certificates, or similar documents. Also notable in this image is the large number of names. Browsing the records suggests that these appear because the naturalisation documents always include the applicant's name. But underneath these layers are a wide range of more descriptive terms: "war", "papua", and "immigration", for example. Despite the dominance of the naturalisation records, the coverage of this list - the number of items with title words appearing in it - is quite high: over 60,000 of the 64,000 records are represented here, about 94%.

The text cloud gives an effective overview of the collection titles, compressing a huge mass of textual content into a single screen. But 6% of that content is unrepresented here; if this is our interface to the collection, that 6% is effectively invisible. As an initial experiment, I regenerated a cloud that excluded all items containing "naturalisation". The resulting cloud (below) covers some 14,500 items; as expected the names have all but disappeared, but more interestingly there are a rich set of new descriptive terms that were previously buried under the naturalisation records. If we add the coverage of this cloud and the first (14,571 plus the 47,058 containing "naturalisation") we get a total coverage of about 96%; so some, but not all, of that invisible 6% is now represented.

The other addition here uses interaction to extract more information from the cloud. One disadvantage of text clouds is the way they relentlessly decontextualise, breaking the local relations between terms. The lines between terms here - displayed on rolling over each term in the cloud - are an attempt to restore some of that context. They show links between terms that occur together in Item titles; so in the image above we can see that "new" occurs with "guinea" very frequently (not suprisingly). More informative though is that "employment" and "staff" are also correlated. Note also that "papua" is not strongly correlated with "guinea" - a bit of history explains why; Papua and New Guinea were administratively separate until 1945. So here a simple interactive visualisation device adds new context to the display and prompts new questions about the content.

In the next post: expanding these techniques into an interactive browser that can take us from a whole-Series view, to an image of a specific document, in a few clicks.

Bi-Directional Series Links

2009-07-21T14:22:00.009+10:00

Well, the project is officially complete now - thanks to all who attended the presentation at the Archives, it was a lot of fun. In the next week or two I'll be retrospectively documenting the final stages of the project's development.

In a comment on the last post, Tim Sherratt observed that there seemed to be fewer links between Series than there should be. I did some digging in the data and discovered that links in the Archives' data are uni-directional. In other words, when Series A lists Series B as a related Series, Series B does not automatically reciprocate. The same is true for succession and control relationships: Series data lists subsequent Series links, but not preceding Series (which are subsequent Series relationships in reverse). Controlling links are listed, but not controlled by relationships.

In order to represent these links I first had to rewrite the parsing code so that when it finds a link, it simply records the link in two Series - at both ends of the link - rather than one. Thinking about directionality I decided that succession links all could be represented in the same way, regardless of direction: since the grid layout shows chronological ordering, that relationship is already clear (succession relationships are blue, above). Related Series could also be represented symmetrically - if Series A is related to Series B, surely B is also related to A (related links are yellow, above). Control relationships however are highly directional, so I introduced a new link type to represent the controlled by relationship. In the image above the controlled by links are purple, and lead from a large series to a number of smaller ones.

This tweak has a number of important results. Not surprisingly, the number of links increases - it doubles, in fact - providing more impetus to expore the context around a focused series. Also, the addition of the controlled by relationship makes small controlling Series far more findable because they are often linked from large Series, as in the image above.

You can download this latest (and for now final) version here for Mac, Windows or Linux (5Mb each, and requires 1280 x 1024).

Update 20th August - updated these sketches to fix a memory allocation problem

Browsing Series and Agencies

2009-05-06T13:56:00.012+10:00

The latest step in building a browsable Series-based visualisation has been to add in Agency data. The previous post made a first step towards integrating Agencies into the visualisation - essentially using their ID codes to colour the Series squares. But Agencies also offer a powerful way to add context to our visual exploration of the Archives collection, as this post will show. To skip straight to the latest visualisation, download the executables for Mac, Windows or Linux (about 5Mb each - and needs 1280x1024).

To begin with I wanted to get a sense of the quantitative relationships between Series and Agencies. After converting the Series dataset to JSON and ingesting it to Processing, I generated some simple "utility" visualisations.

This graph shows all 9000 or so Agencies in my dataset, ranked in order of the number of Series that they record into. Agencies are arranged from left to right along the x axis; number of Series recorded is graphed on the y axis (click the graph to see a larger version over on Flickr). This shape is a classic power law distribution: there are a very small number of Agencies recording a large number of Series, and conversely, very many Agencies recording very few Series. For example, there are only around 100 Agencies that record to more than 100 Series; and the vast majority of Agencies record to very few series (less than 10, say).

This graph shows the same relationship from the other side: here Series are ranked along the x axis, and the y axis shows the number of Agencies recording into each Series. Again this is a power-law distribution, but the quantities are much smaller. We can see that around two thirds of all Series have only one recording Agency; almost all have fewer than 10; and a tiny number have as many as 45.

Where does this get us, in terms of making a visualisation of the whole collection? It shows that Agencies offer a useful way to break the collection up into manageable-sized subsets; the vast majority of Agencies record to fewer than 100 of the 57.5 thousand series. That's a significant refinement. At the same time most Agencies record to more than Series: so Agencies should be able to show relationships between usefully-sized groups of Series.

The next step was to integrate the full Agency data with the previous Series visualisations; this was relatively straightforward, and again HashMaps were invaluable in cross-linking Series and Agency data. I rebuilt the floating caption display, to handle a complete listing of recording Agencies and their titles. This alone adds a wealth of context to the visualisation: Series with relatively generic titles ("Correpsondence files") are brought into focus with a descriptive list of their Agencies.

For each browsed Series, we can then show other Series recorded to by its Agencies. In the interactive sketch we can select a Series with a mouse click, turn on the Agency display (hit "A"), then scroll through the agencies with the arrow keys. The floating caption box allows us to investigate highlighted Series, select them in turn, and so on. The result is contextually rich and far more browsable than before. The scale of these Agency-based groups is, as the first graphs show, an effective way to break the collection down. The highlights give a slight counterbalance to the size-based bias of the packed-square visualisation, leading us out into smaller series. Also, because the floating caption shows all the Agencies for each highlighted Series, we build up a sense of the range of related Agencies in a certain area; so highlighting CA 51, the (mid-century) Department of Immigration (Central Office), and browsing its Series, reveals other immigration-related Agencies. Have a browse: download the visualisation as a Java executable for Mac, Windows or Linux (about 5Mb each - and needs 1280x1024).

Finally, these latest visualisations also include an important tweak in the packed-square visualisation model. Tim Sherratt commented on an earlier post that the "hollow box" metaphor is potentially misleading, because it's based only on the ratio between shelf metres and recorded items. In other words, the way that a "hollow" suggests un-registered items is just not right. While browsing these visualisations I came across another, more serious problem with the "hollow" approach. Because the overall size of a square is determined by its shelf space, it's possible to have very small squares that represent Series with many thousands of recorded items; as many or more than physically larger Series. The solution is simple, once you think of it: visualise both items and shelf metres. Now, the area of the inner (brighter) square is proportional to items; while area of the outer (duller) band is proportional to shelf metres. The result is that Series that are physically small, but contain many items, suddenly grow in size ( in the visualisation these appear with very thin borders). Interestingly some very recent Series pop out, including a couple documenting the 2005 UN Oil-for-Food / AWB inquiry: with zero shelf metres, I wonder if these are "born digital" records?

This will be the final step, for the moment, in visualising the whole collection. With a public lecture at the Archives coming up I need to move on to the Items level, visualising the contents of A1. More on that shortly.

[update - the links to the executables were broken, sorry: fixed now (11 May)]

Series Links

2009-04-12T04:56:00.003+10:00

After a hiatus over summer and the start of the academic year, I finally have some more progress to report. Using the packed square visualisations as a base, I've been adding more data elements from the Series dataset, and working towards visualising relationships between Series and the Agencies that generate their content. This has taken longer than planned due to more data-plumbing issues, which I'll come to later.

The Archives' Series data records two kinds of links: Series-Agency and Series-Series. The latest sketches make a start in visualising both of these. Here colour (or more accurately hue) is derived from the first listed Recording or Controlling Agency. As the CRS Manual explains, the Recording Agency generates the records; while the Controlling Agency is the "agency currently responsible for some or all of the functions or legislation documented in records." In either case, given that there are some 9000 Agencies involved here, how do we visualise this link? For the moment I'm doing it in the simplest possible way: low Agency numbers have low hue values (red), while high Agency numbers have high hue values (blue to purple). There are a number of problems with this - notably that it's impossible to tell the difference between, for example, CA 11 (Treasury 1901-1976) and CA 12 (the PM's Department 1911-1971) - which is a very significant difference. These two images show the difference between visualising Recording Agency (top) and Controlling Agency (bottom).

Series data also records links to other Series, which come in three flavours: Succession (between previous and subsequent series), Controlling (where one Series acts as an index or register for another) and Related (for other relationships). In this dataset (57.5k Series) there are some 7.5k succession links, 6.2k controlling series links, and 25k related series links. My initial attempt to render all of these (by just drawing a line between linked Series) resulted in a giant, unreadable cloud. A simpler and more legible approach is to only draw links for one Series at a time.

In the latest interactive sketch, a single Series' links are drawn as coloured lines: controlling links are red, succession links are blue, and related Series links are yellow. Clicking a Series selects it and draws its links, rendering linked series in colour while dimming the rest to grey (clicking the Series again unselects it and returns to Technicolor mode). This begins to show the potential for a visual interface to the collection, I think. Here's the applet - note that it's fairly screen and memory-hungry. Feedback welcome, as always.

There are a few changes behind the scenes here as well. As outlined earlier, XML has been a mixed blessing: easy to use and human-readable, but the file sizes are large, and the DOM parsing method used in Processing is memory-hungry and slow. For these sketches I've switched to JSON, a simple, lightweight data format with its own Java library. So far, JSON is working nicely; its file sizes are around half those of the equivalent XML files, parsing is much faster, and the parsing code is simpler and neater. This thread has lots of useful info on implementing JSON in Processing.

HashMaps are the other new toy here. I'd never quite found a use for them until now, but because they easily connect an object (in this case a Series) with an index string (in this case a Series ID), they are essential here for building Series-Series links. I simply store each Series' links as a list of ID strings, then to draw the link, feed each ID into a HashMap to access the whole Series object. Thanks to @blprnt and @toxi for reminding me why I needed HashMaps!

Next: digging deeper into the complexities of Agency-Series relations.

In the News

2009-01-20T08:28:00.004+11:00

The Visible Archive project was in the Canberra Times yesterday, with a nice full-page feature written by Nyssa Skilton (photo by Marina Neil). Frankly I would have preferred more of the visualisations and less of my quizzical mug but it's not a bad photo. If you've arrived here via the CT, welcome, and have a look around...

Packing Them In

2009-01-14T14:28:00.004+11:00

Up to this point the grid visualisations have taken a very simple approach to space: dividing it up equally among the data points, and then using hue and brightness to show attributes such as shelf metres and items. This has the advantage of simplicity, but it has a major disadvantage too: it's attempting to represent size (shelf metres or number of items) using other means. Why not just use size for size? Read on for the blow-by-blow account, or skip straight to the end result: the latest interactive sketch.

Before Christmas I had a first stab at this problem. The approach was basic, as usual. Maintaining the chronological ordering of the series, I drew each series as a square with area proportional to number of items. The packing procedure was simply: starting where the previous series is, step through the grid until we find a big enough space to draw the current series. The result looked like this:

After weeks of regular grids, this was a sight to see. The distribution of the sizes of series (overall and through time) is instantly apparent. This ultra-simple packing method is far from perfect, though, as you can see from all the black gaps. Because it tiles one series at a time, in strict sequence, and only searches forwards through the grid, gaps appear whenever a large square comes up as the search scrolls along to find a free space.

The main restriction here is the chronological ordering of the series. I need to maintain that ordering, but at the same time I need to be able to pack the squares more efficiently, which means changing the order. Luckily there's a loophole: as the first histogram showed, many series share the same start date. So we can change the sequence of those same-year series, without disrupting the overall order. We can pack them starting with the biggest squares and pack in the smaller ones around them. The latest sketches use this method, which can be described in pseudocode:

Make a list of series with a given start year
Working from biggest to smallest, pack each series into the grid, from a given start point: restart the search from the start point each time.
Keep track of the latest point in the grid that this group occupies. For the following year, start from this point.

This improves the packing dramatically:

In this image square area is mapped to shelf metres; as in the earlier sketch hue is derived from the series prefix (roughly A = red, Z = blue). One artefact is apparent here - those lines of squares graded by size occur when nothing gets in the way of the packing process. As a byproduct of this, the biggest squares in those sequences often mark the start of a new year in the grid.

The latest sketches integrate both shelf metres and described items, and finally add interaction to this visualisation. To combine metres and items the squares are drawn as above, with area proportional to shelf metres; then overlaid with a second grey square, whose size is inversely proportional to the number of items in the series. The result is that series with many items are full of colour, and series with few items have large "hollows" and narrow coloured borders.

Again, there are relations between series here that are instantly apparent. It's easy to see those series that have lots of shelf metres but relatively few items, as well as even medium-sized series with many items. I couldn't find A1 in the earlier grids (though Tim Sherratt from the Archives could); it is much more prominent here. Tim also pointed out that B2455, one of the big series of WWI service records, didn't jump out of the grids: it's very prominent here. As well that cluster of post-War migration series spotted in the items grid reappears here. Promising signs for the usefulness of this visualisation.

All this is best demonstrated in the interactive version, which like the previous grids adds a caption overlay and some year labels on the vertical axis. Browse around and see what you can find - feedback very welcome.

Grid Browser (now with 100% more data)

2008-12-06T12:01:00.014+11:00

After being completely buried under end-of-year admin for a few weeks, it's great to be back to work on this project. I've been working on plumbing in the latest dataset from the Archives, which has doubled in size to around 57,500 series. In an attempt to create a browsable overview of the whole collection, I have been developing the earlier grid sketches, feeding in more data, and extra parameters. Also new in this dataset are two interesting features of archival series: items - number of catalogued items in the series - and shelf metres - the amount of physical space the series occupies. In this interactive browser, you can navigate around the whole collection, and switch between modes that display these parameters.

A brief explanation. Like the earier grid, series are sorted by start date (still contents start date, rather than accumulation, for the moment) then simply layed out from top left to bottom right. In this version I've added some year labels on the Y axis, which show the distribution of the series through time. Hue is mapped directly to date span: red series have a short date span, blue have a long span. The four modes in this interactive change the mapping for brightness. In the default display brightness is mapped to items (I); M switches the brightness key to shelf metres; P shows items per shelf metre; and S switches the brightness key off (showing span/hue only).

Both these new parameters have a wide range and a very uneven distribution, and as you can see in the visualisation there are many series with zero items and/or zero metres. In fact around 30000 series (over half this collection) have zero digitised items; while around 2600 have between 100 and 1000 items, and 13 have more than 10000 items. Around 20000 series have zero shelf metres, around the same number have 0.1-1m, around 10000 have between 1m and 10m, and the rest have more than 10m - with a couple of dozen series with more than 1km of shelf space! It's important to remember, as Archives staff have mentioned to me, that items here refers to digitised items. Series with zero listed items aren't empty, they just haven't been digitised. Similarly I suspect that a value of zero shelf metres suggests that the data doesn't exist. Even if it can't be taken at face value, items is an interesting metric because the Archives digitises records largely on the basis of demand from users; so a series that is frequently requested is more likely to be digitised. Items, then, is partly a measure of how interesting a series is, to Archives users.

The items view of the grid allows us to see, for example, that there are more digitised items in series commencing in the 20s and 30s, than there are in series commencing in the 60s and 70s. We can also see a dense band of well-digitised series from the late 90s onwards. I don't know, but I'd suspect that these are "born digital" records - no digitisation required. The most striking feature of the items graph is the narrow red streaks around 1950: these are Displaced Persons records from 1948-52, each series corresponding to a single incoming ship (above). These records show up here because they are well digitised (interesting) but also because there are many sequential series forming visual groups. There are other pockets of "interestingness", but they are less obvious. This reveals one drawback of this grid layout, which is that related series are not necessarily grouped together. I'm hoping to address this when I start looking at agencies, functions, and links between series.

A few technical notes. After running into problems storing data in plain text, I changed the code to read the source XML in, pick out certain fields or elements, and write the data back out as XML. I used Christian Riekoff's ProXML library for Processing, which makes the file writing part very easy (Processing's built-in XML functions don't include a file writer). This worked well, except when it came to exporting web applets, which just refused to load. Rummaging around in the console log, and turning on Java's debugging tools (thanks Sam) showed that the applet was running out of memory while trying to load the XML - admittedly a fairly hefty 27Mb uncompressed. So for the web version at least, I have reverted to storing the data as plain text, which immediately reduced file size and loading time by a factor of 4, and solved the applet problem. Since then Dan and Toxi have suggested alternative ways of handling the XML, such as SAX, which streams the data in and generates events on the fly, rather than loading the whole XML tree into memory before parsing it. I'll be looking into that for any serious web implementation of this stuff.

Finally, with almost 60000 objects on the screen, this visualisation raises some basic computation and design issues. Even using accelerated OpenGL, this is a tall order; I found I was getting around one frame per second on a moderately powerful computer. I have solved the issue here with a simple workaround (thanks Geoff for this one) - pre-render an image of the grid, then overlay the interactive elements. Performance issue solved. But there are some limitations: this approach means the grid layout is fixed. It's a significant move away from a truly "dynamic" visualisation, where all the elements are drawn on the fly. For visualisations at this scale, I don't think there's any other way, but as the design develops I'll be trying to push back towards the live, dynamic approach, as the dataset permits.

Series Title Cloud

2008-10-17T20:17:00.005+11:00

So, it's no Wordle, but it's my first text cloud. This visualises the 250 most common words in the titles of each series in the initial dataset. It's also the first time I've mined the titles for data, and another step in the process of feeling out the attributes of this dataset. I've excluded a few "stop" words ("and","with","the","for") and anything with less than three characters, but otherwise this is a raw representation of the titles.

It shows first of all that the most frequently occuring terms in series titles are either generic descriptors ("files", "correspondence") or metadata, referring to the organisation or structure of the series, rather than its content ("alphabetical", "prefix", "single"). But then after the top twenty or so words, there's a large number of more descriptive terms. The difference in scale between these layers is significant; for example "series" and "files" occur in around 10,000 series titles (about a third of all series), whereas "drawings" occurs in around 800 series, and "HMAS", "Papua" and "Lighthouse" all occur in around 200 series. Some odd features show up as well, for example "Yokohama" and "specie"; turns out there are a large number of series consisting of records from the Yokohama Specie Bank, a Japanese bank involved in trade with China and Australia around the mid-C20th - it gets a mention in this 1940 telegram from Menzies to the High Commissioner in London. I wonder how the records ended up in the Archives?

Next, to try integrating text clouds as interfaces / overlays for the previous visualisations.

Something Completely Different: Interactive Grid

2008-10-15T15:09:00.006+11:00

I've been considering how to develop the stack histograms, but meantime decided to quickly trial a completely different approach to visualising the Series dataset. I don't want to get carried away with one metaphor / approach, when there may be others worth exploring. So, in this visualisation some 27000 series are layed out in a simple grid. Series are ordered by (contents) start date, and sequenced left to right, top to bottom. As in the last histograms, date span is mapped to hue, so long spans are blue, short spans are red. I've been having some weird issues with web applets so far, but this one seems to work (without OpenGL), so there's also an interactive version to play with.

This layout has a number of advantages over the stack approach. The primary one is visual density. This layout makes it possible to see all the series, in a single visual field. In the examples here the grid is 200 columns wide and around 135 rows high; each series is a 4 x 4 pixel square. Even allowing for 40000 series in an expanded dataset (more of which soon), this scale is functional. A related advantage is browsability. In the interactive version of this sketch, we can simply mouse over series to see their details; a usable, if still unstructured way to browse the collection.

The grid throws away the emergent histogram-form of the stack approach. However many related structures are still apparent: for example the pattern of long-span series having early start dates is clear; and the interactive version also reveals the date distribution; the reddish band in the middle of the grid is the wave of short series around WWII. One thing on the list to try is add a date key to the vertical axis. This would effectively show the same thing as the tall peaks of the original histogram: the relative numbers of series commencing over time. The grid simply structures space according to the data elements (the series), so that the relation of date to visual space becomes nonlinear; but the relationship is still there and easily revealed.

Next on the list of things to try is a word-frequency visualisation based on series titles. This should provide a way to browse the grid more effectively; after that, I need to get to work on a new, expanded dataset with more series, but also useful quantitative measures like shelf space and digitised items, for each series. Then, more layers of structure and browsability: relationships between series, agency and function.

Stacked Series Histogram

2008-09-30T11:27:00.010+10:00

I've been developing the year-span histograms posted earlier. In these sketches, the series are again represented as single horizontal lines that correspond to their date spans. To address the problem of series overlapping each other, this sketch sorts and stacks the series into a single big, non-overlapping heap. The method is fairly simple. First, sort all the series by their span, longest to shortest (this involved learning to implement Java's Comparable function). Then, place the series in the stack, longest to shortest, bottom to top. A simple 2D array is used to keep track of series positions and check collisions; if a collision is found, simply try the next row up and repeat until a space is found.

The result is almost, but not quite, a histogram (because the packing isn't perfect - there are some gaps). Unlike the earlier sketches though, it visualises the total number of series spanning a given year, rather than just the commencing year; this seems a more generally useful feature to visualise. It's interesting to note though that some of the features obvious in the commencing year histogram are less clear here - notably the spikes around Federation and the Wars.

The real payoff for the stacking is that now we have a potential interface to the entire collection, at series level. Adding interaction makes it easy to browse the visualisation by year, showing the relation between series in that year and the total collection. Sheer scale is still a problem. This "heap" is more than 10,000 series high - too big to usefully show every series even at one pixel each. Interaction allows zooming and panning (above), which helps. Next, I'd like to be able to filter the heap down to a more manageable size, to a point where this can become the interface to browse through individual series.

Year Span Histo(ry)grams

2008-09-20T08:43:00.006+10:00

One of the obvious limits of the first histogram is that series - or more specifically series contents, here - have an end date as well a start date; and the date span of a series is far more informative than the start date alone. So here's a first attempt at introducing date span into the visualisation. It's really a minimal tweak of the previous sketch; instead of drawing a vertical line with the histogram count (number of series commencing at a given date), I draw a stack of translucent horizontal lines from start to end year. I've also increased the scale here, so that each series line is a single pixel high; and the grid lines are now at 10 rather than 25 year intervals. Click for the full res image.

This adds a lot of visual detail, but it also obscures quite a lot. The drawing order is essentially arbitrary (it's the order of series records in the dataset as provided) and there's no collision checking, so all the lines are just overlaying each other. We can get a vague sense of the range of date spans from the top of the "spike" years, where a single stack of series lines is more clearly visible; and we can see that although the series start dates drop off sharply after 1960 (as shown in the first histogram), many series have end dates in the last 20 years.

In another quick tweak I added colour to the graph, in an attempt to pull out some of what's hidden here. By simply mapping the duration of a series (in years) to the line's hue, we can see more about the overall distribution of durations. It seems, for example, that there are a small subset of series that commence around 1900 or earlier, with very long durations. It also seems that most of the series around WWII had quite short date spans - plausible enough. So we can see a bit more here but the overdrawing problem is still significant. My next step will be to address this, perhaps by managing the drawing / stacking order to reduce overdrawing; or adding some interaction that will allow date-based highlighting of series stacks. Also in my plans is a way to stack series without any overlaps at all; a kind of packing problem. Plenty to do...

Hello World - a Histogram

2008-09-17T14:37:00.007+10:00

After some admin delays, I collected the data from the Archives yesterday, and have been digging in with some excitement. The data consists of three big XML files, totalling around 300Mb; initially I have been looking at the largest of these datasets (180Mb), which records the 27000+ series in the Archives collection.

Initial data-munging presented some challenges, as expected; many of the records contained HTML in plain text wrapped inside the XML. Archives staff had warned me about this and I'd blithely replied that it would be fine, and the more data the better. Of course the first thing that happened as I attempted to parse the XML with Processing, was that the HTML broke the parser. So step one was to make a copy of the dataset without the HTML; a quick grep tutorial later and I was able to use Textwrangler to automate the process of stripping it out, reducing the file size along the way to about 50Mb.

After that the process of getting the data in to Processing has been straightforward, and I'm impressed with its ability to ingest a large lump of XML without complaint. As a sort of "hello world" visualisation I decided to make a simple histogram of the entire series dataset by date; specifically, the start date of the contents of each series (click the image to see it without the nasty scaling artefacts, at full resolution). The x axis is year, with a range from 1800 to 2000; the y axis is the number of series with that start date; it's unlabelled here but the maximum value (in 1950) is about 960. Already you can get a sense of the shape of the collection from this image; there are spikes at 1901 and 1914 that correspond, I'd guess, to Federation and World War I; and the next spike is, of course, 1939. One question I can't answer at the moment is why there is such a dramatic drop in the number of series commencing after 1960 - perhaps a change in recordkeeping or the archival process itself? Any thoughts?

Day One, and A1

2008-07-30T13:36:00.007+10:00

So after signing the contracts at the Archives thismorning, I can declare this Day One, and the project has officially started. The contract revealed some good stuff. The project will focus, as proposed, on two levels, visualising high level structures in the entire collection (series, agencies and functions); and within an individual series. Even better, that individual series will be A1, a huge collection dating from Federation up to WWII - more than 20,000 records that occupy over 450 metres of shelf space! I understand that this series is also very highly digitised - which raises the prospect of working with not only the catalogue data, but the digitised records themselves.

Can't wait to get my hands on the data; I'll be meeting with the Archives again soon to discuss starting points, data formats, and so on.

Meantime, welcome to the brand new project blog. Your comments and thoughts are always welcome - for now I'm especially interested in related work in the visualisation of cultural datasets and digital archives. I'll be posting some of my own research in those areas soon, but if you have any pointers, send them along. Here's a short outline of the project, for starters.

Project Outline

2008-07-30T12:58:00.000+10:00

This outline, presented to the Archives as a refinement of the original proposal, summarises the context, aims and outcomes of the project.

As archives are increasingly digitised, so their collections become available as rich, and very large, datasets. Individual records in these datasets are readily accessible through search interfaces, such as those the Archives already provides. However it is more difficult to gain any wider sense of these cultural datasets, due to their sheer scale. Conventional text-based displays are unable to offer us any overall impression of the millions of items contained in modern collections such as the National Archives. Searching the collection is something like wandering through narrow paths in a forest: what we need is a map.

This proposal is to research and develop techniques for visualising, or mapping, archival collections in a way that supports their management, administration and use. The specific aim is to develop techniques for revealing context: the patterns, high-level structures and connections
between items in a collection.

The practical outcomes of the project will be prototype interactive, browsable maps of the National Archives collection that apply these techniques at different structural levels:

A map of the whole collection, at Series level, will show the "big picture": the size, scope and historical distribution of different series, the relations between series, and their corresponding Agencies and functions.
A more detailed map will focus, as a test case, on a single series (A1), accumulating data from individual records to reveal the distinctive "shape" of that series.

The issue of navigating large digital collections is current and significant; interestingly some
prominent American researchers have recently announced a broadly related project. This project is highly innovative; by supporting it, the Archives would take a leading position in the field. The project would be extensively documented and well disseminated, drawing an international audience.

Outcomes

A prototype browsable map showing the structure of the whole National Archives collection at a Series level, including the relationships between Series, collecting and controlling Agencies, and functions.
A prototype map of a single series, linking to and contextualising individual items in the series.
A set of sketches: static and dynamic visualisations that demonstrate a range of different approaches.
A set of techniques and approaches for creating interactive maps of archival datasets. These will be applicable across the archives sector, and among other institutions dealing with digital collections.
Documentation and dissemination of the project to an international audience.