Digital HASS Champion Simon Musgrave showcases content about digitised books etc.

Published by Katy McHugh on

Written by Dr Simon Musgrave.

Mahendra Mahey was one of the invited speakers at the Digital Humanities Australasia 2018 conference in Adelaide last year. One of the things he talked about was how the British Library came to be gifted a collection of digitised books by Microsoft. The rather prosaic description of this material is: “The datasets in this collection are comprised and derived from 49,455 digitised books (65,227 volumes) largely from the 19th Century.” Which made me start thinking about whether any of these books mentioned Australia. I may have been seeking a way to avoid some other task, but during the conference, I downloaded the metadata for the collection and I wrote a Python script to pull out any records which mentioned Australia or Australians:


(Don’t judge me please, I know I’m a crap programmer – and there’s a debugging artefact left in there.)

The metadata is in json format, but I didn’t try to parse the json beyond isolating records. So the results include mentions in any field of the records, including titles and places of publication. And because the search is a regular expression search, using ‘Australia’ as a keyword also found ‘Australian’ and ‘Australians’. It turns out that there are 276 records in the catalogue which were retrieved and this is what a sample looks like:


The complete set of records can be downloaded here.

I should add that getting to the books is a bit more difficult than accessing the metadata. You can download a zip archive containing all the text in a json format – that’s a 10GB zip archive which extracts to a single file. Alternatively, you can download a decade at a time of ALTO xml format files. I had hoped to use this material in a class I was teaching, but I didn’t manage to get useable versions yet – if anyone does manage, I’d love to hear about it!