Event Date: 28 October 2011
Flett Lecture Theatre
Natural History Museum
Anchoring Biodiversity Information:
From Sherborn to the 21st century and beyond
Digitising legacy taxonomic literature: processes, products and using the output
Department of Entomology, Natural History Museum. London
To date, most digitisation of taxonomic literature has led to a more or less simple digital copy of a paper original ñ the output has effectively been an electronic copy of a traditional library. While this has increased accessibility of publications through internet access, for many scientific papers the means of indexing and locating them is much the same as with traditional libraries. OCR and born-digital papers allow use of web search engines to locate instances of taxon names and other terms, but OCR efficiency in recognising names is still relatively poor, peopleís ability to use search engines effectively is mixed, and many papers cannot be directly searched. Instead of building digital analogues of traditional publications, we should consider what properties we require of future taxonomic information access. Ideally the content of each new digital publication should be accessible in the context of all previous published data, and the user able to retrieve nomenclatural, taxonomic and other data / information in the form required without having to scan all of the original paper and extract target content manually. This opens the door to dynamic linking of new content with extant systems ñ automatic population and updating of taxonomic catalogues, ZooBank and faunal lists, all descriptions of a taxon and its children instantly accessible with a single search, comparison of classifications used in different publications, and so on. The means to do this is currently marking up content into XML, the more atomised the mark-up the greater the possibilities for data retrieval and integration. Mark-up requires XML that accommodates the required content elements and is interoperable with other XML schemas, and there are now several written to do this, particularly TaxPub, taxonX and taXMLit, the last of these being the most atomised. Building on earlier systems for mark-up of legacy literature ViBRANT is developing a new workflow and seeking to increase the automated component of the process. Manual and automatic data and information retrieval is demonstrated by projects such as INOTAXA and Plazi. As we move to creating and using taxonomic products through the power of the internet, we need to ensure the output, while satisfying the requirements of the Code, is fit for purpose in the future.