No, it ISN'T all on computer these days!

Historians do a lot of their work in libraries and archives, reading old books and documents. But while a few years ago most people would have rather expected this, now we are often asked the question, "But isn't it all on computer these days?"

Electronic resources

It is true that electronic resources are growing fast, and that they have great advantages. In many fields, all or most of the information being accumulated is recorded electronically as well as on paper. In some cases this is then made available on the Internet. The Internet is thus becoming a very important source of new information in some fields, notably the sciences where the shelf-life of new information is short and most scholars seldom need to consult anything more than a few years old.

However, in fields where old (pre-electronic-era) information is important, the situation is rather different. If information was not originally recorded elctronically, but only on paper, it can only be accessed electronically if someone creates an electronic version.

How is this done? Firstly, an existing paper document can be retyped. Secondly, the document can be scanned and converted to text by optical-character-recognition software. Thirdly, a page can be recorded as an image rather than as text.

Notice that in the first two options, the text is recorded (as if re-typesetting a book) not the exact appearance of the page. This has advantages for the scholar since it makes it possible to search the electronic version electronically. This is hugely valuable - consider the difference between trying to find a reference to "John X. Smith" in an unindexed thousand-page book and being able to type in "find 'John X. Smith'".

Archives

However, while this is fine for electronic versions of printed books, there are problems in applying it to other types of document. In old official files, for example, there will be numerous handwritten notations to the type-script, and older documents will be entirely handwritten. Leaving aside the practical problems in transcribing the words, there is a serious objection to doing so - the historian will want to see the original. For one thing, with bad handwriting, opinions may differ as to what it actually says. For another, the historian may wish to try to identify the handwriting.

Ideally, therefore, the historian would like an electronic version of an old file to include both a searchable text version and an image version.

The problem is the amount of work to be done in creating such electronic versions. A large archive may contain millions of documents. Retyping all of them would be an enormous, and enormously expensive, undertaking. This applies especially to modern archives where the amounts of paper are so great: some old collections (e.g. medieval papers) are sufficiently small, and sufficiently in demand in relation to their size, that it is a practical undertaking. Thus, paradoxically, medieval historians are likely to be using electronic databases sooner than modern ones.

What we really need is optical character recognition (OCR) suffiently fast and sophisticated that it can accurately read a whole page of text, including scrawled handwriting, as fast as the pages can be held up to the reader. This is some way off yet, but if increases in computing speed continue, there seems no reason why it should not be achieved eventually.

Let us imagine that such technology is developed. How long would it take to enter archives into the database? Old files will probably need to be opened by hand. Let us assume that each archivist can turn pages so as to display one page each five seconds. In an eight-hour day, 5760 pages would be read. This seems a lot until you reflect that the records of the Public Record Office (Kew, London) take up about 97 miles (over 150 km) of shelving. (Source: http://www.pro.gov.uk/education/teachers/collect.htm, accessed July 2000.) The PRO web-site does not seem to contain an estimate of the total amount of paper, but it is apparent that even with the advanced technology we are imagining, it would be a major project involving years of work for a large number of people. It would not be an impossible task, and I am hopeful that eventually it will happen. But with existing technology, it is out of the question.

The Public Record Office in London has a programme, "Archives Direct 2001", which aims to provide on-line access. By the end of 2001 it is hoped that the catalogues will be searchable electronically. The 1901 census returns are also to be placed on-line. This will be a recording of images, with only a few details being entered as text for indexing. To give some sense of the scale of the task, the PRO estimates that if the census was put on CD-ROM, it would take about 1000 CDs. (Source: http://www.pro.gov.uk/census/factsheet.htm, accessed July 2000.)

Publications

Books and journals are much more manageable, although the scale of the task of creating electronic versions is still large. Publications, however, have a different problem: copyright. Publishers are in business to make a living, or at least to break even, and so do not generally give their work away. Note that although a work goes out of copyright eventually, a new edition of it can be (and probably is) copyright. Project Gutenberg is creating a collection of free, out-of-copyright electronic texts, but the collection is still tiny compared to the size of any large library. (See the section on electronic texts in this web-site.)

Copyright work, however, is made available on the basis of payment. In some fields, such as academic journals, arrangements are developing whereby (for example) a university pays for on-line access to a journal for on-site users. This can be very useful, but it should be remembered that on-site users in a university which can pay for such a service would already have access to many of these journals in printed form anyway. The new arrangements typically improve access for those who already have it in some degree, rather than creating the sort of revolutionary new access which the champions of the Internet hope for.

Again, I am hopeful that in time the Internet will create revolutionary new access, but for historians, at least in this part of the world, it is not here yet. Some have speculated that it will have to wait for some major reforms in the basic concepts of copyright (a system which evolved to cope with printing). Alternatively, "frictionless" electronic commerce may make it possible to have electronic access in a more flexible form, more affordable to individual researchers.

In the meantime, however, historians will still be packing up their bags and physically travelling to London, Serowe, or wherever else the archives they want are located.

Postscript: new developments

January 2002

A potentially revolutionary development: the British Library newspaper collection at Colindale is experimenting with OCR which can read and index newspapers straight from microfilm. Some samples have been indexed and can be accessed on-line at <http://www.uk.olivesoftware.com/archive/skins/bl/navigator.asp> The results are startling. The articles, although searched for by a digital index, are shown as images.

Am I yet ready to eat my words? Not quite, as the problems with paper archives that have not been microfilmed remain. However, it does look as if the computerization of old records may be about to enter a new peiod of rapid progress. If so, it will be of enormous importance, especially to those of us located at long distances from archives we want to use.

No, it isn't all on computer these days!

Electronic resources

Archives

Publications

Postscript: new developments