Image Capture

The process by which the digital images are created — known as ‘image capture’ — fundamentally affects what the user sees on screen. Images captured at a very high resolution, whereby significant amounts of detail are recorded per inch of space, will look much more photo–realistic and permit high zooming percentages before appearing blocky. However, this comes at the price of vast file sizes and consequently slow loading times. A balance has to be struck between the quality of the image and versatility of use.

Images captured from microfilm were therefore captured at a resolution of 400dpi (dots per inch), in line with standards provided by the Library of Congress for digitization projects. For medieval manuscripts a higher resolution (perhaps as much as 1200dpi) could be suitable, in order to permit close viewing of intricate details, but with newspapers a resolution of 400dpi provides a reasonable balance between readability and manageable file size.

 

A decision is also required about whether to capture the images as bitonal (i.e. simple black and white) or as greyscale (which permits subtlety of tone and different shades). Each has their pros and cons. Bitonal can make text very stark and clear against a white background, and is regularly used for historic newspapers. However, it renders illustrations and photographs very poorly. By contrast, greyscale handles photographs and illustrations much better, but also picks up the background grain of the paper, meaning that the images have a grey ‘noisy’ background. As documents become increasingly illustrated during the course of the twentieth century we needed to use greyscale for the later content. We therefore opted for a hybrid solution, where the images would be bitonal up to 1962, when colour and true halftone photographs start appearing more regularly, and greyscale thereafter. Users will notice a contrast in the way the images appear either side of this date.

 

As an example, the Daily Mail switched from broadsheet to tabloid format on 3 May 1971. It was noticeable that the file sizes of the broadsheet years were significantly higher than those after the move to tabloid format, especially the greyscale images. In order to achieve manageable file sizes, the dimensions of the pre–tabloid Daily Mail were reduced by 50%, so the broadsheet issues in the archive will appear to be the same physical size as the tabloid years. In many ways, this makes the paper much easier to read on screen — few people have a broadsheet–sized monitor — but it is worth highlighting this as a transformation brought about by the digital edition. The Atlantic Edition was filmed at 400dpi, and captured as bitonal. The originals are tabloid-sized, and there was no reduction in size for the digital edition. The special issues were captured at 400dpi in full colour at their original size.
 

OPTICAL CHARACTER RECOGNITION (OCR)

Scanned pages of a newspaper are simply a form of photograph — a picture of the text — and consequently of limited use in of themselves. Without data that supports those images, the scanned pages are not searchable or discoverable in a digital environment. The creation of such data is the key component of any digital archive project. It powers the functionality that allows users to search, retrieve and browse the hundreds of thousands of pages.

 

To render the text on a document, we put it through a process known as optical character recognition (OCR). The text produced by the OCR process is what is actually being checked when a user enters a search term. OCR software analyses the light and dark areas of the scanned image in order to identify each alphabetic letter and numeric digit. When it recognizes a character, it converts it into regular text.

(Example of how modern OCR software carries out ‘feature detection’. By learning the common features of a letter or character, the OCR software can recognize most letters, whatever the typeface. In this case a capital ‘A’ will always have two slanting lines and a bridging line in between. Image reproduced with the kind permission of www.explainthatstuff.com)

 

OCR is an imperfect process, and there is a wide array of challenges. The quality of the OCR text usually says more about the condition of the original materials than it does about the performance of the OCR software. Certain types of material are much harder to OCR than others. As a general rule of thumb, older newspapers produce much less satisfactory results than modern ones. This can be because the originals are worn or in a poor condition, or because the text is smudged and difficult to read. Any document that was printed by hand is much more difficult for OCR software to analyze than machine–printed characters. Wartime newspapers noticeably produce poorer results than adjacent periods of history; they were usually printed on poor–quality, thin paper owing to rationing, which leads to ‘bleed–through’ of text from the other side of the page. People often ask ‘how accurate is the OCR in your digital archives?’ The word ‘accuracy’ is misleading here, because OCR software works from a confidence rating, not true accuracy. The software calculates a confidence level from 0 to 9 for each character it detects, but does not know whether a character has been converted correctly or not. The software can only be confident or not confident that it is correct.

 

True accuracy, that is, whether a character is actually correct, can only be determined by a human assessing each character manually. This is why it is not possible to correct the OCR of these projects, at least initially. To put this in perspective, the Daily Mail Historical Archive had a team of over 400 people creating and reviewing the data for the archive, but with over 1.2 million pages to digitize and convert, it was not physically possible to clean up the OCR for every article. At the time of writing, only small–scale digitization projects have a realistic opportunity of being able to produce 100% perfect OCR, and even those projects typically rely on the goodwill and time of unpaid enthusiasts manually correcting the OCR text.

Members of the 400–strong team involved in the creation of digital data. In the left photograph, the operator is scanning the microfilm to digital. In the right photograph the team is reviewing the digital images and data.

METADATA

While we cannot guarantee perfect OCR text in these projects, we do aim for 99.5% accuracy in the metadata we produce. Metadata is the ‘who, what, where and when’ of a digital object, providing it with key descriptive information that permits it to be organized more easily. If the quality of the metadata is high, it becomes much easier to find the specific types of information a researcher is looking for, and to place effective parameters and filters around a search query, such as date ranges limited to specific article types.

 

Metadata is created at several layers, including publication level, issue level, page level, and article level. In the case of a newspaper article in a historical archive, the metadata will include:

  • Article title (or first line of text if no formal title is present)
  • Author (if known)
  • Newspaper title
  • Date of publication
  • Issue/edition Number of the newspaper
  • Page number
  • Article category (e.g. Advertising, News, Letter)

Most of this metadata is entered manually, and then verified by two operators working independently, to ensure they agree on the same result. If they disagree, additional opinions are sought. We ensure that our metadata is consistent across our newspaper archives, which allows a familiar user experience, and permits cross-searching and cross–browsing.

 

Assigning categories to each document is one of the most difficult tasks. This is done manually, with an operator selecting the appropriate option. The categories we use for newspapers are based on a taxonomy we have developed across many projects, with rules defining what constitutes a ‘display advert’, what constitutes a ‘letter’, what constitutes a ‘news’ item, and so on. As well as our standard taxonomy, we also aim to introduce categories that pull out specific features of the paper. 

 

There are instances where the metadata has to be more bespoke. In the case of the Daily Mail, as we would with any archive that has additional or specialist categories, we decided to create additional metadata fields to accommodate the Atlantic Edition. Following extensive checking that involved looking at every single issue, we determined two common threads that were specific to all issues of the Atlantic Edition: name of the ship and the direction of travel. Both values are significant to the character of this edition. Direction of travel affects the advertising content, and further study may reveal other differences in content between eastward and westward travel. The name of the ship is important because each ship had an embedded editor on board who personally composed the wireless news pages, so these pages to an extent reflect that editor’s personal voice. Capturing both of these additional values for the Atlantic Edition therefore provides potential guidance to the researcher.

 

Such unique categories are most successful when they are easy to define. In the case of the Daily Mail, we have made the ‘Femail’ section articles a category. A more generic ‘women’s page’ category was much more difficult to define, as the paper went through a number of women’s columns over the twentieth century, with a different presentation, layouts, and appearances in the paper. Knowing when to capture them was not always obvious, and if a category is captured incorrectly, its value is diluted. It was therefore decided to restrict the category to the well–known ‘Femail’ section instead.

XML (eXtensible Markup Language)

XML (which stands for (eXtensible Markup Language) is the backbone of a digital archive. The XML files provide the structure for the various strands of data (including the OCR text and metadata), assigning tags to each element to define its role. By doing this in a clearly defined, consistent way across the whole data corpus, XML files allow a software application (i.e. the user interface) to make sense of the archive.

 

Creating XML is by far the costliest and most labour–intensive part of any digitization project. Yet, if done well, it is entirely invisible!

 

We start by creating a document type definition (DTD). This DTD defines the data structure for the archive, outlining a list of permissible legal elements and attributes. In essence, it provides rules and order for what would otherwise be a jumbled mass of items. All data that is captured for a project must fit the DTD, or it does not pass verification — there are no exceptions. Examples of such rules include:

  • There must only be one title for an article
  • Every article must be assigned to a specific category
  • Every article must have a date

If articles fail to meet these rules, the failure is detected by our quality assurance process, and the problem can be addressed. For newspapers, the creation of XML includes a process of ‘article segmentation’, whereby each individual component of a page is manually identified, and its location coordinates on the scanned image captured. This is what allows articles to be displayed as ‘clips’ in an archive, and also permits each article to be individually highlighted when users are looking at the page as a whole.

Example of newspaper page being segmented into individual zones.
Issue level metadata is reviewed.
Article level metadata is reviewed along with zoning and article ‘threading’, to ensure that when an article spreads across more than one column or page that it is all captured as one article.
Checking the article in the context of the full page.
Operators can make data corrections while viewing each article.

QUALITY ASSURANCE

Once the XML is created, we put it through a thorough quality assurance (QA) process. Examples of these checks include:

  • Ensuring that an image exists for all XML references and vice–versa  
  • Confirming that file naming convention and directory structure meet requirements  
  • Validating the XML structure against the DTD  
  • Checking that the image format and size meets guidelines  
  • Comparing the digital files against the manifest of expected items. Is anything missing or do we have more than we are expecting?

Many of the checks above can be automated, but much of our QA process is carried out by human operators, who view every page image for quality and metadata capture accuracy. This is a different team to those who create the XML, so that we can obtain an independent view of the quality of data being created.

 

During manual QA, we will compare metadata to the source images, check the overall image quality, and confirm that the image coordinates have been captured correctly. Issues, pages, and articles that do not meet our standards are rejected and returned for reworking. Those that meet acceptable standards are then moved to a staging area for the final stages of content processing, before being prepared for loading to our content delivery systems.

THE APPLICATION

While we convert the content, we simultaneously create the ‘application’ that hosts the content. This is the front–end of the archive that users are familiar with, containing the search screens, results pages, and article display that define the way we use a digital archive. This application allows users to access the underlying database in an intuitive and straightforward manner, without the need for significant specialist knowledge.

 

We have an ongoing process of user–testing of our products, and use this feedback to inform the development of new archives. The goal is to have a number of useful ways in which to interact with the content, without unnecessarily complicating the archive and making it inaccessible to a diverse range of users.

 

Despite the surface-level simplicity, our newspaper archives contain a number of powerful features for advanced users. The search engine will search the OCR text, the aforementioned problems of which means that some of the retrieved results may not be appropriate, while other useful items may not have been picked up. To overcome this, the ability to use ‘wildcards’ as part of a search query can often help:

*    for any number of characters (e.g. carib* finds: Carribbean and caribou)
?    in place of any one character (e.g. A search on psych????y finds: psychiatry and psychology but not psychotherapy)
!    for one or no characters (e.g. a search on colo!r finds: color and colour)

Other powerful search tools include ‘proximity operators’. These are used between two search terms to indicate that the terms must occur within a specified distance of each other. The benefit of this is that words that are close to each other are more likely to be related than words that are far apart.

 

A proximity operator has two components: a letter that indicates the direction and a number that indicates the distance in words. For example, there could be two proximity operators. Mastering the use of tools such as wildcards and proximity operators can significantly transform the experience of using a newspaper archive, and vastly improve the results retrieved.

Wn    The W (within) operator specifies that the word that follows the operator must occur within n words after the word that precedes the operator. For example, the search expression shared w3 values matches any records in which the word values occurs three or fewer words after the word shared.
Nn    The N (near) operator specifies that the words on either side of the operator must occur within n words of each other in either direction. For example, the search expression memory n5 repressed matches any records in which the words memory and repressed occur within five or fewer words of each other in either direction.