Behind the Scenes | Producing a Digital Archive

Skip straight to a question here:

What is the general development process and workflow for a digitised archive?
What is the process for scanning documents, and how do we deal with fragile documents selected for digitisation?
What is the process and resource we use for OCR, and how does that differ to HTR?
With OCR, how do we reach the confidence score and how does it compare to the accuracy of the OCR?
What is the process for creating Gale platform features, and how do we get ideas for developments and enhancements?
How is the content indexed and metadata applied, and how does this relate to how our search engine works?
What quality checks are done before an archive release, what goes into maintaining the archive after it comes out?
What would a typical ‘day in the life’ of Gale’s production team consist of, at what points do you work with other parts of the business?
Are there any achievements, statistics, or contributions from production that would surprise people?

What is the general development process and workflow for a digitised archive?

Joe and Megan: The development process begins with our acquisitions editors, who ideate new digital archives or new modules for existing archives based on the availability of relevant source collections and their research to identify needs in the market. Once a suitable project has been identified, the acquisitions editor prepares a business case to get the project approved, at which point we begin scheduling the scanning of archival materials. Scanning is typically done on-site at the source library unless there are extenuating circumstances. Throughout the scanning process, our development team is busy creating the structure for the digital archive and our product team is identifying any new features necessary to support the content so that, after scanned documents are sent to our conversion vendor and our content team applies indexing and metadata elements, the content is loaded into the archive. From there, we rigorously test the content and the archive structure to ensure both function properly upon release.

What is the process for scanning documents, and how do we deal with fragile documents selected for digitisation?

Rick: The archival scanning process starts with the source content itself; the content or content types will dictate the production workflow. Our team of acquisition editors select collections globally that best fit their needs for the upcoming product. These collections come from various locations: libraries, universities, historical societies, and museums. Once collection(s) are contracted, our editorial staff provide us with a content list or scanning list with title, volume, folder, shelf mark, Bib IDs, etc. This data can be exported from the MARC records or source catalogue and used to validate the contracted documents are being pulled for scanning and to create product metadata.

Image of a book in a V-Cradle Once content selections are confirmed, the included documents are assessed through a process called conservation review. Think of this as a preparation stage prior to starting scanning to evaluate the condition of the content and suitability to scan. This process is typically done by the source library conservation staff. It is at this point items that are fragile, damaged, destroyed or require special handling are repaired before moving into the scanning workflow. To be clear, not every document requires conservation review and repair, it is done for the purpose of identifying and resolving cracked or broken bindings, creased or split pages, unfolding dog-eared pages, mold, or decay removal. Once repairs are made, the items are delivered to our scanning vendors, along with any of items not requiring conservation work, to begin the scanning phase.

Now that the collection has been passed through conservation review, special handling requirements are written by the source institute. The requirements detail how best to handle rare and fragile items during the scanning process. These requirements, to name a few, include:

Monographs cannot be opened more than 90 degrees due to stress on the bindings
Fragile (and/or) split pages always need to be handled with two (2) hands
Book cradles are required for any monograph larger than 14” in length
Perspex (or plexiglass) shall be used to hold down pages, book spatula(s) are required
Fasteners (treasury tags/staples/paperclips) within manuscript titles - should remain in place during scanning capture

The collection is now ready to begin the scanning phase and we can start the vendor selection process. Generally, we provide the project details to our Gale preferred scanning vendors and request a bid or RFP. During this period, our vendors assess the content on site at the source institute to gain a better understanding of the content itself, determine equipment needs, and ask logistical questions. This gives our vendors an opportunity to get a firsthand look at the collection. It is important that potential vendors see a representative sample of the collection. Once all vendors have had an opportunity to submit their proposal, we will review each and select the vendor best suited for the job.

The source content types does inform the equipment required for scanning. Different equipment is sometimes required to scan various content types, but in general, most content types can be digitized using an overhead or column-base scanner. There are some exceptions, however; monograph material that is in good standing quality and has structured spines, we opt to use a robotic scanner. The use of robotics helps maximize production speed and increase our weekly throughput. For large format material (such as maps or large newspaper prints) the use of a map scanner or feeder scanner is utilized. This content can be also digitized on an overhead scanner depending on the capture-bed size, but this would require the use of Photoshop or editing software to stitch the separate quadrants together. Overhead scanners are generally the most versatile units to capture Gale Primary Source content. The camera (either digital or CCD lens) is mounted above the operator’s head, pointing straight down at the content with an adjustable height.

The scanning process is quite simple once equipment and handling have been outlined. The source library is responsible for transferring the content to our operators to keep the production process moving. The operators are responsible for documenting and tracking the items they have captured along with the digital images they produce. They are asked to report page counts per item, dates of capture and any scanning notes that will be useful, i.e. – if source documents are missing pages, tight gutters, damage to the item, irregular pagination, etc. These anomalies / notations are added to the production manifest to help inform our content team and QC vendor on the overall condition of the content. The hands-on scanning works much like a production line. Each item, whether it be a box, folder, book, etc. gets placed on the scanning bed one-by-one for capture. The vendor utilizes the scanning title list to track and manage the collection through the scanning process. The scanning operators work their way through the boxes / folders until complete. Once those individual items are complete, they place them aside until QA is performed and digital images are approved. The manifest gets uploaded along with the scanned images to our cloud for download and inspection by our internal staff and QC vendor. This process is done on a weekly basis or as often as defined in the agreed upon schedule. The delivery schedule is used to help track scanning progress. There are variables to consider when building a schedule; total pages to be scanned, the daily throughput (which is based on type of equipment, type of content, and the number of scanning operators) as well as the target release date.

We have written and provide our vendors with a standard set of scanning specifications. Our standards are based on industry standards and internal Gale standards. Vendors are expected and agree to meet these standards. The scanned images are then put through a post-production process that is performed by the scanning vendor. Post-production refers to the process in which the RAW images are reviewed and enhanced to meet our standard specifications. This part of the process includes cropping and text straightening (or deskewed) along with confirming Gale standards were met, meaning, the processed images must pass our quality requirements at 100% before they move into the conversion phase of production. If the images do not pass inspection, a rescan request is submitted for the affected files and a complete QC report delivered to the scanning vendor. The scanning vendor will make any adjustments noted in the report and/or provide explanation for the discrepancy and reupload the images for secondary inspection. This process will continue until all images are passed and accepted.

What is the process and resource we use for OCR, and how does that differ to HTR?

“Optical Character Recognition, or OCR, is a technology that enables you to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data.¹

OCR is produced using complex algorithms to identify individual letters in the image of the printed pages.

Gale starts with a scanned image of a paper document (in most cases); a book, manuscript, telegram, newspaper etc. This scanned image alone is not enough to make the information contained in the document available for editing or analysis, say in Microsoft Word. The scanned image is nothing more than a collection of black and white or colour dots, known as a raster image. In order to extract data from scanned documents for research, Gale uses an OCR software that identifies individual letters on the page before combining them into words and finally into sentences, thus enabling the researcher to access and edit the content of the original document. “

- Excerpt from Gale article Explaining the OCR Process written by Ray Bankoski in 2018:

Michelle: HTR does not differ much from OCR in terms of the overall process. The team does spend additional time reviewing the HTR output for errors since the technology is new. Identifying the HTR engine trouble spots informs areas to focus training and software improvements. The technology differs from OCR in the follow ways:

Overall comparison HTR vs OCR:

HTR

It is trained to recognize both machine printed text and handwritten text

It is trained with standard fonts plus many more handwritten patterns as handwritten vary from person to person and time to time.

It is much challenging for layout analysis and segmentation of blocks, lines and words as the handwritten vary from person to person and time to time

It is much challenging for recognizing the cursive handwriting text due to which the accuracy is not at higher rate

It supports limited languages

OCR

It recognizes only machine printed text

It has standard fonts only to train

Not much challenge as it is very structured

Not applicable

It supports many languages

Comparison in specific to the HTR version we use over OCR:

Feature

Recognition of Printed, handwritten styles

Oriented text

Inverted text

Input formats

Languages supported

OCR Confidence

Formatting

Output

HTR

ü - Modern, western European, Historical and Gothic

ü

Multi page TIFF, TIFF, JP2, JPEG, PDF

All Latin scripts

ü

û

JSON, PDF

OCR

ü - Printed text only

ü

Multi page TIFF, TIFF, JP2, JPEG, PDF, PNG, BMP

Above 200

ü

Text, XML, RTF, PDF

With OCR, how do we reach the confidence score and how does it compare to the accuracy of the OCR?

From Gale article Explaining the OCR Process written by Ray Bankoski in 2018:

OCR engines use “Confidence” levels to represent how well it thinks it performed. Here is an explanation from ABBYY:

During the layout analysis the text areas, lines and single characters coordinates are detected. After the character separation each character is recognized with different text recognition classifiers.²

The recognition confidence of a character image is a numerical estimate of the probability that the image does in fact represent this character. When recognising a character, the program provides several recognition variants which are ranked by their confidence values. For example, an image of the letter "e" may be recognised:

as the letter "e" with a confidence of 95,
as the letter "c" with a confidence of 85,
as the letter "o" with a confidence of 65, etc

The hypothesis with the highest confidence rating is selected as the recognition result. But the choice also depends on the context (i.e. the word to which the character belongs) and the results of a differential comparison. For example, if the word with the "e" hypothesis is not a dictionary word while the word with the "c" hypothesis is a dictionary word, the latter will be selected as the recognition result, even though its confidence rating will still be 85. The rest of the recognition variants can be obtained as hypotheses.

It is extremely difficult to accurately measure the accuracy of the OCR process – to do so would require manually going through a database of millions of OCR pages to determine their accuracy. As a general rule:

The majority of OCR software suppliers define accuracy in terms of a percentage figure based on the number of correct characters per volume of characters converted. This is very likely to be a misleading figure, as it is normally based upon the OCR engine attempting to convert a perfect laser-printed text of the modernity and quality of, for instance, the printed version of this document. In our experience, gaining character accuracies of greater than 1 in 5,000 characters (99.98%) with fully automated OCR is usually only possible with post-1950's printed text, whilst gaining accuracies of greater than 95% (5 in 100 characters wrong) is more usual for post-1900 and pre-1950's text and anything pre-1900 will be fortunate to exceed 85% accuracy (15 in 100 characters wrong).”³ [Tanner, Muñoz and Ros: Measuring Mass Text Digitization Quality and Usefulness. D-Lib Magazine July/August 2009]

Therefore, Gale’s OCR accuracy (the amount of words correct as a proportion of the whole) can be estimated in the 85%-95% range, depending on the database.

What is the process for creating Gale platform features, and how do we get ideas for developments and enhancements?

Joe and Megan: Ideas for creating Gale platform features come from several different sources, but we primarily employ user testing and user feedback to prioritize new features. Typically, we identify a need through customer conversations, internal testing, or larger initiatives at Gale to improve the research experience, then brainstorm potential solutions. Once we have developed an optimal solution, we typically create a prototype and gather feedback from customers and internal stakeholders to tweak the concept and ensure that the solution is both practical and satiates the need we are attempting to address. After solidifying the ideal implementation for a new feature or enhancement, we work with our internal teams (content, development, etc.) to estimate how much work will be involved so we can prioritize the project. Lastly, our internal teams work to successfully implement the feature or enhancement.

How is the content indexed and metadata applied, and how does this relate to how our search engine works?

Joe and Megan: We index our content according to a robust controlled vocabulary that is locally maintained by Gale. This controlled vocabulary provides a standard for various metadata fields and search indexes such as authors, subjects, document and illustration types, geographic locations, and newspaper section headings. This is to ensure that our vast and diverse primary source content shares a unified search experience and can be seamlessly cross-searched.

Our search engine also expands on synonyms of terms in our OCR and subject indexing so that users can search for variations of a name or word. For example, if a user performs a Keyword search for “marriage” they will also return results for “matrimony.” This can be turned off by placing quotes around a Keyword search or by using our Entire Document search index, which looks strictly at the full-text. This feature is driven by our Gale thesaurus, which is maintained by our in-house experts in library and information science.

The content is indexed by an automated process, and goes through separate QA checks by content and metadata specialists. Our indexing and metadata process is rigorous and ensures that we adhere to a standardized framework that is applied accurately and consistently. This ensures an optimal search experience for our researchers and is an area in which Gale takes great pride.

What quality checks are done before an archive release, what goes into maintaining the archive after it comes out?

Michelle: The source content goes through several quality checks, both automated and manual, along the production process. The first quality check is performed upon receipt of the scans from our imaging vendor. Every scan is reviewed for quality. Some of the quality checks at this stage are:

Correct file format and image resolutions
Corrupt images or images that we are unable to read/open
A hand or finger captured in the image (this does happen!)
Negative exposure
Page orientation problems
Poor image quality
Torn or damaged pages
Missing pages
Duplicate pages

When the scans pass the quality check, they move into the data capture phase of production. This work is performed by our vendor and includes creating the OCR and keying metadata. Following completion of the data capture phase, the data enters another quality check. Our quality vendor reviews both the scans and the captured data (OCR and metadata). Every document is reviewed for quality. Some of the quality checks at this stage are:

Capture requirements are followed
File naming meet requirements
XML output is validated against the provided schema
Keyed metadata matches the scans (for example, the title captured is the same as the title printed on the scanned title page)
Keyed metadata is free of typos and is captured consistently
Gale metadata standards are followed, and controlled vocabulary are applied
Image quality and missing page are checked a second time

Data that meet the above quality checks are delivered to the production team (data is delivered in batches on a weekly or semi-weekly schedule). The data is then processed into the production workflow system. During this process, the data is run through a series of automated validation routines. These include:

Validate the delivery contains the expected files and all files are present
XML files validate against schema
XML is structurally correct based on rules (this is beyond validating against schema)
Gale metadata standards were followed
Controlled vocabulary is applied

We perform one last quality check once the above process is complete. This time, the production team will manually inspect a sampling of the delivered data.

Confirm metadata match scans
Review metadata for typos and inconsistencies
Validate capture rules have been followed
Assess for quality
Evaluate HTR (when applicable)

Data that pass all quality checks are then imported into the product database. The import scripts include validation to ensure captured data follows schema rules.

Megan and Joe: Once the archive is released, the final set of data is archived. We create multiple copies and store on different servers. The content remains accessible to the product team for ease of retrieval. At times, changes may be required based on customer feedback or the addition of a new product index or feature. The complete data set is provided to Portico (https://www.portico.org/) to ensure content remains available to our customers.

In addition to the work performed by Michelle’s team, once data is loaded to a digital archive, our product management and content teams test the content to identify any anomalies prior to its release. Also, once content has been added, the archive itself goes through several rounds of testing by our development team, product management team, and QA team to ensure all new and enhanced features are performing as expected to provide the best possible research experience when the archive is released.

What would a typical ‘day in the life’ of Gale’s production team consist of, at what points do you work with other parts of the business?

Sarah: For a production team member, there is no typical “day in the life”. We work on different projects with different teams from different countries on different times zones with different languages and each day presents unique challenges to overcome.

We take an idea from an Acquisitions Editor and get that idea online so that people can search archives and collections from source institutions around the world. We must then work with multiple different teams and vendors, both inside and outside Gale, to make the final product.

GATHERING REQUIREMENTS

From Product Managers:

New or updated requirements from the Product Management team can come in at any point as well. The product managers review comments, feedback, and suggestions from users across all our projects. They collate all the information and turn them into user stories and requirements to get the most highly sought after feature in place across our platforms. This usually means we in production need to either add, change, standardise something within existing XML that will support this new feature. An example of this is the brand-new Browse Manuscript feature. It was something users had been asking for consistently so we received a requirement to make sure the required manuscripts had a manuscript number, and a way to sort the manuscript number in product so that a user would easily be able to find what they are looking for.

From Acquisitions Editors:

We can get a new idea from an Acquisitions Editor at any time and we start off with a kick-off meeting. This meeting is to get the idea, vision, and details about the project. Where is it coming from? What does it contain? How big is it? What should the final online archive look like? What is special about this archive?

Production then take the metadata and pour over it to make sure everything makes sense and there is nothing obvious missing. We see how the metadata is organised and how the library has catalogued the content and we try and keep the browsability similar so that users can easily find what they want. We sometimes create wireframes to show what a new feature or a new programme would look like. These wireframes vary from text boxes on a word document to a full fleshed out working wireframe. We check to make sure that what they are asking for is feasible and in line with other archives i.e. we do not want one archive using U.S.A. as a term to filter on and another archive using United States of America.

MAKING THE ARCHIVE

Source Institutions, universities, and libraries:

After an Acquisitions Editor gives us their idea, we then work with the source institution, university or library to work out how we will be able to get into scan their material, or how we can get the content out to scan it. We work with them on getting MARC records, complete lists, metadata, sometimes they help us flag the content, so the scanning teams know exactly what to scan when they get there. Sometimes, the source institution scans themselves, or have scanned some of the content themselves previously so production work with them to get the images to us so we can review. As some of these collections have not been properly catalogued or necessarily touched for a number of years, the production team will get clarifications and questions from the source library and we will have clarifications and questions for them as well as we start work on gathering all the content together to start the scanning.

Scanning:

To scan the average of 10 million pages of content each year, we have scanning vendors that can go into source institutions or can scan from shipped content at a vendor’s location. We provide a list of required items to the scanning teams and the source institution. These items can be books, pamphlets, flyers, newspapers, periodicals, magazines, manuscripts, scrolls, maps, photographs, and so on. Each type of content has different problems, solutions, and workflows. Production teams get questions and clarifications daily from the scanning vendors throughout the scanning process which lasts anywhere from 3 months to two years. Questions and clarifications cover scenarios such as missing items, extra items, duplicate items, items that require conservation, items with special requirements i.e. what is the best way to scan a 4ft scroll?, and items that are just too delicate to scan. Often the questions require input from the original Acquisition Editor. The production team help facilitate and monitor these questions and monitor the progress of the scanning so that it remains on schedule and to the original size.

The images then go through a quality assurance process to a different vendor. This QA vendor checks to make sure that the correct images have been scanned, in the correct order, in focus, at the right DPI and without any pages missing. This process also generates clarifications and questions that we deal with or facilitate.

Conversion to XML (Extensible Markup Language):

To process 10 million pages of content each year, we send the scanned images to a conversion vendor. We also send metadata in the form of MARC records, library cataloguing, our own metadata we have created in house or cataloguing from freelancers inside the libraries or institutions. The vendor matches the correct metadata to the correct set of images, scans all the words, both typed or handwritten, and maps those words coordinates to a full page. This allows a user to both search the words on the scanned image, but also locate it easily as the word will be highlighted. This process takes around 3-9 months depending on the size. Production teams will get questions and clarifications daily from the vendor throughout this process as well. These include missing metadata or MARC records, wrong metadata or MARC records, unique content that require special instructions i.e. what type label do we put to a book of stamps? The conversion vendor sends the XML and images back to us in batches and we look at the metadata to make sure that the vendor is capturing everything correctly like titles, authors, and publication dates etc. These clarifications and QA of work are the bulk of what the production teams do day to day.

Working with Dev:

We deliver the QA’d and sometimes corrected XML and images to the DEV and DEV processing teams, who load the content and images to the platforms. At this time, occasionally we will work with the DEV team if they spot a problem or if new indexes or requirements are involved.

Working with indexing teams:

Sometimes the products call for additional requirements that are out of scope or not achievable in the production teams. This includes scenarios like subject matter expert involvement or assigning subjects to newspaper articles. For these types of cases, the production team engages with the Indexing Team. They take our XML or metadata, depending on the type of work and give us back appropriate subject for us to put into our XML.

Are there any achievements, statistics, or contributions from production that would surprise people?

Michelle: I think what might surprise people about the production team is we are a remote team (prior to COVID), with team members in the US (3 different states) and in the UK. The production team is small, with just 7 team members. Our workflow was designed and developed internally, by the production team, and is extremely efficient. We have processed over 200 million pages of primary source content since 2003.

¹ https://www.abbyy.com/en-gb/finereader/what-is-ocr/

² https://abbyy.technology/en:features:ocr:classifier

³ http://www.dlib.org/dlib/july09/munoz/07munoz.html

REQUEST A FREE TRIAL | CONTACT YOUR LOCAL GALE REPRESENTATIVE | SUPPORT AND TRAINING

SIGN UP FOR GALE PRIMARY SOURCES NEWS | THE GALE REVIEW BLOG

RETURN TO ARCHIVES EXPLORED | ABOUT GALE PRIMARY SOURCES | GALE DIGITAL SCHOLAR LAB