About Gale

Title List Changes

Business Development

Press Room

Outside U.S. and Canada

Product Information:

Customer Service:

Customer Resource Center:

Free Resources:

Reference Reviews

Péter's Digital Reference Shelf

November 2006


Title: Google Book Search (a.k.a. GooglePrint)
Publisher: Google, Inc.
URL: http://books.google.com
Cost: Free
Tested: Continuously

The Context

Google Book Search (GBS), launched in 2004 under the name Google Print, is the most controversial project of the many beta releases of Google, Inc.— mostly for the Google Books Library Project module. I skip the legal and/or ethical pros and cons in the case; there are many substantial sources to let you see both sides of the coin, and legal cases are pending. Here is an excellent bibliography by Charles Bailey. I focus on what is the current content; what is accessible; and how the software helps and prevents finding materials. Only a small segment of the books and other print materials seem to be available free in their entirety. For this column, I approach it primarily from the ready reference perspective, where even snippets of information can be useful.

Google has an unusually extensive background page about Google Book Search (but without any factual information about the size and composition of the database). It is full of success stories and happy testimonials. They are mostly from users who believe that the concept of digitizing books and making them full-text searchable is yet another innovation by Google, Inc. These happy users apparently have lived in the Google bubble, ignorant about other alternatives.

The eBook idea first appeared in the early 1970s, when Michael Hart started the Gutenberg Project to scan pages and convert them into plain-text format public domain documents. By now there are 19,700 eBooks in Project Gutenberg. By today’s standard this is a relatively small amount, but these items can be displayed and/or printed in their entirety (although the typography is plain and ugly ASCII text, not a facsimile of the books). It is dwarfed by the beautiful American Memory multimedia super collection of historical materials. Its creation started in the early 1990s (15 years before Google Print), and now has more than 9 million items. It has 465 items about the impeachment of Andrew Johnson alone.

The Million Books project is another mega database that started long before Google Print was conceived.

There are several, relatively small but worthy eBook collections that are free to search and display the full text of books, such as the small scholarly book collection of the National Academies Press or the free subset of ebrary with about 30,000 books. For further information, see Nicholas Tomaiuolo’s well-updated and annotated list of e-text collections and the Open Directory Project section on the topic as implemented by Google.

One of the most prominent pioneers of the Web era, Amazon's, Look Inside The Book (LIB), then Search Inside The Book (SIB) features must have been the obvious inspiration for Google Book Search (GBS). The SIB subset of Amazon has about 280,000 fully searchable books. Many of these are greatly enriched by extra information, such as book reviews from professional journals, information about the authors, citing and cited references as I discussed in my review.

The Software

I almost always discuss the software at the end of the review, but here I must make an exception and bring up serious software problems that confuse even veteran searchers, and distort or make enigmatic some results. Even with simple searches, there is enough confusion because of the ignorance, illiteracy and innumeracy of the software.

Boolean search

The most startling problem is the incorrect use of the Boolean OR operation, the simplest of all. It is taught in kindergarten that the search for A OR B cannot produce less results than the higher found for A or B. Still, the query aboulia produces 26 items, abulia yields 40, but aboulia OR abulia produces only 35.

Neither can a search for A OR B produce more hits than the sum of the hits found for A and B together at most. But this is what happens as illustrated by this simple search: for books with the word arrogance in the title. It finds 2 books. The search for books with the word arrogant in the title finds 6 documents. (Minutes earlier the software produced 8 hits, and such disappearances add an additional dimension to the confusion). The search for books with arrogant OR arrogance in the title yields 13 books.

This is surprising, as there could not be more than 8 books. The first page of the list shows books with the word arrogance in the title that were not shown when searching for that word. The same is true for arrogant. This may explain the result of the OR operation but then keeps the user wondering why those extra books were retrieved only for the Boolean OR operation.

Using limit fields

Most search programs make it easy to limit the search to the title field, the publication year and some other fields. Google serves up strange results even for the simple title search, ignoring obviously matching hits. Searching for the term Google in the title yields two hits. When you search for the word anywhere, the first 12 of the 28 hits show books where the term appears in the title. For perspective: Amazon has 23 fully searchable books with the word Google in the title.

Use of the date limit is also a letdown. It seems absurd that GBS has only 55 partially viewable books published in 2006. Amazon has 15,152. To its credit, GBS has 25 fully viewable books, but it is a small consolation.

Split results

The handling of fully viewable books is inconsistent in GBS, and therefore the results are unpredictable. Sometimes they are included in the All Books search, sometimes not; sometimes some of the fully viewable books are included in the All Books search, but not the others. The search for the word fundamentalism in the title yields 8 hits in the All Books list and 3 in the fully viewable result list. None of the latter appears in the former.

The search for the term ignorance returns 91 hits in the All Books result list, and 66 in the Full View result list. Four of the first five hits in the latter appear also in the All Books search result, but none of the other 62. Practically, if you want a comprehensive search you must repeat the search in both domains. This is very irritating. The simple query form should have check boxes to accommodate the user preferences for content type, and to make the result list consistent and predictable.

Confusing hit counts

It certainly discombobulates the users when hits are reported in terms of pages rather than books. When searching about the macaque monkey, however, 26 pages are reported in the result list. Actually 26 represents the number of books listed, not the number of pages. The first two books (with a total of more than 1,000 pages) are dedicated to the topic of the social behavior of macaque monkeys. The search term obviously must appear on hundreds of pages in those two books, so the number of pages should be much higher than the number of books.

Using the search cell within the page of the first matching page shows that there are 30 pages where the search word occurs and are viewable. This is clearly the number of pages that GBS allows the user to view, not the number of pages on which the search term appears, let alone the total number of occurrences of the search word.

Even more enigmatic is the result list header on the first page of the search for the word arrogance which says Books 1-10 with 4110 pages on intitle:arrogant OR intitle:arrogance. What is that score? The total number of pages in the books? Not likely, and it would not be relevant anyhow. The total number of hits matching the word arrogant or arrogance in the books? That could be useful, but why it is shown only when there are more than 10 hits for the query? Why does it disappear when you get to the end of the result list? Why is it not shown when you set the num= parameter to higher than the default 10 hits per page?

The search for publisher Houghton Mifflin produces a list that claims 10,100,000 (yes, ten million one hundred thousand) pages as hits. By the time you scroll down the list, it settles for 53 books – and 53 pages.

The header on the top of the short result list should offer much better information, reporting that there are X number of occurrences of macaque, on Y number of pages in N books. There are Z number of pages which can be displayed.

The scanning process brings its own oddities. It caught my attention that in the search for the word ignorance there is an item authored by Plea, and the title starts with “A plea for strengthening …”. I just wondered why the letter A was not misinterpreted as the initial of the first name of Mr. Plea. I could not imagine why Haydn’s dictionary from 1883 came up for my search for tsunami in dictionaries, when the word was not even used in that year. It turns out that the name of a Turkish pasha, Osman, was considered to be a match. In fairness, Amazon also has odd results for scanning reasons, and Google has a much more difficult task scanning materials from centuries earlier. About 95% of the books in the SIB collection are less than 30 years old, in my estimate.

These problems are not nearly as lethal in this database as in Google Scholar, which has very similar deficiencies, and is used by some too-enthusiastic scientists in various disciplines. They take the hit counts and the citation scores reported by Google Scholar without checking their plausibility, then feed the numbers to their programs, which diligently churn out many useless statistical measures. They give a publisher an embellished pseudo-scholarly paper based on often inflated hit counts and phantom citations, and these papers are cited, exciting other researchers. You can find examples for the serious problems of Google Scholar, and the puppy love attitude of serious researchers, in a PowerPoint presentation for the closing session of the UKSG conference, and in a paper published in Online Information Review.

The Content

GBS includes eBooks converted from scanned print publication format and books received directly from the publishers in digital format. Character recognition in the scanning process is never 100% accurate, but the ratio of scanning errors was small in my samples (as it is in Amazon). Even in most of those cases, the context made clear for the naked eye what the original word may have been. Of course, for searching purposes these words are lost, as they are not matching the query term. However, if the word appears more than once in the text, the book is still retrieved, and if the word appears more than once on the same page and at least once correctly, the specific page will also show up in the results.

Database composition

GBS offers four content viewing options. The most generous is the full view option that allows thumbing through the entire book as well as downloading the books in PDF format. Books that are in the public domain have this option, or if the copyright holder asked Google to make them viewable without restriction, as is the case with the 2001 edition of the nearly 300 page book in the Daytrips series about Hawaii [daytrips]. There are no equivalent to this category yet in Amazon.

Copyright holders mostly choose the limited view option when only about 20-25% of the pages can be viewed and downloading/printing are disabled. Still, they can be very informative for getting a feel about the content, style and format of the book, to decide if the book is worth buying, borrowing or requesting through interlibrary loan. You can read reviews about the spectacularly illustrated Concise Animal Encyclopedia, but taking a glance at a picture or two of this book is, indeed, worth a thousand words of reviews.

The limited view option is not that too limiting for those who just need some factual information about a person, a place, an event or a concept. For example, the Best Beaches of Hawai’i book is just perfect in this format for getting concise information.

The index page shows one page for Lanikai, which turns out to be the first page of a three-page sub-section, and you can read it through from page 19 through page 20 to page 21. You can go fishing for another beach in the table of contents, which is usually available in its entirety for most books even in limited view, and pick another beach name for the next query, then jump to the appropriate page shown in the sidebar of the search result page.

The snippet view option has very restricted viewing options, just a paragraph from a few pages at best which include your search terms. This still could be useful for a ready reference question, such as the meaning of a word, especially when it is a geographic name (usually not included in many general dictionaries), and a gazetteer would not provide the meaning. Occasionally, there are books that appear both as no preview and snippet view types.

It is another question if the source defines the term correctly. In ready reference, corroboration of the information is crucial, but can be time consuming. In the example above, heavenly shore for Lanikai is a tad loose translation. One of the beauties of GSB is that even the snippets might give a hint, than clicking on an adjacent entry might reconfirm or contradict the information. In this search result the entry right above the entry with the snippet view happens to be an excerpt from the book Hawai’i Place Names, and it provides a much more informative and credible piece of information about the meaning of the name of the beach of the small town.

The most restrictive option provides only the usual bibliographic data, but no preview. It is still useful, as at least you would know that your search term occurs somewhere in the book – except when it does not. Searching for my last name, for example brings back books, which includes Jacson instead of Jacso. Of course, you don’t know about such mistakes if there is no preview.

Database size

It would be useful to know the proportion of books in each category discussed above. Google does not provide any quantitative information about the database itself, or such details as the ratio of books in the different categories.

As is usual with Google services, it is not possible to determine through special searches how many items there are in the database, or get factual information about other aspects of the content, such as the distribution of items by publication year (at least by broad range, such as for the last decade).

There is a publication year range cell on the advanced template, but it is like a prop in the cheap B-movies. It does not work if you touch it. For example, the search for books published in the past 10 years which include the word “love” anywhere in the body of the text, yields an implausibly low number of 18 hits from GBS.

Oprah used to recommend more than that between two commercial breaks. The Amazon SIB subset for books published in the past 10 years that include the word “love” anywhere in the body of the text yields 191,178 hits. It’s a reasonable number that would please all reading club members and talk-show participants. Extending the time span to more than 500 years the hit number makes the result in GBS increase by 3 to 21.

If the subject word is dropped to find out how many books there are in GBS published between 1496 and 2005, the hit number goes up to 59. That would be pathetic even in the eye of those bloggers who get instantly infatuated with any Google service without really testing them.

Because of the crippling software limitations, the best alternative approach may be to compare results from GBS with Amazon’s SIB subset for the semantically equivalent (but sometimes syntactically different) queries, without using date limitation or more advanced but often dysfunctional query combinations and filters which would guarantee to leave GBS in the dust.

Database sources

My samples have shown that not only books, but all kinds of printed materials, such as pamphlets, are present in the database; from every time period, in every genre. Sometimes, odd items show up in result list, which are certainly not books, but journals, whose GBS records were apparently created from the journal title list of Ebsco, and ProQuest (which are described as authors), or publishers’ catalog of books.

Unfortunately, it is impossible to estimate, let alone to determine their absolute numbers. As for the scope of publishers, the biggest names have submitted books in digital format for inclusion, including both university presses, such as Oxford, Cambridge, Princeton, Chicago and, to a lesser extent, commercial publishers, such as Penguin, Springer and Houghton Mifflin. From the perspective of ready reference, encyclopedias, dictionaries, almanacs, and factbooks are the most important traditional sources. Limiting the search to one of these words in the title, showed a good variety of ready reference works with definition and/or description for the term I searched for.

Even more importantly, non-reference books can now serve as ready reference sources by virtue of searching the entire body of text of all kinds of books. Occasionally, a quick search in GBS can return a wealth of ready reference information for a question which classical dictionaries, encyclopedias, and almanacs don’t answer.

Results of test searches

A search for the definition or description of affluenza yields no result from any of the following dictionaries American Heritage, Chambers, Collins, Cambriidge American English, Longman Contemporary English, Merriam-Webster (10 th and 11 th and unabridged editions), Oxford Concise, Compact Oxford, any of the dictionaries in the Oxford Reference Online suite, and Wordsmyth. Only Oxford English Dictionary had a definition with sample citations.

In contrast, GBS finds 29 books where the word appears. Actually, the first one is a book titled Affluenza – dedicated to the topic. Even the snippets shown on the result list might give the answer, or take you directly to the answer in the book.

With that said, Amazon shows its superiority not only by bringing up the same book (although only as the 9th hit) but also 290 other books in which the word appears. It also includes reviews from Booklist, Library Journals, and several other review publications incorporated in the master record), and offers many other informative features, including links to 116 other books cited by Affluenza.

Searches by the name of 15 publishers showed big differences between Amazon SIB collection and GBS. The latter came up better only for O’Reilly and the University of Hawaii Press with 36 versus 10, and 37 versus 3, respectively).

In the rest, Amazon was incomparably better, as illustrated by university presses such as Oxford (7,045 vs 57), Cambridge (11,445 vs 53), University of Chicago (2,923 vs 43), Princeton (2,193 vs 48), as well as commercial publishers Houghton Mifflin (736 vs 56), Blackwell (3114 vs 61), Penguin (2090 vs 16), Springer (13,138 vs 65), Taylor and Francis (1,565 vs 52), or McGraw-Hill (4,210 vs 34).

The hit numbers in GSB fluctuated somewhat during my test. I did not reduce hits because of false drops like matching author name appearing in publisher field for Taylor & Francis, for snippet view and no preview records) These numbers may not include the 200 or so full view books offered by the publishers. As the difference is two orders of magnitude, it was not worth the effort to check how many of those are included in the All Books counts, and how many are indeed unique, and thus to be added. It is a laudable feature of GBS but does not change the picture. I hope that this low number of items from the largest publishing partners of Google is just a software failure not shallow content. Publishers could easily run some tests on their titles.

As far as the legally undisputable clean subset of GBS is concerned, it very well complements Amazon’s SIB. Time and again I found top notch, ready reference sources in GBS with limited preview option which are not searchable through Amazon’s SIB subset. There are many comments on the GDB site by some Google-smitten bloggers about GBS. Most of them sound like those in the midnight commercials by exuberant housewives finding their true love in a laundry detergent or sink cleaning gizmo. Google prominently quoted from Tom Bruno’s Jersey Exile blog, but should not take at face value what Tom, a library assistant at Harvard University, wrote (Google's search capabilities beat the pants off of its competitor [Amazon]. Google Print also doesn't muddle the results of its searches by trying to sell you unrelated stuff conjured up by your keyword searches in Amazon). Beyond simple keyword searching, Google’s software seems to be cognitively challenged, to put it nicely, and hinders access to the content, which would deserve at least a functional and half as smart software as Amazon has.

Opinions expressed in this review do not necessarily reflect the opinions of Thomson Gale, its employees or affiliates. We cannot guarantee the accuracy of information contained in non-Thomson Gale sites.

Careers at Cengage   |   Contact Cengage Cengage Learning     —     Gale   |   Course Technology   |   Delmar   |   Academic   |   Nelson
Privacy Statement   |   Terms of Use   |   Copyright Notice