August 29, 2007

The Quality of Google Book Search


Paul Duguid wrote an interesting article about Google Book Search in which he analyzed the quality of the indexed editions and the search results by doing a search for Lawrence Sterne's "Tristram Shandy", a novel from the 18th century. Mr. Duguid noticed that the Harvard edition of the book had many quality problems and some text wasn't scanned properly. Google Book Search doesn't distinguish between the volumes of a book, so it's difficult to realize that the Stanford edition is actually the second volume of the book.
Google may or may not be sucking the air out of other digitization projects, but like Project Gutenberg before, it is certainly sucking better–forgotten versions of classic texts from justified oblivion and presenting them as the first choice to readers. (...) The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one. Relying on the power of its search tools, Google has ignored elemental metadata, such as volume numbers. The quality of its scanning (and so we may presume its searching) is at times completely inadequate. The editions offered (by search or by sale) are, at best, regrettable. Curiously, this suggests to me that it may be Google's technicians, and not librarians, who are the great romanticisers of the book. Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they don't submit equally to a standard shelf, a standard scanner, or a standard ontology.

Patrick Leary, the author of the article Googling the Victorians (PDF), has a pragmatical response, as seen on O'Reilly Radar:
Mass digitization is all about trade-offs. All mass digitizing programs compromise textual accuracy and bibliographical meta-data so that they can afford to include many more texts at a reasonable cost in money and time. All texts in mass digitization collections are corrupt to some degree. Everything else being equal, the more limited the number of texts included in a digital collection, the more care can be lavished on each text. Assessing the balance of value involved in this trade-off, I think, is one of the main places where we part company. You conclude, on the basis of your inspection of these two volumes, that the corruption of texts like Tristram Shandy makes Google Books a "highly problematic" way of getting at the meanings of the books it includes. By contrast, while acknowledging how unfortunate are some of the problems you mention, I believe that the sheer scale of the project and the power of its search function together far outweigh these "problematic" elements.

When scanning and indexing millions of books, it's difficult to assess the quality of each edition. Google Book Search's main goal is to let you discover books you can borrow or buy later on. But Google could add an option to rate the quality of each digitized book or build algorithms that detect flaws or differences between editions. So the next time you do a search for Tristram Shandy, all the editions are clustered and the best one comes up first.

8 comments:

  1. Most of the legacy books, especially those became yellow paper, are presented in Google as images, as Google does store the scanned images of books. For these books, the OCR text don't have to be 100% accurate, 95% is good enough for indexing.

    Of course, image is not good for looking up contents after locating a book. I hope Google Book had stored or published the mapping the scanned text and the original images. So I can lookup by text, and then locate a respective page of images.

    For new books, especially those were originally digitized to PDF or whatever, I would just read text.

    ReplyDelete
  2. I think the guy who wrote that article should have more appreciation for an effort that will digitize millions of books, make them available to the whole world, and not charge him a penny. If someone gave me a full course dinner I wouldn't complain there wasn't enough salt.

    ReplyDelete
  3. The problem of this wonderful project (and I mentioned it to the team) isn't the quality of searching, but the poor quality of some copies available for download (missing page areas in such a degree that the book is unreadable). Despite it, the project on the whole is a success, and I am grateful to have access to a treasure of information (I wanted to have a closer look to the Ancient Greek and Latin , and I was able to build through Google Books a library stored on DVD, library that I couldn't afford without this project).

    ReplyDelete
  4. I think brute force approaches will win to careful insider consideration of whether or not a volume is deemed worthy to be scanned. However, attaching the right metadata is important, of course, as is good OCR (Google doesn't have super great OCR results).

    ReplyDelete
  5. I'm glad Daguid is calling attention to the failure to include metadata. I don't buy Leary's excuse for Google, because these are all library books with barcodes linking them to carefully organized metadata. I can only hope, that since there is evidence that Google has held on to the librarians' OCLC numbers, etc., we may see the design improved. This would render my Public Domain Books for Classicists obsolete. Meanwhile, some of these titles were very difficult to find with intensive searching. With periodicals and series containing huge numbers of volumes, Google's presentation of each item without volume number feels like a slap in the face (example: how are vol. 1 and vol. 2 of the Glossa ordinaria from the other 215 volumes of the Patrologia Latina? With better metadata, I could be using Google Books to look up 19th century German journal articles. In its present form, it's a (wonderful) mess. My fear would be that Google will care more about the OCR (it fits their "entire universe searchable" paradigm) than about the metadata (looking for things you already know you want to read is the old way of doing things).

    ReplyDelete
  6. I just posted a comment on the Google Exchange post regarding Michigan's take on the quality issue - http://radar.oreilly.com/archives/2007/08/the_google_exch.html (scroll d o w n)

    ReplyDelete
  7. Google Books is an indisputable good, at least for those with scholarly interests not satisfied by visits to the local public library, or by having to return books to a university library, or, worst of all, having to pay hundreds of dollars for a rare volume. All the stuff about copyright can be set aside: every time a company has an innovative idea that is not inherently exploitative, but actually generous in some way, the little crows come flocking and parroting each other about "ethics" and so on; their only ethos is seeing you in the end pay for the air you breathe. It's all bullocks. Google Books gives us access to books, opens us up to others, that have been collecting fine coats of dust for decades, if not centuries. It doesn't charge you. It doesn't ask for your name and address. It doesn't even force you to "register", like every other damned website today. Anyone who complains about Google Books is a moralistic ninny.

    Of course, there are procedural mistakes in GB -- scanning seems somewhat haphazard, occasionally you'll see a finger, etc. Hey, they're people behind all that scanning, it's to be expected. TRY to be a LITTLE grateful. But I suppose people WHO DON'T REALLY GIVE A DAMN ABOUT BOOKS, can't be expected to have some gratitude for those revolutionizing free media.

    -Marshall Lentini

    ReplyDelete
  8. « Hey, they're people behind all that scanning, »

    Yes, and when we see theirs fingers on the scans, we are sure that these people have not realize correctly their job.

    ReplyDelete

Note: Only a member of this blog may post a comment.