Do you feel lucky? Google Books is at heart a catalogue of errors

Scholar highlights flawed metadata in the world's largest digital library. Matthew Reisz writes

December 8, 2011

Two years ago, Google Books was becoming the world's largest digital library and, with an effective monopoly, seemed "almost certain to be the last one".

The tragedy for scholars was that Google Books' metadata - which allow users to search the catalogue - were "a mishmash wrapped in a muddle wrapped in a mess".

Such was the argument made in 2009 by Geoffrey Nunberg, adjunct full professor in the School of Information at the University of California, Berkeley.

He went on to have a good deal of fun with the many strange anomalies: 115 hits for Greta Garbo and 325 for Woody Allen in books said to date from before they were born; editions of Jane Eyre classified under history or antiques and collectibles; Sigmund Freud listed as an author of a guide to an internet interface.

ADVERTISEMENT

There was even a case of an 1890 guidebook assigned to 1774 because it happened to open with an advertisement for a shirt manufacturer founded in that year.

All this made Google Books' search facility a very dangerous tool for serious researchers looking to track, for example, the way a particular word has changed its meaning over time.

ADVERTISEMENT

In response to Professor Nunberg's critique, Google offered to correct any errors that were brought to its attention. But while this process has ironed out specific glitches in the intervening years, Professor Nunberg does not believe it has made a fundamental difference.

"The changes are a drop in a greatly enlarged ocean," he said, adding that the flaws in Google's metadata remain "a big systematic structural problem".

In the course of his research alone, he has continued to come across glaring errors similar to those he flagged up two years ago.

While working on a history of swearing, for example, Professor Nunberg did searches for the word "asshole". Google Books' search facility promptly provided much useful material.

But what is obviously a contemporary novel was listed as the complete works of the French composers Jean-Philippe Rameau and Camille Saint-Saëns. A novel by Arthur Hailey was catalogued as A Survey of American Chemistry, and a book about tattooing as Tudor Historical Thought.

A colleague of Professor Nunberg who was researching the history of alcohol searched for a kind of port known as a "30-year-old tawny" and was presented with a detailed discussion of the subject in a volume Google Books showed as bearing the title How to Play Better Soccer. There were also cases of Google technicians who had managed to scan in images of their fingers rather than the relevant pages of text. Among more general concerns, periodicals were often dated by their first issue.

Professor Nunberg said he could not understand why Google scans in copies of books from major research libraries, where the details tend to be recorded correctly, and then turns for its metadata to far less reliable sources.

To patch up the huge problems would now require substantial time and resources. These were unlikely to be forthcoming, Professor Nunberg said, because, "like most high-tech companies, Google puts a much higher premium on innovation than maintenance. They aren't good at the punctilious, anal-retentive sort of work librarians are used to."

ADVERTISEMENT

matthew.reisz@tsleducation.com.

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Register
Please Login or Register to read this article.

Sponsored

ADVERTISEMENT