Google weighs in. Jon Orwant (manager of the Google Books metadata team) responds to the criticism. His comment is lengthy but well worth reading. It's obvious that the Google Book metadata team has quite a gargantuan task: aggregating metadata from 100 providers for over 168 million digitized books.
I found this quote interesting:
Geoff also suggests that "most of the misdatings are pretty obviously the result of an effort to automate the extraction of pub dates from the OCR'd text." However, we don't extract publication dates from OCR. Every misdating came from a human — some inside Google, most outside. Where the misdates come from the frontmatter (e.g., the frontispiece or the title page, as in the two examples Geoff cites) the error is more likely to have been a person inside Google. We are investigating the best ways to fix these — through better training for those people, through automated ways to identify the errors, and maybe someday through user-supplied metadata corrections. [emphasis added]
Yes, GIGO is the problem with machine-manipulated data with little or no human involvement. Nunberg's comments on the metadata problems with Google Books were excellent and thoroughly documented. He made some mistakes about the sources of some of the errosr, but the overall point that Google was not being responsible with its evaluation of metadata was good. To their credit, Orwant and other members of the the Google project seem to be taking concrete measures to change things. I think this is one of those rare situations in which criticism is taken seriously and improvements made.
Posted by: Phred | Tuesday, September 08, 2009 at 10:44 AM
Thanks for posting this Christine. I thought the Orwant comment was really helpful for answering some of my own questions about how they are building Google Books.
Posted by: Matt Ostercamp | Wednesday, September 09, 2009 at 11:06 AM