The Google Books metadata controversy resurfaced recently in a Salon article by Laura Miller, The trouble with Google Books.
I focused a lot on this issue last year. It's fascinating to watch non-librarians, in this case UC-Berkeley professor Geoffrey Nunberg, arguing and explaining the importance of bibliographic metadata. Every cataloger/metadata librarian who wants to feel good about what they do should read this article.
One of my takeaways from the Google Books metadata "mess" is that full-text searching is not a substitute for accurate metadata. If it was, Google would not be spending any time or energy creating or providing metadata for Google Book Search.
It seems we need both full-text searching and metadata working in unison for optimal discovery of digital resources.
Don't get too self-congratulatory though.
" Woody Allen is mentioned in 325 books ostensibly published before he was born.
Other errors include misattributed authors -- Sigmund Freud is listed as a co-author of a book on the Mosaic Web browser and Henry James is credited with writing "Madame Bovary.""
My local cataloging corpus has many errors of this sort too, and I bet most readers' do too. Although you're probably going to see different types of errors in the different approaches -- library cataloging is less likely to mis-attribute an author/creator (although more likely than Google to simply leave them out altogether, if they weren't an AACR/AACR2 'main entry'), but probably nearly as likely as Google to have the wrong dates on items in actual search filters -- as 260 dates are transcribed and not suitable for machine processing, and the fixed field dates, not often used by 1980s-2000s OPACs, are often neglected. There are probably other sorts of errors even more likely in library cataloged corpuses than Google's database.
Metadata is hard.
Posted by: Jonathan Rochkind | Thursday, September 16, 2010 at 08:18 PM
PS: "Although Google representatives did respond to Nunberg's article, blaming the bulk of the errors on outside contractors,"
Much (I have no idea how much, but I know some) of Google's metadata actually comes, believe it or not, from trying to make sense of library MARC. Both from scanning partners, and from OCLC. We're lucky Google's representatives didn't publicly blame libraries.
Posted by: Jonathan Rochkind | Thursday, September 16, 2010 at 08:21 PM
Hi Jonathan,
Thanks for the comments. I'll reply to the second one first.
I'm familiar with the Internet Archive's workflow and they make good use of MARC data. I think Google's problems comes from capturing metadata after the fact rather than at the point of scanning. My impression from reading the articles and blog posts last year was that initially Google didn't think about metadata at all. But I'll have to go back and read everything again.
Posted by: Christine Schwartz | Friday, September 17, 2010 at 04:28 PM
Yes, yes and yes. I love this!
@Jonathan, I agree there are problems in all sorts of metadata but these mistakes are pretty obviously glaring and seem to be machine driven. It is easily to misattribute but then to mass update that misattribution...well, this is when you can get such a wide spread mass of errors.
@Christine, I agree. I think trying to link the data after the fact contributed. I'd add that making it a mostly automated process without understanding MARC is what contributed greatly to this.
I'd still love to get involved in the clean-up if Google is hiring such [grin]
Posted by: carol seiler | Monday, September 20, 2010 at 04:49 PM