Thursday, December 17, 2009

Resource review #8 - Improvements and solutions

Martin, S. (2008). To Google or Not to Google, That Is the Question: Supplementing Google Book Search to Make It More Useful for Scholarship. Journal of Library Administration, 47(1/2), 141-150. Retrieved from http://search.ebscohost.com.ezproxy.library.wisc.edu/login.aspx?direct=true&db=lxh&AN=33007439&loginpage=Login.asp&site=ehost-live

Sutherland, J. (2008). A Mass Digitization Primer. Library Trends, 57(1), 17-23. Retrieved from http://search.ebscohost.com.ezproxy.library.wisc.edu/login.aspx?direct=true&db=lxh&AN=34929033&loginpage=Login.asp&site=ehost-live

Juliet Sutherland's article describes the process of digitization and outlines some of the many problems involved with using optical character recognition (OCR) to convert images of book pages into computer-readable text. She outlines a few of the potential solutions to these problems, including Recaptcha and Distributed Proofreaders. She notes the limits of both--Recaptcha doesn't correct spacing errors, and it can only be used to identify words that a computer is unsure of; it doesn't correct the computer's mistakes. The obvious limit of Distributed Proofreaders is time. Proofreading by dedicated humans is by far the most effective way to ensure the accuracy of digitized text, but there's no way that human proofreading will be able to match the speed of Google's scanning.  Sutherland also mentions the possibilities of semantic coding, a process of enriching the text with useful information. She explains that "This can be as simple as identifying chapter titles or as complex as identifying whether a particular instance of the word 'Washington' refers to the person, the city, or the state."

Similarly, Shawn Martin's article discusses the limitations of Google Books for academic research, and discusses a possible solution. Martin is affiliated with the Text Creation Partnership (TCP). He describes their work as follows: "Instead of relying on a computer to read the book and extract readable text from an image (as OCR does) TCP works with companies whose employees read the text, transcribe it, and add structural tagging (that allows a computer to see elements of the book such as paragraphs, typeface changes, and chapters)." This process is undertaken typically by three people--two to transcribe and one to person to view and edit the results. If I understand the article correctly, Martin is not suggesting the Google Books move away from OCR and begin transcribing texts manually. Rather, he's suggesting that Google Books partner with a company like TCP to enhance their existing collections, particularly through the use of the kind of semantic coding Sutherland describes.

I agree with both authors that problems with the accuracy of OCR are a big obstacle to creating useful digital content. The solutions proposed by Sutherland seem most intriguing to me--despite its limits, Recaptcha (acquired by Google since the publication of this article) is an especially clever approach to the problem. Although Martin describes some very successful projects undertaken by TCP, it's just hard to imagine how that kind of detail-oriented, labor-intensive approach would work on Google's grand scale. I wonder if it's possible that eventually Google Books will identify materials that are of particular interest to researchers, and make strides to ensure accuracy and provide enhanced content through semantic coding within those specific materials. In the meantime, though they can't match the scope of Google Books, the smaller projects undertaken by universities that Martin describes, as well as projects like the Internet Archive and Project Gutenberg, simply offer higher quality products to researchers.

Because this is my last resource review, I'll also throw in a few thoughts about solutions to the larger problems of legality and privacy. Based on what I've read, I'm not sure that the new proposed settlement offers large enough changes to appease the Department of Justice. Grimmelmann's solution ("open up the settlement to any competitor on the same terms Google would receive") seems ideal. Competition will do more than just (hopefully) keep the cost of access down; in addition, if other providers of digitized content find innovative ways to increase accuracy and enrich content while still digitizing books at a rapid pace, this would force Google to make similar improvements. 


As for privacy concerns and the larger problems that come with this concentration of content in the hands of a private company, I'm not sure what can be done. Librarians can't expect Google to adhere to the values of our profession (beyond their pledge to not be evil). Google Books is an amazing tool, and if libraries must use it carefully and thoughtfully because of its proprietary nature, so be it. It's still extremely useful. I also think it's important for libraries to consider taking on their own digitization projects when possible, because that allows them a greater degree of quality control. Additionally, it's worth watching to see what libraries like the University of Michigan do with the digitized material that Google provides. Libraries need not "eliminate their print collections and become dependent on Google's institutional subscription, only to see its price rise uncontrollably" (Grimmelmann, 2009); libraries can use Google Books without depending on it exclusively for digital content.

No comments:

Post a Comment