Thursday, December 17, 2009

Resource review #8 - Improvements and solutions

Martin, S. (2008). To Google or Not to Google, That Is the Question: Supplementing Google Book Search to Make It More Useful for Scholarship. Journal of Library Administration, 47(1/2), 141-150. Retrieved from http://search.ebscohost.com.ezproxy.library.wisc.edu/login.aspx?direct=true&db=lxh&AN=33007439&loginpage=Login.asp&site=ehost-live

Sutherland, J. (2008). A Mass Digitization Primer. Library Trends, 57(1), 17-23. Retrieved from http://search.ebscohost.com.ezproxy.library.wisc.edu/login.aspx?direct=true&db=lxh&AN=34929033&loginpage=Login.asp&site=ehost-live

Juliet Sutherland's article describes the process of digitization and outlines some of the many problems involved with using optical character recognition (OCR) to convert images of book pages into computer-readable text. She outlines a few of the potential solutions to these problems, including Recaptcha and Distributed Proofreaders. She notes the limits of both--Recaptcha doesn't correct spacing errors, and it can only be used to identify words that a computer is unsure of; it doesn't correct the computer's mistakes. The obvious limit of Distributed Proofreaders is time. Proofreading by dedicated humans is by far the most effective way to ensure the accuracy of digitized text, but there's no way that human proofreading will be able to match the speed of Google's scanning.  Sutherland also mentions the possibilities of semantic coding, a process of enriching the text with useful information. She explains that "This can be as simple as identifying chapter titles or as complex as identifying whether a particular instance of the word 'Washington' refers to the person, the city, or the state."

Similarly, Shawn Martin's article discusses the limitations of Google Books for academic research, and discusses a possible solution. Martin is affiliated with the Text Creation Partnership (TCP). He describes their work as follows: "Instead of relying on a computer to read the book and extract readable text from an image (as OCR does) TCP works with companies whose employees read the text, transcribe it, and add structural tagging (that allows a computer to see elements of the book such as paragraphs, typeface changes, and chapters)." This process is undertaken typically by three people--two to transcribe and one to person to view and edit the results. If I understand the article correctly, Martin is not suggesting the Google Books move away from OCR and begin transcribing texts manually. Rather, he's suggesting that Google Books partner with a company like TCP to enhance their existing collections, particularly through the use of the kind of semantic coding Sutherland describes.

I agree with both authors that problems with the accuracy of OCR are a big obstacle to creating useful digital content. The solutions proposed by Sutherland seem most intriguing to me--despite its limits, Recaptcha (acquired by Google since the publication of this article) is an especially clever approach to the problem. Although Martin describes some very successful projects undertaken by TCP, it's just hard to imagine how that kind of detail-oriented, labor-intensive approach would work on Google's grand scale. I wonder if it's possible that eventually Google Books will identify materials that are of particular interest to researchers, and make strides to ensure accuracy and provide enhanced content through semantic coding within those specific materials. In the meantime, though they can't match the scope of Google Books, the smaller projects undertaken by universities that Martin describes, as well as projects like the Internet Archive and Project Gutenberg, simply offer higher quality products to researchers.

Because this is my last resource review, I'll also throw in a few thoughts about solutions to the larger problems of legality and privacy. Based on what I've read, I'm not sure that the new proposed settlement offers large enough changes to appease the Department of Justice. Grimmelmann's solution ("open up the settlement to any competitor on the same terms Google would receive") seems ideal. Competition will do more than just (hopefully) keep the cost of access down; in addition, if other providers of digitized content find innovative ways to increase accuracy and enrich content while still digitizing books at a rapid pace, this would force Google to make similar improvements. 


As for privacy concerns and the larger problems that come with this concentration of content in the hands of a private company, I'm not sure what can be done. Librarians can't expect Google to adhere to the values of our profession (beyond their pledge to not be evil). Google Books is an amazing tool, and if libraries must use it carefully and thoughtfully because of its proprietary nature, so be it. It's still extremely useful. I also think it's important for libraries to consider taking on their own digitization projects when possible, because that allows them a greater degree of quality control. Additionally, it's worth watching to see what libraries like the University of Michigan do with the digitized material that Google provides. Libraries need not "eliminate their print collections and become dependent on Google's institutional subscription, only to see its price rise uncontrollably" (Grimmelmann, 2009); libraries can use Google Books without depending on it exclusively for digital content.

Wednesday, December 16, 2009

Resource review #7: The amended settlement & objections

Band, J. (November 23, 2009). A Guide for the Perplexed Part III: The Amended Settlement Agreement, American Library Association, Association of College and Research Libraries, and Association of Research Libraries. Retrieved from http://www.arl.org/bm~doc/guide_for_the_perplexed_part3_final.pdf

Grimmelmann, J. (November 23, 2009). James Grimmelmann on The Google Settlement: what’s right, what’s wrong, what’s left to do, Publisher’s Weekly. Retrieved from http://www.publishersweekly.com/article/CA6708106.html

In cooperation with the American Library Association, the Association of College and Research Libraries, and the Association of Research Libraries, Jonathan Band has published several "[Guides] for the Perplexed," which outline the changes made in various versions of the Google Books Settlement, with a focus on the implications for libraries. I highly recommend these guides, as they briefly and clearly explain concepts in other contexts can seem kind of nebulous. I'll summarize his most recent summary of the Amended Settlement Agreement (ASA):

1. In response to complaints made by foreign rightsholders of books, the ASA does not apply to these books (with a few exceptions). This means that the great majority of books published elsewhere will not be available in full-text. This is a lot of books, possibly half of the books Google has digitized so far. Google will keep scanning these kinds of books, and will attempt to get permission from rightsholders to provide full-text access. Not only does this remove a huge quanitity of books from Google's collection, it also removes the great majority of plaintiffs from "the plaintiff class" in the settlement.
2. OCLC is now included among institutions "that can receive benefits under the settlement."
3. The ASA extends the time in which rightsholders can "request the removal" of books.
4. Changes to the Book Rights Registry set up by the ASA:
-The Book Rights Registry will have "the purely discretionary" ability to make more than one free public access terminal available at public libraries.
-The BRR will no longer be permitted to use unclaimed funds resulting from the sale of orphan works for operating expenses. These will now be given to charity. Up to 25% of these funds can be used to search for rightsholders of unclaimed works.
-An Unclaimed Works Fiduciary will be selected by the BRR (with the court's approval) to "[represent] the interests of the rightsholders of the unclaimed works."
-The BRR will protect the right of rightsholders to distribute their works through Creative Commons licensing, or other alternative licenses.
5. The ASA removes a few clauses that give Google favored treatment over third parties providing similar services.
6. Google waives its right to antitrust immunity -- apparently "under the Noerr-Pennington doctrine, if an activity receives government approval, it cannot form the basis of antitrust liability." This means that the Department of Justice now has time to see how things play out before determining if Google Books should actually be the target of an antitrust investigation. 

James Grimmelmann is a professor at New York Law School who has written extensively about Google Books (other articles are available here). He is also responsible for Public-Interest Book Search Initiative, which is in turn responsible for (as far as I'm concerned) the best and most comprehensive resource about Google Books, the Public Index. The Public Index provides access to the settlement documents and to related legal documents. Users can annotate the settlement. Additionally, the Public Index links to a wealth of articles on the subject, written from a legal perspective or for a wider audience.

Grimmelmann's most recent general audience article on the subject discusses the ASA and identifies positive changes and areas that remain problematic. The new settlement, as he sees it, consists of "one big feature cut and a bunch of small bug fixes" (the big feature cut being the exclusion of books with foreign rightsholders). Grimmelmann concludes that while the changes to the settlement are mostly positive, the bigger issues are unchanged. Or as he put it more eloquently, "the dark heart of the deal remains: Google will still have effectively exclusive access to unclaimed books." According to Grimmelmann, the issue of antitrust is unresolved. In addition, the opt-out feature of the settlement threatens to set what Grimmelmann calls " a bad precedent for future class actions." He explains: "the plaintiffs aren't just giving up the right to sue Google for scanning their books; they're also being shanghaied into a complicated commercial deal that includes a controversial allocation of electronic book rights and requires them to give up the right in the future to sue Google for plenty of things it hasn't even contemplated doing yet." Because of this, Grimmelmann argues that the court should reflect not only on the fairness of the proposed settlement, but also on the implications for future class action cases.

I've been trying to focus on resources that present perspectives on the potential library use of Google Book as a tool, without any of the legal and ethical arguments for and against. These kinds of articles are somewhat difficult to locate, because (understandably) it's difficult to write anything about Google Books without addressing its legal implications. It's been easy to dismiss a lot of the arguments librarians have made against Google Books--for example, the problem of poor scanning quality and metadata--because the project provides such unprecedented access to such a massive quantity of materials. However, at the core of the project is a massive trade-off: we have unprecedented access to these materials, but other similar vendors have their access to these materials severely curtailed (As Grimmelmann puts it: "A competitor, however, would need to get individual permission [to sell these works] first or be sued into oblivion. That's hard enough in general, and for orphan books it's impossible. There's no one to ask. The class action opens a door for Google, but leaves it closed for everyone else"). Without competition, its possible that the costs to use Google Books will rise dramatically, again limiting access to the materials they've digitized. Grimmelmann describes the possible implications: "Will it drive libraries to eliminate their print collections and become dependent on Google's institutional subscription, only to see its price rise uncontrollably? Will the FBI force Google to turn over its lists of who's been reading the Qur'an? If these kinds of broad-reaching policy decisions were being made by Congress, the legislative process would in theory take everyone's interests into account. But in a settlement negotiated by a handful of lawyers, the danger is always that the “public interest” means whatever they say it does."

Grimmelmann's Public-Interest Book Search Initiative exists for the purposes of encouraging the public to discuss and evaluate the settlement. In that sense, the project attempts to allow the public to reflect on the real meaning of "public interest." As such, the materials provided in the Public Index and the guides written by Jonathan Band are a valuable resource for librarians concerned about the Google Books Settlement's effects on public access to information.

Tuesday, December 1, 2009

Resource review # 6 - ...maybe it is preservation after all?

Blakeley, R. (2009). What Was Lost, Now Is Found: Using Google Books and Internet Archive to Enhance a Government Documents Collection with Digital Documents. DttP, 37(3), 26-9. 
   
   In this article, Rebecca Blakely, a government documents librarian at McNeese State University, describes the process by which she used Google Books and the Internet Archive to supplement the McNeese Library government documents collection. The collection fared badly in Hurricane Rita, suffering water damage and mold. Blakely eventually stumbled upon some full-text government documents in Google Books while helping a patron, and it occurred to her that digitized materials could compensate somewhat for the library's loss. She describes the search methods she used to find government documents in both Google Books and the Internet Archive, and compares the strengths and weaknesses of the two.


   For Blakely, the best feature in Google Books is the "my library" option, which can be used to compile items and share them with other users. She started compiling full-text government documents she found using that option - her collection is available here. (Because of extensive tagging, in some ways her small digital library is much more easily browsable than physical collections of government documents.) Blakely notes that it's also possible to create RSS feeds to point out new items added to the collection. She mentions that the quality of scanning and metadata varies, but praises the range of viewing options: zooming, one or two page display, plain-text display, thumbnails or full screen. Her biggest complaint is that Google Books only provides limited viewing of many government documents, even though the great majority of them are in the public domain. (Google responded to an email about this by explaining that rather than taking the time to figure out the rights status, they just add materials in limited view until the status can be determined for sure. Hopefully this means that more government documents will be available in full view later on.) It is for this reason that Blakely prefers the Internet Archive.


   This fits in pretty well with the comparisons drawn by Kalev Leetaru in an article I wrote about previously. The Internet Archive doesn't post books until they've determined that materials are in the public domain or secured permission from the rightsholder in question. They also take a lot more time to produce high-quality scans. As a result, there's significantly less there, but what's there isn't as messy as Google Books. Additionally, the Internet Archive allows users to upload materials to the collection -- Blakely notes that some materials have been uploaded by users who originally downloaded them from Google Books. The Internet Archive allows users to bookmark items, which can then be shared via RSS feeds. The site also offers a "bookmark explorer," which allows users to view items bookmarked by others.


  This article illustrates a pretty neat use of these two large digital repositories, and provides good examples of the differences between the two, in terms of features and underlying philosophies. The Internet Archive, while growing, looks like a finished product, while Google Books is very much constantly in progress. I came across an interesting blog post by Ed Felton recently, discussing another blog post about the metadata problems at Google Books; it addresses this point:
"What's most interesting to me is a seeming difference in mindset between critics like Nunberg on the one hand, and Google on the other. Nunberg thinks of Google's metadata catalog as a fixed product that has some (unfortunately large) number of errors, whereas Google sees the catalog as a work in progress, subject to continual improvement. Even calling Google's metadata a "catalog" seems to connote a level of completion and immutability that Google might not assert. An electronic "card catalog" can change every day -- a good thing if the changes are strict improvements such as error fixes -- in a way that a traditional card catalog wouldn't."
 I think one of the biggest reasons for the backlash against Google Books by librarians stems from overlooking this. They feel like they've turned over a bunch of their best materials to be digitized, but it's been done sloppily in terms of scanning or metadata, and no one knows exactly what the final shape of Google Books will look like, once the settlement is (or isn't) finalized. I think it's a good point, but I'm also skeptical about the plausibility of fixing all these errors. Is Google planning to rescan everything that's blurry, or all the pages with visible scanning hands? I think the "beta" label is a good explanation for some problems, but isn't Google digging itself kind of a deep hole by doing so much so quickly and imprecisely?

  Either way, Blakely's article serves as a great example of the flexibility that digitization allows. We've read a great deal this semester about the complicated nature of digital preservation, but in cases like this, digitized documents are certainly preferable to moldy ones.