644 adventures.: 2009

Thursday, December 17, 2009

Resource review #8 - Improvements and solutions

Martin, S. (2008). To Google or Not to Google, That Is the Question: Supplementing Google Book Search to Make It More Useful for Scholarship. Journal of Library Administration, 47(1/2), 141-150. Retrieved from http://search.ebscohost.com.ezproxy.library.wisc.edu/login.aspx?direct=true&db=lxh&AN=33007439&loginpage=Login.asp&site=ehost-live

Sutherland, J. (2008). A Mass Digitization Primer. Library Trends, 57(1), 17-23. Retrieved from http://search.ebscohost.com.ezproxy.library.wisc.edu/login.aspx?direct=true&db=lxh&AN=34929033&loginpage=Login.asp&site=ehost-live

Juliet Sutherland's article describes the process of digitization and outlines some of the many problems involved with using optical character recognition (OCR) to convert images of book pages into computer-readable text. She outlines a few of the potential solutions to these problems, including Recaptcha and Distributed Proofreaders. She notes the limits of both--Recaptcha doesn't correct spacing errors, and it can only be used to identify words that a computer is unsure of; it doesn't correct the computer's mistakes. The obvious limit of Distributed Proofreaders is time. Proofreading by dedicated humans is by far the most effective way to ensure the accuracy of digitized text, but there's no way that human proofreading will be able to match the speed of Google's scanning. Sutherland also mentions the possibilities of semantic coding, a process of enriching the text with useful information. She explains that "This can be as simple as identifying chapter titles or as complex as identifying whether a particular instance of the word 'Washington' refers to the person, the city, or the state."

Similarly, Shawn Martin's article discusses the limitations of Google Books for academic research, and discusses a possible solution. Martin is affiliated with the Text Creation Partnership (TCP). He describes their work as follows: "Instead of relying on a computer to read the book and extract readable text from an image (as OCR does) TCP works with companies whose employees read the text, transcribe it, and add structural tagging (that allows a computer to see elements of the book such as paragraphs, typeface changes, and chapters)." This process is undertaken typically by three people--two to transcribe and one to person to view and edit the results. If I understand the article correctly, Martin is not suggesting the Google Books move away from OCR and begin transcribing texts manually. Rather, he's suggesting that Google Books partner with a company like TCP to enhance their existing collections, particularly through the use of the kind of semantic coding Sutherland describes.

I agree with both authors that problems with the accuracy of OCR are a big obstacle to creating useful digital content. The solutions proposed by Sutherland seem most intriguing to me--despite its limits, Recaptcha (acquired by Google since the publication of this article) is an especially clever approach to the problem. Although Martin describes some very successful projects undertaken by TCP, it's just hard to imagine how that kind of detail-oriented, labor-intensive approach would work on Google's grand scale. I wonder if it's possible that eventually Google Books will identify materials that are of particular interest to researchers, and make strides to ensure accuracy and provide enhanced content through semantic coding within those specific materials. In the meantime, though they can't match the scope of Google Books, the smaller projects undertaken by universities that Martin describes, as well as projects like the Internet Archive and Project Gutenberg, simply offer higher quality products to researchers.

Because this is my last resource review, I'll also throw in a few thoughts about solutions to the larger problems of legality and privacy. Based on what I've read, I'm not sure that the new proposed settlement offers large enough changes to appease the Department of Justice. Grimmelmann's solution ("open up the settlement to any competitor on the same terms Google would receive") seems ideal. Competition will do more than just (hopefully) keep the cost of access down; in addition, if other providers of digitized content find innovative ways to increase accuracy and enrich content while still digitizing books at a rapid pace, this would force Google to make similar improvements.

As for privacy concerns and the larger problems that come with this concentration of content in the hands of a private company, I'm not sure what can be done. Librarians can't expect Google to adhere to the values of our profession (beyond their pledge to not be evil). Google Books is an amazing tool, and if libraries must use it carefully and thoughtfully because of its proprietary nature, so be it. It's still extremely useful. I also think it's important for libraries to consider taking on their own digitization projects when possible, because that allows them a greater degree of quality control. Additionally, it's worth watching to see what libraries like the University of Michigan do with the digitized material that Google provides. Libraries need not "eliminate their print collections and become dependent on Google's institutional subscription, only to see its price rise uncontrollably" (Grimmelmann, 2009); libraries can use Google Books without depending on it exclusively for digital content.

Wednesday, December 16, 2009

Resource review #7: The amended settlement & objections

Band, J. (November 23, 2009). A Guide for the Perplexed Part III: The Amended Settlement Agreement, American Library Association, Association of College and Research Libraries, and Association of Research Libraries. Retrieved from http://www.arl.org/bm~doc/guide_for_the_perplexed_part3_final.pdf

Grimmelmann, J. (November 23, 2009). James Grimmelmann on The Google Settlement: what’s right, what’s wrong, what’s left to do, Publisher’s Weekly. Retrieved from http://www.publishersweekly.com/article/CA6708106.html

In cooperation with the American Library Association, the Association of College and Research Libraries, and the Association of Research Libraries, Jonathan Band has published several "[Guides] for the Perplexed," which outline the changes made in various versions of the Google Books Settlement, with a focus on the implications for libraries. I highly recommend these guides, as they briefly and clearly explain concepts in other contexts can seem kind of nebulous. I'll summarize his most recent summary of the Amended Settlement Agreement (ASA):

1. In response to complaints made by foreign rightsholders of books, the ASA does not apply to these books (with a few exceptions). This means that the great majority of books published elsewhere will not be available in full-text. This is a lot of books, possibly half of the books Google has digitized so far. Google will keep scanning these kinds of books, and will attempt to get permission from rightsholders to provide full-text access. Not only does this remove a huge quanitity of books from Google's collection, it also removes the great majority of plaintiffs from "the plaintiff class" in the settlement.
2. OCLC is now included among institutions "that can receive benefits under the settlement."
3. The ASA extends the time in which rightsholders can "request the removal" of books.
4. Changes to the Book Rights Registry set up by the ASA:
-The Book Rights Registry will have "the purely discretionary" ability to make more than one free public access terminal available at public libraries.
-The BRR will no longer be permitted to use unclaimed funds resulting from the sale of orphan works for operating expenses. These will now be given to charity. Up to 25% of these funds can be used to search for rightsholders of unclaimed works.
-An Unclaimed Works Fiduciary will be selected by the BRR (with the court's approval) to "[represent] the interests of the rightsholders of the unclaimed works."
-The BRR will protect the right of rightsholders to distribute their works through Creative Commons licensing, or other alternative licenses.
5. The ASA removes a few clauses that give Google favored treatment over third parties providing similar services.
6. Google waives its right to antitrust immunity -- apparently "under the Noerr-Pennington doctrine, if an activity receives government approval, it cannot form the basis of antitrust liability." This means that the Department of Justice now has time to see how things play out before determining if Google Books should actually be the target of an antitrust investigation.

James Grimmelmann is a professor at New York Law School who has written extensively about Google Books (other articles are available here). He is also responsible for Public-Interest Book Search Initiative, which is in turn responsible for (as far as I'm concerned) the best and most comprehensive resource about Google Books, the Public Index. The Public Index provides access to the settlement documents and to related legal documents. Users can annotate the settlement. Additionally, the Public Index links to a wealth of articles on the subject, written from a legal perspective or for a wider audience.

Grimmelmann's most recent general audience article on the subject discusses the ASA and identifies positive changes and areas that remain problematic. The new settlement, as he sees it, consists of "one big feature cut and a bunch of small bug fixes" (the big feature cut being the exclusion of books with foreign rightsholders). Grimmelmann concludes that while the changes to the settlement are mostly positive, the bigger issues are unchanged. Or as he put it more eloquently, "the dark heart of the deal remains: Google will still have effectively exclusive access to unclaimed books." According to Grimmelmann, the issue of antitrust is unresolved. In addition, the opt-out feature of the settlement threatens to set what Grimmelmann calls " a bad precedent for future class actions." He explains: "the plaintiffs aren't just giving up the right to sue Google for scanning their books; they're also being shanghaied into a complicated commercial deal that includes a controversial allocation of electronic book rights and requires them to give up the right in the future to sue Google for plenty of things it hasn't even contemplated doing yet." Because of this, Grimmelmann argues that the court should reflect not only on the fairness of the proposed settlement, but also on the implications for future class action cases.

I've been trying to focus on resources that present perspectives on the potential library use of Google Book as a tool, without any of the legal and ethical arguments for and against. These kinds of articles are somewhat difficult to locate, because (understandably) it's difficult to write anything about Google Books without addressing its legal implications. It's been easy to dismiss a lot of the arguments librarians have made against Google Books--for example, the problem of poor scanning quality and metadata--because the project provides such unprecedented access to such a massive quantity of materials. However, at the core of the project is a massive trade-off: we have unprecedented access to these materials, but other similar vendors have their access to these materials severely curtailed (As Grimmelmann puts it: "A competitor, however, would need to get individual permission [to sell these works] first or be sued into oblivion. That's hard enough in general, and for orphan books it's impossible. There's no one to ask. The class action opens a door for Google, but leaves it closed for everyone else"). Without competition, its possible that the costs to use Google Books will rise dramatically, again limiting access to the materials they've digitized. Grimmelmann describes the possible implications: "Will it drive libraries to eliminate their print collections and become dependent on Google's institutional subscription, only to see its price rise uncontrollably? Will the FBI force Google to turn over its lists of who's been reading the Qur'an? If these kinds of broad-reaching policy decisions were being made by Congress, the legislative process would in theory take everyone's interests into account. But in a settlement negotiated by a handful of lawyers, the danger is always that the “public interest” means whatever they say it does."

Grimmelmann's Public-Interest Book Search Initiative exists for the purposes of encouraging the public to discuss and evaluate the settlement. In that sense, the project attempts to allow the public to reflect on the real meaning of "public interest." As such, the materials provided in the Public Index and the guides written by Jonathan Band are a valuable resource for librarians concerned about the Google Books Settlement's effects on public access to information.

Tuesday, December 1, 2009

Resource review # 6 - ...maybe it is preservation after all?

Blakeley, R. (2009). What Was Lost, Now Is Found: Using Google Books and Internet Archive to Enhance a Government Documents Collection with Digital Documents. DttP, 37(3), 26-9.

   In this article, Rebecca Blakely, a government documents librarian at McNeese State University, describes the process by which she used Google Books and the Internet Archive to supplement the McNeese Library government documents collection. The collection fared badly in Hurricane Rita, suffering water damage and mold. Blakely eventually stumbled upon some full-text government documents in Google Books while helping a patron, and it occurred to her that digitized materials could compensate somewhat for the library's loss. She describes the search methods she used to find government documents in both Google Books and the Internet Archive, and compares the strengths and weaknesses of the two.

   For Blakely, the best feature in Google Books is the "my library" option, which can be used to compile items and share them with other users. She started compiling full-text government documents she found using that option - her collection is available here. (Because of extensive tagging, in some ways her small digital library is much more easily browsable than physical collections of government documents.) Blakely notes that it's also possible to create RSS feeds to point out new items added to the collection. She mentions that the quality of scanning and metadata varies, but praises the range of viewing options: zooming, one or two page display, plain-text display, thumbnails or full screen. Her biggest complaint is that Google Books only provides limited viewing of many government documents, even though the great majority of them are in the public domain. (Google responded to an email about this by explaining that rather than taking the time to figure out the rights status, they just add materials in limited view until the status can be determined for sure. Hopefully this means that more government documents will be available in full view later on.) It is for this reason that Blakely prefers the Internet Archive.

   This fits in pretty well with the comparisons drawn by Kalev Leetaru in an article I wrote about previously. The Internet Archive doesn't post books until they've determined that materials are in the public domain or secured permission from the rightsholder in question. They also take a lot more time to produce high-quality scans. As a result, there's significantly less there, but what's there isn't as messy as Google Books. Additionally, the Internet Archive allows users to upload materials to the collection -- Blakely notes that some materials have been uploaded by users who originally downloaded them from Google Books. The Internet Archive allows users to bookmark items, which can then be shared via RSS feeds. The site also offers a "bookmark explorer," which allows users to view items bookmarked by others.

This article illustrates a pretty neat use of these two large digital repositories, and provides good examples of the differences between the two, in terms of features and underlying philosophies. The Internet Archive, while growing, looks like a finished product, while Google Books is very much constantly in progress. I came across an interesting blog post by Ed Felton recently, discussing another blog post about the metadata problems at Google Books; it addresses this point:

"What's most interesting to me is a seeming difference in mindset between critics like Nunberg on the one hand, and Google on the other. Nunberg thinks of Google's metadata catalog as a fixed product that has some (unfortunately large) number of errors, whereas Google sees the catalog as a work in progress, subject to continual improvement. Even calling Google's metadata a "catalog" seems to connote a level of completion and immutability that Google might not assert. An electronic "card catalog" can change every day -- a good thing if the changes are strict improvements such as error fixes -- in a way that a traditional card catalog wouldn't."

I think one of the biggest reasons for the backlash against Google Books by librarians stems from overlooking this. They feel like they've turned over a bunch of their best materials to be digitized, but it's been done sloppily in terms of scanning or metadata, and no one knows exactly what the final shape of Google Books will look like, once the settlement is (or isn't) finalized. I think it's a good point, but I'm also skeptical about the plausibility of fixing all these errors. Is Google planning to rescan everything that's blurry, or all the pages with visible scanning hands? I think the "beta" label is a good explanation for some problems, but isn't Google digging itself kind of a deep hole by doing so much so quickly and imprecisely?

Either way, Blakely's article serves as a great example of the flexibility that digitization allows. We've read a great deal this semester about the complicated nature of digital preservation, but in cases like this, digitized documents are certainly preferable to moldy ones.

Sunday, November 29, 2009

Resource review #5: Libraries and Google: a love/hate relationship

Waller, V. (22 August 2009) "The relationship between public libraries and Google: Too much information" First Monday, 14 (9). Retrieved from http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2477/2279

I'll just get this out of the way. One thing that's really gotten to me in the course of reading all these articles is that librarians and other authors can't seem to keep the name of Google's digitization project straight. I understand that it's changed several times, from Google Print, to Google Library Project, to Google Book Search, to Google Books. According to the Google's history of the project, the name was changed to Google Books in 2005. That's plenty of time for librarians to catch up. If you're going to write an article that's critical of something, you've got to have your facts straight. Getting the object of criticism's name right should be the absolute bare minimum. Similarly, Waller doesn't seem to know the names of Google's founders, Larry Page and Sergey Brin. When citing their published work, she lists their last names correctly, but when she mentions them in passing, she refers to them as Sergey and Brin every time. I'm sure I'm guilty of my own fair share of typos, so I try not to harp on this sort of thing. But in all other respects, this article is very academic (if I was on the reference desk, I'd tell students it was scholarly), so in that context, this kind of error is rather glaring.

Waller's article describes the partnership between Google and libraries in terms of the stages of a romantic relationship. At first, libraries felt that their goals matched perfectly with Google's, whose "stated mission is to organise the world’s information and make it useful." However, with time, libraries have realized that the match isn't so perfect after all. Most of the problems Waller lists stem from the fact that Google is a private company--it is driven by advertising revenue; it doesn't share the concern of libraries for users' privacy; and as a private company, it could go out of business, taking with it the massive number of books it has digitized. Additionally, Waller notes the suggestion made by many librarians that Google has "[conflated] information retrieval and knowledge." Waller (and the librarians she cites) fear that "the Google effect" will lead to a mindset that ignores traditional research methods in favor of shortcuts. Similarly, she cites "concerns that the relationship between the reader and a digital text will be superficial in comparison with the intimate relationship that can develop between a reader or scholar and a physical text," and fears "the possibility that searching a book will become an increasingly adequate substitute for reading it." Finally, she notes the complicated nature of digital preservation issues.

To her credit, Waller is not suggesting that librarians cut ties with Google. Rather, she stresses the importance of using Google as a starting point, and emphasizing to library users the limits of Google as a research tool. She argues that librarians "should teach library users, through example, about the difference between freely flowing information and balanced information. They should not be afraid of giving priority to more significant information. They should also be discussing the losses involved in representing an aspect of an analogue world with ones and zeroes. As philosopher of technology Don Ihde says, it is ‘what is revealed is what excites; what is concealed may be forgotten.’ Libraries need to pay attention to that which is concealed by Google’s search results and by digitised information. What is concealed includes vital aspects of human knowledge and culture and it is part of the task of the public library to preserve these things."

One of Waller's main points is that for information to be accessible and useful, it has to be organized, with priority given to information of greater significance. She claims that Google's search results have no such organization, but are only organized and ranked in order to serve the needs of advertisers. In fact, the organization of Google's search results has been well-documented--results are ranked by an algorithm influenced by citation analysis. It's certainly not perfect, but it is a form of relevance ranking and organization. Waller's failure to acknowledge this seems like a pretty big oversight.

I'm also somewhat bothered by Waller's suggestion that if Google goes out of business, its digital library will disappear. Every participating library gets a digital copy of the books it allows Google to digitize. These can (and are) used in various ways--copies can be printed on the Espresso Book Machine, digital material can be integrated into the library's OPAC, libraries can create their own digital libraries, like the Hathi Trust Digital Repository, which began at the University of Michigan and is now a partnership between thirteen large research universities. Just because content is originally digitized by Google doesn't mean it has to live there exclusively. However, the problems Waller notes are real. Librarians should feel uncomfortable partnering with an organization that collects data about individual users and shares that data with advertisers. The Google settlement stipulates that public libraries can each have one terminal from which the public can access all of the material available on Google Books--will the convenience and benefits of this access be outweighed by the presence of targeted advertising and the potential invasion of users' privacy? I don't know. I'm planning on tackling privacy issues (and the dreaded copyright/legal challenges) next.

Saturday, November 28, 2009

Resource review #4 - Mass digitization leads to more books in print?

Rosen, J. (2009). Bookselling Heads To the Espresso Age. Publishers Weekly, 256(40), 3-4.

Badger, B. (2009, September 9). Books Digitized by Google Available via the Espresso Book Machine.
Retrieved from http://booksearch.blogspot.com/2009/09/books-digitized-by-google-available-via.html

(2008). U of Michigan Library Installs Espresso Book Machine. Advanced Technology Libraries, 37(11), 1, 10-11.

(2009). Espresso Book Machine. Retrieved from http://www.lib.umich.edu/espresso-book-machine

   In September, Google announced that it would partner with On Demand Books to make its two million + digitized public domain books available for printing on the Espresso Book Machine (EBM). The machine is "capable of making a 300-page perfect-bound book in five to seven minutes" and can print a yearly total of 60,000 books. According to Brandon Badger, a product manager at Google Books, "If sentient robots ever succeed in taking over the world, this is how they will print their books."
   When I originally came across this announcement at Inside Google Books, I thought it was just a neat bit of technology. According to Rosen's article, the implications are much larger. She cites one of the founders of On Demand Books as asserting that the machine's [relatively] low cost and the company's partnership with Google signal "the end of the Gutenberg age." The ability to quickly and inexpensively print books does have the potential to radically decentralize the publishing industry. Rosen's article explores the implications for small independent bookstores, and also alludes to possible library use. Dane Neller (cofounder and CEO of On Demand) suggests that "the Espresso machine enables local retailers to do everything a national behemoth like Amazon does." Additionally, Espresso machines can allow small bookstores to save space while offering a much larger inventory. Booksellers quoted in the article described plans to sell copies of classics, and in a university bookstore, to print copies of books authored by faculty members. Rosen also suggested that libraries may begin printing copies of digitized rare books.
    I have some trouble imagining the use of the EBM in libraries. Would libraries be selling books to patrons, printing books to add to their collections, printing copies of digitized rare or fragile items? So far, the best example I can find is the University of Michigan library, which became the first university library to purchase an EBM in 2008. The library planned to sell copies of of books they had digitized for the Open Content Alliance, as well as items from their pre-1923 collection, for about $10 a book. U of M's dean of libraries, Paul Courant, stated "This is a significant moment in the history of book publishing and distribution. As a library, we're stepping beyond the limits of physical space. Now we can produce affordable printed copies of rare and hard-to-find books. It's a great step toward the democratization of information, getting information to readers when and where they need it." According to the library's website, U of M also expects to offer additional uses of the EBM: "Small runs of printed books produced by classes, such as anthologies of creative writing; printed copies of proceedings of University conferences and events; printing and binding course materials; self-publishing for Ann Arbor authors." This is potentially a large expansion of the library's role on campus, and I think it's illustrative of digitization's potential for expanding access to information, in digital form and paradoxically, in print as well.
   Possibly the most interesting point this article makes is that (at least in theory) the EBM gives booksellers an opportunity to offer readers a convenient print alternative to e-books. Many librarians worry that Google Books represents a serious challenge to the relevance of libraries as physical repositories for printed objects; however, this partnership can provide booksellers and librarians with an opportunity to inexpensively put more materials into print. The University of Michigan, an early and enthusiastic participant in mass digitization projects, suggests that the opportunity to return digitized materials to print provides necessary flexibility in format: "Rather than a one-size-fits-all solution, we believe that the best book format varies in relationship to its uses and its users. Some of the time, an electronic book -- that can be accessed any time, anywhere, and quickly searched -- is exactly what we need. At other times, the ideal form of the book is a nicely bound copy that helps with sustained reading, that serves as a physical reminder of a reading experience, or that can easily be passed from hand to hand." This lends further credence to the argument that the greatest justification for Google Books is the expansion of access to information. If the EBM becomes widely available (a big if, I suppose), users don't even have to have internet access to benefit from mass digitization. They just have to have $10.
   Finally, after looking at the DIY book scanner several weeks ago, I have to wonder how plausible it is that someone will cook up a DIY bookmaking machine. It seems like a pretty huge undertaking, but who knows! Then we'll really democratize access.

Wednesday, November 18, 2009

Resource review #3

Google Books Mutilates the Printed Past. By: Musto, Ronald G., Chronicle of Higher Education, 6/12/2009, Vol. 55, Issue 39.

In this article, Ronald G. Musto, a medieval historian, describes the “promise and perils” of using Google Books for historical research. Musto’s work involves studying archival records related to Naples in the Middle Ages. He briefly describes the repeated destruction and subsequent reconstruction of those records. He notes that “for the few of us who work on the city's urban development, that double mutilation -- of both its archival and architectural past -- makes work difficult at best. More than many other historians, we have to rely on remnants to recreate this history.” Many of these remnants are now available on Google Books, which Musto is decidedly not satisfied with.

Like almost everyone involved in the debate about Google Books, Musto is pleased with the level of new access the resource provides. However, citing a key work in his field, he rails against the quality of Google’s scanning:

“In its frenzy to digitize the holdings of its partner collections, in this case those of the Stanford University Libraries, Google Books has pursued a "good enough" scanning strategy. The books' pages were hurriedly reproduced: No apparent quality control was employed, either during or after scanning. The result is that 29 percent of the pages in Volume 1 and 38 percent of the pages in Volume 2 are either skewed, blurred, swooshed, folded back, misplaced, or just plain missing. A few images even contain the fingers of the human page-turner. (Like a medieval scribe, he left his own pointing hand on the page!) Not bad, one might argue, for no charge and on your desktop. But now I'm dealing with a mutilated edition of a mutilated selection of a mutilated archive of a mutilated history of a mutilated kingdom -- hardly the stuff of the positivist, empirical method I was trained in a generation ago.”

While he admits that this is just one book, and that a cursory search of materials outside his field of study fails to reveal a similar concentration of errors, the poor scanning quality seems to essentially push him over the rhetorical edge. He expresses concerns that Google’s poorly scanned books will replace the world’s collections of rare books and archival materials, arguing that “should Google Books prevail, and the resources of the scholarly community be made irrelevant by Google's sheer scale and force, the future of our past will be in great doubt.”

This view seems pretty extreme to me, but it’s expressed often enough in a variety of articles and blog posts to merit discussion. I don’t think that Google Books is about preservation. I think it’s about access. The ability to do full-text searching in four million books is ridiculously convenient, and that massive opportunity comes at the expense of precision and quality. But given the legal complications of the Google Books settlement, I don’t think that access at the expense of preservation is an argument that Google’s leaders want to be publicly pushing. In a recent New York Times editorial called “A Library to Last Forever,” Google co-founder Sergey Brin (on the basis of title alone) is obviously suggesting that the project is justified because it will digitally preserve the world’s libraries.* So I suppose it makes sense to judge Google Books by the stringent standards that the goal of preservation implies, since these are the claims that the company itself is making.

Still, I don’t see any reason to jump to the conclusion that Google’s digitial copies of books are going to make physical collections irrelevant, especially in the case of rare books. If Google were to launch a project involving digitization of the world’s art, I don’t think anyone would suggest that museum curators may as well trash the original “Starry Night.” However, I do understand that the tenor of Musto’s argument is provoked in part by the arrogance of Google’s stated goals. He suggests that Google believes that the noble goals and public good resulting from the project grant them the “right to turn copyright on its head.” It’s an important point, as the stakes are pretty high. Whatever comes out of the settlement, the repercussions will be huge. Once I read a bit more about the new proposed settlement, I’ll blog about it.

*Brin does point out that without access, preservation doesn’t really matter: “…if our cultural heritage stays intact in the world’s foremost libraries, it is effectively lost if no one can access it easily.”

Tuesday, November 3, 2009

No weekend plans?

scanner + image by Daniel Reetz.

So, apparently you can make a fully functional book scanner yourself for about $300, if you're willing to scavenge a bit and you happen to have some power tools lying around. More info here and here.

Wednesday, October 28, 2009

Resource review #2: Open Content Alliance and Google Books

Leetaru, K (2008, October 6). Mass book digitization: the deeper story of Google Books and the Open Content Alliance. First Monday, 13 (10).

In this article, Kalev Leetaru offeres a nuanced perspective regarding the similarities and differences of Google Books and the Open Content Alliance (OCA). He focuses primarily on the technical aspects of their work, their willingness to reveal information about their technical processes, their approaches to copyright and user access, and their use of metadata. Although OCA formed as a reaction to the commercial and secretive nature of Google Books, Leetaru points out that they have not quite delivered on their promise of transparency. While Google has released technical reports about innovations they have made and revealed information about their processing in speeches, very little is known about OCA's technical process. Based on information gathered about both organizations, Leetaru suggests that the two projects are conducted using similar methods. However, OCA spends more time on quality control, while Google Books focuses on increasing output and efficiency. Additionally, Google's PDFs are bitonal, which make them easier to view (even with limited bandwidth) than the full color scans provided by OCA.

Another difference is found in the search options -- Google offers full-text searching, while OCA allows searching only in title and description fields. The two organizations also differ in their approaches to copyright. Google scans copyrighted material, but only allows users to view limited portions in search results. OCA focuses on scanning out-of-copyright materials, but scans in-copyright materials if given permission by the publisher. Possibly the most striking difference described by Leetaru is the approach to restrictions on use of the materials. Public domain materials on Google Books can be downloaded in full. Members of OCA can set their own restrictions on use of the materials they contribute to the project, which means that restrictions vary from item to item. This can apparently get pretty complicated. Google provides metadata explaining the rights policy of each item; OCA does not.

This article provides useful information comparing Google Books to a similar mass digitization project. It's interesting to evaluate OCA's attempt to provide an alternative approach to digitization. Leetaru offers a pretty convincing argument suggesting that OCA hasn't been too successful in meeting its stated goals of transparency and open access. This article also includes pretty thorough descriptions of the process of digitization. Leetaru makes a point of differentiating between the goal of preservation digitization and access digitization. The latter is focused primarily on providing user access to materials, rather than gathering and preserving that material. He argues that both OCA and Google Books are attempts at access digitization, which largely negates much of the criticism directed at Google's quality control standards. If large-scale access is the goal, Leetaru suggests that some level of attention to detail will be lost in order to provide access to more materials. This is an interesting perspective that I hadn't come across before.

Wednesday, October 7, 2009

Resource review #1: Metadata and Google Books

Jackson, M. (2008). Using Metadata to Discover the Buried Treasure in Google Books Search. Journal of Library Administration. (47), 1/2.

In this article, Millie Jackson discusses the relative merits of the metadata created by Google Books, compared to that provided by WorldCat and the MBooks project (now known as the Hathi Trust Digital Library) at the University of Michigan. As she points out, full-text keyword searching has advantages and disadvantages. This feature may allow the researcher to search for concepts more easily than can be done with traditional subject headings or controlled vocabulary. However, Jackson notes that the listing of frequently-used words in the text found on Google Books may fail to give the user a sense of the book's "aboutness." In WorldCat, she explains, the user can click on a subject heading and find similar works. In Google Books, it can be more difficult to find relevant materials in a similar manner.

I appreciated the author's even tone and willingness to acknowledge that Google Books will undoubtedly be improved and tweaked over time. Many of the other articles I've come across (some of which I'll discuss later) seem to primarily serve as a list of complaints about poor scanning and bad metadata (as Jackson explains, some of Google's metadata is retrieved automatically from a variety of sources, which can sometimes result in comical and/or frustrating errors). The author also argues that libraries should be looking to Google Books for new ideas, instead of simply finding fault, a perspective I agree with. (Of course, arguments about copyrights and monopolies are another thing entirely, and later on I'll write about resources that address this.)

Additionally, this article is useful because of its discussion about the ways in which libraries can consolidate Google Books with other library services. The Hathi Trust Digital Library at the University of Michigan is a good example. As Jackson explains, the Hathi Trust Digital Library offers many features, some of which are similar or identical to those found on Google Books, but in a very different interface. At the Hathi Trust Digital Library, the user can export citations to a citation manager, find print copies in a library using WorldCat, or search a material's full text. More flexible search options are also available -- searches can be narrowed by viewability, subject, author, language, place or date of publication, and original format and location. (Many of these options are also available in Google's advanced search, but aren't as immediately obvious).

Jackson's article strikes me as a good introduction to the strengths and weaknesses of the search options available in Google Books. Her points, while not discussed in great detail, will be useful in directing me toward other related resources.

Monday, September 7, 2009

First post!

It's been a long time since I've done any blogging. For now, I'll be writing about SLIS-related things, especially LIS 644, Digital Tools, Trends, and Debates. Should be fun!

644 adventures.