Sep 10, 2009

Feeling Lucky? Enough to Trust Google With UC's Library Holdings?

By Carol Ness, UC Berkeley Public Affairs

An I School conference explores the pros and cons of letting Google control every aspect of 'the last library'

BERKELEY — On its face, the proposed settlement of a lawsuit over Google Book Search — the internet titan's effort to digitize vast numbers of books, academic and otherwise, for all the world to consult online — is about resolving copyright and fair-use issues.

But the settlement, which goes before a federal judge in New York for approval next month, is about much more than that, according to experts assembled for the Berkeley School of Information's recent day-long conference on the implications raised by Google Books.

"It's one of the most important issues of our era," said School of Information Dean AnnaLee Saxenian in opening the discussion, which attracted an almost full house to Banatao Auditorium in Sutardja Dai Hall on Aug. 28.

By the end of the day, settlement fans and critics alike had generated a long list of commitments they'd like to see Google make, in writing, before it is given legal permission to move ahead with its mammoth digital library project, which both the UC system and the Berkeley campus have supported and participated in.

Chief among the concerns raised were that the settlement does not contain guarantees protecting the public trust, the privacy of people using Google Book Search, the accuracy of the data in the digitized book collection, and broad public access to the database, or "corpus."

"This is likely to be the last library," said School of Information adjunct professor Geoffrey Nunberg. Google's massive head start in scanning the books, and the costs involved in such work, mean it's likely that no one else will ever try to duplicate its effort, Nunberg contended — "hence the urgency of [the] questions."

Google has already scanned more than 7 million books held in private and university libraries, including many of UC's, with the intent of making them available online. An estimated two-thirds are out of print but still in copyright, and many fall into a category called "orphan books" because the copyright holder cannot be identified.

The long, complicated settlement proposal would give Google the right to scan and put online all such books without risk of being sued. It would resolve lawsuits over copyright and fair-use issues brought by the Association of American Publishers and the Authors Guild.

Critics, among them Berkeley Law professor Pamela Samuelson, who moderated a panel on public access, have argued that the settlement amounts to granting Google a monopoly over digitization of the orphan books. Samuelson, director of the Berkeley Center for Law and Technology and founder of the Samuelson High Technology and Public Policy clinic at the law school, has written that the settlement would amount to "a privately negotiated compulsory license designed to monetize millions of orphan works" for the benefit of Google, plus some — but not all — authors and publishers.

Nineteen UC faculty have signed a letter to the court considering the settlement, raising concerns about its potential effects on the interests of scholarly authors — without opposing the settlement itself.

The issues raised at the conference touched squarely on areas critical to Berkeley's mission, including concerns that putting a corporation in charge of access to the digitized volumes might limit scholarship and research, compromise the accuracy of information in the books, and jeopardize readers' privacy.

What if, for instance, universities experiencing hard times could no longer afford Google's price for its books subscription, several panelists asked — just as many institutions have struggled to keep paying high prices for scholarly journals.

Public libraries have long been on the front lines of the fight to protect readers' privacy, but concern was raised about how those values would apply in the digital world. The Google Books settlement, said Angela Maycock of the American Library Association, while incredibly detailed in many areas, "is silent on privacy."

Berkeley's University Librarian, Tom Leonard, said that "the core issue for research libraries is protecting users' anonymity," and reassurances that Google has posted online don't go far enough.

"I trust Google," Leonard said. "I'd like to trust Google more."

Problems with the accuracy of Google's database raised the ire of the I School's Nunberg.

When it comes to metadata — keywords used to identify works by subject, publication date and the like — "Google's are terrible," Nunberg said. "The problems are pervasive and endemic in the database."

He provided illustrations from searches he'd run on the database. One turned up 527 books on the subject of the Internet that supposedly were published before 1950 — long before the Web came into being. Another got 182 hits on books about Charles Dickens carrying publication dates from before he was born. And in a third, 46 of 66 hits on the phrase "candy bar" were misdated.

"A 70 percent error rate seems to be quite high," Nunberg observed drily. Other problems cropped up in classification errors — Hamlet filed under "antiques and collectibles," for example.

"Google's response is, ‘we're fixing it,' but that doesn't change systemic problems," Nunberg said, asking: "Why should Google have no obligations to do Google Book Search right?"

Dan Clancy, a Google spokesperson who was an active part of the discussion all day, said the settlement should be considered a starting point for discussion of such issues. Pointing out that Google Books' most vocal critics are among its most avid users, he said the ongoing dialogue among all parties will lead to resolution of the issues raised. "I don't think we are one library; I think we're part of a broader community," Clancy said.

UC and its libraries were active in the settlement, despite reservations, because it promises the public a great deal of access, said Dan Greenstein, UC vice provost for academic planning, programs, and coordination.

"In a post-settlement world, our library holdings — we have 35 million volumes, something like 14 million of them unique, probably 15-20 percent of them in the public domain — are going to be available online, available to be searched. If you want a physical copy of the book, you can be directed to places where you can buy it, or to libraries near you.

"Does anyone think that's not a good thing?" Greenstein asked.

Last updated: October 4, 2016