February 04, 2004

How to Digitize Eight Million Books

The New York Time reports (Registration Required) that Google and Stanford are planning on digitizing books:

And Google has embarked on an ambitious secret effort known as Project Ocean, according to a person involved with the operation. With the cooperation of Stanford University, the company now plans to digitize the entire collection of the vast Stanford Library published before 1923, which is no longer limited by copyright restrictions. The project could add millions of digitized books that would be available exclusively via Google.

Meanwhile, The Book and the Computer has a fascinating interview with Stanford's Michael Keller. Keller, the University Librarian, is leading a project to digitize eight million books. The article can be found here, but the URL doesn't look to be permanent, so you might need to do some browsing to find it.

Some take-aways:


  • This is a really impressive project, and it gets even more interesting if the content generated becomes accessible via Google.

  • The storage needs for something like this are crazy. Keller: "With eight million volumes, if we were to digitize everything, we would end up with about a petabyte and a half of data. A petabyte is 10 to the 15th power."

  • The value of the project is tied directly to rights management and copyright law. Sure, they'll be able to digitize plenty of government documents and the like, but the bulk of the material that would available on Google is pre-1923 material that has fallen into the public domain. This is a result of Congress' ongoing enlargements of the copyright term (see Lessig, Future of Ideas). Imagine the resource we'd have if more of this material was digitized and at our fingertips.