Information Access Seminar

Facilitating Diverse Collection and Curation in Web Crawling and Indexing and Blockchain: What's Not To Like?

Friday, November 2, 2018
3:10 pm - 5:00 pm
Matt Bayley, Mark Graham, and David S. H. Rosenthal

Facilitating Diverse Collection and Curation in Web Crawling and Indexing
(Matt Bayley & Mark Graham)

We propose to create an open and publicly available index of the public web. Building on the 22 year history of Internet Archive’s effort to archive, and make available, web pages (URLs) we will construct a publicly accessible list of web sites (hosts). We will provide a variety of ways for people to interact with the data with two key areas of focus being efforts to support more/better web archiving as well as general research about the Web. In addition to indexing about 2 billion URLs for web hosts we plan to create/associate various metadata including language, genre and last observed HTTP status codes. We consider this project to be foundational to an ongoing and expanding effort to map resources available via HTTP. Obvious additional enhancements (beyond the scope of this initial project phase) might include adding link graph data and user-generated metadata.

Blockchain: What's Not To Like?
(David S. H. Rosenthal)

We're in a period when blockchain or “distributed ledger technology” is the Solution to Everything™, so it is inevitable that it will be proposed as the solution to problems in academic communication and digital preservation. These proposals typically assume, despite the evidence, that real-world blockchain implementations actually deliver the theoretical attributes of decentralization, immutability, security, anonymity, lack of trust, etc. The proposers appear to believe that Satoshi Nakamoto revealed the infallible Bitcoin protocol to the world on golden tablets; they typically don't appreciate or cite the nearly three decades of research and implementation that led up to it. This talk will discuss the mis-match between theory and practice in blockchain technology.

Matt Bayley is a MIMS student at the I School with a background in data engineering and an interest in software, infrastructure, and tech policy.

Mark Graham has created and managed innovative online products and services since 1984. As director of the Wayback Machine he is responsible for capturing, preserving, and helping people discover and use more than 1 billion new web captures each week.

David S. H. Rosenthal is retired from Stanford Libraries. He was a team member of CMU's Andrew Project; an early employee and distinguished engineer at Sun Microsystems; employee #4, first chief scientist, and first sysadmin at Nvidia; and co-founder 20 years ago of the LOCKSS Program. He has been blogging since 2007, about blockchains and cryptocurrencies since November 2013.

Last updated: October 22, 2018