Capturing Moments in Time: Collecting Our World’s Memory Together Courtney C. Mumma, Internet Archive September 12&13, 2016 - Geneva and Zurich - Library Science Talks Thanks to Zentralbibliothek Zurich, the CERN Scientific Information Service & the Association of International Librarians and Information Specialists (AILIS)

Talk overview ● ● ●

What we do at the Internet Archive Evolution of web archiving at IA Working together ○ ○



People ■ Coordinated collection development Systems ■ Interoperability ■ Distributed preservation People + Systems ■ Local and Cloud Access ■ Research services ■ New technology for new challenges

The Internet Archive Non-Profit Library Founded in 1996 by Brewster Kahle

Universal Access to All Knowledge

UN 2030 Agenda Sustainable Development Goals Vision: A World With Universal Literacy Target 16.10: “Ensure public access to information and protect fundamental freedoms, in accordance with national legislation and international agreements” Goal 16: Promote peaceful and inclusive societies for sustainable development, provide access to justice for all and build effective, accountable and inclusive institutions at all levels

30,000,000,000,000,000 Bytes Archived (30 PetaBytes)

Books Digitization

Music Digitization

TV Collection

Software Preservation and Emulation

25,000 2,000,000 2,300,000 2,400,000 3,000,000 4,000,000 500,000,000,000+

Software Titles Moving Images Book Archive Audio Recordings Hours of Television eBooks URLs

20 Years of Archiving the Web

1996 US Presidential Campaigns with Smithsonian

218,342,520 Web Captures

1997 First Full Crawl

525,362,846 Web Captures

1998 Donation of Crawl to the Library Of Congress

1,166,891,826 Web Captures

2000 US Presidential Campaigns with the Library of Congress Started Collecting Television Started Digitizing Movies

6,153,042,235 Web Captures

2001 Launch of the WayBack Machine

12,082,859,018 Web Captures

2002 Book Digitization Begins

22,277,788,816 Web Captures

2003 International Internet Preservation Consortium Founded

38,868,116,181 Web Captures

2006 Archive-It Started

103,943,903,726 Web Captures

2007 Ireland

184,277,909,308 Web Captures

2008 National Archive Government Crawls

209,160,715,829 Web Captures

2009 Archive-It Adds its 100th Partner 7 National Library Partners

225,658,093,516 Web Captures

2010 Broad and Survey Web-Scale Crawls

246,744,306,660 Web Captures

2012 Australia National Library

324,068,259,435 Web Captures

2014 Emulation of Software Archive in the Browser

452,769,236,649 Web Captures

2015 Archive-It Adds its 400th Partner

467,195,419,06 9 Web Captures

Support Open Source Software

Global Wayback ● Broad snapshot ● Deep crawl on popular sites ● Broad crawl on known domains ● Donated and targeted crawls

● On-demand

● No More 404s

Working together Web Archiving Partnerships and Services

Domain Scale Web Preservation

Contract Crawling Domain-scale • Run by Internet Archive • Average 300 million URLs per collection Partial List of Partners: • National Libraries of Australia and New Zealand • U.S. National Archives and Library of Congress • Luxembourg National Library • Israel National Library Partial List of Collections: • Iraq War (2003-2011) • 2005 US Supreme Court Nominations • Australian domain (2005-2011)

Archive-It Curated, Selective Web Archiving

Archive-It Web based application Fully hosted service Includes access and storage Select, manage, scope and catalog with metadata 10 different crawl frequencies

html, videos, audio, social networking, PDFs, images, news Archives available 24 hours after a capture is complete & full text search is available within 7 days Restricted access options

How our partners use Archive-It • • • • •

Work with associate agencies to archive their websites and publications Enhance and supplement traditional offline collections Support records retention policies Collaborate with other like-minded institutions and share research Collaborate with subject specialists to capture "at risk" digital content on a spontaneous event

Spontaneous Web Archiving • Started in 2007 • “At risk” web content that quickly disappears as an event unfolds and develops • Global in scope or have global implications and impact • Curators use Archive-It to add websites, metadata and set up automated crawls • Collections are all publicly available

Examples Some of these Collections: Virginia Tech Tragedy (2007) Georgia/Russia Conflict (2008) Earthquake in Haiti (2010) Jasmine Revolution (2011) Revolution in Egypt (2011) Japan Tsunami (2011)

Russian Presidential Election (2012) Hurricane Sandy (2012) US Government Shutdown (2013) Paris Attacks (2015) Orlando Shootings (2016)

th

March 11 , 2011 Captured Content

March 11th, 2011 BBC Coverage of Earthquake http://wayback.archive-it.org/2438/20110311172026/http://www.bbc.co. uk/news/world-asia-pacific-12711313

Dynamically Created Content Seismic activity in Japan for last 30 days Dynamically created information - charts, data, maps, images, etc

http://wayback.archive-it.org/2438/20120119192420/http://www.hin et.bosai.go.jp/hypomap/

Interactive Content

Before & After Imagery •http://wayback.archive-it.org/2438/20110714013452/http://www.abc.net.au/news/specials/japan -quake-2011/

Images & Video

Images of Military Assistance

Stories about the Earthquake and Tsunami

Capture Date: 10/30/2011

Capture Date: 12/29/2011

Archived: Reports of Medical Support Teams

Capture date: 08/06/2011 http://wayback.archive-it.org/2438/20110806224310/http://www.hyogo.med.or.jp/ishikai/north eastEarthquake/report.html

Live Web: Reports of Medical Support Teams

404 error: 10/25/2012 http://www.hyogo.med.or.jp/ishikai/northeastEarthquake/report.html

Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing

Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing

2014 Congressional Election Cycle Collection

Stanford University & UC Berkeley

Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing

#blacklivesmatter Web Archive

Documenting Ferguson, Washington University, St. Louis

Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing

Charlie Hebdo

Members of the IIPC Including the Bibliothèque Nationale de France

Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing

2013 Government Shutdown

Call for URL submissions Via Facebook and Twitter

Challenges Inherently Ad hoc Curator Identification and Diversity Imperative to Act Quickly Rapid Change Quality Assurance Access and Data Analysis

Working together Systems Interoperability and Distributed Preservation

APIs (*application programming interfaces)

● Interoperability ● Flexibility and modularity ● Loose coupling of services (so we can improve pieces as needed) ● Scalability - Bulk data upload and download

Challenges Web archives are rarely integrated into Digital Initiatives infrastructure (DL, DA, DR, etc) Reliance on a few providers for entire lifecycle

Variance of coordination on emergent efforts Lack of foresight for interoperability

WARCs, CDXs and derivatives

Access

Content Mgmt

Preservation

Storage

Lost in the maze in Labyrinth (1986, LucasFilm, screen capture)

Web Archiving Tools

APIs for Community ● ● ● ● ●

Community building and coordinated collection Compare/contrast ‘duplicate’ archives Determine systems of record Local ingest and preservation of WARCs Broaden options for use (data mining, bulk access, federated browsing)

Related Web Archive API Work ● ● ● ● ●

CDX Server APIs (IA, IIPC) Derivative formats (Archive-It, BL) Crawl logs and partner data (Archive-It) Wayback Machine APIs (IA) WASAPI - Web Archiving Services APIs ○ Community & API for WARC export/import (US IMLS, IA, Stanford, Rutgers, UNT, LOCKSS)

● Archiving Contemporary Composers (IA, NYU, Mellon Foundation) ● Memento and Time Travel APIs (Los Alamos National Laboratory, Old Dominion University, LC, Mellon Foundation)

Working together Access and Use

Research

Why focus on use? ● Web Archives ○

meaningful historical records



diversity of content types and born-digital resources across time rich sources for data mining



● Current Access Models ○

URL-centric clicking & browsing through Wayback



Query-based retrieval and search functionality Silos





Use largely TBD or not measured

Researchers Web Archives Want data Have a lot of data Interested in change over time Have data segmented over time Study semantics, entities, locations Have a rich diversity of content Multidisciplinary Multidisciplinary Value info about collection origins Chock full of provenance information

Challenges - Technical Variations in capture proficiency Complex WARC format Tons o’ data +++++++ infrastructure

Storage Computational infrastructure Diversity of contained data

Copyright and IP, full content access problematic

Challenges - Conceptual Acquisition

Lure of evermore data

Provenance

More data is not “better” data

The Neverending Web Crawl configurations Moving target Meaningful access, traditional finding aids inadequate

Attestation Higher sensitivity to elision than in traditional archives?

Research Services

Goals of Archive-It Research Services ● Expand access models for web archives ● Enable new insights into collections ● Leverage Internet Archive infrastructure for large-scale processing to produce datasets for research ● Facilitate computational analysis and new use cases ● Increase use, visibility, and value of Archive-It partner collections

Web Archives Datasets

Archive-It Research Services http://bit.ly/ait_ars

Exploring the Canadian Political Interest Group and Political Parties Web Sphere via WAT files

Named Entities in the Human Rights Collection

Going forward ● Datasets to researchers, patrons, developers ● Minimize need for dedicated infrastructure ○

leverage the Internet Archive’s computing power and expertise

● Hide complexities of cluster computing and of web archive and derivative formats and organization ● Ongoing development of platforms and APIs for research data visualization and manipulation

Working together New technology to face new challenges

Some of these Collections: Web keeps getting more complex, so archiving is harder

Personalized ads in Minority Report (2002, 20th Century Fox, screen capture)

Challenges Web archiving still a niche collecting activity Few coordinated efforts on shared tools

Little familiarity with formats, software, or processes Few on-ramps for non-developer and developer participation

Ongoing efforts • Open Wayback • Social media

BROZZLER!

– Brozzler and Umbra (Archive-It) – Documenting the Now – Social Feed Manager (GWU)

• • • •

UNT Nomination Tool Proliferating capture tools (GWU, IA, Rhizome) WASAPI - Community building Memento

Let’s Build Libraries Together!

THANK YOU [email protected]

Mumma-Library-Science-Talks-2016.pdf

Universal Access to All Knowledge. Page 3 of 85. Mumma-Library-Science-Talks-2016.pdf. Mumma-Library-Science-Talks-2016.pdf. Open. Extract. Open with.

7MB Sizes 2 Downloads 120 Views

Recommend Documents

No documents