Capturing Moments in Time: Collecting Our World’s Memory Together Courtney C. Mumma, Internet Archive September 12&13, 2016 - Geneva and Zurich - Library Science Talks Thanks to Zentralbibliothek Zurich, the CERN Scientific Information Service & the Association of International Librarians and Information Specialists (AILIS)
Talk overview ● ● ●
What we do at the Internet Archive Evolution of web archiving at IA Working together ○ ○
○
People ■ Coordinated collection development Systems ■ Interoperability ■ Distributed preservation People + Systems ■ Local and Cloud Access ■ Research services ■ New technology for new challenges
The Internet Archive Non-Profit Library Founded in 1996 by Brewster Kahle
Universal Access to All Knowledge
UN 2030 Agenda Sustainable Development Goals Vision: A World With Universal Literacy Target 16.10: “Ensure public access to information and protect fundamental freedoms, in accordance with national legislation and international agreements” Goal 16: Promote peaceful and inclusive societies for sustainable development, provide access to justice for all and build effective, accountable and inclusive institutions at all levels
30,000,000,000,000,000 Bytes Archived (30 PetaBytes)
Books Digitization
Music Digitization
TV Collection
Software Preservation and Emulation
25,000 2,000,000 2,300,000 2,400,000 3,000,000 4,000,000 500,000,000,000+
Software Titles Moving Images Book Archive Audio Recordings Hours of Television eBooks URLs
20 Years of Archiving the Web
1996 US Presidential Campaigns with Smithsonian
218,342,520 Web Captures
1997 First Full Crawl
525,362,846 Web Captures
1998 Donation of Crawl to the Library Of Congress
1,166,891,826 Web Captures
2000 US Presidential Campaigns with the Library of Congress Started Collecting Television Started Digitizing Movies
6,153,042,235 Web Captures
2001 Launch of the WayBack Machine
12,082,859,018 Web Captures
2002 Book Digitization Begins
22,277,788,816 Web Captures
2003 International Internet Preservation Consortium Founded
38,868,116,181 Web Captures
2006 Archive-It Started
103,943,903,726 Web Captures
2007 Ireland
184,277,909,308 Web Captures
2008 National Archive Government Crawls
209,160,715,829 Web Captures
2009 Archive-It Adds its 100th Partner 7 National Library Partners
225,658,093,516 Web Captures
2010 Broad and Survey Web-Scale Crawls
246,744,306,660 Web Captures
2012 Australia National Library
324,068,259,435 Web Captures
2014 Emulation of Software Archive in the Browser
452,769,236,649 Web Captures
2015 Archive-It Adds its 400th Partner
467,195,419,06 9 Web Captures
Support Open Source Software
Global Wayback ● Broad snapshot ● Deep crawl on popular sites ● Broad crawl on known domains ● Donated and targeted crawls
● On-demand
● No More 404s
Working together Web Archiving Partnerships and Services
Domain Scale Web Preservation
Contract Crawling Domain-scale • Run by Internet Archive • Average 300 million URLs per collection Partial List of Partners: • National Libraries of Australia and New Zealand • U.S. National Archives and Library of Congress • Luxembourg National Library • Israel National Library Partial List of Collections: • Iraq War (2003-2011) • 2005 US Supreme Court Nominations • Australian domain (2005-2011)
Archive-It Curated, Selective Web Archiving
Archive-It Web based application Fully hosted service Includes access and storage Select, manage, scope and catalog with metadata 10 different crawl frequencies
html, videos, audio, social networking, PDFs, images, news Archives available 24 hours after a capture is complete & full text search is available within 7 days Restricted access options
How our partners use Archive-It • • • • •
Work with associate agencies to archive their websites and publications Enhance and supplement traditional offline collections Support records retention policies Collaborate with other like-minded institutions and share research Collaborate with subject specialists to capture "at risk" digital content on a spontaneous event
Spontaneous Web Archiving • Started in 2007 • “At risk” web content that quickly disappears as an event unfolds and develops • Global in scope or have global implications and impact • Curators use Archive-It to add websites, metadata and set up automated crawls • Collections are all publicly available
Examples Some of these Collections: Virginia Tech Tragedy (2007) Georgia/Russia Conflict (2008) Earthquake in Haiti (2010) Jasmine Revolution (2011) Revolution in Egypt (2011) Japan Tsunami (2011)
Russian Presidential Election (2012) Hurricane Sandy (2012) US Government Shutdown (2013) Paris Attacks (2015) Orlando Shootings (2016)
th
March 11 , 2011 Captured Content
March 11th, 2011 BBC Coverage of Earthquake http://wayback.archive-it.org/2438/20110311172026/http://www.bbc.co. uk/news/world-asia-pacific-12711313
Dynamically Created Content Seismic activity in Japan for last 30 days Dynamically created information - charts, data, maps, images, etc
http://wayback.archive-it.org/2438/20120119192420/http://www.hin et.bosai.go.jp/hypomap/
Interactive Content
Before & After Imagery •http://wayback.archive-it.org/2438/20110714013452/http://www.abc.net.au/news/specials/japan -quake-2011/
Images & Video
Images of Military Assistance
Stories about the Earthquake and Tsunami
Capture Date: 10/30/2011
Capture Date: 12/29/2011
Archived: Reports of Medical Support Teams
Capture date: 08/06/2011 http://wayback.archive-it.org/2438/20110806224310/http://www.hyogo.med.or.jp/ishikai/north eastEarthquake/report.html
Live Web: Reports of Medical Support Teams
404 error: 10/25/2012 http://www.hyogo.med.or.jp/ishikai/northeastEarthquake/report.html
Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing
Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing
2014 Congressional Election Cycle Collection
Stanford University & UC Berkeley
Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing
#blacklivesmatter Web Archive
Documenting Ferguson, Washington University, St. Louis
Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing
Charlie Hebdo
Members of the IIPC Including the Bibliothèque Nationale de France
Collaborative Collection Building Pairing subject expertise with technical expertise Content submissions come from: • Subject matter experts • Concurrent projects • Local experts • Crowdsourcing
2013 Government Shutdown
Call for URL submissions Via Facebook and Twitter
Challenges Inherently Ad hoc Curator Identification and Diversity Imperative to Act Quickly Rapid Change Quality Assurance Access and Data Analysis
Working together Systems Interoperability and Distributed Preservation
APIs (*application programming interfaces)
● Interoperability ● Flexibility and modularity ● Loose coupling of services (so we can improve pieces as needed) ● Scalability - Bulk data upload and download
Challenges Web archives are rarely integrated into Digital Initiatives infrastructure (DL, DA, DR, etc) Reliance on a few providers for entire lifecycle
Variance of coordination on emergent efforts Lack of foresight for interoperability
WARCs, CDXs and derivatives
Access
Content Mgmt
Preservation
Storage
Lost in the maze in Labyrinth (1986, LucasFilm, screen capture)
Web Archiving Tools
APIs for Community ● ● ● ● ●
Community building and coordinated collection Compare/contrast ‘duplicate’ archives Determine systems of record Local ingest and preservation of WARCs Broaden options for use (data mining, bulk access, federated browsing)
Related Web Archive API Work ● ● ● ● ●
CDX Server APIs (IA, IIPC) Derivative formats (Archive-It, BL) Crawl logs and partner data (Archive-It) Wayback Machine APIs (IA) WASAPI - Web Archiving Services APIs ○ Community & API for WARC export/import (US IMLS, IA, Stanford, Rutgers, UNT, LOCKSS)
● Archiving Contemporary Composers (IA, NYU, Mellon Foundation) ● Memento and Time Travel APIs (Los Alamos National Laboratory, Old Dominion University, LC, Mellon Foundation)
Working together Access and Use
Research
Why focus on use? ● Web Archives ○
meaningful historical records
○
diversity of content types and born-digital resources across time rich sources for data mining
○
● Current Access Models ○
URL-centric clicking & browsing through Wayback
○
Query-based retrieval and search functionality Silos
○
●
Use largely TBD or not measured
Researchers Web Archives Want data Have a lot of data Interested in change over time Have data segmented over time Study semantics, entities, locations Have a rich diversity of content Multidisciplinary Multidisciplinary Value info about collection origins Chock full of provenance information
Challenges - Technical Variations in capture proficiency Complex WARC format Tons o’ data +++++++ infrastructure
Storage Computational infrastructure Diversity of contained data
Copyright and IP, full content access problematic
Challenges - Conceptual Acquisition
Lure of evermore data
Provenance
More data is not “better” data
The Neverending Web Crawl configurations Moving target Meaningful access, traditional finding aids inadequate
Attestation Higher sensitivity to elision than in traditional archives?
Research Services
Goals of Archive-It Research Services ● Expand access models for web archives ● Enable new insights into collections ● Leverage Internet Archive infrastructure for large-scale processing to produce datasets for research ● Facilitate computational analysis and new use cases ● Increase use, visibility, and value of Archive-It partner collections
Web Archives Datasets
Archive-It Research Services http://bit.ly/ait_ars
Exploring the Canadian Political Interest Group and Political Parties Web Sphere via WAT files
Named Entities in the Human Rights Collection
Going forward ● Datasets to researchers, patrons, developers ● Minimize need for dedicated infrastructure ○
leverage the Internet Archive’s computing power and expertise
● Hide complexities of cluster computing and of web archive and derivative formats and organization ● Ongoing development of platforms and APIs for research data visualization and manipulation
Working together New technology to face new challenges
Some of these Collections: Web keeps getting more complex, so archiving is harder
Personalized ads in Minority Report (2002, 20th Century Fox, screen capture)
Challenges Web archiving still a niche collecting activity Few coordinated efforts on shared tools
Little familiarity with formats, software, or processes Few on-ramps for non-developer and developer participation
Ongoing efforts • Open Wayback • Social media
BROZZLER!
– Brozzler and Umbra (Archive-It) – Documenting the Now – Social Feed Manager (GWU)
• • • •
UNT Nomination Tool Proliferating capture tools (GWU, IA, Rhizome) WASAPI - Community building Memento
Let’s Build Libraries Together!
THANK YOU
[email protected]