Building Event Collections from Crawling Web Archives

1. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ Building Event Collections from Crawling Web Archives Martin Klein1 Lyudmila Balakireva1 Herbert Van de Sompel2 1Research Library Los Alamos National Laboratory 2Data Archiving and Networked Services The Netherlands

2. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 2 Inspiration from Previous Work https://doi.org/10.1007/978-3-319-67008-9_10

3. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 3 Published at WebSci 2018 https://doi.org/10.1145/3201064.3201085

4. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 4 1. Can we create event collections by focused crawling online- available web archives? 2. How do event collections created from the archived web compare to those created from the live web? 3. How does the amount of time passed since the event affect the collections built from the live and the archived web? 4. How do event collections built from the archived web compare to manually curated collections? Questions

5. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 5 • Often orchestrated by subject matter experts, archivists, special collection librarians, technicians • Potentially with guidance from institutional collection policy • Results in a list of seeds (URIs, social media accounts, etc) • Utilization of crawling services such as Archive-It, Social Feed Manager Background – Event Collection Building

6. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 6 • Temporal: time passed since event is of concern  Use of web archives via Memento infrastructure • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Problems and our Approach

7. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 7

9. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 9 • Temporal: time passed since event is of concern  Use of web archives • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Problems and our Approach

12. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 12 • Temporal: time passed since event is of concern  Use of web archives • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Problems and our Approach

13. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 13 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant

16. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 16 1. Content of Wikipedia page + random 60% of page’s references • Generate topic vector (TF-IDF of 1grams + 2grams) 2. Content of remaining 40% of Wikipedia page’s references • Generate topic vector (TF-IDF of 1grams + 2grams) • Compute cosine similarity value between vectors 1 and 2 • Run 10 times • Take average cosine similarity value as content threshold Content Relevance

17. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 17 • Define temporal interval for which crawled pages are considered relevant • Event date extracted from Wikipedia event page Temporal Relevance 1 Event Date Change Point Today 0 0

18. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 18 Change Point Detection 2016−06−12 2016−11−05 2017−03−31 2017−08−24 020406080100 Edit Dates Percentage 46 • Plot number of Wikipedia page edits per day • Run R’s changepoint algorithm • Detect significant change in curve https://cran.r-project.org/web/packages/changepoint/index.html

19. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 19 • Extract datetime from pages via: • URI http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/ • Meta tags <meta property="article:published" itemprop="datePublished" content="2017-12-09T10:14:50-05:00" /> • ODU’s Carbondate tool http://carbondate.cs.odu.edu/ • Memento datetime • X-Header Datetime Extraction

20. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 20 • Topics limited to terror attacks and mass shootings in the U.S. • From different times in the past • Take content and temporal relevance into account • Equally weighted • Use events’ Wikipedia page as input for focused crawler • Version that was live at change point Experiment Details

21. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 21 • Focused crawl of: • 22 archives, simultaneously, via Memento infrastructure • The live web • Seeds • Memento of Wikipedia page references closest to and after event time • Subject to temporal and contextual relevance assessment • Crawled outlinks • Memento of outlinks closest to and after event time • Subject to temporal and contextual relevance assessment Crawl Details

22. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 22 • Crawl stop conditions: • No more relevant documents left • 5 levels deep • Utilized crawl priority queue Crawl Details Level 2 Level 1 Level 0 Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2

23. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 23 • New York City, October 31st 2017 • Las Vegas, October 1st 2017 • Orlando, June 12th 2016 • San Bernadino, December 2nd 2015 • Tucson, January 8th 2011 • Binghampton, April 3rd 2009 Collections Crawled (in November 2017)

24. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 24 NYC, 10/31/2017 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 0500100015002000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 0500100015002000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs

25. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 25 TUC, 01/08/2011 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 020000400006000080000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 020000400006000080000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs

26. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 26 NYC, 10/31/2017 – Relevance over… Crawled Documents Crawl Time

27. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 27 TUC, 01/08/2011 – Relevance over… Crawled Documents Crawl Time

28. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 28 TUC, 01/08/2011 – Comparison to Archive-IT 0 5000 10000 15000 050001000015000 Documents AccumulatedRelevance Web Archive Crawl Archive−It Crawl

29. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 29 TUC, 01/08/2011 – Web Archive Contributions web.archive.org 75% wayback.archive−it.org 14% webarchive.loc.gov 7% web.archive.bibalex.org 2% archive.is 2%

30. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 30 • Web archives are great resources to build event collections of web resources • Crawling web archives is much slower than the live web • Collections about very recent events benefit more from the live web than the archived web but • Collections about events from the distant past benefit more from the archived web than the live web • Utilizing multiple web archives is beneficial for the collection • Focused crawls have the potential to outperform manual collection building Takeaways

31. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 31 https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384

32. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ Building Event Collections from Crawling Web Archives Martin Klein1 Lyudmila Balakireva1 Herbert Van de Sompel2 1Research Library Los Alamos National Laboratory 2Data Archiving and Networked Services The Netherlands