Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam

1. @shawnmjones @WebSciDL Improving Understanding of Web Archive Collections Through Storytelling PhD Candidacy Exam for: Shawn M. Jones Committee: Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna Thanks to:

2. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 4. Proposed Research 2

3. @shawnmjones @WebSciDL Let’s say: you find a bag 3

4. @shawnmjones @WebSciDL Let’s say: you find a bag There are thousands of different items inside. Can you use the contents of this bag? How quickly can you make this decision? 4

5. @shawnmjones @WebSciDL Now let’s say: there are thousands of bags Which one might contain something useful for you? Do any? How do you know? How do you decrease your chances of wasting your time? 5

6. @shawnmjones @WebSciDL What does this have to do with web archives? 6

7. @shawnmjones @WebSciDL Researchers create their own web archive collections 7 Archived web pages, or mementos, are used by journalists, sociologists, and historians. Tucson Shootings2008 OlympicsUniversity of Utah

8. @shawnmjones @WebSciDL Web archive collections have many versions of the same page 8 2013 2015 2018 University of Utah Office of Admissions from the University of Utah Web Archive Collection 4/1/2015 3/5/2015 Tumblr Black Lives Matter Blog from the #blacklivesmatter Collection 2/12/2015

9. @shawnmjones @WebSciDL Different versions allow us to see an unfolding news story 9 Memento from April 19, 2013 17:12 Searching for suspects, City on lockdown Memento from April 19, 2013 17:59 Officer Donahue in hospital, Lockdown loosened, Will the Red Sox game be cancelled? Memento from April 24, 2013 2:24 Suspect Found, Office collier lost life, Obama speaks

10. @shawnmjones @WebSciDL Different versions allow us to see changes in an organization’s web presence 10 The White House: 2016 The White House: 2018

11. @shawnmjones @WebSciDL Archive-It allows curators to easily create collections Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive collections. Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos. 11

12. @shawnmjones @WebSciDL … and these collections are used by other researchers 12 The collection curator is not the only user of the collection! These collections live a life after their curator has stopped adding to them.

13. @shawnmjones @WebSciDL How do we tell the difference between collections? What is the difference between these two Archive-It collections about the South Louisiana Flood of 2016? Which one should a researcher use? 13

14. @shawnmjones @WebSciDL 14 31 Archive-It collections match the search query “human rights” How are they different from each other? Which one is best for my needs?

15. @shawnmjones @WebSciDL Archive-It provides fields for metadata 15 Collection-wide metadata Metadata on individual seeds Dublin Core + Custom Fields

16. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation 16 9 seeds with metadata 132,599 seeds no metadata

17. @shawnmjones @WebSciDL But, alas the metadata does not help Because metadata is optional it is not always present. Metadata on Archive-It collections: • many different curators • different organizations • different content standards • different rules of interpretation • it is inconsistently applied This means that a user cannot reliably compare metadata fields to understand the differences between collections. 17 132,599 seeds no metadata 9 seeds with metadata Paradox: More seeds = more effort More seeds = greater user need for metadata

18. @shawnmjones @WebSciDL Reviewing mementos manually is costly This collection has 132,599 seeds, many with multiple mementos Some collections have 1000s of seeds Each seed can have many mementos In some cases, this can require reviewing 100,000+ documents to understand the collection 18

19. @shawnmjones @WebSciDL More Archive-It collections are added every year More than 8000 collections exist as of the end of 2016 19

20. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections. 20

21. @shawnmjones @WebSciDL The problem, summarized  There are multiple collections about the same concept.  The metadata for each collection is non-existent, or inconsistently applied.  Many collections have 1000s of seeds with multiple mementos.  There are more than 8000 collections.  Human review of these mementos for collection understanding is an expensive proposition. 21

22. @shawnmjones @WebSciDL Our proposal: a visualization made of representative mementos  Our visualization is a summary that will act like an abstract  Pirolli and Card’s Information Foraging Theory:  maximize the value of the information gained from our summaries  minimize the cost of interacting with the collection  ensure that our representative mementos have good information scent  contain cues that the memento will address a user’s needs 22 From this: 318 seeds with 2421 mementos To something like this: a social media story of ~28 surrogates P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20

24. @shawnmjones @WebSciDL Surrogates provide a visual summary of the content behind a URI… 24 https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,- 109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36 .8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582 Long URI: The same URI represented by a browser thumbnail surrogate: The same URI represented by a social card surrogate:

25. @shawnmjones @WebSciDL Social media storytelling uses surrogates to provide a “summary of summaries” 25 2 resources are shown from this Wakelet story6 resources are shown from this Storify story Each surrogate summarizes a web resource. Each story groups the surrogates, summarizing the topic. We want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm.

26. @shawnmjones @WebSciDL Traditional surrogates contain metadata generated by humans to convey aboutness 26

27. @shawnmjones @WebSciDL Web surrogates provide a visual summary of a web resource drawn from the content of the resource 27 Browser Thumbnail (example from UK Web Archive)Text snippet (example from Bing) Social Card (example from Facebook) Text + Thumbnail (example from Internet Archive) S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018- 04-24-lets-get-visual-and-examine.html, 2018.

28. @shawnmjones @WebSciDL Our research questions  RQ1: What types of web archive collections exist?  RQ2: What surrogates work best for understanding collections of mementos?  RQ3: How do we select representative mementos for the different semantic types of collections?  RQ4: How well do stories produced by different summarization algorithms work for collection understanding? 28 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types

29. @shawnmjones @WebSciDL RQ2: What surrogates work best for web resources? 29 Studies on visualizing web resources have focused primarily on determining search engine result relevance and not collection understanding. Li (2008) social cards > text snippets in performance Dziadosz (2002) text + thumbnail > text snippet text snippet > thumbnail in performance Woodruff (2001) thumbnails > text snippets in performance Teevan (2009) text snippets > thumbnails in performance Aula (2010) text snippets ~= thumbnails in performance Loumakis (2011) text snippets ~= social cards in performance social cards > text snippets in information scent and user preference Capra (2013) social cards > text snippets In performance (barely statistically significant) Al Maqbali (2010) text + thumbnail ~= social card text snippet ~= social card text + thumbnail ~= text snippet in performance S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018- 04-24-lets-get-visual-and-examine.html, 2018.

30. @shawnmjones @WebSciDL RQ3: How might we select representative mementos? Luhn (1958) • automatic abstracts Silva (2014) • word graphs from Luhn’s algorithm DUC Datasets (2001-2007) Napoles (2012) • Gigaword Lin (2014) • ROUGE metrics Grusky (2018) • NEWSROOM • Existing reference summaries were built from news articles. • Existing reference summaries were not built from web archives. Mihalcea (2004) • TextRank Dolan (2004) • clustering news articles • Lede3 preferred by evaluators Xie (2008) • MMR for meeting summaries Radev (1998) • automatic news briefs Xie (2008) • MMR for meeting summaries Sipos (2008) • scholarly corpus over time Zhang (2010)/Li (2011) • aspects of disasters Hong (2014) • word weighting 30

31. @shawnmjones @WebSciDL RQ3: How might we select representative mementos? – Related Concepts  Scatter-Gather (Cutting 1992)  allows a user to explore a collection by drilling through topic cluster until they reach individual documents  we seek to provide a representative sample that a user can quickly glance  Recommender Systems  predicts the preference of a user based on past behavior, demographic profile, or behavior of the user’s friends  we want to provide a summary without any knowledge of the user  Zero-Query Systems  predicts the information a user will need based on time, location, environment, user interests, and other factors  again, we want to provide a summary with no knowledge of the user 31 Image reference: Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval(SIGIR '92). Copenhagen, Denmark, pp. 318- 329. https://doi.org/10.1145/133160.133214

32. @shawnmjones @WebSciDL How have others explored collections? 32 Conta Me Histórias ArchiveSpark Archives Unleashed Cloud Existing solutions allow users to query and develop statistics on collections. Users must have some ideas of a topic or concept a priori.

33. @shawnmjones @WebSciDL How have others visualized collections for understanding? 33 Other attempts at visualizing Archive-It collections tried to visualize everything. http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis- visualizing.html K. Padia, Y. AlNoamany, and M. C. Weigle. 2012. Visualizing digital collections at archive-it. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18. DOI:10.1145/2232817.2232821

34. @shawnmjones @WebSciDL How have others told stories with web archive collections? 34  AlNoamany told stories via the storytelling platform Storify  She proved that test participants could not detect the difference between her automated stories and stories generated by human curators Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318. DOI:10.1145/3091478.3091508

35. @shawnmjones @WebSciDL How have others told stories with web archive collections? 35  AlNoamany told stories via the storytelling platform Storify – which is no longer in service  She proved that test participants could not detect the difference between her automated stories and stories generated by human curators Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize x S. M. Jones. “Storify Will Be Gone Soon, So How Do We Preserve The Stories?” http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html 2017. x

36. @shawnmjones @WebSciDL How have others told stories with web archive collections?  AlNoamany told stories via the storytelling platform Storify – which is no longer in service  She proved that test participants could not detect the difference between her automated stories and stories generated by human curators  Did not evaluate if the resulting summaries were effective tools for collection understanding  Focused on summarizing collections about events  There are other types of Archive-It collections Characteristicsof human-generated Stories Characteristicsof Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 36 x x

38. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 1. RQ1: What types of web archive collections exist? 2. Partial RQ2: What surrogates work best for understanding collections of mementos? 1. How effective are existing curation platforms at producing mementos? 2. Preliminary user surrogate study 3. Partial RQ3: How do we select representative mementos for the different semantic types of collections? 1. The Off-Topic Memento Toolkit (OTMT) 4. Proposed Research 38

39. @shawnmjones @WebSciDL As collection users, we view Archive-It collections from outside… 39 • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

40. @shawnmjones @WebSciDL As collection users, what structural features can we view from outside? 40  Using only structural features is advantageous because it saves one from having to download a collection’s content.  These structural features give us different insight than can be provided by text analysis or metadata. 81,014 seeds 486,227 seed mementos Structural features shown here: • number of seeds • number of mementos S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

41. @shawnmjones @WebSciDL Was the collection built from web sites belonging to one domain or many? 41 Many domains One domain Structural feature discussed here: • domain diversity S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

42. @shawnmjones @WebSciDL Were most of the web pages in the collection top-level pages or specific articles deeper in a web site? 42 Top-level pages Deeper links Structural feature discussed here: • path depth diversity • most frequent path depth S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

43. @shawnmjones @WebSciDL Growth curves provide some understanding of collection curation behavior 43 • Skew of the collection’s holdings • Indicates temporality of collection • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained (Positive) (Positive) (Negative) (Negative) S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

44. @shawnmjones @WebSciDL Does most of the collection exist earlier or later in its life? 44 This collection was created in March 2010. Most of its mementos come from 2016 – 2018. Most of this collection exists later in its life. Structural feature discussed here: • area under the seed memento growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

45. @shawnmjones @WebSciDL When did the curator select and archive a collection’s contents? 45 This collection was created in March 2006. Some of the seeds were selected in 2006. Many of the seeds were selected all along its life. It has mementos as recent as July 2018. Structural feature discussed here: • area under the seed growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

46. @shawnmjones @WebSciDL Did the curator create a collection intended to archive new versions of the same web pages repeatedly? 46 This collection was created in June 2014. The seeds were selected toward the beginning of its life. Mementos were captured all during its life. Structural feature discussed here: • area under the seed growth curve • area under the seed memento growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

47. @shawnmjones @WebSciDL We discovered four semantic categories in Archive-It collections… 47 Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

48. @shawnmjones @WebSciDL 48 Self-Archiving 54.1% of collections Subject-based Time Bounded – Expected Time Bounded – Spontaneous In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…

49. @shawnmjones @WebSciDL 49 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected Time Bounded – Spontaneous In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…

50. @shawnmjones @WebSciDL 50 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…

51. @shawnmjones @WebSciDL 51 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections In a study of 3,382 Archive-It collections S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…

52. @shawnmjones @WebSciDL 52 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P We discovered four semantic categories in Archive-It collections…

53. @shawnmjones @WebSciDL We can bridge the structural to the descriptive… 53 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany Using the structural features mentioned previously, we can predict these semantic categories with a Random Forest classifier with F1 = 0.720 S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P

54. @shawnmjones @WebSciDL We have identified different types of Archive-It collections 54 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ We can take these features into account to address the other research questions. So, let’s tell some stories on social media! Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous

55. @shawnmjones @WebSciDL We have identified different types of Archive-It collections 55 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ We can take these features into account to address the other research questions. So, let’s tell some stories on social media! Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous Not so fast…

56. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 1. RQ1: What types of web archive collections exist? 2. Partial RQ2: What surrogates work best for understanding collections of mementos? 1. How effective are existing curation platforms at producing mementos? 2. Preliminary user surrogate study 3. Partial RQ3: How do we select representative mementos for the different semantic types of collections? 1. The Off-Topic Memento Toolkit (OTMT) 4. Proposed Research 56

57. @shawnmjones @WebSciDL Existing platforms do not reliably produce surrogates for mementos… 57 If we cannot rely upon the service to generate a surrogate for a memento, our system must then do the work to create our own surrogates. S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws- dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.

58. @shawnmjones @WebSciDL Some services have stories, but not long term storytelling? 58 Facebook stories Image ref: https://techcrunch.com/2018/04/05/facebook-stories-default/ Image ref: https://techcrunch.com/2013/10/03/snapc hat-gets-its-own-timeline-with-snapchat- stories-24-hour-photo-video-tales/ Snapchat stories Image ref: https://buffer.com/library/instagram-stories Instagram stories These platforms delete the user’s stories 24 hours after they are posted. This form of social media storytelling is the opposite of what we are looking for. We want the stories to be artifacts themselves.

59. @shawnmjones @WebSciDL Some services’ longevity is in doubt… 59 RIP: Google+ 2019 RIP: Tumblr (soon?)RIP: Storify 2018 S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws- dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.

60. @shawnmjones @WebSciDL Existing surrogate services create a confusing experience for mementos 60 Who published these resources? Archive-It? CNN? Is the story author sharing fake news? S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws- dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018. embed.rocks surrogate embed.ly surrogate

61. @shawnmjones @WebSciDL Neither social media services nor surrogate services were reliable for storytelling, so we created MementoEmbed… 61 Information in the MementoEmbed social card surrogate is separated to avoid issues of confusion about attribution. MementoEmbed is archive-aware. It can locate information about the memento that is not available in other surrogates. S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws- dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.

62. @shawnmjones @WebSciDL MementoEmbed provides us with a tool for evaluating surrogates, a step on the road to answering RQ2… 62 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ ☑️ ☑️

63. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 1. RQ1: What types of web archive collections exist? 2. Partial RQ2: What surrogates work best for understanding collections of mementos? 1. How effective are live web curation platforms at producing mementos? 2. Preliminary user surrogate study 3. Partial RQ3: How do we select representative mementos for the different semantic types of collections? 1. The Off-Topic Memento Toolkit (OTMT) 4. Proposed Research 63

64. @shawnmjones @WebSciDL Using stories built from curator-selected mementos, we shared stories with MT participants… 64 Archive-It like Social Card Browser thumbnails Social Card With Thumbnail as Image (sc/t) Social Card With Thumbnail to Right (sc+t) Social Card with Thumbnail on Hover (sc^t) • 4 stories of 15-17 mementos selected by human Archive-It curators from their collections • 6 different surrogate types • 24 different story-surrogate combinations • 120 MT participants • Given 30 seconds to view each story S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.

65. @shawnmjones @WebSciDL And then we asked them which of 2 of 6 mementos come from the same collection… 65 • Each participant was shown a list of 6 surrogates of the same type as the story they just viewed. • They were asked to choose the 2 that they thought came from the same collection. • They were given as much time as they wished to answer the question. • This is similar to the Sentence Verification Task from reading comprehension studies. S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.

66. @shawnmjones @WebSciDL Response times per surrogate had interesting means, but p-values were not statistically significant at p < 0.05 66 p = 0.190 p = 0.202 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.

67. @shawnmjones @WebSciDL Correct answers per surrogate indicate that social cards probably outperform the Archive-It surrogate 67 0 0.5 1 1.5 2 2.5 Archive-It Facsimile Browser Thumbnails Social Cards sc+t sc/t sc^t Correct Answers Per Surrogate Median Mean p = 0.0569 p = 0.0770 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.

68. @shawnmjones @WebSciDL Whenever thumbnails are present, more users interact with them 68 We could not detect if participants were zooming in to view thumbnails, but most hovered when confronted with a thumbnail, regardless of surrogate. For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the surrogate. S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.

69. @shawnmjones @WebSciDL We have some results indicating that social cards perform better, but there is more to answering RQ2… 69 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ ☑️ ☑️ 0 0.5 1 1.5 2 2.5 Archive-It Facsimile Browser Thumbnails Social Cards sc+t sc/t sc^t Correct Answers Per Surrogate Median Mean

70. @shawnmjones @WebSciDL Outline 1. Motivation 2. Research Questions 3. Preliminary Work 1. RQ1: What types of web archive collections exist? 2. Partial RQ2: What surrogates work best for understanding collections of mementos? 3. Partial RQ3: How do we select representative mementos for the different semantic types of collections? 1. The Off-Topic Memento Toolkit (OTMT) 4. Proposed Research 70

71. @shawnmjones @WebSciDL Identifying off-topic mementos is key to choosing representative mementos 71 Hacked Moved on from topic Collections have a theme Seeds are selected to support that theme Mementos are versions of seeds Some of these versions are off-topic Identifying these off-topic mementos is key to summarization Web Page Gone Account Suspension S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87

72. @shawnmjones @WebSciDL The Off-Topic Memento Toolkit (OTMT) compares a seed’s first memento with the seed’s other mementos via different measures… Measure Fully Equivalent Score Fully Dissimilar Score Preprocessing Performed OTMT -tm keyword Byte Count 0.0 -1.0 No bytecount Word Count 0.0 -1.0 Yes wordcount Jaccard Distance 0.0 1.0 Yes jaccard Sørensen-Dice 0.0 1.0 Yes sorensen Simhash of Term Frequencies 0 64 Yes simhash-tf Simhash or raw memento 0 64 No simhash-raw Cosine Similarity of TF-IDF Vectors 1.0 0 Yes cosine Cosine Similarity of LSI Vectors 1.0 0 Yes gensim_lsi 72 S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87

73. @shawnmjones @WebSciDL After repeating AlNoamany’s experiment, Word Count had the best F1 score for identifying off-topic mementos… 73 We reused AlNoamany’s labeled dataset. She did not try: • Sørensen-Dice • Simhash of raw content • Simhash of TF • Gensim LSI Our word count accuracy came out ahead of AlNoamany’s. S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87 Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting oﬀ-topic pages within TimeMaps in Web archives,” International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5

74. @shawnmjones @WebSciDL Finding off-topic mementos is one of the first steps to addressing RQ3… 74 RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ ☑️ ☑️

76. @shawnmjones @WebSciDL This work requires a flexible framework – Dark and Stormy Archives (DSA) 2.0 76 OTMT Hypercane Raintale MementoEmbed Archive-It Utilities Story Web Archive Collection ✅ ✅ ✅ callscalls calls provides input to input output Thousands of HTML documents < 30 Representative Mementos Visualized as surrogates calls ✅ S. M. Jones. “Raintale – A Storytelling Tool for Web Archives.” https://ws-dl.blogspot.com/2019/07/2019-07-11- raintale-storytelling-tool.html, 2019. Tools for selecting representative mementos Tools for visualizing mementos as a story

77. @shawnmjones @WebSciDL Evaluation of RQ2: What surrogates work best for understanding collections of mementos? 77 How well do users perform with different types of surrogates? 1. Select 5 collections from each semantic category 2. Select the earliest memento of each of the first 20 seeds from each collection – this is the number of surrogates a user views if they open an Archive-It story and page down once 3. Present the participant with a story of 20 surrogates, varying the surrogate between participants 4. Ask them to address a user task Variations: • For step #3, vary the time for participants to view the story • participants view for 5, 10, 20, 30 seconds • may surface the ability to “glance” and understand • some surrogates consist only of title, URI, etc. • may determine which surrogate elements perform best • For step #4, ask the participant to: • determine if the collection behind the story is suited for a task – similar to traditional IR research • identify which items likely belong to the same collection • Instead of steps 3 and 4 – ask former participants which surrogate they prefer for a given task

78. @shawnmjones @WebSciDL Evaluation of RQ2: What surrogates work best for understanding collections of mementos? 78 What information is available to users of the existing Archive-It story? Discover patterns in metadata usage that may indicate the semantic type of collection. How well do our stories compare to the existing metadata? How well do our stories cover the content of the underlying collection? How well does the Archive-It story cover the underlying collection? How well do surrogates cover the content of their mementos? Collection Content Our Story Content Collection Content Archive-It Story Content Memento Content Surrogate Content Our Story Content Existing Metadata For Seeds Similarity metrics will be used for evaluating coverage.

79. @shawnmjones @WebSciDL Evaluation of RQ3: How do we select representative mementos for different semantic types of collections? 79 We will develop different algorithms and compare their output with several metrics to determine which algorithms provide the best ”aboutness” for the collection. 0 1 2 3 4 5 6 7 8 9 10 Existing Metadata Content Coverage Temporal Spread Source Diversity Compression Performance DSA 1.0 Algorithm 2 Algorithm 3 Algorithm 4

80. @shawnmjones @WebSciDL RQ4: How well do stories produced by different summarization algorithms work for collection understanding? 80 How well do our generated stories compare to the existing Archive-It interface? Do study participants understand key concepts of the collection represented by the story? Using the stories, can participants tell the difference between similar collections? Can participants compare stories and tell which are similar? Does the addition of existing metadata improve the participant’s performance? Does the layout of the surrogates improve the participant’s performance? RQ2: Surrogate Types RQ3: Selecting Mementos RQ4: Evaluating Stories RQ1: Collection Types ✅ ☑️ ☑️

81. @shawnmjones @WebSciDL We plan to have completed this research in 2021… 81 iPres 2018 iPres 2018 CIKM 2019 ECIR 2020 WWW 2020 CIKM 2020 WebSci 2021 JCDL 2020 JCDL 2018 DTMH 2017

82. @shawnmjones @WebSciDL Our methods are not just for Archive-It 82 Our methods will be applicable web archive collections created on other platforms, like Rhizome’s Webrecorder.

83. @shawnmjones @WebSciDL Motivation Summary  Collection understanding is a problem with web archive collections  inconsistent metadata  1000s of mementos  1000s of collections  costly for human review  We intend to produce a visualization that serves as an abstract to assist in collection understanding  Prior work in this area:  did not evaluate how well this method works for collection understanding  only focused on collections about events  relied upon Storify as a visualization medium 83

84. @shawnmjones @WebSciDL Contributions  Existing work:  Derived semantic categories of web archive collections in Archive-It  Categories can be predicted by using structural features  Most collections are not about events  MementoEmbed – surrogates for the past web  Social cards probably provide better understanding of collections  Off-Topic Memento Toolkit – Identifying off-topic mementos  Future work:  Evaluate algorithms for surfacing a representative sample from a document collection  Evaluate different surrogate types via user evaluation  Show which surrogate-sample combinations work best for collection understanding via user evaluation 84

85. @shawnmjones @WebSciDL Improving Understanding of Web Archive Collections Through Storytelling PhD Candidacy Exam for: Shawn M. Jones Committee: Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna Thanks to:

86. @shawnmjones @WebSciDL Discussion

Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam

Shawn Jones

Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam