The Many Shapes of Archive-It

1. The Many Shapes of Archive-It Shawn M. Jones Alexander Nwala Michele C. Weigle Michael L. Nelson Old Dominion University Web Science and Digital Libraries Research Group @WebSciDL sjone@cs.odu.edu @shawnmjones anwala@cs.odu.edu @acnwala mweigle@cs.odu.edu @weiglemc mln@cs.odu.edu @phonedude_mln Thanks to:

2. @shawnmjones @WebSciDL Researchers Create Their Own Web Archive Collections 2 Archived web pages, or mementos, are used by journalists, sociologists, and historians. Tucson Shootings2008 OlympicsUniversity of Utah

3. @shawnmjones @WebSciDL Web Archive Collections Have Many Versions of the Same Page 3 2013 2015 2018 University of Utah Office of Admissions from the University of Utah Web Archive Collection 4/1/2015 3/5/2015 Tumblr Black Lives Matter Blog from the #blacklivesmatter Collection 2/12/2015

4. @shawnmjones @WebSciDL Different Versions Allow Us to See an Unfolding News Story 4 Memento from April 19, 2013 17:12 Searching for Suspects, City on Lockdown Memento from April 19, 2013 17:59 Officer Donahue in hospital, Lockdown loosened, Will the Red Sox game be cancelled? Memento from April 21, 2013 2:24 Suspect Found, Office Collier Lost Life, Obama speaks

5. @shawnmjones @WebSciDL Different Versions Allow Us To See Changes In An Organization’s Web Presence 5 The White House: 2016 The White House: 2018

6. @shawnmjones @WebSciDL The Internet Archive created Archive-It so organizations could create their own web archive collections Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos of these seeds at different points in time. 6

7. @shawnmjones @WebSciDL But this is the interface available for browsing those collections… 7 How do we tell the difference without going through them all? What types of collections exist?

8. @shawnmjones @WebSciDL How can we understand an Archive-It collection?

9. @shawnmjones @WebSciDL We Can Understand It Based On Metadata 9 Collection wide Metadata Metadata on Individual Seeds Dublin Core + Custom Fields

10. @shawnmjones @WebSciDL We Can Understand It Based On Metadata, but the Metadata Does Not Always Help… 10 132,599 seeds no metadata 9 seeds with metadata Because metadata is optional it is not always present.

11. @shawnmjones @WebSciDL We Can Understand It Based On Metadata, but the Metadata Does Not Always Help… 11 Because metadata is optional it is not always present. When it is present, metadata on Archive-It collections is: • generated by many different curators • from different organizations • with different content standards • and different rules of interpretation

12. @shawnmjones @WebSciDL We Can Understand It Based On Metadata, but the Metadata Does Not Always Help… 12 Because metadata is optional it is not always present. When it is present, metadata on Archive-It collections is: • generated by many different curators • from different organizations • with different content standards • and different rules of interpretation It is inconsistently applied! This means that a user cannot reliably compare metadata fields to understand the differences between collections.

13. @shawnmjones @WebSciDL We Can Understand It Based on Content  We can use techniques such as text mining and network analysis The same collection in the Archives Unleashed Cloud https://archivesunleashed.org 13

14. @shawnmjones @WebSciDL We Can Understand It Based on Content, but all of that Content Must Be Dereferenced… 14

15. @shawnmjones @WebSciDL We Can Understand It Based on Content, but all of that Content Must Be Dereferenced… 15 Remember: • Each result is a seed • Each seed has multiple mementos

16. @shawnmjones @WebSciDL We Can Understand It Based on Content, but all of that Content Must Be Dereferenced… 16 There are 486,227 seed mementos that must be downloaded and processed to understand this collection. Remember: • Each result is a seed • Each seed has multiple mementos

17. @shawnmjones @WebSciDL We Can Understand It Based on Content, but all of that Content Must Be Dereferenced… 17 There are 486,227 seed mementos that must be downloaded and processed to understand this collection. Remember: • Each result is a seed • Each seed has multiple mementos These 333 seeds correspond to 278,306 seed mementos. They must be downloaded and processed.

18. @shawnmjones @WebSciDL and what if we do not know the language? 18 ??? About University of Utah English non- German Speakers can discern: About shootings in Tuscon

19. @shawnmjones @WebSciDL How else can we understand an Archive-It collection?

20. @shawnmjones @WebSciDL What kinds of questions can be answered with Structural Features?  Using only structural features is advantageous because it saves one from having to dereference all of the URIs in a collection.  These structural features also give us different insight than can be provided by text analysis or metadata. 20 81,014 seeds 486,227 seed mementos

21. @shawnmjones @WebSciDL Does most of the collection exist earlier or later in its life? 21 This collection was created in March 2010. Most of its mementos come from 2016 – 2018. Most of this collection exists later in its life.

22. @shawnmjones @WebSciDL When did the curator select and archive a collection’s contents? 22 This collection was created in March 2006. Some of the seeds were selected in 2006. Many of the seeds were selected all along its life. It has mementos as recent as July 2018.

23. @shawnmjones @WebSciDL Did the curator create a collection intended to archive new versions of the same web pages repeatedly? 23 This collection was created in June 2014. The seeds were selected at the beginning of its life. Mementos were captured all during its life.

24. @shawnmjones @WebSciDL Was the collection built from web sites belonging to one domain or many? 24 Many domains One domain

25. @shawnmjones @WebSciDL Were most of the web pages in the collection top-level pages or specific articles deeper in a web site? 25 Top-level pages Deeper Links

26. @shawnmjones @WebSciDL Other questions answered by structural features:  Was there renewed interest at some point later in the collection’s life?  Did the curator nurture the selected web pages throughout the collection’s life and add content continuously?  What time period does the collection span?  What is the temporal skew of the collection?  What is the lifetime of the collection? 26

27. @shawnmjones @WebSciDL Can we bridge the structural to the descriptive?  We can categorize Archive-It’s collections into four main semantic categories.  We can predict these categories using a Random Forest Classifier using structural features. 27

28. @shawnmjones @WebSciDL Let’s go over a few things…

29. @shawnmjones @WebSciDL Looking at Archive-It collections from the outside • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds • In this work, we focus on seeds and seed mementos 29

30. @shawnmjones @WebSciDL TimeMaps from the Memento Protocol 30 <http://a.example.org>;rel="original", <http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format" ; from="Tue, 20 Jun 2000 18:02:59 GMT" ; until="Wed, 21 Jun 2000 04:41:56 GMT", <http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate", <http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento"; datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento"; datetime="Tue, 27 Oct 2009 20:49:54 GMT", <http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 01:17:31 GMT", <http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 04:41:56 GMT" … Each seed has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento-datetimes. entries for mementos memento-datetime original resource URI Memento URI (URI-M) TimeMap URI (URI-T)

31. @shawnmjones @WebSciDL What other work is related to web collections?

32. @shawnmjones @WebSciDL Related Work 32 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012)

33. @shawnmjones @WebSciDL Related Work 33 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections

34. @shawnmjones @WebSciDL Related Work 34 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created

35. @shawnmjones @WebSciDL Related Work 35 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created We look to structural features of web archives rather than user studies of live web curation platforms

36. @shawnmjones @WebSciDL Related Work 36 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created We look to structural features of web archives rather than user studies of live web curation platforms We focus on the output of web archivists rather than studying their behavior in real time

37. @shawnmjones @WebSciDL Related Work 37 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created We look to structural features of web archives rather than user studies of live web curation platforms We focus on the output of web archivists rather than studying their behavior in real time We focus on structural features rather than challenges with using Archive-It as a tool

38. @shawnmjones @WebSciDL Related Work 38 Nwala (2018) Mull (2014) Wang (2016) Ogden (2017) features of digital collections Fenlon (2017) selecting seeds for web archive collections Milligan (2016) motivations for creating collections behavior of web archivists Crook (2009) Slania (2013) Deutch (2016) studies of using Archive-It capabilities of web archive user interfaces Niu (2012) We focus on web archive collections We examine web archive collections after they have been created We look to structural features of web archives rather than user studies of live web curation platforms We focus on the output of web archivists rather than studying their behavior in real time We focus on structural features rather than challenges with using Archive-It as a tool We focus on structural features of the archives rather than their user interfaces

39. @shawnmjones @WebSciDL Related Work 39 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It Dublin Core AlNoamany (2016)

40. @shawnmjones @WebSciDL Related Work 40 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata Dublin Core AlNoamany (2016)

41. @shawnmjones @WebSciDL Related Work 41 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata We are not looking at the content, but the structural features of collections Dublin Core AlNoamany (2016)

42. @shawnmjones @WebSciDL Related Work 42 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata We are not looking at the content, but the structural features of collections We examine different features of URIs like domain and path depth Dublin Core AlNoamany (2016)

43. @shawnmjones @WebSciDL Related Work 43 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata We are not looking at the content, but the structural features of collections We examine different features of URIs like domain and path depth We apply AlSum’s methods to specific collections rather than entire archives Dublin Core AlNoamany (2016)

44. @shawnmjones @WebSciDL Related Work 44 Sağlam (2014) Abramson (2012) AlSum (2014) metadata standards EAD topics in web archive collections AlNoamany (2016) classification of URIs web archive growth analysis studies of using Archive-It We look at the structural features rather than metadata We are not looking at the content, but the structural features of collections We examine different features of URIs like domain and path depth We apply AlSum’s methods to specific collections rather than entire archives We look at collections as units rather than analyzing Archive-It as a whole Dublin Core AlNoamany (2016)

45. @shawnmjones @WebSciDL How did we acquire the data for this study?

46. @shawnmjones @WebSciDL Acquiring 9351 Archive-It collections We used BeautifulSoup to scrape the web pages of 9,351 Archive-It Collections. From this scraping we discovered: • If the collection was public or private • Seed URIs Using the Seed URIs, we discovered TimeMaps listing all seed mementos and their memento-datetimes. 46

47. @shawnmjones @WebSciDL Remove 4,823 private collections Private collections do not allow access to seeds, seed mementos, or TimeMaps 47

48. @shawnmjones @WebSciDL Remove 440 young collections Collections younger than a year may still be building, possibly skewing results 48

49. @shawnmjones @WebSciDL Remove empty collections Empty collections have no data to analyze 49

50. @shawnmjones @WebSciDL Remove 48 collections with errors Collections with download or processing errors may skew the results 50

51. @shawnmjones @WebSciDL Remove 357 collections with a single memento Singletons consist of a single seed with a single memento, offering no behavior to study 51

52. @shawnmjones @WebSciDL Remove 21 instantaneous collections Single second collections were captured in a single second, offering no behavior over time to study 52

53. @shawnmjones @WebSciDL Remove 32 test collections Collections clearly marked as test or trial do not represent regular collection behavior 53

54. @shawnmjones @WebSciDL We study the remaining 3,382 collections This leaves us with 3,382 collections for study with a total of : • 700,835 seeds • 6,943,677 seed mementos 54

55. @shawnmjones @WebSciDL Understanding Collection Growth Through Time 55 collections that do not grow are not interesting for us

56. @shawnmjones @WebSciDL Growth curves help us understand collection growth, but require normalization for comparison 56 We want to compare time • “2014 Primaries” has 219,084 mementos • “The Obama White House” has 140 • We normalize the number as a percentage We want to compare memento count • “Hurricane Sandy” has 174,884 seeds • “Scottish Politics” has 58 seeds • We normalize the number as a percentage We want to compare seed count • “Indiana: State and Local Documents” spans 2005 – 2018 • “Japan: Election 2016 House of Councilors” spans less than 2 days in July 2016 • We normalize time as a percentage of the lifespan of the collection, from the first memento-datetime to the last

57. @shawnmjones @WebSciDL Once normalized, we can compare behavior in the seed growth… 57 • Skew of the curator’s involvement with the collection • When seeds were added • When interest was lost or regained Seeds added all up frontSeeds added early, but not all up front

58. @shawnmjones @WebSciDL And, we can compare behavior in the memento growth… 58 • Built from all mementos in the collection’s TimeMaps • Skew of the collection’s holdings • Indicates temporality of collection Mementos crawled all alongMementos crawled later

59. @shawnmjones @WebSciDL We can classifying different behaviors of Growth Curves  Using two features:  Area under the seed curve (AUCseed)  Area under the seed memento curve (AUCsmem)  We can classify a collection’s growth curve into 9 categories  If AUC > 0.55, then those points occur early  If AUC < 0.45, then those points occur late  If 0.55 > AUC > 0.45, then those points occur continuously 59 Seeds Late Seeds Continuously Seeds Early Seed Mementos Early Seed Mementos Continuously Seed Mementos Late AUCseed > 0.55 AUCseed < 0.45 AUCsmem > 0.55 0.55 > AUCsmem > 0.45 AUCsmem < 0.45 0.55 > AUCseed > 0.45

60. @shawnmjones @WebSciDL Seeds Early 60 The curators added most of the seeds at the beginning of the collection’s life and then scheduled crawls at different schedules.

61. @shawnmjones @WebSciDL Seeds Continuously 61 The curators keep adding new things to these collections throughout each collection’s life.

62. @shawnmjones @WebSciDL Seeds Late 62 There was renewed interest in adding seeds at some point in these collections’ lives.

63. @shawnmjones @WebSciDL From These Growth Curves we have some simple Structural Features  Number of Seeds  Number of Seed Mementos  Collection Lifespan  Time between first and last memento 63

64. @shawnmjones @WebSciDL We also have complex Growth Curve Features: Difference of Seed Curve AUC and Diagonal 64 Subtracting the AUC of the diagonal from the AUC of the seed curve: • We can more easily see if the seed curve is early or late • Early is positive • Late is negative • “Close” to 0 means continuous (pos.) (neg.)

65. @shawnmjones @WebSciDL More complex Growth Curve Features: Difference of Seed Memento Curve AUC and Diagonal 65 Subtracting the AUC of the diagonal from the seed curve: • We can more easily see if the seed curve is early or late • Early is positive • Late is negative • “Close” to 0 means continuous (pos.) (neg.)

66. @shawnmjones @WebSciDL More complex Growth Curve Features: Diff. of Seed Curve AUC and Seed Memento Curve AUC 66 Difference between the seed curve AUC and the seed memento curve AUC indicates how close the two are. A value of 0 means that the two overlap, likely meaning that there is one memento per seed. A positive value means that the seeds are added earlier than the seed mementos. A negative value means that the seed memento growth has overtaken the seed growth.

67. @shawnmjones @WebSciDL What About Structural Features of Seeds? 67

68. @shawnmjones @WebSciDL Seed URI domain diversity 68 Alexander Nwala. (2018 May) An Exploration of URL Diversity Measures. Web Science and Digital Libraries Reseach Group Blog. http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html Domain diversity: 0 (duplicate cnn.com hosts) http://www.cnn.com/path/to/story0 http://news.cnn.com/path/to/story1 http://top.cnn.com/path/to/story2 Domain diversity: 1 (no duplicate domains) http://www.cnn.com/path/to/story0 http://www.vox.com/path/to/story http://www.foxnews.com/path/to/story Domain diversity: 0.5 (1 duplicate cnn.com host) http://www.cnn.com/path/to/story0 http://www.cnn.com/path/to/story1 http://www.vox.com/path/to/story U = # of unique domains C = number of seeds D = diversity D’ = normalized diversity * Now known as the WSDL Diversity Index Observation: Some collections only archive a single domain while others have more variety.

69. @shawnmjones @WebSciDL Path Depth Path Depth is a concept measuring how many items exist in a URI’s path  Based on McCown’s work, we also add 1 for any path containing a query string: 69 Example URI Path Depth http://example.com/ 0 http://example.com/directory 1 http://example.com/dir1/dir2/dir3/dir4 4 http://example.com/dir1/file2?key1=val1&k ey2=val2&key3=val3 3 Observation: Top-level pages tend to have more general information whereas deeper pages tend to have a more specific focus.

70. @shawnmjones @WebSciDL Seed URI Path Depth Diversity 70 Path depth diversity: 0 (All path depths are 3) http://www.cnn.com/path/to/story0 http://news.vox.com/path/to/story1 http://top.cnn.com/path/to/story2 Path depth diversity: 1 (all completely different path depths) http://www.cnn.com/ http://news.vox.com/path/ http://top.cnn.com/path/to/story Path depth diversity: 0.5 (1 path depth of 1, 2 with depth of 3) http://www.cnn.com/ http://news.vox.com/path/to/story1 http://top.cnn.com/path/to/story2 Observation: Some collections only have seeds at the top level where others only link to deeper articles. We reuse the WSDL Diversity Index, but this time apply it to path depth.

71. @shawnmjones @WebSciDL Other Seed Features  Most Frequent Path Depth  The path depth that appears most in the seed URIs  Observation: For some collections, most seeds exist at the top level while others link to deeper articles.  % Query String Usage  How many URIs consist of query strings  Observation: Some collections have many URIs with query strings, while others have none. 71

72. @shawnmjones @WebSciDL Mapping the structural to the descriptive is hard… 72

73. @shawnmjones @WebSciDL At first, we tried to map the structural features to metadata directly…  We tried using machine learning to predict the topics found in the metadata of a collection  There are problems with this approach:  Not all collections have topics.  Many collections have multiple topics.  Many collections have user-supplied topics. 73

74. @shawnmjones @WebSciDL Instead, we established semantic categories of Archive-It collections  We reviewed the descriptions of 3,382 Archive-It Collections  Based on their metadata and seeds, we placed them into 4 semantic categories 74

75. @shawnmjones @WebSciDL Self-Archiving dominates Archive-It 54.1% 27.6% 14.1% 4.2% 75 Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous

76. @shawnmjones @WebSciDL We can predict the semantic category with structural features 76 Random Forest Results by Semantic CategoryResults for different Machine Learning algorithms We found that a Random Forest classifier was best able to predict the semantic category using a collection’s structural features. The Random Forest classifier works best with collections in the Self-Archiving category. without processing the page content

77. @shawnmjones @WebSciDL We optimized our prediction 77 Using Kendall Tau, we were able to determine which features had a strong correlation with the semantic category. Removing the “number of mementos” feature improved F1 scores for all categories, except Self-Archiving. Original With feature removed

78. @shawnmjones @WebSciDL Where do we go from here? 78

79. @shawnmjones @WebSciDL Future Work  We will adapt these structural features for our collection summarization work  The skew of growth curves may affect which mementos are chosen for review  The seed analysis features will help us better choose seeds to be included  We can incorporate this classifier to tailor summarization algorithms to specific semantic categories  We intend to work further with Archive-It to make metadata and other data more accessible so that screen-scraping is not necessary 79

80. @shawnmjones @WebSciDL Conclusion 80

81. @shawnmjones @WebSciDL We adapted Growth Curves for collections We can normalize & visualize curator engagement with the collection 81

82. @shawnmjones @WebSciDL We introduced Seed Features  Seed features also help us understand the curation strategy of a collection  Are most of the seeds from the same domain?  Are most of the seeds from top-level domains or deeper pages? 82

83. @shawnmjones @WebSciDL We bridged the structural to the descriptive 83 Results of Random Forest Classifier

84. @shawnmjones @WebSciDL We can understand web archive collections using only structural features 84 Thanks to: Metadata scraping code available: https://github.com/oduwsdl/archiveit_utilities

85. @shawnmjones @WebSciDL Backup Slides 85

86. @shawnmjones @WebSciDL Growth curves allow us to understand collection curation behavior 86 • Built from all mementos in the collection’s Timemaps • Skew of the collection’s holdings • Indicates temporality of collection • Built from the first memento for each seed in the collection’s TimeMaps • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained

87. @shawnmjones @WebSciDL Seeds Early, Seed Mementos Early Most curatorial decisions were made early in this collection’s life Most crawling was done early in its life The temporalness of these collections skew early AUCseed > 0.55 AUCsmem > 0.55

88. @shawnmjones @WebSciDL Seeds Early, Seed Mementos Continuously Most curatorial decisions were made early in this collection’s life Seed mementos were added continuously The temporalness of these collections spreads throughout their lives AUCseed > 0.55 0.55 > AUCsmem > 0.45

89. @shawnmjones @WebSciDL Seeds Early, Seed Mementos Late Seed mementos were added later The temporalness of these collections skew more recent Most curatorial decisions were made early in this collection’s life AUCseed > 0.55 AUCsmem < 0.45

90. @shawnmjones @WebSciDL Seeds Continuously, Seed Mementos Early 0.55 > AUCseed > 0.45 AUCsmem > 0.55 Seeds are added throughout a collection’s life. Seed mementos were added earlier. This means that most the content of the collection comes from earlier in its life.

91. @shawnmjones @WebSciDL Seeds Continuously, Seed Mementos Continuously 0.55 > AUCseed > 0.45 0.55 > AUCseed memento > 0.45 Seeds are added throughout and their seed mementos are collected continuously. These collections have a lot of curatorial involvement throughout their life. Their contents are spread throughout their life.

92. @shawnmjones @WebSciDL Seeds Continuously, Seed Mementos Late 0.55 > AUCseed > 0.45 AUCsmem < 0.45 Seeds are added throughout, but the collection is built from mementos that were collected later.

93. @shawnmjones @WebSciDL Seeds Late, Seed Mementos Early AUCseed < 0.45 AUCseed memento > 0.55 Most curatorial decisions were made later in this collection’s life. But most of the mementos were added earlier. The temporalness of the collection skews earlier. Most of the mementos belong to these early seeds.

94. @shawnmjones @WebSciDL Seeds Late, Seed Mementos Continuously AUCseed < 0.45 0.55 > AUCseed memento > 0.45 The collection’s contents are spread throughout its life, but many seeds were added later. This means that some of those early seeds have more mementos.

95. @shawnmjones @WebSciDL Seeds Late, Seed Mementos Late AUCseed < 0.45 AUCseed memento < 0.45 In these cases, the collection appears to have experienced a “resurgence in interest” later in its life.

The Many Shapes of Archive-It

Shawn Jones

The Many Shapes of Archive-It