Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Off-Topic Memento Toolkit

55 views

Published on

I presented this paper at iPres 2018. Here, we introduce the Off-Topic Memento Toolkit, used to detect versions of web pages that have drifted off topic from the general topic of a collection.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The Off-Topic Memento Toolkit

  1. 1. The Off-Topic Memento Toolkit Shawn M. Jones Michele C. Weigle Michael L. Nelson Old Dominion University Web Science and Digital Libraries Research Group @WebSciDL sjone@cs.odu.edu @shawnmjones mweigle@cs.odu.edu @weiglemc mln@cs.odu.edu @phonedude_mln Thanks to:
  2. 2. @shawnmjones @WebSciDL Many Curators Use Archive-It To Create Web Archive Collections 2 Archive-It makes it easy for curators to build collections and supply metadata for a collection.
  3. 3. @shawnmjones @WebSciDL When Building A Web Archive Collection…  Curators select web resources as seeds  Each version of a seed becomes a memento 3
  4. 4. @shawnmjones @WebSciDL When Building A Web Archive Collection…  Curators select web resources as seeds  Each version of a seed becomes a memento  They create a web archive collection with a purpose in mind 4
  5. 5. @shawnmjones @WebSciDL When Researchers Prepare to Analyze a Web Archive Collection… 5 Some collections have thousands of seeds. Remember: Each seed has one or more mementos. The sheer number of mementos to process means that researchers will need to quickly identify mementos with low information value. Off-topic mementos have low information value. We want to identify, not delete, these for further decision-making. We identify them to not consider them for selection as exemplars for storytelling. 81,014 seeds 486,227 seed mementos
  6. 6. @shawnmjones @WebSciDL How Can Mementos Go Off-Topic? 6
  7. 7. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: For Technical Reasons 7 http://wayback.archive-it.org/1068/20130306212205/http://bo.amnesty.org/ http://wayback.archive-it.org/1068/20120303011104/http://bo.amnesty.org/
  8. 8. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Page Gone 8 http://wayback.archive-it.org/1068/20101221161732/http://www.acdauk.org.uk/ http://wayback.archive-it.org/1068/20110902210644/http://www.acdauk.org.uk/
  9. 9. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Content Drift – A Change in Languages 9 http://wayback.archive-it.org/1068/20130306231537/http://ecwronline.org/ http://wayback.archive-it.org/1068/20110129043404/http://ecwronline.org/
  10. 10. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Server Maintenance 10 http://wayback.archive-it.org/1068/20111202210620/http://amnestyghana.org/ http://wayback.archive-it.org/1068/20120302232416/http://amnestyghana.org/
  11. 11. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Account Suspension 11 http://wayback.archive-it.org/1068/20110317151735/http://amnestymauritius.org/french/news.php http://wayback.archive-it.org/1068/20111202210625/http://amnestymauritius.org/cgi-sys/suspendedpage.cgi
  12. 12. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Site Redesign 12 http://wayback.archive-it.org/1068/20120302224302/http://ombuds.am/main/ http://wayback.archive-it.org/1068/20100510173253/http://ombuds.am/main
  13. 13. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Change Site Ownership 13 http://wayback.archive-it.org/1068/20090210190543/http://www.afapredesa.org/index.php http://wayback.archive-it.org/1068/20120302210439/http://www.afapredesa.org/
  14. 14. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: The Site Was Hacked 14 http://wayback.archive-it.org/2950/20120327032244/http://occupyevansville.org/ http://wayback.archive-it.org/2950/20120410032628/http://occupyevansville.org/
  15. 15. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: The Site Moves On From The Topic 15 http://wayback.archive-it.org/2358/20120803140009/http://www.bbc.co.uk/news/world/middle_east/ http://wayback.archive-it.org/2358/20110202225040/http://www.bbc.co.uk/news/world/middle_east/
  16. 16. @shawnmjones @WebSciDL Presenting the Off-Topic Memento Toolkit (OTMT) a tool for identifying these off-topic mementos 16
  17. 17. @shawnmjones @WebSciDL The Off-Topic Memento Toolkit (OTMT)  Currently in alpha status, the OTMT  Accepts a collection of mementos  Executes similarity measures on those mementos  Rates them as on or off-topic  Identifies, does not delete, off- topic mementos 17 https://github.com/oduwsdl/off-topic-memento-toolkit
  18. 18. @shawnmjones @WebSciDL Background and Related Work 18
  19. 19. @shawnmjones @WebSciDL Related Work – Similarity Measures for Documents 19 Manku (2007) Sorensen (1948) Dice (1945) Jaccard (1912) Simhash Charikar (2002) Sørensen-Dice Coefficient Jaccard Index Hajishirzi (2010) Cosine Similarity of TF-IDF Vectors Cosine Similarity of Latent Semantic Indexing Vectors Deerweister (1990) Adar (2009) Sivakumar (2015) Řehůřek (2011) Content Drift in Web Archives Jones (2016) Zittrain (2014)
  20. 20. @shawnmjones @WebSciDL Related Work – Similarity Measures for Documents 20 Manku (2007) Sorensen (1948) Dice (1945) Jaccard (1912) Simhash Charikar (2002) Sørensen-Dice Coefficient Jaccard Index Hajishirzi (2010) Cosine Similarity of TF-IDF Vectors Cosine Similarity of Latent Semantic Indexing Vectors Deerweister (1990) OTMT supports these similarity measures Adar (2009) Sivakumar (2015) Řehůřek (2011) Content Drift in Web Archives Jones (2016) Zittrain (2014)
  21. 21. @shawnmjones @WebSciDL Related Work – Similarity Measures for Documents 21 Manku (2007) Sorensen (1948) Dice (1945) Jaccard (1912) Simhash Charikar (2002) Sørensen-Dice Coefficient Jaccard Index Hajishirzi (2010) Cosine Similarity of TF-IDF Vectors Cosine Similarity of Latent Semantic Indexing Vectors Deerweister (1990) OTMT supports these similarity measures Adar (2009) Sivakumar (2015) Řehůřek (2011) Like these studies, we also use these similarity measures on mementos Content Drift in Web Archives Jones (2016) Zittrain (2014)
  22. 22. @shawnmjones @WebSciDL Related Work – Other Methods of Off-Topic Detection 22 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis
  23. 23. @shawnmjones @WebSciDL Related Work – Other Methods of Off-Topic Detection 23 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis Topic modeling should help us find off-topic documents, but which cluster is off- topic?
  24. 24. @shawnmjones @WebSciDL Related Work – Other Methods of Off-Topic Detection 24 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis Topic modeling should help us find off-topic documents, but which cluster is off- topic? It is costly to manually review browser thumbnails to find off-topic mementos
  25. 25. @shawnmjones @WebSciDL Related Work – Other Methods of Off-Topic Detection 25 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis We build on AlNoamany’s work to bring you the Off-Topic Memento Toolkit Topic modeling should help us find off-topic documents, but which cluster is off- topic? It is costly to manually review browser thumbnails to find off-topic mementos
  26. 26. @shawnmjones @WebSciDL Memento Protocol Terminology <http://a.example.org>;rel="original", <http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format" ; from="Tue, 20 Jun 2000 18:02:59 GMT" ; until="Wed, 21 Jun 2000 04:41:56 GMT", <http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate", <http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento"; datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento"; datetime="Tue, 27 Oct 2009 20:49:54 GMT", <http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 01:17:31 GMT", <http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 04:41:56 GMT" … 26 Each seed, or original resource, has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento- datetimes. URI-T: a URI for a TimeMap URI-M: a URI for a memento
  27. 27. @shawnmjones @WebSciDL Web Archives Augment Their Mementos 27 Banners Rewritten Links http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
  28. 28. @shawnmjones @WebSciDL The OTMT Uses Raw Mementos 28 Raw mementos are free of these augmentations. Archive-It and the Internet Archive provide access to raw mementos at special URIs. The OTMT finds these raw mementos and uses them in its similarity comparisons. http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
  29. 29. @shawnmjones @WebSciDL The OTMT Performs Preprocessing 29 <p class=“homepage-description”>The Women’s Initiatives for Gender Justice works globally to ensure justice for women and an independent and effective International Criminal Court.</p> ['The', 'Women', '’', 's', 'Initiatives', 'for', 'Gender', 'Justice', 'works', 'globally', 'to', 'ensure', 'justice', 'for', 'women', 'and', 'an', 'independent', 'and', 'effective', 'International', 'Criminal', 'Court', '.'] Tokenization Remove stop words ['Women', '’', 'Initiatives', 'Gender', 'Justice', 'works', 'globally', 'ensure', 'justice', 'women', 'independent', 'effective', 'International', 'Criminal', 'Court'] Stemming ['women', '’', 'initi', 'gender', 'justic', 'work', 'global', 'ensur', 'justic', 'women', 'independ', 'effect', 'intern', 'crimin', 'court'] Boilerplate removal The Women’s Initiatives for Gender Justice works globally to ensure justice for women and an independent and effective International Criminal Court.
  30. 30. @shawnmjones @WebSciDL We Evaluated the OTMT with a Gold Standard Dataset  In “Detecting off-topic pages within TimeMaps in Web archives”, AlNoamany performed a study to detect off-topic Mementos  The mementos were manually marked as on or off-topic  We reuse this dataset in our evaluation 30 https://github.com/oduwsdl/offtopic-goldstandard-data
  31. 31. @shawnmjones @WebSciDL TimeMap Measures Supported by OTMT 31
  32. 32. @shawnmjones @WebSciDL General algorithm  For each TimeMap in a collection 1. Get the first memento 2. Preprocess it 3. For each memento in the TimeMap 1. Get the memento 2. Preprocess it 3. Compute the similarity to the first memento using a given measure 4. Save the score 5. A threshold value determines if a memento is on or off-topic 32 First memento Considered memento
  33. 33. @shawnmjones @WebSciDL Structural Measures – Byte Count and Word Count 33 On-topic: 9599 bytes 183 words (after preprocessing) Off-topic: 401 bytes 22 words (after preprocessing) Off-topic mementos tend to have less bytes/words Scores range from 0 to -1
  34. 34. @shawnmjones @WebSciDL Set Operation Measures 34 Jaccard Distance Sørensen-Dice Distance Size of Intersection over size of union Twice the size of intersection over size of both sets Scores range from 0 to 1 ['women', '’', 'initi', 'gender', 'justic', 'current', 'work', 'uganda', 'democrat', 'republ', 'congo', 'libya'] ['women', '’', 'initi', 'gender', 'justic', 'work', 'uganda', 'democrat', 'republ', 'congo', 'sudan', 'central', 'african', 'republ', 'kenya', 'libya', 'kyrgyzstan'] Highlighted words are the intersection Words from Doc #1: Words from Doc #2:
  35. 35. @shawnmjones @WebSciDL Simhash of Term Frequencies 35 ('women', 4), ('justic', 4), ('’', 3), ('gender', 3), ('initi', 2), ('intern', 2), ('icc', 2), ('work', 2), ('republ', 2), ('human', 1), … 13221438115839111206 13797903006343525414 ('women', 4), ('justic', 4), ('’', 3), ('gender', 3), ('initi', 2), ('intern', 2), ('icc', 2), ('work', 2), ('human', 1), ('right', 1), … 6 bits Scores range from 0 to 64 bits Simhash Distance: Simhash of Terms and Frequencies from Document #1: Simhash of Terms and Frequencies from Document #2:
  36. 36. @shawnmjones @WebSciDL Simhash of raw content 36 The Women’s Initiatives for Gender Justice is an international women’s human rights organisation that advocates for gender justice through the International Criminal Court (ICC) and through domestic mechanisms, including peace negotiations and justice processes.We work with women and communities most affected by the armed conflict with a focus on countries with situations under investigation by the ICC. The Women’s Initiatives for Gender Justice currently works in Uganda, the Democratic Republic of the Congo and Libya. The Women’s Initiatives for Gender Justice is an international women’s human rights organisation that advocates for gender justice through the International Criminal Court (ICC) and through domestic mechanisms, including peace negotiations and justice processes. We work with women most affected by the conflict situations under investigation by the ICC. The Women’s Initiatives for Gender Justice works in Uganda, the Democratic Republic of the Congo, Sudan, the Central African Republic, Kenya, Libya and Kyrgyzstan. 12358429319379250844 12359555184926328508 6 bits Scores range from 0 to 64 bits Simhash of Document #1: Simhash of Document #2: Simhash Distance:
  37. 37. @shawnmjones @WebSciDL Cosine Similiarities 37 Take the cosine of the document vectors. Cosine of TF-IDF Vectors are formed from each document and their term frequencies. Cosine of Latent Semantic Indexing (LSI) Each vector is informed by LSI. Scores range from 1 to 0.
  38. 38. @shawnmjones @WebSciDL Using the OTMT 38
  39. 39. @shawnmjones @WebSciDL OTMT Installation Options 1. Pip from Pypi (preferred): pip install otmt 2. Experimental Docker Image: docker pull shawnmjones/otmt 3. Source Code: git clone https://github.com/oduwsdl/off-topic-memento- toolkit.git 39
  40. 40. @shawnmjones @WebSciDL OTMT Usage 40 # detect_off_topic -i archiveit=7877 -tm jaccard=0.80,bytecount=-0.50 -o outputfile.json Input Types for -i: • timemap – followed by 1 or more TimeMap URIs, separated by commas • warc – followed by 1 or more WARC files, separated by commas • archiveit – followed by an Archive-It collection ID TimeMap measures for -tm: • bytecount • wordcount • jaccard • sorensen • simhash-tf • simhash-raw • cosine • gensim_lsi Input OutputMeasure Output types for -ot: • json • csv
  41. 41. @shawnmjones @WebSciDL OTMT Output - JSON 41 "http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": { "http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": { "timemap measures": { "cosine": { "stemmed": true, "tokenized": true, "removed boilerplate": true, "comparison score": 0.10969941307631487, "topic status": "off-topic” }, "bytecount": { "stemmed": false, "tokenized": false, "removed boilerplate": false, "comparison score": 0.15971409055425445, "topic status": "on-topic" } }, "overall topic status": "off-topic" }, ... Measure Information Preprocessing status Measure Score On or off topic status by measure On or off topic status overall URI-T of TimeMap URI-M of Memento
  42. 42. @shawnmjones @WebSciDL OTMT Output - JSON 42 "http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": { "http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": { "timemap measures": { "cosine": { "stemmed": true, "tokenized": true, "removed boilerplate": true, "comparison score": 0.10969941307631487, "topic status": "off-topic” }, "bytecount": { "stemmed": false, "tokenized": false, "removed boilerplate": false, "comparison score": 0.15971409055425445, "topic status": "on-topic" } }, "overall topic status": "off-topic" }, ... URI-T of TimeMap URI-M of Memento Measure Information Preprocessing status Measure Score On or off topic status by measure On or off topic status overall If one measure scores as off-topic, the memento is considered off-topic
  43. 43. @shawnmjones @WebSciDL Supported Similarity Measures Measure Fully Equivalent Score Fully Dissimilar Score Preprocessing Performed OTMT -tm keyword Byte Count 0.0 -1.0 No bytecount Word Count 0.0 -1.0 Yes wordcount Jaccard Distance 0.0 1.0 Yes jaccard Sørensen-Dice 0.0 1.0 Yes sorensen Simhash of Term Frequencies 0 64 Yes simhash-tf Simhash or raw memento 0 64 No simhash-raw Cosine Similarity of TF-IDF Vectors 1.0 0 Yes cosine Cosine Similarity of LSI Vectors 1.0 0 Yes gensim_lsi 43
  44. 44. @shawnmjones @WebSciDL Establishing Reasonable Defaults 44
  45. 45. @shawnmjones @WebSciDL Experiment setup  For each measure: 1. Start the threshold at the score of complete dissimilarity 2. Test with the URI-Ms from the gold standard data set as if that threshold indicated off-topic 3. Compute F1 using real off-topic status of the memento from the gold standard data 4. Increment the threshold 5. Repeat 2 – 4 until the threshold matches complete equivalence score 45  Example using Byte Count: 1. Start threshold at -1 2. Test with the URI-Ms from the gold standard data set as if -1 indicated off- topic 3. Compute F1 using real off-topic status of the memento from the gold standard data 4. Increment the threshold to -0.99 5. Test with the URI-Ms from the gold standard data set as if -0.99 indicated off-topic 6. Compute F1 with real status 7. Increment to -0.98 8. Repeat until the threshold is 0
  46. 46. @shawnmjones @WebSciDL Our results do not match AlNoamany’s, but the world is not the same as it was in 2015… AlNoamany’s Study Our Study Year Conducted 2015 2017 Boilerplate Removal Boilerpipe (Java) Justext Tokenization and Stemming Scikit-learn NLTK 46 Other changes: • Download errors • Gold Standard Dataset updates
  47. 47. @shawnmjones @WebSciDL Simhash of Term Frequencies 47 Our Results: AlNoamany’s Results Not tested
  48. 48. @shawnmjones @WebSciDL Simhash of raw memento 48 Our Results: AlNoamany’s Results Not tested
  49. 49. @shawnmjones @WebSciDL Sørensen-Dice Distance Results 49 Our Results: AlNoamany’s Results Not tested
  50. 50. @shawnmjones @WebSciDL Jaccard Distance Results 50 Our Results: AlNoamany’s Results Best F1 Score: 0.538 Threshold: 0.95
  51. 51. @shawnmjones @WebSciDL Cosine Similarity of LSI Vectors 51 AlNoamany’s Results Not tested Our Results: Note: LSI scores are non-deterministic
  52. 52. @shawnmjones @WebSciDL Byte Count Results 52 AlNoamany’s Results Best F1 Score: 0.584 Threshold: -0.65 Our Results:
  53. 53. @shawnmjones @WebSciDL Cosine Similarity of TF-IDF Vectors 53 Our Results: AlNoamany’s Results Best F1 Score: 0.881 Threshold: 0.15 Best score in AlNoamany’s Results
  54. 54. @shawnmjones @WebSciDL Word Count Results 54 Best Score in Our Results: AlNoamany’s Results Best F1 Score: 0.806 Threshold: -0.85
  55. 55. @shawnmjones @WebSciDL Results Summarized – Best F1 Score is Word Count 55 AlNoamany's Results Results of this study Similarity Measure Best F1 Score Corresponding Accuracy Corresponding Threshold Best F1 Score Corresponding Accuracy Corresponding Threshold Word Count 0.806 0.982 -0.85 0.788 0.971 -0.7 Cosine Similarity of TF-IDF Vectors 0.881 0.983 0.15 0.766 0.965 0.12 Byte Count 0.584 0.962 -0.65 0.756 0.965 -0.39 Cosine Similarity of LSI Vectors Not tested 0.711 0.965 0.12 with 10 topics Jaccard Distance 0.538 0.962 0.95 0.651 0.953 0.94 Sørensen-Dice Distance Not tested 0.649 0.953 0.88 Simhash on raw memento content Not tested 0.578 0.934 25 Simhash on TF Not tested 0.523 0.942 28 Our word count measure came out ahead of AlNoamany’s. AlNoamany’s Cosine Similarity measure came out ahead of ours.
  56. 56. @shawnmjones @WebSciDL What about using measures together? 56 AlNoamany found that using cosine similarity of TF-IDF vectors and word count together produced even better results. Our best F1 score for word count alone was 0.788. Word count combined with LSI turned out to be slightly better with the same Accuracy. The success of word count appears to exert influence on the threshold of its partner measure, making its threshold more strict.
  57. 57. @shawnmjones @WebSciDL The Future of OTMT 57
  58. 58. @shawnmjones @WebSciDL Improving the OTMT  Bug fixes  Make LSI scores reproducible  New Measures  TimeMap Measures – compare first memento with considered memento:  Spamsum of the raw content – used by Andy Jackson at the UKWA  Cosine of LDA Vectors via Gensim  Collection Measures 1. Develop a collection-wide picture 2. Compare each memento against that picture  Control over preprocessing:  Options to change use a different boilerplate method  Options to turn off stemming or stop word removal 58
  59. 59. @shawnmjones @WebSciDL Conclusion 59
  60. 60. @shawnmjones @WebSciDL Motivation - Mementos Can Go Off-Topic 60 Hacked Moved on from topic Collections have a theme Seeds are selected to support that theme Mementos are versions of seeds Some of these versions are off-topic Identifying these off-topic mementos is key to some research activities, like summarization Web Page Gone Account Suspension
  61. 61. @shawnmjones @WebSciDL OTMT supports different similarity measures with thresholds established based on experimentation  Byte count  Word count  Jaccard distance  Sørensen-Dice distance  Simhash of term frequencies  Simhash of raw memento content  Cosine similarity of TF-IDF vectors  Cosine similarity of LSI vectors 61
  62. 62. @shawnmjones @WebSciDL Please try out the Off-Topic Memento Toolkit! 62 Thanks to: 1. Pip (preferred): pip install otmt 2. Experimental Docker Image: docker pull shawnmjones/otmt 3. Source Code: git clone https://github.com/oduwsdl/off-topic- memento-toolkit.git https://github.com/oduwsdl/off-topic-memento-toolkit https://github.com/oduwsdl/offtopic-goldstandard-data

×