Use Case: Document Keeping HTML Tags in Text Output #63

ianmilligan1 · Nov 2, 2018

Interesting use case at the datathon where they wanted to work with the raw HTML to help find data using specific tags. Makes sense to me! I will add to the documentation.

Thanks to @obrienben for the suggestion.

Testing with:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/mnt/vol1/data_sets/ubc-wildfires-2017/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, r.getContentString))
  .saveAsTextFile("/mnt/vol1/derivative_data/ubc-wildfires-2017/plain-html")

Will see how it works with the team and if this meets needs, will add to our docs.

ianmilligan1 added the documentation label Nov 2, 2018

ianmilligan1 self-assigned this Nov 2, 2018

ianmilligan1 changed the title from Use Case: Document Keeping HTML to Use Case: Document Keeping HTML Tags in Text Output Nov 2, 2018

ianmilligan1 closed this in 1133333 Nov 2, 2018

archivesunleashed/archivesunleashed.org

Use Case: Document Keeping HTML Tags in Text Output #63

ianmilligan1 commented Nov 2, 2018 •

edited

ianmilligan1 added the documentation label Nov 2, 2018

ianmilligan1 self-assigned this Nov 2, 2018

ianmilligan1 changed the title from Use Case: Document Keeping HTML to Use Case: Document Keeping HTML Tags in Text Output Nov 2, 2018

ianmilligan1 closed this in `1133333` Nov 2, 2018

archivesunleashed/archivesunleashed.org

Join GitHub today

Use Case: Document Keeping HTML Tags in Text Output #63

Comments

ianmilligan1 commented Nov 2, 2018 • edited

ianmilligan1 added the documentation label Nov 2, 2018

ianmilligan1 self-assigned this Nov 2, 2018

ianmilligan1 changed the title from Use Case: Document Keeping HTML to Use Case: Document Keeping HTML Tags in Text Output Nov 2, 2018

ianmilligan1 closed this in 1133333 Nov 2, 2018

ianmilligan1 commented Nov 2, 2018 •

edited

ianmilligan1 closed this in `1133333` Nov 2, 2018