Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding management #428

Open
alxdrdelaporte opened this issue Mar 12, 2020 · 3 comments
Open

Encoding management #428

alxdrdelaporte opened this issue Mar 12, 2020 · 3 comments

Comments

@alxdrdelaporte
Copy link

@alxdrdelaporte alxdrdelaporte commented Mar 12, 2020

I am currently working on a project involving content extraction from a certain number of WARC archives.
I use Archives Unleashed Toolkit to extract plain HTML content (script below) and it works very well.

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/path/to/archive.arc.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHttpHeader(r.getContentString)))
  .saveAsTextFile("plain-html/") 

However I encounter a problem regarding the output file produced by this script: the webpages (mostly in French) extracted from the WARC archive do not all have UTF-8 as their original encoding, so some characters are not rendered properly in the output (and replaced by �).

For example :

nouvelle page consacr�e au 114�me bataillon de marche de chasseurs alpins

As my research is mainly based on lexicon analysis, proper encoding and character rendering are an important matter.
I would be great if there was a way to deal with non-UTF encodings (such as windows-1252 or latin1) while extracting with AUT.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Mar 12, 2020

@alxdrdelaporte thanks for creating the issue! There was an issue template, and we use that to collect some very useful information to troubleshoot the issue. Can you please provide some more information:

Environment information

  • AUT version: [e.g. 0.16.0, HEAD]
  • OS: [e.g. MacOS 10.13.3, Ubuntu 18.04]
  • Java version: [e.g. Java 8]
  • Apache Spark version: [e.g. 2.1.3, 2.3.1]
  • Apache Spark w/aut: [e.g. --jars, --packages]
  • Apache Spark command used to run AUT: [e.g. ./spark-shell --driver-memory 55G --packages "io.archivesunleashed:aut:0.16.0"]

I have some sample WARC/ARCs from BAnQ that should help me troubleshoot, but if you could provide one that you're working with, that would be very helpful.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Mar 12, 2020

@alxdrdelaporte this make sense? That's Spark 2.4.5 with the 0.50.0 release.

What needs to be sorted out:

  • how is the content encoded in the WARC/ARC (is it bad data in, bad data out?)
  • if the content is "good" in the WARC/ARC, where does the bad encoding happen? In one of the dependencies -- webarchive commons -- or somewhere else?
  • if everything is fine up until the write in Spark, that's a Spark issue, and I honestly don't know how one would handle multiple text encodings, and re-encoding them to something else in the derivative write.

All that said, this is all web archives, and crazy stuff from the web end up in them. Text encoding is definitely going to be a giant reemerging thorn.

@alxdrdelaporte

This comment has been minimized.

Copy link
Author

@alxdrdelaporte alxdrdelaporte commented Mar 24, 2020

Thanks for your answer @ruebot (and sorry for not following the issue template, I didn't see there was one)

As my workplace is in lockdown until further notice all projects are on hold and I am not able to provide more information for now. I will as soon as I can (the only thing I will not be able to provide at all is a sample WARC file, because duplicating or processing data outside of the premises is strictly forbidden).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.