Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding management #428

Closed
alxdrdelaporte opened this issue Mar 12, 2020 · 11 comments
Closed

Encoding management #428

alxdrdelaporte opened this issue Mar 12, 2020 · 11 comments

Comments

@alxdrdelaporte
Copy link

@alxdrdelaporte alxdrdelaporte commented Mar 12, 2020

I am currently working on a project involving content extraction from a certain number of WARC archives.
I use Archives Unleashed Toolkit to extract plain HTML content (script below) and it works very well.

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/path/to/archive.arc.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHttpHeader(r.getContentString)))
  .saveAsTextFile("plain-html/") 

However I encounter a problem regarding the output file produced by this script: the webpages (mostly in French) extracted from the WARC archive do not all have UTF-8 as their original encoding, so some characters are not rendered properly in the output (and replaced by �).

For example :

nouvelle page consacr�e au 114�me bataillon de marche de chasseurs alpins

As my research is mainly based on lexicon analysis, proper encoding and character rendering are an important matter.
I would be great if there was a way to deal with non-UTF encodings (such as windows-1252 or latin1) while extracting with AUT.

@ruebot
Copy link
Member

@ruebot ruebot commented Mar 12, 2020

@alxdrdelaporte thanks for creating the issue! There was an issue template, and we use that to collect some very useful information to troubleshoot the issue. Can you please provide some more information:

Environment information

  • AUT version: [e.g. 0.16.0, HEAD]
  • OS: [e.g. MacOS 10.13.3, Ubuntu 18.04]
  • Java version: [e.g. Java 8]
  • Apache Spark version: [e.g. 2.1.3, 2.3.1]
  • Apache Spark w/aut: [e.g. --jars, --packages]
  • Apache Spark command used to run AUT: [e.g. ./spark-shell --driver-memory 55G --packages "io.archivesunleashed:aut:0.16.0"]

I have some sample WARC/ARCs from BAnQ that should help me troubleshoot, but if you could provide one that you're working with, that would be very helpful.

@ruebot
Copy link
Member

@ruebot ruebot commented Mar 12, 2020

@alxdrdelaporte this make sense? That's Spark 2.4.5 with the 0.50.0 release.

What needs to be sorted out:

  • how is the content encoded in the WARC/ARC (is it bad data in, bad data out?)
  • if the content is "good" in the WARC/ARC, where does the bad encoding happen? In one of the dependencies -- webarchive commons -- or somewhere else?
  • if everything is fine up until the write in Spark, that's a Spark issue, and I honestly don't know how one would handle multiple text encodings, and re-encoding them to something else in the derivative write.

All that said, this is all web archives, and crazy stuff from the web end up in them. Text encoding is definitely going to be a giant reemerging thorn.

@alxdrdelaporte
Copy link
Author

@alxdrdelaporte alxdrdelaporte commented Mar 24, 2020

Thanks for your answer @ruebot (and sorry for not following the issue template, I didn't see there was one)

As my workplace is in lockdown until further notice all projects are on hold and I am not able to provide more information for now. I will as soon as I can (the only thing I will not be able to provide at all is a sample WARC file, because duplicating or processing data outside of the premises is strictly forbidden).

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Mar 25, 2020

OK! I spent about an hour and a bit on this problem, and I am not sure how much closer I am.

In our sample data WARCs, we have the issue of mixed encodings in the same WARC. We can see this here:

Screen Shot 2020-03-25 at 2 51 46 PM

Looking at the WARC itself, we can see that the records that are not rendering properly are indeed encoded in ISO-8859-1. Looking at the WARC itself in vim, we can see the accents although we can see that even vim is having some trouble parsing them (note the super tiny é).

Screen Shot 2020-03-25 at 2 55 17 PM

My thought was to try to convert the DataFrames to ISO-8859-1 and see if the results were usable. My script was thus:

import io.archivesunleashed._
import io.archivesunleashed.df._

var data = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .webpages()
  .filter($"language" === "fr")
  .select($"crawl_date", ExtractDomainDF($"url"), $"url", $"language", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")))

data.write.option("encoding","ISO-8859-1").csv("test2")

Alas, that did not fix the ISO-8859-1 encoded file (changed the error characters to question marks for the most part), and indeed broke the properly rendered files as well.

Screen Shot 2020-03-25 at 4 03 12 PM

So @ruebot, looks like the data is "good" in the WARC (in the sense that it is readable, even if it is in different encodings), but by the time we get to the write stage something has gone awry.

I am probably the last person who should be digging into this, but hopefully this is helpful?

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Mar 25, 2020

And I was curious, so I tried @greebie's idea of changing the WarcRecordUtils behaviour like so:

dout.write("WARC/0.17\n".getBytes("ISO-8859-1"));

But even when in theory reading WARCs in using ISO-8859-1, the record theoretically encoded that way doesn't work.

Screen Shot 2020-03-25 at 4 32 04 PM

@ruebot
Copy link
Member

@ruebot ruebot commented Mar 25, 2020

Thanks @ianmilligan1! This is some good hacking and supporting documentation. It all dovetails really nicely with my digging.

Without seeing @alxdrdelaporte's WARC, I'm going to hazard a guess that it has the same mixed encoding that many ARC/WARCs are going to have. Most likely the result of a bunch of wild web servers serving up a variety of encodings. Mircosoft web servers? 👀

Roughly, we're pulling the "content" from a ARC/WARC record as a raw byte array in Java, and then handing that off Scala to ArchiveRecord and RecordLoader. That's all under the same jvm, so there shouldn't be any issues between Java/Scala. Spark is going to write out as UTF-8 by default if I'm not mistaken. So, my next line of thinking is going to down the path of figuring out if there is a way to identify the encoding of a content payload on the fly while we're processing each record, convert it to UTF-8? Maybe? But, that all assumes we have good data coming in, and we can do things relatively predictably.

Looping back to my earlier comment, I'm honestly not sure this is something that aut can and should handle. It sounds like it should be, but I'm not 100% sure. I've hit a wall with cycles on this, and not sure what else we could do. @lintool any ideas on your end?

@anjackson @tokee I'd assume y'all hit this hurdle before with webarchive-discovery?

@greebie
Copy link
Contributor

@greebie greebie commented Mar 27, 2020

Hi @ianmilligan1 - could you check if loading using UTF-8 works okay instead? The issue with .getBytes() as I understand it, is that it will use the default encoding, which can differ from OS to OS. Also -- just in case, you changed the encoding in all 3 places, including here, correct?:

for (Map.Entry<String, Object> entry : record.getHeader()
            .getHeaderFields().entrySet()) {
      dout.write((entry.getKey() + ": " + entry.getValue("UTF-8").toString() + "\n")
              .getBytes());
    }

I think forcing to UTF-8 is a good idea at any rate. Might not solve this problem, but it may prevent other issues down the road.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Mar 27, 2020

Thanks @greebie (hope you're well!). I can try w/ UTF-8 in a bit - but yes I did also try it in all three places as well (same results).

@greebie
Copy link
Contributor

@greebie greebie commented Mar 27, 2020

Thanks. Social distancing isn't fantastic for my fashion sense, but so far we are doing okay. (Thank goodness.)

@greebie
Copy link
Contributor

@greebie greebie commented Mar 27, 2020

Confirmed that the WARCWriter library saves as UTF-8 by default, so this problem likely only occurs when the warc has been saved in another encoding.

@ruebot ruebot added the question label Jun 3, 2020
@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 23, 2020

Hi @alxdrdelaporte - hope you’re still saying safe and healthy amidst the global situation.

Our team has looked at this issue quite a bit and ultimately our issue through sleuthing is that we’re running into problems with mixed-encoding ARC/WARCs: i.e. one server is encoding content in ISO-8859-1 (Latin1) and another in say UTF-8.

We’ve tried a few different solutions and ultimately, short of doing encoding detection on each record – which would have performance implications as well - I think this might be out of scope for the Archives Unleashed Toolkit since the problem lies at the data coming into it.

@ruebot ruebot closed this Jun 23, 2020
@ruebot ruebot added this to Done in 1.0.0 Release of AUT via automation Jun 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.