Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upUpdate to Spark 2.4.3 and update Tika to 1.20. #321
Conversation
ruebot
requested review from
lintool and
ianmilligan1
Jul 4, 2019
ruebot
reviewed
Jul 4, 2019
@@ -178,9 +178,13 @@ class CommandLineApp(conf: CmdAppConf) { | |||
|
|||
def save(d: Dataset[Row]): Unit = { | |||
if (!configuration.partition.isEmpty) { | |||
d.coalesce(configuration.partition()).write.csv(saveTarget) | |||
d.coalesce(configuration.partition()).write | |||
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") |
This comment has been minimized.
This comment has been minimized.
ruebot
Jul 4, 2019
Author
Member
FYI: These were added for the newer version of Tika (underlying libraries). Timestamp format is required.
This comment has been minimized.
This comment has been minimized.
codecov-io
commented
Jul 4, 2019
•
Codecov Report
@@ Coverage Diff @@
## master #321 +/- ##
==========================================
+ Coverage 74.84% 74.97% +0.13%
==========================================
Files 39 39
Lines 1117 1123 +6
Branches 197 197
==========================================
+ Hits 836 842 +6
Misses 215 215
Partials 66 66
Continue to review full report at Codecov.
|
ianmilligan1
requested changes
Jul 4, 2019
Putting through its paces, things work well with example data (including DataFrames) except the language filter. The following script: import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("example.arc.gz", sc)
.keepValidPages()
.keepDomains(Set("www.archive.org"))
.keepLanguages(Set("fr"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("plain-text-fr/") Fails with
et al. [full log here] Given the Tika issues around the language detector, probably somewhat to be expected? |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 That's what I was afraid of. |
This comment has been minimized.
This comment has been minimized.
Looks like I'm getting a slightly different error with:
ERROR:
I'll dig into it more. I think it might be a guava issue. |
This comment has been minimized.
This comment has been minimized.
Ah cool. We have 16 different occurrences of |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I think I got it!
318-test-lang-scala
|
ianmilligan1
approved these changes
Jul 17, 2019
Tested and confirmed on multi-lingual data – fantastic work, @ruebot, congrats on getting this resolved! |
ruebot commentedJul 4, 2019
•
edited
GitHub issue(s):
What does this Pull Request do?
Update to Spark 2.4.3 and update Tika to 1.20, and pulls in unfinished work by @jrwiebe and @borislin.
I had to tweak the language tests. I believe is because of the updated language detection in Tika, and that's why the test fixtures changed. Though, I'm not a 100% certain. So, definitely need a sanity check here.
How should this be tested?
TravisCI should take care of the first bit.
@ianmilligan1 would you mind putting this through a bunch of examples?
rm -rf ~/.m2/repository/* && mvn clean install
rm -rf ~/.ivy2/* && spark-shell --packages "io.archivesunleashed:aut:0.17.1-SNAPSHOT"
Additional Notes:
Once we get this in, and #318, might as well cut a 0.18.0 release. There's a fair bit done since the last release. It'd be nice to get #317 sorted, but that isn't a blocker.
@jrwiebe if you have some time, can you give this a sanity look too, since I'll built on a bit of what you were previously digging into.