Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add office document binary extraction. #346

Open
wants to merge 1 commit into
base: master
from

Conversation

@ruebot
Copy link
Member

commented Aug 15, 2019

GitHub issue(s):

What does this Pull Request do?

  • Add WordProcessor DF and binary extraction
  • Add Spreadsheets DF and binary extraction
  • Add Presentation Program DF and binary extraction
  • Add tests for new DF and binary extractions
  • Add test fixture for new DF and binary extractions
  • Resolves #303
  • Resolves #304
  • Resolves #305
  • Back out 39831c2 (We might not have
    to do this)

How should this be tested?

  • TravisCI
  • I tested on the 10 GeoCities WARCs, here is a whole bunch of info
$ rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && time ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/office-document-extraction.scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val df_ss = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/presentation", "extension")

val df_word = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/document", "extension")

sys.exit

Additional Notes:

  • Do we want to include plain text (txt) files in Word Processor?
  • Is it work putting some NOT conditionals in these? Like ! r._1 == "text/html" or ! r._1.startsWith("image/"). Or, is it worth leaving some of this noise in there?
Add office document binary extraction.
- Add WordProcessor DF and binary extraction
- Add Spreadsheets DF and binary extraction
- Add Presentation Program DF and binary extraction
- Add tests for new DF and binary extractions
- Add test fixture for new DF and binary extractions
- Resolves #303
- Resolves #304
- Resolves #305
- Back out 39831c2 (We _might_ not have
to do this)

@ruebot ruebot requested review from ianmilligan1 and jrwiebe Aug 15, 2019

@codecov

This comment has been minimized.

Copy link

commented Aug 15, 2019

Codecov Report

Merging #346 into master will decrease coverage by 2.3%.
The diff coverage is 52.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #346      +/-   ##
==========================================
- Coverage    75.2%   72.89%   -2.31%     
==========================================
  Files          39       39              
  Lines        1230     1369     +139     
  Branches      224      294      +70     
==========================================
+ Hits          925      998      +73     
  Misses        214      214              
- Partials       91      157      +66
Impacted Files Coverage Δ
src/main/scala/io/archivesunleashed/package.scala 69.25% <52.51%> (-11.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39831c2...2258207. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.