Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344

Merged
merged 1 commit into from Aug 14, 2019

Conversation

@ruebot
Copy link
Member

commented Aug 14, 2019

GitHub issue(s): #342

What does this Pull Request do?

  • Move audio, pdf, and video DF extraction to tuple map
  • Provide two MimeType columns; mime_type_web_server and mime_type_tika
  • Update tests
  • Resolves #342

How should this be tested?

  • TravisCI
  • I did the same tests for #341 and got the same number of files before and after for the 10 GeoCities WARCs:
4809 audio files
644 PDF files
232 video files
5685 total

Times were slightly slower 🤷‍♂

  • This PR 4212.43s user 224.14s system 623% cpu 11:51.81 total
  • HEAD on master now: 3899.34s user 186.65s system 598% cpu 11:23.12 total

Interested parties

@jrwiebe @ianmilligan1

@ruebot ruebot requested review from ianmilligan1 and jrwiebe Aug 14, 2019

Use Tika's detected MIME type instead of ArchiveRecord getMimeType.
- Move audio, pdf, and video DF extraction to tuple map
- Provide two MimeType columns; mime_type_web_server and mime_type_tika
- Update tests
- Resolves #342

@ruebot ruebot force-pushed the mimetype-columns branch from 7ff5db2 to 47612ce Aug 14, 2019

@codecov

This comment has been minimized.

Copy link

commented Aug 14, 2019

Codecov Report

Merging #344 into master will increase coverage by 0.28%.
The diff coverage is 60%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #344      +/-   ##
=========================================
+ Coverage   74.91%   75.2%   +0.28%     
=========================================
  Files          39      39              
  Lines        1216    1230      +14     
  Branches      224     224              
=========================================
+ Hits          911     925      +14     
  Misses        214     214              
  Partials       91      91
Impacted Files Coverage Δ
src/main/scala/io/archivesunleashed/package.scala 80.38% <60%> (+1.4%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54c0c3e...47612ce. Read the comment docs.

@ianmilligan1
Copy link
Member

left a comment

Tested locally on sample data and worked nicely on my end. Same results as in #341 as well.

Screen Shot 2019-08-14 at 2 49 11 PM

@jrwiebe
Copy link
Contributor

left a comment

I tested with a sample from the CPP dataset, using the script from #341 and modifying the paths. It worked as expected, but especially on video there were a lot of files with no extension. (Also a few with extensions like ".php".) A strategy for extensions like I describe in #343 would fix this.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 14, 2019

@jrwiebe yeah, I've noticed that the majority of missing extensions so far have been video files. My guess is that it's mostly to do with streaming.

@ianmilligan1 ianmilligan1 merged commit 01d12b4 into master Aug 14, 2019

2 of 3 checks passed

codecov/patch 60% of diff hit (target 74.91%)
Details
codecov/project 75.2% (+0.28%) compared to 54c0c3e
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ianmilligan1 ianmilligan1 deleted the mimetype-columns branch Aug 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.