Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional filters for fextFiles; resolves #362. #393

Merged
merged 2 commits into from Dec 18, 2019
Merged

Conversation

@ruebot
Copy link
Member

ruebot commented Dec 18, 2019

GitHub issue(s):

What does this Pull Request do?

Add additional filters for fextFiles; resolves #362.

  • Add filedesc, and dns filter (arc files)
  • Add test case

You can see filedesc and dns in the ARC test fixtures:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/aut/src/test/resources/arc/*gz",sc)
df.all().select("url").show(5, false)
+-------------------------------------------------+
|url                                              |
+-------------------------------------------------+
|filedesc://IAH-20080430204825-00000-blackbook.arc|
|dns:www.archive.org                              |
|http://www.archive.org/robots.txt                |
|http://www.archive.org/                          |
|http://www.archive.org/index.php                 |
+-------------------------------------------------+
only showing top 5 rows

How should this be tested?

The updated test catches the above examples examples.

I'm doing a more robust test on the BANQ collection in question on #362 now. I'll move this out of draft if it is successful.

- Add filedesc, and dns filter (arc files)
- Add test case
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Dec 18, 2019

Testing with:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/data/banq-datathon/PQ/warcs/*gz", sc).textFiles();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5").orderBy(desc("md5")).write.parquet("/data/banq-datathon/PQ/derivatives/parquet/text")

df.select($"bytes", $"extension").saveToDisk("bytes", "/data/banq-datathon/PQ/derivatives/binaries/text/pq-2012-text", "extension")

sys.exit
@codecov

This comment has been minimized.

Copy link

codecov bot commented Dec 18, 2019

Codecov Report

Merging #393 into master will decrease coverage by 0.03%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master     #393      +/-   ##
==========================================
- Coverage   77.15%   77.11%   -0.04%     
==========================================
  Files          40       40              
  Lines        1484     1486       +2     
  Branches      278      280       +2     
==========================================
+ Hits         1145     1146       +1     
  Misses        217      217              
- Partials      122      123       +1
@ruebot ruebot marked this pull request as ready for review Dec 18, 2019
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Dec 18, 2019

Successfully ran the job on the BANQ dataset twice (with both commits) without issue.

@ruebot ruebot requested a review from ianmilligan1 Dec 18, 2019
Copy link
Member

ianmilligan1 left a comment

Tested locally and looks great!

@ianmilligan1 ianmilligan1 merged commit 8eb43ff into master Dec 18, 2019
1 of 3 checks passed
1 of 3 checks passed
codecov/patch 50% of diff hit (target 77.15%)
Details
codecov/project 77.11% (-0.04%) compared to 40a59de
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ianmilligan1 ianmilligan1 deleted the issue-362 branch Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.