Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd additional filters for fextFiles; resolves #362. #393
+17
−2
Conversation
- Add filedesc, and dns filter (arc files) - Add test case
This comment has been minimized.
This comment has been minimized.
Testing with: import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/data/banq-datathon/PQ/warcs/*gz", sc).textFiles();
df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5").orderBy(desc("md5")).write.parquet("/data/banq-datathon/PQ/derivatives/parquet/text")
df.select($"bytes", $"extension").saveToDisk("bytes", "/data/banq-datathon/PQ/derivatives/binaries/text/pq-2012-text", "extension")
sys.exit |
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Dec 18, 2019
•
Codecov Report
@@ Coverage Diff @@
## master #393 +/- ##
==========================================
- Coverage 77.15% 77.11% -0.04%
==========================================
Files 40 40
Lines 1484 1486 +2
Branches 278 280 +2
==========================================
+ Hits 1145 1146 +1
Misses 217 217
- Partials 122 123 +1 |
This comment has been minimized.
This comment has been minimized.
Successfully ran the job on the BANQ dataset twice (with both commits) without issue. |
Tested locally and looks great! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
ruebot commentedDec 18, 2019
GitHub issue(s):
What does this Pull Request do?
Add additional filters for fextFiles; resolves #362.
You can see
filedesc
anddns
in the ARC test fixtures:How should this be tested?
The updated test catches the above examples examples.
I'm doing a more robust test on the BANQ collection in question on #362 now. I'll move this out of draft if it is successful.