Perfect your code
With built-in code review tools, GitHub makes it easy to raise the quality bar before you ship. Join the 40 million developers who've merged over 200 million pull requests.
Sign up for free See pricing for teams and enterprisesAdd additional filters for fextFiles; resolves #362. #393
+17
−2
Conversation
- Add filedesc, and dns filter (arc files) - Add test case
This comment has been minimized.
This comment has been minimized.
Testing with: import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/data/banq-datathon/PQ/warcs/*gz", sc).textFiles();
df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5").orderBy(desc("md5")).write.parquet("/data/banq-datathon/PQ/derivatives/parquet/text")
df.select($"bytes", $"extension").saveToDisk("bytes", "/data/banq-datathon/PQ/derivatives/binaries/text/pq-2012-text", "extension")
sys.exit |
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Dec 18, 2019
•
Codecov Report
@@ Coverage Diff @@
## master #393 +/- ##
==========================================
- Coverage 77.15% 77.11% -0.04%
==========================================
Files 40 40
Lines 1484 1486 +2
Branches 278 280 +2
==========================================
+ Hits 1145 1146 +1
Misses 217 217
- Partials 122 123 +1 |
This comment has been minimized.
This comment has been minimized.
Successfully ran the job on the BANQ dataset twice (with both commits) without issue. |
Tested locally and looks great! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
ruebot commentedDec 18, 2019
GitHub issue(s):
What does this Pull Request do?
Add additional filters for fextFiles; resolves #362.
You can see
filedesc
anddns
in the ARC test fixtures:How should this be tested?
The updated test catches the above examples examples.
I'm doing a more robust test on the BANQ collection in question on #362 now. I'll move this out of draft if it is successful.