Track tasks and feature requests
Join 36 million developers who use GitHub issues to help identify, assign, and keep track of the features and bug fixes your projects need.
Sign up for free See pricing for teams and enterprisesOutput filtered data to WARC Format #147
Comments
ianmilligan1
changed the title
Output filtered data to Warc Format
Output filtered data to WARC Format
Dec 19, 2017
greebie
self-assigned this
Dec 19, 2017
ianmilligan1
added
the
enhancement
label
Jan 6, 2018
This comment has been minimized.
This comment has been minimized.
@dportabella has an example to implement this here: https://gist.github.com/dportabella/3caf261c218a4448a03a14dbc06fe730 . The other alternative is the more detailed WARCWriter class from iipc: This feature has potential to be dangerous, as there is no real way to test the total size of the request. Take for example this pseudocode:
which would save the entire Warc for every ArchiveRecord in record. It would be a juggernaut that will not stop until the server explodes due to lack of fileSpace. I have to admit to being a little lost to the finer details of producing and saving a WARC files here, and it's Monday, so am prone to laziness. Advice @ruebot, @lintool and @ianmilligan1 ? |
This comment has been minimized.
This comment has been minimized.
The example looks promising, @greebie! I'm not too hung up on the danger, as long as the feature is well documented. But maybe I'm naive. The others may think differently, but my gut is that taking a stab at using @dportabella's example and seeing if it can play with AUT is probably the most fruitful way forward? We can also discuss tomorrow. |
This comment has been minimized.
This comment has been minimized.
it would be nice to also create the cdx index at the same time. |
This comment has been minimized.
This comment has been minimized.
Producing the cdx would be a safe start for testing purposes, actually. Thanks dportabella! |
This comment has been minimized.
This comment has been minimized.
Backing away from this issue for now until we find someone with better understanding of the iipc toolkit. |
ianmilligan1
added
the
RA-Task
label
Jan 11, 2018
ianmilligan1
unassigned
greebie
Aug 3, 2018
ruebot
added this to To Do
in 1.0.0 Release of AUT
Aug 13, 2018
This comment has been minimized.
This comment has been minimized.
I think our conversations have largely moved away from the idea of creating new WARC files, and really focusing on derivative datasets. I think given this move in the project, we could consider closing this? |
This comment has been minimized.
This comment has been minimized.
I still think that filtering WARC files is an important task that AUT can solve. |
This comment has been minimized.
This comment has been minimized.
Thanks @dportabella! My sense is that our team's time is limited to make this a short or medium-term issue for us, but any chance you'd be interested in opening up a PR based on the example code that @greebie shared up above? |
This comment has been minimized.
This comment has been minimized.
I shared a gist on achieving this task (included in @greebie comment above), and I am currently using this approach. |
greebie commentedDec 8, 2017
•
edited by ianmilligan1
Users may desire outputs in WARC format after filtering their RDD[ArchiveRecord].