Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upfeature request: ArchiveRecord.archiveFile #164
Comments
This comment has been minimized.
This comment has been minimized.
Thanks as always (we're getting a bit of a back log of issues as we're stretched in many different directions, so don't take lack of work as lack of interest!). Just so I am clear, the idea would be that you'd have a command like:
Where |
This comment has been minimized.
This comment has been minimized.
Hi, i am not sure I understand your question. here it is an example:
|
This comment has been minimized.
This comment has been minimized.
OK thanks for this. We are quite swamped right now but if you have a cycle we always enthusiastically look for pull requests too. |
This comment has been minimized.
This comment has been minimized.
I posted the feature request here, but I am not sure that it's useful for other people.
for the previous example, this would create the files:
which is actually even better for my needs. |
dportabella commentedJan 22, 2018
I am querying CommonCrawl archive, which is divided into hundreds of warc.gz files. I use RecordLoader.loadArchives to read all the warc files at once. Sometimes the log contains an Exception when processing a page, and I'd need to find out from which of the individual warc.gz files comes from (so that I can re-run the program in that file only).
Would it be possible for
ArchiveRecord
class to have also a field with the input archive name? (with that, I could catch exceptions and show not only the url but also the input archive file).