Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upMethod to perform finer-grained selection of ARCs and WARCs #247
Comments
lintool
assigned
borislin
Jul 30, 2018
ruebot
added
enhancement
RA-Task
labels
Aug 14, 2018
ruebot
added this to In Progress
in 1.0.0 Release of AUT
Aug 14, 2018
ruebot
added
the
in progress
label
Aug 20, 2018
This comment has been minimized.
This comment has been minimized.
Based on our discussion on Slack, we would like to limit the size of individual records instead of archive files. A lot of Spark jobs fail because of records that are too big. |
borislin
referenced this issue
Oct 12, 2018
Closed
Refactor loadArchives() function to limit size of individual record #275
ianmilligan1
unassigned
borislin
Jan 11, 2019
This comment has been minimized.
This comment has been minimized.
Should this still be an open issue? I don't think we've been running into any ingestion issues lately, including on some very large collections. |
This comment has been minimized.
This comment has been minimized.
I think the work done here is worth revisiting in the future, since I believe it got at the spirit of @lintool original post:
But, I'll leave it up to him to close or keep it open. |
lintool commentedJul 30, 2018
We currently only have one method to load ARCs or WARCs:
It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.