Method to perform finer-grained selection of ARCs and WARCs #247

lintool · Jul 30, 2018

We currently only have one method to load ARCs or WARCs:

RecordLoader.loadArchives("/path/to/many/warcs/*.gz", sc)

It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.

borislin · Oct 12, 2018

@lintool @ianmilligan1

Based on our discussion on Slack, we would like to limit the size of individual records instead of archive files. A lot of Spark jobs fail because of records that are too big.

ianmilligan1 · Jan 11, 2019

Should this still be an open issue? I don't think we've been running into any ingestion issues lately, including on some very large collections.

ruebot · Jan 11, 2019

I think the work done here is worth revisiting in the future, since I believe it got at the spirit of @lintool original post:

It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.

But, I'll leave it up to him to close or keep it open.

ianmilligan1 · Jan 11, 2019

Sounds good, thanks @lintool @ruebot !

lintool assigned borislin Jul 30, 2018

borislin referenced this issue Aug 12, 2018
Closed
Refactor loadArchives() function #257

ruebot added enhancement RA-Task labels Aug 14, 2018

ruebot added this to In Progress in 1.0.0 Release of AUT Aug 14, 2018

ruebot added the in progress label Aug 20, 2018

borislin referenced this issue Oct 12, 2018
Closed
Refactor loadArchives() function to limit size of individual record #275

ianmilligan1 unassigned borislin Jan 11, 2019

archivesunleashed/aut

Method to perform finer-grained selection of ARCs and WARCs #247

Method to perform finer-grained selection of ARCs and WARCs #247

lintool commented Jul 30, 2018

lintool assigned borislin Jul 30, 2018

borislin referenced this issue Aug 12, 2018

Refactor loadArchives() function #257

ruebot added enhancement RA-Task labels Aug 14, 2018

ruebot added this to In Progress in 1.0.0 Release of AUT Aug 14, 2018

ruebot added the in progress label Aug 20, 2018

This comment has been minimized.

borislin commented Oct 12, 2018

borislin referenced this issue Oct 12, 2018

Refactor loadArchives() function to limit size of individual record #275

ianmilligan1 unassigned borislin Jan 11, 2019

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

This comment has been minimized.

ruebot commented Jan 11, 2019 •

edited

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

archivesunleashed/aut

Join GitHub today

Method to perform finer-grained selection of ARCs and WARCs #247

Comments

lintool commented Jul 30, 2018

lintool assigned borislin Jul 30, 2018

borislin referenced this issue Aug 12, 2018

Refactor loadArchives() function #257

ruebot added enhancement RA-Task labels Aug 14, 2018

ruebot added this to In Progress in 1.0.0 Release of AUT Aug 14, 2018

ruebot added the in progress label Aug 20, 2018

This comment has been minimized.

borislin commented Oct 12, 2018

borislin referenced this issue Oct 12, 2018

Refactor loadArchives() function to limit size of individual record #275

ianmilligan1 unassigned borislin Jan 11, 2019

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

This comment has been minimized.

ruebot commented Jan 11, 2019 • edited

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

ruebot commented Jan 11, 2019 •

edited