New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to perform finer-grained selection of ARCs and WARCs #247

Open
lintool opened this Issue Jul 30, 2018 · 4 comments

Comments

4 participants
@lintool
Member

lintool commented Jul 30, 2018

We currently only have one method to load ARCs or WARCs:

RecordLoader.loadArchives("/path/to/many/warcs/*.gz", sc)

It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.

@borislin

This comment has been minimized.

Collaborator

borislin commented Oct 12, 2018

@lintool @ianmilligan1

Based on our discussion on Slack, we would like to limit the size of individual records instead of archive files. A lot of Spark jobs fail because of records that are too big.

@ianmilligan1

This comment has been minimized.

Member

ianmilligan1 commented Jan 11, 2019

Should this still be an open issue? I don't think we've been running into any ingestion issues lately, including on some very large collections.

@ruebot

This comment has been minimized.

Member

ruebot commented Jan 11, 2019

I think the work done here is worth revisiting in the future, since I believe it got at the spirit of @lintool original post:

It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.

But, I'll leave it up to him to close or keep it open.

@ianmilligan1

This comment has been minimized.

Member

ianmilligan1 commented Jan 11, 2019

Sounds good, thanks @lintool @ruebot !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment