Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance keepValidPages #359

Closed
ruebot opened this issue Sep 11, 2019 · 0 comments · Fixed by #360

Comments

@ruebot
Copy link
Member

commented Sep 11, 2019

We should add a keepHttpStatus and keepMimeTypesTika filter to the keepValidPages helper.

def keepValidPages(): RDD[ArchiveRecord] = {
rdd.filter(r =>
r.getCrawlDate != null
&& (r.getMimeType == "text/html"
|| r.getMimeType == "application/xhtml+xml"
|| r.getUrl.toLowerCase.endsWith("htm")
|| r.getUrl.toLowerCase.endsWith("html"))
&& !r.getUrl.toLowerCase.endsWith("robots.txt"))
}

@ruebot ruebot self-assigned this Sep 11, 2019

ruebot added a commit that referenced this issue Sep 11, 2019
Update keepValidPages to include a filter on 200 OK.
- Add status code filter to keepValidPages
- Add MimeTypeTika to valid pages DF
- Update tests since we filter more and better now 😄
- Resolves #359
ianmilligan1 added a commit that referenced this issue Sep 11, 2019
Update keepValidPages to include a filter on 200 OK. (#360)
- Add status code filter to keepValidPages
- Add MimeTypeTika to valid pages DF
- Update tests since we filter more and better now 😄
- Resolves #359
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.