Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update filters documentation for https://github.com/archivesunleashed… #33

Merged
merged 2 commits into from Dec 18, 2019

Conversation

@ruebot
Copy link
Member

ruebot commented Dec 18, 2019

…/aut/pull/391

  • Add ToC
  • Add Scala RDD, Scala DF, and Python DF sections
- Add ToC
- Add Scala RDD, Scala DF, and Python DF sections
@ruebot ruebot requested a review from ianmilligan1 Dec 18, 2019
Copy link
Member

ianmilligan1 left a comment

Had some failures when running through the new additions. I think we're just missing .all() or .webpages() after the RecordLoaders and then it should be good to go.

import io.archivesunleashed.df._
RecordLoader.loadArchives("example.warc.gz",sc)
.discardMimeTypesDF(Set("text/html"))

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Dec 18, 2019

Member

This fails on the most recent build with

<pastie>:36: error: value discardMimeTypesDF is not a member of org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord]
possible cause: maybe a semicolon is missing before `value discardMimeTypesDF'?
          .discardMimeTypesDF(Set("text/html"))
           ^

Can be fixed by adding .all() or .webpages()

after the RecordLoader call

import io.archivesunleashed.df._
RecordLoader.loadArchives("example.warc.gz",sc)
.discardDateDF("20080429")

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Dec 18, 2019

Member

same error here as above

import io.archivesunleashed.df._
RecordLoader.loadArchives("example.warc.gz",sc)
.discardUrlsDF(Set("http://www.archive.org/"))

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Dec 18, 2019

Member

same error here

import io.archivesunleashed.df._
RecordLoader.loadArchives("example.warc.gz",sc)
.discardDomainsDF(Set("www.archive.org"))

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Dec 18, 2019

Member

same error

@ianmilligan1 ianmilligan1 merged commit 19d8607 into master Dec 18, 2019
@ianmilligan1 ianmilligan1 deleted the aut-pr-391 branch Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.