Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API implementations for DataFrame #391

Merged
merged 15 commits into from Dec 17, 2019

Conversation

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Dec 17, 2019

API implementations for DataFrame
#223

For Testing:

DiscardMimeTypes:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.all()
			.discardMimeTypesDF(Set("text/html"))
			.select($"mime_type_web_server")
			.show(10,false)

DiscardDate:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.discardDateDF("20080429")
			.select($"crawl_date")
			.show(10,false)

DiscardUrls:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.discardUrlsDF(Set("http://www.archive.org/"))
			.select($"url")
			.show(10,false)

DiscardDomains:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.discardDomainsDF(Set("www.archive.org"))
			.select(ExtractDomainDF($"url"))
			.show(10,false)
@codecov

This comment has been minimized.

Copy link

codecov bot commented Dec 17, 2019

Codecov Report

Merging #391 into master will increase coverage by 0.12%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #391      +/-   ##
==========================================
+ Coverage   77.03%   77.15%   +0.12%     
==========================================
  Files          40       40              
  Lines        1476     1484       +8     
  Branches      274      278       +4     
==========================================
+ Hits         1137     1145       +8     
  Misses        217      217              
  Partials      122      122
Copy link
Member

ruebot left a comment

Small change, but other than that tested and all good.

src/main/scala/io/archivesunleashed/package.scala Outdated Show resolved Hide resolved
g285sing
@ruebot
ruebot approved these changes Dec 17, 2019
@ruebot ruebot merged commit 40a59de into archivesunleashed:master Dec 17, 2019
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 77.03%)
Details
codecov/project 77.15% (+0.12%) compared to ca928d8
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Dec 18, 2019
- Add ToC
- Add Scala RDD, Scala DF, and Python DF sections
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Dec 18, 2019

Documentation PR: archivesunleashed/aut-docs#32

ianmilligan1 added a commit to archivesunleashed/aut-docs that referenced this pull request Dec 18, 2019
#33)

* Update filters documentation for archivesunleashed/aut#391

- Add ToC
- Add Scala RDD, Scala DF, and Python DF sections

* review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.