Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update for 'src' column #424

Merged
merged 3 commits into from Feb 12, 2020
Merged

Conversation

@SinghGursimran
Copy link
Collaborator

SinghGursimran commented Feb 11, 2020

update for 'src' column

#418

For Testing:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.keepUrlPatternsDF(Set(".*index.*".r))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.discardUrlPatternsDF(Set(".*images.*".r))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.keepUrlsDF(Set("http://www.archive.org/","http://www.archive.org/index.php"))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.discardUrlsDF(Set("http://www.archive.org/","http://www.archive.org/index.php"))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.imagegraph()
			.select($"src")
			.keepDomainsDF(Set("www.archive.org"))
			.show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.discardDomainsDF(Set("www.archive.org"))
			.show(10,false)
g285sing
@codecov

This comment has been minimized.

Copy link

codecov bot commented Feb 11, 2020

Codecov Report

Merging #424 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #424   +/-   ##
=======================================
  Coverage   78.15%   78.15%           
=======================================
  Files          41       41           
  Lines        1584     1584           
  Branches      299      299           
=======================================
  Hits         1238     1238           
  Misses        218      218           
  Partials      128      128
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Feb 11, 2020

@SinghGursimran nice! I did a hasColumn function for a similar solution in twut. Can we get a test update too?

@lintool @ianmilligan1 do either of you see a use case for filtering on dest or image_url? Or is src, and url good enough here? If we add dest or image_url, we'd probably need to change the implementation to pass the column name as well.

@ruebot ruebot added this to In review in DataFrames and PySpark Feb 11, 2020
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Feb 11, 2020

@SinghGursimran based on the new related issue (#425) let's just worry about getting the tests updated here, and don't worry about dest and image_url for now since the implementation of #425 would resolve that.

I'm running the right now on the entire GeoCities dataset for other project, and everything appears to be running smoothly 🙌

g285sing and others added 2 commits Feb 12, 2020
g285sing
@ruebot
ruebot approved these changes Feb 12, 2020
@ruebot ruebot merged commit ebb5298 into archivesunleashed:master Feb 12, 2020
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 78.15%)
Details
codecov/project 78.15% (+0%) compared to c7687e8
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.