Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding Hyperlinks within Collection on Pages with Certain Keyword #377

Merged
merged 5 commits into from Nov 12, 2019

Conversation

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Nov 12, 2019

Extract hyperlinks within a collection filtered on pages containing a particular keyword (case insensitive) using df.

import io.archivesunleashed._
import io.archivesunleashed.df._

val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1))

val df= RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
.select(     RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),
  	        $"url".as("url"),
  	        $"crawl_date",
  	        explode_outer(ExtractLinks($"url",$"content")).as("link")
  	  )
.filter(lower($"content").contains("internet")) //filtered on keyword "internet"

df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page"))
.write
.option("header","true")
.csv("filtered_results/")

#238

Returns a csv file with URL, Domain, crawl_date, and destination_page.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Nov 12, 2019

Codecov Report

Merging #377 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #377      +/-   ##
==========================================
+ Coverage   76.36%   76.37%   +0.01%     
==========================================
  Files          40       40              
  Lines        1413     1414       +1     
  Branches      268      268              
==========================================
+ Hits         1079     1080       +1     
  Misses        217      217              
  Partials      117      117
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 12, 2019

Looks good!

import io.archivesunleashed._
import io.archivesunleashed.df._

val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1))

val df= RecordLoader.loadArchives("/home/nruest/Projects/au/aut-resources/Sample-Data/*gz", sc).extractValidPagesDF()
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
.filter($"content".contains("keystone"))

df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page"))
.show()

// Exiting paste mode, now interpreting.

+--------------------+---------------+----------+--------------------+          
|                 url|         Domain|crawl_date|    destination_page|
+--------------------+---------------+----------+--------------------+
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
+--------------------+---------------+----------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.df._
result: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,None)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Domain: string, url: string ... 2 more fields]

^^^ @ianmilligan1 that what you're looking for? If so, I'll squash and merge, and add this to the cookbook section if that makes sense.

@ruebot
ruebot approved these changes Nov 12, 2019
@lintool

This comment has been minimized.

Copy link
Member

lintool commented Nov 12, 2019

hey @ruebot is this something that should be encoded in a test case while we're at it?

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 12, 2019

@lintool we already have a test case for ExtractLinks, unless you're thinking of something else.

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Nov 12, 2019

@ruebot add a separate test case that explicitly include filtering? Maybe not. I dunno.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 12, 2019

Yeah, we could add a new test that add filter or just add a filter to https://github.com/archivesunleashed/aut/blob/e32ae17c55740c6c6adc177d42690b39fa6321dd/src/test/scala/io/archivesunleashed/matchbox/ExtractLinksTest.scala if that's what you're thinking.

...and if that's the case, @SinghGursimran, want to update the PR with an updated test?

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Nov 12, 2019

@ruebot Looks perfect to me - and I love the sample results with "keystone" leading to David Suzuki. Thanks @SinghGursimran, great work.

@SinghGursimran

This comment has been minimized.

Copy link
Contributor Author

SinghGursimran commented Nov 12, 2019

Yeah, we could add a new test that add filter or just add a filter to https://github.com/archivesunleashed/aut/blob/e32ae17c55740c6c6adc177d42690b39fa6321dd/src/test/scala/io/archivesunleashed/matchbox/ExtractLinksTest.scala if that's what you're thinking.

...and if that's the case, @SinghGursimran, want to update the PR with an updated test?

Should I add a separate test for ExtractLink udf with a filter?

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 12, 2019

@SinghGursimran yeah, why not. Let's go with that.

g285sing
@ruebot ruebot merged commit c353dae into archivesunleashed:master Nov 12, 2019
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 76.36%)
Details
codecov/project 76.37% (+0.01%) compared to 107def2
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
ruebot added a commit to archivesunleashed/aut-docs-new that referenced this pull request Nov 12, 2019
ianmilligan1 added a commit to archivesunleashed/aut-docs-new that referenced this pull request Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.