Finding Hyperlinks within Collection on Pages with Certain Keyword #377

SinghGursimran · 2019-11-12T05:36:16Z

Extract hyperlinks within a collection filtered on pages containing a particular keyword (case insensitive) using df.

import io.archivesunleashed._
import io.archivesunleashed.df._

val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1))

val df= RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
.select(     RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),
  	        $"url".as("url"),
  	        $"crawl_date",
  	        explode_outer(ExtractLinks($"url",$"content")).as("link")
  	  )
.filter(lower($"content").contains("internet")) //filtered on keyword "internet"

df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page"))
.write
.option("header","true")
.csv("filtered_results/")

#238

Returns a csv file with URL, Domain, crawl_date, and destination_page.


        Issue-368


        Issue238


        Issue238


        Merge branch 'master' of https://github.com/SinghGursimran/aut

codecov · 2019-11-12T05:53:53Z

Codecov Report

Merging #377 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #377      +/-   ##
==========================================
+ Coverage   76.36%   76.37%   +0.01%     
==========================================
  Files          40       40              
  Lines        1413     1414       +1     
  Branches      268      268              
==========================================
+ Hits         1079     1080       +1     
  Misses        217      217              
  Partials      117      117

ruebot · 2019-11-12T14:02:19Z

Looks good!

import io.archivesunleashed._
import io.archivesunleashed.df._

val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1))

val df= RecordLoader.loadArchives("/home/nruest/Projects/au/aut-resources/Sample-Data/*gz", sc).extractValidPagesDF()
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
.filter($"content".contains("keystone"))

df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page"))
.show()

// Exiting paste mode, now interpreting.

+--------------------+---------------+----------+--------------------+          
|                 url|         Domain|crawl_date|    destination_page|
+--------------------+---------------+----------+--------------------+
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
|http://www.davids...|davidsuzuki.org|  20091219|http://www.davids...|
+--------------------+---------------+----------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.df._
result: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,None)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Domain: string, url: string ... 2 more fields]

^^^ @ianmilligan1 that what you're looking for? If so, I'll squash and merge, and add this to the cookbook section if that makes sense.

lintool · 2019-11-12T14:35:44Z

hey @ruebot is this something that should be encoded in a test case while we're at it?

ruebot · 2019-11-12T14:43:41Z

@lintool we already have a test case for ExtractLinks, unless you're thinking of something else.

lintool · 2019-11-12T14:45:20Z

@ruebot add a separate test case that explicitly include filtering? Maybe not. I dunno.

ruebot · 2019-11-12T14:48:14Z

Yeah, we could add a new test that add filter or just add a filter to https://github.com/archivesunleashed/aut/blob/e32ae17c55740c6c6adc177d42690b39fa6321dd/src/test/scala/io/archivesunleashed/matchbox/ExtractLinksTest.scala if that's what you're thinking.

...and if that's the case, @SinghGursimran, want to update the PR with an updated test?

ianmilligan1 · 2019-11-12T15:10:21Z

@ruebot Looks perfect to me - and I love the sample results with "keystone" leading to David Suzuki. Thanks @SinghGursimran, great work.

SinghGursimran · 2019-11-12T17:12:39Z

Yeah, we could add a new test that add filter or just add a filter to https://github.com/archivesunleashed/aut/blob/e32ae17c55740c6c6adc177d42690b39fa6321dd/src/test/scala/io/archivesunleashed/matchbox/ExtractLinksTest.scala if that's what you're thinking.

...and if that's the case, @SinghGursimran, want to update the PR with an updated test?

Should I add a separate test for ExtractLink udf with a filter?

ruebot · 2019-11-12T17:31:47Z

@SinghGursimran yeah, why not. Let's go with that.


        test


        Add example for archivesunleashed/aut#377 / archivesunleashed/aut#238

g285sing added 4 commits Nov 7, 2019

Issue-368

Loading status checks…

68922a2

Issue238

0db0093

Issue238

5817bf5

Merge branch 'master' of https://github.com/SinghGursimran/aut

Loading status checks…

d65cb41

ruebot approved these changes Nov 12, 2019

View changes

ianmilligan1 approved these changes Nov 12, 2019

View changes

SinghGursimran closed this Nov 12, 2019

SinghGursimran reopened this Nov 12, 2019

test

Loading status checks…

4e5a066

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Finding Hyperlinks within Collection on Pages with Certain Keyword #377

Finding Hyperlinks within Collection on Pages with Certain Keyword #377

SinghGursimran commented Nov 12, 2019 •

edited

This comment has been minimized.

codecov bot commented Nov 12, 2019

This comment has been minimized.

ruebot commented Nov 12, 2019

This comment has been minimized.

lintool commented Nov 12, 2019

This comment has been minimized.

ruebot commented Nov 12, 2019

This comment has been minimized.

lintool commented Nov 12, 2019

This comment has been minimized.

ruebot commented Nov 12, 2019

This comment has been minimized.

ianmilligan1 commented Nov 12, 2019

This comment has been minimized.

SinghGursimran commented Nov 12, 2019 •

edited

This comment has been minimized.

ruebot commented Nov 12, 2019

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Join GitHub today

Finding Hyperlinks within Collection on Pages with Certain Keyword #377

Conversation

SinghGursimran commented Nov 12, 2019 • edited

This comment has been minimized.

codecov bot commented Nov 12, 2019

Codecov Report

This comment has been minimized.

ruebot commented Nov 12, 2019

This comment has been minimized.

lintool commented Nov 12, 2019

This comment has been minimized.

ruebot commented Nov 12, 2019

This comment has been minimized.

lintool commented Nov 12, 2019

This comment has been minimized.

ruebot commented Nov 12, 2019

This comment has been minimized.

ianmilligan1 commented Nov 12, 2019

This comment has been minimized.

SinghGursimran commented Nov 12, 2019 • edited

This comment has been minimized.

ruebot commented Nov 12, 2019

SinghGursimran commented Nov 12, 2019 •

edited

SinghGursimran commented Nov 12, 2019 •

edited