Dataframe Code Request: Finding Hyperlinks within Collection on Pages with Certain Keyword #238

ianmilligan1 · 2018-05-24T20:37:47Z

Use Case

I am interested in finding what pages particular organizations link to when they contain certain keywords.

In this case, I am curious to see where the Liberal Party of Canada (liberal.ca) has linked to from pages that contain the words "Keystone". Do they link to news sources? Do they link to activist groups? Do they link to opposition parties? Let's find out.

Data

In this case we need to do the following:

Find all pages within a collection of WARCs that contain the keyword Keystone. Let's make it case insensitive.
Find all hyperlinks that are leaving the pages that contain the keyword Keystone.

Desired Output

The desired output would be a CSV looking like:

domain, URL, crawl date, origin page, destination page

Question

What data frame query can we use to first filter down to pages containing a certain keyword, and then extract just links from them? Hopefully this is clear but do let me know if I can clarify further.

ruebot · 2019-08-17T01:38:12Z

@ianmilligan1 do you still need help on this one?

ianmilligan1 · 2019-08-17T02:31:59Z

I think the problem here is that we don’t have text as part of our dataframe implementation yet.


        Various UDF implementation and cleanup for DF. (#370)

- Replace ExtractBaseDomain with ExtractDomain - Closes #367 - Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader - Closes #369 - Wraps RemoveHttpHeader and RemoveHTML for use in data frames. - Partially addresses #238 - Updates tests where necessary - Punts on #368 UDF CaMeL cASe consistency issues

ruebot · 2019-11-08T22:37:58Z

@SinghGursimran here's one for you.

SinghGursimran · 2019-11-09T09:14:27Z

import io.archivesunleashed._
import io.archivesunleashed.df._

val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1))

val df= RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
.filter($"content".contains("keyNote"))

df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page"))
.write
.option("header","true")
.csv("filtered_results/")

I have written this script to answer the above query. It works but things can be done in a better way if I create a separate function in the code specifically for this query. I am not sure whether that's a good idea considering there can be many random queries. @ruebot @lintool @ianmilligan1

ruebot · 2019-11-11T13:17:27Z

We have to import matchbox on that for ExtractLinks, and then we hit:

import io.archivesunleashed._
import io.archivesunleashed.df._
import io.archivesunleashed.matchbox._

val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1))

val df= RecordLoader.loadArchives("/home/nruest/Projects/au/aut/src/test/resources/arc/example.arc.gz", sc).extractValidPagesDF()
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
.filter($"content".contains("keystone"))

df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page"))
.show()

// Exiting paste mode, now interpreting.

<console>:31: error: reference to ExtractDomain is ambiguous;
it is imported twice in the same scope by
import io.archivesunleashed.matchbox._
and import io.archivesunleashed.df._
       .select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
                               ^
<console>:31: error: type mismatch;
 found   : org.apache.spark.sql.ColumnName
 required: String
       .select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
                                                                                                                             ^
<console>:31: error: type mismatch;
 found   : org.apache.spark.sql.ColumnName
 required: String
       .select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))

SinghGursimran · 2019-11-11T13:23:22Z

I have added another udf in aut/src/main/scala/io/archivesunleashed/df/package.scala. (forgot to mention it here)

val ExtractLinks = udf(io.archivesunleashed.matchbox.ExtractLinks.apply(_:String,_:String))

Sorry for the confusion. I will create a pull request.

ianmilligan1 added the question label May 24, 2018

ruebot added this to In Progress in DataFrames and PySpark Aug 13, 2018

ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018

ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018

ruebot added the resolve before 0.18.0 label Aug 17, 2019

ruebot removed the resolve before 0.18.0 label Aug 17, 2019

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Dataframe Code Request: Finding Hyperlinks within Collection on Pages with Certain Keyword #238

Dataframe Code Request: Finding Hyperlinks within Collection on Pages with Certain Keyword #238

ianmilligan1 commented May 24, 2018

This comment has been minimized.

ruebot commented Aug 17, 2019

This comment has been minimized.

ianmilligan1 commented Aug 17, 2019

This comment has been minimized.

ruebot commented Nov 8, 2019

This comment has been minimized.

SinghGursimran commented Nov 9, 2019

This comment has been minimized.

ruebot commented Nov 11, 2019

This comment has been minimized.

SinghGursimran commented Nov 11, 2019 •

edited

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Join GitHub today

Dataframe Code Request: Finding Hyperlinks within Collection on Pages with Certain Keyword #238

Comments

ianmilligan1 commented May 24, 2018

Use Case

Data

Desired Output

Question

This comment has been minimized.

ruebot commented Aug 17, 2019

This comment has been minimized.

ianmilligan1 commented Aug 17, 2019

This comment has been minimized.

ruebot commented Nov 8, 2019

This comment has been minimized.

SinghGursimran commented Nov 9, 2019

This comment has been minimized.

ruebot commented Nov 11, 2019

This comment has been minimized.

SinghGursimran commented Nov 11, 2019 • edited

SinghGursimran commented Nov 11, 2019 •

edited