Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe Code Request: Finding Hyperlinks within Collection on Pages with Certain Keyword #238

Open
ianmilligan1 opened this issue May 24, 2018 · 6 comments

Comments

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented May 24, 2018

Use Case

I am interested in finding what pages particular organizations link to when they contain certain keywords.

In this case, I am curious to see where the Liberal Party of Canada (liberal.ca) has linked to from pages that contain the words "Keystone". Do they link to news sources? Do they link to activist groups? Do they link to opposition parties? Let's find out.

Data

In this case we need to do the following:

  • Find all pages within a collection of WARCs that contain the keyword Keystone. Let's make it case insensitive.
  • Find all hyperlinks that are leaving the pages that contain the keyword Keystone.

Desired Output

The desired output would be a CSV looking like:

domain, URL, crawl date, origin page, destination page

Question

What data frame query can we use to first filter down to pages containing a certain keyword, and then extract just links from them? Hopefully this is clear but do let me know if I can clarify further.

@ruebot ruebot added this to In Progress in DataFrames and PySpark Aug 13, 2018
@ruebot ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018
@ruebot ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018
@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Aug 17, 2019

@ianmilligan1 do you still need help on this one?

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

@ianmilligan1 ianmilligan1 commented Aug 17, 2019

I think the problem here is that we don’t have text as part of our dataframe implementation yet.

ruebot added a commit that referenced this issue Nov 5, 2019
- Replace ExtractBaseDomain with ExtractDomain
- Closes #367
- Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader
- Closes #369
- Wraps RemoveHttpHeader and RemoveHTML for use in data frames.
- Partially addresses #238
- Updates tests where necessary
- Punts on #368 UDF CaMeL cASe consistency issues
@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Nov 8, 2019

@SinghGursimran here's one for you.

@SinghGursimran

This comment has been minimized.

Copy link
Contributor

@SinghGursimran SinghGursimran commented Nov 9, 2019

import io.archivesunleashed._
import io.archivesunleashed.df._

val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1))

val df= RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
.filter($"content".contains("keyNote"))

df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page"))
.write
.option("header","true")
.csv("filtered_results/")

I have written this script to answer the above query. It works but things can be done in a better way if I create a separate function in the code specifically for this query. I am not sure whether that's a good idea considering there can be many random queries. @ruebot @lintool @ianmilligan1

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Nov 11, 2019

We have to import matchbox on that for ExtractLinks, and then we hit:

import io.archivesunleashed._
import io.archivesunleashed.df._
import io.archivesunleashed.matchbox._

val result = udf((vs: Seq[Any]) => vs(0).toString.split(",")(1))

val df= RecordLoader.loadArchives("/home/nruest/Projects/au/aut/src/test/resources/arc/example.arc.gz", sc).extractValidPagesDF()
.select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
.filter($"content".contains("keystone"))

df.select($"url",$"Domain",$"crawl_date",result(array($"link")).as("destination_page"))
.show()

// Exiting paste mode, now interpreting.

<console>:31: error: reference to ExtractDomain is ambiguous;
it is imported twice in the same scope by
import io.archivesunleashed.matchbox._
and import io.archivesunleashed.df._
       .select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
                               ^
<console>:31: error: type mismatch;
 found   : org.apache.spark.sql.ColumnName
 required: String
       .select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
                                                                                                                             ^
<console>:31: error: type mismatch;
 found   : org.apache.spark.sql.ColumnName
 required: String
       .select(RemovePrefixWWW(ExtractDomain($"url")).as("Domain"),$"url".as("url"),$"crawl_date",explode_outer(ExtractLinks($"url",$"content")).as("link"))
@SinghGursimran

This comment has been minimized.

Copy link
Contributor

@SinghGursimran SinghGursimran commented Nov 11, 2019

I have added another udf in aut/src/main/scala/io/archivesunleashed/df/package.scala. (forgot to mention it here)

val ExtractLinks = udf(io.archivesunleashed.matchbox.ExtractLinks.apply(_:String,_:String))

Sorry for the confusion. I will create a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.