Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe Code Request: Finding Hyperlinks within Collection on Pages with Certain Keyword #238

Open
ianmilligan1 opened this issue May 24, 2018 · 2 comments

Comments

@ianmilligan1
Copy link
Member

commented May 24, 2018

Use Case

I am interested in finding what pages particular organizations link to when they contain certain keywords.

In this case, I am curious to see where the Liberal Party of Canada (liberal.ca) has linked to from pages that contain the words "Keystone". Do they link to news sources? Do they link to activist groups? Do they link to opposition parties? Let's find out.

Data

In this case we need to do the following:

  • Find all pages within a collection of WARCs that contain the keyword Keystone. Let's make it case insensitive.
  • Find all hyperlinks that are leaving the pages that contain the keyword Keystone.

Desired Output

The desired output would be a CSV looking like:

domain, URL, crawl date, origin page, destination page

Question

What data frame query can we use to first filter down to pages containing a certain keyword, and then extract just links from them? Hopefully this is clear but do let me know if I can clarify further.

@ruebot ruebot added this to In Progress in DataFrames and PySpark Aug 13, 2018

@ruebot ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018

@ruebot ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 17, 2019

@ianmilligan1 do you still need help on this one?

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented Aug 17, 2019

I think the problem here is that we don’t have text as part of our dataframe implementation yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.