Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDataframe Code Request: Finding Hyperlinks within Collection on Pages with Certain Keyword #238
Comments
ianmilligan1
added
the
question
label
May 24, 2018
ruebot
added this to In Progress
in DataFrames and PySpark
Aug 13, 2018
ruebot
added this to To Do
in 1.0.0 Release of AUT
Aug 13, 2018
ruebot
moved this from In Progress
to ToDo
in DataFrames and PySpark
Aug 13, 2018
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 do you still need help on this one? |
ruebot
added
the
resolve before 0.18.0
label
Aug 17, 2019
This comment has been minimized.
This comment has been minimized.
I think the problem here is that we don’t have text as part of our dataframe implementation yet. |
ruebot
removed
the
resolve before 0.18.0
label
Aug 17, 2019
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
ianmilligan1 commentedMay 24, 2018
Use Case
I am interested in finding what pages particular organizations link to when they contain certain keywords.
In this case, I am curious to see where the Liberal Party of Canada (liberal.ca) has linked to from pages that contain the words "Keystone". Do they link to news sources? Do they link to activist groups? Do they link to opposition parties? Let's find out.
Data
In this case we need to do the following:
Keystone
. Let's make it case insensitive.Keystone
.Desired Output
The desired output would be a CSV looking like:
Question
What data frame query can we use to first filter down to pages containing a certain keyword, and then extract just links from them? Hopefully this is clear but do let me know if I can clarify further.