Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDataframe Code Request: Finding Image Sharing between Domains #237
Comments
ianmilligan1
added
the
question
label
May 24, 2018
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 |
This comment has been minimized.
This comment has been minimized.
Great, thanks @JWZ2018 – just pinged you in Slack about access to a relatively small dataset that could be tested on (you could try on the sample data here, but I'm worried we need a large enough dataset to find these potential hits). |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1
Some results shared in the slack |
This comment has been minimized.
This comment has been minimized.
This is awesome (and thanks for the results, looks great). Given the results, I realize maybe we should isolate to just a single crawl. If we want to do the above but slate it to just the crawl date in |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1
This particular dataset didn't return any results for the given month but the script completed successfully. |
This comment has been minimized.
This comment has been minimized.
@JWZ2018 in above, filter is being done on RDD... the plan is move everything over to DF, so we need a new set of UDFs... I'll create a new PR on this. |
ruebot
added this to In Progress
in DataFrames and PySpark
Aug 13, 2018
ruebot
added this to To Do
in 1.0.0 Release of AUT
Aug 13, 2018
ruebot
moved this from In Progress
to ToDo
in DataFrames and PySpark
Aug 13, 2018
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 are we good on this issue, or are we waiting for something from @lintool still? |
ruebot
added
the
resolve before 0.18.0
label
Aug 17, 2019
This comment has been minimized.
This comment has been minimized.
Realistically we could probably just do this by filtering the resulting csv file, so I’m happy if we close this. |
ruebot
moved this from ToDo
to In Progress
in DataFrames and PySpark
Aug 17, 2019
ruebot
moved this from To Do
to In Progress
in 1.0.0 Release of AUT
Aug 17, 2019
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
OK, thanks @lintool. Above you noted creating some new UDFs, is that still something you could do? |
ianmilligan1 commentedMay 24, 2018
•
edited
Use Case
I am interested in finding substantial images (so larger than icons - bigger than 50 px wide and 50 px high) that are found across domains within an Archive-It collection. @lintool suggested putting this here as we can begin assembling documentation for complicated dataframe queries.
Input
Imagine this Dataframe. It is the result of finding all images within a collection with heights and widths greater than 50 px.
The above has three images: one that appears twice on greenparty.ca with different URLs (but it's the same png); one that appears only once on liberal.ca (
pierre.png
) and one that appears on bothliberal.ca
andconservative.ca
. We can tell there are three images because there are three distinct MD5 hashes.Desired Output
I would like to only receive the results that appear more than once in more than one domain. I am not interested in the green party.ca
planet.png
andplaneta.png
because it's image borrowing within one domain. But I am curious about why the same image appears on both liberal.ca and conservative.ca.Question
What query could we use to
Let me know if this is unclear, happy to clarify however best I can.