Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDataframe Code Request: Finding Image Sharing between Domains #237
Comments
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 |
This comment has been minimized.
This comment has been minimized.
Great, thanks @JWZ2018 – just pinged you in Slack about access to a relatively small dataset that could be tested on (you could try on the sample data here, but I'm worried we need a large enough dataset to find these potential hits). |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1
Some results shared in the slack |
This comment has been minimized.
This comment has been minimized.
This is awesome (and thanks for the results, looks great). Given the results, I realize maybe we should isolate to just a single crawl. If we want to do the above but slate it to just the crawl date in |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1
This particular dataset didn't return any results for the given month but the script completed successfully. |
This comment has been minimized.
This comment has been minimized.
@JWZ2018 in above, filter is being done on RDD... the plan is move everything over to DF, so we need a new set of UDFs... I'll create a new PR on this. |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 are we good on this issue, or are we waiting for something from @lintool still? |
This comment has been minimized.
This comment has been minimized.
Realistically we could probably just do this by filtering the resulting csv file, so I’m happy if we close this. |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
OK, thanks @lintool. Above you noted creating some new UDFs, is that still something you could do? |
This comment has been minimized.
This comment has been minimized.
@SinghGursimran here's one for you. |
ianmilligan1 commentedMay 24, 2018
•
edited
Use Case
I am interested in finding substantial images (so larger than icons - bigger than 50 px wide and 50 px high) that are found across domains within an Archive-It collection. @lintool suggested putting this here as we can begin assembling documentation for complicated dataframe queries.
Input
Imagine this Dataframe. It is the result of finding all images within a collection with heights and widths greater than 50 px.
The above has three images: one that appears twice on greenparty.ca with different URLs (but it's the same png); one that appears only once on liberal.ca (
pierre.png
) and one that appears on bothliberal.ca
andconservative.ca
. We can tell there are three images because there are three distinct MD5 hashes.Desired Output
I would like to only receive the results that appear more than once in more than one domain. I am not interested in the green party.ca
planet.png
andplaneta.png
because it's image borrowing within one domain. But I am curious about why the same image appears on both liberal.ca and conservative.ca.Question
What query could we use to
Let me know if this is unclear, happy to clarify however best I can.