Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDataframe Code Request: Finding Image Sharing between Domains #237
Comments
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 |
This comment has been minimized.
This comment has been minimized.
Great, thanks @JWZ2018 – just pinged you in Slack about access to a relatively small dataset that could be tested on (you could try on the sample data here, but I'm worried we need a large enough dataset to find these potential hits). |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1
Some results shared in the slack |
This comment has been minimized.
This comment has been minimized.
This is awesome (and thanks for the results, looks great). Given the results, I realize maybe we should isolate to just a single crawl. If we want to do the above but slate it to just the crawl date in |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1
This particular dataset didn't return any results for the given month but the script completed successfully. |
This comment has been minimized.
This comment has been minimized.
@JWZ2018 in above, filter is being done on RDD... the plan is move everything over to DF, so we need a new set of UDFs... I'll create a new PR on this. |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 are we good on this issue, or are we waiting for something from @lintool still? |
This comment has been minimized.
This comment has been minimized.
Realistically we could probably just do this by filtering the resulting csv file, so I’m happy if we close this. |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
OK, thanks @lintool. Above you noted creating some new UDFs, is that still something you could do? |
This comment has been minimized.
This comment has been minimized.
@SinghGursimran here's one for you. |
This comment has been minimized.
This comment has been minimized.
import io.archivesunleashed.matchbox._
import io.archivesunleashed._
val imgDetails = udf((url: String, MimeTypeTika: String, content: String) => ExtractImageDetails(url,MimeTypeTika,content.getBytes()).md5Hash)
val imgLinks = udf((url: String, content: String) => ExtractImageLinks(url, content))
val domain = udf((url: String) => ExtractDomain(url))
val total = RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.extractValidPagesDF()
.select(
$"crawl_date".as("crawl_date"),
domain($"url").as("Domain"),
explode_outer(imgLinks(($"url"), ($"content"))).as("ImageUrl"),
imgDetails(($"url"), ($"mime_type_tika"), ($"content")).as("MD5")
)
.filter($"crawl_date" rlike "200912[0-9]{2}")
val links = total.groupBy("MD5").count()
.where(countDistinct("Domain")>=2)
val result = total.join(links, "MD5")
.groupBy("Domain","MD5")
.agg(first("ImageUrl").as("ImageUrl"))
.orderBy(asc("MD5"))
.show(10,false) The above script performs all operations on df. There are no potential hits for the given date in the dataset I used, though the script completed successfully. |
This comment has been minimized.
This comment has been minimized.
Hrm... I think I should be getting matches here, but I'm not getting any: Crawl dates that should match:
Filter for matching this pattern:
I think I should be getting results there. |
This comment has been minimized.
This comment has been minimized.
Are there 2 or more distinct domains with same md5 hash on the given date? |
This comment has been minimized.
This comment has been minimized.
Oh, that's right. Now we have to search for a datset that solves this. @ianmilligan1 I can run this on a larger portion of GeoCities on |
This comment has been minimized.
This comment has been minimized.
Nope I think running on GeoCities on |
This comment has been minimized.
This comment has been minimized.
Ok, I'm running it on the entire 4T of GeoCities, and writing to csv. I'll report back in a few days when it finishes. |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 @lintool if this is completes successfully, where do you two envision this landing in |
ianmilligan1 commentedMay 24, 2018
•
edited
Use Case
I am interested in finding substantial images (so larger than icons - bigger than 50 px wide and 50 px high) that are found across domains within an Archive-It collection. @lintool suggested putting this here as we can begin assembling documentation for complicated dataframe queries.
Input
Imagine this Dataframe. It is the result of finding all images within a collection with heights and widths greater than 50 px.
The above has three images: one that appears twice on greenparty.ca with different URLs (but it's the same png); one that appears only once on liberal.ca (
pierre.png
) and one that appears on bothliberal.ca
andconservative.ca
. We can tell there are three images because there are three distinct MD5 hashes.Desired Output
I would like to only receive the results that appear more than once in more than one domain. I am not interested in the green party.ca
planet.png
andplaneta.png
because it's image borrowing within one domain. But I am curious about why the same image appears on both liberal.ca and conservative.ca.Question
What query could we use to
Let me know if this is unclear, happy to clarify however best I can.