Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe Code Request: Finding Image Sharing between Domains #237

Open
ianmilligan1 opened this issue May 24, 2018 · 10 comments

Comments

@ianmilligan1
Copy link
Member

commented May 24, 2018

Use Case

I am interested in finding substantial images (so larger than icons - bigger than 50 px wide and 50 px high) that are found across domains within an Archive-It collection. @lintool suggested putting this here as we can begin assembling documentation for complicated dataframe queries.

Input

Imagine this Dataframe. It is the result of finding all images within a collection with heights and widths greater than 50 px.

Domain URL MD5
liberal.ca www.liberal.ca/images/trudeau.png 4c028c4429359af2c724767dcc932c69
liberal.ca www.liberal.ca/images/pierre.png a449a58d72cb497f2edd7ed5e31a9d1c
conservative.ca www.conservative.ca/images/jerk.png 4c028c4429359af2c724767dcc932c69
greenparty.ca www.greenparty.ca/images/planet.png f85243a4fe4cf3bdfd77e9effec2559c
greenparty.ca www.greenparty.ca/images/planeta.png f85243a4fe4cf3bdfd77e9effec2559c

The above has three images: one that appears twice on greenparty.ca with different URLs (but it's the same png); one that appears only once on liberal.ca (pierre.png) and one that appears on both liberal.ca and conservative.ca. We can tell there are three images because there are three distinct MD5 hashes.

Desired Output

Domain URL MD5
liberal.ca www.liberal.ca/images/trudeau.png 4c028c4429359af2c724767dcc932c69
conservative.ca www.conservative.ca/images/jerk.png 4c028c4429359af2c724767dcc932c69

I would like to only receive the results that appear more than once in more than one domain. I am not interested in the green party.ca planet.png and planeta.png because it's image borrowing within one domain. But I am curious about why the same image appears on both liberal.ca and conservative.ca.

Question

What query could we use to

  • take a directory of WARCs;
  • extract the image details above and;
  • filter so we just receive a list of images that appear in multiple domains.

Let me know if this is unclear, happy to clarify however best I can.

@JWZ2018

This comment has been minimized.

Copy link
Contributor

commented May 24, 2018

@ianmilligan1
I wrote a script to do this. Do you have a small-ish dataset that has images like this that I can test with?

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented May 24, 2018

Great, thanks @JWZ2018 – just pinged you in Slack about access to a relatively small dataset that could be tested on (you could try on the sample data here, but I'm worried we need a large enough dataset to find these potential hits).

@JWZ2018

This comment has been minimized.

Copy link
Contributor

commented May 25, 2018

@ianmilligan1
I used this script:


import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-16606*",sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

Some results shared in the slack

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented May 25, 2018

This is awesome (and thanks for the results, looks great).

Given the results, I realize maybe we should isolate to just a single crawl.

If we want to do the above but slate it to just the crawl date in yyyymm format: 200912, where should we put that filter in above for optimal performance?

@JWZ2018

This comment has been minimized.

Copy link
Contributor

commented May 25, 2018

@ianmilligan1
We can try something like this:


import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-DNYDTY-20121103160515-00000-crawling202.us.archive.org-6683.warc.gz",sc).filter(r => r.getCrawlMonth == "201211")
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

This particular dataset didn't return any results for the given month but the script completed successfully.

@lintool

This comment has been minimized.

Copy link
Member

commented May 25, 2018

@JWZ2018 in above, filter is being done on RDD... the plan is move everything over to DF, so we need a new set of UDFs... I'll create a new PR on this.

@ruebot ruebot added this to In Progress in DataFrames and PySpark Aug 13, 2018

@ruebot ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018

@ruebot ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 17, 2019

@ianmilligan1 are we good on this issue, or are we waiting for something from @lintool still?

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented Aug 17, 2019

Realistically we could probably just do this by filtering the resulting csv file, so I’m happy if we close this.

@ruebot ruebot moved this from ToDo to In Progress in DataFrames and PySpark Aug 17, 2019

@ruebot ruebot moved this from To Do to In Progress in 1.0.0 Release of AUT Aug 17, 2019

@lintool

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

👎 on filtering CSVs - not scalable...

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented Aug 21, 2019

OK, thanks @lintool. Above you noted creating some new UDFs, is that still something you could do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.