Dataframe Code Request: Finding Image Sharing between Domains #237

ianmilligan1 · 2018-05-24T20:29:55Z

Use Case

I am interested in finding substantial images (so larger than icons - bigger than 50 px wide and 50 px high) that are found across domains within an Archive-It collection. @lintool suggested putting this here as we can begin assembling documentation for complicated dataframe queries.

Input

Imagine this Dataframe. It is the result of finding all images within a collection with heights and widths greater than 50 px.

Domain	URL	MD5
liberal.ca	www.liberal.ca/images/trudeau.png	4c028c4429359af2c724767dcc932c69
liberal.ca	www.liberal.ca/images/pierre.png	a449a58d72cb497f2edd7ed5e31a9d1c
conservative.ca	www.conservative.ca/images/jerk.png	4c028c4429359af2c724767dcc932c69
greenparty.ca	www.greenparty.ca/images/planet.png	f85243a4fe4cf3bdfd77e9effec2559c
greenparty.ca	www.greenparty.ca/images/planeta.png	f85243a4fe4cf3bdfd77e9effec2559c

The above has three images: one that appears twice on greenparty.ca with different URLs (but it's the same png); one that appears only once on liberal.ca (pierre.png) and one that appears on both liberal.ca and conservative.ca. We can tell there are three images because there are three distinct MD5 hashes.

Desired Output

Domain	URL	MD5
liberal.ca	www.liberal.ca/images/trudeau.png	4c028c4429359af2c724767dcc932c69
conservative.ca	www.conservative.ca/images/jerk.png	4c028c4429359af2c724767dcc932c69

I would like to only receive the results that appear more than once in more than one domain. I am not interested in the green party.ca planet.png and planeta.png because it's image borrowing within one domain. But I am curious about why the same image appears on both liberal.ca and conservative.ca.

Question

What query could we use to

take a directory of WARCs;
extract the image details above and;
filter so we just receive a list of images that appear in multiple domains.

Let me know if this is unclear, happy to clarify however best I can.

JWZ2018 · 2018-05-24T22:21:15Z

@ianmilligan1
I wrote a script to do this. Do you have a small-ish dataset that has images like this that I can test with?

ianmilligan1 · 2018-05-24T23:47:47Z

Great, thanks @JWZ2018 – just pinged you in Slack about access to a relatively small dataset that could be tested on (you could try on the sample data here, but I'm worried we need a large enough dataset to find these potential hits).

JWZ2018 · 2018-05-25T03:03:02Z

@ianmilligan1
I used this script:


import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-16606*",sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

Some results shared in the slack

ianmilligan1 · 2018-05-25T13:18:40Z

This is awesome (and thanks for the results, looks great).

Given the results, I realize maybe we should isolate to just a single crawl.

If we want to do the above but slate it to just the crawl date in yyyymm format: 200912, where should we put that filter in above for optimal performance?

JWZ2018 · 2018-05-25T16:12:38Z

@ianmilligan1
We can try something like this:


import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-DNYDTY-20121103160515-00000-crawling202.us.archive.org-6683.warc.gz",sc).filter(r => r.getCrawlMonth == "201211")
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

This particular dataset didn't return any results for the given month but the script completed successfully.

lintool · 2018-05-25T17:21:52Z

@JWZ2018 in above, filter is being done on RDD... the plan is move everything over to DF, so we need a new set of UDFs... I'll create a new PR on this.

ruebot · 2019-08-17T01:36:52Z

@ianmilligan1 are we good on this issue, or are we waiting for something from @lintool still?

ianmilligan1 · 2019-08-17T02:33:10Z

Realistically we could probably just do this by filtering the resulting csv file, so I’m happy if we close this.

lintool · 2019-08-21T08:49:06Z

👎 on filtering CSVs - not scalable...

ianmilligan1 · 2019-08-21T12:48:34Z

OK, thanks @lintool. Above you noted creating some new UDFs, is that still something you could do?

ruebot · 2019-11-08T22:37:52Z

@SinghGursimran here's one for you.

SinghGursimran · 2019-11-14T06:41:44Z

import io.archivesunleashed.matchbox._
import io.archivesunleashed._

val imgDetails = udf((url: String, MimeTypeTika: String, content: String) => ExtractImageDetails(url,MimeTypeTika,content.getBytes()).md5Hash)
val imgLinks = udf((url: String, content: String) => ExtractImageLinks(url, content))
val domain = udf((url: String) => ExtractDomain(url))

val total = RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
						.extractValidPagesDF()
						.select(
								$"crawl_date".as("crawl_date"),
								domain($"url").as("Domain"),
								explode_outer(imgLinks(($"url"), ($"content"))).as("ImageUrl"),
								imgDetails(($"url"), ($"mime_type_tika"), ($"content")).as("MD5")
							   )
					  	.filter($"crawl_date" rlike "200912[0-9]{2}")

val links = total.groupBy("MD5").count()
				 .where(countDistinct("Domain")>=2)

val result = total.join(links, "MD5")
				  .groupBy("Domain","MD5")
				  .agg(first("ImageUrl").as("ImageUrl"))
				  .orderBy(asc("MD5"))
				  .show(10,false)

The above script performs all operations on df. There are no potential hits for the given date in the dataset I used, though the script completed successfully.

ruebot · 2019-11-14T13:11:02Z

Hrm... I think I should be getting matches here, but I'm not getting any:

Crawl dates that should match: 20091027

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/", sc)
            .extractValidPagesDF()
            .show()

// Exiting paste mode, now interpreting.

+----------+--------------------+--------------------+--------------------+--------------------+
|crawl_date|                 url|mime_type_web_server|      mime_type_tika|             content|
+----------+--------------------+--------------------+--------------------+--------------------+
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.talent...|           text/html|application/xhtml...|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|application/xhtml...|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.infoca...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
+----------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Filter for matching this pattern: 200910

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed.matchbox._
import io.archivesunleashed._

val imgDetails = udf((url: String, MimeTypeTika: String, content: String) => ExtractImageDetails(url,MimeTypeTika,content.getBytes()).md5Hash)
val imgLinks = udf((url: String, content: String) => ExtractImageLinks(url, content))
val domain = udf((url: String) => ExtractDomain(url))

val total = RecordLoader
              .loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/", sc)
              .extractValidPagesDF()
              .select(
                $"crawl_date".as("crawl_date"),
                domain($"url").as("Domain"),
                explode_outer(imgLinks(($"url"),
                ($"content"))).as("ImageUrl"),
                imgDetails(($"url"), 
                ($"mime_type_tika"), 
                ($"content")).as("MD5")
              )
              .filter($"crawl_date" rlike "200910[0-9]{2}")

val links = total
              .groupBy("MD5")
              .count()
              .where(countDistinct("Domain")>=2)

val result = total
               .join(links, "MD5")
               .groupBy("Domain","MD5")
               .agg(first("ImageUrl")
               .as("ImageUrl"))
               .orderBy(asc("MD5"))
               .show(10,false)

// Exiting paste mode, now interpreting.

+------+---+--------+                                                           
|Domain|MD5|ImageUrl|
+------+---+--------+
+------+---+--------+

import io.archivesunleashed.matchbox._
import io.archivesunleashed._
imgDetails: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function3>,StringType,Some(List(StringType, StringType, StringType)))
imgLinks: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StringType,true),Some(List(StringType, StringType)))
domain: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
total: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [crawl_date: string, Domain: string ... 2 more fields]
links: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [MD5: string, count: bigint]
result: Unit = ()

I think I should be getting results there.

SinghGursimran · 2019-11-14T13:15:41Z

Are there 2 or more distinct domains with same md5 hash on the given date?

ruebot · 2019-11-14T13:19:22Z

Oh, that's right. 🤦‍♂

Now we have to search for a datset that solves this. @ianmilligan1 I can run this on a larger portion of GeoCities on rho if you want, unless you have something better in mind.

ianmilligan1 · 2019-11-14T14:59:43Z

Nope I think running on GeoCities on rho makes sense to me!

ruebot · 2019-11-14T17:55:20Z

Ok, I'm running it on the entire 4T of GeoCities, and writing to csv. I'll report back in a few days when it finishes.

ruebot · 2019-11-14T19:26:38Z

@ianmilligan1 @lintool if this is completes successfully, where do you two envision this landing in aut-docs-new?

ianmilligan1 added the question label May 24, 2018

ianmilligan1 referenced this issue May 25, 2018

Improve ExtractDomain Normalization #239

Closed

ruebot added this to In Progress in DataFrames and PySpark Aug 13, 2018

ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018

ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018

ruebot added the resolve before 0.18.0 label Aug 17, 2019

ruebot moved this from ToDo to In Progress in DataFrames and PySpark Aug 17, 2019

ruebot moved this from To Do to In Progress in 1.0.0 Release of AUT Aug 17, 2019

jrwiebe referenced this issue Aug 18, 2019

DataFrame discussion: open thread #190

Closed

ruebot removed the resolve before 0.18.0 label Aug 21, 2019

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Join GitHub today

Dataframe Code Request: Finding Image Sharing between Domains #237

Comments

ianmilligan1 commented May 24, 2018 • edited

Use Case

Input

Desired Output

Question

This comment has been minimized.

JWZ2018 commented May 24, 2018

This comment has been minimized.

ianmilligan1 commented May 24, 2018

This comment has been minimized.

JWZ2018 commented May 25, 2018

This comment has been minimized.

ianmilligan1 commented May 25, 2018

This comment has been minimized.

JWZ2018 commented May 25, 2018

This comment has been minimized.

lintool commented May 25, 2018

This comment has been minimized.

ruebot commented Aug 17, 2019

This comment has been minimized.

ianmilligan1 commented Aug 17, 2019

This comment has been minimized.

lintool commented Aug 21, 2019

This comment has been minimized.

ianmilligan1 commented Aug 21, 2019

This comment has been minimized.

ruebot commented Nov 8, 2019

This comment has been minimized.

SinghGursimran commented Nov 14, 2019 • edited

This comment has been minimized.

ruebot commented Nov 14, 2019

This comment has been minimized.

SinghGursimran commented Nov 14, 2019

This comment has been minimized.

ruebot commented Nov 14, 2019

This comment has been minimized.

ianmilligan1 commented Nov 14, 2019

This comment has been minimized.

ruebot commented Nov 14, 2019

This comment has been minimized.

ruebot commented Nov 14, 2019

ianmilligan1 commented May 24, 2018 •

edited

SinghGursimran commented Nov 14, 2019 •

edited