Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe Code Request: Finding Image Sharing between Domains #237

Open
ianmilligan1 opened this issue May 24, 2018 · 18 comments

Comments

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented May 24, 2018

Use Case

I am interested in finding substantial images (so larger than icons - bigger than 50 px wide and 50 px high) that are found across domains within an Archive-It collection. @lintool suggested putting this here as we can begin assembling documentation for complicated dataframe queries.

Input

Imagine this Dataframe. It is the result of finding all images within a collection with heights and widths greater than 50 px.

Domain URL MD5
liberal.ca www.liberal.ca/images/trudeau.png 4c028c4429359af2c724767dcc932c69
liberal.ca www.liberal.ca/images/pierre.png a449a58d72cb497f2edd7ed5e31a9d1c
conservative.ca www.conservative.ca/images/jerk.png 4c028c4429359af2c724767dcc932c69
greenparty.ca www.greenparty.ca/images/planet.png f85243a4fe4cf3bdfd77e9effec2559c
greenparty.ca www.greenparty.ca/images/planeta.png f85243a4fe4cf3bdfd77e9effec2559c

The above has three images: one that appears twice on greenparty.ca with different URLs (but it's the same png); one that appears only once on liberal.ca (pierre.png) and one that appears on both liberal.ca and conservative.ca. We can tell there are three images because there are three distinct MD5 hashes.

Desired Output

Domain URL MD5
liberal.ca www.liberal.ca/images/trudeau.png 4c028c4429359af2c724767dcc932c69
conservative.ca www.conservative.ca/images/jerk.png 4c028c4429359af2c724767dcc932c69

I would like to only receive the results that appear more than once in more than one domain. I am not interested in the green party.ca planet.png and planeta.png because it's image borrowing within one domain. But I am curious about why the same image appears on both liberal.ca and conservative.ca.

Question

What query could we use to

  • take a directory of WARCs;
  • extract the image details above and;
  • filter so we just receive a list of images that appear in multiple domains.

Let me know if this is unclear, happy to clarify however best I can.

@JWZ2018

This comment has been minimized.

Copy link
Contributor

@JWZ2018 JWZ2018 commented May 24, 2018

@ianmilligan1
I wrote a script to do this. Do you have a small-ish dataset that has images like this that I can test with?

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

@ianmilligan1 ianmilligan1 commented May 24, 2018

Great, thanks @JWZ2018 – just pinged you in Slack about access to a relatively small dataset that could be tested on (you could try on the sample data here, but I'm worried we need a large enough dataset to find these potential hits).

@JWZ2018

This comment has been minimized.

Copy link
Contributor

@JWZ2018 JWZ2018 commented May 25, 2018

@ianmilligan1
I used this script:


import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-16606*",sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

Some results shared in the slack

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

@ianmilligan1 ianmilligan1 commented May 25, 2018

This is awesome (and thanks for the results, looks great).

Given the results, I realize maybe we should isolate to just a single crawl.

If we want to do the above but slate it to just the crawl date in yyyymm format: 200912, where should we put that filter in above for optimal performance?

@JWZ2018

This comment has been minimized.

Copy link
Contributor

@JWZ2018 JWZ2018 commented May 25, 2018

@ianmilligan1
We can try something like this:


import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-DNYDTY-20121103160515-00000-crawling202.us.archive.org-6683.warc.gz",sc).filter(r => r.getCrawlMonth == "201211")
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

This particular dataset didn't return any results for the given month but the script completed successfully.

@lintool

This comment has been minimized.

Copy link
Member

@lintool lintool commented May 25, 2018

@JWZ2018 in above, filter is being done on RDD... the plan is move everything over to DF, so we need a new set of UDFs... I'll create a new PR on this.

@ruebot ruebot added this to In Progress in DataFrames and PySpark Aug 13, 2018
@ruebot ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018
@ruebot ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018
@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Aug 17, 2019

@ianmilligan1 are we good on this issue, or are we waiting for something from @lintool still?

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

@ianmilligan1 ianmilligan1 commented Aug 17, 2019

Realistically we could probably just do this by filtering the resulting csv file, so I’m happy if we close this.

@ruebot ruebot moved this from ToDo to In Progress in DataFrames and PySpark Aug 17, 2019
@ruebot ruebot moved this from To Do to In Progress in 1.0.0 Release of AUT Aug 17, 2019
@lintool

This comment has been minimized.

Copy link
Member

@lintool lintool commented Aug 21, 2019

👎 on filtering CSVs - not scalable...

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

@ianmilligan1 ianmilligan1 commented Aug 21, 2019

OK, thanks @lintool. Above you noted creating some new UDFs, is that still something you could do?

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Nov 8, 2019

@SinghGursimran here's one for you.

@SinghGursimran

This comment has been minimized.

Copy link
Contributor

@SinghGursimran SinghGursimran commented Nov 14, 2019

import io.archivesunleashed.matchbox._
import io.archivesunleashed._

val imgDetails = udf((url: String, MimeTypeTika: String, content: String) => ExtractImageDetails(url,MimeTypeTika,content.getBytes()).md5Hash)
val imgLinks = udf((url: String, content: String) => ExtractImageLinks(url, content))
val domain = udf((url: String) => ExtractDomain(url))

val total = RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
						.extractValidPagesDF()
						.select(
								$"crawl_date".as("crawl_date"),
								domain($"url").as("Domain"),
								explode_outer(imgLinks(($"url"), ($"content"))).as("ImageUrl"),
								imgDetails(($"url"), ($"mime_type_tika"), ($"content")).as("MD5")
							   )
					  	.filter($"crawl_date" rlike "200912[0-9]{2}")

val links = total.groupBy("MD5").count()
				 .where(countDistinct("Domain")>=2)

val result = total.join(links, "MD5")
				  .groupBy("Domain","MD5")
				  .agg(first("ImageUrl").as("ImageUrl"))
				  .orderBy(asc("MD5"))
				  .show(10,false)

The above script performs all operations on df. There are no potential hits for the given date in the dataset I used, though the script completed successfully.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Nov 14, 2019

Hrm... I think I should be getting matches here, but I'm not getting any:

Crawl dates that should match: 20091027

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/", sc)
            .extractValidPagesDF()
            .show()

// Exiting paste mode, now interpreting.

+----------+--------------------+--------------------+--------------------+--------------------+
|crawl_date|                 url|mime_type_web_server|      mime_type_tika|             content|
+----------+--------------------+--------------------+--------------------+--------------------+
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.talent...|           text/html|application/xhtml...|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.geocit...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|application/xhtml...|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://www.infoca...|           text/html|           text/html|HTTP/1.1 200 OK
...|
|  20091027|http://geocities....|           text/html|           text/html|HTTP/1.1 200 OK
...|
+----------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Filter for matching this pattern: 200910

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed.matchbox._
import io.archivesunleashed._

val imgDetails = udf((url: String, MimeTypeTika: String, content: String) => ExtractImageDetails(url,MimeTypeTika,content.getBytes()).md5Hash)
val imgLinks = udf((url: String, content: String) => ExtractImageLinks(url, content))
val domain = udf((url: String) => ExtractDomain(url))

val total = RecordLoader
              .loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/", sc)
              .extractValidPagesDF()
              .select(
                $"crawl_date".as("crawl_date"),
                domain($"url").as("Domain"),
                explode_outer(imgLinks(($"url"),
                ($"content"))).as("ImageUrl"),
                imgDetails(($"url"), 
                ($"mime_type_tika"), 
                ($"content")).as("MD5")
              )
              .filter($"crawl_date" rlike "200910[0-9]{2}")

val links = total
              .groupBy("MD5")
              .count()
              .where(countDistinct("Domain")>=2)

val result = total
               .join(links, "MD5")
               .groupBy("Domain","MD5")
               .agg(first("ImageUrl")
               .as("ImageUrl"))
               .orderBy(asc("MD5"))
               .show(10,false)

// Exiting paste mode, now interpreting.

+------+---+--------+                                                           
|Domain|MD5|ImageUrl|
+------+---+--------+
+------+---+--------+

import io.archivesunleashed.matchbox._
import io.archivesunleashed._
imgDetails: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function3>,StringType,Some(List(StringType, StringType, StringType)))
imgLinks: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StringType,true),Some(List(StringType, StringType)))
domain: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
total: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [crawl_date: string, Domain: string ... 2 more fields]
links: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [MD5: string, count: bigint]
result: Unit = ()

I think I should be getting results there.

@SinghGursimran

This comment has been minimized.

Copy link
Contributor

@SinghGursimran SinghGursimran commented Nov 14, 2019

Are there 2 or more distinct domains with same md5 hash on the given date?

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Nov 14, 2019

Oh, that's right. 🤦‍♂

Now we have to search for a datset that solves this. @ianmilligan1 I can run this on a larger portion of GeoCities on rho if you want, unless you have something better in mind.

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

@ianmilligan1 ianmilligan1 commented Nov 14, 2019

Nope I think running on GeoCities on rho makes sense to me!

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Nov 14, 2019

Ok, I'm running it on the entire 4T of GeoCities, and writing to csv. I'll report back in a few days when it finishes.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Nov 14, 2019

@ianmilligan1 @lintool if this is completes successfully, where do you two envision this landing in aut-docs-new?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
1.0.0 Release of AUT
  
In Progress
DataFrames and PySpark
  
In Progress
5 participants
You can’t perform that action at this time.