Improve ExtractDomain Normalization #239

ianmilligan1 · May 25, 2018

Describe the bug
Right now, we have uneven behaviour with ExtractDomain. For example, in #237, when extracting domains we find things like:

+--------------------+--------------------+--------------------+                
|              Domain|                 MD5|            ImageUrl|
+--------------------+--------------------+--------------------+
| www.davidsuzuki.org|10e2370b0958cd978...|http://www.davids...|
|     davidsuzuki.org|10e2370b0958cd978...|http://www.davids...|
|     davidsuzuki.org|1576ed906c5f34291...|http://www.davids...|
| www.davidsuzuki.org|1576ed906c5f34291...|http://www.davids...|

It would be nice to have www.davidsuzuki.org and davidsuzuki.org combined.

To Reproduce
To reproduce the behaviour, run ExtractDomains on a large corpus. The above was generated with the command (in #237 by @JWZ2018)

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-16606*",sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

Expected behavior
Ideally, in above we would have just seen:

+--------------------+--------------------+--------------------+                
|              Domain|                 MD5|            ImageUrl|
+--------------------+--------------------+--------------------+
|     davidsuzuki.org|10e2370b0958cd978...|http://www.davids...|
|     davidsuzuki.org|1576ed906c5f34291...|http://www.davids...|

I think it makes sense to drop the www but I am agnostic if others feel it is better to keep it. Whatever decision we make it should just be consistent.

Desktop/Laptop (please complete the following information):
Machine shouldn't matter, but above was run on an Ubuntu 16 server.

I see @lintool's comment in #236

@TitusAn I would suggest renaming the DF ExtractDomain to ExtractBaseDomain since it also removes the www prefix. Giving it a different name will also reduce confusion in the matchbox version since it does something different.

Perhaps ExtractBaseDomain will resolve this? (two issues in one!)

ianmilligan1 · May 25, 2018

I see @lintool's comment in #236

@TitusAn I would suggest renaming the DF ExtractDomain to ExtractBaseDomain since it also removes the www prefix. Giving it a different name will also reduce confusion in the matchbox version since it does something different.

Perhaps ExtractBaseDomain will resolve this? (two issues in one!)

This looks like its been resolved at webarchive-discovery using a canonical regex. I'm going to take a crack at a fix using the same.

greebie · Aug 29, 2018

This looks like its been resolved at webarchive-discovery using a canonical regex. I'm going to take a crack at a fix using the same.

I see - it's a slightly different problem. Basically, there are two udfs. The first, ExtractBaseDomain, uses ExtractDomain to get the base url, which includes the www. A second udf RemovePrefixWWW removes the www. I will test the run and see what happens.

greebie · Aug 29, 2018

I see - it's a slightly different problem. Basically, there are two udfs. The first, ExtractBaseDomain, uses ExtractDomain to get the base url, which includes the www. A second udf RemovePrefixWWW removes the www. I will test the run and see what happens.

Hi Ian -- can you try the following code and see if it resolves your problem?

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val warcs = "{warc collection path}"
val data = RecordLoader.loadArchives(warcs, sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.RemovePrefixWWW(df.ExtractBaseDomain($"src")).as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

greebie · Aug 29, 2018

Hi Ian -- can you try the following code and see if it resolves your problem?

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val warcs = "{warc collection path}"
val data = RecordLoader.loadArchives(warcs, sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.RemovePrefixWWW(df.ExtractBaseDomain($"src")).as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

This looks good. Can you explain what you've changed here and rationale? (I can look at the difference but figured you could walk me through it)

ianmilligan1 · Aug 29, 2018

This looks good. Can you explain what you've changed here and rationale? (I can look at the difference but figured you could walk me through it)

The main difference is that the df.ExtractBaseDomain udf (which is the same as ExtractDomain in RDD) is wrapped in df.RemovePrefixWWW, which removes the "www." As discussed above the DF ExtractDomain has changed its name to ExtractBaseDomain as well.

greebie · Aug 29, 2018

The main difference is that the df.ExtractBaseDomain udf (which is the same as ExtractDomain in RDD) is wrapped in df.RemovePrefixWWW, which removes the "www." As discussed above the DF ExtractDomain has changed its name to ExtractBaseDomain as well.

Great, this works well. Thanks @greebie. Down the road as we document data frames we should make sure to use this as an example script!

ianmilligan1 · Aug 30, 2018

Great, this works well. Thanks @greebie. Down the road as we document data frames we should make sure to use this as an example script!

ianmilligan1 added the bug label May 25, 2018

ianmilligan1 referenced this issue May 31, 2018
Merged
URL normalisation/canonicalisation fixes #176

ianmilligan1 closed this Aug 30, 2018

archivesunleashed/aut

Improve ExtractDomain Normalization #239

ianmilligan1 commented May 25, 2018

ianmilligan1 added the bug label May 25, 2018

This comment has been minimized.

ianmilligan1 commented May 25, 2018 •

edited

Edited 1 time

ianmilligan1 edited May 25, 2018 (most recent)

ianmilligan1 referenced this issue May 31, 2018

URL normalisation/canonicalisation fixes #176

This comment has been minimized.

greebie commented Aug 29, 2018

This comment has been minimized.

greebie commented Aug 29, 2018

This comment has been minimized.

greebie commented Aug 29, 2018

This comment has been minimized.

ianmilligan1 commented Aug 29, 2018

This comment has been minimized.

greebie commented Aug 29, 2018

This comment has been minimized.

ianmilligan1 commented Aug 30, 2018

ianmilligan1 closed this Aug 30, 2018

archivesunleashed/aut

Join GitHub today

Improve ExtractDomain Normalization #239

Comments

ianmilligan1 commented May 25, 2018

ianmilligan1 added the bug label May 25, 2018

This comment has been minimized.

ianmilligan1 commented May 25, 2018 • edited Edited 1 time ianmilligan1 edited May 25, 2018 (most recent)

ianmilligan1 referenced this issue May 31, 2018

This comment has been minimized.

greebie commented Aug 29, 2018

This comment has been minimized.

greebie commented Aug 29, 2018

This comment has been minimized.

greebie commented Aug 29, 2018

This comment has been minimized.

ianmilligan1 commented Aug 29, 2018

This comment has been minimized.

greebie commented Aug 29, 2018

This comment has been minimized.

ianmilligan1 commented Aug 30, 2018

ianmilligan1 closed this Aug 30, 2018

ianmilligan1 commented May 25, 2018 •

edited

Edited 1 time

ianmilligan1 edited May 25, 2018 (most recent)