New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ExtractDomain Normalization #239

Closed
ianmilligan1 opened this Issue May 25, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@ianmilligan1
Member

ianmilligan1 commented May 25, 2018

Describe the bug
Right now, we have uneven behaviour with ExtractDomain. For example, in #237, when extracting domains we find things like:

+--------------------+--------------------+--------------------+                
|              Domain|                 MD5|            ImageUrl|
+--------------------+--------------------+--------------------+
| www.davidsuzuki.org|10e2370b0958cd978...|http://www.davids...|
|     davidsuzuki.org|10e2370b0958cd978...|http://www.davids...|
|     davidsuzuki.org|1576ed906c5f34291...|http://www.davids...|
| www.davidsuzuki.org|1576ed906c5f34291...|http://www.davids...|

It would be nice to have www.davidsuzuki.org and davidsuzuki.org combined.

To Reproduce
To reproduce the behaviour, run ExtractDomains on a large corpus. The above was generated with the command (in #237 by @JWZ2018)

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val data = RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_01/partner.archive-it.org/cgi-bin/getarcs.pl/ARCHIVEIT-227-QUARTERLY-16606*",sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.ExtractDomain($"src").as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()

Expected behavior
Ideally, in above we would have just seen:

+--------------------+--------------------+--------------------+                
|              Domain|                 MD5|            ImageUrl|
+--------------------+--------------------+--------------------+
|     davidsuzuki.org|10e2370b0958cd978...|http://www.davids...|
|     davidsuzuki.org|1576ed906c5f34291...|http://www.davids...|

I think it makes sense to drop the www but I am agnostic if others feel it is better to keep it. Whatever decision we make it should just be consistent.

Desktop/Laptop (please complete the following information):
Machine shouldn't matter, but above was run on an Ubuntu 16 server.

@ianmilligan1 ianmilligan1 added the bug label May 25, 2018

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 May 25, 2018

Member

I see @lintool's comment in #236

@TitusAn I would suggest renaming the DF ExtractDomain to ExtractBaseDomain since it also removes the www prefix. Giving it a different name will also reduce confusion in the matchbox version since it does something different.

Perhaps ExtractBaseDomain will resolve this? (two issues in one!)

Member

ianmilligan1 commented May 25, 2018

I see @lintool's comment in #236

@TitusAn I would suggest renaming the DF ExtractDomain to ExtractBaseDomain since it also removes the www prefix. Giving it a different name will also reduce confusion in the matchbox version since it does something different.

Perhaps ExtractBaseDomain will resolve this? (two issues in one!)

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Aug 29, 2018

Contributor

This looks like its been resolved at webarchive-discovery using a canonical regex. I'm going to take a crack at a fix using the same.

Contributor

greebie commented Aug 29, 2018

This looks like its been resolved at webarchive-discovery using a canonical regex. I'm going to take a crack at a fix using the same.

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Aug 29, 2018

Contributor

I see - it's a slightly different problem. Basically, there are two udfs. The first, ExtractBaseDomain, uses ExtractDomain to get the base url, which includes the www. A second udf RemovePrefixWWW removes the www. I will test the run and see what happens.

Contributor

greebie commented Aug 29, 2018

I see - it's a slightly different problem. Basically, there are two udfs. The first, ExtractBaseDomain, uses ExtractDomain to get the base url, which includes the www. A second udf RemovePrefixWWW removes the www. I will test the run and see what happens.

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Aug 29, 2018

Contributor

Hi Ian -- can you try the following code and see if it resolves your problem?

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val warcs = "{warc collection path}"
val data = RecordLoader.loadArchives(warcs, sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.RemovePrefixWWW(df.ExtractBaseDomain($"src")).as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()
Contributor

greebie commented Aug 29, 2018

Hi Ian -- can you try the following code and see if it resolves your problem?

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val warcs = "{warc collection path}"
val data = RecordLoader.loadArchives(warcs, sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.RemovePrefixWWW(df.ExtractBaseDomain($"src")).as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));

//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()
@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Aug 29, 2018

Member

This looks good. Can you explain what you've changed here and rationale? (I can look at the difference but figured you could walk me through it)

Member

ianmilligan1 commented Aug 29, 2018

This looks good. Can you explain what you've changed here and rationale? (I can look at the difference but figured you could walk me through it)

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Aug 29, 2018

Contributor

The main difference is that the df.ExtractBaseDomain udf (which is the same as ExtractDomain in RDD) is wrapped in df.RemovePrefixWWW, which removes the "www." As discussed above the DF ExtractDomain has changed its name to ExtractBaseDomain as well.

Contributor

greebie commented Aug 29, 2018

The main difference is that the df.ExtractBaseDomain udf (which is the same as ExtractDomain in RDD) is wrapped in df.RemovePrefixWWW, which removes the "www." As discussed above the DF ExtractDomain has changed its name to ExtractBaseDomain as well.

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Aug 30, 2018

Member

Great, this works well. Thanks @greebie. Down the road as we document data frames we should make sure to use this as an example script!

Member

ianmilligan1 commented Aug 30, 2018

Great, this works well. Thanks @greebie. Down the road as we document data frames we should make sure to use this as an example script!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment