Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upImprove ExtractDomain Normalization #239
Comments
ianmilligan1
added
the
bug
label
May 25, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
May 25, 2018
Member
I see @lintool's comment in #236
@TitusAn I would suggest renaming the DF ExtractDomain to ExtractBaseDomain since it also removes the www prefix. Giving it a different name will also reduce confusion in the matchbox version since it does something different.
Perhaps ExtractBaseDomain
will resolve this? (two issues in one!)
I see @lintool's comment in #236
Perhaps |
ianmilligan1
referenced this issue
May 31, 2018
Merged
URL normalisation/canonicalisation fixes #176
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
Aug 29, 2018
Contributor
This looks like its been resolved at webarchive-discovery using a canonical regex. I'm going to take a crack at a fix using the same.
This looks like its been resolved at webarchive-discovery using a canonical regex. I'm going to take a crack at a fix using the same. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
Aug 29, 2018
Contributor
Hi Ian -- can you try the following code and see if it resolves your problem?
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.df._
val warcs = "{warc collection path}"
val data = RecordLoader.loadArchives(warcs, sc)
import spark.implicits._
val domains = data.extractImageLinksDF().select(df.RemovePrefixWWW(df.ExtractBaseDomain($"src")).as("Domain"), $"image_url".as("ImageUrl"));
val images = data.extractImageDetailsDF().select($"url".as("ImageUrl"), $"md5".as("MD5"));
//domains and images in one table
val total = domains.join(images, "ImageUrl")
//group same images by MD5 and only keep the md5 with at least 2 distinct domains
val links = total.groupBy("MD5").count().where(countDistinct("Domain")>=2)
//rejoin with images to get the list of image urls
val result = total.join(links, "MD5").groupBy("Domain","MD5").agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5")).show()
Hi Ian -- can you try the following code and see if it resolves your problem?
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Aug 29, 2018
Member
This looks good. Can you explain what you've changed here and rationale? (I can look at the difference but figured you could walk me through it)
This looks good. Can you explain what you've changed here and rationale? (I can look at the difference but figured you could walk me through it) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
Aug 29, 2018
Contributor
The main difference is that the df.ExtractBaseDomain udf (which is the same as ExtractDomain in RDD) is wrapped in df.RemovePrefixWWW, which removes the "www." As discussed above the DF ExtractDomain has changed its name to ExtractBaseDomain as well.
The main difference is that the df.ExtractBaseDomain udf (which is the same as ExtractDomain in RDD) is wrapped in df.RemovePrefixWWW, which removes the "www." As discussed above the DF ExtractDomain has changed its name to ExtractBaseDomain as well. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Aug 30, 2018
Member
Great, this works well. Thanks @greebie. Down the road as we document data frames we should make sure to use this as an example script!
Great, this works well. Thanks @greebie. Down the road as we document data frames we should make sure to use this as an example script! |
ianmilligan1 commentedMay 25, 2018
Describe the bug
Right now, we have uneven behaviour with
ExtractDomain
. For example, in #237, when extracting domains we find things like:It would be nice to have
www.davidsuzuki.org
anddavidsuzuki.org
combined.To Reproduce
To reproduce the behaviour, run
ExtractDomains
on a large corpus. The above was generated with the command (in #237 by @JWZ2018)Expected behavior
Ideally, in above we would have just seen:
I think it makes sense to drop the
www
but I am agnostic if others feel it is better to keep it. Whatever decision we make it should just be consistent.Desktop/Laptop (please complete the following information):
Machine shouldn't matter, but above was run on an Ubuntu 16 server.