Improve ExtractDomain to Better Isolate Domains #269

ianmilligan1 · Sep 13, 2018

Describe the bug
ExtractDomain should be producing domains like:

www.archive.org
www.liberal.ca

etc.

At times we see domains like this, however

seetorontonow.canada-booknow.com\booking_results.php

This is probably due to the URL having a backslash rather than the expected forward slash.

Expected behavior
In the above example, we should probably have:

seetorontonow.canada-booknow.com

This has impacts on the ensuing GEXF files.

What should we do?
Improve the ExtractDomain UDF so that it captures domain based on backslash as well.

FWIW, in ExtractDomain we use the Java URL class to extract the host for us.

We could potentially just put in an extra line to split the string at backslash and take the first part, maybe, but I (a) don't know Java at all; (b) that might be a terrible idea.

ianmilligan1 · Sep 13, 2018

FWIW, in ExtractDomain we use the Java URL class to extract the host for us.

We could potentially just put in an extra line to split the string at backslash and take the first part, maybe, but I (a) don't know Java at all; (b) that might be a terrible idea.

@ruebot

@ruebot Quick question: in ExtractDomain, why do we check source first then url? I think source will only be used when url doesn't contain any valid domain host.

borislin · Oct 8, 2018

@ruebot Quick question: in ExtractDomain, why do we check source first then url? I think source will only be used when url doesn't contain any valid domain host.

@borislin

@borislin According to git blame that's @greebie 's code.

Either way, we'll need more test cases and better coverage here...

lintool · Oct 9, 2018

@borislin According to git blame that's @greebie 's code.

Either way, we'll need more test cases and better coverage here...

@borislin

Thanks for finding this @borislin. It appears that I broke this functionality when I tried to convert using an Option approach.

Having the additional test case would help prevent this error in future.

greebie · Oct 9, 2018

Thanks for finding this @borislin. It appears that I broke this functionality when I tried to convert using an Option approach.

Having the additional test case would help prevent this error in future.

@ianmilligan1

@ianmilligan1 Do you have an example archive file that contains a backslash in the URL so I can test?

borislin · Oct 10, 2018

@ianmilligan1 Do you have an example archive file that contains a backslash in the URL so I can test?

I don't – I know there's one somewhere in collection 5421 though. It's 50GB. Do you want me to start a wget job and park it somewhere on tuna?

ianmilligan1 · Oct 10, 2018

I don't – I know there's one somewhere in collection 5421 though. It's 50GB. Do you want me to start a wget job and park it somewhere on tuna?

@ianmilligan1

@ianmilligan1 I'll try to find other ways to fake one and do the testing. But sure, we still need a real life example to do a final testing to make sure my fix works. Pls help move it to tuna and let me know when it's done and the path to the collection.

borislin · Oct 10, 2018

@ianmilligan1 I'll try to find other ways to fake one and do the testing. But sure, we still need a real life example to do a final testing to make sure my fix works. Pls help move it to tuna and let me know when it's done and the path to the collection.

Sorry for delay (am in European timezone today) – the collection is @ /tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl

ianmilligan1 · Oct 11, 2018

Sorry for delay (am in European timezone today) – the collection is @ /tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl

@ianmilligan1

@ianmilligan1 Can you give me the script you run before that produces this URL issue?

I'm not able to reproduce this issue with this collection. The script I'm using is /tuna1/scratch/aut-issue-269/spark_jobs/269.scala and the output files are in /tuna1/scratch/aut-issue-269/derivatives/all-domains/. The combined output file is /tuna1/scratch/aut-issue-269/derivatives/all-domains.txt.

Could you please provide me with all the steps you've taken to reproduce this issue?

borislin · Oct 13, 2018

@ianmilligan1 Can you give me the script you run before that produces this URL issue?

I'm not able to reproduce this issue with this collection. The script I'm using is /tuna1/scratch/aut-issue-269/spark_jobs/269.scala and the output files are in /tuna1/scratch/aut-issue-269/derivatives/all-domains/. The combined output file is /tuna1/scratch/aut-issue-269/derivatives/all-domains.txt.

Could you please provide me with all the steps you've taken to reproduce this issue?

@borislin

It was appearing when running the link generator, i.e. the standard AUK job:

val links = RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "#{collection_derivatives}/gephi/#{c.collection_id}-gephi.graphml")

Try running the GraphML generator on the collection? Thanks @borislin !

ianmilligan1 · Oct 13, 2018

It was appearing when running the link generator, i.e. the standard AUK job:

val links = RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "#{collection_derivatives}/gephi/#{c.collection_id}-gephi.graphml")

Try running the GraphML generator on the collection? Thanks @borislin !

ianmilligan1 added the bug label Sep 13, 2018

ianmilligan1 assigned borislin Sep 18, 2018

borislin referenced this issue Oct 16, 2018
Merged
Patch for #269: Replace backslash with forward slash in URL #276

ruebot closed this in 7c3a80d Oct 17, 2018

archivesunleashed/aut

Improve ExtractDomain to Better Isolate Domains #269

ianmilligan1 commented Sep 13, 2018

ianmilligan1 added the bug label Sep 13, 2018

This comment has been minimized.

ianmilligan1 commented Sep 13, 2018

ianmilligan1 assigned borislin Sep 18, 2018

This comment has been minimized.

borislin commented Oct 8, 2018

This comment has been minimized.

lintool commented Oct 9, 2018

This comment has been minimized.

greebie commented Oct 9, 2018

This comment has been minimized.

borislin commented Oct 10, 2018

This comment has been minimized.

ianmilligan1 commented Oct 10, 2018

This comment has been minimized.

borislin commented Oct 10, 2018

This comment has been minimized.

ianmilligan1 commented Oct 11, 2018

This comment has been minimized.

borislin commented Oct 13, 2018

This comment has been minimized.

ianmilligan1 commented Oct 13, 2018

borislin referenced this issue Oct 16, 2018

Patch for #269: Replace backslash with forward slash in URL #276

ruebot closed this in `7c3a80d` Oct 17, 2018

archivesunleashed/aut

Join GitHub today

Improve ExtractDomain to Better Isolate Domains #269

Comments

ianmilligan1 commented Sep 13, 2018

ianmilligan1 added the bug label Sep 13, 2018

This comment has been minimized.

ianmilligan1 commented Sep 13, 2018

ianmilligan1 assigned borislin Sep 18, 2018

This comment has been minimized.

borislin commented Oct 8, 2018

This comment has been minimized.

lintool commented Oct 9, 2018

This comment has been minimized.

greebie commented Oct 9, 2018

This comment has been minimized.

borislin commented Oct 10, 2018

This comment has been minimized.

ianmilligan1 commented Oct 10, 2018

This comment has been minimized.

borislin commented Oct 10, 2018

This comment has been minimized.

ianmilligan1 commented Oct 11, 2018

This comment has been minimized.

borislin commented Oct 13, 2018

This comment has been minimized.

ianmilligan1 commented Oct 13, 2018

borislin referenced this issue Oct 16, 2018

ruebot closed this in 7c3a80d Oct 17, 2018

ruebot closed this in `7c3a80d` Oct 17, 2018