New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ExtractDomain to Better Isolate Domains #269

Closed
ianmilligan1 opened this Issue Sep 13, 2018 · 10 comments

Comments

Projects
None yet
4 participants
@ianmilligan1
Member

ianmilligan1 commented Sep 13, 2018

Describe the bug
ExtractDomain should be producing domains like:

www.archive.org
www.liberal.ca

etc.

At times we see domains like this, however

seetorontonow.canada-booknow.com\booking_results.php

This is probably due to the URL having a backslash rather than the expected forward slash.

Expected behavior
In the above example, we should probably have:

seetorontonow.canada-booknow.com

This has impacts on the ensuing GEXF files.

What should we do?
Improve the ExtractDomain UDF so that it captures domain based on backslash as well.

@ianmilligan1 ianmilligan1 added the bug label Sep 13, 2018

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Sep 13, 2018

Member

FWIW, in ExtractDomain we use the Java URL class to extract the host for us.

We could potentially just put in an extra line to split the string at backslash and take the first part, maybe, but I (a) don't know Java at all; (b) that might be a terrible idea.

Member

ianmilligan1 commented Sep 13, 2018

FWIW, in ExtractDomain we use the Java URL class to extract the host for us.

We could potentially just put in an extra line to split the string at backslash and take the first part, maybe, but I (a) don't know Java at all; (b) that might be a terrible idea.

@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 8, 2018

Collaborator

@ruebot Quick question: in ExtractDomain, why do we check source first then url? I think source will only be used when url doesn't contain any valid domain host.

Collaborator

borislin commented Oct 8, 2018

@ruebot Quick question: in ExtractDomain, why do we check source first then url? I think source will only be used when url doesn't contain any valid domain host.

@lintool

This comment has been minimized.

Show comment
Hide comment
@lintool

lintool Oct 9, 2018

Member

@borislin According to git blame that's @greebie 's code.

Either way, we'll need more test cases and better coverage here...

Member

lintool commented Oct 9, 2018

@borislin According to git blame that's @greebie 's code.

Either way, we'll need more test cases and better coverage here...

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Oct 9, 2018

Contributor

Thanks for finding this @borislin. It appears that I broke this functionality when I tried to convert using an Option approach.

Having the additional test case would help prevent this error in future.

Contributor

greebie commented Oct 9, 2018

Thanks for finding this @borislin. It appears that I broke this functionality when I tried to convert using an Option approach.

Having the additional test case would help prevent this error in future.

@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 10, 2018

Collaborator

@ianmilligan1 Do you have an example archive file that contains a backslash in the URL so I can test?

Collaborator

borislin commented Oct 10, 2018

@ianmilligan1 Do you have an example archive file that contains a backslash in the URL so I can test?

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Oct 10, 2018

Member

I don't – I know there's one somewhere in collection 5421 though. It's 50GB. Do you want me to start a wget job and park it somewhere on tuna?

Member

ianmilligan1 commented Oct 10, 2018

I don't – I know there's one somewhere in collection 5421 though. It's 50GB. Do you want me to start a wget job and park it somewhere on tuna?

@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 10, 2018

Collaborator

@ianmilligan1 I'll try to find other ways to fake one and do the testing. But sure, we still need a real life example to do a final testing to make sure my fix works. Pls help move it to tuna and let me know when it's done and the path to the collection.

Collaborator

borislin commented Oct 10, 2018

@ianmilligan1 I'll try to find other ways to fake one and do the testing. But sure, we still need a real life example to do a final testing to make sure my fix works. Pls help move it to tuna and let me know when it's done and the path to the collection.

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Oct 11, 2018

Member

Sorry for delay (am in European timezone today) – the collection is @ /tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl

Member

ianmilligan1 commented Oct 11, 2018

Sorry for delay (am in European timezone today) – the collection is @ /tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl

@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 13, 2018

Collaborator

@ianmilligan1 Can you give me the script you run before that produces this URL issue?

I'm not able to reproduce this issue with this collection. The script I'm using is /tuna1/scratch/aut-issue-269/spark_jobs/269.scala and the output files are in /tuna1/scratch/aut-issue-269/derivatives/all-domains/. The combined output file is /tuna1/scratch/aut-issue-269/derivatives/all-domains.txt.

Could you please provide me with all the steps you've taken to reproduce this issue?

Collaborator

borislin commented Oct 13, 2018

@ianmilligan1 Can you give me the script you run before that produces this URL issue?

I'm not able to reproduce this issue with this collection. The script I'm using is /tuna1/scratch/aut-issue-269/spark_jobs/269.scala and the output files are in /tuna1/scratch/aut-issue-269/derivatives/all-domains/. The combined output file is /tuna1/scratch/aut-issue-269/derivatives/all-domains.txt.

Could you please provide me with all the steps you've taken to reproduce this issue?

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Oct 13, 2018

Member

It was appearing when running the link generator, i.e. the standard AUK job:

val links = RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "#{collection_derivatives}/gephi/#{c.collection_id}-gephi.graphml")

Try running the GraphML generator on the collection? Thanks @borislin !

Member

ianmilligan1 commented Oct 13, 2018

It was appearing when running the link generator, i.e. the standard AUK job:

val links = RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "#{collection_derivatives}/gephi/#{c.collection_id}-gephi.graphml")

Try running the GraphML generator on the collection? Thanks @borislin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment