Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upImprove ExtractDomain to Better Isolate Domains #269
Comments
ianmilligan1
added
the
bug
label
Sep 13, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Sep 13, 2018
Member
FWIW, in ExtractDomain we use the Java URL class to extract the host for us.
We could potentially just put in an extra line to split the string at backslash and take the first part, maybe, but I (a) don't know Java at all; (b) that might be a terrible idea.
FWIW, in ExtractDomain we use the Java URL class to extract the host for us. We could potentially just put in an extra line to split the string at backslash and take the first part, maybe, but I (a) don't know Java at all; (b) that might be a terrible idea. |
ianmilligan1
assigned
borislin
Sep 18, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 8, 2018
Collaborator
@ruebot Quick question: in ExtractDomain
, why do we check source
first then url
? I think source
will only be used when url
doesn't contain any valid domain host.
@ruebot Quick question: in |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
Oct 9, 2018
Contributor
Thanks for finding this @borislin. It appears that I broke this functionality when I tried to convert using an Option approach.
Having the additional test case would help prevent this error in future.
Thanks for finding this @borislin. It appears that I broke this functionality when I tried to convert using an Option approach. Having the additional test case would help prevent this error in future. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 10, 2018
Collaborator
@ianmilligan1 Do you have an example archive file that contains a backslash in the URL so I can test?
@ianmilligan1 Do you have an example archive file that contains a backslash in the URL so I can test? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Oct 10, 2018
Member
I don't – I know there's one somewhere in collection 5421 though. It's 50GB. Do you want me to start a wget
job and park it somewhere on tuna
?
I don't – I know there's one somewhere in collection 5421 though. It's 50GB. Do you want me to start a |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 10, 2018
Collaborator
@ianmilligan1 I'll try to find other ways to fake one and do the testing. But sure, we still need a real life example to do a final testing to make sure my fix works. Pls help move it to tuna
and let me know when it's done and the path to the collection.
@ianmilligan1 I'll try to find other ways to fake one and do the testing. But sure, we still need a real life example to do a final testing to make sure my fix works. Pls help move it to |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Oct 11, 2018
Member
Sorry for delay (am in European timezone today) – the collection is @ /tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl
Sorry for delay (am in European timezone today) – the collection is @ |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 13, 2018
Collaborator
@ianmilligan1 Can you give me the script you run before that produces this URL issue?
I'm not able to reproduce this issue with this collection. The script I'm using is /tuna1/scratch/aut-issue-269/spark_jobs/269.scala
and the output files are in /tuna1/scratch/aut-issue-269/derivatives/all-domains/
. The combined output file is /tuna1/scratch/aut-issue-269/derivatives/all-domains.txt
.
Could you please provide me with all the steps you've taken to reproduce this issue?
@ianmilligan1 Can you give me the script you run before that produces this URL issue? I'm not able to reproduce this issue with this collection. The script I'm using is Could you please provide me with all the steps you've taken to reproduce this issue? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Oct 13, 2018
Member
It was appearing when running the link generator, i.e. the standard AUK job:
val links = RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "#{collection_derivatives}/gephi/#{c.collection_id}-gephi.graphml")
Try running the GraphML generator on the collection? Thanks @borislin !
It was appearing when running the link generator, i.e. the standard AUK job:
Try running the GraphML generator on the collection? Thanks @borislin ! |
ianmilligan1 commentedSep 13, 2018
Describe the bug
ExtractDomain should be producing domains like:
etc.
At times we see domains like this, however
This is probably due to the URL having a backslash rather than the expected forward slash.
Expected behavior
In the above example, we should probably have:
This has impacts on the ensuing GEXF files.
What should we do?
Improve the
ExtractDomain
UDF so that it captures domain based on backslash as well.