New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch for #269: Replace backslash with forward slash in URL #276

Merged
merged 4 commits into from Oct 17, 2018

Conversation

Projects
None yet
4 participants
@borislin
Collaborator

borislin commented Oct 16, 2018

This PR improves ExtractDomain by replacing backslash with forward slash in URL before passing it into Java URL class.


GitHub issue(s):

What does this Pull Request do?

This PR improves URL parsing in ExtractDomain by replacing backslash with forward slash before passing it into Java URL class, allowing ExtractDomain to capture the true domain of an URL.

How should this be tested?

  • git fetch --all
  • git checkout fix-url
  • mvn clean install
  • Create an output directory with sub-directories:
    mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs
  • Adapt the script below:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

val input = "/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz"

val output1 = "/tuna1/scratch/aut-issue-269/derivatives/all-domains"
val output2 = "/tuna1/scratch/aut-issue-269/derivatives/all-text"
val output3 = "/tuna1/scratch/aut-issue-269/derivatives/gephi"

RecordLoader.loadArchives(input, sc).map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile(output1)

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile(output2)

val links = RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, output3)

sys.exit

Current Results

With this PR patch:

  • /tuna1/scratch/aut-issue-269/derivatives/all-domains or /tuna1/scratch/aut-issue-269/derivatives/all-domains.txt (a combined version of all files in /tuna1/scratch/aut-issue-269/derivatives/all-domains)
  • /tuna1/scratch/aut-issue-269/derivatives/gephi (doesn't contain backslash anymore, proper domain seetorontonow.canada-booknow.com has been extracted from URL)

Without this PR patch (master branch):

  • /tuna1/scratch/aut-issue-269/derivatives/all-domains-without-patch or /tuna1/scratch/aut-issue-269/derivatives/all-domains-without-patch.txt (combined version)
  • /tuna1/scratch/aut-issue-269/derivatives/gephi-without-patch (contains backslash as in URL seetorontonow.canada-booknow.com\booking_results.php)

Interested parties

@lintool @ianmilligan1 @ruebot @greebie

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Oct 16, 2018

Codecov Report

Merging #276 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #276   +/-   ##
=======================================
  Coverage   70.36%   70.36%           
=======================================
  Files          41       41           
  Lines        1046     1046           
  Branches      192      192           
=======================================
  Hits          736      736           
  Misses        244      244           
  Partials       66       66
Impacted Files Coverage Δ
.../io/archivesunleashed/matchbox/ExtractDomain.scala 87.5% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fe05a5...bf458e9. Read the comment docs.

codecov-io commented Oct 16, 2018

Codecov Report

Merging #276 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #276   +/-   ##
=======================================
  Coverage   70.36%   70.36%           
=======================================
  Files          41       41           
  Lines        1046     1046           
  Branches      192      192           
=======================================
  Hits          736      736           
  Misses        244      244           
  Partials       66       66
Impacted Files Coverage Δ
.../io/archivesunleashed/matchbox/ExtractDomain.scala 87.5% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fe05a5...bf458e9. Read the comment docs.

@borislin borislin requested review from greebie, lintool and ruebot Oct 16, 2018

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 17, 2018

Member

@ianmilligan1 you want to test this one out since it is for #269?

@borislin can you update your branch?

Member

ruebot commented Oct 17, 2018

@ianmilligan1 you want to test this one out since it is for #269?

@borislin can you update your branch?

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Oct 17, 2018

Member

@ruebot yep, will do!

Member

ianmilligan1 commented Oct 17, 2018

@ruebot yep, will do!

@ianmilligan1

Tested and works well – thanks @borislin!

@ruebot

ruebot approved these changes Oct 17, 2018

@ruebot ruebot merged commit 7c3a80d into master Oct 17, 2018

4 checks passed

codecov/patch 100% of diff hit (target 70.36%)
Details
codecov/project 70.36% (+0%) compared to 4fe05a5
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@ruebot ruebot deleted the fix-url branch Oct 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment