Add save to wayback function(s). (Issue #2) #3

greebie · Jan 11, 2019

This PR adds save to wayback functionality.

It can do a single url

r <- save_wayback("qdr.syr.edu")
r
 $url
 [1] "https://qdr.syr.edu/"

$archived_snapshots
$archived_snapshots$closest
$archived_snapshots$closest$status
[1] "200"

$archived_snapshots$closest$available
[1] TRUE

$archived_snapshots$closest$url
[1] "http://web.archive.org/web/20190111181645/https://qdr.syr.edu/"

$archived_snapshots$closest$timestamp
[1] "20190111181645"

or it can save a list of urls:

r <- archiv(c("qdr.syr.edu", "notaurl", "www.google.com"))

                    url status available?
1  https://qdr.syr.edu/    200       TRUE
2               notaurl    000      FALSE
3 http://www.google.com    200       TRUE
                                                      wayback_url
1  http://web.archive.org/web/20190111190523/https://qdr.syr.edu/
2                                                   url not found
3 http://web.archive.org/web/20190111190630/http://www.google.com
       timestamp
1 20190111190523
2        unknown
3 20190111190630

adam3smith · Jan 15, 2019

adam3smith approved these changes Jan 15, 2019

View changes

Saving to WBM works flawlessly. Also tested archiving a page with NOARCHIV in robots and script returned appropriate 403 message

adam3smith · Jan 15, 2019

Still testing saving urls from files

greebie · Jan 15, 2019

Fingers crossed I am starting issue #5 now. Getting there!

adam3smith · Jan 15, 2019

adam3smith reviewed Jan 15, 2019

View changes

archivr.R

greebie · Jan 17, 2019

I included the relaxed regex, but it seems that I get false negatives in docx (haven't tried with others). For example, from my syllabus, I got "http://gutenberg.ca/ebooks/innis-minerva/innis-minerva-00-h.html.Turner" suggesting that somewhere along the line the script is not detecting \n etc.

Issues #7, #8 and #11 are resolved in the last commit.

adam3smith · Jan 17, 2019

adam3smith reviewed Jan 17, 2019

View changes

adam3smith · Jan 17, 2019

archivr.R

@@ -220,7 +308,7 @@ set_api_key <- function (key) {
 #'
 #' @param url The url to extract urls.
 #' @return a vector of urls.
-get_urls_from_webpage <- function (url) {
+extract_urls_from_webpage <- function (url) {


this function doesn't extract internal/relative links (because of the startsWith command). I think generally thats the right choice. We should
a) either document it (I think this is probably the better option) or
b) add an option to this that it would inherit from archiv.fromURL

What do you think?

I think documenting would be better for now. If all things work properly, I think fixing the README will be part of the PR that turns this into a package. (Packaging will require some serious refactoring of the code).

adam3smith · Jan 17, 2019

adam3smith reviewed Jan 17, 2019

View changes

archivr.R

adam3smith · Jan 17, 2019

adam3smith reviewed Jan 17, 2019

View changes

archivr.R

adam3smith · Jan 26, 2019

@greebie -- I think we're good to merge this PR and turn this into a package. I'll let you handle the merge. Thanks!

Add save to wayback function(s).

9351022

greebie requested a review from adam3smith Jan 11, 2019

Add functions to save urls in a file or url.

f2c2a4c

greebie referenced this pull request Jan 15, 2019
Closed
Set up saving urls from a webpage or markdown #4

greebie added some commits Jan 17, 2019

Update based on PR review and issues.

f70725e

(Partially) Resolve issues with docx files.
Relax regex for detecting urls.

b439114

get_urls_from_webpage to extract_urls_from_webpage

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

468ac64

greebie merged commit d43e594 into master Jan 26, 2019

greebie deleted the issue-2 branch Jan 26, 2019

QualitativeDataRepository/archivr

Add save to wayback function(s). (Issue #2) #3

Add save to wayback function(s). (Issue #2) #3

greebie commented Jan 11, 2019

greebie requested a review from adam3smith Jan 11, 2019

greebie referenced this pull request Jan 15, 2019

Set up saving urls from a webpage or markdown #4

adam3smith approved these changes Jan 15, 2019

View changes

adam3smith left a comment

This comment has been minimized.

adam3smith commented Jan 15, 2019

This comment has been minimized.

greebie commented Jan 15, 2019

adam3smith reviewed Jan 15, 2019

View changes

greebie added some commits Jan 17, 2019

This comment has been minimized.

greebie commented Jan 17, 2019

adam3smith reviewed Jan 17, 2019

View changes

This comment has been minimized.

This comment has been minimized.

adam3smith reviewed Jan 17, 2019

View changes

adam3smith reviewed Jan 17, 2019

View changes

greebie added some commits Jan 17, 2019

This comment has been minimized.

adam3smith commented Jan 26, 2019

greebie merged commit `d43e594` into master Jan 26, 2019

greebie deleted the issue-2 branch Jan 26, 2019

QualitativeDataRepository/archivr

Join GitHub today

Add save to wayback function(s). (Issue #2) #3

Conversation

greebie commented Jan 11, 2019

greebie requested a review from adam3smith Jan 11, 2019

greebie referenced this pull request Jan 15, 2019

Set up saving urls from a webpage or markdown #4

adam3smith approved these changes Jan 15, 2019 View changes

adam3smith left a comment

This comment has been minimized.

adam3smith commented Jan 15, 2019

This comment has been minimized.

greebie commented Jan 15, 2019

adam3smith reviewed Jan 15, 2019 View changes

greebie added some commits Jan 17, 2019

This comment has been minimized.

greebie commented Jan 17, 2019

adam3smith reviewed Jan 17, 2019 View changes

This comment has been minimized.

adam3smith Jan 17, 2019

This comment has been minimized.

greebie Jan 17, 2019

adam3smith reviewed Jan 17, 2019 View changes

adam3smith reviewed Jan 17, 2019 View changes

greebie added some commits Jan 17, 2019

This comment has been minimized.

adam3smith commented Jan 26, 2019

greebie merged commit d43e594 into master Jan 26, 2019

greebie deleted the issue-2 branch Jan 26, 2019

adam3smith approved these changes Jan 15, 2019

View changes

adam3smith reviewed Jan 15, 2019

View changes

adam3smith reviewed Jan 17, 2019

View changes

adam3smith reviewed Jan 17, 2019

View changes

adam3smith reviewed Jan 17, 2019

View changes

greebie merged commit `d43e594` into master Jan 26, 2019