Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse xml and htm[l]? documents #31

Open
wants to merge 7 commits into
base: master
from

Conversation

@adam3smith
Copy link
Contributor

adam3smith commented Oct 17, 2019

Closes #30
ideas for doing this more elegantly welcome, of course

Closes #30 
ideas for doing this more elegantly welcome, of course
@adam3smith adam3smith requested a review from greebie Oct 17, 2019
@adam3smith

This comment has been minimized.

Copy link
Contributor Author

adam3smith commented Nov 2, 2019

Closes #12

@adam3smith

This comment has been minimized.

Copy link
Contributor Author

adam3smith commented Nov 2, 2019

@greebie if you'd have time to review this? If code is OK, I'd add to docs and then I think we have this in good enough shape to start promoting, even if we aren't quite ready for CRAN yet.

Copy link
Collaborator

greebie left a comment

Just a minor issue that broke Travis and Appveyor (see review).

R/archivr.R Show resolved Hide resolved
adam3smith added 2 commits Nov 6, 2019
Add except to examples and include in Readme
Update examples & man files.

from_wayback and from_perma now both return equivalent lists. 
view_archiv now sets available to true if either has a copy.

I've removed the "status" column/variable, since it's just the status of 
the API call, not that of the URL, so doesn't add anything.

I've also made all examples assign function results to a new variable 
rather than print them to declutter CI (and model good practice)
@greebie greebie self-requested a review Nov 7, 2019
R/archivr.R Outdated
@@ -259,7 +260,7 @@ view_archiv <- function (lst, method="wayback") {
#' @return a dataframe containing the url, status, availability,
#' archived url(s) and timestamp(s)
#' @examples
#' view_archiv.fromUrl("https://www-cs-faculty.stanford.edu/~knuth/retd.html", method="both")
#' checkArchiveStatus <- view_archiv.fromUrl("https://www-cs-faculty.stanford.edu/~knuth/retd.html", method="both")

This comment has been minimized.

Copy link
@greebie

greebie Nov 7, 2019

Collaborator

Checkreview complained about the example line being over 100 characters. I suggest:

#' checkArchiveStatus <- view_archiv.fromUrl(
#'     "https (..)",
#'     method="both")
@@ -273,7 +274,7 @@ view_archiv.fromUrl <- function (url, method="wayback") {
#' archived url(s) and timestamp(s)
#' @examples
#' \dontrun{\
#' view_archiv.fromText("testfile.docx", method="both")
#' checkArchiveStatus <- view_archiv.fromText("testfile.docx", method="both")

This comment has been minimized.

Copy link
@greebie

greebie Nov 7, 2019

Collaborator

Same.

R/archivr.R Outdated
#' @export
#' @return a dataframe containing the url, status, availability,
#' archived url(s) and timestamp(s)
#' @examples
#' # Wayback
#' archiv.fromUrl("https://www-cs-faculty.stanford.edu/~knuth/retd.html")
#' archivedURLs <- archiv.fromUrl("https://www-cs-faculty.stanford.edu/~knuth/retd.html", except="validator\\.w3\\.org")

This comment has been minimized.

Copy link
@greebie

greebie Nov 7, 2019

Collaborator

Same.

archiv.fromUrl <- function (url, method="wayback") {
return(archiv(extract_urls_from_webpage(url), method))
archiv.fromUrl <- function (url, method="wayback", except = NULL) {
return(archiv(extract_urls_from_webpage(url, except), method))

This comment has been minimized.

Copy link
@greebie

greebie Nov 7, 2019

Collaborator

Same.

R/archivr.R Outdated
#' @examples
#' from_wayback("https://www-cs-faculty.stanford.edu/~knuth/retd.html")
#' checkStatus <- from_wayback("https://www-cs-faculty.stanford.edu/~knuth/retd.html")

This comment has been minimized.

Copy link
@greebie

greebie Nov 7, 2019

Collaborator

Same.

This comment has been minimized.

Copy link
@adam3smith

adam3smith Nov 7, 2019

Author Contributor

Will fix, thanks, didn't see these comments before the latest commit.

And remove it from more documentation
@greebie

This comment has been minimized.

Copy link
Collaborator

greebie commented Nov 7, 2019

Travis did not work this time. Seems like a problem with from_wayback?

  > ### ** Examples
  > 
  > urls <- c("https://qdr.syr.edu", "https://cran.r-project.org/", "https://apsa.net")
  > 
  > # archive in Wayback machine
  > archiv(urls)
  [1] "Received a NULL value from archived snapshots from Wayback for https://qdr.syr.edu"
  [1] "Discovered an error in saving the url. Received http status 502. Perhaps try again at another time."
  Error in names(x) <- value : 
    'names' attribute [5] must be the same length as the vector [4]
  Calls: archiv -> colnames<-
  Execution halted
adam3smith added 2 commits Nov 7, 2019
@adam3smith

This comment has been minimized.

Copy link
Contributor Author

adam3smith commented Nov 7, 2019

@greebie -- we're warning free, thanks! Let me know if this looks go to merge from your end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.