Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea for a text/binary sniffing function. Want it? #236

Closed
bokov opened this issue Oct 4, 2019 · 5 comments
Closed

Idea for a text/binary sniffing function. Want it? #236

bokov opened this issue Oct 4, 2019 · 5 comments
Labels
Milestone

Comments

@bokov
Copy link
Contributor

@bokov bokov commented Oct 4, 2019

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

I got fed up the fact that my OS seems to always know which files are plain-text and yet my own code never does, so I did some googling and found this: https://stackoverflow.com/questions/32184809/python-file1-why-are-the-numbers-7-8-9-10-12-13-27-and-range0x20-0x100

That led to the function below. I'd rather contribute it to an existing package than try to tackle CRAN maintenance on my own, so if the below seems useful to include among rio's utilities I'll submit it as a PR.

Put your code here:

#' Determine whether a file is "plain-text" or some sort of binary format
#'
#' @param filename Path to the file
#' @param maxsize  Maximum number of bytes to read
#' @param textbytes Which characters are used by normal (though not necessarily 
#'                  just ASCII) text. To detect just ASCII, the following value
#'                  can be used: `as.raw(c(7:16,18,19,32:127))`
#' @param tf       If `TRUE` (default) simply return `TRUE` when `filename` 
#'                 references a text-only file and `FALSE` otherwise. If set to
#'                 `FALSE` then returns the "non text" bytes found in the file.
#'
#' @return boolean or raw
#'
#' @examples
#' library(datasets)
#' export(iris,"iris.yml")
#' isfiletext("iris.yml")
#' ## TRUE
#' 
#' export(iris,"iris.sav")
#' isfiletext("iris.sav")
#' ## FALSE
#' isfiletext("iris.sav", tf=FALSE)
#' ## These are the characters found in "iris.sav" that are not printable text
#' ## 02 00 05 03 06 04 01 14 15 11 17 16 1c 19 1b 1a 18 1e 1d 1f
isfiletext <- function(filename,maxsize=Inf,
                       textbytes=as.raw(c(0x7:0x10,0x12,0x13,0x20:0xFF)),
                       tf=TRUE){
  bytes <- readBin(ff<-file(filename,'rb'),raw(),n=min(file.info(filename)$size,
                                                       maxsize));
  close(ff);
  nontextbytes <- setdiff(bytes,textbytes);
  if(tf) return(length(nontextbytes)==0) else return(nontextbytes);
}
@leeper

This comment has been minimized.

Copy link
Owner

@leeper leeper commented Oct 5, 2019

Is this OS specific?

@bokov

This comment has been minimized.

Copy link
Contributor Author

@bokov bokov commented Oct 5, 2019

I don't think so but I'll write some tests and see what the CI servers say.

@bokov

This comment has been minimized.

Copy link
Contributor Author

@bokov bokov commented Oct 8, 2019

Pushed my feature branch, with tests. If Travis and Appveyor run successfully, I will make a one-commit version of that branch and submit a PR.

bokov added a commit to bokov/rio that referenced this issue Oct 8, 2019
…orm matter, completing ticket leeper#236. This can be useful when a file extension is missing or ambiguous
@leeper leeper added the enhancement label Oct 19, 2019
@bokov

This comment has been minimized.

Copy link
Contributor Author

@bokov bokov commented Nov 14, 2019

Is this OS specific?

This function is not OS specific.

It works identically on Windows:
https://ci.appveyor.com/project/bokov/rio/builds/28858637

as well as Linux and MacOS:
https://travis-ci.org/bokov/rio/builds/612002288

I have not been able to formally demonstrate this until now because CI scripts were broken by the R version update. Now that problem seems to have been fixed upstream. The test failures in #239 are out of date and will clear if https://travis-ci.org/leeper/rio/builds/608907693 and https://ci.appveyor.com/project/leeper/rio/builds/28698005 are resubmitted to their respective CI services (in the case of former, after switching to the default xcode in .travis.yml)

@bokov

This comment has been minimized.

Copy link
Contributor Author

@bokov bokov commented Nov 14, 2019

CAVEAT: The MacOS xcode8.3 builds just keep dying with timeouts. The only way I could get it to build is by pretending that #247 already got accepted. The successful build above is based on the .travis.yml from #247, not the one in the current master.

@leeper leeper added this to the v0.6 milestone Dec 20, 2019
leeper added a commit that referenced this issue Dec 21, 2019
* isfiletext() checks whether a file is text or binary in a cross platform matter, completing ticket #236. This can be useful when a file extension is missing or ambiguous

* Minor tweak-- removed redundant 'fwf' entry from 'txtformats' in the test for isfiletext()

* Incorporating feedback from #239#pullrequestreview-335144814 and merging in changes to master

* Incorporating review feedback
@leeper leeper closed this Dec 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.