Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rclean: A Tool for Writing Cleaner, More Transparent Code #327

Open
MKLau opened this issue Jul 23, 2019 · 43 comments
Open

Rclean: A Tool for Writing Cleaner, More Transparent Code #327

MKLau opened this issue Jul 23, 2019 · 43 comments

Comments

@MKLau
Copy link

@MKLau MKLau commented Jul 23, 2019

Submitting Author: Matthew K. Lau (@MKLau)
Repository: https://github.com/MKLau/rclean


  • Paste the full DESCRIPTION file inside a code block below:
Type: Package
Package: Rclean
Title: A Tool for Writing Cleaner, More Transparent Code
Version: 1.1.0
Author: Matthew K. Lau
Maintainer: Matthew K. Lau <matthewklau@fas.harvard.edu>
Description: To create clearer, more concise code provides this
	     toolbox helps coders to isolate the essential parts of a
	     script that produces a chosen result, such as an object,
	     tables and figures written to disk and even warnings and
	     errors.
URL: https://github.com/ProvTools/Rclean
BugReports: https://github.com/ProvTools/Rclean/issues
License: GPL-3 | file LICENSE
Imports: igraph, jsonlite, formatR, CodeDepends, Rgraphviz, methods, knitr
Suggests: 
    roxygen2,
    testthat,
    covr
Language: en-US
Encoding: UTF-8
RoxygenNote: 6.1.1
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below.:

    • data retrieval
    • data extraction
    • database access
    • data munging
    • data deposition
    • reproducibility
    • geospatial data
    • text analysis
  • Explain how the and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

In writing analytical scripts, software best practices are often a
lower priority than producing inferential results, leading to large,
complicated code bases that often need refactoring. The "code
cleaning" capabilities of the Rclean package provide a means to
rigorously identify the minimal code required to produce a given
result (e.g. object, table, plot, etc.), reducing the effort required
to create simpler, more transparent code that is easier to reproduce.

  • Who is the target audience and what are scientific applications of
    this package?

The target audience is domain scientists that have little to no formal
training in software engineering. Multiple studies on scientific
reproducibility have pointed to data and software availability as
limiting factors. This tool will provide an easy to use tool for
writing cleaner analytical code.

There are other packages that analyze the syntax and structure of
code, such as lintr, formatr and cleanr. Rclean, as far as we are
aware, is the only package written for R that uses a data provenance
approach to construct the interdependencies of objects and functions
and then uses graph analytics to rigorously determine the desired
pathways to determine the minimal code-base needed to generate an
result.

  • Any other questions or issues we should be aware of?:

Not that I can think of at the moment.

@MKLau MKLau changed the title *Rclean*: A Tool for Writing Cleaner, More Transparent Code Rclean: A Tool for Writing Cleaner, More Transparent Code Jul 23, 2019
@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Jul 24, 2019

Thanks for your submission @MKLau! I'll be handling your review. Stay tuned while I complete editors checks. I'll get back in touch when they are done.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Jul 24, 2019

Thanks!

@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Jul 26, 2019

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Installation:

Hello again @MKLau 👋

I'm having issues installing the package. Although installing from CRAN works,
installing from GitHub using devtools::install_github("ProvTools/Rclean", ref = "dev") or from source locally does not, producing this error :

Downloading GitHub repo ProvTools/Rclean@master
Skipping 2 packages not available: CodeDepends, Rgraphviz
ERROR: dependency ‘CodeDepends’ is not available for package ‘Rclean’
* removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rclean’
Error: Failed to install 'Rclean' from GitHub:
  (converted from warning) installation of package ‘/var/folders/8p/87cqdx2s34vfvcgh04l6z72w0000gn/T//Rtmp0mSvCy/filea469117a2725/Rclean_1.1.0.tar.gz’ had non-zero exit status

CodeDepends is on github so the remote repository needs to be includes in the DESCRIPTION file under Remotes.

So after adding:

Remotes:
    duncantl/CodeDepends

to DESCRIPTION, I now get a different install error:

Skipping 1 packages not available: Rgraphviz
Downloading GitHub repo duncantl/CodeDepends@master
Skipping 1 packages not available: graph
✔  checking for file ‘/private/var/folders/8p/87cqdx2s34vfvcgh04l6z72w0000gn/T/RtmperTpaE/remotesa03323bd0a21/duncantl-CodeDepends-7c9cf7e/DESCRIPTION’ ...
─  preparing ‘CodeDepends’:
✔  checking DESCRIPTION meta-information ...
─  installing the package to build vignettes
         -----------------------------------
   ERROR: dependency ‘graph’ is not available for package ‘CodeDepends’
─  removing ‘/private/var/folders/8p/87cqdx2s34vfvcgh04l6z72w0000gn/T/Rtmpoz9E5n/Rinsta2487ca40f8/CodeDepends’
         -----------------------------------
   ERROR: package installation failed

Both Rgraphviz & graph are bioconductor packages.

Adding:

biocViews:

just above Imports: in your DESCRIPTION file resolves the Rgraphviz issue but not the error generated for graph through attempted installation of CodeDepends. And trying to just install it on it's own with devtools::install_github("duncantl/CodeDepends") produces the same error. How did you manage to install CodeDepends?

I tried just adding graph as a dependency in your DESCRIPTION but that didn't resolve the CodeDepends error. It might be useful to get in touch with the package authors to see if they can help.


Finally, I also got these warnings:

✔  checking for file ‘/private/var/folders/8p/87cqdx2s34vfvcgh04l6z72w0000gn/T/Rtmpkh4rA9/remotes26657867070/ProvTools-Rclean-47ec758/DESCRIPTION’ ...
─  preparing ‘Rclean’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  looking to see if a ‘data/datalist’ file should be added
     NB: this package now depends on R (>= 3.5.0)
     WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Rclean/tests/testthat/clean.simple.out.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Rclean/tests/testthat/format.simple.out.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects: 'Rclean/tests/testthat/lib.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Rclean/tests/testthat/libs.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects: 'Rclean/tests/testthat/opt.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects: 'Rclean/tests/testthat/pi.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Rclean/tests/testthat/prov.g.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Rclean/tests/testthat/prov.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Rclean/tests/testthat/rp.clean.x.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Rclean/tests/testthat/rp.clean.y.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Rclean/tests/testthat/rp.clean.y.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Rclean/tests/testthat/rp.options.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects: 'Rclean/tests/testthat/s.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects: 'Rclean/tests/testthat/sp.test.rda'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects: 'Rclean/tests/testthat/vl.test.rda'
─

Which suggest that you need to set a minimum version of R in DESCRIPTION. For more information see: r-lib/devtools#1912
Adding:

Depends:
   R (>= 3.5.0)

in your DESCRIPTION file indeed solves the problem.


So until these installation issues are resolved, I can't really proceed with the rest of the editors checks as they are all performed on the source code which needs to be able to be installed from source (ie not just downloading the binary from CRAN. For example, running devtools::check() just errors because of missing the CodeDepends package.

Just let me know when this problem is fixed and I can carry on with checks.


Reviewers:
Due date:

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Aug 9, 2019

Hi @annakrystalli, thanks for your feedback on this. I've been looking into the issues with installing those dependencies.

The package installs from dev on github via devtools on my machine and on Travis' server. However, I manually installed the CodeDepends on my machine, and Travis manually installs as well. I haven't yet done a full install on a fresh machine just using devtools.

This is an important issue, as it would make install much easier to just run one command. So, I'll sort this out and let you know.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Aug 9, 2019

Hi @annakrystalli, alright, I did a fresh install of R on my machine with the addition of all your dependency fixes to the DESCRIPTION and I get a similar error as you:

> install_github("Provtools/Rclean", ref = "deps")
Downloading GitHub repo Provtools/Rclean@deps
Downloading GitHub repo duncantl/CodeDepends@master
Skipping 1 packages not available: graph
Installing 2 packages: graph, XML
Error: Failed to install 'Rclean' from GitHub:
  Failed to install 'CodeDepends' from GitHub:
  (converted from warning) package ‘graph’ is not available (for R version 3.6.1)

Travis still installs without error, which is installing the graph package separately.

I've done some searching on Stack Exchange, and I can see why RGraphviz would install but not graph. I'll keep searching for possible workarounds.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Aug 20, 2019

@annakrystalli

Hi Anna, I still haven’t found a solution to the graph package install other than manually installing graph via BiocManager. So, if you’d like you should be able to proceed by first installing graph before installing

install.packages(“devtools”)
install.packages(“BiocManager”)
BiocManager::install(“graph”)
devtools::install_github(“provtools/rclean”, ref = “dev”)

It looks like the CodeDepends maintainers are working on a solution for this (duncantl/CodeDepends#30) but currently they’re essentially using the manual install method like this.

I’m adding this approach to the dev branch install instructions but I’m dissatisfied with this as an install method. I’ll keep looking for a fix and follow the updates from the CodeDepends folks.

@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Aug 23, 2019

Hey @MKLau, sorry for slow reply, I am actually on holiday so will probably be slow getting back to you for the next week also. Apologies!

In the meantime, I found a short term workaround that hopefully we might be able to turn into a more long-term one.

I forked CodeDepends and just added biocViews: to the DESCRIPTION file in my fork: https://github.com/annakrystalli/CodeDepends.

I also changed the remote repository pointer in the Rclean DESCRIPTION to annakrystalli/CodeDepends to install my fork of the package and installation of RClean worked! 🎉. This setup installs version 1.62 of graph. Out of curiosity, what version do you have installed? I quickly ran the tests and they all passed so on the surface this version of graph seems to work with Rclean.

I'll make a pull request to the CodeDepends repo and hope they will merge it. But if not, you could just maintain a fork of CodeDepends that your package depends on where you include biocViews: in the DESCRIPTION. In fact, I would suggest that you make a fork, set it up as I did and use that as your dependency for now. That would allow both myself and reviewers to proceed with installing and reviewing the package. Then, if and once the PR is merged, you can revert back to the upstream CodeDepends repo.

Sound good?

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Aug 28, 2019

Great, @annakrystalli! Copy that. I’ll use this workaround and follow what the CodeDepends issue you opened. Hopefully they just merge it in, which seems likely.

Enjoy the rest of your break!

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Aug 28, 2019

Now completed, tested and merged into dev branch.

@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Sep 10, 2019

OK great and apologies for massive delay! Been pretty snowed under since I got back and still catching up. I've now been able to complete editors checks and installation worked fine with the new setup.

So, firstly and most importantly, I'd recommend that all your installs happen from the master branch and that this is always stable. Use your dev branch for the more experimental / development state of code.

The rest of the editors checks threw up only minor issues.

goodpractice

goodpractice::gp()

Throws a warning

  ✖ fix this R CMD check NOTE: Namespaces in Imports field not imported from: ‘BiocManager’ ‘knitr’ All declared Imports
    should be used.

I wonder, are there calls to functions in BiocManagers in your code? It's hard to search the dev branch of your repo so could figure it out without scanning all your R/ code (which I didn't do 😜).

spell_check

A few potential genuine typos

devtools::spell_check("/Users/Anna/Documents/workflows/rOpenSci/editorials/Rclean")
#>   WORD              FOUND IN
#> formate           p.spine.Rd:10
#> funcitons         codeGraph.Rd:16
#> inforrmation      get.libs.Rd:16
#> parantage         get.spine.Rd:16
#> Parenatage        p.spine.Rd:5
#> Prases            parse.graph.Rd:5
#> reltionships      parse.graph.Rd:6
#> reproduciblity    README.md:49
#>                   Rclean.rmd:41

Created on 2019-09-10 by the reprex package (v0.3.0)

package coverage

Great coverage overall! Is there are reason var.id.R and write.code.R have such low coverage though?

covr::package_coverage("/Users/Anna/Documents/workflows/rOpenSci/editorials/Rclean")
#> Rclean Coverage: 81.22%
#> R/var.id.R: 0.00%
#> R/write.code.R: 45.45%
#> R/var.lineage.R: 70.00%
#> R/p.spine.R: 72.73%
#> R/parse.graph.R: 81.82%
#> R/parse.info.R: 87.50%
#> R/clean.R: 91.67%
#> R/codeGraph.R: 100.00%
#> R/get.libs.R: 100.00%
#> R/get.spine.R: 100.00%
#> R/read.prov.R: 100.00%

Created on 2019-09-10 by the reprex package (v0.3.0)

@MKLau, all in all I'm happy to proceed with finding reviewers. If you could have a go at fixing these minor issues during that time, that would be great. For me the most important is to get the master branch up to date and use that as the stable version of your package that will be reviewed.

Could you please add the rOpenSci under review badge to your README?

[![](https://badges.ropensci.org/327_status.svg)](https://github.com/ropensci/software-review/issues/327)

Thanks again for your patience during the holidays!


Reviewers:
Due date:

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Sep 11, 2019

Hi @annakrystalli, no worries! Hope you had a relaxing/restorative holiday.

  • dev and master branch usage: I had stopped moving master forward as so many changes were happening due to the ROpenSci and JOSS review. Travis is currently running checks on the following updates (see following items). I'll merge into master after those checks pass. However, CRAN will now be behind master. Do you recommend updating it or waiting for the ROpenSci review process to proceed?
  • goodpractice Imports: I had added BiocManager to Imports to deal with the installation of graph but I think that your suggested fixes to CodeDepends has made that vestigial. knitr is required for automatically building the vignette. I removed BiocManager entirely and added knitr under Suggests.
  • goodpractices spell_check: fixed all of those, funny how many times I've run spellcheck but typos keep popping up!
  • goodpracices package coverage: var.id has become a legacy function that I think may be useful in the future, so I haven't gotten rid of it, but I also haven't bothered to add unit tests. I've added an issue for that, and I'll add them when I next get a chance. write.code only has partial coverage because part of its functionality is copying to the clipboard, which I haven't figured out how to write tests for. If you (or maybe one of the reviewers) have a suggestion, I'd be happy to get the coverage on that function higher.
  • Add badge to README: done!

Thanks for the updates on this, and happy to see the review move forward.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Sep 11, 2019

@annakrystalli FYI - master is now up to date with dev (now v1.1.0). CRAN is still v1.0.0.

@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Sep 12, 2019

Great work @MKLau !

Regarding your comments:

dev and master branch usage: I had stopped moving master forward as so many changes were happening due to the ROpenSci and JOSS review. Travis is currently running checks on the following updates (see following items). I'll merge into master after those checks pass. However, CRAN will now be behind master. Do you recommend updating it or waiting for the ROpenSci review process to proceed?

You should treat the GitHub version of your code as the development version in general and it is absolutely fine for it to not reflect CRAN. The way you link your GitHub code version to the CRAN released it through tags which snapshot more formally the state of the repository upon release. After that you are free to continue developing on GitHub by adding .9000 to the end of your release version in the DESCRIPTION to indicate this is the development version of the software.

That is somewhat different to the use of branches. You still want the master branch in the repository to reflect the most stable state of the development code and it should generally be the go to outward facing version of the development source code. I think the chapter on releasing packages sort of discusses this but think instead about how in general we often see in READMEs instruction to download from CRAN or download the development version from GitHub which is generally the master branch. Does that make sense?

write.code only has partial coverage because part of its functionality is copying to the clipboard, which I haven't figured out how to write tests for. If you (or maybe one of the reviewers) have a suggestion, I'd be happy to get the coverage on that function higher.

My go-to strategy in these situations is to have a look at the tests in packages that use the feature I want to test. I initially thought of reprex and saw that that just calls clipr::write_clip(). So here's a link to the tests script for that function write_clip() 😃.

For all your other comments:

@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Sep 12, 2019

Another minor thing I've just noted. In the DESCRIPTION file, we require that author details are captured using the Authors@R notation not character strings. More details in the relevant r packages chapter.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Sep 12, 2019

Great! Thanks for the pointers. Per these suggestions, I’ll go ahead and:

  • update the readme to point to master as the stable “beta” install
  • switch to using the Authors@R field
  • look into implementing that test per test_render
@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Sep 23, 2019

We have reviewers! 🥳

Many thanks again for agreeing to review @wlandau and @nevrome. See below for review data. Any questions, just reach out.


Reviewers: @wlandau @nevrome
Due date: 2019-10-21

@nevrome

This comment has been minimized.

Copy link
Member

@nevrome nevrome commented Sep 25, 2019

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

A very interesting package to simplify R scripts. I really like the underlying idea and the main functions clean and write.code work well when applied to the example script example/simple_script.R. Good job, @MKLau! However, the usability of the package can be increased a bit more. Some suggestions in no particular order:

  • The function documentation is very brief at the moment. I don't think this necessarily a bad thing, but for some of the functions the lack of description and example code leaves me wondering, what the respective function is supposed to do. Please add some more descriptive text and example code for clean, codeGraph and write.code. I would like to see codeGraph mentioned in the README/vignette.
  • I would love to have an example script (maybe example/simple_script.R) as a package example dataset that allows to explore the functions without the need to provide/prepare an own script. An example script will also allow you to add meaningful example code for the functions.
  • Retrospective Provenance remained a mystery for me. The missing piece of the puzzle is how to generate the relevant json file. Maybe I just missed the crucial hint? I would prefer to see a more extensive explanation of this feature and how to use it in the README/vignette.
  • In the context of the Retrospective Provenance feature you use the option interface to access the content of the json file. I do not understand why this rather unusual solution is necessary. Why can't the rd parameter simply get the relevant character vector? Instead of options(prov.json = readLines("example/prov_micro.json")); clean(file = "example/micro.R", rp = TRUE) the interface could simply be clean(file = "example/micro.R", rp = readLines("example/prov_micro.json")) or even just the path clean(file = "example/micro.R", rp = "example/prov_micro.json"). Why did you decide against this?
  • When I tried to apply clean to some of my own scripts, I realized that all packages necessary to run this script have to be loaded in the current session. If this was not the case, then I got the error message Error in as.environment(pos): no item called "package:dplyr" on the search list. I got this message 10+ times for one of my more complex scripts before clean went through, until I loaded all the packages. This is a little bit tedious. Is there another way to implement package loading? Maybe the automagic package could help to automate that? If this is not possible or if you don't want to have an automatic solution, then I would at least ask you to catch the current error message and replace it with something more intelligible, e.g. "Package xyz has to be loaded to analyze this script.".
  • Internal functions do not need to be exported (e.g. parse.info, read.prov). Remove the @export tag for these functions and add the tags @keywords internal and @noRd.
  • The simple_script.R code appears many times in the package (in the directories test/, inst/, exec/, example/). I don't think you need this file so many times.
  • Although you not seem to work in RStudio I suggest to add a Rstudio project file in the root directory (+ the necessary additions to .Rbuildignore). That simplifies contributing for RStudio users.
  • You could add a help page for the whole package as described here. I often use this page as an entry point for the function index.
  • This might be way beyond the purpose of this review, but as I personally consider these kind of things when I decide whether I want to use a package, I would like to add that I was not able to access neither your homepage http://mklau.info/ nor the one the Github Orga ProvTools http://provtools.org/ [2019-09-24].
  • The JOSS paper manuscript seems to be in an unfinished state because several sections are mentioned with a keyword but not yet written. The text available so far is fine.
  • In the travis-ci setup I suggest to treat warnings as errors (warnings_are_errors: true). I think this is usual for R packages, because no package with warnings can go to CRAN.
@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Sep 25, 2019

Thanks for your speedy response @nevrome!

@MKLau, it's probably best to wait till both reviews are in and then respond.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Sep 26, 2019

@annakrystalli, that’s what I was planning.

Double thanks @nevrome!

@wlandau

This comment has been minimized.

Copy link

@wlandau wlandau commented Sep 27, 2019

Package Review

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 4

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Rclean is a promising package. It goes above and beyond typical static code analysis and addresses a huge unmet needs in scientific workflows. Well done, @MKLau!

Documentation

  • Please include a CODE_OF_CONDUCT.md file at the top level in your repo. You could generate one here or call usethis::use_code_of_conduct(). On this page, you can even generate issue and pull request templates that remind people of the code of conduct.

  • Consider renaming CONTRIBUTE.md to the more widely-used CONTRIBUTING.md.

  • Do you have a longer, real-world script to show? simple_script.R gets the idea across, but if you show how to extract simple chunks from a confusing script with 1000+ lines, you could make Rclean really shine.

  • At the top of the README, please consider reorganizing those bullet points into two paragraphs or subheadings: one introducing the existing problems in scripts and reproducibility, and then another focusing on how Rclean solves those problems. Maybe also elaborate on both points as well. I do not think this is the place to mention the provenance model or graphs, but you could definitely talk more about the survey you mentioned, the problem of messy code in science, and the fact that Rclean sifts through code to extract just what you need for a given result.

  • Would you cite that survey from the third bullet in the README? If you cannot find the citation, feel free to remove the mention entirely and choose a different way to motivate the problem.

A recent survey of over 1500 scientists reported a crisis of reproducibility with “selective reporting” being the most cited contributing factor and 80% saying code availability is playing a role.

  • You might consider using a README.Rmd to generate the README.md so it is easier to refresh the docs when you change the package (example here).

  • I strongly agree with @nevrome to add examples to the help files for the exported functions you choose to keep (see below). It is difficult for me to understand how to create arguments for most of your exported functions.

  • The vignette repeats a lot of information from the README. Let's think about removing the redundant sections and refocusing the vignette on a deeper dive in a feature set not covered in the README. Perhaps retrospective provenance ? See the next comment.

  • It is not intuitively clear from the docs why we need retrospective provenance. In the README or vignette, please demonstrate a situation where prospective provenance is not enough and retrospective provenance gives us what we seek.

  • Please consider citing a paper to make it clear that "retrospective" and "prospective" are widely used and accepted terms to describe provenance. A quick search reveals a couple possibilities (examples here and here)?

API

  • The clean() function tries to do three things: (1) list lines of execution, (2) list the names of possible results, and (3) get the code required to produce one of the results. I agree that clean() should do (3), but I think (1) and (2) are not really cleaning yet and should go in their own functions. Maybe list_lines() (which returns an integer vector) and list_results()?

  • Similarly, it is also worth considering an entirely new set of functions for retrospective provenance because they have different inputs and your underlying code is entirely different.

  • In help(package = "Rclean"), I see several API functions that represent very technical steps and need esoteric data structures. Are there specific reasons why you exported these things to the user? I am not sure users will want to handle matrix representations of graphs or PROV-JSON-formatted provenance objects in memory. It seems like the easiest inputs to understand are R script files and JSON files. Similarly, the best outputs are human-readable: cleaned-up scripts and plotted code graph visualizations. Maybe even consider writing files directly from clean(). Consistency and simplicity are extremely valuable.

  • I am questioning the need for most of Rclean's exported functions. Not all your functions need to be exported, and not all of them need roxygen2 docstrings.

    • Do we really need p.spine() and p.spine()? If you add a function to get the igraph from a script, it seems like you can simply instruct users to call dfs() themselves.
    • Should get.libs(), var.lineage(), and var.id() really be exported? clean() needs them, but do you expect most users to ever invoke them manually?
    • Do we really need a write.code() function? readLines() already writes code to a file, and clipr::write_clip() writes to the clipboard. Optional: maybe clean() itself could have an argument to control where the output goes: either to the clipboard with write_clip() or to a file if a path is given.
    • You explicitly list parse.info(), read.prov(), and parse.graph() as internal functions. Perhaps these should not be exported either.
  • The names of exported functions have different naming conventions: codeGraph() (camel case) vs prov_json() (snake case) vs get.spine() (dot-separated). For the exported functions you choose to keep, please consider sticking to one of these conventions (I recommend snake case). Depending on the user base you have right now, you may want to gracefully deprecate those functions as described here.

Installation and automated checks

  • Re #327 (comment), I also had some trouble installing Rclean. I needed to install Rgraphviz directly from Bioconductor. I suggest adding the following to the installation instructions in the README. At the very least, it is a useful stopgap until @annakrystalli's pull request to CodeDepends reaches CRAN.
install.packages("BiocManager")
BiocManager::install(c("graph", "Rgraphviz"))
  • devtools::spell_check("Rclean") identifies several correctly spelled words. I recommend listing them in an inst/WORDLIST file (example here).
  • goddpractice::gp("Rclean") suggests a bunch of style fixes I agree with. The styler package can help you turn assignment = into <-. Please let me know if you would like clarification on any of these notes.
It is good practice towrite unit tests for all functions, and all package code
    in general. 86% of code lines are covered by test cases.

    R/clean.R:49:NA
    R/clean.R:72:NA
    R/clean.R:82:NA
    R/clean.R:91:NA
    R/clean.R:95:NA
    ... and 21 more linesuse '<-' for assignment instead of '='. '<-' is the
    standard, and R users and developers are used it and it is easier
    to read your code for them if you use '<-'.

    R/var.id.R:41:44
    R/var.id.R:42:47avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    R/clean.R:47:1
    R/clean.R:93:1
    R/get.spine.R:44:1
    R/parse.info.R:39:1
    R/parse.info.R:40:1
    ... and 1 more linesavoid sapply(), it is not type safe. It might return a
    vector, or a list, depending on the input data. Consider using
    vapply() instead.

    R/clean.R:150:29
    R/clean.R:151:55
    R/clean.R:169:34avoid 1:length(...), 1:nrow(...), 1:ncol(...),
    1:NROW(...) and 1:NCOL(...) expressions. They are error prone and
    result 1:0 if the expression on the right hand side is zero. Use
    seq_len() or seq_along() instead.

    R/get.libs.R:40:13
    R/get.libs.R:42:17
    R/get.libs.R:44:21
    R/var.id.R:39:13
    R/var.lineage.R:43:22
    ... and 4 more lines

Miscellaneous

  • I am a bit concerned about the scope of the built-in unit tests. It would be helpful to test at least one longer script to make sure the correct lines get captured in a serious workflow. Not only that, but when you actually run the longer script and various object-specific scripts generated by clean(), do the output data objects agree?

  • What is the reason you chose JSON to store retrospective provenance? Have you considered a tidier format such as a long-form data frame? I like how pkgapi represents its output.

  • The clean() function is ambitious, long, and deeply indented. Please consider decomposing this monolithic function into smaller helper functions. No need to export the helper functions or write roxygen2 docstrings. A particularly good resource on this kind of refactoring is Jenny Bryan's talk on code smells and feels.

  • Are you open to using styler instead of formatR to reformat code? It is more modern and more actively maintained.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Sep 27, 2019

@wlandau thanks so much for a speedy review as well!

Thanks to everyone for helping to improve and push this project forward.

@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Sep 27, 2019

Thanks indeed @wlandau! 🎉

Lot's of useful and insightful comments in both reviews so over to you now @MKLau 😊

@wlandau

This comment has been minimized.

Copy link

@wlandau wlandau commented Oct 11, 2019

One clarification on a couple points I made:

I am questioning the need for most of Rclean's exported functions...Do we really need p.spine() and p.spine()?

Those functions are necessary to implement clean(), and I recommend keeping them internally. I only question the need for an @export tag because I suspect most users will not invoke these functions directly.

In fact (and this is not strictly necessary to satisfy my review) it could be helpful to split clean() up into multiple helper functions. This could reduce the deeply nested logic in your code and make Rclean easier to maintain. A couple resources:

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Oct 11, 2019

Thanks for taking the time to clarify @wlandau. I get what you’re saying both about the structure of the API and the logic of the clean function. This review process has been great because it’s forcing me to hone the target user/application, which has meandered a bit in development. I think the inclination to make the spine and other functions internal is the way to go.

Also, thanks for the pointers, especially the cyclocomp package.

@danielskatz

This comment has been minimized.

Copy link

@danielskatz danielskatz commented Oct 29, 2019

Please update openjournals/joss-reviews#1312 when this is accepted.

@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Nov 25, 2019

Hi @MKLau, Just checking in with you. Do you have an ETA for a response to reviewers?

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Nov 25, 2019

Hi @annakrystalli, I’m finishing up addressing a final comments this week, and, as long as those get done, I should be able to write-up a response early next week.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Dec 4, 2019

General Response

Thanks again to @wlandau and @nevrome for reviewing the package and,
@annakrystalli included, providing insightful comments. I have updated the package
greatly based on the various suggestions and questions. These updates are
documented in the package's github repo issue and change logs. I detailed the changes as they pertain to the reviewers' comments point by point below. For unchecked, please see the comments that follow. In addition, I would like to point out that the package repo is now hosted at https://gitub.com/MKLau/Rclean, which is the result of shifts in how the project is now being managed.

Response to Reviewer Comments

Documentation

  • @wlandau Please include a CODE_OF_CONDUCT.md file at the top
    level in your repo. You could generate one
    here or call
    usethis::use_code_of_conduct() . On
    this page, you
    can even generate issue and pull request templates that remind
    people of the code of conduct.

    Added CODE_OF_CONDUCT.md at the top level via
    usethis::use_code_of_conduct().

  • @wlandau Consider renaming CONTRIBUTE.md to the more widely-used
    CONTRIBUTING.md .

    Changed to CONTRIBUTING.md.

  • @wlandau Do you have a longer, real-world script to show?
    simple_script.R gets the idea across, but if you show how to
    extract simple chunks from a confusing script with 1000+ lines,
    you could make Rclean really shine.

    I agree, and I added a longer, more realistic, script to the
    example scripts, see "examples/long_script.R". I used this in the
    vignette to demonstrate the utility of retrieving the code to
    produce a specific result when it is embedded in a longer,
    somewhat complicated script. I wasn't able to simply use an
    existing script from a real project because of issues with data
    dependencies. If the reviewer has a suggestion of an example, I
    would be happy to look at including it in a future update.

  • @wlandau At the top of the README, please consider reorganizing
    those bullet points into two paragraphs or subheadings: one
    introducing the existing problems in scripts and reproducibility,
    and then another focusing on how Rclean solves those
    problems. Maybe also elaborate on both points as well. I do not
    think this is the place to mention the provenance model or graphs,
    but you could definitely talk more about the survey you mentioned,
    the problem of messy code in science, and the fact that Rclean
    sifts through code to extract just what you need for a given
    result.

    I have edited and re-organized the content of the README to focus
    on the purpose of the package, installation and a quick start
    guide. The README refers users to the vignette for more information.

  • @wlandau Would you cite that survey from the third bullet in the
    README? If you cannot find the citation, feel free to remove the
    mention entirely and choose a different way to motivate the
    problem.

    A recent survey of over 1500 scientists reported a crisis of
    reproducibility with “selective reporting” being the most cited
    contributing factor and 80% saying code availability is playing a
    role.

    This reference has been removed from the README.

  • @wlandau You might consider using a README.Rmd to generate the
    README.md so it is easier to refresh the docs when you change the
    package (example
    here).

    Updated to use README.Rmd.

  • @wlandau I strongly agree with @nevrome to add examples to the
    help files for the exported functions you choose to keep (see
    below). It is difficult for me to understand how to create
    arguments for most of your exported functions.

    Exported functions now have simple examples to illustrate usage.

  • @wlandau The
    vignette
    repeats a lot of information from the README. Let's think about
    removing the redundant sections and refocusing the vignette on a
    deeper dive in a feature set not covered in the README. Perhaps
    retrospective provenance ? See the next comment.

    README.md now functions as a quick start guide, and more detailed
    information has been moved to the vignette and JOSS paper.

  • @wlandau It is not intuitively clear from the docs why we need
    retrospective provenance. In the README or vignette, please
    demonstrate a situation where prospective provenance is not enough
    and retrospective provenance gives us what we seek.

    Thank you for this question, it has really pushed the development
    of a more focused package. I have removed the retrospective
    provenance functionality from the package, as I came to realize
    that it was unnecessarily slowing Rclean and complicating
    usage. The code to use retrospective provenance has been
    integrated into
    https://github.com/end-to-end-provenance/provClean, but it is now
    not used for Rclean. However, to explain the difference,
    retrospective provenance gets the lineage of a result by
    monitoring the execution of code, while prospective provenance
    infers similar information from the code prior to
    execution. Because of this difference, retrospective provenance
    can do things that prospective provenance can't, such as observe
    and record random numbers and which paths were taken through
    control statements. I have clarified this information in the
    vignette and similar information will be added to the JOSS
    article.

  • @wlandau Please consider citing a paper to make it clear that
    "retrospective" and "prospective" are widely used and accepted
    terms to describe provenance. A quick search reveals a couple
    possibilities (examples
    here and
    here)?

    See the previous comment. Thanks for the references, I have added
    provenance explanations and other references to the vignette.

  • @nevrome Please add some more descriptive text and example code
    for clean , codeGraph and write.code . I would like to see
    codeGraph mentioned in the README/vignette.

    Examples are now present for exported functions and more
    expository text has been added.

  • @nevrome I would love to have an example script (maybe
    example/simple_script.R ) as a
    package example dataset that
    allows to explore the functions without the need to
    provide/prepare an own script. An example script will also allow
    you to add meaningful example code for the functions.

    Two example scripts, simple_script and long_script, are now
    available as R package data files. However, as clean() requires
    a file path, they cannot be passed directly into clean().

  • @nevrome Retrospective Provenance remained a mystery for
    me. The missing piece of the puzzle is how to generate the
    relevant json file. Maybe I just missed the crucial hint? I would
    prefer to see a more extensive explanation of this feature and how
    to use it in the README/vignette.

    See my explanation and comments above.

  • @nevrome In the context of the Retrospective Provenance feature
    you use the ** option interface** to access the content of the
    json file. I do not understand why this rather unusual solution is
    necessary. Why can't the rd parameter simply get the relevant
    character vector? Instead of options(prov.json = readLines("example/prov_micro.json")); clean(file = "example/micro.R", rp = TRUE) the interface could simply be clean(file = "example/micro.R", rp = readLines("example/prov_micro.json")) or even just the path
    clean(file = "example/micro.R", rp = "example/prov_micro.json") .
    Why did you decide against this?

    Similarly here, see previous provenance comments.

  • @nevrome You could add a help page for the whole package as
    described here. I
    often use this page as an entry point for the function index.

    A whole package help page has been added.

  • @nevrome This might be way beyond the purpose of this review,
    but as I personally consider these kind of things when I decide
    whether I want to use a package, I would like to add that I was
    not able to access neither your homepage http://mklau.info/
    nor the one the Github Orga ProvTools http://provtools.org/
    [2019-09-24].

    Thanks for checking these links. The provtools website is now
    defunct and materials have been moved to
    github.com/mklau/rclean. However, mklau.info is still supported
    and is working for me. Let me know if you happen to try accessing
    it again and have an issue.

JOSS Co-Submission

@wlandau: missing all sections
@nevrome: missing short summary

  • The JOSS paper manuscript seems to be in an unfinished state
    because several sections are mentioned with a keyword but not yet
    written. The text available so far is fine.

@MKLau: The initial submission of the JOSS paper was based on an
example of a submitted article present on the JOSS
instructions. However, after this package review, there will be
substantial content added to the article. I am currently working on
these updates to the text.

Functionality

  • @wlandau: installation and package guidelines missing.

    @MKLau: This has been added to the README.

API

  • @wlandau The clean() function tries to do three things: (1) list
    lines of execution, (2) list the names of possible results, and
    (3) get the code required to produce one of the results. I agree
    that clean() should do (3), but I think (1) and (2) are not really
    cleaning yet and should go in their own functions. Maybe
    list_lines() (which returns an integer vector) and list_results()?

    The clean function has been significantly refactored. Now that the
    retrospective provenance has been removed from the package, clean
    has been streamlined. It still returns a list of possible results,
    if no results are inputed, but the focus in now on getting the
    script and a variable or several variables to determine and return
    the minimal code. There is now an exported function get_vars to
    return a list of variables in a given script.

  • @wlandau Similarly, it is also worth considering an entirely new
    set of functions for retrospective provenance because they have
    different inputs and
    your underlying code is entirely different.

    Already discussed above, this feature has been removed from Rclean.

  • @wlandau In help(package = "Rclean") , I see several API
    functions that represent very technical steps and need esoteric
    data structures. Are there specific reasons why you exported these
    things to the user? I am not sure users will want to handle matrix
    representations of graphs or PROV-JSON-formatted provenance
    objects in memory. It seems like the easiest inputs to understand
    are R script files and JSON files. Similarly, the best outputs are
    human-readable: cleaned-up scripts and plotted code graph
    visualizations. Maybe even consider writing files directly from
    clean() . Consistency and simplicity are extremely valuable.

    Again, the API has been substantially simplified. Specific to this
    comment, JSON is no longer required because the package no longer
    uses retrospective provenance, which stores provenance in
    PROV-JSON format.

  • @wlandau I am questioning the need for most of Rclean 's
    exported functions. Not all your functions need to be exported,
    and not all of them need roxygen2 docstrings.

    Good suggestion. The API has been updated with a reduction in the
    exported functions, and significant refactoring has been done to
    the code base, particularly clean.R, which now has internal
    functions included in the clean.R file itself.

  • @wlandau Do we really need p.spine() and p.spine() ? If you add
    a function to get the igraph from a script, it seems like you can
    simply instruct users to call dfs() themselves.

    Per the previous comment, these functions (p.spine and get.spine)
    have been refactored and integrated into clean.R. In order to keep
    the workflow and API simple for users, manually conducting the
    depth first search is not described for the users. This could be
    done for some more advanced applications, but currently I can't
    see a purpose for exposing the user to this complexity.

  • @wlandau Should get.libs() , var.lineage() , and var.id() really
    be exported? clean() needs them, but do you expect most users to
    ever invoke them manually?

    var.id() has been removed. var.lineage(), now var_lineage(), has
    been internalized. get.libs(), now get_libs(), is still a part of
    the API for users, as it allows users to quickly dig out the
    libraries used in code.

  • @wlandau Do we really need a write.code() function? readLines()
    already writes code to a file, and clipr::write_clip() writes to
    the clipboard. Optional: maybe clean() itself could have an
    argument to control where the output goes: either to the clipboard
    with write_clip() or to a file if a path is given.

    Thanks for these suggestions. write.code() has become keep(),
    which now uses the clipr package. Although it may seem like a
    small step, my sense in using Rclean is that abstracting
    writeLines(), which requires the creation of a file connection,
    and combining its functionality with clipr's write_clip(),
    substantially eases the management of cleaned code.

  • @wlandau You explicitly list parse.info() , read.prov() , and
    parse.graph() as internal functions. Perhaps these should not be
    exported either.

    I agree here, however, as discussed above, this feature has been
    removed.

  • @wlandau The names of exported functions have different naming
    conventions: codeGraph() (camel case) vs prov_json() (snake
    case) vs get.spine() (dot-separated). For the exported functions
    you choose to keep, please consider sticking to one of these
    conventions (I recommend snake case). Depending on the user base
    you have right now, you may want to gracefully deprecate those
    functions as
    described here.

    Good suggestion. Switched to snake case throughout.

Installation and automated checks

  • @wlandau Re
    #327 (comment),
    I also had some trouble installing Rclean . I needed to install
    Rgraphviz directly from Bioconductor. I suggest adding the
    following to the installation instructions in the README. At the
    very least, it is a useful stopgap until @annakrystalli's pull
    request to CodeDepends reaches CRAN.

    install.packages("BiocManager")
    BiocManager::install(c("graph", "Rgraphviz"))

    This has been added to the README and vignette.

  • @wlandau devtools::spell_check("Rclean") identifies several
    correctly spelled words. I recommend listing them in an
    inst/WORDLIST file
    (example here).

    Thanks for this. I have added inst/wordlist.

  • @wlandau goddpractice::gp("Rclean") suggests a bunch of style
    fixes I agree with. The styler package can help you turn
    assignment = into <-. Please let me know if you would like
    clarification on any of these notes.

    It is good practice towrite unit tests for all functions, and all package code
        in general. 86% of code lines are covered by test cases.
    
        R/clean.R:49:NA
        R/clean.R:72:NA
        R/clean.R:82:NA
        R/clean.R:91:NA
        R/clean.R:95:NA
        ... and 21 more linesuse '<-' for assignment instead of '='. '<-' is the
        standard, and R users and developers are used it and it is easier
        to read your code for them if you use '<-'.
    
        R/var.id.R:41:44
        R/var.id.R:42:47avoid long code lines, it is bad for readability. Also,
        many people prefer editor windows that are about 80 characters
        wide. Try make your lines shorter than 80 characters
    
        R/clean.R:47:1
        R/clean.R:93:1
        R/get.spine.R:44:1
        R/parse.info.R:39:1
        R/parse.info.R:40:1
        ... and 1 more linesavoid sapply(), it is not type safe. It might return a
        vector, or a list, depending on the input data. Consider using
        vapply() instead.
    
        R/clean.R:150:29
        R/clean.R:151:55
        R/clean.R:169:34avoid 1:length(...), 1:nrow(...), 1:ncol(...),
        1:NROW(...) and 1:NCOL(...) expressions. They are error prone and
        result 1:0 if the expression on the right hand side is zero. Use
        seq_len() or seq_along() instead.
    
        R/get.libs.R:40:13
        R/get.libs.R:42:17
        R/get.libs.R:44:21
        R/var.id.R:39:13
        R/var.lineage.R:43:22
        ... and 4 more lines

    These have either been fixed or deleted from the
    package. goodpractice::gp() currently reports "Ah! Perfect
    package! Keep up the cat's pajamas work!".

Miscellaneous

  • @wlandau I am a bit concerned about the scope of the built-in
    unit tests. It would be helpful to test at least one longer script
    to make sure the correct lines get captured in a serious
    workflow. Not only that, but when you actually run the longer
    script and various object-specific scripts generated by clean() ,
    do the output data objects agree?

    A longer script, that is arguably more realistic, has been added
    along with associated tests to check that minimized code for an
    entangled result can be correctly extracted.

  • @wlandau What is the reason you chose JSON to store
    retrospective provenance? Have you considered a tidier format
    such as a long-form data frame? I like how
    pkgapi represents its output.

    This functionality has been removed, but the PROV-JSON format was
    chosen to comply with standards used by the retrospective
    provenance community.

  • @wlandau The clean() function is ambitious, long, and deeply
    indented. Please consider decomposing this monolithic function
    into smaller helper functions. No need to export the helper
    functions or write roxygen2 docstrings. A particularly good
    resource on this kind of refactoring is Jenny Bryan's talk on
    code smells and feels.

    As mentioned previously, clean() has been refactored and the API
    as a whole has been re-organized to make it more transparent and
    approachable.

  • @wlandau Are you open to using
    styler instead of formatR to
    reformat code? It is more modern and more actively maintained.

    Given the higher maintenance activity, Rclean now uses styler.

  • @nevrome When I tried to apply clean to some of my own scripts,
    I realized that all packages necessary to run this script have to
    be loaded in the current session. If this was not the case, then I
    got the error message Error in as.environment(pos): no item called "package:dplyr" on the search list. I got this message
    10+ times for one of my more complex scripts before clean went
    through, until I loaded all the packages. This is a little bit
    tedious. Is there another way to implement package loading? Maybe
    the automagic package
    could help to automate that? If this is not possible or if you
    don't want to have an automatic solution, then I would at least
    ask you to catch the current error message and replace it with
    something more intelligible, e.g. "Package xyz has to be loaded
    to analyze this script.".

    Sorry you ran into this hurdle with the package. This should now
    no longer be an issue, as it was a product of how the get.libs()
    function found the packages that were being used in the script and
    the need for the retrospective provenance to run the code prior to
    analyzing it. Now, get_libs() no longer needs the packages to be
    loaded and retrospective provenance is no longer used.

  • @nevrome Internal functions do not need to be exported (e.g.
    parse.info , read.prov ). Remove the @export tag for these
    functions and add the tags @Keywords internal and @nord .

    Agreed, the API has been updated and @nord has been added where needed.

  • @nevrome The ** simple_script.R ** code appears many times in
    the package (in the directories test/ , inst/ , exec/ , example/
    ). I don't think you need this file so many times.

    simple_script.R now only occurs in inst/examples.

  • @nevrome Although you not seem to work in RStudio I suggest to
    add a Rstudio project file in the root directory (+ the
    necessary additions to .Rbuildignore ). That simplifies
    contributing for RStudio users.

    Yes, I don't work in RStudio, except when I'm teaching. So, I'm
    not sure how to best set this up for using with the package
    dev. If someone who uses RStudio starts to contribute, I would be
    happy to have it added.

  • @nevrome In the travis-ci setup I suggest to treat warnings
    as errors ( warnings_are_errors: true ). I think this is usual for
    R packages, because no package with warnings can go to CRAN.

    This has been updated to treat warnings as errors.

@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Dec 5, 2019

Thanks for the detailed response @MKLau!

@wlandau & @nevrome, please let us know whether you are happy with the responses/actions to your review comments.

@wlandau

This comment has been minimized.

Copy link

@wlandau wlandau commented Dec 6, 2019

@MKLau, fantastic work on the revisions! You put a ton of effort into refactoring the code, rethinking the package scope, and strengthening the documentation. The behavior of Rclean is more internally consistent, and the usage and intent are far easier to understand.

After reviewing MKLau/Rclean@b674396, I have some minor follow-up comments and requests.

  • Would you point me to the new test you added on a longer script? I am not sure I found what you are referring to.
  • Code coverage could be improved if you think it is reasonable. Raising coverage to 100% is not as important as diligently verifying the core claims of the package, but I thought I would raise the issue since Rclean seems relatively light in tests. I have included a coverage report below. You could do some combination of adding tests that hit those lines and adding inline comments in the source for lines you want to deliberately exempt.
Rclean Coverage: 89.69%
R/keep.R: 83.33%
R/clean.R: 85.00%
R/get_libs.R: 85.71%
R/get_vars.R: 85.71%
R/code_graph.R: 92.31%
R/var_lineage.R: 100.00%
         filename  functions line value
51      R/clean.R      clean   57     0
53      R/clean.R      clean   58     0
79      R/clean.R   get_path  132     0
81      R/clean.R   get_path  135     0
82      R/clean.R   get_path  136     0
83      R/clean.R   get_path  137     0
20 R/code_graph.R code_graph   46     0
9    R/get_libs.R   get_libs   42     0
18   R/get_vars.R   get_vars   45     0
31       R/keep.R       keep   49     0
  • The goodpractice::gp() report lists a couple other minor notes too.
avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    R/code_graph.R:15:1
    R/keep.R:17:1
    R/var_lineage.R:15:1
    R/var_lineage.R:22:1fix this R CMD check NOTE: Namespace in Imports field not
    imported from:jsonliteAll declared Imports should be used.

Without retrospective provenance, it appears you no longer need jsonlite. Also, line length is ultimately a matter of personal style, but lines less than 80 characters are relatively easy to read. You could either wrap those lines or let me know what you prefer stylistically.

  • Lastly, I recommend fixing a couple examples. In package.R, I recommend using system.file() to get the path like you do elsewhere (as opposed to the hard-coded example/simple_script.R). And here, the value returned by data() does not have the lines of code we want. I recommend something like this:
#' data(simple_script)
#' ## Copies code to your clipboard
#' keep(simple_script)
#' ## Saves code to your disk
#' keep(simple_script, file = "clean_code.R")
@nevrome

This comment has been minimized.

Copy link
Member

@nevrome nevrome commented Dec 9, 2019

I agree with @wlandau: You did an excellent job refactoring Rclean, @MKLau. I really enjoyed reading the vignette and running the example code. I will keep this in mind to analyse my own scripts.

Maybe an RStudio Addin would be nice: Select a variable in your code file with your mouse, press a key combo, get the clean() result. Would make it even more convinient and is simple enough to implement with rstudioapi.

Some last, small remarks:


  • I agree with @wlandau that the example code should run out of the box without any additional work for the user. That is especially true for the README. I therefore suggest to replace
script <- "example/simple_script.R"

with

script <- system.file("example", "simple_script.R", package = "Rclean")
  • The plot produced by code_graph() has very small node labels. Maybe you could add argument passing with ... (code_graph(script, pdf.file, ...)) to Rgraphviz::plot.graphNEL (plot(gNEL, ...)). That would allow me to set the label size. I personally would prefer to separate graph generation and graph plotting, but if you combine them, then at least I would like to have some control over the appearance.

  • Unrelated to the package: The lack of https access to http://mklau.info/ still causes problems for me in chromium. Maybe your hoster provides an easy way to get Let's encrypt certificates for this domain? It will be increasingly difficult for users to access your homepage without encryption.

  • Concerning the JOSS submission: Afaik the JOSS editors assume that the paper is perfectly fine already when it went through the rOpenSci review. Can you confirm that @annakrystalli? The paper is not completely ready yet though, so it does not make sense for me to read and review it at this point. I wonder what's the correct procedure here to make sure that JOSS receives reviewed texts.


If you want to add the reviewers to the DESCRIPTION file, then you can use the following string for me:

	person(given = "Clemens",
             family = "Schmid",
             role = "rev",
             email = "clemens@nevrome.de",
             comment = c(ORCID = "0000-0003-3448-5715"))
@danielskatz

This comment has been minimized.

Copy link

@danielskatz danielskatz commented Dec 9, 2019

pinging @labarba as the JOSS editor for that part of this - note the comment about the JOSS paper and review (openjournals/joss-reviews#1312) in the comment above.

@labarba

This comment has been minimized.

Copy link

@labarba labarba commented Dec 9, 2019

@danielskatz — I don't understand what I need to opine about.

@danielskatz

This comment has been minimized.

Copy link

@danielskatz danielskatz commented Dec 9, 2019

the part:

Afaik the JOSS editors assume that the paper is perfectly fine already when it went through the rOpenSci review. Can you confirm that @annakrystalli? The paper is not completely ready yet though, so it does not make sense for me to read and review it at this point. I wonder what's the correct procedure here to make sure that JOSS receives reviewed texts.

@labarba

This comment has been minimized.

Copy link

@labarba labarba commented Dec 9, 2019

When a paper goes through rOpenSci review, it gets handled by @arfon on the JOSS side. I have not handled any of these.

@danielskatz

This comment has been minimized.

Copy link

@danielskatz danielskatz commented Dec 9, 2019

Actually, it can be handled by the AEiC on duty - I've done this a few times, as have others. See https://joss.readthedocs.io/en/latest/editing.html#processing-of-ropensci-reviewed-and-accepted-submissions

@danielskatz

This comment has been minimized.

Copy link

@danielskatz danielskatz commented Dec 9, 2019

In any event, to answer @nevrome's question, my understanding is that once this submission is accepted by rOpenSci, the JOSS Associate-Editor-in-Chief responsible for the JOSS review (which is currently @labarba) will review the paper to make sure it meets JOSS standards, but will assume the review of everything else done by rOpenSci is sufficient. (But maybe it's worth asking @arfon to make sure I have this right)

@nevrome

This comment has been minimized.

Copy link
Member

@nevrome nevrome commented Dec 9, 2019

Thank you for your answer, @danielskatz. My issue is mostly caused by the unclear border between code and text review.

@wlandau and I provided code review according to this Guide for Reviewers. The guide does not mention the JOSS paper. The review template contains the following section, though:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

So obviously it is expected that I read and review the text. Also from my understanding of your answer, @danielskatz, there won't be additional review beyond a rather general check of the editor over at JOSS.

The takeaway message is that the course of action suggested above by @MKLau is not possible:

The initial submission of the JOSS paper was based on an example of a submitted article present on the JOSS instructions. However, after this package review, there will be substantial content added to the article. I am currently working on these updates to the text.

If there are significant changes of the article AFTER this review, then they won't be reviewed at all.

@annakrystalli

This comment has been minimized.

Copy link
Collaborator

@annakrystalli annakrystalli commented Dec 9, 2019

Thanks @nevrome for picking this up.

What we can do is consider the code review of the package complete and move it to rOpenSci. I will coordinate with @MKLau and when the paper.md is complete and ready for JOSS submission, I will review it before it gets submitted (unless of course you want to @nevrome). Ultimately I'm happy to consider your contribution complete and do the final check myself.

@danielskatz

This comment has been minimized.

Copy link

@danielskatz danielskatz commented Dec 9, 2019

If there are significant changes of the article AFTER this review, then they won't be reviewed at all.

Well, they are reviewed by the Associate-Editor-in-Chief, but we would certainly appreciate anything you do that improves the paper. Whether this is an obligation of rOpenSci reviewers of submissions that are also being submitted to JOSS is for rOpenSci to determine.

@nevrome

This comment has been minimized.

Copy link
Member

@nevrome nevrome commented Dec 9, 2019

Thanks for the clarification, @annakrystalli and @danielskatz. The solution suggested by you, @annakrystalli, is fine for me. I can also offer to read the text when it's ready and if you have the feeling that some additional feedback might be helpful, @MKLau.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Dec 9, 2019

Great, thanks to everyone for clarifying the relationship between the software and article review. Thanks for the offer to review the manuscript (@nevrome), it would be great to get your input on it if you have the time to spare.

@annakrystalli, I’ll send you the updated manuscript soon. Hopefully by week’s end.

@MKLau

This comment has been minimized.

Copy link
Author

@MKLau MKLau commented Dec 9, 2019

@wlandau and @nevrome, thanks also for the follow-up comments and suggestions.

  • I'll take care of the NAMESPACE update and the long line issues. I'm not sure why gp didn't pick these up when I ran it.
  • I've added "enhancement" issues for the RStudio add-in and the node label sizing for code_graph. I'd like to get those implemented soon, but probably after the review, as they're not integral to the functioning of the package.
  • @wlandau, per your request for the test reference, I was referring to line 26 in test-main.R. It tests the correct cleaning of code for a two variables that are "tangled" around several other variables. I agree with your general comments about testing. After taking a closer look using covr I think should be able to get the coverage up to over 95%, but I agree, it's more important to have tests that are checking for correct functioning of the package. I think that the clean function testing could be more extensive, and if you have more specific suggestions, I'd be happy to integrate them.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.