Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PRE REVIEW]: Hierdenc: Retrieval and clustering of large categorical data sets with locality-sensitive hashing #693

Open
whedon opened this issue Apr 23, 2018 · 46 comments

Comments

Projects
None yet
7 participants
@whedon
Copy link
Collaborator

commented Apr 23, 2018

Submitting author: @wandreopoulos (William Andreopoulos)
Repository: https://git.code.sf.net/p/hierdenc/code
Version: v1.0
Editor: @jedbrown
Reviewer: Pending

Author instructions

Thanks for submitting your paper to JOSS @wandreopoulos. The JOSS editor (shown at the top of this issue) will work with you on this issue to find a reviewer for your submission before creating the main review issue.

@wandreopoulos if you have any suggestions for potential reviewers then please mention them here in this thread. In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission.

Editor instructions

The JOSS submission bot @whedon is here to help you find and assign reviewers and start the main review. To find out what @whedon can do for you type:

@whedon commands

@whedon whedon added the pre-review label Apr 23, 2018

@whedon

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 23, 2018

Hello human, I'm @whedon. I'm here to help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@whedon commands
@whedon

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 23, 2018

Attempting PDF compilation. Reticulating splines etc...
@whedon

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 23, 2018

PDF failed to compile for issue #693 with the following error:

Can't find any papers to compile :-(

@arfon

This comment has been minimized.

Copy link
Member

commented Apr 23, 2018

@wandreopoulos - do you know why the source code is not showing up here? https://sourceforge.net/p/hierdenc/code/ref/master/

@wandreopoulos

This comment has been minimized.

Copy link

commented Apr 23, 2018

@arfon

This comment has been minimized.

Copy link
Member

commented Apr 24, 2018

@wandreopoulos - yes, please push to git too.

@wandreopoulos

This comment has been minimized.

Copy link

commented Apr 25, 2018

I did

@arfon

This comment has been minimized.

Copy link
Member

commented Apr 30, 2018

👋 @kyleniemeyer - would you be willing to edit this submission for JOSS?

@kyleniemeyer

This comment has been minimized.

Copy link

commented May 1, 2018

@arfon I don't think I'm the best person to edit this; someone more data-sciency might be better.

@arfon

This comment has been minimized.

Copy link
Member

commented May 3, 2018

@whedon generate pdf

@whedon

This comment has been minimized.

Copy link
Collaborator Author

commented May 3, 2018

Attempting PDF compilation. Reticulating splines etc...
@whedon

This comment has been minimized.

Copy link
Collaborator Author

commented May 3, 2018

PDF failed to compile for issue #693 with the following error:

/app/vendor/ruby-2.3.4/lib/ruby/2.3.0/find.rb:43:in block in find': No such file or directory (Errno::ENOENT) from /app/vendor/ruby-2.3.4/lib/ruby/2.3.0/find.rb:43:in collect!'
from /app/vendor/ruby-2.3.4/lib/ruby/2.3.0/find.rb:43:in find' from /app/vendor/bundle/ruby/2.3.0/bundler/gems/whedon-a31a5c2a9125/lib/whedon/processor.rb:52:in find_paper_paths'
from /app/vendor/bundle/ruby/2.3.0/bundler/gems/whedon-a31a5c2a9125/bin/whedon:32:in prepare' from /app/vendor/bundle/ruby/2.3.0/gems/thor-0.20.0/lib/thor/command.rb:27:in run'
from /app/vendor/bundle/ruby/2.3.0/gems/thor-0.20.0/lib/thor/invocation.rb:126:in invoke_command' from /app/vendor/bundle/ruby/2.3.0/gems/thor-0.20.0/lib/thor.rb:387:in dispatch'
from /app/vendor/bundle/ruby/2.3.0/gems/thor-0.20.0/lib/thor/base.rb:466:in start' from /app/vendor/bundle/ruby/2.3.0/bundler/gems/whedon-a31a5c2a9125/bin/whedon:99:in <top (required)>'
from /app/vendor/bundle/ruby/2.3.0/bin/whedon:22:in load' from /app/vendor/bundle/ruby/2.3.0/bin/whedon:22:in

'

@arfon

This comment has been minimized.

Copy link
Member

commented May 3, 2018

@whedon generate pdf

@whedon

This comment has been minimized.

Copy link
Collaborator Author

commented May 3, 2018

Attempting PDF compilation. Reticulating splines etc...
@whedon

This comment has been minimized.

Copy link
Collaborator Author

commented May 3, 2018

@whedon whedon added Python TeX labels May 4, 2018

@arfon

This comment has been minimized.

Copy link
Member

commented May 14, 2018

@whedon list reviewers

@whedon

This comment has been minimized.

Copy link
Collaborator Author

commented May 14, 2018

Here's the current list of reviewers: https://bit.ly/joss-reviewers

@arfon

This comment has been minimized.

Copy link
Member

commented May 14, 2018

@wandreopoulos - could you take a look at the list above ☝️and suggest some potential reviewers?

@wandreopoulos

This comment has been minimized.

Copy link

commented May 25, 2018

@arfon

This comment has been minimized.

Copy link
Member

commented Jun 4, 2018

👋 @jedbrown - would you be willing to edit this submission for JOSS?

@jedbrown

This comment has been minimized.

Copy link
Member

commented Jun 4, 2018

I can take it.

@wandreopoulos It looks like your list formatting is missing some newlines to make it format as a list instead of as prose. Here is a recently-accepted example with a list that formats correctly. https://github.com/jrbourbeau/pyunfold/blob/master/docs/joss_paper/paper.md

Please also consider addressing community guidelines and examples.

@arfon Correct me if wrong, but I would expect the "repository" link to be viewable in a web browser, versus a URL that can only be cloned by the Git client.

@arfon

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

@arfon Correct me if wrong, but I would expect the "repository" link to be viewable in a web browser, versus a URL that can only be cloned by the Git client.

Usually yes. Although @whedon's automation relies upon this URL being cloneable.

@jedbrown

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

@whedon assign @jedbrown as editor

@whedon

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 5, 2018

OK, the editor is @jedbrown

@arfon

This comment has been minimized.

Copy link
Member

commented Jun 15, 2018

👋 @jedbrown - I think we just need to assign a reviewer (or two) to get moving on this one.

@jedbrown

This comment has been minimized.

Copy link
Member

commented Jun 15, 2018

This is a Python package with no install process (no setup.py; it is meant to be run out of the source directory), no tests, and no examples. It requires modifying source code just to run. The tool is a total of 480 lines of Python and there are no docstrings. I was hoping to hear from @wandreopoulos before reviewer assignment because it appears at first glance to be either "not feature complete" or a "minor utility", though it has potential.

@wandreopoulos

This comment has been minimized.

Copy link

commented Jun 16, 2018

@jedbrown jedbrown added the paused label Sep 21, 2018

@arfon

This comment has been minimized.

Copy link
Member

commented Nov 2, 2018

Hi @wandreopoulos - how are you getting along making these changes?

@wandreopoulos

This comment has been minimized.

Copy link

commented Nov 5, 2018

@jedbrown

This comment has been minimized.

Copy link
Member

commented Nov 10, 2018

Have you tested with Python-3 (up to 3.6, as setup.py claims to support)? The dependency on MySQL-python (which has been abandoned for years) seems to prevent it.

Please also double-check the review criteria, especially community guidelines (I don't see anything about contributions or support) and documentation. I'm perplexed by the software design to use classes with no members or data as a wrapper around a single run method that does not access self. The docstrings are then associated with the class but meant to apply to the method. Seems like all of these would be better as plain functions, but documentation should go in the right place in any case.

@wandreopoulos

This comment has been minimized.

Copy link

commented Nov 29, 2018

@jedbrown

This comment has been minimized.

Copy link
Member

commented Dec 28, 2018

Thanks for the updates and sorry about my slow reply in this thread. Please take a look at Packaging Python Projects and consider following the recommendations strictly unless you have a compelling reason not to (current directory structure is unusual). The paper should also include some discussion of related software. Under what circumstances is hierdenc a better choice for researchers than other software?

Can you also check whether the requirements are accurate? I presume nose is actually a requirement for the tests, thus should be in requirements_dev.txt. I don't see any direct use of SciPy, though the README, paper, and a ReadInput.py comment erroneously claims that murmurhash_32 (should be murmurhash3_32) is in SciPy (it is actually in sklearn.utils). Also, pandas is in requirements_dev.txt, but appears to not be used at all.

If one installs the package as is, they get $prefix/bin/create_db which is a rather audacious use of the public namespace. Also, there doesn't seem to be any Python package installed. I.e., one cannot install the package and write a script that imports it. Cf. JOSS submission requirements (emphasis mine).

The software should be feature complete (no half-baked solutions) and designed for maintainable extension (not one-off modifications). Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable.

@labarba

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

👋 @wandreopoulos — We haven't heard from you in a while. What's your status? Are you planning on improving this software for consideration in JOSS? Or should we withdraw this submission?

@wandreopoulos

This comment has been minimized.

Copy link

commented Feb 13, 2019

@wandreopoulos

This comment has been minimized.

Copy link

commented Feb 23, 2019

Hello:

Thank you for all your feedback. I am sorry for my terribly slow response. I just went through a hectic period.

Please take a look at Packaging Python Projects and consider following the recommendations strictly unless you have a compelling reason not to (current directory structure is unusual).

I worked on setup.py and the directory structure, according to the link you sent me.
In particular, the source *.py files are under a subdirectory, and there are LICENSE and README.md files.

The paper should also include some discussion of related software. Under what circumstances is hierdenc a better choice for researchers than other software?

I added to paper.md:
"As the dataset grows over time and new items are frequently getting added to the previous results, Hierdenc will preserve the previous cluster structure and add to it. Under these circumstances, it is a better choice than other methods. It won't build an indexed database faster than other methods like MCL do clustering, but it will preserve the results from datasets that were previously analyzed and add to the results incrementally as new data gets introduced."

Can you also check whether the requirements are accurate? I presume nose is actually a requirement for the tests, thus should be in requirements_dev.txt. I don't see any direct use of SciPy, though the README, paper, and a ReadInput.py comment erroneously claims that murmurhash_32 (should be murmurhash3_32) is in SciPy (it is actually in sklearn.utils). Also, pandas is in requirements_dev.txt, but appears to not be used at all.

I have checked the requirements and fixed them. I removed SciPy and Pandas.

If one installs the package as is, they get $prefix/bin/create_db which is a rather audacious use of the public namespace.

The scripts/create_db.py functionality is now provided in a py script that setups the schema of the db. It is not installed under the bin directory. It can be run under the hierdenc/scripts directory.

Also, there doesn't seem to be any Python package installed. I.e., one cannot install the package and write a script that imports it. Cf. JOSS submission requirements (emphasis mine).
The software should be feature complete (no half-baked solutions) and designed for maintainable extension (not one-off modifications). Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable.

This is fixed and the python package hierdenc is now installed as described in the README.md file (see sections 4.1-4.2). I used a virtualenv and then imported the package in python in order to check that it gets installed.

@labarba

This comment has been minimized.

Copy link
Member

commented Feb 23, 2019

@jedbrown -- Are you planning to move this paper into review?
We need to have these conversations in a Review issue. Here, we're just in the Pre-Review, which has the purpose of finding reviewers and making the initial assessment of scope and eligibility. If this submission passes the eligibility/scope assessment, then the handling editor's suggestions for improvement would be better recorded in the Review issue.

@jedbrown

This comment has been minimized.

Copy link
Member

commented Feb 24, 2019

@wandreopoulos Thanks for your work on this. The install currently doesn't actually install anything, and if I import hierdenc from the source directory, none of the functionality is exposed (it's just an empty module). To be a reusable library, one would expect to be able to python setup.py install --prefix=/some/prefix and then (with suitable PYTHONPATH or virtualenv) import hierdenc and actually use it as a library. Can you check that again?

@wandreopoulos

This comment has been minimized.

Copy link

commented Feb 24, 2019

Sorry. hierdenc is provided as a package under a directory with a init.py file.
I changed init.py, such that all .py modules under the hierdenc directory get imported.
From the source directory if I do from hierdenc import * the functionality is exposed as follows:

wandreo-lm:hierdenc andreopo$ python
Python 2.7.15 (default, Aug 22 2018, 16:36:18)

from hierdenc import *
ReadInput.readinput("test/plants.data", False)
...
Cluster.cluster(50)
....
RetrieveNN.retrieve("aerva,fl,ca")
...

I also tried sudo python setup.py install and the package with the modules became available.

Please remember to pull or clone the latest code: git clone ssh://billandreo@git.code.sf.net/p/hierdenc/code hierdenc-code

@jedbrown

This comment has been minimized.

Copy link
Member

commented Mar 4, 2019

I tried again and it looks like an environment issue. Please try this to reproduce in a clean environment.

$ virtualenv TESTENV
$ . TESTENV/bin/activate
$ python setup.py install
$ cd /
$ python -c 'from hierdenc import *'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/jed/joss/hierdenc/hierdenc/TESTENV/lib/python3.7/site-packages/Hierdenc_billandreo-1.96-py3.7.egg/hierdenc/BandHasher.py", line 12, in <module>
    from Constants import *
ModuleNotFoundError: No module named 'Constants'

We change directory to force the import to use the installed version rather than the current directory.

@wandreopoulos

This comment has been minimized.

Copy link

commented Mar 10, 2019

@jedbrown

This comment has been minimized.

Copy link
Member

commented Mar 10, 2019

Oh, Python3 was the issue. It works with Python2, though that language will be unmaintained in 9 months and the libraries you use (numpy and sklearn) are dropping support sooner. What do you think about supporting Python-3?

@wandreopoulos

This comment has been minimized.

Copy link

commented Mar 12, 2019

Hi Jed,
I am currently trying to port the code to Python3, it should be done in a while. I will let you know.

@arfon arfon removed the TeX label May 11, 2019

@wandreopoulos

This comment has been minimized.

Copy link

commented May 13, 2019

@danielskatz

This comment has been minimized.

Copy link

commented May 20, 2019

👋 @jedbrown - I think the action on this submission is now yours

@jedbrown

This comment has been minimized.

Copy link
Member

commented May 20, 2019

@wandreopoulos I've tried installing with your branch and the virtualenv procedure above does not work. Some issues

  • To start with, you can remove sqlite3 from requirements.txt because it is part of the standard library (not a package).
  • I think nose is really only a development requirement.
  • You can also remove duplicate entries from requirements_dev.txt (it doesn't need to repeat those that are in requirements.txt).
  • setup.py should state that it supports Python-3.
  • sklearn no longer supports Python-2 and requirements.txt doesn't explicitly restrict to an old version.
@jedbrown

This comment has been minimized.

Copy link
Member

commented May 20, 2019

Also:

Note to Users

Nose has been in maintenance mode for the past several years and will likely cease without a new person/team to take over maintainership. New projects should consider using Nose2, py.test, or just plain unittest/unittest2.
https://nose.readthedocs.io/en/latest/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.