New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct atom configuration prior fingerprint calculation #94

Open
bachi55 opened this Issue Jan 17, 2019 · 3 comments

Comments

Projects
None yet
2 participants
@bachi55
Copy link
Contributor

bachi55 commented Jan 17, 2019

Hello,

I was wondering about the correct usage of do.typing(), do.aromaticity() and do.isotopes() before calculating the fingerprints of a molecules parsed from its SMILES using parse.smiles().

Let me go through a few examples.

MACCS
The documentation of get.fingerprints contains:

smiles <- c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1','C1CCC1CC(CN(C)(C))CC(=O)CC')
mols <- parse.smiles(smiles)
fps <- lapply(mols, get.fingerprint, type='maccs')

There is no information, that the aromaticity might need to be perceived first as otherwise some SMARTS are not properly matched (?). In the CDK tests for the MACCSFingerprinter at least there is atom-typing and aromaticity-detection done.

Pubchem
For this fingerprinter the CDK tests indicate that implicit hydrogens should be converted to explicit ones, i.e. convert.implicit.to.explicit.

Klekota and Roth
Here the CDK tests indicate, that no "additional" function, e.g. typing or aromaticity detection, needs to be applied.

So I could continue the list of different fingerprinters, which seem to expect different "modifications" done to the parsed molecular structure. Some fingerprinters seems to take care about it internally (within the class).

I would like to know, how others are dealing with this issue? Which transformations you perform, when you calculate different fingerprints (I could believe that problem continues for descriptors as well)? Am I overthinking the problem here? Can a "to much modifications" (I called do.typing if it is not needed, e.g.) harm my calculation, i.e. wrong fingerprints?

I believe the problem should somehow be solved in CDK, but how can we maybe in the rcdk side make the documentation more precise?

Best regards,

Eric

@schymane

This comment has been minimized.

Copy link
Contributor

schymane commented Jan 17, 2019

@bachi55 this is a great point and it's a problem we are encountering beyond the fingerprints (look at e.g. #73 and the massfuncs branch where we are trying to capture this for the mass calculations). Historically in RMassBank we patched this by doing the typing etc our side in wrapper functions to ensure consistency. What concerns me is that they are also order dependent (change the order of the commands and you get different results) and this is should not happen (or if it does, then users need guidance what order is "right").

We've had conversation (email) with @rajarshi about updating documentation to roxygen anyway and not only to they need roxygenising but also a revamp to deliver additional clarity. If we start to sum up the time I'm investing into clarifying these aspects to colleagues via email, it'd start adding up to the time it would take to upgrade the docs and the unit tests to ensure we're doing it right. Just need this magic thing called time ;-)

@rajarshi any idea how best to coordinate? I am hiring and hope to soon have either own capacity or team member capacity to potentially contribute to a joint effort as it's impacting what we need to achieve. My logical start point would be on the masses but we're still stuck with the NPEs ...

@bachi55 my approach for the masses was to build unit tests for all the "edge cases" I would expect and I would use those also to build the documentation to clarify (in examples) what to do and what not to do ... once we have them as unit tests we can then run them and interact with the CDK guys to check rcdk is doing what it should ...

@bachi55

This comment has been minimized.

Copy link
Contributor

bachi55 commented Jan 17, 2019

Hei Emma,

thanks for your comments. It's "good" to hear, that the problem is somehow known.

I am at the moment using your approach to wrap rcdk, when calculating the fingerprints to have at least consistent computations. And, I will update my code so that it can test for some "edge cases". Do you have a list of those for fingerprints as well?

Actually, I think it would be good, if the CDK framework would directly take care about the correct configurations, depending on the fingerprint used.

Best regards,

Eric

@schymane

This comment has been minimized.

Copy link
Contributor

schymane commented Jan 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment