Correct atom configuration prior fingerprint calculation #94

bachi55 · Jan 17, 2019

Hello,

I was wondering about the correct usage of do.typing(), do.aromaticity() and do.isotopes() before calculating the fingerprints of a molecules parsed from its SMILES using parse.smiles().

Let me go through a few examples.

MACCS
The documentation of get.fingerprints contains:

smiles <- c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1','C1CCC1CC(CN(C)(C))CC(=O)CC')
mols <- parse.smiles(smiles)
fps <- lapply(mols, get.fingerprint, type='maccs')

There is no information, that the aromaticity might need to be perceived first as otherwise some SMARTS are not properly matched (?). In the CDK tests for the MACCSFingerprinter at least there is atom-typing and aromaticity-detection done.

Pubchem
For this fingerprinter the CDK tests indicate that implicit hydrogens should be converted to explicit ones, i.e. convert.implicit.to.explicit.

Klekota and Roth
Here the CDK tests indicate, that no "additional" function, e.g. typing or aromaticity detection, needs to be applied.

So I could continue the list of different fingerprinters, which seem to expect different "modifications" done to the parsed molecular structure. Some fingerprinters seems to take care about it internally (within the class).

I would like to know, how others are dealing with this issue? Which transformations you perform, when you calculate different fingerprints (I could believe that problem continues for descriptors as well)? Am I overthinking the problem here? Can a "to much modifications" (I called do.typing if it is not needed, e.g.) harm my calculation, i.e. wrong fingerprints?

I believe the problem should somehow be solved in CDK, but how can we maybe in the rcdk side make the documentation more precise?

Best regards,

Eric

schymane · Jan 17, 2019

@bachi55 this is a great point and it's a problem we are encountering beyond the fingerprints (look at e.g. #73 and the massfuncs branch where we are trying to capture this for the mass calculations). Historically in RMassBank we patched this by doing the typing etc our side in wrapper functions to ensure consistency. What concerns me is that they are also order dependent (change the order of the commands and you get different results) and this is should not happen (or if it does, then users need guidance what order is "right").

We've had conversation (email) with @rajarshi about updating documentation to roxygen anyway and not only to they need roxygenising but also a revamp to deliver additional clarity. If we start to sum up the time I'm investing into clarifying these aspects to colleagues via email, it'd start adding up to the time it would take to upgrade the docs and the unit tests to ensure we're doing it right. Just need this magic thing called time ;-)

@rajarshi any idea how best to coordinate? I am hiring and hope to soon have either own capacity or team member capacity to potentially contribute to a joint effort as it's impacting what we need to achieve. My logical start point would be on the masses but we're still stuck with the NPEs ...

@bachi55 my approach for the masses was to build unit tests for all the "edge cases" I would expect and I would use those also to build the documentation to clarify (in examples) what to do and what not to do ... once we have them as unit tests we can then run them and interact with the CDK guys to check rcdk is doing what it should ...

bachi55 · Jan 17, 2019

Hei Emma,

thanks for your comments. It's "good" to hear, that the problem is somehow known.

I am at the moment using your approach to wrap rcdk, when calculating the fingerprints to have at least consistent computations. And, I will update my code so that it can test for some "edge cases". Do you have a list of those for fingerprints as well?

Actually, I think it would be good, if the CDK framework would directly take care about the correct configurations, depending on the fingerprint used.

Best regards,

Eric

schymane · Jan 17, 2019

I haven’t looked into this for fingerprints, but was answering some queries about them only a few days ago ;-) and some of the edge cases I pulled out for the mass functions may apply – I’d say you need cases that cover aromaticity, explicit/implicit Hs, tautomers and probably also isotopes and I have examples that cover several bits of that in massfuncs: https://github.com/CDK-R/cdkr/blob/massfuncs/rcdk/inst/unitTests/runit.mass.R Make sure you also capture some cases that should distinguish the different features of the fingerprints .. @rajarshi may have better ideas there …

CDK-R/cdkr

Correct atom configuration prior fingerprint calculation #94

Correct atom configuration prior fingerprint calculation #94

bachi55 commented Jan 17, 2019

This comment has been minimized.

schymane commented Jan 17, 2019

This comment has been minimized.

bachi55 commented Jan 17, 2019

This comment has been minimized.

schymane commented Jan 17, 2019

CDK-R/cdkr

Join GitHub today

Correct atom configuration prior fingerprint calculation #94

Comments

bachi55 commented Jan 17, 2019

This comment has been minimized.

schymane commented Jan 17, 2019

This comment has been minimized.

bachi55 commented Jan 17, 2019

This comment has been minimized.

schymane commented Jan 17, 2019