Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch/get mass #547

Open
wants to merge 7 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@johnmay
Copy link
Member

johnmay commented Feb 18, 2019

Adds getMass() functions to AtomContainerManipulator and MolecularFormulaManipulator unify the existing calculations. Existing methods are deprecated pointing to the new API.

Some open questions, I used integers for the options but a new 'MassType' enum might be useful. I really don't like adding new class but in this case might make sense. Naming of the options is flexible, would be good to get Emma's thoughts on this (can't find her GitHub handle ATM).

The tests summarise it nicely:

    @Test public void getMassC6Br6() throws InvalidSmilesException {
        IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
        SmilesParser smipar = new SmilesParser(bldr);
        IAtomContainer mol = smipar.parseSmiles("Brc1c(Br)c(Br)c(Br)c(Br)c1Br");
        assertThat(AtomContainerManipulator.getMass(mol, MolWeight),
                   closeTo(551.485, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, AverageWeight),
                   closeTo(551.485, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, MonoIsotopic),
                   closeTo(545.510, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, MostAbundant),
                   closeTo(551.503, 0.001));
    }

    @Test public void getMassCranbin() {
        IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
        IAtomContainer mol =
                MolecularFormulaManipulator.getAtomContainer("C202H315N55O64S6",
                                                              bldr);
        assertThat(AtomContainerManipulator.getMass(mol, MolWeight),
                   closeTo(4730.397, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, AverageWeight),
                   closeTo(4730.397, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, MonoIsotopic),
                   closeTo(4727.140, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, MostAbundant),
                   closeTo(4729.147, 0.001));
    }

    @Test public void getMassCranbinSpecIsotopes() {
        IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
        IAtomContainer mol =
                MolecularFormulaManipulator.getAtomContainer("[12]C200[13]C2[1]H315[14]N55[16]O64[32]S6",
                                                             bldr);
        assertThat(AtomContainerManipulator.getMass(mol, MolWeight),
                   closeTo(4729.147, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, AverageWeight),
                   closeTo(4730.397, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, MonoIsotopic),
                   closeTo(4729.147, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, MostAbundant),
                   closeTo(4729.147, 0.001));
    }

    @Test public void getMassCranbinMixedSpecIsotopes() {
        IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
        IAtomContainer mol =
                MolecularFormulaManipulator.getAtomContainer("C200[13]C2H315N55O64S6",
                                                             bldr);
        assertThat(AtomContainerManipulator.getMass(mol, MolWeight),
                   closeTo(4732.382, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, AverageWeight),
                   closeTo(4730.397, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, MonoIsotopic),
                   closeTo(4729.147, 0.001));
        assertThat(AtomContainerManipulator.getMass(mol, MostAbundant),
                   closeTo(4731.154, 0.001));
    }
@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Feb 18, 2019

@schymane... hard to find you.

Will also ping in CDK-R/cdkr#65

@egonw
Copy link
Member

egonw left a comment

Thanks, I like the API. I also like your suggestion of a MassType.

* a distribution of isotopes by splitting the set of isotopes in two,
* the one under consideration (specified by 'idx') and the remaining to be
* considered ('>idx'). The inflection point is calculate as 'k'
* &le 'count' isotopes added. If there are remaining isotopes the method

This comment has been minimized.

@egonw

egonw Feb 18, 2019

Member

missing ';' in ≤

Show resolved Hide resolved ...a/org/openscience/cdk/tools/manipulator/MolecularFormulaManipulator.java
@schymane

This comment has been minimized.

Copy link

schymane commented Mar 3, 2019

So, it seems you have built in 4 cases and we had 5 in the latest changes @rajarshi and I worked on in rcdk ... the missing one is the mass integer (maybe worth adding for our nominal mass friends?). Our test cases don't overlap which is making the terminology match confusing.
I don't find MolWeight and AverageWeight sufficiently clear, inherently, in what they do differently (you need to study the test cases) - and would prefer to use mass not weight, as it's technically a mass.

https://github.com/CDK-R/cdkr/blob/massfuncs/rcdk/inst/unitTests/runit.mass.R
We had
total.exact
total.natural
mass.number
major.isotope
molecular.weight (I concede there it's likely the most used term so probably the best, maybe one could add molecular.mass and molecular.weight and have them equal to the same).

I cannot figure out if these overlap completely with the different terms and the unfamiliar masses ... I suspect one option doesn't - and I think once we have all the options we need, we should come to an agreement on names. It's not clear for mass spec people which one is monoisotopic mass in our rcdk terms, it is for yours, but it's not clear if and when isotopes are ignored and this is very confusing for end users. Are you able to implement a couple of my trickier test cases? I chose them to check the implicit/explicit H handling (a nasty case for deuterium, especially deuterium exchange) and the isotope handling as well.
https://github.com/CDK-R/cdkr/blob/massfuncs/rcdk/inst/unitTests/runit.mass.R#L23
https://github.com/CDK-R/cdkr/blob/massfuncs/rcdk/inst/unitTests/runit.mass.R#L43
https://github.com/CDK-R/cdkr/blob/massfuncs/rcdk/inst/unitTests/runit.mass.R#L59
https://github.com/CDK-R/cdkr/blob/massfuncs/rcdk/inst/unitTests/runit.mass.R#L91

@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Mar 3, 2019

AverageWeight is essentially for backward compatibility with what CDK used to do an in IMO is wrong - not sure if you looked at the doc in the patch, here it is in case you missed it but I think it's clear:

  • MolWeight (default) - use an isotopes natural mass unless a specific isotope is specified
  • AverageWeight - use an isotopes natural mass even if a specific isotope is specified
  • MonoIsotopic - use an isotopes major isotope mass even if a specific isotope is specified
  • MostAbundant - use the distribution of isotopes based on their abundance and select the most abundant. For example C6Br6 would have three 79Br and 81Br because their abundance is 51 and 49%.

I kind of think MonoIsotopic should also me modified such that explicitly specified isotopes are considered (same as MolWeight) currently it just throws them away..

In terms of naming, happy to modify and drop the Weight from AverageWeight as needed but "total.exact" and "total.natural" are non starters for me, works in R but not in Java - we can't have the "dots" and TotalExact TotalNautral meh. As you say "MolWeight" is too ubiquitous, and we have to think about the community in general rather just MS so I would keep that, it's currently the default which I think makes sense.

molecular.weight (I concede there it's likely the most used term so probably the best, maybe one could add molecular.mass and molecular.weight and have them equal to the same).

Not really a fan of that duplication, think it makes it more confusing.

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 3, 2019

AverageWeight is essentially for backward compatibility with what CDK used to do an in IMO is wrong -

I agree that it's "wrong" but agree we have to leave it in for backwards compatibility.

not sure if you looked at the doc in the patch,

I confess I was severely undercaffeinated this morning and found those definitions in my emails after giving my feedback, sorry :-/ and thanks for pasting again

* MolWeight (default) - use an isotopes natural mass unless a specific isotope is specified

OK this is good, behaviour is "normal" unless we specify, and the name is what most would consider logical.

* AverageWeight - use an isotopes natural mass even if a specific isotope is specified

As you said, it's needed for backwards compatibility and having the "weight" in there will connect it back to MolWeight. I would not know a chemically correct term for this offhand since conceptually it's not really logical. Suggest to add "(this option is retained for backwards compatibility)" or similar to the documentation, so purists that have an issue with a rather illogical calculation know the reasoning?

* MonoIsotopic - use an isotopes major isotope mass even if a specific isotope is specified

This is an absolute killer for mass spec if this is the only exact mass option and we'd need an additional function that gives us the exact mass for specified isotopes. So that's the case that was missing. If we go strictly by definition (which is from the IUPAC Gold Book):
https://en.wikipedia.org/wiki/Monoisotopic_mass
The monoisotopic mass is the sum of the masses of the atoms in a molecule using the unbound, ground-state, rest mass of the principal (most abundant) isotope for each element instead of the isotopic average mass

Then we will need two cases. CDK's MonoIsotopic option should match the IUPAC definition (and does if your major isotope the principal by abundance - maybe adjust the definitions to be as close to the IUPAC as possible?), then we would need another option ExactMass (or similar) to return the exact mass of the exact isotopic combination specified. , i.e
*ExactMass - the sum of the exact masses of the atoms using the specified isotopes

* MostAbundant - use the distribution of isotopes based on their abundance and select the most abundant. For example C6Br6 would have three 79Br and
       81Br because their abundance is 51 and 49%.

Agree with the need for this and like the name. It says what it is.

I kind of think MonoIsotopic should also me modified such that explicitly specified isotopes are considered (same as MolWeight) currently it just throws them away..

See above - we should match the IUPAC definition.

In terms of naming, happy to modify and drop the Weight from AverageWeight as needed but "total.exact" and "total.natural" are non starters for me, works in R but not in Java - we can't have the "dots" and TotalExact TotalNautral meh. As you say "MolWeight" is too ubiquitous, and we have to think about the community in general rather just MS so I would keep that, it's currently the default which I think makes sense.

Sorry I was meaning we should match terms in terms of "getting our words consistent" and ignoring the obvious styling differences between the languages. Whether we use caps or dots to do the separation doesn't really matter (and we'll need to keep the dots to keep the syntax in rcdk consistent) but I think most humans can match AvgWeight v avg.weight and ExactMass v exact.mass ;-) and this is what we need to get consistent.

molecular.weight (I concede there it's likely the most used term so probably the best, maybe one could add molecular.mass and molecular.weight and have them equal to the same).

Not really a fan of that duplication, think it makes it more confusing.

Fine with me - I agree in this case that it could add too much confusion.

I would strongly suggest to put in some of the examples I suggested, or similar, as they were selected to show the need for the two different styles of exact mass - and will also check the H behaviour. Or I can give it a go once we have all the options if you trust my copy-paste work ;-)

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 3, 2019

* MostAbundant - use the distribution of isotopes based on their abundance and select the most abundant. For example C6Br6 would have three 79Br and
       81Br because their abundance is 51 and 49%.

I spoke too soon, I find the definition a little unclear, suggest a minor wording change (if I understood it correctly...seems to match the enviPat output at least ... ):

  • MostAbundant - returns the exact mass of the most abundant isotope in the distribution of isotopes. For example C6Br6 would have three 79Br and 81Br because their abundance is 51 and 49%.

For the record:
https://www.envipat.eawag.ch/index.php (wish it had a direct link to the formula results ;-) )

@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Mar 3, 2019

Sorry miss-documented! MonoIsotopic does do what we expect. Wording change for MostAbundant sounds fine.

I propose the following, we can have ugly names for the ones we don't want people to use:

  • Average (or MolWeight) - default
  • MonoIsotopic
  • AverageIgnoringSpecified
  • MonoIsotopicIgnoringSpecified
  • MostAbundant
  • MassNumber? - any better name for this I don't like repeating words in APIs when it gets called, e.g. it will look like this getMass(mol, MassNumber)
@schymane

This comment has been minimized.

Copy link

schymane commented Mar 3, 2019

Sorry miss-documented! MonoIsotopic does do what we expect.

I'm not yet convinced it does ... comments below ... and I've finally found the good overview:
https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Average_mass
(and the definitions below that one too)

I propose the following, we can have ugly names for the ones we don't want people to use:

* `Average` (or `MolWeight`) - default

I find MolWeight clearer than Average. Molecular weight is the term commonly used. Note at the above site, MolWeight (which is calls MolMass ;-) ) and Average mass are defined separately. Would much rather MolWeight for clarity.

* `MonoIsotopic`

I'm concerned here that you deviate from the IUPAC definition, because it specifies the exact isotopic state for this mass:
The monoisotopic mass is the sum of the masses of the atoms in a molecule using the unbound, ground-state, rest mass of the principal (most abundant) isotope for each element instead of the isotopic average mass.

I think this should be "Exact" (given your comment you don't want repetition).
https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Exact_mass
The default "Exact" (if isotopes are not given) is indeed MonoIsotopic

* `AverageIgnoringSpecified`

OK for ugly name here but then please MolWeightIgnoringSpecified? Or rather IgnoringIsotopes?

* `MonoIsotopicIgnoringSpecified`

See above I think this should be MonoIsotopic (it gives people the MonoIso mass irrespective of the input isotopes, which is exactly in line with the IUPAC definition; Exact will give MonoIso only if isotopes are unspecified) - there will be cases for mass spec where this is very useful

* `MostAbundant`

All good with this one still as far as I can tell

* `MassNumber`? - any better name for this I don't like repeating words in APIs when it gets called, e.g. it will look like this `getMass(mol, MassNumber)`

Looks like you hit the nail on the head ...
https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Mass_number

Do you see a need for the Nominal Mass in addition? As far as I can tell it's just a simple operation from MonoIso and shouldn't need to be hard coded (and becoming slowly obsolete ...)

@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Mar 4, 2019

Maybe there is something I'm missing but I need a bit more explanation on the difference between Exact vs MonoIsotopic. Why do you think the definition is not as implemented?

@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Mar 4, 2019

The crux of the issue is "What do you do when an isotope is unspecified?" If everything is specified MolWeight and MonoIsotopic will both yield the ExactMass.

The Exact Mass definition on Wikipedia doesn't say what you do when they're unspecified except for an extra sentence on the end "When an exact mass value is given without specifying an isotopic species, it normally refers to the most abundant isotopic species."

So if all were unspecified the exact mass is the same as mono isotopic?

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

As far as I can tell, MonoIsotopic and MostAbundant should be fixed, irrespective of the isotopes defined (because they have fixed definitions wrt which isotopes to use - or is it me misinterpreting the definition?), whereas Exact would give the exact mass for the exact isotopic state given in the input. Thus as I see it, Exact is the generic function and MonoIsotopic and MostAbundant are two, fixed "special cases". Exact should default to MonoIsotopic if isotopes are not defined.

This is at least what I set up in my test cases for rcdk (but I did not have MostAbundant so those numbers are missing):
Atrazine:
CCNC1=NC(NC(C)C)=NC(Cl)=N1
MonoIsotopic = Exact = 215.0938
Atrazine-2d:
[2H]N(CC)C1=NC(=NC(Cl)=N1)N([2H])C(C)C
MonoIsotopic = 215.0938 (ignores the 2D)
Exact = 217.1063
Atrazine-d5:
[2H]C([2H])([2H])C([2H])([2H])NC1=NC(Cl)=NC(NC(C)C)=N1
MonoIsotopic: 215.0938
Exact: 220.1252

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

The crux of the issue is "What do you do when an isotope is unspecified?" If everything is specified MolWeight and MonoIsotopic will both yield the ExactMass.

This should not be the case ...
Atrazine:
MonoIso=Exact=215.0938
MolWeight = 215.6835

The Exact Mass definition on Wikipedia doesn't say what you do when they're unspecified except for an extra sentence on the end "When an exact mass value is given without specifying an isotopic species, it normally refers to the most abundant isotopic species."

So if all were unspecified the exact mass is the same as mono isotopic?

Yes, if isotopes are unspecified, MonoIsotopic = Exact

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

The crux of the issue is "What do you do when an isotope is unspecified?" If everything is specified MolWeight and MonoIsotopic will both yield the ExactMass.

This should not be the case ...
Atrazine:
MonoIso=Exact=215.0938
MolWeight = 215.6835

...but of course here not every isotope is specified ... I think I misunderstood ... OK so we need a test case with every single individual isotope defined too.

@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Mar 4, 2019

What do you mean by "fixed" do you mean should not change if the isotopic definitions change? Do you think the value should change if I gave it D2O vs H2O currently these will give different values for both MonoIsotopic and MostAbundant.

Atrazine:
MonoIso=Exact=215.0938
MolWeight = 215.6835

If a user specifies absolutely everything the mass functions all return the same thing, the option is really what do you do when there is missing information.
Yes if unspecified but if I went through and set the isotopes for example lets say I set the isotopes. 12C713C2H1435Cl14N5 all the masses will be the same.

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

What do you mean by "fixed" do you mean should not change if the isotopic definitions change?

Taking the definitions strictly, these values should not change irrespective of the isotopic definitions, because the terms MostAbundant and MonoIsotopic define the isotopes to use (as currently written). Hence Exact is the generic form (and should probably be the setting to encourage people to use if they want to get either the Exact or MonoIsotopic mass). MonoIsotopic is still in there for the "normal" cases and also for backwards compatibility. I think anything else will end up giving masses that aren't really "defined" with a concept behind them, although I am on the fence about MostAbundant ... I need to see the numbers side by side. It might make more sense if this shifts with the given isotopes but then you have to assume the others are "normal" if they are only partially defined? What makes most sense with your current code base?

Do you think the value should change if I gave it D2O vs H2O currently these will give different values for both MonoIsotopic and MostAbundant.

I need to see numbers ...

Yes if unspecified but if I went through and set the isotopes for example lets say I set the isotopes. 12C713C2H1435Cl14N5 all the masses will be the same.

We should add this as a test case. Should I expand yours, or will you do?

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

C6Br6

MonoIso = Exact = "1"
MostAbundant = "0"
MolWeight = Average Mass
(see highlighting)

image

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

[12]C4[13]C2Br6

Exact = 547.5167 - in this picture also matches MonoIsotopic but this is not strictly in keeping with the definition(?). According to the IUPAC definition it should be 545.51? But if everywhere shifts this according to defined isotopes, we should be consistent. Then you just need to maintain backwards compatibility with the functionality that ignored isotopes with another setting?
MostAbundant = "1" = 553.5106
It makes no sense to keep this fixed to unspecified isotopes really ... wouldn't this only be useful in this context? Thus ... I think MostAbundant should shift with defined isotopes.

image

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

I would still prefer for clarity to name it "Exact" mass - and maybe revert to your suggestion a few comments back with MonoIsotopicIgnoresIsotopes for the backwards compatible function?

In most cases MonoIso = Exact up until now but times are changing with lots more isotopic experiments going on and we should encourage correct usage ...

The monoisotopic mass is not used frequently in fields outside of mass spectrometry because other fields can not distinguish molecules of differing isotopic composition.
https://en.wikipedia.org/wiki/Monoisotopic_mass

@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Mar 4, 2019

So I match those screenshots...

    @Test
    public void C6Br6() {
        assertMass("C6Br6", 551.485, MolWeight);
        assertMass("C6Br6", 545.510, MonoIsotopic);
        assertMass("C6Br6", 551.503, MostAbundant);
        assertMass("[12]C4[13]C2Br6", 553.427, MolWeight);
        assertMass("[12]C4[13]C2Br6", 547.516, MonoIsotopic);
        assertMass("[12]C4[13]C2Br6", 553.510, MostAbundant);
    }

[12]C4[13]C2Br6 seems a little off when getting the Average/MolWeight but could just be rounding error.

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Mar 4, 2019

Looks OK if you call it Exact and not MonoIsotopic?

I still prefer MonoIsotopic, and envipat also calls it MonoIsotopic right?

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Mar 4, 2019

Just clarify something you would argue the strict definition of MonoIsotopic means even the MF with 13C should be 545.510? That is information provided is ignored.

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 4, 2019

@hunter-moseley

This comment has been minimized.

Copy link

hunter-moseley commented Mar 5, 2019

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 5, 2019

I still prefer MonoIsotopic, and envipat also calls it MonoIsotopic right?

Based on Hunter's response, I think we can stick to MonoIsotopic.

@johnmay

This comment has been minimized.

Copy link
Member Author

johnmay commented Mar 5, 2019

OK, I'll remove the AverageWeight though and add in MassNumber.

@schymane

This comment has been minimized.

Copy link

schymane commented Mar 5, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.