Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMI2016 records with compound/spectrum mismatch #9

Open
schymane opened this issue Apr 27, 2018 · 7 comments

Comments

Projects
None yet
3 participants
@schymane
Copy link
Member

commented Apr 27, 2018

User reported that SM858902 and SM858951 contain spectral data from acetylsulfamethoxazole but are labeled diphenhydramine (thank you!). Upon closer inspection we seem to have had an ID/Precursor&peaks mismatch for 3 IDs / 4 records in a series, surrounded by records that look OK; series "broken" due to missing IDs in the middle. We also need to find the cause in https://github.com/MassBank/RMassBank

This should not be passing any form of validation; a screening of the entire CASMI2016 database would be extremely useful for debugging the cause and flagging how and how many records to fix, thank you @meier-rene in advance if you can :-)

From what I can see:
**this one looks OK.
ACCESSION: SM858203
RECORD_TITLE: Cetirizine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C21H25ClN2O3
CH$EXACT_MASS: 388.15537
MS$FOCUSED_ION: PRECURSOR_M/Z 389.1626
389.1626 C21H26ClN2O3+ 1 389.1626 -0.05

**this one looks OK.
ACCESSION: SM858353
RECORD_TITLE: 2-Hydroxycarbamazepine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C15H12N2O2
CH$EXACT_MASS: 252.08988
MS$FOCUSED_ION: PRECURSOR_M/Z 251.0826
251.0827 C15H11N2O2- 1 251.0826 0.4

[no records with IDs between 8583 and 8588]

** here something has gone wrong
ACCESSION: SM858801
RECORD_TITLE: Finasteride; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C23H36N2O2
CH$EXACT_MASS: 372.27768
MS$FOCUSED_ION: PRECURSOR_M/Z 256.1696

** here something has gone wrong
ACCESSION: SM858902
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 296.07

** still wrong ... it's using the same (wrong) exact mass to get equivalent wrong precursor
ACCESSION: SM858951
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 294.0554

** still wrong:
ACCESSION: SM859002
RECORD_TITLE: Acetyl-sulfamethoxazole; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C12H13N3O4S
CH$EXACT_MASS: 295.06268
MS$FOCUSED_ION: PRECURSOR_M/Z 325.1711
325.171 C20H22FN2O+ 1 325.1711 -0.17 <= we have F annotations!!!!!

[no 8591]

** and now everything seems OK again ...
ACCESSION: SM859203
RECORD_TITLE: Amitriptyline; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C20H23N
CH$EXACT_MASS: 277.18305
MS$FOCUSED_ION: PRECURSOR_M/Z 278.1903
278.1904 C20H24N+ 1 278.1903 0.42

@schymane

This comment has been minimized.

Copy link
Member Author

commented Apr 27, 2018

So, I just ran getMBRecordInfo (https://github.com/schymane/ReSOLUTION/) on the directory, extracting precursor and exact mass automatically from CASMI2016 from the OpenData SVN; checking the difference flags exactly and only these 4 records as having a mass difference above/below ~1.007
SM858801, SM858902, SM858951, SM859002

@schymane

This comment has been minimized.

Copy link
Member Author

commented Mar 24, 2019

Thanks to diagnosis from Herbert Oberacher the case is now clear (see issue online for case history):

SM858801 is diphenhydramine
SM858902 and SM858951 are Acetyl-sulfamethoxazole
SM859002 is citalopram

So, how to update? If I update the compound information to match the spectra then we will have a mismatch between the internal IDs, UFZ IDs and the MassBank accession numbers. However if I change to the correct internal IDs we'll be changing accession numbers and I think this is worse. If I hear nothing back I will correct the compound information in these four records and send along updates when I get a chance.

@meier-rene @tsufz @meowcat

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented Mar 24, 2019

Is deleting the incorrect records and adding new and correct records an option?

@schymane

This comment has been minimized.

Copy link
Member Author

commented Mar 24, 2019

Well, the records need to be fixed, this is for sure. However, if I correct the processing error, we will end up with new accession numbers. I am not sure this is the right way to fix it in this case though. This is the compound list ... it is still inexplicable how this happened as it's kind of impossible the way that RMassBank works, but something certainly went wrong! According to the compound list, 8588 is certainly meant to be Finasteride but ended up as the compound info of finasteride with the spectral data of diphenhydramine ... do you see the problem? If I now reprocess then the SM858801 record will turn into SM858901 and SM858902 will become SM859002 ...
I think best would be to update the compound info with the current accession numbers otherwise we are going to run into awful versioning problems?

image

@meowcat

This comment has been minimized.

Copy link

commented Mar 25, 2019

I understand the problem - is it a reasonable option to upload the records under a tag that is not SM? In that way the new, say SZ records will have the correct internal ID, and the old ones should be marked obsolete... Just an idea. Not yet thought through.

@schymane

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2019

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 20, 2019

OK here goes with a complicated update to address issues in the CASMI spectra, I suggest @meier-rene implement this at the MassBank-data side, and I'll double check to confirm once done, and comment the commit where necessary (@meier-rene other ideas welcome if you see an alternative). This has been double checked with the data source (Martin Krauss).
Note for the record: NONE of these issues actually affected the CASMI contest. It was an inadvertent upload of files that were extracted but eliminated during quality control for the contest. But we need to fix the database now ;-)

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM872102
This is a spectrum of Exemestane (identical SPLASH), please update the compound information in SM872102 to match the compound information of Exemestane in this record:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM873802&dsn=CASMI_2016

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM871901
This is a spectrum of Trenbolone (identical SPLASH), please update the compound information in SM871901 to match the compound information of Trenbolone in this record:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM874601&dsn=CASMI_2016

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM840901
This should be simazine, please take the compound information from SM841901
The analytical information is correct.

https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM841901
This should be Desethylterbutylazine, please take the compound information from SM840901.
The analytical information is correct.

The other ones we need to correct are indicated above, i.e.
SM858801 is diphenhydramine => please take compound information from the current
https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM858902&dsn=CASMI_2016
The analytical information is correct.

SM858902 and SM858951 are Acetyl-sulfamethoxazole => please take compound information from the current https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM859002&dsn=CASMI_2016
The analytical information is correct.

SM859002 is citalopram => please take compound information from an existing record, for instance:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=EA290112&dsn=Eawag
The analytical information is correct.

With compound information I'm referring to the CH$ entries, ie
CH$NAME: Diphenhydramine
CH$NAME: 2-benzhydryloxy-N,N-dimethylethanamine
CH$COMPOUND_CLASS: N/A; Environmental Standard
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
CH$SMILES: CN(C)CCOC(c1ccccc1)c1ccccc1
CH$IUPAC: InChI=1S/C17H21NO/c1-18(2)13-14-19-17(15-9-5-3-6-10-15)16-11-7-4-8-12-16/h3-12,17H,13-14H2,1-2H3
CH$LINK: CAS 58-73-1
CH$LINK: CHEBI 4636
CH$LINK: KEGG D00300
CH$LINK: PUBCHEM CID:3100
CH$LINK: INCHIKEY ZZVUWRFHKOJYTH-UHFFFAOYSA-N
CH$LINK: CHEMSPIDER 2989
CH$LINK: COMPTOX DTXSID4022949

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.