Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMI2016 records with compound/spectrum mismatch #9

Open
schymane opened this Issue Apr 27, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@schymane
Copy link
Member

schymane commented Apr 27, 2018

User reported that SM858902 and SM858951 contain spectral data from acetylsulfamethoxazole but are labeled diphenhydramine (thank you!). Upon closer inspection we seem to have had an ID/Precursor&peaks mismatch for 3 IDs / 4 records in a series, surrounded by records that look OK; series "broken" due to missing IDs in the middle. We also need to find the cause in https://github.com/MassBank/RMassBank

This should not be passing any form of validation; a screening of the entire CASMI2016 database would be extremely useful for debugging the cause and flagging how and how many records to fix, thank you @meier-rene in advance if you can :-)

From what I can see:
**this one looks OK.
ACCESSION: SM858203
RECORD_TITLE: Cetirizine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C21H25ClN2O3
CH$EXACT_MASS: 388.15537
MS$FOCUSED_ION: PRECURSOR_M/Z 389.1626
389.1626 C21H26ClN2O3+ 1 389.1626 -0.05

**this one looks OK.
ACCESSION: SM858353
RECORD_TITLE: 2-Hydroxycarbamazepine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C15H12N2O2
CH$EXACT_MASS: 252.08988
MS$FOCUSED_ION: PRECURSOR_M/Z 251.0826
251.0827 C15H11N2O2- 1 251.0826 0.4

[no records with IDs between 8583 and 8588]

** here something has gone wrong
ACCESSION: SM858801
RECORD_TITLE: Finasteride; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C23H36N2O2
CH$EXACT_MASS: 372.27768
MS$FOCUSED_ION: PRECURSOR_M/Z 256.1696

** here something has gone wrong
ACCESSION: SM858902
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 296.07

** still wrong ... it's using the same (wrong) exact mass to get equivalent wrong precursor
ACCESSION: SM858951
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 294.0554

** still wrong:
ACCESSION: SM859002
RECORD_TITLE: Acetyl-sulfamethoxazole; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C12H13N3O4S
CH$EXACT_MASS: 295.06268
MS$FOCUSED_ION: PRECURSOR_M/Z 325.1711
325.171 C20H22FN2O+ 1 325.1711 -0.17 <= we have F annotations!!!!!

[no 8591]

** and now everything seems OK again ...
ACCESSION: SM859203
RECORD_TITLE: Amitriptyline; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C20H23N
CH$EXACT_MASS: 277.18305
MS$FOCUSED_ION: PRECURSOR_M/Z 278.1903
278.1904 C20H24N+ 1 278.1903 0.42

@schymane

This comment has been minimized.

Copy link
Member Author

schymane commented Apr 27, 2018

So, I just ran getMBRecordInfo (https://github.com/schymane/ReSOLUTION/) on the directory, extracting precursor and exact mass automatically from CASMI2016 from the OpenData SVN; checking the difference flags exactly and only these 4 records as having a mass difference above/below ~1.007
SM858801, SM858902, SM858951, SM859002

@schymane

This comment has been minimized.

Copy link
Member Author

schymane commented Mar 24, 2019

Thanks to diagnosis from Herbert Oberacher the case is now clear (see issue online for case history):

SM858801 is diphenhydramine
SM858902 and SM858951 are Acetyl-sulfamethoxazole
SM859002 is citalopram

So, how to update? If I update the compound information to match the spectra then we will have a mismatch between the internal IDs, UFZ IDs and the MassBank accession numbers. However if I change to the correct internal IDs we'll be changing accession numbers and I think this is worse. If I hear nothing back I will correct the compound information in these four records and send along updates when I get a chance.

@meier-rene @tsufz @meowcat

@meier-rene

This comment has been minimized.

Copy link
Collaborator

meier-rene commented Mar 24, 2019

Is deleting the incorrect records and adding new and correct records an option?

@schymane

This comment has been minimized.

Copy link
Member Author

schymane commented Mar 24, 2019

Well, the records need to be fixed, this is for sure. However, if I correct the processing error, we will end up with new accession numbers. I am not sure this is the right way to fix it in this case though. This is the compound list ... it is still inexplicable how this happened as it's kind of impossible the way that RMassBank works, but something certainly went wrong! According to the compound list, 8588 is certainly meant to be Finasteride but ended up as the compound info of finasteride with the spectral data of diphenhydramine ... do you see the problem? If I now reprocess then the SM858801 record will turn into SM858901 and SM858902 will become SM859002 ...
I think best would be to update the compound info with the current accession numbers otherwise we are going to run into awful versioning problems?

image

@meowcat

This comment has been minimized.

Copy link

meowcat commented Mar 25, 2019

I understand the problem - is it a reasonable option to upload the records under a tag that is not SM? In that way the new, say SZ records will have the correct internal ID, and the old ones should be marked obsolete... Just an idea. Not yet thought through.

@schymane

This comment has been minimized.

Copy link
Member Author

schymane commented Mar 25, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.