Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upCASMI2016 records with compound/spectrum mismatch #9
Comments
This comment has been minimized.
This comment has been minimized.
So, I just ran getMBRecordInfo (https://github.com/schymane/ReSOLUTION/) on the directory, extracting precursor and exact mass automatically from CASMI2016 from the OpenData SVN; checking the difference flags exactly and only these 4 records as having a mass difference above/below ~1.007 |
This comment has been minimized.
This comment has been minimized.
Thanks to diagnosis from Herbert Oberacher the case is now clear (see issue online for case history): SM858801 is diphenhydramine So, how to update? If I update the compound information to match the spectra then we will have a mismatch between the internal IDs, UFZ IDs and the MassBank accession numbers. However if I change to the correct internal IDs we'll be changing accession numbers and I think this is worse. If I hear nothing back I will correct the compound information in these four records and send along updates when I get a chance. |
This comment has been minimized.
This comment has been minimized.
Is deleting the incorrect records and adding new and correct records an option? |
This comment has been minimized.
This comment has been minimized.
Well, the records need to be fixed, this is for sure. However, if I correct the processing error, we will end up with new accession numbers. I am not sure this is the right way to fix it in this case though. This is the compound list ... it is still inexplicable how this happened as it's kind of impossible the way that RMassBank works, but something certainly went wrong! According to the compound list, 8588 is certainly meant to be Finasteride but ended up as the compound info of finasteride with the spectral data of diphenhydramine ... do you see the problem? If I now reprocess then the SM858801 record will turn into SM858901 and SM858902 will become SM859002 ... |
This comment has been minimized.
This comment has been minimized.
I understand the problem - is it a reasonable option to upload the records under a tag that is not SM? In that way the new, say SZ records will have the correct internal ID, and the old ones should be marked obsolete... Just an idea. Not yet thought through. |
This comment has been minimized.
This comment has been minimized.
Quite honestly I don’t really want to reprocess them all as it was an incredibly complicated process and it’s only three records (although I have a few others with issues too). It’s made more difficult by the fact that we have several scans (multiple precursors but one CE) and thus the last number in the accession also shifts … it is highly unlikely we’ll use those internal IDs again as it was a once-off dataset. I’d prefer for now to update the compound info and leave a trace in the COMMENT field. We have a couple of others we’ll likely have to deprecate and a couple where I need input from Martin first.
|
This comment has been minimized.
This comment has been minimized.
OK here goes with a complicated update to address issues in the CASMI spectra, I suggest @meier-rene implement this at the MassBank-data side, and I'll double check to confirm once done, and comment the commit where necessary (@meier-rene other ideas welcome if you see an alternative). This has been double checked with the data source (Martin Krauss). https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM872102 https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM871901 https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM840901 https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM841901 The other ones we need to correct are indicated above, i.e. SM858902 and SM858951 are Acetyl-sulfamethoxazole => please take compound information from the current https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM859002&dsn=CASMI_2016 SM859002 is citalopram => please take compound information from an existing record, for instance: With compound information I'm referring to the CH$ entries, ie |
This comment has been minimized.
This comment has been minimized.
@schymane Who should curate this data? |
This comment has been minimized.
This comment has been minimized.
I hoped @meier-rene could do this but if not someone just needs to update the files, all the info is there ... |
This comment has been minimized.
This comment has been minimized.
@schymane Come on, you did generate them, why you don't curate them by yourself or have them been copied from for example UFZ records? |
This comment has been minimized.
This comment has been minimized.
At one point Rene said he'd do things centrally. This one is tough and I see why he didn't update it, I'll do it when I have a chance but I currently don't have time. Likely during Biohackathon. If you get to it first I'll be overjoyed. If not I'll do it when I get the chance .. |
This comment has been minimized.
This comment has been minimized.
Okay, who first comes, serves first. |
This comment has been minimized.
This comment has been minimized.
So, the movement to dev branch after I had forked the MassBank-data repo has caused a lot of unexpected issues. @meier-rene is walking me through fixing this, before we will be able to change anything. I've had to delete the whole repo and hope that starting from scratch will fix things. Still cloning .. |
MassBank#9 (both pos and neg spectra for this compound)
schymane commentedApr 27, 2018
User reported that SM858902 and SM858951 contain spectral data from acetylsulfamethoxazole but are labeled diphenhydramine (thank you!). Upon closer inspection we seem to have had an ID/Precursor&peaks mismatch for 3 IDs / 4 records in a series, surrounded by records that look OK; series "broken" due to missing IDs in the middle. We also need to find the cause in https://github.com/MassBank/RMassBank
This should not be passing any form of validation; a screening of the entire CASMI2016 database would be extremely useful for debugging the cause and flagging how and how many records to fix, thank you @meier-rene in advance if you can :-)
From what I can see:
**this one looks OK.
ACCESSION: SM858203
RECORD_TITLE: Cetirizine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C21H25ClN2O3
CH$EXACT_MASS: 388.15537
MS$FOCUSED_ION: PRECURSOR_M/Z 389.1626
389.1626 C21H26ClN2O3+ 1 389.1626 -0.05
**this one looks OK.
ACCESSION: SM858353
RECORD_TITLE: 2-Hydroxycarbamazepine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C15H12N2O2
CH$EXACT_MASS: 252.08988
MS$FOCUSED_ION: PRECURSOR_M/Z 251.0826
251.0827 C15H11N2O2- 1 251.0826 0.4
[no records with IDs between 8583 and 8588]
** here something has gone wrong
ACCESSION: SM858801
RECORD_TITLE: Finasteride; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C23H36N2O2
CH$EXACT_MASS: 372.27768
MS$FOCUSED_ION: PRECURSOR_M/Z 256.1696
** here something has gone wrong
ACCESSION: SM858902
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 296.07
** still wrong ... it's using the same (wrong) exact mass to get equivalent wrong precursor
ACCESSION: SM858951
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 294.0554
** still wrong:
ACCESSION: SM859002
RECORD_TITLE: Acetyl-sulfamethoxazole; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C12H13N3O4S
CH$EXACT_MASS: 295.06268
MS$FOCUSED_ION: PRECURSOR_M/Z 325.1711
325.171 C20H22FN2O+ 1 325.1711 -0.17 <= we have F annotations!!!!!
[no 8591]
** and now everything seems OK again ...
ACCESSION: SM859203
RECORD_TITLE: Amitriptyline; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C20H23N
CH$EXACT_MASS: 277.18305
MS$FOCUSED_ION: PRECURSOR_M/Z 278.1903
278.1904 C20H24N+ 1 278.1903 0.42