Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in data for records on MssBank vs MoNA #63

Open
ChemConnector opened this Issue Apr 26, 2019 · 5 comments

Comments

Projects
None yet
4 participants
@ChemConnector
Copy link

commented Apr 26, 2019

I am comparing the MoNA record at http://mona.fiehnlab.ucdavis.edu/spectra/display/BSU00002 with the MassBank record at https://massbank.eu/MassBank/RecordDisplay.jsp?id=BSU00002

I see stereochem in the structure depiction on MoNA but not in the MassBank record. I assume that InChIs are the basis of the stereo on MoNA but the SMILES has no stereochem on MassBank. The inconsistency is confusing. Is there a StereoSMILES in MassBank that is not displayed?

@schymane

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

@ssmehta

This comment has been minimized.

Copy link

commented Apr 27, 2019

Just to note that in this case MoNA would generate the displayed structure from the InChI. In general, it depends on what is provided - MOL data is given preference for validation and display purposes followed by InChI, then SMILES, and finally falling back to an InChIKey/CSID/PubChem CID/CAS lookup if they are provided.

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented Apr 27, 2019

Yes, it passes validation. Atm we do not check identity of molecular structures in InChI and SMILES. Only molecular sum formula is compared to molecular formula given. There are probably hundreds of consistencies of this kind in the data and I can not think of an automatic procedure to fix this properly. Thats why I havent implemented this identity check.

@schymane

This comment has been minimized.

Copy link
Member

commented Apr 27, 2019

Thanks for clarifying @ssmehta !
Re validation checks our side @meier-rene I have a whole lot of procedures that @ChemConnector and I worked on a while back to validate all EU contributions, I'd like to formalize this and work it into our validation gradually - will post issues when I get the chance. Was looking at it again this week ;-)

@ChemConnector

This comment has been minimized.

Copy link
Author

commented Apr 27, 2019

I would like to help in any way I can to map/curate/collapse the data and provide the appropriate SMILES strings to use with CDK depiction. I understand it will be an incremental effort and take time but @schymane and I been working iteratively on data streams for a couple of years now. For example, having the correct structure for cholesterol should be easily achievable: https://massbank.eu/MassBank/Result.jsp?compound=cholesterol&op1=and&mz=&tol=0.3&op2=and&formula=&type=quick&searchType=keyword&sortKey=not&sortAction=1&pageNo=1&exec=&inst_grp=ESI&inst=CE-ESI-TOF&inst=ESI-ITFT&inst=ESI-ITTOF&inst=ESI-QTOF&inst=ESI-TOF&inst=LC-ESI-IT&inst=LC-ESI-ITFT&inst=LC-ESI-ITTOF&inst=LC-ESI-Q&inst=LC-ESI-QFT&inst=LC-ESI-QIT&inst=LC-ESI-QQ&inst=LC-ESI-QTOF&inst=LC-ESI-TOF&ms=MS2&ion=0

I understand that we would likely not have DTXSIDs for all chemicals in the combined MassBank EU and JP, and that it could be very difficult to curate some of the data. However, I think we can make good progress in providing fully defined stereoforms of SMILES, InChIs, molfiles if necessary, mapped DTXSIDs for more than is available at present. For tasks like this I am willing to dedicate some time every day to check and curate as appropriate, Would be best to coordinate the process through @schymane based on our previous experiences on doing this on other datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.