Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upExternal report: issues with conflicting stereochemistry in identifiers #70
Comments
This comment has been minimized.
This comment has been minimized.
Well, another good example why MassBank meta data needs curation. The people frequently approach us now and this is a good sign that the community is interested in MassBank. However, if errors are not handled, the people will loose reliability. We are on a good way. |
This comment has been minimized.
This comment has been minimized.
Next answer, plus a list of affected identifiers. @egonw is following this up on the Wikidata side, @meier-rene @Treutler we will need to follow this up on the MassBank side to address the immediate issue, plus add some ideas how to catch these cases in the validator. I think we may be able to do this with checking identifiers for consistency and flagging clashes? MassBank/MassBank-web#158
The list: JP000136 |
This comment has been minimized.
This comment has been minimized.
egonw
commented
May 21, 2019
I want to stress that this is not caused by our data import into Wikidata, not by MassBank. This examples is caused by an merger of two Wikidata items with different InChIKeys. I'm still exploring how this happened, as the person who did it is an experience chemist. These things do happen because of inconsistencies in Wikipedia and if you clean them, it can have downstream effects that are not always easy to detect (without automated, regular tests). |
This comment has been minimized.
This comment has been minimized.
So, if this is not caused by problems on the MassBank side, we just need to double-check that these records have structural identifiers that are consistent within themselves (MassBank/MassBank-web#158 (comment)), and if so, we close the issue our side. Do I understand that correctly? |
This comment has been minimized.
This comment has been minimized.
I don't exactly understand the Wikidata part, but I understand that the current MassBank data might produce inconsistencies in external repositories because its already inconsistent within MassBank. In this particular case the image of the structure is inconsistent with the structure in the InChI. The image is drawn from the SMILES field and this does not define trans double bonds as depicted. On the other hand the InChI defines a trans double bond. Summary: We have two sources of chemical structures, InChI and SMILES, in our record files and they are not always consistent. I have code for the validator (#158) but its not activated because we have currently 10026 records with this kind of inconsistencies. I can not think of an automatic procedure to fix this at the moment. |
This comment has been minimized.
This comment has been minimized.
How many unique InChIs are associated with the 10026 records? |
meier-rene
pushed a commit
that referenced
this issue
Jun 6, 2019
This comment has been minimized.
This comment has been minimized.
I have fixed the inconsistencies for this particular compound. Numbers for all inconsistencies will follow. |
This comment has been minimized.
This comment has been minimized.
Here are some numbers: We have 3351 unique InChI keys and we have 2964 unique InChI keys first block with inconsistencies. And here is a listing of inconsistencies by databases: Main source of inconsistency is the usage of SMILES without sterochemistry. |
This comment has been minimized.
This comment has been minimized.
And one last fact: we don't have any inconsistencies in the connection table. Only stereochemical information differ between SMILES and InChI. |
This comment has been minimized.
This comment has been minimized.
Now I'm confused. Can we get a table of MassBank Accession ID, CH$NAME, SMILES, InChI and InChIKey fields in the records, as well as the corresponding InChIKeys calculated from the SMILES and from the InChI fields (as well as the key in the records)? |
schymane commentedMay 19, 2019
Copy-paste from email received; @meier-rene are you able to follow-up? Thx!
Comparing data from different databases, I found some discrépancies between your data. For the mentioned entry of your database (https://massbank.eu/MassBank/RecordDisplay.jsp?id=OUF00136), the chemical structure indicates that the configuration of the double bond is not defined. This configuration is defined in other databases as InChIKey CWVRJTMFETXNAD-NCZKRNLISA-N:
See:
PubChem: https://pubchem.ncbi.nlm.nih.gov/compound/9476
ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:95271
ChEMBL: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL3186431/
EPA: https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID3024786
Could you check please if the definition of your entry is correct and if the chemical structure is the correct one of if the structural identifiers are wrong ?
The problem is the same for other entries like FIO00619, JP000136, FIO00623... where the chemical structure is not correct compared to the stereoconfiguration at the origin of InChIKey CWVRJTMFETXNAD-JUHZACGLSA-N. This InChIKey requires the definition of the 4 chiral carbons on the ring. Please see:
ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:16112
CHEMBL: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL284616/