Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External report: issues with conflicting stereochemistry in identifiers #70

Open
schymane opened this issue May 19, 2019 · 10 comments

Comments

Projects
None yet
4 participants
@schymane
Copy link
Member

commented May 19, 2019

Copy-paste from email received; @meier-rene are you able to follow-up? Thx!

Comparing data from different databases, I found some discrépancies between your data. For the mentioned entry of your database (https://massbank.eu/MassBank/RecordDisplay.jsp?id=OUF00136), the chemical structure indicates that the configuration of the double bond is not defined. This configuration is defined in other databases as InChIKey CWVRJTMFETXNAD-NCZKRNLISA-N:

See:

PubChem: https://pubchem.ncbi.nlm.nih.gov/compound/9476
ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:95271
ChEMBL: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL3186431/
EPA: https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID3024786

Could you check please if the definition of your entry is correct and if the chemical structure is the correct one of if the structural identifiers are wrong ?

The problem is the same for other entries like FIO00619, JP000136, FIO00623... where the chemical structure is not correct compared to the stereoconfiguration at the origin of InChIKey CWVRJTMFETXNAD-JUHZACGLSA-N. This InChIKey requires the definition of the 4 chiral carbons on the ring. Please see:

ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:16112
CHEMBL: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL284616/

@tsufz

This comment has been minimized.

Copy link
Member

commented May 20, 2019

Well, another good example why MassBank meta data needs curation. The people frequently approach us now and this is a good sign that the community is interested in MassBank. However, if errors are not handled, the people will loose reliability. We are on a good way.

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 21, 2019

Next answer, plus a list of affected identifiers. @egonw is following this up on the Wikidata side, @meier-rene @Treutler we will need to follow this up on the MassBank side to address the immediate issue, plus add some ideas how to catch these cases in the validator. I think we may be able to do this with checking identifiers for consistency and flagging clashes? MassBank/MassBank-web#158

As we agree about the problem of the chemical structure and the structural identifiers (mainly InChIKey and InChI) I can provide a full list of entries of MassBank to check: I am curating chemicals entries in Wikidata and I found that somebody uploaded all MAssBank entries with InChIKey = CWVRJTMFETXNAD-JUHZACGLSA-N to the wrong item. I don't check all entries mentioned on that page https://www.wikidata.org/wiki/Q27167119 (scroll down to find the Mass Bank identifiers) but I think that most of entries have no defined chiral centers and should have the InChIKey = CWVRJTMFETXNAD-UHFFFAOYSA-N according to the chemical structure.

The list:

JP000136
FIO00618
FIO00619
FIO00620
FIO00621
FIO00622
FIO00623
FIO00624
FIO00625
FIO00626
FIO00627
PB005541
PB006181
PB006182
KO000466
KO000467
KO000468
KO000469
KO000470
KO002577
KO002578
KO002579
KO002580
KO002581
KO008922
KO008923
OUF00135
OUF00136

@egonw

This comment has been minimized.

Copy link

commented May 21, 2019

I want to stress that this is not caused by our data import into Wikidata, not by MassBank. This examples is caused by an merger of two Wikidata items with different InChIKeys. I'm still exploring how this happened, as the person who did it is an experience chemist. These things do happen because of inconsistencies in Wikipedia and if you clean them, it can have downstream effects that are not always easy to detect (without automated, regular tests).

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 21, 2019

So, if this is not caused by problems on the MassBank side, we just need to double-check that these records have structural identifiers that are consistent within themselves (MassBank/MassBank-web#158 (comment)), and if so, we close the issue our side. Do I understand that correctly?

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2019

I don't exactly understand the Wikidata part, but I understand that the current MassBank data might produce inconsistencies in external repositories because its already inconsistent within MassBank. In this particular case the image of the structure is inconsistent with the structure in the InChI. The image is drawn from the SMILES field and this does not define trans double bonds as depicted. On the other hand the InChI defines a trans double bond.

Summary: We have two sources of chemical structures, InChI and SMILES, in our record files and they are not always consistent. I have code for the validator (#158) but its not activated because we have currently 10026 records with this kind of inconsistencies. I can not think of an automatic procedure to fix this at the moment.

@schymane

This comment has been minimized.

Copy link
Member Author

commented Jun 6, 2019

How many unique InChIs are associated with the 10026 records?
The useful breakdown would be (1) how many unique InChIKeys and (2) how many unique InChIKey first blocks ... because from the number of 10,026 this sounds incredibly large, but there are surely at least an order of magnitude (hopefully two) fewer chemicals associated with this number of records?
For my own curiosity it would also be useful which databases are the main sources of these errors to see if we have anything systematic ...

meier-rene pushed a commit that referenced this issue Jun 6, 2019

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2019

I have fixed the inconsistencies for this particular compound. Numbers for all inconsistencies will follow.

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2019

Here are some numbers: We have 3351 unique InChI keys and we have 2964 unique InChI keys first block with inconsistencies.

And here is a listing of inconsistencies by databases:
202 BS
3 Boise_State_Univ
174 Kyoto_Univ
225 MPI_for_Chemical_Ecology
75 Univ_Connecticut
349 Eawag
239 PFOS_research_group
199 Fiocruz
193 Fukuyama_Univ
41 GL_Sciences_Inc
14 JEOL_Ltd
2021 Fac_Eng_Univ_Tokyo
167 NAIST
1039 Keio_Univ
62 Kazusa
5 Osaka_MCHRI
31 MSSJ
70 Metabolon
35 NaToxAq
4 RIKEN_NPDepo
459 Nihon_Univ
147 Osaka_Univ
179 IPB_Halle
742 RIKEN
6 CASMI_2012
12 Tottori_Univ
171 Univ_Toyama
26 UOEH
3 UPAO
2312 Chubu_Univ
793 Waters

Main source of inconsistency is the usage of SMILES without sterochemistry.

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2019

And one last fact: we don't have any inconsistencies in the connection table. Only stereochemical information differ between SMILES and InChI.

@schymane

This comment has been minimized.

Copy link
Member Author

commented Jun 6, 2019

Now I'm confused. Can we get a table of MassBank Accession ID, CH$NAME, SMILES, InChI and InChIKey fields in the records, as well as the corresponding InChIKeys calculated from the SMILES and from the InChI fields (as well as the key in the records)?
I do not quite understand how this happens e.g. for the Eawag records where the InChIs should be systematically calculated from the SMILES within RMassBank ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.