Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in DTXSID mapping? #68

Open
schymane opened this issue May 16, 2019 · 18 comments

Comments

Projects
None yet
4 participants
@schymane
Copy link
Member

commented May 16, 2019

Bug report from external user:

First of all, thank you for the massive effort in developing and maintaining MassBank! I was very pleased to see in the News that all the records were linked to Comptox (if registered), so I gave it a go: the first record I randomly tested was MSJ01067 (Acetamiprid; GC-EI-Q; MS; Positive; M+), I clicked the Comptox link (DTXSID60861331) and...the substance ID does not exist - Acetamiprid ID is DTXSID0034300.

I therefore tested many other records which were all ok, so I assume that I was really unlucky (or an excellent proof-reader) :-)

I don't know if it's an isolated case, but give it a check.

Follow-up:
Indeed that DTXSID doesn't appear to exist in the public Dashboard, nor do I get a match for that InChIKey. If this is a name match, it's wrong ...
https://massbank.eu/MassBank/RecordDisplay.jsp?id=MSJ01067
image

https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID60861331
image

https://comptox.epa.gov/dashboard/dsstoxdb/results?search=WCXDHFDTOYPNIE-UHFFFAOYSA-N
image

This is the correct match:
https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID0034300
and is also found by name:
https://comptox.epa.gov/dashboard/dsstoxdb/results?search=Acetamiprid

Any ideas what went wrong here @meier-rene @ChemConnector ?
PubChem link looks fine
https://pubchem.ncbi.nlm.nih.gov/compound/213021

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented May 16, 2019

That's my issue. I take care of it.

@ChemConnector

This comment has been minimized.

Copy link

commented May 16, 2019

I hope @meier-rene can resolve the issue as it is not obvious at all to me how this would happen. We do have that DTXSID60861331 in our internal production but it is not yet public and certainly is not Acetamiprid. Rene, please let me know whether you can fix it . Thanks

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 16, 2019

I hope @meier-rene can find the cause but it's worrying that this exists but is not yet in production - it is going to get very confusing if we can access DTXSIDs that are not yet in production via the web services ... we will end up with broken links everywhere and no way to control it?

@ChemConnector

This comment has been minimized.

Copy link

commented May 16, 2019

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented May 16, 2019

Because the InChI key resolver at https://actorws.epa.gov is my only source for DTXSID I have to wait until this service is fixed.

@ChemConnector

This comment has been minimized.

Copy link

commented May 16, 2019

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2019

The service is back, thank you @ChemConnector. Unfortunately there waits some more work for you. The erroneous record https://massbank.eu/MassBank/RecordDisplay.jsp?id=MSJ01067 contains the InChI key WCXDHFDTOYPNIE-UHFFFAOYSA-N. If I put this in the resolver https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve?identifier=WCXDHFDTOYPNIE-UHFFFAOYSA-N I get DTXSID60861331. This does not resolve to an valid substance https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID60861331. Please have a look into this.

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

Interesting ... tautomer issue maybe contributing - plus differing stereochem in the InChIKeys?
image

@ChemConnector

This comment has been minimized.

Copy link

commented May 17, 2019

I am still researching but I think I know what it is and need to check out with the developer. One comment though is that Acetamiprid is explicit stereo (E-form) for the chemicals. See https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID0034300 . I have confirmed this will multiple resources so you may wish to update your structure and associated InChIKey to WCXDHFDTOYPNIE-RIYZIHGNSA-N. This resolves correctly. https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve?identifier=WCXDHFDTOYPNIE-RIYZIHGNSA-N

This is NOT the cause of the error you are seeing for sure. If my hypothesis is correct it's the fact that one of the synonyms for this chemical https://comptox.epa.gov/dashboard/dsstoxdb/results?search=FAIL%20peptide in the synonym table is "FAIL" and I believe that the service is passing a FAIL message and then resolving to this chemical....it matches the IndigoInChIKey here https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve?identifier=WCXDHFDTOYPNIE-UHFFFAOYSA-N. I am off to go prove it...

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 17, 2019

@meier-rene can you take care of updating the record, or should I add it to my list along with the CASMI and UFZ ones to resolve (hope to do this next week). Just let me know, thanks!

meier-rene pushed a commit that referenced this issue May 17, 2019

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2019

I changed the chemical information for all records of Acetamiprid.

@adelenelai

This comment has been minimized.

Copy link

commented May 27, 2019

Hi Tony @ChemConnector , came across another case of the same Dashboard issue described by Rene:

Benzoylecgonine's DTXSID does not seem to surface in the Dashboard...
Screenshot 2019-05-27 at 12 02 41

https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID30859442
Screenshot 2019-05-27 at 12 03 53

...but does exist:
Screenshot 2019-05-27 at 13 59 34

A specific use case which would be affected: if users of RMassBank wanted to manually look up a DTXSID using InChiKey, they must do it via the web service and not via Dashboard MassBank/RMassBank#215

Screenshot 2019-05-27 at 14 24 23

NB: Currently, the Benzoylecgonine that does surface in the Dashboard has specific stereochemistry (whereas in MassBank, no stereochemistry is specified):
Screenshot 2019-05-27 at 12 11 33

@ChemConnector

This comment has been minimized.

Copy link

commented May 29, 2019

I understand what is happening here and causing this issue. The services are hitting the database and retrieving "Level 6" data, that we term as "Incomplete" (non-curated for public release). We have a filter on the dashboard code that does not retrieve Incompletes but this filter was not added to the service. It was developed outside of the core team in order to make it available faster that we would get done otherwise and this filter was missed. It will be added in the a future sprint and fix the issue.

The mismatch did highlight that the structure for the chemical does not match the name since, as was pointed out, the chemical is lacking stereochemistry. It would be best to update the InChIKey and other mappings in MassBank too maybe?

http://comptox-prod.epa.gov/dashboard/dsstoxdb/results?search=DTXSID7046758

@schymane

This comment has been minimized.

Copy link
Member Author

commented Jun 3, 2019

OK this is an Eawag record ... Tony and I went thru these and all Eawag substances DO have DTXSIDs and we need to figure out a smart way to upgrade all the Eawag records with stereo-specific information. @meier-rene we will need to discuss this, it's been on my todo list for a while but I will need some time to figure out how best to coordinate upgrading the chemical information with you. Pls put this on hold for the moment as it's not just this one ... maybe we need another issue for that.

@schymane

This comment has been minimized.

Copy link
Member Author

commented Jun 18, 2019

So, @ChemConnector has provided me with a list of 390 DTXSIDs that are "Level 6" and thus not public that we should remove from our InChIKey - DTXSID mappings. I will email this to @meier-rene then I think we can close this issue, as the main cause was the incorrect retrieval of Level 6 substances from DSSTox due to an error in the CompTox web services [a fix that needs doing their side].
Updating Eawag records is unrelated to this issue, so not a barrier to closing this and #66 once the incorrectly-mapped DTXSIDs are removed.

@schymane

This comment has been minimized.

Copy link
Member Author

commented Jun 18, 2019

via @ChemConnector
Here’s an example service output to look at:
https://actorws.epa.gov/actorws/dsstox/v02/casTable.xml?DTXSID=DTXSID40860448 that has qclevel Incomplete. Hope this helps.

image

@schymane

This comment has been minimized.

Copy link
Member Author

commented Jun 19, 2019

More clarifications via email conversation with @ChemConnector
@meier-rene @adelenelai please take careful note that the numbering is not particularly logical .. we should only be using qcId 1, 2, 4, 5, and 6 (see below).

There are 5 public levels of data:
image
and the 6th is the "incomplete" level that should not be public but has made it into MassBank via the web services.

The field qcId in the web services is numbered 1-6 but they do not align with the public Levels.
qcId = 3 is the "incomplete" level that should be eliminated.
qcId = 1, 2, 4, 5, 6 are public data and can be included.
Here is the documentation mapping Level to qcId:

<qcDescription>
Level 1: Expert curated, highest confidence in accuracy and consistency of unique chemical identifiers
</qcDescription>
<qcId>1</qcId>
<qcLabel>DSSTox_High</qcLabel>
<qcName>DSSTox_High</qcName>

<qcDescription>
Level 2: Expert curated, unique chemical identifiers confirmed using multiple public sources
</qcDescription>
<qcId>2</qcId>
<qcLabel>DSSTox_Low</qcLabel>
<qcName>DSSTox_Low</qcName>

<qcDescription>
Level 3: Programmatically curated from high quality EPA source, unique chemical identifiers have no conflicts in ChemID and PubChem
</qcDescription>
<qcId>4</qcId>
<qcLabel>Public_High</qcLabel>
<qcName>Public_High</qcName>

<qcDescription>
Level 4: Programmatically curated from ChemID, unique chemical identifiers have no conflicts in PubChem
</qcDescription>
<qcId>6</qcId>
<qcLabel>Public_Medium</qcLabel>
<qcName>Public_Medium</qcName>

<qcDescription>
Level 5: Programmatically curated from ACToR or PubChem, unique chemical identifiers with low confidence, single public source
</qcDescription>
<qcId>5</qcId>
<qcLabel>Public_Low</qcLabel>
<qcName>Public_Low</qcName>

<qcDescription>Incomplete</qcDescription>
<qcId>3</qcId>
<qcLabel>Incomplete</qcLabel>
<qcName>Incomplete</qcName>

@schymane

This comment has been minimized.

Copy link
Member Author

commented Jun 19, 2019

Also to note:
The Incomplete filter on the web service is now in place so any future uses of the service should not pass any unresolvable DTXSIDs through (or InChIKey to DTXSIDs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.