Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DTXSIDs to all MassBank records with InChIKey match #66

Open
schymane opened this issue May 7, 2019 · 8 comments

Comments

Projects
None yet
2 participants
@schymane
Copy link
Member

commented May 7, 2019

@meier-rene @Treutler the EPA have set up a basic service that should allow retrieval of DTXSIDs by InChIKey, can you look into implementing this on the database end to add DTXSIDs to all records with matching entries for now, I will post a separate issue to get this into RMassBank and linked up in MassBank-web.
It's already in our Record format as
CH$LINK: COMPTOX DTXSID50274017
(https://github.com/MassBank/MassBank-web/blob/master/Documentation/MassBankRecordFormat.md)

https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve?identifier=IKHGUXGNUITLKF-UHFFFAOYSA-N
https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve.json?identifier=IKHGUXGNUITLKF-UHFFFAOYSA-N
https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve.xml?identifier=IKHGUXGNUITLKF-UHFFFAOYSA-N

Any feedback re service to @ChemConnector

Thanks!

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented May 7, 2019

I will take care of this.

And I would like to give a short update about a related topic: I curated all records with any structural information available to contain proper InChI and InChI-Keys. There are just 900 records left which dont have structural information, just chemical names.

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 7, 2019

Great!
Can you post a list somewhere of the 900, with basic details like name, accession etc? Some of them are "tentative", but I am not sure we have that many ... I would be curious ... Thanks!

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented May 8, 2019

noStructure.txt
The list of all records without a Structure given.

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

Oh interesting ... so the EawagAdditional are ones that almost certainly don't have a structure because they are tenative records ... but I see a lot from BS, Fac_Eng_Univ_Tokyo (major culprit) and even IPB Halle! @sneumann should be able to comment about the latter ... do you see a systematic issue (one critical identifier missing that we could fill in with other information available) with BS and Fac_Eng_Univ_Tokyo?

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented May 8, 2019

There are roughly 60 records with other database identifier, like CAS, which I could use to retrieve proper chemical information. The remaining records have only chemical names. Needs manual lookup and might be unsuccessful in some cases. This will take some time...

Different topic:
Please could someone explain the difference between DTXCID and DTXSID? The code for adding COMPTOX id is nearly finished.

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

C = compound/chemical and S = substance. The "C" entries are the unique chemical (~~ "MS-ready" forms (put simply)) and the "S" entries are the official database entry.
Effectively we should always use and link via the substance identifier, the DTXSID

image

image

Check out infoboxes here (@ChemConnector note inconsistencies in the DTXCID!)
https://comptox.epa.gov/dashboard/dsstoxdb/batch_search

@meier-rene

This comment has been minimized.

Copy link
Collaborator

commented May 8, 2019

Sorry, didnt understand this concept.

On pubchem we have SID which is something like the label on a bottle with chemicals and could potentially be a mixture and we have CID which is a uniqe compound which is represented by exactly one formula(like you would draw on a paper).

Thats why more questions:
Does this mean that there might be several DTXSID for one InChI-Key?
Is there a 1 to n relation between DTXCID and DTXSID like in pubchem?

@schymane

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

As far as I'm aware it's a one DTXSID per InChIKey. The service should return us one DTXSID for one InChIKey request and this is what @ChemConnector asked us to do, use InChIKey to DTXSID to add these identifiers to MassBank .. (therefore I'm assuming this is the most robust way in his opinion and from my experience, I'd agree)

One DTXSID may have multiple DTXCIDs associated with it. It's a bit different to the PubChem construct. imho we should not yet try mapping on DTXCIDs as they don't have the full functionality associated with them like the DTXSIDs, until recently they were hidden entirely.

Some examples:
https://comptox.epa.gov/dashboard/dsstoxdb/results?search=nicotine
https://comptox.epa.gov/dashboard/dsstoxdb/ms_ready_mixture?cid=28128

https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID10858175
This one has two DTXCIDs associated with it:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.