Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTS retrieving non-live CIDs #225

Open
schymane opened this issue Dec 13, 2019 · 5 comments
Open

CTS retrieving non-live CIDs #225

schymane opened this issue Dec 13, 2019 · 5 comments
Labels

Comments

@schymane
Copy link
Member

@schymane schymane commented Dec 13, 2019

We should make our default source of CIDs PubChem, and not CTS. There are too many discrepancies/error cropping up.
@meier-rene we may have to check the "status" of CIDs during validation, to catch and fix.

Example from freshly-created infolist:
https://pubchem.ncbi.nlm.nih.gov/compound/4644

@tsufz

This comment has been minimized.

Copy link
Member

@tsufz tsufz commented Dec 13, 2019

@schymane

This comment has been minimized.

Copy link
Member Author

@schymane schymane commented Apr 16, 2020

getPcId will also return non-live CIDs for non-standard tautomeric forms. We can fix this using a function in RChemMass, but we should make sure we upgrade the getPcId function to automatically do this. On the "todo" list ...

> getPcId("NPZTUJOABDZTLV-UHFFFAOYSA-N")
[1] 2759291
> getPCIDs.CIDtype(getPcId("NPZTUJOABDZTLV-UHFFFAOYSA-N"),type="preferred")
[1] 135399369

@schymane

This comment has been minimized.

Copy link
Member Author

@schymane schymane commented Apr 17, 2020

@meowcat I will need to upgrade getPcId to make sure it returns "live" CIDs, this is fine - but where can I check whether we grab PubChem CIDs from PubChem vs CTS? If we already use getPcId (not CTS), then all I will need to do is fix that function, then this issue is solved. Thanks.

@meowcat

This comment has been minimized.

Copy link

@meowcat meowcat commented Apr 17, 2020

So:
https://github.com/MassBank/RMassBank/createMassBank.R
Line 577 is where we call gatherPubChem to get data off PubChem.

PcInfo <- gatherPubChem(inchikey_split)

Lines 602..608 is where we get CTS data. To do: what do we still need from CTS at this point?

##Use CTS to retrieve information
CTSinfo <- getCtsRecord(inchikey_split)
if((CTSinfo[1] == "Sorry, we couldn't find any matching results") || is.null(CTSinfo[1]))
{
CTSinfo <- NA
}

Lines 775-786 is where we decide which PubChem ID to use. I guess you want to drop CTS completely as an option?

# PubChem CID
if(is.na(PcInfo$PcID[1])){
if(!is.na(CTSinfo[1])){
if("PubChem CID" %in% CTS.externalIdTypes(CTSinfo))
{
pc <- CTS.externalIdSubset(CTSinfo,"PubChem CID")
link[["PUBCHEM"]] <- paste0(min(pc))
}
}
} else{
link[["PUBCHEM"]] <- PcInfo$PcID[1]
}

Then the actual data retrieval from PubChem is in gatherPubChem, where getPcId is called:

gatherPubChem <- function(key){
PubChemData <- list()
##Trycatches are there because pubchem has connection issues 1 in 50 times.
##Write NA into the respective fields if something goes wrong with the conenction or the data.
##Retrieve Pubchem CID
tryCatch(
PubChemData$PcID <- getPcId(key),
error=function(e){
PubChemData$PcID <<- NA
})

getPcId is then the function in webAccess,R:

RMassBank/R/webAccess.R

Lines 109 to 144 in 611b785

getPcId <- function(query, from = "inchikey")
{
baseURL <- "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound"
url <- paste(baseURL, from, query, "description", "json", sep="/")
errorvar <- 0
currEnvir <- environment()
tryCatch(
data <- getURL(URLencode(url),timeout=5),
error=function(e){
currEnvir$errorvar <- 1
})
if(errorvar){
return(NA)
}
# This happens if the InChI key is not found:
r <- fromJSON(data)
if(!is.null(r$Fault))
return(NA)
titleEntry <- which(unlist(lapply(r$InformationList$Information, function(i) !is.null(i$Title))))
titleEntry <- titleEntry[which.min(sapply(titleEntry, function(x)r$InformationList$Information[[x]]$CID))]
PcID <- r$InformationList$Information[[titleEntry]]$CID
if(is.null(PcID)){
return(NA)
} else{
return(PcID)
}
}

@schymane

This comment has been minimized.

Copy link
Member Author

@schymane schymane commented Apr 18, 2020

@meowcat I've created a new branch:
https://github.com/MassBank/RMassBank/tree/preferredPCIDs

I've added getPCIDs.CIDtype, adjusted getPcId and createMassBank.R (and updated my old email address). I'm stuck on the documentation - see emails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.