Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to retrieve all candidates in PubChem using MetFragR or MetFragCL #24

Open
JustinZZW opened this issue May 20, 2019 · 16 comments

Comments

Projects
None yet
4 participants
@JustinZZW
Copy link

commented May 20, 2019

Hi,
I wonder to know how to retrieve all possible candidates in PubChem with defined NeutralPrecursorMass and DatabaseSearchRelativeMassDeviation using MetFragR? I noticed in CASMI 2017, it said "The candidates were retrieved as InChI structures from PubChem (mirror dated 2017-02-03) using MetFrag 2.4.2". It is easily completed in the MetFrag webserver, but I do not how to do it in MetFragR or MetFragCL.

Thanks very much!

@JustinZZW JustinZZW changed the title How to retrieve all candidates using MetFragR or MetFragCL How to retrieve all candidates in PubChem using MetFragR or MetFragCL May 20, 2019

@schymane

This comment has been minimized.

Copy link

commented May 20, 2019

I have example functions, data and documentation here:
https://github.com/schymane/ReSOLUTION/
https://github.com/schymane/ReSOLUTION/blob/master/R/MetFragConfigR.R

We are looking at improving this our side and streamlining documentation etc, if you can install this package from github this should get you started in the interim. The IPB team also have other workflows that may be useful for you.

@JustinZZW

This comment has been minimized.

Copy link
Author

commented May 20, 2019

Thanks, Emma,
I tried the examples in ReSOLUTION package as following

peaklist_path <- system.file("extdata","EA026206_Simazine_peaks.txt",package="ReSOLUTION")
test_dir <- "I:/software/MetFrag/test_190520"
config_file <- MetFragConfig(201.0776, 
                             "[M+H]+", 
                             "Simazine_neutralMass_PubChem", 
                             peaklist_path, 
                             test_dir, 
                             DB="PubChem", 
                             neutralPrecursorMass=TRUE)
MetFragAdductTypes <- read.csv(system.file("extdata","MetFrag_AdductTypes.csv",package="ReSOLUTION"))
metfrag_dir <- "I:/software/MetFrag/"

MetFragCL_name <- "MetFrag2.4.3-CL.jar"
runMetFrag(config_file, metfrag_dir, MetFragCL_name)

It founds 1022 candidates, and return a final result with 682 candidates. However, which function should I use to export the all 1022 candidates? Or how do I modify the parameter to download all 1022 candidates?

This is the log:

INFO de.ipbhalle.metfraglib.database.OnlineExtendedPubChemDatabase - Fetching candidates from PubChem
INFO de.ipbhalle.metfraglib.database.OnlineExtendedPubChemDatabase - Fetching PubMed references
INFO de.ipbhalle.metfraglib.database.OnlineExtendedPubChemDatabase - Fetching patents
INFO de.ipbhalle.metfraglib.process.CombinedMetFragProcess - Got 1022 candidate(s)
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 10 %
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 20 %
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 30 %
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 40 %
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 50 %
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 60 %
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 70 %
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 80 %
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 90 %
INFO de.ipbhalle.metfraglib.process.ProcessingStatus - 100 %
INFO de.ipbhalle.metfraglib.process.CombinedMetFragProcess - Processed 951 candidate(s)
INFO de.ipbhalle.metfraglib.process.CombinedMetFragProcess - 71 candidate(s) were discarded before processing due to pre-filtering
INFO de.ipbhalle.metfraglib.process.CombinedMetFragProcess - 0 candidate(s) discarded during processing due to errors
INFO de.ipbhalle.metfraglib.process.CombinedMetFragProcess - 269 candidate(s) discarded after processing due to post-filtering
INFO de.ipbhalle.metfraglib.process.CombinedMetFragProcess - Stored 682 candidate(s)

@schymane

This comment has been minimized.

Copy link

commented May 20, 2019

The log indicates that your post processing settings are reducing the candidates:
de.ipbhalle.metfraglib.process.CombinedMetFragProcess - 269 candidate(s) discarded after processing due to post-filtering

I highly recommend you choose some examples and try them on the web interface, see what settings result in the parameter files (which you can download) and use this to choose the options that suit what you want to do.

@JustinZZW

This comment has been minimized.

Copy link
Author

commented May 20, 2019

Thanks, I can set "filter_by_InChIKey" and "filter_isotopes" as FALSE to turn off the post-filtering.

However, it still indicates 70 cancidates were discarded before processing due to pre-filter. How do I turn off this function?

Besides, I try the same netural mass in the web server with same setting, but it only retrieve 700 candidates. So I wonder to know why the difference? In the local computer, I use the MetFrag2.4.3-CL.

I attach the config, log and config in webserver.
MetFragWeb_Parameters.txt

Simazine_neutralMass_PubChem_new_2_log.txt

Simazine_neutralMass_PubChem_new_2_config.txt

@schymane

This comment has been minimized.

Copy link

commented May 20, 2019

These candidates are salts / mixtures / disconnected and cannot possibly be observed at the mass of interest and are thus excluded entirely. It is a factor of the way the data is retrieved from PubChem. Since they will not be observed at the mass you have given, it makes no sense to include them, so we have not added an "on/off" option for this case.

@schymane

This comment has been minimized.

Copy link

commented May 20, 2019

To check compatibility between web and CL version in detail, you have to compare the parameter files. Some discrepancy can arise if the web uses the local PubChem mirror versus the live online PubChem query.

@JustinZZW

This comment has been minimized.

Copy link
Author

commented May 20, 2019

Thanks a lot for the patient explanation.

I agree your point that the difference may be casued by the pubchem mirror and online query. But how can I know which it used in webserver and CL verison? I compared both parameter files, and do not find the corresponding parameter to clarify.

In addition, if the CL verison use the online query, how to make the result reproducible?

Thanks again.

@schymane

This comment has been minimized.

Copy link

commented May 20, 2019

You have to check the database parameters to see. On the web, if you tick the references box, it automatically uses the online version. I can't seem to download your parameter file from here. I just gave and posted a talk on this today, maybe the slides help a little? https://zenodo.org/record/3046373

@JustinZZW

This comment has been minimized.

Copy link
Author

commented May 21, 2019

Thanks very much for the slides. It's really useful and impressive.
And I'm sorry for the undownloadable parameter file. I still remain some questions:

First, if the webserver use the online verison, I confused why it retrive less compounds (700 cmps) than the CL verison (1022 cmps)?

Second, I'm not very clear what's the meaning of "filter_isotopes"? In the help file, it said it remove all candidates containing non-standard isotopes. May you help to give some examples? I really appreicate your kindness reply.

This is parameters of the CL version:

SampleName = Simazine_neutralMass_PubChem_new_2
ResultsPath = I:/software/MetFrag/test_190520_2/results
PeakListPath = C:/Users/user/Documents/R/win-library/3.5/ReSOLUTION/extdata/EA026206_Simazine_peaks.txt
PrecursorIonType = [M+H]+
NeutralPrecursorMass = 201.0776
IsPositiveIonMode = True
DatabaseSearchRelativeMassDeviation = 5
FragmentPeakMatchAbsoluteMassDeviation = 0.001
FragmentPeakMatchRelativeMassDeviation = 5
MaximumTreeDepth = 2
NumberThreads = 2
MetFragCandidateWriter = XLS
MetFragDatabaseType = PubChem
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter
MetFragScoreWeights = 1.0,1.0,1.0
MetFragScoreTypes = FragmenterScore,OfflineMetFusionScore,OfflineIndividualMoNAScore

This is parameters of the webserver:

PrecursorIonMode = 1
FragmentPeakMatchRelativeMassDeviation = 5.0
SampleName = MetFragWeb_Sample
MetFragCandidateWriter = XLS
DatabaseSearchRelativeMassDeviation = 5.0
FragmentPeakMatchAbsoluteMassDeviation = 0.001
MetFragDatabaseType = PubChem
ResultsPath = .
NeutralPrecursorMass = 201.0776
MetFragScoreTypes = FragmenterScore
MetFragScoreWeights = 1.0
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter,IsotopeFilter
IsPositiveIonMode = true
MaximumTreeDepth = 2
NumberThreads = 2
UseSmiles = true
PeakListPath = MetFragWeb_Peaklist.txt

@schymane

This comment has been minimized.

Copy link

commented May 21, 2019

I am not sure what the problem was with the file download, it redirected me to a completely different website instead of the parameter file, but it may have been the app I was using. Many thanks for copying the parameters, this is very helpful.
Regarding your question of the "IsotopeFilter", this is in place to remove e.g. structures that contain "non-standard" isotopes as one of the main atoms, e.g. 13C or 2H, as we would not expect to observe them in reality (they are usually e.g. internal standards). For instance for Simazine look at this link:
https://comptox.epa.gov/dashboard/dsstoxdb/multiple_results?input_type=inchikey_skeleton&inputs=ODCWYMIRDDJXKW
The second two structures we would not expect in a normal sample. But these would also not be retrieved from PubChem in this case as they have the wrong mass (only Simazine satisfies the mass+5ppm window) - but maybe the CL results from PubChem contain other molecules with non-standard isotopes. I would not normally expect such a large discrepancy of ~300 candidates however.
What you could do is try to run MetFragCL with exactly the parameters from the web interface (obviously you have to change the file path), or just add the IsotopeFilter to your command line parameter file and see if this changes the candidate numbers to be consistent.
I have tried the web interface myself and also get 700 candidates whatever PubChem option I use.

Also just to clarify: the CL uses the online version (unless you specify a local version of PubChem), the web uses an offline version by default, but when I select the references, I see now MetFragDatabaseType = ExtendedMetChem
My colleagues in Halle will have to confirm whether this is now also linked to an offline database?

Please let me know if this resolves the issue. If you still see different candidate numbers after testing this, can you submit your MetFrag web parameters to the team in Halle so they can investigate further?
@sneumann FYI ...

image

@JustinZZW

This comment has been minimized.

Copy link
Author

commented May 21, 2019

Thanks, I tried as your instructions with 4 different parameter sets. It still have large differnces in candidatdes in CL (1022) and webserver (700). The filter_isotopes leads only 1 candidate difference. Therefore, as you say, I guess the webserver use the offline version, but the CL use the online version. Following are results from 4 groups:

  1. Generate parameter set by ReSOLUTION, and run in CL: filter_isotopes=FALSE, filter_by_InChIKey=FALSE
    Result: 1022 candidates, 70 candidates were discarded due to pre-filtering, 952 candidates were remained

  2. Generate parameter set by ReSOLUTION, and run in CL: filter_isotopes=TRUE, filter_by_InChIKey=FALSE
    Result: 1022 candidates, 71 candidates were discarded due to pre-filtering, 951 candidates were remained

  3. Run in webserver: DB='PubChem'
    Result: 700 candidates, 63 candidates were discarded due to pre-filtering, 637 candidates were remained

  4. Use the parameter set downloaded from webserver, and run in CL
    Result: 1022 candidates, 71 candidates were discarded due to pre-filtering, 951 candidate were remained


  1. Parameters sets:

SampleName = Simazine_neutralMass_PubChem_filter_isotope
ResultsPath = I:/software/MetFrag/run_with_CL_no_filter_isotope_190521/results
PeakListPath = C:/Users/user/Documents/R/win-library/3.5/ReSOLUTION/extdata/EA026206_Simazine_peaks.txt
PrecursorIonType = [M+H]+
NeutralPrecursorMass = 201.0776
IsPositiveIonMode = True
DatabaseSearchRelativeMassDeviation = 5
FragmentPeakMatchAbsoluteMassDeviation = 0.001
FragmentPeakMatchRelativeMassDeviation = 5
MaximumTreeDepth = 2
NumberThreads = 2
MetFragCandidateWriter = XLS
MetFragDatabaseType = PubChem
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter
MetFragScoreWeights = 1.0,1.0,1.0
MetFragScoreTypes = FragmenterScore,OfflineMetFusionScore,OfflineIndividualMoNAScore

  1. Parameter sets

SampleName = Simazine_neutralMass_PubChem_filter_isotope
ResultsPath = I:/software/MetFrag/run_with_CL_add_filter_isotope_190521/results
PeakListPath = C:/Users/user/Documents/R/win-library/3.5/ReSOLUTION/extdata/EA026206_Simazine_peaks.txt
PrecursorIonType = [M+H]+
NeutralPrecursorMass = 201.0776
IsPositiveIonMode = True
DatabaseSearchRelativeMassDeviation = 5
FragmentPeakMatchAbsoluteMassDeviation = 0.001
FragmentPeakMatchRelativeMassDeviation = 5
MaximumTreeDepth = 2
NumberThreads = 2
MetFragCandidateWriter = XLS
MetFragDatabaseType = PubChem
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter,IsotopeFilter
MetFragScoreWeights = 1.0,1.0,1.0
MetFragScoreTypes = FragmenterScore,OfflineMetFusionScore,OfflineIndividualMoNAScore

  1. Parameter sets

PrecursorIonMode = 1
FragmentPeakMatchRelativeMassDeviation = 5.0
SampleName = MetFragWeb_Sample
MetFragCandidateWriter = XLS
DatabaseSearchRelativeMassDeviation = 5.0
FragmentPeakMatchAbsoluteMassDeviation = 0.001
MetFragDatabaseType = PubChem
ResultsPath = .
NeutralPrecursorMass = 201.0776
MetFragScoreTypes = FragmenterScore
MetFragScoreWeights = 1.0
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter,IsotopeFilter
IsPositiveIonMode = true
MaximumTreeDepth = 2
NumberThreads = 2
UseSmiles = true
PeakListPath = MetFragWeb_Peaklist.txt

  1. Parameter sets

PrecursorIonMode = 1
FragmentPeakMatchRelativeMassDeviation = 5.0
SampleName = MetFragWeb_Sample
MetFragCandidateWriter = XLS
DatabaseSearchRelativeMassDeviation = 5.0
FragmentPeakMatchAbsoluteMassDeviation = 0.001
MetFragDatabaseType = PubChem
ResultsPath = I:/software/MetFrag/run_with_webserver_para_190521
NeutralPrecursorMass = 201.0776
MetFragScoreTypes = FragmenterScore
MetFragScoreWeights = 1.0
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter,IsotopeFilter
IsPositiveIonMode = true
MaximumTreeDepth = 2
NumberThreads = 2
UseSmiles = true
PeakListPath = C:/Users/user/Documents/R/win-library/3.5/ReSOLUTION/extdata/EA026206_Simazine_peaks.txt

@schymane

This comment has been minimized.

Copy link

commented May 21, 2019

@sneumann @korseby can you confirm that MetFrag web only works on an offline mirror now (is there any way to set it to use the online version instead)? Does this explain the candidate difference? What is the date of the PubChem mirror? This is a huge difference in candidate numbers ... quite surprising. Thanks!

@sneumann

This comment has been minimized.

Copy link
Member

commented May 21, 2019

Hi, I can confirm that the online version is using an older local mirror of PubChem.
The SQL dump is in https://msbi.ipb-halle.de/~sneumann/metchem-2016.sql.gz
and some documentation to install that locally is in
https://github.com/sneumann/container-metfrag/blob/master/deployments/README.md
Information how to get an updated local database are in
https://github.com/c-ruttkies/container-metchemdata but we didn't get around
to make the update happen on a regular basis in the online version.
Yours, Steffen

@JustinZZW

This comment has been minimized.

Copy link
Author

commented May 21, 2019

Thanks for your all helps! @schymane @sneumann Although I am not familar with the DOCKER, I could try to learn to install it.

Thanks agian!

@korseby

This comment has been minimized.

Copy link
Member

commented May 21, 2019

I noticed that you use an outdated version of the MetFrag-CLI. It is also possible to use the local MetChem mirror (deployed as docker container) from the command line (which speeds up processing). Please also make sure to have the latest version installed (currently 2.4.5). The easiest way would be to install it from bioconda, see here: https://github.com/bioconda/bioconda-recipes/tree/bc656c004ca3399595959403ca1870e83ac1e50b/recipes/metfrag or to use the biocontainers image.

There is also a command line reference for MetFrag-CLI, see here: http://ipb-halle.github.io/MetFrag/projects/metfragcl/

@JustinZZW

This comment has been minimized.

Copy link
Author

commented May 21, 2019

ok, I will try the lastest version. Thanks! @korseby

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.