Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cheminformatics approaches to support chemical identification delivered via the EPA CompTox Chemicals Dashboard

57 views

Published on

The identification of chemicals in environment media depends on the application of analytical methods, the primary approach being one of the multiple mass spectrometry techniques. Cheminformatics solutions are critical to supporting the chemical identification process. This includes the assembly of large chemical substance databases, prioritization ranking of potential candidate search hits, and search approaches that support both targeted and non-targeted screening approaches. The US Environmental Protection Agency CompTox Chemicals Dashboard is a web-based application providing access to data for over 760,000 chemical substances. This includes access to physicochemical property, environmental fate and transport data, both human and ecological toxicity data, information regarding chemicals contained in products in commerce, and in vitro bioactivity data. Searches are allowed based on chemical identifiers, product and use, genes and assays associated with the EPA ToxCast assays and, specific to supporting mass spectrometry, searches based on masses and formulae. These searches make use of a novel “MS-Ready structures” approach collapsing chemicals related as mixtures, salts, stereoforms and isotopomers. The dashboard supports both singleton or batch searching by accurate mass/chemical formula, supported by MS-ready structures, and utilizes rich meta data to facilitate candidate ranking and the prioritization of chemicals of concern based on toxicity and exposure data. The dashboard also hosts tens of chemical lists that have been assembled from public databases, many supporting non-targeted analysis and mass spectrometry databases.

This presentation will provide an overview of the dashboard and will review our latest research into structure identification by searching experimental mass spectrometry data against predicted fragmentation spectra for LC-MS (positive and negative ion mode) and GC-MS (EI), a total of 3 million predicted spectra. We will also provide an overview of our progress supporting structure and substructure searching, using mass and formula-based filtering, and report on the latest applications of the dashboard to support structure identification projects of interest to the EPA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Cheminformatics approaches to support chemical identification delivered via the EPA CompTox Chemicals Dashboard

  1. 1. Cheminformatics approaches to support chemical identification delivered via the EPA CompTox Chemicals Dashboard Antony Williams1, Andrew D. McEachran2, Chris Grulke1, Elin Ulrich3 and Jon R. Sobus3 1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC 2) Oak Ridge Institute of Science and Education (ORISE) Research Participant, RTP, NC 3) National Exposure Research Laboratory, U.S. Environmental Protection Agency, RTP, NC Spring 2019 ACS Spring Meeting, Orlando http://www.orcid.org/0000-0002-2668-4821 The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
  2. 2. Limit notetaking if you wish www.slideshare.net/AntonyWilliams
  3. 3. Suspect Screening and Non-Targeted Analysis Workflows 2 DSSTox Chemical Database “Molecular Features” Extracted Samples Raw Samples Raw Features Matched Formulas Mapped Structures Prioritized Structures (using ToxPi) Confirmed Structures (using ToxCast standards) Processed Features Prioritized Features Predicted Formulas Candidate Structures Sorted Structures Predicted Retention Times Predicted/Observed Functional Use Top Candidate Structure(s) Suspect Screening Non-Targeted Analysis Predicted Concentrations Predicted/Observed Media Occurrence Predicted Mass Spectra Methodological Concordance Red = Analytical Chemistry Blue = Data Processing & Analysis Green = Informatics & Web Services Purple = Mathematical & QSPR Modeling Color Key
  4. 4. CompTox Chemicals Dashboard https://comptox.epa.gov/dashboard 3 875k Chemical Substances
  5. 5. BASIC Search 4
  6. 6. Detailed Chemical Pages 5
  7. 7. Access to Chemical Hazard Data 6
  8. 8. In Vitro Bioassay Screening ToxCast and Tox21 7
  9. 9. Sources of Exposure to Chemicals 8
  10. 10. Link Access 9
  11. 11. NIST WebBook https://webbook.nist.gov/chemistry/ 10
  12. 12. MassBank of North America https://mona.fiehnlab.ucdavis.edu 11
  13. 13. m/z CLOUD https://www.mzcloud.org/ 12
  14. 14. DO WE REALLY NEED ANOTHER DATABASE? 13
  15. 15. Is a bigger database better? 14 • ChemSpider was 26 million chemicals then • Much BIGGER today • Is bigger better??
  16. 16. Comparing Search Performance 15 • Dashboard content was 720k chemicals • Only 3% of ChemSpider size • What was the comparison in performance?
  17. 17. SAME dataset for comparison 16
  18. 18. How did performance compare? 17
  19. 19. Data Quality is important • Data quality in free web-based databases! 18
  20. 20. Will the correct Microcystin LR Stand Up? ChemSpider Skeleton Search 19
  21. 21. Comparing ChemSpider Structures 20
  22. 22. Comparing ChemSpider Structures 21
  23. 23. Other Searches 22
  24. 24. Delivering a Better Database • An ideal database would provide: – Curated CAS Number-Name mappings with “correct” chemical structures • We have full time curators checking data 23
  25. 25. MASS AND FORMULA SEARCHING (and metadata ranking) 24
  26. 26. Advanced Searches Mass and Formula Based Search 25
  27. 27. Advanced Searches Mass and Formula Based Search 26
  28. 28. Using Metadata for Ranking • Use available metadata to rank candidates – Associated data sources • Associated lists in DSSTox database • Associated sources in PubChem • Specific types (e.g. water, surfactants, pesticides etc.) – Number of associated PubMed articles – Number of products/categories containing the chemical 27
  29. 29. Metadata rank ordering 28
  30. 30. SPECIFIC APPLICATIONS TO MASS SPEC. 29
  31. 31. Mass Spec Focused Applications 30
  32. 32. “MS-Ready Structures” https://doi.org/10.1186/s13321-018-0299-2 31
  33. 33. 32
  34. 34. MS-Ready Mappings 33
  35. 35. MS-Ready Mappings Set 34
  36. 36. Advanced Searches Mass Search 35
  37. 37. Advanced Searches Mass Search 36
  38. 38. MS-Ready Structures for Formula Search 37
  39. 39. MS-Ready Mappings • EXACT Formula: C10H16N2O8: 3 Hits 38
  40. 40. MS-Ready Mappings • Same Input Formula: C10H16N2O8 • MS Ready Formula Search: 125 Chemicals 39
  41. 41. MS-Ready Mappings • 125 chemicals returned in total – 8 of the 125 are single component chemicals – 3 of the 8 are isotope-labeled – 3 are neutral compounds and 2 are charged 40
  42. 42. Batch Searching • Singleton searches are useful but we work with thousands of masses and formulae! • Typical questions – What is the list of chemicals for the formula CxHyOz – What is the list of chemicals for a mass +/- error – Can I get chemical lists in Excel files? In SDF files? – Can I include properties in the download file? 41
  43. 43. Batch Searching Formula/Mass 42
  44. 44. Searching batches using MS-Ready Formula (or mass) searching 43
  45. 45. RELATED APPLICATIONS OF INTEREST TO MASS SPEC. 44
  46. 46. Find me “related structures” Formula-Based Search 45
  47. 47. Select Chemicals of Interest 46
  48. 48. Find me “related structures” Based on Structure Similarity 47
  49. 49. Find me “related structures” Based on Structure Similarity 48
  50. 50. Find me “related structures” Structure Similarity – sort on mass 49
  51. 51. Literature Searching 50
  52. 52. Literature Searching 51
  53. 53. Literature Searching 52
  54. 54. FOCUSED CHEMICAL LISTS OF INTEREST 53
  55. 55. Chemical Lists 54
  56. 56. EPAHFR: Hydraulic Fracturing 55
  57. 57. PFAS lists of Chemicals 56
  58. 58. COMPLEX CHEMICAL SUBSTANCES 57
  59. 59. UVCB Chemicals 58
  60. 60. Many Hydraulic Fracturing Chemicals are “Complex” 59
  61. 61. “Markush Structures” https://en.wikipedia.org/wiki/Markush_structure 60
  62. 62. WORK IN PROGRESS 61
  63. 63. Suspect Screening and Non-Targeted Analysis Workflow 62 DSSTox Chemical Database “Molecular Features” Extracted Samples Raw Samples Raw Features Matched Formulas Mapped Structures Prioritized Structures (using ToxPi) Confirmed Structures (using ToxCast standards) Processed Features Prioritized Features Predicted Formulas Candidate Structures Sorted Structures Predicted Retention Times Predicted/Observed Functional Use Top Candidate Structure(s) Suspect Screening Non-Targeted Analysis Predicted Concentrations Predicted/Observed Media Occurrence Predicted Mass Spectra Methodological Concordance Red = Analytical Chemistry Blue = Data Processing & Analysis Green = Informatics & Web Services Purple = Mathematical & QSPR Modeling Color Key
  64. 64. Work in Progress • Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database 63
  65. 65. Predicted Mass Spectra http://cfmid.wishartlab.com/ • MS/MS spectra prediction for ESI+, ESI-, and EI • Predictions generated and stored for >800,000 structures, to be accessible via Dashboard 64
  66. 66. Search Expt. vs. Predicted Spectra
  67. 67. Search Expt. vs. Predicted Spectra
  68. 68. Spectral Viewer Comparison 67
  69. 69. Work in Progress • Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Retention Time Index Prediction 68
  70. 70. Retention Time Prediction for Ranking 69
  71. 71. Moving to Relative Retention Times 70
  72. 72. Work in Progress • Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Retention Time Index Prediction • Structure/substructure/similarity search 71
  73. 73. Prototype Development 72
  74. 74. Prototype Development 73
  75. 75. Work in Progress • Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Retention Time Index Prediction • Structure/substructure/similarity search • Access to API and web services for programmatic access 74
  76. 76. API services and Open Data • Groups waiting on our API and web services • Mass Spec companies instrument integration • Release will be in iterations but for now our data are available 75
  77. 77. SIDE EFFECTS OF SHARING OPEN DATA 76
  78. 78. NORMAN Suspect List Exchange https://www.norman-network.com/?q=node/236 77
  79. 79. Integration to MetFrag in place https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0299-2 78
  80. 80. Conclusion • Dashboard access to data for ~875,000 chemicals • MS-Ready data facilitates structure identification • Related metadata facilitates candidate ranking 79 • Relationship mappings and chemical lists of great utility • Dashboard and contents are one part of the solution • We are committed to open API development with time..
  81. 81. Acknowledgements • THANK YOU for the invitation! • IT Development team – especially Jeff Edwards and Jeremy Dunne • Chris Grulke for the ChemReg system • NERL colleagues – Jon Sobus, Elin Ulrich, Mark Strynar, Seth Newton • Emma Schymanski, LCSB, Luxembourg • The NORMAN Network and all contributors 80
  82. 82. Contact Antony Williams US EPA Office of Research and Development National Center for Computational Toxicology EMAIL: Williams.Antony@epa.gov ORCID: https://orcid.org/0000-0002-2668-4821 81

×