Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Structure Identification Approaches Using the
US-EPA CompTox Chemicals Dashboard to
Support Mass Spectrometry Analyses
Tri...
Outline
• Quick overview of the dashboard
• Specific data of interest to this audience
(it’s not just Computational Toxico...
2
SEARCH
TOX DATA
BIOACTIVITY
SIMILARITY
READ-ACROSS
PUBMED
BATCH SEARCH
CompTox Chemicals Dashboard
https://comptox.epa.g...
BASIC Search
3
Detailed Chemical Pages
4
Properties, Fate and Transport
5
Properties, Fate and Transport
e.g. Solubility
6
Properties, Fate and Transport
e.g. logP
7
Sources of Exposure to Chemicals
8
Identifiers to Support Searches
9
Link Access
10
Mass Spec Links
11
NIST WebBook
https://webbook.nist.gov/chemistry/
12
MassBank of North America
https://mona.fiehnlab.ucdavis.edu
13
Batch
Searching
14
Aggregate data for a list of chemicals
15
Batch Search Names
16
Excel
Download
Add Other Data of Interest
17
Chemical Lists of
Interest…
18
225 Chemical Lists (and growing)
19
“Volatilome” Human Breath
20
“Volatilome” Saliva
21
PFAS lists of Chemicals
22
Building a “reference” PFAS list
• PFAS structure list (PFASSTRUCT)
is expanded from public databases, EPA
agency lists an...
Formula Search can find isomers
24
Active expansion of the PFAS list
From 2 to 8 variants of PFOS
25
Disinfection By-Products
26
Mycotoxins
• Two lists: 328 and 88 members
27
Vomitoxin
28
BIG databases are GREAT!
P
u
b
C
h
e
m
C
A
S
R
e
g
is
try
C
h
e
m
S
p
id
e
r
E
P
A
D
S
S
T
o
x
B
lo
o
d
E
x
p
o
s
o
m
e
1 ...
Vomitoxin - ChemSpider
• 19 “Vomitoxins” – 3 isotopically labeled
30
Vomitoxin – PubChem
31
• 33 unique InChI Keys
PubChem – “virtual chemistry”
• Other databases grow quickly…a lot of “virtual
chemistry” and “make on demand” compounds.
...
ChemSpider – lots of virtuals???
33
• 52 million chemicals
from one vendor
Taxol: 79 Results
34
Data Quality is important
• Data quality in free web-based databases!
35
We’re still cleaning data too
36
Tire Crumb Rubber (298)
37
Terpenes in Vape (37)
38
Hydraulic Fracturing (1640)
39
Opioids and Metabolites (160)
40
“MS-ready”
structures
41
Overview of MS-Ready Structures
• All structure-based chemical substances are
algorithmically processed to
– Split multico...
43
MS-Ready Mappings from
Details Page
44
MS-Ready Mappings Set of 20
substances for “PFOS”
45
Mass and Formula
Searching
46
Advanced Searches
Mass Search
47
Advanced Searches
Mass Search
48
MS-Ready Structures for
Formula Search
49
MS-Ready Mappings
• EXACT Formula: C10H16N2O8: 3 Hits
50
MS-Ready Mappings
• Same Input Formula: C10H16N2O8
• MS Ready Formula Search: 125 Chemicals
51
MS-Ready Mappings
• 125 chemicals returned in total
– 8 of the 125 are single component chemicals
– 3 of the 8 are isotope...
Batch Searching
mass and formula
53
Batch Searching
• Singleton searches are useful but we work
with thousands of masses and formulae!
• Typical questions
– W...
Batch Searching Formula/Mass
55
Searching batches using MS-Ready
Formula (or mass) searching
56
Batch Search in specific lists
57
Benefits of bringing it all together
• The true dashboard benefit is integration
• Rank potential candidates for toxicity ...
Candidate ranking
using metadata
59
Data Source Ranking of
“known unknowns”
60
• A mass and/or formula search is
for an unknown chemical but it
is a known che...
Data Streams for Ranking
• CompTox Dashboard Data Sources
• PubChem Data Source Count
• PubMed Reference Count
• Toxcast i...
Search 228.115 +/- 5.0 ppm
234 single component chemicals
62
Search 228.115 +/- 5.0 ppm
234 single component chemicals
63
The original ChemSpider work
64
Is a bigger database better?
65
• ChemSpider was 26 million chemicals for
the original work
• Much BIGGER today
• Is bigge...
Comparing Search Performance
66
• When dashboard contained 720k chemicals
• Only 3% of ChemSpider size
• What was the comp...
SAME dataset for comparison
67
How did performance compare?
68
For the same 162 chemicals,
Dashboard outperforms
ChemSpider for both Mass and
Formula Ran...
Identification ranks for 1783 chemicals
using multiple data streams
69
DS: Data Sources
PC: PubChem
PM: PubMed
STOFF: DB
K...
“UVCB”
Chemicals
70
UVCB Chemicals
71
UVCBs challenge in non-target analysis
72
Homologue screening plots from
Swiss Wastewater (Schymanski et al
2014, left) an...
Public TSCA Inventory on Dashboard
31,460 Chemicals (1/24/2020)
73
Many Chemicals are “Complex”
>14000 chemicals are UVCBs
74
“Markush Structures”
https://en.wikipedia.org/wiki/Markush_structure
75
How to represent complexity?
76
In the Dashboard
Abstract
Sifter
77
Literature Searching
78
Literature Searching
79
Abstract Sifter for Excel
80
Work in
Progress
81
List Registration Activities
• Registering and curating numerous lists
– NIST library of chemicals –clean up especially ar...
Blood Exposome Curation
83
• Blood exposome data collection from Barupal and
Fiehn. Great work and we reviewing.
• Aggrega...
Prototype Work in Progress
• CFM-ID
– Viewing and Downloading pre-predicted spectra
– Search spectra against the database
...
Predicted Mass Spectra
http://cfmid.wishartlab.com/
• MS/MS spectra prediction for ESI+, ESI-, and EI
• Predictions genera...
Search Expt. vs. Predicted Spectra
Search Expt. vs. Predicted Spectra
Spectral Viewer Comparison
88
Predicted Data Already Public
Publication and Data Files
89
https://epa.figshare.com/articles/CFM-ID_Paper_Data/7776212/1
Published: Chao et al
90
Prototype Development
91
CASMI 2012-2017 revisited
• Application of metadata candidate ranking
and CFM-ID to all five years of CASMI data
92
Method Amenability Prediction
Charlie Lowe
Why?
• Chromatography-mass
spectrometry can be LC or GC
• Which phase is more a...
Ongoing Work
• Data sources to date
• Massbank of North America
• 9,275 chemicals for non-derivatized GC
• 846 chemicals f...
TMAP Visualization of MoNA GC Data
Future Work: Add database of
Collision Cross Section Prediction
96
API services and Open Data
• Web Services https://actorws.epa.gov/actorws/
• Data sets also available for download..
97
Web Services
https://actorws.epa.gov/actorws/
• Data in UI, JSON and XML format
• Our services are free of course..
98
InChIKey to DTXCIDs
99
https://actorws.epa.gov/actorws/dsstox/v02/msready?identifier
=UVOFGKIRTCCNKG-UHFFFAOYSA-N
Data and Services
used by the
Community
100
NORMAN Suspect List Exchange
https://www.norman-network.com/?q=node/236
101
Integration to MetFrag in place
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0299-2
102
MassBank mapping to Dashboard
Based on Web Service lookup
103
Conclusion
• Dashboard access to data for ~875,000 chemicals
(~895k in the Spring Release)
• MS-Ready data facilitates str...
ILS
Kamel Mansouri
EPA ORD
Ann Richard
Chris Grulke
Jeremy Dunne
Jeff Edwards
Grace Patlewicz
Alex Chao
Kristin Isaacs
Cha...
MANY presentations online
https://tinyurl.com/w5hqs55
106
Contact
Antony Williams
CCTE, US EPA Office of Research and Development,
Williams.Antony@epa.gov
ORCID: https://orcid.org/...
Upcoming SlideShare
Loading in …5
×

TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches Using the US-EPA CompTox Chemicals Dashboard to Support Mass Spectrometry Analyses

67 views

Published on

This presentation was given at a TRIANGLE AREA MASS SPECTOMETRY meeting on 01/29/2019 in Research Triangle Park, North Carolina to provide a general overview of the CompTox Chemicals Dashboard to an audience of mass spectrometrists and people interested in the capabilities of the dashboard for chemical forensics, structure identification etc.

Published in: Science
  • Be the first to comment

  • Be the first to like this

TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches Using the US-EPA CompTox Chemicals Dashboard to Support Mass Spectrometry Analyses

  1. 1. Structure Identification Approaches Using the US-EPA CompTox Chemicals Dashboard to Support Mass Spectrometry Analyses Triangle Area Mass Spectrometry RTP, January 2020 http://www.orcid.org/0000-0002-2668-4821 The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA Antony Williams Center for Computational Toxicology and Exposure, US-EPA, RTP, NC …and an enormous cast of characters
  2. 2. Outline • Quick overview of the dashboard • Specific data of interest to this audience (it’s not just Computational Toxicology) • Support for Mass Spectrometry • Data quality in the public domain • Work in progress – prototypes • A request for help 1
  3. 3. 2 SEARCH TOX DATA BIOACTIVITY SIMILARITY READ-ACROSS PUBMED BATCH SEARCH CompTox Chemicals Dashboard https://comptox.epa.gov/dashboard
  4. 4. BASIC Search 3
  5. 5. Detailed Chemical Pages 4
  6. 6. Properties, Fate and Transport 5
  7. 7. Properties, Fate and Transport e.g. Solubility 6
  8. 8. Properties, Fate and Transport e.g. logP 7
  9. 9. Sources of Exposure to Chemicals 8
  10. 10. Identifiers to Support Searches 9
  11. 11. Link Access 10
  12. 12. Mass Spec Links 11
  13. 13. NIST WebBook https://webbook.nist.gov/chemistry/ 12
  14. 14. MassBank of North America https://mona.fiehnlab.ucdavis.edu 13
  15. 15. Batch Searching 14
  16. 16. Aggregate data for a list of chemicals 15
  17. 17. Batch Search Names 16 Excel Download
  18. 18. Add Other Data of Interest 17
  19. 19. Chemical Lists of Interest… 18
  20. 20. 225 Chemical Lists (and growing) 19
  21. 21. “Volatilome” Human Breath 20
  22. 22. “Volatilome” Saliva 21
  23. 23. PFAS lists of Chemicals 22
  24. 24. Building a “reference” PFAS list • PFAS structure list (PFASSTRUCT) is expanded from public databases, EPA agency lists and literature • Approaching ~7000 structures – 98.8% have associated CAS Numbers • Compare with PubChem 220,720 structures 23
  25. 25. Formula Search can find isomers 24
  26. 26. Active expansion of the PFAS list From 2 to 8 variants of PFOS 25
  27. 27. Disinfection By-Products 26
  28. 28. Mycotoxins • Two lists: 328 and 88 members 27
  29. 29. Vomitoxin 28
  30. 30. BIG databases are GREAT! P u b C h e m C A S R e g is try C h e m S p id e r E P A D S S T o x B lo o d E x p o s o m e 1 0 4 1 0 5 1 0 6 1 0 7 1 0 8 1 0 9 ChemicalSubstances • Thanks to all of the public database efforts • So much benefit from what’s been done • There are hundreds of them at this point…
  31. 31. Vomitoxin - ChemSpider • 19 “Vomitoxins” – 3 isotopically labeled 30
  32. 32. Vomitoxin – PubChem 31 • 33 unique InChI Keys
  33. 33. PubChem – “virtual chemistry” • Other databases grow quickly…a lot of “virtual chemistry” and “make on demand” compounds. Vomitoxin has 7 ZINC stereoforms. • The Dashboard database grows slowly (next release is +20k chemicals in 6 months) 32
  34. 34. ChemSpider – lots of virtuals??? 33 • 52 million chemicals from one vendor
  35. 35. Taxol: 79 Results 34
  36. 36. Data Quality is important • Data quality in free web-based databases! 35
  37. 37. We’re still cleaning data too 36
  38. 38. Tire Crumb Rubber (298) 37
  39. 39. Terpenes in Vape (37) 38
  40. 40. Hydraulic Fracturing (1640) 39
  41. 41. Opioids and Metabolites (160) 40
  42. 42. “MS-ready” structures 41
  43. 43. Overview of MS-Ready Structures • All structure-based chemical substances are algorithmically processed to – Split multicomponent chemicals into individual structures – Desalt and neutralize individual structures – Remove stereochemical bonds from all chemicals • MS-Ready structures are then mapped to original substances to provide a path between chemicals detected by mass spectrometry to original substances 42
  44. 44. 43
  45. 45. MS-Ready Mappings from Details Page 44
  46. 46. MS-Ready Mappings Set of 20 substances for “PFOS” 45
  47. 47. Mass and Formula Searching 46
  48. 48. Advanced Searches Mass Search 47
  49. 49. Advanced Searches Mass Search 48
  50. 50. MS-Ready Structures for Formula Search 49
  51. 51. MS-Ready Mappings • EXACT Formula: C10H16N2O8: 3 Hits 50
  52. 52. MS-Ready Mappings • Same Input Formula: C10H16N2O8 • MS Ready Formula Search: 125 Chemicals 51
  53. 53. MS-Ready Mappings • 125 chemicals returned in total – 8 of the 125 are single component chemicals – 3 of the 8 are isotope-labeled – 3 are neutral compounds and 2 are charged • Multiple components, stereo, isotopes and charge all collapsed and mapped through MS-Ready 52
  54. 54. Batch Searching mass and formula 53
  55. 55. Batch Searching • Singleton searches are useful but we work with thousands of masses and formulae! • Typical questions – What is the list of chemicals for the formula CxHyOz – What is the list of chemicals for a mass +/- error – Can I get chemical lists in Excel files? In SDF files? – Can I include properties in the download file? 54
  56. 56. Batch Searching Formula/Mass 55
  57. 57. Searching batches using MS-Ready Formula (or mass) searching 56
  58. 58. Batch Search in specific lists 57
  59. 59. Benefits of bringing it all together • The true dashboard benefit is integration • Rank potential candidates for toxicity using available data – hazard, exposure, in vitro 58
  60. 60. Candidate ranking using metadata 59
  61. 61. Data Source Ranking of “known unknowns” 60 • A mass and/or formula search is for an unknown chemical but it is a known chemical contained within a reference database • Most likely candidate chemicals have the most associated data sources, most associated literature articles or both C14H22N2O3 266.16304 Chemical Reference Database Sorted candidate structures
  62. 62. Data Streams for Ranking • CompTox Dashboard Data Sources • PubChem Data Source Count • PubMed Reference Count • Toxcast in vitro bioactivity • Presence in CPDat database • OPERA PhysChem Properties • Other possibilities – predicted media occurrence, frequency of InChIs online
  63. 63. Search 228.115 +/- 5.0 ppm 234 single component chemicals 62
  64. 64. Search 228.115 +/- 5.0 ppm 234 single component chemicals 63
  65. 65. The original ChemSpider work 64
  66. 66. Is a bigger database better? 65 • ChemSpider was 26 million chemicals for the original work • Much BIGGER today • Is bigger better?? • Are there other metadata to use for ranking?
  67. 67. Comparing Search Performance 66 • When dashboard contained 720k chemicals • Only 3% of ChemSpider size • What was the comparison in performance?
  68. 68. SAME dataset for comparison 67
  69. 69. How did performance compare? 68 For the same 162 chemicals, Dashboard outperforms ChemSpider for both Mass and Formula Ranking
  70. 70. Identification ranks for 1783 chemicals using multiple data streams 69 DS: Data Sources PC: PubChem PM: PubMed STOFF: DB KEMI: DB Data Sources alone rank ~75% of the chemicals as Top Hit
  71. 71. “UVCB” Chemicals 70
  72. 72. UVCB Chemicals 71
  73. 73. UVCBs challenge in non-target analysis 72 Homologue screening plots from Swiss Wastewater (Schymanski et al 2014, left) and Novi Sad (right) o Complex mixtures (UVCBs) are a huge and very challenging part of the unknowns in many environmental samples
  74. 74. Public TSCA Inventory on Dashboard 31,460 Chemicals (1/24/2020) 73
  75. 75. Many Chemicals are “Complex” >14000 chemicals are UVCBs 74
  76. 76. “Markush Structures” https://en.wikipedia.org/wiki/Markush_structure 75
  77. 77. How to represent complexity? 76
  78. 78. In the Dashboard Abstract Sifter 77
  79. 79. Literature Searching 78
  80. 80. Literature Searching 79
  81. 81. Abstract Sifter for Excel 80
  82. 82. Work in Progress 81
  83. 83. List Registration Activities • Registering and curating numerous lists – NIST library of chemicals –clean up especially around stereochemical representation – United States Geological Survey chemicals in water – Scientific Working Group for the Analysis of Seized Drugs – Synthetic Cannabinoids – Blood Exposome Database 82
  84. 84. Blood Exposome Curation 83 • Blood exposome data collection from Barupal and Fiehn. Great work and we reviewing. • Aggregating large datasets is CHALLENGING • Comparing with our “Abstract Sifter” approach • We will iterate into a dashboard form..
  85. 85. Prototype Work in Progress • CFM-ID – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Structure/substructure/similarity search • Access to API and web services 84
  86. 86. Predicted Mass Spectra http://cfmid.wishartlab.com/ • MS/MS spectra prediction for ESI+, ESI-, and EI • Predictions generated and stored for >800,000 structures, to be accessible via Dashboard 85
  87. 87. Search Expt. vs. Predicted Spectra
  88. 88. Search Expt. vs. Predicted Spectra
  89. 89. Spectral Viewer Comparison 88
  90. 90. Predicted Data Already Public Publication and Data Files 89 https://epa.figshare.com/articles/CFM-ID_Paper_Data/7776212/1
  91. 91. Published: Chao et al 90
  92. 92. Prototype Development 91
  93. 93. CASMI 2012-2017 revisited • Application of metadata candidate ranking and CFM-ID to all five years of CASMI data 92
  94. 94. Method Amenability Prediction Charlie Lowe Why? • Chromatography-mass spectrometry can be LC or GC • Which phase is more appropriate for which chemicals?
  95. 95. Ongoing Work • Data sources to date • Massbank of North America • 9,275 chemicals for non-derivatized GC • 846 chemicals for derivatized GC • 816 chemicals for APCI+ • 454 chemicals for APCI- • 4,907 chemicals for ESI+ • 3,430 chemicals for ESI- • EPA Non-targeted Analysis Collaborative Trial (ENTACT) • 886 chemicals for non-derivatized GC • 44 chemicals for derivatized GC • 774 chemicals for APCI+ • 431 chemicals for APCI- • 1,113 chemicals for ESI+ • 648 chemicals for ESI-
  96. 96. TMAP Visualization of MoNA GC Data
  97. 97. Future Work: Add database of Collision Cross Section Prediction 96
  98. 98. API services and Open Data • Web Services https://actorws.epa.gov/actorws/ • Data sets also available for download.. 97
  99. 99. Web Services https://actorws.epa.gov/actorws/ • Data in UI, JSON and XML format • Our services are free of course.. 98
  100. 100. InChIKey to DTXCIDs 99 https://actorws.epa.gov/actorws/dsstox/v02/msready?identifier =UVOFGKIRTCCNKG-UHFFFAOYSA-N
  101. 101. Data and Services used by the Community 100
  102. 102. NORMAN Suspect List Exchange https://www.norman-network.com/?q=node/236 101
  103. 103. Integration to MetFrag in place https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0299-2 102
  104. 104. MassBank mapping to Dashboard Based on Web Service lookup 103
  105. 105. Conclusion • Dashboard access to data for ~875,000 chemicals (~895k in the Spring Release) • MS-Ready data facilitates structure identification • Related metadata facilitates candidate ranking 104 • Relationship mappings and chemical lists of great utility • Curation and mutual sharing of chemical lists is important (e.g. NORMAN)
  106. 106. ILS Kamel Mansouri EPA ORD Ann Richard Chris Grulke Jeremy Dunne Jeff Edwards Grace Patlewicz Alex Chao Kristin Isaacs Charles Lowe James McCord Seth Newton Katherine Phillips Jon Sobus Mark Strynar Elin Ulrich Joach Pleil GDIT Ilya Balabin Tom Transue Tommy Cathey Acknowledgements TEAMS IT Development Team Curation Team Collaborators Emma Schymanski NORMAN Network Andrew McEachran
  107. 107. MANY presentations online https://tinyurl.com/w5hqs55 106
  108. 108. Contact Antony Williams CCTE, US EPA Office of Research and Development, Williams.Antony@epa.gov ORCID: https://orcid.org/0000-0002-2668-4821 107 https://doi.org/10.1186/s13321-017-0247-6

×