Diversity and Depth: Implementing AI across many long tail domains

1. Diversity and Depth: Implementing AI across many long tail domains Paul Groth | pgroth.com | @pgroth Elsevier Labs Thanks to Helena Deus, Tony Scerri, Sujit Pal, Corey Harper, Ron Daniel, Brad Allen IJCAI 2018 – Industry Day

2. Introducing Elsevier Content Technology Chemistry database 500m published experimental facts User queries 13m monthly users on ScienceDirect Books 35,000 published books Drug Database 100% of drug information from pharmaceutical companies updated daily Research 16% of the world’s research data and articles published by Elsevier 1,000 technologists employed by Elsevier Machine learning Over 1,000 predictive models trained on 1.5 billion electronic health care events Machine reading 475m facts extracted from ScienceDirect Collaborative filtering: 1bn scientific articles added by 2.5m researchers analyzed daily to generate over 250m article recommendations Semantic Enhancement Knowledge on 50m chemicals captured as 11B facts

3. June 15, 2018 3 Bloom, N., Jones, C. I., Van Reenen, J., & Webb, M. (2017). Are ideas getting harder to find? (No. w23782). National Bureau of Economic Research. Slides: https://web.stanford.edu/~chadj/slides- ideas.pdf

4. INFORMATION OVERLOAD

5. IN PRACTICE Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2017). Searching Data: A Review of Observational Data Retrieval Practices. arXiv preprint arXiv:1707.06937. Some observations from @gregory_km survey & interviews : • The needs and behaviors of specific user groups (e.g. early career researchers, policy makers, students) are not well documented. • Participants require details about data collection and handling • Reconstructing data tables from journal articles, using general search engines, and making direct data requests are common. K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018). Understanding Data Retrieval Practices: A Social Informatics Perspective. arXiv preprint arXiv:1801.04971

6. PROVIDING ANSWERS FOR RESEARCHERS, DOCTORS AND NURSES: ANSWERS NEED AI My work is moving towards a new field; what should I know? • Journal articles, reference works, profiles of researchers, funders & institutions • Recommendations of people to connect with, reading lists, topic pages How should I treat my patient given her condition & history? • Journal articles, reference works, medical guidelines, electronic health records • Treatment plan with alternatives personalized for the patient How can I master the subject matter of the course I am taking? • Course syllabus, reference works, course objectives, student history • Quiz plan based on the student’s history and course objectives

7. RECOGNIZING DECISION GRAPHS IN MEDICAL CONTENT: MOTIVATING USE CASE • Clinical Key is Elsevier’s flagship medical reference search product • Clinicians prefer “answers” in the form of tables or flowcharts • Eliminates need to page through retrieved content to find actionable information • Clinical Key provides a sidebar section displaying answers, but this feature depends on very labor-intensive manual curation • Solution: automatically classify images in medical content corpus at index time • Benefits: lower cost and improved user experience 8 “Curated Answers” section displays medical decision graphs

8. RECOGNIZING DECISION GRAPHS IN MEDICAL CONTENT: SOLUTION • Perfect fit for transfer learning approach • Input to the classifier is a classifier image and output is one of 8 classes: Photo, Radiological, Data graphic, Illustration, Microscopy, Flowchart, Electrophoresis, Medical decision graph • Image dataset is augmented by producing variations of the training images by rotating, flipping, transposing, jittering, etc. • Reusing all but the last two Dense layers of a pre-trained model (VGG- CNN, available from Caffe’s “model zoo”) • VGG-CNN was trained on Imagenet (14 million images from the Web, 1000 general topic classes e.g., Cat, Airplane, House) • Last layer is a multinomial logistic regression (or softmax) classifier • Model trained on 10,167 images with a 70/30 train/test split • Achieves 93% test set accuracy • Evaluated image + caption text model but did not get a big performance boost • Searchable image base used to support training set and model development 9

9. • Total concepts = 540,632 • 100+ person years of clinical expert knowledge H-GRAPH KNOWLEDGE GRAPH

10. 11 Open Information Extraction • Knowledge bases are populated by scanning text and doing Information Extraction • Most information extraction systems are looking for very specific things, like drug-drug interactions • Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text • For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar • One weird trick for open information extraction … • ReVerb*: 1. Find “relation phrases” starting with a verb and ending with a verb or preposition 2. Find noun phrases before and after the relation phrase 3. Discard relation phrases not used with multiple combinations of arguments. In addition, brain scans were performed to exclude other causes of dementia. * Fader et al. Identifying Relations for Open Information Extraction

11. 12 ReVerb output After ReVerb pulls out noun phrases, match them up to EMMeT concepts Discard rare concepts, relations, or relations that are not used with many different concepts # SD Documents Scanned 14,000,000 Extracted ReVerb Triples 473,350,566

12. 13 Universal schemas - Initialization • Method to combine ‘facts’ found by machine reading with stronger assertions from ontology. • Build ExR matrix with entity-pairs as rows and relations as columns. • Relation columns can come from EMMeT, or from ReVerb extractions. • Cells contain 1.0 if that pair of entities is connected by that relation.

13. 14 Universal schemas - Prediction • Factorize matrix to ExK and KxR, then recombine. • “Learns” the correlations between text relations and EMMeT relations, in the context of pairs of objects. • Find new triples to go into EMMeT e.g., (glaucoma, has_alternativeProcedure, biofeedback)

14. 15 Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Michael Lauruhn, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016). ONTOLOGY MAINTENANCE • Pretty good F measure around 0.7 • Good enough as a recommender to experts

15. TOPIC PAGES Definition Related terms Relevant ranked snippets

16. GOOD & BAD DEFINITIONS Concept Bad Good Inferior Colliculus By comparing activation obtained in an equivalent standard ( non-cardiac-gated ) fMRI experiment , Guimaraes and colleagues found that cardiac-gated activation maps yielded much greater activation in subcortical nuclei , such as the inferior colliculus . The inferior colliculus (IC) is part of the tectum of the midbrain (mesencephalon) comprising the quadrigeminal plate (Lamina quadrigemina). It is located caudal to the superior colliculus on the dorsal surface of the mesencephalon ( Figure 36.7 FIGURE 36.7Overview of the human brainstem; view from dorsal. The superior and inferior colliculi form the quadrigeminal plate. Parts of the cerebellum are removed.). The ventral border is formed by the lateral lemniscus. The inferior colliculus is the largest nucleus of the human auditory system. … Purkinje cells It is felt that the aminopyridines are likely to increase the excitability of the potassium channel-rich cerebellar Purkinje cells in the flocculus ( Etzion and Grossman , 2001 ) . Purkinje cells are the most salient cellular elements of the cerebellar cortex. They are arranged in a single row throughout the entire cerebellar cortex between the molecular (outer) layer and the granular (inner) layer. They are among the largest neurons and have a round perikaryon, classically described as shaped “like a chianti bottle,” with a highly branched dendritic tree shaped like a candelabrum and extending into the molecular layer where they are contacted by incoming systems of afferent fibers from granule neurons and the brainstem… Olfactory Bulb The most common sites used for induction of kindling include the amygdala, perforant path , dorsal hippocampus , olfactory bulb , and perirhinal cortex. The olfactory bulb is the first relay station of the central olfactory system in the vertebrate brain and contains in its superficial layer a few thousand glomeruli, spherical neuropils with sharp borders ( Figure 1 Figure 1Axonal projection pattern of olfactory sensory neurons to the glomeruli of the rodent olfactory bulb. The olfactory epithelium in rats and mice is divided into four zones (zones 1–4). A given odorant receptor is expressed by sensory neurons located within one zone of the epithelium. Individual olfactory sensory neurons express a single odorant receptor…

17. HOW - OVERVIEW Content Books, Articles, Ontologies ... • Identification of concepts • Disambiguation • Domain/sub-domain identification • Abbreviations, variants • Gazeteering • Identification and classification of text snippets around concepts • Features building for concept/snippet pairs • Lexical, syntactic, semantic, doc structure … • Ranking concept snippet pairs • Machine learning • Hand made rules • Similarities • Deduplication Technologies NLP, ML • Curation • White list driven • Black list • Corrections/improve ments • Evaluation • Gold set by domain • Random set by domain • By SMEs (Subject Matter Experts) • Automation • Content Enrichment Framework • Taxonomy coverage extension Knowledge Graph Concepts, snippets, meta data, …

18. SCIENTIFIC TEXT IS CHALLENGING 698 unique relation types – 400 relation types Open Information Extraction on Scientific Text: An Evaluation. Paul Groth, Mike Lauruhn, Antony Scerri and Ron Daniel, Jr.. To appear at COLING 2018

19. 21 Augenstein, Isabelle, et al. "SemEval 2017 Task 10: ScienceIE-Extracting Keyphrases and Relations from Scientific Publications." Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. SCIENTIFIC TEXT IS CHALLENGING

20. June 15, 2018 22 THE CROWD ISN’T AN EXPERT

21. AMIRSYS

22. Burger and Beans – weakly supervised/joint embeddings 24 correct text vector image vector Hypersphere of joint embeddings incorrect text vector Engilberge, Martin, Louis Chevallier, Patrick Pérez and Matthieu Cord. “Finding beans in burgers: Deep semantic-visual embedding with localization.” CoRR abs/1804.01720 (2018)

23. 25 Image Vector Text Vector 2. Had to “pre-warm” with ImageNet – separate model/task 1. ResNet152 (not ResNet50 as usual) 3. From Weldon model (had to be ported to python from lua) 4. Had to find the right embeddings (K=620) 5. Had to find a library and stack many SRU APPLYING THE STATE OF THE ART

24. • Science and medicine are challenging domains for AI: • long tailed, deep knowledge, constantly changing • AI has the potential to change how we do scientific discovery and transition it into practice • At Elsevier we are applying AI to build platforms support health and science professionals • Of course, we’re hiring  26 CONCLUSION Paul Groth (@pgroth) p.groth@elsevier.com Elsevier Labs