Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Towards Knowledge Graph based Representation, Augmentation and Exploration of Scholarly Communications


Published on

Despite an improved digital access to scientific publications in the last decades, the fundamental principles of scholarly communication remain unchanged and continue to be largely document-based. The document-oriented workflows in science have reached the limits of adequacy as highlighted by recent discussions on the increasing proliferation of scientific literature, the deficiency of peer-review and the reproducibility crisis. We need to represent, analyse, augment and exploit scholarly communication in a knowledge-based way by expressing and linking scientific contributions and related artefacts through semantically rich, interlinked knowledge graphs. This should be based
on deep semantic representation of scientific contributions, their manual, crowd-sourced and automatic augmentation and finally the intuitive exploration and interaction employing question answering on the resulting scientific knowledge base. We need to synergistically combine automated extraction and augmentation techniques, with large-scale collaboration to reach an unprecedented level of knowledge graph breadth and depth. As a result, knowledge-based information flows can facilitate completely new ways of search and exploration. The efficiency and effectiveness of scholarly communication will significant increase, since ambiguities are reduced, reproducibility is facilitated, redundancy is avoided, provenance and contributions can be better traced and the interconnections of research contributions are made more explicit and transparent. In this talk we will present first steps in this direction in the context of our Open Research Knowledge Graph initiative and the ScienceGRAPH project.

Published in: Science
  • Be the first to comment

Towards Knowledge Graph based Representation, Augmentation and Exploration of Scholarly Communications

  1. 1. Prof. Dr. Sören Auer Faculty of Electrical Engineering & Computer Science Leibniz University of Hannover TIB Technische Informationsbibliothek Towards Knowledge Graph based Representation, Augmentation and Exploration of Scholarly Communications
  2. 2. Page 2 Zuse Z3: the beginning of Computing – close to the hardware Foto: Konrad Zuse Internet Archiv/Deutsches Museum/DFG
  3. 3. Page 3
  4. 4. Page 4 We can make things more intuitive Picture: The illustrated recipes of lucy eldridge 7/18/the-illustrated-recipes-of- lucy-eldridge/
  5. 5. Computing more inuitive: procedural programming
  6. 6. Page 6Sören Auer 6
  7. 7. Computing more inuitive: OO programming
  8. 8. Page 8Sören Auer 8
  9. 9. Page 9 Computing even more inuitive: with cognitive data?! Sören Auer 9
  10. 10. Page 10 Linked Data Principles Addressing the neglected third V (Variety) 1. Use URIs to identify the “things” in your data 2. Use http:// URIs so people (and machines) can look them up on the web 3. When a URI is looked up, return a description of the thing in the W3C Resource Description Format (RDF) 4. Include links to related things
  11. 11. Page 11 1. Graph based RDF data model consisting of S-P-O statements (facts) 2. Serialised as RDF Triples: Et-Inf conf:organizes Antrittsvorlesung2019 . Antrittsvorlesung2019 conf:starts “2019-20-07”^^xsd:date . Antrittsvorlesung2019 conf:takesPlaceAt dbpedia:Hannover . 3. Publication under URL in Web, Intranet, Extranet RDF & Linked Data in a Nutshell Antritts- vorlesung2019 dbpedia:Hannover 20.05.2019 Et-Inf conf:organizes conf:starts conf:takesPlaceInSubject Predicate Object
  12. 12. Page 12 Creating Knowledge Graphs with RDF Linked Data DHL Post Tower 162.5 m Bonn Logistics Logistik DHL International GmbH ?? located in label industry headquarters height label full name located in label industry headquarters full nameDHL Post Tower 162.5 m Bonn Logistics Logistik DHL International GmbH height 物流 label
  13. 13. Page 13  Fabric of concept, class, property, relationships, entity desc.  Uses a knowledge representation formalism (RDF, OWL)  Holistic knowledge (multi-domain, source, granularity):  instance data (ground truth),  open (e.g. DBpedia, WikiData), private (e.g. supply chain data), closed data (product models),  derived, aggregated data,  schema data (vocabularies, ontologies)  meta-data (e.g. provenance, versioning, documentation licensing)  comprehensive taxonomies to categorize entities  links between internal and external data  mappings to data stored in other systems and databases Knowledge Graphs – A definition
  14. 14. Page 14Source: Source: 878ad2a55c440b18c889394a7 abaa5d3_1200x500.jpg
  15. 15. WDAqua project vision ● Answer natural language questions ● Exploit knowledge encoded in the Web of Data ● Provide QA services to citizens, communities, and industry 15 Q A Web of Data
  16. 16. Who is the director of Clockwork Orange? 16
  17. 17. Who is the director of Clockwork Orange? 17 Understand a spoken question
  18. 18. Who is the director of Clockwork Orange? 18 Understand a spoken question Analyse question
  19. 19. Who is the director of Clockwork Orange? 19 Understand a spoken question Analyse question Find data to answer the question
  20. 20. Who is the director of Clockwork Orange? 20 Understand a spoken question Analyse question Find data to answer the question Present the answer
  21. 21. Who is the director of Clockwork Orange? 21 Understand a spoken question Analyse question Find data to answer the question Present the answer Data source:
  22. 22. 22 Which publications and health reports are related to Alzheimer in Greece? Understand a spoken question Analyse question Find data to answer the question Present the answer
  23. 23. 23 Which publications and health reports are related to Alzheimer in Greece? Understand a spoken question Analyse question Find data to answer the question Present the answer Data sources :
  24. 24. WDAquaQAarchitecture 24 Data management layer Data layer Query decomposition Data source selection Query execution Benchmarkin g Profiling Data qualityData generation QA pipeline configurator Service repository Monitoring RESTful API Versioning Message dispatcher Voice to text NL to SPARQLDisambiguator Rel. extraction UIAnswer generation
  25. 25. 25 Who is the director of Clockwork Orange? Understand a spoken question Analyse question Find data to answer the question Present the answer Demo:
  26. 26. Page 26 How did information flows change in the digital era?
  27. 27. Page 27 Computer Source:
  28. 28. Page 28 Road Maps Source: Source: _SX348_BO1,204,203,200_.jpg
  29. 29. Page 29 Phone Books Source: 171296_LondonMiniStreetAtlas_A-Zpbk_carto.jpeg Source:
  30. 30. Page 30 How does it work today?
  31. 31. Page 31, 04.2019
  32. 32. Page 32, 04.2019
  33. 33. Page 33,+Georgia,+USA/@33.756009,-84.4151149,13.5z/data=!4m5!3m4!1s0x88f5045d6993098d:0x66fede2f990b630b!8m2!3d33.7489954!4d-84.3879824, 04.2019
  34. 34. Page 34  New means adapted to the new posibilities were developed, e.g. „zooming“, dynamics  Business models changed completely  More focus on data, interlinking of data / services and search in the data  Integration, crowdsourcing play an important role The World of Publishing & Communication has profundely changed
  35. 35. Page 35 What about Scholarly Communication?
  36. 36. Page 36 One of the earliest research journals: Philosophical Transactions of the Royal Society Scientific publishing in the 17th century © CC BY Henry Oldenburg
  37. 37. Page 37 Scholarly communication in 1865 Source:
  38. 38. Page 38 Publishing in 1970s Source:
  39. 39. Page 39 WE HAVE BUT  Mainly based on PDF  Is only partially machine-readable  Does not preserve structure  Does not allow embedding of semantics  Does not facilitate interactivity / dynamicity / repurposing  … Scientific publishing today Source:
  40. 40. Page 40 Scholarly Communication has not changed (much) 17th century 19th century 20th century 21th century Meanwhile other information intense domains were completely disrupted: mail order catalogs, street maps, phone books, …
  41. 41. Page 41 Challenges we are facing: We need to rethink the way how research is represented and communicated [1], [2] M. Baker: 1,500 scientists lift the lid on reproducibility, Nature, 2016. [3] Science and Engineering Publication Output Trends, National Science Foundation, 2018. [4] J. Couzin-Frankel: Secretive and Subjective, Peer Review Proves Resistant to Study. Science, 2013. Digitalisation of Science  Data integration and analysis  Digital collaboration Monopolisation by commercial actors  Publisher look-in effects  Maximization of profits [1] Reproducibility Crisis  Majority of experiments are hard or not reproducible [2] Proliferation of publications  Publication output doubled within a decade  continues to rise [3] Deficiency of Peer Review  Deteriorating quality [4]  Predatory publishing
  42. 42. Page 42 Science and engineering articles by region, country: 2004 and 2014 Proliferation of scientific literature Source: National Science Foundation: Science and Engineering Publication Output Trends:
  43. 43. Page 43 1,500 scientists lift the lid on reproducibility Monya Baker in Nature, 2016. 533 (7604): 452–454. doi:10.1038/533452a:  70% failed to reproduce at least one other scientist's experiment  50% failed to reproduce one of their own experiments Failure to reproduce results among disciplines (in brackets own results) Reproducibility Crisis chemistry 87% (64%) biology 77% (60%) physics and engineering 69% (51%) Earth sciences 64% (41%) Source: © Stanford Medicine - Stanford University
  44. 44. Page 44 How can we avoid duplication if the terminology, research problems, approaches, methods, characteristics, evaluations, … are not properly defined and identified? How would you build an engine / building without properly defining their parts, relationships, materials, characteristics …? Duplication and Inefficiency Source: 4-visible-8-engine-plastic_1_d2162f52c3fa3a6f72d2722f6c50b7b2.jpg Source: models-cad-and-revit-design
  45. 45. Page 45 Lack of… Root Cause – Deficiency of Scholarly Communication? Transparency information is hidden in text Integratability fitting different research results together Machine assistance unstructured content is hard to process Identifyability of concepts beyond metadata Collaboration one brain barrier Overview Schientists look for the needle in the haystack
  46. 46. Page 46 Search for CRISPR: > 9.000 Results Source:, 04.2019
  47. 47. Page 47 How good is CRISPR (wrt. precision, safety, cost)? What specifics has genome editing with insects? Who has applied it to butterflies? Search for CRISPR: > 163.000 Results Source:, 04.2019
  48. 48. Page 48 How can we fix it?
  49. 49. Page 49 Realizing Vannevar Bush‘s vision of Memex Source: Source:
  50. 50. Page 50 Mathematics • Definitions • Theorems • Proofs • Methods • … Physics • Experiments • Data • Models • … Chemistry • Substances • Structures • Reactions • … Computer Science • Concepts • Implemen- tations • Evaluations • … Technology • Standards • Processes • Elements • Units, Sensor data Architecture • Regulations • Elements • Models • … Concepts Overarching Concepts  Research problems  Definitions  Research approaches  Methods Artefacts  Publications  Data  Software  Image/Audio/Video  Knowledge Graphs / Ontologies Domain specific Concepts
  51. 51. Page 51 Chemistry Example: CRISPR Genome Editing Source:
  52. 52. Page 52 1. Original Publication Chemistry Example: Populating the Graph 2. Adaptive Graph Curation & Completion Author Robert Reed Research Problem Genome editing in Lepidoptera Methods CRISPR / cas9 Applied on Lepidoptera Experimental Data 3. Graph representation CRISPR / cas9 editing in Lepidoptera Robert Reed Genome editing in Lepidoptera Experimental Data adresses CRSPRS/cas9isEvaluatedWith Genome editing
  53. 53. Page 53 KGs are proven to capture factual knowledge [1] Research Challenge: Manage • Uncertainty & disagreement • Varying semantic granularity • Emergence, evolution & provenance • Integrating existing domain models But maintain flexibility and simplicity Cognitive Knowledge Graphs for scholarly knowledge ScienceGRAPH approach: Cognitive Knowledge Graphs • Fabric of knowledge molecules – compact, relatively simple, structured units of knowledge • Can be incrementally enriched, annotated, interlinked … [1] S Auer et al.: DBpedia: A nucleus for a web of open data. 6th Int. Semantic Web Conf. (ISWC) – 10-year best paper award. cf. also knowledge graphs from: WikiData, BBC, Google, Bing, Thomson Reuters, AirBnB, BNY Mellon …
  54. 54. Page 54 Factual Base entities Real world Granularity Atomic Entities Evolution Addition/deletion of facts Collaboration Fact enrichment From Factual Knowledge Graphs Today
  55. 55. Page 55 Factual Cognitive Base entities Real world Conceptual Granularity Atomic Entities Interlinked descriptions (molecules) with annotations (provenance) Evolution Addition/deletion of facts Concept drift, varying aggregation levels Collaboration Fact enrichment Emergent semantics From Factual to Cognitive Knowledge Graphs Today ScienceGRAPH
  56. 56. Page 56 Research Challenge: • Intuitive exploration leveraging the rich semantic representations • Answer natural language questions Exploration and Question Answering ScienceGRAPH Approach: • KG-based QA component integration for dynamic and automated composition of QA pipelines for cognitive knowledge graphs (e.g. following [1]) • Round-trip refinement and integration of search, faceted exploration, question answering and conversational interfaces Question parsing Named Entity Recognition (NER) & Linking (NEL) Relation extraction Query con- struction Query execution Result rendering Q: How do different genome editing techniques compare? SELECT Approach, Feature WHERE { Approach adresses GenomEditing . Approach hasFeature Feature } [1] K. Singh, S. Auer et al: Why Reinvent the Wheel? Let's Build Question Answering Systems Together. The Web Conference (WWW 2018). Q: How do different genome editing techniques compare?
  57. 57. Page 57 Engineered Nucleases Site-specificity Safety Ease-of-use / costs/ speed zinc finger nucleases (ZFN) ++ 9-18nt + -- $$$: screening, testing to define efficiency transcription activator-like effector nucleases (TALENs) +++ 9-16nt ++ ++ Easy to engineer 1 week / few hundred dollar engineered meganucleases +++ 12-40 nt 0 -- $$$ Protein engineering, high-throughput screening CRISPR system/cas9 ++ 5-12 nt - +++ Easy to engineer few days / less 200 dollar Result: Automatic Generation of Comparisons / Surveys Q: How do different genome editing techniques compare?
  58. 58. Page 58
  59. 59. Page 59
  60. 60. Page 60
  61. 61. Page 61
  62. 62. Page 62
  63. 63. Page 63
  64. 64. Page 64 Facilitating Comparisons of Research Contributions
  65. 65. Page 65 High-level Data Model: RDF + Metadata Statement Predicate resource_id: R1 date: 2019-01-23 user: 1234 Resource Resource Literal resource_id: R5 date: 2019-01-23 user: 6789 literal_id: L17 value: „ORKG“ date: 2019-01-23 user: 1234 predicate_id: P4 date: 2019-01-23 user: 6789 statement_id: S2 date: 2019-01-23 user: 6789
  66. 66. Page 66 Business Logic (Data input/output, consistency) REST API (Interface to the outside world) SPARQL (Data Query Language) GraphQL (API Query Language) Neo4j (Linked Property Graph) Virtuoso (Triple Store) Domain Model (Statements, Resouces, etc.) AuthN & AuthZ (ORCID or other SSO) Contribute Curate Explore Third-party Apps ? (Other Database) User Interface PersistenceDomainApplication High-Level Architecture: Neo4j Graph Application
  67. 67. Term of the Gene Ontology, namely GO:0030350
  68. 68. The authorsThe research contribution The research result The paperA continuous variable value
  69. 69. Page 88 More projects Stay tuned   Mailinglist/group:!forum/orkg  Open Research Knowledge Graph:  ERC Consolidator Grant ScienceGRAPH started in May  Transfer event on International Data Space on June 19:
  70. 70. Page 89 The Team Prof. (Univ. S. Bolivar) Dr. Maria Esther Vidal Software Development Kemele Endris Farah Karim Collaborators TIB/L3S Scientific Data Management Group Leaders PostDocs Project Management Doctoral Researchers Dr. Markus Stocker Dr. Gábor Kismihók Dr. Javad Chamanara Dr. Jennifer D’Souza Olga Lezhnina Allard Oelen Yaser Jaradeh Shereif Eid Manuel Prinz Alex Garatzogianni Laura Granzow Collaborators InfAI Leipzig / AKSW Dr. Michael Martin Natanael Arndt Sarven Capadisli Vitalis Wiens Wazed Ali
  71. 71. Contact Prof. Dr. Sören Auer TIB & Leibniz University of Hannover