Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Thinking About the Making of Data

47 views

Published on

Presentation at WU Wien June 12, 2019 - reflecting on the nature of constructing datasets and the difficulties therein.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Thinking About the Making of Data

  1. 1. Faculty of Science Paul Groth | @pgroth | pgroth.com May 12, 2019 Institute for Information Business – WU Wien Thinking About the Making of Data Thanks to Kathleen Gregory (@gregory_km )
  2. 2. Faculty of Science The making of data is important “There is a major, largely unrealised potential to merge and integrate the data from different disciplines of science in order to reveal deep patterns in the multi-facetted complexity that underlies most of the domains of application that are intrinsic to the major global challenges that confront humanity.” – Grand Challenge for Science http://dataintegration.codata.org Committee on Data of the International Council for Science (CODATA)
  3. 3. Faculty of Science Software 2.0 https://link.medium.com/srrJhEl5bS “In the 2.0 stack, the programming is done by accumulating, massaging and cleaning datasets” Figure 8 Data Science Surveys 2017 & 2018 The making of data is hard
  4. 4. Faculty of Science
  5. 5. Faculty of Science
  6. 6. Faculty of Science
  7. 7. Faculty of Science
  8. 8. Faculty of Science
  9. 9. Faculty of Science
  10. 10. Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on Scientific Text: An Evaluation”, 2018; [http://arxiv.org/abs/1802.05574 arXiv:1802.05574]
  11. 11. Faculty of Science COMPLEX DISTRIBUTED WORKFLOWS
  12. 12. Faculty of Science NOT JUST DATA SCIENCE Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019). Searching Data: A Review of Observational Data Retrieval Practices. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24165 Some observations from @gregory_km survey & interviews : • The needs and behaviors of specific user groups (e.g. early career researchers, policy makers, students) are not well documented. • Participants require details about data collection and handling • Reconstructing data tables from journal articles, using general search engines, and making direct data requests are common. Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (forthcoming). Understanding Data Search as a Socio-technical Practice. Journal of Information Science. arXiv preprint: arXiv:1801.04971.
  13. 13. Faculty of Science Spreadsheet Events https://www.seh.ox.ac.uk/news/the-case-for-ceres-developing-a-postgraduate-mission-with-the-european-space-agency
  14. 14. Faculty of Science BOTTLENECKS 1.Manual 2.Difficulty in creating flexible reusable workflows 3.Lack of transparency Paul Groth."The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.- Oct. 2013 doi: 10.1109/MIS.2013.138 Paul Groth, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi: 10.1109/MIC.2013.41
  15. 15. Faculty of Science • Focus on intelligent systems for supporting people working with data. • 5 people by September 2019 + growing • 3 Research areas: • AI for Data Engineering Tasks • Knowledge graph construction • Data wrangling support + automation • Transparency in data supply chains • Lineage of provenance of data • Understanding data professionals work • Empirical insights into how people go about working with data New lab at the University of Amsterdam http://indelab.org
  16. 16. Faculty of Science Data search – is it just a regular search engine? Survey of Research Challenges: Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez, Emilia Kacprzak, Paul Groth (Jan 2019) "Dataset search: a survey" https://arxiv.org/abs/1901.00735
  17. 17. Faculty of Science “An information need is the topic about which the user desires to know more” – Manning Information Needs
  18. 18. Faculty of Science Data as an information need  Researchers across communities need a diversity of observational data, requiring data of different types, from different sources and disciplines, and often collected at different scales.  Integrating diverse data is a challenge. Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (2019). Searching data: A review of observational data retrieval practices in selected disciplines. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24165
  19. 19. Faculty of Science Primary: Semi-structured interviews with data seekers across disciplines (n=22) Next stage: Multidisciplinary survey (n=1677, still in analysis phase) How do researchers search for data? Work of Kathleen Gregory with Sally Wyatt, Andrea Scharnhorst, Helena Cousijn
  20. 20. Faculty of Science Data needed for research are not always research data Numerous roles - data as hubs for collaboration and creativity A broader understanding of the data needed by users Users and data needs
  21. 21. Faculty of Science 52.2 29.8 18.1 Percentage No Sometimes Yes Do you discover data differently than how you discover academic literature?
  22. 22. Faculty of Science 30.2 29.4 20.5 19.3 0.6 Percentage Following citations to data Search with goal of finding data While reading or searching for literature Extract data directly from literature, tables, graphs Other How do you discover data using the academic literature?
  23. 23. Faculty of Science Actively searching online Serendipitously, while searching for something else While sharing/managing own data Serendipitously, when not actively searching How frequently do you find data in the following ways? Never Occasionally Often Percentage
  24. 24. Faculty of Science Key role of social interactions Search and discovery strategies Actually, most of the times that I have looked for external data, it has been through (personal) connections (11). The human network of contacts is still the best way to find the information you want, especially if it is a small group...that is the most powerful and accurate source of information that I use at this point. (17)
  25. 25. Faculty of Science Role of social interactions continues Evaluation and sense-making I think if there was a good search engine, then I could get the dataset directly. I would still get in touch with the data author anyway, both for social reasons - developing the network and eventual collaboration - and also because most of the times the metadata are not enough to really understand the biology behind the species (4).
  26. 26. Faculty of Science Role of social interactions continues Evaluation and sensemaking I am used to working with experts from different areas of knowledge. For me it is usual to have partners with different expertise: biology, agronomy, economy…I know the language of LCA (life cycle assessment), not of electronics or agricultural biology. My limit is not the data that I cannot find, but people that can work with these data (16).
  27. 27. Faculty of Science What does this mean for system design? Consider how data are made available • Metadata standardization and enrichment • Summarization to facilitate sensemaking Consider entirety of data needs • Point to best practices or resources for other data • Do disciplinary categories still fit? Consider diversity and overlaps • Differentiated interfaces • Integration with infrastructures supporting other data and research practices Consider how to incorporate role of social interactions • Contact data author, integration with author profiles, ORCID? • Links to in-person trainings? Connecting with “data experts”?
  28. 28. Faculty of Science Integration of Data Into Workflows Chichester, Christine, Daniela Digles, Ronald Siebes, Antonis Loizou, Paul Groth, and Lee Harland. "Drug discovery FAQs: workflows for answering multidomain drug discovery questions." Drug discovery today 20, no. 4 (2015): 399-405.
  29. 29. Faculty of Science Run structured queries
  30. 30. Faculty of Science BUILD A KNOWLEDGE GRAPH Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Michael Lauruhn, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  31. 31. Faculty of Science SOURCES OF CHANGE Concept1 Concept2 Concept3 KOS Professional Curators Literature Software Non-professional contributors 1. dealing with changing cultural and societal norms, specifically to address or correct bias; 2. political influence 3. new concepts and terminology arising from discoveries or change in perspective within a technical/scientific community 4. gardening 5. incremental contributorship 6. progressive formalization 7. software and automation 8. integration of large numbers of data sources 9. variance in algorithm training data Data ⚐Society & Politics (4, 5, 6) (7, 8, 9) (3) (1, 2) Lauruhn, Michael, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  32. 32. Faculty of Science WIKIDATA VOCABULARY
  33. 33. Faculty of Science 4. GARDENING Wikipedia Categories 25% increase in the number of categories over the 2012 - 2014 period vs a 12% increase in the number of articles. Likewise, the number of disambiguation pages has increased by 13%. (Bairi et al. 2015) http://blog.schema.org/2015/11/schemaorg-whats-new.html
  34. 34. Faculty of Science INCREMENTAL CONTRIBUTORSHIP Over 17,000 active users on wikidata as of Feb 2017
  35. 35. Faculty of Science INTEGRATION OF LARGE NUMBERS OF DATA SOURCES Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138 • 10 different extractors • E.g mapping-based infobox extractor • Infobox uses a hand-built ontology based on the 350 • Based on acommonly used English language infoboxes • Integrates with Yago • Yago relies on Wikipedia + Wordnet • Upper ontology from Wordnet and then a mapping to Wikipedia categories based frequencies • Wordnet is built by psycholinguists
  36. 36. Faculty of Science Data are complex objects Data are diverse. Data do not stand alone. Data are not always stable and do not travel easily. Borgman, C.L. (2015). Big Data, Little Data, No Data: Scholarship in the Networked World. MIT Press. Leonelli, S., Rappert, B., & Davies, G. (2017). Data shadows: Knowledge, openness, and absence. Science, Technology, & Human Values, 42(2), p.191-202.
  37. 37. Faculty of Science http://www.publicbooks.org/justice-for-data-janitors/
  38. 38. Faculty of Science A MORE TRANSPARENT DATA SUPPLY CHAIN Groth, Paul, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March- April 2013 doi: 10.1109/MIC.2013.41
  39. 39. Faculty of Science TRANSPARENCY ACKNOWLEDGES MESSINESS M. C. Elish & danah boyd (2018) Situating methods in the magic of Big Data and AI, Communication Monographs, 85:1, 57-80, DOI: 10.1080/03637751.2017.1375130
  40. 40. Faculty of Science • Data reuse though integration/munging/remixing is pervasive • We need to reflect on the making especially as we can automate more • How can we use the knowledge of making to help support our information need Conclusion Contact: Paul Groth | @pgroth | pgroth.com
  41. 41. Faculty of Science Can you skip all that? Paul T. Groth, Antony Scerri, Ron Daniel Jr., Bradley P. Allen: End-to-End Learning for Answering Structured Queries Directly over Text. CoRRabs/1811.06303 (2018)
  42. 42. Faculty of Science Machine Comprehension + Question Answering Tasks https://nlp.stanford.edu/software/sempre/wikitable/
  43. 43. Faculty of Science We have a parallel corpora
  44. 44. Faculty of Science Triple Pattern Fragments http://linkeddatafragments.org/concept/
  45. 45. Faculty of Science Now we only need to answer slot filling queries WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 Constructing Datasets for Multi-hop Reading Comprehension Across Documents, Johannes Welbl, Pontus Stenetorp, Sebastian Riedel, Transactions of the Association for Computational Linguistics 2018
  46. 46. Faculty of Science Off the shelf QA architectures Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 271–280, 2017. Tim Dettmers Isabelle Augenstein Johannes Welbl Tim Rocktaschel Matko Bosnjak Jeff Mitchell Thomas Demeester Pontus Stenetorp Sebastian Riedel Dirk Weissenborn, Pasquale Minervini. Jack the Reader – A Machine Reading Framework. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) System Demonstrations, July 2018. URL https://arxiv.org/abs/1806.08727 Question: lexicalize(?city wdt:P131 wd:Q55) => Located in the administrative territorial entity of …. Netherlands Input Text “Amsterdam is the capital city and most populous municipality of the Netherlands. ….” Answer span Amsterdam [0,9]
  47. 47. Faculty of Science Results
  48. 48. Faculty of Science Results
  49. 49. Faculty of Science A Prototype
  50. 50. Faculty of Science Primary: Semi-structured interviews with data seekers across disciplines (n=22) Next stage: Multidisciplinary survey (n=1677, still in analysis phase) Methodology
  51. 51. Faculty of Science Data needed for research are not always research data Numerous roles - data as hubs for collaboration and creativity A broader understanding of the data needed by users Users and data needs
  52. 52. Faculty of Science Relationship with academic literature search Overlaps with other practices Search and discovery strategies
  53. 53. Faculty of Science 52.2 29.8 18.1 Percentage No Sometimes Yes Do you discover data differently than how you discover academic literature?
  54. 54. Faculty of Science 30.2 29.4 20.5 19.3 0.6 Percentage Following citations to data Search with goal of finding data While reading or searching for literature Extract data directly from literature, tables, graphs Other How do you discover data using the academic literature?
  55. 55. Faculty of Science Actively searching online Serendipitously, while searching for something else While sharing/managing own data Serendipitously, when not actively searching How frequently do you find data in the following ways? Never Occasionally Often Percentage
  56. 56. Faculty of Science Key role of social interactions Search and discovery strategies Actually, most of the times that I have looked for external data, it has been through (personal) connections (11). The human network of contacts is still the best way to find the information you want, especially if it is a small group...that is the most powerful and accurate source of information that I use at this point. (17)
  57. 57. Faculty of Science Role of social interactions continues Evaluation and sense-making I think if there was a good search engine, then I could get the dataset directly. I would still get in touch with the data author anyway, both for social reasons - developing the network and eventual collaboration - and also because most of the times the metadata are not enough to really understand the biology behind the species (4).
  58. 58. Faculty of Science Role of social interactions continues Evaluation and sensemaking I am used to working with experts from different areas of knowledge. For me it is usual to have partners with different expertise: biology, agronomy, economy…I know the language of LCA (life cycle assessment), not of electronics or agricultural biology. My limit is not the data that I cannot find, but people that can work with these data (16).
  59. 59. Faculty of Science

×