The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

Research output: Chapter in book/report/conference proceedingConference contributionResearch

Authors

  • Jennifer D'Souza
  • Anett Hoppe
  • Arthur Brack
  • Mohamad Yaser Jaradeh
  • Sören Auer
  • Ralph Ewerth

External Research Organisations

  • German National Library of Science and Technology (TIB)
View graph of relations

Details

Original languageEnglish
Title of host publicationLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Pages2192-2203
Number of pages12
ISBN (electronic)9791095546344
Publication statusPublished - May 2020
Externally publishedYes

Publication series

NameLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Abstract

We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.

Keywords

    cs.IR, cs.AI, cs.CL, cs.DL, Entity Resolution, Entity Classification, Entity Linking, Word Sense Disambiguation, Evaluation Corpus, Entity Recognition, Language Resource

ASJC Scopus subject areas

Sustainable Development Goals

Cite this

The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. / D'Souza, Jennifer; Hoppe, Anett; Brack, Arthur et al.
LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. ed. / Nicoletta Calzolari; Frederic Bechet; Philippe Blache; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Helene Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis. 2020. p. 2192-2203 (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).

Research output: Chapter in book/report/conference proceedingConference contributionResearch

D'Souza, J, Hoppe, A, Brack, A, Jaradeh, MY, Auer, S & Ewerth, R 2020, The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. in N Calzolari, F Bechet, P Blache, K Choukri, C Cieri, T Declerck, S Goggi, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk & S Piperidis (eds), LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, pp. 2192-2203. <https://arxiv.org/abs/2003.01006>
D'Souza, J., Hoppe, A., Brack, A., Jaradeh, M. Y., Auer, S., & Ewerth, R. (2020). The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. In N. Calzolari, F. Bechet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings (pp. 2192-2203). (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings). https://arxiv.org/abs/2003.01006
D'Souza J, Hoppe A, Brack A, Jaradeh MY, Auer S, Ewerth R. The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. In Calzolari N, Bechet F, Blache P, Choukri K, Cieri C, Declerck T, Goggi S, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, editors, LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. 2020. p. 2192-2203. (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).
D'Souza, Jennifer ; Hoppe, Anett ; Brack, Arthur et al. / The STEM-ECR Dataset : Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. editor / Nicoletta Calzolari ; Frederic Bechet ; Philippe Blache ; Khalid Choukri ; Christopher Cieri ; Thierry Declerck ; Sara Goggi ; Hitoshi Isahara ; Bente Maegaard ; Joseph Mariani ; Helene Mazo ; Asuncion Moreno ; Jan Odijk ; Stelios Piperidis. 2020. pp. 2192-2203 (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).
Download
@inproceedings{719c7a674c8e4486b7798cc4cf2addd5,
title = "The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources",
abstract = "We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.",
keywords = "cs.IR, cs.AI, cs.CL, cs.DL, Entity Resolution, Entity Classification, Entity Linking, Word Sense Disambiguation, Evaluation Corpus, Entity Recognition, Language Resource",
author = "Jennifer D'Souza and Anett Hoppe and Arthur Brack and Jaradeh, {Mohamad Yaser} and S{\"o}ren Auer and Ralph Ewerth",
note = "Funding information: We thank the anonymous reviewersfor their comments and suggestions. We also thank the subject specialists at TIB fortheirhelpfulfeedbackinthefirstpartofthisstudy. This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and by the TIB Leibniz Information Centre for Science and Technology.",
year = "2020",
month = may,
language = "English",
series = "LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings",
pages = "2192--2203",
editor = "Nicoletta Calzolari and Frederic Bechet and Philippe Blache and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis",
booktitle = "LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings",

}

Download

TY - GEN

T1 - The STEM-ECR Dataset

T2 - Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

AU - D'Souza, Jennifer

AU - Hoppe, Anett

AU - Brack, Arthur

AU - Jaradeh, Mohamad Yaser

AU - Auer, Sören

AU - Ewerth, Ralph

N1 - Funding information: We thank the anonymous reviewersfor their comments and suggestions. We also thank the subject specialists at TIB fortheirhelpfulfeedbackinthefirstpartofthisstudy. This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and by the TIB Leibniz Information Centre for Science and Technology.

PY - 2020/5

Y1 - 2020/5

N2 - We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.

AB - We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.

KW - cs.IR

KW - cs.AI

KW - cs.CL

KW - cs.DL

KW - Entity Resolution

KW - Entity Classification

KW - Entity Linking

KW - Word Sense Disambiguation

KW - Evaluation Corpus

KW - Entity Recognition

KW - Language Resource

UR - http://www.scopus.com/inward/record.url?scp=85090882122&partnerID=8YFLogxK

M3 - Conference contribution

T3 - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

SP - 2192

EP - 2203

BT - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

A2 - Calzolari, Nicoletta

A2 - Bechet, Frederic

A2 - Blache, Philippe

A2 - Choukri, Khalid

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Helene

A2 - Moreno, Asuncion

A2 - Odijk, Jan

A2 - Piperidis, Stelios

ER -

By the same author(s)