Details
Original language | English |
---|---|
Title of host publication | LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings |
Editors | Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis |
Pages | 2192-2203 |
Number of pages | 12 |
ISBN (electronic) | 9791095546344 |
Publication status | Published - May 2020 |
Externally published | Yes |
Publication series
Name | LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings |
---|
Abstract
We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.
Keywords
- cs.IR, cs.AI, cs.CL, cs.DL, Entity Resolution, Entity Classification, Entity Linking, Word Sense Disambiguation, Evaluation Corpus, Entity Recognition, Language Resource
ASJC Scopus subject areas
- Social Sciences(all)
- Education
- Social Sciences(all)
- Library and Information Sciences
- Arts and Humanities(all)
- Language and Linguistics
- Social Sciences(all)
- Linguistics and Language
Sustainable Development Goals
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. ed. / Nicoletta Calzolari; Frederic Bechet; Philippe Blache; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Helene Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis. 2020. p. 2192-2203 (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research
}
TY - GEN
T1 - The STEM-ECR Dataset
T2 - Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
AU - D'Souza, Jennifer
AU - Hoppe, Anett
AU - Brack, Arthur
AU - Jaradeh, Mohamad Yaser
AU - Auer, Sören
AU - Ewerth, Ralph
N1 - Funding information: We thank the anonymous reviewersfor their comments and suggestions. We also thank the subject specialists at TIB fortheirhelpfulfeedbackinthefirstpartofthisstudy. This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and by the TIB Leibniz Information Centre for Science and Technology.
PY - 2020/5
Y1 - 2020/5
N2 - We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.
AB - We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.
KW - cs.IR
KW - cs.AI
KW - cs.CL
KW - cs.DL
KW - Entity Resolution
KW - Entity Classification
KW - Entity Linking
KW - Word Sense Disambiguation
KW - Evaluation Corpus
KW - Entity Recognition
KW - Language Resource
UR - http://www.scopus.com/inward/record.url?scp=85090882122&partnerID=8YFLogxK
M3 - Conference contribution
T3 - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
SP - 2192
EP - 2203
BT - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
A2 - Calzolari, Nicoletta
A2 - Bechet, Frederic
A2 - Blache, Philippe
A2 - Choukri, Khalid
A2 - Cieri, Christopher
A2 - Declerck, Thierry
A2 - Goggi, Sara
A2 - Isahara, Hitoshi
A2 - Maegaard, Bente
A2 - Mariani, Joseph
A2 - Mazo, Helene
A2 - Moreno, Asuncion
A2 - Odijk, Jan
A2 - Piperidis, Stelios
ER -