Details
Original language | English |
---|---|
Title of host publication | Knowledge Science, Engineering and Management |
Subtitle of host publication | 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II |
Editors | Han Qiu, Cheng Zhang, Zongming Fei, Meikang Qiu, Sun-Yuan Kung |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 163-175 |
Number of pages | 13 |
ISBN (electronic) | 978-3-030-82147-0 |
ISBN (print) | 9783030821463 |
Publication status | Published - 7 Aug 2021 |
Event | 14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021 - Tokyo, Japan Duration: 14 Aug 2021 → 16 Aug 2021 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 12816 LNAI |
ISSN (Print) | 0302-9743 |
ISSN (electronic) | 1611-3349 |
Abstract
Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web’s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models’ performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals’ occupation and web page domain.
Keywords
- Bias, Classification, Dataset, Generation, Heuristic, Quality, Web archive
ASJC Scopus subject areas
- Mathematics(all)
- Theoretical Computer Science
- Computer Science(all)
- General Computer Science
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II. ed. / Han Qiu; Cheng Zhang; Zongming Fei; Meikang Qiu; Sun-Yuan Kung. Springer Science and Business Media Deutschland GmbH, 2021. p. 163-175 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12816 LNAI).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT
AU - Paris, Michael
AU - Jäschke, Robert
N1 - Funding Information: Acknowledgments. Parts of this research were funded by the German Federal Ministry of Education and Research (BMBF) in the REGIO project (grant no. 01PU17012D).
PY - 2021/8/7
Y1 - 2021/8/7
N2 - Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web’s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models’ performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals’ occupation and web page domain.
AB - Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web’s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models’ performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals’ occupation and web page domain.
KW - Bias
KW - Classification
KW - Dataset
KW - Generation
KW - Heuristic
KW - Quality
KW - Web archive
UR - http://www.scopus.com/inward/record.url?scp=85113788546&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-82147-0_14
DO - 10.1007/978-3-030-82147-0_14
M3 - Conference contribution
AN - SCOPUS:85113788546
SN - 9783030821463
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 163
EP - 175
BT - Knowledge Science, Engineering and Management
A2 - Qiu, Han
A2 - Zhang, Cheng
A2 - Fei, Zongming
A2 - Qiu, Meikang
A2 - Kung, Sun-Yuan
PB - Springer Science and Business Media Deutschland GmbH
T2 - 14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021
Y2 - 14 August 2021 through 16 August 2021
ER -