Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT

Michael Paris; Robert Jäschke

doi:10.1007/978-3-030-82147-0_14

Details

Original language	English
Title of host publication	Knowledge Science, Engineering and Management
Subtitle of host publication	14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II
Editors	Han Qiu, Cheng Zhang, Zongming Fei, Meikang Qiu, Sun-Yuan Kung
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	163-175
Number of pages	13
ISBN (electronic)	978-3-030-82147-0
ISBN (print)	9783030821463
Publication status	Published - 7 Aug 2021
Event	14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021 - Tokyo, Japan Duration: 14 Aug 2021 → 16 Aug 2021

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12816 LNAI
ISSN (Print)	0302-9743
ISSN (electronic)	1611-3349

Abstract

Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web’s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models’ performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals’ occupation and web page domain.

Keywords

Bias, Classification, Dataset, Generation, Heuristic, Quality, Web archive

ASJC Scopus subject areas

Mathematics(all)
Theoretical Computer Science
Computer Science(all)
General Computer Science

Cite this

Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. / Paris, Michael; Jäschke, Robert.
Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II. ed. / Han Qiu; Cheng Zhang; Zongming Fei; Meikang Qiu; Sun-Yuan Kung. Springer Science and Business Media Deutschland GmbH, 2021. p. 163-175 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12816 LNAI).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Paris, M & Jäschke, R 2021, Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. in H Qiu, C Zhang, Z Fei, M Qiu & S-Y Kung (eds), Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12816 LNAI, Springer Science and Business Media Deutschland GmbH, pp. 163-175, 14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021, Tokyo, Japan, 14 Aug 2021. https://doi.org/10.1007/978-3-030-82147-0_14

Paris, M., & Jäschke, R. (2021). Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. In H. Qiu, C. Zhang, Z. Fei, M. Qiu, & S.-Y. Kung (Eds.), Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II (pp. 163-175). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12816 LNAI). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-82147-0_14

Paris M, Jäschke R. Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. In Qiu H, Zhang C, Fei Z, Qiu M, Kung SY, editors, Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II. Springer Science and Business Media Deutschland GmbH. 2021. p. 163-175. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-82147-0_14

Paris, Michael ; Jäschke, Robert. / Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II. editor / Han Qiu ; Cheng Zhang ; Zongming Fei ; Meikang Qiu ; Sun-Yuan Kung. Springer Science and Business Media Deutschland GmbH, 2021. pp. 163-175 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

Download

@inproceedings{1c1e6224c4ea4585b1f45d17009d82a0,

title = "Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT",

abstract = "Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web{\textquoteright}s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models{\textquoteright} performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals{\textquoteright} occupation and web page domain.",

keywords = "Bias, Classification, Dataset, Generation, Heuristic, Quality, Web archive",

author = "Michael Paris and Robert J{\"a}schke",

note = "Funding Information: Acknowledgments. Parts of this research were funded by the German Federal Ministry of Education and Research (BMBF) in the REGIO project (grant no. 01PU17012D). ; 14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021 ; Conference date: 14-08-2021 Through 16-08-2021",

year = "2021",

month = aug,

day = "7",

doi = "10.1007/978-3-030-82147-0_14",

language = "English",

isbn = "9783030821463",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "163--175",

editor = "Han Qiu and Cheng Zhang and Zongming Fei and Meikang Qiu and Sun-Yuan Kung",

booktitle = "Knowledge Science, Engineering and Management",

address = "Germany",

}

Download

TY - GEN

T1 - Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT

AU - Paris, Michael

AU - Jäschke, Robert

N1 - Funding Information: Acknowledgments. Parts of this research were funded by the German Federal Ministry of Education and Research (BMBF) in the REGIO project (grant no. 01PU17012D).

PY - 2021/8/7

Y1 - 2021/8/7

N2 - Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web’s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models’ performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals’ occupation and web page domain.

AB - Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web’s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models’ performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals’ occupation and web page domain.

KW - Bias

KW - Classification

KW - Dataset

KW - Generation

KW - Heuristic

KW - Quality

KW - Web archive

UR - http://www.scopus.com/inward/record.url?scp=85113788546&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-82147-0_14

DO - 10.1007/978-3-030-82147-0_14

M3 - Conference contribution

AN - SCOPUS:85113788546

SN - 9783030821463

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 163

EP - 175

BT - Knowledge Science, Engineering and Management

A2 - Qiu, Han

A2 - Zhang, Cheng

A2 - Fei, Zongming

A2 - Qiu, Meikang

A2 - Kung, Sun-Yuan

PB - Springer Science and Business Media Deutschland GmbH

T2 - 14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021

Y2 - 14 August 2021 through 16 August 2021

ER -

Research@Leibniz University

Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT

Authors

Research Organisations

External Research Organisations