Loading [MathJax]/extensions/tex2jax.js

Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Michael Paris
  • Robert Jäschke

Research Organisations

External Research Organisations

  • Humboldt-Universität zu Berlin (HU Berlin)

Details

Original languageEnglish
Title of host publicationKnowledge Science, Engineering and Management
Subtitle of host publication14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II
EditorsHan Qiu, Cheng Zhang, Zongming Fei, Meikang Qiu, Sun-Yuan Kung
PublisherSpringer Science and Business Media Deutschland GmbH
Pages163-175
Number of pages13
ISBN (electronic)978-3-030-82147-0
ISBN (print)9783030821463
Publication statusPublished - 7 Aug 2021
Event14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021 - Tokyo, Japan
Duration: 14 Aug 202116 Aug 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12816 LNAI
ISSN (Print)0302-9743
ISSN (electronic)1611-3349

Abstract

Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web’s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models’ performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals’ occupation and web page domain.

Keywords

    Bias, Classification, Dataset, Generation, Heuristic, Quality, Web archive

ASJC Scopus subject areas

Cite this

Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. / Paris, Michael; Jäschke, Robert.
Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II. ed. / Han Qiu; Cheng Zhang; Zongming Fei; Meikang Qiu; Sun-Yuan Kung. Springer Science and Business Media Deutschland GmbH, 2021. p. 163-175 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12816 LNAI).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Paris, M & Jäschke, R 2021, Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. in H Qiu, C Zhang, Z Fei, M Qiu & S-Y Kung (eds), Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12816 LNAI, Springer Science and Business Media Deutschland GmbH, pp. 163-175, 14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021, Tokyo, Japan, 14 Aug 2021. https://doi.org/10.1007/978-3-030-82147-0_14
Paris, M., & Jäschke, R. (2021). Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. In H. Qiu, C. Zhang, Z. Fei, M. Qiu, & S.-Y. Kung (Eds.), Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II (pp. 163-175). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12816 LNAI). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-82147-0_14
Paris M, Jäschke R. Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. In Qiu H, Zhang C, Fei Z, Qiu M, Kung SY, editors, Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II. Springer Science and Business Media Deutschland GmbH. 2021. p. 163-175. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-82147-0_14
Paris, Michael ; Jäschke, Robert. / Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT. Knowledge Science, Engineering and Management : 14th International Conference, KSEM 2021, Tokyo, Japan, August 14–16, 2021, Proceedings, Part II. editor / Han Qiu ; Cheng Zhang ; Zongming Fei ; Meikang Qiu ; Sun-Yuan Kung. Springer Science and Business Media Deutschland GmbH, 2021. pp. 163-175 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Download
@inproceedings{1c1e6224c4ea4585b1f45d17009d82a0,
title = "Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT",
abstract = "Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web{\textquoteright}s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models{\textquoteright} performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals{\textquoteright} occupation and web page domain.",
keywords = "Bias, Classification, Dataset, Generation, Heuristic, Quality, Web archive",
author = "Michael Paris and Robert J{\"a}schke",
note = "Funding Information: Acknowledgments. Parts of this research were funded by the German Federal Ministry of Education and Research (BMBF) in the REGIO project (grant no. 01PU17012D). ; 14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021 ; Conference date: 14-08-2021 Through 16-08-2021",
year = "2021",
month = aug,
day = "7",
doi = "10.1007/978-3-030-82147-0_14",
language = "English",
isbn = "9783030821463",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Science and Business Media Deutschland GmbH",
pages = "163--175",
editor = "Han Qiu and Cheng Zhang and Zongming Fei and Meikang Qiu and Sun-Yuan Kung",
booktitle = "Knowledge Science, Engineering and Management",
address = "Germany",

}

Download

TY - GEN

T1 - Evaluating Dataset Creation Heuristics for Concept Detection in Web Pages Using BERT

AU - Paris, Michael

AU - Jäschke, Robert

N1 - Funding Information: Acknowledgments. Parts of this research were funded by the German Federal Ministry of Education and Research (BMBF) in the REGIO project (grant no. 01PU17012D).

PY - 2021/8/7

Y1 - 2021/8/7

N2 - Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web’s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models’ performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals’ occupation and web page domain.

AB - Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web’s unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models’ performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals’ occupation and web page domain.

KW - Bias

KW - Classification

KW - Dataset

KW - Generation

KW - Heuristic

KW - Quality

KW - Web archive

UR - http://www.scopus.com/inward/record.url?scp=85113788546&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-82147-0_14

DO - 10.1007/978-3-030-82147-0_14

M3 - Conference contribution

AN - SCOPUS:85113788546

SN - 9783030821463

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 163

EP - 175

BT - Knowledge Science, Engineering and Management

A2 - Qiu, Han

A2 - Zhang, Cheng

A2 - Fei, Zongming

A2 - Qiu, Meikang

A2 - Kung, Sun-Yuan

PB - Springer Science and Business Media Deutschland GmbH

T2 - 14th International Conference on Knowledge Science, Engineering and Management, KSEM 2021

Y2 - 14 August 2021 through 16 August 2021

ER -