Boilerplate Detection using Shallow Text Features

Christian Kohlschütter; Peter Fankhauser; Wolfgang Nejdl

doi:10.1145/1718487.1718542

Details

Originalsprache	Englisch
Titel des Sammelwerks	WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining
Seiten	441-450
Seitenumfang	10
Publikationsstatus	Veröffentlicht - 4 Feb. 2010
Veranstaltung	3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 - New York City, NY, USA / Vereinigte Staaten Dauer: 3 Feb. 2010 → 6 Feb. 2010

Publikationsreihe

Name	WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

Abstract

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

ASJC Scopus Sachgebiete

Informatik (insg.)
Computernetzwerke und -kommunikation
Informatik (insg.)
Software

Zitieren

Boilerplate Detection using Shallow Text Features. / Kohlschütter, Christian; Fankhauser, Peter; Nejdl, Wolfgang.
WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. S. 441-450 (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Kohlschütter, C, Fankhauser, P & Nejdl, W 2010, Boilerplate Detection using Shallow Text Features. in WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, S. 441-450, 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010, New York City, NY, USA / Vereinigte Staaten, 3 Feb. 2010. https://doi.org/10.1145/1718487.1718542

Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate Detection using Shallow Text Features. In WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (S. 441-450). (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining). https://doi.org/10.1145/1718487.1718542

Kohlschütter C, Fankhauser P, Nejdl W. Boilerplate Detection using Shallow Text Features. in WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. S. 441-450. (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining). doi: 10.1145/1718487.1718542

Kohlschütter, Christian ; Fankhauser, Peter ; Nejdl, Wolfgang. / Boilerplate Detection using Shallow Text Features. WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. S. 441-450 (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining).

Download

@inproceedings{34c945515c1e465c9f0965b804fabd16,

title = "Boilerplate Detection using Shallow Text Features",

abstract = "In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.",

keywords = "Boilerplate removal, Full-text extraction, Template detection, Text cleaning, Web document modeling",

author = "Christian Kohlsch{\"u}tter and Peter Fankhauser and Wolfgang Nejdl",

year = "2010",

month = feb,

day = "4",

doi = "10.1145/1718487.1718542",

language = "English",

isbn = "9781605588896",

series = "WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining",

pages = "441--450",

booktitle = "WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining",

note = "3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 ; Conference date: 03-02-2010 Through 06-02-2010",

}

Download

TY - GEN

T1 - Boilerplate Detection using Shallow Text Features

AU - Kohlschütter, Christian

AU - Fankhauser, Peter

AU - Nejdl, Wolfgang

PY - 2010/2/4

Y1 - 2010/2/4

N2 - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

AB - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

KW - Boilerplate removal

KW - Full-text extraction

KW - Template detection

KW - Text cleaning

KW - Web document modeling

UR - http://www.scopus.com/inward/record.url?scp=77950904942&partnerID=8YFLogxK

U2 - 10.1145/1718487.1718542

DO - 10.1145/1718487.1718542

M3 - Conference contribution

AN - SCOPUS:77950904942

SN - 9781605588896

T3 - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

SP - 441

EP - 450

BT - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

T2 - 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010

Y2 - 3 February 2010 through 6 February 2010

ER -

Research@Leibniz University

Boilerplate Detection using Shallow Text Features

Autoren

Organisationseinheiten

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

An artificial intelligence-assisted clinical framework to facilitate diagnostics and translational discovery in hematologic neoplasia