Boilerplate Detection using Shallow Text Features

Christian Kohlschütter; Peter Fankhauser; Wolfgang Nejdl

doi:10.1145/1718487.1718542

Details

Original language	English
Title of host publication	WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining
Pages	441-450
Number of pages	10
Publication status	Published - 4 Feb 2010
Event	3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 - New York City, NY, United States Duration: 3 Feb 2010 → 6 Feb 2010

Publication series

Name	WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

Abstract

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

Keywords

Boilerplate removal, Full-text extraction, Template detection, Text cleaning, Web document modeling

ASJC Scopus subject areas

Computer Science(all)
Computer Networks and Communications
Computer Science(all)
Software

Cite this

Boilerplate Detection using Shallow Text Features. / Kohlschütter, Christian; Fankhauser, Peter; Nejdl, Wolfgang.
WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. p. 441-450 (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Kohlschütter, C, Fankhauser, P & Nejdl, W 2010, Boilerplate Detection using Shallow Text Features. in WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pp. 441-450, 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010, New York City, NY, United States, 3 Feb 2010. https://doi.org/10.1145/1718487.1718542

Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate Detection using Shallow Text Features. In WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (pp. 441-450). (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining). https://doi.org/10.1145/1718487.1718542

Kohlschütter C, Fankhauser P, Nejdl W. Boilerplate Detection using Shallow Text Features. In WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. p. 441-450. (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining). doi: 10.1145/1718487.1718542

Kohlschütter, Christian ; Fankhauser, Peter ; Nejdl, Wolfgang. / Boilerplate Detection using Shallow Text Features. WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. pp. 441-450 (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining).

Download

@inproceedings{34c945515c1e465c9f0965b804fabd16,

title = "Boilerplate Detection using Shallow Text Features",

abstract = "In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.",

keywords = "Boilerplate removal, Full-text extraction, Template detection, Text cleaning, Web document modeling",

author = "Christian Kohlsch{\"u}tter and Peter Fankhauser and Wolfgang Nejdl",

year = "2010",

month = feb,

day = "4",

doi = "10.1145/1718487.1718542",

language = "English",

isbn = "9781605588896",

series = "WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining",

pages = "441--450",

booktitle = "WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining",

note = "3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 ; Conference date: 03-02-2010 Through 06-02-2010",

}

Download

TY - GEN

T1 - Boilerplate Detection using Shallow Text Features

AU - Kohlschütter, Christian

AU - Fankhauser, Peter

AU - Nejdl, Wolfgang

PY - 2010/2/4

Y1 - 2010/2/4

N2 - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

AB - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

KW - Boilerplate removal

KW - Full-text extraction

KW - Template detection

KW - Text cleaning

KW - Web document modeling

UR - http://www.scopus.com/inward/record.url?scp=77950904942&partnerID=8YFLogxK

U2 - 10.1145/1718487.1718542

DO - 10.1145/1718487.1718542

M3 - Conference contribution

AN - SCOPUS:77950904942

SN - 9781605588896

T3 - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

SP - 441

EP - 450

BT - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

T2 - 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010

Y2 - 3 February 2010 through 6 February 2010

ER -

Research@Leibniz University

Boilerplate Detection using Shallow Text Features

Authors

Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets