Boilerplate Detection using Shallow Text Features

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publicationWSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining
Pages441-450
Number of pages10
Publication statusPublished - 4 Feb 2010
Event3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 - New York City, NY, United States
Duration: 3 Feb 20106 Feb 2010

Publication series

NameWSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

Abstract

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

Keywords

    Boilerplate removal, Full-text extraction, Template detection, Text cleaning, Web document modeling

ASJC Scopus subject areas

Cite this

Boilerplate Detection using Shallow Text Features. / Kohlschütter, Christian; Fankhauser, Peter; Nejdl, Wolfgang.
WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. p. 441-450 (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Kohlschütter, C, Fankhauser, P & Nejdl, W 2010, Boilerplate Detection using Shallow Text Features. in WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pp. 441-450, 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010, New York City, NY, United States, 3 Feb 2010. https://doi.org/10.1145/1718487.1718542
Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate Detection using Shallow Text Features. In WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (pp. 441-450). (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining). https://doi.org/10.1145/1718487.1718542
Kohlschütter C, Fankhauser P, Nejdl W. Boilerplate Detection using Shallow Text Features. In WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. p. 441-450. (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining). doi: 10.1145/1718487.1718542
Kohlschütter, Christian ; Fankhauser, Peter ; Nejdl, Wolfgang. / Boilerplate Detection using Shallow Text Features. WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. pp. 441-450 (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining).
Download
@inproceedings{34c945515c1e465c9f0965b804fabd16,
title = "Boilerplate Detection using Shallow Text Features",
abstract = "In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.",
keywords = "Boilerplate removal, Full-text extraction, Template detection, Text cleaning, Web document modeling",
author = "Christian Kohlsch{\"u}tter and Peter Fankhauser and Wolfgang Nejdl",
year = "2010",
month = feb,
day = "4",
doi = "10.1145/1718487.1718542",
language = "English",
isbn = "9781605588896",
series = "WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining",
pages = "441--450",
booktitle = "WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining",
note = "3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 ; Conference date: 03-02-2010 Through 06-02-2010",

}

Download

TY - GEN

T1 - Boilerplate Detection using Shallow Text Features

AU - Kohlschütter, Christian

AU - Fankhauser, Peter

AU - Nejdl, Wolfgang

PY - 2010/2/4

Y1 - 2010/2/4

N2 - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

AB - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

KW - Boilerplate removal

KW - Full-text extraction

KW - Template detection

KW - Text cleaning

KW - Web document modeling

UR - http://www.scopus.com/inward/record.url?scp=77950904942&partnerID=8YFLogxK

U2 - 10.1145/1718487.1718542

DO - 10.1145/1718487.1718542

M3 - Conference contribution

AN - SCOPUS:77950904942

SN - 9781605588896

T3 - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

SP - 441

EP - 450

BT - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

T2 - 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010

Y2 - 3 February 2010 through 6 February 2010

ER -

By the same author(s)