Boilerplate Detection using Shallow Text Features

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

Organisationseinheiten

Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksWSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining
Seiten441-450
Seitenumfang10
PublikationsstatusVeröffentlicht - 4 Feb. 2010
Veranstaltung3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 - New York City, NY, USA / Vereinigte Staaten
Dauer: 3 Feb. 20106 Feb. 2010

Publikationsreihe

NameWSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

Abstract

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

ASJC Scopus Sachgebiete

Zitieren

Boilerplate Detection using Shallow Text Features. / Kohlschütter, Christian; Fankhauser, Peter; Nejdl, Wolfgang.
WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. S. 441-450 (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Kohlschütter, C, Fankhauser, P & Nejdl, W 2010, Boilerplate Detection using Shallow Text Features. in WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, S. 441-450, 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010, New York City, NY, USA / Vereinigte Staaten, 3 Feb. 2010. https://doi.org/10.1145/1718487.1718542
Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate Detection using Shallow Text Features. In WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (S. 441-450). (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining). https://doi.org/10.1145/1718487.1718542
Kohlschütter C, Fankhauser P, Nejdl W. Boilerplate Detection using Shallow Text Features. in WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. S. 441-450. (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining). doi: 10.1145/1718487.1718542
Kohlschütter, Christian ; Fankhauser, Peter ; Nejdl, Wolfgang. / Boilerplate Detection using Shallow Text Features. WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. S. 441-450 (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining).
Download
@inproceedings{34c945515c1e465c9f0965b804fabd16,
title = "Boilerplate Detection using Shallow Text Features",
abstract = "In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.",
keywords = "Boilerplate removal, Full-text extraction, Template detection, Text cleaning, Web document modeling",
author = "Christian Kohlsch{\"u}tter and Peter Fankhauser and Wolfgang Nejdl",
year = "2010",
month = feb,
day = "4",
doi = "10.1145/1718487.1718542",
language = "English",
isbn = "9781605588896",
series = "WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining",
pages = "441--450",
booktitle = "WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining",
note = "3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 ; Conference date: 03-02-2010 Through 06-02-2010",

}

Download

TY - GEN

T1 - Boilerplate Detection using Shallow Text Features

AU - Kohlschütter, Christian

AU - Fankhauser, Peter

AU - Nejdl, Wolfgang

PY - 2010/2/4

Y1 - 2010/2/4

N2 - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

AB - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

KW - Boilerplate removal

KW - Full-text extraction

KW - Template detection

KW - Text cleaning

KW - Web document modeling

UR - http://www.scopus.com/inward/record.url?scp=77950904942&partnerID=8YFLogxK

U2 - 10.1145/1718487.1718542

DO - 10.1145/1718487.1718542

M3 - Conference contribution

AN - SCOPUS:77950904942

SN - 9781605588896

T3 - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

SP - 441

EP - 450

BT - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining

T2 - 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010

Y2 - 3 February 2010 through 6 February 2010

ER -

Von denselben Autoren