Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining |
Seiten | 441-450 |
Seitenumfang | 10 |
Publikationsstatus | Veröffentlicht - 4 Feb. 2010 |
Veranstaltung | 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010 - New York City, NY, USA / Vereinigte Staaten Dauer: 3 Feb. 2010 → 6 Feb. 2010 |
Publikationsreihe
Name | WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining |
---|
Abstract
In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Computernetzwerke und -kommunikation
- Informatik (insg.)
- Software
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010. S. 441-450 (WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - Boilerplate Detection using Shallow Text Features
AU - Kohlschütter, Christian
AU - Fankhauser, Peter
AU - Nejdl, Wolfgang
PY - 2010/2/4
Y1 - 2010/2/4
N2 - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.
AB - In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.
KW - Boilerplate removal
KW - Full-text extraction
KW - Template detection
KW - Text cleaning
KW - Web document modeling
UR - http://www.scopus.com/inward/record.url?scp=77950904942&partnerID=8YFLogxK
U2 - 10.1145/1718487.1718542
DO - 10.1145/1718487.1718542
M3 - Conference contribution
AN - SCOPUS:77950904942
SN - 9781605588896
T3 - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining
SP - 441
EP - 450
BT - WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining
T2 - 3rd ACM International Conference on Web Search and Data Mining, WSDM 2010
Y2 - 3 February 2010 through 6 February 2010
ER -