Details
Original language | English |
---|---|
Title of host publication | Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08 |
Publisher | Association for Computing Machinery (ACM) |
Pages | 1173-1182 |
Number of pages | 10 |
ISBN (print) | 9781595939913 |
Publication status | Published - 26 Oct 2008 |
Event | 17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, United States Duration: 26 Oct 2008 → 30 Oct 2008 |
Publication series
Name | International Conference on Information and Knowledge Management, Proceedings |
---|
Abstract
Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.
Keywords
- Full-text extraction, Noise removal, Template detection, Web page Segmentation
ASJC Scopus subject areas
- Decision Sciences(all)
- General Decision Sciences
- Business, Management and Accounting(all)
- General Business,Management and Accounting
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. p. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - A densitometric approach to web page segmentation
AU - Kohlschütter, Christian
AU - Nejdl, Wolfgang
PY - 2008/10/26
Y1 - 2008/10/26
N2 - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.
AB - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.
KW - Full-text extraction
KW - Noise removal
KW - Template detection
KW - Web page Segmentation
UR - http://www.scopus.com/inward/record.url?scp=70349243805&partnerID=8YFLogxK
U2 - 10.1145/1458082.1458237
DO - 10.1145/1458082.1458237
M3 - Conference contribution
AN - SCOPUS:70349243805
SN - 9781595939913
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1173
EP - 1182
BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
PB - Association for Computing Machinery (ACM)
T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08
Y2 - 26 October 2008 through 30 October 2008
ER -