Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08 |
Herausgeber (Verlag) | Association for Computing Machinery (ACM) |
Seiten | 1173-1182 |
Seitenumfang | 10 |
ISBN (Print) | 9781595939913 |
Publikationsstatus | Veröffentlicht - 26 Okt. 2008 |
Veranstaltung | 17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, USA / Vereinigte Staaten Dauer: 26 Okt. 2008 → 30 Okt. 2008 |
Publikationsreihe
Name | International Conference on Information and Knowledge Management, Proceedings |
---|
Abstract
Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.
ASJC Scopus Sachgebiete
- Entscheidungswissenschaften (insg.)
- Allgemeine Entscheidungswissenschaften
- Betriebswirtschaft, Management und Rechnungswesen (insg.)
- Allgemeine Unternehmensführung und Buchhaltung
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. S. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - A densitometric approach to web page segmentation
AU - Kohlschütter, Christian
AU - Nejdl, Wolfgang
PY - 2008/10/26
Y1 - 2008/10/26
N2 - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.
AB - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.
KW - Full-text extraction
KW - Noise removal
KW - Template detection
KW - Web page Segmentation
UR - http://www.scopus.com/inward/record.url?scp=70349243805&partnerID=8YFLogxK
U2 - 10.1145/1458082.1458237
DO - 10.1145/1458082.1458237
M3 - Conference contribution
AN - SCOPUS:70349243805
SN - 9781595939913
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1173
EP - 1182
BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
PB - Association for Computing Machinery (ACM)
T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08
Y2 - 26 October 2008 through 30 October 2008
ER -