A densitometric approach to web page segmentation

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

Organisationseinheiten

Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksProceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
Herausgeber (Verlag)Association for Computing Machinery (ACM)
Seiten1173-1182
Seitenumfang10
ISBN (Print)9781595939913
PublikationsstatusVeröffentlicht - 26 Okt. 2008
Veranstaltung17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, USA / Vereinigte Staaten
Dauer: 26 Okt. 200830 Okt. 2008

Publikationsreihe

NameInternational Conference on Information and Knowledge Management, Proceedings

Abstract

Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

Zitieren

A densitometric approach to web page segmentation. / Kohlschütter, Christian; Nejdl, Wolfgang.
Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. S. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Kohlschütter, C & Nejdl, W 2008, A densitometric approach to web page segmentation. in Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. International Conference on Information and Knowledge Management, Proceedings, Association for Computing Machinery (ACM), S. 1173-1182, 17th ACM Conference on Information and Knowledge Management, CIKM'08, Napa Valley, CA, USA / Vereinigte Staaten, 26 Okt. 2008. https://doi.org/10.1145/1458082.1458237
Kohlschütter, C., & Nejdl, W. (2008). A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08 (S. 1173-1182). (International Conference on Information and Knowledge Management, Proceedings). Association for Computing Machinery (ACM). https://doi.org/10.1145/1458082.1458237
Kohlschütter C, Nejdl W. A densitometric approach to web page segmentation. in Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM). 2008. S. 1173-1182. (International Conference on Information and Knowledge Management, Proceedings). doi: 10.1145/1458082.1458237
Kohlschütter, Christian ; Nejdl, Wolfgang. / A densitometric approach to web page segmentation. Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. S. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).
Download
@inproceedings{08e490ab8d9e4219a3474dde051308a6,
title = "A densitometric approach to web page segmentation",
abstract = "Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.",
keywords = "Full-text extraction, Noise removal, Template detection, Web page Segmentation",
author = "Christian Kohlsch{\"u}tter and Wolfgang Nejdl",
year = "2008",
month = oct,
day = "26",
doi = "10.1145/1458082.1458237",
language = "English",
isbn = "9781595939913",
series = "International Conference on Information and Knowledge Management, Proceedings",
publisher = "Association for Computing Machinery (ACM)",
pages = "1173--1182",
booktitle = "Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08",
address = "United States",
note = "17th ACM Conference on Information and Knowledge Management, CIKM'08 ; Conference date: 26-10-2008 Through 30-10-2008",

}

Download

TY - GEN

T1 - A densitometric approach to web page segmentation

AU - Kohlschütter, Christian

AU - Nejdl, Wolfgang

PY - 2008/10/26

Y1 - 2008/10/26

N2 - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

AB - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

KW - Full-text extraction

KW - Noise removal

KW - Template detection

KW - Web page Segmentation

UR - http://www.scopus.com/inward/record.url?scp=70349243805&partnerID=8YFLogxK

U2 - 10.1145/1458082.1458237

DO - 10.1145/1458082.1458237

M3 - Conference contribution

AN - SCOPUS:70349243805

SN - 9781595939913

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 1173

EP - 1182

BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08

PB - Association for Computing Machinery (ACM)

T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08

Y2 - 26 October 2008 through 30 October 2008

ER -

Von denselben Autoren