A densitometric approach to web page segmentation

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publicationProceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
PublisherAssociation for Computing Machinery (ACM)
Pages1173-1182
Number of pages10
ISBN (print)9781595939913
Publication statusPublished - 26 Oct 2008
Event17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, United States
Duration: 26 Oct 200830 Oct 2008

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Abstract

Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

Keywords

    Full-text extraction, Noise removal, Template detection, Web page Segmentation

ASJC Scopus subject areas

Cite this

A densitometric approach to web page segmentation. / Kohlschütter, Christian; Nejdl, Wolfgang.
Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. p. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Kohlschütter, C & Nejdl, W 2008, A densitometric approach to web page segmentation. in Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. International Conference on Information and Knowledge Management, Proceedings, Association for Computing Machinery (ACM), pp. 1173-1182, 17th ACM Conference on Information and Knowledge Management, CIKM'08, Napa Valley, CA, United States, 26 Oct 2008. https://doi.org/10.1145/1458082.1458237
Kohlschütter, C., & Nejdl, W. (2008). A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08 (pp. 1173-1182). (International Conference on Information and Knowledge Management, Proceedings). Association for Computing Machinery (ACM). https://doi.org/10.1145/1458082.1458237
Kohlschütter C, Nejdl W. A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM). 2008. p. 1173-1182. (International Conference on Information and Knowledge Management, Proceedings). doi: 10.1145/1458082.1458237
Kohlschütter, Christian ; Nejdl, Wolfgang. / A densitometric approach to web page segmentation. Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. pp. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).
Download
@inproceedings{08e490ab8d9e4219a3474dde051308a6,
title = "A densitometric approach to web page segmentation",
abstract = "Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.",
keywords = "Full-text extraction, Noise removal, Template detection, Web page Segmentation",
author = "Christian Kohlsch{\"u}tter and Wolfgang Nejdl",
year = "2008",
month = oct,
day = "26",
doi = "10.1145/1458082.1458237",
language = "English",
isbn = "9781595939913",
series = "International Conference on Information and Knowledge Management, Proceedings",
publisher = "Association for Computing Machinery (ACM)",
pages = "1173--1182",
booktitle = "Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08",
address = "United States",
note = "17th ACM Conference on Information and Knowledge Management, CIKM'08 ; Conference date: 26-10-2008 Through 30-10-2008",

}

Download

TY - GEN

T1 - A densitometric approach to web page segmentation

AU - Kohlschütter, Christian

AU - Nejdl, Wolfgang

PY - 2008/10/26

Y1 - 2008/10/26

N2 - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

AB - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

KW - Full-text extraction

KW - Noise removal

KW - Template detection

KW - Web page Segmentation

UR - http://www.scopus.com/inward/record.url?scp=70349243805&partnerID=8YFLogxK

U2 - 10.1145/1458082.1458237

DO - 10.1145/1458082.1458237

M3 - Conference contribution

AN - SCOPUS:70349243805

SN - 9781595939913

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 1173

EP - 1182

BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08

PB - Association for Computing Machinery (ACM)

T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08

Y2 - 26 October 2008 through 30 October 2008

ER -

By the same author(s)