A densitometric approach to web page segmentation

Christian Kohlschütter; Wolfgang Nejdl

doi:10.1145/1458082.1458237

Details

Original language	English
Title of host publication	Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
Publisher	Association for Computing Machinery (ACM)
Pages	1173-1182
Number of pages	10
ISBN (print)	9781595939913
Publication status	Published - 26 Oct 2008
Event	17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, United States Duration: 26 Oct 2008 → 30 Oct 2008

Publication series

Name	International Conference on Information and Knowledge Management, Proceedings

Abstract

Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

Keywords

Full-text extraction, Noise removal, Template detection, Web page Segmentation

ASJC Scopus subject areas

Decision Sciences(all)
General Decision Sciences
Business, Management and Accounting(all)
General Business,Management and Accounting

Cite this

A densitometric approach to web page segmentation. / Kohlschütter, Christian; Nejdl, Wolfgang.
Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. p. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Kohlschütter, C & Nejdl, W 2008, A densitometric approach to web page segmentation. in Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. International Conference on Information and Knowledge Management, Proceedings, Association for Computing Machinery (ACM), pp. 1173-1182, 17th ACM Conference on Information and Knowledge Management, CIKM'08, Napa Valley, CA, United States, 26 Oct 2008. https://doi.org/10.1145/1458082.1458237

Kohlschütter, C., & Nejdl, W. (2008). A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08 (pp. 1173-1182). (International Conference on Information and Knowledge Management, Proceedings). Association for Computing Machinery (ACM). https://doi.org/10.1145/1458082.1458237

Kohlschütter C, Nejdl W. A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM). 2008. p. 1173-1182. (International Conference on Information and Knowledge Management, Proceedings). doi: 10.1145/1458082.1458237

Kohlschütter, Christian ; Nejdl, Wolfgang. / A densitometric approach to web page segmentation. Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. pp. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).

Download

@inproceedings{08e490ab8d9e4219a3474dde051308a6,

title = "A densitometric approach to web page segmentation",

abstract = "Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.",

keywords = "Full-text extraction, Noise removal, Template detection, Web page Segmentation",

author = "Christian Kohlsch{\"u}tter and Wolfgang Nejdl",

year = "2008",

month = oct,

day = "26",

doi = "10.1145/1458082.1458237",

language = "English",

isbn = "9781595939913",

series = "International Conference on Information and Knowledge Management, Proceedings",

publisher = "Association for Computing Machinery (ACM)",

pages = "1173--1182",

booktitle = "Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08",

address = "United States",

note = "17th ACM Conference on Information and Knowledge Management, CIKM'08 ; Conference date: 26-10-2008 Through 30-10-2008",

}

Download

TY - GEN

T1 - A densitometric approach to web page segmentation

AU - Kohlschütter, Christian

AU - Nejdl, Wolfgang

PY - 2008/10/26

Y1 - 2008/10/26

N2 - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

AB - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

KW - Full-text extraction

KW - Noise removal

KW - Template detection

KW - Web page Segmentation

UR - http://www.scopus.com/inward/record.url?scp=70349243805&partnerID=8YFLogxK

U2 - 10.1145/1458082.1458237

DO - 10.1145/1458082.1458237

M3 - Conference contribution

AN - SCOPUS:70349243805

SN - 9781595939913

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 1173

EP - 1182

BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08

PB - Association for Computing Machinery (ACM)

T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08

Y2 - 26 October 2008 through 30 October 2008

ER -

Research@Leibniz University

A densitometric approach to web page segmentation

Authors

Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets