A densitometric approach to web page segmentation

Christian Kohlschütter; Wolfgang Nejdl

doi:10.1145/1458082.1458237

Details

Originalsprache	Englisch
Titel des Sammelwerks	Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
Herausgeber (Verlag)	Association for Computing Machinery (ACM)
Seiten	1173-1182
Seitenumfang	10
ISBN (Print)	9781595939913
Publikationsstatus	Veröffentlicht - 26 Okt. 2008
Veranstaltung	17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, USA / Vereinigte Staaten Dauer: 26 Okt. 2008 → 30 Okt. 2008

Publikationsreihe

Name	International Conference on Information and Knowledge Management, Proceedings

Abstract

Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

ASJC Scopus Sachgebiete

Entscheidungswissenschaften (insg.)
Allgemeine Entscheidungswissenschaften
Betriebswirtschaft, Management und Rechnungswesen (insg.)
Allgemeine Unternehmensführung und Buchhaltung

Zitieren

A densitometric approach to web page segmentation. / Kohlschütter, Christian; Nejdl, Wolfgang.
Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. S. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Kohlschütter, C & Nejdl, W 2008, A densitometric approach to web page segmentation. in Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. International Conference on Information and Knowledge Management, Proceedings, Association for Computing Machinery (ACM), S. 1173-1182, 17th ACM Conference on Information and Knowledge Management, CIKM'08, Napa Valley, CA, USA / Vereinigte Staaten, 26 Okt. 2008. https://doi.org/10.1145/1458082.1458237

Kohlschütter, C., & Nejdl, W. (2008). A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08 (S. 1173-1182). (International Conference on Information and Knowledge Management, Proceedings). Association for Computing Machinery (ACM). https://doi.org/10.1145/1458082.1458237

Kohlschütter C, Nejdl W. A densitometric approach to web page segmentation. in Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM). 2008. S. 1173-1182. (International Conference on Information and Knowledge Management, Proceedings). doi: 10.1145/1458082.1458237

Kohlschütter, Christian ; Nejdl, Wolfgang. / A densitometric approach to web page segmentation. Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08. Association for Computing Machinery (ACM), 2008. S. 1173-1182 (International Conference on Information and Knowledge Management, Proceedings).

Download

@inproceedings{08e490ab8d9e4219a3474dde051308a6,

title = "A densitometric approach to web page segmentation",

abstract = "Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.",

keywords = "Full-text extraction, Noise removal, Template detection, Web page Segmentation",

author = "Christian Kohlsch{\"u}tter and Wolfgang Nejdl",

year = "2008",

month = oct,

day = "26",

doi = "10.1145/1458082.1458237",

language = "English",

isbn = "9781595939913",

series = "International Conference on Information and Knowledge Management, Proceedings",

publisher = "Association for Computing Machinery (ACM)",

pages = "1173--1182",

booktitle = "Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08",

address = "United States",

note = "17th ACM Conference on Information and Knowledge Management, CIKM'08 ; Conference date: 26-10-2008 Through 30-10-2008",

}

Download

TY - GEN

T1 - A densitometric approach to web page segmentation

AU - Kohlschütter, Christian

AU - Nejdl, Wolfgang

PY - 2008/10/26

Y1 - 2008/10/26

N2 - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

AB - Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

KW - Full-text extraction

KW - Noise removal

KW - Template detection

KW - Web page Segmentation

UR - http://www.scopus.com/inward/record.url?scp=70349243805&partnerID=8YFLogxK

U2 - 10.1145/1458082.1458237

DO - 10.1145/1458082.1458237

M3 - Conference contribution

AN - SCOPUS:70349243805

SN - 9781595939913

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 1173

EP - 1182

BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08

PB - Association for Computing Machinery (ACM)

T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08

Y2 - 26 October 2008 through 30 October 2008

ER -

Research@Leibniz University

A densitometric approach to web page segmentation

Autorschaft

Organisationseinheiten

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets