Efficient Semantic-Aware Detection of Near Duplicate Resources

Ekaterini Ioannou; Odysseas Papapetrou; Dimitrios Skoutas; Wolfgang Nejdl

doi:10.1007/978-3-642-13489-0_10

Details

Original language	English
Title of host publication	The Semantic Web
Subtitle of host publication	Research and Applications - 7th Extended Semantic Web Conference, ESWC 2010, Proceedings
Pages	136-150
Number of pages	15
Publication status	Published - 14 Jul 2010
Event	7th Extended Semantic Web Conference, ESWC 2010 - Heraklion, Crete, Greece Duration: 30 May 2010 → 3 Jun 2010

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Number	PART 2
Volume	6089 LNCS
ISSN (Print)	0302-9743
ISSN (electronic)	1611-3349

Abstract

Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.

Keywords

data integration, near duplicate detection

ASJC Scopus subject areas

Mathematics(all)
Theoretical Computer Science
Computer Science(all)
General Computer Science

Cite this

Efficient Semantic-Aware Detection of Near Duplicate Resources. / Ioannou, Ekaterini; Papapetrou, Odysseas; Skoutas, Dimitrios et al.
The Semantic Web: Research and Applications - 7th Extended Semantic Web Conference, ESWC 2010, Proceedings. PART 2. ed. 2010. p. 136-150 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6089 LNCS, No. PART 2).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Ioannou, E, Papapetrou, O, Skoutas, D & Nejdl, W 2010, Efficient Semantic-Aware Detection of Near Duplicate Resources. in The Semantic Web: Research and Applications - 7th Extended Semantic Web Conference, ESWC 2010, Proceedings. PART 2 edn, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), no. PART 2, vol. 6089 LNCS, pp. 136-150, 7th Extended Semantic Web Conference, ESWC 2010, Heraklion, Crete, Greece, 30 May 2010. https://doi.org/10.1007/978-3-642-13489-0_10

Ioannou, E., Papapetrou, O., Skoutas, D., & Nejdl, W. (2010). Efficient Semantic-Aware Detection of Near Duplicate Resources. In The Semantic Web: Research and Applications - 7th Extended Semantic Web Conference, ESWC 2010, Proceedings (PART 2 ed., pp. 136-150). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6089 LNCS, No. PART 2). https://doi.org/10.1007/978-3-642-13489-0_10

Ioannou E, Papapetrou O, Skoutas D, Nejdl W. Efficient Semantic-Aware Detection of Near Duplicate Resources. In The Semantic Web: Research and Applications - 7th Extended Semantic Web Conference, ESWC 2010, Proceedings. PART 2 ed. 2010. p. 136-150. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 2). doi: 10.1007/978-3-642-13489-0_10

Ioannou, Ekaterini ; Papapetrou, Odysseas ; Skoutas, Dimitrios et al. / Efficient Semantic-Aware Detection of Near Duplicate Resources. The Semantic Web: Research and Applications - 7th Extended Semantic Web Conference, ESWC 2010, Proceedings. PART 2. ed. 2010. pp. 136-150 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 2).

Download

@inproceedings{daa3743ce2d04f30a2a54d12f3259a9b,

title = "Efficient Semantic-Aware Detection of Near Duplicate Resources",

abstract = "Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.",

keywords = "data integration, near duplicate detection",

author = "Ekaterini Ioannou and Odysseas Papapetrou and Dimitrios Skoutas and Wolfgang Nejdl",

year = "2010",

month = jul,

day = "14",

doi = "10.1007/978-3-642-13489-0_10",

language = "English",

isbn = "3642134882",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

number = "PART 2",

pages = "136--150",

booktitle = "The Semantic Web",

edition = "PART 2",

note = "7th Extended Semantic Web Conference, ESWC 2010 ; Conference date: 30-05-2010 Through 03-06-2010",

}

Download

TY - GEN

T1 - Efficient Semantic-Aware Detection of Near Duplicate Resources

AU - Ioannou, Ekaterini

AU - Papapetrou, Odysseas

AU - Skoutas, Dimitrios

AU - Nejdl, Wolfgang

PY - 2010/7/14

Y1 - 2010/7/14

N2 - Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.

AB - Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.

KW - data integration

KW - near duplicate detection

UR - http://www.scopus.com/inward/record.url?scp=77954405011&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-13489-0_10

DO - 10.1007/978-3-642-13489-0_10

M3 - Conference contribution

AN - SCOPUS:77954405011

SN - 3642134882

SN - 9783642134883

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 136

EP - 150

BT - The Semantic Web

T2 - 7th Extended Semantic Web Conference, ESWC 2010

Y2 - 30 May 2010 through 3 June 2010

ER -

Research@Leibniz University

Efficient Semantic-Aware Detection of Near Duplicate Resources

Authors

Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets