Details
Original language | English |
---|---|
Title of host publication | The Semantic Web |
Subtitle of host publication | Research and Applications - 7th Extended Semantic Web Conference, ESWC 2010, Proceedings |
Pages | 136-150 |
Number of pages | 15 |
Edition | PART 2 |
Publication status | Published - 14 Jul 2010 |
Event | 7th Extended Semantic Web Conference, ESWC 2010 - Heraklion, Crete, Greece Duration: 30 May 2010 → 3 Jun 2010 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Number | PART 2 |
Volume | 6089 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (electronic) | 1611-3349 |
Abstract
Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.
Keywords
- data integration, near duplicate detection
ASJC Scopus subject areas
- Mathematics(all)
- Theoretical Computer Science
- Computer Science(all)
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
The Semantic Web: Research and Applications - 7th Extended Semantic Web Conference, ESWC 2010, Proceedings. PART 2. ed. 2010. p. 136-150 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6089 LNCS, No. PART 2).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Efficient Semantic-Aware Detection of Near Duplicate Resources
AU - Ioannou, Ekaterini
AU - Papapetrou, Odysseas
AU - Skoutas, Dimitrios
AU - Nejdl, Wolfgang
PY - 2010/7/14
Y1 - 2010/7/14
N2 - Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.
AB - Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.
KW - data integration
KW - near duplicate detection
UR - http://www.scopus.com/inward/record.url?scp=77954405011&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-13489-0_10
DO - 10.1007/978-3-642-13489-0_10
M3 - Conference contribution
AN - SCOPUS:77954405011
SN - 3642134882
SN - 9783642134883
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 136
EP - 150
BT - The Semantic Web
T2 - 7th Extended Semantic Web Conference, ESWC 2010
Y2 - 30 May 2010 through 3 June 2010
ER -