Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings |
Herausgeber (Verlag) | Springer Verlag |
Seiten | 59-73 |
Seitenumfang | 15 |
Auflage | PART 2 |
ISBN (Print) | 9783319100845 |
Publikationsstatus | Veröffentlicht - 2014 |
Veranstaltung | 25th International Conference on Database and Expert Systems Applications, DEXA 2014 - Munich, Deutschland Dauer: 1 Sept. 2014 → 4 Sept. 2014 |
Publikationsreihe
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Nummer | PART 2 |
Band | 8645 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (elektronisch) | 1611-3349 |
Abstract
The problem of near-duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high-dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near-duplicates whenever new multimedia content is uploaded. Among different approaches, near-duplicate detection in high-dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real-world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.
ASJC Scopus Sachgebiete
- Mathematik (insg.)
- Theoretische Informatik
- Informatik (insg.)
- Allgemeine Computerwissenschaft
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings. PART 2. Aufl. Springer Verlag, 2014. S. 59-73 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 8645 LNCS, Nr. PART 2).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces
AU - Fisichella, Marco
AU - Ceroni, Andrea
AU - Deng, Fan
AU - Nejdl, Wolfgang
PY - 2014
Y1 - 2014
N2 - The problem of near-duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high-dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near-duplicates whenever new multimedia content is uploaded. Among different approaches, near-duplicate detection in high-dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real-world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.
AB - The problem of near-duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high-dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near-duplicates whenever new multimedia content is uploaded. Among different approaches, near-duplicate detection in high-dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real-world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.
KW - high-dimensional data sets
KW - Indexing methods
KW - Locality Sensitive Hashing
KW - Near-duplicate detection
KW - Query
UR - http://www.scopus.com/inward/record.url?scp=84958538278&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-10085-2_5
DO - 10.1007/978-3-319-10085-2_5
M3 - Conference contribution
AN - SCOPUS:84958538278
SN - 9783319100845
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 59
EP - 73
BT - Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings
PB - Springer Verlag
T2 - 25th International Conference on Database and Expert Systems Applications, DEXA 2014
Y2 - 1 September 2014 through 4 September 2014
ER -