Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

Marco Fisichella; Andrea Ceroni; Fan Deng; Wolfgang Nejdl

doi:10.1007/978-3-319-10085-2_5

Details

Original language	English
Title of host publication	Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings
Publisher	Springer Verlag
Pages	59-73
Number of pages	15
ISBN (print)	9783319100845
Publication status	Published - 2014
Event	25th International Conference on Database and Expert Systems Applications, DEXA 2014 - Munich, Germany Duration: 1 Sept 2014 → 4 Sept 2014

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Number	PART 2
Volume	8645 LNCS
ISSN (Print)	0302-9743
ISSN (electronic)	1611-3349

Abstract

The problem of near-duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high-dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near-duplicates whenever new multimedia content is uploaded. Among different approaches, near-duplicate detection in high-dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real-world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.

Keywords

high-dimensional data sets, Indexing methods, Locality Sensitive Hashing, Near-duplicate detection, Query

ASJC Scopus subject areas

Mathematics(all)
Theoretical Computer Science
Computer Science(all)
General Computer Science

Cite this

Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces. / Fisichella, Marco; Ceroni, Andrea; Deng, Fan et al.
Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings. PART 2. ed. Springer Verlag, 2014. p. 59-73 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8645 LNCS, No. PART 2).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Fisichella, M, Ceroni, A, Deng, F & Nejdl, W 2014, Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces. in Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings. PART 2 edn, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), no. PART 2, vol. 8645 LNCS, Springer Verlag, pp. 59-73, 25th International Conference on Database and Expert Systems Applications, DEXA 2014, Munich, Germany, 1 Sept 2014. https://doi.org/10.1007/978-3-319-10085-2_5

Fisichella, M., Ceroni, A., Deng, F., & Nejdl, W. (2014). Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces. In Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings (PART 2 ed., pp. 59-73). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8645 LNCS, No. PART 2). Springer Verlag. https://doi.org/10.1007/978-3-319-10085-2_5

Fisichella M, Ceroni A, Deng F, Nejdl W. Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces. In Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings. PART 2 ed. Springer Verlag. 2014. p. 59-73. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 2). doi: 10.1007/978-3-319-10085-2_5

Fisichella, Marco ; Ceroni, Andrea ; Deng, Fan et al. / Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces. Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings. PART 2. ed. Springer Verlag, 2014. pp. 59-73 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 2).

Download

@inproceedings{5753fcb756bf43afb2a155f3023dbb9d,

title = "Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces",

abstract = "The problem of near-duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high-dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near-duplicates whenever new multimedia content is uploaded. Among different approaches, near-duplicate detection in high-dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real-world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.",

keywords = "high-dimensional data sets, Indexing methods, Locality Sensitive Hashing, Near-duplicate detection, Query",

author = "Marco Fisichella and Andrea Ceroni and Fan Deng and Wolfgang Nejdl",

year = "2014",

doi = "10.1007/978-3-319-10085-2_5",

language = "English",

isbn = "9783319100845",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

number = "PART 2",

pages = "59--73",

booktitle = "Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings",

address = "Germany",

edition = "PART 2",

note = "25th International Conference on Database and Expert Systems Applications, DEXA 2014 ; Conference date: 01-09-2014 Through 04-09-2014",

}

Download

TY - GEN

T1 - Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

AU - Fisichella, Marco

AU - Ceroni, Andrea

AU - Deng, Fan

AU - Nejdl, Wolfgang

PY - 2014

Y1 - 2014

N2 - The problem of near-duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high-dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near-duplicates whenever new multimedia content is uploaded. Among different approaches, near-duplicate detection in high-dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real-world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.

AB - The problem of near-duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high-dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near-duplicates whenever new multimedia content is uploaded. Among different approaches, near-duplicate detection in high-dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real-world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.

KW - high-dimensional data sets

KW - Indexing methods

KW - Locality Sensitive Hashing

KW - Near-duplicate detection

KW - Query

UR - http://www.scopus.com/inward/record.url?scp=84958538278&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-10085-2_5

DO - 10.1007/978-3-319-10085-2_5

M3 - Conference contribution

AN - SCOPUS:84958538278

SN - 9783319100845

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 59

EP - 73

BT - Database and Expert Systems Applications - 25th International Conference, DEXA 2014, Proceedings

PB - Springer Verlag

T2 - 25th International Conference on Database and Expert Systems Applications, DEXA 2014

Y2 - 1 September 2014 through 4 September 2014

ER -

Research@Leibniz University

Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

Authors

Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

A Trustworthy Approach to Classify and Analyze Epidemic-Related Information From Microblogs

LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

A Trustworthy Approach to Classify and Analyze Epidemic-Related Information From Microblogs

LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets