Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publicationDatabase and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings
Pages152-166
Number of pages15
EditionPART 1
Publication statusPublished - 8 Nov 2010
Event21st International Conference on Database and Expert Systems Applications, DEXA 2010 - Bilbao, Spain
Duration: 30 Aug 20103 Sept 2010

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume6261 LNCS
ISSN (Print)0302-9743
ISSN (electronic)1611-3349

Abstract

In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.

ASJC Scopus subject areas

Cite this

Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. / Fisichella, Marco; Deng, Fan; Nejdl, Wolfgang.
Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings. PART 1. ed. 2010. p. 152-166 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6261 LNCS, No. PART 1).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Fisichella, M, Deng, F & Nejdl, W 2010, Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. in Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings. PART 1 edn, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), no. PART 1, vol. 6261 LNCS, pp. 152-166, 21st International Conference on Database and Expert Systems Applications, DEXA 2010, Bilbao, Spain, 30 Aug 2010. https://doi.org/10.1007/978-3-642-15364-8_11
Fisichella, M., Deng, F., & Nejdl, W. (2010). Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. In Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings (PART 1 ed., pp. 152-166). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6261 LNCS, No. PART 1). https://doi.org/10.1007/978-3-642-15364-8_11
Fisichella M, Deng F, Nejdl W. Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. In Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings. PART 1 ed. 2010. p. 152-166. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1). doi: 10.1007/978-3-642-15364-8_11
Fisichella, Marco ; Deng, Fan ; Nejdl, Wolfgang. / Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings. PART 1. ed. 2010. pp. 152-166 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1).
Download
@inproceedings{55744aa118964bb194b0478a7de15519,
title = "Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing",
abstract = "In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.",
author = "Marco Fisichella and Fan Deng and Wolfgang Nejdl",
year = "2010",
month = nov,
day = "8",
doi = "10.1007/978-3-642-15364-8_11",
language = "English",
isbn = "3642153631",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
number = "PART 1",
pages = "152--166",
booktitle = "Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings",
edition = "PART 1",
note = "21st International Conference on Database and Expert Systems Applications, DEXA 2010 ; Conference date: 30-08-2010 Through 03-09-2010",

}

Download

TY - GEN

T1 - Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing

AU - Fisichella, Marco

AU - Deng, Fan

AU - Nejdl, Wolfgang

PY - 2010/11/8

Y1 - 2010/11/8

N2 - In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.

AB - In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.

UR - http://www.scopus.com/inward/record.url?scp=78049371730&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-15364-8_11

DO - 10.1007/978-3-642-15364-8_11

M3 - Conference contribution

AN - SCOPUS:78049371730

SN - 3642153631

SN - 9783642153631

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 152

EP - 166

BT - Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings

T2 - 21st International Conference on Database and Expert Systems Applications, DEXA 2010

Y2 - 30 August 2010 through 3 September 2010

ER -

By the same author(s)