Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing

Marco Fisichella; Fan Deng; Wolfgang Nejdl

doi:10.1007/978-3-642-15364-8_11

Details

Originalsprache	Englisch
Titel des Sammelwerks	Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings
Seiten	152-166
Seitenumfang	15
Auflage	PART 1
Publikationsstatus	Veröffentlicht - 8 Nov. 2010
Veranstaltung	21st International Conference on Database and Expert Systems Applications, DEXA 2010 - Bilbao, Spanien Dauer: 30 Aug. 2010 → 3 Sept. 2010

Publikationsreihe

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Nummer	PART 1
Band	6261 LNCS
ISSN (Print)	0302-9743
ISSN (elektronisch)	1611-3349

Abstract

In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.

ASJC Scopus Sachgebiete

Mathematik (insg.)
Theoretische Informatik
Informatik (insg.)
Allgemeine Computerwissenschaft

Zitieren

Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. / Fisichella, Marco; Deng, Fan; Nejdl, Wolfgang.
Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings. PART 1. Aufl. 2010. S. 152-166 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 6261 LNCS, Nr. PART 1).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Fisichella, M, Deng, F & Nejdl, W 2010, Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. in Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings. PART 1 Aufl., Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Nr. PART 1, Bd. 6261 LNCS, S. 152-166, 21st International Conference on Database and Expert Systems Applications, DEXA 2010, Bilbao, Spanien, 30 Aug. 2010. https://doi.org/10.1007/978-3-642-15364-8_11

Fisichella, M., Deng, F., & Nejdl, W. (2010). Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. In Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings (PART 1 Aufl., S. 152-166). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 6261 LNCS, Nr. PART 1). https://doi.org/10.1007/978-3-642-15364-8_11

Fisichella M, Deng F, Nejdl W. Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. in Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings. PART 1 Aufl. 2010. S. 152-166. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1). doi: 10.1007/978-3-642-15364-8_11

Fisichella, Marco ; Deng, Fan ; Nejdl, Wolfgang. / Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing. Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings. PART 1. Aufl. 2010. S. 152-166 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 1).

Download

@inproceedings{55744aa118964bb194b0478a7de15519,

title = "Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing",

abstract = "In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.",

author = "Marco Fisichella and Fan Deng and Wolfgang Nejdl",

year = "2010",

month = nov,

day = "8",

doi = "10.1007/978-3-642-15364-8_11",

language = "English",

isbn = "3642153631",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

number = "PART 1",

pages = "152--166",

booktitle = "Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings",

edition = "PART 1",

note = "21st International Conference on Database and Expert Systems Applications, DEXA 2010 ; Conference date: 30-08-2010 Through 03-09-2010",

}

Download

TY - GEN

T1 - Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing

AU - Fisichella, Marco

AU - Deng, Fan

AU - Nejdl, Wolfgang

PY - 2010/11/8

Y1 - 2010/11/8

N2 - In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.

AB - In this paper, we study the problem of detecting near duplicates for high dimensional data points in an incremental manner. For example, for an image sharing website, it would be a desirable feature if near-duplicates can be detected whenever a user uploads a new image into the website so that the user can take some action such as stopping the upload or reporting an illegal copy. Specifically, whenever a new point arrives, our goal is to find all points within an existing point set that are close to the new point based on a given distance function and a distance threshold before the new point is inserted into the data set. Based on a well-known indexing technique, Locality Sensitive Hashing, we propose a new approach which clearly speeds up the running time of LSH indexing while using only a small amount of extra space. The idea is to store a small fraction of near duplicate pairs within the existing point set which are found when they are inserted into the data set, and use them to prune LSH candidate sets for the newly arrived point. Extensive experiments based on three real-world data sets show that our method consistently outperforms the original LSH approach: to reach the same query response time, our method needs significantly less memory than the original LSH approach. Meanwhile, the LSH theoretical guarantee on the quality of the search result is preserved by our approach. Furthermore, it is easy to implement our approach based on LSH.

UR - http://www.scopus.com/inward/record.url?scp=78049371730&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-15364-8_11

DO - 10.1007/978-3-642-15364-8_11

M3 - Conference contribution

AN - SCOPUS:78049371730

SN - 3642153631

SN - 9783642153631

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 152

EP - 166

BT - Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings

T2 - 21st International Conference on Database and Expert Systems Applications, DEXA 2010

Y2 - 30 August 2010 through 3 September 2010

ER -

Research@Leibniz University

Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing

Autorschaft

Organisationseinheiten

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

A Trustworthy Approach to Classify and Analyze Epidemic-Related Information From Microblogs

LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

A Trustworthy Approach to Classify and Analyze Epidemic-Related Information From Microblogs

LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets