Redundancies in Data and their Effect on the Evaluation of Recommendation Systems: A Case Study on the Amazon Reviews Datasets

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Daniel Basaran
  • Eirini Ntoutsi
  • Arthur Zimek

Research Organisations

External Research Organisations

  • Ludwig-Maximilians-Universität München (LMU)
  • University of Southern Denmark
View graph of relations

Details

Original languageEnglish
Title of host publicationProceedings of the 2017 SIAM International Conference on Data Mining (SDM)
EditorsNitesh Chawla, Wei Wang
PublisherSociety for Industrial and Applied Mathematics Publications
Pages390-398
Number of pages9
ISBN (electronic)9781611974973
Publication statusPublished - 2017
Event17th SIAM International Conference on Data Mining, SDM 2017 - Houston, United States
Duration: 27 Apr 201729 Apr 2017

Abstract

A collection of datasets crawled from Amazon, "Amazon reviews", is popular in the evaluation of recommendation systems. These datasets, however, contain redundancies (duplicated recommendations for variants of certain items). These redundancies went unnoticed in earlier use of these datasets and thus incurred to a certain extent wrong conclusions in the evaluation of algorithms tested on these datasets. We analyze the nature and amount of these redundancies and their impact on the evaluation of recommendation methods. While the general and obvious conclusion is that redundancies should be avoided and datasets should be carefully preprocessed, we observe more specifically that their impact depends on the complexity of the methods. With this work, we also want to raise the awareness of the importance of data quality, model understanding, and appropriate evaluation.

ASJC Scopus subject areas

Cite this

Redundancies in Data and their Effect on the Evaluation of Recommendation Systems: A Case Study on the Amazon Reviews Datasets. / Basaran, Daniel; Ntoutsi, Eirini; Zimek, Arthur.
Proceedings of the 2017 SIAM International Conference on Data Mining (SDM). ed. / Nitesh Chawla; Wei Wang. Society for Industrial and Applied Mathematics Publications, 2017. p. 390-398.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Basaran, D, Ntoutsi, E & Zimek, A 2017, Redundancies in Data and their Effect on the Evaluation of Recommendation Systems: A Case Study on the Amazon Reviews Datasets. in N Chawla & W Wang (eds), Proceedings of the 2017 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics Publications, pp. 390-398, 17th SIAM International Conference on Data Mining, SDM 2017, Houston, United States, 27 Apr 2017. https://doi.org/10.1137/1.9781611974973.44
Basaran, D., Ntoutsi, E., & Zimek, A. (2017). Redundancies in Data and their Effect on the Evaluation of Recommendation Systems: A Case Study on the Amazon Reviews Datasets. In N. Chawla, & W. Wang (Eds.), Proceedings of the 2017 SIAM International Conference on Data Mining (SDM) (pp. 390-398). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611974973.44
Basaran D, Ntoutsi E, Zimek A. Redundancies in Data and their Effect on the Evaluation of Recommendation Systems: A Case Study on the Amazon Reviews Datasets. In Chawla N, Wang W, editors, Proceedings of the 2017 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics Publications. 2017. p. 390-398 doi: 10.1137/1.9781611974973.44
Basaran, Daniel ; Ntoutsi, Eirini ; Zimek, Arthur. / Redundancies in Data and their Effect on the Evaluation of Recommendation Systems : A Case Study on the Amazon Reviews Datasets. Proceedings of the 2017 SIAM International Conference on Data Mining (SDM). editor / Nitesh Chawla ; Wei Wang. Society for Industrial and Applied Mathematics Publications, 2017. pp. 390-398
Download
@inproceedings{9076c016f90a4b549e8f0e4cc4085e31,
title = "Redundancies in Data and their Effect on the Evaluation of Recommendation Systems: A Case Study on the Amazon Reviews Datasets",
abstract = "A collection of datasets crawled from Amazon, {"}Amazon reviews{"}, is popular in the evaluation of recommendation systems. These datasets, however, contain redundancies (duplicated recommendations for variants of certain items). These redundancies went unnoticed in earlier use of these datasets and thus incurred to a certain extent wrong conclusions in the evaluation of algorithms tested on these datasets. We analyze the nature and amount of these redundancies and their impact on the evaluation of recommendation methods. While the general and obvious conclusion is that redundancies should be avoided and datasets should be carefully preprocessed, we observe more specifically that their impact depends on the complexity of the methods. With this work, we also want to raise the awareness of the importance of data quality, model understanding, and appropriate evaluation.",
author = "Daniel Basaran and Eirini Ntoutsi and Arthur Zimek",
note = "Publisher Copyright: Copyright {\textcopyright} by SIAM. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.; 17th SIAM International Conference on Data Mining, SDM 2017 ; Conference date: 27-04-2017 Through 29-04-2017",
year = "2017",
doi = "10.1137/1.9781611974973.44",
language = "English",
pages = "390--398",
editor = "Nitesh Chawla and Wei Wang",
booktitle = "Proceedings of the 2017 SIAM International Conference on Data Mining (SDM)",
publisher = "Society for Industrial and Applied Mathematics Publications",
address = "United States",

}

Download

TY - GEN

T1 - Redundancies in Data and their Effect on the Evaluation of Recommendation Systems

T2 - 17th SIAM International Conference on Data Mining, SDM 2017

AU - Basaran, Daniel

AU - Ntoutsi, Eirini

AU - Zimek, Arthur

N1 - Publisher Copyright: Copyright © by SIAM. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.

PY - 2017

Y1 - 2017

N2 - A collection of datasets crawled from Amazon, "Amazon reviews", is popular in the evaluation of recommendation systems. These datasets, however, contain redundancies (duplicated recommendations for variants of certain items). These redundancies went unnoticed in earlier use of these datasets and thus incurred to a certain extent wrong conclusions in the evaluation of algorithms tested on these datasets. We analyze the nature and amount of these redundancies and their impact on the evaluation of recommendation methods. While the general and obvious conclusion is that redundancies should be avoided and datasets should be carefully preprocessed, we observe more specifically that their impact depends on the complexity of the methods. With this work, we also want to raise the awareness of the importance of data quality, model understanding, and appropriate evaluation.

AB - A collection of datasets crawled from Amazon, "Amazon reviews", is popular in the evaluation of recommendation systems. These datasets, however, contain redundancies (duplicated recommendations for variants of certain items). These redundancies went unnoticed in earlier use of these datasets and thus incurred to a certain extent wrong conclusions in the evaluation of algorithms tested on these datasets. We analyze the nature and amount of these redundancies and their impact on the evaluation of recommendation methods. While the general and obvious conclusion is that redundancies should be avoided and datasets should be carefully preprocessed, we observe more specifically that their impact depends on the complexity of the methods. With this work, we also want to raise the awareness of the importance of data quality, model understanding, and appropriate evaluation.

UR - http://www.scopus.com/inward/record.url?scp=85027880582&partnerID=8YFLogxK

U2 - 10.1137/1.9781611974973.44

DO - 10.1137/1.9781611974973.44

M3 - Conference contribution

AN - SCOPUS:85027880582

SP - 390

EP - 398

BT - Proceedings of the 2017 SIAM International Conference on Data Mining (SDM)

A2 - Chawla, Nitesh

A2 - Wang, Wei

PB - Society for Industrial and Applied Mathematics Publications

Y2 - 27 April 2017 through 29 April 2017

ER -