On the Impact of Dataset Size: A Twitter Classification Case Study

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Thi Huyen Nguyen
  • Hoang H. Nguyen
  • Zahra Ahmadi
  • Tuan Anh Hoang
  • Thanh Nam Doan

Organisationseinheiten

Externe Organisationen

  • Vietnam National University
  • University of Tennessee, Chattanooga
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksWI-IAT '21
UntertitelIEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
Herausgeber (Verlag)Association for Computing Machinery (ACM)
Seiten210-217
Seitenumfang8
ISBN (elektronisch)9781450391153
PublikationsstatusVeröffentlicht - 13 Apr. 2022
Veranstaltung2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021 - Virtual, Online, Australien
Dauer: 14 Dez. 202117 Dez. 2021

Publikationsreihe

NameACM International Conference Proceeding Series

Abstract

The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

ASJC Scopus Sachgebiete

Zitieren

On the Impact of Dataset Size: A Twitter Classification Case Study. / Nguyen, Thi Huyen; Nguyen, Hoang H.; Ahmadi, Zahra et al.
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Association for Computing Machinery (ACM), 2022. S. 210-217 (ACM International Conference Proceeding Series).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Nguyen, TH, Nguyen, HH, Ahmadi, Z, Hoang, TA & Doan, TN 2022, On the Impact of Dataset Size: A Twitter Classification Case Study. in WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. ACM International Conference Proceeding Series, Association for Computing Machinery (ACM), S. 210-217, 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021, Virtual, Online, Australien, 14 Dez. 2021. https://doi.org/10.1145/3486622.3493960
Nguyen, T. H., Nguyen, H. H., Ahmadi, Z., Hoang, T. A., & Doan, T. N. (2022). On the Impact of Dataset Size: A Twitter Classification Case Study. In WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (S. 210-217). (ACM International Conference Proceeding Series). Association for Computing Machinery (ACM). https://doi.org/10.1145/3486622.3493960
Nguyen TH, Nguyen HH, Ahmadi Z, Hoang TA, Doan TN. On the Impact of Dataset Size: A Twitter Classification Case Study. in WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Association for Computing Machinery (ACM). 2022. S. 210-217. (ACM International Conference Proceeding Series). doi: 10.1145/3486622.3493960
Nguyen, Thi Huyen ; Nguyen, Hoang H. ; Ahmadi, Zahra et al. / On the Impact of Dataset Size : A Twitter Classification Case Study. WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Association for Computing Machinery (ACM), 2022. S. 210-217 (ACM International Conference Proceeding Series).
Download
@inproceedings{bc80338fce4a4b688d1ce9f69cb3293c,
title = "On the Impact of Dataset Size: A Twitter Classification Case Study",
abstract = "The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.",
keywords = "dataset size, empirical study, extrapolation methods, machine learning, neural network, Twitter classification",
author = "Nguyen, {Thi Huyen} and Nguyen, {Hoang H.} and Zahra Ahmadi and Hoang, {Tuan Anh} and Doan, {Thanh Nam}",
note = "Funding Information: This work is supported by the DFG Grant (NI-1760/1-1) Managed Forgetting, the European Union{\textquoteright}s Horizon 2020 research and innovation program under grant agreement No. 832921 (project MIRROR), and No. 833635 (project ROXANNE).; 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021 ; Conference date: 14-12-2021 Through 17-12-2021",
year = "2022",
month = apr,
day = "13",
doi = "10.1145/3486622.3493960",
language = "English",
series = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery (ACM)",
pages = "210--217",
booktitle = "WI-IAT '21",
address = "United States",

}

Download

TY - GEN

T1 - On the Impact of Dataset Size

T2 - 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021

AU - Nguyen, Thi Huyen

AU - Nguyen, Hoang H.

AU - Ahmadi, Zahra

AU - Hoang, Tuan Anh

AU - Doan, Thanh Nam

N1 - Funding Information: This work is supported by the DFG Grant (NI-1760/1-1) Managed Forgetting, the European Union’s Horizon 2020 research and innovation program under grant agreement No. 832921 (project MIRROR), and No. 833635 (project ROXANNE).

PY - 2022/4/13

Y1 - 2022/4/13

N2 - The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

AB - The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

KW - dataset size

KW - empirical study

KW - extrapolation methods

KW - machine learning

KW - neural network

KW - Twitter classification

UR - http://www.scopus.com/inward/record.url?scp=85128592543&partnerID=8YFLogxK

U2 - 10.1145/3486622.3493960

DO - 10.1145/3486622.3493960

M3 - Conference contribution

AN - SCOPUS:85128592543

T3 - ACM International Conference Proceeding Series

SP - 210

EP - 217

BT - WI-IAT '21

PB - Association for Computing Machinery (ACM)

Y2 - 14 December 2021 through 17 December 2021

ER -