On the Impact of Dataset Size: A Twitter Classification Case Study

Thi Huyen Nguyen; Hoang H. Nguyen; Zahra Ahmadi; Tuan Anh Hoang; Thanh Nam Doan

doi:10.1145/3486622.3493960

Details

Original language	English
Title of host publication	WI-IAT '21
Subtitle of host publication	IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
Publisher	Association for Computing Machinery (ACM)
Pages	210-217
Number of pages	8
ISBN (electronic)	9781450391153
Publication status	Published - 13 Apr 2022
Event	2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021 - Virtual, Online, Australia Duration: 14 Dec 2021 → 17 Dec 2021

Publication series

Name	ACM International Conference Proceeding Series

Abstract

The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

Keywords

dataset size, empirical study, extrapolation methods, machine learning, neural network, Twitter classification

ASJC Scopus subject areas

Computer Science(all)
Software
Computer Science(all)
Human-Computer Interaction
Computer Science(all)
Computer Vision and Pattern Recognition
Computer Science(all)
Computer Networks and Communications

Cite this

On the Impact of Dataset Size: A Twitter Classification Case Study. / Nguyen, Thi Huyen; Nguyen, Hoang H.; Ahmadi, Zahra et al.
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Association for Computing Machinery (ACM), 2022. p. 210-217 (ACM International Conference Proceeding Series).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Nguyen, TH, Nguyen, HH, Ahmadi, Z, Hoang, TA & Doan, TN 2022, On the Impact of Dataset Size: A Twitter Classification Case Study. in WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. ACM International Conference Proceeding Series, Association for Computing Machinery (ACM), pp. 210-217, 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021, Virtual, Online, Australia, 14 Dec 2021. https://doi.org/10.1145/3486622.3493960

Nguyen, T. H., Nguyen, H. H., Ahmadi, Z., Hoang, T. A., & Doan, T. N. (2022). On the Impact of Dataset Size: A Twitter Classification Case Study. In WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (pp. 210-217). (ACM International Conference Proceeding Series). Association for Computing Machinery (ACM). https://doi.org/10.1145/3486622.3493960

Nguyen TH, Nguyen HH, Ahmadi Z, Hoang TA, Doan TN. On the Impact of Dataset Size: A Twitter Classification Case Study. In WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Association for Computing Machinery (ACM). 2022. p. 210-217. (ACM International Conference Proceeding Series). doi: 10.1145/3486622.3493960

Nguyen, Thi Huyen ; Nguyen, Hoang H. ; Ahmadi, Zahra et al. / On the Impact of Dataset Size : A Twitter Classification Case Study. WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Association for Computing Machinery (ACM), 2022. pp. 210-217 (ACM International Conference Proceeding Series).

Download

@inproceedings{bc80338fce4a4b688d1ce9f69cb3293c,

title = "On the Impact of Dataset Size: A Twitter Classification Case Study",

abstract = "The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.",

keywords = "dataset size, empirical study, extrapolation methods, machine learning, neural network, Twitter classification",

author = "Nguyen, {Thi Huyen} and Nguyen, {Hoang H.} and Zahra Ahmadi and Hoang, {Tuan Anh} and Doan, {Thanh Nam}",

note = "Funding Information: This work is supported by the DFG Grant (NI-1760/1-1) Managed Forgetting, the European Union{\textquoteright}s Horizon 2020 research and innovation program under grant agreement No. 832921 (project MIRROR), and No. 833635 (project ROXANNE).; 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021 ; Conference date: 14-12-2021 Through 17-12-2021",

year = "2022",

month = apr,

day = "13",

doi = "10.1145/3486622.3493960",

language = "English",

series = "ACM International Conference Proceeding Series",

publisher = "Association for Computing Machinery (ACM)",

pages = "210--217",

booktitle = "WI-IAT '21",

address = "United States",

}

Download

TY - GEN

T1 - On the Impact of Dataset Size

T2 - 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021

AU - Nguyen, Thi Huyen

AU - Nguyen, Hoang H.

AU - Ahmadi, Zahra

AU - Hoang, Tuan Anh

AU - Doan, Thanh Nam

N1 - Funding Information: This work is supported by the DFG Grant (NI-1760/1-1) Managed Forgetting, the European Union’s Horizon 2020 research and innovation program under grant agreement No. 832921 (project MIRROR), and No. 833635 (project ROXANNE).

PY - 2022/4/13

Y1 - 2022/4/13

N2 - The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

AB - The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

KW - dataset size

KW - empirical study

KW - extrapolation methods

KW - machine learning

KW - neural network

KW - Twitter classification

UR - http://www.scopus.com/inward/record.url?scp=85128592543&partnerID=8YFLogxK

U2 - 10.1145/3486622.3493960

DO - 10.1145/3486622.3493960

M3 - Conference contribution

AN - SCOPUS:85128592543

T3 - ACM International Conference Proceeding Series

SP - 210

EP - 217

BT - WI-IAT '21

PB - Association for Computing Machinery (ACM)

Y2 - 14 December 2021 through 17 December 2021

ER -

Research@Leibniz University

On the Impact of Dataset Size: A Twitter Classification Case Study

Authors

Research Organisations

External Research Organisations