Details
Original language | English |
---|---|
Title of host publication | WI-IAT '21 |
Subtitle of host publication | IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology |
Publisher | Association for Computing Machinery (ACM) |
Pages | 210-217 |
Number of pages | 8 |
ISBN (electronic) | 9781450391153 |
Publication status | Published - 13 Apr 2022 |
Event | 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021 - Virtual, Online, Australia Duration: 14 Dec 2021 → 17 Dec 2021 |
Publication series
Name | ACM International Conference Proceeding Series |
---|
Abstract
The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.
Keywords
- dataset size, empirical study, extrapolation methods, machine learning, neural network, Twitter classification
ASJC Scopus subject areas
- Computer Science(all)
- Software
- Computer Science(all)
- Human-Computer Interaction
- Computer Science(all)
- Computer Vision and Pattern Recognition
- Computer Science(all)
- Computer Networks and Communications
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Association for Computing Machinery (ACM), 2022. p. 210-217 (ACM International Conference Proceeding Series).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - On the Impact of Dataset Size
T2 - 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021
AU - Nguyen, Thi Huyen
AU - Nguyen, Hoang H.
AU - Ahmadi, Zahra
AU - Hoang, Tuan Anh
AU - Doan, Thanh Nam
N1 - Funding Information: This work is supported by the DFG Grant (NI-1760/1-1) Managed Forgetting, the European Union’s Horizon 2020 research and innovation program under grant agreement No. 832921 (project MIRROR), and No. 833635 (project ROXANNE).
PY - 2022/4/13
Y1 - 2022/4/13
N2 - The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.
AB - The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.
KW - dataset size
KW - empirical study
KW - extrapolation methods
KW - machine learning
KW - neural network
KW - Twitter classification
UR - http://www.scopus.com/inward/record.url?scp=85128592543&partnerID=8YFLogxK
U2 - 10.1145/3486622.3493960
DO - 10.1145/3486622.3493960
M3 - Conference contribution
AN - SCOPUS:85128592543
T3 - ACM International Conference Proceeding Series
SP - 210
EP - 217
BT - WI-IAT '21
PB - Association for Computing Machinery (ACM)
Y2 - 14 December 2021 through 17 December 2021
ER -