Sentiment analysis on big sparse data streams with limited labels

Vasileios Iosifidis; Eirini Ntoutsi

doi:10.1007/s10115-019-01392-9

Details

Originalsprache	Englisch
Seiten (von - bis)	1393-1432
Seitenumfang	40
Fachzeitschrift	Knowledge and information systems
Jahrgang	62
Ausgabenummer	4
Frühes Online-Datum	17 Aug. 2019
Publikationsstatus	Veröffentlicht - Apr. 2020

Abstract

Sentiment analysis is an important task in order to gain insights over the huge amounts of opinionated texts generated on a daily basis in social media like Twitter. Despite its huge amount, standard supervised learning methods won’t work upon such sort of data due to lack of labels and the impracticality of (human) labeling at this scale. In this work, we leverage distant supervision and semi-supervised learning to annotate a big stream of tweets from 2015 which consists of 228 million tweets without retweets (and 275 million with retweets). We present the insights from our annotation process regarding the effect of different semi-supervised learning approaches, namely Self-Learning, Co-Training and Expectation–Maximization. Moreover, we propose two annotation modes, the batch mode where all labeled and unlabeled data are available to the algorithms from the beginning and a lightweight streaming mode that processes the data in batches based on their arrival time in the stream. Our experiments show that stream processing with a sliding window of three months achieves comparable results to batch processing while being more efficient. Finally, to tackle the class imbalance problem, as our dataset is imbalanced toward the positive sentiment class, and its aggravation by the semi-supervised learning methods, we employ data augmentation in the semi-supervised learning process in order to equalize the class distribution. Our results show that semi-supervised learning coupled with data augmentation outperforms significantly the default semi-supervised annotation process. We make the so-called TSentiment15 sentiment-annotated dataset available to the community to be used for evaluation purposes and for developing new methods.

ASJC Scopus Sachgebiete

Informatik (insg.)
Software
Informatik (insg.)
Information systems
Informatik (insg.)
Mensch-Maschine-Interaktion
Informatik (insg.)
Hardware und Architektur
Informatik (insg.)
Artificial intelligence

Zitieren

Sentiment analysis on big sparse data streams with limited labels. / Iosifidis, Vasileios; Ntoutsi, Eirini.
in: Knowledge and information systems, Jahrgang 62, Nr. 4, 04.2020, S. 1393-1432.

Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review

Iosifidis, V & Ntoutsi, E 2020, 'Sentiment analysis on big sparse data streams with limited labels', Knowledge and information systems, Jg. 62, Nr. 4, S. 1393-1432. https://doi.org/10.1007/s10115-019-01392-9

Iosifidis, V., & Ntoutsi, E. (2020). Sentiment analysis on big sparse data streams with limited labels. Knowledge and information systems, 62(4), 1393-1432. https://doi.org/10.1007/s10115-019-01392-9

Iosifidis V, Ntoutsi E. Sentiment analysis on big sparse data streams with limited labels. Knowledge and information systems. 2020 Apr;62(4):1393-1432. Epub 2019 Aug 17. doi: 10.1007/s10115-019-01392-9

Iosifidis, Vasileios ; Ntoutsi, Eirini. / Sentiment analysis on big sparse data streams with limited labels. in: Knowledge and information systems. 2020 ; Jahrgang 62, Nr. 4. S. 1393-1432.

Download

@article{215d54dde39147ef9fa6d5a3d7a2ccc1,

title = "Sentiment analysis on big sparse data streams with limited labels",

abstract = "Sentiment analysis is an important task in order to gain insights over the huge amounts of opinionated texts generated on a daily basis in social media like Twitter. Despite its huge amount, standard supervised learning methods won{\textquoteright}t work upon such sort of data due to lack of labels and the impracticality of (human) labeling at this scale. In this work, we leverage distant supervision and semi-supervised learning to annotate a big stream of tweets from 2015 which consists of 228 million tweets without retweets (and 275 million with retweets). We present the insights from our annotation process regarding the effect of different semi-supervised learning approaches, namely Self-Learning, Co-Training and Expectation–Maximization. Moreover, we propose two annotation modes, the batch mode where all labeled and unlabeled data are available to the algorithms from the beginning and a lightweight streaming mode that processes the data in batches based on their arrival time in the stream. Our experiments show that stream processing with a sliding window of three months achieves comparable results to batch processing while being more efficient. Finally, to tackle the class imbalance problem, as our dataset is imbalanced toward the positive sentiment class, and its aggravation by the semi-supervised learning methods, we employ data augmentation in the semi-supervised learning process in order to equalize the class distribution. Our results show that semi-supervised learning coupled with data augmentation outperforms significantly the default semi-supervised annotation process. We make the so-called TSentiment15 sentiment-annotated dataset available to the community to be used for evaluation purposes and for developing new methods.",

keywords = "Class imbalance, Data augmentation, Semi-supervised learning, Sentiment analysis",

author = "Vasileios Iosifidis and Eirini Ntoutsi",

note = "Funding information: The work was inspired by the German Research Foundation (DFG) Project (Grant No. 317686254) OSCAR (Opinion Stream Classification with Ensembles and Active leaRners) for which the last author is a Principal Investigator.",

year = "2020",

month = apr,

doi = "10.1007/s10115-019-01392-9",

language = "English",

volume = "62",

pages = "1393--1432",

journal = "Knowledge and information systems",

issn = "0219-1377",

publisher = "Springer London",

number = "4",

}

Download

TY - JOUR

T1 - Sentiment analysis on big sparse data streams with limited labels

AU - Iosifidis, Vasileios

AU - Ntoutsi, Eirini

N1 - Funding information: The work was inspired by the German Research Foundation (DFG) Project (Grant No. 317686254) OSCAR (Opinion Stream Classification with Ensembles and Active leaRners) for which the last author is a Principal Investigator.

PY - 2020/4

Y1 - 2020/4

N2 - Sentiment analysis is an important task in order to gain insights over the huge amounts of opinionated texts generated on a daily basis in social media like Twitter. Despite its huge amount, standard supervised learning methods won’t work upon such sort of data due to lack of labels and the impracticality of (human) labeling at this scale. In this work, we leverage distant supervision and semi-supervised learning to annotate a big stream of tweets from 2015 which consists of 228 million tweets without retweets (and 275 million with retweets). We present the insights from our annotation process regarding the effect of different semi-supervised learning approaches, namely Self-Learning, Co-Training and Expectation–Maximization. Moreover, we propose two annotation modes, the batch mode where all labeled and unlabeled data are available to the algorithms from the beginning and a lightweight streaming mode that processes the data in batches based on their arrival time in the stream. Our experiments show that stream processing with a sliding window of three months achieves comparable results to batch processing while being more efficient. Finally, to tackle the class imbalance problem, as our dataset is imbalanced toward the positive sentiment class, and its aggravation by the semi-supervised learning methods, we employ data augmentation in the semi-supervised learning process in order to equalize the class distribution. Our results show that semi-supervised learning coupled with data augmentation outperforms significantly the default semi-supervised annotation process. We make the so-called TSentiment15 sentiment-annotated dataset available to the community to be used for evaluation purposes and for developing new methods.

AB - Sentiment analysis is an important task in order to gain insights over the huge amounts of opinionated texts generated on a daily basis in social media like Twitter. Despite its huge amount, standard supervised learning methods won’t work upon such sort of data due to lack of labels and the impracticality of (human) labeling at this scale. In this work, we leverage distant supervision and semi-supervised learning to annotate a big stream of tweets from 2015 which consists of 228 million tweets without retweets (and 275 million with retweets). We present the insights from our annotation process regarding the effect of different semi-supervised learning approaches, namely Self-Learning, Co-Training and Expectation–Maximization. Moreover, we propose two annotation modes, the batch mode where all labeled and unlabeled data are available to the algorithms from the beginning and a lightweight streaming mode that processes the data in batches based on their arrival time in the stream. Our experiments show that stream processing with a sliding window of three months achieves comparable results to batch processing while being more efficient. Finally, to tackle the class imbalance problem, as our dataset is imbalanced toward the positive sentiment class, and its aggravation by the semi-supervised learning methods, we employ data augmentation in the semi-supervised learning process in order to equalize the class distribution. Our results show that semi-supervised learning coupled with data augmentation outperforms significantly the default semi-supervised annotation process. We make the so-called TSentiment15 sentiment-annotated dataset available to the community to be used for evaluation purposes and for developing new methods.

KW - Class imbalance

KW - Data augmentation

KW - Semi-supervised learning

KW - Sentiment analysis

UR - http://www.scopus.com/inward/record.url?scp=85071118635&partnerID=8YFLogxK

U2 - 10.1007/s10115-019-01392-9

DO - 10.1007/s10115-019-01392-9

M3 - Article

AN - SCOPUS:85071118635

VL - 62

SP - 1393

EP - 1432

JO - Knowledge and information systems

JF - Knowledge and information systems

SN - 0219-1377

IS - 4

ER -

Research@Leibniz University

Sentiment analysis on big sparse data streams with limited labels

Autorschaft

Organisationseinheiten

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren