Large Scale Sentiment Learning with Limited Labels

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Vasileios Iosifidis
  • Eirini Ntoutsi

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publicationKDD '17
Subtitle of host publicationProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Pages1823-1832
Number of pages10
ISBN (electronic)9781450348874
Publication statusPublished - 13 Aug 2017
Event23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017 - Halifax, Canada
Duration: 13 Aug 201717 Aug 2017

Abstract

Sentiment analysis is an important task in order to gain insights over the huge amounts of opinions that are generated in the social media on a daily basis. Although there is a lot of work on sentiment analysis, there are no many datasets available which one can use for developing new methods and for evaluation. To the best of our knowledge, the largest dataset for sentiment analysis is TSentiment [8], a 1.6 millions machine-annotated tweets dataset covering a period of about 3 months in 2009. This dataset however is too short and therefore insufficient to study heterogeneous, fast evolving streams. Therefore, we annotated the Twitter dataset of 2015 (228 million tweets without retweets and 275 million with retweets) and we make it publicly available for research. For the annotation we leverage the power of unlabeled data, together with labeled data using semi-supervised learning and in particular, Self-Learning and Co-Training. Our main contribution is the provision of the TSentiment15 dataset together with insights from the analysis, which includes a batch and a stream-processing of the data. In the former, all labeled and unlabeled data are available to the algorithms from the beginning, whereas in the later, they are revealed gradually based on their arrival time in the stream.

Keywords

    Cotraining, Self-learning, Semi-supervised learning, Sentiment analysis

ASJC Scopus subject areas

Cite this

Large Scale Sentiment Learning with Limited Labels. / Iosifidis, Vasileios; Ntoutsi, Eirini.
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. p. 1823-1832.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Iosifidis, V & Ntoutsi, E 2017, Large Scale Sentiment Learning with Limited Labels. in KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1823-1832, 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017, Halifax, Canada, 13 Aug 2017. https://doi.org/10.1145/3097983.3098159
Iosifidis, V., & Ntoutsi, E. (2017). Large Scale Sentiment Learning with Limited Labels. In KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1823-1832) https://doi.org/10.1145/3097983.3098159
Iosifidis V, Ntoutsi E. Large Scale Sentiment Learning with Limited Labels. In KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. p. 1823-1832 doi: 10.1145/3097983.3098159
Iosifidis, Vasileios ; Ntoutsi, Eirini. / Large Scale Sentiment Learning with Limited Labels. KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. pp. 1823-1832
Download
@inproceedings{4f3ed32124d441478eaa9a149d506171,
title = "Large Scale Sentiment Learning with Limited Labels",
abstract = "Sentiment analysis is an important task in order to gain insights over the huge amounts of opinions that are generated in the social media on a daily basis. Although there is a lot of work on sentiment analysis, there are no many datasets available which one can use for developing new methods and for evaluation. To the best of our knowledge, the largest dataset for sentiment analysis is TSentiment [8], a 1.6 millions machine-annotated tweets dataset covering a period of about 3 months in 2009. This dataset however is too short and therefore insufficient to study heterogeneous, fast evolving streams. Therefore, we annotated the Twitter dataset of 2015 (228 million tweets without retweets and 275 million with retweets) and we make it publicly available for research. For the annotation we leverage the power of unlabeled data, together with labeled data using semi-supervised learning and in particular, Self-Learning and Co-Training. Our main contribution is the provision of the TSentiment15 dataset together with insights from the analysis, which includes a batch and a stream-processing of the data. In the former, all labeled and unlabeled data are available to the algorithms from the beginning, whereas in the later, they are revealed gradually based on their arrival time in the stream.",
keywords = "Cotraining, Self-learning, Semi-supervised learning, Sentiment analysis",
author = "Vasileios Iosifidis and Eirini Ntoutsi",
note = "Funding information: The work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233 and by the German Research Foundation (DFG) project OSCAR (Opinion Stream Classification with Ensembles and Active leaRners).; 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017 ; Conference date: 13-08-2017 Through 17-08-2017",
year = "2017",
month = aug,
day = "13",
doi = "10.1145/3097983.3098159",
language = "English",
pages = "1823--1832",
booktitle = "KDD '17",

}

Download

TY - GEN

T1 - Large Scale Sentiment Learning with Limited Labels

AU - Iosifidis, Vasileios

AU - Ntoutsi, Eirini

N1 - Funding information: The work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233 and by the German Research Foundation (DFG) project OSCAR (Opinion Stream Classification with Ensembles and Active leaRners).

PY - 2017/8/13

Y1 - 2017/8/13

N2 - Sentiment analysis is an important task in order to gain insights over the huge amounts of opinions that are generated in the social media on a daily basis. Although there is a lot of work on sentiment analysis, there are no many datasets available which one can use for developing new methods and for evaluation. To the best of our knowledge, the largest dataset for sentiment analysis is TSentiment [8], a 1.6 millions machine-annotated tweets dataset covering a period of about 3 months in 2009. This dataset however is too short and therefore insufficient to study heterogeneous, fast evolving streams. Therefore, we annotated the Twitter dataset of 2015 (228 million tweets without retweets and 275 million with retweets) and we make it publicly available for research. For the annotation we leverage the power of unlabeled data, together with labeled data using semi-supervised learning and in particular, Self-Learning and Co-Training. Our main contribution is the provision of the TSentiment15 dataset together with insights from the analysis, which includes a batch and a stream-processing of the data. In the former, all labeled and unlabeled data are available to the algorithms from the beginning, whereas in the later, they are revealed gradually based on their arrival time in the stream.

AB - Sentiment analysis is an important task in order to gain insights over the huge amounts of opinions that are generated in the social media on a daily basis. Although there is a lot of work on sentiment analysis, there are no many datasets available which one can use for developing new methods and for evaluation. To the best of our knowledge, the largest dataset for sentiment analysis is TSentiment [8], a 1.6 millions machine-annotated tweets dataset covering a period of about 3 months in 2009. This dataset however is too short and therefore insufficient to study heterogeneous, fast evolving streams. Therefore, we annotated the Twitter dataset of 2015 (228 million tweets without retweets and 275 million with retweets) and we make it publicly available for research. For the annotation we leverage the power of unlabeled data, together with labeled data using semi-supervised learning and in particular, Self-Learning and Co-Training. Our main contribution is the provision of the TSentiment15 dataset together with insights from the analysis, which includes a batch and a stream-processing of the data. In the former, all labeled and unlabeled data are available to the algorithms from the beginning, whereas in the later, they are revealed gradually based on their arrival time in the stream.

KW - Cotraining

KW - Self-learning

KW - Semi-supervised learning

KW - Sentiment analysis

UR - http://www.scopus.com/inward/record.url?scp=85029066358&partnerID=8YFLogxK

U2 - 10.1145/3097983.3098159

DO - 10.1145/3097983.3098159

M3 - Conference contribution

AN - SCOPUS:85029066358

SP - 1823

EP - 1832

BT - KDD '17

T2 - 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017

Y2 - 13 August 2017 through 17 August 2017

ER -