On the Subjectivity of Emotions in Software Projects: How Reliable Are Pre-labeled Data Sets for Sentiment Analysis?

Marc Herrmann; Martin Obaidi; Larissa Chazette; Jil Klünder

doi:10.48550/arXiv.2207.07954

Details

Originalsprache	Englisch
Aufsatznummer	111448
Fachzeitschrift	Journal of Systems and Software
Jahrgang	193
Frühes Online-Datum	21 Juli 2022
Publikationsstatus	Veröffentlicht - Nov. 2022

Abstract

Social aspects of software projects become increasingly important for research and practice. Different approaches analyze the sentiment of a development team, ranging from simply asking the team to so-called sentiment analysis on text-based communication. These sentiment analysis tools are trained using pre-labeled data sets from different sources, including GitHub and Stack Overflow. In this paper, we investigate if the labels of the statements in the data sets coincide with the perception of potential members of a software project team. Based on an international survey, we compare the median perception of 94 participants with the pre-labeled data sets as well as every single participant’s agreement with the predefined labels. Our results point to three remarkable findings: (1) Although the median values coincide with the predefined labels of the data sets in 62.5% of the cases, we observe a huge difference between the single participant’s ratings and the labels; (2) there is not a single participant who totally agrees with the predefined labels; and (3) the data set whose labels are based on guidelines performs better than the ad hoc labeled data set.

ASJC Scopus Sachgebiete

Informatik (insg.)
Software
Informatik (insg.)
Information systems
Informatik (insg.)
Hardware und Architektur

Zitieren

On the Subjectivity of Emotions in Software Projects: How Reliable Are Pre-labeled Data Sets for Sentiment Analysis? / Herrmann, Marc ; Obaidi, Martin; Chazette, Larissa et al.
in: Journal of Systems and Software, Jahrgang 193, 111448, 11.2022.

Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review

Herrmann, M , Obaidi, M, Chazette, L & Klünder, J 2022, 'On the Subjectivity of Emotions in Software Projects: How Reliable Are Pre-labeled Data Sets for Sentiment Analysis?', Journal of Systems and Software, Jg. 193, 111448. https://doi.org/10.48550/arXiv.2207.07954, https://doi.org/10.1016/j.jss.2022.111448

Herrmann, M., Obaidi, M., Chazette, L., & Klünder, J. (2022). On the Subjectivity of Emotions in Software Projects: How Reliable Are Pre-labeled Data Sets for Sentiment Analysis? Journal of Systems and Software, 193, Artikel 111448. https://doi.org/10.48550/arXiv.2207.07954, https://doi.org/10.1016/j.jss.2022.111448

Herrmann M , Obaidi M, Chazette L, Klünder J. On the Subjectivity of Emotions in Software Projects: How Reliable Are Pre-labeled Data Sets for Sentiment Analysis? Journal of Systems and Software. 2022 Nov;193:111448. Epub 2022 Jul 21. doi: 10.48550/arXiv.2207.07954, 10.1016/j.jss.2022.111448

Herrmann, Marc ; Obaidi, Martin ; Chazette, Larissa et al. / On the Subjectivity of Emotions in Software Projects : How Reliable Are Pre-labeled Data Sets for Sentiment Analysis?. in: Journal of Systems and Software. 2022 ; Jahrgang 193.

Download

@article{aa727167023e438495d552576c62f791,

title = "On the Subjectivity of Emotions in Software Projects: How Reliable Are Pre-labeled Data Sets for Sentiment Analysis?",

abstract = "Social aspects of software projects become increasingly important for research and practice. Different approaches analyze the sentiment of a development team, ranging from simply asking the team to so-called sentiment analysis on text-based communication. These sentiment analysis tools are trained using pre-labeled data sets from different sources, including GitHub and Stack Overflow.In this paper, we investigate if the labels of the statements in the data sets coincide with the perception of potential members of a software project team. Based on an international survey, we compare the median perception of 94 participants with the pre-labeled data sets as well as every single participant{\textquoteright}s agreement with the predefined labels. Our results point to three remarkable findings: (1) Although the median values coincide with the predefined labels of the data sets in 62.5% of the cases, we observe a huge difference between the single participant{\textquoteright}s ratings and the labels; (2) there is not a single participant who totally agrees with the predefined labels; and (3) the data set whose labels are based on guidelines performs better than the ad hoc labeled data set.",

keywords = "Sentiment analysis, Software projects, Polarity, Development team, Communication",

author = "Marc Herrmann and Martin Obaidi and Larissa Chazette and Jil Kl{\"u}nder",

note = "Funding Information: This research was funded by the Leibniz University Hannover as a Leibniz Young Investigator Grant (Project ComContA, Project Number 85430128, 2020–2022).",

year = "2022",

month = nov,

doi = "10.48550/arXiv.2207.07954",

language = "English",

volume = "193",

journal = "Journal of Systems and Software",

issn = "0164-1212",

publisher = "Elsevier Inc.",

}

Download

TY - JOUR

T1 - On the Subjectivity of Emotions in Software Projects

T2 - How Reliable Are Pre-labeled Data Sets for Sentiment Analysis?

AU - Herrmann, Marc

AU - Obaidi, Martin

AU - Chazette, Larissa

AU - Klünder, Jil

N1 - Funding Information: This research was funded by the Leibniz University Hannover as a Leibniz Young Investigator Grant (Project ComContA, Project Number 85430128, 2020–2022).

PY - 2022/11

Y1 - 2022/11

N2 - Social aspects of software projects become increasingly important for research and practice. Different approaches analyze the sentiment of a development team, ranging from simply asking the team to so-called sentiment analysis on text-based communication. These sentiment analysis tools are trained using pre-labeled data sets from different sources, including GitHub and Stack Overflow.In this paper, we investigate if the labels of the statements in the data sets coincide with the perception of potential members of a software project team. Based on an international survey, we compare the median perception of 94 participants with the pre-labeled data sets as well as every single participant’s agreement with the predefined labels. Our results point to three remarkable findings: (1) Although the median values coincide with the predefined labels of the data sets in 62.5% of the cases, we observe a huge difference between the single participant’s ratings and the labels; (2) there is not a single participant who totally agrees with the predefined labels; and (3) the data set whose labels are based on guidelines performs better than the ad hoc labeled data set.

AB - Social aspects of software projects become increasingly important for research and practice. Different approaches analyze the sentiment of a development team, ranging from simply asking the team to so-called sentiment analysis on text-based communication. These sentiment analysis tools are trained using pre-labeled data sets from different sources, including GitHub and Stack Overflow.In this paper, we investigate if the labels of the statements in the data sets coincide with the perception of potential members of a software project team. Based on an international survey, we compare the median perception of 94 participants with the pre-labeled data sets as well as every single participant’s agreement with the predefined labels. Our results point to three remarkable findings: (1) Although the median values coincide with the predefined labels of the data sets in 62.5% of the cases, we observe a huge difference between the single participant’s ratings and the labels; (2) there is not a single participant who totally agrees with the predefined labels; and (3) the data set whose labels are based on guidelines performs better than the ad hoc labeled data set.

KW - Sentiment analysis

KW - Software projects

KW - Polarity

KW - Development team

KW - Communication

UR - http://www.scopus.com/inward/record.url?scp=85134891383&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2207.07954

DO - 10.48550/arXiv.2207.07954

M3 - Article

VL - 193

JO - Journal of Systems and Software

JF - Journal of Systems and Software

SN - 0164-1212

M1 - 111448

ER -

Research@Leibniz University

On the Subjectivity of Emotions in Software Projects: How Reliable Are Pre-labeled Data Sets for Sentiment Analysis?

Autoren

Organisationseinheiten

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Organizing Graphical User Interface tests from behavior‐driven development as videos to obtain stakeholders' feedback

Human factors in model-driven engineering: future research goals and initiatives for MDE

What is Needed to Apply Sentiment Analysis in Real Software Projects: A Feasibility Study in Industry

How Explainable Is Your System? Towards a Quality Model for Explainability

Explainability Requirements for Time Series Forecasts: A Study in the Energy Domain