Latent Class Cluster Analysis: Selecting the number of clusters

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autoren

  • Olga Lezhnina
  • Gábor Kismihók

Externe Organisationen

  • Technische Informationsbibliothek (TIB) Leibniz-Informationszentrum Technik und Naturwissenschaften und Universitätsbibliothek
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Aufsatznummer101747
FachzeitschriftMethodsX
Jahrgang9
Frühes Online-Datum29 Mai 2022
PublikationsstatusVeröffentlicht - 2022
Extern publiziertJa

Abstract

Latent Class Cluster Analysis (LCCA) is an advanced model-based clustering method, which is increasingly used in social, psychological, and educational research. Selecting the number of clusters in LCCA is a challenging task involving inevitable subjectivity of analytical choices. Researchers often rely excessively on fit indices, as model fit is the main selection criterion in model-based clustering; it was shown, however, that a wider spectrum of criteria needs to be taken into account. In this paper, we suggest an extended analytical strategy for selecting the number of clusters in LCCA based on model fit, cluster separation, and stability of partitions. The suggested procedure is illustrated on simulated data and a real world dataset from the International Computer and Information Literacy Study (ICILS) 2018. For the latter, we provide an example of end-to-end LCCA including data preprocessing. The researcher can use our R script to conduct LCCA in a few easily reproducible steps, or implement the strategy with any other software suitable for clustering. We show that the extended strategy, in comparison to fit indices-based strategy, facilitates the selection of more stable and well-separated clusters in the data.

ASJC Scopus Sachgebiete

Zitieren

Latent Class Cluster Analysis: Selecting the number of clusters. / Lezhnina, Olga; Kismihók, Gábor.
in: MethodsX, Jahrgang 9, 101747, 2022.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Lezhnina O, Kismihók G. Latent Class Cluster Analysis: Selecting the number of clusters. MethodsX. 2022;9:101747. Epub 2022 Mai 29. doi: 10.1016/j.mex.2022.101747
Lezhnina, Olga ; Kismihók, Gábor. / Latent Class Cluster Analysis : Selecting the number of clusters. in: MethodsX. 2022 ; Jahrgang 9.
Download
@article{152b04e6ce6b4de9b051104f110f9824,
title = "Latent Class Cluster Analysis: Selecting the number of clusters",
abstract = "Latent Class Cluster Analysis (LCCA) is an advanced model-based clustering method, which is increasingly used in social, psychological, and educational research. Selecting the number of clusters in LCCA is a challenging task involving inevitable subjectivity of analytical choices. Researchers often rely excessively on fit indices, as model fit is the main selection criterion in model-based clustering; it was shown, however, that a wider spectrum of criteria needs to be taken into account. In this paper, we suggest an extended analytical strategy for selecting the number of clusters in LCCA based on model fit, cluster separation, and stability of partitions. The suggested procedure is illustrated on simulated data and a real world dataset from the International Computer and Information Literacy Study (ICILS) 2018. For the latter, we provide an example of end-to-end LCCA including data preprocessing. The researcher can use our R script to conduct LCCA in a few easily reproducible steps, or implement the strategy with any other software suitable for clustering. We show that the extended strategy, in comparison to fit indices-based strategy, facilitates the selection of more stable and well-separated clusters in the data.",
keywords = "Cluster separation, Extended selecting strategy for LCCA, LCCA, Model fit, Stability of partitions",
author = "Olga Lezhnina and G{\'a}bor Kismih{\'o}k",
note = "Funding Information: We are grateful to anonymous reviewers who helped us to substantially improve the manuscript. In this section, the reader can find the definitions of the BIC, the ICL, the ASW, the ARI and the Jaccard coefficient. We also explain in detail how to conduct the bootstrap stability assessment. The BIC is defined as follows: BIC=−2logL+plognwhere p is the number of free parameters in the model, n is the number of observations, and L is the maximized likelihood function of the model. For a large n, minimizing the BIC corresponds to maximizing the posterior model probability. The BIC is useful when the sample size is sufficiently large, and for small samples, the Akaike's Information Criterion (AIC) is appropriate [4]. The ICL is defined as follows: [Formula presented] where x is the data, [Formula presented] is the estimated cluster membership for observations in the model m with K as the number of clusters, θ refers to the estimated mixture parameters, and υm,K is the number of free parameters in the model. The ICL is equal to the BIC penalized by the estimated mean entropy, which means that it aims at finding well-separated clusters and thus should not overestimate the number of clusters [2,11]. The ASW is the averaged value of silhouette widths for observations, which are defined as follows: [Formula presented] where a(i) is average dissimilarity between observation i and all other points of the cluster to which i belongs, and b(i) is average dissimilarity between i and all observations of the nearest cluster to which i does not belong. The ASW values range from –1 to 1, and higher positive values indicate better defined clusters characterized by within-cluster compactness and between-cluster separation, while values close to 0 or negative values show that the clusters are not well-separated. Stability of partitions is calculated as follows. We cluster the original data and apply the cluster solution to a bootstrap sample, which is also clustered anew. Thus, we have two cluster partitions for each bootstrap sample: the partition created by the original solution on the new sample and the new partition of this sample. They are compared using an external metric of our choice; this value is averaged over multiple repetitions to indicate the stability of the clustering [13]. To compare partitions, external measures should be used, such as the ARI and Jaccard coefficient [11,13]. These measures can be explained as follows. We need to compare two different cluster partitions U = {U1, U2, …Ur} and V = {V1, V2, …Vs} conducted on the same data. Let n be the total number of observations, and nij the number of objects in common between two partitions Ui and Vj, which sums as ni.=∑jnij and n.j=∑inij. There will be pairs of observations placed in the same cluster in both partitions: a=∑i,j(nij2), Other pairs of observations will be placed in the same cluster in one partition but in different clusters in the other: b=∑i(ni.2)−∑i,j(nij2), Still other pairs of observations will be in different clusters in both partitions: c=∑j(n.j2)−∑i,j(nij2)., In this case, the Jaccard coefficient is defined as [Formula presented], And the ARI is defined as [Formula presented]",
year = "2022",
doi = "10.1016/j.mex.2022.101747",
language = "English",
volume = "9",

}

Download

TY - JOUR

T1 - Latent Class Cluster Analysis

T2 - Selecting the number of clusters

AU - Lezhnina, Olga

AU - Kismihók, Gábor

N1 - Funding Information: We are grateful to anonymous reviewers who helped us to substantially improve the manuscript. In this section, the reader can find the definitions of the BIC, the ICL, the ASW, the ARI and the Jaccard coefficient. We also explain in detail how to conduct the bootstrap stability assessment. The BIC is defined as follows: BIC=−2logL+plognwhere p is the number of free parameters in the model, n is the number of observations, and L is the maximized likelihood function of the model. For a large n, minimizing the BIC corresponds to maximizing the posterior model probability. The BIC is useful when the sample size is sufficiently large, and for small samples, the Akaike's Information Criterion (AIC) is appropriate [4]. The ICL is defined as follows: [Formula presented] where x is the data, [Formula presented] is the estimated cluster membership for observations in the model m with K as the number of clusters, θ refers to the estimated mixture parameters, and υm,K is the number of free parameters in the model. The ICL is equal to the BIC penalized by the estimated mean entropy, which means that it aims at finding well-separated clusters and thus should not overestimate the number of clusters [2,11]. The ASW is the averaged value of silhouette widths for observations, which are defined as follows: [Formula presented] where a(i) is average dissimilarity between observation i and all other points of the cluster to which i belongs, and b(i) is average dissimilarity between i and all observations of the nearest cluster to which i does not belong. The ASW values range from –1 to 1, and higher positive values indicate better defined clusters characterized by within-cluster compactness and between-cluster separation, while values close to 0 or negative values show that the clusters are not well-separated. Stability of partitions is calculated as follows. We cluster the original data and apply the cluster solution to a bootstrap sample, which is also clustered anew. Thus, we have two cluster partitions for each bootstrap sample: the partition created by the original solution on the new sample and the new partition of this sample. They are compared using an external metric of our choice; this value is averaged over multiple repetitions to indicate the stability of the clustering [13]. To compare partitions, external measures should be used, such as the ARI and Jaccard coefficient [11,13]. These measures can be explained as follows. We need to compare two different cluster partitions U = {U1, U2, …Ur} and V = {V1, V2, …Vs} conducted on the same data. Let n be the total number of observations, and nij the number of objects in common between two partitions Ui and Vj, which sums as ni.=∑jnij and n.j=∑inij. There will be pairs of observations placed in the same cluster in both partitions: a=∑i,j(nij2), Other pairs of observations will be placed in the same cluster in one partition but in different clusters in the other: b=∑i(ni.2)−∑i,j(nij2), Still other pairs of observations will be in different clusters in both partitions: c=∑j(n.j2)−∑i,j(nij2)., In this case, the Jaccard coefficient is defined as [Formula presented], And the ARI is defined as [Formula presented]

PY - 2022

Y1 - 2022

N2 - Latent Class Cluster Analysis (LCCA) is an advanced model-based clustering method, which is increasingly used in social, psychological, and educational research. Selecting the number of clusters in LCCA is a challenging task involving inevitable subjectivity of analytical choices. Researchers often rely excessively on fit indices, as model fit is the main selection criterion in model-based clustering; it was shown, however, that a wider spectrum of criteria needs to be taken into account. In this paper, we suggest an extended analytical strategy for selecting the number of clusters in LCCA based on model fit, cluster separation, and stability of partitions. The suggested procedure is illustrated on simulated data and a real world dataset from the International Computer and Information Literacy Study (ICILS) 2018. For the latter, we provide an example of end-to-end LCCA including data preprocessing. The researcher can use our R script to conduct LCCA in a few easily reproducible steps, or implement the strategy with any other software suitable for clustering. We show that the extended strategy, in comparison to fit indices-based strategy, facilitates the selection of more stable and well-separated clusters in the data.

AB - Latent Class Cluster Analysis (LCCA) is an advanced model-based clustering method, which is increasingly used in social, psychological, and educational research. Selecting the number of clusters in LCCA is a challenging task involving inevitable subjectivity of analytical choices. Researchers often rely excessively on fit indices, as model fit is the main selection criterion in model-based clustering; it was shown, however, that a wider spectrum of criteria needs to be taken into account. In this paper, we suggest an extended analytical strategy for selecting the number of clusters in LCCA based on model fit, cluster separation, and stability of partitions. The suggested procedure is illustrated on simulated data and a real world dataset from the International Computer and Information Literacy Study (ICILS) 2018. For the latter, we provide an example of end-to-end LCCA including data preprocessing. The researcher can use our R script to conduct LCCA in a few easily reproducible steps, or implement the strategy with any other software suitable for clustering. We show that the extended strategy, in comparison to fit indices-based strategy, facilitates the selection of more stable and well-separated clusters in the data.

KW - Cluster separation

KW - Extended selecting strategy for LCCA

KW - LCCA

KW - Model fit

KW - Stability of partitions

UR - http://www.scopus.com/inward/record.url?scp=85132576032&partnerID=8YFLogxK

U2 - 10.1016/j.mex.2022.101747

DO - 10.1016/j.mex.2022.101747

M3 - Article

AN - SCOPUS:85132576032

VL - 9

JO - MethodsX

JF - MethodsX

SN - 2215-0161

M1 - 101747

ER -