Details
Originalsprache | Englisch |
---|---|
Aufsatznummer | 101747 |
Fachzeitschrift | MethodsX |
Jahrgang | 9 |
Frühes Online-Datum | 29 Mai 2022 |
Publikationsstatus | Veröffentlicht - 2022 |
Extern publiziert | Ja |
Abstract
Latent Class Cluster Analysis (LCCA) is an advanced model-based clustering method, which is increasingly used in social, psychological, and educational research. Selecting the number of clusters in LCCA is a challenging task involving inevitable subjectivity of analytical choices. Researchers often rely excessively on fit indices, as model fit is the main selection criterion in model-based clustering; it was shown, however, that a wider spectrum of criteria needs to be taken into account. In this paper, we suggest an extended analytical strategy for selecting the number of clusters in LCCA based on model fit, cluster separation, and stability of partitions. The suggested procedure is illustrated on simulated data and a real world dataset from the International Computer and Information Literacy Study (ICILS) 2018. For the latter, we provide an example of end-to-end LCCA including data preprocessing. The researcher can use our R script to conduct LCCA in a few easily reproducible steps, or implement the strategy with any other software suitable for clustering. We show that the extended strategy, in comparison to fit indices-based strategy, facilitates the selection of more stable and well-separated clusters in the data.
ASJC Scopus Sachgebiete
- Biochemie, Genetik und Molekularbiologie (insg.)
- Klinische Biochemie
- Gesundheitsberufe (insg.)
- Medizinische Labortechnik
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
in: MethodsX, Jahrgang 9, 101747, 2022.
Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review
}
TY - JOUR
T1 - Latent Class Cluster Analysis
T2 - Selecting the number of clusters
AU - Lezhnina, Olga
AU - Kismihók, Gábor
N1 - Funding Information: We are grateful to anonymous reviewers who helped us to substantially improve the manuscript. In this section, the reader can find the definitions of the BIC, the ICL, the ASW, the ARI and the Jaccard coefficient. We also explain in detail how to conduct the bootstrap stability assessment. The BIC is defined as follows: BIC=−2logL+plognwhere p is the number of free parameters in the model, n is the number of observations, and L is the maximized likelihood function of the model. For a large n, minimizing the BIC corresponds to maximizing the posterior model probability. The BIC is useful when the sample size is sufficiently large, and for small samples, the Akaike's Information Criterion (AIC) is appropriate [4]. The ICL is defined as follows: [Formula presented] where x is the data, [Formula presented] is the estimated cluster membership for observations in the model m with K as the number of clusters, θ refers to the estimated mixture parameters, and υm,K is the number of free parameters in the model. The ICL is equal to the BIC penalized by the estimated mean entropy, which means that it aims at finding well-separated clusters and thus should not overestimate the number of clusters [2,11]. The ASW is the averaged value of silhouette widths for observations, which are defined as follows: [Formula presented] where a(i) is average dissimilarity between observation i and all other points of the cluster to which i belongs, and b(i) is average dissimilarity between i and all observations of the nearest cluster to which i does not belong. The ASW values range from –1 to 1, and higher positive values indicate better defined clusters characterized by within-cluster compactness and between-cluster separation, while values close to 0 or negative values show that the clusters are not well-separated. Stability of partitions is calculated as follows. We cluster the original data and apply the cluster solution to a bootstrap sample, which is also clustered anew. Thus, we have two cluster partitions for each bootstrap sample: the partition created by the original solution on the new sample and the new partition of this sample. They are compared using an external metric of our choice; this value is averaged over multiple repetitions to indicate the stability of the clustering [13]. To compare partitions, external measures should be used, such as the ARI and Jaccard coefficient [11,13]. These measures can be explained as follows. We need to compare two different cluster partitions U = {U1, U2, …Ur} and V = {V1, V2, …Vs} conducted on the same data. Let n be the total number of observations, and nij the number of objects in common between two partitions Ui and Vj, which sums as ni.=∑jnij and n.j=∑inij. There will be pairs of observations placed in the same cluster in both partitions: a=∑i,j(nij2), Other pairs of observations will be placed in the same cluster in one partition but in different clusters in the other: b=∑i(ni.2)−∑i,j(nij2), Still other pairs of observations will be in different clusters in both partitions: c=∑j(n.j2)−∑i,j(nij2)., In this case, the Jaccard coefficient is defined as [Formula presented], And the ARI is defined as [Formula presented]
PY - 2022
Y1 - 2022
N2 - Latent Class Cluster Analysis (LCCA) is an advanced model-based clustering method, which is increasingly used in social, psychological, and educational research. Selecting the number of clusters in LCCA is a challenging task involving inevitable subjectivity of analytical choices. Researchers often rely excessively on fit indices, as model fit is the main selection criterion in model-based clustering; it was shown, however, that a wider spectrum of criteria needs to be taken into account. In this paper, we suggest an extended analytical strategy for selecting the number of clusters in LCCA based on model fit, cluster separation, and stability of partitions. The suggested procedure is illustrated on simulated data and a real world dataset from the International Computer and Information Literacy Study (ICILS) 2018. For the latter, we provide an example of end-to-end LCCA including data preprocessing. The researcher can use our R script to conduct LCCA in a few easily reproducible steps, or implement the strategy with any other software suitable for clustering. We show that the extended strategy, in comparison to fit indices-based strategy, facilitates the selection of more stable and well-separated clusters in the data.
AB - Latent Class Cluster Analysis (LCCA) is an advanced model-based clustering method, which is increasingly used in social, psychological, and educational research. Selecting the number of clusters in LCCA is a challenging task involving inevitable subjectivity of analytical choices. Researchers often rely excessively on fit indices, as model fit is the main selection criterion in model-based clustering; it was shown, however, that a wider spectrum of criteria needs to be taken into account. In this paper, we suggest an extended analytical strategy for selecting the number of clusters in LCCA based on model fit, cluster separation, and stability of partitions. The suggested procedure is illustrated on simulated data and a real world dataset from the International Computer and Information Literacy Study (ICILS) 2018. For the latter, we provide an example of end-to-end LCCA including data preprocessing. The researcher can use our R script to conduct LCCA in a few easily reproducible steps, or implement the strategy with any other software suitable for clustering. We show that the extended strategy, in comparison to fit indices-based strategy, facilitates the selection of more stable and well-separated clusters in the data.
KW - Cluster separation
KW - Extended selecting strategy for LCCA
KW - LCCA
KW - Model fit
KW - Stability of partitions
UR - http://www.scopus.com/inward/record.url?scp=85132576032&partnerID=8YFLogxK
U2 - 10.1016/j.mex.2022.101747
DO - 10.1016/j.mex.2022.101747
M3 - Article
AN - SCOPUS:85132576032
VL - 9
JO - MethodsX
JF - MethodsX
SN - 2215-0161
M1 - 101747
ER -