Talk:Determining the number of clusters in a data set
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
|
Rule of Thumb
[edit]What is the basis for the rule of thumb for k? — Preceding unsigned comment added by 78.48.3.227 (talk) 00:08, 26 November 2013 (UTC)
Additional updates coming
[edit]A colleague will be adding details to the "Elbow method" and "Information criteria" subsections shortly. -JohnMeier (talk) 15:10, 9 April 2009 (UTC)
Not that common problem
[edit]There are lots of alternative algorithms that do not require the specification of k beforehand. This is mostly a problem of k-means, k-medoids and the EM-algorithm. Pretty much none of the more recent algorithms has this parameter. --Chire2 (talk) 14:13, 7 May 2010 (UTC)
- Any examples for such algorithms? thanks. Talgalili (talk) 12:36, 20 June 2010 (UTC)
A well known, early example is the AutoClass algorithm, by Cheeseman et al. 1988, which applied a search-based method built around Expectation Maximization to find the Maximum A-Posteriori distribution as a function of the number of classes. More modern approaches to this problem would equivalently apply the Bayes Information Criterion to selecting k. Johnmark54 (talk) 15:26, 5 October 2011 (UTC)
While fully probabilistic methods - e.g. by the computation of marginal likelihoods or Dirichlet process mixture models do deal with this problem they do at substantial computational complexity (and are subtle conceptually). Algorithms that are not fully probabilistic are still in very common use. Similarly maximum likelihood or (maximum a posterior) fits rather than fully Bayesian methods are also in common use and these also have problems determining k. So I partially agree with Johnmark54 - but only partially. It is probably worth adding mixture models as a fully probabilistic alternative to clustering to the article with a pointer else where. — Preceding unsigned comment added by 130.102.214.226 (talk) 00:58, 19 March 2014 (UTC)
Spectral Methods
[edit]Spectral methods automatically give k for many datasets. — Preceding unsigned comment added by 192.249.47.174 (talk) 15:38, 21 June 2012 (UTC)
REference to such methods please? — Preceding unsigned comment added by 152.16.225.159 (talk) 19:24, 31 October 2012 (UTC)
Information and text
[edit]Information theoretic section is disproportionately long, it should be edited down to be commensurate with the others.
Also, I moved the heuristic about textual clustering down, as it is specialized and not of very general interest (compared to, say, the elbow method). — Preceding unsigned comment added by Bluedevil.knight (talk • contribs) 14:46, 1 November 2012 (UTC)
Elbow : not equivalent to the F test
[edit]Original says: " Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. " This seems wrong. For one, the F-test is not this ratio. The F-statistic is between-group variance over within-group variance, which does not give percent of variance explained. Percentage of total variance explained is between-group variance over total variance (the sample variance), as the article states, which is not the F-statistic. Plus, isn't the F-test only technically used on univariate data? I'm not sure, but that is another strike against this claim of equivalence. If nobody protests, I will modify the original. How do you calculate variance explained with clusters in multivariate data sets? Bluedevil.knight (talk) 19:13, 1 November 2012 (UTC)
Rule of Thumb
[edit]I have removed it. As someone with reasonable experience in cluster analysis, I don't see any use in this rule of thumb at all. Why should the number of individuals have any effect on the number of clusters present?
Consider this example, there is a large population with three hidden subpopulations which differ across several variables. If 100 members of this population were randomly sampled, the rule of thumb suggests that there are ~7 clusters. If 1000 members of this population were sampled, the rule of thumb suggests there are ~22 clusters. In either case, it was still the same population with three clusters.Combee123 (talk) 22:05, 25 January 2016 (UTC)