Loading AI tools
From Wikipedia, the free encyclopedia
This is an archive of past discussions about Cluster analysis. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 |
I find this in the article:
But when I looked at the bibliograpy, it was not there. If anyone has the information, could they add it? Michael Hardy 18:36, 21 November 2005 (UTC)
The clustering impossibility theorem should be mentioned. it is similar to Arrow's impossibility theorem, except for clustering. I don't know who created it though. Briefly, it states that any clustering algorithm should have these 3 properties:
No clustering algorithm can have all 3.
-- BAxelrod 03:53, 16 December 2005 (UTC)
It is a good thing to have and mostly one should reference, I guess \bibitem{JK02} Jon Kleinberg. An impossibility theorem for clustering - Advances in Neural Information Processing Systems, 2002.
However, if there is not another source, then I'd mention that there is a little problem with this theorem as it is presented in that article.
First, it deals with graphs G(V,E), having |V| >= 2 and having distances d(i,j) = 0 iff i==j, i,j in V. Thus, take richness and scale invariance (which means that a graph with some fixed weights has the same clustering if all the weights are multiplied by some positive constant), a graph with |V| = 2, and boom - here you go. For each clustering we get either scale invariance or richness. If there is richness, then scale invariance does not work and the other way round. Sweet, is not it? Or am I wrong somewhere?
Could you please explain something about Isodata Algorithm for data clustering
The last external link on this page has an example on ISODATA clustering. I will try to do a digest when I have time, but feel free to beat me to it. Bryan
Beyond the division of clustering methodologies hierarchical/partitional agglomerative/divisive, it is possible to differentiate betewen: Arrivial of data points: on-line/off-line Type of data: stationary/non-stationary
Additionally it may be helpful to discuss some of the difficulties in clustering data, in particular choosing the correct number of centroids, limits on memory or processing time, and techniques for solving them, such as measuring phase transition (Rose, Gurewitz & Fox)
It seems a large amount of the effort in text mining related to text clustering is left out of this article, but it seems to be most appropriate place. Josh Froelich 20:16, 9 January 2007 (UTC)
I believe that this algorithm developed at the University of Toronto by Brendan Frey, a professor in the department of Electrical and Computer Engineering and Delbert Dueck, a member of his research group which appeared in Science (journal) Feb 07 will change the way people think about clustering. See www.psi.toronto.edu/affinitypropagation/ and www.sciencemag.org/cgi/content/full/sci;315/5814/972 . However I am not capable of writing a full introduction, so I hope someone better equiped for the job will do that. Including the AP breakthrough is a must in my view to retain the currency of this article. Bunty.Gill 08:50, 24 April 2007 (UTC)
"Recurse with the reduced set of points." (with link to recursion)
Is this really recursion? I would call it iteration. You repeat the process until you can't go any further, then you stop. Sounds like a while loop to me.
--84.9.95.214 17:28, 1 July 2007 (UTC)
If you take a look at the Recursion (computer science) article "Any function that can be evaluated by a computer can be expressed in terms of recursive functions without the use of iteration, in continuation-passing style; and conversely any recursive function can be expressed in terms of iteration.", i.e. you can rewrite anything primitive recursive to an iterated algo. Here, recursion is in the philosophical sense, since you apply the same analysis to a reduced set of points.
As mentioned in the article there are several different terms used to describe cluster analysis. However, the most frequent term used in books, papers, etc, always appears to be cluster analysis by a fair margin. It would make sense to use the most widely used term as the page name.
Approximately 7-10 days ago I collected statistics on the number of hits on each term from several different sources. The data could be used as an approximate indicator for the prevalence of each term. The numbers in brackets represent papers released since January 2000. On average the ratio of [papers published since January 2000 to number of papers overall] for the ACM and IEEE combined is approximately 0.757 and 0.759 for data clustering and cluster analysis respectively, thereby suggesting a relatively stable prevalence of each term in the literature over time.
Source | IEEE | ACM | Alexa | Yahoo | Ask | Gigablast | Live Search | |
---|---|---|---|---|---|---|---|---|
Results for "Data Clustering" | 524 (368) | 682 (555) | 20,000 | 96,900 | 353,000 | 43,600 | 38,219 | 27,279 |
Results for "Cluster Analysis" | 571 (511) | 1159 (724) | 130,000 | 715,000 | 1,860,000 | 307,200 | 316,708 | 148,096 |
Based on these results, and the titles and content of books on the subject, I propose that the page title be changed to "Cluster Analysis".
--MatthewKarlsen 16:07, 15 July 2007 (UTC)
Normalized Google Distance (NGD) -- when I saw this I thought it was a prank. It turns out that someone has actually written a (suprisingly well-informed) paper or two on this -- but that does not mean it is a serious approach (or that it is not just the type of prank that bored computer scientists come up with in their free time).
Is anyone here informed on this? Is NGD a viable norm function (in its domain)? I have been unable to find any peer-reviewed publications regarding the topic (New Scientist hardly counts). --SteelSoul 23:12, 1 November 2007 (UTC)
Someone should edit out the blatant advertising for the excel plugin product. It would also be nice to have some additional info on picking the number of clusters?
I'm wondering if there isn't an error in this sentence: "Percent of variance explained or percent of total variance is the ratio of within-group variance to total variance." I'm thinking that as the number of clusters increases, the within-group variation decreases, which is not what is shown on the graph. Should this be "... the ratio of the between-group variance to the total variance." Mhorney (talk) 17:57, 11 January 2008 (UTC)
The section is in violation of WP:NOR. —Preceding unsigned comment added by 71.100.12.147 (talk) 11:09, 11 September 2008 (UTC)
Theoretical and Empirical Separation Curves
Use of the term "cluster" to refer to a subset of a group of attributes which define a bounded class shows an obvious lack of comprehension of the subject matter. Use of the term "cluster" is not valid in this context when referring to the number of attributes as a selected subset of a group of attributes but valid only when referring to a multiset count of the values of an attribute where the count of the set or multiset values equals the number of clusters.
The number of attributes selected as the the number of attributes in the subset is not arbitrarily selected or fixed but initially set to one for the first separation analysis and thereafter progressively incremented until 100% separation is achieved or to some point prior to target set size exceeding computer capacity or the time allocated for classification is exceeded. The minimum number of attributes (not clusters) is determined mathematically as follows:
, where:
71.100.14.204 (talk) 20:47, 11 September 2008 (UTC)
A user coming from several different IP-numbers 71.100.*.* (DSL verizon) has an compulsion to add tags such as "original research" around the "elbow criterion". It is not apparent for me why this is so. Looking at the explained variance as a function of the number of clusters is a well-known method. I haven't heard of the term "elbow criterion" before, but looking at Google Scholar there seems to be no doubt that it is used in peer-reviewed communication. — fnielsen (talk) 12:24, 15 September 2008 (UTC)
71.100.*.*, let me put it this way - your conduct is in direct violation of the Verizon Internet Acceptable Use Policy (see ), and consider this your final warning to stop your range of disruptive activities on Wikipedia, including but not limited to
A report to Wikipedia:Abuse reports will take place for any further abuse, which will then result in an official communication to Verizon, possibly leading to the termination of your account (as detailed by the AUP). You don't want this. Please stop. --Jiuguang (talk) 17:50, 15 September 2008 (UTC)
Hey, it would be nice to have some image that intuitively shows the idea of clustering. Usually in courses on machine learning or tutorials on clustering such images are shown. They are usually two dimensional depicting a number of points dispersed in the coordinate system, circles mark clusters/groups of points. I think for somebody opening an article about clustering and who is new to the topic such an image could be very helpful. Ben T/C 14:03, 10 February 2009 (UTC)
That's a pretty big topic. Shouldn't be just a subsection of this article. I may have to start one if no else does within the next several months. Makewater (talk) 20:36, 6 April 2009 (UTC)
The number of different ways to choose k seems to warrant more than a subsection on this page, especially since identification of the number of clusters in a data set is a separate issue from ways of actually performing clustering. I've expanded the former subsection on the topic into a standalone page, Determining the number of clusters in a data set. -JohnMeier (talk) 00:42, 8 April 2009 (UTC)
Hi, I found this page when I was looking for information on Cluster Sampling. Perhaps there should be one of those nifty disambiguation links at the top of this page. I don't really understand what this cluster analysis thingy is or how important it is so I'm not sure whether or not such a link would be justified, but it would have saved me some time.220.239.204.226 (talk) 05:43, 4 November 2009 (UTC)
--222.64.209.26 (talk) 03:37, 20 November 2009 (UTC)
--222.64.209.26 (talk) 04:01, 20 November 2009 (UTC)
--222.64.209.26 (talk) 04:04, 20 November 2009 (UTC)
--222.64.209.26 (talk) 04:19, 20 November 2009 (UTC)
--222.64.209.26 (talk) 04:28, 20 November 2009 (UTC)
--222.64.209.26 (talk) 04:29, 20 November 2009 (UTC)
--222.64.209.26 (talk) 05:25, 20 November 2009 (UTC)
--222.64.209.26 (talk) 05:29, 20 November 2009 (UTC)
It's pity that the application of the technique has been limited
--222.64.209.26 (talk) 03:40, 20 November 2009 (UTC)
I'm sure if the CA is used in conjunction with the DNA profiling for Population management, lots of overloads of population can be managed.--222.64.209.26 (talk) 03:46, 20 November 2009 (UTC)
Look at that...
--222.64.209.26 (talk) 03:51, 20 November 2009 (UTC)
--222.64.209.26 (talk) 04:14, 20 November 2009 (UTC)
Addressing WHAT IS NOT
--222.67.208.51 (talk) 06:46, 24 November 2009 (UTC)
I've tagged this article as a cleanup as it is becoming very confusing and unwieldy. My vague suggestions:
Any comments? —3mta3 (talk) 17:00, 18 May 2009 (UTC)
There are lots of details about different methods and metrics used to solve the problem, which is defined too unstrictly.
As I understand it, clustering can be used for different purposes, but it is generally used to find classes of data points that aren't easily noticeable otherwise. However, if you are using the term 'problem' in the CS theoretical sense, then each clustering algorithm really has a different problem.
Darthhappyface (talk) 04:32, 25 May 2010 (UTC)
Hi all,
I wrote an article about a method to visualize cluster analysis, here: http://www.r-statistics.com/2010/06/clustergram-a-graph-for-visualizing-cluster-analyses-r-code/
I was wondering where (and if) to add the above information about this method to the page, but couldn't quite figure out where in the page to do so. Any suggestions?
Talgalili (talk) 16:36, 15 June 2010 (UTC)
In the section on spectral clustering, I think the formula P = S*D^(-1) should be P = D^(-1)*S instead as written in eq. 4 of the paper by Meila and Shi namely "A random walks view of spectral segmentation" Meila, M., Shi J., AISTATS 2001. In general D^(1) and S DO NOT commute. One definition leads to the transpose of the other for P - same eigenvalues but different eigenvectors. Unless someone can provide a justification for this formula, I think this could be a bonafide error. TonyMath (talk) 19:47, 19 April 2011 (UTC)
This section is not complete. It should give more details about the eigenvectors extraction, and explain why. Moreover, the formulae for the Laplacian matrix is L=D-S. Indeed, the formula in the article is the one for the normalized Laplacian matrix. This normalization is due to the relaxation of the constraints on the indicator vector. —Preceding unsigned comment added by Guillaumew (talk • contribs) 09:26, 4 May 2011 (UTC)
I understand that CA will cluster just anything we throw at it: random data, linearly correlated data, etc. Could somebody knowledgeable please point out when NOT to use CA? --Stevemiller (talk) 04:56, 15 February 2008 (UTC)
One person's opinion. Please allow me to suggest a criterion for determining if two separate clusters really should be separate: two distinct well-defined subpopulations constitute distinct clusters if and only if characteristics of interest possess a non-zero difference in their means at a statistically significant level of confidence. Thus, (1) a criterion for identifying distinct clusters is given utilizing a standard, well accepted statistical methodology - the difference between two means (2) whether two clusters are distinct may be ambiguous depending on the confidence level required. For example, two subpopulations may be distinct clusters with 95% confidence but not 99% confidence (3) if no distinct clusters are identified under this criterion, cluster analysis fails: the null hypothesis cannot be rejected and the population should therefore be considered homogeneous at the level of statistical confidence used in testing the hypothesis of distinctness. —Preceding unsigned comment added by Davidjcorliss (talk • contribs) 02:36, 24 May 2011 (UTC)
Yes, the two should be merged. I dont think Wikipedia is a how to manual - so the applications, use and basic methodology whould be in a single article--Maven111 (talk) 12:36, 3 March 2010 (UTC)
The article cluster analysis (in marketing) seems to repeat a lot of information. Should we incorporate it somehow? —3mta3 (talk) 08:19, 21 May 2009 (UTC)
I think it's not a good idea, because the cluster analysis is dealing with the methods and the "in marketing" one is simply mentioning the uses of the analysis in economy. Kroolik (talk) 11:45, 25 May 2009 (UTC)
This section should be merged with market segmentation discussions of benefit segmentation analysis (marketing). It is an application of a broader concept, but it is very sepcific to benefit segmentation. —Preceding unsigned comment added by 74.75.128.221 (talk) 23:43, 25 October 2009 (UTC)
Yes - the two should be merged. I apply cluster analysis in marketing segmentation as a consultant and also in astrostatistics and in education analysis in my academic research. Yet, the mathematics is, at all points, the same. It is a single discipline with a multiplicity of applications. Davidjcorliss (talk) 02:46, 24 May 2011 (UTC)
There seems to be an inordinate and excessive use of boldface. In particular there is what appears to be simply a list of applications of clustering which has very little prose in it. Some trimming is suggested as per MOS:BOLD#Boldface, where the spiecific example shows the list as a list article, not in the middle of a non-list article. Chaosdruid (talk) 14:23, 16 February 2012 (UTC)
No link or mention in this article of spectral clustering, which has its own entire wikipedia article. — Preceding unsigned comment added by 192.249.47.174 (talk) 21:05, 19 June 2012 (UTC)
About the pairwise F-measure : "This measure is able to compare clusterings with different numbers of clusters". Actually, all of the reviewed indices are able to do so. Well, they are some well known drawbacks, especially with the Rand index which tends to 1 while the number of clusters increase because of the False Positive term, but you just have to keep that in mind ;) Moreover, if you want to compare clusters independently, a matching is obviously required and it is generally done by the Hungarian method for the sake of efficiency. — Preceding unsigned comment added by 84.98.253.168 (talk) 02:19, 23 October 2012 (UTC)
On 11/3/2012 I added 4 independent citations to support the article's statement that Clustering is a main task of explorative data mining. These citations were removed the same day. I propose that it would be good to have citations for claims like this.
I am a novice wikipedia author, and on 11/3, I did not sign my edit or say anything on the TALK page. I now know that I should do both of these things. I should also point out that I am an author on 2 of the citations I added. However, I do not think that there is a COI in this. It is an area in which I have done research, so it is an area I know. I am simply trying to add citations that I think would help the article. However, if other wikipedia authors evaluate this and feel that my 2 citations should be excluded, I would still encourage the community to retain the other two citations that I added on 11/3.
Thank you for considering this.Karl (talk) 01:43, 6 November 2012 (UTC)
The sentence "Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm based on this criterion." is very long and adds very little value to the article. It should be rewritten. — Preceding unsigned comment added by 92.229.28.245 (talk) 13:44, 30 January 2013 (UTC)
please see Talk:Statistical classification Fgnievinski (talk) 23:38, 3 May 2014 (UTC)
The data in figures should be explained, it means that explaining what is x-axis and what is y-axis in the figures above. The similar problem also occurs to other sections in this article. Moreover, the data in two figures are different, so how can we compare the difference between two methods?
The topic of multi-assignment clustering seems to be missing altogether from wikipedia. --Nicolamr (talk) 22:55, 28 July 2014 (UTC)
The text says: "Cluster analysis was originated in anthropology by Driver and Kroeber in 1932 and introduced to psychology by Zubin in 1938 and Robert Tryon in 1939[1][2] and famously used by Cattell beginning in 1943[3] for trait theory classification in personality psychology."
Driver and Kroeber 1932 and Zubin 1938 does not appear on the references list. Anyone has access to Ken (1994) to check the full references for these two works?
--Lucas Gallindo (talk) 23:12, 2 December 2014 (UTC)
Here they are, should we put them back ?
I think the list should either be kept out of the article altogether, or it should be explicitly limited to a small number (say half a dozen) of particularly notable clustering systems, for which some strong argument can be made for their inclusion beyond "it's a clustering system" or even "it's a free clustering system". By a strong argument, I mean statements such as "it's the most widely used free clustering system for Linux" or the like. Anything less restrictive is just an invitation for an unencyclopedic link farm. —David Eppstein 03:22, 30 March 2007 (UTC)
There are other well-known (within the field) directories for such things... Maybe use one of them? and , for example? stoborrobots (talk) 15:23, 22 February 2011 (UTC)
I read the earlier discussion "Software links were removed". I can somewhat appreciate the concerns raised but I think there is a problem.
Without information about relevant analysis software, the article seems to me lacking. I came to the article with a general concept of what cluster analysis is about and a desire to find out how to look for clusters in a data set I have.
The problem is that (in my perception) this article goes from the general conceptual level into the advanced level of discussing different theoretic approaches, with very little in between.
The article List of statistical packages has an extensive list. Could this article refer to that article to the extent of indicating packages there that include cluster analysis? Ideally also indicating which packages would be more helpful to someone unused to cluster analysis?
If this sort of information is NOT appropriate to have in the article, can anyone offer me any "private" user talk page advice? Thanks. Wanderer57 (talk) 19:21, 11 May 2011 (UTC)
It is unclear if they copied wikipedia, or someone pasted content from this article: http://files.aiscience.org/journal/article/html/70110028.html — Preceding unsigned comment added by Sakoht (talk • contribs) 02:23, 25 April 2016 (UTC)
I do not think references for applications are needed. This just attracts spam. HelpUsStopSpam (talk) 19:51, 1 November 2016 (UTC)
Hello fellow Wikipedians,
I have just modified one external link on Cluster analysis. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}}
(last update: 5 June 2024).
Cheers.—InternetArchiveBot (Report bug) 20:32, 9 August 2017 (UTC)
I distinctly remember there was a section on the Quality Threshold (QT) algorithm. I cannot find it anymore and it's a very useful algorithm that should be included. Why was this section deleted? I cannot find any discussion in the Talk section about this removal? — Preceding unsigned comment added by 131.107.160.106 (talk) 23:44, 1 April 2018 (UTC)
The evaluation section extension with benchmarking frameworks contained copied content from the Clubmark paper, however this paper had already been published in the public domains (The Clubmark paper, arXiv) with the respective public licences and the original paper was explicitly cited in the content.
In addition, today the author(s) have emailed the permission to permissions-en(at)wikimedia(dot)org to use the paper under the (CC-BY-SA), version 3.0. {{OTRS pending}}
How to recover the roll-backed refinements to the Cluster Analysis page temporary removed because of the copyright issues? --Glokc (talk) 06:12, 11 February 2019 (UTC)
Could someone the statistical field include a line or two in the intro (or elsewhere) that explains the purpose of the cluster analysis? The "What" and "How" is explained to a good extent but I can't find the "why" anywhere. Given it's use in machine learning and data mining, I think it would be timely to include the reasons. Economicactvist (talk) 08:26, 25 June 2019 (UTC)
This article was the subject of a Wiki Education Foundation-supported course assignment, between 6 September 2020 and 6 December 2020. Further details are available on the course page. Student editor(s): Rc4230.
Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 17:53, 16 January 2022 (UTC)
Currently the description of Jaccard Index ends by saying "Also TN is not taken into account and can vary from 0 upward without bound." However, this is incorrect and contradicts the beginning of the description which correctly says the metric ranges from 0 to 1. The zero to one range is also attested in the main article for Jaccard index. I am therefore going to remove that incorrect last sentence. Showeropera (talk) 18:51, 14 February 2023 (UTC)
Seamless Wikipedia browsing. On steroids.
Every time you click a link to Wikipedia, Wiktionary or Wikiquote in your browser's search results, it will show the modern Wikiwand interface.
Wikiwand extension is a five stars, simple, with minimum permission required to keep your browsing private, safe and transparent.