## kmeans and nan

Q&A's, tips, howto's
ralph.ronnquist
Posts: 216
Joined: Mon Jun 02, 2014 1:40 am
Location: Melbourne, Australia

### kmeans and nan

I started investigating using kmeans for vector classification, and has the problem of getting more 'nan' than I'm happy with, and I hope there's a simple answer to what I'm doing wrong:

E.g. I run kmean-train to form 5 clusters for a collection of ~50 vectors of 100 elements. Then some centroids are of all -nan values. What does that mean?

Next, I run a sequence of kmeans-query using these centroids (including the -nan ones), and then often but not always some measures are 'nan'. What does that mean?

Alternatively, is it sensible to revise the centroids, eg map -nan to something small or large, to maybe avoid the nan classifications?

Lutz
Posts: 5279
Joined: Thu Sep 26, 2002 4:45 pm
Contact:

### Re: kmeans and nan

First a general observation:
Do you have 50 data records (rows) in a 100 dimensions (columns) or 100 data records in 50 dimensions? In any case the number of dimensions (columns) seems very big for the number of data records (relatively speaking). But that in itself is not responsible for the nans, just a general observation.

But now about the nans (or NaNs depending on the platform). In the kmeans-train syntax:

Code: Select all

``(kmeans-train <matrix-data> <int-k> <context> [<matrix-centroids>])``
int-k is used for the number of clusters to generate. This is a tentative number. If the number is to big, kmeans-train will leave some clusters unpopulated with a 0 frequency of data records in it and all values in the cetroid set to nan. When using these invalid centroids, results in distance vectors from calculating with nans will be nans again. In other words: The distance of a data point (record or row) from a nan centroid is nan.

For 50 data records, I would probably start out with an int-k of no more than 5 or for 100 with 10. If some of the clusters have un-proportional big memberships, I would increase that number, trying to split up those relatively large clusters.

You could repeat calculation using a smaller int-k, trying to eliminate nan centroids. Look also in K:labels for centroids with very little membership. Do these centroids describe real data outliers? Or are they just a sign that your int-k number is still a bit high.

When looking into K:labels (the cluster membership where K is the context), you will see that nan centroids are ignored. The expression:

Code: Select all

``(count (unique K:labels) K:labels)``
can give you a count of data records in each cluster.

ralph.ronnquist
Posts: 216
Joined: Mon Jun 02, 2014 1:40 am
Location: Melbourne, Australia

### Re: kmeans and nan

Great. Thanks. That makes good sense, and is very helpful.

Now, a follow on question, which requires me to elaborate the scenario a bit: basically I'm developing a rolling classification scheme for a growing time series of data, where each time point has a 100 element characterisation vector. The purpose is to classify new data points as they come in against the past time series of data in a timely manner.

The past time series is quite large, but the timeliness requirement prohibits running clustering an all of it for each input. And, at the same time, the fundamental proposal is that there is a reasonably small number of clusters, which occur repetetively (though irregularly), and distinctly (i.e., distinctly either present or absent in shorter time periods). Apparently, 50 point periods usually have ~3 clusters.

So, my present thinking is, to repeatedly use the last (say) ~50 points every so often (say, at every ~30 points), obtain centroids for that, and collate these centroids into the "symbol set" for the rolling classification.

Would you have any comment on this approach?

Maybe centroids should be collated by means of its own clustering scheme? Or perhaps it is significantly better to form larger cluster sets by training with more data points? (As you might notice, I'm not very read-in on the underlying theory)

ralph.ronnquist
Posts: 216
Joined: Mon Jun 02, 2014 1:40 am
Location: Melbourne, Australia

### Re: kmeans and nan

I'm still fascinated with newlisp. It let's me formulate, test and discard a stack of stupid ideas in a single sitting, without forcing piles and piles of infrastructural coding. Now I just wish I had better ideas :-)

Lutz
Posts: 5279
Joined: Thu Sep 26, 2002 4:45 pm
Contact:

### Re: kmeans and nan

It’s hard to make any recommendation without knowing more about the data and the purpose of classifying them. But here are some general thoughts.

Dealing with time series, perhaps you want to find classes of behavior over time. Currently your rows in the matrix seem to represent time points and the columns some other elements changing over time. Transpose the matrix! So now you have rows of elements changing over time (columns). Clustering now will give you types of time movement behavior patterns. Those could also be useful. Again it all depends where your data come from.

What is your classification for? Classifying new time points? Develop some kind of quantifiable validation when applying results of a previous cluster analysis to new data.
test and discard a stack of stupid ideas in a single sitting, without forcing piles and piles of infrastructural coding
This is the way many, especially creative people, are using newLISP.

ralph.ronnquist
Posts: 216
Joined: Mon Jun 02, 2014 1:40 am
Location: Melbourne, Australia

### Re: kmeans and nan

You just opened a new door for me! No, you gave me a whole new game level!

The transposition idea indeed is an interesting and useful perspective, provided the measurement dimensions are "compatible", or can be duly normalized. I think it applies for my purpose (which I have to stay apologetically secretive about).

In particular, I can see its cluster memberships being a useful suggestion of which dimensions "move together in a similar way", which is a quite interesting/important aspect, as well as the point of labelling which kinds of motions there are. And, I see how this perspective easily extends to an abstracted, long-term motion analysis simply by stepping away from the time series unit time step, and consider (transposed) sub series of every n:th point.

... this will keep me off the streets a fair while :-)

rickyboy
Posts: 595
Joined: Fri Apr 08, 2005 7:13 pm
Location: Front Royal, Virginia

### Re: kmeans and nan

ralph.ronnquist wrote:... this will keep me off the streets a fair while :-)
New tagline for newLISP: "Use of newLISP may lower vagrancy in your hometown." :)
(λx. x x) (λx. x x)