First a general observation:

Do you have 50 data records (rows) in a 100 dimensions (columns) or 100 data records in 50 dimensions? In any case the number of dimensions (columns) seems very big for the number of data records (relatively speaking). But that in itself is not responsible for the

nans, just a general observation.

But now about the

nans (or

NaNs depending on the platform). In the

kmeans-train syntax:

- Code: Select all
`(kmeans-train <matrix-data> <int-k> <context> [<matrix-centroids>])`

int-k is used for the number of clusters to generate. This is a tentative number. If the number is to big,

kmeans-train will leave some clusters unpopulated with a 0 frequency of data records in it and all values in the cetroid set to

nan. When using these invalid centroids, results in distance vectors from calculating with

nans will be

nans again. In other words: The distance of a data point (record or row) from a

nan centroid is

nan.

For 50 data records, I would probably start out with an

int-k of no more than 5 or for 100 with 10. If some of the clusters have un-proportional big memberships, I would increase that number, trying to split up those relatively large clusters.

You could repeat calculation using a smaller

int-k, trying to eliminate

nan centroids. Look also in

K:labels for centroids with very little membership. Do these centroids describe real data outliers? Or are they just a sign that your int-k number is still a bit high.

When looking into

K:labels (the cluster membership where

K is the context), you will see that

nan centroids are ignored. The expression:

- Code: Select all
`(count (unique K:labels) K:labels)`

can give you a count of data records in each cluster.