You might often wish to partition your data into meaningful groups based on some degree of "closeness." However, deciding how to actually go about the partitioning is highly subjective and therefore open to criticism from other researchers. The solution to this problem is k-means clustering. K-means clustering is an algorithm that automatically partitions your data for you. It is a form of machine learning that gives an optimal data partitioning under a set of constraints. MATLAB offers a k-means clustering function that you can easily apply to your data set.

- Skill level:
- Moderate

### Other People Are Reading

## Instructions

- 1
Read your data into MATLAB as a matrix. Locate the data file on your computer and remember the filename (e.g. "datafile.dat"). Use the command "[dat, vars, cases] = tblread(filename)" where "filename" is the name of the file containing your data, such as "datafile.dat." Hit enter and the variable "dat" will be a data matrix containing your data.

- 2
Decide on the number of means for the k-means clustering algorithm. The number of means you choose will be exactly equal to the number of groups yielded. Use the properties of your data and the problem at hand to decide how many groups you wish to partition the data into.

- 3
Decide how the k-means clustering algorithm should compute the distance between points. There are two common methods for calculating distance for this algorithm: Euclidean and correlational. Euclidean just looks at the "physical" distance between points as if you graphed them on a Cartesian plane. Correlational distance takes into account the variance of the data and may be more suitable when you are dealing with data that has a known distribution (such as the normal distribution).

- 4
Run the k-means clustering algorithm. Use the command "ind = kmeans(dat, g, 'distance')" where "g" is a number representing the number of clusters you want and "distance" is the type of distance you want the k-means clustering algorithm to use: "sqEuclidean" for Euclidean distance and "correlation" for correlational distance.