This tutorial was written using the kohonen
package version 2.0.19. Some of the code will not work in the most recent version of this package. To install 2.0.19, run the following:
packageurl <- "https://cran.r-project.org/src/contrib/Archive/kohonen/kohonen_2.0.19.tar.gz"
install.packages(packageurl, repos = NULL, type = "source")
I hope to update all of the SOM tutorials to run properly on kohonen
v3 in the near future.
Self Organizing Maps (SOMs) are a tool for visualizing patterns in high dimensional data by producing a 2 dimensional representation, which (hopefully) displays meaningful patterns in the higher dimensional structure. SOMs are “trained” with the given data (or a sample of your data) in the following way:
The key feature this algorithm gives to the SOM is that points that were close in the data space are close in the SOM. Thus SOMs may be a good tool for representing spatial clusters in your data.
require(kohonen)
require(RColorBrewer)
The Kohonen package allows for quick creation of some basic SOMs in R. Our examples below will use player statistics from the 2015/16 NBA season. We will look at player stats per 36 minutes played, so variation in playtime is somewhat controlled for. These data are available at http://www.basketball-reference.com/. We’ve already cleaned the data. Kohonen
functions will require using numeric fields with no missing entries.
library(RCurl)
NBA <- read.csv(text = getURL("https://raw.githubusercontent.com/clarkdatalabs/soms/master/NBA_2016_player_stats_cleaned.csv"),
sep = ",", header = T, check.names = FALSE)
Before we create a SOM, we need to choose which variables we want to search for patterns in.
colnames(NBA)
## [1] "" "Player" "Pos" "Age" "Tm" "G" "GS"
## [8] "MP" "FG" "FGA" "FG%" "3P" "3PA" "3P%"
## [15] "2P" "2PA" "2P%" "FT" "FTA" "FT%" "ORB"
## [22] "DRB" "TRB" "AST" "STL" "BLK" "TOV" "PF"
## [29] "PTS"
We’ll start with some simple examples using shot attempts:
NBA.measures1 <- c("FTA", "2PA", "3PA")
NBA.SOM1 <- som(scale(NBA[NBA.measures1]), grid = somgrid(6, 4, "rectangular"))
plot(NBA.SOM1)
Note that we scaled and centered our training data, and defined the grid size and arrangement. The standard Kohonen SOM plot creates these pie representatons of the representative vectors for the grid cells, where the radius of a wedge corresponds to the magnitude in a particular dimension. Some patterns start to emerge, with players generally being clustered by how many of each type of shot they take.
Remember that the above is just a map of the player data - each cell displays its representative vector. We could identify players with cells on the map by assigning each player to the cell with representative vector closest to that player’s stat line. The “count” type SOM does exactly this, and creates a heatmap based on the number of players assigned to each cell. Just for fun, we reversed the order of the pre-defined palette heat.colors
so that red represents grid cells with higher numbers of represented players.
# reverse color ramp
colors <- function(n, alpha = 1) {
rev(heat.colors(n, alpha))
}
plot(NBA.SOM1, type = "counts", palette.name = colors, heatkey = TRUE)
Alternatively you could plot the players as points on the grid using the “mapping” type SOM. We do it side by side with the regular SOM to start to make visual comparisons.
par(mfrow = c(1, 2))
plot(NBA.SOM1, type = "mapping", pchs = 20, main = "Mapping Type SOM")
plot(NBA.SOM1, main = "Default SOM Plot")
The representattive vector of each map cell is displayed on the right. On the left, players are plotted on this map based on how close their stat lines are to these representative vectors. Note that each of these examples takes a different type
parameter for the Kohonen plot function. If you want to customize these graphics by, for example, plotting points on a grid which displays some other measure of your SOM as a background, you will have to dig into the some of the properties of the SOM objects. We’ll do this in a forthcoming post about text mining and SOMs.
This next example is not another type
of SOM plot, but a way of changing the geometry of any of the plot types. When we trained the SOM for the above examples we used a rectangular grid. Since cells on the edges, and particularly in the corners, have fewer neighbors than interior cells, more extreme values tend to be pushed to the edges. In our first example, the maximum in each of the three stats we looked at fell in a seperate corner. Alternatively, we can use toroidal topology for our map - basically pac-man rules - where the top-bottom and right-left edges are adjacent.
NBA.SOM2 <- som(scale(NBA[NBA.measures1]), grid = somgrid(6, 6, "hexagonal"),
toroidal = TRUE)
par(mfrow = c(1, 2))
plot(NBA.SOM2, type = "mapping", pchs = 20, main = "Mapping Type SOM")
plot(NBA.SOM2, main = "Default SOM Plot")
When we plot with type = "dist.neighbours"
, the cells are colored depending on the overall distance to their nearest neighbors, which allows us to visualize how far apart different features are in the higher dimensional space.
plot(NBA.SOM2, type = "dist.neighbours", palette.name = terrain.colors)
You can think of this display with a topographic analogy. Cells with greater distances to their neighbors are like mountain peaks - the deformed surface area means surface distances are greater. We will explore this idea more in a follow up post to this one, where we will attempt to visualize the distance between Shakespearean plays based on their word usage.
The kohonen
package also supports supervised SOMs, which allow us to make classifications. So far we’ve only worked with mapping three dimensional data to two dimensions. The utility of SOMs becomes more evident when we’re working with higher dimensional data, so let’s do this supervised example with an expanded list of player stats:
NBA.measures2 <- c("FTA", "FT", "2PA", "2P", "3PA", "3P", "AST", "ORB", "DRB",
"TRB", "STL", "BLK", "TOV")
We’ll use the xyf()
function to create a supervised SOM and classification of players by their position on the court. We’ll randomly divide our data into training and testing sets.
training_indices <- sample(nrow(NBA), 200)
NBA.training <- scale(NBA[training_indices, NBA.measures2])
NBA.testing <- scale(NBA[-training_indices, NBA.measures2], center = attr(NBA.training,
"scaled:center"), scale = attr(NBA.training, "scaled:scale"))
Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
NBA.SOM3 <- xyf(NBA.training, classvec2classmat(NBA$Pos[training_indices]),
grid = somgrid(13, 13, "hexagonal"), toroidal = TRUE, rlen = 100, xweight = 0.5)
Note the xweight
parameter for xyf()
. This allows you to weight the set of training variables (NBA.training
) versus the prediction variable (NBA$Pos
) in the training algorithm. Now let’s check the accuracy of the prediction:
pos.prediction <- predict(NBA.SOM3, newdata = NBA.testing)
table(NBA[-training_indices, "Pos"], pos.prediction$prediction)
##
## Center Point Guard Power Forward Shooting Guard
## Center 16 0 26 1
## Point Guard 0 49 0 12
## Power Forward 10 1 29 5
## Shooting Guard 0 8 4 38
## Small Forward 0 0 15 9
##
## Small Forward
## Center 4
## Point Guard 11
## Power Forward 8
## Shooting Guard 19
## Small Forward 38
For this example we’ll use xyf()
to do a similar position predicting training, but using all of the players instead of just a training set. This time we will weight the player stats more heavily than the player position using the xweight
parameter.
NBA.SOM4 <- xyf(scale(NBA[, NBA.measures2]), classvec2classmat(NBA[, "Pos"]),
grid = somgrid(13, 13, "hexagonal"), toroidal = TRUE, rlen = 300, xweight = 0.7)
Plotting using type = "codes"
we get the standard side by side visualization the player stats (Codes X
) and the player position prediction (Codes Y
).
par(mfrow = c(1, 2))
plot(NBA.SOM4, type = "codes", main = c("Codes X", "Codes Y"))
NBA.SOM4.hc <- cutree(hclust(dist(NBA.SOM4$codes$Y)), 5)
add.cluster.boundaries(NBA.SOM4, NBA.SOM4.hc)
This view allows us to compare player stats to the position predictions, but doesn’t really give us any idea about the accuracy of these groupings or how well the players map into these groupings.
In this final example we’ll make a few customizations with the type = mapping
plot so that we can simultaneously represent the actual player positions and the SOM’s predicted positions. We’ll start with the visualization and follow it with the code for you to explore.
Background colors are set by the predicted player position for that location. We set the background color transparency (alpha) to depend on the certainty with which our SOM classified that cell. Faded cells have multiple position values which have similar orders of magnitude, though only the position of the maximum value is used for the color. Backgrounds of plotted player dots represent their true position.
bg.pallet <- c("red", "blue", "yellow", "purple", "green")
# make a vector of just the background colors for all map cells
position.predictions <- classmat2classvec(predict(NBA.SOM4)$unit.predictions)
base.color.vector <- bg.pallet[match(position.predictions, levels(NBA$Pos))]
# set alpha to scale with maximum confidence of prediction
bgcols <- c()
max.conf <- apply(NBA.SOM4$codes$Y, 1, max)
for (i in 1:length(base.color.vector)) {
bgcols[i] <- adjustcolor(base.color.vector[i], max.conf[i])
}
par(mar = c(0, 0, 0, 4), xpd = TRUE)
plot(NBA.SOM4, type = "mapping", pchs = 21, col = "black", bg = bg.pallet[match(NBA$Pos,
levels(NBA$Pos))], bgcol = bgcols)
legend("topright", legend = levels(NBA$Pos), text.col = bg.pallet, bty = "n",
inset = c(-0.03, 0))