This tutorial was written using the kohonen
package version 2.0.19. Some of the code will not work in the most recent version of this package. To install 2.0.19, run the following:
packageurl <- "https://cran.r-project.org/src/contrib/Archive/kohonen/kohonen_2.0.19.tar.gz"
install.packages(packageurl, repos = NULL, type = "source")
I hope to update all of the SOM tutorials to run properly on kohonen
v3 in the near future.
The zipped folder we downloaded in the part 1 also contains a document for each character in each play containing all of that character’s lines. In this example we’ll make a SOM to see if we can find relationships between characters based on their vocabularies.
I’ve already built a character_table
, similar to the play_table
used in the last example. You can see how this was constructed here.
library("RCurl")
character_table <- read.csv(text = getURL("https://raw.githubusercontent.com/clarkdatalabs/soms/master/Shakespeare_tables/character_table.csv"),
sep = ",", header = T, check.names = TRUE, stringsAsFactors = FALSE)
Let’s look at the metadata we have for each text file of lines:
head(character_table[, c("char", "play", "genre")], 10)
## char play genre
## 1 All Four A Midsummer-Night's Dreams comedies
## 2 All A Midsummer-Night's Dreams comedies
## 3 Bottom A Midsummer-Night's Dreams comedies
## 4 Cobweb A Midsummer-Night's Dreams comedies
## 5 Demetrius A Midsummer-Night's Dreams comedies
## 6 Dramatis_personae A Midsummer-Night's Dreams comedies
## 7 Egeus A Midsummer-Night's Dreams comedies
## 8 Fairy A Midsummer-Night's Dreams comedies
## 9 Flute A Midsummer-Night's Dreams comedies
## 10 Helena A Midsummer-Night's Dreams comedies
Note the Dramatis_personae character. Each play has one of these documents, which is just a list of characters and contains no lines. We could get rid of these now, but we are going to end up leveraging the row numbering in character_table
to add labels to our SOM, so we don’t want to remove any rows from character_table
. Instead, these will get filtered out later when we ignore documents with too few words in them.
library("tm")
character.corpus <- Corpus(VectorSource(character_table[, "text"]))
The text here has the same bracketed tags, white space, punctuation, and uncommon words that we stripped out of the play text in part 1, so we follow the same procedure to simplify it. This time we’ll use the special tm_reduce
option of the tm_map
function, which allows us to combine all of our text simplifications into one step.
Note: this may take a few minutes to run.
library("SnowballC")
striptags <- function(x) gsub(x, pattern = '<[^>]*>', replacement = '')
skipwords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(content_transformer(striptags),
removePunctuation,
removeNumbers,
content_transformer(tolower),
skipwords,
stemDocument,
stripWhitespace,
removePunctuation)
character.corpus <- tm_map(character.corpus, FUN = tm_reduce, tmFuns = funcs)
Again, we form a Document Term Matrix and remove sparse terms:
character.DTM <- DocumentTermMatrix(character.corpus)
character.DTM.modified <- removeSparseTerms(character.DTM, 0.9)
Let’s see what this did to the wordcounts for some of our characters.
wordcount.DTM <- rowSums(as.matrix(character.DTM))
wordcount.DTM.modified <- rowSums(as.matrix(character.DTM.modified))
head(cbind(character_table$char,
"DTM" = wordcount.DTM,
"DTM.modified" = wordcount.DTM.modified),
n = 10)
## DTM DTM.modified
## 1 "All Four" "5" "5"
## 2 "All" "8" "8"
## 3 "Bottom" "1013" "626"
## 4 "Cobweb" "16" "13"
## 5 "Demetrius" "765" "505"
## 6 "Dramatis_personae" "31" "2"
## 7 "Egeus" "188" "109"
## 8 "Fairy" "134" "60"
## 9 "Flute" "132" "84"
## 10 "Helena" "1079" "655"
Overall this seems to have cut a fair number of words out of each character’s wordcount. Also note that some characters don’t have many lines at all. Let’s exclude characters that have fewer than 200 filtered words.
character.DTM.modified <- character.DTM.modified[wordcount.DTM.modified > 200,
]
This time we are going to use Term Frequency-Inverse Document Frequency (TF-IDF) as a weighting on our terms. This scales a term to reflect it’s relative importance, which is determined by how many of the documents in our corpus the term appears in. Words that are only common to a few characters will have more influence over how our SOM groups characters. The tm
package supports this with the weightTfIdf
function, which we also use to normalize our rows.
character.DTM.modified <- weightTfIdf(character.DTM.modified, normalize = TRUE)
In order to use some of the metadata in character_table
in our SOM plots, we need to merge it onto our Document Term Matrix, which we’ll need to convert to a dataframe.
character.DTM_df <- data.frame(as.matrix(character.DTM.modified))
character.DTM_df[1:10, 1:4]
## act age air all
## 3 0.0002813951 0.000000000 0.000000000 0.002069929
## 5 0.0003348657 0.000000000 0.000000000 0.000000000
## 10 0.0001936342 0.000000000 0.002669836 0.001978284
## 11 0.0003041496 0.000000000 0.000000000 0.000000000
## 14 0.0002818453 0.000000000 0.000000000 0.000000000
## 18 0.0001666704 0.000000000 0.000000000 0.002113827
## 21 0.0002334562 0.000000000 0.000000000 0.005203919
## 24 0.0004418367 0.000000000 0.000000000 0.000000000
## 28 0.0002625888 0.002733877 0.000000000 0.000000000
## 30 0.0001995826 0.000000000 0.008614496 0.000000000
We can see that even though we removed rows representing characters with few lines, the row numbers from our original character_table
were maintained. We use these to merge:
character.DTM_df <- merge(character_table[,c("char", "play", "genre")],
character.DTM_df,
by.x = "row.names",
by.y = "row.names",
suffixes = c("", ".DTM"))
We form our SOM object as before. Note that the kohonen
package restricts the size of your map grid to a number of cells less than the size of your training data. In part 1, we were working with only 37 plays, and could have at most a 6x6 grid (or 1x37 if we were feeling particularly useless). Here we are free to have a much larger grid.
library(kohonen)
character.SOM <- som(as.matrix(character.DTM_df[, !names(character.DTM_df) %in%
c("Row.names", "char", "play", "genre")]), grid = somgrid(10, 10, "hexagonal"),
toroidal = TRUE)
As in part 1, we will choose a select set of characters to plot:
characters.to.plot <- c("Romeo",
"Juliet",
"K. Henry",
"Macbeth",
"Hermia",
"Othello",
"Puck",
"Borachio")
character.labels <- c()
X <- character.DTM_df[,"char"]
for (i in 1:length(X)) {
if (X[i] %in% characters.to.plot) {
character.labels[i] <- X[i]
} else {
character.labels[i] <- "+"
}
}
remove(i,X)
In this figure we will color the characters by genre. Since we are plotting over 400 characters on the same map, let’s make all of the characters labeled with “+” somewhat transparent so we can read the names of the characters we are actually interested in.
palette.3 <- c("firebrick1", "darkolivegreen3", "dodgerblue2")
dim.symbols <- function(x) {
if (nchar(x) < 2) {
return(0.5)
} else {
return(1)
}
}
genre.label.data <- cbind(char = character.DTM_df$char, label = character.labels,
base_color = palette.3[as.factor(character.DTM_df$genre)], transparency = lapply(character.labels,
dim.symbols))
genre.label.data <- as.data.frame(genre.label.data)
label.colors <- c()
for (row in 1:nrow(genre.label.data)) {
label.colors[row] <- adjustcolor(genre.label.data[row, "base_color"], alpha.f = genre.label.data[row,
"transparency"])
}
genre.label.data <- cbind(genre.label.data, label_color = label.colors)
remove(row, label.colors)
Here we define the same distance function from part 1 used to color the background of the map cells.
code.distances <- function(SOM) {
N <- nrow(SOM$codes)
# calculate all pairwise distances
pairwise.distances = c()
for (i in 1:N) {
list <- matrix()
for (j in 1:N) {
list[j] <- dist(rbind(SOM$codes[i, ], SOM$codes[j, ]))
}
pairwise.distances <- cbind(pairwise.distances, list)
}
## distance coefficient function, to be applied to geometric GRID distances
d.coef <- function(x) {
if (x == 0) {
return(0)
} else {
return(1/(x^4))
}
}
distance.coefficients <- apply(kohonen::unit.distances(SOM$grid, SOM$toroidal),
c(1, 2), d.coef)
# calculate scaled sum of distances
A <- distance.coefficients * pairwise.distances
scaled.dist.sums <- (colSums(A) - min(colSums(A)))/(max(colSums(A)) - min(colSums(A)))
# clean up variables
remove(i, j)
return(scaled.dist.sums)
}
The output of this function is a value between 0 and 1 for each cell on our SOM. Higher values represent greater overall distance from a cell to nearby cells.
par(bg = hsv(h=0,s=0,v=.7), font = 1, cex = 1.2)
plot(character.SOM,
type = "mapping",
col = as.character(genre.label.data$label_color),
bgcol = hsv(h=0.8,s=.3,v=(.5*(0.3+code.distances(character.SOM)))),
labels = genre.label.data$label,
main = "Shakespearean Characters - colored by genre")
There are some apparent clusters. Recall that we are using a toroidal SOM for this example, so the left and right edges and the top and bottom edges are identified. As in part 1, background color signifies a cell’s distance from neighboring cells. Dark cells are much closer to their neighbors than lighter colored cells.
Even to a Shakespearean neophyte such as myself, using this SOM to do some exploration reveals a couple interesting features. Romeo and Juliet are very close to each other linguistically, as we might have expected. King Henry, on the other hand, is pretty dissimilar from himself across the multiple plays in which he is a character. Cool!
Following the above example, it’s easy to instead color each character using a different criterion. In our character_table
we have the play that a character appears in as a readily available distinction.
palette.37 <- rainbow(37)
We’ll omit the rest of the code, since it mirrors the example above.
This is clearly not particularly useful, but this method suggests a few other ideas that you might want to explore: