A couple months ago I published some examples of using the R package kohonen and player stats from the 2015/2016 NBA season to make Self Organizing Maps (check it out here). That post covered the basics of Self Organizing Maps and the the basic plot types supported by kohonen.

Since that post, I’ve been playing with around with text mining and a corpus of the dramas of Shakespeare. The central questions these next two posts explore is:

Can we use SOMs to see relationships between plays (part 1) or characters (part 2) based only on an analysis of their spoken vocabulary?

This tutorial is designed so that no files need to be downloaded seperately from this document. Run code blocks sequentially to follow along.


This tutorial was written using the kohonen package version 2.0.19. Some of the code will not work in the most recent version of this package. To install 2.0.19, run the following:

packageurl <- "https://cran.r-project.org/src/contrib/Archive/kohonen/kohonen_2.0.19.tar.gz"
install.packages(packageurl, repos = NULL, type = "source")

I hope to update all of the SOM tutorials to run properly on kohonen v3 in the near future.

Shakespeare’s Dramas

The following script downloads a zipped folder containing all of the text of 37 of Shakespeare’s plays (originally available at lexically.net). It sets up a table (play_table) which contains all of the information we will be working with, including the complete text of each play.

td <- tempdir()
tf <- tempfile(tmpdir = td)


fname <- unzip(tf, list = TRUE)$Name[1]
unzip(tf, exdir = td, overwrite = TRUE)
fpath <- file.path(td, fname)

Within the ShakespearePlaysPlus folder there are subfolders categorizing works by genre. We’ll use this as metadata for plots later, so we want to capture it in our table.

genres <- list.dirs(fpath, full.names = FALSE, recursive = FALSE)

We build a dataframe with a row for each play and columns for play title, genre, and text:

play_table <- data.frame(play = character(), genre = character(), text = character())

for (genre in genres) {
    plays <- list.files(file.path(fpath, genre), pattern = ".txt")
    for (play in plays) {
        play.text <- read_file(file.path(fpath, genre, play), locale(encoding = "UTF-16"))
        play.name <- gsub(".txt", "", play)
        play_table <- rbind(play_table, data.frame(play = play.name, genre = genre, 
            text = play.text))
remove(genre, plays, play, play.name, play.text)

Finally, we clean up everything from the temporary download except the table we’ve created:

unlink(td, recursive = FALSE)

Working with tm

Forming & Cleaning a Corpus

We will be using the tm package for text mining. The fundemental object we’ll be working with is a corpus, within which we have one document per play.


play.vec <- VectorSource(play_table[, "text"])
play.corpus <- Corpus(play.vec)

We can see the start of the contents of the first document in our corpus:

substr(play.corpus[[1]]$content, 1, 400)
## [1] "< Shakespeare -- A MIDSUMMER-NIGHT'S DREAM >\n\n\n< from Online Library of Liberty (http://oll.libertyfund.org) >\n\n\n< Unicode .txt version by Mike Scott (http://www.lexically.net) >\n\n\n< from \"The Complete Works of William Shakespeare\" >\n\n\n< ed. with a glossary by W.J. Craig M.A. >\n\n\n< (London: Oxford University Press, 1916) >\n\n\n<STAGE DIR>\n\n\n<Scene.<U+0097>Athens, and a Wood near it.>\n\n\n</STAGE DIR>\n\n\n\n\n\n\n\n"

Without using substr we would see the entire text of A Midsummer-Night’s Dream - over 110,000 characters instead of just the first 400. Yikes!

You can see there are lots of stage instructions and other metadata tags that are all enclosed in angle brackets. To transform the contents of documents within our corpus, we need to define a function that does exactly what we want: strip out anything within angle brackets.

striptags <- function(x) gsub(x, pattern = "<[^>]*>", replacement = "")

The tm_map function applies functions to our corpus. Many of the functions we’ll use are already defined on corpus objects. Our striptags function, however, is not. It will need to be put within the content_transformer wrapper function so it applies to the contents of each document within our corpus.

play.corpus <- tm_map(play.corpus, content_transformer(striptags))
substr(play.corpus[[1]]$content, 1, 400)
## [1] "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\n\tNow, fair Hippolyta, our nuptial hour\n\n\n\tDraws on apace: four happy days bring in\n\n\n\tAnother moon; but O! methinks how slow\n\n\n\tThis old moon wanes; she lingers my desires,\n\n\n\tLike to a step dame, or a dowager\n\n\n\tLong withering out a young man's revenue.\n\n\n\n\n\n\n\n\n\t\n\n\n\tFour days will quickly steep themselves in night;\n\n\n\tFour nights will q"

Next we want to really clean up our texts. The tm package has functions for stripping punctuation, numbers, whitespace, and the most common words (we don’t want to be comparing counts of “and” and “to”). We can change everything to lower case using tolower() from base R, but again we’ll need to wrap this in content_transformer so tm_map knows to apply this to the contents of each document. We’ll also use a function from the SnowballC package to cut words down to common stems for comparison.

Note: this may take a few minutes to run.


play.corpus <- tm_map(play.corpus, removePunctuation)
play.corpus <- tm_map(play.corpus, removeNumbers)
play.corpus <- tm_map(play.corpus, content_transformer(tolower))
play.corpus <- tm_map(play.corpus, removeWords, stopwords("english"))
play.corpus <- tm_map(play.corpus, stemDocument)
play.corpus <- tm_map(play.corpus, stripWhitespace)
play.corpus <- tm_map(play.corpus, removePunctuation)
substr(play.corpus[[1]]$content, 1, 400)
## [1] " now fair hippolyta nuptial hour draw apac four happi day bring anoth moon o methink slow old moon wane linger desires like step dame dowager long wither young man revenue four day will quick steep night four night will quick dream away time moon like silver bow newbent heaven shall behold night solemnities go philostrate stir athenian youth merriments awak pert nimbl spirit mirth turn melancholi "

The functions that rely on english dictionaries had a bit of trouble with the Shakespearean English, but they’re close enough for our analysis.

Document Term Matrix

Next we’ll create what’s called a Document Term Matrix. In this matrix, there will be row for each document and a column for each word (in our case, word stem) appearing in our corpus. Entries in the matrix are simple counts of the number of times a word appeared in the corresponding document.

play.DTM <- DocumentTermMatrix(play.corpus)

Restrict to Shared Words

This matrix currently has columns for 25313 different words or word stems. Many of these, including character names, only appear in a single play. We want to restrict our comparison to terms that are somewhat common across the collection of 37 plays. Restricting the sparsity of our Document Term Matrix has this effect.

play.DTM.modified <- removeSparseTerms(play.DTM, 0.8)
## [1] 4100

This has greatly reduced the number of terms that we will be using to compare plays.

Normalize Play Vectors

Next, I’ve decided I only care about the distribution of words within a play, not the length of the play. I’ll scale each row to be a length 1 vector. This way the length of the play will not impact the SOM.

play.DTM.modified <- t(apply(play.DTM.modified, 1, function(x) x/sqrt(sum(x^2))))

Making a SOM

We’re going to be training our SOM using the rows of our document term matrix as vectors for plays. These are 4100 dimensional, length one vectors, with scaled counts of word stems for each play.


play.SOM <- som(play.DTM.modified, grid = somgrid(6, 6, "hexagonal"), toroidal = FALSE)