VERSION WARNING

This tutorial was written using the kohonen package version 2.0.19. Some of the code will not work in the most recent version of this package. To install 2.0.19, run the following:

packageurl <- "https://cran.r-project.org/src/contrib/Archive/kohonen/kohonen_2.0.19.tar.gz"
install.packages(packageurl, repos = NULL, type = "source")

I hope to update all of the SOM tutorials to run properly on kohonen v3 in the near future.

Setup

The zipped folder we downloaded in the part 1 also contains a document for each character in each play containing all of that character’s lines. In this example we’ll make a SOM to see if we can find relationships between characters based on their vocabularies.

I’ve already built a character_table, similar to the play_table used in the last example. You can see how this was constructed here.

library("RCurl")
character_table <- read.csv(text = getURL("https://raw.githubusercontent.com/clarkdatalabs/soms/master/Shakespeare_tables/character_table.csv"), 
    sep = ",", header = T, check.names = TRUE, stringsAsFactors = FALSE)

Let’s look at the metadata we have for each text file of lines:

head(character_table[, c("char", "play", "genre")], 10)

##                 char                       play    genre
## 1           All Four A Midsummer-Night's Dreams comedies
## 2                All A Midsummer-Night's Dreams comedies
## 3             Bottom A Midsummer-Night's Dreams comedies
## 4             Cobweb A Midsummer-Night's Dreams comedies
## 5          Demetrius A Midsummer-Night's Dreams comedies
## 6  Dramatis_personae A Midsummer-Night's Dreams comedies
## 7              Egeus A Midsummer-Night's Dreams comedies
## 8              Fairy A Midsummer-Night's Dreams comedies
## 9              Flute A Midsummer-Night's Dreams comedies
## 10            Helena A Midsummer-Night's Dreams comedies

Note the Dramatis_personae character. Each play has one of these documents, which is just a list of characters and contains no lines. We could get rid of these now, but we are going to end up leveraging the row numbering in character_table to add labels to our SOM, so we don’t want to remove any rows from character_table. Instead, these will get filtered out later when we ignore documents with too few words in them.

Form & Clean Corpus

library("tm")
character.corpus <- Corpus(VectorSource(character_table[, "text"]))

The text here has the same bracketed tags, white space, punctuation, and uncommon words that we stripped out of the play text in part 1, so we follow the same procedure to simplify it. This time we’ll use the special tm_reduce option of the tm_map function, which allows us to combine all of our text simplifications into one step.

Note: this may take a few minutes to run.

library("SnowballC")

striptags <- function(x) gsub(x, pattern = '<[^>]*>', replacement = '')
skipwords <- function(x) removeWords(x, stopwords("english"))

funcs <- list(content_transformer(striptags),
              removePunctuation, 
              removeNumbers, 
              content_transformer(tolower), 
              skipwords, 
              stemDocument, 
              stripWhitespace, 
              removePunctuation)

character.corpus <- tm_map(character.corpus, FUN = tm_reduce, tmFuns = funcs)

Again, we form a Document Term Matrix and remove sparse terms:

character.DTM <- DocumentTermMatrix(character.corpus)

character.DTM.modified <- removeSparseTerms(character.DTM, 0.9)

Let’s see what this did to the wordcounts for some of our characters.

wordcount.DTM <- rowSums(as.matrix(character.DTM))
wordcount.DTM.modified <- rowSums(as.matrix(character.DTM.modified))

head(cbind(character_table$char,
           "DTM" = wordcount.DTM,
           "DTM.modified" = wordcount.DTM.modified),
     n = 10)

##                        DTM    DTM.modified
## 1  "All Four"          "5"    "5"         
## 2  "All"               "8"    "8"         
## 3  "Bottom"            "1013" "626"       
## 4  "Cobweb"            "16"   "13"        
## 5  "Demetrius"         "765"  "505"       
## 6  "Dramatis_personae" "31"   "2"         
## 7  "Egeus"             "188"  "109"       
## 8  "Fairy"             "134"  "60"        
## 9  "Flute"             "132"  "84"        
## 10 "Helena"            "1079" "655"

Overall this seems to have cut a fair number of words out of each character’s wordcount. Also note that some characters don’t have many lines at all. Let’s exclude characters that have fewer than 200 filtered words.

character.DTM.modified <- character.DTM.modified[wordcount.DTM.modified > 200, 
    ]

TF-IDF

This time we are going to use Term Frequency-Inverse Document Frequency (TF-IDF) as a weighting on our terms. This scales a term to reflect it’s relative importance, which is determined by how many of the documents in our corpus the term appears in. Words that are only common to a few characters will have more influence over how our SOM groups characters. The tm package supports this with the weightTfIdf function, which we also use to normalize our rows.

character.DTM.modified <- weightTfIdf(character.DTM.modified, normalize = TRUE)

Merge with Metadata

In order to use some of the metadata in character_table in our SOM plots, we need to merge it onto our Document Term Matrix, which we’ll need to convert to a dataframe.

character.DTM_df <- data.frame(as.matrix(character.DTM.modified))
character.DTM_df[1:10, 1:4]

##             act         age         air         all
## 3  0.0002813951 0.000000000 0.000000000 0.002069929
## 5  0.0003348657 0.000000000 0.000000000 0.000000000
## 10 0.0001936342 0.000000000 0.002669836 0.001978284
## 11 0.0003041496 0.000000000 0.000000000 0.000000000
## 14 0.0002818453 0.000000000 0.000000000 0.000000000
## 18 0.0001666704 0.000000000 0.000000000 0.002113827
## 21 0.0002334562 0.000000000 0.000000000 0.005203919
## 24 0.0004418367 0.000000000 0.000000000 0.000000000
## 28 0.0002625888 0.002733877 0.000000000 0.000000000
## 30 0.0001995826 0.000000000 0.008614496 0.000000000

We can see that even though we removed rows representing characters with few lines, the row numbers from our original character_table were maintained. We use these to merge:

character.DTM_df <- merge(character_table[,c("char", "play", "genre")],
                          character.DTM_df, 
                          by.x = "row.names", 
                          by.y = "row.names", 
                          suffixes = c("", ".DTM"))

SOM - Colored by Genre

We form our SOM object as before. Note that the kohonen package restricts the size of your map grid to a number of cells less than the size of your training data. In part 1, we were working with only 37 plays, and could have at most a 6x6 grid (or 1x37 if we were feeling particularly useless). Here we are free to have a much larger grid.

library(kohonen)
character.SOM <- som(as.matrix(character.DTM_df[, !names(character.DTM_df) %in% 
    c("Row.names", "char", "play", "genre")]), grid = somgrid(10, 10, "hexagonal"), 
    toroidal = TRUE)

Label Text

As in part 1, we will choose a select set of characters to plot:

characters.to.plot <- c("Romeo", 
                        "Juliet",
                        "K. Henry",
                        "Macbeth",
                        "Hermia",
                        "Othello",
                        "Puck",
                        "Borachio")

character.labels <- c()
X <- character.DTM_df[,"char"]
for (i in 1:length(X)) {
    if (X[i] %in% characters.to.plot) {
        character.labels[i] <- X[i]
    } else {
        character.labels[i] <- "+"
    }
}
remove(i,X)

Label Color & Opacity

In this figure we will color the characters by genre. Since we are plotting over 400 characters on the same map, let’s make all of the characters labeled with “+” somewhat transparent so we can read the names of the characters we are actually interested in.

palette.3 <- c("firebrick1", "darkolivegreen3", "dodgerblue2")

dim.symbols <- function(x) {
    if (nchar(x) < 2) {
        return(0.5)
    } else {
        return(1)
    }
}

genre.label.data <- cbind(char = character.DTM_df$char, label = character.labels, 
    base_color = palette.3[as.factor(character.DTM_df$genre)], transparency = lapply(character.labels, 
        dim.symbols))

genre.label.data <- as.data.frame(genre.label.data)

label.colors <- c()
for (row in 1:nrow(genre.label.data)) {
    label.colors[row] <- adjustcolor(genre.label.data[row, "base_color"], alpha.f = genre.label.data[row, 
        "transparency"])
}

genre.label.data <- cbind(genre.label.data, label_color = label.colors)
remove(row, label.colors)

Cell Background - Distance

Here we define the same distance function from part 1 used to color the background of the map cells.

code.distances <- function(SOM) {
    N <- nrow(SOM$codes)
    # calculate all pairwise distances
    pairwise.distances = c()
    for (i in 1:N) {
        list <- matrix()
        for (j in 1:N) {
            list[j] <- dist(rbind(SOM$codes[i, ], SOM$codes[j, ]))
        }
        pairwise.distances <- cbind(pairwise.distances, list)
    }
    
    ## distance coefficient function, to be applied to geometric GRID distances
    d.coef <- function(x) {
        if (x == 0) {
            return(0)
        } else {
            return(1/(x^4))
        }
    }
    distance.coefficients <- apply(kohonen::unit.distances(SOM$grid, SOM$toroidal), 
        c(1, 2), d.coef)
    
    # calculate scaled sum of distances
    A <- distance.coefficients * pairwise.distances
    scaled.dist.sums <- (colSums(A) - min(colSums(A)))/(max(colSums(A)) - min(colSums(A)))
    
    # clean up variables
    remove(i, j)
    
    return(scaled.dist.sums)
}

The output of this function is a value between 0 and 1 for each cell on our SOM. Higher values represent greater overall distance from a cell to nearby cells.

Plot

par(bg = hsv(h=0,s=0,v=.7), font = 1, cex = 1.2)
plot(character.SOM, 
     type = "mapping",
     col = as.character(genre.label.data$label_color),
     bgcol = hsv(h=0.8,s=.3,v=(.5*(0.3+code.distances(character.SOM)))),
     labels = genre.label.data$label,
     main = "Shakespearean Characters - colored by genre")

There are some apparent clusters. Recall that we are using a toroidal SOM for this example, so the left and right edges and the top and bottom edges are identified. As in part 1, background color signifies a cell’s distance from neighboring cells. Dark cells are much closer to their neighbors than lighter colored cells.

Even to a Shakespearean neophyte such as myself, using this SOM to do some exploration reveals a couple interesting features. Romeo and Juliet are very close to each other linguistically, as we might have expected. King Henry, on the other hand, is pretty dissimilar from himself across the multiple plays in which he is a character. Cool!

SOM - Other Ideas

Following the above example, it’s easy to instead color each character using a different criterion. In our character_table we have the play that a character appears in as a readily available distinction.

palette.37 <- rainbow(37)

We’ll omit the rest of the code, since it mirrors the example above.

This is clearly not particularly useful, but this method suggests a few other ideas that you might want to explore:

color by character gender - How different do characters of different gender speak in Shakespeare’s plays? If there is a pattern, are there any significant outliers? You could either go through the entire table of 1300+ characters and assign each one a gender, or use these data to experiment with the gender package, which predicts gender based on an input name and date range (the gender of names has not been static over time).
color characters by date first published - How did Shakespear’s language change over time? Would a SOM show a difference in how characters spoke over time? There is some ambiguity in dating Shakespearean plays, but estimated figures exist and could be easily used to make temporal groupings (e.g. “early, middle, late”).

Self Organizing Maps and Text Mining - Visualizing Shakespeare (Part 2)

Daniel Tanner

25 June, 2017