Package Setup

Install and load the tidyverse package we will use.

install.packages("tidyverse", repos = "https://cran.r-project.org/web/packages/tidyverse/index.html")

library(tidyverse)

Getting The Dataset Into R

The adults dataset we are going to use in this tutorial originates from the UCI Machine Learning Repository.

There are many ways we can read in a dataset in R depending on our needs and how the data file is formatted. This specific dataset was only available as a text(.txt) file, unparsed, and without headings.

We use the following method to read in the adults data and store them into a variable. Keep in mind that there are multiple ways to read in a dataset.

To read in and parse this dataset correctly, use the read.delim() function. This function is useful because we can specify how we want to parse our file. We specify in the function parameter that the dataset we are using is separated by commas.

setwd("~/Documents/Projects & Work/Digital_Projects_Studio/R_tutorial/data")
adults <- read.delim("adult-data.txt", header = FALSE, sep = ",")

head(adults)

##   V1                V2     V3         V4 V5                  V6
## 1 39         State-gov  77516  Bachelors 13       Never-married
## 2 50  Self-emp-not-inc  83311  Bachelors 13  Married-civ-spouse
## 3 38           Private 215646    HS-grad  9            Divorced
## 4 53           Private 234721       11th  7  Married-civ-spouse
## 5 28           Private 338409  Bachelors 13  Married-civ-spouse
## 6 37           Private 284582    Masters 14  Married-civ-spouse
##                   V7             V8     V9     V10  V11 V12 V13
## 1       Adm-clerical  Not-in-family  White    Male 2174   0  40
## 2    Exec-managerial        Husband  White    Male    0   0  13
## 3  Handlers-cleaners  Not-in-family  White    Male    0   0  40
## 4  Handlers-cleaners        Husband  Black    Male    0   0  40
## 5     Prof-specialty           Wife  Black  Female    0   0  40
## 6    Exec-managerial           Wife  White  Female    0   0  40
##              V14    V15
## 1  United-States  <=50K
## 2  United-States  <=50K
## 3  United-States  <=50K
## 4  United-States  <=50K
## 5           Cuba  <=50K
## 6  United-States  <=50K

Formatting The Data

Notice that although the dataset has been read in correctly, it does not have any headings that tell us what each column represents. We can see from our original data file the corresponding headings. Let’s add those headings in through the names() function.

names(adults) <- c("age","workclass","fnlwgt","education","education-num",
                   "marital-status","occupation","relationship","race","sex",
                   "capital-gain","capital-loss","hours-per-week",
                   "native-country", "salary")

adults[20:30,]

##    age         workclass fnlwgt     education education-num
## 20  43  Self-emp-not-inc 292175       Masters            14
## 21  40           Private 193524     Doctorate            16
## 22  54           Private 302146       HS-grad             9
## 23  35       Federal-gov  76845           9th             5
## 24  43           Private 117037          11th             7
## 25  59           Private 109015       HS-grad             9
## 26  56         Local-gov 216851     Bachelors            13
## 27  19           Private 168294       HS-grad             9
## 28  54                 ? 180211  Some-college            10
## 29  39           Private 367260       HS-grad             9
## 30  49           Private 193366       HS-grad             9
##         marital-status        occupation   relationship
## 20            Divorced   Exec-managerial      Unmarried
## 21  Married-civ-spouse    Prof-specialty        Husband
## 22           Separated     Other-service      Unmarried
## 23  Married-civ-spouse   Farming-fishing        Husband
## 24  Married-civ-spouse  Transport-moving        Husband
## 25            Divorced      Tech-support      Unmarried
## 26  Married-civ-spouse      Tech-support        Husband
## 27       Never-married      Craft-repair      Own-child
## 28  Married-civ-spouse                 ?        Husband
## 29            Divorced   Exec-managerial  Not-in-family
## 30  Married-civ-spouse      Craft-repair        Husband
##                   race     sex capital-gain capital-loss hours-per-week
## 20               White  Female            0            0             45
## 21               White    Male            0            0             60
## 22               Black  Female            0            0             20
## 23               Black    Male            0            0             40
## 24               White    Male            0         2042             40
## 25               White  Female            0            0             40
## 26               White    Male            0            0             40
## 27               White    Male            0            0             40
## 28  Asian-Pac-Islander    Male            0            0             60
## 29               White    Male            0            0             80
## 30               White    Male            0            0             40
##    native-country salary
## 20  United-States   >50K
## 21  United-States   >50K
## 22  United-States  <=50K
## 23  United-States  <=50K
## 24  United-States  <=50K
## 25  United-States  <=50K
## 26  United-States   >50K
## 27  United-States  <=50K
## 28          South   >50K
## 29  United-States  <=50K
## 30  United-States  <=50K

Handling Missing Values

It is always good practice to evaluate our data and check for missing values. We notice that there are some missing values in our dataset, particularly in the workclass, occupation, and native country variables, that are represented with “?”. For example, in the variable workclass, the vlaue in row 28 is represented with “?”.

In R, there are multiple methods to handle missing values that only recognize them if they were represented with NA. In order to format the missing values correctly, we can use the stringr library, along with some simple regular expression to replace the question marks with NA.

After we load the stringr library, we can use str_detect() to replace the question marks in our data. str_detect() takes in a string and a pattern, then detects and returns the parts of this string that match the pattern. We notice that only the variables workclass, occupation, and native country have question marks. In order to search for a pattern, we must write a regular expression, which is a sequence of characters that define a search pattern. Since a question mark is a special character in regular expression, we must use two backslashes “\” to tell R that what we actually want to match is the question mark itself. We specify the variable we want to look in, like adults$workclass, and use square brackets around str_detect() to specify these are the rows we want to select, then replace these selected values with NA.

library(stringr)

adults$workclass[str_detect(adults$workclass, "\\?")] <- NA
adults$occupation[str_detect(adults$occupation, "\\?")] <- NA
adults$`native-country`[str_detect(adults$`native-country`, "\\?")] <- NA

head(adults, 20)

##    age         workclass fnlwgt     education education-num
## 1   39         State-gov  77516     Bachelors            13
## 2   50  Self-emp-not-inc  83311     Bachelors            13
## 3   38           Private 215646       HS-grad             9
## 4   53           Private 234721          11th             7
## 5   28           Private 338409     Bachelors            13
## 6   37           Private 284582       Masters            14
## 7   49           Private 160187           9th             5
## 8   52  Self-emp-not-inc 209642       HS-grad             9
## 9   31           Private  45781       Masters            14
## 10  42           Private 159449     Bachelors            13
## 11  37           Private 280464  Some-college            10
## 12  30         State-gov 141297     Bachelors            13
## 13  23           Private 122272     Bachelors            13
## 14  32           Private 205019    Assoc-acdm            12
## 15  40           Private 121772     Assoc-voc            11
## 16  34           Private 245487       7th-8th             4
## 17  25  Self-emp-not-inc 176756       HS-grad             9
## 18  32           Private 186824       HS-grad             9
## 19  38           Private  28887          11th             7
## 20  43  Self-emp-not-inc 292175       Masters            14
##            marital-status         occupation   relationship
## 1           Never-married       Adm-clerical  Not-in-family
## 2      Married-civ-spouse    Exec-managerial        Husband
## 3                Divorced  Handlers-cleaners  Not-in-family
## 4      Married-civ-spouse  Handlers-cleaners        Husband
## 5      Married-civ-spouse     Prof-specialty           Wife
## 6      Married-civ-spouse    Exec-managerial           Wife
## 7   Married-spouse-absent      Other-service  Not-in-family
## 8      Married-civ-spouse    Exec-managerial        Husband
## 9           Never-married     Prof-specialty  Not-in-family
## 10     Married-civ-spouse    Exec-managerial        Husband
## 11     Married-civ-spouse    Exec-managerial        Husband
## 12     Married-civ-spouse     Prof-specialty        Husband
## 13          Never-married       Adm-clerical      Own-child
## 14          Never-married              Sales  Not-in-family
## 15     Married-civ-spouse       Craft-repair        Husband
## 16     Married-civ-spouse   Transport-moving        Husband
## 17          Never-married    Farming-fishing      Own-child
## 18          Never-married  Machine-op-inspct      Unmarried
## 19     Married-civ-spouse              Sales        Husband
## 20               Divorced    Exec-managerial      Unmarried
##                   race     sex capital-gain capital-loss hours-per-week
## 1                White    Male         2174            0             40
## 2                White    Male            0            0             13
## 3                White    Male            0            0             40
## 4                Black    Male            0            0             40
## 5                Black  Female            0            0             40
## 6                White  Female            0            0             40
## 7                Black  Female            0            0             16
## 8                White    Male            0            0             45
## 9                White  Female        14084            0             50
## 10               White    Male         5178            0             40
## 11               Black    Male            0            0             80
## 12  Asian-Pac-Islander    Male            0            0             40
## 13               White  Female            0            0             30
## 14               Black    Male            0            0             50
## 15  Asian-Pac-Islander    Male            0            0             40
## 16  Amer-Indian-Eskimo    Male            0            0             45
## 17               White    Male            0            0             35
## 18               White    Male            0            0             40
## 19               White    Male            0            0             50
## 20               White  Female            0            0             45
##    native-country salary
## 1   United-States  <=50K
## 2   United-States  <=50K
## 3   United-States  <=50K
## 4   United-States  <=50K
## 5            Cuba  <=50K
## 6   United-States  <=50K
## 7         Jamaica  <=50K
## 8   United-States   >50K
## 9   United-States   >50K
## 10  United-States   >50K
## 11  United-States   >50K
## 12          India   >50K
## 13  United-States  <=50K
## 14  United-States  <=50K
## 15           <NA>   >50K
## 16         Mexico  <=50K
## 17  United-States  <=50K
## 18  United-States  <=50K
## 19  United-States  <=50K
## 20  United-States   >50K

Dplyr: Cleaning And Manipulating The Data

Let’s take a look at our correctly formatted data. Immediately we notice there are some variables that are not particularly interesting, and some that seem interesting and worth exploring. We will use the dplyr package that is part of tidyverse to manipulate the data we have into something we can create interesting visualizations with.

There are many functions in dplyr - we will cover some of the most useful and commonly used functions:

filter
select
arrange
mutate
transmute
summarize
group_by
rename

Select

The select() function is used to keep only a few variables of interest to the current analysis. It is most useful when working with dataframes involving a large number of variables. Let’s extract the columns that we are interestd in examining into a new dataframe, so that we don’t lose any columns completely, in case we are interested in them later on.

The most common way to use select() is to write down all the variable names we wish to keep in the new dataset. In this case, since our variable names contained dashes, we need to put them in either double or single quotes.

adults_select <- select(adults, age, workclass, fnlwgt, education, "education-num", "marital-status", occupation, race, sex, "hours-per-week", "native-country", salary)

head(adults_select, 10)

##    age         workclass fnlwgt  education education-num
## 1   39         State-gov  77516  Bachelors            13
## 2   50  Self-emp-not-inc  83311  Bachelors            13
## 3   38           Private 215646    HS-grad             9
## 4   53           Private 234721       11th             7
## 5   28           Private 338409  Bachelors            13
## 6   37           Private 284582    Masters            14
## 7   49           Private 160187        9th             5
## 8   52  Self-emp-not-inc 209642    HS-grad             9
## 9   31           Private  45781    Masters            14
## 10  42           Private 159449  Bachelors            13
##            marital-status         occupation   race     sex hours-per-week
## 1           Never-married       Adm-clerical  White    Male             40
## 2      Married-civ-spouse    Exec-managerial  White    Male             13
## 3                Divorced  Handlers-cleaners  White    Male             40
## 4      Married-civ-spouse  Handlers-cleaners  Black    Male             40
## 5      Married-civ-spouse     Prof-specialty  Black  Female             40
## 6      Married-civ-spouse    Exec-managerial  White  Female             40
## 7   Married-spouse-absent      Other-service  Black  Female             16
## 8      Married-civ-spouse    Exec-managerial  White    Male             45
## 9           Never-married     Prof-specialty  White  Female             50
## 10     Married-civ-spouse    Exec-managerial  White    Male             40
##    native-country salary
## 1   United-States  <=50K
## 2   United-States  <=50K
## 3   United-States  <=50K
## 4   United-States  <=50K
## 5            Cuba  <=50K
## 6   United-States  <=50K
## 7         Jamaica  <=50K
## 8   United-States   >50K
## 9   United-States   >50K
## 10  United-States   >50K

If there are more variables we want to keep than drop, it might be more efficient to use a second method. In this method, we use select() as we normally would, but put a - in front of the variables we want to drop from the original dataset so that we can type less variables. This returns the same result as the previous method.

adults_negselect <- select(adults, -relationship, -"capital-gain", -"capital-loss")

head(adults_negselect, 10)

##    age         workclass fnlwgt  education education-num
## 1   39         State-gov  77516  Bachelors            13
## 2   50  Self-emp-not-inc  83311  Bachelors            13
## 3   38           Private 215646    HS-grad             9
## 4   53           Private 234721       11th             7
## 5   28           Private 338409  Bachelors            13
## 6   37           Private 284582    Masters            14
## 7   49           Private 160187        9th             5
## 8   52  Self-emp-not-inc 209642    HS-grad             9
## 9   31           Private  45781    Masters            14
## 10  42           Private 159449  Bachelors            13
##            marital-status         occupation   race     sex hours-per-week
## 1           Never-married       Adm-clerical  White    Male             40
## 2      Married-civ-spouse    Exec-managerial  White    Male             13
## 3                Divorced  Handlers-cleaners  White    Male             40
## 4      Married-civ-spouse  Handlers-cleaners  Black    Male             40
## 5      Married-civ-spouse     Prof-specialty  Black  Female             40
## 6      Married-civ-spouse    Exec-managerial  White  Female             40
## 7   Married-spouse-absent      Other-service  Black  Female             16
## 8      Married-civ-spouse    Exec-managerial  White    Male             45
## 9           Never-married     Prof-specialty  White  Female             50
## 10     Married-civ-spouse    Exec-managerial  White    Male             40
##    native-country salary
## 1   United-States  <=50K
## 2   United-States  <=50K
## 3   United-States  <=50K
## 4   United-States  <=50K
## 5            Cuba  <=50K
## 6   United-States  <=50K
## 7         Jamaica  <=50K
## 8   United-States   >50K
## 9   United-States   >50K
## 10  United-States   >50K

There are also other interesting ways to use select() to achieve the desired outcome in the most efficient way. There are many select_helper functions that only work inside select(). Some of these functions are starts_with(), ends_with(), contains(). For example, say in our adults dataset we wanted to only examine the columns with education related data. We can use starts_with("education") inside select, which selects the columns with names that match the given string, “education”.

adults_educ <- select(adults, starts_with("education"))

head(adults_educ, 10)

##     education education-num
## 1   Bachelors            13
## 2   Bachelors            13
## 3     HS-grad             9
## 4        11th             7
## 5   Bachelors            13
## 6     Masters            14
## 7         9th             5
## 8     HS-grad             9
## 9     Masters            14
## 10  Bachelors            13

Arrange

Now that we have a dataframe with only the columns we are interested in, let’s explore some variables that might be interesting to take a closer look at. We are interested in seeing the age range of adults in this dataset and focus in on a smaller group.

The arrange() function can order rows of a data frame using a variable name (or a more complicated expression). If we provide multiple expressions to order by, it uses the second one to break ties in the first one, third one to break ties in the second one, and so on. The default setting for arrangement is from low to high values.

adults_arrange <- arrange(adults_select, age)

head(adults_arrange, 20)

##    age         workclass fnlwgt education education-num marital-status
## 1   17              <NA> 304873      10th             6  Never-married
## 2   17           Private  65368      11th             7  Never-married
## 3   17           Private 245918      11th             7  Never-married
## 4   17           Private 191260       9th             5  Never-married
## 5   17           Private 270942   5th-6th             3  Never-married
## 6   17           Private  89821      11th             7  Never-married
## 7   17           Private 175024      11th             7  Never-married
## 8   17              <NA> 202521      11th             7  Never-married
## 9   17              <NA> 258872      11th             7  Never-married
## 10  17           Private 211870       9th             5  Never-married
## 11  17           Private 242718      11th             7  Never-married
## 12  17           Private 169658      10th             6  Never-married
## 13  17              <NA>  80077      11th             7  Never-married
## 14  17  Self-emp-not-inc 368700      11th             7  Never-married
## 15  17           Private 102726      12th             8  Never-married
## 16  17           Private 316929      12th             8  Never-married
## 17  17           Private 193830      11th             7  Never-married
## 18  17           Private  32607      10th             6  Never-married
## 19  17           Private 198124      11th             7  Never-married
## 20  17           Private 368700      11th             7  Never-married
##            occupation   race     sex hours-per-week native-country salary
## 1                <NA>  White  Female             32  United-States  <=50K
## 2               Sales  White  Female             12  United-States  <=50K
## 3       Other-service  White    Male             12  United-States  <=50K
## 4       Other-service  White    Male             24  United-States  <=50K
## 5       Other-service  White    Male             48         Mexico  <=50K
## 6       Other-service  White    Male             10  United-States  <=50K
## 7   Handlers-cleaners  White    Male             18  United-States  <=50K
## 8                <NA>  White    Male             40  United-States  <=50K
## 9                <NA>  White  Female              5  United-States  <=50K
## 10      Other-service  White    Male              6  United-States  <=50K
## 11              Sales  White    Male             12  United-States  <=50K
## 12      Other-service  White  Female             21  United-States  <=50K
## 13               <NA>  White  Female             20  United-States  <=50K
## 14    Farming-fishing  White    Male             10  United-States  <=50K
## 15      Other-service  White    Male             16  United-States  <=50K
## 16  Handlers-cleaners  White    Male             20  United-States  <=50K
## 17              Sales  White  Female             20  United-States  <=50K
## 18    Farming-fishing  White    Male             20  United-States  <=50K
## 19              Sales  White    Male             20  United-States  <=50K
## 20              Sales  White    Male             28  United-States  <=50K

Filter

Let’s filter the data further. we want to take a closer look at older adults past the age of 30 that originated from the United States, and whom work for a private company. The filter() function is useful to filter the rows in the dataset that match the requirements we set in the parameter.

Notice that the native-country variable in the dataset has text data rather than numerical data. To filter columns with text data, we must use a special function from the stringr library in R. Adding the str_detect() function inside the filter parameter ensures that the columns with text data are filtered just like numerical data, with the filter function returning the row of data if the condition matches.

In order to filter the age variable, I had to evaluated it as.integer(), since age is a factor variable in this dataset.

One useful tool in dplyr is the pipe operator %>%. The pipe operator is used at the end of a line of code. It pipes the output from one function and feeds it to the first argument of the next function. I use the operator here so that I do not have to retype my dataset argument. The pipe operator is very useful when we want to use multiple functions on the same dataset, so that we do not have to save our dataset in a new variable for each function used.

library(stringr)

adults_filter <- adults_select %>%
  filter(str_detect(`native-country`, 'United-States') & 
         as.integer(age) >= 30 & str_detect(workclass, 'Private'))

head(adults_filter, 10)

##    age workclass fnlwgt     education education-num      marital-status
## 1   38   Private 215646       HS-grad             9            Divorced
## 2   53   Private 234721          11th             7  Married-civ-spouse
## 3   37   Private 284582       Masters            14  Married-civ-spouse
## 4   31   Private  45781       Masters            14       Never-married
## 5   42   Private 159449     Bachelors            13  Married-civ-spouse
## 6   37   Private 280464  Some-college            10  Married-civ-spouse
## 7   32   Private 205019    Assoc-acdm            12       Never-married
## 8   32   Private 186824       HS-grad             9       Never-married
## 9   38   Private  28887          11th             7  Married-civ-spouse
## 10  40   Private 193524     Doctorate            16  Married-civ-spouse
##            occupation   race     sex hours-per-week native-country salary
## 1   Handlers-cleaners  White    Male             40  United-States  <=50K
## 2   Handlers-cleaners  Black    Male             40  United-States  <=50K
## 3     Exec-managerial  White  Female             40  United-States  <=50K
## 4      Prof-specialty  White  Female             50  United-States   >50K
## 5     Exec-managerial  White    Male             40  United-States   >50K
## 6     Exec-managerial  Black    Male             80  United-States   >50K
## 7               Sales  Black    Male             50  United-States  <=50K
## 8   Machine-op-inspct  White    Male             40  United-States  <=50K
## 9               Sales  White    Male             50  United-States  <=50K
## 10     Prof-specialty  White    Male             60  United-States   >50K

There are also other interesting ways to use filter() to achieve the desired outcome in the most efficient way. There are many filter_helper functions that only work inside filter(). Some of these functions are is.na(), between(), near(). For example, say in our adults_select dataset we wanted to only examine the rows with adults between ages 30 and 50. We can use between(age, 30, 50) inside filter, which filters the rows with the matching condition. Remember that we once again have to cast age as.integer() since it is a factor variable.

adults_filter_between <- adults_select %>%
  filter(between(as.integer(age), 30, 50))

head(adults_filter_between, 10)

##    age         workclass fnlwgt     education education-num
## 1   39         State-gov  77516     Bachelors            13
## 2   50  Self-emp-not-inc  83311     Bachelors            13
## 3   38           Private 215646       HS-grad             9
## 4   37           Private 284582       Masters            14
## 5   49           Private 160187           9th             5
## 6   31           Private  45781       Masters            14
## 7   42           Private 159449     Bachelors            13
## 8   37           Private 280464  Some-college            10
## 9   30         State-gov 141297     Bachelors            13
## 10  32           Private 205019    Assoc-acdm            12
##            marital-status         occupation                race     sex
## 1           Never-married       Adm-clerical               White    Male
## 2      Married-civ-spouse    Exec-managerial               White    Male
## 3                Divorced  Handlers-cleaners               White    Male
## 4      Married-civ-spouse    Exec-managerial               White  Female
## 5   Married-spouse-absent      Other-service               Black  Female
## 6           Never-married     Prof-specialty               White  Female
## 7      Married-civ-spouse    Exec-managerial               White    Male
## 8      Married-civ-spouse    Exec-managerial               Black    Male
## 9      Married-civ-spouse     Prof-specialty  Asian-Pac-Islander    Male
## 10          Never-married              Sales               Black    Male
##    hours-per-week native-country salary
## 1              40  United-States  <=50K
## 2              13  United-States  <=50K
## 3              40  United-States  <=50K
## 4              40  United-States  <=50K
## 5              16        Jamaica  <=50K
## 6              50  United-States   >50K
## 7              40  United-States   >50K
## 8              80  United-States   >50K
## 9              40          India   >50K
## 10             50  United-States  <=50K

Mutate

The mutate() function can help us add additional variables to our dataset.

Suppose we want to include the fnlwgt variable in our visualizations, but the values in the variable are too large and we want to scale it down. We can use the mutate() function to create a new column in the dataset, name the new column, and set the values in this column equal to what we want it to be. In this case, we have scaled the values down by 10,000 so that they contain the same information, just on a smaller scale.

In order to filter the age variable, I had to evaluate it as.integer(), since age is a factor variable in this dataset.

adults_mutate <- mutate(adults_filter, "scaled_fnlwgt" = as.double(fnlwgt)/10000)

head(adults_mutate)

##   age workclass fnlwgt     education education-num      marital-status
## 1  38   Private 215646       HS-grad             9            Divorced
## 2  53   Private 234721          11th             7  Married-civ-spouse
## 3  37   Private 284582       Masters            14  Married-civ-spouse
## 4  31   Private  45781       Masters            14       Never-married
## 5  42   Private 159449     Bachelors            13  Married-civ-spouse
## 6  37   Private 280464  Some-college            10  Married-civ-spouse
##           occupation   race     sex hours-per-week native-country salary
## 1  Handlers-cleaners  White    Male             40  United-States  <=50K
## 2  Handlers-cleaners  Black    Male             40  United-States  <=50K
## 3    Exec-managerial  White  Female             40  United-States  <=50K
## 4     Prof-specialty  White  Female             50  United-States   >50K
## 5    Exec-managerial  White    Male             40  United-States   >50K
## 6    Exec-managerial  Black    Male             80  United-States   >50K
##   scaled_fnlwgt
## 1       21.5646
## 2       23.4721
## 3       28.4582
## 4        4.5781
## 5       15.9449
## 6       28.0464

transmute() is a variable of the mutate() function. transmute() acts the same way, except the new dataset will only contain the new mutated variables, and not the other untouched ones.

head(transmute(adults_filter, "scaled_fnlwgt" = as.double(fnlwgt)/10000), 10)

##    scaled_fnlwgt
## 1        21.5646
## 2        23.4721
## 3        28.4582
## 4         4.5781
## 5        15.9449
## 6        28.0464
## 7        20.5019
## 8        18.6824
## 9         2.8887
## 10       19.3524

Summarize

The summarize() function can be used to summarize entire data frames by collapsing items into single number summaries.

Suppose we want to look at the average hours per week white males work, and compare them with the average hours per week white females work. We can use the summarize() function along with the filter() function and the dplyr pipe %>% to filter the adults whose race is white, and whose sex is male/female.

adults_mutate %>%
  filter(str_detect(race, 'White') & str_detect(sex, 'Male')) %>%
      summarize(white_males = mean(as.double(`hours-per-week`)))

##   white_males
## 1    44.09296

adults_mutate %>%
  filter(str_detect(race, 'White') & str_detect(sex, 'Female')) %>%
    summarize(white_females = mean(as.double(`hours-per-week`)))

##   white_females
## 1      38.44055

We see that white males work more hours per week compared to white females from the data in this dataset. We are interested in visualizing characteristic distributions of these two groups to potentially find out why white males work on average higher than white females. Let’s save this data and we will recall this when we move onto visualization.

Group_by

There is a more efficient way to achieve what we used summarize() above for. group_by() allows us to group different categorical variables within the same column and form summary statistics easily, saving us an extra step to group or filter out the variables ourselves. After grouping and summarizing, it is easy to see that for the adults in the dataset, it appears Asian Pacific Islander American males work the most hours per week on average, at 44.2 hours per week. However, there is not a significant difference between the average hours per week for the groups we have chosen by race and sex.

adults_mutate %>%
  group_by(race, sex) %>%
  summarize(avg_hours = mean(as.double(`hours-per-week`)))

## # A tibble: 10 x 3
## # Groups:   race [?]
##    race                  sex       avg_hours
##    <fct>                 <fct>         <dbl>
##  1 " Amer-Indian-Eskimo" " Female"      38.6
##  2 " Amer-Indian-Eskimo" " Male"        42.9
##  3 " Asian-Pac-Islander" " Female"      39.9
##  4 " Asian-Pac-Islander" " Male"        44.2
##  5 " Black"              " Female"      37.7
##  6 " Black"              " Male"        41.6
##  7 " Other"              " Female"      39  
##  8 " Other"              " Male"        40.6
##  9 " White"              " Female"      38.4
## 10 " White"              " Male"        44.1

Rename

The rename() function is pretty self-explanatory, and allows us to rename our variable names in the dataset. Let’s try renaming some of our variables.

For the sake of consistency with variable naming, let’s rename our variable scaled_fnlwgt to scaled-fnlwgt. After the dataset argument which always goes first in the function, the second argument should be the desired new variable name, followed by the original variable name after the =. Since the new name has a dash, either double quotes or single quotes should go around the name.

In order to make the change permanently, remember to save the expression into the function.

We can rename multiple variables, by adding each variable we want to rename in the third, fourth arguments and so on.

adults_mutate <- rename(adults_mutate, "scaled-fnlwgt" = scaled_fnlwgt)

ggplot2: Visualizing The Data

Now that we have learned about some tools to clean and manipulate data, let’s make some interesting visualizations with R’s ggplot2 to explore patterns in our dataset! There are many different types of plots and variations of these plots we can create in ggplot2. In this section, we will mainly explore the following plots:

geom_point
geom_smooth
geom_bar
Coxcomb Chart
Pie Chart
Bullseye Chart

geom_point

Visualizations in ggplot2 start with a baseplot, and allow us to add different layers onto the plot, creating more complex visualizations. geom_point() allows us to create scatterplots with our data.

We are interested in seeing the relationship between the variables age and scaled-fnlwgt. We can visualize this relationship by creating a scatterplot with age as the x variable, and scaled-fnlwgt as the y variable.

We can also visualize a third variable, by setting the color of the points to a different variable. We are interested in seeing the distribution of sex in relation to age and their scaled financial weight.

ggplot(adults_mutate) +
  geom_point(mapping = aes(x = age, y = `scaled-fnlwgt`))

ggplot(adults_mutate) +
  geom_point(mapping = aes(x = age, y = `scaled-fnlwgt`, color = sex))

We can see in the two scatterplots we have created, that there is a slight negative correlation between the scaled-fnlwgt and age variables. For the adults included in the dataset that are older than 30 years, the financial weight seems to peek around 35-45 years, and adults older than this range generally do not have as high of financial weight. There is also not a particularly notable trend with the sex distribution in relation to age and scaled-fnlwgt.

geom_smooth

In ggplot2, we can add many different geoms to a baseplot. For example, if we wanted to add a regression line to the scatterplot we had above, we can layer a geom_smooth() layer on top, which adds the line to the graph. When we have two or more geoms in one plot, to make sure there are no syntax errors and to be consistent, we only have to specify the arguments once, in the ggplot() line of code.

ggplot(adults_mutate, mapping = aes(x = `age`, y = `scaled-fnlwgt`)) +
  geom_point() +
  geom_smooth()

We can see that the data we used for this graph was not very suitable for a regression line added by geom_smooth(), because the line added does not show us any useful trends. It seems like there is too much data included in the adults_mutate dataset that are overwhelming the plot. Let’s filter down the data into smaller subsets using dplyr.

To examine a smaller subset of the data, let’s only look at female Asian Pacific Islander American adults, because there are significantly less female than male adults, and significantly less Asian Pacific Islander American adults than other races included in this dataset.

adults_asian_fem <- adults_mutate %>%
  filter(str_detect(sex, 'Female') & str_detect(race, 'Asian-Pac-Islander'))

After we’ve filtered the sex and race variable, we have a significantly smaller dataset to work with. Let’s try visualizing this dataset again using the same variables. The default of geom_smooth() displays the standard error confidence band. If we want to get rid of it, we can set the standard error band se = FALSE inside the geom_smooth() line of code.

ggplot(adults_asian_fem, mapping = aes(x = `age`, y = `scaled-fnlwgt`)) +
  geom_point() +
  geom_smooth()

ggplot(adults_asian_fem, mapping = aes(x = `age`, y = `scaled-fnlwgt`)) +
  geom_point() +
  geom_smooth(se = FALSE)

The plot we have shows us no apparent trend associated with age and financial weight in Asian Pacific Islander American females collected in this dataset. Let’s explore some other variables and visualizations to find interesting trends!

geom_bar

Let’s create a histogram with geom_bar() with the count of salaries either <=50 k or >50k, with sex distribution as filled color. The position = "dodge" command separates out stacked bars to side-by-side bars so that we can see the comparison between sex better. We can see that in this histogram, there are significantly more males than females who make >50k.

ggplot(adults_mutate) +
  geom_bar(mapping = aes(x = salary, fill = sex), position = "dodge")

Let’s create a second histogram with the scaled-fnlwgt of each race, also with sex as the filled color. The stat = "identity" argument is necessary because here we are replacing the default method for geom_bar(), which is stat = "count", with the value of a variable in our dataset on the y-axis. The second histogram shows that for the adults in the dataset, males generally earn more than females in all races. This difference is particularly pronounced in blacks and whites, where males earn significantly more than females of the same race.

ggplot(adults_mutate) +
  geom_bar(mapping = aes(x = race, y = `scaled-fnlwgt`, fill = sex), position = "dodge", stat = "identity")

We want to see if there are any characteristics in the dataset that might be causing these trends. Let’s go back to the dataset to examine the data in these variables with some of the dplyr tools we learned earlier. We can use filter() and summarize() to count the number of white, black, male, and female adults in this dataset.

adults_mutate %>%
  group_by(race) %>%
  summarize(count = n())

## # A tibble: 5 x 2
##   race                  count
##   <fct>                 <int>
## 1 " Amer-Indian-Eskimo"   103
## 2 " Asian-Pac-Islander"   113
## 3 " Black"               1296
## 4 " Other"                 45
## 5 " White"              11762

adults_mutate %>%
  group_by(sex) %>%
  summarize(count = n())

## # A tibble: 2 x 2
##   sex       count
##   <fct>     <int>
## 1 " Female"  4101
## 2 " Male"    9218

We can compare the counts and notice that there is about 10 times more white adults in the dataset than black adults, and there is a little over twice the amount of male adults compared to female adults. The unequal number of adults represented in this dataset for each category may potentially be an explanation for the difference in salary for males vs. females.

Facet

Facets provide another way to add a third variable to a scatterplot. We should facet a discrete variable rather than a continuous one. In order to use the function, add facet_wrap() and (~variable) at the end of a normal geom_point plot. The function wraps the third variable around in separate plots displayed next to each other, so it is easy for us to see any major trend differences between each category of this third variable.

We filter the dataset to only contain females, since there are too many observations in the dataset. We can then plot this data in a geom_point plot, and facet_wrap the third variable race. We do not see any particular trends in the relationship between age and financial weight across different races in females, but notice that there is an uneven representation in this dataset, with significantly more white and black females than any other race in this dataset.

adults_mutate %>%
  filter(str_detect(sex, 'Female')) %>%
  ggplot() +
  geom_point(mapping = aes(x = age, y = `scaled-fnlwgt`)) +
  facet_wrap(~race)

***

Coordinate Systems

We can also create interesting visualizations on the coordinate plane rather than on the x-y plane with the ggplot2 functions.

Coxcomb Chart

We are interested in visualizing the distribution of marital status of adults in this dataset. In order to create a Coxcomb chart for this visualization, we must first create a bar chart using geom_bar(). When creating this base bar chart, it is important to set the x variable as well as the color fill to the same variable that we want to visualize. We set the width variable to 1 to make sure the bars touch each other.

bar <- ggplot(adults_mutate) +
  geom_bar(mapping = aes(x = `marital-status`, fill = `marital-status`), width = 1)

bar

After we create the base bar chart, we can then create the Coxcomb chart by using the coord_polar() function, which helps transform x-y plane charts to polar coordinate system charts. We set the x label to NULL to remove the x-axis label “marital-status”. Now we have a Coxcomb chart that represents the frequency of each marital status category for adults collected in our dataset. We can see that most adults in the dataset are married civilian adults.

bar +
  labs(x = NULL) +
  coord_polar()

Pie Chart

We can also visualize the frequency of marriage status of adults in this dataset with a pie chart. For a pie chart, we must also create a base bar chart using geom_bar(). However, this time we need to create a stacked bar chart. We set the x variable to factor(1) to make sure all the bars are stacked on top of each other. We set the color fill to marital-status to represent each category. We set the width as 1 to make sure the different bars touch.

bar <- ggplot(adults_mutate) +
  geom_bar(mapping = aes(x = factor(1), fill = `marital-status`), width = 1)

bar

For the pie chart, we also use coord_polar() to transform the stacked bar chart to polar coordinates and remove the x-axis label. We map the y-axis of the bar chart to the angle theta with theta = "y". We can see the distribution of each category clearer in the pie chart, including some categories that might have been too little to see in the Coxcomb chart.

bar +
  labs(x = NULL) +
  coord_polar(theta = "y")

Bullseye Chart

We can also visualize marital-status in another kind of polar coordinate chart. The bullseye chart can be created with a very similar procedure as the pie chart. The base bar chart is the same as the stacked bar chart we created above. When mapping to polar coordinates, if we remove theta = "y", which mapped the y-axis of the bar char to the angle theta, we will get the bullseye chart by default.

bar +
  labs(x = NULL) +
  coord_polar()

***

I hope this tutorial was helpful to get you started on your journey with R. You can check out more tutorials on useful data manipulation and visualization tools at the University of Michigan Clark Labs Digital Projects Studio!

Tidyverse dplyr and ggplot2: How To Manipulate and Visualize Data Easily in R

Emily Lin, Digital Projects Studio

3/28/2019