Install and load the tidyverse package we will use.
install.packages("tidyverse", repos = "https://cran.r-project.org/web/packages/tidyverse/index.html")
library(tidyverse)
The adults dataset we are going to use in this tutorial originates from the UCI Machine Learning Repository.
There are many ways we can read in a dataset in R depending on our needs and how the data file is formatted. This specific dataset was only available as a text(.txt) file, unparsed, and without headings.
We use the following method to read in the adults data and store them into a variable. Keep in mind that there are multiple ways to read in a dataset.
To read in and parse this dataset correctly, use the read.delim()
function. This function is useful because we can specify how we want to parse our file. We specify in the function parameter that the dataset we are using is separated by commas.
setwd("~/Documents/Projects & Work/Digital_Projects_Studio/R_tutorial/data")
adults <- read.delim("adult-data.txt", header = FALSE, sep = ",")
head(adults)
## V1 V2 V3 V4 V5 V6
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## V7 V8 V9 V10 V11 V12 V13
## 1 Adm-clerical Not-in-family White Male 2174 0 40
## 2 Exec-managerial Husband White Male 0 0 13
## 3 Handlers-cleaners Not-in-family White Male 0 0 40
## 4 Handlers-cleaners Husband Black Male 0 0 40
## 5 Prof-specialty Wife Black Female 0 0 40
## 6 Exec-managerial Wife White Female 0 0 40
## V14 V15
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 Cuba <=50K
## 6 United-States <=50K
Notice that although the dataset has been read in correctly, it does not have any headings that tell us what each column represents. We can see from our original data file the corresponding headings. Let’s add those headings in through the names()
function.
names(adults) <- c("age","workclass","fnlwgt","education","education-num",
"marital-status","occupation","relationship","race","sex",
"capital-gain","capital-loss","hours-per-week",
"native-country", "salary")
adults[20:30,]
## age workclass fnlwgt education education-num
## 20 43 Self-emp-not-inc 292175 Masters 14
## 21 40 Private 193524 Doctorate 16
## 22 54 Private 302146 HS-grad 9
## 23 35 Federal-gov 76845 9th 5
## 24 43 Private 117037 11th 7
## 25 59 Private 109015 HS-grad 9
## 26 56 Local-gov 216851 Bachelors 13
## 27 19 Private 168294 HS-grad 9
## 28 54 ? 180211 Some-college 10
## 29 39 Private 367260 HS-grad 9
## 30 49 Private 193366 HS-grad 9
## marital-status occupation relationship
## 20 Divorced Exec-managerial Unmarried
## 21 Married-civ-spouse Prof-specialty Husband
## 22 Separated Other-service Unmarried
## 23 Married-civ-spouse Farming-fishing Husband
## 24 Married-civ-spouse Transport-moving Husband
## 25 Divorced Tech-support Unmarried
## 26 Married-civ-spouse Tech-support Husband
## 27 Never-married Craft-repair Own-child
## 28 Married-civ-spouse ? Husband
## 29 Divorced Exec-managerial Not-in-family
## 30 Married-civ-spouse Craft-repair Husband
## race sex capital-gain capital-loss hours-per-week
## 20 White Female 0 0 45
## 21 White Male 0 0 60
## 22 Black Female 0 0 20
## 23 Black Male 0 0 40
## 24 White Male 0 2042 40
## 25 White Female 0 0 40
## 26 White Male 0 0 40
## 27 White Male 0 0 40
## 28 Asian-Pac-Islander Male 0 0 60
## 29 White Male 0 0 80
## 30 White Male 0 0 40
## native-country salary
## 20 United-States >50K
## 21 United-States >50K
## 22 United-States <=50K
## 23 United-States <=50K
## 24 United-States <=50K
## 25 United-States <=50K
## 26 United-States >50K
## 27 United-States <=50K
## 28 South >50K
## 29 United-States <=50K
## 30 United-States <=50K
It is always good practice to evaluate our data and check for missing values. We notice that there are some missing values in our dataset, particularly in the workclass, occupation, and native country variables, that are represented with “?”. For example, in the variable workclass, the vlaue in row 28 is represented with “?”.
In R, there are multiple methods to handle missing values that only recognize them if they were represented with NA
. In order to format the missing values correctly, we can use the stringr library, along with some simple regular expression to replace the question marks with NA
.
After we load the stringr library, we can use str_detect()
to replace the question marks in our data. str_detect()
takes in a string and a pattern, then detects and returns the parts of this string that match the pattern. We notice that only the variables workclass, occupation, and native country have question marks. In order to search for a pattern, we must write a regular expression, which is a sequence of characters that define a search pattern. Since a question mark is a special character in regular expression, we must use two backslashes “\” to tell R that what we actually want to match is the question mark itself. We specify the variable we want to look in, like adults$workclass, and use square brackets around str_detect()
to specify these are the rows we want to select, then replace these selected values with NA.
library(stringr)
adults$workclass[str_detect(adults$workclass, "\\?")] <- NA
adults$occupation[str_detect(adults$occupation, "\\?")] <- NA
adults$`native-country`[str_detect(adults$`native-country`, "\\?")] <- NA
head(adults, 20)
## age workclass fnlwgt education education-num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## 7 49 Private 160187 9th 5
## 8 52 Self-emp-not-inc 209642 HS-grad 9
## 9 31 Private 45781 Masters 14
## 10 42 Private 159449 Bachelors 13
## 11 37 Private 280464 Some-college 10
## 12 30 State-gov 141297 Bachelors 13
## 13 23 Private 122272 Bachelors 13
## 14 32 Private 205019 Assoc-acdm 12
## 15 40 Private 121772 Assoc-voc 11
## 16 34 Private 245487 7th-8th 4
## 17 25 Self-emp-not-inc 176756 HS-grad 9
## 18 32 Private 186824 HS-grad 9
## 19 38 Private 28887 11th 7
## 20 43 Self-emp-not-inc 292175 Masters 14
## marital-status occupation relationship
## 1 Never-married Adm-clerical Not-in-family
## 2 Married-civ-spouse Exec-managerial Husband
## 3 Divorced Handlers-cleaners Not-in-family
## 4 Married-civ-spouse Handlers-cleaners Husband
## 5 Married-civ-spouse Prof-specialty Wife
## 6 Married-civ-spouse Exec-managerial Wife
## 7 Married-spouse-absent Other-service Not-in-family
## 8 Married-civ-spouse Exec-managerial Husband
## 9 Never-married Prof-specialty Not-in-family
## 10 Married-civ-spouse Exec-managerial Husband
## 11 Married-civ-spouse Exec-managerial Husband
## 12 Married-civ-spouse Prof-specialty Husband
## 13 Never-married Adm-clerical Own-child
## 14 Never-married Sales Not-in-family
## 15 Married-civ-spouse Craft-repair Husband
## 16 Married-civ-spouse Transport-moving Husband
## 17 Never-married Farming-fishing Own-child
## 18 Never-married Machine-op-inspct Unmarried
## 19 Married-civ-spouse Sales Husband
## 20 Divorced Exec-managerial Unmarried
## race sex capital-gain capital-loss hours-per-week
## 1 White Male 2174 0 40
## 2 White Male 0 0 13
## 3 White Male 0 0 40
## 4 Black Male 0 0 40
## 5 Black Female 0 0 40
## 6 White Female 0 0 40
## 7 Black Female 0 0 16
## 8 White Male 0 0 45
## 9 White Female 14084 0 50
## 10 White Male 5178 0 40
## 11 Black Male 0 0 80
## 12 Asian-Pac-Islander Male 0 0 40
## 13 White Female 0 0 30
## 14 Black Male 0 0 50
## 15 Asian-Pac-Islander Male 0 0 40
## 16 Amer-Indian-Eskimo Male 0 0 45
## 17 White Male 0 0 35
## 18 White Male 0 0 40
## 19 White Male 0 0 50
## 20 White Female 0 0 45
## native-country salary
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 Cuba <=50K
## 6 United-States <=50K
## 7 Jamaica <=50K
## 8 United-States >50K
## 9 United-States >50K
## 10 United-States >50K
## 11 United-States >50K
## 12 India >50K
## 13 United-States <=50K
## 14 United-States <=50K
## 15 <NA> >50K
## 16 Mexico <=50K
## 17 United-States <=50K
## 18 United-States <=50K
## 19 United-States <=50K
## 20 United-States >50K
Let’s take a look at our correctly formatted data. Immediately we notice there are some variables that are not particularly interesting, and some that seem interesting and worth exploring. We will use the dplyr package that is part of tidyverse to manipulate the data we have into something we can create interesting visualizations with.
There are many functions in dplyr - we will cover some of the most useful and commonly used functions:
The select()
function is used to keep only a few variables of interest to the current analysis. It is most useful when working with dataframes involving a large number of variables. Let’s extract the columns that we are interestd in examining into a new dataframe, so that we don’t lose any columns completely, in case we are interested in them later on.
The most common way to use select()
is to write down all the variable names we wish to keep in the new dataset. In this case, since our variable names contained dashes, we need to put them in either double or single quotes.
adults_select <- select(adults, age, workclass, fnlwgt, education, "education-num", "marital-status", occupation, race, sex, "hours-per-week", "native-country", salary)
head(adults_select, 10)
## age workclass fnlwgt education education-num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## 7 49 Private 160187 9th 5
## 8 52 Self-emp-not-inc 209642 HS-grad 9
## 9 31 Private 45781 Masters 14
## 10 42 Private 159449 Bachelors 13
## marital-status occupation race sex hours-per-week
## 1 Never-married Adm-clerical White Male 40
## 2 Married-civ-spouse Exec-managerial White Male 13
## 3 Divorced Handlers-cleaners White Male 40
## 4 Married-civ-spouse Handlers-cleaners Black Male 40
## 5 Married-civ-spouse Prof-specialty Black Female 40
## 6 Married-civ-spouse Exec-managerial White Female 40
## 7 Married-spouse-absent Other-service Black Female 16
## 8 Married-civ-spouse Exec-managerial White Male 45
## 9 Never-married Prof-specialty White Female 50
## 10 Married-civ-spouse Exec-managerial White Male 40
## native-country salary
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 Cuba <=50K
## 6 United-States <=50K
## 7 Jamaica <=50K
## 8 United-States >50K
## 9 United-States >50K
## 10 United-States >50K
If there are more variables we want to keep than drop, it might be more efficient to use a second method. In this method, we use select()
as we normally would, but put a - in front of the variables we want to drop from the original dataset so that we can type less variables. This returns the same result as the previous method.
adults_negselect <- select(adults, -relationship, -"capital-gain", -"capital-loss")
head(adults_negselect, 10)
## age workclass fnlwgt education education-num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## 7 49 Private 160187 9th 5
## 8 52 Self-emp-not-inc 209642 HS-grad 9
## 9 31 Private 45781 Masters 14
## 10 42 Private 159449 Bachelors 13
## marital-status occupation race sex hours-per-week
## 1 Never-married Adm-clerical White Male 40
## 2 Married-civ-spouse Exec-managerial White Male 13
## 3 Divorced Handlers-cleaners White Male 40
## 4 Married-civ-spouse Handlers-cleaners Black Male 40
## 5 Married-civ-spouse Prof-specialty Black Female 40
## 6 Married-civ-spouse Exec-managerial White Female 40
## 7 Married-spouse-absent Other-service Black Female 16
## 8 Married-civ-spouse Exec-managerial White Male 45
## 9 Never-married Prof-specialty White Female 50
## 10 Married-civ-spouse Exec-managerial White Male 40
## native-country salary
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 Cuba <=50K
## 6 United-States <=50K
## 7 Jamaica <=50K
## 8 United-States >50K
## 9 United-States >50K
## 10 United-States >50K
There are also other interesting ways to use select()
to achieve the desired outcome in the most efficient way. There are many select_helper functions that only work inside select()
. Some of these functions are starts_with()
, ends_with()
, contains()
. For example, say in our adults dataset we wanted to only examine the columns with education related data. We can use starts_with("education")
inside select, which selects the columns with names that match the given string, “education”.
adults_educ <- select(adults, starts_with("education"))
head(adults_educ, 10)
## education education-num
## 1 Bachelors 13
## 2 Bachelors 13
## 3 HS-grad 9
## 4 11th 7
## 5 Bachelors 13
## 6 Masters 14
## 7 9th 5
## 8 HS-grad 9
## 9 Masters 14
## 10 Bachelors 13
Now that we have a dataframe with only the columns we are interested in, let’s explore some variables that might be interesting to take a closer look at. We are interested in seeing the age range of adults in this dataset and focus in on a smaller group.
The arrange()
function can order rows of a data frame using a variable name (or a more complicated expression). If we provide multiple expressions to order by, it uses the second one to break ties in the first one, third one to break ties in the second one, and so on. The default setting for arrangement is from low to high values.
adults_arrange <- arrange(adults_select, age)
head(adults_arrange, 20)
## age workclass fnlwgt education education-num marital-status
## 1 17 <NA> 304873 10th 6 Never-married
## 2 17 Private 65368 11th 7 Never-married
## 3 17 Private 245918 11th 7 Never-married
## 4 17 Private 191260 9th 5 Never-married
## 5 17 Private 270942 5th-6th 3 Never-married
## 6 17 Private 89821 11th 7 Never-married
## 7 17 Private 175024 11th 7 Never-married
## 8 17 <NA> 202521 11th 7 Never-married
## 9 17 <NA> 258872 11th 7 Never-married
## 10 17 Private 211870 9th 5 Never-married
## 11 17 Private 242718 11th 7 Never-married
## 12 17 Private 169658 10th 6 Never-married
## 13 17 <NA> 80077 11th 7 Never-married
## 14 17 Self-emp-not-inc 368700 11th 7 Never-married
## 15 17 Private 102726 12th 8 Never-married
## 16 17 Private 316929 12th 8 Never-married
## 17 17 Private 193830 11th 7 Never-married
## 18 17 Private 32607 10th 6 Never-married
## 19 17 Private 198124 11th 7 Never-married
## 20 17 Private 368700 11th 7 Never-married
## occupation race sex hours-per-week native-country salary
## 1 <NA> White Female 32 United-States <=50K
## 2 Sales White Female 12 United-States <=50K
## 3 Other-service White Male 12 United-States <=50K
## 4 Other-service White Male 24 United-States <=50K
## 5 Other-service White Male 48 Mexico <=50K
## 6 Other-service White Male 10 United-States <=50K
## 7 Handlers-cleaners White Male 18 United-States <=50K
## 8 <NA> White Male 40 United-States <=50K
## 9 <NA> White Female 5 United-States <=50K
## 10 Other-service White Male 6 United-States <=50K
## 11 Sales White Male 12 United-States <=50K
## 12 Other-service White Female 21 United-States <=50K
## 13 <NA> White Female 20 United-States <=50K
## 14 Farming-fishing White Male 10 United-States <=50K
## 15 Other-service White Male 16 United-States <=50K
## 16 Handlers-cleaners White Male 20 United-States <=50K
## 17 Sales White Female 20 United-States <=50K
## 18 Farming-fishing White Male 20 United-States <=50K
## 19 Sales White Male 20 United-States <=50K
## 20 Sales White Male 28 United-States <=50K
Let’s filter the data further. we want to take a closer look at older adults past the age of 30 that originated from the United States, and whom work for a private company. The filter()
function is useful to filter the rows in the dataset that match the requirements we set in the parameter.
Notice that the native-country variable in the dataset has text data rather than numerical data. To filter columns with text data, we must use a special function from the stringr
library in R. Adding the str_detect()
function inside the filter parameter ensures that the columns with text data are filtered just like numerical data, with the filter function returning the row of data if the condition matches.
In order to filter the age variable, I had to evaluated it as.integer()
, since age is a factor variable in this dataset.
One useful tool in dplyr is the pipe operator %>%
. The pipe operator is used at the end of a line of code. It pipes the output from one function and feeds it to the first argument of the next function. I use the operator here so that I do not have to retype my dataset argument. The pipe operator is very useful when we want to use multiple functions on the same dataset, so that we do not have to save our dataset in a new variable for each function used.
library(stringr)
adults_filter <- adults_select %>%
filter(str_detect(`native-country`, 'United-States') &
as.integer(age) >= 30 & str_detect(workclass, 'Private'))
head(adults_filter, 10)
## age workclass fnlwgt education education-num marital-status
## 1 38 Private 215646 HS-grad 9 Divorced
## 2 53 Private 234721 11th 7 Married-civ-spouse
## 3 37 Private 284582 Masters 14 Married-civ-spouse
## 4 31 Private 45781 Masters 14 Never-married
## 5 42 Private 159449 Bachelors 13 Married-civ-spouse
## 6 37 Private 280464 Some-college 10 Married-civ-spouse
## 7 32 Private 205019 Assoc-acdm 12 Never-married
## 8 32 Private 186824 HS-grad 9 Never-married
## 9 38 Private 28887 11th 7 Married-civ-spouse
## 10 40 Private 193524 Doctorate 16 Married-civ-spouse
## occupation race sex hours-per-week native-country salary
## 1 Handlers-cleaners White Male 40 United-States <=50K
## 2 Handlers-cleaners Black Male 40 United-States <=50K
## 3 Exec-managerial White Female 40 United-States <=50K
## 4 Prof-specialty White Female 50 United-States >50K
## 5 Exec-managerial White Male 40 United-States >50K
## 6 Exec-managerial Black Male 80 United-States >50K
## 7 Sales Black Male 50 United-States <=50K
## 8 Machine-op-inspct White Male 40 United-States <=50K
## 9 Sales White Male 50 United-States <=50K
## 10 Prof-specialty White Male 60 United-States >50K
There are also other interesting ways to use filter()
to achieve the desired outcome in the most efficient way. There are many filter_helper functions that only work inside filter()
. Some of these functions are is.na()
, between()
, near()
. For example, say in our adults_select dataset we wanted to only examine the rows with adults between ages 30 and 50. We can use between(age, 30, 50)
inside filter, which filters the rows with the matching condition. Remember that we once again have to cast age as.integer()
since it is a factor variable.
adults_filter_between <- adults_select %>%
filter(between(as.integer(age), 30, 50))
head(adults_filter_between, 10)
## age workclass fnlwgt education education-num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 37 Private 284582 Masters 14
## 5 49 Private 160187 9th 5
## 6 31 Private 45781 Masters 14
## 7 42 Private 159449 Bachelors 13
## 8 37 Private 280464 Some-college 10
## 9 30 State-gov 141297 Bachelors 13
## 10 32 Private 205019 Assoc-acdm 12
## marital-status occupation race sex
## 1 Never-married Adm-clerical White Male
## 2 Married-civ-spouse Exec-managerial White Male
## 3 Divorced Handlers-cleaners White Male
## 4 Married-civ-spouse Exec-managerial White Female
## 5 Married-spouse-absent Other-service Black Female
## 6 Never-married Prof-specialty White Female
## 7 Married-civ-spouse Exec-managerial White Male
## 8 Married-civ-spouse Exec-managerial Black Male
## 9 Married-civ-spouse Prof-specialty Asian-Pac-Islander Male
## 10 Never-married Sales Black Male
## hours-per-week native-country salary
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 16 Jamaica <=50K
## 6 50 United-States >50K
## 7 40 United-States >50K
## 8 80 United-States >50K
## 9 40 India >50K
## 10 50 United-States <=50K
The mutate()
function can help us add additional variables to our dataset.
Suppose we want to include the fnlwgt variable in our visualizations, but the values in the variable are too large and we want to scale it down. We can use the mutate()
function to create a new column in the dataset, name the new column, and set the values in this column equal to what we want it to be. In this case, we have scaled the values down by 10,000 so that they contain the same information, just on a smaller scale.
In order to filter the age variable, I had to evaluate it as.integer()
, since age is a factor variable in this dataset.
adults_mutate <- mutate(adults_filter, "scaled_fnlwgt" = as.double(fnlwgt)/10000)
head(adults_mutate)
## age workclass fnlwgt education education-num marital-status
## 1 38 Private 215646 HS-grad 9 Divorced
## 2 53 Private 234721 11th 7 Married-civ-spouse
## 3 37 Private 284582 Masters 14 Married-civ-spouse
## 4 31 Private 45781 Masters 14 Never-married
## 5 42 Private 159449 Bachelors 13 Married-civ-spouse
## 6 37 Private 280464 Some-college 10 Married-civ-spouse
## occupation race sex hours-per-week native-country salary
## 1 Handlers-cleaners White Male 40 United-States <=50K
## 2 Handlers-cleaners Black Male 40 United-States <=50K
## 3 Exec-managerial White Female 40 United-States <=50K
## 4 Prof-specialty White Female 50 United-States >50K
## 5 Exec-managerial White Male 40 United-States >50K
## 6 Exec-managerial Black Male 80 United-States >50K
## scaled_fnlwgt
## 1 21.5646
## 2 23.4721
## 3 28.4582
## 4 4.5781
## 5 15.9449
## 6 28.0464
transmute()
is a variable of the mutate()
function. transmute()
acts the same way, except the new dataset will only contain the new mutated variables, and not the other untouched ones.
head(transmute(adults_filter, "scaled_fnlwgt" = as.double(fnlwgt)/10000), 10)
## scaled_fnlwgt
## 1 21.5646
## 2 23.4721
## 3 28.4582
## 4 4.5781
## 5 15.9449
## 6 28.0464
## 7 20.5019
## 8 18.6824
## 9 2.8887
## 10 19.3524
The summarize()
function can be used to summarize entire data frames by collapsing items into single number summaries.
Suppose we want to look at the average hours per week white males work, and compare them with the average hours per week white females work. We can use the summarize()
function along with the filter()
function and the dplyr pipe %>%
to filter the adults whose race is white, and whose sex is male/female.
adults_mutate %>%
filter(str_detect(race, 'White') & str_detect(sex, 'Male')) %>%
summarize(white_males = mean(as.double(`hours-per-week`)))
## white_males
## 1 44.09296
adults_mutate %>%
filter(str_detect(race, 'White') & str_detect(sex, 'Female')) %>%
summarize(white_females = mean(as.double(`hours-per-week`)))
## white_females
## 1 38.44055
We see that white males work more hours per week compared to white females from the data in this dataset. We are interested in visualizing characteristic distributions of these two groups to potentially find out why white males work on average higher than white females. Let’s save this data and we will recall this when we move onto visualization.
There is a more efficient way to achieve what we used summarize()
above for. group_by()
allows us to group different categorical variables within the same column and form summary statistics easily, saving us an extra step to group or filter out the variables ourselves. After grouping and summarizing, it is easy to see that for the adults in the dataset, it appears Asian Pacific Islander American males work the most hours per week on average, at 44.2 hours per week. However, there is not a significant difference between the average hours per week for the groups we have chosen by race and sex.
adults_mutate %>%
group_by(race, sex) %>%
summarize(avg_hours = mean(as.double(`hours-per-week`)))
## # A tibble: 10 x 3
## # Groups: race [?]
## race sex avg_hours
## <fct> <fct> <dbl>
## 1 " Amer-Indian-Eskimo" " Female" 38.6
## 2 " Amer-Indian-Eskimo" " Male" 42.9
## 3 " Asian-Pac-Islander" " Female" 39.9
## 4 " Asian-Pac-Islander" " Male" 44.2
## 5 " Black" " Female" 37.7
## 6 " Black" " Male" 41.6
## 7 " Other" " Female" 39
## 8 " Other" " Male" 40.6
## 9 " White" " Female" 38.4
## 10 " White" " Male" 44.1
The rename()
function is pretty self-explanatory, and allows us to rename our variable names in the dataset. Let’s try renaming some of our variables.
For the sake of consistency with variable naming, let’s rename our variable scaled_fnlwgt to scaled-fnlwgt. After the dataset argument which always goes first in the function, the second argument should be the desired new variable name, followed by the original variable name after the =. Since the new name has a dash, either double quotes or single quotes should go around the name.
In order to make the change permanently, remember to save the expression into the function.
We can rename multiple variables, by adding each variable we want to rename in the third, fourth arguments and so on.
adults_mutate <- rename(adults_mutate, "scaled-fnlwgt" = scaled_fnlwgt)
Now that we have learned about some tools to clean and manipulate data, let’s make some interesting visualizations with R’s ggplot2 to explore patterns in our dataset! There are many different types of plots and variations of these plots we can create in ggplot2. In this section, we will mainly explore the following plots:
Visualizations in ggplot2 start with a baseplot, and allow us to add different layers onto the plot, creating more complex visualizations. geom_point()
allows us to create scatterplots with our data.
We are interested in seeing the relationship between the variables age and scaled-fnlwgt. We can visualize this relationship by creating a scatterplot with age as the x variable, and scaled-fnlwgt as the y variable.
We can also visualize a third variable, by setting the color of the points to a different variable. We are interested in seeing the distribution of sex in relation to age and their scaled financial weight.
ggplot(adults_mutate) +
geom_point(mapping = aes(x = age, y = `scaled-fnlwgt`))
ggplot(adults_mutate) +
geom_point(mapping = aes(x = age, y = `scaled-fnlwgt`, color = sex))
We can see in the two scatterplots we have created, that there is a slight negative correlation between the scaled-fnlwgt and age variables. For the adults included in the dataset that are older than 30 years, the financial weight seems to peek around 35-45 years, and adults older than this range generally do not have as high of financial weight. There is also not a particularly notable trend with the sex distribution in relation to age and scaled-fnlwgt.
In ggplot2, we can add many different geoms to a baseplot. For example, if we wanted to add a regression line to the scatterplot we had above, we can layer a geom_smooth()
layer on top, which adds the line to the graph. When we have two or more geoms in one plot, to make sure there are no syntax errors and to be consistent, we only have to specify the arguments once, in the ggplot()
line of code.
ggplot(adults_mutate, mapping = aes(x = `age`, y = `scaled-fnlwgt`)) +
geom_point() +
geom_smooth()
We can see that the data we used for this graph was not very suitable for a regression line added by geom_smooth()
, because the line added does not show us any useful trends. It seems like there is too much data included in the adults_mutate dataset that are overwhelming the plot. Let’s filter down the data into smaller subsets using dplyr.
To examine a smaller subset of the data, let’s only look at female Asian Pacific Islander American adults, because there are significantly less female than male adults, and significantly less Asian Pacific Islander American adults than other races included in this dataset.
adults_asian_fem <- adults_mutate %>%
filter(str_detect(sex, 'Female') & str_detect(race, 'Asian-Pac-Islander'))
After we’ve filtered the sex and race variable, we have a significantly smaller dataset to work with. Let’s try visualizing this dataset again using the same variables. The default of geom_smooth()
displays the standard error confidence band. If we want to get rid of it, we can set the standard error band se = FALSE
inside the geom_smooth()
line of code.
ggplot(adults_asian_fem, mapping = aes(x = `age`, y = `scaled-fnlwgt`)) +
geom_point() +
geom_smooth()
ggplot(adults_asian_fem, mapping = aes(x = `age`, y = `scaled-fnlwgt`)) +
geom_point() +
geom_smooth(se = FALSE)
The plot we have shows us no apparent trend associated with age and financial weight in Asian Pacific Islander American females collected in this dataset. Let’s explore some other variables and visualizations to find interesting trends!
Let’s create a histogram with geom_bar()
with the count of salaries either <=50 k or >50k, with sex distribution as filled color. The position = "dodge"
command separates out stacked bars to side-by-side bars so that we can see the comparison between sex better. We can see that in this histogram, there are significantly more males than females who make >50k.
ggplot(adults_mutate) +
geom_bar(mapping = aes(x = salary, fill = sex), position = "dodge")
Let’s create a second histogram with the scaled-fnlwgt of each race, also with sex as the filled color. The stat = "identity"
argument is necessary because here we are replacing the default method for geom_bar()
, which is stat = "count"
, with the value of a variable in our dataset on the y-axis. The second histogram shows that for the adults in the dataset, males generally earn more than females in all races. This difference is particularly pronounced in blacks and whites, where males earn significantly more than females of the same race.
ggplot(adults_mutate) +
geom_bar(mapping = aes(x = race, y = `scaled-fnlwgt`, fill = sex), position = "dodge", stat = "identity")
We want to see if there are any characteristics in the dataset that might be causing these trends. Let’s go back to the dataset to examine the data in these variables with some of the dplyr tools we learned earlier. We can use filter()
and summarize()
to count the number of white, black, male, and female adults in this dataset.
adults_mutate %>%
group_by(race) %>%
summarize(count = n())
## # A tibble: 5 x 2
## race count
## <fct> <int>
## 1 " Amer-Indian-Eskimo" 103
## 2 " Asian-Pac-Islander" 113
## 3 " Black" 1296
## 4 " Other" 45
## 5 " White" 11762
adults_mutate %>%
group_by(sex) %>%
summarize(count = n())
## # A tibble: 2 x 2
## sex count
## <fct> <int>
## 1 " Female" 4101
## 2 " Male" 9218
We can compare the counts and notice that there is about 10 times more white adults in the dataset than black adults, and there is a little over twice the amount of male adults compared to female adults. The unequal number of adults represented in this dataset for each category may potentially be an explanation for the difference in salary for males vs. females.
Facets provide another way to add a third variable to a scatterplot. We should facet a discrete variable rather than a continuous one. In order to use the function, add facet_wrap()
and (~variable)
at the end of a normal geom_point plot. The function wraps the third variable around in separate plots displayed next to each other, so it is easy for us to see any major trend differences between each category of this third variable.
We filter the dataset to only contain females, since there are too many observations in the dataset. We can then plot this data in a geom_point plot, and facet_wrap the third variable race. We do not see any particular trends in the relationship between age and financial weight across different races in females, but notice that there is an uneven representation in this dataset, with significantly more white and black females than any other race in this dataset.
adults_mutate %>%
filter(str_detect(sex, 'Female')) %>%
ggplot() +
geom_point(mapping = aes(x = age, y = `scaled-fnlwgt`)) +
facet_wrap(~race)
***
We can also create interesting visualizations on the coordinate plane rather than on the x-y plane with the ggplot2 functions.
We are interested in visualizing the distribution of marital status of adults in this dataset. In order to create a Coxcomb chart for this visualization, we must first create a bar chart using geom_bar()
. When creating this base bar chart, it is important to set the x variable as well as the color fill to the same variable that we want to visualize. We set the width variable to 1 to make sure the bars touch each other.
bar <- ggplot(adults_mutate) +
geom_bar(mapping = aes(x = `marital-status`, fill = `marital-status`), width = 1)
bar
After we create the base bar chart, we can then create the Coxcomb chart by using the coord_polar()
function, which helps transform x-y plane charts to polar coordinate system charts. We set the x label to NULL to remove the x-axis label “marital-status”. Now we have a Coxcomb chart that represents the frequency of each marital status category for adults collected in our dataset. We can see that most adults in the dataset are married civilian adults.
bar +
labs(x = NULL) +
coord_polar()
We can also visualize the frequency of marriage status of adults in this dataset with a pie chart. For a pie chart, we must also create a base bar chart using geom_bar()
. However, this time we need to create a stacked bar chart. We set the x variable to factor(1)
to make sure all the bars are stacked on top of each other. We set the color fill to marital-status to represent each category. We set the width as 1 to make sure the different bars touch.
bar <- ggplot(adults_mutate) +
geom_bar(mapping = aes(x = factor(1), fill = `marital-status`), width = 1)
bar
For the pie chart, we also use coord_polar()
to transform the stacked bar chart to polar coordinates and remove the x-axis label. We map the y-axis of the bar chart to the angle theta with theta = "y"
. We can see the distribution of each category clearer in the pie chart, including some categories that might have been too little to see in the Coxcomb chart.
bar +
labs(x = NULL) +
coord_polar(theta = "y")
We can also visualize marital-status in another kind of polar coordinate chart. The bullseye chart can be created with a very similar procedure as the pie chart. The base bar chart is the same as the stacked bar chart we created above. When mapping to polar coordinates, if we remove theta = "y"
, which mapped the y-axis of the bar char to the angle theta, we will get the bullseye chart by default.
bar +
labs(x = NULL) +
coord_polar()
***
I hope this tutorial was helpful to get you started on your journey with R. You can check out more tutorials on useful data manipulation and visualization tools at the University of Michigan Clark Labs Digital Projects Studio!