Install and load the tidyverse package we will use.
install.packages("tidyverse", repos = "https://cran.r-project.org/web/packages/tidyverse/index.html")
library(tidyverse)
The adults dataset we are going to use in this tutorial originates from the UCI Machine Learning Repository.
There are many ways we can read in a dataset in R depending on our needs and how the data file is formatted. This specific dataset was only available as a text(.txt) file, unparsed, and without headings.
We use the following method to read in the adults data and store them into a variable. Keep in mind that there are multiple ways to read in a dataset.
To read in and parse this dataset correctly, use the read.delim()
function. This function is useful because we can specify how we want to parse our file. We specify in the function parameter that the dataset we are using is separated by commas.
setwd("~/Documents/Projects & Work/Digital_Projects_Studio/R_tutorial/data")
adults <- read.delim("adult-data.txt", header = FALSE, sep = ",")
head(adults)
## V1 V2 V3 V4 V5 V6
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## V7 V8 V9 V10 V11 V12 V13
## 1 Adm-clerical Not-in-family White Male 2174 0 40
## 2 Exec-managerial Husband White Male 0 0 13
## 3 Handlers-cleaners Not-in-family White Male 0 0 40
## 4 Handlers-cleaners Husband Black Male 0 0 40
## 5 Prof-specialty Wife Black Female 0 0 40
## 6 Exec-managerial Wife White Female 0 0 40
## V14 V15
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 Cuba <=50K
## 6 United-States <=50K
Notice that although the dataset has been read in correctly, it does not have any headings that tell us what each column represents. We can see from our original data file the corresponding headings. Let’s add those headings in through the names()
function.
names(adults) <- c("age","workclass","fnlwgt","education","education-num",
"marital-status","occupation","relationship","race","sex",
"capital-gain","capital-loss","hours-per-week",
"native-country", "salary")
adults[20:30,]
## age workclass fnlwgt education education-num
## 20 43 Self-emp-not-inc 292175 Masters 14
## 21 40 Private 193524 Doctorate 16
## 22 54 Private 302146 HS-grad 9
## 23 35 Federal-gov 76845 9th 5
## 24 43 Private 117037 11th 7
## 25 59 Private 109015 HS-grad 9
## 26 56 Local-gov 216851 Bachelors 13
## 27 19 Private 168294 HS-grad 9
## 28 54 ? 180211 Some-college 10
## 29 39 Private 367260 HS-grad 9
## 30 49 Private 193366 HS-grad 9
## marital-status occupation relationship
## 20 Divorced Exec-managerial Unmarried
## 21 Married-civ-spouse Prof-specialty Husband
## 22 Separated Other-service Unmarried
## 23 Married-civ-spouse Farming-fishing Husband
## 24 Married-civ-spouse Transport-moving Husband
## 25 Divorced Tech-support Unmarried
## 26 Married-civ-spouse Tech-support Husband
## 27 Never-married Craft-repair Own-child
## 28 Married-civ-spouse ? Husband
## 29 Divorced Exec-managerial Not-in-family
## 30 Married-civ-spouse Craft-repair Husband
## race sex capital-gain capital-loss hours-per-week
## 20 White Female 0 0 45
## 21 White Male 0 0 60
## 22 Black Female 0 0 20
## 23 Black Male 0 0 40
## 24 White Male 0 2042 40
## 25 White Female 0 0 40
## 26 White Male 0 0 40
## 27 White Male 0 0 40
## 28 Asian-Pac-Islander Male 0 0 60
## 29 White Male 0 0 80
## 30 White Male 0 0 40
## native-country salary
## 20 United-States >50K
## 21 United-States >50K
## 22 United-States <=50K
## 23 United-States <=50K
## 24 United-States <=50K
## 25 United-States <=50K
## 26 United-States >50K
## 27 United-States <=50K
## 28 South >50K
## 29 United-States <=50K
## 30 United-States <=50K
It is always good practice to evaluate our data and check for missing values. We notice that there are some missing values in our dataset, particularly in the workclass, occupation, and native country variables, that are represented with “?”. For example, in the variable workclass, the vlaue in row 28 is represented with “?”.
In R, there are multiple methods to handle missing values that only recognize them if they were represented with NA
. In order to format the missing values correctly, we can use the stringr library, along with some simple regular expression to replace the question marks with NA
.
After we load the stringr library, we can use str_detect()
to replace the question marks in our data. str_detect()
takes in a string and a pattern, then detects and returns the parts of this string that match the pattern. We notice that only the variables workclass, occupation, and native country have question marks. In order to search for a pattern, we must write a regular expression, which is a sequence of characters that define a search pattern. Since a question mark is a special character in regular expression, we must use two backslashes “\” to tell R that what we actually want to match is the question mark itself. We specify the variable we want to look in, like adults$workclass, and use square brackets around str_detect()
to specify these are the rows we want to select, then replace these selected values with NA.
library(stringr)
adults$workclass[str_detect(adults$workclass, "\\?")] <- NA
adults$occupation[str_detect(adults$occupation, "\\?")] <- NA
adults$`native-country`[str_detect(adults$`native-country`, "\\?")] <- NA
head(adults, 20)
## age workclass fnlwgt education education-num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## 7 49 Private 160187 9th 5
## 8 52 Self-emp-not-inc 209642 HS-grad 9
## 9 31 Private 45781 Masters 14
## 10 42 Private 159449 Bachelors 13
## 11 37 Private 280464 Some-college 10
## 12 30 State-gov 141297 Bachelors 13
## 13 23 Private 122272 Bachelors 13
## 14 32 Private 205019 Assoc-acdm 12
## 15 40 Private 121772 Assoc-voc 11
## 16 34 Private 245487 7th-8th 4
## 17 25 Self-emp-not-inc 176756 HS-grad 9
## 18 32 Private 186824 HS-grad 9
## 19 38 Private 28887 11th 7
## 20 43 Self-emp-not-inc 292175 Masters 14
## marital-status occupation relationship
## 1 Never-married Adm-clerical Not-in-family
## 2 Married-civ-spouse Exec-managerial Husband
## 3 Divorced Handlers-cleaners Not-in-family
## 4 Married-civ-spouse Handlers-cleaners Husband
## 5 Married-civ-spouse Prof-specialty Wife
## 6 Married-civ-spouse Exec-managerial Wife
## 7 Married-spouse-absent Other-service Not-in-family
## 8 Married-civ-spouse Exec-managerial Husband
## 9 Never-married Prof-specialty Not-in-family
## 10 Married-civ-spouse Exec-managerial Husband
## 11 Married-civ-spouse Exec-managerial Husband
## 12 Married-civ-spouse Prof-specialty Husband
## 13 Never-married Adm-clerical Own-child
## 14 Never-married Sales Not-in-family
## 15 Married-civ-spouse Craft-repair Husband
## 16 Married-civ-spouse Transport-moving Husband
## 17 Never-married Farming-fishing Own-child
## 18 Never-married Machine-op-inspct Unmarried
## 19 Married-civ-spouse Sales Husband
## 20 Divorced Exec-managerial Unmarried
## race sex capital-gain capital-loss hours-per-week
## 1 White Male 2174 0 40
## 2 White Male 0 0 13
## 3 White Male 0 0 40
## 4 Black Male 0 0 40
## 5 Black Female 0 0 40
## 6 White Female 0 0 40
## 7 Black Female 0 0 16
## 8 White Male 0 0 45
## 9 White Female 14084 0 50
## 10 White Male 5178 0 40
## 11 Black Male 0 0 80
## 12 Asian-Pac-Islander Male 0 0 40
## 13 White Female 0 0 30
## 14 Black Male 0 0 50
## 15 Asian-Pac-Islander Male 0 0 40
## 16 Amer-Indian-Eskimo Male 0 0 45
## 17 White Male 0 0 35
## 18 White Male 0 0 40
## 19 White Male 0 0 50
## 20 White Female 0 0 45
## native-country salary
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 Cuba <=50K
## 6 United-States <=50K
## 7 Jamaica <=50K
## 8 United-States >50K
## 9 United-States >50K
## 10 United-States >50K
## 11 United-States >50K
## 12 India >50K
## 13 United-States <=50K
## 14 United-States <=50K
## 15 <NA> >50K
## 16 Mexico <=50K
## 17 United-States <=50K
## 18 United-States <=50K
## 19 United-States <=50K
## 20 United-States >50K
Let’s take a look at our correctly formatted data. Immediately we notice there are some variables that are not particularly interesting, and some that seem interesting and worth exploring. We will use the dplyr package that is part of tidyverse to manipulate the data we have into something we can create interesting visualizations with.
There are many functions in dplyr - we will cover some of the most useful and commonly used functions:
The select()
function is used to keep only a few variables of interest to the current analysis. It is most useful when working with dataframes involving a large number of variables. Let’s extract the columns that we are interestd in examining into a new dataframe, so that we don’t lose any columns completely, in case we are interested in them later on.
The most common way to use select()
is to write down all the variable names we wish to keep in the new dataset. In this case, since our variable names contained dashes, we need to put them in either double or single quotes.
adults_select <- select(adults, age, workclass, fnlwgt, education, "education-num", "marital-status", occupation, race, sex, "hours-per-week", "native-country", salary)
head(adults_select, 10)
## age workclass fnlwgt education education-num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## 7 49 Private 160187 9th 5
## 8 52 Self-emp-not-inc 209642 HS-grad 9
## 9 31 Private 45781 Masters 14
## 10 42 Private 159449 Bachelors 13
## marital-status occupation race sex hours-per-week
## 1 Never-married Adm-clerical White Male 40
## 2 Married-civ-spouse Exec-managerial White Male 13
## 3 Divorced Handlers-cleaners White Male 40
## 4 Married-civ-spouse Handlers-cleaners Black Male 40
## 5 Married-civ-spouse Prof-specialty Black Female 40
## 6 Married-civ-spouse Exec-managerial White Female 40
## 7 Married-spouse-absent Other-service Black Female 16
## 8 Married-civ-spouse Exec-managerial White Male 45
## 9 Never-married Prof-specialty White Female 50
## 10 Married-civ-spouse Exec-managerial White Male 40
## native-country salary
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 Cuba <=50K
## 6 United-States <=50K
## 7 Jamaica <=50K
## 8 United-States >50K
## 9 United-States >50K
## 10 United-States >50K
If there are more variables we want to keep than drop, it might be more efficient to use a second method. In this method, we use select()
as we normally would, but put a - in front of the variables we want to drop from the original dataset so that we can type less variables. This returns the same result as the previous method.
adults_negselect <- select(adults, -relationship, -"capital-gain", -"capital-loss")
head(adults_negselect, 10)
## age workclass fnlwgt education education-num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## 7 49 Private 160187 9th 5
## 8 52 Self-emp-not-inc 209642 HS-grad 9
## 9 31 Private 45781 Masters 14
## 10 42 Private 159449 Bachelors 13
## marital-status occupation race sex hours-per-week
## 1 Never-married Adm-clerical White Male 40
## 2 Married-civ-spouse Exec-managerial White Male 13
## 3 Divorced Handlers-cleaners White Male 40
## 4 Married-civ-spouse Handlers-cleaners Black Male 40
## 5 Married-civ-spouse Prof-specialty Black Female 40
## 6 Married-civ-spouse Exec-managerial White Female 40
## 7 Married-spouse-absent Other-service Black Female 16
## 8 Married-civ-spouse Exec-managerial White Male 45
## 9 Never-married Prof-specialty White Female 50
## 10 Married-civ-spouse Exec-managerial White Male 40
## native-country salary
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 Cuba <=50K
## 6 United-States <=50K
## 7 Jamaica <=50K
## 8 United-States >50K
## 9 United-States >50K
## 10 United-States >50K
There are also other interesting ways to use select()
to achieve the desired outcome in the most efficient way. There are many select_helper functions that only work inside select()
. Some of these functions are starts_with()
, ends_with()
, contains()
. For example, say in our adults dataset we wanted to only examine the columns with education related data. We can use starts_with("education")
inside select, which selects the columns with names that match the given string, “education”.
adults_educ <- select(adults, starts_with("education"))
head(adults_educ, 10)
## education education-num
## 1 Bachelors 13
## 2 Bachelors 13
## 3 HS-grad 9
## 4 11th 7
## 5 Bachelors 13
## 6 Masters 14
## 7 9th 5
## 8 HS-grad 9
## 9 Masters 14
## 10 Bachelors 13
Now that we have a dataframe with only the columns we are interested in, let’s explore some variables that might be interesting to take a closer look at. We are interested in seeing the age range of adults in this dataset and focus in on a smaller group.
The arrange()
function can order rows of a data frame using a variable name (or a more complicated expression). If we provide multiple expressions to order by, it uses the second one to break ties in the first one, third one to break ties in the second one, and so on. The default setting for arrangement is from low to high values.
adults_arrange <- arrange(adults_select, age)
head(adults_arrange, 20)
## age workclass fnlwgt education education-num marital-status
## 1 17 <NA> 304873 10th 6 Never-married
## 2 17 Private 65368 11th 7 Never-married
## 3 17 Private 245918 11th 7 Never-married
## 4 17 Private 191260 9th 5 Never-married
## 5 17 Private 270942 5th-6th 3 Never-married
## 6 17 Private 89821 11th 7 Never-married
## 7 17 Private 175024 11th 7 Never-married
## 8 17 <NA> 202521 11th 7 Never-married
## 9 17 <NA> 258872 11th 7 Never-married
## 10 17 Private 211870 9th 5 Never-married
## 11 17 Private 242718 11th 7 Never-married
## 12 17 Private 169658 10th 6 Never-married
## 13 17 <NA> 80077 11th 7 Never-married
## 14 17 Self-emp-not-inc 368700 11th 7 Never-married
## 15 17 Private 102726 12th 8 Never-married
## 16 17 Private 316929 12th 8 Never-married
## 17 17 Private 193830 11th 7 Never-married
## 18 17 Private 32607 10th 6 Never-married
## 19 17 Private 198124 11th 7 Never-married
## 20 17 Private 368700 11th 7 Never-married
## occupation race sex hours-per-week native-country salary
## 1 <NA> White Female 32 United-States <=50K
## 2 Sales White Female 12 United-States <=50K
## 3 Other-service White Male 12 United-States <=50K
## 4 Other-service White Male 24 United-States <=50K
## 5 Other-service White Male 48 Mexico <=50K
## 6 Other-service White Male 10 United-States <=50K
## 7 Handlers-cleaners White Male 18 United-States <=50K
## 8 <NA> White Male 40 United-States <=50K
## 9 <NA> White Female 5 United-States <=50K
## 10 Other-service White Male 6 United-States <=50K
## 11 Sales White Male 12 United-States <=50K
## 12 Other-service White Female 21 United-States <=50K
## 13 <NA> White Female 20 United-States <=50K
## 14 Farming-fishing White Male 10 United-States <=50K
## 15 Other-service White Male 16 United-States <=50K
## 16 Handlers-cleaners White Male 20 United-States <=50K
## 17 Sales White Female 20 United-States <=50K
## 18 Farming-fishing White Male 20 United-States <=50K
## 19 Sales White Male 20 United-States <=50K
## 20 Sales White Male 28 United-States <=50K
Let’s filter the data further. we want to take a closer look at older adults past the age of 30 that originated from the United States, and whom work for a private company. The filter()
function is useful to filter the rows in the dataset that match the requirements we set in the parameter.
Notice that the native-country variable in the dataset has text data rather than numerical data. To filter columns with text data, we must use a special function from the stringr
library in R. Adding the str_detect()
function inside the filter parameter ensures that the columns with text data are filtered just like numerical data, with the filter function returning the row of data if the condition matches.
In order to filter the age variable, I had to evaluated it as.integer()
, since age is a factor variable in this dataset.
One useful tool in dplyr is the pipe operator %>%
. The pipe operator is used at the end of a line of code. It pipes the output from one function and feeds it to the first argument of the next function. I use the operator here so that I do not have to retype my dataset argument. The pipe operator is very useful when we want to use multiple functions on the same dataset, so that we do not have to save our dataset in a new variable for each function used.
library(stringr)
adults_filter <- adults_select %>%
filter(str_detect(`native-country`, 'United-States') &
as.integer(age) >= 30 & str_detect(workclass, 'Private'))
head(adults_filter, 10)
## age workclass fnlwgt education education-num marital-status
## 1 38 Private 215646 HS-grad 9 Divorced
## 2 53 Private 234721 11th 7 Married-civ-spouse
## 3 37 Private 284582 Masters 14 Married-civ-spouse
## 4 31 Private 45781 Masters 14 Never-married
## 5 42 Private 159449 Bachelors 13 Married-civ-spouse
## 6 37 Private 280464 Some-college 10 Married-civ-spouse
## 7 32 Private 205019 Assoc-acdm 12 Never-married
## 8 32 Private 186824 HS-grad 9 Never-married
## 9 38 Private 28887 11th 7 Married-civ-spouse
## 10 40 Private 193524 Doctorate 16 Married-civ-spouse
## occupation race sex hours-per-week native-country salary
## 1 Handlers-cleaners White Male 40 United-States <=50K
## 2 Handlers-cleaners Black Male 40 United-States <=50K
## 3 Exec-managerial White Female 40 United-States <=50K
## 4 Prof-specialty White Female 50 United-States >50K
## 5 Exec-managerial White Male 40 United-States >50K
## 6 Exec-managerial Black Male 80 United-States >50K
## 7 Sales Black Male 50 United-States <=50K
## 8 Machine-op-inspct White Male 40 United-States <=50K
## 9 Sales White Male 50 United-States <=50K
## 10 Prof-specialty White Male 60 United-States >50K
There are also other interesting ways to use filter()
to achieve the desired outcome in the most efficient way. There are many filter_helper functions that only work inside filter()
. Some of these functions are is.na()
, between()
, near()
. For example, say in our adults_select dataset we wanted to only examine the rows with adults between ages 30 and 50. We can use between(age, 30, 50)
inside filter, which filters the rows with the matching condition. Remember that we once again have to cast age as.integer()
since it is a factor variable.
adults_filter_between <- adults_select %>%
filter(between(as.integer(age), 30, 50))
head(adults_filter_between, 10)
## age workclass fnlwgt education education-num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 37 Private 284582 Masters 14
## 5 49 Private 160187 9th 5
## 6 31 Private 45781 Masters 14
## 7 42 Private 159449 Bachelors 13
## 8 37 Private 280464 Some-college 10
## 9 30 State-gov 141297 Bachelors 13
## 10 32 Private 205019 Assoc-acdm 12
## marital-status occupation race sex
## 1 Never-married Adm-clerical White Male
## 2 Married-civ-spouse Exec-managerial White Male
## 3 Divorced Handlers-cleaners White Male
## 4 Married-civ-spouse Exec-managerial White Female
## 5 Married-spouse-absent Other-service Black Female
## 6 Never-married Prof-specialty White Female
## 7 Married-civ-spouse Exec-managerial White Male
## 8 Married-civ-spouse Exec-managerial Black Male
## 9 Married-civ-spouse Prof-specialty Asian-Pac-Islander Male
## 10 Never-married Sales Black Male
## hours-per-week native-country salary
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 16 Jamaica <=50K
## 6 50 United-States >50K
## 7 40 United-States >50K
## 8 80 United-States >50K
## 9 40 India >50K
## 10 50 United-States <=50K
The mutate()
function can help us add additional variables to our dataset.
Suppose we want to include the fnlwgt variable in our visualizations, but the values in the variable are too large and we want to scale it down. We can use the mutate()
function to create a new column in the dataset, name the new column, and set the values in this column equal to what we want it to be. In this case, we have scaled the values down by 10,000 so that they contain the same information, just on a smaller scale.
In order to filter the age variable, I had to evaluate it as.integer()
, since age is a factor variable in this dataset.
adults_mutate <- mutate(adults_filter, "scaled_fnlwgt" = as.double(fnlwgt)/10000)
head(adults_mutate)
## age workclass fnlwgt education education-num marital-status
## 1 38 Private 215646 HS-grad 9 Divorced
## 2 53 Private 234721 11th 7 Married-civ-spouse
## 3 37 Private 284582 Masters 14 Married-civ-spouse
## 4 31 Private 45781 Masters 14 Never-married
## 5 42 Private 159449 Bachelors 13 Married-civ-spouse
## 6 37 Private 280464 Some-college 10 Married-civ-spouse
## occupation race sex hours-per-week native-country salary
## 1 Handlers-cleaners White Male 40 United-States <=50K
## 2 Handlers-cleaners Black Male 40 United-States <=50K
## 3 Exec-managerial White Female 40 United-States <=50K
## 4 Prof-specialty White Female 50 United-States >50K
## 5 Exec-managerial White Male 40 United-States >50K
## 6 Exec-managerial Black Male 80 United-States >50K
## scaled_fnlwgt
## 1 21.5646
## 2 23.4721
## 3 28.4582
## 4 4.5781
## 5 15.9449
## 6 28.0464
transmute()
is a variable of the mutate()
function. transmute()
acts the same way, except the new dataset will only contain the new mutated variables, and not the other untouched ones.
head(transmute(adults_filter, "scaled_fnlwgt" = as.double(fnlwgt)/10000), 10)
## scaled_fnlwgt
## 1 21.5646
## 2 23.4721
## 3 28.4582
## 4 4.5781
## 5 15.9449
## 6 28.0464
## 7 20.5019
## 8 18.6824
## 9 2.8887
## 10 19.3524
The summarize()
function can be used to summarize entire data frames by collapsing items into single number summaries.
Suppose we want to look at the average hours per week white males work, and compare them with the average hours per week white females work. We can use the summarize()
function along with the filter()
function and the dplyr pipe %>%
to filter the adults whose race is white, and whose sex is male/female.
adults_mutate %>%
filter(str_detect(race, 'White') & str_detect(sex, 'Male')) %>%
summarize(white_males = mean(as.double(`hours-per-week`)))
## white_males
## 1 44.09296
adults_mutate %>%
filter(str_detect(race, 'White') & str_detect(sex, 'Female')) %>%
summarize(white_females = mean(as.double(`hours-per-week`)))
## white_females
## 1 38.44055
We see that white males work more hours per week compared to white females from the data in this dataset. We are interested in visualizing characteristic distributions of these two groups to potentially find out why white males work on average higher than white females. Let’s save this data and we will recall this when we move onto visualization.
There is a more efficient way to achieve what we used summarize()
above for. group_by()
allows us to group different categorical variables within the same column and form summary statistics easily, saving us an extra step to group or filter out the variables ourselves. After grouping and summarizing, it is easy to see that for the adults in the dataset, it appears Asian Pacific Islander American males work the most hours per week on average, at 44.2 hours per week. However, there is not a significant difference between the average hours per week for the groups we have chosen by race and sex.
adults_mutate %>%
group_by(race, sex) %>%
summarize(avg_hours = mean(as.double(`hours-per-week`)))
## # A tibble: 10 x 3
## # Groups: race [?]
## race sex avg_hours
## <fct> <fct> <dbl>
## 1 " Amer-Indian-Eskimo" " Female" 38.6
## 2 " Amer-Indian-Eskimo" " Male" 42.9
## 3 " Asian-Pac-Islander" " Female" 39.9
## 4 " Asian-Pac-Islander" " Male" 44.2
## 5 " Black" " Female" 37.7
## 6 " Black" " Male" 41.6
## 7 " Other" " Female" 39
## 8 " Other" " Male" 40.6
## 9 " White" " Female" 38.4
## 10 " White" " Male" 44.1
The rename()
function is pretty self-explanatory, and allows us to rename our variable names in the dataset. Let’s try renaming some of our variables.
For the sake of consistency with variable naming, let’s rename our variable scaled_fnlwgt to scaled-fnlwgt. After the dataset argument which always goes first in the function, the second argument should be the desired new variable name, followed by the original variable name after the =. Since the new name has a dash, either double quotes or single quotes should go around the name.
In order to make the change permanently, remember to save the expression into the function.
We can rename multiple variables, by adding each variable we want to rename in the third, fourth arguments and so on.
adults_mutate <- rename(adults_mutate, "scaled-fnlwgt" = scaled_fnlwgt)
Now that we have learned about some tools to clean and manipulate data, let’s make some interesting visualizations with R’s ggplot2 to explore patterns in our dataset! There are many different types of plots and variations of these plots we can create in ggplot2. In this section, we will mainly explore the following plots:
Visualizations in ggplot2 start with a baseplot, and allow us to add different layers onto the plot, creating more complex visualizations. geom_point()
allows us to create scatterplots with our data.
We are interested in seeing the relationship between the variables age and scaled-fnlwgt. We can visualize this relationship by creating a scatterplot with age as the x variable, and scaled-fnlwgt as the y variable.
We can also visualize a third variable, by setting the color of the points to a different variable. We are interested in seeing the distribution of sex in relation to age and their scaled financial weight.
ggplot(adults_mutate) +
geom_point(mapping = aes(x = age, y = `scaled-fnlwgt`))
ggplot(adults_mutate) +
geom_point(mapping = aes(x = age, y = `scaled-fnlwgt`, color = sex))