Chapter 5 Introduction to the Tidyverse
The tidyverse is a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures. These packages are essential tools for anyone working with data in R, providing intuitive and consistent functions for data cleaning, manipulation, and visualization.
The tidyverse follows Hadley Wickham’s philosophy of tidy data, where:
- Each variable is a column
- Each observation is a row
- Each value is a cell
This structure makes data analysis more intuitive and efficient.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
5.1 Data Structure: Wide vs. Tidy Data
5.1.1 Understanding the Difference
Let’s explore the concept of tidy data with a practical example:
# Create a wide-format dataset
flu_cases <- data.frame(
year = c(2020, 2021),
male_cases = c(120, 150),
female_cases = c(100, 130)
)
print("Wide format (not tidy):")## [1] "Wide format (not tidy):"
## year male_cases female_cases
## 1 2020 120 100
## 2 2021 150 130
Why isn’t this tidy? The variable “sex” is encoded in the column names rather than being a proper column with values.
5.1.2 Reshaping Data with tidyr
5.1.2.1 Making Data Longer with pivot_longer()
flu_tidy <- flu_cases %>%
pivot_longer(
cols = c(male_cases, female_cases), # Columns to pivot
names_to = "sex", # New column for variable names
values_to = "cases" # New column for values
) %>%
mutate(sex = str_replace(sex, "_cases", "")) # Clean up the sex values
print("Tidy format (long):")## [1] "Tidy format (long):"
## # A tibble: 4 × 3
## year sex cases
## <dbl> <chr> <dbl>
## 1 2020 male 120
## 2 2020 female 100
## 3 2021 male 150
## 4 2021 female 130
5.1.2.2 Making Data Wider with pivot_wider()
flu_wide_again <- flu_tidy %>%
pivot_wider(
names_from = sex, # Column to get new column names from
values_from = cases, # Column to get values from
names_glue = "{sex}_cases" # Template for new column names
)
print("Back to wide format:")## [1] "Back to wide format:"
## # A tibble: 2 × 3
## year male_cases female_cases
## <dbl> <dbl> <dbl>
## 1 2020 120 100
## 2 2021 150 130
5.2 Core dplyr Functions for Data Manipulation
Let’s explore the essential dplyr functions using the gapminder dataset. The Gapminder dataset is a widely used dataset in data science, statistics, and economics that provides a comprehensive view of global development over time. It contains information on countries’ life expectancy, GDP per capita, population, and other socio-economic indicators, often spanning several decades. The dataset is particularly popular for teaching and visualization because it allows users to explore trends across countries, continents, and time. For example, you can track how life expectancy has improved in different regions, how income inequality has changed, or how population growth interacts with economic development. Gapminder is also well known through the Gapminder Foundation and the animated bubble charts popularized by Hans Rosling, which make complex global trends intuitive and engaging.
## # A tibble: 10 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
5.2.1 The Pipe Operator (%>%)
The pipe operator is the backbone of tidyverse workflows. It passes the result of one function as the first argument to the next function:
5.2.2 select(): Choose Columns
# Select specific columns
selected_columns <- gapminder %>%
select(country, year, lifeExp, pop)
head(selected_columns)## # A tibble: 6 × 4
## country year lifeExp pop
## <fct> <int> <dbl> <int>
## 1 Afghanistan 1952 28.8 8425333
## 2 Afghanistan 1957 30.3 9240934
## 3 Afghanistan 1962 32.0 10267083
## 4 Afghanistan 1967 34.0 11537966
## 5 Afghanistan 1972 36.1 13079460
## 6 Afghanistan 1977 38.4 14880372
## # A tibble: 6 × 2
## country lifeExp
## <fct> <dbl>
## 1 Afghanistan 28.8
## 2 Afghanistan 30.3
## 3 Afghanistan 32.0
## 4 Afghanistan 34.0
## 5 Afghanistan 36.1
## 6 Afghanistan 38.4
## # A tibble: 6 × 4
## country year lifeExp pop
## <fct> <int> <dbl> <int>
## 1 Afghanistan 1952 28.8 8425333
## 2 Afghanistan 1957 30.3 9240934
## 3 Afghanistan 1962 32.0 10267083
## 4 Afghanistan 1967 34.0 11537966
## 5 Afghanistan 1972 36.1 13079460
## 6 Afghanistan 1977 38.4 14880372
5.2.3 filter(): Choose Rows
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
# Filter by multiple conditions
gapminder %>%
filter(year == 2007, continent == "Europe") %>%
head()## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Austria Europe 2007 79.8 8199783 36126.
## 3 Belgium Europe 2007 79.4 10392226 33693.
## 4 Bosnia and Herzegovina Europe 2007 74.9 4552198 7446.
## 5 Bulgaria Europe 2007 73.0 7322858 10681.
## 6 Croatia Europe 2007 75.7 4493312 14619.
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Algeria Africa 1952 43.1 9279525 2449.
## 2 Algeria Africa 1957 45.7 10270856 3014.
## 3 Algeria Africa 1962 48.3 11000948 2551.
## 4 Algeria Africa 1967 51.4 12760499 3247.
## 5 Algeria Africa 1972 54.5 14760787 4183.
## 6 Algeria Africa 1977 58.0 17152804 4910.
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Brazil Americas 2002 71.0 179914212 8131.
## 2 Brazil Americas 2007 72.4 190010647 9066.
## 3 China Asia 1997 70.4 1230075000 2289.
## 4 China Asia 2002 72.0 1280400000 3119.
## 5 China Asia 2007 73.0 1318683096 4959.
## 6 Indonesia Asia 2007 70.6 223547000 3541.
5.2.4 mutate(): Create or Modify Columns
# Create new columns
gapminder_enhanced <- gapminder %>%
mutate(
pop_millions = pop / 1000000,
gdp_total = pop * gdpPercap,
pop_log = log10(pop)
)
head(gapminder_enhanced)## # A tibble: 6 × 9
## country continent year lifeExp pop gdpPercap pop_millions gdp_total
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 8.43 6.57e 9
## 2 Afghanistan Asia 1957 30.3 9240934 821. 9.24 7.59e 9
## 3 Afghanistan Asia 1962 32.0 10267083 853. 10.3 8.76e 9
## 4 Afghanistan Asia 1967 34.0 11537966 836. 11.5 9.65e 9
## 5 Afghanistan Asia 1972 36.1 13079460 740. 13.1 9.68e 9
## 6 Afghanistan Asia 1977 38.4 14880372 786. 14.9 1.17e10
## # ℹ 1 more variable: pop_log <dbl>
# Conditional mutations with ifelse
gapminder %>%
mutate(
pop_category = ifelse(pop > 50000000, "Large", "Small")
) %>%
head()## # A tibble: 6 × 7
## country continent year lifeExp pop gdpPercap pop_category
## <fct> <fct> <int> <dbl> <int> <dbl> <chr>
## 1 Afghanistan Asia 1952 28.8 8425333 779. Small
## 2 Afghanistan Asia 1957 30.3 9240934 821. Small
## 3 Afghanistan Asia 1962 32.0 10267083 853. Small
## 4 Afghanistan Asia 1967 34.0 11537966 836. Small
## 5 Afghanistan Asia 1972 36.1 13079460 740. Small
## 6 Afghanistan Asia 1977 38.4 14880372 786. Small
5.2.5 transmute(): Create New Columns (Drop Others)
# Keep only the newly created columns
gapminder %>%
transmute(
country = country,
year = year,
gdp_total = pop * gdpPercap,
pop_millions = pop / 1000000
) %>%
head()## # A tibble: 6 × 4
## country year gdp_total pop_millions
## <fct> <int> <dbl> <dbl>
## 1 Afghanistan 1952 6567086330. 8.43
## 2 Afghanistan 1957 7585448670. 9.24
## 3 Afghanistan 1962 8758855797. 10.3
## 4 Afghanistan 1967 9648014150. 11.5
## 5 Afghanistan 1972 9678553274. 13.1
## 6 Afghanistan 1977 11697659231. 14.9
5.2.6 arrange(): Sort Rows
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Swaziland Africa 2007 39.6 1133066 4513.
## 2 Mozambique Africa 2007 42.1 19951656 824.
## 3 Zambia Africa 2007 42.4 11746035 1271.
## 4 Sierra Leone Africa 2007 42.6 6144562 863.
## 5 Lesotho Africa 2007 42.6 2012649 1569.
## 6 Angola Africa 2007 42.7 12420476 4797.
# Sort by multiple columns (descending)
gapminder %>%
filter(year == 2007) %>%
arrange(desc(gdpPercap), desc(lifeExp)) %>%
head()## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Norway Europe 2007 80.2 4627926 49357.
## 2 Kuwait Asia 2007 77.6 2505559 47307.
## 3 Singapore Asia 2007 80.0 4553009 47143.
## 4 United States Americas 2007 78.2 301139947 42952.
## 5 Ireland Europe 2007 78.9 4109086 40676.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
5.2.7 group_by() and summarise(): Group Operations
# Basic grouping and summarizing
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(
mean_lifeExp = mean(lifeExp),
sd_lifeExp = sd(lifeExp),
total_pop = sum(pop),
n_countries = n(),
.groups = 'drop' # Automatically ungroup
)## # A tibble: 5 × 5
## continent mean_lifeExp sd_lifeExp total_pop n_countries
## <fct> <dbl> <dbl> <dbl> <int>
## 1 Africa 54.8 9.63 929539692 52
## 2 Americas 73.6 4.44 898871184 25
## 3 Asia 70.7 7.96 3811953827 33
## 4 Europe 77.6 2.98 586098529 30
## 5 Oceania 80.7 0.729 24549947 2
# Multiple grouping variables
gapminder %>%
filter(year %in% c(1997, 2007)) %>%
group_by(continent, year) %>%
summarise(
avg_gdp = mean(gdpPercap),
countries = n(),
.groups = 'drop'
)## # A tibble: 10 × 4
## continent year avg_gdp countries
## <fct> <int> <dbl> <int>
## 1 Africa 1997 2379. 52
## 2 Africa 2007 3089. 52
## 3 Americas 1997 8889. 25
## 4 Americas 2007 11003. 25
## 5 Asia 1997 9834. 33
## 6 Asia 2007 12473. 33
## 7 Europe 1997 19077. 30
## 8 Europe 2007 25054. 30
## 9 Oceania 1997 24024. 2
## 10 Oceania 2007 29810. 2
5.3 Additional Essential Functions
5.3.1 rename(): Change Column Names
# Rename columns
gapminder %>%
rename(
life_expectancy = lifeExp,
gdp_per_capita = gdpPercap,
population = pop
) %>%
head()## # A tibble: 6 × 6
## country continent year life_expectancy population gdp_per_capita
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## # A tibble: 6 × 6
## COUNTRY CONTINENT year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
5.3.2 Data Type Conversions
# Create sample data with mixed types
sample_data <- data.frame(
id = c("1", "2", "3"),
date_string = c("2020-01-01", "2020-02-01", "2020-03-01"),
value = c("10.5", "20.3", "15.7"),
stringsAsFactors = FALSE
)
# Convert data types
sample_data_converted <- sample_data %>%
mutate(
id = as.numeric(id),
date_string = as.Date(date_string),
value = as.numeric(value),
id_char = as.character(id)
)
str(sample_data_converted)## 'data.frame': 3 obs. of 4 variables:
## $ id : num 1 2 3
## $ date_string: Date, format: "2020-01-01" "2020-02-01" ...
## $ value : num 10.5 20.3 15.7
## $ id_char : chr "1" "2" "3"
5.3.3 across(): Apply Functions to Multiple Columns
# Apply function to multiple columns
gapminder %>%
group_by(continent) %>%
summarise(
across(c(lifeExp, pop, gdpPercap), mean),
.groups = 'drop'
)## # A tibble: 5 × 4
## continent lifeExp pop gdpPercap
## <fct> <dbl> <dbl> <dbl>
## 1 Africa 48.9 9916003. 2194.
## 2 Americas 64.7 24504795. 7136.
## 3 Asia 60.1 77038722. 7902.
## 4 Europe 71.9 17169765. 14469.
## 5 Oceania 74.3 8874672. 18622.
# Apply multiple functions
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(
across(c(lifeExp, gdpPercap),
list(mean = mean, sd = sd, min = min, max = max)),
.groups = 'drop'
)## # A tibble: 5 × 9
## continent lifeExp_mean lifeExp_sd lifeExp_min lifeExp_max gdpPercap_mean
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 54.8 9.63 39.6 76.4 3089.
## 2 Americas 73.6 4.44 60.9 80.7 11003.
## 3 Asia 70.7 7.96 43.8 82.6 12473.
## 4 Europe 77.6 2.98 71.8 81.8 25054.
## 5 Oceania 80.7 0.729 80.2 81.2 29810.
## # ℹ 3 more variables: gdpPercap_sd <dbl>, gdpPercap_min <dbl>,
## # gdpPercap_max <dbl>
5.3.4 recode(): Recode Values
# Recode values
gapminder %>%
mutate(
continent_short = recode(continent,
"Africa" = "AF",
"Americas" = "AM",
"Asia" = "AS",
"Europe" = "EU",
"Oceania" = "OC"
)
) %>%
select(country, continent, continent_short) %>%
head()## # A tibble: 6 × 3
## country continent continent_short
## <fct> <fct> <fct>
## 1 Afghanistan Asia AS
## 2 Afghanistan Asia AS
## 3 Afghanistan Asia AS
## 4 Afghanistan Asia AS
## 5 Afghanistan Asia AS
## 6 Afghanistan Asia AS
5.3.5 case_when(): Multiple Conditional Logic
# Multiple conditions with case_when
gapminder %>%
mutate(
development_level = case_when(
gdpPercap < 1000 ~ "Low income",
gdpPercap < 5000 ~ "Lower middle income",
gdpPercap < 15000 ~ "Upper middle income",
gdpPercap >= 15000 ~ "High income",
TRUE ~ "Unknown" # Default case
),
pop_size = case_when(
pop < 10000000 ~ "Small",
pop < 50000000 ~ "Medium",
pop >= 50000000 ~ "Large"
)
) %>%
select(country, year, gdpPercap, development_level, pop, pop_size) %>%
filter(year == 2007) %>%
head(10)## # A tibble: 10 × 6
## country year gdpPercap development_level pop pop_size
## <fct> <int> <dbl> <chr> <int> <chr>
## 1 Afghanistan 2007 975. Low income 31889923 Medium
## 2 Albania 2007 5937. Upper middle income 3600523 Small
## 3 Algeria 2007 6223. Upper middle income 33333216 Medium
## 4 Angola 2007 4797. Lower middle income 12420476 Medium
## 5 Argentina 2007 12779. Upper middle income 40301927 Medium
## 6 Australia 2007 34435. High income 20434176 Medium
## 7 Austria 2007 36126. High income 8199783 Small
## 8 Bahrain 2007 29796. High income 708573 Small
## 9 Bangladesh 2007 1391. Lower middle income 150448339 Large
## 10 Belgium 2007 33693. High income 10392226 Medium
5.3.6 replace_na(): Handle Missing Values
# Create data with missing values
data_with_na <- data.frame(
name = c("Alice", "Bob", NA, "David"),
age = c(25, NA, 35, 40),
city = c("New York", "Boston", "Chicago", NA)
)
print("Original data:")## [1] "Original data:"
## name age city
## 1 Alice 25 New York
## 2 Bob NA Boston
## 3 <NA> 35 Chicago
## 4 David 40 <NA>
# Replace NA values
data_cleaned <- data_with_na %>%
mutate(
name = replace_na(name, "Unknown"),
age = replace_na(age, median(age, na.rm = TRUE)),
city = replace_na(city, "Not specified")
)
print("After replacing NA values:")## [1] "After replacing NA values:"
## name age city
## 1 Alice 25 New York
## 2 Bob 35 Boston
## 3 Unknown 35 Chicago
## 4 David 40 Not specified
5.3.7 which(): Find Positions
# Using which to find positions
sample_vector <- c(10, 25, 30, 15, 40, 35)
# Find positions where condition is TRUE
positions <- which(sample_vector > 20)
print(paste("Positions where value > 20:", paste(positions, collapse = ", ")))## [1] "Positions where value > 20: 2, 3, 5, 6"
# Use in data frame context
gapminder %>%
filter(year == 2007) %>%
mutate(rank_gdp = rank(desc(gdpPercap))) %>%
filter(rank_gdp <= 5) %>%
select(country, gdpPercap, rank_gdp) %>%
arrange(rank_gdp)## # A tibble: 5 × 3
## country gdpPercap rank_gdp
## <fct> <dbl> <dbl>
## 1 Norway 49357. 1
## 2 Kuwait 47307. 2
## 3 Singapore 47143. 3
## 4 United States 42952. 4
## 5 Ireland 40676. 5
5.4 Practical Example: Data Analysis Workflow
This example analyzes life expectancy patterns across continents, decades, and economic development levels using the gapminder dataset. The goal is to uncover how life expectancy varies when we consider geographic location, time period (1990s vs 2000s), and economic status (GDP per capita categories) simultaneously. The result is a structured summary showing average life expectancy for different combinations of these factors, filtered to ensure statistical reliability.
workflow
- create categorical variables for time periods and GDP levels
- group by continent, decade, and GDP category
- calculate summary statistics
- remove unreliable small groups
- arrange results for easy interpretation.
# Comprehensive analysis: Life expectancy trends by continent
life_exp_analysis <- gapminder %>%
# Filter for recent decades
filter(year >= 1990) %>%
# Create additional variables
mutate(
decade = case_when(
year %in% 1990:1999 ~ "1990s",
year %in% 2000:2007 ~ "2000s"
),
gdp_category = case_when(
gdpPercap < 2000 ~ "Low GDP",
gdpPercap < 10000 ~ "Medium GDP",
gdpPercap >= 10000 ~ "High GDP"
)
) %>%
# Group and summarize
group_by(continent, decade, gdp_category) %>%
summarise(
avg_life_exp = mean(lifeExp),
countries = n(),
total_pop = sum(pop),
.groups = 'drop'
) %>%
# Filter out small groups
filter(countries >= 3) %>%
# Arrange results
arrange(continent, decade, desc(avg_life_exp))
print("Life expectancy analysis by continent, decade, and GDP category:")## [1] "Life expectancy analysis by continent, decade, and GDP category:"
## # A tibble: 21 × 6
## continent decade gdp_category avg_life_exp countries total_pop
## <fct> <chr> <chr> <dbl> <int> <dbl>
## 1 Africa 1990s Medium GDP 61.9 28 381335877
## 2 Africa 1990s Low GDP 50.3 74 1019466696
## 3 Africa 2000s Medium GDP 59.7 27 632430409
## 4 Africa 2000s High GDP 58.5 7 13862646
## 5 Africa 2000s Low GDP 51.5 70 1116970553
## 6 Americas 1990s High GDP 75.1 10 689423253
## 7 Americas 1990s Medium GDP 69.9 38 833511034
## 8 Americas 2000s High GDP 76.3 15 976865109
## 9 Americas 2000s Medium GDP 72.3 33 755668372
## 10 Asia 1990s High GDP 75.0 21 479574324
## # ℹ 11 more rows
5.5 Function Summary Table
Here’s a comprehensive summary of all the tidyverse functions covered:
| Function | Description | Example |
|---|---|---|
select() |
Choose specific columns from a dataset | select(country, year, pop) |
filter() |
Choose specific rows based on conditions | filter(year == 2007, pop > 1000000) |
mutate() |
Create new columns or modify existing ones | mutate(gdp_total = pop * gdpPercap) |
transmute() |
Create new columns and drop all others | transmute(country, gdp_billions = gdp/1e9) |
arrange() |
Sort rows by one or more columns | arrange(desc(lifeExp)) |
group_by() |
Group rows for grouped operations | group_by(continent, year) |
summarise() |
Calculate summary statistics for groups | summarise(mean_life = mean(lifeExp)) |
rename() |
Change column names | rename(life_expectancy = lifeExp) |
pivot_longer() |
Convert wide data to long format (tidy) | pivot_longer(cols = c(col1, col2)) |
pivot_wider() |
Convert long data to wide format | pivot_wider(names_from = key) |
across() |
Apply functions to multiple columns at once | across(c(col1, col2), mean) |
recode() |
Recode values in a column | recode(continent, 'Asia' = 'AS') |
case_when() |
Handle multiple conditional logic | case_when(pop < 1e6 ~ 'Small') |
replace_na() |
Replace missing (NA) values | replace_na(column, 'Unknown') |
as.numeric() |
Convert to numeric data type | as.numeric(character_column) |
as.character() |
Convert to character data type | as.character(numeric_column) |
which() |
Find positions where condition is TRUE | which(vector > threshold) |
5.6 Conclusion
The tidyverse provides a consistent and intuitive grammar for data manipulation in R. By mastering these core functions, you’ll be able to efficiently clean, transform, and analyze data. The key principles to remember are:
- Start with tidy data - each variable in a column, each observation in a row
- Use the pipe operator (
%>%) to chain operations together - Combine functions to create powerful data manipulation workflows
- Group operations allow for sophisticated analyses across different categories