The {messy} R package equips instructors with the ability to transform clean datasets into realistic, messy ones filled with typos, missing values, and inconsistencies. This allows students to hone their data cleaning and wrangling skills in a practical context, preparing them for the challenges they will face in real data analysis.
The messy R package is designed to help instructors and students engage with real-world data scenarios by converting clean datasets into messy and untidy formats. This tool is invaluable for teaching data cleaning and wrangling skills, as it simulates the common imperfections found in actual datasets, such as typos, irregular missing values, and unexpected whitespace.
Key Features:
- Random Data Alteration: Introduces realistic imperfections into clean datasets, allowing users to practice their data cleaning skills.
- Flexible Messiness Control: Adjust the level of messiness in your data, from minor changes to significant distortions, giving your students a realistic challenge.
Usage:
The package includes several core functions that help you create messy data:
messy()
Transform a clean dataset by adding various types of messiness.
set.seed(1234)
messy(ToothGrowth[1:10,])
Increase Messiness:
set.seed(1234)
messy(ToothGrowth[1:10,], messiness = 0.7)
add_whitespace()
Randomly adds whitespace to values, which can change numeric columns to character types:
set.seed(1234)
add_whitespace(ToothGrowth[1:10,])
Apply to specific columns:
set.seed(1234)
add_whitespace(ToothGrowth[1:10,], cols = "supp")
change_case()
Randomly switches the case of character or factor columns:
set.seed(1234)
change_case(ToothGrowth[1:10,], messiness = 0.5)
make_missing()
Randomly introduces missing values (NA) into your dataset:
set.seed(1234)
make_missing(ToothGrowth[1:10,])
Use a different representation for missing values:
set.seed(1234)
make_missing(ToothGrowth[1:10,], cols = "supp", missing = "999")
Combining Functions:
Leverage the power of piping to apply multiple transformations seamlessly:
set.seed(1234)
ToothGrowth[1:10,] |>
make_missing(cols = "supp", missing = " ") |>
make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |>
add_whitespace(cols = "supp", messiness = 0.5)
The messy package empowers R users to replicate the data challenges they will face in real-life situations, enhancing their data cleaning and analysis capabilities. Whether for academia or personal projects, this tool is essential for engaging with data realistically.