Demetri Pananos Ph.D - Cleaning Data Patterns

I still clean data from time to time, and I find myself writing patterns that looks like

raw_data %>%
  mutate(
    smoking = case_match(str_to_lower(smoking),
      "y" ~ "Y",
      "n" ~ "N",
      .default = "N"
    ),
    hypertension = case_match(str_to_lower(hypertension),
      "y" ~ "Y",
      "n" ~ "N",
      .default = "N"
    ),
    diabetes = case_match(str_to_lower(diabetes),
      "y" ~ "Y",
      "n" ~ "N",
      .default = "N")
  )

# A tibble: 10 × 4
      id smoking hypertension diabetes
   <int> <chr>   <chr>        <chr>   
 1     1 Y       Y            N       
 2     2 N       Y            Y       
 3     3 Y       N            Y       
 4     4 N       N            N       
 5     5 N       N            N       
 6     6 N       N            N       
 7     7 N       Y            Y       
 8     8 Y       N            N       
 9     9 N       N            N       
10    10 N       N            N

I know the aphorism is “If you write a piece of code 3 times, write a function” so I could do

clean_yn <- function(x){
  case_match(str_to_lower(x), 
        "y" ~ "Y",
        "n" ~ "N",
        .default = "N"
             )
}

raw_data %>%
  mutate(
    across(
      all_of(c("smoking", "hypertension", "diabetes")),
    ~clean_yn(.x)
    ))

# A tibble: 10 × 4
      id smoking hypertension diabetes
   <int> <chr>   <chr>        <chr>   
 1     1 Y       Y            N       
 2     2 N       Y            Y       
 3     3 Y       N            Y       
 4     4 N       N            N       
 5     5 N       N            N       
 6     6 N       N            N       
 7     7 N       Y            Y       
 8     8 Y       N            N       
 9     9 N       N            N       
10    10 N       N            N

But I usually have to edit the name of the column too because often smoking is actually Smoking (Yes or No) (and while {janitor} helps it still leaves some cruft on the end of column names). This makes the first pattern more attractive, despite the repetition.

I asked gemini how it might handle this problem, and I was shown a neat little pattern I want to share. The mutate-across pattern us actually pretty useful, so let’s abstract that into a function and tack on a rename like so

apply_cleaners <- function(data, clean_func, rename_map){
  data |> 
    mutate(
      across(all_of(unname(rename_map)), ~clean_func(.x))
    ) |> 
    rename(all_of(rename_map))
}

Now, I can specify a list of columns I want to clean with the same function, and the new names I want to use

cleaning_tasks <- list(
  rename_map = c('smoking_cleaned' = 'smoking', 
                 'hypertension_cleaned' = 'hypertension',
                 'diabetes_cleaned' = 'diabetes'),
  func = clean_yn
)

apply_cleaners(raw_data, cleaning_tasks$func, cleaning_tasks$rename_map)

# A tibble: 10 × 4
      id smoking_cleaned hypertension_cleaned diabetes_cleaned
   <int> <chr>           <chr>                <chr>           
 1     1 Y               Y                    N               
 2     2 N               Y                    Y               
 3     3 Y               N                    Y               
 4     4 N               N                    N               
 5     5 N               N                    N               
 6     6 N               N                    N               
 7     7 N               Y                    Y               
 8     8 Y               N                    N               
 9     9 N               N                    N               
10    10 N               N                    N

which doesn’t seem that cool, unless you realize that what is returned is a tibble, which can be passed through apply_cleaners again. Hence, reduce seems like a good tool here

# Needs to be a list of lists 
cleaning_tasks <- list(
  "yn_cleaning_tasks" = list(
    rename_map = c('smoking_cleaned' = 'smoking', 
                   'hypertension_cleaned' = 'hypertension',
                   'diabetes_cleaned' = 'diabetes'),
    func = clean_yn
  )
)


reduce(cleaning_tasks, function(data, task) {
  apply_cleaners(data, task$func, task$rename_map)
}, .init=raw_data)

# A tibble: 10 × 4
      id smoking_cleaned hypertension_cleaned diabetes_cleaned
   <int> <chr>           <chr>                <chr>           
 1     1 Y               Y                    N               
 2     2 N               Y                    Y               
 3     3 Y               N                    Y               
 4     4 N               N                    N               
 5     5 N               N                    N               
 6     6 N               N                    N               
 7     7 N               Y                    Y               
 8     8 Y               N                    N               
 9     9 N               N                    N               
10    10 N               N                    N

This is overkill for many cleaning tasks, but when you’re cleaning dozens of columns, I think this is nice! It is very targets-y, which I like.