r - A loop for log transformation -
my task write function, aims calculate logarithms of given variables (vars
) in given data set (dset
) levels of declared variable (byvar
). if minimum of given variable given level of byvar
greater 0, simple natural logarithm calculated. otherwise, new value of given variable given segment calculated as:
new.value = log(old.value + 1 + abs(min.value.of.given.var.for.given.level)
in order achieve this, wrote such code (for reproducible example):
set.seed(1234567) data(iris) iris$random <- rnorm(nrow(iris), 0, 1) log.vars <- function(dset, vars, byvar, verbose = f){ # loop levels of "byvar" for(i in 1:length(unique(dset[[byvar]]))){ if(verbose == t){ print(paste0("------ level=", unique(dset[[byvar]])[i], "----")) } # loop variables in "vars" for(j in 1:length(vars)){ min.var <- min(dset[[vars[j]]][dset[[byvar]] == unique(dset[[byvar]])[i]]) # if minimum of given variable given level greater 0 # calculate logarithm; # otherwise, add value 1 , mode of minimum , calculate # logarithm dset[[paste0("ln_", vars[j])]][dset[[byvar]] == unique(dset[[byvar]])[i]] <- if(min.var > 0){ log(dset[[vars[j]]][dset[[byvar]] == unique(dset[[byvar]])[i]]) } else{ log(dset[[vars[j]]][dset[[byvar]] == unique(dset[[byvar]])[i]] + 1 + abs(min.var)) } } } return(dset) } iris2 <- log.vars(dset = iris, vars = c("sepal.length", "random", "sepal.width"), byvar = "species", verbose = t) head(iris2)
it works, however, there clear problem readability. additionally, wonder if performance enhanced. last not least, aim preserve order of observations in data set. kind of help/suggestions appreciated
turning comments answer:
don't reinvent wheel. there way "do function group" in base
(tapply
, ave
), data.table
, plyr
, , dplyr
. don't have provide function:
my_log = function(x) { m = min(x) if (m > 0) return(log(x)) return(log1p(x - m)) }
the above implements log describe. since want run on same grouping multiple columns, dplyr::mutate_each
can make our lives easy:
library(dplyr) iris %>% group_by(species) %>% mutate_each(funs = funs(logged = my_log)) # source: local data frame [150 x 11] # groups: species [3] # # sepal.length sepal.width petal.length petal.width species random sepal.length_logged # <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl> # 1 5.1 3.5 1.4 0.2 setosa 0.156703769 1.629241 # 2 4.9 3.0 1.4 0.2 setosa 1.373811191 1.589235 # 3 4.7 3.2 1.3 0.2 setosa 0.730670244 1.547563 # 4 4.6 3.1 1.5 0.2 setosa -1.350800927 1.526056 # 5 5.0 3.6 1.4 0.2 setosa -0.008514961 1.609438 # 6 5.4 3.9 1.7 0.4 setosa 0.320981863 1.686399 # 7 4.6 3.4 1.4 0.3 setosa -1.778148409 1.526056 # 8 5.0 3.4 1.5 0.2 setosa 0.909503835 1.609438 # 9 4.4 2.9 1.4 0.2 setosa -0.919404336 1.481605 # 10 4.9 3.1 1.5 0.1 setosa -0.157714831 1.589235 # # ... 140 more rows, , 4 more variables: sepal.width_logged <dbl>, petal.length_logged <dbl>, # # petal.width_logged <dbl>, random_logged <dbl>
and that's there it! seems nice, concise, , readable. if you'd "functionalize" more, can wrap function, below, same result:
log_vars = function(data, vars, byvar) { data %>% group_by_(byvar) %>% mutate_each_(funs = funs(logged = my_log), vars = vars) %>% return } log_vars(iris, vars = c("sepal.width", "random"), byvar = "species")
regarding 3 asks:
- readable - seems more readable. can rewritten without
%>%
pipes if prefer. - performance - faster counts: largeish data lots of groups.
- order - order of rows not changed.
Comments
Post a Comment