r - A loop for log transformation -


my task write function, aims calculate logarithms of given variables (vars) in given data set (dset) levels of declared variable (byvar). if minimum of given variable given level of byvar greater 0, simple natural logarithm calculated. otherwise, new value of given variable given segment calculated as:

new.value =  log(old.value + 1 + abs(min.value.of.given.var.for.given.level) 

in order achieve this, wrote such code (for reproducible example):

set.seed(1234567)    data(iris) iris$random <- rnorm(nrow(iris), 0, 1)  log.vars <- function(dset, vars, byvar, verbose = f){    # loop levels of "byvar"    for(i in 1:length(unique(dset[[byvar]]))){      if(verbose == t){       print(paste0("------ level=", unique(dset[[byvar]])[i], "----"))     }      # loop variables in "vars"      for(j in 1:length(vars)){        min.var <- min(dset[[vars[j]]][dset[[byvar]] == unique(dset[[byvar]])[i]])        # if minimum of given variable given level greater 0       # calculate logarithm;       # otherwise, add value 1 , mode of minimum , calculate       # logarithm        dset[[paste0("ln_", vars[j])]][dset[[byvar]] == unique(dset[[byvar]])[i]] <-          if(min.var > 0){           log(dset[[vars[j]]][dset[[byvar]] == unique(dset[[byvar]])[i]])         } else{           log(dset[[vars[j]]][dset[[byvar]] == unique(dset[[byvar]])[i]] + 1 +               abs(min.var))         }     }   }   return(dset) }  iris2 <- log.vars(dset = iris,          vars = c("sepal.length", "random", "sepal.width"),          byvar = "species",          verbose = t)  head(iris2) 

it works, however, there clear problem readability. additionally, wonder if performance enhanced. last not least, aim preserve order of observations in data set. kind of help/suggestions appreciated

turning comments answer:

don't reinvent wheel. there way "do function group" in base (tapply , ave), data.table, plyr, , dplyr. don't have provide function:

my_log = function(x) {     m = min(x)     if (m > 0) return(log(x))     return(log1p(x - m)) } 

the above implements log describe. since want run on same grouping multiple columns, dplyr::mutate_each can make our lives easy:

library(dplyr) iris %>% group_by(species) %>%     mutate_each(funs = funs(logged = my_log)) # source: local data frame [150 x 11] # groups: species [3] #  #    sepal.length sepal.width petal.length petal.width species       random sepal.length_logged #           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>        <dbl>               <dbl> # 1           5.1         3.5          1.4         0.2  setosa  0.156703769            1.629241 # 2           4.9         3.0          1.4         0.2  setosa  1.373811191            1.589235 # 3           4.7         3.2          1.3         0.2  setosa  0.730670244            1.547563 # 4           4.6         3.1          1.5         0.2  setosa -1.350800927            1.526056 # 5           5.0         3.6          1.4         0.2  setosa -0.008514961            1.609438 # 6           5.4         3.9          1.7         0.4  setosa  0.320981863            1.686399 # 7           4.6         3.4          1.4         0.3  setosa -1.778148409            1.526056 # 8           5.0         3.4          1.5         0.2  setosa  0.909503835            1.609438 # 9           4.4         2.9          1.4         0.2  setosa -0.919404336            1.481605 # 10          4.9         3.1          1.5         0.1  setosa -0.157714831            1.589235 # # ... 140 more rows, , 4 more variables: sepal.width_logged <dbl>, petal.length_logged <dbl>, # #   petal.width_logged <dbl>, random_logged <dbl> 

and that's there it! seems nice, concise, , readable. if you'd "functionalize" more, can wrap function, below, same result:

log_vars = function(data, vars, byvar) {     data %>% group_by_(byvar) %>%         mutate_each_(funs = funs(logged = my_log), vars = vars) %>%         return }  log_vars(iris, vars = c("sepal.width", "random"), byvar = "species") 

regarding 3 asks:

  1. readable - seems more readable. can rewritten without %>% pipes if prefer.
  2. performance - faster counts: largeish data lots of groups.
  3. order - order of rows not changed.

Comments

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

java - How to resolve The method toString() in the type Object is not applicable for the arguments (InputStream) -