r - Aggregate entries in table by subset of column id characters -
i working on gene expression dataset using r. new coding please forgive me if not describe problem in adequate detail.
my dataset looks looks this:
geneid sample1 sample2 slc26a5-001 7 8 slc26a5-002 1 2 homer2-001 6 5 slc26a5-200 8 10
the gene name first part of id (slc26a5) , transcript number denoted (-001). need find way collapse of different transcript ids , sum respective rows @ same time. output following:
geneid sample1 sample2 slc26a5 16 20 homer2 6 5
the aggregate function should work summing rows based on gene id. stuck because can not figure out how refer gene id's first part of name inside of aggregate function.
does know how this?
thanks help!
the main thing remove tail part of geneid
column standardize grouping. done below sub()
. it's pretty standard aggregation. aggregate()
, following it.
aggregate(df[-1], list(geneid = sub("-.*", "", df$geneid)), sum) # geneid sample1 sample2 # 1 homer2 6 5 # 2 slc26a5 16 20
we use rowsum()
, not unnecessarily convert data.
rowsum(df[-1], sub("-.*", "", df$geneid)) # sample1 sample2 # homer2 6 5 # slc26a5 16 20
data:
df <- structure(list(geneid = structure(c(2l, 3l, 1l, 4l), .label = c("homer2-001", "slc26a5-001", "slc26a5-002", "slc26a5-200"), class = "factor"), sample1 = c(7l, 1l, 6l, 8l), sample2 = c(8l, 2l, 5l, 10l)), .names = c("geneid", "sample1", "sample2"), class = "data.frame", row.names = c(na, -4l))
Comments
Post a Comment