5

I'm trying to subset a given data.table

DT <- data.table(
  a = c(1:20),
  b = (3:4),
  c = (5:14),
  d = c(1:4)
)

within a function by a parameter which is a named list

param <- list(a = 1:10,
              b = 2:3,
              c = c(5, 7, 10))

I am maybe a bit stuck here but I certainly do not want implement something ugly like this. Especially since its not very dynamic.

DT[(if (!is.null(param$a))
  a %in% param$a
  else
    TRUE)
  &
    (if (!is.null(param$b))
      b %in% param$b
     else
       TRUE)
  &
    (if (!is.null(param$c))
      c %in%  param$c
     else
       TRUE)
  &
    (if (!is.null(param$d))
      d %in% param$d
     else
       TRUE)]
   a b c d
1: 1 3 5 1
2: 3 3 7 3

Any ideas how to achieve this in an elegant way in data.table or base R using the names of the named list to subset the corresponding columns in the data.table with the associate values? Thanks!

EDIT

I performed a microbenchmark with some of the answers:

func_4 <- function(myp, DT) {
  myp    = Filter(Negate(is.null), param)

  exs = Map(function(var, val)
    call("%in%", var, val),
    var = sapply(names(myp), as.name),
    val = myp)
  exi = Reduce(function(x, y)
    call("&", x, y), exs)
  ex = call("[", x = as.name("DT"), i = exi)
  # eval(as.call(c(as.list(ex))))
  eval(ex)
}

microbenchmark(
  (DT[do.call(pmin, Map(`%in%`, DT[, names(param), with = FALSE], param)) == 1L]),
  (DT[rowSums(mapply(`%in%`, DT[, names(param), with = FALSE], param)) == length(param)]),
  (DT[do.call(CJ, param), on = names(param), nomatch = NULL]),
  (DT[expand.grid(param), on = names(param), nomatch = NULL]),
  (DT[DT[, all(mapply(`%in%`, .SD, param)), by = 1:nrow(DT), .SDcols = names(param)]$V1]),
  (func_4(myp = param, DT = DT)),
  times = 200)

   min        lq      mean   median        uq       max neval
  446.656  488.5365  565.5597  511.403  533.7785  7167.847   200
  454.120  516.3000  566.8617  538.146  561.8965  1840.982   200
 2433.450 2538.6075 2732.4749 2606.986 2704.5285 10302.085   200
 2478.595 2588.7240 2939.8625 2642.311 2743.9375 10722.578   200
 2648.707 2761.2475 3040.4926 2814.177 2903.8845 10334.822   200
 3243.040 3384.6220 3764.5087 3484.423 3596.9140 14873.898   200
2

We can select columns in DT using names in param, apply %in% to every list element with columns and select only rows where all the values are TRUE.

DT[which(rowSums(mapply(`%in%`, DT[, names(param), with = FALSE],
      param)) == length(param)), ]

#   a b c d
#1: 1 3 5 1
#2: 3 3 7 3
4

You can use the CJ (Cross Join) function from data.table to make a filtering table from the list.

lookup <- do.call(CJ, param)
head(lookup)
#    a b  c
# 1: 1 2  5
# 2: 1 2  7
# 3: 1 2 10
# 4: 1 3  5
# 5: 1 3  7
# 6: 1 3 10

DT[
    lookup,
    on = names(lookup),
    nomatch = NULL
]
#    a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3

Note that nomatch = 0 means any combo in lookup that doesn't exist in DT won't return a row.

  • Very nice. But could I not use DT[expand.grid(param), on = names(param), nomatch = NULL] instead? – jaydee Apr 17 at 16:33
  • 1
    @jaydee: Yes, CJ is pretty much the data.table version of expand.grid. Two benefits of CJ: it doesn't convert strings to factors, and it handles list input consistently. See the differences between expand.grid(b = list(1, 2, 3)), expand.grid(a = 1, b = list(1, 2, 3)), and how CJ handles those inputs. – Nathan Werth Apr 17 at 17:19
4

Using Map we can do

DT[DT[, all(Map(`%in%`, .SD, param)), by = 1:nrow(DT)]$V1]
#   a b c d
#1: 1 3 5 1
#2: 3 3 7 3

For each row we check if all elements in DT are present in param.


Thanks to @Frank, this can be improved to

DT[DT[, all(mapply(`%in%`, .SD, param)), by = 1:nrow(DT), .SDcols=names(param)]$V1]
  • Thanks, this works but returns a lot of warnings: ... 34: In all(Map(``%in%``, .SD, param)) : coercing argument of type 'list' to logical 35: In mapply(FUN = f, ..., SIMPLIFY = FALSE) : longer argument not a multiple of length of shorter ... – jaydee Apr 17 at 13:30
3

You could build the expression with call(fun, ...) and as.name:

myp    = Filter(Negate(is.null), param)

exs = Map(function(var, val) call("%in%", var, val), var = sapply(names(myp), as.name), val = myp)
exi = Reduce(function(x,y) call("&", x, y), exs)
ex = call("[", x = as.name("DT"), i = exi)
# DT[i = a %in% 1:10 & b %in% 2:3 & c %in% c(5, 7, 10)]

eval(ex)
#    a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3

By composing the call correctly, you can take advantage of efficient algorithms for "indices" in the data.table (see the package vignettes). You can also turn verbose on to get a note about the inefficiency of specifying param$c as numeric when DT$c is int:

> z <- as.call(c(as.list(ex), verbose=TRUE))
> eval(z)
Optimized subsetting with index 'c__b__a'
on= matches existing index, using index
Coercing double column i.'c' to integer to match type of x.'c'. Please avoid coercion for efficiency.
Starting bmerge ...done in 0.020sec 
   a b c d
1: 1 3 5 1
2: 3 3 7 3

That is, you should use c(5L, 7L, 10L).

A join, as in Nathan's answer, also uses indices, but building and joining on the Cartesian table of param will be costly if prod(lengths(param)) is large.


@markus approach may be slow due to by-row operation, so here is a variant:

DT[do.call(pmin, Map(`%in%`, DT[, names(param), with=FALSE], param)) == 1L]

#    a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3

The trick is that the elementwise version of all is pmin(...) == 1L. Likewise, any corresponds to pmax(...) == 1L. (This is why pany/pall are not included in this conversation on r-devel: http://r.789695.n4.nabble.com/There-is-pmin-and-pmax-each-taking-na-rm-how-about-psum-td4647841.html)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.