3

This is a speed optimization question.

Here is my sample data. The real data has over 100k rows and >300 columns.

library(data.table)
dt <- data.table(ref=1:20, tgt1=11:30, tgt2=21:40)
dt[c(3,8,9,15,16,17), "tgt1"] = NA
dt[c(4,5,15,17), "tgt2"] = NA
dt

#>     ref tgt1 tgt2
#>  1:   1   11   21
#>  2:   2   12   22
#>  3:   3   NA   23
#>  4:   4   14   NA
#>  5:   5   15   NA
#>  6:   6   16   26
#>  7:   7   17   27
#>  8:   8   NA   28
#>  9:   9   NA   29
#> 10:  10   20   30
#> 11:  11   21   31
#> 12:  12   22   32
#> 13:  13   23   33
#> 14:  14   24   34
#> 15:  15   NA   NA
#> 16:  16   NA   36
#> 17:  17   NA   NA
#> 18:  18   28   38
#> 19:  19   29   39
#> 20:  20   30   40

Some columns have NA at some positions, and my goal is to get the positions of the nearest non-NA flanking values. For instance, for the second column tgt1, I am using the following code

tgt = dt[, tgt1]
tgt.na = which(is.na(tgt))
tgt.non.na = which(!is.na(tgt))
start = sapply(tgt.na, function(x) max(tgt.non.na[tgt.non.na < x]))
stop = sapply(tgt.na, function(x) min(tgt.non.na[tgt.non.na > x]))
data.frame(start, stop)

#>   start stop
#> 1     2    4
#> 2     7   10
#> 3     7   10
#> 4    14   18
#> 5    14   18
#> 6    14   18

Here for the tgt1 column, I get what I want. For example, for the NA at 3rd row, the closest flanking non-NA values are at 2 and 4, and so on for others. My issues is that the sapply are very slow. Imagine running this for >300 columns and 100k rows. In current form it takes over few hours to finish. Ultimately, when these positions are found, then they are used to index values from ref column to compute the missing values in tgt1 and so on columns. But that is the topic for another time.

Is there any way I can make it faster? Any data.table way solution for it.

Edit: All great solutions, here is my benchmark, and you can see all proposed methods worked lightning fast compared to my original sapply. I select lapply, not only because it is the fastest but also because it aligns well with my current code syntax.

Unit: milliseconds
           expr         min          lq        mean      median          uq         max neval
         sapply 3755.118949 3787.288609 3850.322669 3819.458269 3897.924530 3976.390790     3
 dt.thelatemail    9.145551    9.920238   10.242885   10.694925   10.791552   10.888180     3
  lapply.andrew    2.626525    3.038480    3.446682    3.450434    3.856760    4.263086     3
   zoo.chinsoon    6.457849    6.578099    6.629839    6.698349    6.715834    6.733318     3
  • Can you please elaborate on your ultimate goal. How are you going "to compute the missing values"? Although you write that it is not the current topic, it may still be relevant. Cheers. – Henrik Apr 17 at 22:19
  • Can you have leading or trailing NA, or are they always flanked by non-NA values? – Henrik Apr 17 at 22:31
  • There are different ways we can compute the missing values, one simple way is to take the mean of flanking values. Another is take corresponding non missing values from other column(s) and do some computation. – SinghTheCoder Apr 17 at 22:54
  • In the real data, you can have trailing and leading NA, but they are not frequent so we don't worry about them – SinghTheCoder Apr 17 at 23:06
3

Here is a base R alternative using rle. I used lapply because I was not sure how you wanted to save all the output dataframes. Hope this helps!

dt <- data.table(ref=1:20, tgt1=11:30, tgt2=21:40)
dt[c(3,8,9,15,16,17), "tgt1"] = NA
dt[c(4,5,15,17), "tgt2"] = NA


lapply(dt[,-1], function(x) {
  na_loc <- which(is.na(x))
  rle_x <- rle(is.na(x))
  reps <- rle_x$lengths[rle_x$values == T]

  start <- na_loc - 1
  start <- start[!start %in% na_loc]
  end <- na_loc + 1
  end <- end[!end %in% na_loc]

  data.frame(start = rep(start, reps),
             end = rep(end, reps))
})

$tgt1
   start end
1:     2   4
2:     7  10
3:     7  10
4:    14  18
5:    14  18
6:    14  18

$tgt2
   start end
1:     3   6
2:     3   6
3:    14  16
4:    16  18

It also scales fairly well on my laptop for a sample dataframe w/ 300 columns:

df1 <- data.frame(ref = 1:1e5)
df1[paste0("tgt", 1:300)] <- replicate(300, sample(c(1:50, rep(NA, 5)), 1e5, replace = T))

microbenchmark::microbenchmark(
  base = {
    lapply(df1[,-1], function(x) {
      na_loc <- which(is.na(x))
      rle <- rle(is.na(x))
      reps <- rle$lengths[rle$values == T]

      start <- na_loc - 1
      start <- start[!start %in% na_loc]
      end <- na_loc + 1
      end <- end[!end %in% na_loc]

      data.frame(start = rep(start, reps),
                 end = rep(end, reps))
    }
  )},
  times = 5
)

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 base 1.863319 1.888617 1.897651 1.892166 1.898196 1.945954     5
  • 1
    Great solution. Thanks. Coding is one thing, but thinking like a coder is another. The way you decided to find start and end positions is just elegant. Wonderful! – SinghTheCoder Apr 18 at 16:24
  • Thanks @nsinghs, glad it helped!! – Andrew Apr 18 at 17:03
  • Just to let you know. This method results in over 90% computational time saving. – SinghTheCoder Apr 18 at 19:58
  • That's fantastic, it was a really interesting problem to think through. I am glad you have something that works better, too!! – Andrew Apr 19 at 0:29
3

You should be able to take advantage of rleid to calculate the prior value to a run of NAs and then match it up. E.g.:

dt[, a := rleid(is.na(tgt1))]
dt[, rev(ref)[match((a - 1)[is.na(tgt1)], rev(a))] ]
#[1]  2  7  7 14 14 14
dt[, ref[match((a + 1)[is.na(tgt1)], a)] ]
#[1]  4 10 10 18 18 18

Seems pretty quick to process 100k rows:

dt <- dt[rep(1:20,5e3),]
dt[, ref := 1:1e5]
system.time({
  dt[, a := rleid(is.na(tgt1))]
  dt[, rev(ref)[match((a-1)[is.na(tgt1)],rev(a))]]
  dt[, ref[match((a+1)[is.na(tgt1)],a)]]
})
#   user  system elapsed 
#   0.02    0.00    0.02
  • I am off work now but will try this tomorrow and see what benefit can I gain. – SinghTheCoder Apr 17 at 23:02
  • Thanks for the solution, I posted my benchmark in the question. All proposed solutions are quite fast and at par with each other. This is a good solution but it took me a while to understand what rev(ref)[match((a - 1)[is.na(tgt1)], rev(a))] was doing. Nice solution though. – SinghTheCoder Apr 18 at 16:21
3

Another possibility using zoo package:

library(zoo)
for (j in paste0("tgt", 1L:2L)) {
    print(dt[, {
        k <- is.na(get(j))
        x <- replace(ref, k, NA_integer_)
        .(start=na.locf0(x)[k], 
          end=na.locf0(x, fromLast=TRUE)[k])
    }])
}

output:

   start end
1:     2   4
2:     7  10
3:     7  10
4:    14  18
5:    14  18
6:    14  18
   start end
1:     3   6
2:     3   6
3:    14  16
4:    16  18

timing code:

library(data.table)
library(zoo)
sz <- 100e3
nc <- 400
dt <- data.table(ref=1L:sz, 
    as.data.table(matrix(sample(c(NA_integer_, 1L), sz*nc, replace=TRUE), ncol=nc)))

library(microbenchmark)
microbenchmark(
    mtd0=for (j in paste0("V", 1L:nc)) {
        k <- dt[,is.na(get(j))]
        dt[, a := rleid(k)][, 
            .(start=rev(ref)[match((a-1)[k],rev(a))], end=ref[match((a+1)[k],a)])]
    },
    mtd1=for (j in paste0("V", 1L:nc)) {
        dt[, {
            k <- is.na(get(j))
            x <- replace(ref, k, NA_integer_)
            .(start=na.locf0(x)[k], end=na.locf0(x, fromLast=TRUE)[k])
        }]
    },
    times=3L)

timings:

Unit: seconds
 expr      min       lq     mean   median       uq      max neval cld
 mtd0 6.638253 6.698023 6.730352 6.757794 6.776402 6.795010     3   b
 mtd1 4.832264 4.835764 4.854799 4.839264 4.866066 4.892867     3  a 

Not much diff in timings given the number of rows.

  • I will start using zoo package more from now on, very handy functions. – SinghTheCoder Apr 18 at 16:20
  • Check up the authors against those who post answers here and you will learn more from them :) – chinsoon12 Apr 18 at 23:25

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.