I have data frame as follows :

S A B C D E 
1 N N N N N
2 N Y Y N N
3 Y N Y N N
4 Y N Y Y Y

where do I need to create a new column F which contains the most occurrences character from the multiple columns A, B, C, D, and E?

The output should look like the following :

 S A B C D E F
 1 N N N N N N
 2 N Y Y N N N
 3 Y N Y N N N
 4 Y N Y Y Y Y

We can create a Mode function and apply over the rows

df1$F <- apply(df1[-1], 1, Mode)
#  S A B C D E F
#1 1 N N N N N N
#2 2 N Y Y N N N
#3 3 Y N Y N N N
#4 4 Y N Y Y Y Y

Or another option is

df1$F <- c('N', 'Y')[max.col(table(c(row(df1[-1])), unlist(df1[-1])), 'first')]


Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]

Or using tidyverse

df1 %>% 
    mutate(F = pmap_chr(.[-1], ~ Mode(c(...))))

Or another option is

gather(df1, key, F, - S) %>% 
     group_by(S, F) %>% 
     summarise(n = n()) %>% 
     slice(which.max(n)) %>% 
     ungroup %>% 
     dplyr::select(F) %>% 
     bind_cols(df1, .)

Or we transpose the dataset, apply the Mode by each column and then bind the output as new column to original dataset

t(df1[-1]) %>%
   as.data.frame %>% 
   summarise_all(Mode) %>% 
   unlist %>%
   bind_cols(df1, F = .)

Or an option with data.table

setDT(df1)[,  F := names(which.max(table(unlist(.SD)))), S][]

NOTE: These are general methods instead of just checking on a single case

If we need an efficient method, without any ifelse, we can also do this by

df1$F <- c("Y", "N")[(rowSums(df1[-1] == "N") > 2) + 1]
#[1] "N" "N" "N" "Y"

Or with Reduce

c("Y", "N")[(Reduce(`+`, lapply(df1[-1], `==`, "N")) > 2) + 1]

Or another approach is

c("Y", "N")[(str_count(do.call(paste0, df1[-1]), "N") > 2) + 1]


df1 <- structure(list(S = 1:4, A = c("N", "N", "Y", "Y"), B = c("N", 
"Y", "N", "N"), C = c("N", "Y", "Y", "Y"), D = c("N", "N", "N", 
"Y"), E = c("N", "N", "N", "Y")), class = "data.frame", row.names = c(NA, 

One dplyr possibility could be:

df %>%
 mutate(F = ifelse(rowSums(.[2:length(.)] == "N") > 2, "N", "Y"))

  S A B C D E F
1 1 N N N N N N
2 2 N Y Y N N N
3 3 Y N Y N N N
4 4 Y N Y Y Y Y

It assumes that there are just N and Y values and that the number of columns is 5.

As @Sotos noted, it could be easily rewritten to base R form:

df$F <- ifelse(rowSums(df[2:length(df)] == "N") > 2, "N", "Y")

Or without the assumption about the number of columns (based on @TinglTanglBob):

df %>%
 mutate(F = ifelse(rowMeans(.[2:length(.)] == "N") > 0.5, "N", "Y"))

The same with base R:

df$F <- ifelse(rowMeans(df[2:length(df)] == "N") > 0.5, "N", "Y")
  • 1
    +1 Great idea, avoiding the apply() with margin 1. It could have been way better If you did not load dplyr just for the pipe and mutate. Just leave it in base R. In addition, you could also make 2 dynamic. Something like ceiling(ncol(df) / 2) – Sotos Apr 19 at 8:20
  • 2
    you could use rowMeans(...) >0.5 to avoid making an assumption about the number of columns – TinglTanglBob Apr 19 at 8:26
  • @TinglTanglBob nice idea, I added it, thanks :) – tmfmnk Apr 19 at 8:31

An alternative, slightly different:

x$F <- unlist(do.call(Map, c(function(...) names(sort(-table(c(...)), partial=1)[1]), x[,-1])))
#   S A B C D E F
# 1 1 N N N N N N
# 2 2 N Y Y N N N
# 3 3 Y N Y N N N
# 4 4 Y N Y Y Y Y

Perhaps I'm just trying to produce obscure code now ...

I'm realizing this might be more general than absolutely necessary. This finds the most frequent "thing" regardless of how many different things exist among the rows.

The sort(..., partial=1) stops sorting after the first pass.



d <- read.table(text ="S A B C D E 
1 N N N N N
2 N Y Y N N
3 Y N Y N N
4 Y N Y Y Y", header = TRUE, row.names = 1, stringsAsFactors = FALSE)

d$F <- with(
  stack(data.frame(t(as.matrix(d)), stringsAsFactors = FALSE)),
  tapply(values, ind, function(x) names(sort(table(x), decreasing = TRUE)[1])))

#A B C D E F
#1 N N N N N N
#2 N Y Y N N N
#3 Y N Y N N N
#4 Y N Y Y Y Y

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.