r - Separate a column in dataframe where each observation can have multiple concurrent values -


i believe question best practice tidying messy data, here goes.

below excerpt of dataframe lang.df, school-wide dataset of students. column, langauge.home, indicates parent responses question: "what languages speak @ home?"

> lang.df    nationality             language.home 1           hk                  mandarin 2       german   mandarin/english/german 3        saudi                    arabic 4    norwegian                 norwegian 5           uk                   english 6           hk mandarin/ min nan dialect 7   australian                  mandarin 8           hk                  mandarin 9    brazilian        portuguese/english 10      indian             hindi/english 

it obvious me poor way information poor way store it, job use data have.

outcome

i want explore effect home languages might have on achievement. need ability subset single langauge spoken @ home (e.g. students speak english @ home).

to so, seems have separate language@home column 3 ("language.home1", "language.home2", "language.home3") using dplyr's separate(). create new column each unique value (i.e. language) in new columns created

process

below attempt @ doing above, efficiently

library(dplyr) library(tidyr)  #separate langauge.home 3 new columns lang.df <- lang.df %>% separate(language.home,         c("language.home1", "language.home2", "language.home3"),         sep = "/",         remove = false)  #find distinct languages & remove nas langs <- unique(c(lang.df$language.home1,     lang.df$language.home2,     lang.df$language.home3)) langs <- langs[!is.na(langs)]  #create boolean column each unique language in new columns (i in langs) { lang.df[,paste(i)] <- grepl(i, lang.df$language.home)  } 

questions

  1. what situation called? tried @ tidyr docs , here on couldn't find it.
  2. is there more elegant way code transformations have done?
  3. what best practice
    • getting data (to change future data entry process)
    • dealing situation statistical perspective

thanks in advance help. i've been using r on-and-off year , first post. give me feedback can!

data

lang.df <- structure(list(nationality = structure(c(4l, 3l, 7l, 6l, 8l,  4l, 1l, 4l, 2l, 5l), .label = c("australian", "brazilian", "german",  "hk", "indian", "norwegian", "saudi", "uk"), class = "factor"),  `language.home` = structure(c(4l, 6l, 1l, 7l, 2l, 5l, 4l,  4l, 8l, 3l), .label = c("arabic", "english", "hindi/english",  "mandarin", "mandarin/ min nan dialect", "mandarin/english/german",  "norwegian", "portuguese/english"), class = "factor")), row.names = c(na,  10l), .names = c("nationality", "language.home"), class = "data.frame") 

we can use csplit splitstackshape split 'language.home' using delimiter / , convert long format.

library(splitstackshape) library(data.table) dt <- csplit(lang.df, "language.home", "/", "long") 

then, use dcast convert 'long' 'wide'

dcast(dt, nationality~language.home, fun.aggregate = function(x) length(x)>0) 

note: there duplicate 'nationality' rows, above group common elements together. may better group together.

if need have logical columns based on each row (irrespective of similar 'nationality')

 dcast(csplit(setdt(lang.df, keep.rownames=true), "language.home",    "/", "long"), rn +nationality ~language.home, function(x) length(x) >0) 

or option mtabulate qdaptools after splitting 'language.home' /.

 library(qdaptools)  cbind(lang.df, !!(mtabulate(setnames(strsplit(as.character(lang.df$language.home),                   "/"), lang.df$nationality)))) #   nationality             language.home  min nan dialect arabic english german hindi mandarin norwegian portuguese #1           hk                  mandarin            false  false   false  false false     true     false      false #2       german   mandarin/english/german            false  false    true   true false     true     false      false #3        saudi                    arabic            false   true   false  false false    false     false      false #4    norwegian                 norwegian            false  false   false  false false    false      true      false #5           uk                   english            false  false    true  false false    false     false      false #6           hk mandarin/ min nan dialect             true  false   false  false false     true     false      false #7   australian                  mandarin            false  false   false  false false     true     false      false #8           hk                  mandarin            false  false   false  false false     true     false      false #9    brazilian        portuguese/english            false  false    true  false false    false     false       true #10      indian             hindi/english            false  false    true  false  true    false     false      false 

Comments

Popular posts from this blog

Load Balancing in Bluemix using custom domain and DNS SRV records -

oracle - pls-00402 alias required in select list of cursor to avoid duplicate column names -

python - Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] error -