r - Separate a column in dataframe where each observation can have multiple concurrent values -
i believe question best practice tidying messy data, here goes.
below excerpt of dataframe lang.df
, school-wide dataset of students. column, langauge.home
, indicates parent responses question: "what languages speak @ home?"
> lang.df nationality language.home 1 hk mandarin 2 german mandarin/english/german 3 saudi arabic 4 norwegian norwegian 5 uk english 6 hk mandarin/ min nan dialect 7 australian mandarin 8 hk mandarin 9 brazilian portuguese/english 10 indian hindi/english
it obvious me poor way information poor way store it, job use data have.
outcome
i want explore effect home languages might have on achievement. need ability subset single langauge spoken @ home (e.g. students speak english @ home).
to so, seems have separate language@home
column 3 ("language.home1", "language.home2", "language.home3"
) using dplyr's separate()
. create new column each unique value (i.e. language) in new columns created
process
below attempt @ doing above, efficiently
library(dplyr) library(tidyr) #separate langauge.home 3 new columns lang.df <- lang.df %>% separate(language.home, c("language.home1", "language.home2", "language.home3"), sep = "/", remove = false) #find distinct languages & remove nas langs <- unique(c(lang.df$language.home1, lang.df$language.home2, lang.df$language.home3)) langs <- langs[!is.na(langs)] #create boolean column each unique language in new columns (i in langs) { lang.df[,paste(i)] <- grepl(i, lang.df$language.home) }
questions
- what situation called? tried @
tidyr
docs , here on couldn't find it. - is there more elegant way code transformations have done?
- what best practice
- getting data (to change future data entry process)
- dealing situation statistical perspective
thanks in advance help. i've been using r on-and-off year , first post. give me feedback can!
data
lang.df <- structure(list(nationality = structure(c(4l, 3l, 7l, 6l, 8l, 4l, 1l, 4l, 2l, 5l), .label = c("australian", "brazilian", "german", "hk", "indian", "norwegian", "saudi", "uk"), class = "factor"), `language.home` = structure(c(4l, 6l, 1l, 7l, 2l, 5l, 4l, 4l, 8l, 3l), .label = c("arabic", "english", "hindi/english", "mandarin", "mandarin/ min nan dialect", "mandarin/english/german", "norwegian", "portuguese/english"), class = "factor")), row.names = c(na, 10l), .names = c("nationality", "language.home"), class = "data.frame")
we can use csplit
splitstackshape
split 'language.home' using delimiter /
, convert long
format.
library(splitstackshape) library(data.table) dt <- csplit(lang.df, "language.home", "/", "long")
then, use dcast
convert 'long' 'wide'
dcast(dt, nationality~language.home, fun.aggregate = function(x) length(x)>0)
note: there duplicate 'nationality' rows, above group common elements together. may better group together.
if need have logical columns based on each row (irrespective of similar 'nationality')
dcast(csplit(setdt(lang.df, keep.rownames=true), "language.home", "/", "long"), rn +nationality ~language.home, function(x) length(x) >0)
or option mtabulate
qdaptools
after splitting 'language.home' /
.
library(qdaptools) cbind(lang.df, !!(mtabulate(setnames(strsplit(as.character(lang.df$language.home), "/"), lang.df$nationality)))) # nationality language.home min nan dialect arabic english german hindi mandarin norwegian portuguese #1 hk mandarin false false false false false true false false #2 german mandarin/english/german false false true true false true false false #3 saudi arabic false true false false false false false false #4 norwegian norwegian false false false false false false true false #5 uk english false false true false false false false false #6 hk mandarin/ min nan dialect true false false false false true false false #7 australian mandarin false false false false false true false false #8 hk mandarin false false false false false true false false #9 brazilian portuguese/english false false true false false false false true #10 indian hindi/english false false true false true false false false
Comments
Post a Comment