dplyr::
gather()
separate()
mutate()
and case_when()
to create a new columnspread()
to make long data wideinner_join()
filter()
if_else()
Before the pandemic, I usually taught my “College English 2:Speaking” with this textbook, which is based on popular TED talks and includes a great variety of excellent classroom activities. However, when my university announced in February 2020 that all courses would move online, I had to ask myself if the same classroom activities from the textbook could be used for online classes via Zoom.
So, I selected a few interesting TED and TED-Ed talks and designed my own activities that would have better suited the needs of students learning online. One of such TED-Ed talks was Christian Rudder’s briefing on how he and his co-founders of OKCupid.com (a popular dating app in North America) came up with a matching algorithm for the users of their dating app. You can watch the video below, and a magazine article about their enterprise is found here. As I watched the video, I thought following his instructions in the video and calculating match scores for my students would be a fun ice-breaker in the beginning of the semester.
And now, I’m writing this post to help anyone out there who would like to learn how to use R. There are a kazillion posts and blogs out there that explain how to use R, but I thought writing about a small project I carried out for my class would be a fun way to introduce R to novices. So off we go!
The video is fun to watch, and it basically says that you need three different pieces of information to calculate how compatible two people will be: (1) Your traits, (2) Your preference for your potential match’s traits, and (3) How important each trait is to you.
To obtain such information, I simply wrote up a short survey in a Google form and distributed it around. Students’ responses were recorded in a spreadsheet connected to the google form, and you can download it here. I also wrote up an r markdown file so you can easily follow this post. Just create a folder named ‘aol’ on Desktop of your computer (be sure it’s all small letters; R is case-sensitive) and save both files in that folder.
Although most dating apps will make you answer hundreds of questions for more precise match calculations, I couldn’t possibly use too much of class time making students answer dating app questions. But I also did not want to calculate two people’s match scores based on only two traits as in the example in the video. So, I asked about five most common things people consider when they look for a partner (looks, intelligence, kindness, wealth, and ethics). Following are the column names for each trait and the questions asked. All questions were asked on a 5-point scale with 1 being ‘Not at all’ and 5 being ‘Extremely so.’
Column Headings | Questions |
---|---|
Appearance_1 | I am good looking. |
Appearance_2 | I want a good looking partner. |
Appearance_3 | How important is ‘appearance’ in your life? |
Intelligence_1 | I am intelligent. |
Intelligence_2 | I would like to be with someone intelligent. |
Intelligence_3 | How important is ‘intelligence’ in your life? |
Kindness_1 | I am kind-hearted and caring. |
Kindness_2 | I would like to have a kind-hearted and caring partner. |
Kindness_3 | How important is ‘kindness’ in your life? |
Wealth_1 | I am rich. |
Wealth_2 | I want my partner to be rich. |
Wealth_3 | How important is ‘wealth’ in your life? |
Ethics_1 | I am ethical. |
Ethics_2 | I would like my partner to be ethical. |
Ethics_3 | How important is ‘being ethical’ in your life? |
This post is intended for those who have absolutely no knowledge of how R works. Those who have at least a little bit of experience might find all the excruciating details unbearable. In such case, skip parts you already have a good knowledge of and move on. Or you can simply download the data file and the Rmd file and play with them.
Now, you need to set your working directory. It is the same folder that you saved your data file and the r markdown file earlier. The Rmd file I provided assumes that you have a folder named ‘aol’ on your Desktop and have both the data file and the Rmd file within that folder. If you are using Windows, just uncomment the Windows command. If you are a Mac user, uncomment the Mac command. Also, the parenthesized part (UserName)
should be replaced with your computer name. If your computer name has Korean characters, a space, or a special character in it, a nightmare begins. In such cases, look for the nearest computer geek in your vicinity. I have no intention to write another post about it.
# Windows
# setwd("C:/Users/(UserName)/Desktop/aol")
# Mac
# setwd("/Users/(UserName)/Desktop/aol")
# A hashtag (#) works as a sign for R to ignore whatever's after it. To execute either setwd command above, simply remove the hashtag in the begining of the line. But be sure to delete the hash tag only in one of the two lines. If you're a Windows user, delete the hash tag before the Windows command. If you're a Mac user, the one before the Mac command.
Or if you’d like to set your working directory elsewhere, refer to this video. Otherwise, just google ‘how to set up working directory in rstudio.’ You will realize that the whole world is eager to teach you how to do that.
Although R comes with some basic functions, which we call base R, people have written awesome packages that have convenient functions to use. In our case, we will load a package called tidyverse
for easy data manipulation and pracma
for nth root calculation. If you haven’t installed either package on your computer, you will first have to install.packages(c("tidyverse", "pracma"))
before you load them using library(tidyverse)
and library(pracma)
.
library(tidyverse)
## -- Attaching packages ------------------------------------------ tidyverse 1.3.0 --
## √ ggplot2 3.3.2 √ purrr 0.3.4
## √ tibble 3.0.3 √ dplyr 1.0.0
## √ tidyr 1.1.1 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.5.0
## -- Conflicts --------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
tidyverse
is a package of packages that are commonly used for tidying, manipulating, and plotting data. The eight packages that come in tidyverse
are listed above. Also, sometimes, some packages that you load might have the same function names as those in other packages. Then, the conflicting functions are also listed. Although the conflicts will not create errors in our script, if in doubt; you can add dplyr::
or stats::
before you use either the filter()
or lag()
functions to clarify which function from which package is being called for.
library(pracma)
## Warning: package 'pracma' was built under R version 4.0.4
##
## Attaching package: 'pracma'
## The following object is masked from 'package:purrr':
##
## cross
When I loaded pracma
, it gave me a warning that the package was written for R version 4.0.4
. It is because the current version of R I’m using is 4.0.2. If the version difference creates a problem, you can simply update your R. For now, let’s just carry on. Also, the message that an object is masked from another package is basically the same as the conflict message from tidyverse
. When you have to use cross
from the purrr
package, add purrr::
. In our case, we can simply ignore it.
Now, use read.csv()
to load in data. The data has responses to the 15 questions above from 19 students in total. Their names were all replaced with animal names.
raw.data <- read.csv("sampleDat.csv")
Now the data file is loaded, let’s take a look at how the data frame is structured. Instead of loading in all data, let’s take a look at the first six rows of the data frame using head()
.
head(raw.data)
## Name Appearance_1 Appearance_2 Appearance_3 Intelligence_1 Intelligence_2
## 1 Boar 4 4 2 4 3
## 2 Buck 2 3 4 5 4
## 3 Bull 4 4 4 4 4
## 4 Cob 3 3 4 3 4
## 5 Cow 3 5 4 3 3
## 6 Doe 3 4 3 4 4
## Intelligence_3 Kindness_1 Kindness_2 Kindness_3 Wealth_1 Wealth_2 Wealth_3
## 1 3 4 4 4 3 2 4
## 2 4 5 5 5 2 2 4
## 3 4 5 5 5 4 4 4
## 4 4 4 5 5 2 4 4
## 5 2 4 5 5 3 2 2
## 6 4 4 4 4 3 4 3
## Ethics_1 Ethics_2 Ethics_3
## 1 3 3 4
## 2 5 5 5
## 3 4 4 4
## 4 4 5 5
## 5 4 4 3
## 6 4 4 3
If you open the data file in Excel or in any other spreadsheet, all columns will be next to each other. In this case, each row is divided into three different rows because of the limited page width.
On each row is a respondent’s name and his/her responses to the 15 questions. This is a data format that most people are familiar with, but this is not the data format that R is happy with. We call it messy data and need to make it tidy.
dplyr::
gather()
The first thing you now need to do is to turn the wide data above into a long form. In this case, we can use the gather()
function from dplyr
.
data0<-raw.data%>%
gather(questions, responses, 2:16)
You have taken a peek at raw.data
earlier, the code above says that you are creating a new data frame named data0
out of raw.data
. The gather()
function takes column headings from the second column to the 16th column for questions
and responses in each cell as responses
. In the wide format, each person was listed only in one row with his/her 15 responses next to the name. In the long data format, each person is repeated in 15 different rows and will be accompanied by a question and a response. The first 15 rows of the long data look like this.
head(data0%>%arrange(Name), n = 15L)
## Name questions responses
## 1 Boar Appearance_1 4
## 2 Boar Appearance_2 4
## 3 Boar Appearance_3 2
## 4 Boar Intelligence_1 4
## 5 Boar Intelligence_2 3
## 6 Boar Intelligence_3 3
## 7 Boar Kindness_1 4
## 8 Boar Kindness_2 4
## 9 Boar Kindness_3 4
## 10 Boar Wealth_1 3
## 11 Boar Wealth_2 2
## 12 Boar Wealth_3 4
## 13 Boar Ethics_1 3
## 14 Boar Ethics_2 3
## 15 Boar Ethics_3 4
separate()
Now, you understand what a long form of data means. The next step is to separate keywords and numbers in the second column such as Intelligence_1 and Ethics_3. The keywords indicate the traits we are interested in, and the numbers indicate 1: Your trait, 2: Your preference for your partner’s trait. and 3: The importance of a trait. I have intentionally used such column names for easy separation of the keywords and the numbers. separate(questions, c("category", "Qnum"))
means that the column questions
will be separated into two columns category
and Qnum
. The keywords will be placed in the category
column and the suffixes will be placed in the Qnum
column.
data0<-data0%>%
separate(questions, c("category", "Qnum"))
The %>%
symbol is called a ‘pipe,’ and it links one code to the next. This way, you don’t have to create a new data frame every time you make a change in your data frame. Now, let’s take a look at how the column is now separated.
head(data0 %>% arrange(Name), n=15L)
## Name category Qnum responses
## 1 Boar Appearance 1 4
## 2 Boar Appearance 2 4
## 3 Boar Appearance 3 2
## 4 Boar Intelligence 1 4
## 5 Boar Intelligence 2 3
## 6 Boar Intelligence 3 3
## 7 Boar Kindness 1 4
## 8 Boar Kindness 2 4
## 9 Boar Kindness 3 4
## 10 Boar Wealth 1 3
## 11 Boar Wealth 2 2
## 12 Boar Wealth 3 4
## 13 Boar Ethics 1 3
## 14 Boar Ethics 2 3
## 15 Boar Ethics 3 4
The %>% arrange(Name)
command is added to arrange the data file by respondents’ names. If you’d like to see what happens without it, you can simply try head(data0, n =15L)
or just head(data0)
. What you see here now is that Boar thinks he is quite good looking since he gave 4 to the first question of Appearance
, and he also wants his partner to be just as good looking, but he says Appearance
is not that important as his score for the third question of Appearance
is only 2. So, that’s how you read this data set.
mutate()
and case_when()
to create a new columnRemember that the importance of a trait was weighted exponentially in the video? We now need to turn the values of 1 to 5 into an exponential scale.
data0<-data0%>%
mutate(weight=case_when(
Qnum==3 & responses ==1~0,
Qnum==3 & responses ==2~1,
Qnum==3 & responses ==3~10,
Qnum==3 & responses ==4~50,
Qnum==3 & responses ==5~250
))
mutate()
creates a new column called weight
based on values in the columns Qnum
and responses
using case_when()
. That is, if the question is about the importance of a trait and if the response is 1 (‘Not important at all’), then, its weight is 0. Likewise, if the question is 3 and the response is 5 (‘Extremely so’), then, the weight is 250, and so on and so forth.
head(data0 %>% arrange(Name), n=15L)
## Name category Qnum responses weight
## 1 Boar Appearance 1 4 NA
## 2 Boar Appearance 2 4 NA
## 3 Boar Appearance 3 2 1
## 4 Boar Intelligence 1 4 NA
## 5 Boar Intelligence 2 3 NA
## 6 Boar Intelligence 3 3 10
## 7 Boar Kindness 1 4 NA
## 8 Boar Kindness 2 4 NA
## 9 Boar Kindness 3 4 50
## 10 Boar Wealth 1 3 NA
## 11 Boar Wealth 2 2 NA
## 12 Boar Wealth 3 4 50
## 13 Boar Ethics 1 3 NA
## 14 Boar Ethics 2 3 NA
## 15 Boar Ethics 3 4 50
For every row that has a question about the importance of a trait and a response to it, the weight of the trait in the exponential scale is listed in the weight
column.
The next step is to turn the continuous values in the Qnum
column into categorical values because we do not want R to process the numbers in the Qnum
column as ordinal or interval values. We can again use the mutate()
and case_when()
functions to do that. Also, adding in select(-Qnum)
at the end of the codes will delete the now unwanted, redundant column (Qnum
).
data0<-data0%>%
mutate(number = case_when(
Qnum ==1 ~"q1",
Qnum ==2 ~"q2",
Qnum ==3 ~"q3"
))%>%
select(-Qnum)
head(data0 %>% arrange(Name), n=15L)
## Name category responses weight number
## 1 Boar Appearance 4 NA q1
## 2 Boar Appearance 4 NA q2
## 3 Boar Appearance 2 1 q3
## 4 Boar Intelligence 4 NA q1
## 5 Boar Intelligence 3 NA q2
## 6 Boar Intelligence 3 10 q3
## 7 Boar Kindness 4 NA q1
## 8 Boar Kindness 4 NA q2
## 9 Boar Kindness 4 50 q3
## 10 Boar Wealth 3 NA q1
## 11 Boar Wealth 2 NA q2
## 12 Boar Wealth 4 50 q3
## 13 Boar Ethics 3 NA q1
## 14 Boar Ethics 3 NA q2
## 15 Boar Ethics 4 50 q3
However, R does not like empty cells with NA
in it. Since the weight will be applicable just the same for the same trait of a person, we can just fill in all the blanks with the same weight as long as the empty cells belong to the same trait of the same person.
data0<-data0%>%
group_by(Name, category)%>%
fill(weight, .direction = "up")
The code above says that the data file should be sorted by respondents’ names and question categories. Then, it fills up the question cells for questions 1 and 2 within the same category of a person with the same weight
as the weight value in q3
(Question 3: The important of a trait). Check out what change has been made below.
head(data0 %>% arrange(Name), n=15L)
## # A tibble: 15 x 5
## # Groups: Name, category [5]
## Name category responses weight number
## <chr> <chr> <int> <dbl> <chr>
## 1 Boar Appearance 4 1 q1
## 2 Boar Appearance 4 1 q2
## 3 Boar Appearance 2 1 q3
## 4 Boar Intelligence 4 10 q1
## 5 Boar Intelligence 3 10 q2
## 6 Boar Intelligence 3 10 q3
## 7 Boar Kindness 4 50 q1
## 8 Boar Kindness 4 50 q2
## 9 Boar Kindness 4 50 q3
## 10 Boar Wealth 3 50 q1
## 11 Boar Wealth 2 50 q2
## 12 Boar Wealth 4 50 q3
## 13 Boar Ethics 3 50 q1
## 14 Boar Ethics 3 50 q2
## 15 Boar Ethics 4 50 q3
From here on, I have made a few changes in the way the video explains how calculations should be done. Because the video used only two questions, it just calculates the average score of the two questions for each of the two people, and they get the square root of the product of the two values. However, when there are five different questions (as in our case), the order in which calculations are done should be different from that in the video. I assume that the actual algorithm that they use is a lot more complex, but following is one way I believe will work.
spread()
to make long data wideThe spread()
function takes two arguments; one column that has a set number of values to provide new column headings and the other column that will fill the cells of the new columns. Below, we take the columns number
and responses
. Since number
has three different values q1
, q2
, and q3
, we will have new column headings labeled q1
, q2
, and q3
. Then, the columns will be filled with responses from each respondent.
data0<-data0%>%
spread(number, responses)
You can see the changes made below. In the previous table, we saw the same person (Boar) repeated in 15 different rows. Instead, the new table shows that each person is repeated only five different times (for the five traits), and responses for each question is now spread out sideways.
head(data0 %>% arrange(Name), n=15L)
## # A tibble: 15 x 6
## # Groups: Name, category [15]
## Name category weight q1 q2 q3
## <chr> <chr> <dbl> <int> <int> <int>
## 1 Boar Appearance 1 4 4 2
## 2 Boar Ethics 50 3 3 4
## 3 Boar Intelligence 10 4 3 3
## 4 Boar Kindness 50 4 4 4
## 5 Boar Wealth 50 3 2 4
## 6 Buck Appearance 50 2 3 4
## 7 Buck Ethics 250 5 5 5
## 8 Buck Intelligence 50 5 4 4
## 9 Buck Kindness 250 5 5 5
## 10 Buck Wealth 50 2 2 4
## 11 Bull Appearance 50 4 4 4
## 12 Bull Ethics 50 4 4 4
## 13 Bull Intelligence 50 4 4 4
## 14 Bull Kindness 250 5 5 5
## 15 Bull Wealth 50 4 4 4
The next step is to calculate how the other 18 people will score in each respondent’s scale. For that, let’s create a new data frame called data1.
data1<-data0%>%
select(Name, category, q1)%>%
rename(partner=Name, partner_resp=q1)
You create a new data frame called data1
out of data0
and select only three columns: Name, category, q1
. Then, rename the Name
column partner
and q1
partner_resp
. Everyone is a potential partner to everyone anyway! And it looks like this.
head(data1 %>% arrange(partner), n=15L)
## # A tibble: 15 x 3
## # Groups: partner, category [15]
## partner category partner_resp
## <chr> <chr> <int>
## 1 Boar Appearance 4
## 2 Boar Ethics 3
## 3 Boar Intelligence 4
## 4 Boar Kindness 4
## 5 Boar Wealth 3
## 6 Buck Appearance 2
## 7 Buck Ethics 5
## 8 Buck Intelligence 5
## 9 Buck Kindness 5
## 10 Buck Wealth 2
## 11 Bull Appearance 4
## 12 Bull Ethics 4
## 13 Bull Intelligence 4
## 14 Bull Kindness 5
## 15 Bull Wealth 4
inner_join()
There are many different commands to merge two data files. For now, we will use inner_join()
, which combines data frames x and y, returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. It takes three arguments, x (data0
), y (data1
), and by="common column name"
. In our case, the category
column will be the joiner of the two data frames.
data2<-inner_join(data0, data1, by="category")
head(data2, n=19L)
## # A tibble: 19 x 8
## # Groups: Name, category [1]
## Name category weight q1 q2 q3 partner partner_resp
## <chr> <chr> <dbl> <int> <int> <int> <chr> <int>
## 1 Boar Appearance 1 4 4 2 Boar 4
## 2 Boar Appearance 1 4 4 2 Buck 2
## 3 Boar Appearance 1 4 4 2 Bull 4
## 4 Boar Appearance 1 4 4 2 Cob 3
## 5 Boar Appearance 1 4 4 2 Cow 3
## 6 Boar Appearance 1 4 4 2 Doe 3
## 7 Boar Appearance 1 4 4 2 Dog 3
## 8 Boar Appearance 1 4 4 2 Drake 3
## 9 Boar Appearance 1 4 4 2 Duck 4
## 10 Boar Appearance 1 4 4 2 Gander 3
## 11 Boar Appearance 1 4 4 2 Goose 3
## 12 Boar Appearance 1 4 4 2 Hen 1
## 13 Boar Appearance 1 4 4 2 Lioness 4
## 14 Boar Appearance 1 4 4 2 Mare 2
## 15 Boar Appearance 1 4 4 2 Pen 3
## 16 Boar Appearance 1 4 4 2 Ram 3
## 17 Boar Appearance 1 4 4 2 Rooster 3
## 18 Boar Appearance 1 4 4 2 Tigress 4
## 19 Boar Appearance 1 4 4 2 Whale 3
The combined data frame now shows each person’s preference on each trait and a potential partner’s response for each trait in one row. The problem is that because R simply calculates everything we tell it to, it shows a comparison between Boar (in the name column) and Boar (in the partner column). Currently, the number of rows in the table is:
nrow(data2)
## [1] 1805
That is the same number as 19 * 19 * 5 = 1805
.
filter()
Although it is true that everyone becomes someone else’s partner, we do not need to calculate a match score between a person and him/herself. So, let’s drop cases where the Name == partner
. The code below says that data will be filtered in only when the Name
and partner
columns do not have the same values.
data2<-data2%>%
filter(Name != partner)
Now, the number of rows in the filtered data file is:
nrow(data2)
## [1] 1710
, which is the result of 19*18*5
.
head(data2, n=19L)
## # A tibble: 19 x 8
## # Groups: Name, category [2]
## Name category weight q1 q2 q3 partner partner_resp
## <chr> <chr> <dbl> <int> <int> <int> <chr> <int>
## 1 Boar Appearance 1 4 4 2 Buck 2
## 2 Boar Appearance 1 4 4 2 Bull 4
## 3 Boar Appearance 1 4 4 2 Cob 3
## 4 Boar Appearance 1 4 4 2 Cow 3
## 5 Boar Appearance 1 4 4 2 Doe 3
## 6 Boar Appearance 1 4 4 2 Dog 3
## 7 Boar Appearance 1 4 4 2 Drake 3
## 8 Boar Appearance 1 4 4 2 Duck 4
## 9 Boar Appearance 1 4 4 2 Gander 3
## 10 Boar Appearance 1 4 4 2 Goose 3
## 11 Boar Appearance 1 4 4 2 Hen 1
## 12 Boar Appearance 1 4 4 2 Lioness 4
## 13 Boar Appearance 1 4 4 2 Mare 2
## 14 Boar Appearance 1 4 4 2 Pen 3
## 15 Boar Appearance 1 4 4 2 Ram 3
## 16 Boar Appearance 1 4 4 2 Rooster 3
## 17 Boar Appearance 1 4 4 2 Tigress 4
## 18 Boar Appearance 1 4 4 2 Whale 3
## 19 Boar Ethics 50 3 3 4 Buck 5
if_else()
Now, we need to calculate how each partner’s response measures in each respondent’s scale. Boar wants his partner to be quite good-looking (4 on q2
of Appearance
), and all potential partners responded that they were quite good-looking (4) or lower. What if someone said they were extremely good-looking (5), should we give more than 100% to the person? Maybe just 100% will make better sense. To do that, if a partner’s response is higher than a respondent’s wish for partner’s trait (q2
), we can just give 1 for the partner’s score. If the partner’s response is equal to or smaller than the respondent’s wish, then, we should divide the partner’s response (partner_resp
) by the respondent’s original preference for a partner’s trait (q2
). And that calculation can be done through if_else()
within mutate()
. if_else()
takes three arguments, a condition, a value when the condition is met, and a value when the condition is not met. So, the code below says that partner_score
should be 1 if partner_resp
is greater than q2
; otherwise, it should be partner_resp/q2
.
data2<-data2%>%
mutate(partner_score = if_else(partner_resp > q2, c(1), c(partner_resp/q2)))
head(data2)
## # A tibble: 6 x 9
## # Groups: Name, category [1]
## Name category weight q1 q2 q3 partner partner_resp partner_score
## <chr> <chr> <dbl> <int> <int> <int> <chr> <int> <dbl>
## 1 Boar Appearance 1 4 4 2 Buck 2 0.5
## 2 Boar Appearance 1 4 4 2 Bull 4 1
## 3 Boar Appearance 1 4 4 2 Cob 3 0.75
## 4 Boar Appearance 1 4 4 2 Cow 3 0.75
## 5 Boar Appearance 1 4 4 2 Doe 3 0.75
## 6 Boar Appearance 1 4 4 2 Dog 3 0.75
Since we do not need q1, q2, q3, partner_resp
any more, let’s just get rid of them.
data2<-data2%>%
select(-c(q1, q2, q3, partner_resp))
head(data2)
## # A tibble: 6 x 5
## # Groups: Name, category [1]
## Name category weight partner partner_score
## <chr> <chr> <dbl> <chr> <dbl>
## 1 Boar Appearance 1 Buck 0.5
## 2 Boar Appearance 1 Bull 1
## 3 Boar Appearance 1 Cob 0.75
## 4 Boar Appearance 1 Cow 0.75
## 5 Boar Appearance 1 Doe 0.75
## 6 Boar Appearance 1 Dog 0.75
nrow(data2)
## [1] 1710
The next thing I would like to do is to juxtapose B’s score on A’s scale and A’s score on B’s scale next to each other. And the following codes will just do that.
data3<-data2%>%
mutate(joiner=paste(partner, Name, sep = "_"))
head(data3)
## # A tibble: 6 x 6
## # Groups: Name, category [1]
## Name category weight partner partner_score joiner
## <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Boar Appearance 1 Buck 0.5 Buck_Boar
## 2 Boar Appearance 1 Bull 1 Bull_Boar
## 3 Boar Appearance 1 Cob 0.75 Cob_Boar
## 4 Boar Appearance 1 Cow 0.75 Cow_Boar
## 5 Boar Appearance 1 Doe 0.75 Doe_Boar
## 6 Boar Appearance 1 Dog 0.75 Dog_Boar
data2<-data2%>%
mutate(joiner=paste(Name, partner, sep = "_"))
head(data2)
## # A tibble: 6 x 6
## # Groups: Name, category [1]
## Name category weight partner partner_score joiner
## <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Boar Appearance 1 Buck 0.5 Boar_Buck
## 2 Boar Appearance 1 Bull 1 Boar_Bull
## 3 Boar Appearance 1 Cob 0.75 Boar_Cob
## 4 Boar Appearance 1 Cow 0.75 Boar_Cow
## 5 Boar Appearance 1 Doe 0.75 Boar_Doe
## 6 Boar Appearance 1 Dog 0.75 Boar_Dog
data4<-left_join(data2, data3, by = c("joiner", "category"))
head(data4)
## # A tibble: 6 x 10
## # Groups: category [1]
## Name.x category weight.x partner.x partner_score.x joiner Name.y weight.y
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl>
## 1 Boar Appeara~ 1 Buck 0.5 Boar_~ Buck 50
## 2 Boar Appeara~ 1 Bull 1 Boar_~ Bull 50
## 3 Boar Appeara~ 1 Cob 0.75 Boar_~ Cob 50
## 4 Boar Appeara~ 1 Cow 0.75 Boar_~ Cow 50
## 5 Boar Appeara~ 1 Doe 0.75 Boar_~ Doe 10
## 6 Boar Appeara~ 1 Dog 0.75 Boar_~ Dog 50
## # ... with 2 more variables: partner.y <chr>, partner_score.y <dbl>
Now the newly joined data4 has too many columns that it will not be displayed properly, let’s just check the names of the columns in the new data set.
colnames(data4)
## [1] "Name.x" "category" "weight.x" "partner.x"
## [5] "partner_score.x" "joiner" "Name.y" "weight.y"
## [9] "partner.y" "partner_score.y"
nrow(data4)
## [1] 1710
What I can tell you here is that Name.x
is the same as partner.y
and that partner.x
is the same as Name.y
. What we can do is set Name.x
to be A and Name.y
to be B. Then, partner_score.x
will be B’s score on A’s scale (BonA
), and partner_score.y
will be A’s score on B’s scale (AonB
). Also, weight.x
means the weight placed on a trait by a respondent, which we now call A, and weight.y
indicates the weight placed on a trait by a partner, which we now call B. Let’s rename them to be weight.A
and weight.B
, respectively.
data4<-data4%>%
rename(A = Name.x, B = Name.y, BonA = partner_score.x, AonB = partner_score.y, weight.A = weight.x, weight.B = weight.y)
head(data4)
## # A tibble: 6 x 10
## # Groups: category [1]
## A category weight.A partner.x BonA joiner B weight.B partner.y AonB
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 Boar Appearan~ 1 Buck 0.5 Boar_~ Buck 50 Boar 1
## 2 Boar Appearan~ 1 Bull 1 Boar_~ Bull 50 Boar 1
## 3 Boar Appearan~ 1 Cob 0.75 Boar_~ Cob 50 Boar 1
## 4 Boar Appearan~ 1 Cow 0.75 Boar_~ Cow 50 Boar 0.8
## 5 Boar Appearan~ 1 Doe 0.75 Boar_~ Doe 10 Boar 1
## 6 Boar Appearan~ 1 Dog 0.75 Boar_~ Dog 50 Boar 0.8
Now, let’s leave in necessary columns only.
data4<-data4%>%
select(A, B, category, AonB, BonA, weight.A, weight.B)
head(data4)
## # A tibble: 6 x 7
## # Groups: category [1]
## A B category AonB BonA weight.A weight.B
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Boar Buck Appearance 1 0.5 1 50
## 2 Boar Bull Appearance 1 1 1 50
## 3 Boar Cob Appearance 1 0.75 1 50
## 4 Boar Cow Appearance 0.8 0.75 1 50
## 5 Boar Doe Appearance 1 0.75 1 10
## 6 Boar Dog Appearance 0.8 0.75 1 50
Let’s now calculate each pair’s match score on each trait.
data4<-data4%>%
mutate(matchScore = ((weight.A*BonA + weight.B*AonB)/(weight.A+weight.B))*100)
head(data4)
## # A tibble: 6 x 8
## # Groups: category [1]
## A B category AonB BonA weight.A weight.B matchScore
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boar Buck Appearance 1 0.5 1 50 99.0
## 2 Boar Bull Appearance 1 1 1 50 100
## 3 Boar Cob Appearance 1 0.75 1 50 99.5
## 4 Boar Cow Appearance 0.8 0.75 1 50 79.9
## 5 Boar Doe Appearance 1 0.75 1 10 97.7
## 6 Boar Dog Appearance 0.8 0.75 1 50 79.9
Also, if you’d like to see how the scores are distributed, a histogram (hist()
) will help. Using ggplot2::
for a neat and pretty plot will be discussed in another post in the future.
hist(data4$matchScore)
With the match score calculated, we do not need the other columns (AonB
, BonA
, weight.A
, weight.B
), and I would like to spread out the trait category to a separate column for each trait. Let’s select()
necessary columns only and spread()
matchScore
by category
.
data4<-data4%>%
select(A, B, category, matchScore)%>%
spread(category, matchScore)
nrow(data4)
## [1] 342
head(data4)
## # A tibble: 6 x 7
## A B Appearance Ethics Intelligence Kindness Wealth
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boar Buck 99.0 66.7 100 83.3 100
## 2 Boar Bull 100 87.5 100 83.3 87.5
## 3 Boar Cob 99.5 66.7 100 83.3 87.5
## 4 Boar Cow 79.9 95.8 100 83.3 100
## 5 Boar Doe 97.7 95.8 100 100 95.8
## 6 Boar Dog 79.9 80 100 83.3 100
You can see that each pair’s match scores for the five different trait categories are next to each other in one line. We now need to get the 5th root of the product of the five values.
data4<-data4%>%
mutate(compatibility = nthroot(Appearance * Ethics * Intelligence * Kindness * Wealth, 5))
head(data4)
## # A tibble: 6 x 8
## A B Appearance Ethics Intelligence Kindness Wealth compatibility
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boar Buck 99.0 66.7 100 83.3 100 88.7
## 2 Boar Bull 100 87.5 100 83.3 87.5 91.4
## 3 Boar Cob 99.5 66.7 100 83.3 87.5 86.5
## 4 Boar Cow 79.9 95.8 100 83.3 100 91.4
## 5 Boar Doe 97.7 95.8 100 100 95.8 97.9
## 6 Boar Dog 79.9 80 100 83.3 100 88.2
Here is the distribution of the final compatibility scores.
hist(data4$compatibility)
If you’d like to see the entire data frame in a spreadsheet, you can write a csv file using write.csv()
.
write.csv(data4, file = "comp.csv")
Then, you will find a csv file within the same working directory you set in the beginning of this post. If you are interested in looking at the resulting file, here it is.
Following are all codes used in this script in one place.
data0<-raw.data%>%
gather(questions, responses, 2:16)%>%
separate(questions, c("category", "Qnum"))%>%
mutate(weight=case_when(
Qnum==3 & responses ==1~0,
Qnum==3 & responses ==2~1,
Qnum==3 & responses ==3~10,
Qnum==3 & responses ==4~50,
Qnum==3 & responses ==5~250
))%>%
mutate(number = case_when(
Qnum ==1 ~"q1",
Qnum ==2 ~"q2",
Qnum ==3 ~"q3"
))%>%
select(-Qnum)%>%
group_by(Name, category)%>%
fill(weight, .direction = "up")%>%
spread(number, responses)
data1<-data0%>%
select(Name, category, q1)%>%
rename(partner=Name, partner_resp=q1)
data2<-inner_join(data0, data1, by="category")%>%
filter(Name != partner)%>%
mutate(partner_score = if_else(partner_resp > q2, c(1), c(partner_resp/q2)))%>%
select(-c(q1, q2, q3, partner_resp))
data3<-data2%>%
mutate(joiner=paste(partner, Name, sep = "_"))
data2<-data2%>%
mutate(joiner=paste(Name, partner, sep = "_"))
data4<-left_join(data2, data3, by = c("joiner", "category"))%>%
rename(A = Name.x, B = Name.y, BonA = partner_score.x, AonB = partner_score.y, weight.A = weight.x, weight.B = weight.y)%>%
select(A, B, category, AonB, BonA, weight.A, weight.B)%>%
mutate(matchScore = ((weight.A*BonA + weight.B*AonB)/(weight.A+weight.B))*100)%>%
select(A, B, category, matchScore)%>%
spread(category, matchScore)%>%
mutate(compatibility = nthroot(Appearance * Ethics * Intelligence * Kindness * Wealth, 5))
Well, if anyone asks me if this was the most elegant way to compute compatibility scores or if this was exactly what Christian Rudder meant in his video, I’m not sure I can confidently say yes. But it’s definitely an idea that seemed to make the most sense to me. If anyone out there with a broader and deeper understanding of algorithmic thinking has better ideas, please email me with your ideas. I’m merely a learner of coding myself, who is striving to get better at it everyday.
Whoa! That was a massive amount of procrastination all in the midst of worrying about an IRB application, a manuscript, and a midterm exam to write. Hopefully, there is someone out there reading this and finding it helpful. So long for now! I will try not to take too long to return to write another post of this sort ;)