1 Introduction

1.1 Why algorithm of LOVE?

Before the pandemic, I usually taught my “College English 2:Speaking” with this textbook, which is based on popular TED talks and includes a great variety of excellent classroom activities. However, when my university announced in February 2020 that all courses would move online, I had to ask myself if the same classroom activities from the textbook could be used for online classes via Zoom.

So, I selected a few interesting TED and TED-Ed talks and designed my own activities that would have better suited the needs of students learning online. One of such TED-Ed talks was Christian Rudder’s briefing on how he and his co-founders of OKCupid.com (a popular dating app in North America) came up with a matching algorithm for the users of their dating app. You can watch the video below, and a magazine article about their enterprise is found here. As I watched the video, I thought following his instructions in the video and calculating match scores for my students would be a fun ice-breaker in the beginning of the semester.

And now, I’m writing this post to help anyone out there who would like to learn how to use R. There are a kazillion posts and blogs out there that explain how to use R, but I thought writing about a small project I carried out for my class would be a fun way to introduce R to novices. So off we go!

The video is fun to watch, and it basically says that you need three different pieces of information to calculate how compatible two people will be: (1) Your traits, (2) Your preference for your potential match’s traits, and (3) How important each trait is to you.

To obtain such information, I simply wrote up a short survey in a Google form and distributed it around. Students’ responses were recorded in a spreadsheet connected to the google form, and you can download it here. I also wrote up an r markdown file so you can easily follow this post. Just create a folder named ‘aol’ on Desktop of your computer (be sure it’s all small letters; R is case-sensitive) and save both files in that folder.

1.2 Data codes

Although most dating apps will make you answer hundreds of questions for more precise match calculations, I couldn’t possibly use too much of class time making students answer dating app questions. But I also did not want to calculate two people’s match scores based on only two traits as in the example in the video. So, I asked about five most common things people consider when they look for a partner (looks, intelligence, kindness, wealth, and ethics). Following are the column names for each trait and the questions asked. All questions were asked on a 5-point scale with 1 being ‘Not at all’ and 5 being ‘Extremely so.’

Column Headings Questions
Appearance_1 I am good looking.
Appearance_2 I want a good looking partner.
Appearance_3 How important is ‘appearance’ in your life?
Intelligence_1 I am intelligent.
Intelligence_2 I would like to be with someone intelligent.
Intelligence_3 How important is ‘intelligence’ in your life?
Kindness_1 I am kind-hearted and caring.
Kindness_2 I would like to have a kind-hearted and caring partner.
Kindness_3 How important is ‘kindness’ in your life?
Wealth_1 I am rich.
Wealth_2 I want my partner to be rich.
Wealth_3 How important is ‘wealth’ in your life?
Ethics_1 I am ethical.
Ethics_2 I would like my partner to be ethical.
Ethics_3 How important is ‘being ethical’ in your life?

2 Basics

This post is intended for those who have absolutely no knowledge of how R works. Those who have at least a little bit of experience might find all the excruciating details unbearable. In such case, skip parts you already have a good knowledge of and move on. Or you can simply download the data file and the Rmd file and play with them.

2.1 Setting up your working directory

Now, you need to set your working directory. It is the same folder that you saved your data file and the r markdown file earlier. The Rmd file I provided assumes that you have a folder named ‘aol’ on your Desktop and have both the data file and the Rmd file within that folder. If you are using Windows, just uncomment the Windows command. If you are a Mac user, uncomment the Mac command. Also, the parenthesized part (UserName) should be replaced with your computer name. If your computer name has Korean characters, a space, or a special character in it, a nightmare begins. In such cases, look for the nearest computer geek in your vicinity. I have no intention to write another post about it.

# Windows
# setwd("C:/Users/(UserName)/Desktop/aol") 

# Mac
# setwd("/Users/(UserName)/Desktop/aol")

# A hashtag (#) works as a sign for R to ignore whatever's after it. To execute either setwd command above, simply remove the hashtag in the begining of the line. But be sure to delete the hash tag only in one of the two lines. If you're a Windows user, delete the hash tag before the Windows command. If you're a Mac user, the one before the Mac command. 

Or if you’d like to set your working directory elsewhere, refer to this video. Otherwise, just google ‘how to set up working directory in rstudio.’ You will realize that the whole world is eager to teach you how to do that.

2.2 Loading up required libraries (packages)

Although R comes with some basic functions, which we call base R, people have written awesome packages that have convenient functions to use. In our case, we will load a package called tidyverse for easy data manipulation and pracma for nth root calculation. If you haven’t installed either package on your computer, you will first have to install.packages(c("tidyverse", "pracma")) before you load them using library(tidyverse) and library(pracma).

library(tidyverse)
## -- Attaching packages ------------------------------------------ tidyverse 1.3.0 --
## √ ggplot2 3.3.2     √ purrr   0.3.4
## √ tibble  3.0.3     √ dplyr   1.0.0
## √ tidyr   1.1.1     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0
## -- Conflicts --------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

tidyverse is a package of packages that are commonly used for tidying, manipulating, and plotting data. The eight packages that come in tidyverse are listed above. Also, sometimes, some packages that you load might have the same function names as those in other packages. Then, the conflicting functions are also listed. Although the conflicts will not create errors in our script, if in doubt; you can add dplyr:: or stats:: before you use either the filter() or lag() functions to clarify which function from which package is being called for.

library(pracma)
## Warning: package 'pracma' was built under R version 4.0.4
## 
## Attaching package: 'pracma'
## The following object is masked from 'package:purrr':
## 
##     cross

When I loaded pracma, it gave me a warning that the package was written for R version 4.0.4. It is because the current version of R I’m using is 4.0.2. If the version difference creates a problem, you can simply update your R. For now, let’s just carry on. Also, the message that an object is masked from another package is basically the same as the conflict message from tidyverse. When you have to use cross from the purrr package, add purrr::. In our case, we can simply ignore it.

2.3 Loading in data

Now, use read.csv() to load in data. The data has responses to the 15 questions above from 19 students in total. Their names were all replaced with animal names.

raw.data <- read.csv("sampleDat.csv")

Now the data file is loaded, let’s take a look at how the data frame is structured. Instead of loading in all data, let’s take a look at the first six rows of the data frame using head().

head(raw.data)
##   Name Appearance_1 Appearance_2 Appearance_3 Intelligence_1 Intelligence_2
## 1 Boar            4            4            2              4              3
## 2 Buck            2            3            4              5              4
## 3 Bull            4            4            4              4              4
## 4  Cob            3            3            4              3              4
## 5  Cow            3            5            4              3              3
## 6  Doe            3            4            3              4              4
##   Intelligence_3 Kindness_1 Kindness_2 Kindness_3 Wealth_1 Wealth_2 Wealth_3
## 1              3          4          4          4        3        2        4
## 2              4          5          5          5        2        2        4
## 3              4          5          5          5        4        4        4
## 4              4          4          5          5        2        4        4
## 5              2          4          5          5        3        2        2
## 6              4          4          4          4        3        4        3
##   Ethics_1 Ethics_2 Ethics_3
## 1        3        3        4
## 2        5        5        5
## 3        4        4        4
## 4        4        5        5
## 5        4        4        3
## 6        4        4        3

If you open the data file in Excel or in any other spreadsheet, all columns will be next to each other. In this case, each row is divided into three different rows because of the limited page width.

On each row is a respondent’s name and his/her responses to the 15 questions. This is a data format that most people are familiar with, but this is not the data format that R is happy with. We call it messy data and need to make it tidy.

3 Manipulating data using dplyr::

3.1 Making wide data long using gather()

The first thing you now need to do is to turn the wide data above into a long form. In this case, we can use the gather() function from dplyr.

data0<-raw.data%>%
  gather(questions, responses, 2:16)

You have taken a peek at raw.data earlier, the code above says that you are creating a new data frame named data0 out of raw.data. The gather() function takes column headings from the second column to the 16th column for questions and responses in each cell as responses. In the wide format, each person was listed only in one row with his/her 15 responses next to the name. In the long data format, each person is repeated in 15 different rows and will be accompanied by a question and a response. The first 15 rows of the long data look like this.

head(data0%>%arrange(Name), n = 15L)
##    Name      questions responses
## 1  Boar   Appearance_1         4
## 2  Boar   Appearance_2         4
## 3  Boar   Appearance_3         2
## 4  Boar Intelligence_1         4
## 5  Boar Intelligence_2         3
## 6  Boar Intelligence_3         3
## 7  Boar     Kindness_1         4
## 8  Boar     Kindness_2         4
## 9  Boar     Kindness_3         4
## 10 Boar       Wealth_1         3
## 11 Boar       Wealth_2         2
## 12 Boar       Wealth_3         4
## 13 Boar       Ethics_1         3
## 14 Boar       Ethics_2         3
## 15 Boar       Ethics_3         4

3.2 Seperating a column using separate()

Now, you understand what a long form of data means. The next step is to separate keywords and numbers in the second column such as Intelligence_1 and Ethics_3. The keywords indicate the traits we are interested in, and the numbers indicate 1: Your trait, 2: Your preference for your partner’s trait. and 3: The importance of a trait. I have intentionally used such column names for easy separation of the keywords and the numbers. separate(questions, c("category", "Qnum")) means that the column questions will be separated into two columns category and Qnum. The keywords will be placed in the category column and the suffixes will be placed in the Qnum column.

data0<-data0%>%
  separate(questions, c("category", "Qnum"))

The %>% symbol is called a ‘pipe,’ and it links one code to the next. This way, you don’t have to create a new data frame every time you make a change in your data frame. Now, let’s take a look at how the column is now separated.

head(data0 %>% arrange(Name), n=15L)
##    Name     category Qnum responses
## 1  Boar   Appearance    1         4
## 2  Boar   Appearance    2         4
## 3  Boar   Appearance    3         2
## 4  Boar Intelligence    1         4
## 5  Boar Intelligence    2         3
## 6  Boar Intelligence    3         3
## 7  Boar     Kindness    1         4
## 8  Boar     Kindness    2         4
## 9  Boar     Kindness    3         4
## 10 Boar       Wealth    1         3
## 11 Boar       Wealth    2         2
## 12 Boar       Wealth    3         4
## 13 Boar       Ethics    1         3
## 14 Boar       Ethics    2         3
## 15 Boar       Ethics    3         4

The %>% arrange(Name) command is added to arrange the data file by respondents’ names. If you’d like to see what happens without it, you can simply try head(data0, n =15L) or just head(data0). What you see here now is that Boar thinks he is quite good looking since he gave 4 to the first question of Appearance, and he also wants his partner to be just as good looking, but he says Appearance is not that important as his score for the third question of Appearance is only 2. So, that’s how you read this data set.

3.3 Using mutate() and case_when() to create a new column

Remember that the importance of a trait was weighted exponentially in the video? We now need to turn the values of 1 to 5 into an exponential scale.

data0<-data0%>%
  mutate(weight=case_when(
    Qnum==3 & responses ==1~0,
    Qnum==3 & responses ==2~1,
    Qnum==3 & responses ==3~10,
    Qnum==3 & responses ==4~50,
    Qnum==3 & responses ==5~250
  ))

mutate() creates a new column called weight based on values in the columns Qnum and responses using case_when(). That is, if the question is about the importance of a trait and if the response is 1 (‘Not important at all’), then, its weight is 0. Likewise, if the question is 3 and the response is 5 (‘Extremely so’), then, the weight is 250, and so on and so forth.

head(data0 %>% arrange(Name), n=15L)
##    Name     category Qnum responses weight
## 1  Boar   Appearance    1         4     NA
## 2  Boar   Appearance    2         4     NA
## 3  Boar   Appearance    3         2      1
## 4  Boar Intelligence    1         4     NA
## 5  Boar Intelligence    2         3     NA
## 6  Boar Intelligence    3         3     10
## 7  Boar     Kindness    1         4     NA
## 8  Boar     Kindness    2         4     NA
## 9  Boar     Kindness    3         4     50
## 10 Boar       Wealth    1         3     NA
## 11 Boar       Wealth    2         2     NA
## 12 Boar       Wealth    3         4     50
## 13 Boar       Ethics    1         3     NA
## 14 Boar       Ethics    2         3     NA
## 15 Boar       Ethics    3         4     50

For every row that has a question about the importance of a trait and a response to it, the weight of the trait in the exponential scale is listed in the weight column.

The next step is to turn the continuous values in the Qnum column into categorical values because we do not want R to process the numbers in the Qnum column as ordinal or interval values. We can again use the mutate() and case_when() functions to do that. Also, adding in select(-Qnum) at the end of the codes will delete the now unwanted, redundant column (Qnum).

data0<-data0%>%
  mutate(number = case_when(
    Qnum ==1 ~"q1",
    Qnum ==2 ~"q2",
    Qnum ==3 ~"q3"
  ))%>%
  select(-Qnum)
head(data0 %>% arrange(Name), n=15L)
##    Name     category responses weight number
## 1  Boar   Appearance         4     NA     q1
## 2  Boar   Appearance         4     NA     q2
## 3  Boar   Appearance         2      1     q3
## 4  Boar Intelligence         4     NA     q1
## 5  Boar Intelligence         3     NA     q2
## 6  Boar Intelligence         3     10     q3
## 7  Boar     Kindness         4     NA     q1
## 8  Boar     Kindness         4     NA     q2
## 9  Boar     Kindness         4     50     q3
## 10 Boar       Wealth         3     NA     q1
## 11 Boar       Wealth         2     NA     q2
## 12 Boar       Wealth         4     50     q3
## 13 Boar       Ethics         3     NA     q1
## 14 Boar       Ethics         3     NA     q2
## 15 Boar       Ethics         4     50     q3

However, R does not like empty cells with NA in it. Since the weight will be applicable just the same for the same trait of a person, we can just fill in all the blanks with the same weight as long as the empty cells belong to the same trait of the same person.

data0<-data0%>%
  group_by(Name, category)%>%
  fill(weight, .direction = "up")

The code above says that the data file should be sorted by respondents’ names and question categories. Then, it fills up the question cells for questions 1 and 2 within the same category of a person with the same weight as the weight value in q3 (Question 3: The important of a trait). Check out what change has been made below.

head(data0 %>% arrange(Name), n=15L)
## # A tibble: 15 x 5
## # Groups:   Name, category [5]
##    Name  category     responses weight number
##    <chr> <chr>            <int>  <dbl> <chr> 
##  1 Boar  Appearance           4      1 q1    
##  2 Boar  Appearance           4      1 q2    
##  3 Boar  Appearance           2      1 q3    
##  4 Boar  Intelligence         4     10 q1    
##  5 Boar  Intelligence         3     10 q2    
##  6 Boar  Intelligence         3     10 q3    
##  7 Boar  Kindness             4     50 q1    
##  8 Boar  Kindness             4     50 q2    
##  9 Boar  Kindness             4     50 q3    
## 10 Boar  Wealth               3     50 q1    
## 11 Boar  Wealth               2     50 q2    
## 12 Boar  Wealth               4     50 q3    
## 13 Boar  Ethics               3     50 q1    
## 14 Boar  Ethics               3     50 q2    
## 15 Boar  Ethics               4     50 q3

From here on, I have made a few changes in the way the video explains how calculations should be done. Because the video used only two questions, it just calculates the average score of the two questions for each of the two people, and they get the square root of the product of the two values. However, when there are five different questions (as in our case), the order in which calculations are done should be different from that in the video. I assume that the actual algorithm that they use is a lot more complex, but following is one way I believe will work.

3.4 Using spread() to make long data wide

The spread() function takes two arguments; one column that has a set number of values to provide new column headings and the other column that will fill the cells of the new columns. Below, we take the columns number and responses. Since number has three different values q1, q2, and q3, we will have new column headings labeled q1, q2, and q3. Then, the columns will be filled with responses from each respondent.

data0<-data0%>%
  spread(number, responses)

You can see the changes made below. In the previous table, we saw the same person (Boar) repeated in 15 different rows. Instead, the new table shows that each person is repeated only five different times (for the five traits), and responses for each question is now spread out sideways.

head(data0 %>% arrange(Name), n=15L)
## # A tibble: 15 x 6
## # Groups:   Name, category [15]
##    Name  category     weight    q1    q2    q3
##    <chr> <chr>         <dbl> <int> <int> <int>
##  1 Boar  Appearance        1     4     4     2
##  2 Boar  Ethics           50     3     3     4
##  3 Boar  Intelligence     10     4     3     3
##  4 Boar  Kindness         50     4     4     4
##  5 Boar  Wealth           50     3     2     4
##  6 Buck  Appearance       50     2     3     4
##  7 Buck  Ethics          250     5     5     5
##  8 Buck  Intelligence     50     5     4     4
##  9 Buck  Kindness        250     5     5     5
## 10 Buck  Wealth           50     2     2     4
## 11 Bull  Appearance       50     4     4     4
## 12 Bull  Ethics           50     4     4     4
## 13 Bull  Intelligence     50     4     4     4
## 14 Bull  Kindness        250     5     5     5
## 15 Bull  Wealth           50     4     4     4

The next step is to calculate how the other 18 people will score in each respondent’s scale. For that, let’s create a new data frame called data1.

data1<-data0%>%
  select(Name, category, q1)%>%
  rename(partner=Name, partner_resp=q1)

You create a new data frame called data1 out of data0 and select only three columns: Name, category, q1. Then, rename the Name column partner and q1 partner_resp. Everyone is a potential partner to everyone anyway! And it looks like this.

head(data1 %>% arrange(partner), n=15L)
## # A tibble: 15 x 3
## # Groups:   partner, category [15]
##    partner category     partner_resp
##    <chr>   <chr>               <int>
##  1 Boar    Appearance              4
##  2 Boar    Ethics                  3
##  3 Boar    Intelligence            4
##  4 Boar    Kindness                4
##  5 Boar    Wealth                  3
##  6 Buck    Appearance              2
##  7 Buck    Ethics                  5
##  8 Buck    Intelligence            5
##  9 Buck    Kindness                5
## 10 Buck    Wealth                  2
## 11 Bull    Appearance              4
## 12 Bull    Ethics                  4
## 13 Bull    Intelligence            4
## 14 Bull    Kindness                5
## 15 Bull    Wealth                  4

3.5 Merging two data frames using inner_join()

There are many different commands to merge two data files. For now, we will use inner_join(), which combines data frames x and y, returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. It takes three arguments, x (data0), y (data1), and by="common column name". In our case, the category column will be the joiner of the two data frames.

data2<-inner_join(data0, data1, by="category")
head(data2, n=19L)
## # A tibble: 19 x 8
## # Groups:   Name, category [1]
##    Name  category   weight    q1    q2    q3 partner partner_resp
##    <chr> <chr>       <dbl> <int> <int> <int> <chr>          <int>
##  1 Boar  Appearance      1     4     4     2 Boar               4
##  2 Boar  Appearance      1     4     4     2 Buck               2
##  3 Boar  Appearance      1     4     4     2 Bull               4
##  4 Boar  Appearance      1     4     4     2 Cob                3
##  5 Boar  Appearance      1     4     4     2 Cow                3
##  6 Boar  Appearance      1     4     4     2 Doe                3
##  7 Boar  Appearance      1     4     4     2 Dog                3
##  8 Boar  Appearance      1     4     4     2 Drake              3
##  9 Boar  Appearance      1     4     4     2 Duck               4
## 10 Boar  Appearance      1     4     4     2 Gander             3
## 11 Boar  Appearance      1     4     4     2 Goose              3
## 12 Boar  Appearance      1     4     4     2 Hen                1
## 13 Boar  Appearance      1     4     4     2 Lioness            4
## 14 Boar  Appearance      1     4     4     2 Mare               2
## 15 Boar  Appearance      1     4     4     2 Pen                3
## 16 Boar  Appearance      1     4     4     2 Ram                3
## 17 Boar  Appearance      1     4     4     2 Rooster            3
## 18 Boar  Appearance      1     4     4     2 Tigress            4
## 19 Boar  Appearance      1     4     4     2 Whale              3

The combined data frame now shows each person’s preference on each trait and a potential partner’s response for each trait in one row. The problem is that because R simply calculates everything we tell it to, it shows a comparison between Boar (in the name column) and Boar (in the partner column). Currently, the number of rows in the table is:

nrow(data2)
## [1] 1805

That is the same number as 19 * 19 * 5 = 1805.

3.6 Selecting rows based on cell values using filter()

Although it is true that everyone becomes someone else’s partner, we do not need to calculate a match score between a person and him/herself. So, let’s drop cases where the Name == partner. The code below says that data will be filtered in only when the Name and partner columns do not have the same values.

data2<-data2%>%
  filter(Name != partner)

Now, the number of rows in the filtered data file is:

nrow(data2)
## [1] 1710

, which is the result of 19*18*5.

head(data2, n=19L)
## # A tibble: 19 x 8
## # Groups:   Name, category [2]
##    Name  category   weight    q1    q2    q3 partner partner_resp
##    <chr> <chr>       <dbl> <int> <int> <int> <chr>          <int>
##  1 Boar  Appearance      1     4     4     2 Buck               2
##  2 Boar  Appearance      1     4     4     2 Bull               4
##  3 Boar  Appearance      1     4     4     2 Cob                3
##  4 Boar  Appearance      1     4     4     2 Cow                3
##  5 Boar  Appearance      1     4     4     2 Doe                3
##  6 Boar  Appearance      1     4     4     2 Dog                3
##  7 Boar  Appearance      1     4     4     2 Drake              3
##  8 Boar  Appearance      1     4     4     2 Duck               4
##  9 Boar  Appearance      1     4     4     2 Gander             3
## 10 Boar  Appearance      1     4     4     2 Goose              3
## 11 Boar  Appearance      1     4     4     2 Hen                1
## 12 Boar  Appearance      1     4     4     2 Lioness            4
## 13 Boar  Appearance      1     4     4     2 Mare               2
## 14 Boar  Appearance      1     4     4     2 Pen                3
## 15 Boar  Appearance      1     4     4     2 Ram                3
## 16 Boar  Appearance      1     4     4     2 Rooster            3
## 17 Boar  Appearance      1     4     4     2 Tigress            4
## 18 Boar  Appearance      1     4     4     2 Whale              3
## 19 Boar  Ethics         50     3     3     4 Buck               5

3.7 Creating a new column using if_else()

Now, we need to calculate how each partner’s response measures in each respondent’s scale. Boar wants his partner to be quite good-looking (4 on q2 of Appearance), and all potential partners responded that they were quite good-looking (4) or lower. What if someone said they were extremely good-looking (5), should we give more than 100% to the person? Maybe just 100% will make better sense. To do that, if a partner’s response is higher than a respondent’s wish for partner’s trait (q2), we can just give 1 for the partner’s score. If the partner’s response is equal to or smaller than the respondent’s wish, then, we should divide the partner’s response (partner_resp) by the respondent’s original preference for a partner’s trait (q2). And that calculation can be done through if_else() within mutate(). if_else() takes three arguments, a condition, a value when the condition is met, and a value when the condition is not met. So, the code below says that partner_score should be 1 if partner_resp is greater than q2; otherwise, it should be partner_resp/q2.

data2<-data2%>%
  mutate(partner_score = if_else(partner_resp > q2, c(1), c(partner_resp/q2)))
head(data2)
## # A tibble: 6 x 9
## # Groups:   Name, category [1]
##   Name  category   weight    q1    q2    q3 partner partner_resp partner_score
##   <chr> <chr>       <dbl> <int> <int> <int> <chr>          <int>         <dbl>
## 1 Boar  Appearance      1     4     4     2 Buck               2          0.5 
## 2 Boar  Appearance      1     4     4     2 Bull               4          1   
## 3 Boar  Appearance      1     4     4     2 Cob                3          0.75
## 4 Boar  Appearance      1     4     4     2 Cow                3          0.75
## 5 Boar  Appearance      1     4     4     2 Doe                3          0.75
## 6 Boar  Appearance      1     4     4     2 Dog                3          0.75

Since we do not need q1, q2, q3, partner_resp any more, let’s just get rid of them.

data2<-data2%>%
  select(-c(q1, q2, q3, partner_resp))
head(data2)
## # A tibble: 6 x 5
## # Groups:   Name, category [1]
##   Name  category   weight partner partner_score
##   <chr> <chr>       <dbl> <chr>           <dbl>
## 1 Boar  Appearance      1 Buck             0.5 
## 2 Boar  Appearance      1 Bull             1   
## 3 Boar  Appearance      1 Cob              0.75
## 4 Boar  Appearance      1 Cow              0.75
## 5 Boar  Appearance      1 Doe              0.75
## 6 Boar  Appearance      1 Dog              0.75
nrow(data2)
## [1] 1710

The next thing I would like to do is to juxtapose B’s score on A’s scale and A’s score on B’s scale next to each other. And the following codes will just do that.

data3<-data2%>% 
  mutate(joiner=paste(partner, Name, sep = "_"))

head(data3)
## # A tibble: 6 x 6
## # Groups:   Name, category [1]
##   Name  category   weight partner partner_score joiner   
##   <chr> <chr>       <dbl> <chr>           <dbl> <chr>    
## 1 Boar  Appearance      1 Buck             0.5  Buck_Boar
## 2 Boar  Appearance      1 Bull             1    Bull_Boar
## 3 Boar  Appearance      1 Cob              0.75 Cob_Boar 
## 4 Boar  Appearance      1 Cow              0.75 Cow_Boar 
## 5 Boar  Appearance      1 Doe              0.75 Doe_Boar 
## 6 Boar  Appearance      1 Dog              0.75 Dog_Boar
data2<-data2%>%
  mutate(joiner=paste(Name, partner, sep = "_"))

head(data2)
## # A tibble: 6 x 6
## # Groups:   Name, category [1]
##   Name  category   weight partner partner_score joiner   
##   <chr> <chr>       <dbl> <chr>           <dbl> <chr>    
## 1 Boar  Appearance      1 Buck             0.5  Boar_Buck
## 2 Boar  Appearance      1 Bull             1    Boar_Bull
## 3 Boar  Appearance      1 Cob              0.75 Boar_Cob 
## 4 Boar  Appearance      1 Cow              0.75 Boar_Cow 
## 5 Boar  Appearance      1 Doe              0.75 Boar_Doe 
## 6 Boar  Appearance      1 Dog              0.75 Boar_Dog
data4<-left_join(data2, data3, by = c("joiner", "category"))

head(data4)
## # A tibble: 6 x 10
## # Groups:   category [1]
##   Name.x category weight.x partner.x partner_score.x joiner Name.y weight.y
##   <chr>  <chr>       <dbl> <chr>               <dbl> <chr>  <chr>     <dbl>
## 1 Boar   Appeara~        1 Buck                 0.5  Boar_~ Buck         50
## 2 Boar   Appeara~        1 Bull                 1    Boar_~ Bull         50
## 3 Boar   Appeara~        1 Cob                  0.75 Boar_~ Cob          50
## 4 Boar   Appeara~        1 Cow                  0.75 Boar_~ Cow          50
## 5 Boar   Appeara~        1 Doe                  0.75 Boar_~ Doe          10
## 6 Boar   Appeara~        1 Dog                  0.75 Boar_~ Dog          50
## # ... with 2 more variables: partner.y <chr>, partner_score.y <dbl>

Now the newly joined data4 has too many columns that it will not be displayed properly, let’s just check the names of the columns in the new data set.

colnames(data4)
##  [1] "Name.x"          "category"        "weight.x"        "partner.x"      
##  [5] "partner_score.x" "joiner"          "Name.y"          "weight.y"       
##  [9] "partner.y"       "partner_score.y"
nrow(data4)
## [1] 1710

What I can tell you here is that Name.x is the same as partner.y and that partner.x is the same as Name.y. What we can do is set Name.x to be A and Name.y to be B. Then, partner_score.x will be B’s score on A’s scale (BonA), and partner_score.y will be A’s score on B’s scale (AonB). Also, weight.x means the weight placed on a trait by a respondent, which we now call A, and weight.y indicates the weight placed on a trait by a partner, which we now call B. Let’s rename them to be weight.A and weight.B, respectively.

data4<-data4%>%
  rename(A = Name.x, B = Name.y, BonA = partner_score.x, AonB = partner_score.y, weight.A = weight.x, weight.B = weight.y)
head(data4)
## # A tibble: 6 x 10
## # Groups:   category [1]
##   A     category  weight.A partner.x  BonA joiner B     weight.B partner.y  AonB
##   <chr> <chr>        <dbl> <chr>     <dbl> <chr>  <chr>    <dbl> <chr>     <dbl>
## 1 Boar  Appearan~        1 Buck       0.5  Boar_~ Buck        50 Boar        1  
## 2 Boar  Appearan~        1 Bull       1    Boar_~ Bull        50 Boar        1  
## 3 Boar  Appearan~        1 Cob        0.75 Boar_~ Cob         50 Boar        1  
## 4 Boar  Appearan~        1 Cow        0.75 Boar_~ Cow         50 Boar        0.8
## 5 Boar  Appearan~        1 Doe        0.75 Boar_~ Doe         10 Boar        1  
## 6 Boar  Appearan~        1 Dog        0.75 Boar_~ Dog         50 Boar        0.8

Now, let’s leave in necessary columns only.

data4<-data4%>%
  select(A, B, category, AonB, BonA, weight.A, weight.B)
head(data4)
## # A tibble: 6 x 7
## # Groups:   category [1]
##   A     B     category    AonB  BonA weight.A weight.B
##   <chr> <chr> <chr>      <dbl> <dbl>    <dbl>    <dbl>
## 1 Boar  Buck  Appearance   1    0.5         1       50
## 2 Boar  Bull  Appearance   1    1           1       50
## 3 Boar  Cob   Appearance   1    0.75        1       50
## 4 Boar  Cow   Appearance   0.8  0.75        1       50
## 5 Boar  Doe   Appearance   1    0.75        1       10
## 6 Boar  Dog   Appearance   0.8  0.75        1       50

Let’s now calculate each pair’s match score on each trait.

data4<-data4%>%
  mutate(matchScore = ((weight.A*BonA + weight.B*AonB)/(weight.A+weight.B))*100)

head(data4)  
## # A tibble: 6 x 8
## # Groups:   category [1]
##   A     B     category    AonB  BonA weight.A weight.B matchScore
##   <chr> <chr> <chr>      <dbl> <dbl>    <dbl>    <dbl>      <dbl>
## 1 Boar  Buck  Appearance   1    0.5         1       50       99.0
## 2 Boar  Bull  Appearance   1    1           1       50      100  
## 3 Boar  Cob   Appearance   1    0.75        1       50       99.5
## 4 Boar  Cow   Appearance   0.8  0.75        1       50       79.9
## 5 Boar  Doe   Appearance   1    0.75        1       10       97.7
## 6 Boar  Dog   Appearance   0.8  0.75        1       50       79.9

Also, if you’d like to see how the scores are distributed, a histogram (hist()) will help. Using ggplot2:: for a neat and pretty plot will be discussed in another post in the future.

hist(data4$matchScore)

With the match score calculated, we do not need the other columns (AonB, BonA, weight.A, weight.B), and I would like to spread out the trait category to a separate column for each trait. Let’s select() necessary columns only and spread() matchScore by category.

data4<-data4%>%
  select(A, B, category, matchScore)%>%
  spread(category, matchScore)

nrow(data4)
## [1] 342
head(data4)
## # A tibble: 6 x 7
##   A     B     Appearance Ethics Intelligence Kindness Wealth
##   <chr> <chr>      <dbl>  <dbl>        <dbl>    <dbl>  <dbl>
## 1 Boar  Buck        99.0   66.7          100     83.3  100  
## 2 Boar  Bull       100     87.5          100     83.3   87.5
## 3 Boar  Cob         99.5   66.7          100     83.3   87.5
## 4 Boar  Cow         79.9   95.8          100     83.3  100  
## 5 Boar  Doe         97.7   95.8          100    100     95.8
## 6 Boar  Dog         79.9   80            100     83.3  100

You can see that each pair’s match scores for the five different trait categories are next to each other in one line. We now need to get the 5th root of the product of the five values.

data4<-data4%>%
  mutate(compatibility = nthroot(Appearance * Ethics * Intelligence * Kindness * Wealth, 5))
head(data4)
## # A tibble: 6 x 8
##   A     B     Appearance Ethics Intelligence Kindness Wealth compatibility
##   <chr> <chr>      <dbl>  <dbl>        <dbl>    <dbl>  <dbl>         <dbl>
## 1 Boar  Buck        99.0   66.7          100     83.3  100            88.7
## 2 Boar  Bull       100     87.5          100     83.3   87.5          91.4
## 3 Boar  Cob         99.5   66.7          100     83.3   87.5          86.5
## 4 Boar  Cow         79.9   95.8          100     83.3  100            91.4
## 5 Boar  Doe         97.7   95.8          100    100     95.8          97.9
## 6 Boar  Dog         79.9   80            100     83.3  100            88.2

Here is the distribution of the final compatibility scores.

hist(data4$compatibility)

If you’d like to see the entire data frame in a spreadsheet, you can write a csv file using write.csv().

write.csv(data4, file = "comp.csv")

Then, you will find a csv file within the same working directory you set in the beginning of this post. If you are interested in looking at the resulting file, here it is.

3.8 In a nutshell

3.8.2 Codes in one place

Following are all codes used in this script in one place.

data0<-raw.data%>%
  gather(questions, responses, 2:16)%>%
  separate(questions, c("category", "Qnum"))%>%
  mutate(weight=case_when(
    Qnum==3 & responses ==1~0,
    Qnum==3 & responses ==2~1,
    Qnum==3 & responses ==3~10,
    Qnum==3 & responses ==4~50,
    Qnum==3 & responses ==5~250
  ))%>%
  mutate(number = case_when(
    Qnum ==1 ~"q1",
    Qnum ==2 ~"q2",
    Qnum ==3 ~"q3"
  ))%>%
  select(-Qnum)%>%
  group_by(Name, category)%>%
  fill(weight, .direction = "up")%>%
  spread(number, responses)

data1<-data0%>%
  select(Name, category, q1)%>%
  rename(partner=Name, partner_resp=q1)

data2<-inner_join(data0, data1, by="category")%>%
  filter(Name != partner)%>%
  mutate(partner_score = if_else(partner_resp > q2, c(1), c(partner_resp/q2)))%>%
  select(-c(q1, q2, q3, partner_resp))

data3<-data2%>% 
  mutate(joiner=paste(partner, Name, sep = "_"))

data2<-data2%>%
  mutate(joiner=paste(Name, partner, sep = "_"))

data4<-left_join(data2, data3, by = c("joiner", "category"))%>%
  rename(A = Name.x, B = Name.y, BonA = partner_score.x, AonB = partner_score.y, weight.A = weight.x, weight.B = weight.y)%>%
  select(A, B, category, AonB, BonA, weight.A, weight.B)%>%
  mutate(matchScore = ((weight.A*BonA + weight.B*AonB)/(weight.A+weight.B))*100)%>%
  select(A, B, category, matchScore)%>%
  spread(category, matchScore)%>%
  mutate(compatibility = nthroot(Appearance * Ethics * Intelligence * Kindness * Wealth, 5))

4 Closing

Well, if anyone asks me if this was the most elegant way to compute compatibility scores or if this was exactly what Christian Rudder meant in his video, I’m not sure I can confidently say yes. But it’s definitely an idea that seemed to make the most sense to me. If anyone out there with a broader and deeper understanding of algorithmic thinking has better ideas, please email me with your ideas. I’m merely a learner of coding myself, who is striving to get better at it everyday.

Whoa! That was a massive amount of procrastination all in the midst of worrying about an IRB application, a manuscript, and a midterm exam to write. Hopefully, there is someone out there reading this and finding it helpful. So long for now! I will try not to take too long to return to write another post of this sort ;)