July 15, 2011
Birthday Problem: R Simulation Explained As promised, here is the mini-lesson taking apart the code for the birthday problem simulation. First, let’s take another look at the code. y<-NULL for(i in 1:100){ x<-sample(1:365,30,replace=TRUE) tab<-table(x) y[i]<-ifelse(length(tab)<30,"Had common b-day","No common b-days") } table(y) barplot(table(y)) What this does is it randomly generates birthdays for 30 people, then repeats it 100 times. That way, we can see about what percentage of groups have two people with the same birthday experimentally. Let’s start with the first line of code. y<-NULL
What this does is create the y variable without putting anything in it. I did this because I want to assign values to y later on, but don’t want to declare exactly how long it is yet. The next part is the for loop. for(i in 1:100){ x<-sample(1:365,30,replace=TRUE) tab<-table(x) y[i]<-ifelse(length(tab)<30,"Had common b-day","No common b-days") } Those of you familiar with another language can probably tell what the for loop does already. The part in the parentheses shows for which values to iterate the statement in the curly braces. So, since we have for(i in 1:100){, R will run whatever is inside the curly braces for i = 1, then run it all again for i = 2, then for i = 3, and so on until it runs it for i = 100, after which it stops. Basically, this is what I am using to repeatedly generate groups of 30 over and over. If you wanted to create a bigger sample, you could change the 1:100 to a bigger range, like 1:1000. Now let’s take a look at what exactly we’re looping. The first line inside the for loop is the actual sample. The sample function takes two main inputs: the set it’s sampling from and the number of times you are sampling from it. The 1:365 represents the former (each number represents a day of the year — we are ignoring leap years for simplicity) and the 30 represents the latter. I added the replace=TRUE to tell R that I want to sample with replacement since the sample function by default samples without replacement. This sample is then stored in the new variable x. Next is the table function. The table function looks at the vector x and counts how many of each value is inside x. So, if we had 30 different birthdays, we would expect 30 different numbers
1
each with a count of 1. If we had two people with the same birthday, then one number would have a count of 2. This table is then assigned to the variable tab. Now we get to the final part. For each ith iteration, I want to record whether the group had a common birthday or not. So, we are assigning a value to y[i] using the ifelse command. There are three part to the ifelse command: the test, what you do if it’s true, and what you do if it’s false. The test is in this case is length(tab)<30. In other words, if the length of tab is less than 30, the ifelse command returns the second argument. If it is not less than 30, the ifelse command returns the third argument. Why did I choose length(tab)<30? Simply put, if everyone had a different birthday in this particular group, the table would have had 30 different numbers with 1 value for each of them. Thus, the length of tab would be 30. If there were any common birthdays, there would be fewer than 30 unique birthdays, making the length of tab less than 30. So, in short, if tab was shorter than 30, there was a common birthday, while if it was exactly 30, there were no common birthdays. This explains why the second and third arguments are what they are. These two arguments are strings — characters rather than numbers (yes, you can have a vector of characters too). The quotation marks are there to indicate that these are strings. Now, after you run the for loop, if you look at y, you’ll see that it is just a list of ”Had common b-day” and ”No common b-day” entries. In order to see how many of each we have, we can just use the table function again. table(y)
Finally, to get a nice graphical look at it, we can make a barplot of the table. barplot(table(y))
And that’s it! There was a lot of new code here, so feel free to ask me anything about how any of this works. Feel free to play around with it all too, like taking bigger or smaller sample sizes, and checking what is inside each variable at various points.
2