Dummy Variables in R
4 stars based on
As stated earlier, to consider a categorical variable as a predictor in a regression model, we create indicator variables to represent the categories that are not the reference. Continuing with the BMI category example we described above, lets walk through the steps of making dummy variables so that we can include BMI category as a predictor in a multiple linear regression model.
Since we are using the "normal" BMI category as our reference, we need to make indicator variables for being underweight, overweight, or obese. To do this in R, we can use ifelse statements to create the indicator variables. The general form of an ifelse statement is:. Where indicator is the name of the dummy variable, a is the condition present among those with the indicator, b is the value that you want to assign to those who meet the condition in a i.
As a more concrete example, the code below can be used to create indicators for the underweight, overweight, and obese BMI categories. Note that people in the "normal" BMI category do not meet any of the conditions specified in the code above, so they will have values of 0 for u. See the table below to observe how the combination of dummy variables uniquely identifies each BMI category. Now that the dummy variables have been created, we can perform a multiple linear regression that includes this set of indicators in addition to other independent variables.
Lets use the same example as above, where systolic blood pressure serves as the dependent variable, and BMI, age, sex, and use of antihypertensive medication are the independent variables. However, instead of using continuous BMI, we will use BMI categories, represented by our newly created dummy variables.
To perform this regression analysis in R, we use the following code:. Min 1Q - - Median 3Q Max. On the previous page we explained that in the multiple regression model, the regression coefficients associated with each of the dummy variables are interpreted as the expected difference in the mean of the outcome variable for that BMI category as compared to the "normal" BMI group, holding all other predictors constant.
So, from the above output, we see that, on average, systolic blood pressure is 6. From the output above, how much higher do we expect the systolic blood pressure of obese individuals to be compared to individuals with normal BMI, adjusting for age, sex, and use of antihypertensive medication? From the output above, we would reject the global null hypothesis, and conclude that at least one of the independent variables is associated with the outcome.
The video below offers an additional example of how to perform dummy variable regression in R. Note that in the video, Mike Marin allows R to create the dummy variables automatically.
You can do that as well, but as Mike points out, R automatically assigns the reference category, and its automatic choice may not be the group you wish to use as the reference. Thus, by manually creating our dummy variables to include in the model, we have ultimate control over the choice of reference group. The data set used in this video is the same one that was used in the video on page 3 about multiple linear regression.
If you have not yet downloaded that data set, it can be downloaded from the following link. The Power of Multiple Regression Models. Dummy Variables in R As stated earlier, to consider a categorical variable as a predictor in a regression model, we create indicator variables to represent the categories that are not the reference. The general form of an ifelse statement is: To perform this regression analysis in R, we use the following code: Min 1Q - - Median 3Q Max Is the following statement true or false?