Dummy variable (statistics)

A dummy variable is a dichotomous variable which has been coded to represent a variable with a higher level of measurement. Dummy variables are often used in multiple linear regression (MLR).

Dummy coding refers to the process of coding a categorical variable into dichotomous variables. For example, we may have data about participants' religion, with each participant coded as follows:

A categorical or nominal variable with three categories

Religion Code
Christian 1
Muslim 2
Atheist 3

This is a nominal variable (see level of measurement) which would be inappropriate as a predictor in MLR. However, this variable could be represented using a series of three dichotomous variables (coded as 0 or 1), as follows:

Full dummy coding for a categorical variable with three categories

Religion Christian Muslim Atheist
Christian 1 0 0
Muslim 0 1 0
Atheist 0 0 1

There is some redundancy in this dummy coding. For instance, in this simplified data set, if we know that someone is not Christian and not Muslim, then they are Atheist.

So we only need to use two of these three dummy-coded variables as predictors. More generally, the number of dummy-coded variables needed is one less than the number of categories.

Choosing which dummy variable not to use is arbitrary and depends on the researcher's logic. For example, if I'm interested in the effect of being religious, my reference (or baseline) category would be Atheist. I would then be interested to see whether the extent to which being Christian (0 (No) or 1 (Yes)) or Muslim (0 (No) or 1 (Yes)) predicts the variance in a dependent variable (such as Happiness) in a regression analysis. In this case, the dummy coding to be used would be the following subset of the previous full dummy coding table:

Dummy coding for a categorical variable with three categories, using Atheist as the reference category

Religion Christian Muslim
Christian 1 0
Muslim 0 1
Atheist 0 0

Alternatively, I may simply be interested to recode into a single dichotomous variable to indicate, for example, whether a participant is Atheist (0) or Religious (1), where Religious is Christian or Muslim. The coding would be as follows:

A categorical or nominal variable with three categories

Religiosity Code
Atheism 0
Religious 1

See also

External links

  1. http://www.slideshare.net/jtneill/multiple-linear-regression/14
  2. http://www.utexas.edu/courses/schwab/sw388r6_fall_2006/SolvingProblems/IncorporatingNonmetricDataWithDummyVariables.ppt
  3. http://dss.princeton.edu/online_help/analysis/dummy_variables.htm
  4. http://www.psychstat.missouristate.edu/multibook/mlt08m.html
  5. http://www.cscu.cornell.edu/news/statnews/stnews72.pdf
  6. http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm
This article is issued from Wikiversity - version of the Thursday, September 04, 2014. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.