Value Labels
For categorical variables the raw data will often show the string values for the selected response. For instance, you may see “male” and “female” displayed as possible responses to the variable gender. When doing calculations, however, you’ll need these variables to be numeric (in the float or long format) – if they are not already imported this way. It is helpful to preserve the extra information the strings capture by using “value labels.” A value label such as gender would assign “female” to 0 and “male” to 1 and display female and male to the analyst. See help label
for how to do this in Stata. It is very important to label values for two reasons:
- it provides information to the analyst that will reduce mistakes made in coding or analyzing data
- many programs will use information on whether a variable has value labels in order to identify it as a categorical variable, as opposed to a continuous numeric variable.
Encoding String Values in Stata
The quickest way to change string variables to numeric variables with value labels is the encode
command. encode
will automatically convert the string variable into a numeric variable and assign the numbers 1 – x (where x is the number of unique answer choices) to the alphabetized list of the answer choices (ordered 0-9, followed by a-z). Because this is done automatically based on alphabetical order, if you are trying to make value labels match some existing assignment, you may need to recode or label them manually.
Stata stores value labels independently from the variables, so it’s important to manage value labels separately from variables as they can contain PII. Deleting all variables that have a value label and saving the dataset will ensure that the value label is removed from the .dta file. To see which labels are currently defined in Stata and their content you can use the label list
command (helpful summary information is stored in the return values of label list
and label dir
as well). These labels can be modified or deleted to combine using the options of label define function:
*Drop old labels
label drop ex1 ex2 ex3
*Define label
label define yesno 1 "No" 3 "Yes"
label list yesno
*Modify label to correct the error
label define yesno 1 "No" 2 "Yes", modify
*Add extended values to the label defined above
label define yesno .n "No response" 3 "Maybe", add
*Apply the label to all of the variables it should apply to
loc dummy_vars ex1 ex2 ex3
label values `dummy_vars' yesno
Formatting Labeling in Stata
It can be useful to change the delimiter to a semicolon so that a single command can take up several rows in your text editor, making it easier to read labeling. This is especially useful when multiple values are labeled. See help delimit
to learn about delimiters in Stata. An example would be:
* Set delimiter for labeling
#delimit ;
label def female
0 "[0] Male"
1 "[1] Female"
;
label def region
1 "[1] Northern"
2 "[2] Southern"
3 "[3] Western"
4 "[4] Eastern"
5 "[5] Central"
;
#delimit cr
label values female female
label values region region
Note how the labels have the corresponding value as well as the description in the value label. This is not strictly necessary, but can be useful if you want to values to display alongside labels in outputs.
Defensive Workflow for Encoding Values
One way to ensure that data is encoded in an expected way is to check that values are only encoded from a list of value labels that you defined. The user written command sencode
(installed using ssc install sencode
) can help to support this. sencode
labels the variable according to the values that you’ve predefined and then adds additional values in order from the highest value if it encounters values that you haven’t defined. An example data flow follows:
*Ensure sencode is installed
cap which sencode
if _rc ssc install sencode
*Load data
sysuse auto, clear
keep if _n <= 10 // Keep the first ten observations of the sample
# del ;
/* This label is named by the variable name, an "_", and then "label"
so that we can loop over the labels.
*/
label define make_label
1 "AMC Concord"
2 "AMC Pacer"
3 "AMC Spirit"
4 "Buick Century"
5 "Buick Electra"
6 "Buick LeSabre"
7 "Buick Opel"
8 "Buick Regal"
9 "Buick Riviera"
10 "Buick Skylark"
;
# del cr
*Create a local list of variables to encode
loc str_var make
*Encode values and confirm expected
foreach var of local str_var {
*Encode variables
sencode `var', label(`var'_label) replace noextend // type h sencode to see options, the noextend option returns an error if the existing label doesn't capture all values.
}
// end foreach v of local str_var
*Display the labeled and unlabeled values
tab make
tab make, nol