Importing into Stata
Data comes in many forms, from raw text (.txt) files to multi-sheet Excel (.xls and .xlsx) files. Importing data into Stata is necessary if the data is not already in Stata format (.dta file). In general, you should be able to use Stata’s functions and loops to efficiently import data. Stata can import most data formats. If your project’s data was collected using a platform like SurveyCTO, the raw data will come in .csv format and the SurveyCTO server will provide a do file that imports the raw .csv data into the Stata .dta format.
Importing different file types
In Stata 13 and beyond, the import
command can import CSV files, excel files, and more depending on the option used (delimited for CSV, excel for excel), and export
does the same for exporting. The import excel
and import delimited
commands provide a number of options that allow for a large amount of control of importing including from where in the workbook data should be imported from and how data should be saved. See help import
for details on these options.
If you are new to using import
are importing a file type you have not seen before, it can be helpful to use the drop-down menu by clicking “File>Import” and then selecting the appropriate file type. Once you do this, you will be able to copy the specific command syntax directly from the command prompt or review window in Stata to your do-file.
The insheet
command remains an alternative for .csv, .tsv, and .txt files. It performes differently than import delimited
and can be useful for some forms of data, but import delimited
should be used preferentially. One thing to note is that it is often a good idea when using the insheet
command, to use the option names and have your dataset have the same variable names at those at the top of the raw dataset.
Importing multiple files at once
A useful function for importing multiple files within a folder is the dir
extended macro function. You can find documentation on this by typing help extended_fcn
in Stata. This function allows you to store all the names of the files in a folder in a local so you can loop through them for importing. See example code of this process below.
*This stores all files with the extension .xlsx in the "$raw" data folder into a local "files"
local files: dir "$raw" files "*.xlsx", respectcase
*Loop through the files to import, clean the file name, and save as a dta
foreach file in `files' {
*Show your progress of which file you are working on
di in red "working on `file'"
*Import each file
import excel using "$raw/`file'", clear firstrow
**Quality Checks (Optional)
*Assert you have the correct number of observations.
qui count
assert `r(N)' == [number_of_expected_observations] // If you know the amount of expected observations
*Check that what variables you think sould be unique identifiers are indeed unique.
isid unique_id_var
*Check for expected/necessary variables
confirm var expected_var_names
* Edit filename
/*The filenames in the local "files" includes the extension (in this cas .xlsx).
So, I remove these and make new clean file name to save the files as.
You can edit the filenames however you see fit. */
local cleanfilename = subinstr("`file'", ".xlsx","",.)
*Save the file with the new clean file name as a dta file
save "filepath/dtafiles/`cleanfilename'_raw.dta", replace
}
Note that the new .dta files are no longer saved in the same folder that your raw excel, csv, or any other type of files were saved in. As the imported data are no longer raw, they should be saved in either a temporary or data folder.
It can be helpful set up a “dta” or “temp” folder for you to save these intermediate data files before you start. To do so, you can create a folder to save your files in directly in your script by using mkdir "filepath"
. If the directory already exists, this will create an error. One solution is to use the capture
command and type cap mkdir "filepath"
which will suppress the error. We recommend avoiding capture
in most situations.