2  Download and import data

2.1 File formats

The Enroll-HD PDS dataset is provided in two formats:

  • CSV file: CSV stands for comma separated values (.csv) which is a delimiter-separated format. The PDS data uses the tab as the delimiter. Software settings need to be adapted respectively.

  • R file: binary code format for the R (R Core Team, 2021) software application (a software environment for statistical analysis).

Because of the complexity and the size of the data set, use of a statistical software package such as R, Stata, or SAS is recommended. The .csv file format can also be imported into Excel (caution is advisable).

It is important that files are not be edited in a word processing software or other programs that may potentially modify characters, as this may damage the integrity of the original files. CSV files can be saved in other formats which are compatible with other statistical software packages as needed.

2.2 Importing data

Importing CSV files into Excel

The .csv files can be imported and opened in Microsoft Excel. Because Excel is language dependent and delimiters differ from one country to another, some considerations need to be addressed when opening the .csv files to maintain data integrity. The procedures outlined here, to open the .csv files, can be applied to most recent versions of Excel.

As a default, Excel reads the values for each column as being in a “General” format. For example, unless otherwise specified, Excel interprets numeric data as numbers (e.g., 1234), entered dates as date format (as pre-set, e.g. 11/28/2016), and changes other values (e.g. strings) to text format (e.g. Aspirin). For some entries this is counterproductive, as Excel may misinterpret entries and incorrectly reformat the data, effectively changing the data (e.g. 1.5 is read as May 1 instead of 1.5 mg; or the WHO-DD Code for Tetrabenazine 00222101003 is changed to 22211003, removing the important leading “0”s).

To maintain the integrity of the data, each data column needs to be carefully examined prior to importing the data into Excel.

An illustrated guide for correctly importing CSV data files into Excel are provided in Appendix A.


Importing CSV files into R

Make sure the CSV file has not been opened and saved using a word processing software. A software package capable of reading CSV files must be loaded into R environment. The package “readr” is one of the most popular packages, but there are several others that will also work. If a package like “readr” is not already installed, the CSV data files can be imported using the following code line:

install.packages(readr)

To load the CSV data into R using a package like “readr” use the: library(readr) command. To ensure the CSV file is imported correctly, set the directory to the file folder where the PDS files are located, and then run the following code:

file = read_delim("file.csv", "\t", escape_double = FALSE, trim_ws = TRUE)


Importing R files into R

This data file is specific for R. After loading the R data files into R, 9 data frames are made available in the R environment and are ready to be used. The loading can be done using the function command:

load(“Rdata_directory”)

For RStudio users, the loading can be performed by clicking in the “load workspace” ribbon, and then browsing for the location of the R data file.

Appendix A: An illustrated guide to correctly importing CSV files into Excel

The file used for this demonstration is the ‘profile.csv’ file.

Step 1 – Open CSV file in Excel: Open the .csv file using Excel, or open Excel and on the “Data” tab click “From Text/CSV”. Data will be imported in entirety into the first column of the Excel file, as illustrated below.

Step 2 – Open Text to Columns Wizard: Select the first column, then on the tab “Data” click “Text to Columns”. A wizard will appearto guide you through the process.

Step 3 – Select delimiter type: In the Text to Columns Wizard(step 1 of 3), select the “Delimited” checkbox (this lets Excel know that the data fields are separated by commas or tabs), then click “Next”.

Step 4 – Specify data file type:: In the Text to Columns Wizard(step 2 of 3), select the Delimiter type “Tab” (this lets Excel know that the data fields are separated by tabs specifically), then click “Next”.

Step 5 – Assign column formats: For each column (i.e., variable), an appropriate format needs to be assigned. This is completed in the Text to Columns Wizard (step 3 of 3). The default format “General” works for most columns. Columns where numbers have leading “0” and columns with mixed entries like 1.5, 1,5, 1/5, need to be explicitly formatted as “Text”, as entries might otherwise become corrupted in an unchangeable way. After assigning the correct format to each column, click “Finish”.

NB: The data files pharmacotx and nutsuppl contain two columns ‘cmtrt_decod’ and ‘cmdostot’ that require formatting as “Text”.

Step 6 – Save data file: The .csv file is now column-separated and should be saved as an Excel file (.xls or .xlsx) using the ‘Save As’ option.


R Core Team. (2021). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/