Chapter 1 Overview

1.1 Dataset

Extraction

The data contained in Enroll-HD PDS5 was extracted from the Enroll-HD electronic data capture (EDC) database on November 4, 2022, at 10:00 UTC.


Data sources

The PDS6 dataset encompasses data exclusively from Enroll-HD participants, collected from several sources. These sources are the Enroll-HD study, the REGISTRY study, and clinical data collected in adhoc visits outside of the aforementioned studies.

Enroll-HD is an observational cohort study and global clinical research platform designed to facilitate Huntington’s disease (HD) clinical research. It includes participants from North America, Europe, Australasia, and Latin America. The study started in 2012 and is still active and actively recruiting.

REGISTRY is an observational cohort study of HD conducted in Europe. The study started in 2004 and concluded in 2015. As Enroll-HD began, REGISTRY sites and participants began to transition into Enroll-HD. Enroll-HD dataset releases include individuals who initially participated in REGISTRY then consequently enrolled in Enroll-HD and consented to the migration of their REGISTRY data into the Enroll-HD dataset. Registry data are available for a subset of Enroll-HD participants.

Clinical data from additional sources (Ad Hoc data) are available for a subset of Enroll-HD participants. These data were collected at routine clinical visits outside of the Enroll-HD and REGISTRY studies, and comprise HD assessment data (e.g., UHDRS Motor). The date of collection of these data typically pre-date a participant’s enrolment into REGISTRY or Enroll-HD.

Study specific protocols, annotated eCRFs, and data collection guidelines are housed here.


Participant inclusion

To be included in the PDS5 release, participant’sdata had to meet several requirements. Figure 1.1 illustrates the number of participants whose data met each of the predefined inclusion requirement and illustrates how the final sample size of PDS5 was determined.

Participant flow chart for inclusion in PDS6.

Figure 1.1: Participant flow chart for inclusion in PDS6.

Due to data exclusion requirements, not all participants enrolled in Enroll-HD at the time of PDS6 data cut are included in PDS6. Similarly, not all participants included in PDS5 are included in PDS6. Data from 200 participants who were included in PDS5 were not included in the current PDS release. Enroll-HD is an active, longitudinal study. A participant eligible for inclusion for one release may be ineligible the next (e.g., participant data quarantined). Data for PDS5 participants not included in PDS6 may be available through specified dataset (SPS) request.

1.2 Sample Size

PDS5 contains data on 21,116 Enroll-HD participants. Sample size by PDS release is presented in Figure 1.2 .

Enroll-HD sample size by PDS release.

Figure 1.2: Enroll-HD sample size by PDS release.

1.3 Visits

PDS6 contains data from 95,041 visits (baseline and follow-up visits only; all sources). Of these, 78,730 were Enroll-HD visits. The remainder are from Registry (N = 15,292) and ‘Ad Hoc’ sources (N = 1,018). A breakdown of visits by data source is provided in Table 1.1. Number of Enroll-HD visits only by PDS release are illustrated in Figure 1.3.

Table 1.1: Number of visits in PDS5 by constituent data source. Participant number indicates the number of EnrollHD participants with visit data available for the indicated data source (maximum N = 25,550)
Data source Participants Visits
Enroll-HD 25550 78730
Registry 3 4337 10114
Registry 2 2153 5178
Ad Hoc 316 1018
Total 95040

Enroll-HD visits only (baseline and follow-up only) by PDS release.

Figure 1.3: Enroll-HD visits only (baseline and follow-up only) by PDS release.

Considering baseline and follow-up visits from Enroll-HD only, total number of visits per participant in PDS6 ranges from 1 to 11. In Figure 1.4 , we illustrate participant counts by maximum number of Enroll-HD visits. Each participant is represented once, included in the bar indicative of their maximum number of visits. In Figure 1.5, we illustrate maximum participant counts for a specific number of visits. This plot is cumulative, the goal being to illustrate largest available sample size for a specific number of visits. For example, the participant with 11 visits is represented in visit bars 1 through 11, the participants with 10 visits are represented in visit bars 1 through 10, and so on.

Participant counts by maximum number of Enroll-HD visits (baseline and follow-up visits only, unscheduled visits and phone contacts excluded). Full sample represented (N = 25,550, Missing N = 0).

Figure 1.4: Participant counts by maximum number of Enroll-HD visits (baseline and follow-up visits only, unscheduled visits and phone contacts excluded). Full sample represented (N = 25,550, Missing N = 0).


Maximum participant counts for a specific number of Enroll-HD visits (baseline and follow-up visits only; unscheduled visits and phone contacts excluded). Cumulative plot. Full sample represented (N = 25,550; Missing N = 0).

Figure 1.5: Maximum participant counts for a specific number of Enroll-HD visits (baseline and follow-up visits only; unscheduled visits and phone contacts excluded). Cumulative plot. Full sample represented (N = 25,550; Missing N = 0).

Considering baseline and follow-up visits from all data sources (Enroll-HD, REGISTRY, Ad Hoc visits), total number of visits per participant ranges from 1 to >20. Maximum participant counts by visit number are presented in Figure 1.6.

Maximum participant counts for a specific number of visits; Enroll-HD, REGISTRY, Adhoc (baseline and follow-up visits only; unscheduled visits and phone contacts excluded). Cumulative plot. Full sample represented (N = 25,550; Missing N = 0).

Figure 1.6: Maximum participant counts for a specific number of visits; Enroll-HD, REGISTRY, Adhoc (baseline and follow-up visits only; unscheduled visits and phone contacts excluded). Cumulative plot. Full sample represented (N = 25,550; Missing N = 0).

1.4 Sample Characteristics

The PDS6 sample is characterized below with respect to participant category, sociodemographic variables, and clinical characteristics (Figures 1.7 to 1.19).

Participant category at baseline Enroll-HD visit (hdcat_0). Full sample represented (N = 25,550; Missing N = 0)

Figure 1.7: Participant category at baseline Enroll-HD visit (hdcat_0). Full sample represented (N = 25,550; Missing N = 0)


Participant category at latest Enroll-HD visit (hdcat_l). Full sample represented (N = 25,550; Missing N = 0)

Figure 1.8: Participant category at latest Enroll-HD visit (hdcat_l). Full sample represented (N = 25,550; Missing N = 0)


Geographical region (region). Full sample represented (N = 25,550; Missing N = 0)

Figure 1.9: Geographical region (region). Full sample represented (N = 25,550; Missing N = 0)


Sex (sex). Full sample represented (N = 25,550; Missing N = 0)

Figure 1.10: Sex (sex). Full sample represented (N = 25,550; Missing N = 0)


ISCED (isced) at baseline Enroll-HD visit. Full sample represented (N = 25,435; Missing N = 115)

Figure 1.11: ISCED (isced) at baseline Enroll-HD visit. Full sample represented (N = 25,435; Missing N = 115)


HD integrated staging system (HD-ISS) imputed stage (hdiss_stage_imp) at baseline Enroll-HD visit. Individuals with research CAG length of = 40 represented only (N = 17,594).

Figure 1.12: HD integrated staging system (HD-ISS) imputed stage (hdiss_stage_imp) at baseline Enroll-HD visit. Individuals with research CAG length of = 40 represented only (N = 17,594).


Age at baseline Enroll-HD visit (age_0). Full sample represented (N = 25,550; Missing N = 0). Note that in PDS6, age for individuals under the age of 18 years is represented by the value ‘<18’. These values have been transformed to 17 for inclusion in this histogram.

Figure 1.13: Age at baseline Enroll-HD visit (age_0). Full sample represented (N = 25,550; Missing N = 0). Note that in PDS6, age for individuals under the age of 18 years is represented by the value ‘<18’. These values have been transformed to 17 for inclusion in this histogram.


Research CAG length (caghigh). Full sample represented (N = 25,550; Missing N = 0). Note that in PDS6, CAG length for individuals with a CAG length greater than 70 is represented by the value ‘>70’; these values have been transformed to 71 for inclusion in this histogram. HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36.

Figure 1.14: Research CAG length (caghigh). Full sample represented (N = 25,550; Missing N = 0). Note that in PDS6, CAG length for individuals with a CAG length greater than 70 is represented by the value ‘>70’; these values have been transformed to 71 for inclusion in this histogram. HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36.


 CAP score (capscore) at baseline Enroll-HD visit. Calculated using the CAP score formula in Warner et al. Only individuals with a research CAG of = 36 are represented. Note: a subset of these individuals with an aggregated value for age and/or research CAG length (N = 51) are not represented; these individuals have blank ‘entries’ for CAP score in the dataset.

Figure 1.15: CAP score (capscore) at baseline Enroll-HD visit. Calculated using the CAP score formula in Warner et al. Only individuals with a research CAG of = 36 are represented. Note: a subset of these individuals with an aggregated value for age and/or research CAG length (N = 51) are not represented; these individuals have blank ‘entries’ for CAP score in the dataset.


UHDRS total motor score (motscore) at baseline Enroll-HD visit. Full sample represented (N = 25,366; Missing N = 184). HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36

Figure 1.16: UHDRS total motor score (motscore) at baseline Enroll-HD visit. Full sample represented (N = 25,366; Missing N = 184). HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36


UHDRS total functional capacity score (tfcscore) at baseline Enroll-HD visit. Full sample represented (N = 25,463; Missing N = 87). HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36.

Figure 1.17: UHDRS total functional capacity score (tfcscore) at baseline Enroll-HD visit. Full sample represented (N = 25,463; Missing N = 87). HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36.


UHDRS functional assessment score (fascore) at baseline Enroll-HD visit. Full sample represented (N = 25,006; Missing N = 544). HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36

Figure 1.18: UHDRS functional assessment score (fascore) at baseline Enroll-HD visit. Full sample represented (N = 25,006; Missing N = 544). HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36


Symbol digit modality test score (total correct) (sdmt1) at baseline Enroll-HD visit. Full sample represented (N = 24,316; Missing N = 1,234). HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36.

Figure 1.19: Symbol digit modality test score (total correct) (sdmt1) at baseline Enroll-HD visit. Full sample represented (N = 24,316; Missing N = 1,234). HDGEC = individual with CAG = 36; Non-HDGEC = individual with CAG < 36.

1.5 Data availability and completeness

Completeness of data elements in Enroll-HD as a function of percentage of total participants (N = 25,550). Note that the completeness metric for ‘CAG (local)’ is 87% when limited to individuals with a CAG (research) of = 36

Figure 1.20: Completeness of data elements in Enroll-HD as a function of percentage of total participants (N = 25,550). Note that the completeness metric for ‘CAG (local)’ is 87% when limited to individuals with a CAG (research) of = 36


 Completeness of assessments (core and extended) as a function of percentage of total Enroll-HD visits (visit N = 78,730). For assessments and scales with a key outcome variable(s), the completeness metric was operationalized as sufficiently completed such that the key outcome variable(s) is available for that visit. Key outcome variables are indicated in parentheses alongside each scale. For scales with no key outcome variable(s), i.e., CSSRS, Caregivers QoL, and CSRI, the completeness metric was ‘scale administered’ (operationalized as at least one variable field completed at that visit).

Figure 1.21: Completeness of assessments (core and extended) as a function of percentage of total Enroll-HD visits (visit N = 78,730). For assessments and scales with a key outcome variable(s), the completeness metric was operationalized as sufficiently completed such that the key outcome variable(s) is available for that visit. Key outcome variables are indicated in parentheses alongside each scale. For scales with no key outcome variable(s), i.e., CSSRS, Caregivers QoL, and CSRI, the completeness metric was ‘scale administered’ (operationalized as at least one variable field completed at that visit).

1.6 Coverage data availability by region, participant category, visit count

Coverage charts

Availability of participant data by number of visits, region, and HD participant category, is provided in the coverage charts below (Tables 1.2 to 1.4 ).

Table 1.2: PDS6 coverage chart (cumulative; latest). Maximum number of participants available for X visits by region and participant category at latest visit. Visit counts consider only Enroll-HD visits (baseline and follow-up only, unscheduled visits and phone contacts excluded). Note that participants with more than 1 visit are represented in multiple columns. M = manifest; PM = pre-manifest; FC = family control; GN = genotype negative.
Region HD Category Baseline visit Visit 2 Visit 3 Visit 4 Visit 5 Visit 6 Visit 7 Visit 8 Visit 9 Visit 10 Visit 11
Australasia Manifest 377 332 256 192 148 100 47 27 9 0 0
Pre-manifest 292 233 170 127 90 57 27 11 4 0 0
Family Control 95 71 51 36 27 20 9 4 0 0 0
Genotype Negative 96 83 57 41 29 16 9 4 0 0 0
Europe Manifest 9633 7280 5381 3758 2349 1289 679 250 21 0 0
Pre-manifest 3345 2339 1622 1117 678 368 170 46 5 0 0
Family Control 1407 978 705 492 285 153 59 9 4 0 0
Genotype Negative 1655 1086 740 520 315 184 83 18 0 0 0
Latin America Manifest 237 147 90 45 23 11 6 2 0 0 0
Pre-manifest 101 57 16 10 5 0 0 0 0 0 0
Family Control 26 13 8 4 1 0 0 0 0 0 0
Genotype Negative 165 95 35 12 8 2 1 0 0 0 0
Northern America Manifest 3739 2705 1973 1410 947 591 311 144 51 14 0
Pre-manifest 1818 1208 813 563 368 221 123 56 23 6 0
Family Control 1203 904 701 520 365 241 141 69 40 14 2
Genotype Negative 1361 871 632 468 350 233 137 69 30 5 0
Total 25550 18402 13250 9315 5988 3486 1802 709 187 39 2

Table 1.3: PDS6 coverage chart (absolute; latest). Absolute number of participants available for X visits by region and participant category at latest visit. Visit counts consider only Enroll-HD visits (baseline and follow-up only, unscheduled visits and phone contacts excluded). Note that in contrast to Table 2, each participant is represented in a single column only, indicative of their maximum visit count. M = manifest; PM = pre-manifest; FC = family control; GN = genotype negative
Region HD Category Baseline visit Visit 2 Visit 3 Visit 4 Visit 5 Visit 6 Visit 7 Visit 8 Visit 9 Visit 10 Visit 11
Australasia Manifest 45 76 64 44 48 53 20 18 9 0 0
Pre-manifest 59 63 43 37 33 30 16 7 4 0 0
Family Control 24 20 15 9 7 11 5 4 0 0 0
Genotype Negative 13 26 16 12 13 7 5 4 0 0 0
Europe Manifest 2353 1899 1623 1409 1060 610 429 229 21 0 0
Pre-manifest 1006 717 505 439 310 198 124 41 5 0 0
Family Control 429 273 213 207 132 94 50 5 4 0 0
Genotype Negative 569 346 220 205 131 101 65 18 0 0 0
Latin America Manifest 90 57 45 22 12 5 4 2 0 0 0
Pre-manifest 44 41 6 5 5 0 0 0 0 0 0
Family Control 13 5 4 3 1 0 0 0 0 0 0
Genotype Negative 70 60 23 4 6 1 1 0 0 0 0
Northern America Manifest 1034 732 563 463 356 280 167 93 37 14 0
Pre-manifest 610 395 250 195 147 98 67 33 17 6 0
Family Control 299 203 181 155 124 100 72 29 26 12 2
Genotype Negative 490 239 164 118 117 96 68 39 25 5 0
Total 7148 5152 3935 3327 2502 1684 1093 522 148 37 2

Table 1.4: PDS6 coverage chart (absolute; baseline). Absolute number of participants available for X visits by region and participant category at baseline visit. Visit counts consider only Enroll-HD visits (baseline and follow-up only, unscheduled visits and phone contacts excluded). Note that in contrast to Table 2, each participant is represented in a single column only, indicative of their maximum visit count. M = manifest; PM pre-manifest; FC = family control; GN = genotype negative
Region HD Category Baseline visit Visit 2 Visit 3 Visit 4 Visit 5 Visit 6 Visit 7 Visit 8 Visit 9 Visit 10 Visit 11
Australasia Manifest 45 73 55 39 43 38 18 14 9 0 0
Pre-manifest 59 66 52 42 38 45 18 11 4 0 0
Family Control 24 20 15 9 7 11 5 4 0 0 0
Genotype Negative 13 26 16 12 13 7 5 4 0 0 0
Europe Manifest 2352 1835 1518 1300 933 524 362 189 19 0 0
Pre-manifest 1007 781 610 548 437 284 191 81 7 0 0
Family Control 429 273 213 207 132 94 50 5 4 0 0
Genotype Negative 569 346 220 205 131 101 65 18 0 0 0
Latin America Pre-manifest 44 40 8 6 6 0 0 0 0 0 0
Manifest 90 58 43 21 11 5 4 2 0 0 0
Family Control 13 5 4 3 1 0 0 0 0 0 0
Genotype Negative 70 60 23 4 6 1 1 0 0 0 0
Northern America Pre-manifest 611 441 288 276 197 144 97 69 32 11 0
Manifest 1033 686 525 382 306 234 137 57 22 9 0
Family Control 299 203 181 155 124 100 72 29 26 12 2
Genotype Negative 490 239 164 118 117 96 68 39 25 5 0
Total 7148 5152 3935 3327 2502 1684 1093 522 148 37 2

Geographical coverage

Enroll-HD PDS6 data were collected from 179 clinical sites located across 22 countries (Figure 1.22)

Enroll-HD PDS6 map. Enroll-HD data in PDS6 were collected from clinical sites in 22 countries.

Figure 1.22: Enroll-HD PDS6 map. Enroll-HD data in PDS6 were collected from clinical sites in 22 countries.

Chapter 2 Download and import data

2.1 File formats

The Enroll-HD PDS dataset is provided in two formats:

  • CSV file: CSV stands for comma separated values (.csv) which is a delimiter-separated format. The PDS data uses the tab as the delimiter. Software settings need to be adapted respectively.

  • R file: binary code format for the R1 software application (a software environment for statistical analysis).

Because of the complexity and the size of the data set, use of a statistical software package such as R, Stata, or SAS is recommended. The .csv file format can also be imported into Excel (caution is advisable).

It is important that files are not be edited in a word processing software or other programs that may potentially modify characters, as this may damage the integrity of the original files. CSV files can be saved in other formats which are compatible with other statistical software packages as needed.

2.2 Importing data

Importing CSV files into Excel

The .csv files can be imported and opened in Microsoft Excel. Because Excel is language dependent and delimiters differ from one country to another, some considerations need to be addressed when opening the .csv files to maintain data integrity. The procedures outlined here, to open the .csv files, can be applied to most recent versions of Excel.

As a default, Excel reads the values for each column as being in a “General” format. For example, unless otherwise specified, Excel interprets numeric data as numbers (e.g., 1234), entered dates as date format (as pre-set, e.g. 11/28/2016), and changes other values (e.g. strings) to text format (e.g. Aspirin). For some entries this is counterproductive, as Excel may misinterpret entries and incorrectly reformat the data, effectively changing the data (e.g. 1.5 is read as May 1 instead of 1.5 mg; or the WHO-DD Code for Tetrabenazine 00222101003 is changed to 22211003, removing the important leading “0”s).

To maintain the integrity of the data, each data column needs to be carefully examined prior to importing the data into Excel.

An illustrated guide for correctly importing CSV data files into Excel are provided in Appendix A.


Importing CSV files into R

Make sure the CSV file has not been opened and saved using a word processing software. A software package capable of reading CSV files must be loaded into R environment. The package “readr” is one of the most popular packages, but there are several others that will also work. If a package like “readr” is not already installed, the CSV data files can be imported using the following code line:

install.packages(readr)

To load the CSV data into R using a package like “readr” use the: library(readr) command. To ensure the CSV file is imported correctly, set the directory to the file folder where the PDS files are located, and then run the following code:

file = read_delim("file.csv", "\t", escape_double = FALSE, trim_ws = TRUE)


Importing R files into R

This data file is specific for R. After loading the R data files into R, 9 data frames are made available in the R environment and are ready to be used. The loading can be done using the function command:

load(“Rdata_directory”)

For RStudio users, the loading can be performed by clicking in the “load workspace” ribbon, and then browsing for the location of the R data file.

Appendix A: An illustrated guide to correctly importing CSV files into Excel

The file used for this demonstration is the ‘profile.csv’ file.

Step 1 – Open CSV file in Excel: Open the .csv file using Excel, or open Excel and on the “Data” tab click “From Text/CSV”. Data will be imported in entirety into the first column of the Excel file, as illustrated below.

Step 2 – Open Text to Columns Wizard: Select the first column, then on the tab “Data” click “Text to Columns”. A wizard will appearto guide you through the process.

Step 3 – Select delimiter type: In the Text to Columns Wizard(step 1 of 3), select the “Delimited” checkbox (this lets Excel know that the data fields are separated by commas or tabs), then click “Next”.

Step 4 – Specify data file type:: In the Text to Columns Wizard(step 2 of 3), select the Delimiter type “Tab” (this lets Excel know that the data fields are separated by tabs specifically), then click “Next”.

Step 5 – Assign column formats: For each column (i.e., variable), an appropriate format needs to be assigned. This is completed in the Text to Columns Wizard (step 3 of 3). The default format “General” works for most columns. Columns where numbers have leading “0” and columns with mixed entries like 1.5, 1,5, 1/5, need to be explicitly formatted as “Text”, as entries might otherwise become corrupted in an unchangeable way. After assigning the correct format to each column, click “Finish”.

NB: The data files pharmacotx and nutsuppl contain two columns ‘cmtrt_decod’ and ‘cmdostot’ that require formatting as “Text”.

Step 6 – Save data file: The .csv file is now column-separated and should be saved as an Excel file (.xls or .xlsx) using the ‘Save As’ option.

Chapter 3 Explore Dataset Structure

3.1 Structure of dataset

The PDS dataset consists of several data files. These contain data items defined by variables. Variables are taken from the eCRFs from Enroll-HD, REGISTRY, and Ad Hoc data. Specific data items have been transformed or obscured due to de-identification reasons.


Studies within the combined PDS dataset

All individuals in the PDS6 are Enroll-HD participants. However, PDS6 contains data gathered not only from EnrollHD, but also integrates data from Registry, as well as AdHoc data. Study specific protocols and annotated eCRFs are housed under the General Documents section.

Table 3.1: Data sources within Enroll-HD periodic dataset releases.
Study Name Acronym Chronological order
Enroll-HD ENR Enroll-HD is the most recent study a participant will have enrolled in. Participation in this study is mandatory for inclusion in the PDS
REGISTRY V3 R3 Participation in Registry v3 is optional. Participation in this study precedes Enroll-HD
REGISTRY V2 R2 Participation in Registry v2 is optional. Participation in this study precedes Enroll-HD, and Registry 3 (if available)
Ad Hoc RET Ad Hoc data are optional. These data are drawn from a variety of different sources, principally comprise UHDRS data, and are typically gathered prior to a participant’s enrollment into Enroll-HD

Data files within the combined PDS dataset

Enroll-HD PDS releases are comprised of 11 data files, each of which fall into one of three categories:

  • Participant-based: profile, pharmacotx, nonpharmacotx, nutsuppl, comorbid These contain general study-independent information about the participant. This information is applicable to all studies.

  • Study-based: participation, assessment, events These contain study specific information about a participant within a study. Note the data file events is a special data file for Enroll-HD which contains all the reportable events of a participant.

  • Visit-based: enroll, registry, adhoc. These contain all visit-dependent information for the study, combined into one data file

Each PDS6 data file is described in the table below:

Table 3.2: Enroll-HD periodic dataset data file descriptions.
Data file Type Studies Description
profile participant ENR, R3, R2, RET General and annually updated information including the forms: Demographics, HDCC (HD clinical characteristics), CAG, Mortality
pharmacotx participant ENR, R3, R2, RET Information about pharmacologicaltherapies
nutsuppl participant ENR, R3, R2, RET Information about nutritional supplements
nonpharmacotx participant ENR, R3, R2, RET Information about nonpharmacologicaltherapies
comorbid participant ENR, R3, R2, RET Information about comorbid conditions
participation study ENR, R3, R2, RET Provides study specific information about study participation. Includes participant identifier, participant status, study start day, study end day (in event of participant withdrawal or death). Note If a participant is enrolled in several studies, one line per study participation is provided.
event study ENR Enroll-HD study reportable event information
assessment study ENR, R3, R2, RET Visit-specific information about which assessments were performed at each visit (per study)
enroll visit ENR Data fromthe Enroll-HD study
registry visit R3, R2 Data from both the REGISTRY3 and REGISTRY2 studies
adhoc visit RET Ad Hoc data including: Variable, Motor, Function, TFC, MMSE, Cognitive assessment data

For detailed information on each constituent form, please refer to the Annotated eCRFs.

The number of participants included in each PDS data file is illustrated in the figure below.

Number of participants included in each PDS6 data file.

Figure 3.1: Number of participants included in each PDS6 data file.


Entity relation diagram

The PDS6 data file entity relation diagram is presented in the Figure below. This illustrates the relationship between each of the component data files, along with their key variables (primary keys[PK] and foreign keys [FK]) which are required to combine the data files.

Entity relationship diagram.

Figure 3.2: Entity relationship diagram.

3.2 Structure of variables

Each Enroll-HD periodic dataset file contains variables. This Data Dictionary lists all variables by form with the appropriate attributes. A list of the attribute types are provided in Table 3.3.

Table 3.3: Data dictionary column fields.
Attribute Description
Label Variable label (CDISC SDTM compliant)
Domain Assigned CDISC SDTM Domain
Category Assigned CDISC SDTM Category (optional)
Variable Internal variable name. Variable is defined in CDISC SDTM compliant naming convention or as close as possible.
Data Type Boolean: Represents the values 1 (yes) and 0 (no).
Number: Represents integer or floating-point data values.
Text: Represents alphanumeric string data values.
Date: The date type is represented as the number of days relative to the date of the participant’s Enroll-HD baseline visit date. Note that dates that have been specified in the original data as “incomplete” (e.g., without entry of a day) have been automatically completed by the following rule: use “15” as day if day is missing and use “1” as day and “7” as month if day and month is missing. After the date modification is complete, the number of days relating to the enrollment date is calculated and provided in the dataset. The information about whether a date has been automatically completed is not included in the PDS but can be obtained via SPS request.
Single choice: Variable with assigned list of options where one item can be selected. The value provided in the dataset is taken from the available options list. The list of options is defined as parameters in the data dictionary tables.
Parameter Parameter value of coded variables (optional).
Coding Internal parameter value of coded variables (optional).
Unit Unit of input field (optional).
Transformation One important objective of the periodic dataset is to de-identify the Enroll-HD data in order to minimize the possibility to identify a participant. Therefore, many variables are transformed, recoded or outliers removed/cut. These transformations are described on a variable-by-variable basis.
Availability All variables in the Enroll-HD dataset are listed in the Data Dictionary. This availability column allows the researcher to identify which variables are available in the PDS (“PDS”), which are available via special request (“available upon SRC approval”), and which are restricted (“not available”).

3.3 Ordering of visit data

The Enroll-HD PDS data files ‘enroll’, ‘registry’ and ‘adhoc’ contain data for all baseline and follow-up visits, for each participant,for eachstudy. There is not a separate file for each follow-up visit.

The variable visdy indicates the timing of each visit in days relative to the Enroll-HD baseline visit date (please refer to the Date Values section for further information).

In addition, the files ‘enroll’, ‘registry’ and ‘adhoc’ also include a variable called seq. This variable refers to the sequence of the visits and will enable the data analyst to order visits temporally. The seq value is in accordance with number of days after the baseline visit (visdy), where seq=1 refers to the baseline visit, seq =2 refers to the 1st follow-up visit, seq=3 to the 2nd follow-up visit, and so on, including unscheduled and phone contact visits.

Table 3.4: Visit sequencing and visit day example.
subjid studyid visit seq visdy
R000000001 ENR Baseline 1 0
R000000001 R3 Baseline 1 -728
R000000001 R3 Follow up 2 -363

Phone contact visits only occur in the ‘enroll’ file. These visits contain missed visit information, reason for missed follow-up visit and participant’s availability to continue the study. If these data are not required for your analyses, these visits can be filtered out.

Unscheduled visits occur in the ‘enroll’ and ‘registry’ files. These visits contain all the same information as a follow-up visit. These visits occur outside the visit window defined. If these data are not required for your analyses, these visits can be filtered out.

3.4 Merging and aligning files

Enroll-HD PDS releases contains one key variable, subjid, and it is included in every data file. This allows the user to merge two ormore data files, linking information for each participant acrossdata files.

The key variable subjid, labeled as HDID (recoded), is obtained by recoding the HDID. Note that the HDID is a unique participant identifier used across multiple HD studies. HDIDs are not included in any PDS release.

To merge longitudinal data available in visit-based data files, it is important to take into consideration the variable seq, as this variable provides information on the visit sequence. Visit day (visdy) is also available for sequencing visit data temporally.

WARNING: Merging data files in Excel can cause misalignment. Before analyzing the data, check that the resulting merged data file correctly lines up across appropriate fields. To avoid issues with merging data files, it is highly recommended that you use a reputable statistical software package.

Below we provide guidance on selecting entries/lines using Excel or R, respectively. The example described below comprises merging age of HD diagnosis (hddiagn) from the profile file to age at the last visit of each participant in the enroll file.

EXCEL

  1. Sort your enroll database on the first level by subjid, then add a level by seq (Smallest to Largest)

  2. Create a new column with the name “select”;

  3. On the first row of this column type the formula =IF(A2=A3,"","1"), where A corresponds to the column of subjid and A2 to the first row/value of subjid. Then press Enter key and drag Auto fill to copy the formula to the range you need. This will create a column with the value “1” on the row with maximum seq for each participant;

  4. Filter for the variable select with the value “1”;

  5. Create a new column with the name of the variable you want to merge in ‘enroll’ file (in this case hddiagn);

  6. In that new column, use a VLOOKUP() function to merge hddiagn from profile file, using the variable subjid as a linker. Then press Enter key and drag Auto fill to copy the formula to the range you need.

R

  1. Select the rows with highest value in the seq variable for each participant install.packages(‘dplyr’)

    library(dplyr)

    new_database<- enroll %>% group_by(subjid) %>% slice(which.max(seq))

  2. Use the merge() function to merge the variable hddiagn from profile file.

    Example: final_database <- merge(new_database, profile[, c("subjid", "hddiagn")], by="subjid")

    Note: Each user may use different codes to reach the same results, this is just an exemplar way to perform the task proposed. We recommend the user to read and follow the available guidelines.

3.5 Identifying PDS6 participants with data from Enroll-HD only

As described above, all individuals in the PDS6 are Enroll-HD participants.However, a subset of Enroll-HD participants also took part in the REGISTRY study. Should you wish to distinguish between Enroll-HD participants who migrated from the REGISTRY study from those who did not, there are several methods to do so. One simple solution is as follows:

EXCEL

  1. Create a new column with the name “match_registry

  2. On the first row of the new variable enter the code: =IF(ISERROR(VLOOKUP(A2,registry.csv!$A:$A,1,FALSE)),"No Match", "Match") where A2 is the first row of the variable subjid, registry.csv!$A:$A is the column of subjid in the registry file and “1” is the index of the column of subjid in the registry file. Then press Enter key and drag Auto fill to copy the formula to the range you need.

  3. The new column match_registry will have value Match if the participant has migrated from Registry and value No Match otherwise. If you are interested only in migrated participants, just filter the variable for the value Match.

R

  1. Apply a filter to the enroll database :

    library(dplyr)

    final_database<- enroll %>% filter(subjid %in% registry$subjid)

    This piece of code will return only the rows of the enroll database that belong to the participants migrated from Registry.

Chapter 4 Understand and Interpret the Data

4.1 HD classification

The Enroll-HD dataset contains the variables hdcat, hdcat_0 and hdcat_l. These variables refer to the subject group (HD category) of the participant at different points in time.

Variables hdcat_0 and hdcat_l are located in the participation data files and indicate subject group at the time of the baseline visit evaluation (hdcat_0) and at the most recent evaluation(hdcat_l) in a study.

The variable hdcat isincluded in the enroll and registry data files and denotes the subject group (HD category) of each participant at each study visit (note that in the registry file, hdcat is only available in R3, and not R2).

Values for hdcat, hdcat_0 and hdcat_l are assigned by site staff, based on clinical signs and symptoms and genotyping performed as part of clinical care, independent of the Enroll-HD study (except for participantsin the subject group ‘genotype unknown’ who are at risk without clinical symptoms whose gene status is unknown - see below for more details).

In the Enroll-HD periodic dataset (PDS) releases, there are four categorical response outcomes for hdcat, hdcat_0, and hdcat_l:

  • Pre-Manifest/-Motor-manifest HD (hdcat/hdcat_0/hdcat_l = 2): Confirmed HD gene expansion carriers (HDGECs)without clinical features regarded as diagnostic of HD.

  • Manifest/Motor-manifest HD (hdcat/hdcat_0/hdcat_l = 3). HD gene expansion carriers (HDGECs)with clinical features that are considered HD symptoms, in the opinion of the investigator.

  • Genotype Negative (hdcat/hdcat_0/hdcat_l = 4): A first or second degree relative (i.e., related by blood) of a HD gene expansion carrier (HDGEC),who has undergone predictive testing for HD and is known not to carry the HD expansion.

  • Family Control (hdcat/hdcat_0/hdcat_l = 5): Family members or individuals not related by blood to HD gene expansion carriers (HDGECs)(e.g., spouses, partners, caregivers).

For the purposes of PDS releases, participants classified in the Enroll-HD study as Genotype Unknown (hdcat/hdcat_0/hdcat_l = 1; i.e., first- or second-degree blood relativesof a known HDGEC, who have not undergone predictive testing for HD and therefore have an undetermined HD gene status) are reclassified as Manifest, Pre-manifest, or Genotype Negative based on the genetic researchtesting (a CAG determination made at a central Enroll-HD lab,indicated by caghigh) and based on the Diagnostic Confidence Level (diagconf) reported by the investigator for the participant.The following rules are used in the reclassification of genotype unknowns:

  • Reclassify as Genotype Negative: research genotype larger CAG allele (caghigh) < 36;

  • Reclassify as Pre-manifest: research genotype larger CAG allele (caghigh) ≥ 36 and Diagnostic Confidence Level from the UHDRSmotor (diagconf) < 4

  • Reclassify as Manifest: research genotype larger CAG allele (caghigh) ≥ 36 and Diagnostic Confidence Level from the UHDRSmotor (diagconf) = 4

Data on participants categorized as Genotype Unknown, not reclassified, may be obtained through special request, subject to Scientific Review Committee (SRC) approval. Please refer to the Access Data and Biosamples webpage for information on how to request a specified dataset (SPS).

Note that investigators and participants are blinded to the results of central Enroll-HD genetic research testing (i.e., caghigh) and reclassification. In other words, all Enroll-HD participants are tested at a central lab, but these results are not communicated to the participant or to site staff. The only ‘gene status’ known by the participant and the respective site is based on local CAG testing, not Enroll-HD testing.

Community Controls (hdcat/hdcat_0/hdcat_l = 6) are excluded from the dataset.

Note that the hdcat variables are available for the studies Enroll-HD and REGISTRY 3 but are not available for REGISTRY 2 and Ad Hoc since these studies did not use an HD classification system.

For further information on HD disease onset, diagnosis, and disease severity, and how these important concepts are captured in the Enroll-HD PDS, please refer to the following section: HD onset and diagnosis variables and HD Integrated Staging System (HD-ISS).

4.2 Missing values

There are two overarching categories of missing data in the dataset: system-defined missing data (indicated by blank variable ‘entries’), and user-defined missing data (indicated by specific codes, which indicate reason for missing data).


System defined missing data

System defined missing data occurs where the electronic data capture (EDC) system dictates a missing variable field. These missing data values are indicated by blank entries in the dataset.

These instances arise if there is a dependency of specific question to the response to a ‘parent’ question. For example,a ‘no’ response to the ‘parent’ question “Has the participant ever smoked?” (hxtobab) will result in a blank cell for response to the ‘child’ question “cigarettes per day?” (hxtobcpd).

In addition, total scores for assessments may also display as blank entries. This will occur where a mandatory assessment item, required for the calculation of the total score, is missing. Total scores are automatically generated by the ‘system’ if all necessary values are available.

Please note that blank entries may be converted to another value dependent on statistical software package (e.g., ‘NA’ in R).


User defined missing data

User defined missing data occur where a mandatory variable field, as determined by the EDC system, is not completed, or where the value entered into the EDC is incorrect. In these instances, data entry users are prompted to indicate why the value is missing, or why the value entered is not correct. These user-defined labels - ‘exceptional values’ – are listed below. Each one is represented in the dataset by a specific code:

  • Unknown (entered by the site, only available for specific fields): 9999 (numeric); UNKNOWN (text) Refers to mandatory values which are occasionally unknown. This exceptional value code may be selected as a response to the question, “Is/was your Mother affected by HD? (response: yes/no/unknown)”, where a participant did not know their mother.

  • Missing (value expected, but not entered): 9998 (numeric); MISSING (text); 9998-09-09 (date) Refers to mandatory values which could not be completed because data collection was not performed. This code may be used if a participant refuses to provide a response, if the collection of data was accidentally omitted, or if a value could not be obtainedbecause required instrumentation was not available.

  • Not applicable (value expected, but not entered): 9997 (numeric); NOTAPPL (text); 9997-09-09 (date) Refers to mandatory values which could not be completed because they do not apply to the participant due to certain circumstances or characteristics. For example,the question “Age at onset of symptoms in mother” for the mother who is still premanifest and does not have symptoms yet shall be answered as not applicable. The variable value is purposefully not entered. Note the distinction between this value and system-defined missing values is that the user, as opposed to the EDC system, marks them as non-applicable.

  • Wrong (value was entered, but declared as wrong by the site. Entered value excluded from dataset): 9996 (numeric); WRONG (text); 9996-09-09 (date)

    Refers to mandatory values which are entered into the EDC and then identified to be wrong or highly questionable.This may be because data were collected by the wrong person (e.g., assessment performed by untrained site member), faulty instrumentation (e.g., uncalibrated weighing scales), etc. Although these data are not technically missing, they are recoded as wrong using the codes indicated above for PDS releases.

4.3 Imputation

Imputation in the PDS is limited to the following instances:

HD-ISS variables

HD-ISS variables and input variables required for HD-ISS variable imputation (see HD Integrated Staging System (HD-ISS)).

Date Values

Imputation is performed for date variables in instances where an incomplete date has been provided (e.g., if the month and year are known, but not the day), according to the rules indicated in the section Date Values. Note that because of these imputation (autocompletion) rules, events with clear temporal definition sometimes appear out of sequence or have the same date values. For example, the number of days between a medication start date and end date or comorbidity start date and end date may be zero or a negative number.

BMI

The BMI variable provided in the PDS (i.e., bmi_imp) is an imputed value for all visits except the Enroll-HD baseline visit. Imputed BMI values are based on the weight value observed at a specific visit, and height as observed at Enroll-HD baseline. This is to avoid fluctuations in BMI driven by unexpected/implausible variation in height, which are observed in Enroll-HD data, but cannot be observed by end-users as height (and weight) are not included in the PDS for identification risk purposes. BMI is set to system-defined missing (blank) for all visits where the participant is under the age of 18 years. Unimputed BMI (i.e., bmi), weight, and height, are available for all participants at all visits via Specified Dataset (SPS) request.

4.4 Date values

Transformation of date values

To minimize participant identification risk, the Enroll-HD PDS does not contain date values. Date values referring to visit dates are transformed to a numeric value, reflective of the number of days between the Enroll-HD baseline visit date and the date of interest. Date values that refer to date of birth or symptom onset are transformed into age values.

Note that date values are negative if the date refers to a point in time before Enroll-HD enrollment. This is typical for start dates of medications and comorbid conditions, and visit dates in other studies (e.g., Registry).

For example, date values for a participant with a baseline enrollment date of 2020-11-01 (YYYY-MM-DD) would read as follows:

Entered Date Representation in dataset
2020-11-01 0
2020-11-30 29
2020-10-31 -1

Imputation of date values/Autocompletion

Incomplete date values were imputed according to the following rules:

Day missing: YYYY-MM-DD (missing): YYYY-MM-15

Month and day missing: YYYY-MM (missing)-DD(missing): YYYY-07-01

For example, date values for a participant with a baseline enrollment date of 2020-11-01 (YYYY-MM-DD) would read as follows:

Entered Date e.g. medication start Imputed date Representation in dataset
2020-11-01 N/A 0
2020-11 2020-11-15 14
2020 2020-07-01 -123

Note that because of these imputation rules, events with clear temporal definition sometimes appear out of sequence or have the same date values. For example, end dates may appear prior to, or on the same day as, start dates for comorbidities and pharmaco therapies. For example:

Entered start date Imputed start date Entered end date Imputed end date Date differential (days)
2020-11-01 N/A 2020-11 2020-11-15 14
2020-11 2020-11-15 2020-11 2020-11-15 0
2020 2020-07-01 2020-06-15 N/A -16

An additional variable containing date value precision information (d, m, and y) can be obtained through special request, subject to Scientific Review Committee (SRC) approval. Please refer to the Access Data and Biosamples webpage for information on how to request a specified dataset (SPS). The precision variable identifies the level of date completeness:

ymd – for a complete date (precision “days”) i.e., YYYY-MM-DD

ym – if day information is missing (precision “months”) i.e., YYYY-MM-DD(missing)

y – if day and month information is missing (precision “years”) i.e., YYYY-MM(missing)-DD(missing)

4.5 Aggregated values

To minimize participant identification risk, data aggregation techniques are applied to specific variables for PDS releases. These variables, and the criteria/thresholds used for aggregation in PDS6, are described in the table below.

Note that aggregation thresholds differ between PDS releases. Changes in the Enroll-HD cohort size and profile allow for such aggregation threshold adjustments while maintaining low identification risk thresholds.

Note that numerical values with aggregated data have been converted to text variables (e.g.,possible entry for caghigh = ‘>70’). Cells that contain ‘>’ or ‘<’ values should be replaced by the end user with a numeric value for analysis.

Deaggregated or suppressed data may be obtained through special request, subject to Scientific Review Committee (SRC) approval. Please refer to the Access Data and Biosamples webpage for information on how to request a specified dataset (SPS).

Table 4.1: Aggregated variables and aggregation thresholds in PDS6
Data file Variable Variable label Criteria for aggregation
participation age_0 Age at enrollment <18
enroll, registry, adhoc age Age at visit <18
profile caghigh Research larger CAG allele determined from DNA >70
profile caglow Research smaller CAG allele determined from DNA >28
profile race Ethnicity Fewer than 100 cases per category*
* The following categories for ethnicity were aggregated into “Other (6)”: “Native Hawaiian or Other Pacific Islander” (4), “Alaska Native/Inuit” (5), “African – South” (11), “African – North” (12). Note “Asian – West” (13) and “Asian – East” (14) categories are aggregated into the category “Asian” (16).
Table 4.2: Participants subject to aggregation thresholds in PDS6
Data file Variable Label Number of participants
participation age_0 <18 30
enroll age <18 30
profile caghigh >70 32
profile caglow >28 278
profile race Other (6)* 117
profile race Asian (16)** 149
* Includes individuals from the following categories: “Native Hawaiian
or Other Pacific Islander” (4), “Alaska Native/Inuit” (5), “African - South” (11),
“African - North” (12).
** Includes individuals from the following categories: “Asian – West” (13)
and “Asian – East” (14).

4.6 Assessment score calculation

Assessment ‘total scores’ are automatically calculated in the Enroll-HD EDC system.

If a mandatory assessment item, required for the generation of the total score, is missing, a blank data entry will be displayed (indicative of system defined missing data).Note that incomplete total scores (calculation of scores with the available values) or detailed items are also available for some assessments (motor, function).


Unified Huntington’s Disease Rating Scale (UHDRS ®)

Enroll-HD PDS6 contains calculated composite UHDRS scores. Please refer to the following reference for further information: 2

The motor component of the UHDRS ® assesses domains such as chorea, dystonia, bradykinesia, and rigidity. The key disease variable generated by this assessment is total motor score (motscore), which ranges from 0 to 124.

The UHDRS ® Motor/Diagnostic Confidence component, indicates rater’s confidence in patient’s motor onset, based on UHDRS motor assessment above (diagconf). This variable ranges from 0 (no abnormalities) to 4 (motor abnormalities that are unequivocal signs of HD; ≥99% confidence).

The total functional capacity (TFC) component of the UHDRS ® consists of five items: occupation, finances, domestic chores, activities of daily living, and care level. The key disease variable generated by this assessment is total functional capacity score (tfcscore), which ranges from 13 (least severe)to 0 (most severe).

The functional assessment (FAS) component of the UHDRS ® includes 25 yes/no questions about common daily tasks. The key variable generated by this assessment is functional assessment score (fascore), which ranges from 25 to 0.

The independence component of the UHDRS ® assesses the participant’s independence. The single independence scale score (indepscl) is a percentage ranging from100% where no special care needed to 5% where participant is tube fed and needs total bed care.

Table 4.3: UHDRS score calculation for multi-item component sections.
UHDRS section Variable Score calculation
Motor motscore Sum of the values of all scores
Total Functional Capacity tfcscore Sum of the values of all scores
Functional Assessment fascore Sum of the values of all scores

Problem Behaviors Assessment – Short (PBA-s)

Enroll-HD contains calculated composite PBA-s scores. Please refer to the following reference for further information: 3

This instrument measures frequency and severity of symptoms related to altered affect, thought content, and coping styles. It includes items that cover an extensive range of behaviours including: depressed mood, low selfesteem, anxiety, suicidal thoughts, aggressive behaviour, irritability, perseveration, compulsive behaviours, delusions, hallucinations, and apathy. Key disease variables are total score on each sub-scale (depscore, irascore, psyscore, aptscore, exfscore)

Table 4.4: PBA-s sub-scale score calculations.
PBA-s section Variable Score calculation
Depression depscore Addition of composite scores* for depressed mood + suicidal ideation + anxiety
Irritability/Aggression irascore Addition of composite scores* for irritability + angry or aggressive behavior
Psychosis psyscore Addition of composite scores* for delusions / paranoid thinking + hallucinations
Apathy aptscore Addition of composite scores* for apathy
Executive function exfscore Addition of composite scores* for perseverative thinking or behavior + obsessive compulsive behaviors
* These composite scores are calculated bymultiplying severity by frequency for each symptom, which are then summed to create a composite score. For example: Depression = (severity of depressed moodfrequency of depressed mood) + (severity of suicidal ideationfrequency of suicidal ideation) + (severity of anxiety*frequency of anxiety).

Mini Mental State Examination (MMSE) score

Enroll-HD contains calculated MMSE scores. Please refer to the following reference for further information: 4

The Mini Mental State Examination is an 11-question measure that tests five areas of cognitive function: orientation, registration, attentionand calculation, recall, and language. The key variable generated is MMSE score (mmsetotal), calculated by summing the value of all assessment scores.


Hospital Anxiety Depression Scale / Snaith Irritability Scale (HADS-SIS) score calculation

The HADS-SIS assessment used in Enroll-HD is a combination of two separate scales, the Hospital anxiety and depression scale - HADS 5 and the Snaith irritability scale - SIS 6. It is important to recognize that the HADS-SIS is comprised of these two separate scales so that analyses can incorporate the respective subscales and items appropriately.

The HADS combined with the SIS offer a brief rating of depression, anxiety, and irritability (inward and outward) symptoms that reflect primarily mood rather than cognitive and somatic symptoms. Key variables are subscale total scores (anxscore, depscore, irrscore, outscore, inwscore).


Short Form Health Survey - 12v2 (SF-12)

Enroll-HD PDS releases contains calculated scores for SF-12 scales available in the ‘enroll’ data file.

The Short Form Health Survey-12 (SF-12) is extensively used in large population health surveys as a brief, reliable measure of overall health status. The 1-week recall version is used. Key variables are: group norm-based scores for physical functioning (pf), role-physical (rp), bodily pain (bp), general health (gh), vitality (vt), social functioning (sf), role-emotional (re), mental health (mh), all of which generate an overall physical component (pcs) and mental component (mcs).

The following reference provides further information on score calculation: 7


Short Form Health Survey – 36 v1/v2 (SF-36)

Enroll-HD PDS releases contains the total scores for the SF-36 scale (version 1 and version 2) available in the ‘registry’ datafile. All sub-items for this scale are available upon SPS request.

The following reference provides further information on global score calculation: 8

4.7 Derived variables: e.g. periodic dosage (drugs, pharmacotherapies, nutritional supplements)

Data on the use of drugs, pharmacotherapy, and nutritional supplements and/or periodic dosage are included in the Enroll-HD PDS.

The variable cmdostot is derived from raw measures of dose and frequency of use, i.e.,multiplication of a drug dose by the number of intakesper day (for example: 25 mg taken 4 times per day equals 100 mg intake per day).

If a drug is taken “as needed” the frequency of use is often unknown and set to zero,thus the derived value is then zero(e.g., 25 mg taken at 0 times per day equals an intake of 0 mg per day). If the values for frequency of use are set to one of the exceptional values(e.g. 9998), the variable cmdostot is also set to that exceptional value e.g. 9998.

For combined drugs the dose is often not entered as a number, but rather as a string (eg. 25/100 mg). It is not possible to derive the total daily dose from this, and the value cmdostot remains blank.

A large number of unusually high values are also observed for cmdostot. These values may be correct, and attributable to small units of measurement listed for dose, or reflective of data entry errors. These values are not queried by the Enroll-HD team, and as such we highlight the need for careful review before use in analysis.

Another example for derived value is the variable ‘packy’, indicative of an individual’s cumulative lifetime exposure to tobaccoin terms of pack/years. It is derived from daily intake (tobcpd) and years of smoking (tobyos) variables(packy = [tobcpd/20] * tobyos).

If one of these input values is missing, a system-defined missing value will be generated (see Missing values). If one of these values is extremely low or zero (as may be entered for tobcpd for occasional, non-daily smokers), the derived value may be zero, due to either a zero entry for input value, or rounding down of derived value of <0.05. All individual values of low packy are identified in the Quality Control: Observations and Unusual Findings document provided along with the dataset.

The raw data values used to calculate dose can be obtained through special request, subject to Scientific Review Committee (SRC) approval. Please refer to the Access Data and Biosamples webpage for information on how to request a specified dataset (SPS).

4.8 Data Exclusion

For data to be included in the PDS, certain requirements must be met at both a participant- and visit-level basis.

If participant-level requirements are not met, that participant and all their associated data will be excluded from the PDS release. If participant-level requirements are met but visit level data requirements are not met for that participant for one or more visits, the participant will be included in the PDS, but one or more of their study visits will be excluded.

Participant level data requirements:

  • Participant status is not ‘quarantined’

  • Existing value for caghigh (i.e., research CAG)

  • Valid baseline value for hdcat (i.e., HD category at enrollment)

  • Value for hdcat does not equal 6 (i.e., no community control)

  • If participant status is ‘withdrawn’ or ‘violator’, an End form must be completed

  • Central coding completed for pharmacotx (medications, indications), comorbidities and events

  • Study visit status for baseline visit is ‘completed’ (i.e., onsite monitoring of baseline is complete)

  • General data monitored at least once

  • Participant is not on the ‘excluded participants’ list generated by the statistical monitoring team Reasons for exclusion include: exceeding identification risk threshold; caghigh is not consistent with hdcat

Visit level data requirements:

Enroll-HD (enroll) visits (applies to follow-up, unscheduled, phone contact):

  • Study visit status is ‘completed’

  • Visit is not on the ‘excluded visits’ list. Reasons for exclusion include: duplicated visits/visits with the same visit date; visits not covered by a valid ICF

REGISTRY (registry) visits:

  • Study visit status is ‘completed’, ‘signed’, or ‘reviewing’

  • Visit is not on the ‘excluded visits’ list

AdHoc (adhoc) visits:

  • Study visit status is ‘completed’, ‘signed’, or ‘reviewing’

  • Visit is not on the ‘excluded visits’ list

4.9 Quality control: observations and unusual findings

Prior to each PDS release, an enriched set of remote data quality control checks are performed. These include custom checks for unusual or implausible values, and systematic checks of continuous variables for extreme outlying values, which are flagged using data-driven thresholds or pre-specified custom thresholds based on plausibility.

Unusual and implausible values are reviewed by the monitoring and/or medical monitoring teams and queried directly with sites where deemed appropriate by expert determination. In certain instances, however, these values cannot be queried and corrected (e.g., if the observation was recorded in a REGISTRY visit and then transferred into the Enroll-HD database) or are queried and confirmed as correct by site staff. In instances such as these, those unusual values are provided ‘as is’, and it is left to the analyst to determine whether to include or exclude these values or perform sensitivity analyses.

The Quality Control: Observations and Unusual Findings document lists all quality control checks that are performed for PDS releases, alongside frequency counts of how many identified values remain ‘as is’ for each individual check.

While substantial efforts are made to maximize data quality, researchers are encouraged to visualize the data and perform their own QC checks prior to commencing analyses.

4.10 HD onset and diagnosis variables

HD onset and diagnosis are important clinical concepts and of critical interest to many researchers. In this section, we discuss these complex concepts in detail – and the nuances of the variables which capture them in the Enroll-HD study.

HD onset is complex. The timing of symptom onset, order of presentation, and consequent trajectory of symptoms in each domain - motor, cognitive, functional, or behavioural - are unique to each participant. Similarly, diagnosis of HD may be made for an individual in different ways at different times. In recognition of this, EnrollHD collects data on a multitude of variables relating to timing of initial symptoms, disease onset, genetic testing and diagnosis. These are shown in Table 4.5.

Table 4.5: Key disease dates (symptoms, onset, diagnosis) captured in the Enroll-HD EDC. Note analogous age of onset variables are available in PDS releases.
Disease date domain Variable Variable Label Form
Date of first symptoms sxsubj Date symptoms first noted by participant HDCC
sxfam Date symptoms first noticed by family HDCC
sxrater Rater’s estimate of symptom onset HDCC
ccdepyr Year of onset (depression) HDCC
ccirbyr Year of onset (irritability) HDCC
ccvabyr Year of onset (violent or aggressive behavior) HDCC
ccaptyr Year of onset (apathy) HDCC
ccpobyr Year of onset (perseverative/obsessive behavior) HDCC
ccpsyyr Year of onset (psychosis) HDCC
cccogyr Year of onset (cognitive impairment; first began impacting on daily life) HDCC
ccmtryr Year of onset (motor symptoms) HDCC
Date of diagnosis lbdtc Date of report (local CAG) NB: this variable is not available in dataset releases CAG
svstdtc & diagconf Date of visit at which diagnostic confidence level (DCL) is updated from ‘1’, ‘2’, or ‘3’ to ‘4.’ NB: Indicates disease onset, motor. Variable items (Follow-up visit); Motor
svstdtc & hdcat Date of visit at which hdcat is updated from ‘premanifest’ to ‘manifest.’ NB: Indicates disease onset, any domain. Variable items (Follow-up visit);
hddiagn Date of clinical HD diagnosis (based on symptoms in any domain) NB: Indicates disease onset communicated to participant. HDCC

In Enroll-HD, dates relating to onset of first symptoms are captured as reported from several perspectives: the participant (sxsubj), their family (sxfam), and the Enroll-HD clinician/rater (sxrater).

Note that since the EDC release in December 2017 the variable sxrater can only be completed when a participant is considered manifest, as indicated by hdcat=3. This rule does not apply for sxsubj and sxfam which can be completed regardless of hdcat value. Values collected before December 2017 may still be present for some participants not considered manifest.

Onset dates pertaining to specific symptoms in each domain are also captured (e.g., cccogyr). These are completed from the clinician/rater’sperspective, based on their best judgement. This takes into account participant and family reports, available history from medical records, as well as Enroll-HD assessment scores.

Note that of almost all ‘cc….’ symptom onset variables, only symptoms in the motor domain are required to be HD specific. Psychiatric symptoms or cognitive symptoms indicated by these variables may or may not be related to HD and should be considered with this caveat in mind.

Given the exclusion of certain participant visits from the PDS (see Data Exclusion) it is possible for the age listed for these onset variables to be greater than the age at ‘last’ visit.

The term ‘clinical diagnosis’ is used to denote the unequivocal onset of symptoms or signs attributed to HD, which can occur at vastly different times for each individual gene-expanded carrier. In the Enroll-HD protocol, a clinicianbased judgement of disease “manifest” status, as indicated in Enroll-HD by participant category (i.e., hdcat = 3), is based on symptoms in any of the disease domains (i.e., motor, cognitive, behavioral). To this point, note that some participants classified as manifest in Enroll-HD mayhave low values for UHDRS total motor score (e.g., motscore < 10 and a DCL of < 4). In these instances, the manifest categorization may be due to psychiatric or cognitive symptom onset as opposed to motor.

Enroll-HD captures the date of clinical HD diagnosis (hddiagn). This variable represents the date on which a participant is informed by a clinician that the disease is evident. However, this can be years after actual symptom onset if the participant has not been seen by a doctor. If the date of first diagnosis is unknown and cannot be identified, hddiagn can be missing, even if a clinician is confident in their diagnosis of symptomatic HD and has correspondingly marked hdcat as manifest. If the field date of clinical HD diagnosis is filled for participants classified as “pre-manifest”, it is possible that the date of predictive genetic test result has erroneously been entered instead of date of clinical HD diagnosis. The other possibility is that the subject group was not entered correctly for the participant. The analyst should decide whether such values should be excluded from their analysis.

An alternative definition of disease onset, also termed “manifest” and widely used in the HD literature, concerns the transition from pre-symptomatic to symptomatic HD based on motor symptoms only; this is known as motor onset or motor diagnosis. This definition is based on a Diagnostic Confidence Level (DCL) score (i.e., diagconf) of 4, which indicates a clinician’s confidence that, based on the UHDRS Motor assessment, motor signs unequivocally represent HD (≥ 99% confidence). Provided a participant was not classified as hdcat = ‘manifest’, or diagconf = ‘4’, at study entry (i.e., at baseline visit), the date of the visit at which either of these variables are updated to the values above can be used to indicate date of clinical onset, as outlined respectively above.

Note that there may be discrepancies between date of clinical diagnosis (hddiagn) and year of onset of motor symptoms (ccmtryr). Estimation of onset of motor symptoms may be years earlier than date of clinical diagnosis, for example if a participant had not been seen by a doctor for a long period. Further, clinical diagnosisis also very distinct from first symptoms. Conversely, onset of motor symptoms may be years later than date of clinical diagnosis. This is plausible if the diagnosis was based on cognitive or psychiatric symptoms.

Genetic diagnosis of HD is performed by genetic test, which may be completed prior to symptom onset (known as a “predictive test”), or to confirm a clinical diagnosis (known as a “diagnostic test”). Diagnostic or predictive genetic testing is voluntary. Such genetic tests are completed at local labs for some individuals participating in Enroll-HD (not all) and are performed independently of the Enroll-HD study. Separately, all Enroll-HD participants undergo research CAG repeat genotyping at a central research laboratory. These results are used solely for research as opposed to predictive or diagnostic purposes, and are never shared with participants, investigators, or sites. For the Enroll-HD study, an individual is an HD gene expansion carrier if they have 36 or more CAG repeats, although the literature states that CAG repeats between 36 and 39 (inclusive) are not fully penetrant. CAG values between 27 and 35 (inclusive) are considered intermediate alleles. All CAG repeat lengths of 40 and higher are fully penetrant. In symptomatic individuals without family history of HD, clinical diagnosis is confirmed bygenetic testing; therefore, date of local genetic testing (i.e., lbdtc) may be used as “date of clinical diagnosis” in such individuals. In asymptomatic individuals, with family history, who undergo predictive testing, date of genetic testing may be used as “date of genetic diagnosis”.

Note however that date of local genetic testing is not made available in Enroll-HD data releases

Finally, we highlight CAP score (CAG-age-product), which is a commonly used measure of cumulative exposure to mutant huntingtin. Multiple formulas for calculating CAP score exist. In the current Enroll-HD PDS release, we include CAP score (capscore) for applicable participants and visits, calculated per the definition provided by 9.

\[CAP score = Age × (CAG – L)/K\], where L = 30 and K = 6.491 .

This formula is standardized to ensure that CAP = 100 at the expected age of diagnosis.

Please note the following caveats associated with the CAP score values provided in the current Enroll-HD PDS release:

  • CAG length is based on research CAG (i.e., caghigh)

  • CAP score is calculated for each participant at each visit, provided CAG length (caghigh) is >/= 36

  • Age is entered into the above formula as an integer (i.e., a whole number, without decimal points)

  • Where age and/or CAG values are aggregated, CAP score will be blank (i.e., system defined missing)

CAP score values are provided for all visits in all studies included in the PDS (i.e., Enroll-HD, R2, R3, and Ad-hoc). These values are located in the following files: enroll, registry, adhoc.

Quality Control of HD onset and diagnosis variables: A specific set of custom multivariate quality control checks are performed on the HD onset variables prior to PDS releases in an effort to identify unusual or implausible values. These values are reviewed by the monitoring and/or medical monitoring teams and queried directly with sites where relevant. In certain instances, however, these values cannot be queried and corrected (e.g., if the observation was recorded in a REGISTRY visit transferred into the Enroll-HD database) or are queried and confirmed as correct by site staff. In instances such as these, values are provided ’as is, and it is left to the analyst to determine whether to include or exclude these values. These custom HD onset checks, alongside a summary of findings, are listed in the Quality Control: Observations and Unusual Findings document.

4.11 HD Integrated Staging System (HD-ISS)

The current Enroll-HD PDS release includes imputed Huntington’s disease Integrated Staging System 10 – or HD-ISS - variables.


Background

The Huntington’s Disease Integrated Staging System (HD-ISS) is a four-stage evidence-based framework intended to facilitate clinical research. In Stage 0, individuals have the Huntington’s disease genetic mutation (CAG ≥ 40) without any detectable pathological alterations. Stage 1 is marked by measurable underlying biomarker pathophysiology as indicated by striatal atrophy. Stage 2 indicates the appearance of HD signs and symptoms, and Stage 3 is evidenced by functional decline. Staging requires the collection of CAG length (Stage 0), caudate and putamen volume (Stage 1), and the UHDRS variables of total motor score and symbol digit modalities test (Stage 2), and total functional capacity and independence scale (Stage 3). Enroll-HD collects all the variables except brain volume. The missing imaging variables indicate that participants in Enroll-HD cannot be definitively staged. For this reason, we impute HD-ISS stage using machine learning methods that are known to generally provide excellent predictions in applied data analysis. In addition, we provide probabilities of classification of all the stages for a visit, which might be used to represent the confidence level of the imputed stage.


HD-ISS in the PDS

The Enroll-HD PDS includes five HD-ISS variables for each pwHD participant at each visit, as listed below. The values of each of these variables are the result of the imputation methods described below. These values are provided in the ‘enroll’ data file only; HD-ISS variables are not calculated for Registry and Adhoc visits. “HDISS_stage_imp” The imputed HD-ISS stage; specific to participant and timepoint

  • HDISS_stage0_prob” The probability of classification as HD-ISS Stage 0; specific to participant and timepoint

  • HDISS_stage1_prob” The probability of classification as HD-ISS Stage 1; specific to participant and timepoint

  • HDISS_stage2_prob” The probability of classification as HD-ISS Stage 2; specific to participant and timepoint

  • HDISS_stage3_prob” The probability of classification as HD-ISS Stage 3; specific to participant and timepoint


Exclusion

It is not possible to impute HD-ISS stage for every participant and/or visit. Imputation is only calculated for participants of age 18 or older who have CAG in the range of 40 to 50 (inclusive). In these instances when an imputed stage cannot be calculated, the variable values will be blank (i.e., system defined missing).


Why is imputation of Stages 2 and 3 necessary in the Enroll-HD PDS?

The Enroll-HD dataset contains data on total motor score, symbol digit modality test, total functional capacity, and independence scale – i.e., the landmark variables required to assess entry into Stage 2 and Stage 3 respectively. Given that these data are known for Enroll-HD participants, it is reasonable to ask why must classification as Stage 2 or Stage 3 be imputed? In short, because information on all landmark variables is required to definitively stage individuals. The absence of imaging variables in Enroll-HD prohibits definitive classification of participants, although we are able to impute Stages 2 and 3 with a higher degree of confidence than we are Stages 0 or 1 because of the presence of the Stage 2 and 3 landmark variables in the dataset. This is reflected in the HD-ISS stage probability scores displayed alongside the imputed stage.


How is regression in stages possible across visits?

The staging system – by design – is cross-sectional, only applying to a single visit for a person. There is no constraint in the staging system that considers prior or subsequent values for stage. Regression of stage values across visits – and stage skipping - are observed phenomena in the Enroll-HD PDS, and ‘ground truth’ datasets. This is due to a whole host of factors, including measurement error, and real temporal fluctuations. This is not a unique issue to the staging system – and is often seen in scores for other clinical measures such as the TMS and DCL.


Missing Data Imputation

HD-ISS stage imputation for the periodic data set (PDS) was based on random forest 11 with chained equations12 as applied in the “missForest” algorithm13 (the “R” package “missRanger”14 was used). Because Enroll-HD does not collect imaging variables, a database from studies that did collect imaging (IMAGE-HD, PREDICT-HD and TRACK-HD/ON) was used to train the algorithm. Chained equations constitute a conditional specification approach to imputation. The imputation is performed on a variable-by-variable basis with each incomplete variable acting as the outcome variable and using all other variables in the imputation model as predictors. Each conditional imputation model is variablespecific, using the appropriate methods for the data type of the variable, whether it be continuous, binary, multicategory, etc. Thus, in our application all variables with missing data were imputed, not just the HD-ISS. However, only the imputed HD-ISS stages are provided in the PDS distribution, along with post hoc probabilities of classification, as described below. The chained equation approach has been shown to work well in simulation studies. The main advantage of the method is that a specification of the joint multivariate distribution for all the variables is not required. The multivariate distribution may be difficult or impossible to specify when the variables are a mix of types, as we have in Enroll-HD.

The “missForest” imputation algorithm proceeds as follows. Suppose we have variable vectors x_1, x_2, x_3, each with dimension n by 1. Assume the first two variables have missing data, and say that x_1 has less missing than x_2. We start with x_1, and set it to be the outcome variable, y, which will be predicted by x_2 and x_3. The algorithm initiates by making a naive guess for the missing data in x_2, using the mean or mode (depending on the predictor variable type). Then a random forest is grown, consisting of a large number of random regression trees (1000 trees were used in this PDS release. The random forest is trained for the observed portion of y, and then the forest is used to predict the missing portion of y (those rows of x_2 and x_3 that correspond to the missing rows in y are “dropped down” the grown forest to compute predictions). After missing values on x_1 are imputed, we move on to setting x_2 to y, and a random forest is similarly used to predict the non-missing values using the newly imputed x_1 and the (non-imputed) x_3. The newly trained forest is used to impute the x_2 missing values. The process is repeated, and each time the imputed values are updated until a convergence criterion is reached.


Post Hoc Probabilities

Though a single stage is assigned in the imputation method, it is of interest to compute the probability of the HD-ISS stage candidates of 0, 1, 2, 3, for the vector of observed variables for a participant in a given study visit. For example, given the observed values of TMS = 10, SDMT = 40, TFC = 13, and IS = 100, we might want to know how probable a visit would be classified in any one of the stages. It is not possible to compute these probabilities directly from the “missForest” algorithm, so we used a separate random forest fitted on the imputed data. Specifically, the imputed HD-ISS stage was predicted using the landmark variables discussed above, which generated a probability of classification for each stage.

Chapter 5 Data Quality Management and Participant Privacy

5.1 Data monitoring and quality

Each Enroll-HD PDS goes through stringent Quality Control (QC) and de-identification procedures prior to release. These are described in detail below.

Data quality control checks are implemented routinely at multiple levels, from point of data entry, through to onsite and remote data monitoring. Prior to a PDS release, data are also subject to an enriched, unique set of remote data review checks. All of these checks aim to maximize data integrity.

Onsite monitoring visits are carried out routinely at each Enroll-HD site to review source data,ensure compliance with study protocol, Good Clinical Practice, and applicable regulations, and retrain staff as needed.

Each month all data that has been signed off during the month is subject to remote monitoring procedures, where participant’s data (core assessment and selected extended assessments) are subject to cross-sectional QC checks, which include checks for consistency, completeness and plausibility. Participant data are also subject to longitudinal (i.e., within subject) QC checks for a subset of variables (e.g., height, TFC score), which are conducted every 6 months.

Prior to a PDS release, an enriched set of remote data QC checks (~400) are also performed. These include custom multivariate checks for unusual or implausible values, and systematic checks of continuous variables for extreme outlying values, flagged using data-driven thresholds, or pre-specified custom thresholds based on plausibility

Values identified by the QC check battery are reviewed by the Data Monitoring and/or Medical Monitoring teams. Consequent actions (and outcomes) are listed below:

  • Query issued, sent to study site > datum confirmed incorrect and updated by site | datum confirmed correct and not updated by site, retained ‘as is’ for PDS

  • Query issued, sent to another source (e.g., BioRep) > datum confirmed incorrect and updated | participantadded to “exclusion list”

  • Query not issued > participant added to “exclusion list”

  • Query not issued > visit added to “exclusion list”

  • Query not issued > datum recoded with exceptional value code (e.g., WRONG/9996)

  • Query not issued > datum retained ‘as is’ and included in the document Quality Control: Observations and Unusual Findings

Issued queries may result in identified values being updated/corrected. Certain issued queries are checked and

confirmed as correct by site staff. In these cases, unless the datum is impossible (e.g., out of scale range), the value is left ‘as is’ in the dataset. All of these values should be listed in the PDS document Quality Control: Observations and Unusual Findings. It is left to the analyst to determine whether to include or exclude these values or perform sensitivity analyses.

Certain specific QC findings cannot be queried or corrected (e.g., inconsistency in assigned hdcat and caghigh). These result in the participant – or a visit - being excluded from the dataset.

In other instances where identified values cannot be queried and corrected (e.g., if the observation was recorded in a REGISTRY visit transferred into the Enroll-HD database), the variable value may be recoded with an exceptional value data code (e.g., WRONG/9996).

In response to frequently asked questions received from users of the Enroll-HD PDS, we now include a section on ‘HD onset and diagnosis’ variables in the Understand and Interpret the Data document to assist in interpretation of these variables, highlighting unexpectedbut plausible values and value combinations.

While substantial efforts are made to maximize data quality, researchers are encouraged to visualize the data and perform their own QC checks prior to commencing analyses.

5.2 Participant privacy and identification risk management

Enroll-HD takes great care to protect participant data and privacy.

Enroll-HD participant level data and samples are provided to the research community following three overarching principles:

  1. Informed consent and regulatory requirements: Data and samples are only shared in accordance with EU GDPR with special consideration of any Personal Health Information (defined by the HIPAA Privacy Rule), and the participant’s informed consent.

  2. Use agreements: An Enroll-HD Data Use Agreement (DUA) and/or Material Transfer Agreement (MTA) must be signed, and the termshonored by any requester.

  3. Identification risk assessment and minimization: Various methods are applied to data within each dataset to minimize identification risk, including variable suppression, transformation, and aggregation. The risk for participant identification is assessed for all participants in Enroll-HD and steps are taken to reduce the risk of identification below a predeterminedthreshold before data release.


To ensure the risk for participant identification is minimized, two methods are employed: 1) the “Safe Harbor” method and 2) the “Expert Determination” Method.

The Safe Harbor method refersto the removal of specific identifiers that can directly identify a person in the dataset. The HIPAA Privacy Rule outlines a list of 18 variables, such as geographic subdivisions smaller than state, and other characteristics that could uniquely identify the individual. To the extent that these data are collected in the study, the data are removed from the dataset or transformed to reduce identification risk (e.g., date of birth is converted to age).

The Expert Determination method requires a qualified statistical expert to perform an analysis of the participant identification risk for all individuals in a dataset to ensure the risk is low. The methods used to make that determination andjustification of the expert’sopinion must be documented and retained. As part of the Expert Determination method, the Enroll-HD Statistics team has identified additional variables that may be potentially identifying. Many of these variables are suppressed, transformed, or aggregated in dataset releases. Full details of variable availability, transformation, and aggregation – pertaining to the Enroll-HD dataset releases - are provided in the Enroll-HD Data Dictionary and the document Understand and Interpret the Data.

In addition, the probability for individual participant identification is calculated for each participant included in a dataset before release. For this purpose, a combination of potentially identifying variables(collectively referred to as a “key”) are considered. In Enroll-HD, these “key” variables are: HTT CAG size, age (baseline), race (aggregated), sex, educational level (ISCED; baseline), BMI (baseline) and region. These key variableswere selected following thorough literature reviews and discussions with statistical experts. The R software package sdcMicro is used to determine individual participant identification risk based on the above variable set. Genotype unknown participants exceeding a 1% identification risk probability are excluded from the dataset, for all other participants a 3% identification risk threshold is applied.

The individual risk of identification for a participant in Enroll-HD is defined as the probability that a participant could be correctly identifiedfromthe full Enroll-HD sample by looking at a specific combination of the key variables.

Although some participants are excluded from a dataset due to exceeding acceptable risk thresholds, they may be included in future datasets; individual identificationrisk can changewhen additional data are added to a dataset.

An overview of the identification risk minimization process implementedfor Enroll-HD dataset releases is provided in Figure 5.1.

 Enroll-HD identification risk management processfor dataset releases.

Figure 5.1: Enroll-HD identification risk management processfor dataset releases.

Chapter 6 Change Log

6.1 Overview

This document provides an overview of changes made to the structure, content, and format of the Enroll HD Periodic Dataset (PDS) between each consecutive major (PDSX) and minor (RX) release, from the first (PDS1; R1) to the most recent (PDS6; R1). Information about files and file structure, variables (additions, exclusions, and modifications), coding systems, and quality control and identification risk protocol is provided. Sample size and visit counts by major and minor release are also documented. Individual participant value changes between major and minor releases are captured in separate documents, available on request.

The Change Log document can be consulted here: https://www.enroll-hd.org/enrollhd_documents/ENROLL-HD_ChangeLogOverviewPDS1toPDS6_v1.0_20230112.pdf.

Chapter 7 Data Dictionary

7.1 Overview

This Data Dictionary is designed to accompany all Enroll-HD dataset releases, including periodic datasets (PDS) and specified datasets (SPS). Note that Enroll-HD dataset releases encompass data on Enroll-HD partcipants from several sources: Enroll-HD, REGISTRY study, and clinical data collected in adhoc visits outside of the aforementioned studies.

This document denotes each variable, including availability of the data in different types of datasets. A brief overview of constituent data sources, dataset structure, variable structure, and representation of special values is also provided.

We strongly encourage data users to thoroughly read the accompanying data support documentation for detailed descriptions of dataset structure, data quality control procedures and participant identification risk management, coding system descriptions, and data interpretation. Complete description of these important topics is out of scope of the current document.

The PDS Data Dictionary can be consulted at the PDS dataset explorer LITE.

Chapter 8 PDS dataset explorer LITE

8.1 Overview

The PDS Dataset Explorer is a web application designed to explore the main features of the PDS6 and to familiarize researchers with the dataset. It is organized as follows:

  • Data Orientation:

    • Dataset structure: An interactive tree-map illustrating the structure and content of each PDS file.

    • Dataset Preview: Illustrates exemplar data content within each data files (simulated data are presented for illustrative purposes only).

  • Data Dictionary: A filterable and interactive table of all variables collected in Enroll-HD, color coded by availability (e.g., included in PDS releases, available upon SRC approval, Not Available).

  • Population Characterization: A tool enabling quick and easy visualization of univariate data distributions or frequency counts, as a function of HD category.

The application can be accessed via the Enroll-HD website: at https://enroll-hd.org/for-researchers/pds-data-explorer/


  1. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. 2022. https://www.R-project.org/↩︎

  2. Huntington Study Group. Unified Huntington’s Disease Rating Scale: Reliability and Consistency. Neuropsychiatry Movement Disorders 1996, Vol. II, No. 2, 136-142.↩︎

  3. Craufurd D, Thompson JC, Snowden JS. Behavioral changes in Huntington Disease. Neuropsychiatry Neuropsychol Behav Neurol. 2001 Oct-Dec;14(4):219-26.↩︎

  4. Folstein MF, Folstein SE, McHugh PR. “Mini-mental state”. A practical method for grading the cognitive state of patients for the clinician. Psychiatr Res. 1975 Nov;12(3):189-98↩︎

  5. Zigmond AS, Snaith RP. The hospital anxiety and depression scale. Acta Psychiatr Scand. 1983 Jun;67(6):361-70.↩︎

  6. Snaith, RP. A clinical scale for the self-assessment of irritability. Brit J Psychiat. 1978; 132: 164-171.↩︎

  7. Ware JE, Kosinski M, and Keller SD. A 12-Item Short-Form Health Survey: Construction of scales and preliminary tests of reliability and validity. Medical Care, 1996; 34(3):220-233.↩︎

  8. Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection. Med Care. 1992 Jun; 30(6):473-83.↩︎

  9. Warner JH, Long JD, Mills JA, Langbehn DR, Ware J, Mohan A, Sampaio C. Standardizing the CAP Score in Huntington’s Disease by Predicting Age-at-Onset. J Huntingtons Dis. 2022 Apr 19. doi: 10.3233/JHD-210475.↩︎

  10. Tabrizi SJ, Schobel S, Gantman EC, et al. A biological classification of Huntington’s disease: The integrated staging system. The Lancet Neurology 2022; 21: 632–644↩︎

  11. Breiman L. Random forests. Machine Learning 2001; 45: 5–32.↩︎

  12. Buuren S van, Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software 2011; 45: 1–67↩︎

  13. Stekhoven DJ, Buehlmann P. MissForest - nonparametric missing value imputation for mixed-type data. Bioinformatics 2012; 28: 112–118↩︎

  14. Mayer M. missRanger: Fast imputation of missing valueshttps://CRAN.R-project.org/package=missRanger (2021)↩︎