Age and CAG length in HD Data Analysis

How to account for age and CAG in statistical analysis of HD data.


HD develops over time with signs and symptoms typically appearing in mid-life (Ross et al. 2014). The timing of HD signs and symptoms is strongly related to CAG length (Figure 1), with longer lengths being associated with younger age at onset (Lee et al. 2012). Consequently, age and CAG length are key considerations in almost every HD analysis. This article discusses several age and CAG-related issues that a researcher might want to consider before starting analysis of observational HD datasets, such as Enroll-HD (Landwehrmeyer et al. 2016).

The relationship between age and CAG is a consideration in almost all HD analysis, but the details of how age and CAG are treated in statistical models depends on the context. Here we focus on the contexts of a cross-sectional analysis and a longitudinal analysis.

Figure 1. The association between CAG length and age at motor diagnosis.

Cross-sectional Analysis

Cross-sectional analysis uses variables that are collected at a single time point or visit. A commonly used single time point is the visit at study entry (i.e., baseline visit).

When there are multiple time points per participant, as in the Enroll-HD database, all the visits except the one at the time point of interest are ignored. Though some data are not used, what is gained in a cross-sectional analysis is simplicity. Most of the standard statistical methods, such as conventional multiple regression, are intended for cross-sectional analysis.

Focusing on a single time point such as study entry avoids the problem of dropout over time, which often means the analysis maximizes the number of observations (participants). Cross-sectional analysis is also appropriate for examining long-term effects of HD (contingent on study sample characteristics). The progression of HD is relatively slow, with an average of 15 years from motor onset to death (Keum et al. 2017), so the time elapsed for HDGECs up to study entry is often much greater than the time that people are observed in the study. This means that information about long-term progression is often gleaned from variables measured at study entry and less so from the short-term change within the study.

Recent research suggests that the CAG-repeat length is dynamic, continuing to expand at the cell level, and eventually triggers a mechanism that causes cell death (Hong et al. 2020). Cross-sectional studies are important for this somatic expansion because the only comparison to be made is between people, and differences in the magnitude and duration of exposure to toxic effects of mHTT must be accounted for in such comparisons. People enter a study with a variety of exposure times as indexed by age at entry, and a variety of disease magnitudes as indexed by inherited CAG length. It is crucial to account for these differences among people to avoid confounding and provide a level playing field for the comparison of variables of interest.

One common goal of a cross-sectional analysis is to examine the extent to which a variable is related to disease progression. For example, in the search for new fluid biomarkers (e.g., a substance measured in CSF), it is common to examine how the levels of a biomarker vary by age and CAG length at study entry (Leoni et al. 2013). Age and CAG length are used as indicators of progression, and are entered into the statistical models in various ways. The interaction of age and CAG length is important for indexing progression (Langbehn, Hayden, and Paulsen 2010), and so the product term—CAP—is often entered as a predictor (as in a multiple regression) along with the main effects (individual variables).

CAG-Age Product (CAP)

To simplify modeling, the combined effect of age and CAG has been captured in the CAG-Age product (CAP) (Penney et al. 1997; Langbehn, Hayden, and Paulsen 2010; Zhang et al. 2011). CAP has the general form of CAP = (Age at Study Entry) ⋅  (CAG – L) / K, where L  is a centering constant and K is a scaling constant.

Based on the extensive analysis of Warner et al. (2020), the preferred CAP has L = 30 and K = 6.49, giving CAP = (Age at Study Entry) ⋅  (CAG-30) / 6.49.  This formula is standardized such that CAP = 100 at the expected age of diagnosis. However, different constant and scaling values have been and are being used in various analyses. Specifically, CAP developed with the PREDICT-HD database by Zhang et al. (2011) uses L = 33.66 and K = 1, so that CAP = (Age at Study Entry) ⋅  (CAG – 33.66). The CAP developed by Penney et al. uses L = 35.5 and K = 1, so that CAP = (Age at Study Entry) ⋅  (CAG – 35.5).

CAP’s advantage is that it is a single progression score, and it can be included as a predictor in a multiple regression model along with adjustment variables, such as sex, that the analyst deems important to control. For example, an analyst might estimate the regression coefficient of CAP predicting a fluid biomarker controlling for sex. A significant CAP coefficient in this example suggests a statistically reliable relationship between progression and the biomarker adjusting for being female or male.

Using CAP as a continuous score in the example above is only applicable when the participants have an expanded CAG tract (primarily 40 or more repeats). CAP is irrelevant for people in the normal CAG repeat range, and it is not defined. Nevertheless, there are several published analyses of HD in which people who have an expanded CAG are compared to those who do not (e.g., non-affected family members or community controls). One reason for this comparison is to determine the timing of early signs and symptoms of HD (Paulsen et al. 2014; Tabrizi et al. 2013).

The duration of the illness means that manifest individuals may be grouped into CAP score categories that reflect early, mid, and late disease stages.

For example, Zhang et al. (2011) utilize the following thresholds to categorize disease stages using their version of CAP: Early = <290; Mid = 290-367; Late = >367. 

When using the preferred Warner et al. (2020) CAP (L = 30, K = 6.49) the analyst can use the quartiles for the Enroll-HD distribution for fully penetrant participants (CAG ≥ 40), which are the 25th and 75th percentiles of 88 and 119 (Enroll-HD PDS4; release v2018-10-R3). Therefore the groups would be defined as <88, 88-119, >119. Additional work needs to be done to establish optimal cut points.

Longitudinal Analysis

Most HD observational databases have repeat visits for at least a portion of participants; longitudinal data availability in Enroll-HD is illustrated (Figure 2). When the same person is measured over time at recurring visits, we refer to their data as longitudinal.

Longitudinal analysis has the distinct advantage over cross-sectional analysis of examining how processes evolve over time on a within-participant basis. The typical cross-sectional analysis is retrospective regarding progression in that it can only infer the results of progression up until the time point of interest (e.g., study entry). A longitudinal analysis is prospective, as we can examine progression as it is unfolding over time. Longitudinal data are considered crucial for providing evidence in support of cause and effect, which is why pivotal clinical trials are longitudinal in nature (see “Using Observational Data to Inform Clinical Trial Designfor further info). Furthermore, a longitudinal analysis subsumes a cross-sectional analysis because the first visit of the longitudinal trajectory is the visit at study entry. Therefore, all the results of the cross-sectional analysis are available plus the unique prospective results of the longitudinal analysis.

Figure 2. Longitudinal data availability in Enroll-HD PDS5 (release 2020-10-R1). Participant counts by maximum number of Enroll-HD visits (baseline and follow-up visits only; unscheduled visits and phone contacts excluded). Full sample represented (N = 21,116; Missing N = 0).

In HD research, longitudinal analysis has been used to describe the natural history of the disease, especially the pattern (or trajectory) of key clinical variables over time (Langbehn et al. 2019; Long et al. 2014; Paulsen, Smith, and Long 2013). Longitudinal analysis has also been used to examine the timing of landmark events, such as the age at motor diagnosis for different CAG expansions (Long and Mills 2018).

Along with the added prospective insight of a longitudinal analysis comes added complexity. Repeated observations from the same person will be correlated and the number of observations will vary due to people joining the study at different times in history (distant versus recent enrollment). These characteristics need to be accounted for with advanced statistical methods, such as linear mixed models for longitudinal data (Verbeke and Molenberghs 2009).

Similar to a cross-sectional analysis, a longitudinal analysis can use continuous CAP or CAP groups. For example, an analyst might want to examine how a fluid biomarker changes over time based on CAP at study entry. The cross-sectional retrospective information about the biomarker and progression can be examined with an intercept analysis (starting-point analysis), which focuses on the first visit at study entry. In addition, prospective information about the biomarker and progression can be learned with a slope analysis (change analysis), which focuses on the change over the repeated visits.

The selection of a time metric in longitudinal analysis is important. Various studies have shown that the trajectory of many HD clinical variables over the entire adult lifespan is not linear.  Figure 3 shows an example of the composite UHDRS (cUHDRS) tracked over time. As another example, the mean motor signs of a cohort with CAG = 42 will start at or very near 0 (normal) when people are in their early 20s, then slightly increase over the next several years, and then sharply rise just prior to motor diagnosis (Langbehn et al. 2019; Long et al. 2014; Paulsen et al. 2014). If age is used as the time metric, then methods to deal with non-linear trajectories should be used, such as polynomials of age (Long and Ryoo 2010) or spline terms (Long and Mills 2018).

Figure 3. Change in composite UHDRS (cUHDRS) scores over time in HDGECs and healthy control individuals. Data derived from Enroll-HD PDS4; release v2018-10-R3.

Interestingly, when change is examined for CAP or CAP groups, it is often sufficient to use a straight-line model. Recall that the early-mid-late CAP groups partition the CAP range. Within each CAP partition, the change over a few years is relatively linear. So each CAP group can be treated as a linear piece, and when all the pieces are concatenated side-to-side the change over all the stages will be non-linear, but the change within one stage will be linear.

In longitudinal analysis with CAP or CAP groups it is recommended that time since study entry (in years or months) be used as the time metric. Time 0 is the visit at entry, which acknowledges that CAP accounts for progression up to study entry. The progression examined in the longitudinal analysis is only the progression observed during the study and not progression from birth.

Finally, the analysis of the timing of landmark events often relies on using a particular subset of participants, such as a subset who has not yet received a motor diagnosis. Survival analysis is often used to examine whether the duration from study entry to a landmark event such as motor diagnosis can be predicted by CAP or other variables measured at study entry (Long and Paulsen 2015; Long et al. 2017).

The variable information that is used in a survival analysis is the time of the event, or the last recorded time in the study for those who do not experience the event, and the predictor variable at study entry. Though all core variables are collected at all visits, the additional information is often not used. In addition, participants who have already had the event of interest (such as motor diagnosis) before enrolling in the study are usually excluded from the analysis. Such filtering may be justified if people and/or observations are excluded in a random fashion so that the remaining information is representative of the omitted information. But there are scenarios in which the filtering can lead to bias in results. Statistical methods to maximize the use of all the available data continue to be developed (see Long and Mills 2018), and the analyst is encouraged to think through the implications of any filtering of the database.


Hong, P. E., M. E. MacDonald, V. C. Wheeler, L. Jones, P. Holmans, M. Orth, D. G. Monckton, et al. 2020. “Huntington’s Disease Pathogenesis: Two Sequential Components.” Journal of Huntington’s Disease.

Keum, J. W., A. Shin, T. Gillis, J. S. Mysore, K. A. Elneel, D. Lucente, T. Hadzi, et al. 2017. “The HTT Cag-Expansion Mutation Determines Age at Death but Not Disease Duration in Huntington Disease.” The American Journal of Human Genetics 98: 287–98.

Landwehrmeyer, B. G., C. Fitter-Attas, J. Giuliano, and et al. 2016. “Data Analytics from Enroll-HD, a Global Clinical Research Platform for Huntington’s Disease.” Movement Disorder Clinical Practice 4: 212–24.

Langbehn, D. R., M. R. Hayden, and J. S. Paulsen. 2010. “CAG-Repeat Length and the Age of Onset in Huntington Disease (HD) a Review and Validation Study of Statistical Approaches.” American Journal of Medical Genetics, Part B 153: 397–408.

Langbehn, D. R., J. C. Stout, S. Gregory, J. A. Mills, A. Durr, B. R. Leavitt, R. A. C. Roos, et al. 2019. “Association of CAG Repeats with Long-Term Progression in Huntington Disease.” JAMA Neurology 76: 1375–85.

Lee, J. M., E. M. Ramos, J. H. Lee, T. Gillis, J. S. Mysore, M. R. Hayden, S. C. Warby, et al. 2012. “CAG Repeat Expansion in Huntington Disease Determines Age at Onset in a Fully Dominant Fashion.” Neurology 78: 690–95.

Leoni, V., J. D. Long, J. A. Mills, S. Di Donato, and J. S. Paulsen. 2013. “Plasma 24S-Hydroxycholesterol Correlation with Markers of Huntington Disease Progression.” Neurobiology of Disease 55: 37–43.

Long, J. D., and J. A. Mills. 2018. “Joint Modeling of Multivariate Longitudinal Data and Survival Data in Several Observational Studies of Huntington’s Disease.” Medical Research Methodology 18: 138–53.

Long, J. D., J. A. Mills, B. R. Leavitt, A. Durr, R. A. Roos, J. C. Stout, R. Reilmann, et al. 2017. “Survival Endpoints for Huntington’s Disease Trials Prior to a Motor Diagnosis.” JAMA Neurology 74: 1–9.

Long, J. D., and J. S. Paulsen. 2015. “Multivariate Prediction of Motor Diagnosis in Huntington Disease: 12 Years of PREDICT-HD.” Movement Disorders 12: 1664–72.

Long, J. D., J. S. Paulsen, K. Marder, Y. Zhang, J. Kim, and J. A. Mills. 2014. “Tracking Motor Impairments in the Progression of Huntington’s Disease.” Movement Disorders 29: 311–19.

Long, J. D., and J. Ryoo. 2010. “Using Fractional Polynomials to Model Non-Linear Trends in Longitudinal Data.” British Journal of Mathematical and Statistical Psychology 63: 177–203.

Paulsen, J. S., J. D. Long, C. A. Ross, D. L. Harrington, C. J. Erwin, J. K. Williams, H. J. Westervelt, et al. 2014. “Prediction of Manifest Huntington’s Disease with Clinical and Imaging Measures: A Prospective Observational Study.” Lancet Neurology 13: 1193–1201.

Paulsen, J. S., M. M. Smith, and J. D. Long. 2013. “Cognitive Decline in Prodromal Huntington Disease Implications for Clinical Trials.” Journal of Neurology, Neurosurgery, and Psychiatry 84: 1233–9.

Penney, J. B., J. P. Vonsattel, M. E. MacDonald, J. F. Gusella, and R. H. Myers. 1997. “CAG Repeat Number Governs the Development Rate of Pathology in Huntington’s Disease.” Annals of Neurology 41: 689–92.

Ross, C. A., E. H. Aylward, E. J. Wild, D. R. Langbehn, J. D. Long, J. H. Warner, R. I. Scahill, et al. 2014. “Huntington Disease Natural History, Biomarkers and Prospects for Therapeutics.” Nature Reviews Neurology 10: 204–16.

Tabrizi, S. J., R. I. Scahill, G. Owen, A. Durr, B. R. Leavitt, R. A. Roos, B. Borowsky, et al. 2013. “Predictors of Phenotypic Progression and Disease Onset in Premanifest and Early-Stage Huntington’s Disease in the TRACK-HD Study Analysis of 36-Month Observational Data.” Lancet Neurology 12: 637–49.

Verbeke, G., and G. Molenberghs. 2009. Linear Mixed Models for Longitudinal Data. New York: Springer-Verlag.

Warner, J. H., J. D. Long, J. A. Mills, D. R. Langbehn, J. Ware, A. Mohan, and C. Sampaio. 2020. “Standardizing the CAP Score in Huntington’s Disease I: Predicting Age-at-Onset.”

Zhang, Y., J. D. Long, J. A. Mills, J. H. Warner, W. Lu, and J. S. Paulsen. 2011. “Indexing Disease Progression at Study Entry with Individuals at-Risk for Huntington Disease.” American Journal of Medical Genetics Part B Neuropsychiatric Genetics 156: 751–63.