A prevalent theme in heritage language (HL) research is that HLs are simpler than monolingual varieties. We focus on complexity from a comparative variationist perspective, a sociolinguistic approach that examines variable aspects of language (“different ways of saying the same thing”). Arguably, variable elements are harder to acquire than categorical ones, as a matrix of frequencies must be acquired with every element, containing probabilistic information about when each form is (more) appropriate. These include inter-speaker (social) and intra-speaker (linguistic context) predictors, and account for community members’ shared grammars, as well as how we use stochastic information to recognize speakers’ group memberships (cf. Labov et al., 2011). So, how does a matrix of frequencies for the predictors of a particular variable compare between Heritage and Homeland speakers? How can these fairly be compared?

Comparison of variable patterns in Heritage and Homeland Cantonese illustrate an approach that responds to these questions. We revise analyses conducted previously of two morphosyntactic variables: prodrop and classifiers (cf. Nagy, 2015; Nagy & Lo, 2019). Data is extracted from spontaneous speech samples from the Heritage Language Variation and Change Project (Nagy, 2011; Nagy, 2009). We apply a bootstrap procedure, in which we run each regression model (of the set of potential predictors of the forms selected in each extracted token) 10,000 times on equal-size samples of the dataset, created by random sampling with replacement.

Prodrop, variable presence of a subject pronoun within a finite clause, is influenced by predictors such as grammatical person and clause type, and is not a change in progress. In contrast, we examine a pattern in which heritage speakers innovate toward use of the generic go3 個 classifier to mark number in a way that it does not in Hong Kong Cantonese. Predictors for classifiers include number, NP-modifiers, and semantic characteristics of the noun.

Multivariate regression analyses reveal the predictors (and levels within each predictor) that best model the variability in a sample of utterances containing the variable feature (the matrix of frequencies). The variationist field lacks an established methodology for comparing models of different varieties to see if they differ. One noted weakness is that different-sized samples are often compared, implicating different levels of statistical significance even if the populations’ patterns are the same. That is, when a smaller heritage sample is compared to a bigger homeland sample, the heritage variety may appear less constrained (or vice versa). Bootstrapping makes the datasets more comparable by setting them to the same token size, ameliorating some issues associated with unequal-sized datasets frequent in studies of minority or endangered varieties.

Our bootstrapping produces output consisting of 10,000 estimates for each predictor value, one for each time the model was run. The size and importance of each predictor’s effect is shown by the bootstrap confidence interval (the range of values for a given estimate that occurred 95% of the time). If the confidence interval overlaps 0, then we are much less certain that an effect for that predictor exists, than if it does not. To compare two models (e.g., homeland vs. heritage), we calculate the difference between the two models' estimates at each step of the bootstrap procedure, then examine their confidence intervals to determine whether the estimates differ.

We confirm that heritage and homeland speakers both exhibit systematically variable use of prodrop and classifiers, but learn also that the groups’ grammars’ degrees of complexity are similar: the matrices of (significant) frequencies are the same size. This approach allows us to consider not just which surface forms constitute the heritage vs. homeland varieties, but also the complexity of the decision-making process the speakers apply in selecting among the forms.