A prevalent theme in heritage language (HL) research is that HLs are simpler than monolingual
varieties. We focus on complexity from a comparative variationist perspective, a sociolinguistic
approach that examines variable aspects of language (“different ways of saying the same thing”).
Arguably, variable elements are harder to acquire than categorical ones, as a matrix of
frequencies must be acquired with every element, containing probabilistic information about
when each form is (more) appropriate. These include inter-speaker (social) and intra-speaker
(linguistic context) predictors, and account for community members’ shared grammars, as well
as how we use stochastic information to recognize speakers’ group memberships (cf. Labov et
al., 2011). So, how does a matrix of frequencies for the predictors of a particular variable
compare between Heritage and Homeland speakers? How can these fairly be compared?
Comparison of variable patterns in Heritage and Homeland Cantonese illustrate an approach that
responds to these questions. We revise analyses conducted previously of two morphosyntactic
variables: prodrop and classifiers (cf. Nagy, 2015; Nagy & Lo, 2019). Data is extracted from
spontaneous speech samples from the Heritage Language Variation and Change Project (Nagy,
2011; Nagy, 2009). We apply a bootstrap procedure, in which we run each regression model (of
the set of potential predictors of the forms selected in each extracted token) 10,000 times on
equal-size samples of the dataset, created by random sampling with replacement.
Prodrop, variable presence of a subject pronoun within a finite clause, is influenced by
predictors such as grammatical person and clause type, and is not a change in progress. In
contrast, we examine a pattern in which heritage speakers innovate toward use of the generic go3
個 classifier to mark number in a way that it does not in Hong Kong Cantonese. Predictors for
classifiers include number, NP-modifiers, and semantic characteristics of the noun.
Multivariate regression analyses reveal the predictors (and levels within each predictor) that best
model the variability in a sample of utterances containing the variable feature (the matrix of
frequencies). The variationist field lacks an established methodology for comparing models of
different varieties to see if they differ. One noted weakness is that different-sized samples are
often compared, implicating different levels of statistical significance even if the populations’
patterns are the same. That is, when a smaller heritage sample is compared to a bigger homeland
sample, the heritage variety may appear less constrained (or vice versa). Bootstrapping makes
the datasets more comparable by setting them to the same token size, ameliorating some issues
associated with unequal-sized datasets frequent in studies of minority or endangered varieties.
Our bootstrapping produces output consisting of 10,000 estimates for each predictor value, one
for each time the model was run. The size and importance of each predictor’s effect is shown by
the bootstrap confidence interval (the range of values for a given estimate that occurred 95% of
the time). If the confidence interval overlaps 0, then we are much less certain that an effect for that predictor exists, than if it does not. To compare two models (e.g., homeland vs. heritage), we calculate the difference between the two models' estimates at each step of the bootstrap
procedure, then examine their confidence intervals to determine whether the estimates differ.
We confirm that heritage and homeland speakers both exhibit systematically variable use of
prodrop and classifiers, but learn also that the groups’ grammars’ degrees of complexity are
similar: the matrices of (significant) frequencies are the same size. This approach allows us to
consider not just which surface forms constitute the heritage vs. homeland varieties, but also the
complexity of the decision-making process the speakers apply in selecting among the forms.