Tenth Heritage Language Research Institute

Heritage Language Education and Research: Crossing New Frontiers

professional development \ institutes \ 2017 institute

Go back to abstract list

Building an audio-aligned parsed corpus of bilingual Russian child and child directed speech (BiRCh)

by Irina Dubinina & Sophia Malamud (Brandeis University)

At the HLI 2016, we reported on the first stages of a project that aims to build a corpus of bilingual child and child-directed speech. We report on new developments based on a year of research and pilot transcription and annotation. Data collection has grown to include more families in the U.S. and Germany, and to add families from Ukraine as the closest dialectal match for children of Russian-speaking immigrants from Ukraine. Some families will record on a more continuous schedule in order to increase the speech sample above 1% of the current rate (cf. Behrens, 2008). We standardized the spelling of pause fillers, explored ways of marking so-called editing phrases and pause fillers (cf. Ginsburg et al., 2014), and adopted the typology of disfluencies in Hindle (1983) which is shared with other, recent and in-progress, audio-aligned parsed corpora (Appalachian English, NYC English, AAVE). Significant new developments include adding sociolinguistic information on participating families (employing the Utrecht Bilingual Language Exposure Calculator), pilot morphological annotation, and work on creating a hybrid syntactic annotation system that include both the constituency information of the Penn-Helsinki standard and the dependency information of the SynTagRus and Russian National Corpus (Lưu, Malamud, & Xue, 2016).