This PhD thesis presents the development of a prosody system for European Portuguese (EP) for text-to-speech (TTS) applications. Basically, TTS systems carry out the automatic utterance of a text and consist in a sequence of several modules. Those modules implement the pre-processing of the text input, the phonetic transcription and the supra-segmental processing that consists in the inclusion of prosodic patterns. The prosody is responsible for a communicative intention and guarantees some naturalness in the uttered speech. The prosodic features consist in the imposition of the timing, characterized by the segmental durations and pauses, the intonation, characterized by the fundamental frequency (F0) curve, and by the intensity curve.
The preparatory work that was fundamental for modelling and testing purposes is presented in the beginning. It starts with a preliminary study about the stressed syllable. This study identifies the variation range of F0, duration and intensity features in stressed syllable along contexts. Then the FEUP-IPB EP speech database that was used in following studies is presented. The database is labelled at the levels of the phoneme, word, sentence and F0. The thesis follows on with the presentation of two algorithms to provide the syllabic splitting of the text and of the phoneme sequences. This chapter ends with a proposed set of rules for the automatic phonetic transcription of the most problematic graphemes in EP.
The proposed prosody model consists of several sub-models...
Este estudo trata da variação regional em português brasileiro, pautando-se pelo viés perceptivo dos estudos da linguagem. Os trabalhos perceptuais de Atkinsons (1968), Bonte (1975), Maidment (1976), Ohala e Gilbert (1978) e Bezooijen e Gooskens (1999) já apontaram para a importância da prosódia como pista no reconhecimento das línguas e suas variedades regionais. Dessa maneira, este trabalho tem como objetivo verificar se os informantes de português brasileiro são capazes de reconhecer suas próprias variedades regionais apenas pela prosódia, além de buscar, por meio de uma análise de produção, pistas nas variações de F0 que possam justificar o reconhecimento dessas variedades. Trata-se de um estudo experimental sobre 3 variedades do português brasileiro: variedade de Pelotas (RS), a de São Paulo (bairro da Mooca) e a de Senador Pompeu (CE). Um teste perceptual foi elaborado de maneira a eliminar do sinal acústico os segmentos produzidos (experimento 1), deixando como informação para os participantes somente as características prosódicas. Esse experimento foi subdividido em trechos de fala curtos e longos, e, para a feitura com fala delexicalizada, foi utilizado o script PURR (SONNTAG; PORTELE, 1998b). Em contrapartida...
Dissertação de mestrado integrado em Psicologia (área de especialização em Psicologia Escolar e da Educação); Neste estudo longitudinal, com quatro medidas repetidas no tempo e uma amostra de 98
participantes, estuda-se a evolução da leitura prosódica ao longo do segundo e terceiro ano de
escolaridade, o papel da prosódia na compreensão leitora e a ponderação relativa da prosódia
e da velocidade leitora na compreensão leitora. Os resultados de um modelo de crescimento
multinível evidenciam uma evolução gradual da leitura prosódica, com uma aceleração
notória do segundo para o terceiro momento de avaliação, seguida de uma desaceleração do
terceiro para o quarto momento. Nestes sujeitos as diferenças inter-individuais na linha de
base da prosódia nem sempre condicionam o desempenho dos participantes. Regressões
lineares simples revelam que nos quatro momentos de avaliação a dimensão construção
frásica/expressividade prediz significativamente a compreensão leitora. Análises de
correlação parcial mostram que uma vez controlada a velocidade de leitura, a contribuição
única da prosódia se torna residual. Os resultados sugerem que a prosódia emerge da
automatização do processo de descodificação...
Prosody is an important but not fully understood component of reading. In this longitudinal study with a sample of 98 Portuguese elementary school children, a multilevel growth model with four repeated measures over time showed steady progress in participants’ reading prosody from the middle of 2nd to the end of 3rd grade. However, children’s growth in this area varied across time points. Results also showed that individual differences in prosody’s scores at baseline affect the performance of most but not of all students. Simple linear regressions showed that the prosody dimension “phrasing/expression” significantly predicted reading comprehension at all time points. Partial correlation analysis showed that when reading rate was accounted for, the unique contribution of prosody to reading comprehension was marginal, except at the third measurement; A prosódia é uma importante mas nem sempre bem compreendida componente da leitura. Neste estudo
longitudinal, com uma amostra de 98 estudantes do ensino primário, um modelo multinível com quatro
medidas repetidas no tempo evidencia uma evolução estável da leitura prosódica dos participantes entre
o 2.º e o 3.º ano de escolaridade. Contudo a evolução é desigual nos diversos momentos no tempo. Os
resultados também mostram que as diferenças inter-individuais na linha de base da prosódia nem sempre
condicionam o desempenho dos participantes. Regressões lineares simples revelam que nos quatro
momentos de avaliação a dimensão construção frásica/expressividade prediz significativamente a compreensão
da leitura. Análises de correlação parcial mostram que uma vez controlada a velocidade de
The objective was to determine whether disturbances of
affective prosody constitute part of the symptomatology of
schizophrenia. Affective prosody is defined here as a
neuropsychological function that encompasses all non-verbal aspects of
language that are necessary for recognising and conveying emotions in
communication. Twenty six schizophrenic out-patients and twenty four
normal controls underwent a standardised prosody test, assessing four
different aspects of affective prosody: spontaneous prosody, prosodic
recognition, prosodic repetition, and facial affect recognition. Patients scored significantly worse than controls on three of the
four subtests: spontaneous prosody, prosodic recognition, and prosodic repetition. There were no significant differences on a subtest for
facial affect recognition. Differences in educational level between
patients and controls could not account for these differences.
Individuals with autism exhibit significant impairments in prosody production, yet there is a paucity of research on prosody comprehension in this population. The current study adapted a psycholinguistic paradigm to examine whether individuals with autism are able to use prosody to resolve syntactically ambiguous sentences. Participants were 21 adolescents with high-functioning autism (HFA), and 22 typically developing controls matched on age, IQ, receptive language, and gender. The HFA group was significantly less likely to use prosody to disambiguate syntax, but scored comparably to controls when syntax alone or both prosody and syntax indicated the correct response. These findings indicate that adolescents with HFA have difficulty using prosody to disambiguate syntax in comparison to typically developing controls, even when matched on chronological age, IQ, and receptive language. The implications of these findings for how individuals with autism process language are discussed.
Automatic speech recognition (ASR) systems rely almost exclusively on short-term segment-level features (MFCCs), while ignoring higher level suprasegmental cues that are characteristic of human speech. However, recent experiments have shown that categorical representations of prosody, such as those based on the Tones and Break Indices (ToBI) annotation standard, can be used to enhance speech recognizers. However, categorical prosody models are severely limited in scope and coverage due to the lack of large corpora annotated with the relevant prosodic symbols (such as pitch accent, word prominence, and boundary tone labels). In this paper, we first present an architecture for augmenting a standard ASR with symbolic prosody. We then discuss two novel, un-supervised adaptation techniques for improving, respectively, the quality of the linguistic and acoustic components of our categorical prosody models. Finally, we implement the augmented ASR by enriching ASR lattices with the adapted categorical prosody models. Our experiments show that the proposed unsupervised adaptation techniques significantly improve the quality of the prosody models; the adapted prosodic language and acoustic models reduce binary pitch accent (presence versus absence) classification error rate by 13.8% and 4.3%...
Research on emotion processing in the visual modality suggests a processing advantage for emotionally salient stimuli, even at early sensory stages; however, results concerning the auditory correlates are inconsistent. We present two experiments that employed a gating paradigm to investigate emotional prosody. In Experiment 1, participants heard successively building segments of Jabberwocky “sentences” spoken with happy, angry, or neutral intonation. After each segment, participants indicated the emotion conveyed and rated their confidence in their decision. Participants in Experiment 2 also heard Jabberwocky “sentences” in successive increments, with half discriminating happy from neutral prosody, and half discriminating angry from neutral prosody. Participants in both experiments identified neutral prosody more rapidly and accurately than happy or angry prosody. Confidence ratings were greater for neutral sentences, and error patterns also indicated a bias for recognising neutral prosody. Taken together, results suggest that enhanced processing of emotional content may be constrained by stimulus modality.
Patients with schizophrenia have well-established deficits in their ability to identify emotion from facial expression and tone of voice. In the visual modality, there is strong evidence that basic processing deficits contribute to impaired facial affect recognition in schizophrenia. However, few studies have examined the auditory modality for mechanisms underlying affective prosody identification. In this study, we explored links between different stages of auditory processing, using event-related potentials (ERPs), and affective prosody detection in schizophrenia. Thirty-six schizophrenia patients and 18 healthy control subjects received tasks of affective prosody, facial emotion identification, and tone matching, as well as two auditory oddball paradigms, one passive for mismatch negativity (MMN) and one active for P300. Patients had significantly reduced MMN and P300 amplitudes, impaired auditory and visual emotion recognition, and poorer tone matching performance, relative to healthy controls. Correlations between ERP and behavioral measures within the patient group revealed significant associations between affective prosody recognition and both MMN and P300 amplitudes. These relationships were modality specific, as MMN and P300 did not correlate with facial emotion recognition. The two ERP waves accounted for 49% of the variance in affective prosody in a regression analysis. Our results support previous suggestions of a relationship between basic auditory processing abnormalities and affective prosody dysfunction in schizophrenia...
Williams syndrome (WS), a genetic neurodevelopmental disorder, has been taken as evidence that music and language constitute separate modules. This research focused on the linguistic component of prosody and aimed to assess whether relationships exist between the pitch processing mechanisms for music and prosody in WS. Children with WS and typically developing individuals were presented with a musical pitch and two prosody discrimination tasks. In the musical pitch discrimination task, participants were required to distinguish whether two musical tones were the same or different. The prosody discrimination tasks evaluated participants’ skills for discriminating pairs of prosodic contours based on pitch or pitch, loudness and length, jointly. In WS, musical pitch discrimination was significantly correlated with performance on the prosody task assessing the discrimination of prosodic contours based on pitch only. Furthermore, musical pitch discrimination skills predicted performance on the prosody task based on pitch, and this relationship was not better explained by chronological age, vocabulary or auditory memory. These results suggest that children with WS process pitch in music and prosody through shared mechanisms. We discuss the implications of these results for theories of cognitive modularity. The implications of these results for intervention programs for individuals with WS are also discussed.
This paper is an instrumental documentation of the tonal contours of two types of pragmatic constructions in Navajo to investigate observations by native speaking linguists (Willie p.c., Austin-Garrison p.c.) that Navajo has no tonal intonation, and the predictions made by syntacticans (Jelinek 1989, Hale, Jelinek & Willie 2001) who have claimed that the lack of focus intonation is predicted by the features of Navajo’s argument structure. We found no systematic pitch perturbations that differentiate Yes/No questions and Focus constructions from their declarative conutnerparts. The evidence lays open the question of the relationship of the tonal prosody to syntactic features that generate the pattern and to the parameters of intonation, accentuation and metrical structure.
One of the overarching questions in the field of infant perceptual and cognitive development concerns how selective attention is organized during early development to facilitate learning. The following study examined how infants’ selective attention to properties of social events (i.e., prosody of speech and facial identity) changes in real time as a function of intersensory redundancy (redundant audiovisual, nonredundant unimodal visual) and exploratory time. Intersensory redundancy refers to the spatially coordinated and temporally synchronous occurrence of information across multiple senses. Real time macro- and micro-structural change in infants’ scanning patterns of dynamic faces was also examined.
According to the Intersensory Redundancy Hypothesis, information presented redundantly and in temporal synchrony across two or more senses recruits infants’ selective attention and facilitates perceptual learning of highly salient amodal properties (properties that can be perceived across several sensory modalities such as the prosody of speech) at the expense of less salient modality specific properties. Conversely, information presented to only one sense facilitates infants’ learning of modality specific properties (properties that are specific to a particular sensory modality such as facial features) at the expense of amodal properties (Bahrick & Lickliter...
Nonverbal signals play an important role in the way humans communicate with each other. Body movements like gestures and facial expressions are only one part of it – another important factor is prosody, in the clinical context firstly defined by Monrad-Kohn (1947) as that special facility of language which creates independently from semantics different meanings via modulation of speech-rhythm, loudness, frequency and stress patterns.
Approximately, only seven percent of the information about the emotional state of a speaker are inferred from semantics, meaning the content of his words or “what” he or she says. 55 percent is conveyed by body language and the rest, impressive 38 percent, is transported via prosody, e. g. “how” one says, what he says (Mehrabian, 1972).
Therefore, prosody – and its adequate interpretation – represents a vital tool within human every-day-life.
So far, a lot of research has been carried out to further disentangle the contribution of different acoustic parameters to the expression of emotional prosody. Numerous scientists tried to clarify the influence and importance of single acoustic features within the creation of different emotional intonations (like for example anger, happiness, disgust...
The meaning of a coherent discourse is formed not only by the meanings of the individual sentences it consists of, but also by the meaningful links between them---the discourse relations. Discourse relations can be signalled by various linguistic expressions, e.g. connectives (but), anaphoric adverbials (then), intonational pattern (contrastive accent), but they can also remain implicit, in which case they have to be inferred by the addressee. This dissertation addresses the question how and why implicit discourse relations are inferred.
Most of the previous theoretical studies on the inference of discourse relations have been carried out within the framework of Segmented Discourse Representation Theory (SDRT, Asher & Lascarides, 2003). They provide a good coverage of the relevant linguistic data and work out a handy classification of discourse relations, which is used as basis in the present work. However, discourse relations are primitive theoretical constructs in SDRT, so SDRT does
not explain why there are these relations and not others.
Moreover, discourse relations differ in markedness. For instance, the discourse relation Concession is almost always marked by connectives like although', but', etc., whereas relations like Restatement and Elaboration usually remain implicit.
SDRT does not give an answer to the question why this is so. Thus the programmatic goal pursued by the present study is to explain the inventory and relative markedness of discourse relations in terms of the concepts and principles of general pragmatics.
The proposed explanation is based on two default principles:
(1) Topic continuity: By default...
One of the overarching questions in the field of infant perceptual and cognitive development concerns how selective attention is organized during early development to facilitate learning. The following study examined how infants' selective attention to properties of social events (i.e., prosody of speech and facial identity) changes in real time as a function of intersensory redundancy (redundant audiovisual, nonredundant unimodal visual) and exploratory time. Intersensory redundancy refers to the spatially coordinated and temporally synchronous occurrence of information across multiple senses. Real time macro- and micro-structural change in infants' scanning patterns of dynamic faces was also examined. ^ According to the Intersensory Redundancy Hypothesis, information presented redundantly and in temporal synchrony across two or more senses recruits infants' selective attention and facilitates perceptual learning of highly salient amodal properties (properties that can be perceived across several sensory modalities such as the prosody of speech) at the expense of less salient modality specific properties. Conversely, information presented to only one sense facilitates infants' learning of modality specific properties (properties that are specific to a particular sensory modality such as facial features) at the expense of amodal properties (Bahrick & Lickliter...
Patients with Huntington's Disease (HD) who were without dementia were compared to unilateral stroke patients and controls as previously reported in 1983, to discover if they had a prosodic defect. Subjects were presented tape-recorded speech filtered sentences and asked to indicate the tone of voice as happy, sad or angry (affective prosody), or as a question, command or statement (propositional prosody). HD patients were impaired in comprehension of both types of prosody compared to controls but were not different from stroke patients. A second study compared early HD patients with at-risk siblings and spouse controls on comprehension of affective and propositional prosody, discrimination of both types of prosody, rhythm discrimination and tonal memory (Seashore tests). HD patients were impaired in both comprehension and discrimination of all types of prosody. HD patients were less accurate than at-risk patients on the tonal memory task but not on the rhythm discrimination task. These findings suggest compromise in ability to understand the more subtle prosodic aspects of communication which may contribute to social impairment of HD patients very early in the course of the disease.
Building on previous works (e.g. Kubozono 2006, and Kang 2010), this article attempts to establish a taxonomy for loanword prosody, referring specifically to the patterns of stress, tone, or pitch-accent that are found in loanwords. Toward a taxonomy, we consider the following factors: (i) whether the pronunciation of the word in the source language influences the assignment of prosody in the borrowing language; (ii) whether prosody assignment is aided by rules (or constraints) that are specific to loanwords; and (iii) whether segmental features or suprasegmental features play a role. Exemplification of languages instantiating the taxonomy will be provided with discussion regarding issues that arise from the proposed taxonomy.; Partint de treballs previs (p.e. Kubozono 2006, i Kang 2010), aquest article intenta establir una taxonomia de la prosòdia dels manlleus, amb especial referència als patrons d’accent, de to o d’accent tonal que s’hi poden trobar. Amb aquest objectiu, s’han tingut en compte els factors següents: (i) si la pronunciació de la paraula en la llengua d’origen influeix en la prosòdia de la llengua del préstec; (ii) si l’assignació prosòdica fa servir regles (o requeriments) específics per als manlleus; i (iii) si els trets segmentals o els suprasegmentals hi juguen algun paper. L’article presenta els exemples classificats i discuteix els problemes que sorgeixen de la taxonomia proposada.
We are grateful to Joan Borràs-Comes for kindly providing us with the map that appears in Figure 1. Alba Chacón, Verònica Crespo-Sendra and Marianna Nadeu deserve a special mention for having participated unselfishly as narrators of the short picture stories presented in a PowerPoint slide show. We also thank participants, and people that helped us to get in contact with potential participants: Gotzon Aurrekoetxea, Mercedes Cabrera, Verònica Crespo-Sendra, Irene de la Cruz, Gorka Elordieta, Leire Gandarias, Miriam Rodríguez, Paco Vizcaíno. This research has been funded by the project FFI2011–23829/FILO awarded by the Spanish Ministry of Economy and Competitiveness.; En aquest estudi investiguem la interacció entre l’ordre de mots i la prosòdia a l’hora d’expressar la modalitat i diferents construccions de focus en una sèrie de dialectes del català i de l’espanyol. Hem analitzat un corpus obtingut mitjançant dues tasques: a) una tasca de producció dissenyada per obtenir diferents construccions de focus mitjançant parells de pregunta-resposta amb petites històries presentades a partir de figures i b) la metodologia de la tasca de compleció del discurs. Les dades recollides s’han analitzat prosòdicament i sintàctica. Les nostres dades confirmen que en català i en espanyol la prominència entonativa tendeix a recaure al final de l’oració...
This paper analyses the role of prosody in parenthetical insertions, a type of structure that is extremely common in both speech and writing. The materials under study come from a corpus of spontaneous speech acts in Central Catalan (with varying degrees of spontaneity) from which a corpus of oral parenthetical insertions has been compiled. The prototypical prosodic features of a parenthetical insertion in Catalan are: prosodic autonomy, limited extension, production in between pauses or final pause, tendency towards acceleration, fall in intensity, lower pitch range and, finally, falling or rising melodic pattern. While the final fall is the most frequent pattern in spontaneous conversations with a high degree of confidence between interlocutors, a final rising structure is found in interviews in which the degree of confidence between participants is smaller, their roles are unequal, and the interviewed constructs a narrative discourse. We thus suggest that the pitch contour of parenthetical insertions is related to formality and discourse typology (in this case, narrative vs. dialogue). Bearing in mind the discursive functions performed by these insertions, we propose a typology which classifies them with regards to two main functions: completion of information...
Our investigation focuses on several types of structural ambiguity in European Portuguese. The materials include sentences with set-divider adverbs ambiguous as to the direction of syntactic attachment, adjunct and complement PPs ambiguous as to the level of syntactic embedding, nonrestrictive clauses with local and non-local possible antecedents, and relative clauses ambiguous as to their restrictive/non-restrictive meaning. Besides providing a prosodic description of sentences with these various sorts of ambiguity, the relation between prosody and syntactic structure is addressed. It is concluded that structural ambiguity is not always cued by prosody, and it may be resolved by prosodic means that are optional. Additionally, some options on sentence partition in intonational phrases are only available under some interpretations, and in specific configurations I-breaks may not be inserted (namely, between a head and an adjacent complement or modifier). In all cases studied intonational phrase level properties play a crucial role in sentence disambiguation. An intonational phrase boundary after set-divider adverbs indicates leftattachment and between a constituent and the preceding material implies non-local attachment. These facts are seen to follow in a principled way from the conditions on the formation of intonational phrases.