Festival training

December 8, 2016 | Author: Nagaraj Adiga | Category: N/A
Share Embed Donate


Short Description

To develop TTS system using festival software....

Description

Festival TTS Training Material TTS Group Indian Institute of Technology Madras Chennai - 600036 India

June 5, 2012

1

Contents 1 Introduction 1.1 Nature of scripts of Indian languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Convergence and divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 4 4

2 What is Text to Speech Synthesis? 2.1 Components of a text-to-speech system . . 2.2 Normalization of non-standard words . . . . 2.3 Grapheme-to-phoneme conversion . . . . . . 2.4 Prosodic analysis . . . . . . . . . . . . . . . 2.5 Methods of speech generation . . . . . . . . 2.5.1 Parametric synthesis . . . . . . . . . 2.5.2 Concatenative synthesis . . . . . . . 2.6 Primary components of the TTS framework 2.7 Screen readers for the visually challenged .

5 5 5 5 6 6 7 7 8 9

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

3 Overall Picture

10

4 Labeling Tool 12 4.1 How to Install LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Troubleshooting of LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5 Labeling Tool User Manual 5.1 How To Use Labeling Tool . . . . . . . . . . . . 5.2 How to do label correction using Labeling tool 5.3 Viewing the labelled file . . . . . . . . . . . . . 5.4 Control file . . . . . . . . . . . . . . . . . . . . 5.5 Performance results for 6 Indian Languages . . 5.6 Limitations of the tool . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

18 18 25 29 29 30 30

6 Unit Selection Synthesis Using Festival 6.1 Cluster unit selection . . . . . . . . . . . . . . . 6.2 Choosing the right unit type . . . . . . . . . . 6.3 Collecting databases for unit selection . . . . . 6.4 Preliminaries . . . . . . . . . . . . . . . . . . . 6.5 Building utterance structures for unit selection 6.6 Making cepstrum parameter files . . . . . . . . 6.7 Building the clusters . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

31 31 31 32 33 33 34 35

7 Building Festival Voice

42

8 Customizing festival for Indian Languages 44 8.1 Some of the parameters that were customized to deal with Indian languages in festival framework are : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 8.2 Modifications in source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 9 Trouble Shooting in festival 50 9.1 Troubleshooting (Issues related with festival) . . . . . . . . . . . . . . . . . . . . . . 50 9.2 Troubleshooting(Issues might occur while synthesizing) . . . . . . . . . . . . . . . . . 50 10 ORCA Screen Reader

51

2

11 NVDA Windows Screen Reader 53 11.1 Compiling Festival in Windows : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12 SAPI compatibility for festival voice

60

13 Sphere Converter Tool 13.1 Extraction of details from header of the input file 13.1.1 Calculate sample minimum and maximum 13.1.2 RAW Files . . . . . . . . . . . . . . . . . 13.1.3 MULAW Files . . . . . . . . . . . . . . . 13.1.4 Output in encoded format . . . . . . . . . 13.2 Configfile . . . . . . . . . . . . . . . . . . . . . . 14 Sphere Converter User Manual 14.1 How to Install the Sphere converter tool . 14.2 How to use the tool . . . . . . . . . . . . 14.3 Fields in Properties . . . . . . . . . . . . . 14.4 Screenshot . . . . . . . . . . . . . . . . . . 14.5 Example of data in the Config file (default 14.6 Limitations to the tool . . . . . . . . . . .

3

. . . . values . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . properties) . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

62 62 63 63 63 63 63

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

64 64 65 66 67 68 69

1

Introduction

This training is conducted for new members who joined the TTS consortium. The main aim of the TTS consortium is to develop text-to-speech (TTS) systems in all of these 22 official languages in order to build screen readers which are spoken interfaces for information access which will aid visually challenged people use a computer with ease and to make computing ubiquitous and inclusive.

1.1

Nature of scripts of Indian languages

The scripts in Indian languages have originated from the ancient Brahmi script. The basic units of the writing system are referred to as Aksharas. The properties of Aksharas are as follows: 1. An Akshara is an orthographic representation of a speech sound in an Indian language 2. Aksharas are syllabic in nature 3. The typical forms of Akshara are V, CV, CCV and CCCV, thus having a generalized form of C*V where C denotes consonant and V denotes vowel As Indian languages are Akshara based, akshara being a subset of a syllable, a syllable based unit selection synthesis system has been built for Indian languages. Further, a syllable corresponds to a basic unit of production as opposed to that of the diphone or the phone. Earlier efforts were made by the consortium members, in particular, IIIT Hyderabad and IIT Madras do indicate that natural sounding synthesisers for Indian languages can be built using the syllable as a basic unit.

1.2

Convergence and divergence

The official languages of India, except (English and Urdu) share a common phonetic base, i.e., they share a common set of speech sounds. This common phonetic base consists of around 50 phones, including 15 vowels and 35 consonants. While all of these languages share a common phonetic base, some of the languages such as Hindi, Marathi and Nepali also share a common script known as Devanagari. But languages such as Telugu, Kannada and Tamil have their own scripts. The property that makes these languages unique can be attributed to the phonotactics in each of these languages rather than the scripts and speech sounds. Phonotactics is the permissible combinations of phones that can co-occur in a language. This implies that the distribution of syllables encountered in each language is different. Another dimension in which the Indian languages significantly differ is prosody which includes duration, intonation and prominence associated with each syllable in a word or a sentence.

4

2

What is Text to Speech Synthesis?

Text to Speech Synthesis System converts text input to speech output. The conversion of text into spoken form is deceptively nontrivial. A nave approach is to consider storing and concatenation of basic sounds (also referred to as phones) of a language to produce a speech waveform. But, natural speech consists of co-articulation i.e., effect of coupling two sound together, and prosody at syllable, word, sentence and discourse level, which cannot be synthesised by simple concatenation of phones. Another method often employed is to store a huge dictionary of the most common words. However, such a method may not synthesise millions of names and acronyms which are not in the dictionary. It also cannot deal with generating appropriate intonation and duration for words in different context. Thus a text-to-speech approach using phones provides flexibility but cannot produce intelligible and natural speech, while a word level concatenation produces intelligible and natural speech but is not flexible. In order to balance between flexibility and intelligibility/naturalness, sub-word units such as diphones which capture essential coarticulation between adjacent phones are used as suitable units in a text-to-speech system.

2.1

Components of a text-to-speech system

A typical architecture of a Text-to-Speech (TTS) system is as shown in Figure below The components of a text-to-speech system could be broadly categorized as text processing and methods of speech generation. Text processing in the real world, the typical input to a text-to-speech system is text as available in electronic documents, news papers, blogs, emails etc. The text available in real world is anything but a sequence of words available in standard dictionary. The text contains several non-standard words such as numbers, abbreviations, homographs and symbols built using punctuation characters such as exclamation !, smileys :-) etc. The goal of text processing module is to process the input text, normalize the non-standard words, predict the prosodic pauses and generate the appropriate phone sequences for each of the words.

2.2

Normalization of non-standard words

The text in real world consists of words whose pronunciation is typically not found in dictionaries or lexicons such as IBM”, CMU”, and MSN” etc. Such words are referred to as non-standard words (NSW). The various categories of NSW are: 1. Numbers whose pronunciation changes depending on whether they refer to currency, time, telephone numbers, zip code etc. 2. Abbreviations, contractions, acronyms such as ABC, US, approx., Ctrl-C, lb., 3. Punctuations 3-4, +/-, and/or, 4. Dates, time, units and URLs.

2.3

Grapheme-to-phoneme conversion

Given the sequence of words, the next step is to generate a sequence of phones. For languages such as Spanish, Telugu, Kannada, where there is a good correspondence between what is written and what is spoken, a set of simple rules may often suffice. For languages such as English where the relationship between the orthography and pronunciation is complex, a standard pronunciation dictionary such as CMU-DICT is used. To handle unseen words, a grapheme-to-phoneme generator is built using machine learning techniques.

5

2.4

Prosodic analysis

Prosodic analysis deals with modeling and generation of appropriate duration and intonation contours for the given text. This is inherently difficult since prosody is absent in text. For example, the sentences where are you going?; where are you GOING? and where are YOU going?, have same text-content but can be uttered with different intonation and duration to convey different meanings. To predict appropriate duration and intonation, the input text needs to be analyzed. This can be performed by a variety of algorithms including simple rules, example-based techniques and machine learning algorithms. The generated duration and intonation contour can be used to manipulate the context-insensitive diphones in diphone based synthesis or to select an appropriate unit in unit selection voices.

2.5

Methods of speech generation

The methods of conversion of phone sequence to speech waveform could be categorized into parametric, concatenative and statistical parametric synthesis.

6

2.5.1

Parametric synthesis

Parameters such as formants, linear prediction coefficients are extracted from the speech signal of each phone unit. These parameters are modified during synthesis time to incorporate co-articulation and prosody of a natural speech signal. The required modifications are specified in terms of rules which are derived manually from the observations of speech data. These rules include duration, intonation, co-articulation and excitation function. Examples of the early parametric synthesis systems are Klatts formant synthesis and MITTALK. 2.5.2

Concatenative synthesis

Derivation of rules in parametric synthesis is a laborious task. Also, the quality of synthesized speech using traditional parametric synthesis is found to be robotic. This has led to development of concatenative synthesis where the examples of speech units are stored and used during synthesis. Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis. 1. Unit selection synthesis - Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into segments is done using a specially modified speech recognizer set to a ”forced alignment” mode with some manual correction afterward, using visual representations such as the waveform and spectrogram. An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At run time, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree. Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech. Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database. Recently, researchers have proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems. 2. Diphone synthesis - Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding, PSOLA or MBROLA. Diphone synthesis suffers from the sonic glitches of concatenative

7

synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining,[citation needed] although it continues to be used in research because there are a number of freely available software implementations. 3. Domain-specific synthesis - Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings. Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in non-rhotic dialects of English the ”r” in words like ”clear” /kl/ is usually only pronounced when the following word has a vowel as its first letter (e.g. ”clear out” is realized as /klt/). Likewise in French, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison. This alternation cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive. The speech units used in concatenative synthesis are typically at diphone level so that the natural co-articulation is retained. Duration and intonation are derived either manually or automatically from the data and are incorporated during synthesis time. Examples of diphone synthesizers are Festival diphone synthesis and MBROLA. The possibility of storing more than one example of a diphone unit, due to increase in storage and computation capabilities, has led to development of unit selection synthesis. Multiple examples of a unit along with the relevant linguistic and phonetic context are stored and used in the unit selection synthesis. The quality of unit selection synthesis is found to be more natural than diphone and parametric synthesis. However, unit selection synthesis lacks the consistency i.e., in terms of variations of the quality.

2.6

Primary components of the TTS framework

1. Speech Engine - One of the most widely used speech engine is eSpeak. eSpeak uses ”formant synthesis” method, which allows many languages to be provided with a small footprint. The speech synthesized is intelligible, and provides quick responses, but lacks naturalness. The demand is for a high quality natural sounding TTS system. We have used festival speech synthesis system developed at The Centre for Speech Technology Research, University of Edinburgh, which provides a framework for building speech synthesis systems and offers full text to speech support through a number of APIs . A large corpus based unit selection paradigm has been employed. This paradigm is known to produce intelligible natural sounding speech output, but has a larger foot print. 2. Screen Readers - The role of a screen reader is to identify and interpret what is being displayed on the screen and transfer it to the speech engine for synthesis. JAWS is the most popular screen reader used worldwide for Microsoft Windows based systems. But the main 30 drawback of this software is its high cost, approximately 1300 USD, whereas the average per capita income in India is 1045 USD. Different open source screen readers are freely available. We chose ORCA for Linux based systems and NVDA for Windows based systems. ORCA is a flexible screen reader that provides access to the graphical desktop via user-customizable 8

combinations of speech, braille and magnification. ORCA supports the Festival GNOME speech synthesizer and comes bundled with popular Linux distributions like Ubuntu and Fedora. NVDA is a free screen reader which enables vision impaired people to access computers running Windows. NVDA is popular among the members of the AccessIndia community. AccessIndia is a mailing list which provides an opportunity for visually impaired computer users in India to exchange information as well as conduct discussions related to assistive technology and other accessibility issues . NVDA has already been integrated with Festival speech Engine by Olga Yakovleva. 3. Typing tool for Indian Languages - The typing tools map the qwerty keyboard to Indian language characters. Widely used tools to input data in Indian languages are Smart Common Input Method (SCIM)and inbuilt InScript keyboard, for Linux and Windows systems respectively. Same has been used for our TTS systems, as well.

2.7

Screen readers for the visually challenged

India is home to the worlds largest visually challenged (VC) population. In todays digital world, disability is equated to inability. Low attention is paid to people with disabilities and social inclusion and acceptance is always a threat/challenge. The perceived inability of people with disability, the perceived cost of special education and attitudes towards inclusive education are major constraints for effective delivery of education. Education is THE means of developing the capabilities of people with disability, to enable them to develop their potential, become self sufficient, escape poverty and provide a means of entry to fields previously denied to them. The aim of this project is to make a difference in the lives of VC persons. VC persons need to depend on others to access common information that others take for granted, such as newspapers, bank statements, and scholastic transcripts. Assistive technologies (AT) are necessary to enable physically challenged persons to become part of the mainstream of society. A screen reader is an assistive technology potentially useful to people who are visually challenged, visually impaired, illiterate or learning disabled, to use/access standard computer software, such as Word Processors, Spreadsheets, Email and the Internet. Before the start of this project, Indian Institute of Technology, Madras (IIT Madras) had been conducting a training programme for visually challenged people, to enable them to use the computer using the screen reader JAWS with English as the language. Although, the VC persons have benefited from this programme, most of them felt that: • The English accent was difficult to understand. • Most students would have preferred a reader in their native language. • They would prefer English spoken in Indian accent. • The price for the individual purchase of JAWS was very high. Against this backdrop, it was felt imperative to build assistive technologies in the vernacular. An initiative was taken by DIT, Ministry of Information Technology to sponsor the development of 1. Natural sounding Text-to-speech synthesis systems in different Indian languages 2. To ensure that the TTSes are also integrated with open source screen readers.

9

3

Overall Picture

1. Data Collection - Text crawled from a news site and a site for stories for children. 2. Cleaning up of Data - From the crawled data sentences were picked to maximize syllable coverage. 3. Recording - The sentences that were picked were then recorded in a studio which was a completely noise-free environment. 4. Labeling - The wavefiles were then manually labeled using the semi-automatic labeling tool to get accurate syllable boundaries. 5. Training - Using the wavefiles and their transcriptions the indian language unit selection voice was built 10

6. Testing - Using the voice built, a MOS test was conducted with visually challenged end users as the evaluators.

11

4

Labeling Tool

It is a widely accepted fact that the accuracy of labeling of speech files would have a great bearing on the quality of unit selection synthesis. Process of manual labeling is time consuming and daunting task. It is also not trivial to label waveforms manually, at the syllable level. DONLabel Labeling tool provides an automatic way of performing labeling given an input waveform and the corresponding text in utf8 format. The tool makes use of group delay based segmentation to provide the segment boundaries. The size of the segment labels generated can vary from monosyllables to polysyllables, as the Window Scale Factor (WSF) parameter is varied from small to large values. Our labeling process make use of: • Ergodic HMM (EHMM) labeling procedure provided by Festival, • The group delay based algorithm (GD) • The Vowel Onset Point (VOP) detection algorithm. The Labeling tool displays a panel, which shows the segment boundaries estimated by Group Delay algorithm, another panel which would show the segment boundaries as estimated by the EHMM process and a panel for VOP, which shows how many vowel onset points are present between each segments provided by group delay algorithm. This would help greatly in adjusting the labels provided by the group delay algorithm, if necessary, by comparing the labeling outputs of both EHMM process and VOP algorithm. By using VOP as an additional cue, manual intervention during the labeling process can be eliminated. It would also improve the accuracy of the labels generated by the labeling tool. The tool works for 6 different Indian languages namely • Hindi • Tamil • Malayalam • Marathi • Telugu • Bengali The tool also displays the text (utf8) in segmented format along with the speech file.

4.1

How to Install LabelingTool

1. Copy the html folder to /var/www folder. If www folder is not there in /var, create a folder named www and extract the html folder into it. So we have the labelingTool code in /var/www/html/labelingTool/ 2. Install java compiler using the following command sudo apt−get install sun−java6−jdk The following error may come ==> Reading package lists... Done Building dependency tree Reading state information... Done Package sun−java6−jdk is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source 12

E: Package ’sun-java6-jdk’ has no installation candidate sudo apt−get install sun−java6−jre The following error may come ==> Reading package lists... Done Building dependency tree Reading state information... Done Package sun−java6−jre is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source E: Package ’sun-java6-jre’ has no installation candidate One solution is : sudo add−apt−repository ”deb http://archive.canonical.com/ lucid partner” sudo add−apt−repository ”deb http://ftp.debian.org/debian squeeze main contrib non−free” sudo add−apt−repository ”deb http://ppa.launchpad.net/chromium−daily/ppa/ubuntu/ lucid main” sudo add−apt−repository ”deb http://ppa.launchpad.net/flexiondotorg/java/ubuntu/ lucid main” sudo apt−get update The other solution is : For Ubuntu 10.04 LTS, the sun-java6 packages have been dropped from the Multiverse section of the Ubuntu archive. It is recommended that you use openjdk-6 instead. If you can not switch from the proprietary Sun JDK/JRE to OpenJDK, you can install sunjava6 packages from the Canonical Partner Repository. You can configure your system to use this repository via command-line: sudo sudo sudo sudo sudo

add-apt-repository ”deb http://archive.canonical.com/ lucid partner” apt-get update apt-get install sun-java6-jre sun-java6-plugin apt-get install sun-java6-jdk update-alternatives –config java

For Ubuntu 10.10, the sun-java6 packages have been dropped from the Multiverse section of the Ubuntu archive. It is recommended that you use openjdk-6 instead. If you can not switch from the proprietary Sun JDK/JRE to OpenJDK, you can install sunjava6 packages from the Canonical Partner Repository. You can configure your system to use this repository via command-line: sudo add-apt-repository ”deb http://archive.canonical.com/ maverick partner” sudo apt-get update sudo apt-get install sun-java6-jre sun-java6-plugin sudo apt-get install sun-java6-jdk sudo update-alternatives –config java

13

If above does not work (for other version of ubuntu) then you can create local repository as follows: cd ∼/ wget https://github.com/flexiondotorg/oab-java6/raw/0.2.1/oab-java6.sh -O oab-java6.sh chmod +x oab-java6.sh sudo ./oab-java6.sh and then run: sudo apt-get install sun-java6-jdk sudo apt-get install sun-java6-jre Source : https://github.com/flexiondotorg/oab-java6/blob/a04949f242777eb040150e53f4dbcd4a3ccb7568/ README.rst 3. Install php using the following command sudo apt−get install php5 4. Install apache2 using the following command sudo apt−get install apache2 update the paths in the following file /etc/apache2/sites−available/default Set all path of cgi−bin to ”/var/www/html/cgi-bin”. Sample default file is attached 5. Install apache2 using the following command sudo apt−get install speech-tools 6. Install tcsh using the following command sudo apt−get install tcsh 7. Enable java script in the properties of the browser used Use Google chrome or Mozilla firefox 8. Install java plugin for browser sudo apt−get install sun−java6−plugin Create a symbolic link to the Java Plugin libnpjp2.so using the following commands sudo ln −s /usr/lib/jvm/java-6-sun/jre/plugin/i386/libnpjp2.so /etc/alternatives/mozilla−javaplugin.so sudo ln −s /etc/alternatives/mozilla−javaplugin.so /usr/lib/mozilla/plugins/libnpjp2.so 9. give full permissions to html folder sudo chmod −R 777 html/ 10. Add the following code to /etc/java−6−sun/security/java.policy

14

grant { permission java.security.AllPermission; }; 11. In /var/www/html/labelingTool/jsrc/install file, make sure that correct path of javac is provided as per your installation. For example : /usr/lib/jvm/java−6−sun-1.6.0.26/bin/javac Version of java is 1.6.0.26 here, it might be different in your installation. Check the path and give correct values. 12. Install the tool using the following command Go to /var/www/html/labelingTool/jsrc and run the below command sudo ./install It might give the following output which is not an error. Note: LabelingTool.java uses or overrides a deprecated API. Note: Recompile with −Xlint:deprecation for details. 13. Restart apache using the following command sudo /etc/init.d/apache2 restart 14. check if java applet is enabled in the browser by using the following link http://javatester.org/enabled.html In that webpage, in the LIVE box, it should display This web browser can indeed run Java applets wait for some time for the display to come. In case it had displayed This web browser can NOT run Java applets, there is some issue with the java applets. Please browse for how to enable java in your version of browser and fix the issue. 15. Replace the Pronunciation Rules.pl in the /var/www/html/labelingTool folder with your language specific code (The name should be same − Pronunciation Rules.pl ) 16. Open the browser and go to the following link http://localhost/main.php NOTE : VOP algoirthm is not used in the current version of the labelingTool. So anything related to vop, please ignore on the below sections

4.2

Troubleshooting of LabelingTool

1. When Labelingtool is working fine the following files will be generated in labelingTool/results folder boundary segments spec low vop wav sig gd spectrum low segments indicator tmp.seg vopsegments 15

2. when the boundaries are manually updated, (deleted, added or moved) and saved 2 more files gets created in the results folder. ind upd segments updated 3. When after manually updating and saving, if the vopUpdate button is clicked, another new file gets created in the results folder vop updated 4. If a file named ’vop’ is not getting generated in labelingTool/results folder and the labelit.php page is getting stuck, you need to compile the vop module. Follow the below steps. (a) cd /var/www/html/labelingTool/Vop−Lab (b) make −f MakeEse clean (c) make −f MakeEse (d) cd bin (e) cp Multiple Vopd ../../bin/ 5. If the above files are not getting created, we can try running through command line as follows Execute them from/var/www/html/labelingTool/bin folder. • The command line usage of the WordsWithSilenceRemoval program is as follows WordsWithSilenceRemoval ctrlFile waveFile sigFile spectraFile boundaryFile thresSilence(ms) thresVoiced(ms) example : ./WordsWithSilenceRemoval fe−words2.base /home/text 0001.wav ..results/spec ..results/boun 100 100 Two files named spec and boun has to be generated in the results folder. if not created. try recompiling. cd /var/www/html/labelingTool/Segmentation make −f MakeWordsWithSilenceRemoval clean make −f MakeWordsWithSilenceRemoval cp bin/WordsWithSilenceRemoval /var/www/html/labelingTool/bin/ • The command line usage of the Multiple Vopd program is as follows Multiple Vopd ctrlFile waveFile(in sphere format) segments file vopfile example : ./Multiple Vopd fe-ctrl.ese ../results/wav ..results/segments ../results/vop The file ’wav’ in results folder is already the sphere format of your input wavefile.

16

On running Multiple Vopd binary, a file ’vop’ has to be generated in the results folder. 6. If the file ’wav’ is not produced in results folder, ’speech tools’ are not installed How to check if speech tools are installed : Once installing speech tools check if the following ch wave −info This command should give the information about that wave file. If speech tools was installed along with festival and there is no link to it in /usr/bin, please make a link to point to ch wave binary file in /usr/bin folder. 7. How to check if tcsh is installed.. type command ’tcsh’ and a new prompt will come. 8. Provide full permissions to the labelingTool folder and its sub folder so that the new files can be created and updated without any permission issues. (if required, following command can be used in the labelingTool folder chmod −R 777 * chown −R root:root * ) 9. The java.policy file should be updated as specified in the installation steps, otherwise it may result in error ”Error writing Lab File” 10. When the lab file is viewed in the browser, if utf8 is not displaying, enable character−encoding to utf8 for the browser Tools->options->content->fonts and colors(Advanced menu)->default character encoding(utf8) Restart browser.

17

5

Labeling Tool User Manual

5.1

How To Use Labeling Tool

The front page of the tool can be taken using the URL http://localhost/main.php A screen shot of the front page is as shown below.

The front page has the following fields • The speech file in wav format should be provided. It can be browsed using the browse button • The corresponding utf8 text has to be provided in the text file. It can be browsed using the browse button the text file that is uploaded should not have any special characters. • The ehmm lab file generated by festival while building voice can be provided as input. This is an optional field. • The gd lab file generated by labelingtool in a previous attempt to label the same file. This is an optional field. If the user had once labelled a file half way and saved the lab file, it can be provided as input here so as to label the rest of it or to correct the labels. • The threshold for voiced segment has to be provided in the text box. It varies for each wav file. The value is in milli seconds. (e.g. 100, 200, 50..) • The threshold for unvoiced segment has to be provided in the text box. It varies for each wav file. The value is in milli seconds. (e.g. 100, 200, 50..) If the speech file has very long silences a high value can be provided as threshold value. • WSF (window scale factor) can be selected from the drop down list. The default value given is 5. Depending on the output user will be required to change WSF values and find the most appropriate value that provides the best segmentation possible for the speech file. • The corresponding language can be selected using the radio button 18

• Submit the details to the tool using submit button. A screen shot of the filled up front page is given below.

Loading Page On clicking ’submit’ button on the front page the following page will be displayed.

Validation for data entered • If the loading of all files were successful and proper values were given for the thresholds in the front page the message Click View to see the results... will be displayed as shown above. • If the wave file was not provided in the front page the following error will come in the loading page Error uploading wav file. Wav file must be entered • If the text file was not provided in the front page the following error will come in the loading page Error uploading text file. Text file must be entered • If the threshold for voiced segments was not provided in the front page the following error will come in the loading page Threshold for voiced segments must be entered • If the threshold for unvoiced segments was not provided in the front page the following error will come in the loading page Threshold for unvoiced segments must be entered

19

• If numeric value is not entered for thresholds of unvoiced or voiced segments, in the front page the following error will come in the loading page Numeric value must be entered for thresholds • The wav file loaded will be copied to the /var/www/html/UploadDir folder as text.wav • The lab file (ehmm label file) will be copied to /var/www/html/labelingTool/lab folder as temp.lab If error occurred while moving to the lab folder the following error will be displayed Error moving lab file. • The gd lab file (group delay label file) will be copied to /var/www/html/labelingTool/results folder with the name gd lab. If error occurred while moving to the lab folder the following error will be displayed Error moving gdlab file. • The Labelit Page On clicking view button on the loading page the labelit page will be loaded. A screenshot of this page along with the markings for each panel is given below.

Note: If error message Error reading file ’http://localhost/labelingTool/tmp/temp.wav appears, it means in place of wav file some other file(eg text file) was uploaded. • Panels on the Labelit Page It has 6 main Panels – EHMM Panel displays the lab files generated by festival using EHMM algorithm while building voices – Slider Panel using this panel we can slide, delete or add segments/labels

20

– Wave Panel displays the speech waveform in segmented format (Note: The speech wave file is not appearing, as seen in wavesurfur. This is because of the limitations in java) – Text Panel displays the segmented text (in utf8 format) with syllable as the basic units. – GD Panel draws the group delay curve. This is the result of group delay algorithm. Wherever the peaks appear, is considered to be a segment boundary. – VOP Panel shows the number of vowel onset points found between each segments provided by Group delay. Here green colour corresponds to one vowel onset point. That means the segment boundary found by group delay algorithm is correct. Red colour corresponds to zero vowel onset point. That means the segment boundary found by group delay algorithm is wrong and that boundary needs to be deleted. Yellow colour corresponds to more than one vowel onset points. This means that, between 2 boundaries found by group delay algorithm there will be one or more boundaries. • Resegment The WSF selected for this example is ’5’. A different wsf will provide a different set of boundaries. Lesser the wsf, greater the number of boundaries and vice versa. To experiment with different wsf values, select the WSF from the drop down list and click ’RESEGMENT’. A screen shot for the same text (as in the above figure) with a greater wsf selected is shown below

The above figure shows the segmentation using wsf = 12. It gives less number of boundaries. Below figure shows the same waveform with a lesser wsf (wsf =3). It gives more number of boundaries.

21

So the ideal wsf for the waveform has to be found out. Easier way is to check the text segments are reaching approximately near the end of the waveform. (Not missing any text segments nor having many segments without texts). • Menu Bar The menu Bar is just above the EHMM Panel, with a heading ’Waveform’ The Menu Bar contains following buttons in that order from left to right – Save button The lab file can be saved using the save button. After making any changes to the segments (deletion, addition or dragging), if required save button has to be clicked. – Play the waveform The entire wave file will be played on pressing this button – Play the selection Select some portion of the waveform (say a segment) and play just that part using this button. This button can be used to verify each segment. – Play from selection Play the waveform starting from the current selection to the end. Click the mouse on the waveform and a yellow line will appear to show the selection. On clicking this button, from that selected point to end of the file will be played – Play to selection Plays the waveform from the beginning to the end of the current selection – Stop the playbackStops the playing of wave file – Zoom to fit Display the selected portion of the wave zoomed in – Zoom 1 Display the entire wave – Zoom in Zoom in on the wave – Zoom out Zoom out on the wave – Update VOP Panel After changing the segments (dragging, adding or deleting) , the VOP algorithm is recalculated on the new set of segments on clicking this button. After making the changes, the save button must be pressed before updating the VOP panel. 22

Some screen shots are given below to demonstrate the use of menu bar. Below figure shows how to select a portion (drag using mouse on wavepanel) of the waveform and play that part. The selected portion appears shaded in yellow as shown

The below figure shows how to select a point (click using mouse on wavepanel) and play from selection to end of file. The selected point appears as a yellow line

3cm 23

Next figure shows how to select a portion of the wave and zoom to fit.

The Next figure shows how the portion of the wave file selected in the above figure is zoomed.

2cm

24

5.2

How to do label correction using Labeling tool

Each segment given by the group delay can be listened to and decided whether the segment is correct or not, whether it is matching the text segmentation, with the help of VOP and EHMM Panels. • Deletion of a Segment All the segments appear as red lines in the labeling tool output. A segment can be deleted by right clicking on that particular segment on the slider panel. The figure below shows the original output of labelling tool for the Hindi wave file

The third and fourth segments are very close to each other and one has to be deleted. Ideally we delete the fourth one. The VOP has given a red colour (indication to delete one) for that segment. User can decide whether to delete right or left of red segment after listening. On deletion (right click on slider panel on that segment head) of the fourth segment, the text segments get shifted and fits after silence segment as shown in the below figure.

25

On listening each segments it is seen that the segment between and is wrong. It has to be deleted. The VOP gives red colour for the segment and the corresponding peak in the group delay panel is below the threshold. Peaks below the threshold in group delay curve usually wont be a segment boundary. But sometimes the algorithm computes it as a boundary. Threshold value in GD panel is the middle line in magenta colour. There are 2 more red columns in VOP. The last one is correct and we have to delete a segment. The second last red column in VOP is incorrect and GD gives the correct segment. Hence it need not be deleted. is always used as a reference for GD algorithm. It can be wrong in some cases. The yellow colour on VOP usually says to add a new segment, but here the yellow colour is appearing in the Silence region and we ignore it. The figure below shows the corrected segments (after deletion)

26

On completion of correcting the labels, the save button have to be pressed. On clicking Save button a dialog box appears with the message Lab File Saved Click Next to Continue A silence segment gets deleted on clicking the right boundary of the silence segment. • Update VOP Panel After saving the changes made to the labels the VOP update button has to be clicked to recalculate the VOP algorithm on the new segments. The updated output is shown in below figure.

27

• Adding A Segment A segment can be added by right clicking with mouse on the slider panel at the point where a segment needs to be added. The below figure shows a case in which a segment needs to be added.

The VOP shows three yellow columns here of which the second yellow column is true. The GD plot shows a small peak in that segment and we can be sure that the segment has to be added at the peak only. In the above figure it can be seen that the mouse is placed on the slider panel at the location to add the new segment. The figure below shows the corresponding corrected wave file and after VOP updation done. • Sliding a Segment A segment can be moved to left or right by clicking on the head of the segment boundary on the slider panel and dragging left or right. Sliding can be used if required while correcting the labels.

• Modification of labfile If a half corrected lab file is already present (gd lab file present), upload it from ./labelingTool/labfiles directory in the gd lab file option in the main page. Irrespective of the wsf value, the earlier lab file will be loaded. But if we use resegmentation the already present labels will be gone and it will be regenerated based on the new wsf value present. After modification, when Save button is pressed same labfile is updated but before updating backup copy of lab file is created. Note: If system creates a lab file with same name that already exists in labfiles directory, system creates the backup copy of that file. But backup copy is by default hidden, to view it just press CTRL + h.

28

• Logfiles Tool generates a seprate log file for each lab file(eg. text0001.log) in ./labelingTool/logfiles directory. Please keep cleaning this directory after certain interval.

5.3

Viewing the labelled file

Once the corrections are made and the save button is clicked the lab file is generated in/var/www/html/labelingT directory and it can be viewed by clicking on the ’next’ link. The following message comes on clicking the ’next’, Download the labfile: labfile Click on the link ’labfile’. The lab file will appear on the browser window as below

5.4

Control file

A control file is placed at the location /var/www/html/labelingTool/bin/fewords.base The parameters in the control file are given below. These parameters can be adjusted by the user to get better segmentation results. • windowSize size of frame for energy computation • waveType type of the waveform: 0 Sphere PCM 1 Sphere Ulaw 2 plain sample one short integer per line 3 RAW sequence of bytes each sample 8bit 4 RAW16 two bytes/sample Big Endian 5 RAW16 two bytes/sample Little Endian 6 Microsoft RIFF standard wav format • winScaleFactor should be chosen based on syllable rate choose by trial and error • gamma reduces the dynamic range of energy • fftOrder and fftSize MUST be set to ZERO!! • frameAdvanceSamples frameshift for energy computation • medianOrder order of median smoothing for group delay function 1==> no smoothing 29

• ThresEnergy, thresZero, thresSpectralFlatness are thresholds used for voiced unvoiced detection When a parameter is set to zero, it is NOT used . Examples tested with ENERGY only • Sampling rate of the signal required for giving boundary information in seconds.

5.5

Performance results for 6 Indian Languages

Testing was conducted for a set of test sentences for all 6 Indianlanguages and the percentage of correctness was calculated based on the following formulae. The calculations were done after the segmentation was done using the tool with the best wsf and threshold values. [1−

Language Hindi Malayalam Telugu Marathi Bengali Tamil

5.6

(N oof insertions+noof deletions) T otalnoof segments

Percentage of Correctness 86.83% 78.68% 85.40% 80.24% 77.84% 77.38%

Limitations of the tool

• Zooming is not enabled for VOP and EHMM panels • Wave form is not displayed properly as in wavesurfur

30

] × 100

6

Unit Selection Synthesis Using Festival

This chapter discusses some of the options for building waveform synthesizers using unit selection techniques in Festival. By ”unit selection” we actually mean the selection of some unit of speech which may be anything from whole phrase down to diphone (or even smaller). Technically diphone selection is a simple case of this. However typically what we mean is unlike diphone selection, in unit selection there is more than one example of the unit and some mechanism is used to select between them at run-time. The theory is obvious but the design of such systems and finding the appropriate selection criteria, weighting the costs of relative candidates is a non-trivial problem. However techniques like this often produce very high quality, very natural sounding synthesis. However they also can produce some very bad synthesis too, when the database has unexpected holes and/or the selection costs fail.

6.1

Cluster unit selection

The idea is to take a database of general speech and try to cluster each phone type into groups of acoustically similar units based on the (non-acoustic) information available at synthesis time, such as phonetic context, prosodic features (F0 and duration) and higher level features such as stressing, word position, and accents. The actual features used may easily be changed and experimented with as can the definition of the definition of acoustic distance between the units in a cluster. The basic processes involved in building a waveform synthesizer for the clustering algorithm are as follows. A high level walkthrough of the scripts to run is given after these lower level details. 1. Collect the database of general speech. 2. Building the utterance Structure 3. Building coefficients for acoustic distances, typically some form of cepstrum plus F0, or some pitch synchronous analysis (e.g. LPC). 4. Build distances tables, precalculating the acoustic distance between each unit of the same phone type. 5. Dump selection features (phone context, prosodic, positional and whatever) for each unit type. 6. Build cluster trees using wagon with the features and acoustic distances dumped by the previous two stages 7. Building the voice description itself

6.2

Choosing the right unit type

Before you start you must make a decision about what unit type you are going to use. Note there are two dimensions here. First is size, such as phone, diphone, demi-syllable. The second type itself which may be simple phone, phone plus stress, phone plus word etc. The code here and the related files basically assume unit size is phone. However because you may also include a percentage of the previous unit in the acoustic distance measure this unit size is more effectively phone plus previous phone, thus it is somewhat diphone like. The cluster method has actual restrictions on the unit size, it simply clusters the given acoustic units with the given feature, but the basic synthesis code 31

is currently assuming phone sized units. The second dimension, type, is very open and we expect that controlling this will be a good method to attain high quality general unit selection synthesis. The parameter clunit name feat may be used define the unit type. The simplest conceptual example is the one used in the limited domain synthesis. There we distinguish each phone with the word it comes from, thus a d from the word limited is distinct from the d in the word domain. Such distinctions can hard partition up the space of phones into types that can be more manageable. The decision of how to carve up that space depends largely on the intended use of the database. The more distinctions you make less you depend on the clustering acoustic distance, but the more you depend on your labels (and the speech) being (absolutely) correct. The mechanism to define the unit type is through a (typically) user defined feature function. In the given setup scripts this feature function will be called lisp INST LANG NAME::clunit name. Thus the voice simply defines the function INST LANG NAME::clunit name to return the unit type for the given segment. If you wanted to make a diphone unit selection voice this function could simply be (define (INST LANG NAME::clunit name i) (string append (item.name i) ”” (item.feat i ”p.name”))) Thus the unittype would be the phone plus its previous phone. Note that the first part of a unit name is assumed to be the phone name in various parts of the code thus although you make think it would be neater to return previousphone phone that would mess up some other parts of the code. In the limited domain case the word is attached to the phone. You can also consider some demi−syllable information or more to differentiate between different instances of the same phone. The important thing to remember is that at synthesis time the same function is called to identify the unittype which is used to select the appropriate cluster tree to select from. Thus you need to ensure that if you use say diphones that the your database really does not have all diphones in it.

6.3

Collecting databases for unit selection

Unlike diphone database which are carefully constructed to ensure specific coverage, one of the advantages of unit selection is that a much more general database is desired. However, although voices may be built from existing data not specifically gathered for synthesis there are still factors about the data that will help make better synthesis. Like diphone databases the more cleanly and carefully the speech is recorded the better the synthesized voice will be. As we are going to be selecting units from different parts of the database the more similar the recordings are, the less likely bad joins will occur. However unlike diphones database, prosodic variation is probably a good thing, as it is those variations that can make synthesis from unit selection sound more natural. Good phonetic coverage is also useful, at least phone coverage if not complete diphone coverage. Also synthesis using these techniques seems to retain aspects of the original database. If the database is broadcast news stories, the synthesis from it will typically sound like read news stories (or more importantly will sound best when it is reading

32

news stories). Again the notes about recording the database apply, though it will sometimes be the case that the database is already recorded and beyond your control, in that case you will always have something legitimate to blame for poor quality synthesis.

6.4

Preliminaries

Throughout our discussion we will assume the following database layout. It is highly recommended that you follow this format otherwise scripts, and examples will fail. There are many ways to organize databases and many of such choices are arbitrary, here is our ”arbitrary” layout. The basic database directory should contain the following directories bin/ − Any database specific scripts for processing. Typically this first contains a copy of standard scripts that are then customized when necessary to the particular database wav/ − The waveform files. These should be headered, one utterances per file with a standard name convention. They should have the extension .wav and the fileid consistent with all other files through the database (labels, utterances, pitch marks etc). lab/ − The segmental labels. This is usually the master label files, these may contain more information that the labels used by festival which will be in festival/relations/Segment/. lar/ − The EGG files (larynograph files) if collected. pm/ − Pitchmark files as generated from the lar files or from the signal directly. festival/ − Festival specific label files. festival/relations/ − The processed labeled files for building Festival utterances, held in directories whose name reflects the relation they represent: Segment/, Word/, Syllable/ etc. festival/utts/ − The utterances files as generated from the festival/relations/ label files. Other directories will be created for various processing reasons.

6.5

Building utterance structures for unit selection

In order to make access well defined you need to construct Festival utterance structures for each of the utterances in your database. This (in its basic form) requires labels for segments, syllables, words, phrases, F0 Targets, and intonation events. Ideally these should all be carefully hand labeled but in most cases that’s impractical. There are ways to automatically obtain most of these labels but you should be aware of the inherit errors in the labeling system you use (including labeling systems that involve human labelers). Note that when a unit selection method is to be used that fundamentally uses segment boundaries its quality is going to be ultimately determined by the quality of the segmental labels in the databases. For the unit selection algorithm described below the segmental labels should be using the same phoneset as used in the actual synthesis voice. However a more detailed phonetic labeling may be more useful (e.g. marking closures in stops) mapping that information back to the phone labels 33

before actual use. Autoaligned databases typically aren’t accurate enough for use in unit selection. Most autoaligners are built using speech recognition technology where actual phone boundaries are not the primary measure of success. General speech recognition systems primarily measure words correct (or more usefully semantically correct) and do not require phone boundaries to be accurate. If the database is to be used for unit selection it is very important that the phone boundaries are accurate. Having said this though, we have successfully used the aligner described in the diphone chapter above to label general utterance where we knew which phone string we were looking for, using such an aligner may be a useful first pass, but the result should always be checked by hand. It has been suggested that aligning techniques and unit selection training techniques can be used to judge the accuracy of the labels and basically exclude any segments that appear to fall outside the typical range for the segment type. Thus it, is believed that unit selection algorithms should be able to deal with a certain amount of noise in the labeling. This is the desire for researchers in the field, but we are some way from that and the easiest way at present to improve the quality of unit selection algorithms at present is to ensure that segmental labeling is as accurate as possible. Once we have a better handle on selection techniques themselves it will then be possible to start experimenting with noisy labeling. However it should be added that this unit selection technique (and many others) support what is termed ”optimal coupling” where the acoustically most appropriate join point is found automatically at run time when two units are selected for concatenation. This technique is inherently robust to at least a few tens of millisecond boundary labeling errors. For the cluster method defined here it is best to construct more than simply segments, durations and an F0 target. A whole syllabic structure plus word boundaries, intonation events and phrasing allow a much richer set of features to be used for clusters. See the Section called Utterance building in the Chapter called A Practical Speech Synthesis System for a more general discussion of how to build utterance structures for a database.

6.6

Making cepstrum parameter files

In order to cluster similar units in a database we build an acoustic representation of them. This is is also still a research issue but in the example here we will use Mel cepstrum. Interestingly we do not generate these at fixed intervals, but at pitch marks. Thus have a parametric spectral representation of each pitch period. We have found this a better method, though it does require that pitchmarks are reasonably identified. Here is an example script which will generate these parameters for a database, it is included in festvox/src/unitsel/make mcep. for i in $* do fname=‘basename $i .wav‘ echo $fname MCEP $SIG2FV $SIG2FVPARAMS −otype est binary $i −o mcep/$fname.mcep −pm pm/$fname.pm −window type hamming done The above builds coefficients at fixed frames. We have also experimented with building parameters pitch synchronously and have found a slight improvement in the usefulness of the measure 34

based on this. We do not pretend that this part is particularly neat in the system but it does work. When pitch synchronous parameters are build the clunits module will automatically put the local F0 value in coefficient 0 at load time. This happens to be appropriate from LPC coefficients. The script in festvox/src/general/make lpc can be used to generate the parameters, assuming you have already generated pitch marks. Note the secondary advantage of using LPC coefficients is that they are required any way for LPC resynthesis thus this allows less information about the database to be required at run time. We have not yet tried pitch synchronous MEL frequency cepstrum coefficients but that should be tried. Also a more general duration/number of pitch periods match algorithm is worth defining.

6.7

Building the clusters

Cluster building is mostly automatic. Of course you need the clunits modules compiled into your version of Festival. Version 1.3.1 or later is required, the version of clunits in 1.3.0 is buggy and incomplete and will not work. To compile in clunits, add ALSO INCLUDE += clunits to the end of your festival/config/config file, and recompile. To check if an installation already has support for clunits check the value of the variable *modules*. The file festvox/src/unitsel/build clunits.scm contains the basic parameters to build a cluster model for a databases that has utterance structures and acoustic parameters. The function build clunits will build the distance tables, dump the features and build the cluster trees. There are many parameters are set for the particular database (and instance of cluster building) through the Lisp variable clunits params. An reasonable set of defaults is given in that file, and reasonable run−time parameters will be copied into festvox/INST LANG VOX clunits.scm when a new voice is setup. The function build clunits runs through all the steps but in order to better explain what is going on, we will go through each step and at that time explain which parameters affect the substep. The first stage is to load in all the utterances in the database, sort them into segment type and name them with individual names (as TYPE NUM. This first stage is required for all other stages so that if you are not running build clunits you still need to run this stage first. This is done by the calls (format t ”Loading utterances and sorting types\n”) (set! utterances (acost:db utts load dt params)) (set! unittypes (acost:find same types utterances)) (acost:name units unittypes) Though the function build clunits init will do the same thing. This uses the following parameters name STRING A name for this database.

35

db dir FILENAME This pathname of the database, typically . as in the current directory. utts dir FILENAME The directory contain the utterances. utts ext FILENAME The file extention for the utterance files files The list of file ids in the database. For example for the KED example these parameters are (name ’ked timit) (db dir ”/usr/awb/data/timit/ked/”) (utts dir ”festival/utts/”) (utts ext ”.utt”) (files (”kdt 001” ”kdt 002” ”kdt 003” ... )) In the examples below the list of fileids is extracted from the given prompt file at call time. The next stage is to load the acoustic parameters and build the distance tables. The acoustic distance between each segment of the same type is calculated and saved in the distance table. Precalculating this saves a lot of time as the cluster will require this number many times. This is done by the following two function calls (format t ”Loading coefficients\n”) (acost:utts load coeffs utterances) (format t ”Building distance tables\n”) (acost:build disttabs unittypes clunits params) The following parameters influence the behaviour. coeffs dir FILENAME The directory (from db dir) that contains the acoustic coefficients as generated by the script make mcep. coeffs ext FILENAME The file extention for the coefficient files get std per unit Takes the value t or nil. If t the parameters for the type of segment are normalized by finding the means and standard deviations for the class are used. Thus a mean mahalanobis euclidean distance is found between units rather than simply a euclidean distance. The recommended value is t. ac left context FLOAT The amount of the previous unit to be included in the the distance. 1.0 means all, 0.0 means none. This parameter may be used to make the acoustic distance sensitive to the previous acoustic

36

context. The recommended value is 0.8. dur pen weight FLOAT The penalty factor for duration mismatch between units. F0 pen weight FLOAT The penalty factor for F0 mismatch between units. ac weights (FLOAT FLOAT ...) The weights for each parameter in the coefficeint files used while finding the acoustic distance between segments. There must be the same number of weights as there are parameters in the coefficient files. The first parameter is (in normal operations) F0 . Its is common to give proportionally more weight to F0 that to each individual other parameter. The remaining parameters are typically MFCCs (and possibly delta MFCCs). Finding the right parameters and weightings is one the key goals in unit selection synthesis so its not easy to give concrete recommendations. The following aren’t bad, but there may be better ones too though we suspect that real human listening tests are probably the best way to find better values. An example is (coeffs dir ”mcep/”) (coeffs ext ”.mcep”) (dur pen weight 0.1) (get stds per unit t) (ac left context 0.8) (ac weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5)) The next stage is to dump the features that will be used to index the clusters. Remember the clusters are defined with respect to the acoustic distance between each unit in the cluster, but they are indexed by these features. These features are those which will be available at text-to-speech time when no acoustic information is available. Thus they include things like phonetic and prosodic context rather than spectral information. The name features may (and probably should) be over general allowing the decision tree building program wagon to decide which of theses feature actual does have an acoustic distinction in the units. The function to dump the features is (format t ”Dumping features for clustering\n”) (acost:dump features unittypes utterances clunits params) The parameters which affect this function are fests dir FILENAME The directory when the features will be saved (by segment type). feats LIST The list of features to be dumped. These are standard festival feature names with respect to the Segment relation. For our KED example these values are (feats dir ”festival/feats/”) (feats

37

(occurid p.name p.ph vc p.ph ctype p.ph vheight p.ph vlng p.ph vfront p.ph vrnd p.ph cplace p.ph cvox n.name n.ph vc n.ph ctype n.ph vheight n.ph vlng n.ph vfront n.ph vrnd n.ph cplace n.ph cvox segment duration seg pitch p.seg pitch n.seg pitch R:SylStructure.parent.stress seg onsetcoda n.seg onsetcoda p.seg onsetcoda R:SylStructure.parent.accented pos in syl syl initial syl final R:SylStructure.parent.syl break R:SylStructure.parent.R:Syllable.p.syl break pp.name pp.ph vc pp.ph ctype pp.ph vheight pp.ph vlng pp.ph vfront pp.ph vrnd pp.ph cplace pp.ph cvox)) Now that we have the acoustic distances and the feature descriptions of each unit the next stage is to find a relationship between those features and the acoustic distances. This we do using the CART tree builder wagon. It will find out questions about which features best minimize the acoustic distance between the units in that class. wagon has many options many of which are apposite to this task though it is interesting that this learning task is interestingly closed. That is we are trying to classify all the units in the database, there is no test set as such. However in synthesis there will be desired units whose feature vector didn’t exist in the training set. The clusters are built by the following function (format t ”Building cluster trees\n”) (acost:find clusters (mapcar car unittypes) clunits params) The parameters that affect the tree building process are tree dir FILENAME the directory where the decision tree for each segment type will be saved wagon field desc LIST A filename of a wagon field descriptor file. This is a standard field description (field name plus field type) that is require for wagon. An example is given in festival/clunits/all.desc which should be sufficient for the default feature list, though if you change the feature list (or the values those features can take you may need to change this file. wagon progname FILENAME The pathname for the wagon CART building program. This is a string and may also include any extra parameters you wish to give to wagon.

38

wagon cluster size INT The minimum cluster size (the wagon −stop value). prune reduce INT This number of elements in each cluster to remove in pruning. This removes the units in the cluster that are furthest from the center. This is down within the wagon training. cluster prune limit INT This is a post wagon build operation on the generated trees (and perhaps a more reliably method of pruning). This defines the maximum number of units that will be in a cluster at a tree leaf. The wagon cluster size the minimum size. This is usefully when there are some large numbers of some particular unit type which cannot be differentiated. Format example silence segments without context of nothing other silence. Another usage of this is to cause only the center example units to be used. We have used this in building diphones databases from general databases but making the selection features only include phonetic context features and then restrict the number of diphones we take by making this number 5 or so. unittype prune threshold INT When making complex unit types this defines the minimal number of units of that type required before building a tree. When doing cascaded unit selection synthesizers its often not worth excluding large stages if there is say only one example of a particular demi−syllable. Note that as the distance tables can be large there is an alternative function that does both the distance table and clustering in one, deleting the distance table immediately after use, thus you only need enough disk space for the largest number of phones in any type. To do this (acost:disttabs and clusters unittypes clunits params) Removing the calls to acost:build disttabs and acost:find clusters. In our KED example these have the values (trees dir ”festival/trees/”) (wagon field desc ”festival/clunits/all.desc”) (wagon progname ”/usr/awb/projects/speech tools/bin/wagon”) (wagon cluster size 10) (prune reduce 0) The final stage in building a cluster model is collect the generated trees into a single file and dumping the unit catalogue, i.e. the list of unit names and their files and position in them. This is done by the lisp function (acost:collect trees (mapcar car unittypes) clunits params) (format t ”Saving unit catalogue\n”) (acost:save catalogue utterances clunits params) The only parameter that affect this is catalogue dir FILENAME

39

the directory where the catalogue will be save (the name parameter is used to name the file). Be default this is (catalogue dir ”festival/clunits/”) There are a number of parameters that are specified with a cluster voice. These are related to the run time aspects of the cluster model. These are join weights FLOATLIST This are a set of weights, in the same format as ac weights that are used in optimal coupling to find the best join point between two candidate units. This is different from ac weights as it is likely different values are desired, particularly increasing the F0 value (column 0). continuity weight FLOAT The factor to multiply the join cost over the target cost. This is probably not very relevant given the the target cost is merely the position from the cluster center. log scores 1 If specified the joins scores are converted to logs. For databases that have a tendency to contain non−optimal joins (probably any non−limited domain databases), this may be useful to stop failed synthesis of longer sentences. The problem is that the sum of very large number can lead to overflow. This helps reduce this. You could alternatively change the continuity weight to a number less that 1 which would also partially help. However such overflows are often a pointer to some other problem (poor distribution of phones in the db), so this is probably just a hack. optimal coupling INT If 1 this uses optimal coupling and searches the cepstrum vectors at each join point to find the best possible join point. This is computationally expensive (as well as having to load in lots of cepstrum files), but does give better results. If the value is 2 this only checks the coupling distance at the given boundary (and doesn’t move it), this is often adequate in good databases (e.g. limited domain), and is certainly faster. extend selections INT If 1 then the selected cluster will be extended to include any unit from the cluster of the previous segments candidate units that has correct phone type (and isn’t already included in the current cluster). This is experimental but has shown its worth and hence is recommended. This means that instead of selecting just units selection is effectively selecting the beginnings of multiple segment units. This option encourages far longer units. pm coeffs dir FILENAME The directory (from db dir where the pitchmarks are pm coeffs ext FILENAME The file extension for the pitchmark files. sig dir FILENAME Directory containing waveforms of the units (or residuals if Residual LPC is being used, PCM waveforms is PSOLA is being used) sig ext FILENAME

40

File extension for waveforms/residuals join method METHOD Specify the method used for joining the selected units. Currently it supports simple, a very naive joining mechanism, and windowed, where the ends of the units are windowed using a hamming window then overlapped (no prosodic modification takes place though). The other two possible values for this feature are none which does nothing, and modified lpc which uses the standard UniSyn module to modify the selected units to match the targets. clunits debug 1/2 With a value of 1 some debugging information is printed during synthesis, particularly how many candidate phones are available at each stage (and any extended ones). Also where each phone is coming from is printed. With a value of 2 more debugging information is given include the above plus joining costs (which are very readable by humans).

41

7

Building Festival Voice

In the context of Indian languages, syllable units are found to be a much better choice than units like phone, diphone, and half-phone. Unlike most other foreign languages in which the basic unit of writing system is an alphabet, Indian language scripts use syllable as the basic linguistic unit. The syllabic writing in Indic scripts is based on the phonetics of linguistic sounds and the syllabic model is generic to all Indian languages A syllable is typically of the following form: V, CV, VC, CCV, CCCV, and CCVC, where C is consonant and V is Vowel. A syllable could be represented as C*VC*, containing at least one vowel and zero, one or more consonants. Following steps explains how to build a syllable based synthesis using FestVox. 1. Create a directory and enter into the directory. $mkdir iiit tel syllable $cd iiit tel syllable

2. Creat voice setup $FESTVOXDIR/src/unitsel/setup clunits iiit tel syllable $FESTVOXDIR/src/prosody/ Before going to run build prompts do following steps. (a) modify your phoneset according to syllables as phonemes in the phoneset file (b) modify phoneme lable files as syllable lables. (c) Remove special symbols from tokenizer. (d) Call your pronunciation directory module from festvox/iiit tel syllable lexicon.scm (e) The last modification is change the default phonemeset to your language unique syllables in festival/clunits/all.desc file under p.name field. 3. Generate Prompts: festival −b festvox/build clunits.scm ’(build prompts ”etc/txt.done.data”)’ 4. Record prompts ./bin/prompt them etc/time.data 5. Label Automatically $FESTVOXDIR/src/ehmm/bin/do ehmm help run following steps individually: setup, phseq, feats, bw, align 6. Generate Pitch markers bin/make pm wave wav/*.wav 7. Correct the pitch markers bin/make pm fix pm/*.pm Tuning pitch markars (a) Convert pitch marks into label format ./bin/make pmlab pm pm/*.pm (b) After modigying pitch markers convert lable format to pitchmarkers ./bin/make pm pmlab pm lab/*.lab 8. Generate Mel Cepstral coefficients bin/make mcep wav/*.wav 9. Generate Utterance Structure festival −b festvox/build clunits.scm ’(build utts ”etc/txt.done.data”)’

42

10. Open festival/clunits/all.desc and add all the syllables in p.name field. Cluster the units festival −b festvox/build clunits.scm ’(build clunits ”etc/txt.done.data”)’ 11. open bin/make dur model and remove −stepwise ./bin/do build do dur 12. Test the voice. festival festvox/iiit tel syllable clunits.scm ’(voice iiit tel syllable clunits)’ To synthesize sentence: If you are building voice on local machine: (SayText ”your text”) If you are running voice on remote machine: (utt.save.wave (utt.synth (Utterance Text ”your text”)) ”test.wav”) If you want to see selected units, rum following command (set! utt (SayText ”your text”)) (clunits::units selected utt ”filename”) (utt.save.wave utt ”filename” ”wav”)

43

8

Customizing festival for Indian Languages

8.1

Some of the parameters that were customized to deal with Indian languages in festival framework are :

• Cluster Size − It is one of the parameters to be adjusted while building a tree. If the number of nodes for each branch of a tree is very large, it takes more time to synthesize speech as the time required to search for the appropriate unit is more. We therefore limit the size of the branch of the tree by specifying the maximum number of nodes, which is denoted by cluster size. When the tree is built, the cluster size is limited by putting the clustered set of units through a larger set of questions to limit the number of units being clustered as one type. • Duration Penalty Weight (duration pen weight) − While synthesizing speech, the duration of each unit being picked is also of importance as units of different durations being clustered together would make very unpleasant listening. The duration pen weight parameter specifies how much importance should be given to the duration of the unit when the synthesizer is trying to pick units for synthesis. A high value of duration pen weight means a unit very similar in duration to the required unit is picked. Else, not much importance is given to duration and importance is given to other features of the unit. • Fundamental pitch penalty weight (F0 pen weight) − While listening to synthesized speech an abrupt change in pitch between units is not very pleasing to the ear. The F0 pen weight parameter specifies how much importance is given to F0 while selecting a unit for synthesis. The F0 is calculated by calculating the F0 at the center of the unit which would be approximately where the vowel lies, which plays a major role in the F0 contour of the unit. We therefore try to select units which have similar values of F0 to avoid fluctuations in the F0 contour of the synthesized speech. • ac left context − In speech, the way a particular unit is spoken depends a lot on the preceeding and succeeding unit i.e. the context in which a particular unit is spoken. Usually a unit is picked based on what the succeeding unit is. This ac left context specifies the importance given to picking a unit based on what the preceeding unit was. • Phrase Markers − It is very hard to make sense out of something that is said without a pause. It is therefore important to have pauses at the end of phrases to make what is spoken, intelligible. Hindi has certain units called phrase markers which usually mark the end of a phrase. For the purpose of inserting silences at the end of phrases, these phrase markers were identified and a silence was inserted each time one of these was encountered. • Morpheme tags − There are no phrase markers in tamil, but there are units called morpheme tags which are found at the end of words which can be used to predict silences. The voice was built using these tags to predict phrase end silences while synthesizing speech. • Handling silences − Since there are a large number of silences in the database, the chances of a silence of a wrong duration in the wrong place is a common problem that is faced. There is a chance that a long silence is inserted at the end of a phrase or an extremely short silence is inserted at the end of a phrase which sounds very inappropriate. The silence units were therefore quantified into 2 types, i.e. SSIL, the silence at the end of a phrase and LSIL, the silence at the end of a sentence. The silence at the end of a phrase will be of a short duration while the silence at the end of a sentence will be of a long duration. • Inserting commas − Just picking phrase markers was not sufficient to make the speech prosodically rich. Commas were inserted in the text wherever a pause might have been there and the

44

tree was built using these commas so that the location of these commas could be predicted as pauses while synthesizing speech. • Duration Modeling − Was done so as to include the duration of the unit to be used as a feature while building the tree and also as a feature to narrow down the the size of the number of units selected while picking units for synthesis. • Prosody Modeling − This was achieved by phrase markers and by inserting commas in the text. Prosody modeling was done to make the synthesized speech more expressive so that it will more usable for the visually challenged persons. • Geminates − In Indian languages it is very important to preserve the intra-word pause while speaking, as the word spoken without the intra-word pause would have a completely different meaning. These intra−word pauses are called geminates, and care has been taken to preserve these intra−word pauses during synthesis.

8.2

Modifications in source code

1. Add the below 3 lines in the txt.done.data and also add the corresponding wav and lab files in the respective folder ( text 0998 ”LSIL” ) ( text 0999 ”SSIL” ) (text 0000 mono) 2. Inside the bin folder, Do the following Modification in ”make pm wave” file PM ARGS=’−min 0.0057 −max 0.012 −def 0.01 −wave end −lx lf 140 −lx lo 111 −lx hf 80 −lx ho 51 −med o 0’ COMMENT THE ABOVE LINE AND ADD THE FOLLOWING LINE IN THE FILE PM ARGS=’−min 0.003 −max 0.7 −def 0.01 −wave end −lx lf 340 −lx lo 91 −lx hf 140 −lx ho 51 −med o 0’ 3. Open /festvox/build clunits.scm file =>GoTo Line No:69 (i.e) ’(ac left context 0.8) change the value 0.8 to 0.1 =>GoTo Line No:87 (i.e) ’(wagon cluster size 20) change the value 20 to 7 =>GoTo Line No:89 (i.e) ’(cluster prune limit 40) change the value 40 to 10 4. Open /festvox/voicefoldername clunits.scm file =>GoTo Line No:136 ’(optimal coupling 1) change the value 1 to 2 5. Handling SIL − For small system this issue is not need to be handled but system with large database multiple occurrence of SIL creates problem. To Solve the issue do the following step =>GoTo line No:161 the line starts with (define (VOICE FOLDER NAME::clunit name i) Replace the entire function with the following code

45

(define (VOICE FOLDER NAME::clunit name i) ”(VOICE FOLDER NAME::clunit name i) Defines the unit name for unit selection for tam. The can be modified. It changes the basic classification of unit for clustering. By default we just use the phone name, but we may want to make this present phone plus previous phone (or something else).” (let ((name (item.name i))) (cond ((and (not iitm tam aarthi::clunits loaded) (or (string−equal ”h#” name) (string−equal ”1” (item.feat i ”ignore”)) (and (string−equal ”pau” name) (or (string−equal ”pau” (item.feat i ”p.name”)) (string−equal ”h#” (item.feat i ”p.name”))) (string−equal ”pau” (item.feat i ”n.name”))))) ”ignore”) ((string−equal name ”SIL”) ; (set! pau count (+ pau count 1)) (string−append name ”” (item.feat i ”p.name”) (item.feat i ”p.p.name”) )) ;; Comment out this if you want a more interesting unit name ((null nil) name)

;; Comment out this if you want a more interesting unit name ;((null nil) ; name) ;; Comment out the above if you want to use these rules ;((string−equal ”+” (item.feat i ”ph vc”)) ; (string-append ; name ;”” ; (item.feat i ”R:SylStructure.parent.stress”) ;”” ; (iiit tel lenina::nextvoicing i))) ; ((string-equal name ”SIL”) ; (string-append ;name ;”” ; (VOICE FOLDER NAME::nextvoicing i))) ;(t ; (string-append ; name ;””

46

; ; (item.feat i ”seg onsetcoda”) ;;”” ; (iiit tel lenina::nextvoicing i))) ))) 6. then go to line number 309 and add the following code (define (phrase number word) ”(phrase number word) phrase number of a given word in a sentence.” (cond ((null word) 0) ;; beginning or utterance ((string−equal ”;” (item.feat word ”p.R:Token.parent.punc”)) 0) ; end of a sentence ((string−equal ”,” (item.feat word ”p.R:Token.parent.punc”)) (+ 1 (phrase snumber (item.prev word)))) ;end of a phrase (t (+ 0 (phrase number (item.prev word)))))) 7. GoTo festival/clunits/ folder ===>Replace the all.desc file and copy the syllables and phones to both p.name and n.name field 8. Generate phoneset units along with features to include in phoneset.scm file by running create phoneset languageName.pl The Phoneset.scm file contains a list of all units along with their phonetic features. The create phoneset.pl script first takes every syllable and breaks it down into smaller units and dumps their phonetic features into the Phoneset.scm file. For every syllable the create phoneset.pl script checks first if the vowel present in the syllable is a short vowel or a long vowel. Depending on whether it is a short vowel or a long vowel a particular value is assigned to that field. After that the starting and ending consonants of the syllable are checked and and depending on the place of articulation of the consonants a particular value is assigned to that field. Depending on the type of vowel and the type of beginning and end consonants we can now assign a value to the type of vowel field as well. The fields for manner of articulation are kept as zero. 9. In VoiceFolderName Phoneset.scm file Uncomment the following line during TRAINING (PhoneSet.silences ’(SIL)) Uncomment the following line during TESTING (PhoneSet.silences ’(SSIL)) 10. In the VoiceFolderName phoneset.scm file, we have to change the phoneset definitions, Replace the defPhoneSet function with the following code. (;; vowel or consonant (vlng 1 0) ;; full vowel (fv 1 0) ;; syllable type − v vc/vcc cv/ccv cvc/cvcc (syll type 1 2 3 4 0) ;; place of articulation of c1 (poa c1 1 2 3 4 5 6 7 0) ;; manner of articulation of c1 (moa c1 + − 0) ;; place of articulation of c2 labial alveolar palatal labio−dental dental velar 47

(poa c2 1 2 3 4 5 6 7 0) ;; manner of articulation of c2 (moa c2 + − 0) ) 11. When running clunits i.e., the final step, remove (text 0000 ”mono”) & (text 0000-2 ”phone”) from txt.done.data (if exists) 12. Go to VoiceFolderName lexicon.scm file(Calling parser in lexicon file) Goto line number 137 and add the following code in Hand written letter to sound rules section (define (iitm tam lts function word features) ”(iitm hin lts function WORD FEATURES) Return pronunciation of word not in lexicon.” (cond ((string−equal ”LSIL” word )(set! wordstruct ’( ((”LSIL”) 0) ))(list word nil wordstruct)) ((string−equal ”SSIL” word )(set! wordstruct ’( ((”SSIL”) 0) ))(list word nil wordstruct)) ((string−equal ”mono” word )(set! myfilepointer (fopen ”unit size.sh” ”w”))(format myfilepointer ”%s” ”mono”)(fclose myfilepointer)) ((string-equal word ”phone”)(set! myfilepointer (fopen ”unit size.sh” ”w”))(format myfilepointer ”%s” ”phone”)(fclose myfilepointer)) (t (set! myfilepointer (fopen (path−append VoiceFolderName::dir ”parser.sh”) ”w”)) ;; (format myfilepointer ”perl %s %s %s” (path−append VoiceFolderName::dir ”bin/il parsertrain.pl”) word VoiceFolderName::dir) (format myfilepointer ”perl %s %s %s” (path-append VoiceFolderName::dir ”bin/il parser-test.pl”) word VoiceFolderName::dir) (fclose myfilepointer) ;; (print ”called”) (system ”chmod +x parser.sh”) (system ”./parser.sh”) ;(format t ”%l n” word) (load (path-append VoiceFolderName::dir ”wordpronunciation”)) (list word a wordstruct))) ) During Training process uncomment il parser−train.pl, during testing uncomment il parser−test.pl 13. Creating pronunciation dictionary perl test.pl Files name to be edited in il parser pronun dict.pl (a) file containing unique clusters eg my $file=”./unique clusters artistName”; (b) Create pronunciation dictionary: my $oF = ”pronunciationdict artistName”;

48

(c) rename the created pronunciation dictionary to instituteName language lex.out Add ” MNCL ”(without quote) at the first line of instituteName language lex.out and put it into festvox directory 14. To handle English words, we use the preprocessing3.pl perl script. When an english word is encountered, if it is not in the Pronunciation dictionary, it will be sent to the parser and the parser will send the word to preprocessing3.pl which will generate wordpronunciation by splitting the word into individual alphabets. Eg : for i/p word Ram ===> R A M will be the output in wordpronunciation 15. Handling Numbers, Abbreviations, Date and Time, we have include seperate scm file in the festvox folder called tokentowords.scm

49

9

Trouble Shooting in festival

9.1

Troubleshooting (Issues related with festival)

Some errors with solutions during the installation/building process:• Error:− /usr/bin/ld: cannot find −lcurses Solution:− sudo ln −s /lib/libncurses.so.5 /lib/libcurses.so • Error:− /usr/bin/ld: cannot find −lncurses Solution:−apt−get install libncurses5−dev • Error:− /usr/bin/ld: cannot find −lstdc++ Solution:− sudo ln −s /usr/lib/libstdc++.so.6 /lib/libstdc++.so • Error:− gcc: error trying to exec ’cc1plus’: execvp: No such file or directory Solution:− sudo apt−get install g++ • Error:− ln −s festival/bin/festival /usr/bin/festival ln: accessing ‘/usr/bin/festival’: Too many levels of symbolic links Solution:−sudo mv /usr/bin/festival /usr/bin/festival.orig ln −s /home/boss/festival/festival/src/main/festival /usr/bin/festival ln: creating symbolic link ‘/usr/bin/festival’ to ‘/home/boss/festival/festival/

9.2

Troubleshooting(Issues might occur while synthesizing)

Error:: Linux: can’t open /dev/dsp Solution:-Go to your home directory and open the .festivalrc (if it is not there, just create it) $cd $sudo gedit .festivalrc add the following line in this file and save: (Parameter.set ’Audio Command ”aplay −q −c 1 −t raw −f s16 −r $SR $FILE”) (Parameter.set ’Audio Method ’Audio Command)

50

10

ORCA Screen Reader

Integrating festival voices with Orca 1. Place the voices in festival/lib/voices/ folder as Orca will be loading all the voices in this folder. 2. Edit the clunits.scm file for the voice in festvox folder of the voice. (e.g. For hindi, take the iitm hin anjana clunits.scm file in festival/lib/voices/hindi/iitm hin anjana clunits/festvox/ folder and insert the following code just above the last line in the file last line in the file would be (provide ’voice name) (e.g. For hindi the last line in clunits.scm file will be (provide ’iitm hin anjana clunits) ) (proclaim voice ’ iitm hin anjana clunits ’((language hindi) (gender female) (dialect hindi) (description ”This voice provides an Indian hindi language”) (builtwith festvox2.1) (coding UTF8) )) Give the correct voice name and language in the above code. 3. Start festival in server mode with the following command festival −i −−heap 2100000 −−server 4. Start orca with the following command on another command prompt orca −n 5. Click on ORCA preferences button 6. Click on the speech tab 7. The following fields should have the entries given here • Speech system should be GNOME Speech Services • Speech synthesiser should be Festival GNOME Speech Driver • Voice settings should be Default • Select the festival voice for your language from the drop down list for the Person entry 8. Click on Apply button and then OK button. 9. Now ORCA should be able to read the language data. 10. If festival synthesizer does not load, install gnome-speech-swift (if ubuntu, sudo apt-get install gnome-speech-swift and start festival and orca again. If ubuntu version is greater than 10.04 , festival speech dispatcher has to be installed . Gnome speech driver is not supported in later versions of ubuntu . Use the following command to install the speech dispatcher, sudo apt-get install speech-dispatcher-festival

51

11. If a timeout occurs for orca, type locate settings.py in command prompt, and open the files named settings.py in any orca related folders. Usually there are more than one. Search for the phrase timeoutTime and change its value to 30. Do the same for all files named settings.py. Start festival and orca again 12. If an english word is not there in the database, it spells it out. 13. It can be tested using a gedit file containing your language data. 14. Cursor should be placed in front of the sentence to be read using keyboard arrow keys. Move the cursor to different lines in the file for it to read line by line.

52

11 11.1

NVDA Windows Screen Reader Compiling Festival in Windows :

1. Visual Studio 2008 (vc 9.0) standard edition must be successfully installed 2. The service pack for the visual studio 2008 must be installed 3. Install cygwin 4. Rename \speech tools\config\systems\ix86 CYGWIN1.5.mak to \speech tools\config\systems\ix86 unknown.mak (if error comes file not found) Note: copy the new module (il parser) to your festival/src/modules/ folder before compiling speech tools and festival. Copy the Makefile provided to festival/src/modules/ folder Follow the steps mentioned in http://www.eguidedog.net/doc build win festival.php There are more changes we made apart from that mentioned in the web page, which are mentioned below. IMPORTANT: The following changes needs to be made only if errors are thrown for these files while compiling festival following the steps in the link given above. 5. speech tools/include/EST.h must have the following changes #include should be added at line 45 before using namespace std; 6. speech tools/include/EST math.h must have the following changes #include ¡iostream¿ should be added at line 54 after #include 7. speech tools/include/EST TKVL.h must have the following changes #include should be added at line 43 before using namespace std; 8. speech tools/include/EST Token.h must have the following changes #include should be added at line 44 before using namespace std; 9. speech tools/include/EST TrackMap.h must have the following changes #include should be added at line 38 before using namespace std; 10. speech tools/stats/wagon/wagon aux.cc must have the following changes #include ”EST Math.h”should be added at line 47 after #include EST Wagon.h” 11. speech tools/stats/EST DProbDist.cc must have the following changes long long l; on line 62 on line must be changed to long l; and l = (long long)c; on line 66 must be changed to l = (long )c; 12. speech tools/utils/EST cutils.c must have the following changes if (((tdir=getenv(”TMPDIR”)) == NULL) k ((tdir=getenv(”TEMP”)) == NULL) k ((tdir=getenv(”TMP”)) == NULL)) tdir = ”/tmp”; must be replaced by if (((tdir=getenv(”TMPDIR”)) == NULL) && ((tdir=getenv(”TEMP”)) == NULL) && ((tdir=getenv(”TMP”)) == NULL)) tdir = ”/tmp”;

53

13. speech tools/utils/EST ServiceTable.cc must have the following changes The following code must be moved from the end of the file to line 52 after #include ”EST ServiceTable.h” Declare KVL T(EST String, EST ServiceTable::Entry, EST String ST entry) #if defined(INSTANTIATE TEMPLATES) #include ”../base class/EST TList.cc” #include ”../base class/EST TSortable.cc” #include ”../base class/EST TKVL.cc” Instantiate KVL T(EST String, EST ServiceTable::Entry, EST String ST entry) #endif 14. festival/src/main/festival client.cc must have the following changes #include should be added at line 42 before using namespace std; 15. festival/src/main/festival main.cc must have the following changes #include should be added at line 42 before using namespace std;

16. festival/src/modules/MultiSyn/EST FlatTargetCost.cc and in festival/src/modules/MultiSyn/EST FlatT all references to the variable WORD and PWORD must be replaced by WORD1 and PWORD1. WORD and PWORD are keywords in VC++. So the variable names have to be changed. Change in EST FlatTargetCost.h enum tcdata t { VOWEL, SIL, BAD DUR, NBAD DUR, BAD OOL, NBAD OOL, BAD F0 , SYL, SYL STRESS, N SIL, N VOWEL, NSYL, SYL STRESS, RC, NNBAD DUR, NNSYL, LC, PBAD DUR, PSYL, WORD, NWORD, NNWORD, PWORD, SYLPOS, WORDPOS, PBREAK, POS, PUNC, NPOS, NPUNC, TCHI LAST }; Must be changed to enum tcdata t { VOWEL, SIL, BAD DUR, NBAD DUR, BAD OOL, NBAD OOL, BAD F0 , SYL, SYL STRESS, N SIL, N VOWEL, NSYL, NSYL STRESS, RC, NNBAD DUR, NNSYL, LC, PBAD DUR, PSYL, WORD1, NWORD, NNWORD, PWORD1, SYLPOS, WORDPOS, PBREAK, POS, PUNC, NPOS, NPUNC, TCHI LAST }; Change in EST FlatTargetCost.cc In function TCData *EST FlatTargetCost::flatpack(EST Item *seg) const

54

// seg word feature if(word=tc get word(seg)) (*f)[WORD]=simple id(word->S(”id”)); else (*f)[WORD]=0; must be replaced by // seg word feature if(word=tc get word(seg)) (*f)[WORD1]=simple id(word->S(”id”)); else (*f)[WORD1]=0; In function TCData *EST FlatTargetCost::flatpack(EST Item *seg) const // Prev seg word feature if(seg->prev() && (word=tc get word(seg->prev()))) (*f)[PWORD]=simple id(word->S(”id”)); else (*f)[PWORD]=0; Must be replaced by // Prev seg word feature if(seg->prev() && (word=tc get word(seg->prev()))) (*f)[PWORD1]=simple id(word->S(”id”)); else (*f)[PWORD1]=0; In function TCData *EST FlatTargetCost::flatpack(EST Item *seg) const // segs wordpos (*f)[WORDPOS]=0; // medial if( f->a no check(WORD)!= f->a no check(NWORD) ) (*f)[WORDPOS]=1; // inter else if( f->a no check(WORD)!= f->a no check(PWORD) ) (*f)[WORDPOS]=2; // initial else if( f->a no check(NWORD) != f->a no check(NNWORD) ) (*f)[WORDPOS]=3; // final Must be replaced by // segs wordpos (*f)[WORDPOS]=0; // medial if( f->a no check(WORD1)!= f->a no check(NWORD) ) (*f)[WORDPOS]=1; // inter else if( f->a no check(WORD1)!= f->a no check(PWORD1) ) 55

(*f)[WORDPOS]=2; // initial else if( f->a no check(NWORD) != f->a no check(NNWORD) ) (*f)[WORDPOS]=3; // final In function float EST FlatTargetCost::position in phrase cost() const if ( !t->a no check(WORD) && !c->a no check(WORD) ) return 0; if ( !t->a no check(WORD) k !c->a no check(WORD) ) return 1; must be replaced by if ( !t->a no check(WORD1) && !c->a no check(WORD1) ) return 0; if ( !t->a no check(WORD1) k !c->a no check(WORD1) ) return 1; In function float EST FlatTargetCost::punctuation cost() const if ( (t->a no check(WORD) && !c->a no check(WORD)) k (!t->a no check(WORD) && c->a no check(WORD)) ) score += 0.5; else if (t->a no check(WORD) && c->a no check(WORD)) must be replaced by if ( (t->a no check(WORD1) && !c->a no check(WORD1)) k (!t->a no check(WORD1) && c->a no check(WORD1)) ) score += 0.5; else if (t->a no check(WORD1) && c->a no check(WORD1)) In function float EST FlatTargetCost::partofspeech cost() const // Compare left phone half of diphone if(!t->a no check(WORD) && !c->a no check(WORD)) return 0; if(!t->a no check(WORD) k !c->a no check(WORD)) return 1; must be replaced by

56

// Compare left phone half of diphone if(!t->a no check(WORD1) && !c->a no check(WORD1)) return 0; if(!t->a no check(WORD1) k !c->a no check(WORD1)) return 1; 17. festival/src/modules/MultiSyn/EST JoinCostCache.h must have the following changes . comment out the following portion of the code static const unsigned char minVal = 0x0; static const unsigned char maxVal = 0xff; static const unsigned char defVal = 0xff; 18. festival/src/modules/MultiSyn/EST JoinCostCache.cc, replace all minVal by 0x0, all maxVal by 0xff and all defVal by 0xff Following are the changes comment out // #include if( a == b ) return minVal; else if( b > a ) return cache[(b*(b-1)>>1)+a]; else return cache[(a*(a-1)>>1)+b]; return defVal; Must be replaced by if( a == b ) return 0x0; else if( b > a ) return cache[(b*(b-1)>>1)+a]; else return cache[(a*(a-1)>>1)+b]; return 0xff; unsigned int qleveln = maxVal-minVal; must be replaced by unsigned int qleveln = 0xff-0x0; if( cost >= ulimit ) qcost = maxVal; else if( cost = ulimit ) qcost = 0xff; else if( cost features().val(”name”).String() ); const EST String &right phone( cand right->features().val(”name”).String() ); if( ph is vowel( left phone ) k ph is approximant( left phone ) k ph is liquid( left phone ) k ph is nasal( left phone ) ) Replace by if( ph is vowel( cand left->features().val(”name”).String() ) k ph is approximant( cand left->features().val(”name”).String() ) k ph is liquid( cand left->features().val(”name”).String() ) k ph is nasal( cand left->features().val(”name”).String() ) ) if( ph is vowel( right phone ) k ph k ph k ph fv =

is approximant( right phone ) is liquid( right phone ) is nasal( right phone ) ) fvector( cand right->f(”midcoef”) );

replace by if( ph is vowel( cand->next()->features().val(”name”).String() ) k ph is approximant( cand->next()->features().val(”name”).String() ) k ph is liquid( cand->next()->features().val(”name”).String() ) k ph is nasal( cand->next()->features().val(”name”).String() ) ) fv = fvector( cand->next()>f(”midcoef”) ); 20. festival/src/modules/Text/token.cc must have the following changes . #include should be added at line 48 before using namespace std; 21. festival/src/modules/UniSyn/us mapping.cc must have the following changes . declare int i; separately in the following functions and remove the declarations from for loops

58

(a) void make linear mapping(EST Track &pm, EST IVector &map) (b) static void pitchmarksToSpaces( const EST Track &pm, EST IVector *spaces, int start pm, int end pm, int wav srate ) (c) void make join interpolate mapping( const EST Track &source pm,EST Track &target pm, const EST Relation &units, EST IVector &map ) void make join interpolate mapping2( const EST Track &source pm, EST Track &target pm, const EST Relation &units, EST IVector &map ) 22. festival/src/modules/UniSyn/us prosody.cc must have the following changes . In function void F0 to pitchmarks(EST Track &fz, EST Track &pm, int num channels, float default F0 , float target end) remove the declaration of i in the for loop for( int i=0; i D:\fest install\festival\src\arch\festival festival main.cc -> D:\fest install\festival\src\main clunits.cc -> D:\fest install\festival\src\modules\clunits EST wave utils.cc -> D:\fest install\speech tools\speech class config.ini copy it to voice folder (hindi\iitm hin anjana clunits) config.ini file will be accessed by sapi code. (voice iitm hin anjana clunits)

This file has the command to set a voice

Now you need to compile festival as per the steps given in chapter Compile festival in windows. 4. Install Microsoft SDK from the link http://www.microsoft.com/download/en/details.aspx?id=11310

5. Replace the SampleTTSEngine folder { C:\ProgramFiles\MicrosoftSDKs\Windows\v6.1\Samples\winui\ } with the code provided by IITM 6. In this SampleTTSEngine solution a file called register−vox.cpp has the details of our voice. The name and the language code should be changed for respective languages 7. Two environmental variables have to be created FESTLIBDIR D:\festival\festival\lib This should point to where your voice is kept. lib folder should be there with all the scm files. and the voice should be kept in this lib folder under voices\hindi folder voice path D:\festival\festival\lib\voices\hindi\iitm hin anjana clunits\ This will point to the voice folder 8. Check the properties of SampleTTSEngine solution. The libraries, include path of festival and speech tools must point to the correct path. These libraries will be build when we compile festival and speech tools (point 1) 9. compile the SampleTTSEngine solution in release mode. It will generate SampleTtsEngine.dll.

60

10. Check if an entry is there in registry (HKEY LOCAL MACHINE -> software -> Microsoft -> speech -> voices -> Tokens -> ) An entry for our voice will be there. 11. Test with sample TTS application. (Control Panel -> Speech -> Text to speech ) or with TTSAPP.exe that comes with the SDK. 12. If it works in these applications now try in NVDA or JAWS.

61

13

Sphere Converter Tool

The tool was developed to convert all the speech files in different format to a standard sphere format. In the sphere format, there will be a header which will have all the details of the speech file. The speech files can be of wav , raw or mulaw format. The sphere files can either be encoded in wavpack or shorten encoding or kept in the same format as of the input speech file. The input file(either mu law, wave or raw)is to be converted to a sphere file (either encoded in wavpack, shorten or no encoding) with a sphere header. SPHERE files contain a strictly defined header portion followed by the file body (waveform). The header is an object oriented, 1024byte blocked, ASCII structure which is prepended to the waveform data. The header is composed of a fixed format portion followed by an object oriented variable portion. The fixed portion is as follows: NIST 1A 1024 The remaining object oriented variable portion is composed of Below is a sample sphere header that this module is generating. First 4 fields are user defined fields taken from config file. NIST 1A 1024 location id s13 TTS IITMadras database id s22 Sujatha 20 RadioJockey utterance id s9 Suj trial sample sig bits i 16 channel count i 1 sample n bytes i 2 sample rate i 16000 sample count i 46563 sample coding s3 pcm sample byte format s2 01 sample min i 16387 sample max i 23904 end head

13.1

Extraction of details from header of the input file

The input file can be wav file, either pcm or mu law encoded. The header file of a wav file is shown in a table at the end of the document

62

The necessary information from the header of the input file is extracted. If the fact chunk is present in the header the sample count is obtained from the header otherwise it is calculated as follows. The total number of data bytes is obtained from the cksize (second field) in data chunk. The bits per sample is obtained from the field in format chunk. Bytes per sample =( bits per sample ) /8 The sample count = (No. of data bytes) / (bytes per sample) In the sphere package the byte format of data is stored in field SAMPLE BYTE FORMAT. If the sample data is in little endian format, this field is given the value = 01 , if the data is in bigendian the value is 10 and if the samples are single byte the value is 1.

13.1.1

Calculate sample minimum and maximum values

- The objective of this module is to find the maximum sample value and minimum sample value among the sample data present in the input file. Each sample is read from the data part of the file and calculated which is the maximum value and minimum value.

13.1.2

RAW Files

- RAW files are headerless audio files. The sample rate, sample size, channel count and data encoding must be given by the user in the config file, for the program to read the file successfully. The sample count is calculated by counting the number of samples read while calculating the sample minimum and maximum values.

13.1.3

MULAW Files

- If the input file is a mulaw encoded file the AudioFormat field in the format chunk of the header will have value = 7and the FACT chunk will be present in the header. 13.1.4

Output in encoded format

- The data in the output sphere file can be Shorten compressed byte stream or Wavpack compressed byte stream or the data as is present in the input file.

13.2

Configfile

- The user defined fields to be added to header can be kept in this file and it is to be placed at the location were the executables are placed. The output sphere files can be played in the utility wavesurfur. The sphere files have .sph extension and the sphere header can be verified by opening the file. The file can be opened in a hex editor (e.g ghex2) to verify the header fields and size of file.

63

14 14.1

Sphere Converter User Manual How to Install the Sphere converter tool

1. untar sphere 2.6a.tar.Z (use tar −xvzf or zcat sphere 2.6.tar.Z — tar −xvf ) tar −xvzf sphere 2.6a.tar.Z 2. A folder by name ’nist’ will be created. 3. change the file exit.c ( nist/src/lib/sp) replace extern char *sys errlist[]; by following #ifdef NARCH linux #include #else extern char *sys errlist[]; #endif 4. go to folder ’nist’ ( cd nist ) and install nist as follows sh src/scripts/install.sh (a) : Sun OS4.1.[12] (b) : Sun Solaris (c) : Next OS (d) : Dec OSF/1 (with gcc) (e) : Dec OSF/1 (with cc) (f) : SGI IRIX (g) : HP Unix (with gcc) (h) : HP Unix (with cc) (i) : IBM AIX (j) : Custom Please Choose one: 10 What is/are the Compiler Command ? [cc] cc OK, The Compiler Command command is ’cc’. Is this OK? [yes] yes What is/are the Compiler Flags ? [−g] −g −c OK, The Compiler Flags command is ’−g −c’. Is this OK? [yes] yes What is/are the Install Command ? [install −s −m 755] install −s −m 755 What is/are the Archive Sorting Command ? [ranlib] What is/are the Archive Update Command ? [ar ru] What is/are the Architecture ? [SUN] linux OK, The Architecture command is ’linux’. Is this OK? [yes] yes

64

5. copy the following files from c files folder to nist/bin or to any user defined folder decode sphere.c encode sphere.c configfile compare.c convert to sphere.c wavtosphere.sh 6. compile using (in bin folder) cc convert to sphere.c −combine decode sphere.c encode sphere.c −I../include/ −lsp −lutil −lm −L../lib/ if the c files are not in nist/bin but in an ser defined folder, give appropriate path for the nist/ lib and nist/include in the above command This will create a file a.out which will be used by the front end of the tool 7. Install QT compile the Qt bin file qt−sdk−linux−x86−opensource−2010.02.bin: ./qt−sdk−linux−x86−opensource−2010.02.bin A folder by name qtsdk−2010.02 will be created. Thus, installing the Qt software is done. 8. Now, to compile the sphere converter codes using Qt: Extract sphereconverter.tar cd sphereconverter Execute the following commands: /home/......./qtsdk−2010.02/qt/bin/qmake project (This creates the .pro file.) /home/......./qtsdk2010.02/qt/bin/qmake (This created the Makefile.) make (This makes the folder and creates sphere converter executable.) 9. Copy the sphereconverter executable to the bin folder of nist (cd nist/bin) or to the user defined folder containing the c files. 10. Copy this user manual to both sphereconverter andnist/bin folder (or to the user defined folder containing the c files) to access it at all times from the help button in the tool. 11. Now, clicking on this executable in the /home/.../nist/bin (or to the user defined folder containing the c files) will run the tool.

14.2

How to use the tool

• Select the type of file to be converted in the radio button wav (pcm or mulaw) or raw. • Select whether a single file or a bulk of files have to be converted. • If Single file conversion is selected: 1. Load the input file with extension either ’wav’ or ’raw’. It can be browsed using the ’Open’ button. The files with the the extension selected will be listed when browsing. 2. Specify the name of the output sphere file and where the output file has to be saved using the ’save as’ button. • If Bulk file conversion is selected:

65

1. Load the input folder containing wav files or raw files depending on the type of file selected. It can be browsed using the ’Open’ Button 2. If the input folder contains files(wav or raw) other than the type of file selected, the other type files will be converted to corresponding sphere files with default properties. 3. Specify the name of the output folder name where the sphere files have to be located. • Click on ’Edit properties’ to enter the details that would be stored in the sphere header. • If properties are not edited the default properties stored in the configfile(present in nist/bin folder) will be used. • Select the type of encoding for the output sphere file. It can be wavpack encoded or shorten encoding or without any encoding. • Click ’Convert’ button to convert the a single file or ’Bulk Convert’ button to convert a set of files. • If the file is successfully converted the message File was succesfully converted will be displayed. • If any field entered in the properties were wrong or if the file was not successfully converted the appropriate message will be displayed. • After successful convertion the sphere header created by the tool will be displayed. If the user is satisfield with the header ’Ok’ button can be clicked, else ’Cancel’ button can be clicked and user can go back and make changes in the properties. • On clicking the help button on the right top corner the user manual will be opened which can be referred for any issues while using the tool.

14.3

Fields in Properties

• location id : Its a Mandatory field. User can enter any value of string format. Maximum allowed length of the entered text is 100 characters. Preferably this field can hold the name of the location/institution at which the conversion is taking place. • database id : Its a Mandatory field. User can enter any value of string format. Maximum allowed length of the entered text is 100 characters. Preferably this field can hold the details of project/database/speaker. • utterence id : Its a Mandatory field. User can enter any value of string format. Maximum allowed length of the entered text is 100 characters. Preferably this field can hold the name of the speaker. In the sphere header the value of this field will be appended with the name of the file, seperated by an underscore. • language : Its a Mandatory field. User can enter any value of string format. Maximum allowed length of the entered text is 50 characters. Preferably this field can hold the language used for the input file. • sample n bytes : Its a Mandatory field for raw files only. The user can enter the number of bytes in a sample in the file. For wav files this value will be retrieved from the wav header. If a non integer value is entered the tool will point out to the user to enter integer value. This tool deals with only one byte per sample and 2 bytes per sample

66

• sample sig bits : Its a Mandatory field. User can enter the number of significant bits in a sample. If a non integer value is entered the tool will point out to the user to enter integer value. If the value of sample sig bit > (sample n bytes * 8) an error will be thrown informing the user that the value entered for this field is wrong. • sample rate : Its a Mandatory field for raw files only. For wav files this value will be retrieved from the wav header. The user can enter the sampling rate(blocks per second) used in the file. If a non integer value is entered the tool will point out to the user to enter integer value. If (sample count / sample rate ) < = 0, (meaning duration of file is less than or equal to zero) an error will be thrown informing the user that the value entered for this field is wrong • sample byte format : Its a Mandatory field for raw files only. For wav files this value will be retrieved from the wav header. The user can enter the byte ordering used inthe file. It can be either 01 for little endian, 10 for big endian and 1 for single byte. If a value other than this is entered, the tool will point out to the user to enter one of the three values. For raw files, if sample n bytes entered by the user is 1, this field can take only the value 1. Else the user will be informed that the value entered for this field is wrong. • channel count : This tool deals with only channel count = 1. It is the number of interleaved channels. Mono = 1, Stereo = 2 • sample count : The value for this field is calculated by the tool. It is the total number of samples in the file. • sample coding : The value for this field is calculated by the tool. It is the encoding used in the input and output file, seperated by a comma. The input file encodings can be pcm, mulaw or raw. Output file encodings are shorten or wavepack. If no encoding is selected for the output file, the value of the field will contain only the encoding of the input file • sample max : The value for this field is calculated by the tool. It is maximum sample value ( amplitude of the sample with maximum value) present in the file • sample min : The value for this field is calculated by the tool. It is minimum sample value ( amplitude of the sample with minimum value) present in the file • User can add more fields using the ’Add’ button. The field name, data type and value has to be entered. The data type can be string, integer or real, which can be selected from the drop down list. Click Ok button after entering details of the new field. Or cancel button can be clicked. The field names should not have spaces in between. • Maximum size for the sphere header is 1024 bytes. If the user enters more data, tool will inform the user that the header has exceeded 1024 bytes and tells the user to edit/delete few properties. • User entered fields can be deleted. User can check off the check box on the right of each fields entered by the user and delete button can be pressed. • Once the properties are edited click ’Ok’ button. • If you click ’Cancel’ button, tool will inform the user the ’Cancelling will remove the editted properties and use the default wav properties. Do you want to continue?’. User can click ’yes’ or ’no’.

14.4

Screenshot

Here is a screen shot of the sphere converter tool 67

14.5

Example of data in the Config file (default properties)

WAV: location id STRING IIT Madras, chennai database id STRING Sujatha 20 RadioJockey utterance id STRING Suj language STRING tamil sample sig bits INTEGER 16 $$ RAW: location id STRING IIT Madras, chennai database id STRING Sujatha 20 RadioJockey utterance id STRING Suj language STRING hindi sample n bytes INTEGER 2 sample sig bits INTEGER 16 channel count INTEGER 1 68

sample rate INTEGER 16000 sample byte format STRING 01 $$

14.6

Limitations to the tool

• The tool allows only files of single channel. • sample n bytes can take only the values 1 and 2. • For raw files the correct value for sample n bytes,sample rate and sample byte format should be given to get the correct value for sample max and sample min. • Maximum size for the sphere header is 1024 bytes. • For bulk conversion, if both the pcm and mulaw files are there in the input folder, they will use the same properties. • For Bulk conversion if the input folder has sub folders or different types(pcm,mulaw or raw) of input files, the tool will inform that All files were not sucesfully converted. All files in the folder may not be of the same format. Although conversion will successfully happen for the type of file you have selected in the gui, this error is thrown. It is advised to have only files of one type in the input folder and without any subfolders.

69

Field Length RIFF/RIFX chunk ChunkID 4

ChunkSize

4

Wave ID FORMAT Chunk Subchunk1ID Subchunk1Size

4

Content Contains the letters ”RIFF” or RIFX in ASCII form. For little endian files, it is RIFF and big− endian is RIFX This is the size of the rest of the chunk following this number. This is the size of the entire file in bytes minus 8 bytes for the two fields not included in this count: ChunkID and ChunkSize. Contains the letters ”WAVE

4 4

Contains the letters ”fmt ” This is the size of the rest of the Subchunk which follows this number. Value can be 16 or 18 or 40. 16 for PCM. AudioFormat 2 PCM = 1 (i.e. Linear quantization) Values other than 1 indicate some form of compression. Mu−law file has value 7 NumChannels 2 Number of interleaved channels Mono = 1, Stereo = 2 SampleRate 4 Sampling rate (blocks per second) 8000, 16000, 44100, etc. ByteRate 4 Data rate. This is AvgBytesPerSec = = SampleRate * NumChannels * BitsPerSample/8 BlockAlign 2 Data block size (bytes) = = NumChannels * BitsPerSample/8 The number of bytes for one sample including all channels. BitsPerSample 2 8 bits = 8, 16 bits = 16, etc. Optional portion in FORMAT chunk cbSize 2 Size of the extension (0 or 22) This field is present only if the Subchunk1Size is 18 or 40 ValidBitsPerSample 2 Number of valid bits ChannelMask 4 Speaker position mask SubFormat 16 GUID, including the data format code FACT Chunk (All (compressed) nonPCM formats must have a Fact chunk) ckID 4 Chunk ID: ”fact” cksize 4 Chunk size: minimum 4 SampleLength 4 Number of samples (per channel) Data chunk ckID 4 Contains the letters ”data” cksize 4 This is the number of bytes in the data. sampled data n The actual sound data. pad byte 0 or 1 Padding byte if n is odd 70

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF