JAES_V61_1_2_ALL

August 8, 2017 | Author: Taré Ardón | Category: Diffraction, Acoustics, Loudspeaker, Filter (Signal Processing), Waves

Share Embed Donate

Report this link

Short Description

Download JAES_V61_1_2_ALL...

Description

AES

JOURNAL OF THE AUDIO ENGINEERING SOCIETY AUDIO / ACOUSTICS / APPLICATIONS Volume 61 Number 1/2

2013 January/February

In this issue . . . Evaluating Perceptual Trade-Offs with Wave Field Synthesis Efficient Decomposition of Room Impulse Responses Representing Audio Signals Optimized for Noise Reduction Using Fixed-Pole Parallel Filters for Audio Equalizing Extending Digital Waveguide Synthesis to the Korean Geomungo Modeling Room Acoustic Parameters in Small Spaces Influence of a Table Surface on the Response of a Boundary Microphone Features… Mastering for Today’s Media 135th Convention, New York, Call for Papers

AUDIO ENGINEERING SOCIETY, INC. INTERNATIONAL HEADQUARTERS

TECHNICAL COMMITTEES contd.

60 East 42nd Street, Room 2520, New York, NY 10165-2520, USA Tel: +1 212 661 8528 Fax: +1 212 682 0477 E-mail: [email protected] Internet: www.aes.org

CODING OF AUDIO SIGNALS Jürgen Herre and Schuyler Quackenbusch Cochairs

ADMINISTRATION

ELECTROMAGNETIC COMPATIBILITY Jim Brown Chair Bruce Olsen Vice Chair

..

Bob Moses Executive Director Roger K. Furness Deputy Director OFFICERS 2012/2013

COMMITTEES contd.

Frank Wells President

LAWS & RESOLUTIONS WIESLAW WOSZCZYK Chair Sean Olive Vice Chair

Sean Olive President-Elect Jan A. Pedersen Past President Robert Breen Vice President Eastern region, USa/Canada

MEMBERSHIP

Theresa Leonard Chair Jonathan Novick Vice Chair

Michael Fleming Vice President

NOMINATIONS

Central region, USa/Canada

Jan Pedersen Chair

Jonathan Novick Vice President

PRESIDENT’S ADVISORY GROUP Andres A. Mayo Chair

Western region, USa/Canada

Bill Foster Vice President Northern region, Europe

Nadja Wallaszkovits Vice President Central region, Europe

Umberto Zanghieri Vice President Southern region, Europe

Joel Vieira De Brito Vice President Latin america region

Kimio Hamasaki Vice President

PUBLICATIONS POLICY Ville Pulkki Chair Durand Begault Vice Chair REGIONS AND SECTIONS Peter Cook Chair STANDARDS

Bruce Olson Chair TELLERS

FIBER OPTICS FOR AUDIO Ronald Ajemian Chair Werner Bachmann Vice Chair HEARING AND HEARING LOSS PREVENTION Robert Schulein Chair Michael Santucci and Jan Voetmann Vice Chairs HIGH-RESOLUTION AUDIO

Vicki R. Melchior and Josh Reiss Cochairs HUMAN FACTORS IN AUDIO SYSTEMS Michael Hlatky Chair William L. Martens Vice Chair LOUDSPEAKERS & HEADPHONES Alan Trevena Chair Juha Backman Vice Chair MICROPHONES & APPLICATIONS Eddy Bøgh Brixen Chair David Josephson Vice Chair

Christopher V. Freitag Chair

international region

NETWORK AUDIO SYSTEMS Kevin Gross Chair Thomas Sporer and Umberto Zanghieri Vice Chairs

Ron Streicher Secretary Garry Margolis Treasurer TECHNICAL COUNCIL GOVERNORS

Diemer de Vries Gary Gottlieb Jim Kaiser Ben Kok Bozena Kostek John D. Krivit Valerie Tyler Michael Williams

COMMITTEES ASSOCIATE EDITORS

Francis Rumsey Chair Jürgen Herre, Michael Kelly, and Bob Schulein Vice Chairs

PERCEPTION & SUBJECTIVE EVALUATION OF AUDIO Thomas Sporer Chair Eiichi Miyasaka and Sean Olive Vice Chairs

TECHNICAL COMMITTEES

SEMANTIC AUDIO ANALYSIS Mark Sandler Chair Dan Ellis and Jay LeBoeuf Vice Chairs

ACOUSTICS & SOUND REINFORCEMENT Kurt Graffy and Peter Mapp Cochairs ARCHIVING, RESTORATION AND DIGITAL LIBRARIES David Ackerman and Chris Lacinak Cochairs AUDIO FORENSICS

BOZENA KOSTEK Chair

Jeff M. Smith Chair Eddy B. Brixen and Christopher Peltier

AWARDS

Vice Chairs

Jim Kaiser Chair CONFERENCE POLICY

Diemer de Vries Chair

AUDIO FOR GAMES Michael Kelly and Steve Martz Cochairs Kazutaka Someya Vice Chair

CONVENTION POLICY

Jim Kaiser, Umberto Zanghieri Co-chairs EDUCATION

AUDIO FOR TELECOMMUNICATIONS Bob Zurek Chair Antti Kelloniemi Vice Chair

John Krivit Chair Magdalena Plewa and Kyle P. Snyder Vice Chairs

AUDIO RECORDING & MASTERING SYSTEMS Kimio Hamasaki Chair

FINANCE

Garry Margolis Chair

Toru Kamekawa and Andres A. Mayo Vice Chairs

HISTORICAL

AUTOMOTIVE AUDIO

Bill Wray and Gene Radzik Cochairs J. G. (Jay) McKnight Chair Emeritus

Richard S. Stroud Chair Tim Nind Vice Chair

SIGNAL PROCESSING

Christoph M. Musialik Chair James Johnston Vice Chair SOUND FOR DIGITAL CINEMA AND TELEVISION Brian McCarty Cochair Julius Newell Cochair SPATIAL AUDIO (FORMERLY MULTICHANNEL) James Johnston and Sascha Spors Chairs STUDIO PRACTICES & PRODUCTION George Massenburg Chair Mick Sawaguchi and Jim Kaiser Vice Chairs TRANSMISSION & BROADCASTING

Stephen Lyman and Kimio Hamasaki Cochairs

Lars Jonsson Vice Chair Correspondence to aES officers and committee chairs should be addressed to them at the society’s international headquarters.

STANDARDS COMMITTEE

Bruce Olson

Mark Yonge

Chair

Secretary,Standards Manager

John Woodgate ViceChair

David Josephson

Junichi Yoshio

ViceChair,Western Hemisphere

ViceChair, International

AES

Journal of the Audio Engineering Society

(ISSN1549-4950),Volume61,Number1/2,2013January Published monthly, except January/February and July/August when published bimonthly, by the Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA, Telephone: +1 212 661 8528. Fax: +1 212 682 0477. E-mail:[email protected].Periodicalpostagepaidat New York, New York, and at an additional mailing office. Postmaster: Send address corrections to JournaloftheAudioEngineering Society,60East42ndStreet,NewYork,NewYork10165-2520. TheAudioEngineeringSocietyisnotresponsibleforstatementsmadebyitscontributors.

EDITORIAL STAFF SC-02 SUBCOMMITTEE ON DIGITAL AUDIO

John Grant, Chair

Bozena Kostek Editor-in-Chief Francis Rumsey ConsultantTechnicalWriterandEditor

Robin Caine, Vice Chair

Working Groups

SC-02-01 Digital Audio Measurement Techniques: Tom Kite, Ian Dennis SC-02-02 Digital Input-Output Interfacing: John Grant, Steve Lyman SC-02-08 Audio-File Transfer and Exchange: Mark Yonge SC-02-12 Audio Applications of Networks: Richard Foss, Matthew Mora SC-04 SUBCOMMITTEE ON ACOUSTICS

David Josephson Chair

Associate Technical Editors Søren Bech

Aki Mäkivirta

spatialperceptionandprocessing

loudspeakerprocessing

Elizabeth Cohen

Dan Mapes-Riordan

audioarchiving,storage,andrestoration

psychoacousticsandsignalprocessing

Jeremy Cooperstock

Francesco Martellotta

audionetworking

acousticmodelingtechniquesroomand architecturalacoustics

Diemer de Vries roomacousticsandarchitecturalacoustics

Jorge Mendoza-López

Christof Faller

transducers,loudspeakerandsound reinforcementsystems

lowbit-rateaudiocoding

Andreas Floros digitalelectroacousticssystems,audio/music informationretrieval

Working Groups

SC-04-01 Acoustics and Sound Source Modeling: Wolfgang Ahnert SC-04-03 Loudspeaker Modeling and Measurement: Steve Hutt, Neil Harris SC-04-04 Microphone Measurement and Characterization: David Josephson, Jackie Green SC-05 SUBCOMMITTEE ON INTERCONNECTIONS

Ray Rayburn

John M. Woodgate

Chair

ViceChair

Thomas Sporer psychoacoustics,perception,andlisteningtests

Sascha Spors

Woon-Seng Gan

spatialaudio,analysisandsynthesisofsound

adaptivesignalprocessing,activenoisecontrol

Tony Stockman

James D. Johnston

auditorydisplay

signalprocessing,analysisandsynthesisof sound,gameaudio,multichannelsound,and humanauditoryperception

Lamberto Tronchin

Jean-Marc Jot signalprocessing

roomacousticsandsignalprocessinginsound fieldcontrol

Francis (Feng) Li

Toon van Waterschoot

instrumentationandmeasurementandlowbit rateaudiocoding

audioandspeechsignalprocessing

Robert C. Maher analysisandsynthesisofsound

loudspeakers,soundquality,sensory evaluation,coding,spatialsound

William T. McQuaide

Barry A. Blesser

ManagingEditor

ConsultingTechnicalEditor

roomacousticsandarchitecturalacoustics

Xiaojun Qiu

Nick Zacharov

Working Groups

SC-05-02 Audio Connectors: Ray Rayburn, Werner Bachmann, Ronald Ajemian SC-05-05 Grounding and EMC Practices: Bruce Olson, Jim Brown

Mary Ellen Ilich SeniorEditorandCopyEditor

SC-07 SUBCOMMITTEE ON METADATA

Chris Chambers

David Ackerman

Chair

ViceChair Working Groups

SC-07-01 Audio Metadata: Chris Chambers, David Ackerman

THIS ISSUE Papers:Wierstorfetal.,Tervoetal.,Siedenburg and Dörfler, Bank, Kim et al., Zidan and Svensson,andZidanandSvensson.Standards: Audio network control and other topics. Articles: Rumsey on mastering. Section News: Arentina, Central Indiana, Pacific Northwest, Sweden, and Toronto. Products: AKG Acoustics, CEDAR, Chandler Limited, Crown Audio, D.A.S Audio, dbx, Nugen, and Soundcraft. Call for Papers: 135 Convention (New York). Obits: Frank G. Lennert and Bill Isenberg.

COPYRIGHT Copyright © 2013 by the Audio Engineering Society, Inc.ItispermittedtoquotefromthisJournalwithcustomarycredittothesource. COPIES Individual readers are permitted to photocopy isolated articles for research or other noncommercial use. Permission to photocopy for internal or personal use of specific clients is granted by the Audio Engineering Society to libraries and other users registered with the Copyright Clearance Center (CCC), provided that the base fee of $1 per copy plus $.50 per page is paid directly to CCC, 222 Rosewood Dr., Danvers, MA 01923, USA. 0004-7554/95. Photocopies of individual articlesmaybeorderedfromAESHeadquartersoffice: $5perarticle(members)and$20(nonmembers). REPRINTS AND REPUBLICATION Multiplereproductionorrepublicationofanymaterialin this Journal requires the permission of the Audio Engineering Society. Permission may also be required from the author(s). Send inquiries to AES Editorial office. ONLINE JOURNAL AES members can view the Journal online at

www.aes.org/journal/. SUBSCRIPTIONS TheJournalisavailablebysubscription.2013ratesare $290, printed; printed plus online $695; online only $545;upgradefromprintedorE-Library$445.Forinformation,contact[email protected]. BACK ISSUES Selected back issues are available: $10 per issue (members), $25 (nonmembers); For information, contactPublicationsDepartment:[email protected]. MICROFILM CopiesofVol.19,No.1(1971January)tothepresent edition are available on microfilm from University Microfilms International, 300 North Zeeb Rd., Ann Arbor,MI48106,USA. ADVERTISING ForinformationcontactBillMcQuaide:[email protected]or call+1212661-8528,ext.22. MANUSCRIPTS For information on the presentation and processing of manuscripts, see our Author Guidelines at: http://www.aes.org/journal/authors/guidelines/

AES VOLUME 61 NUMBER 1/2

JOURNAL OF THE AUDIO ENGINEERING SOCIETY AUDIO/ACOUSTICS/APPLICATIONS 2013 JANUARY/FEBRUARY

Editor’s Note and List of Associate Technical Editors...................................................................................

4

PAPERS Perception of Focused Sources in Wave Field Synthesis ...................................................Hagen Wierstorf, Alexander Raake, Matthias Geier, and Sascha Spors Wave Field Synthesis (WFS) can synthesize virtual sound sources that are perceived to be at locations between loudspeakers and the listener, called focused sources. Because of practical limitations in the density of loudspeakers, there are artifacts. This research explores the amount of perceptual artifacts and the localization of the focused sources. The results from a variety of listening configurations illustrate the trade-offs. The truncation of loudspeaker arrays creates two opposite effects: (a) fewer additional wave fronts reduce the perception of artifacts, (b) stronger diffraction reduces the size of the listening area with adequate binaural cues. Spatial Decomposition Method for Room Impulse Responses .......................................................... ..................................................................Sakari Tervo, Jukka Pätynen, Antti Kuusinen, and Tapio Lokki For perceptual evaluations of room acoustics, a spatial room impulse response is measured in a real space, encoded for a multichannel loudspeaker reproduction system, and then convolved with anechoic music. This paper describes the Spatial Decomposition Method (SDM). The encoding decomposes the spatial impulse response into a set of image-sources. In contrast to previous methods, SDM can be applied to an arbitrary compact microphone array having a small number of microphones without regard to the spatial sound reproduction technique. SDM relies upon the assumption that the sound propagation direction is the average of all the waves arriving at the microphone array, and that the sound pressure wave at the geometric center of the array represents its impulse response. Listening test with simulated impulse responses show that the proposed method produces an auralization indistinguishable from the reference in the best case. Persistent Time-Frequency Shrinkage for Audio Denoising .......Kai Siedenburg and Monika Dörfler In many audio processing applications, signals are represented by linear combinations of basis functions (such as with windowed Fourier transforms) that are collected in so-called dictionaries. These are considered well adapted to a particular class of signals if they lead to sparse representations, meaning only a small number of basis functions are required for good approximation of signals. Most natural signals have strong inherent structures, such as harmonics and transients, a fact that can be used for adapting audio processing algorithms. This paper considers the audio-denoising problem from the perspective of structured sparse representation. A generalized thresholding scheme is presented from which simple audio-denoising operators are derived. They perform equally well compared to state-of-the-art methods while featuring significantly less computational costs. Audio Equalization with Fixed-Pole Parallel Filters: An Efficient Alternative to Complex Smoothing ............................................................................................................................Balázs Bank A common method for displaying, modeling, and equalizing the frequency response of audio systems is to use smoothing to eliminate the raggedness of the response. Fixed-pole parallel filters, which produce modest computational loading for both signal filtering and parameter estimation, possess the beneficial properties of smoothing. This makes them an efficient method for modeling or equalizing audio systems. The resolution of smoothing is controlled by the choice of pole frequencies: for obtaining a smoothing with 1/ß-octave resolution, ß/2 pole pairs are placed in each octave (e.g., sixth-octave resolution is achieved by having three pole pairs per octave). In addition, an analysis shows the theoretical equivalence of parallel filters and Kautz filters; the formulas for converting the parameters of the two types of filter are given. Digital Waveguide Synthesis of the Geomungo with a Time-Varying Loss Filter .............................................................................Seunghun Kim, Moonseok Kim, and Woon Seung Yeo Physical models of musical instruments are often the basis for computer synthesis of the sound when played. By manipulating control parameters of the model to mimic the performer’s actions, a good imitation of the music can be achieved. A digital waveguide synthesis of the geomungo, a Korean traditional plucked string instrument, must take into account the fact that vigorous playing techniques produce extreme vibrato with a noticeably fluctuating in the decay of the harmonics. To model pitch fluctuation and the decay characteristics of its harmonic partials, a time-varying loss filter with a sinusoidal loop gain is used. The model uses a generalized form of the Karplus-Strong algorithm with a one-pole filter to model loss, and a Lagrange interpolation filter to implement the fundamental frequency. A real-time system shows the potential for creating a virtual geomungo.

5

17

29

39

50

Room Acoustical Parameters in Small Rooms .........Hassan El-Banna Zidan and U. Peter Svensson While there has been great progress in inventing complex algorithms to evaluate room acoustics of large spaces, simple statistical models remain attractive because they can describe the space with a few parameters. This paper explores how Barron’s model can be used to investigate the early and late energy parameters in small shoebox shaped rooms, which place additional burdens on the model. Measurements and simulations for three small rooms without furnishings were used to evaluate this simple model. Errors that were only a few dB for both reflected energy and early reflected sound when averaged across a room in the 250 Hz to 2 kHz frequency range. The model may be useful for studying the influence of parameters such as room volume, source-receiver distance, as well as microphone and loudspeaker directivity. Influence of a Table on a Microphone’s Frequency Response and Directivity .......................................................................................Hassan El-Banna Zidan and U. Peter Svensson Teleconferencing applications often use a microphone placed directly on the table surface. A boundary or pressure zone microphone has its membrane mounted so close to a sound-reflecting plane that the membrane receives direct and reflected sounds in phase at all frequencies of interest, thereby avoiding destructive phase interference. However, in typical applications with small tables or a podium surface, a more detailed model is required to determine if the planar boundary assumption is still valid. Diffraction at the surface edges might significantly affect the frequency response. A comparison between the calculated and measured frequency responses indicate that the simulation gives 1/3 octave-band levels that are typically within 1 dB of measured values. As expects, the response from a small table is less uniform than for a large table. STANDARDS AND INFORMATION DOCUMENTS AES Standards Committee News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

70

76

Audio network control; connector for surround microphones; miniature XLR connectors; carriage of MPEG Surround in AES3; audio-file transfer and exchange file format; acoustics, plane-wave tubes; measuring loudspeaker drive units

FEATURES Mastering for Today’s Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francis Rumsey 135th Convention, New York, Call for Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79 93

DEPARTMENTS Section News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Advertiser Internet Directory . . . . . . . . . . . . . . . . . . . . 85 Products and Developments . . . . . . . . . . . . . . . . . . . . 89

Obituaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 AES Conventions and Conferences. . . . . . . . . . . . . . . 96

EDITOR’S NOTE ear Readers, let me start with a few words on the role of the Associate Technical Editor (ATE). The ATE has a very important role in the editorial review process. Once a manuscript has been submitted, the Editor-in-Chief (EIC) or Managing Editor (ME), checks the proposed submission for its suitability. If the paper is not original or is poorly composed, it is returned to the authors without further review. Should the work be deemed suitable for starting the review process, the EIC (or ME) will contact an Associate Technical Editor (ATE), an expert in an appropriate audio area. The most important roles of the ATE are to identify appropriate reviewers for each submitted manuscript and to manage the review process in the allotted time. At least two reviewers should be assigned to the manuscript. After reviews have been received, the ATE formulates a recommendation for the manuscript (publish as is, publish with minor revisions, publish with major revisions, decline for publication) and sends it to the authors along with the reviewers’ comments for the authors. The ATE may also make his/her own recommendations for changes. The ATE informs the EIC through the system about his/her recommendation. Once the decision is “publish as is,” the communication with the author(s), until the

D

paper is ready for publication, is maintained by the EIC and ME. Thus, the role of the ATEs ends when they make a final recommendation to the Editor-in-Chief. Occasionally an ATE may be invited to act as the Guest Editor for a Journal special issue, and will write the editorial introduction. The first JAES Associate Technical Editors were assigned with the clear intention that their term of service should not go beyond two years. For that reason we issued a Call for ATEs in the JAES November issue, and as the submission deadline has already passed we have appointed some new ATEs. Therefore, I would like to thank all the current ATEs for their hitherto professional, high-quality work for JAES, and express my hope for their assistance in the future. Some of the original ATEs will remain with us, as we would like to use their expertise in the audio domain, and they are willing to assist us with the editorial process. As a further part of our initiative to improve the Journal’s service to its authors, Francis Rumsey will now assist with the editorial management process. The names of all ATEs along with their area of interest are listed below. Bozena Kostek, JAES Editor-in-Chief

JAES Associate Technical Editors Søren Bech spatial perception and processing Elizabeth Cohen audio archiving, storage, and restoration Jeremy Cooperstock audio networking Diemer de Vries room acoustics and architectural acoustics Christof Faller low bit-rate audio coding Andreas Floros digital electroacoustics systems, audio/music information retrieval Woon-Seng Gan adaptive signal processing, active noise control James D. Johnston signal processing, analysis and synthesis of sound, game audio, multichannel sound, human auditory perception Jean-Marc Jot signal processing Francis Feng Li instrumentation and measurement, low bit-rate audio coding Robert C. Maher analysis and synthesis of sound Aki Mäkivirta loudspeaker processing 4

Dan Mapes-Riordan psychoacoustics and signal processing Francesco Martellotta acoustic modeling techniques, room and architectural acoustics Jorge Mendoza-López transducers, loudspeaker and sound reinforcement systems Thomas Sporer psychoacoustics, perception, and listening tests Sascha Spors spatial audio, analysis and synthesis of sound Tony Stockman auditory display Lamberto Tronchin room acoustics and architectural acoustics Xiaojun Qiu room acoustics and signal processing in sound field control Toon van Waterschoot audio and speech signal processing Nick Zacharov loudspeakers, sound quality, sensory evaluation, coding, spatial sound J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

Perception of Focused Sources in Wave Field Synthesis HAGEN WIERSTORF,1 AES Student Member , ALEXANDER RAAKE,1 AES Member, MATTHIAS GEIER2, ([email protected]) AND SASCHA SPORS,2 AES Member 1

Assessment of IP-based Applications, T-Labs, Technische Universität Berlin, Ernst-Reuter-Platz 7, 10587 Berlin, Germany 2 Signal Theory and Digital Signal Processing, Institute of Communications Engineering, Universität Rostock, R.-Wagner-Str. 31, 18119 Rostock/Warnemünde, Germany

Wave Field Synthesis (WFS) allows virtual sound sources to be synthesized that are located between the loudspeaker array and the listener. Such sources are known as focused sources. Due to practical limitations related to real loudspeaker arrays, such as spatial sampling and truncation, there are different artifacts in the synthesized sound field of focused sources. In this paper we present a listening test to identify the perceptual dimensions that are associated with these artifacts. Two main dimensions were found, one describing the amount of perceptual artifacts and the other one describing the localization of the focused source. The influence of the array length on these two dimensions is evaluated further in a second listening test. A binaural model is used to model the perceived location of focused sources found in the second test and to analyze dominant localization cues.

0 INTRODUCTION Wave Field Synthesis (WFS) [1] is one of the most prominent high-resolution sound field synthesis methods used and studied nowadays. Unlike traditional stereophonic techniques, it offers the potential of creating the impression of a virtual point source located inside the listening area—between the loudspeakers and the listeners [2]. These sources are known as focused sources, according to their strong relation to acoustic focusing [3]. The physical theory of WFS assumes a spatially continuous distribution of loudspeakers denoted as secondary sources. However, in practical implementations, the secondary source distribution will be realized by a limited number of loudspeakers placed at discrete positions. This implies a spatial sampling and truncation process that typically leads to spatial aliasing and truncation artifacts in the sound field [4], depending on the position of the virtual source and the position of the listener. For focused sources these artifacts are of special interest, as they may become clearly audible, especially for large loudspeaker arrays [5,6]. In this paper the physical properties of focused sources are studied (Section 1)—as well as their perceptual properties. It is shown that the synthesis of focused sources may be J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

associated with a number of perceptually relevant artifacts, that will be increasingly audible the larger the WFS system, if no further action is taken. The perceptual properties are investigated by performing a formal listening test (Section 2) using the Repertory Grid Technique (RGT). In a second listening test (Section 3), the two main perceptual dimensions, identified in the first listening test, were rated for loudspeaker arrays of different lengths. To further analyze the results, the results for the localization of focused sources are compared with the output of a binaural model (Section 4). To create reproducible test conditions, all tests were conducted with a “virtual” WFS system realized by dynamic (head-tracked) binaural resynthesis, presented to the participants via headphones.

1 THEORY The theory of WFS was initially derived from the Rayleigh integrals for a linear secondary source distribution [1]. With this source distribution it is possible to synthesize a desired two-dimensional sound field in one of the half planes defined by the linear secondary source distribution. 5

WIERSTORF ET AL.

PAPERS

3

1

2 y/m

The sound field in the other half plane is a mirrored version of the desired one. Without loss of generality, a geometry can be chosen for which the secondary source distribution is located on the xaxis of a Cartesian coordinate system. Then, the synthesized sound field is given by ∞ D2D (x0 , ω) G 2D (x − x0 , ω) d x0 , (1) P(x, ω) = −

0 1

−∞

where x = (x, y) with y > 0, x0 = (x0 , 0) and ω = 2πf with temporal frequency f. The functions D2D and G 2D denote the secondary source driving signal and the sound field emitted by a secondary source, respectively. In WFS the driving function is given as D2D (x0 , ω) = 2

∂ S(x, ω) | x=x0 , ∂y

(2)

where S(x, ω) denotes the sound field of the desired virtual source. The sound field G 2D (x − x0 , ω) of a secondary source can be interpreted as the field of a line source intersecting the xy-plane at position x0 . For practical applications only secondary sources with the field of a point source (G 3D ) are available in most cases. Hence a dimensional mismatch of a three-dimensional secondary source for two-dimensional synthesis has to be considered. This leads to a so called two-and-a-half-dimensional driving function that applies an amplitude correction to reduce this mismatch. Using the far-field approximation ωc |x − x0 | 1 the following relationship between the field of a line source and the field of a point source can be derived [7]: i (1) ω H0 c |x − x0 | 4 G 2D (x−x0 ,ω)

≈

ω

2π

1 ei c |x−x0 | ic , |x − x0 | ω 4π |x − x0 |

(3)

G 3D (x−x0 ,ω)

H0(1) denotes

where the Hankel function of first kind and zeroth order. This results in the so called 2.5D driving function, which is given with Eq. (3) as ic

2π|xref − x0 | D2D (x0 , ω) , (4) D2.5D (x0 , ω) = ω g0

where g0 is chosen in such a way that it is a constant and does not depend on x. In this case the amplitude is correct at a line positioned at |xref − x0 | = yref parallel to the loudspeaker array [2]. The synthesized sound field is given by ∞ D2.5D (x0 , ω) G 3D (x − x0 , ω) d x0 . (5) P(x, ω) = − −∞

A reformulation of the theory based on the KirchhoffHelmholtz integral revealed that also arbitrary convex distributions can be employed [8,9]. This study limits itself to linear arrays as these are mainly applied in real life scenar6

0

−2

−1

0 x/m

1

2

−1 P

Fig. 1. Simulation of the sound field P(x, ω) for a monochromatic focused source with a frequency of f = 1000 Hz, located at xs = (0, 1) m. A continuous secondary source distribution with a length of L → ∞ is placed on the x-axis. The amplitude of the sound field is clipped at |P| = 1. In the area below the gray dashed line, the sound field is converging to the focus. Above the line it is diverging from the focus.

ios at the moment. A detailed review of the theory of WFS can be found in the literature such as [1,10]. 1.1 Focused Sources In WFS, sound fields can be described by using source models to calculate the driving function. For example, to synthesize the sound field of a human speaker positioned at point xs , the model of a point source positioned at xs can be used. The point source is then driven by the speech signal of the human. For the synthesis of a focused source, a sound field is desired that converges toward a focal point and diverges after passing it. This is known from the techniques of timereversal focusing and can be reached by using a point sink as source model for the converging part of the focused source sound field [3]. In order to derive an efficient implementation of the driving function not a point sink, but a line sink with a spectral correction given by Eq. (4) is used as source model for y < ys [5] ω i (2) ω S(x, ω) = Ss (ω) |x − xs | , (6) H ic 4 0 c where Ss (ω) denotes the frequency spectrum of the line sink, H0(2) the Hankel function of second kind and zeroth order, and xs = (xs , ys ) the position of the focused source. Using (2) and (4), this leads to the driving function iω y0 − ys D2.5D (x0 , ω) = −Ss (ω) g0 2c |x0 − xs | (2) ω ×H1 c |x0 − xs | ,

(7)

where H1(2)denotes the Hankel function of second kind and first order. In Fig. 1 , the simulated sound field P(x, ω) for a monochromatic focused source located at xs = (0, 1) is shown. The sound field converges for 0 < y < 1 m toward the position of the focused source and diverges for y > 1 m, which defines the listening area for the given focused source position. In addition, a phase jump occurs J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

FOCUSED SOURCES

(a)

(b)

7

real source focused s. focused s.

5

Amplitude/dB

Amplitude

1.2 0.8

3 1

0.4 −2

−1

0 x/m

1

2

0

1

2 y/m

3

4

(c) 10

x1 x2

0 −10 10

100 1000 10000 Frequency/Hz

Fig. 2. Graph (a) and (b) show the amplitude distribution for a focused source positioned at xs = (0, 1) m generated by the driving function Eq. (8) in black and by the driving function without large argument approximation Eq. (7) in gray. In addition the amplitude distribution for a real source located at the same position as the focused source is shown with the dashed line. (a) is parallel to the x-axis at y = 2 m and (b) parallel to the y-axis at x = 0 m. Graph (c) presents the frequency response of the focused source Eq. (8) at two listener positions x1 = (0, 2) m and x2 = (2, 2) m. The frequency of the monochromatic source in (a), (b) was f = 1000 Hz. Parameters: L = 1000 m, x0 = 0.15 m, yref = 2 m.

at y = ys , which is well known for focal points [11]. Due to the limitation of the used source model S to the area with y < ys the evanescent part of the focused source sound field is not identical for y > ys with that of a point source located at xs . In order to reproduce this part correctly, very high amplitudes in the region y < ys are needed because of the exponential decay of the evanescent waves along the y-axis. This can be shown by using the spectral division method to create the sound field [12]. Eq. (7) can be related to the traditional formulation of the driving function used in WFS [2, Eq. (2.30)] by replacing the Hankel function by its large-argument approximation [7] iω y0 − ys −i ω |x0 −xs | e c , D2.5D (x0 , ω) ≈ Ss (ω) g0 2πc |x0 − xs | 23 (8) where g0 is explicitly given in [2]. When the driving function Eq. (8) is transformed into the time domain, it is given as g0 y0 − ys 2π |x0 − xs | 32 s| ×δ t − |x0 −x , c

d2.5D (x0 , t) = ss (t) ∗ h(t) ∗

(9)

where c is the speed of sound, δ the delta function and h(t) denotes the inverse Fourier transform

iω −1 . (10) h(t) = F c It is easy to see that this driving function can be implemented very efficiently by filtering the virtual source signal ss (t) with the so-called pre-equalization filter h(t) and weighting and delaying the pre-filtered signal for every secondary source appropriately. In order to verify the influences of the applied large argument approximations, the amplitude distribution of the synthesized sound field can be studied. Fig. 2 shows the amplitude for a focused source positioned at xs = (0, 1) m along two axes. The amplitude of a real source positioned at xs is shown for reference as a gray dashed line. The black line shows the amplitude distribution for a focused J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

source synthesized by the classical 2.5D driving function Eq. (8), the gray line for a focused source synthesized with the 2.5D driving function Eq. (7). It can be observed that for the focused source given by Eq. (7) the amplitude diverges from that of a real point source due to the 2.5D synthesis (see Fig. 2a). Interestingly, the amplitude distribution of the focused source synthesized by the classical driving function Eq. (8) has a more correct amplitude distribution for distances farther away from the focal point. But its additional large argument approximation reinforces the ripples of the amplitude distribution, that exist due to the 2.5D approximation. 1.2 Loudspeakers as Secondary Sources Theoretically, when an infinitely long continuous secondary source distribution is used, no other errors than an amplitude mismatch due to the 2.5D synthesis are expected in the sound field [5]. However, such a continuous distribution cannot be implemented in practice because a finite number of loudspeakers has to be used. This results in a spatial sampling and spatial truncation of the secondary source distribution. In principle both can be described in terms of diffraction theory (see, e.g., [11]). Unfortunately, as a consequence of the dimensions of loudspeaker arrays and the large range of wave lengths in sound as compared to light, most of the assumptions made to solve diffraction problems in optics are not valid in acoustics. To present some of the basic properties for truncated and sampled secondary source distributions, simulations of the sound field are made and interpreted in terms of basic diffraction theory where possible.

1.2.1 Spatial Sampling The spatial sampling that is equivalent to the diffraction by a grating only has consequences for frequencies greater than the aliasing frequency f al ≥

c , 2x0

(11)

where x0 describes the distance between the secondary sources [5]. In general, the aliasing frequency is position dependent (cf., [8, Eq. 5.17]), but an analytical solution 7

WIERSTORF ET AL.

PAPERS

1

3

2 y/m

y/m

2 0

1

−2

−1

0 x/m

1

2

−1 P

Fig. 3. Simulation of a sound field P(x, ω) for a focused source with a frequency of f = 3000 Hz. A sampled secondary source distribution with a distance of x0 = 0.15 m between the single sources was used. The amplitude of the sound field is clipped at |P| = 1. The gray circle indicates a region without aliasing given by Eq. (12). Parameters: xs = (0, 1) m, L = 1000 m, yref = 2 m.

3

5 0 −5

2 y/m

−10 −15 −20

1

−25 −30

0 −2

−1

0 x/m

1

2

−35 p/dB

Fig. 4. Simulation of the sound field p(x, t) for a broadband focused source for t = 2.4 ms after the wave front has passed the focus. Parameters: xs = (0, 1) m, L = 100 m, x0 = 0.15 m, yref = 2 m.

for focused sources is not available at the moment. Fig. 3 shows the monochromatic sound field for a focused source with a frequency of 3000 Hz generated by a secondary source distribution with x0 = 0.15 m. Clear interference artifacts are visible in the sound field, but there also is an area around the focus where no interference took place. This is a unique property of focused sources. The size of the area depends on the frequency f and becomes smaller with higher frequencies. It can be empirically described by a circle with a radius of (cf., [13]) ys c . f x0

(12)

The area calculated using the parameters applicable to Fig. 3 is indicated by a gray circle. Fig. 4 shows a snapshot of the sound field of a broadband focused source to examine more implications of the spatial sampling artifacts. Every single loudspeaker is sending a broadband signal according to Eq. (9). If no spatial aliasing occurs, the signals cancel each other out in the listening 8

0 1

0

ral =

1

3

0 −2

−1

0 x/m

1

2

−1 P

Fig. 5. Simulation of a sound field P(x, ω) for a focused source with a frequency of f = 1000 Hz generated with a secondary source distribution of length L = 2 m. The amplitude of the sound field is clipped at |P| = 1. The two gray lines indicate the size of the focus after Eq. (13). Parameters: xs = (0, 1) m, x0 = 0.15 m, yref = 2 m.

area, with the exception of the desired wave front of the focused source. In case of spatial aliasing and for frequencies above the aliasing frequency, the cancellation does not occur and a bunch of additional wave fronts reach a given listener position before the desired wave front characterizing the focused source. These additional wave fronts are very critical for the perception of focused sources, as we will see in Section 2. The additional wave fronts also add energy to the signal, which can be seen from the spectrum shown in Fig. 2c. The Figure depicts the frequency responses for two different listener positions x1 and x2 , which are associated with different aliasing frequencies. Obviously, above the aliasing frequency the magnitude of the frequency response increases. This can be avoided by using the pre-equalization filter only until the aliasing frequency [5], which has the shortcoming of introducing position-dependency in the filter.

1.2.2 Truncation The spatial truncation of the loudspeaker array leads to further restrictions. On the one hand, the listener area becomes smaller with a smaller array, which is shown in Fig. 5. The listening area can be approximated by the triangle that is spanned for y > ys by the two lines coming from the edges of the loudspeaker array and crossing the position of the focused source. Another problem is that a smaller loudspeaker array introduces diffraction in the sound field. The loudspeaker array can be seen as a single slit that causes a diffraction of the sound field propagating through it. This can be described in a way equivalent to the phenomenon of edge waves as shown by Sommerfeld and Rubinowicz (see [11] for a summary). The edge waves are two additional spherical waves originating from the edges of the array, which can be softened by applying a tapering window [14]. The resulting diffraction pattern adds artifacts to the desired sound field. For example, the interaural level differences (ILD) will not be correct due to diffraction minima and maxima as shown in Fig. 9. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

FOCUSED SOURCES

The diffraction also leads to a wavelength-dependent widening of the focus. The width of the focus at its position ys can be defined as the distance between the first minima in the diffraction pattern and is given by s = 2|ys − y0 | tan sin−1 λL ,

(13)

where s is the width of the focus, L the array length, ys the y-position of the focused source and y0 the y-position of the loudspeaker array. This formula is based on the assumption of Fraunhofer diffraction near the focus [11, 8.3 Eq. 34]. In Fig. 5, the calculated size of the focal point is indicated by the gray lines. 2 PERCEPTIONAL DIMENSIONS OF FOCUSED SOURCES In the last section different artifacts in the synthesized sound field of a focused source were discussed. These artifacts raise the question whether and how they will affect the perception of focused sources. In this section a listening test is presented that investigates this perceptual impact. It was shown in the previous section that the aliasing frequency f al due to the spatial sampling introduced by the loudspeakers depends on the listening position. In addition, the diffraction due to truncation of the loudspeaker array depends on the size of the array. Hence, different array sizes and listener positions have to be considered in a respective listening test. We have further shown that the time reversal technique used to create focused sources—in combination with spatial aliasing—leads to additional wave fronts arriving at the listener position from different directions and before the desired wave front. This is a situation that only occurs with such synthetic sound fields but not in case of everyday listening in natural environments. As a consequence, it can be expected that the description of the related perceptual effects requires multidimensional attributes in the perceptional domain. To address this issue, the Repertory Grid Technique (RGT) was used to identify perceptually relevant attributes [15,16]. With this method, in a first step each participant creates her/his own set of attributes and in a second step uses respective attribute scales for rating their perception. No attributes are provided by the experimenter, and, thus, the test subject has complete freedom in the choice of attributes. A more detailed discussion of this first experiment was presented in [6]. 2.1 Method 2.1.1 Stimuli The tests were conducted with a “virtual” WFS system realized by dynamic binaural re-synthesis [17] using headphones. See Fig. 6 for a sketch of the geometry of the employed virtual WFS configurations. Two linear loudspeaker arrays with a length L of 4 m and 10 m and a loudspeaker spacing of x0 = 0.15 m were synthesized. To handle truncation, a squared Hann tapering window with a length of J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Fig. 6. Geometry of the experiment. Every listener was positioned at six different positions given by the head symbols. The focused source was always positioned at xs = (0, 1) m, and the center of the used loudspeaker array was always positioned at x = (0, 0) m.

0.15L on both ends of the arrays was used. The impulse responses of the individual virtual loudspeakers of the array were obtained by interpolating and weighting a database of head-related impulse responses (HRIRs) of the FABIAN manikin [18] to the required positions and distances of the loudspeaker. Every virtual loudspeaker was then weighted and delayed according to the driving function Eq. (9) for a given focused source and listener position, summed up and filtered with the pre-equalization filter from Eq. (10). The result was a pair of HRIRs of the desired WFS array producing the given focused source for a given listener position. For the dynamic binaural re-synthesis these pairs of HRIRs had to be calculated for all possible head orientations of the listener ranging from −180◦ to 180◦ in 1◦ steps. As discussed in Section 1.2, the aliasing frequency f al depends on the listener position, therefore the WFS preequalization filter was calculated separately for each simulated listening position. Coloration introduced by an improper choice of the pre-equalization filter was not part of the investigation and should be avoided. For both arrays, three different listener positions on a given circle around the focused source were used. The radius was R1 = 1 m for the short array and R2 = 4 m for the long array. Three different listener angles of ϕ = 0◦ , 30◦ , and 60◦ were applied for both array lengths (see Fig. 6). These six configurations will be referred to as 0◦4m , 30◦4m , 60◦4m , 0◦10m , 30◦10m , and 60◦10m . In all conditions, the focused source was located directly in front of the listener. A seventh reference condition (“ref.”) was created, which consisted of a single sound source located at the position of the focused source. This was realized by directly using the corresponding HRIRs from the database. As audio source signals, anechoic recordings of speech and of castanets were chosen. The speech signal was an 8 s sequence of three different sentences uttered by a female speaker. The castanets recording was 7 s long. The levels of the stimuli were normalized to the same loudness by informal listening by the authors for all conditions. The real-time convolution of these signals with the impulse responses for the WFS arrays was performed using the SoundScape 9

WIERSTORF ET AL.

Renderer (SSR1 ) [19], an open-source software environment for spatial audio reproduction. The SSR performed a real-time convolution of the input signal with that pair of impulse responses corresponding to the instantaneous head orientation of the test subject as measured by a Polhemus Fastrak tracking system. In the SSR, switching between different audio signals is realized using a smooth cross-fade with raised-cosine shaped ramps. AKG K601 headphones were used, and the transfer functions of both earphones were compensated by appropriate filters [20]. Audio examples are available as supplementary material.2

2.1.2 Participants In order to generate a large amount of meaningful attributes, test subjects with experience in analytically listening to audio recordings were recruited. The experiment was conducted with 12 Tonmeister students (3 female, 9 male, between 21 and 33 years old). The participants had between 5 years and 20 years of musical education, and all of them had experience with listening tests. They had normal hearing levels and were financially compensated for their effort. 2.1.3 Procedure The participants received written instructions explaining their tasks in the two phases of the experiment. The RGT procedure consisted of two parts, the elicitation phase and the rating phase. In the elicitation phase, groups of three conditions (triads) were presented to the test subject. The subjects were able to switch between them by pressing a corresponding button and could listen to each stimulus as long as they wanted. For each triad, the subject had to decide which two of the three stimuli were more similar and had to describe the characteristic that made them similar, and in which characteristic they were different from the third stimulus (which should be the opposite of the first property). If there were competing aspects, only the strongest one should be taken into account. One attribute pair per triad had to be specified, and two more could optionally be given if the test subject perceived several different properties. A screenshot of the used test GUI is shown in [6]. After a short training phase, every participant had to execute this procedure 12 times (using 12 different triads). Ten of the 12 triads resulted from a complete set of triads from the five conditions ref., 30◦4m , 60◦4m , 30◦10m , and 60◦10m . The two additional triads were (ref., 0◦4m , 0◦10m ) and (0◦4m , 30◦4m , 0◦10m ). These two have been chosen in order to consider the additional, very similar conditions together, to get attributes for the small differences between them. Complete triads for only five conditions have been chosen because of the time-consuming procedure (a complete set of triads for 7 conditions would have resulted in 35 triads). The presented triads were the same for all participants, however, the order of the triads and the order of conditions 1 2

10

http://tu-berlin.de/?id=ssr http://audio.qu.tu-berlin.de/?p=625

PAPERS

within a triad was alternated over all participants based on a Latin Square design. After the elicitation phase the participants took a break. During this time, the test supervisor removed repetitions of attribute pairs for constructing the attribute list used in the second RGT test phase. For this rating phase in each trial one previously elicited attribute pair was displayed on top of the screen. Below, the seven conditions could be played back and had to be rated on corresponding sliders. The ratings were saved on a continuous scale ranging from −1.0 to 1.0. Once a rating was collected for all conditions, the test subject was able to switch to the next screen, a procedure repeated until all elicited attribute pairs were used. Before the actual test, a training phase had to be completed for two rating screens. In the second session, which was in most cases done on another day, the elicitation and rating phase was repeated with the respective other source stimulus. Half of the subjects were presented with the speech sample in the first session and the castanets in the second session, and vice versa for the other half. 2.2 Results One of the main results of the experiment were the elicited attribute pairs. They reflect the range of perceptual similarities and differences among the conditions. Their number was different between subjects, ranging from 6 to 17 pairs for individual subjects. The most prominent choices were artifacts (e.g., clean sound vs. chirpy, squeaky, unnatural sound) and localization (left vs. center). For the latter, it has to be noted that the focused source was always positioned straight in front of the listener. Attributes describing artifacts were provided by 10 of the 12 subjects for castanets and by 9 subjects for speech. Localizationrelated attributes were given by 7 subjects for castanets, and 5 subjects for speech. Other common attributes were related to coloration (original vs. filtered, balanced vs. unbalanced frequency response), distance (far vs. close) and reverberation (dry vs. reverberant). All elicited attributes were originally collected in German and were translated to English for this paper. The ratings of the attributes can be used to identify the underlying dimensions that best describe the perception of focused sources. This was done using a principal component analysis (PCA) for individual subjects. For all subjects, two principal components could be identified as the main dimensions of the perceptual space. These dimensions can explain 90% of the variance for castanets and 97% for speech, respectively. This also allows to determine the positions of the different conditions in the resulting perceptual space. Fig. 7 shows the PCA results for one individual subject for the speech and castanets, respectively. The PCA results for another subject can be found in [6]. The black dots represent the different conditions in this two-dimensional perceptual space. The gray lines show the arrangement of elicited attribute pairs in this space. From Fig. 7 it can be seen that for both castanets and speech the first principal component J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

FOCUSED SOURCES 60◦4m

center

0◦10m

60◦10m colored

0

full close ref 0◦4m non-colored

30◦4m

few artifacts far narrow

thin

−1

soft

no additive whispering

far

full 30◦4m 0◦10m 0◦ ref. 4m

0

30◦10m

thin

close

additive whispering

−1

60◦10m

narrow panorama 60◦4m

−1

1 Principal component S2

Principal component C2

30◦10m

wide many artifacts

off-center

hard

wide panorama

1

center

off-center

−1 0 1 Principal component S1

0 1 Principal component C1

Fig. 7. Principal component analysis for castanets (left) and speech (right) for one single subject. The black points indicate the position of the conditions given in the two-dimensional space determined by the two given components for each stimulus type. The gray lines show the arrangement of the attribute pairs in these two dimensions.

C1 resp. S1 can be interpreted as a mixture of the amount of artifacts and the distance, and the second principal component C2 resp. S2 as the localization of the source. Considering individual conditions, it can be observed that the 10 m loudspeaker array was rated to produce artifacts in the perception of the focused source, while the artifact-related ratings for the 4 m array are more or less the same as for the reference condition. For the longer array, the amount of artifacts depends on the listener position, with the highest rating of artifacts at the lateral position 60◦10m . The perception of a wrong (off-center) direction is most distinct for the lateral positions of the shorter array, with the condition 60◦4m as the most prominent case. Both lateral positions (φ = 60◦ ) were perceived as more off-center than the other ones. Furthermore, it can be noted that the perceptual deviation from the reference condition occurs for more conditions for the castanets than for the speech stimuli.

2.3 Discussion The results show that the amount of perceived artifacts depends on the length of the loudspeaker array and the position of the listener, being worse for a larger loudspeaker array and a more lateral position of the listener. This is due to the fact that for a larger loudspeaker array more additional wave fronts arrive before the desired one for the focused source. The perceived amount of artifacts further increases with the degree of lateral displacement of the listener relative to the focused source (see Fig. 6). The explanation for this finding can be illustrated using Fig. 8. Here, the direction of incidence of the desired (black arrow) and of the aliasing-related wave fronts for the focused sources are shown for the different listener and array configurations. Note that the arrows point into the direction of incidence from the listener perspective. The starting point of an arrow indicates the position in time of the wave front, and the

0 t/ms

t/ms

0 −5 −10 −15

−10

30 dB R1 = 1 m L = 4m 0◦

−5

−15 30◦ Listener angle φ

60◦

30 dB R2 = 4 m L = 10 m 0◦

30◦ Listener angle φ

60◦

Fig. 8. Direction, amplitude and time of appearance of wave fronts for the 4 m loudspeaker array (left) and the 10 m array (right). The results are shown for different angles φ at a radius R1 = 1 m (left) and R2 = 4 m (right). The arrows are pointing toward the direction from which the wave fronts arrive. The time of appearance is given by the starting point of the arrow. Note that the (temporal) starting points lie closely together for listener positions close to the contributing loudspeakers of the array, and are further apart when the configuration involves larger distances from the loudspeakers. The length of the arrow is proportional to the amplitude of the wave front in dB. The length of the arrow in the legend corresponds to an amplitude of 30 dB. The black arrows indicate the desired wave fronts, the gray arrows aliasing-related wave fronts. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

11

WIERSTORF ET AL.

PAPERS

1

3

y/m

2 0 1 0 −2

−1

0 x/m

1

2

−2

−1

0 x/m

1

2

−2

−1

0 x/m

1

2

−1 P

Fig. 9. Simulation of a sound field P(x, ω) for a focused source synthesized with different array lengths, indicated by the loudspeaker symbols. The array lengths are, from left to right: 1.8 m, 0.75 m, 0.3 m. The amplitude of the sound fields is clipped at |P| = 1. The two parallel gray lines indicate the size of the focal point calculated using Eq. (13). A tapering window of 15% of the array length was applied at the end of the arrays, indicated by the different loudspeaker colors. Parameters: xs = (0, 1) m, f = 1000 Hz, x = 0.15 m, yref = 2 m.

length of the arrow is proportional to its amplitude in dB. It is obvious that the larger the used loudspeaker array, the earlier the occurrence of additional wave fronts, and the higher their amplitude. This is due to the fact that every single loudspeaker adds a wave front. For a given array, the number of wave fronts will be the same regardless of the lateral listener position but the time of arrival of the first wave front will be earlier. This can be explained by the fact that the listener is positioned closer to one end of the loudspeaker array in this case. The loudspeakers at the ends of the array had to be driven as the first ones in order to create a focused source in the middle of the loudspeaker array, resulting in the significantly earlier incidence of the wave fronts from the loudspeakers close to the listener. The results show a dependency of the perceived direction on the listener position and the array size. The condition 60◦4m was perceived as most from the left. The perceived direction can be explained by the additional wave fronts, too. The conditions with φ = 0◦ were perceived from the same direction for both array lengths as the reference condition in front of the listener. For these conditions, the additional wave fronts have no effect on the perceived direction, because they arrive at the listener position symmetrically from all directions (Fig. 8). For the lateral conditions, the first wave front will come mainly from the left side of the listener. Due to the precedence effect [21] this can lead to localization of the sound to the direction of the (first) wave front. For the 10 m array, the perceived direction is different from that of the shorter array. Most of the subjects localized the sound in the same direction as the reference. However, a few subjects indicated that they had heard more than one sound source—one high-frequency chirping source from the left and a cleaner source in front of them. This can be explained with the echo threshold related with the precedence effect, which means that further wave fronts that follow the first one with a lag larger than the echo threshold are perceived as an echo [22]. In order to verify this hypothesis, an experiment has been performed to examine the localization dominance for this kind of time-delayed wave front pattern [23]. Here, an approximated time of 8 ms between the first wave front 12

and the desired one has been identified to be the threshold until which the perceived direction is dominated by the first wave front. This is in conformance with the results for the large array. 3 INFLUENCE OF ARRAY LENGTH ON THE PERCEPTION As mentioned in Section 1.2, truncation of the loudspeaker array leads to two opposite effects. On the one hand, a smaller array leads to fewer additional wave fronts and reduces the perception of artifacts as shown in the last section. On the other hand, a smaller array leads to stronger diffraction of the sound field and therefore a smaller possible listening area as well as wrong binaural cues. Fig. 9 shows the wave fields for a focused source created at xs = (0, 1) m using arrays of three different lengths, L = 1.8 m, L = 0.75 m, and L = 0.3 m, with the same fixed interloudspeaker distance of x0 = 0.15 m as used previously. Hence, the arrays consist of 13, 6, and 3 loudspeakers. The figure illustrates that the focal point gets very large and even disappears for short arrays. This is indicated by the gray lines, which show the size of the focus as calculated using Eq. (13). For the short array with L = 0.3 m the equation is not defined for the given frequency of f = 1000 Hz, because λ/L > 1. In this case, no focal point exists, and the source is located near the position of the loudspeaker array, as can be seen in the rightmost graph of Fig. 9. In addition, the maxima and minima of the diffraction pattern introduce wrong interaural level differences (ILDs) at different listener positions. Note that these wrong binaural cues may deviate for a planar array due to the absence of the amplitude error of 2.5D WFS. To verify if there is an array length for which the artifacts are not audible, and the wrong binaural cues are negligible as well, a listening test was conducted that included the three shorter array lengths as shown in Fig. 9 together with the two array lengths used in the first experiment. In the test two attribute pairs were rated by the subjects, one regarding the audible artifacts and one regarding the perceived position of the focused source. The middle of the array was J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

Artifacts

10

4

right

R = 1m R = 4m Model

left 10 8 1. 75 0. 3 0.

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

8 1.

http://audio.qu.tu-berlin.de/?p=625

75 0.

3

Fig. 10. Mean and variance for the rating of the attribute pair few artifacts vs. many artifacts plotted over the condition. The mean is calculated over all subjects, source materials and the different listener positions.

4 8 1. 75 0. 3 0. f

3.2 Results Fig. 10 presents the mean ratings over all subjects, all listener positions, and both source materials (speech and castanets) for the attribute pair few artifacts vs. many artifacts. Hence, the only independent variable is the strength of artifacts plotted on the x-axis. The 0◦ position for the speech material resulted as an outlier, and was not considered for the plot. At this position and with speech as source material, artifacts are only little audible. On the other hand, there is the coloration introduced by the spatial sampling and independent of the fact that focused sources were realized. An interview with the subjects revealed that four of them have rated this coloration rather than the targeted audible artifacts. It can be seen in the figure that the results

Array length L/m

re

3.1.3 Procedure After an introduction and a short training phase with a violin piece as source material, one half of the participants started the first session presenting speech, the other half presenting castanets. In a second session, the speech and castanets source materials were switched between the groups. The subjects were presented with a screen containing nine sliders representing the nine different conditions. At the top of the screen, one of the two attribute pairs few artifacts vs. many artifacts and left vs. right were presented. After a subject had rated all conditions, the next attribute pair was presented for the same conditions. Thereby the order of the conditions attached to the slider and the appearance of the attribute pairs was randomized. This procedure was repeated three times, once for all the array conditions assessed in case of each listening angle φ. For the listening angle of φ = 0◦ , the attribute pair left vs. right was omitted.

3 0.

3.1.2 Participants Six test subjects participated in the test. All of them were members of the Audio Group at the Quality and Usability Lab and had normal hearing.

few f re

3.1 Method 3.1.1 Stimuli The tests were conducted with a similar geometry and the same source materials as described in Section 2.1. The same listener positions as in Fig. 6 were used, now using the array sizes L = 10 m, 4 m, 1.8 m, 0.75 m, and 0.3 m. Again, the two different radius values R1 = 1 m and R2 = 4 m were used; only R1 for the 4 m array, only R2 m for the 10 m array, and both values for the three other array sizes. Altogether nine different conditions were created, again including the reference condition. Audio examples are available as supplementary material.3

many

Position

again chosen at x = (0, 0) in order to have a symmetric loudspeaker distribution around the x-position of the focused source. A more detailed discussion of the experiment is presented in [24].

FOCUSED SOURCES

Array length L/m

Fig. 11. Mean and variance for the rating of the attribute pair left vs. right plotted over the conditions. The mean is calculated over the different subjects, source materials, and the two listener angles 30◦ and 60◦ . The results are presented separately for the two radius values R1 = 1 m and R2 = 4 m. The black points are the results obtained using the Lindemann model, see Section 4.

for the different loudspeaker arrays build three different groups. The two shortest arrays resulted in as few artifacts as the reference condition. The 10 m array was found to lead to strong artifacts, as it was expected from the previous experiment. The amount of artifacts caused by the 1.8 m and the 4 m array are positioned between these two groups. A one-way ANOVA shows that the mentioned three groups are statistically different (p < 0.05) from each other and not different within each group. In Fig. 11, the results for the attribute pair left vs. right are presented. The means for the arrays were calculated over the 30◦ and 60◦ conditions but once for each radius indicated by the two different shades of gray. It can be seen that the reference condition (arriving from straight ahead of the listener) was rated to come slightly from the right side. All other conditions came from the left side, where shorter arrays and smaller radii lead to a rating further to the left. The two different source materials speech and castanets showed significant differences only for the 10 m array and the 30◦ and 60◦ positions, with more artifacts perceivable for the castanets stimuli. 3.3 Discussion As mentioned in Section 2, the appearance of additional wave fronts due to spatial aliasing leads to strong artifacts 13

WIERSTORF ET AL.

PAPERS

for focused sources. The arrival time of the first wave front at the listener position can be reduced by using a shorter loudspeaker array. This leads to a reduction of audible artifacts, as shown by the results for the attribute pair few artifacts vs. many artifacts. The two smallest arrays with a length of 0.3 m and 0.75 m are rated to have the same amount of artifacts as the single loudspeaker reference. All three loudspeaker arrays with a length of L < 2 m have arrival times of the first wave front of below 5 ms. This means that they fall in a time window in which the precedence effect should work and no echo should be audible. The artifacts audible for the array with L = 1.8 m are therefore due to a comb-filter shaped ripple in the frequency spectrum of the signal, as a result of the temporal delay and superposition procedure of the loudspeakers, see (5) and (9). However, there are other problems related with a shorter array. The main problem is the localization of the focused source. Fig. 11 shows a relation between array length and localization: the shorter the array, the further left the focused source is perceived. This result implies that the precedence effect cannot be the only reason for the wrong perception of the location. For a shorter array, too, the first wave front arrives from the loudspeaker at the edge of the array. This loudspeaker will be positioned less far to the left for a shorter array than for a longer array. Therefore, it is likely that the diffraction due to the short array length introduces wrong binaural cues, namely a wrong ILD.

4 MODELING THE PERCEPTION OF FOCUSED SOURCES To verify the findings for localization perception, a binaural model according to Lindemann [25] was applied using the parameters from the original paper. The model is part of the auditory modeling toolbox.4 This model analyzes binaural cues like the interaural time difference (ITD) and the interaural level difference (ILD). The ITD is calculated for a given signal by a cross-correlation ψn in different frequency bands n. The spacing of the frequency bands is 1 ERB. The model further analyzes the ILD via a contralateral inhibition mechanism, which leads to a shift of the resulting peak of the cross-correlation. This incorporates the ILD and ITD values into a single direction estimation, which has to be done manually in other binaural models— for example [27,28]. As a measure for the perceived direction the mean of the cross-correlation about the frequency bands n = 5. . . 40 was first calculated by 1 ψn . 36 40

ψ=

n=5

4

14

http://amtoolbox.sf.net [26]

(14)

Then the centroid d of ψ was used as the model output for the perceived direction τψ(τ) dτ d= , (15) ψ(τ) dτ where τ is the time of the cross-correlation. The predicted localization was scaled to achieve the same order of magnitude as the rating results. The results are plotted in Fig. 11, together with the subject ratings. Like the rating data, the model data are also depicted as the mean over the two listener directions 30◦ and 60◦ . The model results show a quite good agreement with the test data. This indicates that the perceived localization is dominated by wrong binaural cues due to the diffraction artifacts for truncated arrays. Only for the two large arrays, clear deviations of the modeled results are visible. For these large arrays with L = 4 m and L = 10 m the time of arrival between the first and the last wave front is in the region of 3 ms to 15 ms, which suggests that the precedence effect plays a role in explaining the perceived direction. The model does not account for the precedence effect, which explains the deviation of its prediction from the subject data for the large arrays. 5 SUMMARY AND CONCLUSIONS In practice, the creation of focused sources with WFS is not free from perceptual artifacts. The time-reversal technique used in the synthesis of focused sources causes the appearance of additional wave fronts arriving at the listener position from every single loudspeaker before the desired focused source signal. An experiment using the RGT method was carried out to identify attribute pairs that are able to describe the resulting perception of focused sources. The most dominant attribute pairs were those regarding audible artifacts, coloration, and the position of the focused source. In a second experiment with different linear array lengths and different listener positions using only the attribute pairs few artifacts vs. many artifacts and left vs. right, it could be shown that artifacts could be reduced by using fewer loudspeakers. On the other hand, the perception of a focused source as a distinct source located at a given position is limited when using shorter arrays. The diffraction causes a wider focal point, and the localization of the focused source is disturbed. This was verified using a binaural model. The model results also indicated that the perceived localization for the small arrays is due to the wrong binaural cues introduced by the diffraction pattern. The results for the long arrays indicated that the precedence effect has to be considered for the perceived direction of focused sources created by these arrays. This study shows that the usage of focused sources in WFS with typical linear loudspeaker arrays has to be handled with care. The appearance of additional wave fronts before the desired one introduces different artifacts in the perception of the focused source as compared to the synthesis of a point source located behind the loudspeaker array. In addition to these artifacts, the strong dependency of the spatial aliasing frequency on the position of the listener J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

for focused sources will introduce even more coloration in a real setup using a fixed pre-equalization filter for the whole listening area. This was not the case in this study, because the pre-equalization filter was chosen adaptive for the different listener positions in order to investigate only the effect of the additional wave fronts. For further studies it could be interesting how the perception of focused sources behaves for WFS with multiactuator panels [29]. These panels lead to a more chaotic distribution of the additional wave fronts that cause the perceptual artifacts for focused sources. Moreover it would be beneficial to investigate the perception of focused sources in other sound field synthesis techniques like near-field compensated higher order Ambisonics [30] or numerical methods—for example [31]. 6 ACKNOWLEDGMENT This work was supported by DFG RA 2044/1-1. 7 REFERENCES [1] A. Berkhout, D. de Vries and P. Vogel, “Acoustic Control by Wave Field Synthesis,” J. Acoust. Soc. Am., vol. 93, no.5, pp. 2764–2778 (1993). [2] E. N. G. Verheijen, Sound Reproduction by Wave Field Synthesis, Ph.D. thesis, Delft University of Technology, 1997. [3] S. Yon, M. Tanter and M. Fink, “Sound Focusing in Rooms: The Time-Reversal Approach,” J. Acoust. Soc. Am., vol. 113, no. 3, pp. 1533–1543 (2003). [4] H. Wittek, Perceptual Differences between Wavefield Synthesis and Stereophony, Ph.D. thesis, University of Surrey, 2007. [5] S. Spors, H. Wierstorf, M. Geier and J. Ahrens, “Physical and Perceptual Properties of Focused Sources in Wave Field Synthesis,” presented at the 127th Convention of the Audio Engineering Society (2009 Oct), convention paper 7914. [6] M. Geier et al., “Perception of Focused Sources in Wave Field Synthesis,” presented at the 128th Convention of the Audio Engineering Society (2010 May), convention paper 8069. [7] E. G. Williams, Fourier Acoustics (Academic Press, San Diego, 1999). [8] E. W. Stuart, “Application of Curved Arrays in Wave Field Synthesis,” presented at the 100th Convention of the Audio Engineering Society (1996 May), convention paper 4143. [9] J. Ahrens and S. Spors, “On the Secondary Source Type Mismatch in Wave Field Synthesis Employing Circular Distributions of Loudspeakers,” presented at the 127th Convention of the Audio Engineering Society (2009 Oct), convention paper 7952. [10] S. Spors, R. Rabenstein and J. Ahrens, “The Theory of Wave Field Synthesis Revisited,” presented at the 124th Convention of the Audio Engineering Society (2008 May), convention paper 7358. [11] M. Born and E. Wolf, Principles of Optics (Cambridge University Press, New York, 1999). J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

FOCUSED SOURCES

[12] S. Spors and J. Ahrens, “Reproduction of Focused Sources by the Spectral Division Method,” 4th International Symposium on Communications, Control and Signal Processing (2010 Mar). [13] S. Spors and J. Ahrens, “Efficient Range Extrapolation of Head-Related Impulse Responses by Wave Field Synthesis Techniques,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2011). [14] P. Vogel, Application of Wave Field Synthesis in Room Acoustics, Ph.D. thesis, University of Technology, 1993. [15] G. A. Kelly, The Psychology of Personal Constructs (Norton, New York, 1955). [16] J. Berg and F. Rumsey, “Spatial Attribute Identification and Scaling by Repertory Grid Technique and Other Methods,” presented at 16th AES International Conference, Spatial Sound Reproduction (1999 Mar), conference paper 16-005. [17] A. Lindau, T. Hohn and S. Weinzierl, “Binaural Resynthesis for Comparative Studies of Acoustical Environments,” presented at the 122th Convention of the Audio Engineering Society (2007 May), convention paper 7032. [18] A. Lindau and S. Weinzierl, “FABIAN—An Instrument for the Software-Based Measurement of Binaural Room Impulse Responses in Multiple Degrees of Freedom,” 24th VDT International Convention (2006 Nov.). [19] M. Geier, J. Ahrens and S. Spors, “The SoundScape Renderer: A Unified Spatial Audio Reproduction Framework for Arbitrary Rendering Methods,” presented at the 124th Convention of the Audio Engineering Society (2008 May), convention paper 7330. [20] Z. Schärer and A. Lindau, “Evaluation of Equalization Methods for Binaural Signals,” presented at the 126th Convention of the Audio Engineering Society (2009 May), convention paper 7721. [21] H. Wallach, E. B. Newman and M. R. Rosenzweig, “The Precedence Effect in Sound Localization,” Am. J. Psychol., vol. 57, pp. 315–336 (1949). [22] J. Blauert, Spatial Hearing (The MIT Press, Cambridge, Massachusetts, 1997). [23] H. Wierstorf, S. Spors, and A. Raake, “Die Rolle des Präzedenzeffektes bei der Wahrnehmung von Räumlichen Aliasingartefakten bei der Wellenfeldsynthese,” DAGA German Conference on Acoustics (2010 Mar). [24] H. Wierstorf, M. Geier and S. Spors, “Reducing Artifacts of Focused Sources in Wave Field Synthesis,” presented at the 129th Convention of the Audio Engineering Society (2010 Nov), convention paper 8245. [25] W. Lindemann, “Extension of a Binaural CrossCorrelation Model by Contralateral Inhibition. I. Simulation of Lateralization for Stationary Signals,” J. Acoust. Soc. Am., vol. 80, no.6, pp. 1608–1622 (1986). [26] P. Søndergaard et al., “Towards a Binaural Modelling Toolbox,” FORUM ACUSTICUM (2011 June). [27] E. Blanco-Martin, F. J. Casajús-Quirós, J. J. GómezAlfageme and L. I. Ortiz-Berenguer, “Objective Measurement of Sound Event Localization in Horizontal and Median Planes,” J. Audio Eng. Soc., vol. 59, pp. 124–136 (2011 Mar.). 15

WIERSTORF ET AL.

PAPERS

[28] M. Dietz, S. D. Ewert and V. Hohmann, “Auditory Model Based Direction Estimation of Concurrent Speakers from Binaural Signals,” Speech Communication, vol. 53, no.5, pp. 592–605 (2011). [29] B. Pueo, J. J. López, J. Escolano and L. Hörchens, “Multiactuator Panels for Wave Field Synthesis: Evolution and Present Developments,” J. Audio Eng. Soc., vol. 58, pp. 1045–1063 (2010 Dec.).

[30] J. Ahrens and S. Spors, “Spatial Encoding and Decoding of Focused Virtual Sound Sources,” Ambisonics Symposium, Graz, Austria (2009 June). [31] M. Kolundzija, C. Faller and M. Vetterli, “Reproducing Sound Fields Using Mimo Acoustic Channel Inversion,” J. Audio Eng. Soc., vol. 59, pp. 721–734 (2011 Oct.).

THE AUTHORS

Hagen Wierstorf

Matthias Geier

Hagen Wierstorf is a Ph.D. student in the Assessment of IP-Based Applications group at Technical University (TU), Berlin, Germany. He received his Diplom in Physics from Carl-von-Ossietzky-Universität Oldenburg, Germany in 2008. Since 2009 he is working at the Telekom Innovation Laboratories at the TU.

r

Matthias Geier is a Ph.D. student at the Institute of Communications Engineering at University of Rostock, Germany. He received his Diplom in electrical engineering/sound engineering at University of Technology and University of Music and Dramatic Arts in Graz, Austria in 2006. From 2008–2012 he was working at the Telekom Innovation Laboratories.

r

Alexander Raake is an Assistant Professor and heads the group for Assessment of IP-based Applications at Deutsche Telekom Labs, TU Berlin. From 2005 to 2009 he was a senior scientist at the Quality and Usability Lab of Deutsche Telekom Labs, TU Berlin. From 2004 to 2005, he was a Postdoctoral Researcher at LIMSI-CNRS in Orsay, France. From the electrical engineering and information technology faculty of the Ruhr-Universität Bochum, he obtained his doctoral degree (Dr.-Ing.) in January 2005, with a book on the speech quality of VoIP. After his graduation in 1997

16

Alexander Raake

Sascha Spors

he took up research at the Technical University in Lausanne (EPFL) on ferroelectric thin films. Before, he studied electrical engineering in Aachen (RWTH) and Paris (ENST/Télécom). Since 1999, he has been involved in the standardization activities of the International Telecommunication Union (ITU-T) on transmission performance of telephone networks and terminals, where he currently acts as a Co-Rapporteur for question Q.14/12 on monitoring models for audiovisual services.

r

Sascha Spors is full professor and heads the group signal theory and digital signal processing of the Institute of Communications Engineering, University of Rostock. From 2006 to 2012 he was senior research scientist at the Quality and Usability Lab of Deutsche Telekom Labs, TU Berlin, where he headed the audio technology group. From 2001 to 2005 he was a member of the research staff at the Chair of Multimedia Communications and Signal Processing, University of Erlangen-Nuremberg, Erlangen. He received the Dr.-Ing. degree with distinction from the University of Erlangen-Nuremberg, Erlangen, Germany, in 2006. Dr. Spors holds several patents and has authored or co-authored more than 150 papers in journals and conference proceedings. He is member of the IEEE Audio and Acoustic Signal Processing Technical Committee and chair of the AES Technical Committee on Spatial Audio.

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

Spatial Decomposition Method for Room Impulse Responses ¨ SAKARI TERVO, JUKKA PATYNEN , ANTTI KUUSINEN ([email protected])

(jukka.p¨[email protected])

([email protected])

AND TAPIO LOKKI, AES Member ([email protected])

Department of Media Technology, Aalto University School of Science, FI-00076 Aalto

This paper presents a spatial encoding method for room impulse responses. The method is based on decomposing the spatial room impulse responses into a set of image-sources. The resulting image-sources can be used for room acoustics analysis and for multichannel convolution reverberation engines. The analysis method is applicable for any compact microphone array and the reproduction can be realized with any of the current spatial reproduction methods. Listening test experiments with simulated impulse responses show that the proposed method produces an auralization indistinguishable from the reference in the best case.

0 INTRODUCTION Spatial sound encoding and reproduction techniques are important tools for room acoustics research [1,2]. For perceptual evaluation of room acoustics, a spatial room impulse response is first measured, then encoded for a multichannel loudspeaker reproduction system, and convolved with anechoic music. This process of reproducing spatial sound from a spatial room impulse response is illustrated in Fig. 1. The last part of this spatial sound reproduction process is typically called convolution reverb. Previous research has presented several spatial encoding methods that can be applied for spatial impulse responses. The spatial encoding methods can be divided to three groups according to their aim. In the first group the aim is to reproduce the originally measured sound field over a certain area. These methods include, First-Order Ambisonics (1.OA), Higher-Order Ambisonics (HOA) [3], and Wave-Field Synthesis (WFS) [4,5]. In contrast, in the second group, the binaural reproduction methods, the intention is to reproduce the sound pressure correctly at listener’s eardrums by recording the soundfield close or at the eardrum [6,7]. In the third group, the starting point is to analyze and reproduce some of the spatial cues correctly [8,9]. An example of an analysis method belonging to the third group is the Spatial Room Impulse Response Rendering (SIRR) [10]. The first two groups require specialized microphones or microphone arrays, whereas the methods of the last group aim to present signal processing schemes that are applicable, at least to some extent, for several miJ. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

crophone arrays. An advantage of the first two groups is that they can be applied to a continous signal, such as speech or music. It should be stated that this paper concentrates on the spatial encoding of the spatial room impulse responses, not continuous signals. Most of the professionals working in the field will agree, that when applied in a careful manner, any of the aforementioned encoding techniques will provide a realistic or at least plausible auralization of the acoustics. However, the use of special microphone arrays imposes limitations to the measurement procedure. First, some of the microphones are known to have inaccurate directional response, especially in the higher frequencies, which naturally affects the accuracy of the analysis and the reproduction. Second, the measurement, especially for WFS, either requires a multitude of microphones or is time consuming, which can be costly. Finally, the microphone array setups for some of the reproduction approaches are only limited to that specific approach. This paper presents a spatial encoding technique for spatial room impulse responses, named here Spatial Decomposition Method (SDM). In contrast to previously developed methods, SDM can be applied for an arbitrary compact microphone array with a small number of microphones and any spatial sound reproduction technique. The presented method relies upon the simple assumption that the sound propagation direction is the average of all the waves arriving to the microphone array at time t, and the sound pressure of a single impulse response in the geometric center of the array is associated with it. The method analyzes the spatial 17

TERVO ET AL.

PAPERS

direct sound, discrete reflections, diffractions, or diffuse reflections. At each time moment t, the sound pressure at receiving location r n has a scalar value, i.e., it is a scalar function h(r n , x|t). The scalar value is the overall sum of different sound pressure waves arriving at the same time to the receiver location. In the context of this paper the spatial room impulse response is measured with n = 1, . . ., N microphones, i.e., a microphone array. The whole impulse response is altered by several acoustic phenomena. A majority of the acoustic events is attenuated according to 1/r-law and affected by air absorption. In addition, the frequency response of an event is altered by the absorption of the surfaces in the enclosure. Moreover, the directivities of the microphones and the sound source have an effect on the impulse response. As time progresses, the number of acoustic events per time window increases. In room acoustics research and convolution reverberation engines, the impulse response is traditionally divided into three consecutive regions in time: the direct sound, the early reflections, and the late reverberation. Next subsections list features of these categories in theory and in practice.

Room Impulse Response Measurement or simulation

Microphone array

Microphone signals . . . Spatial analysis and encoding, e.g., (SDM / SIRR / 1.OA / HOA)

Spatial decoding and reproduction, e.g., (VBAP / 1.OA / HOA / WFS)

Analysis and synthesis

Encoded stream . . .

Loudspeaker signals . . .

Convolved signals . . .

.

.

Spatially Reproduced Acoustics

Convolution with anechoic source signal

.

Loudspeakers

Fig. 1. The general processing applied in auralization using room impulse response measurements or simulations. The acronyms for the encoding and decoding techniques are given in the text.

impulse response with this assumption and encodes it to a response that consists of samples that have a pressure value and a spatial location. 1 THEORY AND METHODS This section presents theoretical background on the spatial room impulse response and the proposed spatial analysis. 1.1 Room Impulse Response A room impulse responses captured with a microphones n at location r n is the sum of individual acoustic events hp,n (t): ⎡ ⎤ P h p,n (t)⎦ + wn (t) h n (t) = h(t|r n , x) = ⎣ =

l l=0

p=0 ∞ −∞

H p,n (ω)e

jωt

dω + wn (t),

(1)

where n denotes microphone index, t is time, ω is the angular frequency, x is the source position, p = 0, . . . , P is the index for each acoustic event, wn (t) is the measurement noise, and Hp,n (ω) is the frequency domain representation of hp,n (t). The acoustic events can be, for example, the 18

1.1.1 Direct Sound, Specular and Diffuse Reflections, and Diffraction In theory, the direct sound is a single impulse, i.e., a Dirac delta function. Moreover, with the assumption of ideal reflecting surfaces, the reflections are also impulses. In practice, ideal specular reflections are rare, since they require an infinite rigid and flat plane. Thus, the early reflections are often spread over time instead of being single events in the impulse response and have a certain frequency response due to the absorption at the boundaries. Also, due to the non-ideal response of the loudspeakers and microphones, the direct sound is not an impulse. Moreover, the loudspeaker impulse response is typically different in all directions, therefore, reflections from different directions have different responses in time and frequency. The concept of image-source describes an ideally specular reflection from a surface [11]. Although such reflections are rare or not even possible in real situations, the model can be used to describe several acoustic events. First, the diffraction from an edge can be modeled with properly weighted image-sources [12]. Second, non-ideal reflections, i.e., diffuse reflections, are caused by diffraction in a very small scale [13]. Third, close-to-ideal specular reflections and the direct sound can be modeled with a limited number of properly weighted image-sources and a source, respectively. Thus, it can be concluded that the acoustics of an enclosure can be modeled to some extent with a limited number of image-sources. However, in practice, the acoustic modeling of complex room geometries with image-sources is a very demanding task. 1.1.2 Late Reverberation The reflection density increases in the room impulse response as the time progresses. When enough reflections arrive during the same time, the sound field becomes J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

SPATIAL DECOMPOSITION METHOD

Fig. 2. The processing in the proposed spatial encoding method consists of localization and combining the omni-directional pressure signal with the estimated locations.

diffuse. A diffuse sound field is spatially homogeneous and isotropic. In practice, this means that the distributions of the phase and direction are uniform and amplitude is equally distributed for each position. It follows from these conditions that the net energy flow over a volume is zero. The time when this occurs in an impulse response is typically referred to as the mixing time [14], and after that the impulse response is considered to be late reverberation. 1.2 Analysis SDM assumes that the impulse response can be presented as a set of limited number of image-sources. SDM analyzes the spatial room impulse response at every discrete time step t = 1/fs , where fs is the sampling frequency. The sound arriving during these time windows has an average direction that is estimated with robust localization methods. As a result, a set of discrete pressure values and their corresponding locations, i.e., image-sources, present the spatial room impulse response. Decomposition of the image-sources describes this process and thus the name SDM (Spatial Decomposition Method). The overall processing in the method is illustrated in Fig. 2. The analysis assumes the following general requirements for the used microphone array: • For 3-D spatial sound encoding, the minimum requirement of the number of microphones is four, which are not on the same plane, so that they can set up a 3-D space. • The directivity of one of the microphones is omnidirectional or it is possible to create one virtual omnidirectional pressure microphone signal from the others. • The dimensions of the array are not large, i.e., the microphone array is compact. The dimensions should be less or equal to the dimensions of a human head. • Open microphone arrays are preferred, but closed ones can also be used as long as the above requirements are met. In detail, for a set of room impulse responses H(t) = N , i.e., a spatial room impulse response, the anal{h n (t)}n=1 ysis proceeds as follows. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

1.2.1 Step 1: Localization First, SDM solves the location of the source and imagesources from the spatial room impulse response. For each discrete time step, a localization function P(· | · ) estimates the average location of the arriving sound in a small time window with respect to the geometric center of the array. The localization function maps the received data into a cost function that is given for a location x and possible parameters or a priori models χ: xˆ k = arg max{P(H(k)|x, χ)}, x

(2)

where H(k) is the spatial impulse response in a short time window that is defined by vector k = [−L/2 + k . . . L/2 + k]t with discrete time indices at time tk, k = 1. . .K, where K is the length of the impulse response, and window size L. The a priori models and the localization function depends on the applied microphone array, measurement conditions, and assumption on the sound field propagation model. As an example for an arbitrary array with arbitrary directivities, one can apply the maximum likelihood estimation given in [15] with the reverberation parameter set to γ = 0. For acoustic vector-sensors, e.g., a gradient microphone array, one can apply the solutions given in [16] or [17]. Moreover, [18] gives an overview of different localization functions that are based on time difference of arrival and time of arrival estimation. The accuracy of the localization depends on the applied microphone array and localization method, as well as on the conditions during the measurements. This paper uses the least squares solution for time difference of arrival estimates (TDOA) for localizing the imagesources. Plane-wave propagation model is assumed for the localization since an efficient estimator for the problem exists, unlike for spherical wave propagation model [19] and since the source and the image-sources can be assumed to be in the far field. Although the solution is a set of plane-waves instead of a set of image-sources, the method can treat them in a similar manner due to the far-field assumption. The TDOAs are obtained from generalized correlation method with direct weighting [20] for each time step in the small analysis window k. In addition, each TDOA estimate 19

TERVO ET AL.

PAPERS

is interpolated with the exponential fit [21]. The TDOA estimates are denoted with T ˆ (k) ˆ (k) τˆ k = [ˆτ(k) 1,2 , τ 1,3 , . . . , τ N −1,N ] ,

where N is the number of microphones, and the corresponding microphone position difference vectors are denoted with V = [r 1 − r 2 , r 1 − r 3 , . . . , r N −1 − r N ]T . The least squares solution for slowness vector is then given as [22, p. 75]: ˆ k = V + τˆ k , m

(3)

where (·)+ is Moore-Penrose pseudo-inverse, and the direction of the arriving sound wave is given as ˆ k /m ˆ k . The distance to the image-source k is nˆ k = −m given directly by the time index and the speed of sound dk = ckt.

1.2.2 Step 2: Dividing the Omni-Directional Pressure Signal The second step of the analysis selects one of the available omni-directional microphone signals as the pressure signal hp . Ideally, the microphone for the pressure signal is located in the geometric center of the array. In this case, the analysis assigns each sample of the pressure impulse response hp (tk) with a 3-D location xˆ k , which is the output from Step 1. Then, the method has encoded the spatial impulse response with four values per sample, the pressure value and the 3-D location of the sample. In case the pressure microphone is not in the geometric center of the array, one has to predict the value of the pressure signal according to the image-source locations. This is done by first calculating the distance from the imagesource location to the location of the pressure microphone rp dk = r p − x k ,

(4)

and then assigning each image-source with the pressure value hp (fs dk /c). When using plane wave propagation model, the distance is calculated as dk = |nk (r p − x k,0 )|,

(5)

where nk and x k,0 are the plane normal and a point on the plane, respectively. Instead of predicting the pressure in the center of the array, one can predict the image-source locations in the location of the pressure signal. This is an easier choice because it does not require resampling of the signal. This paper applies neither of these approaches, since the pressure microphone is always located in the middle. 1.3 Limitations on the Performance and the Effect of the Window Size Several aspects affect the accuracy of the analysis in SDM. When the noise level decreases and the number of microphones increases, the performance of the localization improves, as predicted by the Cramér-Rao lower bound (CRLB) (see, e.g., [18]). Other important factors are the 20

time interval between the samples (t) and the size or the dimensions of the microphone array. The smaller these values are the more spatial and temporal separation between individual acoustic events can be made. This improves the localization for individual acoustic events. Other methods require larger aperture size to improve the approximations for low frequencies, however, in SDM this is not a requirement since in SDM the lower frequencies can be estimated by elongating the window size. However, this would also require that SDM processing is done for different frequency bands with different window sizes. This is further discussed in Section 4.1. A limiting factor for the window size is the largest dimension of the microphone array. That is, the window size should be larger than the time that it takes for a sound wave to travel through the array, i.e., Lt > 2dmax /c, where dmax is the maximum distance between any two microphones in the array. Theoretically, a large window size improves and worsens the localization performance at the same time. Namely, as the window size increases, the localization performance of a single acoustic event improves, as stated by the CRLB. However, the probability that more than one acoustic event is present in the analysis window increases. This latter part is seen as a possible problem in the analysis and, therefore, it is recommended that the window size is selected such that it is just over 2dmax c. In addition, if an acoustic event is assumed to be short, time-wise, increasing the window size would actually decrease the theoretical performance since the energy of the noise in the time window increases relative to the energy of the signal, thus decreasing the signal-to-noise ratio. The next part assesses the effect of the window size selection with a quantity called echo density. The echo density describes the average number of echoes in a room per a time instant and is valid for any arbitrarily shaped enclosure [23, p. 92]. It is defined as c3 t 2 Nr = 4π , dt V

(6)

where Nr is the number of reflections, and V is volume. Echo density is a useful tool for inspecting the effects of the window size selection on the number of acoustic events, i.e., image-sources, per time window. The threshold when there is less than Nr reflection(s) present in the time window can be examined with √ Nr V τ1 = ≈ 0.0014 V. (7) dt4πc3 The last approximation is yielded for less than one reflection Nr → 1 and assuming that the speed of sound is constant c = 345 m/s. For example, a window size of dt = Lt = 1 ms produces the value τ1 = 119 ms for a room with volume (30 × 20 × 12 = 7200 m3 ), which indicates that there is only one acoustic event present in the analysis window until 119 ms after the direct sound. Thus, the parameter τ1 describes the average time when there will be more than one reflection present in the analysis time window. The smaller the window size, the bigger the parameter τ1 and the more accurate localization of individual J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

SPATIAL DECOMPOSITION METHOD

acoustic events is achieved. To conclude, shorter time windows should be preferred over long ones, and the minimum length of the time window is defined by the maximum of the spacing between any microphone pair. Y−coordinate [m]

50

0

−50

−100 −100

−50

0 X−coordinate [m]

50

100

(a) Original image-source locations and amplitudes.

100

50 Y−coordinate [m]

1.4 Rationale for SDM The accurate localization of first acoustic events with respect to time in the impulse responses, i.e., the direct sound and the first reflections, is possible as shown in [24] and [18], respectively. However, as the time progresses the number of acoustic events per time window increases, and eventually more than one reflection arrives during the time window. In this case, a cross-correlation-based localization algorithm localizes the sound to the location of the reflection that is the strongest one in that time window. The strongest direction is selected because it shows as the strongest peak in the cross-correlation functions. Analogous example of this behavior with one localization algorithm is shown with speech sources in [25]. However, it is also possible that the estimated location is an intermediate point that is between the reflections within that analysis window. This is, for example, the case if the localization algorithm is based on the average direction of the sound intensity. Thus, the estimated location depends highly on the localization algorithm. The behavior of the localization algorithms in the case of several acoustic events should be further investigated, but here this is left for future research. In any case, SDM assumes in the spatial reproduction that the estimated location corresponds to the correct perceptual location. The assumption has been used previously for example in SIRR [10,26]. SDM produces the diffuse sound field naturally. Namely, in SDM each time step has a random direction in a diffuse sound field. The total directional distribution over the total diffuse sound field, i.e., late reverberation is then uniform. Further evidence for this is provided in a recently published article which uses SDM for spatial analysis [27]. Since the first acoustic events are correctly localized from spatial room impulse response in the SDM framework, and these events are known to have a very prominent effect on the perception of spatial sound [1,28], the resulting auralization should be credible. Moreover, the late part of the spatial room impulse response will be naturally presented as diffuse by SDM because multiple arriving reflections will produce random directions.

100

0

−50

−100 −100

−50

0 X−coordinate [m]

50

100

(b) Analyzed image-source locations and amplitudes Fig. 3. An example of the locations and amplitudes of (a) simulated image-sources and (b) decomposed image-sources with SDM from a spatial room impulse response. The area of each filled circle illustrates the energy of that image-source. The imagesources with the highest energy are correctly analyzed.

Fig. 3, where the radius of each circle corresponds to the amplitude of respective image-source, illustrates the results of the analysis. As can be seen in Fig. 3, the early part of the simulated spatial room impulse response (a) is very similar to the one analyzed by SDM (b). 2 LISTENING TEST EXPERIMENTS

1.5 An Example of the Analysis with SDM This section demonstrates the principles in SDM with an illustration of analysis results of spatial room impulse response. The spatial room impulse response is recorded from a simulation of a shoebox room of size (20 × 30 × 12) m3 . Furthermore, the source was at [16.04, 8.06, 3.58] m and the receiver at [7.35, 7.92, 3.22] m. In addition, the applied window was 1.33 ms Hanning window and overlap between two consecutive windows is 99%. Speed of sound was set to c = 345 m/s, sampling frequency to fs = 48 kHz, reflection coefficient to 0.85, and reflections up to 45th order were simulated. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

This section describes the listening test setup, the listening room, the simulated room acoustic conditions, and the source signals. In addition, listening test procedures and results are presented. This paper uses Vector Base Amplitude Panning (VBAP) [8] as the spatial reproduction technique for the listening tests. Other reproduction methods could also be used, but VBAP is here preferred since it can be implemented for a 3-D spatial sound with less number of loudspeakers than the other methods and since it provides good subjective quality in overall. The listening tests compares the proposed method to SIRR [10,26], which can be considered the 21

TERVO ET AL.

PAPERS

Table 1. Reverberation time (RT), sound pressure level (SPL), and noise level (NL) in the listening room. Sound pressure level is given with respect to the reference (Ref.) value at 200 Hz4 kHz frequency band. In the calibration, the SPL was 87 dB, which gives a signal-to-noise ratio of more than 45 dB for each octave band.

Table 3. Source and receiver positions, source signals, dimensions of the rooms, and sample naming used in the listening test. Speed of sound was set to c = 345 m/s, sampling frequency to fs = 48 kHz, reflection coefficient to 0.85, and reflections up to 45th order were simulated. Sample

Octave band

RT [s]

SPL [dB]

NL [dB]

0.14 0.24 0.17 0.13 0.13 0.12 0.11 0.10

0.00 1.66 0.47 0.36 0.02 0.16 −1.03 −2.93

39.9 35.7 32.9 28.6 20.4 18.9 21.5

(Signal) Ref. [200 Hz - 4 kHz] 125 [Hz] 250 [Hz] 500 [Hz] 1 [kHz] 2 [kHz] 4 [kHz] 8 [kHz]

Table 2. The azimuth and elevation directions and distance of each individual loudspeaker (LPS) in the 14-channel loudspeaker reproduction setup in the listening room. 8, 4, and 2 loudspeakers are located approximately at the lateral plane, 45 degrees above lateral plane, and –45 degrees below lateral plane, respectively. The loudspeakers were localized using the method presented in [24]. LPS #

Azimuth [◦ ]

Elevation [◦ ]

Distance [m]

1 2 3 4 5 6 7 8 9 10 11 12 13 14

46.7 89.6 134.2 179.8 −135.9 −91.6 −45.6 −0.3 45.9 135.1 −137.9 −45.4 24.0 −19.6

0.1 −1.5 −0.1 −0.2 0.1 0.2 1.0 1.6 43.5 40.5 42.3 46.4 −46.1 −45.1

1.01 1.02 0.98 0.95 1.02 0.94 0.96 0.98 1.30 1.33 1.40 1.33 1.29 1.27

state-of-the-art spatial sound encoding method for spatial room impulse responses, at least for VBAP. SIRR also operates under the same assumption as SDM, that the binaural cues are produced correctly. 2.1 Listening Room Setup and Stimuli Listening tests were conducted in an acoustically treated room with dimensions of (x × y × z : 3.0 × 5.1 × 3.8) m3 . Table 1 shows the reverberation time, sound pressure level, and noise level in the listening room. The listening room fulfills the recommendations given by ITU in [29], with the exceptions that the noise level fulfills the noise rating (NR) 30 requirement, whereas the recommendation is NR 15 and the listening distance is about 1.2 meters on average, whereas the recommendation is more than two meters. The listening room includes a 3-D 14-channel loudspeaker setup, out of which 12 are of type Genelec 8030A, and two are of type Genelec 1029A loudspeakers. Table 2 gives the location of each loudspeaker with respect to the listening position at the origin (0,0,0) m. Each loudspeaker is calibrated so that they produce equal A-weighted sound 22

A (Sp.) B (Tr.) C (Ca.) D (Sp.) E (Tr.) F (Ca.)

Source position x [m]

y [m]

z [m]

Receiver position x [m]

y [m]

Large room (30 × 20 × 12) m3 16.04 8.06 3.58 7.35 7.92 17.44 12.81 2.88 2.64 13.48 20.37 11.99 2.52 3.10 12.10 Small room (5 × 3 × 2.8) m3 3.44 0.80 1.53 1.02 0.64 3.87 1.45 1.65 0.76 1.39 3.78 0.85 1.81 1.24 0.97

z [m] 3.22 3.72 2.86 1.40 1.33 2.07

Sp.: Speech, Tr. Trombone, and Ca.: Castanet

pressure level with slow temporal averaging in the listening position for a band-pass filtered noise from 100 Hz to 5 kHz. Since the distance of the loudspeakers is not the same to the reference position for all loudspeakers, they are all delayed with digital signal processing so that each loudspeaker is at a virtual distance of 1.40 m. The simulated impulse responses for the listening test were produced with the image-source method [11] in two modelled rectangular rooms. In the image-source method, throughout this paper, the reflection coefficient is set to 0.85, the speed of sound to 345 m/s, the sampling frequency to 48 kHz, and reflections up to 45th order are simulated. In addition, Table 3 shows the room dimensions, source, and receiver positions used in the image-source method. Two shoebox rooms, a large and a small one, are simulated for the listening tests. The large and the small room have wide band reverberation times of 2.0 s and 0.4 s, respectively. In all the cases, the room impulse responses are truncated from –40 dB onwards according to the backward integrated Schroeder curve.

2.1.1 Reference and Anchor The reference was generated with the image-source method. The location and amplitude of each image-source was transferred into a virtual source, which was panned with VBAP for the current loudspeaker setup [8]. Finally, to simulate a real room impulse response measurement situation, the anechoic impulse response of Genelec 1029A was convolved with the impulse responses, and the impulse response was filtered with air absorption filters, implemented according to [30]. The anchor for the listening test was selected to be the same mono impulse response as in the reference but instead of VBAP processing according to the directional information obtained from the image-source method, it was used directly in the front loudspeaker (# 8 in Table 2). 2.1.2 Spatial Encoding Methods Similarly to the reference, the image-source method was used to generate spatial impulse responses for a virtual J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

SPATIAL DECOMPOSITION METHOD

W X B-format conversion Y Z

VBAP

Non-diffuse stream SIRR

Diffuse stream Decorrelation

Convolution with source signal

Air absorption filter

VBAP

x(t), y(t), z(t), h(t)

...

VBAP

...

Extracted image source locations and pressure signal

...

SDM

...

...

Image source method

Loudspeaker signals

...

...

...

Source Response filter

Microphone signals Room parameters Microphone array -room dimensions -source position -sampling frequency, 48 kHz -reflection coefficient, 0.85 -speed of sound, 345 m/s -reflection order, 45

Monaural impulse response and spatial metadata Source Response filter Air absorption filter

Fig. 4. Processing of the samples in the listening test experiments. The shaded areas highlight the different spatial encoding methods (from top to down, SDM, SIRR and reference). Table 4. Origin centered coordinates for the microphone arrays. Spacing dspc is equal for each microphone pair on a single axis. Microphone #

X [m]

Y [m]

Z [m]

1 2 3 4 5 6 7

dspc /2 −dspc /2 0 0 0 0 0

0 0 dspc /2 −dspc /2 0 0 0

0 0 0 0 dspc /2 −dspc /2 0

microphone array. The microphone array consists of seven microphones, of which six are on a sphere and one in the geometric center of the array, as shown in Table 4. The central microphone is used as the microphone for the pressure signal in the spatial encoding methods. The proposed spatial encoding method, SDM, was compared to two versions of SIRR. The first version of SIRR, as well as SDM, was implemented with seven microphones, and the second version of SIRR was implemented with 13 microphones. Their naming is the following: • SDM with a single microphone array with spacing dspc = 100 mm and one microphone in the geometric center is named SDML7, • SIRR with a single microphone array with spacing dspc = 100 mm and one microphone in the geometric center is named SIRRL7, and • SIRR with two microphone arrays with spacings dspc = 100 mm and dspc = 25 mm and one microphone in the geometric center is named SIRR13. The microphone arrays were selected as such, since SIRR-processing can be implemented for them [10,26]. Namely, SIRR requires the three components of particle velocity, which can be calculated with the gradient microphone-technique and a pressure signal, which is the microphone in the geometric center. SIRR13 analyzes separately the room impulse responses for large and small spacing and in the post-processing phase combines them. Combination adds the analysis result for low frequencies below 1 kHz from the large spacer, and J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

for high frequencies above 1 kHz with the smaller spacer. Before the addition, the analyzed signals for small and large spacer are low-pass and high-pass filtered with a 10th order Butterworth IIR filter, respectively. The motivation for such processing is that the present authors have used such array in measurements of concert halls [31]. To compare the methods in the same conditions, all the analyses use a Hanning window of 1.33 ms (64 samples at 48 kHz). SIRR has an overlap of 50% between two consecutive windows, and SDM has an overlap of 99% (63 samples), as explained in Section 2.2 (Step 1). The window size was selected as 1.33 ms since it is the one used in the original SIRR paper [26]. It should be emphasized that for SDM the optimal window size is much smaller than the selected one. Especially for the smaller simulated room, it is expected that the lengthy time window causes problem in SDM, since parameter τ1 is 1.4 ms. However, since the goal is to compare these two techniques in the same conditions, the same window size is used for both. Moreover, a virtual microphone-based synthesis, originally developed for DirAC in [32], was noticed to provide a more natural sound for SIRR and was included in the processing. The output of the SDM, i.e., the extracted image-sources, are directly panned with VBAP for the current loudspeaker setup. The output of the SIRR-analysis is processed as described in [10] and [26] for VBAP reproduction. In addition, the diffuse part of the SIRR is implemented with the Hybrid Method described in [26]. The processing of the listening test samples for SDM, SIRR, and the reference case are illustrated in Fig. 4.

2.1.3 Source Signals and Test Samples Approximately ten seconds of male speech, trombone, and castanets were selected as the source signals. Each sample was convolved separately with the corresponding 14-channel VBAP output for a reference, SIRR, or SDM. The test samples are named from A to F as indicated in Table 3. 2.2 Listening Test Procedure The task in the listening test is to compare the “similarity” of the spatially encoded samples with the reference sample, 23

TERVO ET AL.

PAPERS A: Speech, large ro om B : Trom bone, large ro om C: Castanet, large ro om D: Trom bone, sm a ll ro o m E: Speech, sm a ll ro o m F: Ca sta net, sm a ll ro o m

1 0.9 0.8 0.7

Sim ilarity

0.6 0.5 0.4 0.3 0.2 0.1 0 A B C D E F Ref.

Fig. 5. Screen capture of the user interface (UI) used in the listening tests. The subjects can freely move, listen to, and rate the samples. Note that in the UI the alphabet stands for “method.”

instead of “impairment” recommended by ITU [29]. This deviation from the ITU-recommendation was made since the test subjects are not encouraged to think that the samples are somehow impaired. In addition, the ITU-recommended impairment scale (imperceptible; perceptible, but not annoying; slightly annoying; annoying; very annoying) is not used in the listening test, since it is known from previous research [26] that SIRR-processed samples sound quite natural and are quite similar to the reference. Thus, the idea of this listening test is to find out which encoding method produces the sound that is most similar with the reference. The listening test was implemented as a parallel comparison with continuous scale, and the task was to compare the similarity of five samples to a reference sample. Test subjects completed the test twice. The order of the test cases A – F (Table 3) was randomized between subjects and repetitions. Also, in each test case, the five samples (Ref., Anchor, SIRR13, SIRR7, SDML7) were presented in a random order with letters A – E. A screen shot of the user-interface and one comparative evaluation of one test case is shown in Fig. 5. During the listening test, the subjects could freely loop a time window what they were listening to and listen to an unlimited number of times. That is, there was no time limit for completing the test. In the beginning of the test, the subjects had an adequate time to familiarize themselves with the samples. The test subjects were instructed to carefully consider the timbral and the spatial aspects in the samples. They were also told that one of the five samples is the hidden reference sample and one other sample is a mono anchor sample, which is played back from the front loudspeaker. After the familiarization, the subjects rated the samples according to similarity to the reference in the actual listening test. When the test ended, the subjects were interviewed and asked for the attributes that they used for discriminating and rating the samples. Seventeen test subjects with normal hearing participated in the test. None of the subjects were the present authors of 24

A B C D E F SDM L7

A B C D E F SIRRL7

A B C D E F SIRR13

A B C D E F Anchor

Fig. 6. Listening test results, the thicker boxes with solid color illustrate the 25 and 75 percentiles, the thinner lines illustrate the most extreme data points, the circles outside the boxes illustrate outliers, and the dots the median.

this paper. Most (9/17) of the test subjects can be considered expert listeners in spatial audio due to their background in spatial audio research. Others (3/17) had experience on critical listening, but this was not necessarily on spatial audio. These subjects were considered as experienced listeners. The rest (5/17) were na¨ıve test subjects and had limited or no experience in critical listening. The test took on average approximately 50 minutes, including approximately a 10-minute familiarization step and a 5-minute interview. 3 RESULTS All the results from the listening tests are shown in Fig. 6. The results from the listening tests are normalized for each test case and subject between 0 and 1. As shown in Fig. 6, the references and anchors are found correctly in most of the cases. SDML7 is mistaken as the reference 12 times, and the SIRR13 as the anchor once. This result already suggests that SDML7 is well suited for spatial encoding. Multi-way analysis of variance (ANOVA) is applied to examine the main effects, and two- and three-factor interactions. The examined main effects are the spatial encoding method (Method), repetition of the test case (Repetition), the size of the room (Room), and the source sound sample (Sound). To perform ANOVA, the cases should be independent, the variances equal (homoscedasticity), and the residuals normally distributed. Here, the cases are assumed to be independent but other assumptions for ANOVA, the homoscedescacity and the normality of the residuals, are next tested with statistical tests. Levene’s test [33] shows that the variances between different test cases are significantly different. In addition, Anderson-Darling test [34] indicates that the residuals are not normally distributed. Both of these results are most likely a consequence of the scale in the listening test. That is, as shown in Fig. 6, the results for the reference and anchor sound condition are negatively and positively skewed, J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

SPATIAL DECOMPOSITION METHOD A: Speech, la rg e ro o m B : Tro m bo ne, la rg e ro o m C: Ca sta net, la rg e ro o m D: Tro m bo ne, sm a ll ro o m E: Speech, sm a ll ro o m F: Ca sta net, sm a ll ro o m

1

Sim ilarity

0.8

0.6

0.4

Table 5. The attributes that the subjects used for assessing the similarity according to the interviews. Most of the attributes are translated from Finnish to English. Timbral aspects Localization (×5), spatial impression (×4), the amount of reverberation (×4), distance (×3), spatial width, spaciousness, gating effect, perception of room size, reverberation time, artifacts in the reverberation, direct-to-reverberant ratio. Spatial aspects

0.2

0

A B C D E F Ref.

A B C D E F SDM L7

A B C D E F SIRRL7

A B C D E F SIRR1 3

A B C D E F Anchor

Fig. 7. Rated similarity of different spatial encoding method for each sample individually. The results are presented with mean and 95% confidence intervals.

respectively. For this reason, the anchor and the reference are removed from the ANOVA examination and the statistical tests are run again. Indeed, when these two methods are removed, Levene’s test shows that the variances can be assumed equal [F(2,609) = 0.73, p > 0.05], and the Anderson-Darling test statistic indicates that the residuals are normally distributed [A2 * = 0.99, p < 0.05]. This means that the ANOVA is suitable for the data. The results of the ANOVA indicate that the only significant main effect is the spatial encoding method [Method, F(2,576) = 332.80, p < 0.001] and none of the interactions is found significant. In total, the model explains 56% of the variance. The main effect for spatial encoding method is very strong and it explains 49% of the variance, and thus the remaining 7% are non-significant effects. The results are presented for the spatial encoding method, in Fig. 7 with means and 95% confidence intervals. As can be seen from Fig. 7, out of the spatial encoding methods, SDML7 is the most similar with the reference, SIRRL7 is the least similar, and SIRR13 is slightly more similar than SIRRL7. All the means are significantly different and the average values for the methods are: reference: 0.98, SDML7: 0.80, SIRR13: 0.48, SIRRL7: 0.40, and Anchor: 0.00. Thus, according to the listening test experiments, SDML7 is the most similar with the reference out of the tested encoding methods. In the best cases, in samples A and B (speech or trombone in large room), the results of the reference and SDML7 are not significantly different. In all the other samples, the SDML7 results are significantly different from the other methods and the reference. The furthest from the reference are the results for sample E (trombone in small room). All the attributes from the interviews are listed in Table 5. They are grouped into two groups, spatial and timbral aspects. The interviews of the test subjects revealed that they most often used localization, spatial impression, muddiness, coloration, distance, and clarity as the attributes for rating the samples. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Coloration (×7), muddiness (×4), clarity (×4), low-frequency content (×2), pitch (×2), brightness (×2), tone, depth, frequency shift, differences in the direct sound, metallic reverb, artifacts at high frequencies, high-frequency content.

4 DISCUSSION 4.1 Advantages and Drawbacks of the SDM and Future Work In rendering, SDM uses only one omni-directional impulse response. Therefore, the frequency response in the exact sweet spot is identical to the original one in the pressure microphone. That is, due to the direct use of the pressure microphone signal, no peaks or dips occur in the frequency response in the sweet spot. The spatial distribution of sound can be inaccurate, but as the direct sound and early reflections are accurately reproduced the perceived error is negligible. In addition, in SDM the diffuse part of the sound field is obtained automatically, whereas in SIRR the diffuse part of sound is reproduced with uncorrelated loudspeaker signals, which are not easy to implement. SDM assumes wideband signals. That is, the room impulse responses should be measured with full band width. In the case of band limited room impulse responses, SDM will artificially increase the energy outside the frequency band, since each sample in the encoded version is presented by a Dirac-impulse. In a real room impulse response, the energy in the high frequencies decreases as time progresses due to air absorption and surface absorptions. Thus, the frequency response of the late reverberation is a “low-pass” filtered version of the original response of the direct sound. As pointed out by the listening test subjects, SDM slightly increases the perceived brightness or high frequency content in the late part of the impulse response. The division of the pressure signal causes this drawback. An image-source represents each of the samples in the pressure signal and the image-source is a Dirac-impulse in time-domain, which is wide band in frequency domain. Since the late part of the impulse response does not have as much energy on the high frequencies as the early part this results in an increase in the perceived brightness of the reverberation. The problem of increased brightness in the late reverberation can be overcome by equalizing the frequency response in a post-processing step. Another option is to analyze the locations of the image-sources in frequency domain. This way, each frequency would have a correct weighting to 25

TERVO ET AL.

begin with. This requires additional research and is therefore left for future work. In addition, future work includes open source implementations of the SDM encoding for a general pressure microphone array, B-format microphone, and the decoding implementations for wave field synthesis and higher order Ambisonics. The problem of increased clarity may not be present in the other reproduction approaches. In this paper the room acoustic simulation used ideal specular reflections, which is an inherent property of the applied image-source room simulation method. SDM should also be tested with diffuse reflections. However, the generation of the reference case for diffuse reflection is problematic, since for the reference case the direction, time of arrival, and pressure value for each time instant is required. This information is available in beam-tracer or ray-tracing methods. Unfortunately, these methods neglect the temporal spreading of the reflections and consider that diffuse reflections only introduce spatial spreading for the reflected sound. Moreover, the room acoustic simulation methods that aim to solve the wave equation, e.g., finite element method, boundary element method, and finite difference in time-domain, may generate the correct pressure values, but they do not produce directional information. The only method that produces all the necessary information and takes into account the temporal and spatial spreading is presented in [35], but it only applies for low-frequencies. The comparison for a reference case with diffuse reflections is currently not possible. SIRR was implemented with the parameters given in the original paper [26]. It should be emphasized that the advances made in Directional Audio Coding could possibly improve quality of the SIRR. In informal listening, for example, the multi-rate implementation [36] was found to increase the overall quality in SIRR. Studies using SIRR with alternative processing approaches are currently not available in the literature. 5 CONCLUSIONS This paper presented a spatial encoding method for spatial room impulse responses. The analysis of the method estimates the location in very small time windows at every discrete time sample, where the localization method depends on the applied microphone array and acoustic conditions. Each of the discrete time samples is therefore represented by an image-source. Thus, the analysis results in a set of image-sources. Then, depending on the spatial reproduction method, the samples are distributed to several reproduction channels to obtain individual impulse responses for all reproduction channels. The main advantage of the method follows from the decomposition of the image-sources. Namely, the method can be applied to any arbitrary microphone array and the spatial reproduction method can be any of a variety of existing techniques. It should be emphasized that the method is not designed for a continuous signal, but for spatial room impulse responses, which can then be convolved with an anechoic signal. In this paper the applied microphone array was 26

PAPERS

an open spherical microphone array with six microphones, with an additional seventh microphone in the geometric center of the array. Listening test experiments showed that the presented method produces sound that is indistinguishable from a reference sound in the best case. In overall, the similarity of the sound samples encoded with the presented method were perceived to be closer than that of a state-of-the-art method in the same conditions. 6 ACKNOWLEDGMENT This work was supported by ERC grant agreement no. [203636] and Academy of Finland agreement nos. [218238 and 140786]. The authors are grateful to Prof. Lauri Savioja, Prof. Ville Pulkki, and Mr. Mikko-Ville Laitinen for discussions. The authors would also like to thank the anonymous reviewers for their valuable comments, which helped to improve the quality of this paper. Dr. Alex Southern is thanked for providing the implementation for air absorption filter computations. 7 REFERENCES [1] T. Lokki, H. Vertanen, A. Kuusinen, J. Pätynen, and S. Tervo “Concert Hall Acoustics Assessment with Individually Elicited Attributes,” J. Acoust. Soc. Am., vol. 130, pp. 835–849 (Aug. 2011). [2] T. Lokki, J. Pätynen, S. Tervo, S. Siltanen, and L. Savioja, “Engaging Concert Hall Acoustics Is Made Up of Temporal Envelope Preserving Reflections,” J. Acoust. Soc. Am., vol. 129, pp. EL223–EL22 (Apr. 2011). [3] J. Daniel, R. Nicol, and S. Moreau, “Further Investigations of High Order Ambisonics and Wavefield Synthesis for Holophonic Sound Imaging,” presented at the 114th Convention of the Audio Engineering Society(2003 March), convention paper 5788. [4] A. J Berkhout, D. De Vries, and P. Vogel, “Acoustic Control by Wave Field Synthesis,” J. Acoust. Soc. Am., vol. 93, no. 5, pp. 2764–2778 (1993). [5] M. M. Boone, E. N. G. Verheijen, and P. F. Van Tol, “Spatial Sound-Field Reproduction by Wave-Field Synthesis,” J. Audio Eng. Soc., vol. 43, pp. 1003–1012 (1995 Dec.). [6] D. Hammershøi and H. Møller, Communication Acoustics, chapter 9, “Binaural Technique—Basic Methods for Recording, Synthesis, and Reproduction”, pp. 223–254 (Springer-Verlag, New York, NY, USA, 2005). [7] D. Schönstein and B. F. G. Katz, “Variability in Perceptual Evaluation of HRTFs,” J. Audio Eng. Soc., vol. 60, pp. 783–793 (2012 Oct.). [8] V. Pulkki, “Virtual Sound Source Positioning Using Vector Base Amplitude Panning,” J. Audio Eng. Soc., vol. 45, pp. 456–466 (1997 Jun). [9] F. Zotter and M. Frank, “All-Round Ambisonic Panning and Decoding,” J. Audio Eng. Soc., vol. 60, pp. 807– 820(2012 Oct.). [10] J. Merimaa and V. Pulkki, “Spatial Impulse Response Rendering I: Analysis and Synthesis,” J. Audio Eng. Soc., vol. 53, pp. 1115–1127 (2005 Dec.). J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

[11] J. B. Allen and D. A. Berkley, “Image Method for Efficiently Simulating Small-Room Acoustics,” J. Acoust. Soc. Am., vol. 65, no. 4, pp. 943–950 (1979). [12] U. P. Svensson, R. I. Fred, and J. Vanderkooy, “An Analytic Secondary Source Model of Edge Diffraction Impulse Responses, J. Acoust. Soc. Am., vol. 106, pp. 2331– 2344 (1999). [13] B. I. Dalenbäck, M. Kleiner, and P. Svensson, “A Macroscopic View of Diffuse Reflection, J. Audio Eng. Soc., vol. 42, pp. 793–807 (1994 Oct.). [14] J.-M. Jot, L. Cerveau, and O. Warusfel, “Analysis and Synthesis of Room Reverberation Based on a Statistical Time-Frequency Model, presented at the 103rd Convention of the Audio Engineering Society, Convention(1997 Sept.), convention paper 4629. [15] C. Zhang, Z. Zhang, and D. Florêncio, “Maximum Likelihood Sound Source Localization for Multiple Directional Microphones,“IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 125– 128(2007). [16] D. Levin, E. A. P. Habets, and S. Gannot, “Maximum Likelihood Estimation of Direction of Arrival Using an Acoustic Vector-Sensor,” J. Acoust. Soc. Am., vol. 131, no. 2, pp. 1240–1248 (2012). [17] S. Tervo, “Direction Estimation Based on Sound Intensity Vectors,“European Signal Processing Conference, Glasgow, Scotland, August 24-28, pp. 700–704 (2009). [18] S. Tervo, J. Pätynen, and T. Lokki, “Acoustic Reflection Localization from Room Impulse Responses,” Acta Acustica united with Acustica, vol. 98, pp. 418–440 (2012). [19] A. Host-Madsen, “On the Existence of Efficient Estimators,” IEEE Trans. Signal Processing, vol. 48, no. 11, pp. 3028–3031 (2000). [20] C. Knapp and G. Carter, “The Generalized Correlation Method for Estimation of Time Delay,” IEEE Trans. Acoust., Speech and Signal Proc., vol. 24, no. 4, pp. 320– 327 (1976). [21] L. Zhang and X. Wu, “On Cross Correlation BasedDiscrete Time Delay Estimation,“IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 981–984 (2005). [22] T. Pirinen, Confidence Scoring of Time Delay Based Direction of Arrival Estimates and a Generalization to Difference Quantities, Ph.D. thesis, Ph.D. thesis, Tampere University of Technology, 2009. Publication; 854. Publication; 854. [23] H. Kutruff, Room acoustics, 4th Ed.(Spon Press, NY, NY, USA, 2000).

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

SPATIAL DECOMPOSITION METHOD

[24] S. Tervo, T. Lokki, and L. Savioja, “Maximum Likelihood Estimation of Loudspeaker Locations from Room Impulse Responses,” J. Audio Eng. Soc., vol. 59, pp. 845– 857 (2011 Nov.). [25] A. Brutti, M. Omologo, and P. Svaizer, “Multiple Source Localization Based on Acoustic Map Deemphasis,” EURASIP J. Audio, Speech, and Music Processing, 2010, 2010, paper 147495. [26] V. Pulkki and J. Merimaa, “Spatial Impulse Response Rendering II: Reproduction of Diffuse Sound and Listening Tests,” J. Audio Eng. Soc., vol. 54, pp. 3–20 (2006 Jan./Feb.). [27] J. Pätynen, S. Tervo, and T. Lokki, “Analysis of Concert Hall Acoustics via Visualizations of TimeFrequency and Spatiotemporal Responses,” J. Acoust. Soc. Am., vol. 133, no. 17 (January 2013). [28] J. S. Bradley, H. Sato, M. Picard, et al., “On the Importance of Early Reflections for Speech in Rooms,” J. Acoust. Soc. Am., vol. 113, no. 6, pp. 3233–3244 (2003). [29] Geneva International Telecommunication Union. ITU-R BS.1116-1: Methods for the Subjective Assessment of Small Impairments in Audio Systems including Multichannel Sound Systems, 1997. [30] H. E. Bass, H.-J. Bauer, and L. B. Evans, “Atmospheric Absorption of Sound: Analytical Expressions,” J. Acoust. Soc. Am., vol. 52, no. 3B, pp. 821–825 (1972). [31] T. Lokki, J. Pätynen, A. Kuusinen, and S. Tervo, “Disentangling Preference Ratings of Concert Hall Acoustics Using Subjective Sensory Profiles,” J. Acoust. Soc. Am., vol. 132, pp. 3148–3161 (Nov. 2012). [32] J. Vilkamo, T. Lokki, and V. Pulkki, “Directional Audio Coding: Virtual Microphone-Based Synthesis and Subjective Evaluation,” J. Audio Eng. Soc., vol. 57, pp. 709 (2009 Sept.). [33] B. B. Schultz, “Levene’s Test for Relative Variation,” Systematic Biology, vol. 34, no. 4, pp. 449–456 (1985). [34] M. A. Stephens, “EDF Statistics for Goodness of Fit and Some Comparisons,” J. Am. Statistical Assoc., vol. 69, no. 347, pp. 730–737 (1974 Sept.). [35] S. Siltanen, T. Lokki, S. Tervo, and L. Savioja, “Modeling Incoherent Reflections from Rough Room Surfaces with Image Sources,” J. Acoust. Soc. Am., vol. 132, pp.4604–4614 (June 2012). [36] T. Pihlajamäki and V. Pulkki, “Low-Delay Directional Audio Coding for Real-Time Human-Computer Interaction,” presented at the 130th Convention of the Audio Engineering Society (2011 May), convention paper 8413.

27

TERVO ET AL.

PAPERS

THE AUTHORS

Sakari Tervo

Jukka Pätynen

Dr. Sakari Tervo is a post-doctoral researcher in the Department of Media Technology, Aalto University School of Science from where he also received a D.Sc. degree in acoustics in January 2012. The topic of his research is on the objective room acoustic measures. Previously he has been working in the Department of Signal Processing, Tampere University of Technology from where he also graduated as a M.Sc. majoring in audio signal processing in 2006. He has visited the Digital Signal Processing Group of Philips Research, Eindhoven, The Netherlands, in 2007 and the Department of Electronics of the University of York, United Kingdom, in 2010.

r

Dr. Jukka Pätynen was born in 1981 in Espoo, Finland. He received M.Sc. and D.Sc. (Tech.) degrees from the Helsinki University of Technology, Finland, in 2007, and Aalto University, Finland, in 2011, respectively. He is currently working as a post-doctoral researcher in the Department of Media Technology, Aalto University. His research activities include room acoustics, musical acoustics, and signal processing.

r

Antti Kuusinen received an M.Sc. degree in March 2012. He is currently working as a doctoral candidate at the De-

28

Antti Kuusinen

Tapio Lokki

partment of Media Technology at Aalto University School of Science under supervision of professor Tapio Lokki. In his doctoral research he focuses on the perceptual characteristics of concert hall acoustics. This research also includes development of descriptive vocabulary for sound, music, and room acoustics as well as elaboration of listening test methodology and statistical data analysis.

r

Dr. Tapio Lokki was born in Helsinki, Finland, in 1971. He has studied acoustics, audio signal processing, and computer science at the Helsinki University of Technology (TKK) and received an M.Sc. degree in electrical engineering in 1997 and a D.Sc. (Tech.) degree in computer science and engineering in 2002. At present Dr. Lokki is an Associate Professor (tenured) with the Department of Media Technology at Aalto University. Prof. Lokki leads the virtual acoustics team jointly with Prof. Lauri Savioja. The research aims to create novel objective and subjective ways to evaluate concert hall acoustics. In addition, the team develops physically-based room acoustics modeling methods to obtain authentic auralization. Furthermore, the team studies augmented reality audio. The team is funded by the Academy of Finland and by Prof. Lokki’s Starting Grant from the European Research Council (ERC).

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

Persistent Time–Frequency Shrinkage for Audio Denoising 2 ¨ KAI SIEDENBURG,1 AES Student Member , AND MONIKA DORFLER

([email protected]) 1

([email protected])

Center for Interdisciplinary Research in Music Media and Technology (CIRMMT), Schulich School of Music, McGill University, Montreal; Austrian Research Institute for Artificial Intelligence, Vienna, Austria 2 Numerical Harmonic Analysis Group, Faculty of Mathematics, University of Vienna, Vienna, Austria

Natural audio signals are known to be highly structured. Incorporating knowledge about this inherent structure helps to improve audio restoration algorithms. In this article audio denoising is addressed as a problem of structured sparse atomic decomposition. A class of time-frequency shrinkage operators is introduced that generalizes some well-known thresholding operators such as the empirical Wiener filter and basis pursuit denoising. The general framework allows for the exploitation of structural properties, in particular the persistence inherent to most natural audio signals. Fast iterative shrinkage algorithms are reviewed and their convergence is numerically evaluated. The denoising performance of the proposed persistent shrinkage operators is evaluated on real-life audio signals. The novel approach shows competitive performance to the state of the art when evaluated by means of signal to noise ratio and appears to have beneficial perceptual properties.

0 INTRODUCTION In audio processing today, signals are commonly interpreted by means of their expansion coefficients with respect to basic functions taken from so-called dictionaries. Dictionaries comprised of windowed Fourier or cosine bases turn out to provide particularly valuable representations: they are easy to interpret and reflect physical reality by expanding a signal with respect to the dimensions of time and frequency. They have also proven to be well adapted for processing most audio signals of relevance for humans, in particular speech and music. A dictionary may be understood as being well adapted to a class of signals, if it allows for sparse representations, i.e., only requiring a few atoms from the dictionary, cp., [1]. Today, sparsity in redundant dictionaries is a forceful paradigm in signal processing. While intuitively an 0 -prior on the synthesis coefficients should yield maximum sparsity, the 1 -norm as a prior on the representation coefficients plays a central role in modeling sparsity with respect to time-frequency coefficients. Most natural signals feature strong inherent structures, such as harmonics and transients. In order to exploit the a priori knowledge about these structures, a growing body of research has recently focused on extending existing sparsity paradigms. The desire to take structural information into account has led to approaches of structured sparsity, see, e.g., [1] and references therein. In the audio-context, structure is often interpreted in the sense of temporal and J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

spectral persistence. In fact, a consequence of basic acoustic laws concerning resonant systems and impact sounds is that large classes of audio components are either sparse in frequency and persistent in time or sparse in time and persistent in frequency [2]. In this article denoising is considered as a problem of structured sparse approximation. Therefore, we utilize the formal framework introduced in [3]. As opposed to [3], where different mixed norm priors were applied to various applications, the current paper considers the case of weighted 1 -regularization, for which an accelerated algorithm is presented and a variety of refined, weightdependent thresholding operators are evaluated for the denoising task in depth. The proposed framework allows to adapt the neighborhood-weighting according to global signal properties, and the current work addresses the influence of the chosen neighborhood-weighting on the denoising algorithm’s performance. The paper first gives a brief review of the state of the art. Consecutively, the framework of sparse expansions using Gabor frames is developed in Section 3.1, the signal model and its link to soft thresholding algorithms is established in Section 3.2, and a generalized viewpoint on soft-thresholding is introduced in Section 3.3. In Section 4, it is turned to numerical aspects of the proposed denoising scheme. After evaluating the convergence of the algorithms in Section 4.1, Section 4.2 investigates the properties of the neighborhood-persistent thresholding operators. 29

¨ SIEDENBURG AND DORFLER

A comparison to existing algorithms is provided in Section 4.3 using a variety of signals and the signal to noise ratio (SNR) measure. Perceptual aspects are discussed in Section 5, and the paper is concluded by a summary and some perspectives in Section 6. 1 STATE OF THE ART In time-frequency based audio denoising, it is insightful to distinguish between diagonal and non-diagonal estimation. In diagonal denoising algorithms, the attenuation factor for each time-frequency coefficient is determined independently. Diagonal denoising procedures such as the empirical Wiener estimator and other power subtraction estimators, typically ignore the correlation between coefficients, cf., [4,5]. Approaches of the diagonal type typically suffer of what is now known as musical noise, perceptually annoying isolated noise-residuals. Non-diagonal estimation then serves as a means to avoid these severe artifacts. Ephraim and Malah [6] were the first to apply time recursive filtering for SNR estimation, introducing some coupling between coefficients, which was equipped with different attenuation rules and was variously refined later, cf., [7,8]. Both diagonal and non-diagonal denoising techniques have been carefully reviewed in [8]. In the same paper the authors introduced a denoising algorithm by time-frequency block thresholding. The latter has shown to improve performance significantly and can now be considered as the state of the art in non-diagonal audio denoising. Since the pioneering work of Donoho and Johnstone on wavelet thresholding in statistics [9,10], sparsity based approaches have also led to significant contributions in signal processing and audio denoising. A particularly successful model of sparsity is based on the 1 -norm regularization (to be thought of as a convex relaxation of the 0 norm) of the signal expansion. It was introduced by Chen et al. in signal processing [11] as basis pursuit denoising and in an equivalent approach by Tibshirani as Lasso in statistics [12]. Later, it was shown that solutions of 1 -regularized inverse problems are in many cases optimally sparse [13] and fast iterative algorithms have been proposed [14,15, 16], following the initial theoretical work of Daubechies, Defrise, and de Mol [17]. However, by treating each coefficient independently these initial sparse models as the Lasso again give rise to diagonal denoising procedures. Aiming at a more comprehensive mathematical model for the dependencies between time-frequency coefficients, a sparse regression approach with structured priors was suggested for audio denoising in [18]. With similar motivation, Kowalski and Torresani [19] extended the 1 regularization method by introducing mixed-norm priors and neighborhood weighting on the coefficients. Each coefficient is thresholded according to the weight of its neighborhood, hence yielding non-diagonal shrinkage operators. Solutions to the mixed-norm regularized inverse problems were shown to be obtained by generalized classes of softthresholding operators in [20]. This approach was revisited and refined in [3] and demonstrated promising results for a variety of classical signal processing tasks such as multi30

PAPERS

layer-decomposition, transient estimation, and denoising. In [21] it was first evaluated for audio denoising in particular. The theoretical properties of iterated neighborhoodbased shrinkage were investigated in [22]. Following the same idea of persistent estimation, a neighborhood-based empirical Wiener estimator was introduced and discussed in [23]. 2 DENOISING BY TIME-FREQUENCY SHRINKAGE Time-frequency methods are very common in audio signal processing, the best-known transform is the short-time Fourier transform (STFT) or sampled sliding window transform. We consider dictionaries known as Gabor frames, which, in their simplest instantiation, correspond to an STFT. 2.1 Signal Expansions Using Gabor Frames We wish to expand a signal y ∈ R L as a linear combination of time-frequency or Gabor atoms φk,j , i.e., y(n) = ck, j φk, j (n) + e(n), n = 1, . . . , L . k, j

The atoms φk,j are generated by time-frequency shifts of a single window function, φk,j = Mbj Tka φ. Here, Tx φ(n) = 2πinω φ(n − x) and Mω φ(n) = φ(n)e L denote time- and frequency-shift operators and φ is a standard window function. a and b are the time- and frequency-sampling constants, j = 0, . . . , J − 1, k = 0, . . . K − 1, with Ka = Jb = L, and the complex numbers ck,j are the expansion coefficients. In order to guarantee perfect and stable reconstruction of a signal from its associated analysis coefficients ck,j = y, φk,j , we always assume that the dictionary = (φk,j )k,j forms a frame, see [24], which, in the finite-dimensional case, means that , the matrix of dimension L × p, with p = JK ≥ L, whose columns consist of the atoms φk,j , has full rank L. We will assume even more, namely the tightness of the frames in use, which means that, up to a constant, we obtain y = k,j y, φk,j φk,j , i.e., perfect reconstruction is achieved by using the same window φ for both analysis and synthesis. Tight frames are easily calculated in many situations of practical relevance, see, e.g., [25]. Tight windows further simplify the interpretation of operations like thresholding in the time-frequency domain, as they introduce a symmetric relation between analysis and synthesis. As we are especially interested in the redundant Gaborframe case L < p, the additional degrees of freedom can be used to promote (structured) sparsity of the coefficients. 2.2 Sparse Approximation and Iterated Soft-Thresholding In our basic signal model y = c + e, the coefficients c ∈ C p are assumed to be sparsely distributed. Then, denoising the observation y means to sparsely approximate the coefficients c. This implies the recovery of c, the clean J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

PERSISTENT SHRINKAGE

signal. This approach is formalized via the minimization problem 1 2 y − c2 + λc1 (1) min c∈C L 2 which is the well-known 1 sparse minimization functional. In statistics, this problem is called Lasso [12], in signal processing basis pursuit denoising [11]. The constant λ > 0 is the so-called Lagrange-multiplier. It adjusts the weight given to the 1 penalty term and hence adjusts the sparsity level of the solution. The higher the value of λ, the sparser the solution will be. Note that, depending on the relation of the noise variance and λ, the solution of Eq. (1) might not coincide with the true underlying sparse representation. In practice, finding a good sparsity level is consequently most crucial for obtaining satisfying results, cf., [23]. In case is an orthonormal basis, the solution c of Eq. (1) is given by a simple soft-tresholding step of the analysis coefficients (see, e.g., [17]): c = Sλ (∗ y) where Sλ (z) = exp(i arg(z))(|z| − λ)+ and as usual, b+ := max (b, 0). * denotes the analysis operator, the adjoint of the synthesis operator , given by *s = (s, ϕk,j )k,j . In the general case when only forms a frame, it was shown in [17] (and extended by other means in [26]) that for arbitrary starting point c0 ∈ C p , the iterated softtresholding algorithm (ISTA) converges strongly to the solution c of (1): c = lim Sλ cn + ∗ (y − cn ) n→∞

√ if only < 2. The IST-algorithm converges very slowly, and in general not even linearly.1 Hence, various methods of acceleration have recently been proposed. It turns out that especially the Beck-Teboulle FISTA-algorithm [15] improves convergence significantly, cp., [14]. Here, we give an adaptation of the algorithm to our setting. Algorithm 2.1 (FISTA). Set c0 = b1 = 0 and t1 = 1. Do cn = Sλ (bn + ∗ (y − bn ))

1 2 1 + 1 + 4tn tn+1 = 2

tn − 1 n+1 n b =c + (cn − cn−1 ) tn+1 Until convergence. While not significantly adding computational complexity, FISTA accelerates the convergence of the objective functional value quadratically [15], i.e., 1 y − cn 22 + λcn 1 ∼ O(1/n 2 ). 2

1

In the case of Gabor-frames, e.g., generated by a compactly supported window function, it is easy to see, however, that the sequence converges linearly that, moreover, guarantees uniqueness of the solution, see [27]. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

2.3 Generalized Soft-Thresholding Since the 1 -norm acts independently on each coefficient, problem (1) does not take into account the dependencies between time-frequency coefficients that are inherent in most natural audio signals. It was shown in [20] that the 1 -penalty term can be replaced by a mixed norm that acts independently on groups of coefficients and their members. The solution of the corresponding minimization functional is then obtained by generalized soft-thresholding. Furthermore, overlapping neighborhood structures can be introduced to exploit persistence properties of coefficients [19]. However, for denoising of quasi-stationary noise, the neighborhood operators derived from the 1 -norm suffice. Therefore, the formal framework introduced in this paper is restrained to that case. For the more general setting encompassing mixed norms, see [3]. Definition 2.2. Let ξ = ξλ : C p → R+ be a non-negative function, the so called threshold function and let denote the set of time-frequency indices, i.e., γ = (j, k) ∈ for j = 0, . . ., J − 1, k = 0, . . . K − 1. Then, for z ∈ C p the generalized thresholding operator is defined componentwise by Sξ (z γ ) := z γ (1 − ξλ,γ (z))+ and we write Sξ (z) := (Sξ (z γ ))γ∈ . Example 2.3. For ξλ,γ (z) = ξ L (z γ ) := |zλγ | , we recover the usual soft-threshold operator (Lasso) from above. For ξ = (ξL )2 , we obtain a so-called empirical Wiener estimator, cf., [23,24]. Setting ξ = limk→∞ (ξL )k , the hard thresholding operator is approached, defined by Sλ (z) = z if |z| > λ and 0 if |z| ≤ λ. The neighborhood-systems introduced in [19] under the name windowed Group Lasso (WGL) can be written in terms of generalized thresholding by setting ξ = ξL ◦η, where η is a neighborhood smoothing functional. It can be defined by ⎛ η(cγ ) := ⎝

⎞1/2 vγ (γ )|cγ |2 ⎠

γ ∈

where γ, γ ∈ are both indexing the time-frequency plane and are of the form γ = (k, j), with k denoting time and j frequency. The sequences v are appropriately chosen nonnegative time-frequency neighborhood weights, that is, e.g., vγ 1 = 1, and vγ (γ) > 0 ∀γ ∈ . The first condition ensures that the overall extension of the neighborhood does not interfere with the sparsity of the solution and without the second, the intuition of a neighborhood would not make sense. In this kind of persistent thresholding ξ = ξL ◦η, the coefficients undergo shrinkage according to the energy of a time-frequency neighborhood that can be modeled flexibly, e.g., by using weighting and overlap. 31

¨ SIEDENBURG AND DORFLER

PAPERS

Fig. 1. Sketch of different neighborhood shapes in a schematic time-frequency plane.

Using the same nomenclature, the persistent empirical Wiener shrinkage (PEW) introduced in [23] is written simply as ξ = (ξL )2 ◦η. Example 2.4. For the neighborhood consisting of the singleton, WGL coincides with the regular Lasso. With disjoint neighborhoods we obtain the well known GroupLasso [28] as a special case of the WGL. Furthermore, block-thresholding estimators as introduced in [29] can be expressed in this framework by using disjoint neighborhoods and appropriately chosen threshold functions. 3 NUMERICAL EVALUATION The generalized shrinkage operators were implemented in MATLAB with the following parameterization of the neighborhoods. For each time-frequency-index γ = (k, j) (k denoting time- and j frequency-index) and a neighborhood size vector σ = (σ1 , . . ., σ4 ), the neighborhood Nσ is defined as the set of indices N(γ) = Nσ (k, j) = {(k , j ) : k ∈ {k − σ4 , k + σ2 }, j ∈ {j − σ3 , j + σ1 }}. That is, the vector σ refers to the additional extension of the neighborhood in the orientation σ =(north, east, south, west) from the center coefficient. For instance, the size vector σ = (1, 1, 1, 1) therewith refers to the rectangular neighborhood containing nine coefficients. Neighborhoods of indices close to a border of the time-frequency plane are obtained by mirroring the index set at the respective border. Fig. 1 sketches different neighborhood shapes in the time-frequency plane. The corresponding toolbox and the audio files used in the evaluation, as well as further audio-visual examples, are available at the webpage homepage.univie.ac.at/ monika.doerfler/StrucAudio.html Fig. 2 gives a first overview of the basic behavior of the denoising operators. It shows parts of the clean, noisy, WGL-denoised, and Lasso-denoised time-frequency coefficients from a natural audio signal. For the sake of brevity, we leave out depictions of the (P)EW operators as they yield similar visual appearances as their respective counterparts. While using the same threshold level, the Lasso gives obviously much sparser results than the WGL. On the contrary, WGL yields a regularized solution, without apparent musical noise. For all the following simulations we employed a tight Gabor frame with Hann-window of 1024 samples length 32

Fig. 2. Clean, noisy, Lasso-denoised, and WGL-denoised timefrequency coefficients of a natural audio signal (in left to right, top to bottom order). WGL is computed with neighborhood σ = (0, 4, 0, 4). Using the same threshold level λ = 0.01, the Lasso is sparser than WGL but suffers from severe musical noise.

with hop-size 256 at 44100 Hz audio sampling rate. In this paper we only consider the case of Gaussian white noise at base noise levels of 0, 10, and 20 dB SNR2 . The denoisingperformance of the different operators is measured via the SNR of the denoised estimation and the original, clean signal. The experiments were repeated with various alternative parameter settings, in particular, windows with longer support. Since the results were very similar, we stick to one standard setting in all evaluations. 3.1 Convergence Operators from the proposed extended class of shrinkage operators are in general not associated to simple convex penalty functionals any more. In [22] convergence of the iterative shrinkage process using the WGL setting within FISTA was studied by exploiting the connection to a related, convex minimization functional obtained by mapping the coefficient space into a higher-dimensional space. The results obtained in [22] and the experimental results presented in this section give strong evidence that FISTA with the WGL shrinkage operator converges for reasonable neighborhood weighting. Furthermore, it is observed that the iterative process improves sparsity (see also Fig. 6) and denoising quality as opposed to one-step thresholding, which provides valid results only when orthonormal bases are used instead of frames. On the contrary, the (persistent) empirical Wiener estimator does not have any obvious relation to an underlying minimization functional and can rather be derived as a multiplier of the frame’s expansion coefficients [23]. They will thus be used as non-iterated time-frequency thresholding operators. As an example of the convergence processes Fig. 3 depicts the evolution of the SNR of the sparse approximation and the clean signal as a function of the number of Thesignal to noise ratio of signals s, s˜ ∈ C L is defined by L |s(n)|2 . 10log10 L n=1 |s(n)−˜s (n)|2 2

n=1

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

PERSISTENT SHRINKAGE

19.4 19.2 19 Lasso−ISTA Lasso−FISTA WGL−ISTA WGL−FISTA

SNR

18.8 18.6

Table 1. Comparison of the neighborhood’s orientation w.r.t. to signal characteristics. The maximal SNRs of WGL with vertical σ = (4, 0, 4, 0), horizontal σ = (0, 4, 0, 4) and rectangular σ = (1, 1, 1, 1) neighborhoods are averaged over noise levels of 0, 10, and 20 dB SNR. Strings

Piano

Percussion

20.2 18.1 19.1

19.3 17.3 18.5

20.8 21.3 21.5

18.4

Horizontal Vertical Rectangular

18.2 18 17.8 17.6

0

50

100

150

Number of iterations

Fig. 3. Example of the evolution of SNR as a function of the number of iterations for the (F)ISTA using Lasso and WGL [σ = (0, 4, 0, 4)] operators for a piano sample at 20 dB SNR and sparsity level λ = 0.429.

35 L−0.5 L−0.05 L−0.005 WGL−0.5 WGL−0.05 WGL−0.005

30

Required Iterations

25

20

15

10

5

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Sparsity Level

Fig. 4. Number of required iterations of FISTA using Lasso and WGL [σ = (0, 4, 0, 4)] to reach an SNR-tolerance of 0.5, 0.05, 0.005 dB w.r.t. to the SNR of the 100th iteration as a reference. The test signal is the piano sample described in Section 3.2 at noise level 10 dB.

iteration steps for the Lasso and the Windowed Group Lasso. It is obvious that the FISTA-agorithm accelerates the convergence significantly. Fig. 4 further sheds some light on the relation of approximation precision, sparsity level, and required number of iteration steps. The figure shows the number of steps required to obtain SNR-tolerances of 0.5, 0.05, and 0.005 dB w.r.t. to the SNR of the 100th iteration. Note that in this example (and also all others we have encountered so far) only less than 20 iterations are required to reach an SNR-tolerance of 0.05 dB. Furthermore, the WGL seems to regularize the convergence process as it requires much less iterations to reach its stationary point. In the following simulations we used FISTA with (at least) 20 iteration steps. 3.2 Selection of Neighborhoods In this section we investigate the influence of neighborhood support and weights. We give some numerical illusJ. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

trations of the intuition, that the neighborhood’s influence depends on local signal characteristics. In particular, we illustrate the four aspects of orientation, extension, symmetry, and decay. To evaluate the operators with respect to varying signal characteristics we chose a set of three mutually contrasting signals. The first contains a rather sustained excerpt of a string quartet, the second a latin-style piano pattern, and the third a percussive conga groove, each of 2300 ms length. The samples do not contain reverb or audible room influences and are taken from a sample library of a commercially available audio-software. Gaussian white noise was added to obtain “base” noise levels of 0, 10, and 20 dB SNR for each signal. The quantitative results presented in the following are always averaged over these three noise levels. It is quite intuitive that the denoising performance should be optimal if the orientation of the neighborhoods in time and frequency fits to the dominating persistence properties of the signal. For instance, the strings-signal contains temporally persistent and spectrally sparse orientation, hence a horizontally oriented neighborhood as σ = (0, 4, 0, 4) seem to be suited. The contrary should hold for the percussionsignal where vertically oriented neighborhoods should perform best. Table 1 presents numerical evaluations concerning three different neighborhood-orientations and the three musical test signals. It shows the respective maximal SNR (over the range of relevant sparsity levels) averaged over all three different noise levels. While the horizontally oriented neighborhood performs best for the sustained string signal, the vertical neighborhood works better for the percussive signal. Contrary to a first intuition, the rectangular neighborhood σ = (1, 1, 1, 1) performs best for the percussive signal, while being sub-optimal for the piano signal which should have both, persistence in time and frequency. Second, we disregard orientation and only focus on the extension of the neighborhoods for horizontally oriented ones. Again, depending on signal characteristics there are optimal neighborhood lengths. Table 2 shows the corresponding numerical results. The longest neighborhood with σ = (0, 12, 0, 12) works best for the strings excerpt that features the most temporally persistent structures. Similarly, medium σ = (0, 8, 0, 8) and short σ = (0, 4, 0, 4) extensions optimize the SNR measure for piano and percussion phrases. The symmetry of the energy distribution of most audio signals varies with the instruments being played and their 33

¨ SIEDENBURG AND DORFLER

PAPERS

Table 2. Comparison of the neighborhood’s extension w.r.t. to signal characteristics. The maximal SNRs of WGL with short σ = (0, 4, 0, 4), medium σ = (0, 8, 0, 8), and long σ = (0, 12, 0, 12) neighborhoods are averaged over noise levels of 0, 10, and 20 dB SNR.

Short Medium Long

Piano

Percussion

20.2 20.5 20.6

19.3 19.5 19.4

20.8 20.4 20.5

L EW BT WGL−0206 WGL−2021 PEW−0206 PEW−2021

23

22 21 SNR

Strings

24

20 19

18 17

Table 3. Comparison of different neighborhood-symmetries w.r.t. to signal characteristics. The maximal SNRs of WGL with symmetric σ = (4, 0, 4, 0), 1/3-centered σ = (0, 2, 0, 6), and asymmetric σ = (0, 0, 0, 8) neighborhoods are averaged over noise levels of 0, 10, and 20 dB SNR.

Symmetric 1/3-Centered Asymmetric

Strings

Piano

Percussion

20.2 20.1 19.9

19.3 19.3 19.0

20.8 20.9 20.4

respective modes of excitation. It was noted in [3] that nonsymmetric neighborhoods can be beneficial, especially for avoiding pre-echo artifacts in tasks like estimation of a tonal signal layer. Table 3 shows the signal-dependent denoising results of the WGL operator using symmetric σ = (4, 0, 4, 0), 1/3-centered σ = (0, 2, 0, 6), and asymmetric σ = (0, 0, 0, 8) neighborhoods. The SNR differences are very subtle but nonetheless point in an interesting direction. While the asymmetric neighborhood is suboptimal for all three signals, the symmetric one is best for the strings (that play rather sustained notes of continuous excitation), and the 1/3-centered neighborhood performs best for the percussive signal (with fast decay after all excitations). A last remark concerns the decay of the neighborhood weights. It seems intuitive that rectangular weighting is suboptimal, since it does not account for continuously varying interdependence of two points in the time-frequency plane (meaning the further away, the less important). However, it is hard to evaluate this aspect. For instance, switching from rectangular to linear weighting will have a similar impact as just changing the neighborhood’s length. In the numerical experiments directed so far, non-rectangular weighting has not improved performance significantly. In conclusion, this numerical case study confirms the intuition that neighborhood selection should ideally be adapted to the signal. While the performance differences for the parameter of symmetry as measured in SNR seems of minor relevance (which does not imply that the same holds in terms of perception), adjusting the overall shape of the neighborhood seems to have the greatest impact. 3.3 Comparison with Other Algorithms In 2008, Yu, Mallat, and Bacry proposed a non-diagonal denoising procedure based on time-frequency block thresholding [8]. By minimizing the corresponding Stein esti34

16 strings

piano

perc.

male

female

jazz

Signals

Fig. 5. Comparison of the maximal SNR (w.r.t. clean signal), averaged over noise levels 0, 10, 20 dB for six different signals. The evaluated operators are the Lasso (L), the empirical Wiener (EW), the block thresholding algorithm (BT), WGL with σ = (0, 2, 0, 6), WGL with σ = (2, 0, 2, 1), and the persistent empirical Wiener (PEW) with σ = (0, 2, 0, 6) and σ = (2, 0, 2, 1).

mate of the risk, the algorithm automatically adapts blocks of time-frequency coefficients on which a block-wise empirical Wiener filter is applied. The authors show that the method clearly outperforms other state of the art algorithms as power subtraction [4] and Ephraim and Malah’s MMSELSA [6] in terms of signal difference measured in SNR and perceptual evaluation with respect to a representative group of subjects. We employ this algorithm with the same Gabor-transform settings as above and use it as a reference to the state of the art. Using the generalized thresholding framework, we further evaluate the simple but ubiquitously used Lasso estimate with threshold function ξ = ξL and its quadratic counterpart ξ = (ξL )2 corresponding to the empirical Wiener estimator, both classical diagonal denoising operators. In terms of neighborhoods of the WGL, we choose nonsymmetric ones to avoid pre-echos and evaluate the 1/3centered horizontal support σ = σh = (0, 2, 0, 6), as well as the rather vertically oriented σ = σr = (2, 0, 2, 1). For both neighborhoods rectangular weighting is used. To further enhance SNR-performance, we also consider their respective non-iterated squared (empirical Wiener) counterparts. The pool of test signals was complemented by a male and a female voice (2300 ms length) and a six- second jazz quintet recording containing drums, double-bass, piano, saxophone, and trumpet. All test signals were corrupted with Gaussian white noise at levels of 0, 10, and 20 dB. Fig. 5 shows the maximal (over all sparsity levels) SNRperformance of the seven different operators, averaged over the three noise levels. Obviously, the plain Lasso is constantly worst. WGL-0206 and WGL-2021 improve the estimation, while one is better than the other, depending on the signal type. For rather percussive signals, such as the conga-percussion part and the speech signals, the vertically oriented WGL is better, for the rather tonal signals, the horizontally oriented WGL improves performance of J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

PERSISTENT SHRINKAGE

100 L EW WGL−0206 PEW−0206 BT

90

Percent of Non−Zero Coefficients

80 70 60

Table 4. Signal lengths (left) and respective computation times in seconds of different denoising operators, averaged over five trials each.

1 sec 10 sec 100 sec

50 40

Lasso

EW

WGL

PEW

BT

0.1 1.1 11.1

0.1 1.1 11.0

3.1 29.0 293.6

0.1 1.2 11.6

3.2 32.7 339.0

30 20 10 0

−2

−1

10

10 Sparsity Level

Fig. 6. Comparison of the number of retained coefficients for different sparsity levels λ. The evaluation includes the Lasso (L), empirical Wiener (EW), WGL with σ = (0, 2, 0, 6), the persistent empirical Wiener (PEW) with σ = (0, 2, 0, 6), and the block thresholding algorithm (BT).

at least 1 dB. Most interestingly, Yu, Mallat, and Bacry’s block thresholding algorithm works best for the percussive signals, i.e., the conga and speech signals, while the non-iterated PEW outperforms block thresholding for predominantly tonal signals. Besides only regarding the denoising performance as measured in SNR, we can compare the sizes of the algorithms’ recovered support, i.e., the model selection, in order to better understand their behavior. Fig. 6 depicts the percentage of non-zero coefficients over different sparsity levels λ for the operators L, EW, WGL-0206, PEW-0206 and BT, each for the jazz-signal used previously, here distorted with white noise at SNR of 10 dB. Obviously, the block thresholding algorithm (BT) does not yield sparsity: over the whole range of sparsity levels it retains almost 100% of coefficients. The Lasso shows the converse behavior; it is overall much sparser than any of the other operators. Of particular interest is the difference of PEW and WGL: the iterative scheme of the WGL results in up to 10% fewer coefficients than the non-iterated PEW.3 Thus, although PEW yields higher SNR, the iterated WGL produces sparser estimates. 3.4 Processing Time Table 4 shows the computation time4 of the discussed denoising operators w.r.t. signals of lengths 1, 10, and 100 seconds, averaged over five trials each. Both persistent operators WGL and PEW use the neighborhood σ = (0, 2, 0, 6) in this experiment. It is visible that for all operators, computation time grows almost linearly as a function of signal length. The Lasso, empirical Wiener, and PEW appear to be much faster than WGL (using 20 FISTA iterations 3

Note that for each thresholding step, both operators produce the same support. 4 Experiments were conducted on a 2 × 2.26 GHz QuadCore processor with 6 GB RAM memory. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

here) and the block thresholding algorithm. For the WGL this relatively slow behavior is due to the included iterations, which would be completely obsolete (cf., Section 3.2 above) when switching from the over-complete Gabor dictionary used here to an orthonormal basis, as MDCTs for instance. Despite its neighborhood persistence, the PEW operator is almost as fast as its non-persistent counterparts, with computation time roughly being one-tenth of the signal length. Let us stress that the presented results are only supposed to give a basic sense of relative differences, as none of the used implementations was particularly optimized in terms of computational efficiency. Given this caveat, it remains obvious that especially PEW seems very attractive in terms of processing cost. 4 PERCEPTUAL ASPECTS The signal to noise ratio measures global energy difference, which generally does not coincide with perceptual distance. The SNR favors non-sparse solutions containing noise residuals of low energy (that are clearly audible, though) such that noisy high energy signal components, emphasized by the underlying 2 -norm, usually remain in the signal. Therefore, this section provides a subjective description of the perceptual qualities of the proposed algorithms. Note that in professional audio restoration the degree of denoising is often limited in order to avoid the introduction of other perceptual artifacts. In this evaluation, however, we have been applying rather extreme settings in order to accentuate such artifacts and emphasize the differences between the operators. A selection of the employed test signals and the denoised results can be found at the above mentioned webpage. Although in terms of SNR quite distinct, the difference between the WGL and PEW operators was inaudible for the authors (under the usage of appropriate audio equipment). Hence, the latter seems preferable for the denoising-task as it produces higher SNR without iteration. In terms of neighborhood shape, the vertically oriented WGL/PEW with σ = (2, 0, 2, 1) tends to produce “underwater-feeling”—like residuals at high noise levels because of its short timepersistence. As could be expected, it further tends to emphasize vertical time-frequency components, as e.g., the snare clicks in the jazz quintet sample, and reduces some temporal smearing that becomes audible in the horizontal WGL/PEW and block thresholding as subtle “ghostlike” artifacts, in particular in speech. Most importantly, all three alternatives successfully avoid musical noise as 35

¨ SIEDENBURG AND DORFLER

well as pre-echo, WGL/PEW due to non-symmetric neighborhoods and block thresholding due to signal dependent block-adaptation. A point supporting block thresholding is that its lowpass filter impact appears not as strong as for the WGLversions we tested, although the difference seems to be of minor magnitude. A more serious artifact is that block-thresholding tends to produce a subtle but annoying background-texture of clicks. This phenomenon is audible at higher noise levels when sparser approximations are required (i.e., higher thresholds) such that no high frequency components can mask this texture. The artifact is probably due to the disjoint partition of the time-frequency plane into blocks and the corresponding rapid amplitude modulations in the time-frequency plane. The overlapping neighborhood systems of the WGL avoid this artifact. While conducting formal listening tests obviously is a way of obtaining more valid assessments of the perceptual differences between the various operators, this approach was out of the scope of the study presented in this paper. We would still like to point out that, beyond comparing plain SNR; there exist a few computational approaches that attempt the automatic simulation of perceptual judgments [30,31]. In our context, however, in particular, dealing with a variety of different noise- and threshold-level settings, these criteria do not always seem to provide reasonable results.5 From our rather extensive investigations using in particular the ITU-R recommendation BS. 1387 perceptual evaluation of audio quality (PEAQ), we conclude that this and similar approaches do not lend themselves easily to general evaluation settings but have to be more specifically trained to the denoising scenario using musical stimuli. For a more detailed critique of the PEAQ measure and audio evaluation in general, see [32,33, 34]. 5 CONCLUSION This paper considered the denoising problem from the perspective of sparse atomic representation. A general framework of time-frequency soft-thresholding was proposed that encompasses and connects well-known shrinkage operators as special cases. In particular, it was demonstrated how neighborhood-persistent threshold operators can be efficiently computed and how a signal adaptive choice of parameters can improve estimation quality. With respect to the denoising task it was shown that simple noniterated operators derived from the generalized thresholding scheme perform competitively compared to state of the art methods, as evaluated in SNR, while being computationally less complex. Regarding perceived audio quality, the initial results are equally promising. The presented 1 -framework is a special case of the mixed norm setting, which was already used for modeling inter-channel dependencies in multichannel audio [19].

5 For instance, we obtained optimal PEAQ values [30] for thresholds being overtly too large. We encountered similarly counterintuitive results using [31].

36

PAPERS

By enhancing the mixed norm setting with the insights gained in this paper, further improvements of multichannel denoising algorithms can be expected. In this paper the issue of adapting the underlying timefrequency transform stayed untouched. In fact, using persistent thresholding in combination with non-stationary or multi-frame expansions seems to bear great potential for structured denoising. Such modified approaches, as well as more comprehensive comparisons with other denoising methods (as used in speech enhancement) using formal listening tests will be addressed in future work. Finally, it is to be noted that the basic idea discussed in this paper, namely supporting persistent structures in time-frequency representations, should also bear potential for other branches of audio processing using sparse representations as dictionary learning, audio coding, and source separation. 6 ACKNOWLEDGMENTS The authors wish to thank the anonymous reviewers for their helpful and constructive advice. This work was supported by the Austrian Science Fund (FWF) : T384-N13 and the WWTF project Audio-Miner (MA09-024). 7 REFERENCES [1] C. Kereliuk and P. Depalle, “Sparse Atomic Modeling of Audio: A Review,” Proc. of the 14th Int. Conference on Digital Audio Effects (DAFx-11), Paris, France (2011). [2] M. Plumbey, T. Blumensath, L. Daudet, R. Gribonval, and M. E. Davies, “Sparse Representations in Audio and Music: From Coding to Source Separation,” Proceedings of the IEEE, vol. 98, no. 6, pp. 995–1005 (2010). [3] K. Siedenburg and M. Dörfler, “Structured Sparsity for Audio Signals,” Proc. of the 14th Int. Conference on Digital Audio Effects (DAFx-11), Paris, France (2011). [4] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of Speech Corrupted by Acoustic Noise,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’79, vol. 4, pp. 208–211 (IEEE, 1979). [5] J. Lim and A. Oppenheim, “Enhancement and Bandwidth Compression of Noisy Speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. 1586–1604 (1979). [6] Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 33, no. 2, pp. 443–44 (1985). [7] O. Cappe, “Elimination of the Musical Noise Phenomenon with the Ephraim and Malah Noise Suppressor,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pp. 345–349 (1994). [8] G. Yu, S. Mallat, and E. Bacry, “Audio Denoising by Time-Frequency Block Thresholding,” IEEE Transactions on Signal Processing, vol. 56, no. 5, pp. 1830–1839 (2008). J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

[9] D. Donoho and J. Johnstone, “Ideal Spatial Adaptation by Wavelet Shrinkage,” Biometrika, vol. 81, no. 3, pp. 425–455 (1994). [10] D. Donoho and I. Johnstone, “Adapting to Unknown Smoothness via Wavelet Shrinkage,” J. Am. Statistical Assn., pp. 1200–1224 (1995). [11] S. Chen, D. Donoho, and M. Saunders, “Atomic Decomposition by Basis Pursuit,” SIAM J. Scientific Computing, vol. 20, no. 1, pp. 33–61 (1998). [12] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. Royal Statistical Soc. Series B (Statistical Methodology), vol. 58, no. 1, pp. 267–288 (1996). [13] D. Donoho, “For Most Large Underdetermined Systems of Linear Equations the Minimal 1-Norm Solution Is Also the Sparsest Solution,” Communication on Pure and Applied Mathematics, vol. 59, no. 6, pp. 797–829 (2006). [14] I. Loris, “On the Performance of Algorithms for the Minimization of 1-Penalized Functionals,” Inverse Problems, vol. 25, no. 3 (2009). [15] A. Beck and M. Teboulle, “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems,” SIAM, J. Imaging Sciences, vol. 2, no. 1, pp. 183–202 (2009). [16] I. Daubechies, M. Fournasier, and I. Loris, “Accelerated Projected Gradient Method for Linear Inverse Problems with Sparsity Constraint,” J. Fourier Analysis and Applications, vol. 14, pp. 764–792 (2008). [17] I. Daubechies, M. Defrise, and C. de Mol, “An Iterative Thresholding Algorithm for Linear Inverse Problems with a Sparsity Constraint,” Communication on Pure and Applied Mathematics (2004). [18] C. Fevotte, B. Torresani, L. Daudet, and S. J. Godsill, “Sparse Linear Regression with Structured Priors and Application to Denoising of Musical Audio,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 1, pp. 174–185 (2008). [19] M. Kowalski and B. Torresani, “Sparsity and Persistence: Mixed Norms Provide Simple Signal Models with Dependent Coefficients,” Signal, Image and Video Processing, vol. 3, no. 3, pp. 251–264 (2009). [20] M. Kowalski, “Sparse Regression Using Mixed Norms,” Applied and Computational Harmonic Analysis, vol. 27, no. 3, pp. 303–324 (2009). [21] K. Siedenburg and M. Dörfler, “Audio Denoising by Generalized Time-Frequency Thresholding,” Proceedings of the AES 45th Conference on Applications of TimeFrequency Processing (March 2012) paper 5-2.

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PERSISTENT SHRINKAGE

[22] M. Kowalski, K. Siedenburg, and M. Dörfler, “Social Sparsity! Neighborhood Systems Enrich Structured Shrinkage Operators,” to appear in IEEE Transactions on Signal Processing (2013). [23] K. Siedenburg, “Persistent Empirical Wiener Estimation with Adaptive Threshold Selection for Audio Denoising,” Proceedings of the 9th Sound and Music Computing Conference, Copenhagen (July 11-14th, 2012). [24] S. Mallat, A Wavelet Tour of Signal Processing— The Sparse Way (Academic Press, 3 ed., 2009). [25] M. Dörfler, “Time-Frequency Analysis for Music Signals: A Mathematical Approach,” Journal of New Music Research, vol. 30, no. 1, pp. 3–12 (2001). [26] P. L. Combettes and V. R. Wajs, “Signal Recovery by Proximal Forward-Backward Splitting,” Multiscale Modeling and Simulations, vol. 4, no. 4, pp. 1168–1200 (2006). [27] K. Siedenburg,”, “Structured Sparsity in TimeFrequency Analysis,” Master’s thesis, Humboldt University Berlin, 2011. [28] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables,” J. Royal Statistical Soc. Series B (Statistical Methodology), vol. 68, pp. 49–67 (2006). [29] T. T. Cai, “Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach,” The Annals of Statistics, vol. 27, no. 3, pp. 898–924 (1999). [30] Method for Objective Measurements of Perceived Audio Quality. ITU-R Recommendation BS 1387, December 1998. [31] Y. Hu and P. Loizou, “Evaluation of Objective Quality Measures for Speech Enhancement,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 16, no. 1, pp. 229–238 (2008). [32] J. Blauert and U. Jekosch, “A Layer Model of Sound Quality,” J. Audio Eng. Soc., vol. 60, pp. 4–12 (2012 Jan./Feb.). [33] M. Purat and T. Ritter, “Comparison of ReceiverBased Concealment and Multiple Description Coding in an 802.11-Based Wireless Multicast Audio Distribution Network,” J. Audio Eng. Soc., vol. 59, pp. 225-238 (2011 Apr.). [34] P. Kabal, “An Examination and Interpretation of itu-r bs. 1387: Perceptual Evaluation of Audio Quality,” Tech. Rep., Dept. of Electrical Engineering and Computer Engineering, McGill University (2003).

37

¨ SIEDENBURG AND DORFLER

PAPERS

THE AUTHORS

Kai Siedenburg Kai Siedenburg currently pursues a Ph.D. in music technology at McGill University. He obtained his degree in mathematics and musicology from Humboldt University Berlin as scholarship holder of the German National Academic Foundation. In 2008/09, Kai was a Fulbright visiting student at the University of California, Berkeley. In 2012 he was with the Austrian Research Institute for Artificial Intelligence (OFAI). Kai’s research interests focus on music perception/cognition and computational models of musical sounds. Artistically, Kai is active as jazz-pianist and electronic musician, playing groove-jazz and improvised experimental music.

38

Monika Dörfler Monika Dörfler obtained her Ph.D. in mathematics from the University of Vienna and is a researcher at the Faculty of Mathematics of the University of Vienna. She studied piano at the Music University of Vienna and is working in the field of applied mathematics for audio signal processing. She is interested in the interplay of local and global aspects of time-frequency analysis, and focuses on the benefits of theoretical results in practical applications. She is heading the interdisciplinary research project AudioMiner and has been a Hertha Firnberg fellow of the Austrian science fund FWF.

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

Audio Equalization with Fixed-Pole Parallel Filters: An Efficient Alternative to Complex Smoothing∗ ´ BALAZS BANK, AES Member ([email protected])

Department of Measurement and Information Systems, Budapest University of Technology and Economics, H-1521 Budapest, Hungary

Recently, the fixed-pole design of parallel second-order filters has been proposed to accomplish arbitrary frequency resolution similarly to Kautz filters at 2/3 of their computational cost. This paper relates the parallel filter to the complex smoothing of transfer functions. Complex smoothing is a well-established method for limiting the frequency resolution of audio transfer functions for analysis, modeling, and equalization purposes. It is shown that the parallel filter response is similar to the one obtained by complex smoothing the target response using a Hann window: a 1/β octave resolution is achieved by using β/2 pole pairs per octave in the parallel filter. Accordingly, the parallel filter can be either used as an efficient implementation of smoothed frequency responses, or, it can be designed from the unsmoothed responses directly, eliminating the need of frequency-domain processing. In addition, the theoretical equivalence of parallel filters and Kautz filters is developed, and the formulas for converting between the parameters of the two structures are given. Examples of loudspeaker-room equalization are provided.

0 INTRODUCTION Audio equalization using DSPs has been a subject of research for three decades. It generally means the correction of the magnitude (and sometimes the phase) response of an audio chain. Typical examples include loudspeaker equalization based on anechoic measurements [1–4], or the correction of loudspeaker-room responses [5–11]. Because the systems to be equalized are typically of very higher order (e.g., due to the high modal density in room responses), the direct inversion of the transfer function is usually not practical. As the final judge in sound quality is the human ear, it is more efficient to equalize only those aspects that lead to an audible error. A typical approach is to take into account the quasi-logarithmic frequency resolution of the human auditory system during equalizer design. Besides efficiency and perceptual aspects, there is also a physical reason for applying logarithmic or logarithmiclike frequency resolution in equalizer design. Namely, an audio system often has multiple outputs, like multiple listening positions in a room, and the equalizer should maintain or improve the sound quality at all positions. Transfer functions measured at different points in space have more similarity at low frequencies than at high frequencies, due * Based on a convention paper published at the 128th Convention of the Audio Engineering Society, London, UK, 2010 May 22–25.

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

to the different wavelengths of sound. Therefore, an overly precise correction at high frequencies would worsen the response at other positions [6–7]. This paper demonstrates that fixed-pole parallel filters can be efficiently used for the modeling or equalization of audio systems, as they possess the beneficial properties of complex smoothing, while requiring relatively little processing power both for filtering and for parameter estimation. The paper first reviews fractional-octave smoothing in Section 1, then covers the related warped and Kautz filters in Section 2. Section 3 outlines the theory of parallel filters and proves their equivalence with Kautz filters. Section 4 relates parallel filter design to complex smoothing, and Section 5 presents loudspeaker-room equalization examples and comparison. Finally, Section 6 gives practical implications and Section 7 concludes the paper.

1 FRACTIONAL-OCTAVE SMOOTHING OF TRANSFER FUNCTIONS The quasi-logarithmic frequency resolution of human hearing is also reflected in how transfer functions are displayed in the audio field. From the earliest times, a logarithmic frequency scale is used, and often the magnitude response is smoothed at a fractional-octave (e.g., third octave) resolution. The motivation behind fractional-octave smoothing is that the original transfer function is too 39

BANK

PAPERS

(a)

Amplitude

0.04 0.02 0 −200

−150

−100

−50 0 50 Frequency [Hz]

100

200

(b)

1 Amplitude

150

0.5

fixed fractional-octave (logarithmic), but arbitrary smoothing resolution can be applied, including those corresponding to Bark or ERB scales [12]. Besides signal analysis, complex-smoothed transfer functions have been successfully applied for loudspeakerroom response equalization, where the complex-smoothed impulse response is used for FIR inverse filter design [14]. In addition, most room equalization systems apply some kind of transfer function smoothing (mostly magnitude smoothing) as a preprocessing step before filter design (see, e.g., [3,6, 9–10]), to avoid the problems of direct inversion mentioned in the Introduction.

0 −0.1

−0.05

0 Time [s]

0.05

0.1

Fig. 1. Complex smoothing and frequency-dependent windowing: (a) smoothing windows W(f) with a Hann shape, having the width of 50 Hz (solid line), 100 Hz (dashed line), and 200 Hz (dash-dotted line). (b) The corresponding time-domain window functions w(t) computed by the inverse Fourier transform of W(f) are displayed by the same line types.

detailed for visual evaluation and the smoothed version gives a good estimate of the perceived timbre. While this practice stems from analog signal analyzers, typically all current digital spectrum analyzers for audio offer this option. While, traditionally, smoothing has only been applied for the magnitude response of audio systems, the complex alternative has also appeared, allowing the reconstruction of a smoothed impulse response via IFFT [12]. Smoothing of the complex transfer function is in practice done by convolving it with a smoothing function W(f), thus, it is equivalent to multiplying the impulse response by a time-domain window function w(t). Frequency-dependent signal windowing, the time-domain equivalent of complex smoothing, has also been proposed in [13]. When the width of the frequency-domain smoothing function depends on frequency, the operation corresponds to multiplying the impulse response by a window whose length is frequency dependent. Smoothing functions W(f) with a Hann window shape having the width of 50, 100, and 200 Hz are displayed in Fig. 1(a). By taking the inverse Fourier transform of W(f), the corresponding time-domain window functions w(t) are obtained. These are displayed in Fig. 1(b). Note that in complex smoothing, the smoothing function W(f) is chosen to be a real (zero phase) function [12], which leads to a corresponding time window that is symmetric around t = 0, that is, w( − t) = w(t). Since we are interested in smoothing causal responses (h(t) = 0 for t < 0), it is sufficient to multiply the impulse response h(t) with the right half of the window function when the frequency-dependent windowing operation is performed. For obtaining a logarithmic frequency resolution, wider smoothing functions have to be used at high frequencies compared to the low ones. This means that the original impulse response is windowed to shorter length at high frequencies compared to the low ones. Naturally, not only 40

2 WARPED AND KAUTZ FILTERS Traditional FIR and IIR filter design techniques or system identification methods provide a uniform frequency resolution, as opposed to the quasi-logarithmic resolution of hearing. Therefore, in audio, often specialized filter design methodologies are used. While there are many different techniques that take into account the frequency resolution of hearing, only those are addressed here that have a direct connection to complex smoothing, or, equivalently, to frequency dependent windowing. 2.1 Warped Filters The most commonly used perceptually motivated design technique is based on frequency warping [15–16]. The basic idea of warped filters is that the unit delay z−1 in the traditional FIR or IIR filters is replaced by an allpass filter z −1 ← D(z −1 ) =

z −1 − λ , 1 − λz −1

(1)

which results in the transformation (warping) of the frequency axis. The design of warped filters starts with warping the target impulse response ht (n), e.g., by the use of an allpass chain, which can be considered as the frequency-dependent resampling of the impulse response. Then, warped FIR (WFIR) filters can be obtained by truncating or windowing the warped target response h˜ t (n). It follows directly from this design principle that WFIR filters perform frequencydependent windowing [13], because the traditional, fixedlength windowing operation is embedded in a frequency-dependent resampling and its inverse operation. Accordingly, WFIR filters have been used for magnitude equalization of room responses with similar results compared to FIR filters obtained by complex smoothing but at a significantly lower computational cost [7]. However, the choice of the frequency profile of the “built-in smoothing function” is limited in the case of WFIR filters, as the shape of the frequency transformation is controlled by a single parameter λ. This is illustrated in Fig. 2, where the minimum-phase version of a loudspeaker-room response (a) is modeled by 32nd order warped FIR filters with λ = 0.55; 0.85; 0.95 parameters, displayed in (b), (c), and (d), respectively. While (a) solid line displays the target response after third-octave J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

AUDIO EQUALIZATION WITH PARALLEL FILTERS

where Gk (z−1 ) are the orthonormal Kautz functions determined by the pole set pk , and pk∗ are the complex conjugate of pk . The advantage of the orthonormality of Gk (z−1 ) functions is that the weights wk can be determined from the target response ht (n) by a scalar product

Magnitude [dB]

40 30

(a)

20

(b)

10

(c)

0

(d)

−10

(e)

−20

(f)

wk =

−40 2

10

3

10 Frequency [Hz]

4

10

Fig. 2. Loudspeaker-room response modeling comparison: (a) third-octave smoothed target response (dashed line without smoothing), (b)–(d) 32nd order WFIR filters with λ = 0.55; 0.85; 0.95 parameters, (e) Kautz filter with 16 logarithmically spaced pole pairs (filter order is 32), and (f) parallel filter with the same pole set as for the Kautz filter. The vertical lines indicate the pole frequencies of the Kautz and parallel filters.

smoothing,1 the WFIR filters are designed using the unsmoothed target response, to demonstrate the smoothing behavior of WFIR filters. It can be seen that the allocation of the frequency resolution is controlled by the λ parameter, and lower values lead to better high frequency resolution, while larger ones increase the resolution at low frequencies. It can be also seen that none of the λ parameters distribute the resolution evenly on the logarithmic frequency scale. Note that warped IIR (WIIR) filters have no direct connections with complex smoothing. 2.2 Kautz Filters Kautz filters can be seen as the generalization of WFIR filters, where the allpass filters in the chain are not identical [17–18]. As a result, the frequency resolution can be allocated arbitrarily by the choice of the filter poles. The Kautz structure is a linear-in-parameter model, where the basis functions are the orthonormalized versions of decaying exponentials. The transfer function of the Kautz filter is: −1

H (z ) =

N

wk G k (z −1 )

k=0

⎛ ⎞ k−1 1 − pk pk∗ z −1 − p ∗j ⎠, wk ⎝ = −1 −1 1 − p z 1 − p z k j k=0 j=0 N

gk (n)h t (n),

(3)

n=1

−30

−50

∞

(2)

where gk (n) are the inverse z transform of Gk (z−1 ). It is impractical to implement Kautz filters by a series of complex first-order allpass filters as in Eq. (2), because combining the complex pole pairs to second-order sections yields lower computational complexity [17]. Still, the combined cascade-parallel nature of the filter requires more computational power compared to filters implemented in direct or cascade form with the same filter order. For determining the poles pk of the Kautz filter, several methods are discussed in [17], including some iterative techniques. For our purposes, the most interesting is the one that sets the poles according to the required resolution, e.g., by applying a logarithmic pole distribution. This is illustrated in Fig. 2(e), displaying the frequency response of a Kautz filter designed to match the unsmoothed target response of Fig. 2(a), dashed line. It can be seen that by placing two-third poles per octave (vertical lines in Fig. 2), the resulting filter response approximates the third-octave smoothed transfer function, Fig. 2(a) solid line. It is important to stress that the Kautz filter was designed from the unsmoothed response, thus, it performs smoothing “automatically.” The present paper provides a theoretical explanation to this phenomenon after proving the equivalence of Kautz and parallel filters in Section 3.4. 3 THE PARALLEL FILTER Recently, a fixed-pole design method has been introduced for parallel second-order filters [19–20]. It has been shown that effectively the same results can be achieved by the parallel filters as with Kautz filters for the same filter order, at 2/3 of their computational cost. Implementing IIR filters in the form of parallel secondorder sections has been used traditionally because it has better quantization noise performance and the possibility of code parallelization. The parameters of the second-order sections are usually determined from direct form IIR filters by partial fraction expansion [21]. In contrast, here the poles are set to a predetermined (e.g., logarithmic) frequency scale, leaving the zeros as free parameters for optimization. As we shall see later, the motivation for fixing the poles is to gain control over the frequency resolution of the design.

1

Note that there is no generally accepted standard how thirdoctave smoothing should be performed. In this paper third-octave smoothing corresponds to convolving the transfer function with a Hann window having the full width of 2/3-octave (i.e., its halfwidth, or, the distance of its 0.5 points is third octave). This complies with the results of analog third-octave analyzers, where the half-power points of the bandpass filters have a third-octave distance [25]. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

3.1 Problem Formulation Every transfer function of the form H(z−1 ) = B(z−1 )/ A(z−1 ) can be rewritten in the form of partial fractions: H (z −1 ) =

K k=1

1 + bm z −n , −1 1 − pk z m=0 M

ck

(4) 41

BANK

PAPERS

filter, pk , are computed by the following formulas:

d1,0

Input

1 1+ a1,1z −1 + a1,2 z −2 . . 1 1+ aK,1z−1 + aK ,2 z−2

z −1

d1,1

ϑk =

Output

d K ,0 z −1

d K ,1

pk = e −

b0 . .

z −1 z

−M

2πf k , fs

b1 Optional FIR part

bM

Fig. 3. Structure of the parallel second-order filter.

ϑk 2

(6)

e± jϑk ,

(7)

where ϑk are the pole frequencies in radians given by the predetermined (e.g., logarithmic) analog frequency series fk and the sampling frequency fs . The bandwidth of the kth second-order section ϑk is computed from the neighboring pole frequencies ϑk+1 − ϑk−1 for k = [2, . ., K − 1], 2 ϑ1 = ϑ2 − ϑ1 , ϑ K = ϑ K − ϑ K −1 , ϑk =

where pk are the poles, either real valued or in complex conjugate pairs, if the system has a real impulse response. The second sum in Eq. (4) is the FIR part of order M. If the order of A(z−1 ) and B(z−1 ) is the same, then it reduces to a constant coefficient b0 . Note that Eq. (4) is valid only if no multiple poles are present. In the case of pole multiplicity, terms of higher order also appear. Now let us assume that we are trying to fit the filter H(z−1 ) to a target response, but the poles of our IIR filter are predefined. In this case Eq. (4) becomes linear in its free parameters ck and bm , thus, they can be estimated by a simple least squares algorithm to match the required response. The resulting filter can be implemented directly as in Eq. (4), forming parallel first-order complex filters, and the estimation of the parameters can be carried out as described in [19]. However, it is more practical to combine the complex pole pairs to a common denominator. This results in second-order sections with real valued coefficients, which can be implemented more efficiently. Those fractions of Eq. (4) that have real poles can be combined with other real poles to form second-order IIR filters, yielding a canonical structure. Accordingly, the transfer function becomes

H (z −1 ) =

L l=1

dl,0 + dl,1 z −1 + bm z −n , −1 −2 1 + al,1 z + al,2 z m=0 M

(5)

where L is the number of second order sections, and M is the order of the FIR part. The filter structure is depicted in Fig. 3. In this paper the FIR part is not utilized (M = 0), since only the “FIR-less” form of the parallel filter can be related to complex smoothing. We note, however, that for modeling certain largely non-minimum-phase responses applying the additional FIR part can lead to an overall lower computational complexity [19]. Using an Mth order FIR part actually means that the first M samples of the target impulse response remain untouched (i.e., followed exactly by the filter response), and the smoothing operation affects the samples n > M. In the context of approximating complex smoothing, the pole frequencies fk should be set according to the required frequency resolution. Accordingly, the poles of the parallel 42

(8)

Equation (7) sets the pole radii |pk | in such a way that the transfer functions of the parallel sections cross approximately at their –3dB points. 3.2 Filter Design First, we investigate how the parameters of the parallel filter can be estimated to match a desired filter response. The simplest way is to find the coefficients in the time domain (for frequency-domain design, see [22]). The impulse response of the parallel filter is the inverse z-transform of Eq. (5): h(n) =

L

dl,0 u l (n) + dl,1 u l (n − 1)

l=1

+

M

bm δ(n − m),

(9)

m=0

where ul (n) is the impulse response of the transfer function 1/(1+al,1 z−1 + al,2 z−2 ), which is an exponentially decaying sinusoidal function and δ(n) is the discrete unit impulse. Conceptually, filter design simply consists of creating a weighted sum of the exponentially decaying sinusoidal functions (and delayed unit impulses if there is an FIR part) in such a way that the resulting signal best approximates the target impulse response ht (n). Because (9) is linear in parameters, it can be written in a matrix form: h = Mp,

(10)

where p = [d1,0 , d1,1 , . . . d L ,0 , d L ,1 , b0 . . .bM ] is a column vector composed of the free parameters. The columns of the modeling signal matrix M contain the modeling signals, which are ul (n) and their delayed counterparts ul (n − 1), and for the optional FIR part, the unit impulse δ(n) and its delayed versions up to δ(n − M). Finally, h = [h(0) . . . h(N )]T is a column vector composed of the resulting impulse response. The problem reduces to finding the optimal parameters popt such that h = Mpopt is closest to the target response ht . If the error function is evaluated in the mean squares sense, the optimum is found by the well known LS T

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

AUDIO EQUALIZATION WITH PARALLEL FILTERS

izer directly without inverting the system response [20], since this simplifies the design and avoids many problems presented by the direct inversion of the measured transfer function. Designing an equalizer requires that the resulting response h(n), which is the convolution of the equalizer response heq (n) and the system response hs (n), is close to the target response ht (n) (that can be a unit impulse, for example). By looking at Fig. 3, the equalizer design can be considered as a system identification problem where the system response hs (n) is the input of the parallel filter, and the weights dl,0 , dl,1 , and bm should be set in such a way that the summed output best approximates the target impulse response ht (n). Accordingly, the output of the parallel filter is computed as

20 15 10

Magnitude [dB]

5 0 −5 −10 −15 −20 −25 −30 10

2

3

10 Frequency [Hz]

4

10

Fig. 4. 32nd order parallel filter design as in Fig. 2(f): frequency response of the parallel filter (thick solid line) and frequency responses of the second-order sections (thin lines). The vertical lines indicate the pole frequencies.

h(n) = h eq (n) ∗ h s (n) =

L

(13)

dl,0 u l (n) ∗ h s (n) + dl,1 u l (n − 1) ∗ h s (n)

l=1

solution

+ +

popt = M ht , M+ = (M H M)−1 M H ,

where M+ is the Moore-Penrose pseudo-inverse, and M H is the conjugate transpose of M. Note that if the frequency resolution—thus the pole set and modeling matrix M— is fixed, the pseudo-inverse M+ can be precomputed and stored, so the parameter estimation reduces to a matrix multiplication according to Eq. (11). This is especially useful for designing multiple sets of filters (e.g., for modeling MIMO systems), since in this case Eq. (12) has to be computed only once. In Fig. 2(f), a parallel filter design example is presented for the same loudspeaker-room target response as for warped and Kautz filters in Section 2. The pole set is identical to the one used for the Kautz filter. It can be seen that the parallel filter (f) results in the same filter response as the Kautz filter (e). The same frequency response is shown in Fig. 4, thick line, but the separate transfer functions of the second-order sections are also visualized. It can be seen that the final transfer function is a result of a complicated interference of the basis functions, unlike in a graphic equalizer. This is because now not only the gains but also the phases of the different “bands” are free parameters. As an effect, while the width of the resulting peaks is limited by the Q factors of the sections, the dips can be much narrower, since they arise from phase cancellations of neighboring bands. 3.3 Direct Equalizer Design Equalizing a system (such as a loudspeaker) by the parallel filter can be done by first inverting the system response and designing the parallel filter as outlined in the previous section. However, it is more practical to design the equalJ. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

bm δ(n − m) ∗ h s (n)

m=0

(11) (12)

M

=

L

dl,0 sl (n) + dl,1 sl (n − 1) +

l=1

M

bm h s (n − m),

m=0

where * denotes convolution. The signal sl (n) = ul (n)*hs (n) is the system response hs (n) filtered by 1/(1+al,1 z−1 + al,2 z−2 ). It can be seen that Eq. (13) has the same structure as Eq. (9). Therefore, the parameters dl,0 , dl,1 , and bm can be estimated in the same way as presented in the previous section. Similarly, writing this in a matrix form yields h = Meq p,

(14)

where the columns of the new signal modeling matrix Meq contain sl (n), sl (n − 1), and the system response hs (n) and its delayed versions up to hs (n − M). Then, the optimal set of parameters is again obtained by popt = M+ eq ht ,

(15)

−1 H H Meq . M+ eq = Meq M

(16)

3.4 Equivalence with Kautz Filters While it was clear from simulations and theoretical reasoning [20] that Kautz and parallel filters result in the same filter response for the same pole set (see also Fig. 2), a formal proof has not been provided previously. The proof presented here is based on the partial fraction expansion of the Kautz basis functions. According to Eq. (2), the kth basis function of the Kautz filter is k−1 1 − pk pk∗ z −1 − p ∗j −1 , (17) G k (z ) = 1 − pk z −1 j=0 1 − p j z −1 43

BANK

PAPERS

which is a kth order filter. In the case of no multiple poles, which is easily satisfied when the pole set is predetermined, Eq. (17) can be written in a partial fraction form G k (z −1 ) =

k

ck,i

i=1

1 , 1 − pi z −1

(18)

where the k coefficients ck,i are found by the usual procedure of partial fraction expansion [21] in a closed form: ck,i =

k 1 − pk pk∗ j=1, j=i

k−1 1 (1 − p ∗j pi ). pi − p j j=1

(19)

By noting that the partial fraction form of Eq. (18) is the same as the complex form of the parallel filter Eq. (4) without the FIR part (M=0), it is clear that the Kautz basis functions can be reconstructed by the parallel filter exactly. As the Kautz filter response is the linear combination of the Kautz basis functions Gk (z−1 ), it is straightforward to convert a Kautz filter into a parallel filter. If the parameters of the Kautz filter are given in a vector w = [w1 , . . . , w K ]T , the parameter vector of the parallel filter c = [c1 , . . . , c K ]T can be obtained by the matrix multiplication c = Kw

(20)

K i,k =

1 − pk pk∗

j=1, j=i

K i,k

for i ≤ k, = 0 for i > k

1 pi − p j

k−1

(1 − p ∗j pi )

j=1

(21)

Since K is tridiagonal, and for simple poles it does not have zero values in its diagonal, it is nonsingular. As a result, the inverse matrix K−1 can be computed that can be used to convert the parallel filter parameters to Kautz parameters (w = K−1 c). Basically, we have shown that the basis functions of the parallel and Kautz filters span the same approximation space, and converting between the two filters is merely a base change. Therefore, approximating a target response using any error norm (e.g., the L2 norm in least-squares design) will lead to exactly the same filter response in both cases. Besides its theoretical importance, the relation between the two filter structures allows a computationally more efficient design of the parallel filter. Namely, first a Kautz filter is designed by the scalar product of Eq. (3), then the parameters are converted by Eqs. (20) and (21). While this seems to be conceptually more complicated, the number of required arithmetic operations is reduced compared to the LS design of Eq. (11), so it is a useful alternative for high (>100) filter orders. Note that in the case of direct equalizer design of Section 3.3 this procedure does not provide any computational benefits compared to the LS design of Eq. (15), since for that case the scalar product of Eq. (3) cannot be used and also the Kautz filter has to be designed by a LS equation [18]. 44

By observing the results of parallel and Kautz filters (see Fig. 2(e) and (f)) it is apparent that the effect of filter design is similar to that of fractional-octave complex-smoothing of transfer functions. However, the theoretical reasons for this similarity have remained unexplored. In this paper the case of the parallel filter is investigated, but since it results in exactly the same filter response as the Kautz filter (as it was proven in Section 3.4), the observations are valid for the Kautz filter as well. 4.1 Uniform Pole Distribution We start our analysis with the simplest case, where the K poles of the parallel filter are distributed uniformly on a circle of radius R < 1, that is, H (z −1 ) =

K k=1

ck ck = −1 1 − pk z 1 − Re j2πk/K z −1 k=1 K

(22)

After some algebra, it is relatively straightforward to demonstrate that the transfer function of the parallel filter with such a pole distribution is actually equivalent to a comb filter and a (K − 1)th order FIR filter in series: H (z −1 ) =

where the conversion matrix K is given as k

4 AN EFFICIENT ALTERNATIVE TO COMPLEX SMOOTHING

1

K −1

1 − z −K R K

k=0

bk z −k ,

(23)

where the FIR coefficients bk arise as the linear combinations of the parallel filter weights ck . In practice, RK 1, therefore, the transfer function can be approximated by the FIR part only: −1

H (z ) ≈

K −1

bk z −k .

(24)

k=0

When designing the parallel filter according to Eq. (11), the mean-squared error between the target impulse response ht (n) and the filter response h(n) is minimized. This error is minimal if the parallel filter coefficients ck are set in a way that the equivalent FIR coefficients bk are equal to the first K samples of the target impulse response bk = ht (k). As a result, the filter impulse response h(n) is the truncated version of the target response ht (n), which is equivalent to multiplying the target response by a rectangular window w(n) of length K: h(n) = w(n)h t (n),

(25)

where w(n) = 1 for 0 ≤ n ≤ K − 1, w(n) = 0 elsewhere.

(26)

Note that this window is defined only for positive times n ≥ 0 (it is a half window), in contrast to the symmetric windows used in complex smoothing (see Section 1). Since for causal impulse responses h(n) = 0 for n < 0 anyway, we may think of symmetrically extending the window w(n) to negative times by setting w( − n) = w(n), without influencing the product h(n) = ht (n)w(n). This has the advantage J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

4.2 Stepwise Uniform Pole Distribution Next, let us consider a more interesting case where the pole density is different in the various regions of the frequency range. Here it is not possible to derive the smoothing function in a closed form, as in Section 4.1. Therefore, a different approach is taken. Since the parameters of the parallel filter are determined by a linear LS design, the superposition principle holds. This means that if we decompose the target response ht (n) as a sum of some test functions, the filter response h(n) will be equal to the sum of filter responses designed for the test functions separately. Since we would like to gain some insight to the frequency-dependent nature of smoothing, a natural choice for such a test function is the basis function of the Fourier transform e− jϑ0 n , where ϑ0 is the angular frequency of the complex exponential. In the frequency domain, this is equivalent to δ(ϑ − ϑ0 ), which is a Dirac delta function at position ϑ0 . Accordingly, in the frequency domain, we are computing the “impulse response” of the smoothing operation, that is, we obtain the smoothing function directly. If the overlap of the basis functions of the parallel filter is not too large, our test function e− jϑ0 n will be approximated by parallel sections whose center frequencies are near to ϑ0 , while the contribution of the other sections will be negligible. Therefore, we expect that the width of the smoothing function in the frequency domain and the length of the corresponding window function in the time domain will only depend on the local pole density near ϑ0 . From Section 4.1 we may deduce that if the distance of the poles is ϑ in some frequency region, the “width” of the smoothing function in that region will be ϑ, and signals in that frequency range will be windowed to a length 2π/ϑ J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

0 −1

Re{h(n)} Re{h(n)}

10

20

30

40

50

(b)

1 0 −1 0

10

20

30

40

50

(c)

1 0 −1 0

10

20 30 n [samples]

40

50

H(ϑ)

Fig. 5. Modeling a complex exponential e− jϑ0 n by a parallel filter having stepwise uniform frequency resolution: (a) ϑ0 = 1/4π, (b) ϑ0 = 1/2π, and (c) ϑ0 = 3/4π. The dashed line is the target response e− jϑ0 n and the solid line is the resulting parallel filter impulse response. Only real parts of the signals are shown.

H(ϑ)

which is clearly a form of transfer function smoothing. Actually, it corresponds to “filtering” the transfer function with an ideal lowpass filter. It can also be seen that the smoothing function does not depend on frequency, as expected from uniform pole distribution. Note that the main lobe of the sinc-like function in Eq. (27) approximates a Hann window quite closely. The width of the main lobe is 4π/(2K − 1) ≈ 2π/K, therefore, the effect will be similar to smoothing the transfer function with a 2π/K wide symmetric Hann window. Since 2π/K actually equals the distance of pole frequencies ϑ, we can conclude that a parallel filter design with ϑ pole-frequency distance corresponds to smoothing the target response by a ϑ wide Hann window.

(a)

1

0

H(ϑ)

that now the results will be directly comparable with those of complex smoothing. Accordingly, designing a parallel filter with a uniform pole distribution is equivalent to multiplying the target impulse response by a symmetric rectangular window of total length 2K − 1. In the frequency domain, this corresponds to convolving the target transfer function Ht (ϑ) with a sinclike (periodic sinc) function:

sin 2K2−1 ϑ (27) H (ϑ) = Ht (ϑ) ∗ sin 12 ϑ

AUDIO EQUALIZATION WITH PARALLEL FILTERS

Re{h(n)}

PAPERS

80 60 40 20 0 −20 0 80 60 40 20 0 −20 0 80 60 40 20 0 −20 0

(a)

0.2

0.4

0.6

0.8

1

(b)

0.2

0.4

0.6

0.8

1

(c)

0.2

0.4

ϑ/π

0.6

0.8

1

Fig. 6. Smoothing functions corresponding to the cases of Fig. 5(a)–(c). The dotted vertical lines display the pole frequencies of the parallel filter. The dashed lines show the 1/|ϑ − ϑ0 | envelopes of the smoothing functions.

(or, the equivalent symmetric time window will have the length 4π/ϑ − 1). This is illustrated in Figs. 5 and 6, displaying a parallel filter design with 30 poles (15 pole pairs) around the unit circle. The pole frequencies are chosen in such a way that 20 poles are distributed evenly in the lower half of the frequency range |ϑ| ≤ π/2, while 10 poles are spread in the upper range π/2 < |ϑ| ≤ π (the dotted vertical lines show the pole frequencies in Fig. 6). In Fig. 5 the target impulse responses e− jϑ0 n are displayed by dashed lines and the resulting parallel filter responses by solid lines. Note that the target and filter responses are complex; here only the real parts of the signals are shown, since the imaginary parts would show a similar behavior. Fig. 5(a) shows a case where the frequency of the 45

BANK

5 DESIGN EXAMPLES AND COMPARISON An example for loudspeaker-room response modeling has already been displayed in Fig. 2. Now we are designing an equalizer for the same system response. The target response is a second-order high-pass filter with a cutoff frequency of 50 Hz, and the frequency resolution of the design is third octave. The system response hs (n) is displayed in Fig. 7(a) thin solid line. Its third-octave complex smoothed version hcs (n) is displayed in Fig. 7(a) thick line, which is used for a 2500 tap FIR equalizer design by a least-squares system identification approach [14]. In this case, the parameters are estimated in such a way that the complex-smoothed system response hcs (n) filtered by the FIR filter best approximates the target response ht (n) in the LS sense. The system response equalized by this FIR filter is shown in Fig. 7(b) thin 46

10 (a) 0 −10 Magnitude [dB]

exponential test function is in the high pole density region, while in (c) the frequency is in the low pole density region of the filter. As expected, the resulting impulse response (solid line) is “windowed” to a longer length in the first case compared to the second case. The theoretically computed half window length 2π/ϑ is 40 and 20 for (a) and (c), which is in agreement with what can be observed in practice. Fig. 5(b) displays an intermediate case when the frequency of the test signal is exactly at the boundary of the two pole density regions. This results in a more mild windowing, where the window length is somewhere in between the (a) and (c) cases. The same phenomenon can also be observed in the frequency domain in Fig. 6, for the same cases. The solid lines display the transfer functions of the resulting filters trying to approximate the test function e− jϑ0 n , again with (a) ϑ0 = 1/4π, (b) ϑ0 = 1/2π, and (c) ϑ0 = 3/4π. Note that the frequency responses were computed by first extending the parallel filter responses to negative times h( − n) = h*(n), to comply with the symmetric windows used in complex smoothing. Accordingly, the frequency responses displayed in Fig. 6 are real (zero phase) functions and can be directly compared to the smoothing functions used in complex smoothing. In Fig. 6 the dotted vertical lines show the pole frequencies of the parallel filter. It can be seen in (a) and (c) that the width of the main lobe equals the pole distance ϑ in that region and so is the periodicity. Locally, the smoothing function has a sinc-like shape, similar to the case of the uniform pole distribution of Section 4.1. The dashed lines show the theoretical 1/|ϑ − ϑ0 | envelopes of the sinc functions. Again, (b) is a borderline case where the envelope still follows that of a regular sinc function, but the periodicity is different at the left and right sides, coming from the different pole densities. To sum up, we have seen that the “width” of the frequency-domain smoothing function (and the length of the corresponding time-domain window) depends only on the local pole density around the frequency of interest. This phenomenon can be effectively utilized for obtaining different resolution (variable amount of smoothing) in different frequency regions by setting the pole density appropriately.

PAPERS

(b) −20 −30 (c) −40 −50 (d) −60 −70 2

10

3

10 Frequency [Hz]

10

4

Fig. 7. Loudspeaker-room equalization: (a) the unequalized system response, (b) equalized by a 2500 tap FIR filter estimated using the third-octave complex-smoothed system response, (c) 32nd order WIIR filter estimated using the third-octave complexsmoothed system response, and (d) a 16 section (32nd order) parallel filter designed using the unsmoothed system response. Thick lines show the third-octave smoothed versions, while the dashed line displays the target response.

line (thick line shows the third-octave smoothed version). The FIR filter designed from the smoothed response equalizes the overall response quite nicely while avoids equalizing the narrow peaks and dips that would happen when designing the filter from the unsmoothed system response (direct inversion). However, the computational complexity of the 2500 tap FIR filter is too large compared to the simple task it should accomplish. A straightforward option for decreasing the computational complexity is to estimate a warped IIR filter based on the complex-smoothed system response. Fig. 7(c) displays the loudspeaker-room response equalized by such a WIIR filter having an order of 32 and λ = 0.75 (again, the thick line is its third-octave smoothed version). It can be seen in Fig. 7 that the results are almost the same as for the FIR filter, except at low frequencies. Note that the case of strictly logarithmic frequency resolution coming from third-octave smoothing is a relatively easy task for the WIIR filter, since it is quite close to its natural frequency resolution. For arbitrary frequency resolution profiles (such as having higher resolution at low frequencies compared to the high ones), the performance of the WIIR filter would be less satisfactory. Fig. 7(d) shows the loudspeaker-room response equalized by a 16 section parallel filter, having the same total filter order as the WIIR filter of (c). The parallel filter was designed by the direct method presented in Section 3.3, by using the unsmoothed system response (we note again that the FIR and WIIR examples were designed using the complex-smoothed system response). To achieve the required third-octave resolution, three poles are placed in each two octaves (the pole density is 3/2 pole/octave). The resulting equalizer performance is at least as good as for the FIR and IIR case, while the design is much simplified, since J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

AUDIO EQUALIZATION WITH PARALLEL FILTERS

6 DISCUSSION (a) 0 (b) Magnitude [dB]

−20 (c) −40 (d) −60 (e) −80

−100 10

2

3

10 Frequency [Hz]

4

10

Fig. 8. Illustrating the possibilities using various pole distributions in the parallel filter: (a) the unequalized loudspeaker-room response, (b) equalized by a parallel filter having double pole density below 320 Hz than above, (c) equalized by a parallel filter having poles only below 500 Hz. The transfer functions of the equalizers for the cases (b) and (c) are displayed in (d) and (e), respectively. The dotted vertical lines display the pole frequencies of the corresponding equalizers. In (a)–(c) the dashed lines show the target response, and the thick lines display the third-octave smoothed versions of the transfer functions.

there is no need for smoothing the target response. This is because the smoothing is done “automatically,” as already discussed in Section. 4. In addition, the procedure always results in a stable filter and is equally efficient at arbitrary (i.e., not strictly logarithmic) frequency resolution, on the contrary to the WIIR design. Fig. 8 shows two additional examples illustrating some of the capabilities of the parallel filter using arbitrary pole distributions. It is often desired to equalize the low frequencies of the loudspeaker-room response more precisely compared to the high ones. This is now achieved by a parallel filter having three poles per octave below 320 Hz (corresponding to sixth-octave resolution), and 3/2 pole per octave above (third-octave resolution). The equalized response is displayed in Fig. 8(b), while the transfer function of the equalizer is shown in (d), together with the pole frequencies as vertical dotted lines. One can see that the 22 section (44th order) parallel filter achieves quite a flat frequency response. Fig. 8(c) displays an example when only the problematic low frequency region is equalized by eight second-order sections (filter order is 16). In this case, the eight pole pairs are logarithmically distributed between 20 and 500 Hz, corresponding to a third-octave resolution (see dotted vertical lines in Fig. 8(e)), and a zero-order FIR part (the constant coefficient b0 in Fig. 3) is also utilized. The transfer function of the equalizer is shown in Fig. 8(e). It is important to note that the equalizer is designed from the unsmoothed loudspeaker-room response exactly as in the previous cases, and there is no need to do any additional processing, like flattening the system response above 500 Hz before equalization. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

We have seen in Section 4 that designing a parallel filter is equivalent to smoothing the target response by a sinclike function, and in the examples of Section 5 that the smoothing behavior is also present in direct equalizer design. The width of the main lobe of the sinc function is equal to the pole frequency distance in that region. Moreover, the main lobe of the sinc function is very similar to the Hann window, a window function commonly used in transfer function smoothing. For obtaining a given f resolution at some frequency region, the distance of the analog pole frequencies of the parallel filter should be 2f, since that corresponds to smoothing by a 2f wide Hann window. In the logarithmic frequency scale, if 1/β octave resolution is desired, β/2 pole pairs have to be placed in each octave. In practice, fixed-pole parallel filters can be used in two ways in relation to complex smoothing: (1) Design of parallel filters instead of transfer function smoothing: In this case, the system is modeled or equalized by the parallel filter without any frequencydomain processing. Thus, the parallel filter is used both as the final implementation structure and a means of achieving complex smoothing. Here, the resolution is controlled by the choice of the pole frequencies. The filter is optimal in the sense that the filter order will correspond to the obtained resolution. An additional advantage of this approach that the parameter estimation is simplified since there is no need to implement the complex-smoothing operation. (2) Design of parallel filters together with transfer function smoothing: In this case, the parallel filter is designed by using the smoothed frequency response. This does not make too much sense for complex-smoothed responses, since almost the same results could be achieved without prior smoothing. However, if the system response is smoothed by some nonlinear processing (e.g., eliminating the dips below certain threshold or by the use of an auditory model), then the parallel filter can be used as an efficient implementation of the smoothed response. In this case, the local pole density of the parallel filter should be set according to the local resolution of the smoothed transfer function. 7 CONCLUSION Transfer function smoothing is a well-established method for displaying, modeling, and equalizing the frequency responses of audio systems, based on both perceptual and physical considerations. This paper has demonstrated that the frequency response of the parallel filter is similar to the response obtained by performing complex smoothing on the target response. As a result, the parallel filter can be either used as an efficient implementation of already smoothed responses, or, it can be designed from the unsmoothed responses directly, eliminating the need of frequency-domain processing, since it performs smoothing “automatically.” The obtained frequency resolution is not limited to the logarithmic scale, but arbitrary resolution can be achieved 47

BANK

by the suitable choice of pole frequencies. The formulas for computing the pole angles and radii from analog pole frequencies were also given. The theoretical equivalence of parallel filters and Kautz filters has also been developed, and the formulas for converting between the parameters of the two structures were presented. This implies that the favorable smoothing properties are also possessed by the Kautz filter. In addition, the conversion formulas can be used for obtaining the parameters of the parallel filter from the Kautz parameters, resulting in a design procedure that requires less arithmetic operations compared to the straightforward LS design. While only loudspeaker-room equalization examples have been provided, the parallel filter can also be successfully used also in other fields where the flexible allocation of frequency resolution is beneficial. For example, this is the case for sound synthesis applications [19,23, 24], and it is hoped that other applications will soon follow. The MATLAB code for designing fixed-pole parallel filters can be downloaded from http://www.mit.bme.hu/ ∼bank/parfilt

8 ACKNOWLEDGMENT This work has been supported by the EEA Norway Grants and the Zoltán Magyary Higher Education Foundation.

9 REFERENCES [1] J. N. Mourjopoulos, P. M. Clarkson, and J. K. Hammond “A Comparative Study of Least-Squares and Homomorphic Techniques for the Inversion of Mixed Phase Signals,” Proc. IEEE Int. Conf. Acoust. Speech and Signal Process., pp. 1858–1861 (May 1982). [2] M. Karjalainen, E. Piirilä, A. Järvinen, and J. Huopaniemi, “Comparison of Loudspeaker Equalization Methods Based on DSP Techniques,” J. Audio Eng. Soc., vol. 47, pp. 14–31 (1999 Jan./Feb.). [3] G. Ramos and J. J. Lopez, “Filter Design Method for Loudspeaker Equalization Based on IIR Parametric Filters,” J. Audio Eng. Soc., vol. 54, pp. 1162–1178 (2006 Dec.). [4] H. Behrends, A. von dem Knesebeck, W. Bradinal, P. Neumann, and U. Zölzer, “Automatic Equalization Using Parametric IIR Filters,” J. Audio Eng. Soc., vol. 59, pp. 102–109 (2011 Mar.). [5] M. Miyoshi and Y. Kaneda, “Inverse Filtering of Room Acoustics,” IEEE Trans. Acoust. Speech Signal Process., vol. 36, no. 2, pp. 145–152 (Feb. 1988). [6] P. G. Craven and M. A. Gerzon, “Practical Adaptive Room and Loudspeaker Equalizer for Hi-Fi Use,” presented at the 92nd Convention of the Audio Engineering Society, (1992 Mar.), convention paper 3346. [7] M. Karjalainen, T. Paatero, J. N. Mourjopoulos, and P. D. Hatziantoniou, “About Room Response Equalization and Dereverberation,” Proc. IEEE Workshop Appl. of Signal Process. to Audio and Acoust., pp. 183–186, New Paltz, NY, USA (Oct. 2005). 48

PAPERS

[8] S. Bharitkar and C. Kyriakakis, “Loudspeaker and Room Response Modeling with Psychoacoustic Warping, Linear Prediction, and Parametric Filters,” presented at the 121st Convention of the Audio Engineering Society (2006 Oct.), convention paper 6982. [9] J. A. Pedersen and K. Thomsen, “Fully Automatic Loudspeaker–Room Adaptation—The RoomPerfect System,” Proc. AES 32nd Int. Conf. “DSP for Loudspeakers”, pp. 11–20, Hillerød, Denmark (2007 Sept.). [10] L. Palestini, P. Peretti, L. Romoli, F. Piazza, and A. Carini S. Cecchi, “Evaluation of a Multipoint Equalization System Based on Impulse Response Prototype Extraction,” J. Audio Eng. Soc., vol. 59, pp. 110-123 (2011 Mar.). [11] A. J. Hill and M. O. J. Hawksford, “Wide-Area Psychoacoustic Correction for Problematic Room-Modes Using Nonlinear Bass Synthesis,” J. Audio Eng. Soc., vol. 59, pp. 825-834 (2011 Nov.). [12] P. D. Hatziantoniou and J. N. Mourjopoulos, “Generalized Fractional-Octave Smoothing for Audio and Acoustic Responses,” J. Audio Eng. Soc., vol. 48, pp. 259– 280 (2000 Apr.). [13] M. Karjalainen and T. Paatero, “FrequencyDependent Signal Windowing,” Proc. IEEE Workshop Appl. of Signal Process. to Audio and Acoust., pp. 35–38, New Paltz, NY, USA (Oct. 2001). [14] J. N. Mourjopoulos and P. D. Hatziantoniou, “RealTime Room Equalization Based on Complex Smoothing: Robustness Results,” presented at the 116th Convention of the Audio Engineering Society (2004 May), convention paper 6070. [15] M. Waters and M. B. Sandler, “Least Squares IIR Filter Design on a Logarithmic Frequency Scale,” Proc. IEEE Int. Symp. on Circuits and Syst., pp. 635–638 (May 1993). [16] A. Härmä, M. Karjalainen, L. Savioja, V. Välimäki, U. K. Laine, and J. Huopaniemi, “Frequency-Warped Signal Processing for Audio Applications,” J. Audio Eng. Soc., vol. 48, pp. 1011–1031 (2000 Nov.). [17] T. Paatero and M. Karjalainen. “Kautz Filters and Generalized Frequency Resolution: Theory and Audio Applications,” J. Audio Eng. Soc., vol. 51, pp. 27–44 (2003 Jan./Feb.). [18] M. Karjalainen and T. Paatero, “Equalization of Loudspeaker and Room Responses Using Kautz Filters: Direct Least Squares Design,” EURASIP J. on Advances in Sign. Proc., Spec. Iss. on Spatial Sound and Virtual Acoustics, 2007:13, 2007. Article ID 60949, doi:10.1155/2007/60949. [19] B. Bank, “Direct Design of Parallel Second-Order Filters for Instrument Body Modeling,” Proc. Int. Computer Music Conf., pp. 458–465, Copenhagen, Denmark (Aug. 2007). URL: http://www.acoustics.hut.fi/go/ icmc07parfilt. [20] B. Bank, “Perceptually Motivated Audio Equalization Using Fixed-Pole Parallel Second-Order Filters,” IEEE Signal Process. Lett., vol. 15, pp. 477–480 (2008). [21] L. R. Rabiner and B. Gold. Theory and Application of Digital Signal Processing (Prentice-Hall, Englewood Cliffs, NJ, USA , 1975). J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

AUDIO EQUALIZATION WITH PARALLEL FILTERS

[22] B. Bank, “Logarithmic Frequency Scale Parallel Filter Design with Complex and Magnitude-Only Specifications,” IEEE Signal Process. Lett., vol. 18, no. 2, pp. 138–141 (Feb. 2011). [23] B. Bank and M. Karjalainen, “Passive Admittance Matrix Modeling for Guitar Synthesis,” Proc. Conf. on Digital Audio Effects, pp. 3–7, Graz, Austria (Sept. 2010).

[24] B. Bank, S. Zambon, and F. Fontana, “A ModalBased Real-Time Piano Synthesizer,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 18, no. 4, pp. 809–821 (May 2010). [25] S. P. Lipshitz, T. C. Scott, and J. Vanderkooy, “Increasing the Audio Measurement Capability of FFT Analyzers by Microprocessor Postprocessing,” J. Applied Mechanics, vol. 33, no. 9, pp. 626–648 (Sept. 1985).

THE AUTHORS

Balázs Bank

Balázs Bank received his M.Sc. and Ph.D. degrees in electrical engineering from the Budapest University of Technology and Economics, Hungary, in 2000 and in 2006, respectively. In the academic year 1999/2000 he was with the Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, Finland. From 2000 to 2006 he was a Ph.D. student and research assistant at the Department of Measurement and Information Systems, Budapest University of Technology and Economics. In 2001 he visited the Department of Information Engineering, University of Padova. In 2007 he returned to the Acoustics Laboratory of Helsinki University of Technol-

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

ogy for a year, by the support of an FP6 Marie Curie EIF individual fellowship. In 2008 he was with the Department of Computer Science, Verona University, Italy. In 2009– 2010 he was a postdoctoral researcher at the Budapest University of Technology and Economics, supported by the Norway and EEA Grants and the Zoltán Magyary Higher Education Foundation. Currently he is an associate professor at the Department of Measurement and Information Systems, Budapest University of Technology and Economics. His research interests include physics-based sound synthesis and filter design for audio applications. He is a member of the AES and the IEEE.

49

PAPERS

Digital Waveguide Synthesis of the Geomungo with a Time-varying Loss Filter SEUNGHUN KIM1, MOONSEOK KIM , AND WOON SEUNG YEO1 ([email protected]) ([email protected]) 1

([email protected])

Graduate School of Culture Technology, KAIST, Daejeon, Korea

This article proposes a commuted waveguide synthesis model for the geomungo, a Korean traditional plucked-string instrument featuring unusually wide vibrato. To model pitch fluctuation as well as the decay characteristics of its harmonic partials, a time-varying loss filter with a sinusoidal loop gain is used. Filter parameters are estimated by approximating the gains, and excitation signals are generated by inverse filtering the original sound. Real-time implementation of the algorithm for the development of the virtual instrument is also discussed.

0 INTRODUCTION Physical modeling synthesis is based on the use of the mathematical models of physical acoustics for computer simulation of sound [24]. By manipulating control parameters, users can modify the physical structure of the sound source (usually a musical instrument) or performer’s interaction with it, thereby avoiding the need to record every possible sound from the source. Since the introduction of the Karplus-Strong (KS) algorithm for plucked string instruments [13], many studies on physical modeling synthesis have been proposed. Jaffe et al. suggested the use of the KS algorithm for the synthesis of stringed instrument tones [11]. Smith proposed a digital waveguide theory that simulates traveling waves with digital delay lines [28] [29] [31]. Based on the assumption that the commutativity of the linear time-invariant (LTI) system is valid, damping and dispersion are also lumped at specific points to reduce the computational cost and allow real-time sound synthesis. Karjalainen et al. showed that bi-directional digital waveguides could be reduced to a single-delay loop (SDL), a generalized form of the KS algorithm [12]. Here, the overall transfer function from the input (the excitation signal in the digital waveguide model) to the output on the bridge is obtained from the partial transfer functions. Since this results in several filters forming a cascade structure, the SDL model is conceptually simpler and easier to implement. Considering this, a plucked string instrument model contains three components—excitation signal, string, and instrument body—each of which can be considered as a filter. However, the order of the filter for modeling the body is usually very high and requires too much computational cost for real-time sound synthesis. 50

To resolve this, Smith proposed commuted waveguide synthesis [30], i.e., rather than forming a distinct filter, the body part is combined with the original excitation signal as a new input, which is stored in the excitation table and used to generate sound through the string model. Excitation samples can be extracted from the digital recordings of the instrument by inverse filtering of the string model, which is a cascade form of several filters. Parameters of the model can be estimated by analyzing the decays of harmonics from recorded samples. Examples of instrument sound synthesis based on the digital waveguide theory include wind [14] [25] [26], bowed string [27] [37], and plucked string. Välimäki et al. presented a real-time digital waveguide model for several plucked string instruments including the guitar, banjo, mandolin, and kantele [33]. They adopted the commuted waveguide synthesis model and Lagrange interpolation for fractional delay, and proposed a real-time synthesis model on a digital signal processor. This was followed by an improved model to implement a guitar synthesizer [36]. Synthesis models for the piano [2] [32] and harpsichord [35] have also been investigated. In addition, spring reverberation effect was simulated based on similar filter components used in digital waveguide theory [34]. While the majority of digital waveguide synthesis examples feature western musical instruments, some studies have dealt with old Asian plucked-string instruments. Erkut et al. proposed a synthesis model for the ud and the Renaissance lute, which originated in the Middle East [8], and implemented the glissando (a glide from one pitch to another) on fretted/fretless instruments. A model-based sound synthesis algorithm of the guqin, a Chinese plucked string instrument, was proposed by Penttinen et al. based on the commuted waveguide method [23]. The whole synthesis J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

system includes a body model filter, a ripple filter for flageolet tones, multiple SDL string models for inharmonic partials (called phantom partials), and a friction model. Recently, model-based sound synthesis of the dan tranh, a Vietnamese plucked string instrument, has been reported [5]. Here, estimated loop gain values and loop coefficient values of the dan tranh were compared with those of the gayageum [4], a Korean traditional plucked string instrument with 12 strings. The fundamental sound production mechanism of these instruments however, is basically similar to those of their western counterparts and does not properly reflect either the differences in physical structure of the instrument (or materials used) or the playing techniques, which can play a crucial role in generating distinctive tones. This paper deals with the digital waveguide synthesis of the geomungo, a Korean traditional plucked string instrument. Due to vigorous playing techniques, extreme vibrato (usually more than 100 cents) with a noticeably fluctuating decay of the harmonics is characteristic of typical geomungo sound. While Mellody et al. found a similar amplitude variation in their analysis of the vibrato tones of the violin [20], research on digital waveguide synthesis of (not only Asian, but also western) stringed instruments concerning this phenomenon is rare. When modeling vibrato, most of the previous examples considered fundamental frequency as the main control parameter but did not pay attention to fluctuations in the amplitudes of the additional partials, thereby failing to resynthesize the target vibrato tone properly. In this work we propose a new commuted waveguide synthesis algorithm that can handle changes in the decay of vibrato tones. Instead of a constant value, the algorithm uses a sinusoidal function to control its loop gain and hence models the time-varying decay pattern of the vibrato partials more naturally. 1 THE GEOMUNGO Known as a remolded guqin instrument imported from China before the fifth century, the geomungo is one of three major traditional string instruments from Silla, an ancient Korean kingdom. The name came from geomun (meaning “black”) and go (meaning “zither”). The pitch range of the geomungo is approximately three octaves—the widest of all the traditional Korean musical instruments, and its sound is generally described as low and resonant. Although the instrument was usually played solo by scholars before the modern era, it is also used widely to play bass in Buddhist music and modern court music nowadays. Since the 1980s, new songs have been composed for the geomungo, and consequently the playing technique and structure of the instrument have been improved. Composer and geomungo virtuoso Jin Hi Kim introduced the instrument to the world, and developed an electric geomungo [15]. 1.1 Structure Fig. 1 shows a picture of the geomungo and the names of its strings. The dimension of the instrument is approxJ. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

SPECTRAL DELAY FILTERS

Fig. 1. The geomungo [10][21].

imately 160 cm long, 22 cm wide, and 10 cm tall. The body is hollow inside to amplify the vibration from the strings. The curved top of the body is made of paulownia wood and the back is crafted from chestnut wood. Six strings made of twisted silks are fastened to the body. The names of each string are moon-hyun, yu-hyun, dae-hyun, gwae-sang-cheong, gwae-ha-cheong, and moo-hyun (hyun and cheong mean “string” and “sound,” respectively). Yuhyun, dae-hyun, and gwae-sang-cheong are stretched over sixteen frets called gwae, while each of the rest is supported by a movable bridge called the anjok (meaning “a seagull’s foot”). The geomungo does not have standard frequencies for tuning, but its six strings are typically tuned to E2 (moon-hyun), A2 (yu-hyun), D2 (daehyun), B1 (gwae-sang-cheong), B1 (gwae-ha-cheong), and B2 (moo-hyun). Yu-hyun is the thinnest string with clear and crisp sonic characteristics, while the thickest, daehyun, produces a low and rich sound. These two strings are used most frequently in playing and cover a wide range of pitch.

1.2 Techniques To play the geomungo, the performer places his/her left foot under right thigh and puts the instrument on it to prop up one edge. The string is then plucked with his/her right hand—both downward and upward—using a 20-cm long plectrum (called a sooldae) while controlling the tension of the string with his/her left hand (Fig. 2). There are three plucking techniques (dae-jeom, joongjeom, and so-jeom, where jeom means “point” and dae, joong, and so mean “large,” “medium,” and “small,” respectively) [18] as well as four left-hand techniques (nonghyun, choo-sung, toh-sung and jeon-sung) [16] [17], many of which cause vibrato. Tables 1 and 2 summarize the techniques. 51

KIM ET AL.

PAPERS

Table 1. Right-hand plucking techniques of the geomungo. dae-jeom joong-jeom so-jeom

To hit the string very strongly, thereby including the sound of the body being hit To hang the plectrum on the string and then plucking the string To pluck the string weakly

0.2

Amplitude

0.1

0

−0.1

−0.2

0.2

0.4

0.6

Fig. 2. Picture of typical geomungo playing.

0.8 Time (s)

1

1.2

1.4

1

1.2

1.4

1

1.2

1.4

(a) 0.2

2 ANALYSIS

0.1 Amplitude

The sounds of the geomungo were recorded for analysis in an anechoic chamber. A microphone (Rode Classic II) was placed toward the back of the instrument body, 50 cm away. Sounds generated with different playing techniques, e.g., nong-hyun, choo-sung, and jeon-sung, and/or plucking styles were recorded at 12 positions on yu-hyun and daehyun. Plectrum-hit body sounds were also recorded for the simulation of the dae-jeom technique. Fig. 3 shows the waveforms of three geomungo tones from the same string (yu-hyun, pressed on the fourth fret). Fig. 3a is a tone without any left hand techniques, while Figs. 3b and 3c are sounds of the nong-hyun technique, with the latter featuring greater intensity and faster vibrato.

0

−0.1

−0.2

0.2

0.4

0.6

0.8 Time (s)

(b) 0.2

Ym (k) =

N −1

w(n)y(n + m H )e−2πjkn/N ,

0

−0.1

−0.2

n=0

m = 0, 1, 2, ...,

Amplitude

0.1

2.1 Frequency Response Changes in the frequency response of vibrato sounds along time were measured by short-time Fourier transform (STFT) [1], which is defined as follows:

0.2

0.4

0.6

k = 0, 1, 2, ..., N − 1

0.8 Time (s)

(c)

where N is FFT size, w(n) is a window function, and H is the number of overlapping samples. Fig. 4 shows the STFT results of the geomungo tones in Fig. 3. Pitch modulations are apparent in Figs. 4b and 4c. In addition, the seventh and higher partials are negligible in loudness, i.e., about 40 dB less than the fundamental, and decay instantly. This phenomenon is common to all the geomungo tones

Fig. 3. Waveforms of recorded geomungo tones of yu-hyun (pressed on the fourth fret): (a) With no left hand technique, (b) nong-hyun, (c) nong-hyun, with greater intensity and faster vibrato.

Table 2. Playing techniques with the left hand of the geomungo. choo-sung toh-sung jeon-sung nong-hyun 52

To press on the string to produce slightly higher pitch To pull the string down To vibrate the plectrum around one position on the string Similar to vibrato including jeon-sung and toh-sung J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

SPECTRAL DELAY FILTERS

Fundamental Frequency (Hz)

PAPERS 180 170 160 150 140 0.2

0.4

0.6

0.8

1

1.2

0.8

1

1.2

0.8

1

1.2

Time (s)

Fundamental Frequency (Hz)

(a) 180 170 160 150 140 0.2

0.4

0.6 Time (s)

Fundamental Frequency (Hz)

(b) 180 170 160 150 140 0.2

0.4

0.6 Time (s)

(c)

Fig. 4. Short-Time Fourier Transform (STFT) results of the geomungo tones in Fig. 3. Here N is the next power of 2 greater than the length of each tone, and H is half the window size (2048 samples). Blackman-Harris window is used for analysis. Compared with (a), pitch modulations are apparent in (b) and (c).

we recorded and is considered for the development of the synthesis algorithm (as discussed later). 2.2 Fundamental Frequency Curve Information on the fundamental frequency of a tone is required for its digital waveguide synthesis. In the case of a vibrato tone, its fundamental frequency can be considered as a function of time (hence denoted as f0 (n)). To estimate the fundamental frequency, we used the YIN algorithm developed by Cheveigné et al. [7] dt (τ)

=

1, dt (τ)/[(1/τ) τj=1 dt ( j)]

if τ = 0 otherwise,

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Fig. 5. Fundamental frequency curves of geomungo tones on Fig. 3 based on the cumulative mean normalized difference function. Compared with (a), (b), and (c) show pitch modulations of nearly 10 ∼20 Hz, which is an unique feature of vibrato tones of the geomungo.

where dt (τ) is dt (τ) =

t+W −1

(x j − x j+r )2 .

j=t

This allowed us to avoid the selection of zero-lag. The fundamental frequency at time τ was then estimated as f 0 (τ) = f s /dt (τ). where fs is the sampling rate. Fig. 5 shows the fundamental frequency curves of the same geomungo tones in Fig 3. Compared with Fig. 5a, Figs. 5b and 5c show larger pitch modulations reaching up to nearly 10 Hz, i.e., 100 cents. This feature is distinguishable from the case of the guitar [9], whose vibrato does not usually exceed 1 Hz. 53

KIM ET AL.

PAPERS

Body hit samples X(z)

Excitation samples

Y(z)

Output loss filter

fractional delay filter delay line

Z-L(n)

F(z)

H(z)

Frequency Analysis

Magnitude Analysis

Target tone Fig. 6. Synthesis model based on commuted waveguide synthesis, which consists of a delay line (of length L(n)), a loss filter, a fractional delay filter, and an optional body hit sound sample for simulation of dae-jeom playing technique. Each component of the system is time-variant for implementation of pitch modulation.

(1+a)*g(n)

x(n)

+

y(n) -

a

Z-1 Fig. 7. Loss filter model. It is IIR low-pass filter for simulating the damping effect.

3 SYNTHESIS MODEL We propose a new synthesis model for the geomungo, which is based on commuted waveguide synthesis [30]. For pitch modulation, all filters used in this model are timevariant. Fig. 6 shows the flow diagram of the system. The output of the system Y(z) to the input X(z) is

where g(n) is the overall loop gain at 0 Hz and a is the feedback gain that determines the cutoff frequency of the filter. Since changes in the feedback gain result in transient effects that create audible clicks [9], only the loop gain g(n) is time-variant. 3.2 Delay Line and Fractional Delay Filter The length of the delay line L(n) is determined by the fundamental frequency f0 (n), i.e., fs . L(n) = f loor f 0 (n) For variable fractional delay, a Lagrange interpolation FIR filter is used. The equation of the filter is given as F(z)

Y (z) = Y (z)S(z) + X (z)

N

h i z −i ,

i=0

where S(z) is the filter corresponding to the string model of the geomungo, S(z) = z −L(n) H (z)F(z). Here, z−L(n) is the delay line of length L at time index n, H(z) is the loss filter, and F(z) is the fractional delay filter in Fig. 6. The overall transfer function (T(z)) is then T (z) =

1 Y (z) = . X (z) 1 − S(z)

3.1 Loss Filter Fig. 7 shows the structure of the time-varying loss filter— a simplified IIR low-pass filter designed to simulate the damping effect, whose transfer function is H (z) = g(n) 54

1+a 1 + az −1

(1)

where the filter coefficients hn for desired phase delay D is hn =

N k=0,k=n

D−k . n−k

For simulation of vibrato, D is also defined as a timevarying value D(n) such that fs fs − f loor . D(n) = f 0 (n) f 0 (n) Due to their flat group delay and flat magnitude response, odd-order interpolators (typically third or fifth) are preferred [22]. We used a third-order one, whose structure is shown in Fig. 8. To preserve energy when the pitch is J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

SPECTRAL DELAY FILTERS

gc(n)

h1(n)

Z-1

h2(n)

Z-1

−20

E1 L1 E2 L2 E3 L3 E4 L4 E5 L5 E6 L6

y(n)

h3(n)

Z-1

Fig. 8. Fractional delay filter model based on Lagrange interpolation FIR filter.

Magnitude (dB)

h0(n)

x(n)

−40

−60

−80

−100 0.2

−20

0.4

0.6

0.8

1

1.2

Time (s)

(a)

−30

E1 L1 E2 L2 E3 L3 E4 L4 E5 L5 E6 L6

−40 −40 Magnitude (dB)

Magnitude (dB)

−20

−50 −60

−60

−80

−70 −100 0.2

0.6

0.7

0.8 0.9 Time (s)

1

1.1

Fig. 9. Comparison of the envelopes of the second partial of the geoumngo sound in Fig. 3c. Solid curve shows the maxima of the band passed signal, while dashed curve is the result estimated by STFT. Not only the curves but also their linear regressions are highly similar to each other.

altered, a gain controller is added at the end [22]. Its gain factor gc is gc =

√

1 − x ≈ 1 −

0.4

0.6

1.2

x for small x (x = xn − xn−1 ) 2

where xn is the position in the string at time index n (fs /f0 (n) − D(n)) and x is the change of the delay line length. 3.3 Body Hit Sound Samples of the body hit sounds can be optionally mixed to the output for the synthesis of sounds played with the dae-jeom technique (explained in Table 1).

0.8

1

1.2

Time (s)

(b) −20

Magnitude (dB)

−80 0.5

E1 L1 E2 L2 E3 L3 E4 L4 E5 L5 E6 L6

−40

−60

−80

−100 0.2

0.4

0.6

0.8

1

1.2

Time (s)

(c) Fig. 10. Envelope curves (solid) and their linear approximations (dotted, dash-dotted, and dashed) of six lowest partials of the sound in Fig. 4.

4 CALIBRATION We estimated system parameters and extracted the excitation signals for the synthesis model. Based on the estimated fundamental frequency, loop gains at harmonic frequencies were determined to approximate the coefficients of the loop filter. The input signal was then extracted by an inverse loop filter. 4.1 Estimation of Frequency-Dependent Loss For the design of the loss filter, frequency-dependent loss characteristics should be investigated. Nelson et al. suggested a bandpass filter-based method for estimation of frequency-dependent loss [19], which extracts each harmonic partial from the source using a bandpass filter and calculates local maxima to acquire the amplitude envelope. Compared with this, however, an STFT-based approach is simpler and easier to implement, while providing a comparable result (Fig. 9). J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

To determine the loop gains of the loss filter at harmonic frequencies, the envelope curve of each harmonic partial needs to be approximated. We used linear regression on the STFT results of the geomungo tones in Fig. 4 for approximation as a linear function on the decibel scale (Fig. 10). Since the seventh and higher partials are negligible (as mentioned in Section 3.1), only the six lowest partials were considered for the loss filter design. The loop gain at each harmonic frequency can be obtained from the slope of its linear approximation, as in G k = 10βk L/20H , k = 1, 2, ..., 6 where βk is the slope of the approximation of the kth partial, H is the number of overlapping segments used in the STFT, and L is fs /f0 . 55

KIM ET AL.

PAPERS

0.2

0.99 0.1

0.98

Amplitude

Linear Magnitude

1

0.97 0.96 0.95

0

−0.1

0

200

400

600

800

1000 −0.2

Frequency (Hz)

4.2 Design of Loss Filter We estimate the filter coefficients by using the weighted least-squares method, in which the solution minimizes the total error. E=

N

W (G k )[|H (wk )| − G k ] , 2

W (G k ) =

0.6

0.8 Time (s)

1

1.2

1.4

0.2

0.1

0

−0.1

k=1

where N is the maximum order of harmonics used to design the filter (six in this case), W(Gk ) is an error weighting function, H is the transfer function of the loss filter defined in Eq. (1), and wk is the kth harmonic frequency. Here W(Gk ) can be defined as

0.4

Fig. 12. Excitation signal extracted from the geomungo tone of Fig. 3a, which features a constant pitch.

Amplitude

Fig. 11. Magnitude response of the designed loss filters and the linearly approximated loop gains at harmonic frequencies (Gk , stems) for the geomungo tone in Fig. 3a.

0.2

−0.2

0.2

0.4

0.6

0.8

1

1.2

1.4

Time (s)

Fig. 13. Excitation signal extracted from the geomungo tone of Fig. 3b, which is the target vibrato tone.

1 1 − Gk 0.2

0.1 Amplitude

to provide more weighting to harmonics with higher gain. In Eq. (1), g(n) determines the overall decay rate of the one-pole filter and is generally selected as G1 , i.e., the loop gain value at the fundamental frequency. Then a is chosen to minimize the total error E. Fig. 11 shows the magnitude response of the loss filter and the values of Gk .

0

−0.1

4.3 Inverse Filtering An excitation signal can be extracted from the recorded geomungo sound using an inverse filter and then used as an input to the synthesis model. In this case, the loop gain coefficient (g) was assumed to be a constant in the inverse filter. Two different methods were used to extract excitation signals for the vibrato tones, as shown in Figs. 3b and 3c. The first method was to extract the residual signal from a geomungo sound that was recorded at the same fret position without vibrato. Fig. 12 is the excitation signal inverse filtered from Fig. 3a. The second method, on the other hand, was to extract the residual signal from the target signal itself. Since the frequency of the geomungo sound varies over time, harmonic partials of the vibrato tone after plucking cannot be inverse filtered. Therefore, the length of the residual signal should be shortened by deamplifyng and removing the vibrating sound after the string is plucked. Fig. 13 shows the residual signal generated by this method. 56

−0.2

0.2

0.4

0.6

0.8

1

1.2

1.4

Time (s)

Fig. 14. Synthesized geomungo tone from the recorded sound in Fig. 3a.

5 SYNTHESIS 5.1 Constant Pitch When there is little vibrato in the original tone (as in Fig. 5a), the loop gain coefficient of the loss filter and the length of the delay line for synthesis are fixed. Fig. 14 shows the resynthesized geomungo tone from Fig. 3a, and Fig. 15 depicts the STFT result of the same sound; harmonic partials feature decay characteristics similar to those of the original sound. Fig. 16 is the waveform of a dae-jeom sound, which includes the body-hit sound sample. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

SPECTRAL DELAY FILTERS

Magnitude (dB)

−20

−40

E1 E2 E3 E4 E5 E6

−60

−80

−100 0.2

0.4

0.6

0.8

1

1.2

Time (s)

Fig. 18. Envelope curves of the synthesized geomungo tone without gain control. Unlike the fluctuation in the original sound (Fig. 10c), lower partials of the synthesized tone feature rather monotonous decay.

0.2

Amplitude

1/f

−40 −45 −50 −55

0.1

−60 0.2

0

0.4

0.6

0.8

1

1.2

Time (s)

Fig. 19. Envelop curve of the lowest partial in Fig. 10c. Points marked with the circles are peaks used to approximate the oscillation as a sinusoid.

−0.1

−0.2

0.2

0.4

0.6

0.8

1

1.2

1.4

Time (s)

Fig. 16. Another synthesized geomungo tone from the recorded sound in Fig. 3a, but with the body hit sound for simulation of dae-jeom technique.

0.2

0.1 Amplitude

magnitude (dB)

−35

Fig. 15. STFT analysis of the synthesized geomungo tone in Fig. 14.

0

g(t) = g0 − Asin(2π f t + θ),

−0.1

−0.2

fluctuations, lower partials of the synthesized tone decay almost exponentially. To address this problem, a time-varying loop gain is used in the loss filter. First, we assume that the envelope curve of the first partial of the original sound in Fig. 10c can be modeled as a combination of a first-order function and a simple sinusoidal function, e.g., C0 t + C1 cos(2πft + θ). The loop gain g(t), a continuous version of g(n), is then considered as the derivative of the envelope function and can be formulated as

0.2

0.4

0.6

0.8

1

1.2

1.4

Time (s)

Fig. 17. Synthesized vibrato tone without time-varying gain control. Only delay line length and fractional delay filter are changed based on the fluctuation in fundamental frequency.

5.2 Vibrato Fig. 17 shows a synthesized tone from Fig. 3c with a constant loop gain in the loss filter. Only the delay line length and fractional delay filter are changed over time based on its fundamental frequency curve (Fig. 5b). Envelope curves of its partials are shown in Fig. 18. Compared with those of the original sound in Fig. 10c that decrease with noticeable J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

where g0 is the constant loop gain value, and A, f, and θ are the amplitude, the frequency, and the initial phase of the sinusoid, respectively. To resynthesize a geomungo sound with vibrato, these parameters should be estimated from the first partial of the target sound, as illustrated in Fig. 19. In addition to g0 , which was already discussed in Section 4.2, f and θ are obtained from the locations of the peaks. A can be determined by measuring the change rate of the magnitude from a peak to the next valley, i.e., A = 10 M/20 f0 − g0 , where M is the mean of the peak-to-peak magnitudes of the envelope curve. Table 3 summarizes the parameter values estimated from the analysis of the geomungo sounds in Figs. 3b and 3c. Fig. 20 shows that g(t) estimated from the real sound in Fig. 3c matches the sequence of the instantaneous rates 57

KIM ET AL.

PAPERS

Table 3: Parameter values estimated from the original geomungo sounds in Figs. 3b and 3c. Sound

g0

A

f

θ

Fig. 3b Fig. 3c

0.99 0.99

0.03 0.1

3.7 4.17

0.18π 0.2π

Magnitude (dB)

−20

E1 E2 E3 E4 E5 E6

−40

−60

−80

−100 0.2

0.4

0.6

1.3

0.8

1

1.2

Time (s)

(a)

1.2 −20

1

−40

Magnitude (dB)

g

1.1

0.9 0.8 0.2

0.4

0.6

0.8

1

1.2

E1 E2 E3 E4 E5 E6

−60 −80

1.4 −100 0.2

Time (s)

0.4

0.6

0.8

1

1.2

Time (s)

Fig. 20. Comparison of the sinusoidal loop gain g(t) (curve) with the sequence of loop gain values of the lowest partial (stems).

(b) Fig. 22. Envelope curves of the partials in the synthesized geomungo tone in Fig. 21b.

0.2

−35 Magnitude (dB)

Amplitude

0.1

0

−0.1

−0.2

−40 −45 −50 −55 −60 0.2

0.2

0.4

0.6

0.8 Time (s)

1

1.2

1.4

0.4

0.6

0.8

1

1.2

Time (s)

Fig. 23. Envelope curve of the first harmonic of the sound in Fig. 21b.

(a) 0.2

Amplitude

0.1

0

−0.1

−0.2

0.2

0.4

0.6

0.8 Time (s)

1

1.2

1.4

(b) Fig. 21. Synthesized geomungo tones from (a) Fig. 3b and (b) Fig. 3c with sinusoidal loop gain and the excitation signal in Fig. 13.

of change of the original partial envelope in Fig. 19 quite closely. Note that use of this sequence as the loop gain instead of g(t) may cause the loop filter to become unstable and fail to simulate the fluctuations of higher partials, which should be avoided. Finally, the waveforms of the synthesized geomungo tones are shown in Fig. 21. Figs. 21a and 21b are the re58

sults originated from the geomungo tone in Figs. 3b and 3c, respectively. Fig. 22 depicts the envelope curves of the partials in the synthesized results, which are comparable to the behavior of the original geomungo tones depicted in Figs. 10b and 10c. For a closer look, Fig. 23 shows the envelope curve of the lowest partial of the synthesized result in Fig. 21b, which may be compared with the result in Fig. 19. 5.3 Evaluation A listening test was conducted to evaluate the quality of synthesis result; we resynthesized geomungo tones from recorded samples and performed a user survey on their similarity to the original ones. In [3], Beauchamp presented four steps to perform formal listening tests that include (1) stimuli preparation, (2) actual listening test, (3) data processing/preparation, and (4) interpretation of results. In our case, we tried to “normalize” the stimuli in terms of loudness level, pitch, and duration. While the latter two of the original and synthesized stimuli were virtually identical, loudness levels of the synthesized sounds were manually adjusted to match the J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

SPECTRAL DELAY FILTERS

stk::FileWvIn Excitation signal

Fig. 24. Four fret positions (A,B,C, and D) selected for recording reference tones for evaluation. Table 4. Evaluation scores for resynthesis of non-vibrato tones. Position

Fundamental frequency

Score

127.2 155.8 191.1 232.8

4.9 4.3 4.5 4.1

A B C D

original without any precise loudness testing or prediction. Therefore, this can be considered as an informal listening test to find the perceptual effects of time-varying gain control in vibrato tones. We selected four most frequently played fret positions (A, B, C, and D, as shown in Fig. 24) for evaluation. Sounds were played by a professional geomungo performer, and three tones with different levels of vibrato, i.e., zero, weak, and strong, were recorded at each position. Sounds with a constant pitch were obtained using a timeinvariant filter, while vibrato tones were generated both with and without real-time gain control. Ten participants, who were professional or amateur musicians, or graduate students majoring in computer music, evaluated the similarity between the original sounds and their resynthesized results on a scale of 1 (completely different) to 5 (virtually identical) and commented on their sonic quality as well. Table 4 summarizes the evaluation results for tones with a constant pitch. As shown by the scores, the commuted waveguide synthesis method introduced in this paper works effectively for non-vibrato tones, producing an average score of 4.45. Evaluation results on vibrato tones are shown in Table 5. In general, resynthesized results of vibrato tones were rated lower than those of non-vibrato cases, which remains a limitation to this work. However, tones generated with time-varying gain control scored higher than those without gain control in six out of eight cases, especially when A became larger. Regarding the sonic quality of these synthesized tones, eight participants commented that the

stk::Iir

g(n), a

stk::DealyL

f0(n)

stk::FileWvIn Body hit signals

stk::FileWvOut

stk::RtAudio

Fig. 25. Structure of the STK-based sound synthesis program for the geomungo.

gain control provided more depth to the vibrato. Six participants, although they gave the same score to both, said they preferred gain-controlled tones. 6 SOFTWARE IMPLEMENTATION Based on the algorithm presented above, a real-time sound synthesis engine for the geomungo was implemented using the Synthesis Toolkit (STK) [6]. Fig. 25 shows the structure of the synthesis engine, which is by nature similar to that of the sound synthesis model we proposed (see Fig. 6). To simulate the geomungo’s vibrato, the gain parameter of the IIR class is controlled by the approximated g, and the delay parameter of the DelayL class is determined by the estimated fundamental frequency curve. The result proves the efficiency of the algorithm for the synthesis of geomungo sounds.

Table 5. Evaluation scores for resynthesis of weak (technique 1) and strong (technique 2) vibrato tones. Here f0 is the range of fundamental frequency, and A and f are the parameters used for time-varying gain control (as in Table 3). Position A B C D

Technique 1 2 1 2 1 2 1 2

f 0 (Hz) 108.1 ∼ 126.5 105.9 ∼ 126.8 151.2 ∼ 160.3 148.6 ∼ 168.1 186.4 ∼ 198.7 182.5 ∼ 210.8 215.1 ∼ 241.4 215.3 ∼ 252.8

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

A 0.05 0.1 0.05 0.05 0.02 0.04 0.04 0.08

f(Hz) 3.80 4.43 3.65 4.17 3.78 9.62 4.22 5.15

Score (no gain ctrl.)

Score (gain ctrl.)

3.6 2.8 3.3 3.9 3.3 3.2 3.2 3.3

3.4 3.5 3.6 3.7 3.6 3.3 3.4 3.7 59

KIM ET AL.

7 CONCLUSION This paper introduced a commuted waveguide synthesis model for the geomungo. The model is based on a generalized form of the Karplus-Strong algorithm. A one-pole filter was used to model the loss, and a Lagrange interpolation FIR filter was implemented to tune the fundamental frequency. Filter parameters can be calibrated by estimating the loss of the harmonic partials. The excitation signals for the synthesis model is then extracted by inverse filtering from the recorded samples. Due to the characteristics of the typical geomungo playing technique, the system was designed with extreme vibrato in mind. The loop gain of the loss filter was modeled as a sinusoidal function based on the magnitude analysis of the target sound. The length of the delay line and the fractional delay filter were manipulated based on the frequency analysis of the target tone. Finally, a real-time sound synthesis system was implemented using the STK to show the potential of this model for a virtual geomungo. Future works include the design of a more elaborate model to reflect the nonlinear acoustic characteristics of the geomungo as well as the parametrization of the current model for real-time performance with physical controllers. We will also consider prototyping a model that generates an input signal necessary to synthesize the sounds based only on mathematical equations without any recorded samples. 8 ACKNOWLEDGMENTS This research was supported by the Culture Technology (CT) Research & Development Program grant (R2012050017) of Korea Creative Content Agency (KOCCA) and the Ministry of Culture, Sports and Tourism (MCST) of Korea. 9 REFERENCES [1] J. B. Allen and L. L. Rabiner, “A Unified Approach to Short-Time Fourier Analysis and Synthesis,” Proceedings of the IEEE, vol. 65, pp. 1558–1564 (Nov. 1997). [2] B. Bank, F. Avanzini, G. Borin, G. De Poli, F. Fontana, and D. Rocchesso, “Physically Informed Signal Processing Methods for Piano Sound Synthesis: A Research Overview,” EURASIP J. Applied Signal Processing, vol. 2003, pp. 941–952 (2003). [3] J. W. Beauchamp, “Perceptually Correlated Parameters of Musical Instrument Tones,” Archives of Acoustics, vol. 36, no. 2, pp. 225–238 (2011). [4] S.-J. Cho, Physical Modeling of the Sanjo Gayageum Using Digital Waveguide Theory, Ph.D. thesis, University of Ulsan, Ulsan, Korea, 2006. [5] S.-J. Cho, U.-P. Chong, and S.-B. Cho, “Synthesis of the Dan Trahn Based on a Parameter Extraction System,” J. Audio Eng. Soc., vol. 58, pp. 498–507 (2010 Jun.). [6] P. R. Cook and G. P. Scavone, “The Synthesis ToolKit (STK),” Proceedings of the International Computer Music Conference, pp. 164–166, Beijing, China (1999). [7] A. de Cheveigne and H. Kawahara, “YIN, a Fundamental Frequency Estimator for Speech and Music,” J. 60

PAPERS

Acous. Soc. Am., vol. 111, no. 4, pp. 1917–1930 (Apr. 2002). [8] C. Erkut, M. Laurson, M. Kuuskankare, and V. Välimäki, “Model-Based Synthesis of the Ud and the Renaissance Lute,” Proceedings of the International Computer Music Conference, pp. 119–122, Havana, Cuba (Sept. 2001). [9] C. Erkut, V. Välimäki, M. Karjalainen, and M. Laursoni, “Extraction of Physical and Expressive Parameters for Model-Based Sound Synthesis of the Classical Guitar,” presented at the 108th Convention of the Audio Engineering Society (2000 Feb), convention paper 5114. [10] Society for Korean Music Educators, “The Geomungo,” J. Korean Music & Korean Music Educ., vol. 5, pp. 1–10 (May 1982). [11] D. A. Jaffe and J. O. Smith, “Extensions of the Karplus-Strong Plucked-String Aalgorithm,” Computer Music J., vol. 7, no. 2, pp. 56–69 (1983). [12] M. Karjalainen, V. Välimäki, and T. Tolonen, “Plucked-String Models: From the Karplus-Strong Algorithm to Digital Waveguides and Beyond,” Computer Music J., vol. 22, no. 3, pp. 17–32 (1998). [13] K. Karplus and A. Strong, “Digital Synthesis of Plucked String and Drum Timbres,” Computer Music J., vol. 7, no. 2, pp. 43–55 (1983). [14] D. H. Keefe, “Physical Modeling of Wind Instruments,” Computer Music J., vol. 16, no. 4, pp. 57–73 (1992). [15] J.-H. Kim, “Komungo,” La Folia, vol. 3: 4 (2001). [16] S.-O. Kim, “The Musical Study of the Geomungo Performing Technique—On the Basis of Sangryongsan in Julpungryu,” J. Korean Music & Korean Music Educ., vol. 25, pp. 113–150 (2007). [17] I.-O. Kown, “The Musical Study of the Geomungo Sanjo Performing Technique—On the Basis of Joongmori,” J. Korean Music & Korean Music Educ., vol. 26, pp. 147– 218 (2008). [18] J. E. Lee, “Sound Spectrum and Performance Analysis of Geomungo, and Performance Reproduction in Midi,” Master’s thesis, Master’s thesis, Ewha Womans University, 2006. [19] N. Lee, J. O. Smith, and V. Välimäki, “Analysis and Synthesis of Coupled Vibrating Strings Using a Hybrid Modal-Waveguide Synthesis Model,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 4, pp. 833–842 (May 2010). [20] M. Mellody and G. H. Wakefield, “The TimeFrequency Characteristics of Violin Vibrato: Modal Distribution Analysis and Synthesis,” J. Acoustical Soc. Am., vol. 107, no. 1, pp. 598-611 (Jan. 2000). [21] Museum Portal of Korea. Geomungo. http://www. emuseum.go.kr. [22] J. Pakarinen, M. Karjalainen, V. Välimäki, and S. Bilbao, “Energy Behavior in Time-Varying Fractional Delay Filters for Physical Modeling Synthesis of Musical Instruments,” Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol. 3, pp. iii/1-iii/4 (2005). [23] H. Penttinen, J. Pakarinen, V. Välimäki, M. Laurson, H. Li, and M. Leman, “Model-Based Sound J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

SPECTRAL DELAY FILTERS

Synthesis of the Guqin,” J. Acoustical Soc. Am., vol. 120, no. 6, pp. 4052–4063 (2006). [24] C. Roads, The Computer Music Tutorial (MIT Press, 1996). [25] G. P. Scavone, “Modeling Wind Instrument Sound Radiation Using Digital Waveguide,” Proceedings of the International Computer Music Conference, pp. 355–358, Beijing, China (1999). [26] G. P. Scavone and P. R. Cook, “Real-Time Computer Modeling of Woodwind Instruments,” Proceedings of the International Symposium on Musical Acoustics(ISMA), pp. 197–202, Leavenworth, Washington, U.S.A, 1998. [27] Serafin Stefania and Young Diana. Recent advances in real-time bowed string synthesis: Evaluation of the models. In Proceedings of the International Congress of Acoustics, Madrid, Spain, 2007. [28] j. O. Smith. Waveguide filter tutorial. In Proceedings of the International Computer Music Conference, pages 9–16, Champaign-Urbana, IL, USA (1987). [29] J. O. Smith, “Physical Modeling Using Digital Waveguides,” Computer Music J., vol. 16, no. 4, pp. 74–91 (1992). [30] J. O. Smith, “Efficient Synthesis of Stringed Musical Instruments,” Proceedings of the International Computer Music Conference, pp. 64–71, Tokyo, Japan (1993).

[31] J. O. Smith, “Principles of Digital Waveguide Models of Musical Instruments,” Applications of Digital Signal Processing to Audio and Acoustics, vol. 437, pp. 417–466 (2002). [32] J. O. Smith and S. A. Van Duyne, “Commuted Piano Synthesis,” Proceedings of the International Computer Music Conference, pp. 319–326 (1995). [33] V. Välimäki, J. Huopaniemi, and M. Karjalainen, “Physical Modeling of Plucked String Instruments with Application to Real-Time Sound Synthesis,” J. Audio Eng. Soc., vol. 44, pp. 331–352 (1996 May). [34] V. Välimäki, J. Parker, and J. S. Abel, “Parametric Spring Reverberation Effect,” J. Audio Eng. Soc., vol. 58, pp. 547–562 (2010 Jul./Aug.). [35] V. Välimäki, H. Penttinen, J. Knif, M. Laurson, and C. Erkut, “Sound Synthesis of the Harpsichord Using a Computationally Efficient Physical Model,” EURASIP J. Applied Signal Processing, 2004, pp. 934–948 (2004). [36] V. Välimäki and T. Tolonen, “Development and Calibration of a Guitar Synthesizer,” J. Audio Eng. Soc., vol. 46, pp. 766–778 (1998 Sep.). [37] J. Woodhouse, “Physical Modeling of Bowed Strings,” Computer Music J., vol. 16, no. 4, pp. 43–56 (1992).

THE AUTHORS

Seunghun Kim

Moonseok Kim

Seunghun Kim is a Ph.D. candidate at KAIST and is a member of Audio and Interactive Media (AIM) Lab, Graduate School of Culture Technology (GSCT). He has received his B.S. degree in electrical and communications engineering from Information and Communications University (ICU), Korea, in 2009 and M.S. degree in culture technology from KAIST, Korea, in 2011. His research interests are audio signal processing, physical modeling, and musical interfaces.

r

Moonseok Kim was born in Incheon, Korea, in 1975. He received his bachelor’s degrees in both mathematics and electrical engineering from Pohang University of Science and Technology, Korea, in 2004 and his master’s degree in music, science, and technology from Stanford University, CA, USA in 2006. He also earned a master’s degree in music technology from McGill University, Montreal, Quebec, Canada in 2010. Since 2004, Mr. Kim has worked in the field of musical signal processing. His research interests in-

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Woon Seung Yeo

clude physical modeling of musical instruments, real-time synthesis method, violin acoustics, and numerical analysis. Mr. Kim is a member of the IEEE.

r

Woon Seung Yeo is a bassist, media artist, and computer music researcher. He is Assistant Professor at the Korea Advanced Institute of Science and Technology (KAIST) and leads the Audio and Interactive Media (AIM) Lab and the KAIST Mobile Phone Orchestra (KAMPO). Mr. Yeo has received degrees from Seoul National University (B.S. and M.S. in electrical engineering), University of California at Santa Barbara (M.S. in media arts and technology), and Stanford University (M.A. and Ph.D. in music). His research includes audiovisual art, cross-modal display, musical interfaces, mobile media for music, and audio DSP. Mr. Yeo is a member of the International Computer Music Association (ICMA), and is the general chair of the 13th International Conference on New Interfaces for Musical Expression (NIME).

61

PAPERS

Room Acoustical Parameters in Small Rooms1 HASSAN EL-BANNA ZIDAN AND U. PETER SVENSSON, AES Member ([email protected])

([email protected])

Acoustics Group, Department of Electronics and Telecommunications, Norwegian University of Science and Technology, NO 7491, Trondheim, Norway.

This paper presents a study of the room acoustics conditions in small shoebox shaped rooms. Barron’s model is used to predict room acoustical parameters such as reflected energy, Grefl , and early reflected energy, G50refl . Measurements were carried out in three small rooms and detailed computer simulations using CATT Acoustic were done for different rooms. Comparisons between these measurements and detailed simulations, and the predictions based on Barron’s model, are typically within –1 dB and +1 dB when averaged across a room, from 250 Hz to 2 kHz.

0 INTRODUCTION There are many different approaches for predicting the properties of the sound field in a room. Three classes of prediction methods are wave-theoretical methods, geometrical acoustics based methods, and statistical methods. The two former can give highly detailed information about a sound field: the sound pressure at any point in space inside a room. At the same time, these methods require, or can handle, geometry, boundary conditions, and source properties to the smallest detail. At the opposite end of the scale is the third approach—statistical methods. These use very few input parameters about a room and ignore most details, so the results that are achieved are not as accurate and detailed as the first two methods. Even with the tremendous progression of computation power and complex algorithm sophistication, the simple statistical methods are still very prominent. This is probably because of the usefulness of the models with few parameters, such that the influence of those basic parameters can be studied readily. One example is presented in [1] where the authors proposed an objective scheme for ranking halls based on a small set of room acoustical parameters. A common statistical model used in room acoustics is Barron’s revised theory that views the sound field as consisting of direct sound and diffuse exponential reverberation only. It ignores the early reflections (i.e., the early part of the impulse response is viewed as part of the exponential diffuse field), it ignores possible non-diffuseness of the reverberation, non-exponential decays, and the room modes. 1 Some parts of this paper were presented (convention paper 8223) at the 129th Convention of the Audio Engineering Society, San Francisco, CA, 2010 November 4–7.

62

Some of these aspects could possibly be included in extensions to the model. Barron’s model has been shown to predict the sound field in large auditoria quite well [2,3] and the largest inaccuracy seems to appear when space is subdivided into less diffuse sub-volumes [4]. Furthermore, better agreement is often found for the late part of the impulse response than for its early part. The same model has been evaluated in smaller auditoria [5]. The results verified the model with slight but consistent differences between the measured and the predicted strength levels, and a correction was proposed for possibly improved prediction accuracy. For classrooms, the relationship between early energy, in the form of the C50 -parameter, and the reverberation time has been analyzed in [6]. Measured values of C50 tended to be 1–3 dB higher, on average, than the diffuse-field relationship. The direct sound was ignored in the diffuse-field relationship, which to some degree explains why the measured values were higher. The purpose of this paper is to study how well Barron’s model does in smaller rooms, similar to the study in [6]. Smaller rooms might be more demanding for such simplified methods but we aim at quantifying the deviations. As a part of this study, an impulse response model will be introduced that facilitates extensions to Barron’s model and that can be used in further work where electroacoustically coupled rooms are studied. The paper is organized as follows. Section 1 briefly reviews Barron’s model and its relationship to a simplified impulse response model. Section 2 presents the measurements and simulations done in different rooms to verify the model used. The achieved results are discussed in Section 3. Finally, the fourth section concludes this study. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

ROOM ACOUSTICAL PARAMETERS IN SMALL ROOMS

1 THEORY 1.1 Impulse Response Models In any closed space or room, the acoustical properties of that room can be characterized and assessed by a room impulse response. Typically, impulse responses are viewed as being composed of three main components: direct sound; early reflections, which are a set of discrete reflections and can be chosen to be those that arrive within the first 50– 80 ms of the first wave front; and late reflections that arrive after 50–80 ms. While the direct sound is a well-defined component that is usually standing out in an impulse response, and which is typically analyzed for, e.g., detecting the location of a source [7], the transition from early reflections to late reflections is not possible to distinguish. Recent research has been done to determine the mixing time for this transition in room impulse responses [8]. A simple discrete-time room impulse response model that was introduced in [9], and expanded in [10], is used in the following way. It is composed of a direct sound unit pulse followed by an exponential-decay weighted noise sequence. Such a model can be used to find estimates of the direct sound energy, early energy, and late energy. These energy estimates are subsequently used in various room acoustical parameters such as Strength, Clarity, etc. Energy estimates can also be derived from Barron’s model without introducing an equivalent impulse response model per se, but the impulse response model gives further flexibility, e.g., for the study of properties of convolved impulse responses [11]. Here, these three energy components will be denoted d, e, and l, respectively. In addition, an additional quantity for the reflected sound energy will be used: rev = e + l. They will be normalized relative to the direct sound energy at 10 m distance for an omnidirectional sound source with the same output power as the sound source used in a measurement. This normalization leads to the various Strength parameters, [12] and can be calculated directly in terms of the d, e, lenergies as follows G = 10 · log10 (d + e + l)

dB

(1)

The strength of the reverberation, i.e., all reflected sound, Grefl , is then given by G refl = 10 · log10 (e + l)

dB

(2)

The strength of the direct and early arriving sound over the first 50 ms, G50 , is given by introducing a time-limit specifier for the early (and late) energies, G 50 = 10 · log10 (d + e50 )

dB

(3)

where e50 is the sound energy arriving within the first 50 ms (and l50 is the sound energy arriving after 50 ms). Finally, the direct sound, d, in Eq. (3) can be excluded in the final early reflected sound strength as G 50refl = 10 · log10 (e50 )

dB

(4)

Now, under the assumption that all reflected sound can be described as an exponential decay, e, l, and rev are related J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

via the reverberation time T60 , −6 ln(10)0.050 l50 = r ev · exp T60 −0.69 ≈ r ev · exp T60

e50

−6 ln(10)0.050 = r ev · 1 − exp T60 −0.69 ≈ r ev · 1 − exp T60

(5)

(6)

Furthermore, the direct sound energy is readily available as d=

100 r2

(7)

where r is the source-receiver distance in meters and is the directivity factor. For the reverberation energy, or reflected sound energy, several values have been proposed in the literature. A first is the classical expression for a constant reverberation energy in a room, r evclassical =

1600π A

(8)

where the absorption area, A, can be written as A=

24 · ln(10) · V c · T60

(9)

V is the room volume and c is the speed of sound. Beranek suggested a modification of the reverberation energy with ¯ α¯ being the the so-called room constant, R’ = A/(1 − α), average absorption coefficient [13]. Thus, the reverberation energy is r ev Beranek =

¯ 1600π(1 − α) A

(10)

Vorländer [14] proved that the room constant, R’, is just a special case of his correction factor that is e−A/S , where (A = AEyring = − S ln(1− α) + 4mV) and m is the air attenuation coefficient. Finally, Barron’s revised formula [15] modified the reverberation term by a factor e−2δr/c , where δ = 3ln(10)/T60 is the decay constant, making the reverberation distance dependent, so that 1600π −2δr exp r ev Barr on = A c 31200T60 −0.040r ≈ exp (11) V T60 By inserting the proper expressions from Eqs. (5)–(11) into Eqs. (1)–(4), estimates of the various G-parameters are found. These estimates can be based on the classical reverberation model, Beranek’s or Barron’s model, and they will be compared with measurements and computer simulations in Section 2. 63

ZIDAN AND SVENSSON

PAPERS

Table 1. Data for the three rooms measured. 1m

Room 1m

Fig. 1. Top view of microphone and loudspeaker configuration setup in the three measured rooms.

1.2 Limitations and Possibilities with the IR Model There is no doubt that early reflections are significant in many respects but the simple IR model(s) ignore(s) these. Only comparisons with measurements and/or detailed simulations will indicate how severe this neglect is for the energy relations. The IR models could, in principle, be expanded if desired. The direct sound pulse can, e.g., be replaced by a measured impulse response in order to get a detailed direct sound impulse response for auralization purposes. Furthermore, very early reflections could be included in the direct sound as done in [10], where the influence of a reflecting table on a microphone’s response was studied. As mentioned above, the IR model offers the possibility to study convolutions of IRs. Such a convolution represents, e.g., an audio or telephone conference situation, where a microphone and a loudspeaker in one room (represented by a single room impulse response) are connected to a loudspeaker and a microphone in another room (also represented by a single room impulse response). The convolution can be assumed to smear out early reflection patterns and thereby reduce the problem caused by ignoring early reflections.

2 EXPERIMENTS A series of measurements was carried out as earlier reported in [10] and summarized in Section 2.1. Due to the limited scope of the measurement series, computer simulations were carried out as well as described in Section 2.2. Computer simulations have been shown to be quite accurate for strength parameters [16] in medium-sized and large auditoria. 2.1 Measurements A series of impulse response measurements was done in three small rooms using a small two-way loudspeaker (Genelec 1029A) as sound source. Its position was 1 m from one of the end walls and centered between the sidewalls as shown in Fig. 1. The height of the loudspeaker was 1.5 m from the floor to the center of the woofer. An omnidirectional 1/2’’ microphone (Norsonic 1220) was positioned 64

R1 R2 R3

Size

T60,mid

fSchroeder

5.7 × 5.7 × 2.7 = 88m3 5.8 × 5.9 × 3 = 104m3 10.3 × 6 × 2.9 = 180m3

0.75 s 0.77 s 0.5 s

185 Hz 176 Hz 107 Hz

in a sequence of receiver positions, in steps of 1 m, starting at 1 m in front of the loudspeaker. The height of the microphone and of the loudspeaker was 1.5 m. Only measured positions on-axis of the loudspeaker were used in order to get the same directivity index for all the measured positions. The rooms had no furniture during the measurements, except for room R3, where tables were standing next to the side walls. Using symmetric measurement positions in a room could be viewed as a worst-case test because of accentuated interference effects caused by the two side-wall reflections. To further enhance deviations between measurements and predictions, the direct sound energy was excluded from the analysis. For the measurements, this was done simply by subtracting the direct sound energy from the impulse response energy values. The data of the rooms are shown in Table 1. For the calibration of the Strength parameter, a series of measurements was done in an anechoic room at distances 1, 2, 3, and 4 m, respectively. All these responses were rescaled to a distance of 10 m and the average value was chosen as representing a 10 m distance. All the measured impulse responses were filtered in octave bands and the total energy was summed for each band. Then a scaling factor was derived for each band such that the energy in each octave band for a 10 m free-field impulse response would be 0 dB. For the subsequent measurements, the gain settings were kept identical to the ones used for the anechoic calibration measurements. An important aspect is the directivity of loudspeaker, and, to a very minor degree, of the microphone. The standard ISO 3382 for room acoustical measurements indicates that both the loudspeaker and the microphone should be omnidirectional, which is practically unattainable for the loudspeaker. Allowable deviations for practical loudspeakers are given in ISO 3382, but here it is not critical to use a sound source that is omnidirectional. It is rather more important to use good estimates of the actual directivity factors. Simulations were done by an edge diffraction method in order to get the values of the directivity index in each octave band. This approach has been shown to be accurate [17]. Table 2 shows the on-axis directivity index of the loudspeaker averaged in octave bands. All IR measurements were done using the WinMLS software [18], and the reverberation times were calculated with this software as well. Strength parameters were, however, computed separately by Matlab scripts. As one example, Fig. 2 shows the reflected energy values, Grefl , as a function of source-receiver distance in the octave band range 125 Hz to 4 kHz. The corresponding error between the measured and the predicted reflected J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

ROOM ACOUSTICAL PARAMETERS IN SMALL ROOMS

Table 2. On-axis directivity index of the two-way Genelec 1029A loudspeaker, based on simulations.

125 Hz 250 Hz 500 Hz 1 kHz 2 kHz 4 kHz

0.2 dB 0.6 dB 2.3 dB 5.2 dB 7.5 dB 4.6 dB

125 250 500 1k 2k 4k

24

Grefl [dB]

22

18

2

1

-1

1

1.5

2

2.5

3

25

Mean error in Grefl. across positions Room 1 Room 2 + 5 dB Room 3 +10dB

14 2

2.5

3

3.5

4

4.5

5

Distance [m] Fig. 2. Measured and predicted reflected energy, Grefl, for room 2. Predictions were based on Barron’s model.

Error in G refl [dB]

20

1.5

4

Fig. 3. Prediction error of the reflected energy, Grefl,Barron , for room 2.

16

1

3.5

Distance [m]

125 250 500 1k 2k 4k

20

3

0

Grefl in room 2 * = Measurement - = Prediction

26

125 250 500 1k 2k 4k

4

Directivity index

Grefl pred. error [dB]

Octave band

Grefl in room 2 - prediction error

5

15

10

5

0

energy, Grefl , is shown in Fig. 3, where predictions are based on Barron’s model. Naturally, for such small rooms, a very weak distance dependence can be observed for the reverberation level. It can be seen in Fig. 3 that the prediction error is mostly within –1 and +2 dB for the octave bands 125 Hz to 2 kHz. Quite large errors occur for the 4 kHz octave band, the reasons for which will be discussed in Section 3. Figs. 2 and 3 show examples from one room, while results for all the three rooms are compiled in Fig. 4. Mean values, as well as the minimum and maximum values are shown. It can be noticed that the mean prediction errors are within −1, +2 dB from 250 Hz to 2 kHz. At 4 kHz the prediction is high on average, and the error varies between 2.4 dB and 3.5 dB. Fig. 5 shows a similar compiled view of the mean error of the early reflected energy, G50refl, Barron , for the three rooms. The mean prediction errors of the early reflected energy, G50refl,Barron , are within −1,+3 dB from 250 Hz to 2 kHz. These values are little bit higher than the mean prediction errors of the reflected energy, Grefl . Similarly to the situation for the reflected energy, the error is higher for the 4 kHz band, ranging from 3 dB to 4 dB. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

-5 2 10

10

3

Frequency [Hz] Fig. 4. Mean (and minimum and maximum) prediction error of reflected energy, Grefl,Barron , in the three rooms. Two curves were shifted for readability.

2.2 Simulations A series of computer simulations was done using the CATT acoustics software [19] for more investigations about the validity of the prediction model in small rooms. Three rooms were designed according to the data given in Table 3. Computer simulation software will usually require more input parameters about the room (such as the absorption coefficient, diffusion, etc.) than the simple impulse response models in Section 1. Consequently, the absorption coefficients of the rooms were chosen based on a simplified assumption that the rooms have one absorption coefficient for the floor and another one for all walls and the ceiling. The absorption coefficient of the floor was fixed in all the rooms to a value 0.05, while the absorption coefficient of walls and ceiling was chosen to give the designated reverberation times in Table 3 based on Sabine’s formula. 65

ZIDAN AND SVENSSON

PAPERS

Table 3. Data of the three rooms simulated Room R1 R2 R3

25

Size

Absorption coefficient α (walls/ceil)

T60,mid

fSchroeder

5.7 × 5.6 × 3 = 95.76 m3 10 × 6 × 3 = 180 m3 8 × 7 × 3 = 168 m3

0.18 0.35 0.29

0.5 s 0.5 s 0.6 s

145 Hz 105 Hz 120 Hz

Mean error in G50refl. across positions Room 1 Room 2 + 5 dB Room 3 +10dB

20

23.5 23

15

Grefl [ dB ]

Error in G50refl. [dB]

Room 1, T60 = 0.5s, 10% diff. 24

10

22.5 22 21.5

CATT results, 2S*99R Classical Beranek Line fit to CATT results Barron's model

5 21

0

-5 2 10

20.5 20

10

3

0

1

2

Frequency [Hz] Fig. 5. Mean prediction error of early reflected energy, G50refl,Barron , in the three rooms. Two curves were shifted for readability.

3

4

5

6

Distance [ m ] Fig. 7. Simulated reflected energy, Grefl, CATT , compared to classical theory and Barron’s model for room 1 with reverberation time 0.5 s and 10% diffusion.

Room 1, T60 = 0.5s, 60% diff. 24 23.5

Grefl [ dB ]

23 22.5 22 21.5

CATT results, 2S*99R Classical Beranek Line fit to CATT results Barron's model

21

Fig. 6. Top view of two sources and a sample of the 99 receivers in a simulated room.

20.5 20

To cover a wide range of cases, each room was calculated with many different levels of diffusion, from which 10%, 60%, and 100% have been chosen to represent low levels, medium levels, and high levels of diffusion, respectively. Each room was simulated with the following source/receiver configuration setup. Two omnidirectional source positions were used. A first position was in the center of the room, 0.5 m from the front wall and a second position 1 m from the front wall and 0.5 m from the sidewall, as illustrated in Fig. 6. The 99 receivers were chosen to be located randomly with a constraint to be at least 1 m away from the first source and 0.5 m away from the second source and 0.5 m away from all the sidewalls. The height of the sources and the receivers was 1.5 m. Fig. 6 shows one example of a source/receiver configuration setup. 66

0

1

2

3

4

5

6

Distance [ m ] Fig. 8. Simulated reflected energy, Grefl,CATT , compared to classical theory and Barron’s model for room 1 with reverberation time 0.5 s and 60% diffusion.

The idea behind the random distribution was to get a range of variations as large as possible. In addition, the random distribution avoids a bias due to possibly systematic distributions. Fig. 7 and Fig. 8 demonstrate examples of the simulated reflected energy, G refl,CATT , compared with the classical theory, Barron’s model, and Beranek’s model. Fig. 7 gives an example with mostly specular reflections (10% diffusion) while Fig. 8 with more diffuse reflection (60% diffusion). J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

ROOM ACOUSTICAL PARAMETERS IN SMALL ROOMS

Table 4. Average absolute error for Grefl with Grefl, CATT as reference.

Diffusion %

Mean error for classical theory

Mean error for Beranek’s model

Mean error for Barron’s model

R1 T60 = 0.5s

10% 60% 100%

0.69 dB 0.71 dB 0.70 dB

0.26 dB 0.37 dB 0.46 dB

0.10 dB 0.18 dB 0.26 dB

R2 T60 = 0.5s

10% 60% 100%

1.91 dB 2.13 dB 2.46 dB

0.95 dB 1.31dB 1.67 dB

0.27 dB 0.47 dB 0.81 dB

R3 T60 = 0.6s

10% 60% 100%

1.26 dB 1.29 dB 1.41 dB

0.51 dB 0.73 dB 0.90 dB

0.18 dB 0.37 dB 0.50 dB

Room

Room 1, T60 = 0.5s, 60% diff.

Room 1, T60 = 0.5s, 10% diff. 24

24

CATT results, 2S*99R Line fit to CATT results Barron's model

23.5

23

22.5

G50refl [ dB ]

G50refl [ dB ]

23

22 21.5

22.5 22 21.5

21

21

20.5

20.5

20

0

1

2

3

4

5

CATT results, 2S*99R Line fit to CATT results Barron's model

23.5

6

20

0

1

Fig. 9. Simulated early reflected energy, G50refl , compared to Barron’s model for room 1 with reverberation time 0.5 s and 10% diffusion.

2

3

4

5

6

Distance [ m ]

Distance [ m ]

Fig. 10. Simulated early reflected energy, G50refl , compared to Barron’s model for room 1 with reverberation time 0.5 s and 60% diffusion. Table 5. Average absolute error for G50refl

All the rooms were simulated with a larger set of diffusion percentages. It was observed that for diffusion levels of 10%, 20%, and 40%, the values of the reflected energy, G refl,CATT , were all very similar. Only when the diffusion factor was 60% or more (80% and 100%) did the values differ markedly from those in Fig. 7. It can be noticed from Figs. 7 and 8 that the maximum difference in reflected energy, Grefl , between the CATT results and Barron’s model does not exceed +/− 0.5 dB, which is to a high extent acceptable. Furthermore, it can be observed that Beranek’s model corresponds closely to the average value through the room as explained in [14]. Table 4 summarizes the mean absolute error of the reflected energy, Grefl , according to classical theory, Barron’s model, and Beranek’s model across all the rooms and all the cases. The simulation results, Grefl, CATT , were considered as the reference. Figs. 9 and 10 show the simulated early reflected energy, G50refl , compared to Barron’s model for a diffusion percentage of 10% and 60% respectively. Table 5 summarizes the mean error of the early reflected energy, G50refl , according to Barron’s model across all the rooms and all the cases. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Diffusion %

Mean error for Barron’s model

R1 T60 = 0.5s

10% 60% 100%

0.36 dB 0.38 dB 0.43 dB

R2 T60 = 0.5s

10% 60% 100%

0.49 dB 0.57 dB 1.01 dB

R3 T60 = 0.6s

10% 60% 100%

0.63 dB 0.63 dB 0.65 dB

Room

It can be noticed that the mean error for early reflections, G50refl , was higher than for the reflected energy, Grefl , in all the cases. On average, for the nine configurations, the prediction error for G50refl is 0.2 dB larger than for Grefl . 3 DISCUSSION OF RESULTS The comparison of the reflected energy, Grefl , for the measurements done in the three rooms, and the predictions 67

ZIDAN AND SVENSSON

according to Barron’s model, showed quite close agreement for the frequency bands 250 Hz to 2 kHz. The prediction error varied between −1 dB and +2 dB. For the 4 kHz band the error increased to between 2.5 dB and 3.5 dB. A possible cause for this noticeable error is the difficulty of accurately estimating the directivity index of the loudspeaker at higher frequencies. The simulation method modeled the drivers as a collection of point sources on the flat frontal baffle, while the loudspeaker does have a very shallow recession around the tweeter. Such a recession would act as a mild horn loading, which would be expected to lead to an increased directivity. Based on the detailed simulations with the CATT Acoustics software, Barron’s model worked very well, too, for the prediction of average trends. The predictions are even closer to the detailed simulation values than to the measurements. This is not surprising since, among other factors, the detailed calculations used an omnidirectional source. As a consequence, the possible source of prediction error caused by an uncertain directivity was eliminated. Moreover, the trend lines of the results (strength values as function of distance) were the most interesting quantities to compare, and predicted trend lines did indeed agree well with the trend lines of the detailed simulations. Certainly, when looking at individual receiver points, the prediction error could be as large as observed for the measurements. When the diffusion was as low as 10%, as shown in Fig. 7, the mean error was below a few tenths of a dB. When the diffusion was increased to 60% or more, the mean error increased but was still bounded within ± 0.5 dB. For the early reflected energy, G50refl , the measurements show, as in Fig. 5, that the model also worked reasonably well although the prediction error was a little bit higher than the total reflected energy, Grefl . The prediction error varied between −2 dB and +2 dB. It is a bit surprising that the prediction error for Barron’s model was smaller when distinct early reflections occurred (that is, when the diffusion was low), but Barron’s model has been explained in terms of the image source model, which corresponds to a diffusion coefficient of zero [15]. It can be noticed that the error was only somewhat higher for the early reflected energy, G50refl , than the error of Grefl , for the same rooms and the same cases. These errors illustrate that on average in rooms, the simple model predicts fairly well the strength values in small shoebox-sized rooms with little or no furnishing. There is no doubt that the error in one specific receiving point might be substantially higher. However, the average behavior of the model might be very useful for studying the influence of parameters such as room volume, source-receiver distance, microphone, and loudspeaker directivity. Furthermore, as the impulse response model used in this paper is composed of two components only, direct sound and an exponential reverberation tail, it can be used to study the properties of two rooms or more through the convolution of room impulse responses as has been done in [10] and [11]. 68

PAPERS

4 CONCLUSIONS A simple room impulse response model, corresponding to Barron’s model, has been used to investigate the room conditions (early and late energy parameters) in small size shoebox-shaped rooms. Measurements and computer simulations were done to evaluate the simple model. Measurements show that the error was limited between –1 dB and + 2 dB for reflected energy, Grefl , across all the octave bands except at 4 kHz. For early reflected energy, G50refl , the prediction error did not exceed ±2 dB. Detailed simulations showed that the mean error did not exceed + 0.8 dB for reflected energy, Grefl , and + 2.3 dB for the early reflected energy G50refl . In all the cases, for both measurements and simulations, the error for the early reflected sound, G50refl , was a bit higher than the error for reflected energy, Grefl , but still within acceptable limits. Therefore, it is suggested that the impulse response model can be useful for studying two rooms connected as in a video conference setup. 5 REFERENCES [1] S. Cerdá, A. Giménez, and R. M. Cibrián, “An Objective Scheme for Ranking Halls and Obtaining Criteria for Improvements and Design,” J. Audio Eng. Soc., vol. 60, pp. 419–430 (2012 June). [2] M. Barron, “Using the Standard on Objective Measures for Concert Auditoria, ISO 3382, to Give Reliable Results,” Acoust. Sci. & Tech. vol. 26, pp. 162–169 (2005). [3] J. S. Bradley, “Using ISO 3382 Measures, and Their Extensions, to Evaluate Acoustical Conditions in Concert Halls,” Acoust. Sci. & Tech. vol. 26, pp. 170–178 (2005). [4] J. S. Bradley, “Experience with New Auditorium Acoustic Measurements,” J. Acoust. Soc. Am., vol. 73, no. 6, pp. 2051–2058 (1983). [5] M. Aretz, and R. Orlowski, “Sound Strength and Reverberation Time in Small Concert Halls,” Applied Acoustics vol. 70, pp. 1099-1110 (2009). [6] E. Nilsson, “Room Acoustic Measures for Classrooms,” Proceedings of Internoise, Lisbon, Portugal (2010). [7] S. Tervo, T. Lokki, and L. Savioja, “Maximum Likelihood Estimation of Loudspeaker Locations from Room Impulse Responses,” J. Audio Eng. Soc., vol. 59, pp. 845– 857 (2011 Nov.). [8] A. Lindau, L. Kosanke, and S. Weinzierl, “Perceptual Evaluation of Model- and Signal-Based Predictors of the Mixing Time in Binaural Room Impulse Responses,” J. Audio Eng. Soc., vol. 60, pp. 887–898 (2012 Nov.). [9] U. P. Svensson, “Energy-Time Relations in an Electroacoustic System in a Room,” J. Acoust. Soc. Am., vol. 104, pp. 1483–1490 (1998). [10] U. P. Svensson and H. EL-Banna Zidan, “Early Energy Conditions in Small Rooms and in Convolutions of Small-Room Impulse Responses,” presented at the 129th Convention of the Audio Engineering Society (2010 Nov.), convention paper 8223. [11] U. P. Svensson, H. EL-Banna Zidan, and J. L. Nielsen, “Properties of Convolved Room Impulse Responses,” Proc. of the 2011 IEEE Workshop on Applications J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

ROOM ACOUSTICAL PARAMETERS IN SMALL ROOMS

of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY (2011). [12] ISO 3382:1997, “Acoustics—Measurement of the Reverberation Time of Rooms with Reference to Other Acoustic Parameters,” International Organization for Standardization (1997). [13] L. Beranek, Acoustics (McGraw-Hill, New York, 1954). [14] M. Vorländer, “Revised Relation between the Sound Power and the Average Sound Pressure Level in Rooms and Consequence for Acoustics Measurements,” Acustica, vol. 81, pp. 332–343 (2005).

[15] M. Barron and L. J. Lee, “Energy Relations in Concert Auditoriums. I,” J. Acoust. Soc. Am., vol. 84, pp. 618–628 (1988). [16] M. Vorländer, “International Round Robin on Room Acoustical Computer Simulations,” Proceedings 15th ICA Trondheim, Norway, pp. 689–692 (1995). [17] U. Peter Svensson and K. Wendlandt, “The Influence of a Loudspeaker Cabinet’s Shape on the Radiated Power,” Proc. of Baltic Acoustic, Sept. 17–21; J. of Vibroeng., No. 3(4), pp. 189–192 (2000). [18] Winmls Software, http://www.winmls.com. [19] CATT Acoustics software, http://www.catt.se.

THE AUTHORS

Hassan Zidan Hassan EL-Banna Zidan was born in Cairo, Egypt in 1975. He received a B.Sc. and a M.Sc. in electronics and telecommunications engineering in 1998 and 2007 both from the Arab Academy for Science, Technology, and Maritime Transport (Alexandria/Cairo). Since 2008 he has been a Ph.D. candidate in the acoustics group at the Norwegian University of Science and Technology (NTNU), Trondheim, Norway. His research interests include audio signal processing, speech enhancement, and room acoustics.

r

Peter Svensson received a M.Sc. degree in engineering physics in 1987 and a Ph.D. degree in 1994, both from Chalmers University of Technology, Gothenburg, Sweden. He has held postdoctoral positions at Chalmers University, University of Waterloo, Ont., Canada, and Kobe University, Japan.

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Peter Svensson Since 1999 he has been a professor in electroacoustics at the Norwegian University of Science and Technology, Trondheim, Norway. His main research interests are auralization, especially computational room acoustics and sound reproduction techniques, measurement techniques, and perceived room acoustical quality. Prof. Svensson has published 33 journal papers and more than 100 conference papers. He is currently vice president of the European Acoustics Association and a past president of the Norwegian Acoustical Society. He is an associate editor in the field of electroacoustics for Acta Acustica united with Acustica. In 2001 he received a best paper award, together with Johan L. Nielsen, for authors 35 years or younger from the Journal of the Audio Engineering Society and in 2009 a best paper award from the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), together with Haohai Sun and Shefeng Yan.

69

PAPERS

Influence of a Table on a Microphone’s Frequency Response and Directivity HASSAN EL-BANNA ZIDAN AND U. PETER SVENSSON, AES Member ([email protected])

([email protected])

Acoustics Group, Department of Electronics and Telecommunications, Norwegian University of Science and Technology, NO 7491, Trondheim, Norway

In a video conference setup, the microphone is often placed on a table and thus might act like a boundary, or pressure-zone, microphone. However, a table typically is small enough that the diffraction of its edges might significantly affect the frequency response, rather than give the ideal +6 dB effect. In this paper a microphone’s frequency response and directivity is studied using an edge diffraction-based calculation method. Measurements are done in an anechoic room. Comparisons between the calculated and measured frequency responses indicate that the simulation gives 1/3 octave-band levels that are typically within 1 dB of measured values. Furthermore, for a smaller table the response might be affected by several dB, but a larger table decreases these variations.

0 INTRODUCTION The microphone is the first link of the electroacoustic transmission chain in many communication systems such as audio/video conferencing and tele-presence systems. Any loss of quality occurring at this stage, through the pick-up of too much reverberation, or extraneous noise, is very difficult to compensate for at subsequent stages. An audio/video conferencing situation is challenging because several sources will be active at more or less unknown locations. In some installations fixed microphone positions might be chosen, e.g., hanging microphones from the ceiling. Hanging microphones might give relatively similar distances to source/talker positions, with corresponding similar levels, but they might also pick up problematic levels of background noise and reverberant sound. A very common situation is that microphones are handled by the users without any sound engineer present, and then one of the few practical solutions is that microphones are placed on a table. Such a solution has some advantages as it is the idea behind boundary microphones (e.g., the Pressure Zone Microphone, PZM) [1]. A boundary microphone has its membrane mounted so close to a sound-reflecting plate or boundary that the microphone membrane receives direct and reflected sounds in phase at all frequencies of interest, avoiding negative interference between them. This doubles the pressure, so boosts the level by 6 dB for the direct sound and in addition, the directivity index is increased by +3 dB due to the shielding of the diffuse field. These speculative benefits apply if the microphone is placed on a 70

very large table and other significant early reflections are ignored. The effect of a finite-sized reflector on a boundary microphone is often described in handbooks and white papers on microphones [2,3, 4], but this effect seems not to have been studied quantitatively much. Probably, this is due to a lack of practical simulation methods. The boundary element method would be suitable for such studies but is computationally quite demanding for high-frequency studies. This paper uses an edge diffraction-based method that permits low- and high-frequency studies with relatively little computational efforts. This approach has been used before for the study of the influence of loudspeaker cabinets. However, previous attempts have used high-frequency asymptotic methods [5], as well other edge-based methods [6], but here a formulation without asymptotic limitations is employed [7]. The effect of finite-sized reflectors is important and wellknown in a number of situations: a microphone placed at a lecture podium, a near-field monitoring loudspeaker placed on top of a mixing console, the shape and size of a loudspeaker enclosure [5,8], the in-situ surface reflection factor measurement technique [9], auditorium ceiling reflectors [10], etc. Some of these cases have been analyzed to a large extent [5,8,10] but others have not. Therefore, the approach in this paper might prove useful also for other applications. Recently, the same edge diffraction-based method as in [7] has been used to study the influence of a loudspeaker’s finite baffle on its measured frequency response [11]. The paper is structured as follows. In Section 1, the prediction model that is based on an edge diffraction method J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

INFLUENCE OF A TABLE ON A MICROPHONE’S FREQUENCY RESPONSE AND DIRECTIVITY

Microphone Diameter, D

(a) (a) Point source Point reciever, elevated D/2

(b)

(b) Fig. 1. (a) Illustration of a typical audio/video conference setup with a loudspeaker (representing a talking person) and a microphone (b) a simplified model

Fig. 2. Illustration of edge diffraction paths of (a), first order and (b), second order

Table 1. List of the tables used

is explained. In Section 2 experiments and comparisons between the measurements and the predicted values are presented. Section 3 will show the results and conclusions are given in Section 4. 1 PREDICTION MODEL The transfer function from a loudspeaker to a microphone on a table can be studied using the simplified model in Fig. 1. The simple representation of the microphone with a point receiver will not capture the directivity of the microphone at higher frequencies but will be adequate for a study of the influence of the table at low and middle frequencies. Furthermore the simple model of the loudspeaker will obviously not capture the directivity of the loudspeaker, but in this study only the direct sound of the loudspeaker will be observed. Certainly more elaborate models of the loudspeaker and the microphone could be expected to increase the accuracy, and a straightforward extension would be to represent the microphone with a number of discrete point receivers, and the loudspeaker might be modeled with a number of discrete point sources, possibly even placed at a box to represent the effect of diffraction of the loudspeaker box edges. The situation in Fig. 1(b) can be accurately and effectively studied numerically with the edge diffraction method [7,11]. All the edges of the table are subdivided into edge elements and the edge diffraction method calculates first, second, and possibly higher order contributions as shown in Fig. 2. Generally, the lower frequencies will require higher order diffraction but here second order diffraction has been satisfactory for the studied frequency range. By this method it is possible to compute the impulse response and find the transfer function using the fast Fourier transform (FFT). In that way, the real response for the direct sound, rather than the ideal +6 dB response, can be found. A second important quantity is the diffuse-field response. This can be dealt with by repeating the transfer function computation for all possible incidence angles (from a furJ. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Table T1 T2

Size L[m] × W[m]

Used for

1.2 × 1.2 3.2 × 1.4

Anechoic room meas. & sim. Simulations

ther distance) and compute the average squared response. Once the direct sound response and the diffuse field response are known, the directivity index can be computed. The details of the edge diffraction calculation method can be found in [7]. Here, it suffices to say that the implementation in the Edge diffraction toolbox for Matlab [12] was used, with a sampling frequency of 96 kHz. The toolbox uses an accurate numerical integration technique for the first-order diffraction components. For the second-order diffraction components, all edges were subdivided into 18 mm long elements and the simple midpoint numerical integration scheme was used with a subsequent splitting of each element’s contribution into the two neighboring time samples [7].

2 EXPERIMENTS In order to compare simulations with measurements, a rather small table (T1 in Table 1) was chosen where diffraction effects are expected to be stronger than for a large table. Impulse responses were measured in an anechoic chamber using the WinMLS measurement software [13]. A small two-way loudspeaker (Genelec 1029A) was used and an omnidirectional free-field equalized half-inch microphone (Norsonic 1220) was positioned in sequence steps of half a meter placed on the table. The influence of the frequency response of the loudspeaker and microphone was suppressed by first measuring with the microphone placed on the table and then repeating the measurement without the table but with the microphone positioned as similarly as possible relative to the loudspeaker. By dividing those two frequency responses by each other, the effect of the uneven response of the loudspeaker was largely removed. 71

ZIDAN AND SVENSSON

PAPERS 20

Pos. 1: 0.04m from front edge 1.2 m

Pos. 2: 0.54m from front edge Pos. 3: 1.04m from front edge 1.2 m

Fig. 3. Top view of table T1, with the loudspeaker and microphone positions used for simulations and for measurements in the anechoic chamber.

Loudspeaker, Lsp 1 Pos. 1: 0.5 m from front edge Pos. 2: 1.5 m from front edge Pos. 3: 2.5 m from front edge

1.6

Lsp 3

3.2 m

Lsp 2

Frequency response [dB]

Loudspeaker

15

10

5

Meas., pos. 3 (+10 dB) Meas., pos. 2 (+5 dB) Meas., pos. 1 Sim., pos. 3 (+10 dB) Sim., pos. 2 (+5 dB) Sim., pos. 1

0

-5

-10 2 10

10

3

10

4

Frequency [Hz] Fig. 5. Measured and simulated frequency responses of a microphone on table T1 in an anechoic room. The responses are 1/3 octave band smoothed and the thin solid lines represent the ideal +6 dB. Note that two of the curves have been shifted, and grid lines have been removed, for clarity.

1.4 m

Fig. 4. Top view of table T2 with the loudspeaker and microphone positions used in the simulation study.

A separate, larger table (T2 in Table 1) was studied in simulations. A table size was chosen that might be used in small/medium video conference setups. Fig. 3 shows the configurations of the loudspeaker and microphone for table T1 at the measurements in the anechoic room. The loudspeaker was placed at a horizontal distance of 1.38 m from the table and a height of 0.78 m above the table (to the center of the woofer). The microphone was placed in three positions as shown and the microphone was laid down on the table, so the microphone membrane was perpendicular to the table with its center at a height of 7 mm. Both the loudspeaker and the microphone were shifted 0.125 m away from the center axis. Fig. 4 shows the table used in simulations, table T2, with a length of 3.2 m and a width of 1.4 m. The loudspeaker was placed in three different locations, representing possible talker’s locations. First location is at a horizontal distance of 0.5 m from the table front edge, a height of 0.4 m above the table, and sideways it was centered to the center axis of the table. The second and third locations are along the long sides of the table at a horizontal distance of 0.5 m from the side edge of the table and at the same height as Pos. 1: 0.4 m. The microphone was centered to the center axis of the table and placed in three positions. A boundary microphone was simulated with the membrane at 1 mm height above the table. 3 RESULTS 3.1 Verification of Simulation Method with Small Table T1 Fig. 5 shows the measured frequency responses in the anechoic room and the simulations done using the edge 72

diffraction prediction model. As described in Section 3, the free-field response of the loudspeaker was measured separately and used to suppress the effect of the loudspeaker and microphone response. It can be noticed that the simulations are very close to the measurements. Between 200 Hz and 10 kHz the difference is on average 0.94 dB. Apparently the simple point source modeling worked well. Furthermore, the +6 dB cannot be practically achieved across all the frequencies for a relatively small sized table. This comparison confirms that the edge diffraction method can predict quite accurately the effect of a table on the microphone’s frequency response. 3.2 Diffuse Field Response for the Microphone on the Small Table T1 In a reflective room, the ideal +6 dB for the direct sound is complemented by an ideal +3 dB for a diffuse incidence sound field. In order to simulate the diffuse field response in an anechoic room, a set of 1000 sources regularly distributed over a sphere (of radius 1000 m) around each one of the three microphone positions is constructed. The simulated response, for all incidence angels, was squared and averaged for each microphone position and the results are shown in Fig. 6. Also for the diffuse field response, the table causes quite a deviation from the ideal +3 dB. At high frequencies, the microphone height of 7 mm causes a strong roll off around 10 kHz. 3.3 Simulations for Typical Table Size, T2, Distributed Talkers and a True Boundary Microphone Simulations were done to investigate the effect of a slightly larger and more realistically sized table, table T2, on the direct sound for different positions of talkers around the table. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS

INFLUENCE OF A TABLE ON A MICROPHONE’S FREQUENCY RESPONSE AND DIRECTIVITY 20

Frequency response [dB]

Frequency response [dB]

20 15

10

5 0

-5

Sim. pos 1 + 10 dB Sim. pos 2 + 5 dB Sim. pos 3

-10

-15 10

2

10

3

10

15

10

5

0

-5 2 10

4

4 CONCLUSIONS From the measurements done in the anechoic room, the simulation method used was confirmed to be accurate, with average deviations for 1/3-octave band smoothed response being around 0.9 dB from 200 Hz to 10 kHz. Conference tables might, due to the finite size, lead to frequency response variations for the direct sound of +/− 2–3 dB within the frequency range 200 Hz to 10 kHz. For smaller tables, or reflectors, variations could be even larger. As expected, the boundary microphones need to be very close to the table surface (few mm) in order to avoid clear high-frequency roll-off effects. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

3

10

4

(a)

Frequency response [dB]

20

15

10

5

0

-5 2 10

Sim. pos. 3 (+10 dB) Sim. pos. 2 (+5 dB) Sim. pos. 1 10

3

10

4

Frequency [Hz] (b) 20

Frequency response [dB]

Figs. 7 (a)–(c) show the simulation results for the three loudspeaker/talker positions around the table as illustrated in Fig. 4. Comparisons can be made between the results for the slightly larger table in Figs. 4 and 7 and the smaller table in Figs. 3 and 5. First, it is clear that placing a microphone at 1 mm height rather than 7 mm height reduces the high-frequency dip as can be seen in Figs. 7(a)–(c). Furthermore, the lowfrequency roll-off, which was prominent in Fig. 5, is quite substantial also for the lager table T2 in Fig. 7. For the studied case, it seems that this roll-off starts from a peak around 2 kHz except for positions 1 and 2 in Fig. 7(a), where the roll-off starts at a lower frequency. Apparently, where this “cut-off frequency” ends up will depend on the arrival times of the two distinct diffraction contributions, from the two edges that are perpendicular to the line from loudspeaker to microphone. The longer the distance to these two edges, the further down in frequency this roll-off can be pushed. Finally, the general unevenness of the response is around –3 dB to +2 dB relative to the ideal +6 dB response, in the frequency range from 200 Hz to 10 kHz.

10

Frequency [Hz]

Frequency [Hz] Fig. 6. Simulations of the effect of placing a microphone on table T1 on the average/diffuse field frequency response. The responses are 1/3 octave bands smoothed and the thin solid lines represent the ideal +3 dB. Note that two of the curves are shifted for clarity.

Sim. pos. 3 (+10 dB) Sim. pos. 2 (+5 dB) Sim. pos. 1

15

10

5

0

-5 2 10

Sim. pos. 3 (+10 dB) Sim. pos. 2 (+5 dB) Sim. pos. 1 10

3

10

4

Frequency [Hz] (c) Fig. 7. Simulated frequency responses of a microphone placed on table T2. The responses are 1/3 octave band smoothed and the thin solid lines represent the ideal +6 dB for (a) Loudspeaker 1, (b) Loudspeaker 2, (c) Loudspeaker 3. Note that two of the curves in each diagram have been shifted for clarity.

73

ZIDAN AND SVENSSON

PAPERS

5 REFERENCES [1] J. Eargle, The Microphone Book, second edition (Focal Press, Burlington, MA, USA , 2004). [2] Boundary Microphone Application Guide: Boundary Microphone Theory and Applications for Crown Boundary Microphones: MB, PCC and PZM Series (Crown Inc., Elkhart, IN, USA , 2000). [3] Boundary Layer Solutions for Installed Sound (AKG Inc., Northridge, CA, USA ). [4] The Directional Boundary Microphone (Bartlett Microphones Inc., Elkhart, IN, USA, 2009). [5] J. Vanderkooy, “A Simple Theory of Cabinet Edge Diffracton,” J. Audio Eng. Soc., vol. 39, pp. 923–933 (1991 Dec.). [6] M. Urban, C. Heil, C. Pignon, C. Combet, and P. Bauman, “The Distributed Edge Dipole (DED) Model for Cabinet Diffraction Effects,” J. Audio Eng. Soc. vol. 52, pp. 1043–1059 (2004 Oct.).

[7] U. P. Svensson, R. I. Fred, and J. Vanderkooy, “An Analytic Secondary Source Model of Edge Diffraction Impulse Responses,” J. Acoust. Soc. Am., vol. 106, no. 5, pp. 2331–2344 (1999). [8] H. F. Olson, “Direct Radiator Loudspeaker Enclosures,” J. Audio Eng. Soc., vol. 17, pp. 22–29 (1969 Jan.). [9] E. Mommertz, “Angle-Dependent In-Situ Measurements of Reflection Coefficients Using a Subtraction Technique,” Appl. Acoust., vol. 46, no. 3, pp. 251–263 (1995). [10] M. Long., Auditorium Acoustics (Academic Press, Boston, MA, USA , 2005). [11] Y. Le, Y. Shen, and L. Xia, “A Diffractive Study on the Relation between Finite Baffle and Loudspeaker Measurement,” J. Audio Eng. Soc, vol. 59, pp. 944–952 (2011 Dec.). [12] http://www.iet.ntnu.no/∼svensson/software/ [13] www.winlms.com

THE AUTHORS

Hassan Zidan Hassan EL-Banna Zidan was born in Cairo, Egypt, in 1975. He received a B.Sc. and a M.Sc. in electronics and telecommunications engineering in 1998 and 2007 both from the Arab Academy for Science, Technology, and Maritime Transport (Alexandria/Cairo). Since 2008 he has been a Ph.D. candidate in the Acoustics group at the Norwegian University of Science and Technology (NTNU), Trondheim, Norway. His research interests include audio signal processing, speech enhancement and room acoustics.

r

Peter Svensson received a M.Sc. degree in engineering physics in 1987 and a Ph.D. degree in 1994, both from Chalmers University of Technology, Gothenburg, Sweden. He has held postdoctoral positions at Chalmers University, University of Waterloo, Ontario, Canada, and Kobe University, Japan. Since 1999 he has been a professor in

74

Peter Svensson electroacoustics at the Norwegian University of Science and Technology, Trondheim, Norway. His main research interests are auralization, especially computational room acoustics and sound reproduction techniques, measurement techniques, and perceived room acoustical quality. Prof. Svensson has published 33 journal papers and more than 100 conference papers. He is currently vice president of the European Acoustics Association and a past president of the Norwegian Acoustical Society. He is an associate editor in the field of electroacoustics for Acta Acustica united with Acustica. In 2001 he received a best paper award, together with Johan L. Nielsen, for authors 35 years or younger from the Journal of the Audio Engineering Society and in 2009 a best paper award from the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), together with Haohai Sun and Shefeng Yan.

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Combining two worlds. Audio analysis plus HDMI functionality. In Blu-rayTM player, TV or AV receiver production, it’s not just important to analyze video quality. You also have to check audio quality. But you don’t always need high-end analyzers designed for complex R&D issues, and you don’t want to have one box for audio and another for video analysis. The R&S®UPP is the answer, with 100 % audio analysis plus basic video test functionality for HDMI interfaces. It’s the combination of performance and value you’ve been looking for.

Download the latest application note at http://www.rohde-schwarz-av.com/hdmi_AES

STANDARDS NEWS

Standards News

Newly Published Detailed information regarding AES Standards Committee (AESSC) proceedings including formal notices, structure, procedures, reports, meetings, and membership is published on the AES Standards Web site at http://www.aes.org/standards/. Membership in AESSC working groups is open to any individual materially and directly affected by the work of the group. For current project schedules, see the project-status document also on the Web site. AES standards are published at: http://www.aes.org/publications/standards/. For its published documents and reports the AESSC is guided by International Electrotechnical Commission (IEC) style as described in the ISO/IEC Directives, Part 2. IEC style differs in some respects from the style of the AES as used elsewhere in this Journal. AESSC document stages referenced are: Project initiation request (PIR); Proposed task-group draft (PTD); Proposed working-group draft (PWD); Proposed call for comment (PCFC); Call for comment (CFC).

76

AES64-2012, AES standard for audio applications of networks—Command, control, and connection management for integrated media has been published. This standard for networked command, control, and connection management for integrated media is an IP-based peer-topeer network protocol, in which any device on the network may initiate or accept control, monitoring, and connection management commands. The AES64 protocol has been developed around three important concepts: structuring, joining, and indexing. Every parameter is part of a structure, and control is possible at any of the levels of the structure, allowing for control over sets of parameters. Parameters can be joined into groups, thereby enabling control over many disparate parameters from a single control source. Every parameter has an index associated with it and, once discovered, this index provides a low bandwidth alternative to parameter control. AES65-2012, AES standard for interconnections—Connector for surround microphones has been published. An increasing number of surround sound microphones are becoming available, however, there has been no common standard for the connectors between microphone and recording device. It is expected that a standard connection will create a basis for

smaller and lighter recording devices. This standard specifies a connector type and contact assignment for microphones having up to six balanced analog output channels, as used in surround sound applications. It includes specifications for marking and identification for the audio channels. It includes recommendations for cable type and detailed wiring. It is expected that other applications will also use this connection. AES66-2012 AES standard for professional audio equipment— Application of connectors -Miniature XLR-type polarity and gender has been published. This standard is intended to apply to three- and five-pin circular connectors, commonly and generically known as miniature XLR-type, used for the balanced interconnection of all categories of sound system components for professional audio, commercial, recording, broadcast, and similar applications, regardless of function, type, or level of the signal. It specifies the application and polarity of analog signals for these connectors and is intended to avoid the inversion of absolute polarity among the items in the analog signal chain. AES55-2012, REVISED AES standard for digital audio engineering—Carriage of MPEG Surround in an AES3 bitstream has been published. MPEG-D is an ISO/IEC standard

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

49th International Conference

Audio for Games Missed the conference? It’s water under the bridge... Order the complete proceedings now at: http://www.aes.org /publications /conferences/

Conference chair: Michael Kelly

London, UK 6–8 February 2013

STANDARDS NEWS describing MPEG Surround that extends mono or stereo audio toward multiple channels. The mono or stereo audio channels represent a downmix of the original multichannel audio that is generated by the MPEG Surround encoder. In addition, the MPEG Surround encoder generates spatial side information (MPEG Surround data). An MPEG Surround decoder is able to combine this information with the downmix to result in a multichannel audio signal. In this way backward compatibility to mono and stereo systems is achieved. More recently, MPEG-D has been revised to include MPEG SAOC (Spatial Audio Object Coding) that uses the same method to convey the related side information over PCM. This standard specifies how MPEG Surround or MPEG SAOC shall be carried within an AES3 bitstream where the downmix channels remain in the linear PCM domain and the MPEG Surround or MPEG SAOC data is embedded into the least significant bits of the PCM audio data. AES31-2-2012, REVISED AES standard on network and file transfer of audio— Audio-file transfer and exchange—File format for transferring digital audio data between systems of different type and manufacture has been published. The Broadcast Wave Format is a file format for audio data. It can be used for the seamless exchange of audio material between (i) different broadcast environments and (ii) equipment based on different computer platforms. As well as the audio data, a BWF file (BWFF) contains the minimum information—or metadata—that is considered necessary for all broadcast applications. The Broadcast Wave Format is based on the Microsoft WAVE audio file format. This specification adds a “Broadcast Audio Extension” chunk to the basic WAVE format. This revision of AES31-2-2006 incorporates the content of Amendment 1 (2008) that specifies an optional Extended Broadcast Wave Format (BWF-E) file format, designed to be a compatible extension of the Broadcast Wave Format (BWF) for audio file sizes larger than a conventional Wave file. It extends the maximum size capabilities of the RIFF/WAVE format by increasing its address space to 64 bits where necessary. BWF-E is also designed to be mutually compatible with the EBU 78

T3306 “RF64” extended format. This revision additionally packages a set of machine-readable loudness metadata into the BWF file. This is compatible with EBU v2 broadcast wave files.

Calls for Comment The following documents will be published by the AES after any adverse comment received within six weeks of the publication of its call on the AES Standards Web site has been resolved. For more detailed information on each call please go to www.aes.org/standards /comments. Comments should be sent by

e-mail to the secretariat at [email protected]. All comments will be published on the Web site. A Call for Comment on DRAFT REVISED AES2-xxxx AES standard for acoustics— Methods of measuring and specifying the performance of loudspeakers for professional applications—Drive units was published 2012-12-22 A Call for Comment on DRAFT REVISED AES-1id-xxxx AES Information document for acoustics— Plane-wave tubes—Design and practice was published 2012-12-22.

Subscribe to the AES E-Library http:// www.aes.org/e-lib/subscribe/

G a i n i m m e d i at e ac c e s s t o ov e r 1 4 , 0 0 0 f u l l y s e a r c h ab l e PDF fi filles documentin g audio r e s e a r c h f r om 1 9 5 3 t o t h e present day. T he E-librar y i n c l u de s e v e r y A E S pa pe r pu b l i s h e d a t a c o n v e n t i o n , c o n f e r e n c e o r i n t h e J ou r n a l

Individual annual subscription $255 non-members $145 members Institutional annual subscription $1800 per year J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

FEATURE ARTICLE

Mastering for today’s media Francis Rumsey Staff Technical Writer Mastering for today’s music delivery media requires an understanding of recent developments in audio coding, physical, and online formats. The advent of Apple’s “Mastered for iTunes” program gives rise to new challenges that were tackled by a panel of experts at the 133rd Convention.

T

he wide range of current delivery formats for music has offered new challenges to mastering engineers in recent years. Mastering for today’s media is different from the past because of a shift from the limited number of physical formats previously used, such as CD, to the plethora of digital files and streaming options available today. The ubiquity of iTunes and Apple’s “Mastered for iTunes” initiative also makes it de rigeur for mastering engineers to have a grasp of the requirements and techniques involved. Two workshops at the recent AES Convention in San Francisco tackled the topic head on, including “Are you ready for the New Media Express?” chaired by Jim Kaiser, and “Platinum Mastering: Mastered for iTunes,” moderated by Bob Ludwig.

ARE YOU READY FOR THE NEW MEDIA EXPRESS? Jim Kaiser opened his workshop by explaining that the processes of recording and mastering got set up in a particular way in the past because of the requirements of the specific delivery formats used at the time. There were defined methods to prepare recordings for vinyl

records and tape cassettes, for example, and reel-to-reel analog masters were prepared for these. Various shiny disk formats followed, and something of a format war ensued between some of the later ones, not necessarily much to do with audio quality. Now there are numerous online digital delivery formats, the question of surround sound to deal with, and the physical medium that is Blu-ray Disc. His panel consisting of Robert Bleidt from Fraunhofer USA, Stefan Bock of MSM Studios, Bob Katz of Digital Domain Mastering, and Morten Lindberg of 2L, concentrated on how to get the best results from this new group of formats.

ARCHIVE AUDIO QUALITY Morten Lindberg concentrated on questions of archive audio quality, noting that there are two domains—the analog one that contains all of the traditional operations involving acoustics, microphone placement, and human reception, and the digital “framework” within which one chooses to work. In his company’s case they have chosen to work with convertors that sample at the very high rate of 5.6 MHz, from which it’s possible to format the bitstream either as DSD (Direct

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Stream Digital) or PCM. PCM is preferred here for its flexibility in processing, and very high-quality archive masters are created at a sampling rate of 352.8 kHz, which can be down-sampled for various delivery formats. This rate also acts as a useful intermediate format for preparing DSD masters if needed. It’s not up to the record label to dictate what format the customer should use, according to Lindberg. Some of 2L’s customers still prefer physical media and for this the Pure Audio Blu-ray Disc is an excellent solution. Physical media still account for some 50% of the revenue in 2L’s case. For those customers who say they will never go back to physical media after departing from the CD, it’s relatively easy to provide a range of different file formats for online delivery, including FLAC and others. The customer can then choose what they want to use. Stefan Bock commented that for high-resolution surround sound material there is little alternative but to use physical media because of the amount of data involved. There is apparently not much momentum behind high-res surround music file distribution, although occasional use is made of 5.1-channel FLAC files. 79

FEATURE ARTICLE PURE AUDIO BLU-RAY Stefan Bock’s background has included DVD-Audio, Video, SACD, and Blu-ray. Recently he has been working on Pure Audio Blu-ray, which he called a new way to bring high-definition audio to a wide range of homes. There’s an option for multichannel audio and MP3 content, as well as additional content that the artist wants to include. These disks will play on any standard Blu-ray player, as opposed to the former situation with DVD that required a special DVD-Audio player to play DVD-A disks. It’s important, said Bock, that the user doesn’t have to turn on his television to control the disk replay—it should be possible simply to put in a disk and play it. However without some means of display and control it can be difficult to select the audio stream to be replayed, so a solution was needed. Every player has the standard four colored buttons on the remote control, so Bock had found a way of using these buttons to select the audio format (e.g., 2-channel stereo, 5.1 surround). There is also a display available for use on a TV screen if one is connected. Java code stored on the disks is used to implement some of the Pure-Audio-specific features. At the moment the only company that can offer this option in mastering is msm, but they are working closely with others to try to make it available elsewhere. Pure Audio Blu-ray can work at up to 192-kHz/24-bit resolution and there are also the losslessly compressed HD formats introduced by Dolby and DTS, as well as FLAC and MP3, but it’s possible to have all this content, including 7.1 surround, on one Blu-ray disk without

Stefan Bock, the leading light of the Pure Audio Blu-ray format 80

running out of space as one can store around 50 Gbytes of data. mShuttle is an option that was introduced for Pure Audio Blu-ray disks, which turns the player into a small web server, enabling audio files to be served from the player over the home network to other devices. That way the content can be used on portable media devices or played out of alternative file-based audio players. This requires a player working to “Profile 2.0,” which was introduced in 2009 and includes the provision of network functionality.

ISSUES WITH AAC AND MP3 Robert Bleidt of Fraunhofer spoke about “how to make great-sounding tracks using AAC and MP3.” (AAC and MP3 are both lossy compressed audio formats.) AAC has an installed base of about 5 billion devices, said Bleidt, across all platforms, players, and browsers. When mastering for compressed tracks it’s a good idea to start off with the highest quality master possible and back off on any clipping or hard limiting. Using the best encoding software possible is another important factor, as not all systems are created equal, and it’s also important to check the sound quality on a simple decoder to get an idea of how it will sound on older phones or music players. Distributors don’t usually accept encoded bit streams, so there are different degrees of control that you may have over the delivered content. Depending on the nature of the mix and the mastering style employed, it’s possible that you may not need to make separate CD and AAC masters. A best-case scenario when it comes to control over the content can be found in Apple’s “Mastered for iTunes” program (see below). Before this came along, you sent a CD to Apple, they encoded it, and you hoped it sounded good. With the new system you can present a 24/96 master recording to a copy of the encoding software yourself, listen to the result of the encoding/decoding, and tweak the master accordingly. The original file is then sent to Apple, which encodes it using the same process, so that what the customer gets is exactly the same as what you heard. The Sonnox/Fraunhofer plug-in also allows you to audition the results of encoding and decoding in real time, using conventional workstation software. As far as Bleidt was aware, Apple is the only operation to offer such a mastering program, so it’s more difficult to tell with other

Robert Bleidt of Fraunhofer USA discusses issues with AAC and MP3.

services what the quality of the distributed audio will be. If you can hear coding artifacts when using data-compressed audio it’s very hard to offer generic advice about what to do to improve matters, said Bleidt. Fraunhofer is always interested to hear examples but has yet to perfect a “remove artifacts” feature in its encoders, he quipped. Sometimes it can help to back off the peak levels a little in order to avoid internal clipping problems that can lead to artifacts. This rather suggests the need to avoid the tempting level-maximization options often offered in workstation software. The clipping behavior of decoders varies, with some of them including mechanisms for avoiding clipping when signal levels are very hot. Simple decoders on older players, such as the VLC player for PC, tend to exhibit more noticeable clipping under these circumstances, and the Sonnox decoder can be used to audition the worsts effects of decoder clipping. It’s also likely that the effects of any encoder clipping will be more audible the lower the bit rate, so it’s particularly important to keep the levels down when compressing material for very low bit-rate streaming, for example. The most challenging tracks for encoding are usually single instruments, which don’t give rise to a lot of the masking that arises in a busier track. If you reduce the bit rate far enough with any material then there is no doubt you will hear artifacts, but at the bit rates typically used for downloads these days artifacts are more rare. (Whereas previously a bit rate of 128 kbit/s was used for iTunes tracks, they are now encoded at 256 kbit/s.) Double-blind comparison tools are available in the Sonnox plug-in that enable users to compare the original and encoded versions without knowing which

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

FEATURE ARTICLE is which. This allows a much more reliable way of determining whether artifacts are audible, as people are notoriously unreliable when they think they know what they’re listening to. It’s best to home in on short sections where artifacts are thought to be audible, then do an ABX comparison to discover what percentage of times the encoded and original can be distinguished from each other.

clipping can give rise to problems, either internally—within the encoder, on decoding, or during oversampled D/A conversion.

MASTERED FOR ITUNES

Bob Katz was also a speaker at Bob Ludwig’s Platinum Mastering session on Mastered for iTunes. Eric Boulanger of The Mastering Lab, the second speaker, had mastered the very first Mastered for iTunes title by Colby Caillat. KATZ COMMENTS Before Mastered for iTunes (MfiT), Lossy coding gives rise to a number of challenges for mastering engineers, said tracks for iTunes were either simply Bob Katz. Lossy-coded material should ripped from CDs or taken from the major never be used as original source material, record company servers and loaded into he suggested, because re-encoding causes iTunes Producer. With MfiT, AAC encoda build up of artifacts that will make the ing is done from 24-bit masters, often sound quality worse. It’s therefore really with lowered level to avoid clipping and important to hang on to the original PCM get a much cleaner result. (The aim is to masters of projects, rather than relying get the best results out of the 256 kbit/s on MP3 or AAC files that might otherwise constrained variable bit-rate (CVBR) of be used if nothing else is available. Robert the iTunes Plus format.) All the encoding Bleidt commented that in the broadcast- for an iTunes release is done by Apple, ing world they do use AAC as a contribu- and there is identical free Apple software tion codec for source material, and called “afconvert” that enables users to do tandem coding (multiple generations of the same thing themselves before submitcoding) is done on a more regular basis. ting masters. The first process in this The quality can be satisfactory for the software is Sound Check, which looks at purpose up to several generations of the relative loudness levels of songs to be encoding if a reasonably high bit rate is encoded and attempts to determine how used. Jim Kaiser noted that the quality much their levels should be raised or requirements differ between applications, lowered on replay to make their loudness so one shouldn’t necessarily expect comparable. It adds metadata that can be broadcasting to be the same as music used by players to avoid loudness differences when tracks are played alongside distribution. The importance of avoiding clipping each other. If the track is at a higher when using lossy formats was reinforced sampling frequency than 44.1 kHz it is by Bob Katz—it can give rise to very down-sampled to 44.1 kHz, otherwise it is nasty artifacts, worse than with linear left alone. There is also a process that will PCM. Even original signals just below convert the AAC-encoded track back to PCM so that you can hear the decoded version. “afclip” looks at the likely on-sample and inter-sample clips, behaving like a true peak-reading meter, enabling the user to determine the potential for encoder and postdecoder clipping. Even when no clips are indicated, Ludwig suggested that it can be necessary to experiment with different amounts of level reduction and compare the encoded result to the 24-bit master, as Bob Ludwig (center) with Bob Katz (left) and Eric Boulanger (right), before the "Mastered for iTunes" Platinum Mastering workshop. the sonic differences J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

MfiT SAMPLE RATE CONVERSION Apple’s own documentation states that it prefers to receive high-resolution masters at sampling frequencies above 44.1 kHz, preferably 96 kHz. That way the encoding process uses its masteringquality sample-rate conversion that generates 32-bit floating point CAF files as the input to AAC encoding. It’s claimed that this avoids the need for redithering and preserves all the dynamic range inherent in the original file, avoiding the potential for aliasing or clipping that can otherwise arise in sample-rate conversion. If you supply 44.1-kHz files to Apple, the advantages of the above process are bypassed as the sample rate conversion is not initiated.

between settings can be quite dramatic. After the track is transferred to Apple, it is encoded in exactly the same way as the user would have done. “Test pressings” are then returned to the record company to confirm what is about to be released on iTunes. Usually these turn out to be bitfor-bit the same as the final encoding created by the mastering engineer, which confirms the integrity of the process. According to Ludwig, MfiT has been a tremendous success, having been embraced by all the major record companies, and with substantial support from recording and mastering engineers, as well as there being options for independent artists. When properly done the results are excellent and the advantages are clear, with quality that is much closer to what the artist and engineer intended. While it might be asked why Apple would not go directly to using lossless downloads, Ludwig pointed out that the installed base of some 400 million Apple music players, as well as Windows players that access iTunes made it necessary to retain compatibility. Apple had spent 18 months figuring out the best possible AAC encoding method to achieve excellent sound quality, considering the “player landscape” in question. John Coltrane’s “Blue Train” track was a good example of the file-size advantages of using AAC over lossless—the AAC track was 3.2 times smaller than the lossless version—so it can be seen that going to lossless would make downloads much longer and make the entire experience rather sluggish, as 81

FEATURE ARTICLE

APPLE’S MfiT TOOLS

The tools contained in the current free mastering suite that can be downloaded from Apple include the Master for iTunes Droplet, which is used to automate the creation of iTunes Plus masters. The suite of tools requires at least the Snow Leopard (10.6) version of OS X to run. The droplet needs either AIFF or WAVE files to be provided as source material and converts them temporarily to Apple’s Core Audio Format (CAF) with a Sound Check metadata profile attached that can normalize the relative loudness levels of songs on replay. AAC files are then encoded. afconvert is a command-line utility that enables more direct control over all of the above MfiT encoding operations. AURoundTripAAC is an Audio Unit (AU) that allows the comparison of encoded audio against the original source file, which also includes clip and peak detection (see screen shots below). There is a similar listening facility to that provided in the Sonnox/Fraunhofer Pro Codec plug-in that allows a simple double-blind ABX test to be set up, in order that users can check whether they can reliably tell the difference between source and encoded versions. The plug-in can be used with workstation software that conforms to the AU plugin format, such as Logic, or alternatively the AU Lab application can be used to run the process.

Display of WAVE file generated by afclip showing impulses marking clipped samples in the right (lower) channel

Tables showing instances of clipping, total number of clipped samples and pinned samples

AU Lab is a free standalone digital mixer utility that lets you use AUtype plug-ins without needing an AU-compatible DAW. afclip is a Unix command line tool that can be used to check a file for on–sample and inter-sample clipping. Inter-sample clipping can

82

arise in oversampling D/A converters used after decoding, for example. (Four-times oversampling is used to estimate sample values in afclip.) When mastering a track for iTunes that peaks very close to digital maximum, it’s necessary to check it using this tool and reduce the level slightly until an acceptable number of clips is indicated (which may be zero, unless a small number turn out to be inaudible). If there’s any on-sample clipping the output of this process is an audio file (.wav) where the left channel data is the original audio and the right channel contains impulses where the audio is clipped, so that clips can be quickly located visually in a digital audio editor. There’s also a table that comes up in the Terminal window (see screen shots above) to show the timing locations of clips and the amount by which the samples exceed the clipping point. “Pinned samples” can also be reported—that is any in a series with a digital level of exactly ±1.0 (peak level), which suggests on-sample clipping may have occurred. Finally, the Audio to WAVE Droplet converts files that are in other audio file formats (any supported by Mac OS X) to the WAVE format. This only works at 44.1k/24 bits at the moment.

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

FEATURE ARTICLE well as consuming more processing power, memory, and so on. Most laptops, phones, and tablets only use 44.1 and 48 kHz for D/A conversion so are likely to down-sample anything else to get it to play from their analog outputs. Mac Snow Leopard can send digital audio data out at any sampling frequency up to 192 kHz and Windows 7 can go up to 96 kHz. However, in order to make this happen on the Mac it’s important to remember to change the output sampling frequency in the Audio/MIDI setup, close iTunes and reopen it, otherwise iTunes will continue to attempt to convert the material to the previous sampling frequency. (In other words, iTunes only appears to read the new output sampling frequency from the audio preferences when it opens, not if it is changed subsequently.) A number of other players are available, that are compatible with iTunes, that interface directly with the Core Audio functionality of the Mac, that may provide better quality sonic results, and which switch sampling frequencies automatically.

KATZ COMMENTS Bob Katz had commented on the Mastered for iTunes project in the previously mentioned workshop, noting that although the iTunes Plus bit rate is nominally 256 kbit/s it is normally a constrained variable bit-rate format. This means the bit rate can go higher than that on occasions to allow for sections that are difficult to encode. It’s the first lossy format that he hasn’t felt embarrassed to use, he said, and the quality is surprisingly good. Because 24-bit PCM audio files are supposed to be submitted to Apple for encoding, it’s important to remember to dither the signal correctly at that resolution when rendering material from a digital audio workstation whose internal resolution may be 32-bit floating point. Although there are people around who claim the effects of dither at this level are inaudible, Katz reinforced the importance of doing it correctly for optimum sound field depth and audio quality. It’s not always clear whether workstation software is doing this automatically, and some surprises can be encountered. Apple stores everything for posterity using its own lossless format called ALAC (Apple Lossless Audio Coding), Katz said. So 24-bit masters submitted to them are first converted into this format for longterm archiving purposes, then they can

be rendered for delivery at whatever rate is required. This lossless storage is partly what allowed them to reissue earlier material at the higher bit rate of 256 kbit/s used in iTunes Plus. (iTunes tracks were encoded at 128 kbit/s until around 2009.)

MfiT IN PRACTICE Eric Boulanger debunked the myth that MfiT is a new format. In fact AAC encoding has been around for a long time and MfiT is just a new process for preparing and encoding AAC files. It puts mastering

WHY IT PAYS TO UNDERSTAND SOUND CHECK In iTunes players, Sound Check, when turned on, scans through the audio information of music files and works out their average loudness. It then stores loudness normalization metadata with the files, either as ID3 tags in MP3-like files, or in the iTunes Music Library database. That metadata is then used to set the relative replay level of tracks, provided Sound Check is still turned on. This is an example of loudness normalization being performed at the end of the chain by the music player itself. It can therefore pay to normalize the loudness levels yourself to something like the –16 LUFS level before putting files through the Sound Check process when using MfiT. If you don’t, and the consumer then uses Sound Check on their player, overloud tracks will simply be corrected in any case and their potential artifacts exposed. If the track had been normalized correctly and the metadata applied during MfiT mastering, it would not then suffer any further loudness corrections on replay and would be more likely to sound as you intended.

for digital downloads into the foreground, instead of it being an afterthought to the CD mastering process. Now that digital downloads have overtaken CD sales, particularly for some artists, this seems important. Apple’s Sound Check process does something similar to Dolby’s dialnorm and the recent ITU BS 1770 standard, in that it attempts to measure the average

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

loudness of songs and add loudness normalization metadata based on aiming for about 16 dB of headroom between the average loudness level and the peak signal level. It has the potential to end the loudness wars and restore the concept of headroom, suggested Boulanger, if it becomes enabled as a default function in iTunes. In fact if someone masters a song too loud its level will automatically be reduced by this process when the finished iTunes files is replayed, as long as Sound Check is active in the player. So it tends to reward good practice and discourage the recent tendency to “slam” levels. Ludwig suggested all his clients should listen to both a Sound Check optimized version of a master and a more slammed version on an iTunes player with Sound Check turned on. In almost all cases it will persuade them to go with the optimized version as the slammed version just sounds messy when the level is reduced. Bob Katz explained that the independent music website, CD Baby (www.cdbaby .com) is now able to work with MfiT to publish higher quality independent music through iTunes. Two licenses are required, one for submitting 16-bit masters for normal releases and one for submitting the 24-bit masters required by MfiT. There is no guarantee that your album will be featured on any of their MfiT promotional material or be identified as such, but it will benefit from the higher sound quality available from MfiT.

SUMMARY The predominance of online music distribution today has changed the face of mastering. While there have been limitations in the quality of downloads available to date, new initiatives such as Mastered for iTunes have enabled mastering engineers to exercise greater control over the quality of data-compressed music downloads. There seems to be broad support for that particular initiative among audio professionals, as it provides a means to ensure that a better rendering of high-resolution audio masters is offered to consumers. It also has the potential to revitalize the work of mastering engineers because back catalogues may be remastered so as to get the best out of iTunes Plus. Editor’s note: To purchase and download mp3s (12AES-W06 and-SE06) of the133rd panels highlighted in this article, go to www.mobiltape.com/conference/AudioEngineering-Society-133rd-Convention/.

83

134

th

Audio engineering society International Convention

F o n ta n a d i T r e v i C on feren ce C entr e

R o m e , I ta ly

Conference May 4th – 7th, 2013

Join the AES in Rome, the birthplace of engineering 2,000 years ago, to celebrate the latest audio advances. For over 60 years, the AES has been the largest gathering of audio professionals and enthusiasts on the globe, attracting delegates from over 100 countries worldwide. Workshops, tutorials, and research papers provide attendees with a wealth of learning, networking and business opportunities.

Listen, Learn, and Connect on these Topics • Loudness • Test & Measurement • Game Audio • Spatial Audio • Audio Signal Processing • Semantic Audio • Transducers • Perception • Acoustics

• Education • Recording & Production • Forensic Audio • Applications In Audio and Much More... For more information and to make reservations, visit our website at

www.aes.org Photo credit: Steve Johnson

AES 134th Full page Ad.indd 1

12/11/12 12:35 PM

SECTION NEWS

Section News Racing game sound design

Neil Muncy tribute in Toronto

driving physics parameters that adjust the audio during the game. Recordings are cut into segments appropriate to use with the physics logic. Combinations of crossfade loops and granular synthesis are used to stitch the sound back into the game, with full synthesis of great interest for the future. There are many other sound details/layers to consider, and attention to detail counts. A lot of DSP may be done to Nick Wiswell (center) of Microsoft’s Turn 10 Studios, such things as exhaust sound who spoke on racing game sound, PNW chair Dave changes, complex gearbox noises, tire Tosti-Lane (left), and Scott Mehrens (right) noises (the most complex sound in their games, with over 30 surfaces to The PNW meeting on January 8 took “a deep consider under myriad conditions and up to dive into the interesting, diverse, and some- half the voice count), environmental sounds times dangerous world of sound design for with reverb, early reflections, Doppler shifts, racing games” with Nick Wiswell, creative crowd sound, collision sounds, ambient audio director for Turn 10 Studios (part of music, and interface sounds. Nick demonMicrosoft). Nick revealed many details about strated a little Forza Motorsport 4 on Xbox how painstakingly sound is produced for for us to show the some of the sounds. Some Q&A revealed more interesting modern car-racing computer games. More than a few attendees confessed to being car details: about mixing in wind noise, the size of Turn 10’s office, the fact that engine nuts as well as gamers. Nick’s duties include the sound direction types can trump the actual car as far as for the Forza Motorsports game series, and sound goes, that 800+ cars have been he previously had similar duties at Bizarre recorded in the past 12 years, user testing Creations (UK) doing car game audio. He and game authenticity checks, licenses noted that a given car sounds the way it does from car manufacturers, suspension due to its particular construction, which can sounds, game hardware restrictions/CPU be compared to a musical instrument, mak- budgets, and dynamometers. Nick went into mixing audio for interacing sound from a combination of mouth, mouthpiece, tubing (intake and exhaust), tive games, which is not like mixing for linand so on. He showed the formula used for ear media. He discussed several common calculating the fundamental note in Hz from methods used, hardware performance budgets, and accommodating various lo-fi platthe RPM and number of cylinders. To get the real individual flavors of the forms (he assumes most games will be cars in the game, recordings are made of the played on a plain TV sound system). His real cars on a dynamometer under various YouTube video showed him and his crew performance conditions, with mics on the recording destructive sounds on cars in the engine, exhaust, and intake areas, as well as junkyard. Another lively Q&A period finished from a distance. Some recordings are made the meeting. Links to the Microsoft Research during actual driving, but this has limita- video of the meeting is available at the tions. Nick stated that they do not just play www.aes.org/sections/view.cfm?section=158. Gary Louie the recordings, but rather use hundreds of

Toronto Section’s meeting on December 18 paid tribute to Neil A. Muncy. It was an opportunity for those who knew Neil A. Muncy him as a friend, 1938–2012 who worked with him as a colleague, to share their thoughts and memories. A Life Member and AES Fellow, teacher at Eastman School of Music Summer Advanced Recording Institute, and studio guru, Neil’s love of jazz drew him to a life of getting the best sound out of the audio technology of the time. Neil was a huge friend of the Toronto AES. This meeting was generously sponsored by Neil’s wife, Mary Muncy. After the end of the presentations, the evening included refreshments, socializing, and excerpts from the AES Oral History DVD: “An Interview with Neil Muncy.” This occasion was recorded for a future DVD. A tribute page for Neil Muncy is under development by the Toronto AES at www.torontoaes.org/ Neil_Muncy-TRIBUTE.html. Karl Machat

Advertiser Internet Directory * Rohde & Schwarz www2.rohde-schwarz.com

75

* AES Sustaining Member

We appreciate the assistance of the section secretaries in providing the information for these reports. Statements reported here are the personal opinions of the presenters, which may not represent the opinions of the AES or the audio industry at large. J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

85

SECTION NEWS

Toronto Section at Koerner Hall the hall's design, flexible performance characteristics, and superb acoustics have been praised by critics and performers alike. The hall achieved the highest possible acoustic rating—N1—rendering it ideal for the finest acoustical performances of classical music, jazz, and world music. The incorporation of variable acoustics makes it equally well suited to amplified music, lectures, and film presentations. The hall features an innovative Clockwise from top left, John O'Keefe, Jim Hayward, Martin and almost invisible “voicevan Dijk, and Jeff Bamford stick” designed by Engineering Harmonics to maximize intelligibility, rather than Moderator Rob DiVito began the January 8 sound reinforcement. Aercoustics was meeting by highlighting the fact that the responsible for optimizing the acoustics. Presentations were given by John Toronto AES and SMPTE group often do a joint session at least once a year. This was O’Keefe, Aercoustics; Jim Hayward, Radio one of those sessions. After taking care of College of Canada; Jeff Bamford, Engineersection business, the meeting proceeded ing Harmonics; and Martin van Dijk, Engiwith a presentation and tour of Koerner neering Harmonics. Rob presented Jeff and Martin with an AES certificate of appreciaHall. Briefly, Koerner Hall was built over three tion. (Jim Hayward was also presented with years at a cost of some $110 million. The a certificate at a later time.) The audience was invited to tour the 1,135-seat hall is the jewel of the new TELUS Centre for Performance and Learn- facility individually and ask questions until ing at the Royal Conservatory of Music. closing time. Karl Machat Since its opening on September 25, 2009,

Digital asset management at Central Indiana Al Grosnicklaus, director of engineering, WTHR-TV, Indianapolis; John Wright, assistant chief engineer for IT, WTIU-TV, Bloomington; and Brian McGinnis, counsel in the Intellectual Property Department of Barnes & Thornberg, gave presentations regarding aspects of digital media archiving at the January 16 meeting of the AES Central Indiana Section. Topics included workflow—ingesting, managing, searching, and tracking a database; recording formats; metadata—how much and what information is tracked; and rights—ownership, usage, authorization, and legal concerns. Al Grosnicklaus gave a presentation that demonstrated the news footage archiving process and databases for WTHR. He explained the evolution of archiving and metadata entry. John Wright demonstrated the digital rights 86

tracking procedures for WTIU, and specifically showed members how program airings were monitored. Brian McGinnis answered questions regarding the legality of using media owned by networks and media generated by private citizens. Fallon Stillman

Al Grosnicklaus, John Wright, Brian McGinnis

Audio over IP in Sweden

Patrik Eriksson, Lars Jonsson and Johan Boqvist

Lars Jonsson and Johan Boqvist gave a detailed presentation of the audio network technology used by the national public radio service, Sveriges Radio (SR) at the Swedish Section meeting on November 15. The organization is at the forefront of research and development of transport mechanisms for contribution networks and distribution networks alike. Audio over IP doesn’t always mean over the Internet. SR owns a dedicated gigabit fiber optical network that physically spans the extreme reaches of Sweden. This allows the routing of audio between remote sites with secure, low-latency links. SR has also been running pilot projects to test new audio-over-IP systems for contributions. The motivation for developing new systems is the forthcoming withdrawal of ISDN services. Several manufacturers have already made hardware that is compatible with the network and protocols—this encourages competition and emphasizes the possibilities for interoperability. This led us on to a summary of the status of the AES X192 committee’s work on a high-performance streaming audio-over-IP interoperability standard. Johan explained that the draft of version 1.0 is about 50% complete with several points already closed. Still remaining is work on transport, encoding, and discovery among other things. Patrik Eriksson from Luthman SMTTS AB talked us through the features of the Merging Technologies Horus audio processor and the implementation of the Ravenna audio protocol. Similar to other protocols such as Dante and QLAN, Ravenna is a Layer 3 IP-based digital audio transport with low latency and high channel count. This allows the use of standard Ethernet technology as a transport medium for digital audio. During the Q&A concerns were expressed about security when passing audio over shared networks. SR solve this issue by leasing specially reserved VLAN channels with guaranteed bandwidth. Luthman SMTTS very generously provided excellent refreshments in the form of a buffet that all the attendees thoroughly enjoyed. Thanks! Steven Liddle

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

SECTION NEWS

AES Argentina sessions at Estudio Urbano’s Conectar 2012 The Argentina Section meetings at Estudio Urbano’s Conectar 2012 on November 27, 28, and 29 encompassed many interesting lectures. Here are some of the highlights. Natalia Sotelo’s talk was entitled, “Soundtracks for Animation Movies.” From the very first experiments till the realization of Disney’s Fantasia, Natalia went through the development of soundtracks in both the technical and the expressive stages, showing how the need to convey human emotions extended to sonic vocabulary as much as the new technologies. This was supported by examples and extracts taken from the works of Georges Melies, the Fischers, Norman McLaren, and Disney, to name a few. Ezequiel Morfi spoke about “The Imprint in Mastering.” In a workshop spanning two hours, Ezequiel spoke to an audience who was able to listen to several different versions of the same song: whether mastered for CD, mastered for vinyl, or remastered for a compilation or for an anniversary edition. Talk back and forth between Ezequiel and the crowd gave way to an analysis of how the mastering process can leave a par-

ticular imprint in a song, as shown and heard by tracks mastered 30 years ago and then remastered more recently. Songs by the Beatles, Guns N’ Roses, Oasis, and Yes were played back. To top this off, the presentation ended in some technical questions and the classic queries about the process of mastering, all of which were discussed by the keynote speaker. Indio Gauvron, AES Argentina’s chair, presented his seminar on the various topics related to tracking, referring to the use of microphones and their components, explaining the purposes of their application in different situations, and commenting on the monitoring of sound in uncontrolled environments. Basing his tutorial on some previous papers published in the AES Journal, Indio discussed such subjects as sample rate and bit rate, making a point in submerging the audience into the experience of listening to an orchestral fragment with varying bit-rate distortion, from the most apparent to the nearly imperceptible. Another experiment included real-time measurement using Smaart technology, through which Indio demonstrated that

small household items (such as clothes or a wooden board) can modify the frequency response of a room without the need to apply an equalization process to the audio signal. Part of Estudio Urbano’s CONECTAR Festival, held on December 1 and 2 at Parque Centenario in Buenos Aires, is the annual trade show, which brings together all the institutions that accompany, support, and participate in Estudio Urbano’s activities throughout the year. The fair offers a space for interaction among visitors as well as live performances from the indie bands who take part in Estudio Urbano’s music programs. For several years the AES Argentina Section has been present at the fair so that attendees can learn about our activities during the year and find out how they can become AES members, establishing a networking environment for all those who have been with us in the past and those who want to get close to the Society by supporting our initiatives for the future. Ezequiel Morfi, Iván Marcovick, Martín Szneider, Natalia Sotelo, Martín Diaz Velez, and Indio Gauvron

Joint meeting with ECOS School of Sound

sound sources, and nonabusive use of processors such as equalizers, to avoid acoustic feedback and frequency response of on-stage monitoring systems. Later on, Hernán Nupieri, freelance sound engineer, showed us the world of controllers and processors for live sound system’s adjustment an alignment: “Sound Processors.” From filters’ topologies, protocols, phase alignment, peak, and RMS limiters to upto-date brands and products, and advantages and disadvantages, everything was included in this lecture. Esteban Frangoudis, a student from ECOS, gave a lecture on “Audio Electronics” about the selection of components and design of high-performance audio equipment. Last but not least, as a closure to this event, Gustavo Baldrati (teacher at ECOS and musical producer in BALDRATI MUSICA) and students from the recording group, showed some self-recorded and produced material, as

well as professional recording in: “Artistic Production,” with questions and comments from the audience. We sincerely thank the speakers who enlightened us with their knowledge, ECOS School of Sound, and AES Argentina’s members Natalia Sotelo and Ezequiel Nakasone who managed the event. Alejandro Goldstein and Indio Gauvron

The closing event for 2012 took place on December 18 at the 5th Congress on “Sound and Recording,” organized by AES Argentina together with ECOS School of Sound. The location was the National University of Technology (Universidad Tecnológica Nacional), based in Avellaneda, Buenos Aires. The whole day included different lectures. Néstor Stazzoni explained the “Advantages of Digital Consoles over Analog Consoles”—a talk that ranged from characteristics, temporal walkthrough, demonstrations, and practical activities. R. Alejandro Barés, lawyer, sound engineer, and UTRA (Argentina’s Union of Technicians) representative, gave a talk about the current problems in the technical field of audio and all its disciplines. Afterward, Esteban Freier introduced us to the secrets of “Monitoring Systems Adjustments,” including such topics as system gain structure, application and exploitation of polar diagrams from both microphones and

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Néstor Stazzoni interacting with the audience 87

135

J acob J avits C e n t e r

th

Audio engineering society International Convention

N e w Y ork C i t y

Conference Oct 17th – 20th, 2013

Join the AES in New York City, to celebrate the latest audio advances. October 17th through October 20th, 2013.

For more information and to make reservations, visit our website at

www.aes.org

JAES 135th Full page Ad.indd 1

1/11/13 11:50 AM

PRODUCTS AND DEVELOPMENTS

Products and developments AKG D12 VR in new vintage

AES Sustaining Member Developed in the 1950s, the first-of-its-kind AKG D12 microphone was marketed as a vocal microphone, ideal for broadcasting and public address, becoming a staple in television and radio studios around the world. Over the years, use of the microphone has shifted and is now one of the most versatile and reliable microphones in

the industry, with artists playing their instruments through the mic. The newly launched Harman AKG D12 VR large-diaphragm cardioid microphone for recording and live applications has since been utilized as a vocal microphone in numerous applications. D12 VR (vintage sound re-issue) offers an ultra thin diaphragm within its newly designed capsule— normally a foil with only 7 microns of thickness is used in condenser microphones. The low-frequency performance gets enhanced with an updated bass chamber below the capsule-element. With phantom power disabled, the D12 delivers accurate, pure character from the sound source. With phantom power enabled, one of three switchable active-filter presets can be used to quickly adapt the mic’s response to suit the user’s desired kick drum.

Battery supply for mics Tabletop mic stands With its very low noise level, a frequency response of 20 Hz–24 kHz, and an inverse polarity protection feature against incorrect battery insertion, AKG’s B48 L battery power supply provides users with more than 20 hours of use on only two AA batteries. When phantom power for a condenser is unavailable from a mixer, the B48 L phantom power supplies microphones with 48V and meets the P48 phantom power standard. Additionally, B48 L’s mini-XLR audio output and included miniXLR to mini-XLR connection cable supports a direct connection of condenser microphones to all AKG bodypack transmitters. A low battery LED light illuminates if phantom power drops below 40V.

AKG also now offers a new line of tabletop microphone stands for all environments. The ST6 houses all AKG XLR-based mics, CGN321STS, and CGN521STS and their programmable switch cater to gooseneck mics and STS DAM+ provides a base for DAM+ gooseneck modules and CS321/521 goosenecks. The heavy-duty CGN321STS also keeps the microphones stable, while absorbing any shock. Its slim 30 cm gooseneck holds a professional condenser capsule with cardioid polar pattern of 120 degrees. AKG Acoustics GmbH Lemböckgasse 21-25 A-1230 Vienna, Austria [email protected] Tel. +43 (0)1 86654 0 Fax +43 (0)1 86654 8800 www.akg.com

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Classic tube sound from new guitar amp

The GAV19T guitar amplifier is a cathode biased, class A, all-tube, 19-watt guitar amp designed in the vein of classic English amplifiers, delivering a big, vintage sound—making this amplifier head the perfect choice for recording guitarists. Chandler Limited is also offering a selection of loudspeaker enclosures optimized for use with the GAV19T guitar amplifier. Of particular note, the new amplifier incorporates its overdrive and tone circuitry directly within the power section. This design enables the GAV19T to be highly overdriven while still maintaining the amplifier’s classic tone. The drive portion of the GAV19T is a boost / overdrive circuit that works only on the power section of the amplifier. Instead of adding preamp gain and distortion to achieve the overdrive effect, the GAV19T uses the power amp. The Drive has two controls: (1) the amount of boost, and (2) the tone of the boost. Guitarists can choose four selections or bypass. The tone control section of the new GAV19T has a Baxendall tone stack that uses an ECC803 tube. The GAV19T’s bias control affects only the bias of the preamp tube—enabling the guitarist to gently tweak its overall tone independently of the power section. This enables the tube to run hotter or colder and change the responsiveness of the amp while also affecting the sustain and frequency response. Chandler Limited P. O. Box 38 Shell Rock, Iowa 50670, USA Tel. +1 319 885 4200 or 885 4201 Fax +1 319 885 4202 [email protected] www.chandlerlimited.com 89

PRODUCTS AND DEVELOPMENTS

Advanced digital console from Soundcraft

Available in three frame sizes, Si Expression 1, 2, and 3 offers 16, 24, and 32 fader and mic inputs respectively; all three are capable of up to a staggering 66 inputs to mix by connecting any Soundcraft stagebox including the two new Mini Stagebox 16 and 32 (16 x 8 and 32 x 16) models or by connecting additional inputs over MADI or AES / EBU. All external inputs are additional to the connections on the desk itself. The mixer is loaded with industry standard processing from HARMAN siblings BSS, dbx, Lexicon, and Studer and many top-end professional features like a color touchscreen, iPad ViSi Remote control, and Soundcraft FaderGlow, adopted from

Wireless connectivity for amplifiers

Soundcraft’s Vi Series large format flagship consoles. A powerful DSP engine provides 4-band parametric EQ, delays, gates, and compressors on every input, parametric, and 30-band graphic EQ, compressors and delays on all outputs, as well as four Lexicon stereo effects devices, all capable of being utilized at the same time, unlike most consoles in this class. Soundcraft ViSi Remote allows remote control of the console from an iPad. Si Expression is designed for all mixing applications ranging from gigging bands, performance venues, installations and tours, corporate AV and theaters to houses of worship and education facilities. Soundcraft Harman International Industries Ltd. Cranborne House Cranborne Road Potters Bar, Herts. EN6 3JN, UK Tel. +44 (0)1707 665000 Fax +44 (0)1707 660742 www.soundcraft.com

dbx networked audio connectivity dbx is now shipping its TR1616 BLU link I/O a 16-input/16-output digital on ramp/off ramp module that brings professional digital networked audio connectivity to an attractive price point. The TR1616 offers 16 XLR/TRS analog inputs and 16 analog outputs, along with BLU link RJ-45 Loop input and output and an RJ-45 Snake input and output ports. Users can configure up to 256 channels at 48 kHz or up to 128 channels at 96 kHz. No addressing or programming is required. Each of the TR1616’s 16 front-panel channels features a high-quality dbx mic preamp with mic gain, 20 dB pad, low-cut filter, +48V phantom power and polarity

selection, along with signal, clipping and additional indicators. A rear-panel USB and Ethernet port enables uploading future firmware updates. Up to 60 TR1616 units can be linked together with no loss of control or fidelity, enabling first-time users to expand their systems as their connectivity requirements increase. The dbx TR1616 is ideal for use with mixing consoles and with the new dbx PMC Personal Monitor Controller, a remote control that allows performers to easily set up and control their personal mix of up to 16 channels of audio via a BSS Audio BLU link interface. A live mode option provides the user the capability to see real-time view of channels. dbx 8760 South Sandy Pkwy. Sandy, UT 84070, USA +1 801 566 8800 [email protected] www.dbxpro.com

90

Crown Audio’s USBX is an accessory that enables Crown XTi, CDi, and DSi Series amplifiers with a USB port to be operated via an Apple iPad/iPhone using the “Powered by Crown” App. The USBX plugs into an AC outlet and up to eight compatible Crown USB products can be connected to the USBX. The USBX has built-in Wi-Fi, enabling wireless control of connected Crown amps without the need for a separate wireless router. The “Powered by Crown” App allows wireless control and monitoring of Crown Ethernet-enabled devices and JBL loudspeakers that are equipped with DrivePack DPDA built-in amplifiers. Since this App uses the same protocols as HARMAN HiQnet System Architect, users can import custom control panels from System Architect into the “Powered by Crown” App for added functionality. A strategic long-term commitment to Ethernet technology by Crown means that users have the benefit of this simple-to-use, cost-efficient program as well as seamless integration of their legacy and future Crown products through the App. Crown Audio, Inc. 1718 W. Mishawaka Rd. Elkhart, IN 46517 USA Tel. +1 574 294 8000 www.crownaudio.com

Visual loudness tool NUGEN Audio has released VisLM version 1.6, a significant update to the company’s visual loudness monitoring tool. For the first time, VisLM is now available in the Avid AAX formatas well as a 64-bit OS X version. VisLM 1.6 includes a new dialog gate option that enables automatic measurement of dialog sections within the source material. Measurements can now be written to automation tracks so that the loudness profile and true-peak clips can be seen against the waveform in the digital audio workstation timeline for diagnostic referencing. the complexity of loudness compliance. NUGEN Audio Airedale House 423 Kirkstall Road, Leeds, LS4 2EW, UK www.nugenaudio.com

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PRODUCTS AND DEVELOPMENTS

Cambridge Series III faster and more powerful

AES Sustaining Member To meet the ever-increasing desire for greater efficiency and faster throughput CEDAR Cambridge Series III hardware will be significantly increased. Featuring

12 Xeon processing cores and 32 GB of RAM, the system will be even better suited to batch processing large bodies of audio in libraries and archives, as well as in forensic laboratories handling high volumes of surveillance material. It will also support multiple instances and multiple users more efficiently and, depending upon the number of processing paths and the number of processes running in each, this can lead to potentially huge increases in productivity. To aid the user, the new system will support up to six DVI monitors, making it easy to set up, view, and control multiple instances of CEDAR Cambridge on the same host hardware, with all of their signal analysis, processing paths, metadata, and report generation windows comfortably laid out and visible simultaneously. CEDAR Audio Ltd. 20 Home End Fulbourn Cambridge, CB21 5BS, UK Tel. +44 1223 881771 Fax +44 1223 881778 www.cedaraudio.com

Advertise in the Journal of the Audio Engineering Society It's the best way to reach the broadest spectrum of decision makers in the audio industry. The more than 15,000 AES members and subscribers worldwide who read the Journal include record producers and engineers, audio equipment designers and manufacturers, scientists and researchers, educators and students (tomorrow's decision makers). Deliver your message where you get the best return on your investment. For information contact: [email protected] Advertising Department Journal of the Audio Engineering Society 60 East 42nd Street, Room 2520, New York, NY 10165-2520 Tel: +1 212 661 8528, ext. 22 www.aes.org/journal/ratecard.cfm

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Multifunction loudspeaker system AES Sustaining Member

Part of the Aero Series 2 product line, the new Aero 40 is a mid-size, powered line array system utilizing 12-inch and 6-inch transducers for the low and mid frequencies respectively plus dual compression drivers for the highs. D.A.S. Audio’s new Convert 15A arrayable loudspeaker is a powered, multi-function system featuring an innovative system that provides user-definable dispersion characteristics—enabling the system to be deployed as either a curved source array or used individually as a point source. The new SX-218A subwoofer is a powered, front loaded sub bass enclosure. This loudspeaker system incorporates dual 18-inch LF transducers and is driven by an 1,800 W RMS integrated power amplifier. For easy transport, the enclosure has two casters mounted on the rear of the enclosure. Together, the D.A.S. Audio Convert 15A powered multifunction loudspeaker system and SX-218A subwoofer make a versatile live sound reinforcement system capable of delivering robust music reproduction characteristics and first-class speech intelligibility. D.A.S. Audio, S.A. Islas Baleares 24 46988 Fuente del Jarro Valencia, Spain Tel. +34 96 134 0206 Fax +34 96 134 0607 www.dasaudio.com 91

In December 1988, Delft University professor Guus Berkhout published his article in the AES Journal in which he proposed the concept of Wave Field Synthesis (WFS) as a format for spatial sound reproduction without sweet spot limitations. Now more than twenty years later, Diemer de Vries, who was involved in the development and application of the concept from the very beginning, presents a 93-page monograph on the theme. The principles of WFS are explained with a digestible summary of the underlying mathematics. An explanation is also given of how the necessary steps, from theory (where things are infinitely large or small, having ideal properties) to real-world application, have successfully been made by making ample use of the properties of the human hearing mechanism Reproduction in WFS is most effective when dedicated recording techniques are used. A survey of such techniques is given. Special attention is given to the EU-sponsored CARROUSO project in which ten institutes successfully cooperated on WFS-oriented research and development. Due to the results of this project, WFS is now known and recognized worldwide. An illustrated short description is given of all WFS systems in the world known to the author. Applications range from high-level audio research via multimedia education to nightclub entertainment. The monograph ends with a view to the future. Purchase online, $35 (Members), $50 (nonmembers), at www.aes.org/publications/anthologies/ For information email Publications Dept. at [email protected] or call +1 212 661 8528.

AUDIO ENGINEERING SOCIETY CALL for PAPERS and ENGINEERING BRIEFS AES 135th Convention, 2013 New York, USA Dates: 2013 October 17–20 Location: Javits Center, New York, USA

www.aes.org/events/135

Authors may submit proposals in three categories: 1. Complete-manuscript peer-reviewed convention papers (submit at www.aes.org/135th_authors) 2. Abstract-precis-reviewed convention papers (submit at www.aes.org/135th_authors) 3. Synopsis-reviewed engineering briefs (submit at www.aes.org/135th_ebriefs) Category 1 and 2 proposals are to be submitted electronically to the AES 135th proposal submission site at www.aes.org/135th_authors by 2013 May 16. For the complete-manuscript peer-reviewed convention papers (category 1), authors are asked to submit papers of 4–10 pages to the submission site. Papers exceeding 10 pages run the risk of rejection without review. These complete-manuscript papers will be reviewed by at least two experts in the field, and authors will be notified of acceptance by 2013 June 20. Final manuscript with revisions requested by the reviewers have to be submitted before 2013 July 8. If rejected as a convention paper (Cat. 1), the proposal may still be accepted for categories 2 or 3. For abstract-precis-reviewed proposals (Cat. 2), a title, 60-to 120-word abstract, and 500- to 750-word précis of the proposed paper must be submitted by 2013 May 16. Authors will be notified of acceptance by 2013 June 6, and the authors must submit their final manuscripts (4 to 10 pages) before 2013 July 8. If rejected from this category, the proposal may be still be accepted as an engineering brief (Cat. 3). Both complete-manuscript peer-reviewed papers (Cat. 1) and abstract-precis-reviewed papers (Cat. 2) will be available on the CD-ROM of convention papers, and also later in the AES E-Library (www.aes.org/e-lib). If a paper is longer than 10 pages, the author will be charged a fee of $25 for every page over 10. For engineering briefs (Cat. 3), authors must supply a short synopsis by 2013 August 14 to indicate their desire to present an engineering brief, followed later by an electronic manuscript. These manuscripts will be freely available in the AES E-Library to AES members, and there will be no paper copies. Topics for the engineering briefs can be very wide-ranging. Relaxed reviewing of submissions will consider mainly whether they are of interest to AES convention attendees and are not overly commercial. PDF manuscripts 1–5 pages in length following the prescribed template must be emailed to [email protected] by 2013 September 12. Presenting authors (one per paper for all 3 categories) will be required to pay the full convention registration fee (member and student member rates are lower than nonmember rates), and only then will the presenting author receive a free CD-ROM of the convention papers. Presenting authors who are student members and whose convention papers (Cat. 1 or 2) are accepted will be eligible for the Student Paper Award at the 135th. During the online submission process you will be asked to specify whether you prefer to present your paper in a lecture or poster session. Highly detailed papers are better suited to poster sessions, which permit greater interaction between author and audience. The convention committee reserves the right to reassign papers to any session, lecture or poster. The submission sites, www.aes.org/135th_authors (Cat. 1 and 2) and www.aes.org/135th_ebriefs (Cat. 3), will be available by early April.

PROPOSED TOPICS Perception Spatial audio Audio signal processing Semantic audio

Transducers Game audio Recording and production Applications in audio

Education Room acoustics Forensic audio Sonification

Audio computing

Submit title/abstract/précis or full papers at www.aes.org/135th_authors. Submit eBriefs at www.aes.org/135th_ebriefs.

SCHEDULE Proposal (Cat. 1 and 2) deadline: 2013 May 16 Acceptance (précis proposals, Cat. 2) emailed: 2013 June 6 Acceptance (full papers, Cat. 1) emailed: 2013 June 20 Paper (Cat 1 and 2) deadline: 2013 July 8 Engineering brief (Cat. 3) deadline: 2013 August 14 Engineering brief (Cat. 3) 1–5 page PDF deadline: 2013 Sep. 12 J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

PAPERS COCHAIRS Questions? Contact: Brett Leonard Tae Hong Park [email protected]

Open Access Authors will have the option of submitting Open Access papers. 93

Rozenn Nicol has been investigating spatial audio technology at Orange Labs for more than ten years. With the publication of this monograph, she aims to promote a better understanding of how binaural technology really works. Despite its straightforwardness, the reproduction of binaural audio with headphones is always impressive— a really convincing 3-D sound scene is achieved. This is possible because binaural technology merely mimics the spatial encoding that we use daily when we localize sounds in real life. Starting from practical issues, concerning what is the real meaning of sound recording and rendering for binaural technology, the underlying theory is then progressively examined. The diffraction of the acoustic wave by the listener’s body defines the key concept of binaural technology and can be represented by the associated transfer functions, which are known as Head Related Transfer Functions. HRTFs are therefore the raw material of binaural spatialization. It is shown how the spatial information is conveyed through HRTFs, investigating both physical phenomena and auditory localization mechanisms. Special attention is given to binaural synthesis, which consists of simulating the left and right signals as they would have been recorded by a pair of microphones inserted in the listener’s ears. This is one of the most well known applications of binaural technology, since it allows one to create with full control a virtual auditory space for psychoacoustic experiment or virtual reality purposes by straightforward filtering. Although binaural technology is a powerful tool for sound spatialization, it should be kept in mind that as the spatialization is determined by the listener’s morphology, which is unfortunately strongly individual; the spatial encoding of a sound scene is theoretically valid for one sole individual. The monograph ends with an overview of solutions for adapting the binaural spatialization process to an individual’s variability.

Purchase online, $35 (Members), $50 (nonmembers), at www.aes.org/publications/anthologies/ For information email Publications Dept. at [email protected] or call +1 212 661 8528.

OBITUARIES

Obituaries Frank G. Lennert 1924–2013 Frank Lennert was born and raised in Hayward, California. When he was in high school, he joined a radio club and quickly became convinced that his future would involve electronics. While at school, he designed and built his own (mechanical) disk recording equipment, so that he could make phonograph recordings of small vocal groups as well as radio programs off-the-air. After he graduated from the University of California in Berkeley in 1947, he started his own phonograph recording company. In early 1948, Alex Poniatoff, founder and president of Ampex Corporation, heard about Frank and asked him to join his company. He was the first person to be hired who had a background in electronics. His contribution to Ampex was very great, and especially in its early years of developing and manufacturing magnetic tape recording equipment. Frank joined just in time to check out Ampex’s serial number 1, Model 200A. This recorder was delivered in 1948 April, and was the company’s first product to be introduced in a long series of professional audio tape recorders. Frank then designed the electronics for a new Model 300, which was introduced in 1949 and performed even better than the 200A. He then designed a conversion kit to upgrade the 200A to match the performance of the 300, and each converted 200A became known as an Ampex Model 201. He established a standard laboratory to provide tapes to assure compatibility between Ampex recorders. He then helped establish a standard to assure that tapes recorded on one manufacturer’s professional tape recorder could be played back properly on another.

In 1952 Frank was the ideal choice to become manager of manufacturing. He knew the engineering of the products to be manufactured, he knew the engineers, and he worked very well with all of them. He was a “team player” during those early years of helping Ampex become a world leader in magnetic tape recording of audio, instrumentation, and video. Frank never lost his love for engineering. In 1953 he thought about the different subassemblies he was manufacturing, and decided one weekend to try an experiment. With the help of his machine shop, he assembled a combination of subassemblies onto a new top-plate. The result was the prototype of a new, smaller, and less expensive magnetic tape recorder. It became the highly successful Model 350. Ampex produced over 6,000 of them, and some are still in use. In 1960 Frank resigned from Ampex in order to have more time to build his new home in Woodside and become involved with a few small startup companies. In 1973 he acquired a small company called Pemtek, which manufactured a line of special instrumentation recorders primarily for military applications. He built it up and sold it in 1976. Then he bought another small company called Standard Tape Laboratory (STL). It was an offshoot of the original standard tape laboratory that Frank had started at Ampex in 1949. He stayed with STL until his retirement in 1987. Frank was AES member 423, and became an AES Fellow in 1959. He published the paper “Equalization of Magnetic Tape Recorders for Audio and Instrumentation Applications” in Transactions of the IREProfessional Group on Audio (PGA). He lived in his home in Woodside for over 50 years. In 2007 he lost his wonderful wife, Mary Ellen, to whom he had been married for nearly 60 years. Frank is survived by two sons, David and Steven, two grandsons, Jeremy and Jason, and his sister, Elizabeth. John Leslie

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

Bill Isenberg 1944–2013 On Wednesday, February 6th, William H. Isenberg passed away from natural causes. Bill—or as his friends knew him “B-I”—was known and respected for his talents as a circuit designer, field technician, and all-around on-call “technical guru” for many of the most prestigious manufacturers and recording studios in Southern California. After a short period of study at Lawrence University he joined the Air Force in 1964, rising to the rank of Tech Sergeant, specializing in electronics technology. In the late 1960s, Bill worked for Daniel A. Flickinger and Associates, where he helped in the design and construction of the Flickinger mixing console. After installing one of the consoles at Bolic Sound (Ike and Tina Turner’s studio) in Inglewood, Bill decided to stay in Southern California. Following his move to Southern California in 1972, Bill worked for and with a number of companies, including Audio Engineering Associates, Capp’s Electronics, Cetec Vega, Filmways Audio, Groove Tubes, Harman Pro Audio, Hollywood Sound Systems, JBL Loudspeakers, Jensen Transformers, Litton Westrex, Marshall Long Acoustics, Pioneer Research, Record Plant Recording, RTS Systems (Telex), SAE High Fidelity, and Seymour Duncan Pickups. As a long-time member of the Hollywood Sapphire Group and the Los Angeles Section of the Audio Engineering Society, Bill was a contributor to numerous technical presentations and workshops for both organizations and served as the AES Section’s treasurer for several years. Ron Streicher 95

AES CONVENTIONS AND CON The latest details on the following events are posted on the AES Website: www.aes.org

49

th

LONDON

2013

49th International Conference “Audio for Games” Date: 2013 February 6–8 Location: London, England

Conference chair: Michael Kelly Email: [email protected]

Papers chair: Damian Murphy Email: [email protected]

FERENCES Call for papers: Vol. 60 No. 5, p. 382 (2012 May) Conference preview: Vol. 60 No. 11, pp. 958–963 (2012 November)

AES REGIONAL OFFICES Europe Conventions Kerkstraat 122/1, BE 1653 Dworp, Belgium, Tel: +32 2 345 7971, Fax: +32 2 345 3419, Email for convention information: [email protected] Europe Services B.P. 20048, FR-94364 Bry Sur Marne Cedex, France, Tel: +33 1 4881 4632, Email for membership and publication sales: [email protected] United Kingdom British Section, Audio Engineering Society Ltd., P.O. Box 645, Slough, SL1 8BJ UK, Tel: +44 1628 663725 Email: [email protected] Japan AES Japan Section, 2-19-9 Yayoi-cho, Nakano-ku, Tokyo 164-0013, Japan, Tel: +81 3 5358 7320, Fax: +81 3 5358 7328, Email: [email protected] AES REGIONS AND SECTIONS

th

134

2013 ROME

50 th MURFREESBORO

2013

51stHELSINKI 2013

134th Convention Rome, Italy Date: 2013 May 4–7 Location: Roma Eventi – Fontana di Trevi Conference Centre

Convention chair: Umberto Zanghieri Email: [email protected]

Papers cochairs: Angelo Farina and Véronique Larcher Email: [email protected]

Call for papers: Vol. 60 No. 10, p. 879 (2012 October)

50th International Conference “Audio Education” Date: 2013 July 25–27 Location: Murfreesboro, USA

Conference cochairs: Michael Fleming and Bill Crabtree Email: [email protected]

Papers chair Jason Corey Email: [email protected]

Call for papers: Vol. 60 No. 5, p. 383 (2012 May)

51st International Conference “Loudspeakers and Headphones” Date: 2013 August 22–24 Location: Helsinki, Finland

Conference chair: Juha Backman Email: [email protected]

Aki Mäkivirta Email: [email protected]

Call for papers: Vol. 60 No. 12, p. 1085 (2012 December)

Papers cochairs: Filippo Fazi and Søren Bech Email: [email protected]

Call for papers: Vol. 60 No. 10, p. 881 (2012 October)

Papers chair

52nd International Conference Conference chair: “Sound Field Control” Francis Rumsey Email: [email protected] Date: 2013 September 2–4 Location: Guildford, UK

52 nd GUILDFORD

NEW 2013 YORK 96

135th Convention New York, USA Date: 2013 Oct 17–20 Location: Javits Center, New York, NY, USA

Convention chair: Jim Anderson Email: [email protected]

Papers cochairs: Brett Leonard Tae Hong Park Email: [email protected]

Call for papers: This issue, p. 93 (2013 January/Feburary)

Eastern Region, USA/Canada Sections: Atlanta, Boston, District of Columbia, New York, Philadelphia, Toronto Student Sections: American University, Appalachian State University, Atlanta, Bay State College, Berklee College of Music, Boston University CDIA, Carnegie Mellon University, Duquesne University, Emerson College, Fredonia, Full Sail Real World Education, Institute of Audio Research, Ithaca College, London Ontario, McGill University, New England Institute of Art, New England School of Communications, New York University, Old Dominion, Peabody Institute of Johns Hopkins University, Pennsylvania State University, Shenandoah University, University of Hartford, University of MassachusettsLowell, University of Miami, University of New Haven, University of North Carolina at Asheville, Valencia College, William Paterson University Central Region, USA/Canada Sections: Central Indiana, Central Texas, Chicago, Cincinnati, Detroit, Kansas City, Nashville, Heartland, New Orleans, St. Louis, Upper Midwest, West Michigan Student Sections: Art Institute of Tennessee-Nashville, Ball State University, Belmont University, Capital University, Columbia College, Flashpoint Academy, Indiana University, Institute of Production & Recording, Kansas City Kansas Community College, Michigan Technological University, McNally Smith College of Music, Middle Tennessee State University, Purdue University, IADT Nashville, SAE Nashville, Ridgewater College, Southern Illinois University, University of ArkansasPine Bluff, University of Central Missouri, University of Illinois, Urbana-Champaign, University of Memphis, University of Michigan, Webster University Western Region, USA/Canada Sections: Alberta, Colorado, Los Angeles, Pacific Northwest, Portland, Sacramento Valley, San Francisco, Southern Nevada, Vancouver Student Sections: American River Regional, Brigham Young University, California State University–Chico, Cal Poly Pomona, Cal Poly San Luis Obispo, Citrus College, Cogswell Polytechnical College, Conservatory of Recording Arts and Sciences, Ex’pression College for Digital Arts, Long Beach City College, Loyola Marymount University, San Francisco State University, Southwestern College, Stanford University, The Art Institute of Las Vegas, The Art Institute of Seattle, University of Colorado at Denver, University of Lethbridge, University of Southern California Northern Region, Europe Sections: Belgian, British, Danish, Finnish, Moscow, Netherlands, Norwegian, St. Petersburg, Swedish Student Sections: Danish, London UK, Netherlands, Russian Academy of Music, St. Petersburg, University of Lulea-Pitea, University of York Central Region, Europe Sections: Austrian, Czech, Central German, North German, South German, Hungarian, Lithuanian, Polish, Slovak, Swiss, Ukrainian Student Sections: Aachen, Berlin, Czech Republic, Darmstadt, Detmold, Düsseldorf, Graz, Hamburg, Krakow, Ilmenau, Technical University of Gdansk, Vienna, Wroclaw University of Technology Southern Region, Europe Sections: Bulgarian, Croatian, French, Greek, Israel, Italian, Portugal, Romanian, Spanish, Serbia, Slovenia, Turkish Student Sections: Croatian, Conservatoire de Paris, Galatasaray ITM, Istanbul Bilgi University, Louis-Lumière, MicroFusa Barcelona, Serbian Latin American Region Sections: Argentina, Brazil, Chile, Colombia, Guatemala, Mexico, Peru, Puebla, Uruguay Student Sections: Academia de Musica Fermatta, Cochabamba, ECOS Escuela de Sonido, I.A.V., Javeriana University, Instituto Mendocino de Audio y Sonido, Orson Welles Institute, ORT Institute, SAE Mexico, Sala de Audio, Sonar Escuela de Sonido, Tecnológico de Monterrey Monterrey, Tecnológico de Monterrey Santa Fe, Universidad de San Buenaventura International Region Sections: Adelaide, Beijing, Brisbane, Hong Kong, Japan, Korea, Malaysia, Melbourne, Philippines, Singapore, Sydney Student Section: Greater Sydney, Japan PURPOSE: The Audio Engineering Society is organized for the purpose of: uniting persons performing professional services in the audio engineering field and its allied arts; collecting, collating, and disseminating scientific knowledge in the field of audio engineering and its allied arts; advancing such science in both theoretical and practical applications; and preparing, publishing, and distributing literature and periodicals relative to the foregoing purposes and policies. MEMBERSHIP: Individuals interested in audio engineering may become members of the AES (www.aes.org/join). 2013 annual dues are: full members and associate members, $149 for both the printed and online Journal; $99 for online Journal only. Student members: $89 for printed and online Journal; $39 for online Journal only. Subscribe to the AES E-Library (www.aes.org/elib/subscribe), $145 per year (members), $255 nonmembers. Sustaining memberships are available to persons, corporations, or organizations who wish to support the Society.

GUIDELINES FOR AUTHORS ARE AVAILABLE AT THE JOURNAL’S WEBSITE: http://www.aes.org/journal/authors/guidelines/

J. Audio Eng. Soc., Vol. 61, No. 1/2, 2013 January/February

AES The Audio Engineering Society recognizes with gratitude the financial support given by its sustaining members, which enables the work of the Society to be extended. Addresses and brief descriptions of the business activities of the sustaining members appear in the October issue of the Journal. A&R Cambridge ACO Pacific, Inc. acouStaCorp Acustica Beyma SL Air Studios Ltd. AKG Acoustics GmbH Amber Technology Limited Anchor Audio, Inc. Apogee Electronics Corp. ATC Loudspeaker Technology Ltd. Audio Limited Audio Logic Systems, LLC Audiomatica S.r.l. Audio Partnership PLC Audio Precision, Inc. AudioScience, Inc. Audio-Technica Audyssey Labs Autograph Sound Recording Ltd. Avid B & W Group Ltd. The Banff Centre Bose Corporation British Broadcasting Corporation Calrec Audio CEDAR Audio Ltd. Celestion International Limited ClearSounds Communications Community Professional Loudspeakers, Inc. CSR Daktronics, Inc. Dan Dugan Sound Design D.A.S. Audio, S.A. dCS Digigram

sustaining member organizations

The Society invites applications for sustaining membership. Information may be obtained from the Chair, Sustaining Memberships Committee, Audio Engineering Society, 60 East 42nd St., Room 2520, New York, New York 10165-2520, USA, Tel.: +1 212-661-8528. Fax: +1 212-682-0477.

Dolby Laboratories, Inc. Dolphin Integration DTS, Inc. DYNACORD, EVI Audio GmbH Focal Press Focusrite Audio Engineering Ltd. Fraunhofer IIS-A General Dynamics Gentex Corporation GoerTek Electronics, Inc. Gracenote Hal Leonard Corporation Harman/Becker Automotive Systems HiWave Technologies (UK) Limited immSound, SA InfoComm International Standards and Industry Innovations Dept. Iron Mountain Film & Sound Archives Institute of Production and Recording International Audio Group iZotope, Inc. KEF Audio (UK) Limited KHALDI EST. Klipsch Group, Inc. L-Acoustics US Linear Integrated Systems Logitech Lynx Studio Technology, Inc. Magnetic Reference Laboratory (MRL) Inc. Markertek Martin Audio Ltd. Master Audio Meridian Audio Limited Metropolis Group Meyer Sound Laboratories Inc. Midas Klark Teknik Limited Monitor Audio Limited Motorola mPATHX Neutrik AG Neutrik USA

NISS – Nordic Institute of Stage and Studio Ontario Institute of Audio Recording Technology Oxford Digital Limited Outline sncPMC Ltd. PMC Polk Audio PreSonus Audio Electronics Prism Sound/SADiE Quantum Data, Inc. Rane Corporation Robert Bosch GmbH Rohde & Schwartz Rycote Microphone Windshields Ltd. SAE Institute Group, Inc. Sencore Sennheiser Electronic Corporation Shure Inc. Silynx Communications, Inc. Snell Solid State Logic Ltd. Solteras, Inc. Sons D’Encanto Sonnox Ltd. Sontia Logic Ltd. Sound On Sound Ltd. SRS Labs, Inc. Sterling Sound, Inc. Tannoy Limited TASCAM THAT Corporation Thwapr THX Ltd. TOA Electronics, Inc. Tommex Tymphany Corporation Uniton AG Universal Audio, Inc. University of Surrey VCS Aktiengesellschaft Wolfson Microelectronics Yamaha Research and Development

JAES_V61_1_2_ALL

Short Description

Description

Comments

We need your help!