February 24, 2017 | Author: Manuel Brásio | Category: N/A
Download Fundamentals of Multimedia.pdf...
INTERNATIONAL
Fundamentais ofMu ia
-Nian L1M
í
undamentais of Multimedia
Ze-Nian Li and Mark S. Drew School of Computing Science Simon Fraser Universiiy
PEARSON
• Prentice Hali •
•
Pearson Education International
,.
./‘
Jf you piarchased this book wilhin lhe Uniled Siates ar Canada you should be aware thal ii has been wrongfully imported without Lhe approval aí lhe Publisher ar Lhe Author.
Vice Presidem and Ediiorial Director, ECS: Monja J. Jioru,i Senior Arquisitions Ediior: Kare bolse?? Editorial Assisianl: Michoel Giacubhe Vice President and Director of Produclion and Manufacturing. ESM: David 1V Rtccardi Executive Managing Editor: Viiic’e OBriest Managing Edilor: Camilie Tremacexoe Produclion Editor: (riria Zucker Director of Creative Servires: Paul Belfan?i ArL Direclor and Cover Manager: Jayuc Con,e Coser Designer: Sic.a,ute Belutke Managing Edilor. AV Managemenl and Produrtion: Pa?ricia Burns Ad Editor: Gregon’ Didles Manufacltiring Manager: Trudy Piscwrri Manufacluring Buyer: Lisa McDowell Markeiing Manager: Panela Shaffer
‘E AR S O N
Prentice
To my inom, and my wzj’e Yansrn. &-Nian
© 2004 by Pearson Education, mc, Pearson Prentice liali
To Noah, James (ira), Eva, and, especially, to Jenna.
07458
Mark Ali righls reserved. No part of lhis book may be reproduccd in any formal or by any mossa, withoul permission o writing from lhe publisher. Images of Lena ihat appear o Figures 3.1.3.3.3,4,3.10.8.20. 9.2, and 9.3. are reproduced by spcciai perrnission of Playboy magazine. Copyrighl 1972 by Playboy. lhe author and publisher aí this book have uaed heir best cOaria is preparing lhis bode. These cObris orlado Lhe developmenl, research, and tesring of lhe heories and programs lo determine bar efl’ectiveness. The author and publisher makc no warranly of any kind. expressed or tmpiied, with regard lo hese programs or [he documentalion coolaincd in lhis book. lhe author and publisher shall n°1 be liable is any evenl for incidental or consequenlial damages in conneclion wilh, ar arising oul of. lhe furnishing. performance, ar use aí lhese prograflis. Prinied in lhe United SaIes oíAmerica
lO 9 8 7 6 5
i~’,j
~
/
~Jr~
Universideda do Porto Faculdade do Enoenharia
ISBN O-13-127256-X
BibIiotoc~
//
CDU Pearsons Education LTD. ‘ “ Pearson Education Austraba PTY, Limiled Pearson Education Singapore, PIe. Ltd Pearson Education Norlh Asia Ltd Pearson Education Casada, Ltd. Pearson Educación de Mexico, S.A. de CV. Pearson Education -- Japan Pearson Education Maiaysia. PIe. LLd Pearson Education, Upper Saddle River, New Jersey
.L2: ~‘1.
List of Trademarks Tbe foliowing is a iist of produccs noied in lhis texi thaL are lrademarks or registered trademarks Lheir associated companies. 3D Studio Max is a registered trademark of Autodesk, me. After Effects, fliusiracor, PhoLoshop, Prerniere, and Cool EdiL are registercd trademarks of Adobe Systems, mc. Authorware, Director, Dreamweaver, Fireworks, and Frechand are regislered Lrademarks, and Flash and Soundedit are trademarks of Macromedia, mc., in lhe United Siales and?or other countries. Cakewalk Pro Audio is a Lrademark of Twelve Tone Syslems, mc,
Contents Preface
1
Introduction Lo Multimedia 1.1 What is Multimedia? 3 1. LI Componenls of Multimedia 3 1.1.2 Muilimedia Research Topics and Projects 4 1.2 Multimedia and Hypermedia 5 1.2.1 Hislory ofMultiniedia 5 1.2.2 Hypermedia and Multimedia 7 1.3 World Wide Web 8 1.3.1 History of lhe WWW 8 1.3.2 HyperText Transfer Protocol (HTTP) 9 1.3.3 HyperText Markup Language (HTML) lo 1.3.4 Extensible Markup Language (XML) li 1.3.5 Synchronized Multimedia Integration Language (SMIL) 1.4 Overview of Multimedia Software Tools 14 1.4.1 Music Scqucncing and Notation 14 1.4.2 Digital Audio 15 1.4.3 Graphics and image Editing IS 1.4.4 VideoEditing IS 1.4.5 Animalion 16 1.4.6 Muicimedia Authoring 17 1.5 Further Exploralion 17 1.6 Exercises 18 1.7 References 19
CoreIDRAW is a regiscered irademark of Corei and/or its subsidiaries in Canada, Lhe United States an&or olher countries. Cubase is a registered trademark of Pinnacle Systcms. DirectX, lnemet Expiorer, PowerPoinl, Windows, Word, Visual Basic, and Visual C++ are regislered trademarks of Microsofc Corporation ia Lhe Uniied Staces andlor other countries. Gifcon is a trademark of Alchemy Mindworks Corporation. HyperCard and Final CuL Pro are registered trademarks of Appie Computer, mc. HyperStudio is a registered trademark of Sunburst Tcchnology. Java Media Framework and Java 3D are lrademarks of Sun Microsystems, mc., in lhe Uniled SLaÉes and other couniries. Jeil-O is a regislered lrademark of Kraft Foods incorporated. MATLAB is a trademark of The MalhWorks, me. Maya and OpenGL are registered trademarks of Silicon Graphics mc. Mosaic is a regisLered iiadeniark of National Center for Supercomputing Appiicacions (NCSA). Netscape is a regisiered lrademark of Nelscape Conimunicalions Corporation in Lhe U.S. and other Couniries. Piaystation is a registered tradeniark of Sony Corporation. Pro Tools is a registered Érademark of Avid Technology, mc. Quest Muhimedia Authoring System is a regiscered crademark of AlIen Comniunication Learning Se rv ices. RenderMan is a registered Lrademark of Pixar Animation Siudios. Slinky is a regiscered Lradeniark of Slinky Toys. Sofrirnage XSi is a registered Lradernark of Avid Technology mc. Sound Forge is a regisrered trademark of Sonic Foundry. Winzip is a registered trademark WinZip Compuling, mc.
Multimedia Authoring and Data Representations
2
Multimedia Authoring and Tools 2.1 Muitiniedia Authoring 20 2.1.1 Multimedia Aulhoring Metaphors 21 2.1.2 Muilimedia Production 23 2.1.3 Muirimedia Presenlation 25 2.1.4 Automatic Aulhoring 33 2.2 Some Uscful Editing and Authoring Tools 37 2.2.1 Adobe Preiniere 37 2.2.2 Macromedia Director 40 2.2.3 Macromedia Flash 46 2.2.4 Dreamweaver Si 2.3 VRML SI 2.3.1 Overview Si 2.3.2 Animalion and inleraclions 54 2.3.3 VRML Specifics 54 2.4 Further Exploracion 55 2.5 Exercises 56 2.6 References 59
12
3
4
Graphics and Image Data Representations 3.1 Graphics/lmage Data Types 60 3.1.! 1-Bil Images 61 3.1.2 8-Bil Gray-Level Images 6! 3.1.3 Image Data Types 64 3.1.4 24-Bit Color Images 64 3.1.5 8-BIL Colar Images 65 3.1.6 Color Lookup Tables (LUTs) 67 3.2 Popular File Formats 71 3.2.1 GIF 7! 3.2.2 JPEO 75 32.3 PNG 76 3.2.4 11FF 77 3.2.5 EXIF 77 3.2.6 Graphics Animacion Files 77 3.2.7 P5 and PDF 78 3.2.8 Windows WMF 78 3.2.9 Windows BMP 78 3.2.10 Macintosh PAINT and PICT 78 3.2.11 XWindowsPPM 79 13 Furiher Exploration 79 3.4 Exercises 79 3.5 References 81 Calor ia linage and Video 4.1 Calor Science 82 4.1.1 LightandSpeclra 82 4.1.2 Human Vision 84 4.1.3 Speclral Sensitivity of the Eye 84 4.1.4 Image Forination 85 4.1.5 CameraSystems 86 4.1.6 Gamma Correction 87 4.1.7 Color-Matching Functions 89 4.1.8 ClEChromaticíty Diagram 91 4.1.9 Colar Monitor Specifications 94 4.1.10 Out-of-Gamut Colors 95 4.1.11 White-Point Correction 96 4.1.12 XYZ (o RGB Transform 97 4.1.13 Transform with Gamma Correclion 97 4.1.14 L*a*b* (CIELAB) Colar Model 98 4.1.15 More Color-Coordinate Schcmes 100 4.1.16 Munsell Calor Naming Syscem 100 4.2 Calor Models in Images 100 4.2.1 RGB Colar Model for CRT Displays 100 4.2.2 Subtraclive Calor: CMY Colar Model 101 4.2.3 Transformation from RGB to CMY 101 4.2.4 Undercolor Removal: CMYIC Systcm 102 4.2.5 Printer Gamuts 102 4.3 Colar Models in Video 104 4.3.1 Video Calor Transforms 104 4.3.2 YUVColorModel 104
4.4 4.5 4.6 5
4.3.3 YlQColorModel lOS 4.3.4 YCbCr Calor Model 107 Furiber Exploration 107 Exercites 108 References III
Fundamental Concepla in Video 5.! TypesofVideoSignals 112 5.1.1 Cornponent Video 112 5.1.2 Composile Video 113 5.1.3 S-Vidco l13 5.2 AnalogVideo 113 5.2.1 NTSCVideo 116 5.2.2 PAL Video 119 5.2.3 SECAM Video 119 5.3 DigilalVideo 119 5.3.1 Chroma Subsampling 120 5.3.2 CCIR Slandards for Digital Vidco 120 5.3.3 High Definition TV (HDTV) 122 5.4 Further Exploralion 124 5.5 Exercises 124 56 References 125 Basies of Digital Audio 6.1 Digilization aí Sound 126 6.1.1 What Is Sound? 126 6.1.2 Digilization 127 6.1.3 NyquistTheorem 128 6.1.4 Signal-to-Noise Ratio (SNR) 131 6.1.5 Signal-lo-Quantizalion-Noise Ratio (SQNR) 6.1.6 Linear and Nonlinear Quantization 133 6.1.7 Audio Filiering 136 6.1.8 Audio Qualily versus Data Rate 136 6.1.9 SynihecicSounds 137 6.2 MIDI: Musical lnsln,ment Digital Interface 139 6.2.1 MIDI Overview 139 6.2.2 HardwareAspectsofMlDl 142 6.2.3 Structure of MIDI Messages 143 6.2.4 General MIDI 147 6.2.5 MIDI-lo-WAV Conversion 147 6.3 Quanlization and Transmjssion of Audio 147 6.3.1 Coding aí Audio 147 6.3.2 Pulse Cade Modulation 148 6.3.3 Differenlial Coding of Audio 150 6.3.4 Losslcss Predictive Coding 151
6.3.5
6.4 6.5 6.6
DPCM
154
6.3.6 DM 157 6.3.7 ADPCM 158 Further Exploration 159 Exercises 160 References 163
13!
II Multimedia Data Compression 7
8
165
Lossless Compression Algorithms 7.1 Inlioduction 167 7.2 Basics of lnfonnation Theory 168 7.3 Run-Length Coding 171 7.4 Variable-Length Coding (VLC) 171 7.4.! Shannon—Pano Algorithm 171 7.4.2 !-fuffman Coding 173 7.4.3 Adaptive HufIman Coding 176 7.5 Dictionary-Based Coding 181 7.6 Arithmetic Codiag 187 7.7 Lossless Image Compression 191 7.7.1 Diflerential Coding of Images 191 7.7.2 Lossless JPEG 193 7.8 Further Exploration 194 7.9 Exercises 195 7.10 References 197 Lossy 8.1 8.2 8.3 8.4
8.5
8.6
8.7 8.8
8.9 8.10 8.)! 8.12
167 9.3
9.4
9.5 9.6 9.7 10 Basic lO.! 10.2 10.3
Compression Algorithnis Introduction 199 Disiortion Measures 199 fle Rale-Distortion Theory 200 Quantization 200 8.4.! Uniform Scalar Quantization 201 8.4.2 Nonuniforro Sealar Quantization 204 8.4.3 Vector Quaniizationt 206 Transform Coding 207 8.5.1 Discrele Cosine Transform (DCI’) 207 8.5.2 ICarhunen—Loàve Transform* 220 Wavelet-Based Coding 222 8.6.1 !ntroduction 222 8.6.2 Continuous Wavelet Transform* 227 8.6.3 Discrete Wavelet Transforni* 230 Wave!et Packeis 240 Embedded Zerotree of Wave!et Coefficients 24! 8.8.! lhe Zerotree Data Stnacture 242 8.8.2 Successive Approximation Quantization 244 8.8.3 EZW Example 244 SeI Partitioning in Hierarchical Trees (SPII-IT) 247 FurtlterExploration 248 Exercises 249 References 252
9 Image Compression Standards 9.1 TheJPEGSiandard 253 9.1.! Maia Sieps in JPEG Image Compression 253 9.1.2 JPEG Modes 262 9.1.3 A G!ance ai lhe JPEG Bicsiream 265 9.2 The JPEG2000 Standard 265 9.2.! Main Steps of JPEG2000 !mage Compression
10.4
10.5
10.6 10.7 10.8
267
9.2.2 Adapcing EBCØI’ lo JPEG2000 275 9.2.3 Regiori-of-!nterest Coding 275 9.2.4 Comparison of JPEG and JPEG2000 Performance The JPEG-LS Standard 277 9.3.! Prediction 280 9.3.2 Context Determination 28! 9.3.3 ResidualCoding 281 9.3.4 Near-Lossless Mode 28! Bi!eve! !mage Compression Standards 282 9.4.1 ThelBlGStandard 282 9.4.2 fle JBIG2 Standard 282 Furthcr Exploration 284 Exercises 285 References 287 Video Compression Techniques !ntroduction lo Video Compression 288 Video Compression Based on Motion Compensation 288 Search for Motion Vectors 290 10.3.! Sequentia! Search 290 10.3.2 2D Logarithmic Search 29! 10.3.3 1-lierarchica! Search 293 [1.261 295 10.4.! Intra-Frame (!-Frame) Coding 297 10.4.2 Inter-Fraine (P-Frame) Predichve Coding 297 10.4.3 Quantization in [1.26! 297 10.4.4 [1.26! Encoder and Decoder 298 10.4.5 A Glance at ihe H.26! Video Bitstream Syniax 30! [1.263 303 10.5.! Motion Compensation in H.263 304 10.5.2 Optional H.263 Coding Modes 305 10.5.3 H.263÷ and H.263++ 307 Furiber Exploration 308 Exercises 309 References 310
11 MPEG Video Coding 1— MPEG-1 and 2 11.1 Overview 312 11.2 MPEG-! 312 11.2 1 Motion Compensation in MPEG-! 313 11.2.2 OiherMajorDifTerencesfrom [1.26! 315 11.2.3 MPEG-! VideoBitstream 318 11.3 MPEG-2 319 11.3.1 Supporting Interlaced Video 320 11.3.2 MPEG-2 Scalabi!ities 323 11.3.3 Ocher Major Differences from MPEG-! 329 11.4 Further Exp!oration 330 11.5 Exercises 330 11.6 References 331
277
12 MPEG Video Coding II — MPEG-4, 7, and Bcyond 12.1 OverviewoíMpEG-4 332 12.2 Objec[.Based Visual Coding in MPEG-4 335 122.1 VOP-BasedCodingvs. Frame-EasedCoding 335 12.2.2 MoLion Compensation 337 12.2.3 TexiureCoding 341 12,2.4 ShapeCoding 343 12.2.5 Slatic Texiure Coding 346 12.2.6 Sprite Coding 347 12.2.7 Global Moiion Compensation 348 12.3 Synthctic Object Coding in MPEG-4 349 12.3.1 20 Mesh Objeci Coding 349 12.3.2 30 Model-based Coding 354 12.4 MPEG-4 Objeci iypcs, Profiles and LeveIs 356 12.5 MPEG-4 PartlO/H.264 357 12.5.1 Core Features 358 12.5.2 Baseline Profile Features 360 12.5.3 Main Profile Features 360 12.5.4 Exiended Profile Features 361 12.6 MPEG-7 36! 12.6.1 Descripior(D) 363 12.6.2 Description Scheme (OS) 365 12.6.3 Description Definition Language (DDL) 368 12.7 MPEG-21 369 12.8 Further Exploraiion 370 12.9 Exercises 370 12.10 References 371 13 Basic Audio Compression Techniques 13.1 ADPCM in Speech Coding 374 13.1.1 ADPCM 374 13.2 0.726 ADPCM 376 13.3 Vocoders 378 13.3.1 Phase lnsensitivity 378 13.3.2 ChannelVocoder 378 13.3.3 Formani Vocoder 380 13.3.4 Linear Predictive Coding 380 13.3.5 CELP 383 13.3.6 Hybrid Excitation Vocoderst 389 13.4 Fuilher Exploration 392 13.5 Exercises 392 13.6 References 393 14 MPEG Audio Compression 14.1 Psychoacousiics 395 14.1.1 Equal-Loudness Relations 14.1.2 FrequencyMasking 398 14.1.3 Temporal Masking 403 14.2 MPEG Audio 405 142.1 MPEG Layers 405
396
14.3 14.4 14.5 14.6 14.7
14.2.2 MPEG Audio Stralegy 406 14.2.3 MPEG Audio Compression Algorichm 407 14.2.4 MPEG-2 AAC (Advanced Audio Coding) 412 14.2.5 MPEG-4Audio 414 Oiher Commercial Audio Codecs 415 Tbe Future: MPEG-7 and MPEG-21 415 FurtherExploration 416 Exercises 416 References 417
III Multimedia Commuuication and Retrieval 15 Coniputer and Multimedia Neiworlcs 15.1 Basics of Computer and Multimedia Neiworks 421 15.1.1 081 Network Layers 421 15.1.2 TCP/lPProtocols 422 15.2 MultiplexingTechnologies 425 15.2.1 Basics ofMultiplexing 425 15.2.2 lniegraied Services Digital Network (ISDN) 427 15.2.3 Synchronous Optical NE1\vork (SONET) 428 15.2.4 Asymmecric Digital Subscriber Line (ADSL) 429 15.3 LANandWAN 430 15.3.1 Local Arca Networks (LANs) 431 15.3.2 Wide Arca Necworks (WANs) 434 15.3.3 Asynchronous Transfer Mede (ATM) 435 15.3.4 Gigabic and l0-Gigabit Ethernets 438 15.4 Access Nctworks 439 15.5 Common Peripheral lnierfaces 441 15.6 Further Exploration 441 15.7 Exercises 442 15.8 References 442 16 Muilimedia Network Communications and Applicaiions 16.1 Qualiiy of Multimedia Data Transmission 443 16.1.1 Quality of Service (Q0S) 443 16.1.2 Q0S for IP Protocols 446 16.1.3 Prioritized Delivery 447 16.2 Muilimedia over IP 447 16.2.1 IP-Muliicasc 447 16.2.2 RTP (Real4ime Transpoil Protocol) 449 16.2.3 Real Time Control Protocol (RTCP) 451 16.2.4 Resource ReSerVation Protocol (RSVP) 451 16.2.5 Real-Time Streaming Prolocol (RTSP) 453 16.2.6 Intemet Telephony 455 16.3 Multimedia over ATI’vl Networks 459 16.3.1 Video Biiraies over ATM 459 16.3.2 ATM Adaptation Layer (AAL) 460 16.3.3 MPEG-2 Convergence Lo ATM 461 16.3.4 Multicasi over ATM 462
419 421
16.4
16.5
16.6 16.7 16.8
Transport ofMPEO-4 462 16.4.1 DMIFinMPEG-4 462 16.4.2 MPEG-4 over IP 463 Media-on-Demand (MOO) 464 165.1 Interactive TV (ITV) and SeI-Top Box (STB) 464 16.5.2 Broadcast Schemes for Video-on-Demand 465 16.5.3 Buffer Managemcnt 472 Furtlier Exploration 475 Exercises 476 References 477
17 Wireless Networks 17.1 Wireless Nelworks 479 17.1.1 Analog Wireless Networlcs 480 17.1.2 Digital Wireless Networks 481 17.1.3 TDMAandOSM 481 17.1.4 SpreadSpectrumandCDMA 483 17.1.5 AnalysisofCDMA 486 17.1.6 30 Digital Wireless Networks 488 17.1.7 WirelessLAN(WLAN) 492 17.2 Radio Propagation Models 493 17.2.1 Multipath Fading 494 17.2.2 Path Loss 496 17.3 Multimedia over Wireless Networks 496 [7.3.1 Synchronization Loss 497 17.3.2 Error Resilient Entropy Coding 499 17.3.3 Error Concealment SOl 17.3.4 Forward Error Correction (FEC) 503 17.3.5 Trends in W’ireless Interactive Multimedia 17.4 Funher Exploration 508 17.5 Exercises 508 17.6 References 510 IS Content-Based Reirieval la Digital Libraries 18.1 How Should We Retrieve Images? 511 18.2 C-BIRD——ACaseStudy 513 18.2.1 C-BIRDOUI 514 18.2.2 ColorHistogram 514 18.2.3 Color Density 516 18.2.4 ColorLayout 516 18.2.5 Texture Layout 517 18.2.6 Seaxch by lllumination Invariance 519 18.2.7 Search by Object Model 520 18.3 Synopsis of Current Image Search Systems 533 18.3.1 QBIC 535 18.3.2 UC Santa Barbara Search Engines 536 18.3.3 Berkeley Digital Library Project 536 18.3.4 Chabot 536 18.3.5 Blobworld 537 18.3.6 Columbia University Image Seekers 537
506
18.3.7 Informedia 537 18.3.8 MetaSEEk 537 18.3.9 Photobook and FourEyes 538 18.3.IOMARS 538 18,3.11 Virage 538 18.3.12 Viper 538 18.3.13 Visual RetrievalWare 538 18.4 Relevance Feedback 539 18.4.1 MARS 539 18.4.2 iFind 541 18.5 Quantifying Results 541 18.6 Queryingon Videos 542 18.7 Querying on Other Formats 544 18.8 Ouilook for Content-Based Retricval 544 18.9 Further Exploration 545 18.10 Exercises 546 18.11 References 547
Preface
Preface A course in multimedia is rapidly becoming a necessity in computer science and engineering curricula, especially now LhaL muilimedia Louches most aspects of Lhese fields. Multimedia was originaily seen as,a vertical applicaLion area; Lhat is, a niche applicaLion with meLhods lhaL belong only LO itself. However, like pervasive computing, multimedia is now essentially a horizontal application area and forms an important component of the study of computer graphics, image processing, databases, real-Lime systems, operating systems, informaLion retrieval, computer neLworks, computer vision, and so on. Multimedia is no longer just a Loy but forms part of Lhe technological environment in which we work and think. This book fihis the need for a universiLy-level LexL thaL examines a good deal of the core agenda compuLer science sees as belonging Lo Luis subject area. MuiLimedia has become associated with a cerlain seL of issues in compuLer science and engineering, and we address Lhose here. The book is not an inLroduction Lo simple design issues—it serves a more advanced audience than that. On the oLher hand, iL is noL a reference work iL is more a tradiLional Lextbook. While we perforce discuss multimedia Lools, we would like Lo give a sense of Lhe underlying principIes in lhe Lasks those Lools carry ouL. SLudents who undertake and succeed in a course based on this LexL can be said Lo reaily understand fundamental matLers in regard Lo Lhis material; hence Lhe LiLie of Lhe texL. In conjunclion wiLh this LexL, a fuil-fledged course should also allow sLudents lo make use ofLhis knowledge Lo carry out inLeresting or even wonderful practical projects in mulLimedia, inLeractive projecLs thaL engage and sometimes amuse and, perhaps, even teach Lhese same concepLs. Who Should Read Thi5 Book? This Lext aims aI introducing the basic ideas in muiLimedia Lo an aLidience comfortable with technical applications—Lhat is, computer science and engineering sLudenLs. IL aims to cover an upper-levei undergraduaLe multimedia course buL could also be used in more advanced courses and would be a good reference for anyone, including those in industry, interested in currenL muiLimedia technologies. GraduaLe sLudents needing a solid grounding in maLerials Lhey may noL have seen before would undoubLedly benefit from reading ii The texL mainly presenLs concepts, noL applications. A multimedia course, on lhe oLher hand, teaches these concepLs and tests them buL also allows students lo use coding and presenLaLion skilis they already know Lo address problems in multimedia. The accompanying web site shows some of lhe code for muiLimedia applications, along with some of the beLter projecLs studenLs have developed in such a course and oLher useful materiais besL presented eiectronicaily. The ideas in Lhe LexL drive the resulLs shown in sLudent projecLs. We assume the reader knows how Lo program and is also complelely comfortable leaming yet anotber boi. lnsLead of concentrating on tools, however, we emphasize whaL sLudenLs do noL already know.
xiv
xv
Using Lhe methods and ideas collected here, studenLs are also able lo leam more themselves, someLimes in ajob setLing. IL is not unusual for sLudents who Lalce the type of mLllLimedia course this LexL aims at togo on tojobs in a muitimedia-reiated indusLry immediateiy after their senior year, and someLimes before. The selection of material in Lhe texl addresses real issues these learners wiill face as soon as they show lip in Lhe workplace. Some topics are simpie but new to lhe studenLs; some are more complex but unavoidable in this emerging area. Have the Authors lised This Material in a Real Class? Since 1996, we have taught a third-year undergraduate course in mulLimedia sysLems based on Lhe inLroducbory maLerials seL out in Lhis book. A one-semester course could very Iikely noL include ali the maLerial covered in Lhis text, buL we have usually managed Lo consider a good many of lhe Lopics addressed and to menLion a selecL number of issues in Part III wiLhin thaL time frame. Over Lhe same time period as an inLroduction lo more advanced maLeriais, we have also taught a one-semesLer graduate-level course using notes covering topics similar tolhe ground covered by this text. A fourth-year or graduate course wouid do well Lo consider material from Parts 1 and 11 of lhe book and then some maLerial from Part III, perhaps in conjunction with some of lhe original research references included here and resulLs presenLed aI topical conferences. We have aLtempLed lo 611 boLh needs, concenlraLing on an undergraduaLe audience but including more advanced maLeriai as well. Sections thaL can safely be omilLed on a first reading are marked with an asterisk. What is Covered in This Text? In Part 1, Multimedia AuLhoring and DaLa Representations, we introduce some of the no tions included in Lhe term muhimedia and look at its history as well as iLs present. Practi cally speaking, we carry ouL muiLimedia projects using software boIs, so in addition to an overview of these boIs, we gel down Lo some of Lhe nuts and bolts of multimedia aulhoring. Representing daLa is criLical in muiLimedia, and we Iook at lhe most importanL data repre sentations for muiLimedia applicaLions, examining image data, video data, and audio daL delail. Since color is vitally importanL in multimedia programs, we see how this important arca impacLs multimedia issues. In Part II, MuiLimedia Dala Compression, we consider how we can make ali this data fly onLo lhe screen and speakers. Data compression Luras out lo be an importanL enabling Lechnology Lhat makes modera multimedia sysLems possible, 50 we look at lossless and lossy compression methods. For the laLter calegory, JPEG slili-image compression standards, including JPEG2000, are arguably the mosL important, 50 We consider Lhese in detail. RuI since a picLure is worth a thousand words and video is worth more Lhan a million words per minuLe, we examine Lhe ideas behind MPEG sLandards MPEG- 1, MPEG-2, MPEG-4, MPEG-7, and beyond. SeparaLely, we consider some basic audio compression techniques and Lake a look aL MPEG Audio, including MP3. In Part 111, MuiLimedia Communication and ReLrieval, we consider Lhe greaL demands muiLimedia places on neLworks and systems. We go on to consider neLwork Lechnologies
xvi
Preface and protocois LhaL make interactive multimedia possible. Some of Lhe applicaLions discussed include muiLimedia cri demand, multimedia over IP, muiLimedia over ATM, and multimedia over wireless networks. Content-based retrievai is a parLicularly importanL issue ia digiLal libraries and inLeraclive muiLimedia, so we examine ideas and systems for this application in some deLail. Textbook Web Site The book’s web siLe is www.cs.sfu,ca/mmbook. There, you wiII find copies of figures from Lhe book, an erraLa sheeL updaLed regularly, programs thaL help demonsLrate concepts ia Lhe text. and a dynamic seL of Iinks for Lhe Furlher Exploration seclion of each chaprer. Since these Iinks are regularly updaLed (and of course URLs change ofLen) they are mosrly online ralher than ia Lhe texL. Instructors’ Resources The main text web siLe has no 113 and password, but access Lo sample sLudent projecLs is at Lhe insLructor’S discretion and is password-proLecLed. Prentice Hail also hosLs a web siLe containing Course Instructor resources for adopLers of the text. These include an exLensive coliecLion of online course notes, a one-semester course syllabus and calendar of events, solutions for Lhe exercises in lhe texl. sample assignmenLs and solutions, sample exams. and extra exam quesLions. Acknowledgements We are most graLeful Lo coileagues who generously gave of Lheir Lime Lo review Lhis Lext, and we wish Lo express our Lhanks to Shu-Ching Chen, Edward Chang, Qianping Gu, Racheile S. ReVer, Gongzhu Hu, 5. N. Jayaram, Tiko Kameda, Xiaobo Li, Siwei Lu, Dennis Richards, and Jacques Vaisey. The writing of this text has been greaLly aided by a number of suggestions from presenL and former coileagues and studenLs. We would like lo thank Janies Au, Chad Ciavarro, Hao Jiang, SLeven Kilthau, Michael King, Cheng L.u, Yi Sua, Dominic Szopa, Zinovi Tauber, Malte von Ruden, Jian Wang, Jie Wei, Edward Yan, Yingchen Yang, Osmar Zaïane, Wenbiao Zhang, and William Zhong for their assistance. As weII, Mr. Ye Lu made great conLributions lo Chapters 8 and 9 and bis valianL efforLs are particularly appreciated. We are also most graLeful for lhe studenLs who generously made lheir course projects available for instrucLional use for Lhis book.
PART ONE
MULTIMEDIA AUTHORING AND DATA REPRESENTATIONS Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6
Introduction to Multimedia 3 Multimedia Authoring and Tools 20 Graphics and Image Data Representations Color in Image and Video 82 Fundamental Concepts in Video 112 Basics of Digital Audio 126
60
Introduction to Multimedia As an inLroduction Lo multimedia, in Chapter 1 we consider lhe queslion of just what mulLimedia is. We examine its history and the development of hypertexL and hypermedia. We then geL down Lo practical maLters with an overview of multimedia software Lools. These are lhe basic means we use lo develop muiLimedia conlenL. But a multimedia producLion is much more Lhan lhe sum of its parIs, Lo ChapLer 2 looks aL Lhe nuLs and bolLs of multimedia aulhoring design and a taxonomy of auLhoring metaphors. The chapter also sets out a list of importanL conLemporary mulLimedia auLhoring bois in current use. Multimedia Data Representations As in many fields, lhe issue of how lo best represenL lhe daLa is of crucial importance ia lhe sLudy of multimedia. ChapLers 3 through 6 consider how Lhis is addressed in this field, selLing ouL Lhe mosL imporlant data representaLions in multimedia applicaLions. Because Lhe main areas of concern are images, moving picLures, and audio, we begin investigaLing these
C H A P T E R
in Chapler 3, Graphics and Image Data RepresentatiOfls, Iben look aI Basics of Video iii Chapter 5. Before going on Lo ChapLer 6, Basics of Digilai Audio, we lake a side Lrip in issues on lhe use of color, since color is vitally imporlant
~
1
1 ntrod uction to I~/I u Iti rned ia
1.1 WHAT IS MULTIMEDIA? People who use Lhe Lerm “muilimedia” often seem to have quite different, even opposing, viewpoinLs. A PC vendor would like us to Lhink of multimedia as a PC Lhat has sound capabilily, a DVD-ROM drive, and perhaps the superioriLy of mulLimedia-enabled micro processors ihat understand addiLional muiLimedia insLructions. A consumer enLerlaiiiment vendor may ihink of multimedia as inleractive cable TV wiLh hundreds of digiLai channeis, or a cable-TV-like service delivered over a high-speed Jniemet connecLion. A computer science studenL reading this book likely has a more applicaLion-orienled view of what muiLimedia consisis of: applicaLions Lhat use muiLiple modalities to their advanLage, including LeXL, images, drawings (graphics), animalion, video, sound (including speech), and, most likely, inleracliviLy of some kind. The popular notion of “convergence” is one Lhal inhabits lhe coliege campus as iL does lhe cuiLure aL large. In this scenario, PCs, DVDs, games, digital TV, set-Lop web surfing, wireless, and so on are converging in lechnology, presumably to arrive in the near future ata final all-around, muiLimedia-enabled producl. Whiie hardware may indeed involve such devices, Lhe presenL is already exciling— muiLimedia is parL of some of Lhe most interesting projecis underway in compuLer science. The convergence going on in this field is in fact a convergence of areas Lhat have in lhe past been separaLed buL are now finding much Lo share in this new application area. Graphics, visualizaLion, HC1, compuLer vision, daLa compression, graph theory, neLworking, daiabase systems — ali have imporLant conLribuLiOns to make in muliimedia at lhe present time. 1.1.1
Components of Multimedia ThemulLiplemodaliLiesoftexL, audio, images. drawings. animation, and videoin multimedia are pul Lo use in ways as diverse as • Video ieleconferencing • DisLributed leclures for higher educaLion • Telemedicine • Cooperative work environmenLs thaL allow business people to edit a shared document or schoolchildren Lo share a single game using Lwo mice Lhat pass conLrol back and forLh
3
4
Chapter 1
Sectiori 1.2
Introduction to Multimedia
• “Augmented” reality: placing real-appearing computer graphics and video objects into scenes soas to take Lhe physics of objects and lights (e.g., shadows) into account • Audio cues for where video-conference participants are seated, as well as taking into account gaze direction and attention of participants • Building searchable features into new video and enabling very high to very low bitrate use of new, scalable multimedia products • Making multimedia components editable — allowing the user side to decide what compenenis, video, graphics, and so on are actually viewed and allowing the client to move components around or delete them — making components distributed • Building “inverse-HoHywood” applications that can re-create Lhe proeess by which a video was made, allowing storyboard pruning and concise video summarization —
say a kitchen-wall
From the computer science student’s point of view, what makes multimedia interesting is that so much of Lhe material covered in traditional computer science arcas bears on the multimedia enterprise: networks, operating systems, real-time systems, vision, information retrieval. Like databases, multimedia touches on many traditional arcas. 1.1.2
Multimedia Research Topics and Projects Tolhe computer science researcher, multimedia consists of a wide variety of topics [1]: • Multimedia processing and coding. Tliis includes multimedia content analysis, content-based multimedia retrieval, multimedia security, audio/image/video process ing, compression, and 50 011. • Multimedia system support and networking. People look at such topics as network protocois, Intemet, operating systems, servers and clients, quality of service (QoS), and databases.
1.2 MULTIMEDIA AND HYPERMEDIA To place multimedia in its proper context, in this section we briefly consider the history of multimedia, a recent part of which is Lhe connection between multimedia and hypermedia. We go on to a quick overview of multimedia software tools available for creation of multi media content, which prepares us to examine, in Chapter 2, the larger issue of integrating this content into full-blown multimedia productions.
• Multimedia tools, cnd systems, and applications. These include hypermedia sys tems, user interfaces, authoring systems, muitimodal interaction, and integration: “ubiquity” — web-everywhere devices, multimedia education, including computer supported coilaborative leaming and design, and applications of virtual environments. The concerus of multimedia researchers also impact researchers in almost every other branch of computer science. For example, data mining is an important current research area, and a large database of multimedia data objects is a good example of just what we may be interested in mining. Telemedicine applications, such as “telemedical patient consultative encounters’ are multimedia applications that place a heavy burden on existing network architectures.
5
Current Multimedia Projects Many exciting research projects are currently underway in multimedia, and we’d 111w to introduce a few of them here. For example, researchers are interested in camera-based object tracking technology. One aim is to deve)op control systems for industrial control, gaming, and so on that rely on moving scale modeis (toys) around a real environment (a board game, say). Tracking the control objects (toys) provides user control of Lhe process. 3D motion capture can also be used for multiple actor capture, so that muitiple real actors in a virtual studio can be used to automatically produce realistic aniniated modeis with natural movement. Multiple views from several cameras or from a single camera under differing lighting can accurately acquire data that gives both the shape and surface properties of materiais, thus automatically generating synthetic graphics models. This allows photo-realistic (video quality) syntbesis of virtual actors. 3D capture technology is next to fast enough now to allow acquiring dynamic characteris tics of human facial expression during speech, to synthesize highly realistic facial animation from speech. Multimedia applications aimed at handicapped persons, particularly those with poor vision and the elderly, are a rich field of endeavor in current research. “Digital fashion” aims to develop smart clothing that can communicate with offier such enhanced clothing using wireless communication, so as to artificially enhance human in teraction in a social setting. The vision here is to use technology to allow individuais to allow certain thoughts and feelings to be broadcast automatically, for exchange with others equipped with similar technology. Georgia Tech’s Electronic Housecall system, an initiative for providing interactive health monitoring services tu patients in their homes, relies on networks for delivery, challenging current capabilities. Behavioral science models can be brought into play to model interaction between peo pie, which can then be extended te enable natural interaction by virtual characters. Such “augmented interaction” applications can be used to develop interfaces between real and virtual humans for tasks such as augmented storytelling. Each of these application areas pushes the development of computer science generally, stimuiates new applications, and fascinates practitioners.
• Searching (very) large video and image databases for target visual objecis
• Using voice recognition to build an interactive environment web browser
Multimedia and Hypermedia
1.2.1
History of Multimedia A brief history of the use of mullimedia tu communicate ideas might begin with newspapers, which were perhaps thefirsi mass communication medium, using text, graphics, and images.
6
Chapter 1
Introduction to Multimedia
Section 1.2
Multimedia and Hypermedia
7
1989 Tim Bemers-Lee proposed Lhe Worid Wide Web lo Lhe European Council for Nuclear Research (CERN).
Molion pictures were originally conceived of in lhe 1830s lo observe motion Loo rapid for percepLion by lhe human eye. Thomas Alva Edison commissioned Lhe invention of a molion piclure camera in 1887. SilenI feacure Iilms appeared from 1910 Lo 1927; Lhe silenl era effectively ended wiLh lhe release of The Jazz Singer in 1927. In 1895, Guglielmo Marconi seni his firsl wireless radio Lransmission aI PonLecchio, lLaly. A few years laler (1901), he detecied radio waves beamed across Lhe Allanlic. lnilially invented for Lelegraph, ralho is now a major medium for audio broadcasting. In 1909, Marconi shared lhe Nobel Prize for physics. (Reginaid A. Fessenden, of Quebec, beal Marconi lo human voice lransmission by severa! years, bul noL ali invenlors receive due credir. Nevertheless, Fessenden was paid $2.5 million in 1928 for.his purloined paLenLs.) Television was lhe new medium for Lhe tweiitieth cenlury. lt eslabiished video as a commoniy availabie medium and has since changed Lhe world of mass communicaLion. The conneclion belween coinpulers and ideas about multimedia covers what is actuaily only a short period:
1990 Krislina Hooper Wooisey headed lhe Appie Muilimedia Lab, with a slaff of 100. Educalion was a chiefgoal. 1991 MPEG-I was approved as an inLemalional sLandard for digital video. lis furlher developmenL Ied Lo newer standards, MPEG-2, MPEG-4, and further MPEGs, in Lhe 1990s. 1991 The inLroducLion of PDAs in 1991 began a new period inIbe use of compulers in general and muilimedia in particular. This developmenL conlinued in 1996 wilh Lhe markeling of lhe firsl PDA wilh no keyboard. 1992 JPEG was accepled as lhe inLemalionai sLandard for digital image compression. lLs furlher developmenL has now led Lo Lhe new JPEG2000 slandard. 1992 The firsL MBone audio muilicasl on lhe Nel was made.
1945 As part of MIT’s postwar deiiberalions on whaL Lo do wilh ali lhose scienlists em ployed on lhe war efforL, Vannevar Bush (1890—1974) wrote a iandmark articie [2] describing whaL amounls Lo a hypermedia syslem, called “Memex.” Memex was meanL Lo be a universally useful and personaiized memory device Lhal even included lhe concepl of associative links it reaily is Lhe forerunner of Lhe Worid Wide Web. Afier World War II, 6,000 scientisls who had been hard aL work on lhe war efforl suddenly found Lhemselves wiLh lime Lo consider other issues, and lhe Memex idea was one fruil of Lhal new freedom.
1993 The Universily of Illinois Nalionai Cenler for Supercompuling Applicalions pro duced NCSA Mosaic, lhe firsi fuil-Íiedged browser, launching a new era in Intemet informaLion access. 1994 um Clark and Marc Andreessen crealed lhe Nelscape program. 1995 The JAVA language was created for plaLform-independenl applicalion developmenl. 1996 DVD video was inLroduced; high-qualily, full-ienglh movies were disliibuled on a single disk. The DVD formal promised Lo lransform the music, gaming and compuLer indusLries.
1960s Ted Neison sLarted lhe Xanadu projecl and coined Lhe Lerm “hyperlexi?’ Xanadu was lhe firsL altempl aL a hypertexL sysLem Neison calied iL a “magic piace of iiLerary memory.”
1998 XML 1.0 was announced as a W3C Recommendalion. 1998 Handheid MP3 devices firsL made inroads inLo consumer LasLes in lhe fali, wilh Lhe inlroducLion of devices hoiding 32 MB of flash memory.
1967 Nicholas NegroponLe formed lhe Archileclure Machine Group aL MIT. 1968 Douglas Engelbarl, greaLly influenced by Vannevar Bush’s “As We May Thinlc’ demonsiraled lhe “On-Line Syslem” (NLS), anoLher early hypertexl program. En geibarl’s group aL Stanford Research JnsLiLuLe aimed aI “augmenlalion, noL auLoma Lion,” lo enhance human abililies Lhrough compuLer Lechnology. NLS consisLed of such criLical ideas as an ouLline ediLor for idea developmenL, hyperlexi links, lele conferencing, word processing, and e-mau, and made use of Lhe mouse pointing device, windowing sofLware, and heip sysLems [3]. 1969 Neison and van Dam ai Brown UniversiLy crealed an eariy hypertexl edilor calied FRESS [4]. The presenl-day Inlermedia projecL by Lhe lnsLiLuLe for Research in InformaLion and Scholarship (IRIS) aI Brown is lhe descendanl of Lhat early sysLem. 1976 The MIT Archileclure Machine Group proposed a projecL entilled “Muilipie Media.” This resulled in lhe Aspen Movie Map, Lhe firsL hypermedia videodisc, in 1978. 1985 NegroponLe and Wiesner cofounded lhe MIT Media L.ab, a ieading research institu Lion invesligaiing digital video and mulLimedia.
2000 Worid Wide Web (WWW) size was eslimaled aI over 1 biliion pages. 1.2.2
Hypermedia and Multimedia Ted Neison invenled lhe lenn “HyperTexL” around 1965. Whereas we may think of a book as a linear medium, basically meanL lo be read from beginning lo end, a hyperlexL syslem is meanL Lo be read noniineariy, by following iinks Lhal poinL Lo oLher parts of Lhe document, or indeed lo oLher documenls. Figure 1.1 illusLraLes Lhis idea. Hyperniedia is noL conslrained lo be lexl-based. IL can inciude olher media, such as graphics, images, and especially Lhe conLinuous media — sound and video. Apparenlly Ted Nelson was also lhe firsL Lo use Lhis Lerm. The World Wide Web (WWW) is lhe besL example of a hypermedia applicalion. As we have seen, multimedia fundamenlally means lhal compuLer informaLion can be represenled Lhrough audio, graphics, images, video, and animaLion in addilion Lo Iradi lional media (lext and graphics). Hypermedia can be considered one particular mullimedia applicaLion.
8
Chapter 1
uction to Multimedia
Section 1.3 Hyperlext
World Wide Web 9
1960s It is recognized LhaL documents need Lo have formats Lhal are human-readable and lhat identify strucLure and elements. Charles Goldfarb, Edward Mosher, and Raymond Lorie developed Lhe Generalized Markup Language (GML) for IBM. 1986 The ISO released a final version of lhe SLandard Generalized Markup Language (SGML), mostly based on lhe earlier GML. 1990 Wilh approval from CERN, Tím Bemers-Lee slarLed developing a hypertext server, browser, and editor on a NeXTSlep workstation. He invenled hypertext markup language (HTML) and lhe hypertext Iransfer protocol (FfrI’P) for lhis purpose.
1993 NCSA released an alpha version of Mosaic based on lhe version by Marc Andreessen for lhe X Windows SysLem. This was lhe first popular browser. Microsoft’s InterneL Explorer is based on Mosaic. 1994 Marc Andreessen and some colleagues from NCSA joined Dr. James H. Clark (also lhe founder of Silicon Graphics mc.) lo form Mosaic Conimunications Corpora Lion. ln November, Lhe company changed its name lo Nelscape Communications CorporaLion.
O ‘HoL spols” Nonlinear FIGURE 1.1: Hypertext is nonlinear.
Examples of lypical mulLimedia applications include: digilal video ediling and produc lion sysLems; electronic newspapers and magazines; lhe World Wide Web; online reference works, such as encyclopedias; games; groupware; home shopping; inLeraclive TV; mulli media courseware; video conferencing; video-on-demand; and interactive movies. 1.3
WORLD WIDE WEB The World Wide Web is lhe largest and mosI commonly used hypermedia applicaLion. lIs populariLy is due lo lhe amount of informalion available from web servers, lhe capacity Lo post such informalion, and Lhe ease of navigaLing such information with a web browser. WWW Iechnology is mainlained and developed by lhe World Wide Web Consorlium (W3C), although Lhe Inlemel Engineering Task Force (IETF) slandardizes Lhe LechnologieS. The W3C has lisled lhe following lhree goals for Lhe WWW: universal access of web resources (by everyone everywhere), effecLiveness of navigaling available informalion, and responsi ble use of posted malerial.
1.3.1
History of fite WWW Amazingly, one of Lhe mosl predominanl networked mullimedia applicaLions has its rools in nuclear physics! As noted in Lhe previous seclion, Tim Berners-Lee proposed lhe World Wide Web Lo CERN (European Cenler for Nuclear Research) as a means for organizing and sharing Lheir work and experimenLal results. The following is a shorl lisI of imporlanl dales in lhe creation of lhe WWW:
1998 The W3C accepLed XML version 1.0 specificalions as a RecommendaLion. XML is Lhe main focus of Lhe W3C and supersedes HTML.
1.3.2 HyperText Transfer Protocol (HTTP) ifri? is a prolocol thal was originally designed for lransmiLLing hypermedia, buL iL also supporls lransmission of any file Lype. H’fl’P is a “slaleless” request/response prolocol, in Lhe sense lhaL a client lypically opens a conneclion lo lhe HTTP server, requesLs informalion, Lhe server responds, and lhe conneclion is terminaled no informalion is carried over for Lhe nexl requesl. The basic requesL formal is Method URI Version Additional-Headers Message—body The Uniform Resource IdenLifier (URI) idenLifies Lhe resource accessed, such as Lhe hosL name, always preceded by Lhe loken “http: / 1”. A URI could be a Uniform Resource Localor (URL), for example. Here, lhe URI can also include query strings (some interaclions require submiLLing dala). Method is a way of exchanging informailon or performing tasks on lhe URI. Two popular meLhods are GET and POST. GET specifies lhaL Lhe informalion requested is in Lhe requesL sLring ilself, while Lhe POST meLhOd specifies Lhal Lhe resource poinLed lo in Lhe URI should consider Lhe message body. POST is generally used for submilLing HTML forms. Addi tional —Headers specifies addilional paramelers about Lhe clienl. For example, Lo requesl access lo lhis lexlbook’s web sile, lhe following HITP message mighl be generated: GET http://www.cs.sfu.ca/mbook/ HTTP/1.1
10
Chapter 1
Introduction te Multimedia
The basic response formal is Version Status-Code Status-Phrase Additional-Headers Message-bOdy Status—Code is a number that identifles lhe response lype (or error Lhat occurs), and Status—Phrase is a textual description of it. Two commonly seen status codes and phrases are 200 0K when the requcst was processed successfully and 404 Not Found when lhe URI does not exist. For example, in response lo the example request above for this LexLbook’s URL, lhe web server may return something like HTTP/1.1 200 0K Server: [No-plugs-here-pleaseJ Date: Wed, 25 July 2002 20:04:30 GMT Content-Length: 1045 Content-Type: text/html 1.3.3
HyperText Markup Language (HTML) HTML is a language for publishing hypermedia on Lhe World Wide Web. li is delined using SGML and derives elements that describe generic document sLnicture and formatting. Since il uses ASCII, II is portable Lo ali different (even binary-incompatible) computer hardware, which allows for global exchange of informaLion. The current version of HTML is version 4.01, specified in 1999. The next generation of HTML is XHTML, a reformulaLion of HTML using XML. HTML uses lags to describe document eleinents. The tags are in lhe formal lo define Lhe start poinL of a document element and lo define Lhe end of the element. Some elements have only mIme parameLers and don’t require ending Lags. I-ITML divides lhe document mIo a HEAD and a EODY part as follows: The HEAD describes documenL definitions, which are parsed before any documenL rendering is done. These include page Litle, resource links, and meta-information Lhe author decides Lo specify. The BODY part describes Lhe document SLrUCLure and conLenL. Common sLnicLure elements are paragraphs, tables, forms, Iinks, item lisls, and buttons.
Section 1.3
World Wide Web
11
A very simple J-ITML page is as foliows: A sample web page.
We can put any text we like here, since this is a paragraph element.
Nalurally, HTML. has more complex sLructures and can be mixed with other sLandards. The sLandard has evolved Lo allow inlegration with script languages, dynamic manipulaLion of almost ali elements and properties after display on lhe client side (dynamic HTML), and modular cusLomization of ali rendering parameLers using a markup language calied Cascad ing Style Sheets (CSS). NoneLheless, HTML. has rigid, nondescriptive structure elements, and modularity is hard Lo achieve. 1.3.4 Extensible Markup Language (XML) There is a need for a markup language for Lhe WWW lhaL has moduiariLy of data, structure, and vjew, That is, we would like a user or an application Lo be able lo define Lhe Lags (strucLure) allowed in a documenL and Lheir relationship Lo each other, in one place, Lhen define data using these lags tn anoLher place (lhe XML file) and, finally, define in yet anoLher document how Lo render lhe Lags. Suppose you wanted lo have slock information reLrieved from a database according Lo a user query. Using XML, you would use a global Documen: Type Definition (DTD) you have already defined for stock data. Your server-side script wilI abide by Lhe I3TD mies Lo generate an XML documenL according Lo the query, using daLa from your daLabase. Finally, you will send users your XML Style Shee: (XSL), depending on Lhe type of device tliey use lo display Lhe information, se that your document looks best both on a compuLer with a 21-inch CRT monitor and on a cellphone. The currenL XML version is XML 1.0, approved by Lhe W3C in Febmary 1998. XML. syntax looks like HTML syntax, although iL is much stricter. AlI Lags are lowercase, and a lag Lhat has only mime data has Lo terminate itself, for example, . XML also uses namespaces, se that multiple DTDs declaring different elemenLs buL wiLh similar tag names can have their elements distinguished. DTDs can be imported from URIs as well. As an example of an XML document sLnicture, here is Lhe definition for a small XHTML documenL:
12
Chapter 1
Introduction to Multimedia (html that follows the above mentioned XML nies]
Ali XML documents slart with . is a special tag used for importing DTDs. Since It is a DTD definition, II does not adhere lo XML rules. anlns defines a unique XML namespace for lhe documenL elements. In this case, Lhe namespace is lhe XHTML specificalions web site. In addition lo XML specificaiions, Lhe foliowing XML-reiated specifications are stan dardized: • XML Protocol. Used lo exchange XML information between processes. lt is meant
Lo supersede HTTP and exLend iL as well as lo allow interprocess communicaLions across networks. • XML Schcma. A more sLnacLured and powerful language for defining XML data lypes (lags). Unlike a DTD, XML Schema uses XML Lags for type definitions. • XSL. This is basically CSS for XML. On lhe oLher hand, XSL is much more complex, having Lhree paris: XSL Transformaflons (XSLT), XML Path Language (XPath), and XSL Formalting Objecis. • SMIL: Synchronized Multimedia Integration Languagc, pronounced “smile”. This is a particular applicaLion of XML (globaily predefined DTD) that permils spec ifying Lemporally scripLed inleraction among any media types and user inpul. For exampie, ii can be used Lo show a streaming video synchronized with a slide show presenlalion, both reacLing Lo user navigation through lhe slide show or video. 1.3.5
Synchronized Multimedia Integration Language (SMIL) JusL as iL was beneficial lo have IrITML provide lext-documenL publishing using a readable markup language, iL is also desirabie lo be able lo publish multimedia presenlations using a markup language. Multimedia presenLaLions have addiLional characLerislics: whereas in LexL documents the Lext is read sequenLially and displayed ali aL once (aI Lhe same time), multi media presenLations can include many elemenLs, such as video and audio, Lhat have content changing Lhrough Lime. Thus, a muiLimedia marlcup language must enable scheduling and synchronizaLion of different muiLimedia elemenLs and define Lhese elemenls’ interacLivity wiLh lhe user.
Section 1.3
World Wide Web
13
The W3C established a Working Group in 1997 Lo come up wiLh specifications for a mullimedia synchronization language. Thal group produced specificalions for SMIL. 1.0 lhat became a Recommendation in June 1998. As i-ITML was being redefined in XM]. (XHTML. specifications), 50 Loo did SMIL 1.0, wiLh some enhancemenls. SMIL 2.0, which also provides integralion wiLh HTML, was accepted as a Recommendation in Augusl 2001. SMIL 2.0 is specified in XML using a nwdularizarion approach similar Lo lhe one used in XHTML. Ali SMIL elemenLs are divided inLo modules — seLs of XML elemenls, atlributes, and values Lhat define one concepLual funcLionalily. In lhe inLeresL of modularization, noL ali availabie modules musL be included for ali applicaLions. For lhat reason, Language ProJiIes are defined, specifying a particular grouping of modules. Particular modules may have inLegralion requirements a profile musL foilow. SMIL 2.0 has a main language profile Lhal includes almost ali SMIL modules, a Basic profile thaL includes only modules necessary Lo support basic funclionaliLy, and an XHTML-i-SMIL profile designed Lo inlegraLe HTM]. and SMIL. The lalLer includes mosL of lhe XHTML modules, wiLh only lhe SMIL Liming modules (buL noL sLructure modules — XHTML has iLs own sLrucLure modules) added. The SMIL language sLrucLure is similar lo XHTML. The rool eiemenL is smil, which contains lhe Lwo elemenLs head and body. head conlains information noL used for synchronizalion — meLainformation, layout infonnation, and conlenl conLrol, such as media biLrale. body conLains ali lhe informalion re]aling to which resources lo presenL, and when. Three lypes of resource synchronizaLion (grouping) are available: seq, par, and exci. seq specifies LhaL Lhe elements grouped are Lo be presenLed in lhe specified order (sequen tially). AlLemaLively, par specifies lhaL ali Lhe elemenls grouped are lo be presenLed aI lhe same time (in parallel). exci specifies Lhal only one of lhe grouped elements can be presenLed ata lime (exclusively); order does noL malLer. Let’s look aI an example of SMIL code:
14
Chapter 1
Iritroductiori to Mtiltirnedia
A SMIL document can oplionally use Lhe directive lo imporl lhe SMIL DTD, which will force Lhe interpreter te verify Lhe documenl against Lhe DTD. A SMIL documenL sLarts with .csmil> and specifies the default namespace, using lhe anlns alLribute. The chead> seclion specifies lhe author of Lhe document. The body element conLains lhe synchronizalion informalion and resources we wish Lo presenL. In Lhe example given, a video source called “authorview.mpg”, an audio source, ‘authorview.wav’, and an HTML document aI ‘http://booksite.html’ are presenled simulLaneously aI Lhe beginning. When Lhe video ends, lhe image onagoodday. jpg” is shown, while lhe audio and Lhe HTML documenl are 5h11 pre senled. AL this poinL, Lhe audio will Lhank Lhe lisLeners and conclude lhe inLerview. Addilional information on SMIL specificaLions and available modules is available cri Lhe W3C web site. 1.4
Section 1.4
15
Macromedia Soundedil Soundedil is a mature program for crealing audio for muiLi media projecLs and lhe web that inlegraLes well wiLh oLher Macromedia producLs such as Flash and Direcior. 1.4.2 Digital Audio Digital Audio bois deal with accessing and editing lhe actual sampled sounds thaL make up audio. Cool EdiL Cool Edil is a powerful, popular digital audio ioolkit wiLh capabilities (for PC users, aL least) thai emulate a professional audio siudio, inciuding mulLiirack producLions and sound file ediiing, along with digital signa) processing effecLs. Sound Forge Sound Forge is a sophislicaied PC-based program for edihing WAV files. Sound can be captured from a CD-ROM drive or from tape or microphone ihrough lhe sound card, then mixed and ediLed. Jt also permits adding complex special effects.
OVERVIEW OF MULTIMEDIA SOFTWARE TOOLS In this subsechion, we look brielly aL some of lhe sofLware Lools available for carrying out Lasks in mLllLimedia. These tools are really only the beginning — a fully funcLional mullimedia projecL can also cal) for stand-alone programming as well as jusL the use of predefined tools Lo fully exercise Lhe capabilihies of machines and lhe NeL.’ The caLegories of software Lools we examine here are • Music sequencing and noLaLion
Overview of tvlultimedia Software Tools
Pra Tools Pra Tools is a high-end integrated audio production and editing environment thal nans on Macintosh computers as well as Windows. Pra Toois offers easy MIDI creation and manipulation as well as powerful audio mixing, recording, and ediling soflware. 1.4.3
Graphics and Image Editing
• DigiLal audio • Graphics and image editing
Adobe Iliustrator lllustraior is a powerful publishing teci for creating and editing vector graphics, which can easily be exported lo use on lhe web.
• Videoediting • Animalion • Multimedia aulhoring 1.4.1
Adobe Photoshop PhoLoshop is Lhe standard in a tool for graphics, image processing, and image manipulation. Layers of images, graphics, and texl can be separately manipulated for maximum flexibility, and its “filter factory” permiLs creahion of sophislicaied lighhing effects.
Music Sequencing and Notation Cakewalk Cakewalk is a well known older name for what is now called Pra Audio. The firm producing this sequencing and editing software, Twelve Tone Syslems, also sells an inlroduclory version of Lheir software, “Calcewaik Express”, over Lhe lnlernet for a iow price. The tetw sequencer comes from oider devices hhal stored sequences aí notes in lhe MIDI music language (evenis, in MDI; see Section 6.2). lt is also possible lo inserl WAV files and Windows MCI commands (for animalion and video) mIo music tracks. (MCI is a ubiquilous component of lhe Windows API.) Cubase Cubase is another sequencinglediting program, with capabilities similar Lo those of Cakewalk. It includes some digital audio editing tools (see below). See lhe accompanying web sue for severat inleresting uses of sofiware boIs. in a iypical computer science course is multimedia, lhe toots dcscribed bem might be used lo creme a smatt mutornedia production as a firsL assignrnent Some of lhe boIs are powerful enough that they might atso form pan ora course project.
Macromcdia Fireworks Fireworks is software for making graphics specifically for Lhe web. lt includes a bitmap editor, a vector graphics editor, and a JavaScript generator for bultons and rollovers. Macromedia Freehand Freehand is a LexL and web graphics editing boi that supporls many bitmap formats, such as GIF, PNG, and JPEG. These are pLtel-based formats, in thaL each pixel is specified. lt also supporls vecror-based formats, in which endpoints oflines are specified instead of Lhe pixels themseives, such as SWF (Macromedia Flash) and FHC (Shockwave Freehand). It can also read Pholoshop format. 1.4.4 Video Editing Adobe Premiere Premiere is asimple, intuitive video ediiing Lool fornonlinearediting putting video clips inLo any order. Video and audio are arranged in iracks, iike a musical
16
Chapter 1
Introduction to Multimedia
score. lt provides a large number of video and audio tracks, superimpositions. and virtual clips. A large library of built-in transitions, filters, and motions for clips allows easy creation ofeffective muitimedia productions. Adobe After Etfecls After Effects is a powerful video editing tool that enables users Lo add and change existing movies wiih effects such as lighLing, shadows, and motion blurring. lt also allows iayers, as in Photoshop, to permit manipulating objects independently. Final Cul Pro Final Cut Pro is a video editing Lool offered by Apple for the Macintosh platforni. IL allows Lhe capture of video and audio from numerous sources, such as fim and DV. lt provides a complete environment, from capturing the video Lo ediLing and color correction and finaily output to a video file or broadcasL from Lhe coniputer. 1.4.5
Animation Multimedia APIs Java3D is an API used by Java to construcL and render 3D graphics, similar to lhe way Java Media Framework handles media files. lL provides a basic set of object primiLives (cube, splines, etc.) upon which the developer can build scenes. li is an absLracLion layer builL on top of OpenOL or DirectX (Lhe user can select which), so Lhe graphics are accelerated. DirectX, a Windows API Lhat supporls video, images, audio, and 3D animation, is Lhe most common API used Lo deveiop inodern muiLimedia Windows applications, such as computer games. OpenGL was creaLed in 1992 and has become Lhe most popular 3D API in use today. OpenGL is highly portable and will run on ali popular modere operating sysLems, such as UNIX, Linux, Windows, and MacinLosh. Rcndering Tools 3D SLudio Max includes a number of high-end professional tools for characLer animation, game development, and visual effecLs production. Models produced using this tool can be seen in several consumer games, such as for Lhe Sony PlaystaLion. Softimage XSI (previously called Softimage 3D) is a powerful modeling, animation, and rendering package for animaLion and special effecLs in films and games. Maya, a compeLing product Lo SofLimage, is a compleLe modeling package. lt features a wide variety of modeling and animaLion Lools, such as to creaLe realisLic clothes and fur. RenderMan is a rendering package created by Pixar. lL excels in crëating complex surface appearances and images and has been used in numerous movies, such as Monslers mc. and Final Fantasy: The Spiriis Within. IL is alto capable of importing models from Maya. GIF Animation Packages For a simpler approach Lo animation thaL also allows quick developmenl of effective small animaLions for the web, many shareware and other programs permiL creating animaLed GIF images. GlFscanconLain several images, and loopingthrough Lhem creaLes a simple animation. Gifcon and OifBuilder are LW0 of Lhese. Linux also provides some simple animation bois. such as aniniate.
section 1.5
Further Exploration
17
1.4.6 Multimedia Authoring Tools LhaL provide Lhe capability for creaLing a complete mulLimedia presenLation, including interactive user conLrol, are called authoring programs. Macromedia Flash Flash allows users Lo create inleracLive movies by using the score meLaphor a Limeline arranged in parailel eveni sequences, much like a musical score consisLing of musical noLes. Elements in Lhe movie are called symbols in Flash. Symbols are added Loa cenLral reposiLory, calied a library, and can be added Lo the movie’s Limeline. Once Lhe symbols are present aL a specific time, Lhey appear on Lhe SLage, which represenLs whaL lhe movie looks like ata certain Lime, and can be manipulated and moved by lhe boIs builL into Flash. Finished Flash movies are commonly used Lo show movies or games on Lhe web. Macromedia DirecLor Director uses a movie metaphor to creaLe inLeractive presen Lations. This powerful program includes a builL-in scripbing language, Lingo, lhaL allows creaLion of complex inLeractive movies.2 The “cast” of characLers in Director íncludes bitmapped sprites, scripLs, music, sounds, and paleLLes. Director can read many bitmapped file formaLs. The program itself allows a good deal of interacliviLy, and Lingo, with its own debugger, allows more conlrol, including control Over exLemal devices, such as VCRs and videodisc players. DirecLor also has web authoring feaLures available, for crealion of fully inLeractive Shockwave movies playable over Lhe web. Authorwnre Authorware is a mature, weil-supported auLhoring producL Lhat has an easy leaming curve forcomputersciencesLudents because it is based on Lhe idea of flowcharts (lhe so-called iconic/flow.conn-ol meLaphor). lt allows hyperlinks Lo link text, digital movies, graphics, and sound. li also provides CompaLibiliLy beLween files produced in PC and Mac versions. Shockwave Aulhorware applicaLions can incorporaLe Shockwave files, including DirecLor movies, Flash animaLions, and audio. Quest Quesb, which uses a type of flowchariing meLaphor, is similar Lo Authorware in many ways. However, Lhe flowchart nodes can encapsulaLe information in a more absLracL way (called “frames”) Lhan simply subrouLine leveis. As a resuiL, connecLions between icoas are more concepLual and do noL always represent ulow of conLrol in the program. 1.5 FIJRTHER EXPLORATION ChapLers 1 and 2 of SLeinmeLz and NahrsLedt [5] provide a good overview of muiLimedia concepts. The web siLe for Lhis LexL is lcept currenL on new deveiopmenLs. Chapter 1 of lhe Further ExploraLion directory on the web site provides links Lo much of Lhe hisLory ofmulLimedia. As a sLart, Lhe compleLe Vannevar Bush article on lhe Memex sysLem conception is online. This article was and sLill isconsidered seminal. Allhough wriLben over 50 years ago, iL adumbraLes many current developmenLs, inciuding fax machines and Lhe associative memory model LhaL underlies Lhe developmenL of Lhe web. Nielsen’s book [6] is a good overview of hypertexL 2Tlierefore, Direcior is ofien a popular choice wilh srudenls for creating a final project o multimedia coulses —
il provides lhe desired power wiihoui Lhe inevicable pais ol’ usiog a full.blown C++ program
18
Chapter 1
Iritroduction to Multimedia
and hypermedia. For more advanced reading, lhe coliection of survey papers by Jeffay and Zhang [1] provides in-depLh background as well as fuLure direcLions ofresearch. Olher links in the LexI web site include information 011 • Ted Nelson and lhe Xanadu project • Nicholas NegroponLe’s work at lhe MIT Media Lab. NegroponLe’s small book 011 mulLimedia [7] has become a much-quoted classic. • Douglas Engelbart and the hislory of Lhe “On-Line System” • The MIT Media Lab. Negroponte and Wiesner cofounded Lhe MIT Media Lab, which is still going slrong and is arguably the rnost influential idea faciory in Lhe world. • ClienL-side execuLion. Java and client-side execution sLarted in 1995; “Dulce”, lhe flrsL JAVA appleL, is also on Lhe LexLboOk’s web site. Chapler 12 of Buford’s book [8] provides adetailed introduction lo auLhoring. NeuschoLz’s introductory text [9] gives step-by-step instrucüons for creaLing simple Lingo-based interactive DirecLor nlovies. Other links include • Digital Audio. This web page includes a link to lhe Sonic Foundry company for informalion on Sound Forge, a sample Sound Forge file, and lhe resulting ouLput WAV file. The example combines left and righL channel information in a complex fashion. LitLle effort is required lo produce sophisLicaled special effects with this tool. Digidesign is one finn offering high-end Macintosh sofLware, which can even involve purchasing exLra boards for specialized processing. • Music sequencing and noLaLion
section 1.7
References
19
3. Your Lask is Lo Lhink about lhe lransmission of smell over Lhe InlemeL. Suppose we have a srnell sensor aL one location and wish lo Lransmit lhe Aroma Vector (say) to a receiver lo reproduce Lhe sarne sensaLion. You are asked lo design such a system. LisL three key issues to consider and Lwo applications of such a delivery system. Hini: Think abouL medical applications. 4. Tracking objecLs or people can be done by boLh sighL and sound. While vision systems are precise, Lhey are relalively expensive; on lhe oLher hand, a pair of microphones can delect a person’s bearing inaccurately bui cheaply. Sensorfusion of sound and vision is thus useful. Surf Lhe web Lo find ouL who is developing tools for video conferencing using lhis kind of multimedia idea. 5. Non-pho:orealistic graphics means compuler graphics Lhat do well enough wilhouL aLtempLing Lo make images IbaL look like camera images. An example is conferenc ing (leIs look aL Lhis cutting-edge applicaLion again). For example, if we track lip movements, we can generate lhe right animaLion lo fiL mar face. lf we don’L much like our own face, we can substitute another one — facial-feature rnodeling can rnap correct lip movemenls onto anoLher model. See if you can find oul who is carrying OUL research 011 generaLing avaLars lo represenL conference participanls’ bodies. 6. WaLerrnarking is a means of embedding a hidden message in data. This could have importanL legal implications: Is Lhis image copied? Is this image docLored? Who looic iL? Where? Think of “messages” LhaL could be sensed while capturing an image and secretly embedded in lhe image, so as lo answer these questions. (A similar quesLion derives from lhe use of celI phones. WhaL could we use Lo determine who is pulLing Lhis phone to use, and where, and when? This could eliminaLe Lhe need for passwords.) 1.7 REFERENCES 1 K. Jeffay and H. Zhang, Readangs ii. Muhirnedia Conipunng and Networking, San Francisco: Morgan Kaufmann, CA, 2002. 2 Vannevar Bush, “As We May ThinlC’ The Aiianric Monthty, Jul. 1945.
• Graphics and image ediLing information
3 D. Engelbart and Ii. Lehtman, “Working TogeLherT BYTE Magazine, Dcc. 1998. 245—252.
• Video editing producls and information
4 N. Yankelovitch, N. Meyrowitz, and A. van Dam, “Reading and Writing Lhe Eleclronic BookT ia Hypennedia and Literary Studies, ed. P. Delany and G.P. Landow, Cambridge, MA: MIT Press, 1991.
• AnimaLion siLes • Multimedia auLhoring Lools • XML.
5 R. SLeinmeLz and K. Nahrsledt, Multimedia: Compuung, C’om,nunications and Applications, Upper Saddle River, P11: Prentice HalI PTR, 1995. 6 J. Nielsen, Multimedia and Hypertexr: lhe Interne? and Beyond, San Diego: AP Professional, 1995. 7 N. Negroponie, Being Digital, New York: VinLage Books, 1995.
1.6
EXERCISES 1. ldentify three novel applicaLions of Lhe InLernel or multimedia applicalions. Discuss why you think lhese are novel. 2. Brielly expIam, in your own words, “Memex” and its role regarding hypertext. Could we carry oul Lhe Memex Lask today? How do you use Memex ideas in your own work?
8 J.F.K. Buford, Mui:i,nedia Systenrs, Reading, MA: Addison Wesley, 1994. 9 P1. Neuscholz, Introduction ‘o Director and Lingo: Muitimedia and Internes Appiications,
Upper Saddle River, N,J: Prenlice Hail, 2000.
Section 2.1
CHAPTER
2.1
MULTIMEDIA AUTHORING MuiLimedia auLhoring is Lhe creation of multimedia productions, someLimes cailed “movies” or “presenLations”. Since we are inLerested in this subject from a computer science point of view, we are mosLly interested in inieractive appiicaLions. Also, we need Lo consider still-image editors, such as Adobe PhoLoshop, and simple video ediLors, such as Adobe Premiere, because Lhese applications help us create inLeracLive multimedia projects. How much inLeraction is necessary or meaningful depends on lhe applicaLion. The spectrum runs from almosi no interactiviLy, as in a siide show, to full-immersion virtual reality. Jn a siide show, inLeracLivity generaliy consists of being able to conLrol the pace (e.g., click Lo advance Lo Lhe nexL siide). The next levei of interactiviLy is being able Lo conLrol Lhe sequence and choose where Lo go next. NexL is media control: start/sLop video, search texi, scroll the view, zoom. More control is available if we can conLrol variables, such as changing a daLabase search query. The levei of control is substaniially higher if we can conLrol objecLs say, mOving objecLs around a screen, playing inLeracLive games, and 50 on. Finaily, we can control an entire simulation: move our perspecLive in Lhe scene, conLroi scene objects. For some time, people have indeed considered what should go inLo a muitimedia project; references are given aL the end of Lhis chapter. In this section, we shali iook aL MuiLimedia auLhoring meLaphors • Multimedia production • MulLimedia presenLation AutomaLic authoring The final iLem deals with general auLhoring issues and whaL benefiL automaLed Lools, using some artificial inteiiigence techniques, for exampie, can bring lo Lhe authoring task. As a first step, we consider programs Lhat carry out automaLic linking for legacy documenLs. After an inLroducLion to muitimedia paradigms. we present some of Lhe pracLical Lools of muiLimedia conLent production software Loois Lhat form Lhe arsenal of mulLimedia
20
21
production. Here we go through Lhe nuLs and bolts of a number of sLandard programs currently in use.
2
Multimedia Authoring and Tools
Multimedia Authoring
2.1.1
Multimedia Authoring Metaphors Authoring is Lhe process of creaLing multimedia applications. MosL authoring programs use one of several authoring metaphors, also known as authoring paradigins: meLaphors for easier understanding of the methodoiogy employed Lo creaLe multimedia applications [li. Some common authoring metaphors are as follows: • Scripting language metaphor Tlie idea here is Lo use a special language lo enable interacLivity (buLtons, mouse, etc.) and allow conditionais, jumps, ioops, funcLions/macros, and so on. An example is Lhe OpenScripL language in AsymeLrix Learning Systems’ Toolbook program. Open ScripL looks like a standard objecL-orienLed, evenL-driven programming language. For example, a small Toolbook prograrn is shown below. Such a language has a learning curve associaLed with ii, as do ali auLhoring tools —even those Lhat use Lhe siandard C programming language as Lheir scripLing language because of lhe object libraries Lhat must be leamed. load an MPEG file extFileNaine of Mediaplayer ‘ theMpegpath’ = c : \windows\media\home3j . mpg’; -play extplaycount of MediaPlayer ‘theMpegpath’ = 1; -put the Mediaplayer in frames mode (not time mode) extoisplayMode of Mediaplayer °theMpegPath” = 1; -if want to start and end at specific frames: extselectionstart of MediaPlayer ~theMpegPath” = 103; extSelectionEnd of Mediaplayer ‘theMpegpath’ 1997; -start playback get extPlay() of Mediaplayer ‘theMpegPath”; • Silde show metaphor Slide shows are by default a linear presenLaLion. Although bois exist Lo perform jumps ia siide shows, few pracLiLioners use lhem. Example programs are PowerPoint ar lmageQ. • Hierarchical metaphor Here, user-conLrollable elements are organized inLo a tree strucLure. Such a meLaphor is ofLen used in menu-driven applications. • Iconicfflow-contrel metaphor Graphical icons are available in a Loolbox, and authoring proceeds by creaLing a flowchari with icons atLached. The standard example of such a meLaphor is Aulhor ware, by Macromedia. Figure 2.1 shows an example flowcharL. As well as simple flowchart elements, such as an IF staiemenL, a CASE sLaLemenL, and so on, we can
22
Chapter 2
Multimedia Authoring and Tools
Section 2.1
litllequiz.ir p
• Qnest Frame FAit
Levei 1 welcaiie Ideal
frame Edlt
e
23
Clieck Cale
Qbjéct
6?~Fks ql q2 q3
.
Multimedia Authoring
HeIp Ir
II This exampie starts up the Wndows Caicuiator. llif lhe user mininizes Lhe calcuiator and then
iesuls
li Irias lo start II up agem, lhe Calcuialor is f~troughi to lhe lop instead ol slarting up anolher insiance ei 1 fiTo use this in your Mies, copy lhe tolowfng lhree lines mIo ifyour trame as wefl as lhe watch tor. Then inodify the watc1i //for lo wefch for whatever bulton or event you have designated #wiil Iaunch lhe calculelor. Vou do foi need lo use lhe llcail prograrn tooi aI lhe alie design levei for lhis lo work
FIGURE 2.1: Authorware flowchart. group elements using a Map (i.e., a subroutine) icon. With little effort, simpie ani mation is also possibie. Frames metaphor As in the iconic/fiow-control metaphor, graphicai icons are again availabie in a tool box, and authoring proceeds by creating a flowchart with icons attached. However, rather than represenhing the actual flow of Lhe program, links between icons are more conceptual. Therefore, “frames” of icon designs represent more abstraction Iban in Lhe simpier iconic/flow-control metaphor. An example of such a program is Quest, by Alien Communication. The flowchart consists of “modules” composed of”frames”. Frames are constructed from objecls, such as text, graphics, audio, animations, and video, ali of which can respond Lo events. A real benefit is that the scripting language here is lhe wideiy used programming ianguage C. Figure 2.2 shows a Quest frame. • Card/scripting metaphor This metaphor uses a simpie index-card structure Lo produce muitimedia productions. Since links are available, this is an easy route Lo producing applications that use hypertext or hypermedia. The original of Lhis metaphor was HyperCard by Apple. Another exampie is HyperStudio by Knowledge Adventure. The latter program is now used in many schoois. Figure 2.3 shows two cards in a HyperStudio stack. • Castlscorelscripting metaphor In Lhis metaphor, time is shown horizontaily in a type of spreadsheet fashion, where rows, or Lracks, represent instanliations of characters in a muitimedia production. Since chese tracks control synchronous behavior, this metaphor somewhat parailels a music score. Muiiimediaelements are drawn from a “cast” of characters, and “scripts”
WORD wstalus;
//Copy 1 hese 3 mines of
i*M40 hwnc~
#code mio your frane
char
szMsg(60]; Oraphic File (calc bnp~ ‘Openceic’ Panei
htopy lhis watch for 1-ito your trame and modity 1 to watch 1/for whatever eveni you would líiçe to iaunch lhe calctNaior ~ Watch for
‘Opencaic L~uItonClicked then Ilis L
-10
FIGURE 2.2: Quesl frame.
are basicaHy event procedures or procedures triggered by Limer evenls. Usuaiiy, you can write your own scripts. in a sense, this is similar Lo Lhe conventionai use of the term “scripting language” — one LhaL is concise and invokes ]ower-levei abstractions, since that isjust what one’s own scripts do. Director, by Macromedia, is lhe chiefexample of this melaphor. Director uses the Lingo scripting ianguage, an object-oriented, event-driven language. 2.1.2 Multimedia Production A muiLimedia project can invoive a host of people with specialized skills. In Lhis book we focus on more Lecbnical aspecis, but muiLimedia production can easily involve an art director, graphic designer, production artist, producer, project manager. writer, user interface designer, sound designer, videographer, and 3D and 2D animators, as well as programmers.
24
Chapter 2
Multimedia Authoring and Tools
Section 2.1
Multimedia Authoring
25
2,1.3 Multimedia Presentation In this section, we briefly outline some effects Lo keep in mmd for presenting multimedia contentas weil as some usefui guidelines for content design. Graphics Styles Carefui Lhought tias gorie into combinations of color schemes and how lettering is perceived in a presentation. Many presentations are meanL for business displays, rather Lhan appearing on a screen. Human visual dynamics are considered in regard to how such presenLations must be constructed. Most of the observations here are drawn from Vetter eI ai. [2], as is Figure 2.4.
FIGURE 2.3: Two cards in a HyperStudio stack.
The production timeline would likely only involve programming when the project is about 40% complete, with a reasonable Larget for an alpha version (an early version Lhat does not contam ali planned features) being perhaps 65—70% complete. ‘T~’pically, Lhe design phase consists of storyboarding, flowcharting, proLotyping, and user testing, as well as a parallel production of media. Programniing and debugging phases would be carried out in consuitation with marketing, and Lhe distribution phase would follow. A storyboard depicts the iniCial idea content of a multimedia concept in a series of sketches. These are like ‘keyframes” in a video — lhe story hangs from these “stopping places”. A flowchart organizes Lhe storyboards by inserting navigation mnformation — the multimedia concepl’s structure and user interaction. The most reliable approach for planning navigalion isto pick a traditional data structure. A hierarchical system is perhaps one of the simplest organizaLional strategies. Muitimedia is not really like other presenlations, in that careful Lhought must be given lo organization of movement between the “rooms” inChe production. For example, suppose we are navigating an African safari, but we also need lo bring specimens back Lo our museum for dose examination —just how do we effecL the Lransition from mie locale lo Lhe oLher? A fiowchart helps imagine Lhe solution. The fiowchart phase is followed by development of a detailed functional specificalion. This consists of a walk-through of each scenario of the presenLation, frame by frame, mnclud ing ali screen adtion and user inleraction. For example, during a mouseover for a character, lhe character reacLs, ora user cliclcing on a characLer resulta in an acLion. The final part of Lhe design phase is prototyping and testing. Some mulLimedia designers use an auLhoring Lool at this stage already, even if the intermediate prototype will ROL be used in Lhe final product or contmnued in another boi. User testing is, of course, extremely imporLant before the final development phase.
Color Principies and Guidelines Some colorschenies and arrsiyles are best combined with a certain Lheme or sLyie. Color schemes could be, for example, natural and floral for ouLdoor scenes and solid colors for indoor seenes. Examples of art styles are oU paints, watercolors, colored pencils, and pasLels. A general hinL is Lo not use too many colors, as Lhis can be distracLing. IL helps Lo be consistent wilh lhe use of color — then color can be used lo signal changes in theme. Fonts For effecLive visual communicaLion, large fonts (18 Lo 36 points) are besL, with no more Lhan six to eight lines per screen. As shown in Figure 2.4, sans serif fonts work betLer than serif fonLs (serif fonLs are those with short lines sLemming from and aL an angle Lo Lhe upper and lower ends of a letter’s strokes). Figure 2.4 shows a comparison of two screen projections, (Figure 2 and 3 from VeLLer, Ward and Shapiro [2]). The Lop figure shows good use of color and fonts. IL tias a consisLent color scheme, uses large and ali sans-serif (Anal) fonts. The boLLom figure is poor, in Lhat Loo many colors are used, and Lhey are inconsisLenL. The red adjacent to the blue is hard Lo focus on, because Lhe human retina cannoL focus on Lhese colors simulLaneously. The serif (Times New Roman) font is said to be hard Lo read in a darkened, projecLion seLting. Finally, the lower righL paneI does Rol have enough conLrast — preLty pastel colors are often usable only if Lheir background is sufficiently different. A Color Contrast Program Seeing Lhe resulLs of VeLtereL al.’s research, weconsLrucLed a small Visual Basic program Lo invesLigaLe how readabiliLy of Lext colors depends on color and Lhe color of Lhe background (see lhe Further ExploraLion section aL Lhe end of this chapter for a poinLer Lo this program on Lhe texL web siLe. There, both Lhe execuLable and lhe program source are given). The simplest approach Lo making readable colors on a screen is Lo use Lhe principal complemenLary color as lhe background for texL. For color values in the range O lo 1 (or, effectively, O Lo 255), if lhe text color is some triple (R, G, B), a legible color for Lhe background is likely given by thaL color subtracted from Lhe maximum: (R,G,B) ~ (l—R,l—G,1—B)
(2.1)
That i5, RoL only is Lhe color “opposite” in some sense (noL Lhe same sense as artisLs use), but if lhe text is brighL, the background is dark, and vice versa.
26
Chapter 2
Section 2.1
Multimedia Authoring and Tools MuhiTntd~TQclBDDk-C22~~O5J6K
-
tua tOIT
taqt
RJpIICm.UPIa
27
_________
u~p
A 15 second clip of music from a compact disc was digitized at three different samphigtates (11 kHz, 22 ld-lz, and 44kHz) wiLh 8-bit precision. The effects of lhe different sampling raLes are cleaily audible. This is a demonstratiofl of Lhe Nyquist Theorem. Press ButLon lo Play 8-bil Audio Qip
Multimedia Authoring
Nyquist Theorem: The minimum sampling frequency ot an ND converter should be at Ieast twice me frequency of the signal being measured.
MÜSIC 11 kHz
Use ü,&sider~ Lo charige lhe window calor, os ciciçon Lhe vãidowLo use a colos piches. The•lext wul change lo Lhe windows complemenLauy 00101 on lhe Lestia rnanualy change t Use lhe slders Lochange lhe windõw cõlor, ordckon lhei’kidow Lo use a colos pickei, 1
II
lieIP
QlLokt boxto
ange -s co r
vai
~t~F~t MuhimDdiflTüDl5DDk.C2?5~5J6K
-
bit
Edil
~sga
appuuc.iIoaS
acIp
-
-
A 15 secoflcl clip of music êom a coznpact disc was digitizeil ai three different sámpling iates (11 kH 22 and 44Hz) wilh 8-bit precision. The effects of the different sampling ratos are clearly audible. Titia is a demoustration of Lhe Nyquist Theorem.
a
FIGURE 2.5: Program Lo investigate colors and readability. Mus 11lc 22 lHz Musc
a
IktwIff’ FIGURE 2.4: Colors and fonLs. (This figure also appears in Lhe color inserL section.) Courtesy ofRon Yetter. In Lhe Visual Basic program given, sliders can be used Lo change Lhe background color. As Lhe background changes, lhe text changes Lo equal Lhe principal coniplemenLarY color. Clicking on Lhe background brings up a color-picker as an altemative Lo Lhe sliders. If you feel you can choose a beLler color cornbination, click on Lhe LexL. This brings up a color picker not Lied Lo Lhe background color, so you can experimenL. (The texL itself can also be ediLed.) A little experimenLaLion shows LhaL some color combinations are more
pleasing Lhan others — for example, a pink background and forest green foreground, or a green background and mauve foreground. Figure 2.5 shows Ibis small program in operalion. Figure 2.6 shows a “color wheel”, wiLh opposite colors equal 10(1 — R, 1 — G, 1 — B). An arList’s color wheel wilI not look the sarne, as it is based on feel ralher Lhan on an algorithm. In lhe tradiLional artisl’s wheel, for exaniple, yellow is opposile magenLa, inslead ofopposiie blue as in Figure 2.6, and blue is instead opposite orange.
Sprite Animalion Sprües are ofLen used in animaLion. For example, in Macromedia Direclor, Lhe nolion of a spriLe is expanded Lo an insLanliaLion of any resource. However, Lhe basic idea of sprite anirnaLion is simple. Suppose we have produced an animalion figure, as in Figure 2.7(a). Then it is a simple matler Lo creaLe a l-bil rnask M, as in Figure 2.7(b), black on whiLe, and Lhe accornpanying sprite 5, as in Figure 2.7(c). Now we can overlay Lhe spriLe on a colored background 8, as in Figure 2.8(a), by firsL ANDing B and M, Lhen ORing lhe resuit wiLh S, with Lhe final resuiL as in Figure 2.8(e). Operations are available Lo carry ouL these simple composiLing rnanipulations aL frame raLe and so produce a simple 2D animalion lhaL moves Lhe spriLe around Lhe frarne buL does nol change Lhe way iL Iooks.
28
Chapter 2
Multimedia Authoring and Tools
Section 2.1
Multimedia Authoring
29
Cc)
(b)
FIGURE 2.6: Color wheel. (This figure also appears in Lhe colar iriserl section.)
Video Transitions Video Lransilions can be an effective way lo indicate a change lo Lhe next SeCLiOR. Video IransiLions are synLacLic means lo signal “scene changes” and ofLen carry semanLic meaning. Many differenL lypes of Lransilions exisL; Lhe main Lypes are mas, wipes, dissolves, fade-ins and fade-outs. A cuL, as Lhe name suggests. carnes ouL an abnapl change of image conlents in lwo consecutive video frames from Lheir respective clips. It is lhe simplesL and most frequently used video LransiLion. A wipe is a replacement of Lhe pixels in a region of Lhe viewport wiLh Lhose from anolher video. lf Lhe boundary une between lhe two videos moves slowly across lhe screen, lhe seeond video gradually replaces lhe firsL. Wipes can be Iefl-Lo-righL, righL-to-IefL, vertical, horizontal, like an iris opening, swepL ouL like lhe hands of a clock, and so on.
(e)
(d)
FIGURE 2.8: SpriLe animaLion: (a) Background B; (b) Mask M; (e) 8 AND M; (d) Sprite 5; (e) 8 AND M 0k 5. A dissolve replaces every pixel wilh a mixLure over time of Lhe two videos, gradually changing Lhe firsl Lo lhe second. A fade-ouL is lhe replacemenL of a video by black (orwhile), and fade-in is iLs reverse. Mosi dissolves can be classified inLo lwo lypes, corresponding, for example. Lo cr055 dissolve and dirher dissolve iii Adobe Premiere video ediling soflware. In Lype 1 (cross dissolve), every pixel is affecLed gradually. IL can be defined by D=(l—aQflA+aQ)-B
(2.2)
where A and B are lhe color 3-vecLors for video A and video B. Here, aQ) is a Lransilion funcLion, which is ofLen linear wiLh Lime 1~ a(t)
(a)
(b)
(e)
FIGURE 2.7: Sprile creaLion: (a) original; (b) mask image M; and (e) sprite S. “Dulce” figure cour!esy of Sun Microsysierns.
=
kt,
wilh
kfrnax
(2.3)
Type II (diLher dissolve) is enLirely different. Delermined by «O), increasingly more and more pixels in video A will abrupLly (instead of gradually, as in ‘1~’pe 1) change lo video B. The posiLions of lhe pixels subjected lo the change can be random or someLiines follow a particular palLern. Obviously. fade-in and fade-oul are special Lypes of a Type 1 dissolve, in which video A or B is black (or while). Wipes are special forms of a Type l[ dissolve, in which changing pixels follow a particular geomelric palLern.
pter 2
Section 2.1
Multimedia Authoring and Tools
Multimedia Authoring
31
Viewport
y >l~
VideoL
(a)
(b)
VideoR
(c)
O
FIGURE 2.9: (a) VideoL; (b) VideoR; (c) VideoL sliding into place and pushing out VideoR. Despite lhe facL that many digital video editors include a preset number of video Lran sitions, we may also be inlerested in buiiding our own. For example, suppose we wish Lo buiid a special Lype of wipe lhaL slides one video out whiie another video slides in te repiace iL. The usual Lype of wipe does not do lhis. InsLead, each video stays in place, and the transition une moves across each “stalionary” video, so lhat Lhe left part of Lhe viewport shows pixels from Lhe left video, and lhe right part shows pixeis from the right video (for a wipe moving horizonlaily from lefL to right). Suppose we wouid like Lo have each video frame not heid in piace, but instead move progressiveiy farLher into (out oQ the viewport: we wish lo slide VideoL in from lhe ieft and push out VideoR. Figure 2.9 shows lhis process. Each of Videoi. and Video~q has its own values ofR, G, and B. NoLe Lhat Ris a funclion of posilion in Lhe frame, (x, y), as weli as of Lime t. Since lhis is video and not a coliection of images of various sizes, each of lhe Lwo videos has lhe sarne maximum extenL, Xrnax. (Premiere actuaily makes ali videos Lhe same size — Lhe one chosen in lhe preset seiection —50 Lhere is no cause te worry abouL different sizes.) As lime goes by, Lhe horizonlai location XT for Lhe Lransition boundary moves across lhe viewport from XT = 0 at t = O Lo XT = x,,10~ at t = ‘mar Therefore, for a Lransition lhat is linear in Lime, XT = (I/trnax)Xmax So for any Limei, Lhe siLuation is as shown in Figure 2.10(a). The viewport, in which we shall be wriLing pixeis, has its own coordinate system, wiffi lhe x-axis from O Lo Xmax. For each x (and y) we musL deLermine (a) from which video we Lake RGB values, and (b) from whaL x posiLion in lhe unmoving video we Lake pixel values — Lhat is, from whaL posilion x from Lhe iefL video, say, in its own coordinaLe system. lt is a video, SO of course Lhe image in lhe left video frame is changing in time. Let’s assume Lhat dependence on y is impliciL. In any evenL, we use lhe sarne y as in Lhe source video. Then for lhe red channel (and similarly for Lhe green and biue), 1? = R(x, t). Suppose we have determined Lhal pixeis should come from VideoL. Then Lhe x-positioa XL in lhe unmoving video should be XL = x + (Xmax — XT), where x is Lhe position we are trying to tili in lhe viewport, XT is Lhe posilion in Lhe viewport Lhat Lhe transitioa boundaxy has reached, and Xmax is lhe maximum pixei position for any frame. To see lhis, we noLe from Figure 2.10(b) Lhat we can calcuiate lhe position XL in VideoL’s coordinate system as Lhe sum of Lhe distancex, in Lhe viewport, and lhedifferencexmax XT.
XT
Xma
X
(a)
y
Xr 1-
X 4-
o Xmax
X
XT
(b) FIGURE 2.10: (a) Geometry of VideoL pushing out VideoR; (b) CalculaLing position in VideoL from where pixeis are copied Lo lhe viewport.
SubsLituting Lhe fact Lhat Lhe Lransilion moves linearly wiLh time, XT = Xmax((/tmax), we can seL up a pseudocode solution as in Figure 2.11. In Figure 2.11, lhe siight change in formula if pixels are acLualiy corning from VideoR instead of from VideoL is easy to derive. SomeTechnicai Design Issues Technical paramelers LhaL affecL the design and delivery ofmulLimedia applicaLions inciude computer piatfonn, video format and resoluLion, memory and disk space, deiivery methods. • Compuler Piatform. Usually we deal wiLh machines thaL are eilher some Lype of UNIX box (such as a Sun) or else a IND or MacinLosh. While a good deal of software is osLensibly ‘portabie”, much cross-piaLform software relies on runlime modules Lhat may foI work weil across systems.
32
Chapter 2
Multimedia Authoring and Tools for
t in °m01 for x in O..Xmax II irnor < 1? = Rj ( x else R = RR ( 1
Section 2.1
Multimedia Authoring
33
No perfect mechanism curreritly exists lo distribute large muitimedia projects. Never Lheless, using such bois as PowerPoint or DirecLor, iL is possible lo creaLe acceptable presenlations Lhal 61 on a singie CD-ROM. +Xrnax*[i7~j, 1)
2.1.4 Automatic Authoring —
1m01 *
~—,
FIGURE 2.11: Pseudocode for sIide video transition.
• Video Formal and Resolution. The mosl popular video formais are NTSC, PAL, and SECAM. They are IIOL compatibie, so conversion is required to play a video in a different format. The graphics card, which displays pixels on the screen, is someLimes referred to as a “video card”. Iii fact, some cards are abie Lo perfonu “frame grabbing”, lo change anaIog signals to digilai for video. This kind of card is called a “video capture card”. The graphics card’s capacity depends on its price. An old standard for lhe capacity of a card is S-VGA, which aIlows for a resolulion of 1,280 x 1,024 pixeis in a dis piayed image and as many as 65,536 coiors using 16-bit pixeis or 16.7 miliion colors using 24-bit pixels. Nowadays, graphics cards LhaL support higher resolution, such as 1,600 x 1,200, and 32-bil pixels or more are common.
• Memory and Disk Space Requirement. Rapid progress ia hardware alieviates the problem, but multimedia software is generaliy greedy. Nowadays, at Ieast 128 megabytes of RAM and 20 gigabyLes of hard-disk space should be availabie for ac ceptable performance and siorage for multimedia programs.
• Delivery Methods. Once coding and ali other work is finished, how shali we presenL our dever work? Since we have presumably purchased a large disk, 50 LhaL perfor mance is good and storage is noL aR issue, we could simply bring along our machine and show lhe work that way. However, we Iikely wish lo distribute lhe work as a product. Presently, rewritabie DVD drives are not Lhe nonu, and CD-ROMs may Iack sufficient storage capaciiy lo hold lhe presenlation. Also, access Lime for CD-ROM drives is longer than for hard-disk drives. Electronic delivery is an option, but Lhis depends on network bandwidth at lhe user side (and aL our server). A streaming option may be available, depending on lhe presentation.
Thus far, we have considered notions developed for authoring new muilimedia. Nevertbe iess, a tremendous amounl of legacy muitimedia documents exists, and researchers have been interesLed iri methods Lo facilitate automatic auflwring. By this term is meant either an advanced helper for creating new multimedia presentations ora mechanism to facilitate automatic creation of more usefui muitimedia documents from existing sources. Hypermcdia Documents Lei us start by considering hypermedia documents. Gen erally, three steps are involved ia producing documents meant Lo be viewed nonlinearly: infonnation generation or capture, authoring, and publication. A question that can be asked is, how much of Lhis process can be automated? The first sLep, capture of media, be iL from LexL or using an audio digitizer or video frame-grabber, is highiy developed and well auLomaLed. The final step, presenlaLion, is Lhe objective of lhe multimedia bois we have been considering. Bul the middle step (authoring) is mosL under consideraLion here. EssenLially, we wish lo sLmcLure informaLion lo support access and manipulaLion of the availabie media. Ciearly, we would be well advised lo consider Lhe standard computing science data strucLures in sLnicturing this informalion: lisLs, Lrees, or networks (graphs). However, here we wouid iike Lo coasider how best Lo structure Lhe data Lo supporl muiLiple views, ralher than a singie, sLaLic view. Externalization versus Linearization Figure 2.12 shows lhe essenLial problem in voived ia commuaicating ideas wiLhout using a hypermedia mechanism: the author’s ideas are “Iinearized” by settiag lhem down in linear order on paper. la coatrasL, hyperiinks alIow us Lhe freedom Lo partiaily mimic Lhe author’s Lhoughl process (i.e., extemaiization). After ali, Lhe essence of Bush’s Memex idea in SecLion 1.2.1 invoives associative lialcs in human memory. Now, using MicrosofL Word, say, it is trivial lo creaLe a hypertext version of one’s doc umenL, as Word simply foiiows Lhe layout aiready set up ia chaplers, headings, and $0 0fl. Bul probiems arise when we wish lo extract semanlic content and find Iinks and anchors, even consideriag just text and noL images. Figure 2.13 displays Lhe probiem: whiie il is feasibie lo meataliy manage a few informaLion nodes, once Lhe problem becomes iarge, we need aubomatic assistants. Once a datasel becomes iarge, we should empioy daLabase methods. The issues be come focused on scalabiiiLy (lo a large datasel), maintainability, addition of material, and reusabiiity. The database information musl be set up ia such a way Lhat Lhe “publishing” stage, presenLaLion Lo lhe user, can be carried oul jusl-in-Lime, presenLing infonuaLion in a user-defined view from an intermediale informalion strucLure.
34
Chapter 2
Section 2.1
Multimedia Authoring and Tools Noite Aulhor
Reader
Lun~~içalion
Deluney41aLion
.
Tostou doo. hos hcon eidcly mcd no 000500001 ind manipolano inf0000linn ii, mlult,mrdin sflnn,ns, ‘for comido, lhe nto ar Ireloal dela ‘cilhio o,alni.,,ndin coo ibonloonceo W snoord. .nalyocd, ,noonipsInlrd4r,lrr.lndsy.iholjçolly.nndosbo.cncd~..... for moe olsooOo,o. Eoscoodly lraI001jaLaa.00C ho gwospcd moo ~sa~Jod~1ln cio bo •rnanod os 0005lsling o! disonrIr nnljlirs {ooo,di. omonocirra. porngnnph. ‘no.) oohjch ok,y o nonir, o! synladicl md nenooninc rolos drncribia,g lhe innrrrlaiioonjnip,.
Multimedia Authoring
35
Node A nodo ti simpty • ;rninping ofconoioiofeomnnnion no~rlhoo no feno, nonin. T),o nodo co, 5o nodo np moo, dilfrro,,n n~csof iofonn.eio, soes .510cm. imo~eo. nidon. soond. dc. Tais pio, o! iofnnnnnoioo s,lsich pairo cnrrnonlp is» o nodo. N°0° dm00 ii saloios somo, rol. nsd o. Typic.lly li, nn.dilionnl molnimondi. ipsnnnoo)ml,c anldoonissotd!ormI,rlindoio,g..,d
.4nchor
(a) Reader
Author
FIGURE 2.14: Nades and anchors ira hypertext. Cour:esy of David Lowe. [6] ExL~1aj~zation
Intey~on Semiautomatic Migration of Hypertext The slructure of hyperlinlcs for LexL informa Lion is simple: “nades” represent semantic information and are anchors for Iinks lo other pages. Figure 2.14 illustrates Lhese concepts. (b)
FIGURE 2.12: Cornrnunication using hyperlinks. IEEE). [5]
Index Concepl 1• Concepl 2 Concepl 3 Concepl 4
s
s
S
Cauriesy of David Lowe; (©1995
A serniautomaLic versiori of such a program is rnost likeiy Lo be successful, malcing suggestions lhaL can be accepted orrejected and manually added to. A database rnanagement system can mainlain Lhe integrity of Iinks when new nades are inserted. For Lhe pubhshing stage, since iL may be impractical Lo re-creaLe the underiying information slnictures, it is besL lo deiay imposing a viewpoint on Lhe data unIu as late as possibie.
Index Concepl ConcepL Concepl Concepl
o o o
Hyperirnages MatLers are not nearly so straighLforward when considering irnage or olher muitirnedia dala. To LreaL an image ira Lhe sarne way as texI, we wouid wish Lo consider an irnage lo be a nade IbaL conlains objecLs and other anchors, for which we need lo determine irnage entiLies and mies. WhaL we desire is an autornated method lo heip us produce true hypermedia, as in Figure 2.15.
o o
o
Information space (a)
For tex:, lhe first step for migraling paper-based information lo hypertext is lo autornat icaliy converl lhe fonnac used lo HTML. Then, seclions and chapters can be piaced in a dalabase. Sirnple versions of data rnining Lechniques, such as word stemming, can easily be used lo pane lides and captions for keywords — for exampie, by frequency counting. Keywords found can be added Lo lhe database being buiit. Then a heiper program cara automaticaliy generate addilional hyperlinks between reiated concepts.
(b)
FIGURE 2.13: Cornplex information space: (a) cornplexity: manageabie; (b) complexity: overwhelrning. Courresy of David L.owe; (©1995 IEEE). [5]
It is possible lo rnanuaily delineate synlaclic irnage elernents by rnasking irnage arcas. These can be tagged wilh LexI, so that previous Lexl-based rneLhods can be brought inLo piay. Figure 2.16 shows a “hyperimage”, with image arcas identified and automaLically linked lo olher parIs of a docurnenl.
36
Chapter 2
Multimedia Authoring and Tools
Section 2.2
Some Useful Editing and Authoring Tools 37
2.2 SOME USEFUL EDITING AND AUTHORING TOOLS This texL is primarily concerned with principies of multimedia — Lhe fundamentais to be grasped for a real understanding of Lhis subject. Nonetheless, we need real vehicles for showing this understanding, and straight programming in C++ or Java is not always Lhe best way of showing your knowledge. Most introductory multimedia courses ask you co aL least start off delivering some multimedia product (e.g., see Exercise II). So we need a jump-start lo help you Iearn “yet another software boi.” This section aims to give you that jLlmp-start. Therefore, we’ll consider some popular authoring boIs. Since Lhe first step in creating a multimedia application is probably creation of interesting video clips, we sart off looking ata video editing tool. This is not really au authoiing tool, but video creation isso important LbeL we include a small introduction to one such program. The tools we look at are Lhe following:
FIGURE 2.15: Strucbure of hypermedia. Courtesy o! David Lowe. [6]
• Adobe Premiere 6 • Macromedia Director 8 and MX Such methods are certainly in their infancy buc provide a fascinating view of what is to come in authoring autoniation. Naturally, we are also interested in what tools from database systems, data mining, artificial inteiligence, and so on can be brought Lo bear to assist production of fuIl-blown multimedia systems, not just hypermedia systems. The above discussion shows that we are indeed aI Lhe start of such work.
• Flash5andMX • Dreamweaver MX Whule this is not an exhaustive Iist, these Lools are often used in creating multimedia contenc.
2.2.1 Adobe Premiere
r FIIerWI.w0
E *2
~~Nypermedb
/u/JhIJra/jvyper~detWentaI*5/4ene4bie~IbIE
-
~ JJ J
Goto’
PIoy
OS Ris:
qulo
Objeco: edible. ______________
rd2bL~O~tIco/dcfl~db
•
-J D
-ou— drw ~0~
O
no
no
HypEfMdii t4
—
ti
HyrM.dia moo.
cocei.. c,oaLty obo ..,ocity e0 9* .dbLe pcedotts. ~.y Li bt.k.a duo oito viciou. outugeoteo. ttou4v ti,.. ntegoci.. .q eet Lo ~tbogonL0t OL&çmvoff010
s
ccncoUy tio actos atu ,.tüaocd Oft*A 000,5,1*01 p.aigii’g (vioL ti, pcobeflo cacept100 o0 v,9*tales) Ibe p.ckagioq Les tio Eortut.tt ceoslt g aioovinç tio ttnteot, te Lo ,oeLy,4. dtMcIb,d. dnv,cted. etc
li__Á
1-
FIGURE 2.16: Hyperimage. Courtesy of David Lowe. [6]
Premicre Basics Adobe Premiere is a very simple video editing program Lhat allows you lo quickly create a simple digital video by assembling and merging multimedia components. li effectively uses lhe score authoring melaphor, in that components are placed in “tracks” horizontally, in a limeline window. The File > New Proj cc t command opens a window chat displays a sedes of”pre sets” — assemblies of values for frame resolution, compression method, and frame rale. There are many preset options, most of which conform lo some NTSC or PAL video stan dard. Start by importing resources, such as AVI (Audio Video Interleave) video files and WAV sound files and dragging them from lhe Project window onto tracks 1 or 2. (In fact, you cmi use up to 99 video and 99 audio tracks!) Video 1 is actually made up of three tracks: Video IA, Video 18 and Transitions. Transitions can be applied only Lo Video 1. Transitions are dragged into lhe Transilions track from lhe Transition window, such as a gradual replacement of Video IA by Video 18 (a dissolve), sudden replacement of random pixels in a checkerboard (a dither dissolve), or a wipe, with one video sliding over anolher. There are many Lransibions Lo choose from, bul you can also design an original transition, using Premiere’s Transition Factory. You can import WAV sound files by dragging lhem to Audio 1 or Audio 2 of Lhe Timeline window or lo any additional sound tracks. You can ediL Lhe properties of any sound lrack by right-clicking on iL.
38
Chapter 2
Section 2.2
Multirnedia Authoring and Tools
toca — o, — —
—
ei—’
Ic—c—fl~i
o,-.
FÁI ~
4_n.~’fl
-
—‘-
—
frn~
~W
FI004&t
IIW~J
o— o
~
39
oc
—.4 -
orne Useful Editing and Authoring Tools
-
,
~
430w1
PI1,t~~p~I 00õ
Ii~s
Lo~
——
o
~
:‘
03
— rLLuldalara,o Ii~k4.c ~ IAMm S
(a) • —1.4
• £, — o—.—
•
ei
•I i*Aj!biS
•
l$I_.
.0
•—,
•
•—2
isl
—
Microsolt MPEG-4 Video Codec V2 Co~ight ® Microsoít Corp. l39S~1 333
po-,
—4
•
Options Keyfiame every
~ seconds
_ -
* •.—.
FIGURE 2.17: Adobe Premiere screen.
Compression Contro4 Smootktiess
• -Data Rate (ti
75
Crispness
~s pei Sec~id)
3~Q Figure 2.17 shows what a typical Premiere screen might Iook like. The yellow ruler at Lhe Lop of Lhe Timeline window delineates Lhe working timeline — drag iL to Lhe right amount of time. The 1 Second dropdown box aL the bottom represents showing one video keyframe per 1 second. To “compile” Lhe video, goto Timeline > Render Work Area and save Lhe project as a .ppj file. Now iL geLs inLeresling, because you must make some choices here, involving how and iii whaL fonnat Lhe movie is Lo be saved. Figure 2.18 shows Lhe project opLions. The dialogs LhaL Lweak each codec are provided by Lhe codec manufacLurer; bring these up by clicking on the Configure button. Compression codees (compression decompression protocois) are ofLen in hardware on Lhe video capture card. lf you choose a codec Lhat requires hardware assistance, someone eise’s system may not be able to play yourbrilliant digital video, and ali is in vain! Images can also be inserted into tracks. We can use transitions Lo make Lhe images gradually appear or disappear in Lhe final video window. Todo so, set up a “mask” image, as in Figure 2.19. Here, we have imported an Adobe Photoshop 6.0 layered image, with accompanying alpha channei made in PhoLoshop. Then in Premiere, we cliclc on Lhe image, which has been placed in its own video track, anduseClip > Video Options > TranspareilcytosettheKey(Whichtflggels Lransparency) to Aipha Channel. IL is also simple Lo use Clip > Video Options > Motion to have Lhe image fiy in and ouL of Lhe frame.
-~
0K
Cancel (b)
FIGURE 2.18: (a) output opLions; (b) compression opLions.
In Photoshop, we seL up an alpha channel as foilows: 1. 2. 3. 4.
Use an image you like — a . JPG, say. Make Lhe background some solid color — white, say. Malce sure you have chosen Image > Mode > RGB Color. Select thaL background area (you wanL iL to remam opaque in Premiere) magic wand Lool. & Gotoselect > Save Selection.... 6. Ensure thaL Channel = New. Press 0K.
—
use lhe
—
~II.Ofl
40
Chapter 2
ection 2.2
Multimedia Authoring and Too s —
x
~.
— — — —
‘AA~ fl bá.’ b
fal~MWl ~ .,t i~.í &LJ 1_MaR
a :
-
Some Useful Editing and Authoring Tools 41
V—
7
Il~
~oID~
j~jT~J
1~in& »J~
1~
a 1’ -
—
I4awnl
—
4 .~__~____4•7= 27 8
1 a ~1 lI $ 2? 8 li 8
a a
2M a
a (a)
T
Ia
(b)
lá 1
—-
Ol~ll2%Fl
Ia,. aq •~ •~ — Páa PâM. ~ P4...
mWR,IM Iyaa,aaA fl~1a,rn2lA fl’4a,1Q1fll flI3$~~fl
— -
A
Ia
T,M M$p S— R..p
—
na$,aion Wda$IIRIQR sana, RIOFI canal RIOP, flal4llêlln R$I)a,lnH san-mama ‘OflallI,ln
n,ow..n amalia,..,
arao.aleA..
FIGURE 2.19: (a): RGB channels. (b): Alphachannel. FIGURE 2.20: Director: main windows. 7. Goto Window > Show Chanfleis, double-click lhe new channel, and rename it Alpha: make its color (0, 0, 0). 8. Save lhe file as a PSD. If the aipha channel you created in Photoshop has a white background, you’ll need lo choose Reverse Key in Premiere when you choose Alpha. Premiere has its own simple method of creating filIes (lo give credit where credit is due) for your digital video. Another nice feature of Premiere is that il is simple lo use in capturing video. To form a digital video from a videotape or camcorder input, go Lo File > Capture > Movie capture. (The menu for video/audio capture options appears by righL-clicking lhe capture window.) Similarly, saving Lo analog tape format is also simple. Premiere ‘fransitions Premiere offers an interesting assorlment of video transitions. However, examining the resulting video frame by frame reveals that the built-in transitions do not work quite as “advertised”. For example, on dose examination, what purporls to be a wipe Lhat is linear with time Lums out to have a nonlinear dip as it begins — Lhe video transition line moves at not quite constant speed. The Premiere Transition FacLory provides a good many functions for building our’own transitions, if we are interesLed in doing so. Since we are actually in an int regime, these functions, such as sin and cos, have both domam and range in lhe ints ralher than tloats. Therefore, some care is required in using Lhem. Exercise 9 gives some of these details in a realistic problem setting. 2.2.2
Macromedia Director Director Windows Director is a complete environment (see Figure 2.20) for creating interactive “movies”. The movie metaphor is used Lhroughout Director, and the windows
used in lhe program refiect Lhis. The main window, ml which lhe action talces place, is lhe Stage. Explicitly opening lhe Stage auomatically doses Lhe other windows. (A useful shortcut is Shift + Keypad-Enter (the Enter key next Lo the numeric keypad, not lhe usual EnLer key); Lhis clears alI windows except Lhe Stage and plays Lhe movie.) The other Lwo main windows are CasL and Score. A CasL consists of resources a movie may use, such as bitmaps, sounds, vector-graphics shapes, Flash movies, digital videos, and scripts. CasL members can be created directly or simply imporled. Typically you creaLe several casLs, Lo betLer organize lhe parts of a movie. CasL members are placed on Lhe Stage by dragging Lhem there from Lhe CasL window. Because several instances may be used for a single cast member, each instance is called a sprite. Typically, cast members are raw media, whereas sprites are objects Lhat control where, when, and how cast members appear on lhe stage and in the movie. Sprites can become interactive by aLtaching “behaviors” to Lhem (for example, make Lhe sprite follow lhe mouse) either prewriLten or specially created. Behaviors are in lhe intemal script language of DirecLor, called Lingo. DirecLor is a sLandard event-driven program that allows easy positioning of objects and attachment of evenL procedures lo objecLs. The set of predefined events is rich and includes mouse events as well as network evenls (aR example of the laILer would be Lesting whether casL members are downloaded yet). The type of control achievable might be Lo Ioop part of a presentalion unLil a video is downloaded, then continue orjump Lo anoLher frame. Bitmaps are used for buttons, and lhe most typical use would be Lo jump to a frame in Lhe movie afler a button-click event. The Score window is organized in horizontal lines, each for one of lhe sprites, and vertical frames. Thus Lhe Score looks somewhat like a musical score, in Lhal lime is from left lo right, but iL more resembles lhe lisL of evenLs in a MDI file (see Chapter 6.) Both Lypes of behaviors, prewritlen and user-defined, are in Lingo. The Library paleLte provides access lo ali prewriLten behavior scripts. You can drop a behavior onto a spriLe or aLtach behaviors Lo a whole frame.
pter 2
Section 2.2
Mu timedia Authoring and Tools
ti
1
VGsne
Some Useful Editing and Authoring Toeis 43
VErÓ,g
.1 Mente,
IQOZ
•
Li
4
FIGURE 2.21: A tweened sprite.
FIGURE 2.22: Score window.
lf a behavior includes parameters, a dialog box appears. For example, navigation be haviors musi have a specified frame Lo jump lo. You can attach Lhe sarne bebavior lo many sprites or frames and use different parameters for each instance. Most behaviors respond lo simple events, such as a click on a sprite or lhe event tiiggered when lhe “playback head” enters a frame. Most basic funclions, such as playing a sound, come prepackaged. Writing your own user-deflned Lingo sctipls provides more flexibility. Behaviors are modified using Inspector windows: the Behavior Inspector, or Properly Inspector.
follow between keyframes. You also specify how Lhe image should accelerate and decelerate aL lhe beginning and end of Lhe movemeni (“ease-in” and “ease-out”). Figure 2.21 shows a tweened sprite.
Animation Traditiortai animation (ccl animation) is created by showing slightly differ ent images over Lime. In DirecLor, Lhis approach amounts Lo using different cast members in different frames. To conLrol this process more easily, Director permiLs combining many cast members into a single sprite. (To place on lhe score, seiect ali Lhe images lo be combined, Lhen use Lhe CasI To Time menu item Lo place them in Lhe current score localion.) A useftil feature is that expanding Lhe time used on Lhe score for such an animalion slows lhe playback Lime for each image, so the whole animation takes Lhe required amount of Lime. A less sophisticated-looking bul simpie animation is avaiiable wiih the tweening feaLure of Director. Here, you specify a particular image and move ii around lhe stage without altering Lhe original image. “Tweening” refers Lo lhe job of minor animators, who used lo have to flul in beLween lhe keyframes produced by more experienced animators — a role Director fuifihis automatically. To prepare such an animation, specify Lhe path on Lhe stage for Lhe tweened frames Lo take. You can also specify several keyframes and Lhe kind of curve for Lhe animation Lo
A simpie kind of animation called paleite animation is also widely used. lf images are 8-bit, cycling through Lhe color lookup Lable or systematically replacing loolcup Lable entries produces interesLing (or strange) effects. The Score window’s imporiant features are channels, frames, and Lhe playbaclc head. The Iatter shows where we are in Lhe score; clicking anywhere in Lhe score reposilions Lhe playback head. Channels are Lhe rows in Lhe Score and cmi contam sprite instances ofvisible media. Therefore, these numbered channels are cailed Sprite channeis. Ai Lhe lop of Lhe Score window are Special Bifecis channels for cOnLrolling lhe paleLtes, tempo, transitions, and sounds. Figure 2.22 shows Lhese channels in Lhe Score window. Frames are numbered horizontally in Lhe Sprite and Special Effects channels. A frame is a single step in Lhe movie, as in a traditional fim. The movie’s playback speed can be modified by resetting Lhe number of frames per second. Control You can place named marlcers aI any frame. Then Lhe simplest lype of conLrol evenl would be Lo jump lo a marker. In Director parlance, each maricer begins a Scene. Events triggered for frame navigalion are Co To Frame, GoTo Marker, or Hold on Current Frame, which stops Lhe movie aL lhat frame. Behaviors for frames appear in a Script Channel in Lhe score window.
1
44
Chapter 2
Multimedia Authoring and Tools
Bultons are simply biLmaps with behaviors alLached. You usually make use of two bilmaps, orie depicling lhe depressed slale of the buLton and one for lhe undepressed staLe. Then lhe buill-in evenL mi mousetip effecls lhe jump. Lingo Scripts Director uses four types of scripLs: behaviors, scripLs aLlached Lo casl members, movie scripls, and parenl scripts. Behaviors, movie scripLs, and parenl scripts ali appear as casL members in Lhe CasL window. A “behavior” is a Lingo script aLlached Lo a sprite ora frame. You mighL use a scripl Lo delennine wheLher a sprile moves, based on whether lhe user has clicked a butlon. A useful fealure is lhat a scripl can conLrol when a muiLimedia resource is played, depending on how much of Lhe resource has already slreamed from lhe web. To atlach a behavior, drag iL from a casl to a sprile or frame iii lhe Score or mi lhe Stage. Also used are Movie scripts, which are avaiiable Lo lhe enlire movie. Movie scripLs can conLrol evenl responses when a movie sLarts, slops, or pauses and can also respondIa evenLs, such as key presses and mouse clicks. Parenl seripls can be used Lo create mulLiple inslances of an objecl withouL adding casl members Lo lhe score. User-wriLlen Linga scripLs can be used Lo creaLe animaLion orlo respond Lo typical evenls, such as user acLions with lhe keyboard and mouse. Scripts can also be used lo stream videos from lhe lntemeL, perform navigatiOn, formaL lexI, and so on. Linga seripls also exlend behaviors beyond whal lhe Score alone can do. The basic dala type is a lisI, which is of course lhe fundamenlal dala slruclure. Using lisls, you can manipulale arrays as well. Malh operaLions and sLring handling are also available. Lisls are of lwo types: linear and properly. A linear lisl is simply a list as in LISP, such as 132,43, 12]. A property lisl is ali associalion lisI, again as iii LISP: each elemenl contains Lwo values separated by a colon. Each properly is preceded by a number sign. For example, slaLemenLs lo creaLe Lwo difíerenl property lisls lo specify lhe Slage coordinales of lwo spriles are as follows: spritelLOCatiOfl sprite2I~ocation
=
[#left:100 #top:15O, ltright:300, #botton:350] l#left:400, #top:550, #right:500. #bottom:7501
Lingo has many funcLions Lhat operale on lisLs, such as append Lo add an element lo lhe end of a lisL and deleteOne lo deleLe a value from a list. Lingo Specifics The funcLion the trame reíers Lo lhe currenl frame. • Special markers next or previous refer Lo adjacenL markers (nol adjacenL frames). • Funclion marker (—1) relnrns Lhe idenLifier for Lhe previous marker. If Lhe frame is marked and has a marker name, marker (0) relurns lhe name of lhe currenL frame; olherWise, iL reLums lhe name of Lhe previous marker. • movie ‘Jaws’ ‘ refers lo lhe slari frame of lhe global movie named ‘Jaws’ This would lypically be Lhe name of anolher Direclor movie. The reference trame 100 of rnovie ‘Jaws’ ‘ poinLs mIo lhaL movie.
Se~ion 2.2
Some Useful Editing and Authoring Tools 45
These deLails are weil oullined in lhe Lingo Help porlion of Lhe online help. The l-{elp direclory Learning > Lingo_Examples has many DIR files LhaL deLail lhe basics of Lingo use.
Lingo Movie-in-a-Window For an exceilenl example of Lingo usage, Lhe Linga Help arLicle on creaLing a movie-in-a-window shows a good overview of how Lo aLlach a scripl. Lingo is a standard, evenL-driven programming language. EvenL handiers are alLached io specific evenLs, such as a rnouseDown message. Scripls conLain evenL handlers. You aLLach a seL of evenL handlers lo an objecL by aLlaching Lhe scripl lo lhe objecl. 3D SpriLcs A new fealure recenLly added Lo DirecLor is Lhe abiliLy lo creale, imporl, and manipulate 3D objecLs on Lhe sLage. A simple 3D objecl lhal can be added in Direclor is 3D lexL. To creaLe 3D lexI, selecL any regular LexL, Lhen in lhe Properly InspecLor click on Lhe TexI lab and seL lhe display mode lo 3D. Olher oplions, such as LexL deplh and Lexlure, can be changed from lhe 3D Extruder Lab in Lhe Properly Inspector window. These properlies can also be dynamicaliy seL in Lingo as well, Lo change lhe LexL as lhe movie progresses. 3D objecls olher lhan LexL can be formed only using Linga or imported from 3D Sludio Max. DirecLor supporls many basic elemenls of 3D animalion, including basic shapes such as spheres and user-definable meshes. The basic shapes can have lexLures and shaders added lo lhem; lexLures are 2D images drawn onlo lhe 3D modeis, while shaders define how lhe basic model looks. Lighis can also be added Lo lhe scene; by defaulL, one lighL provides ambienl lighling Lo lhe whoie scene. Four Lypes oílighls can be added: ambienL, direclionai, poinl, and a spollighL. The slrength and calor of Lhe iight can also be specified. The viewpoinL of lhe user, called Lhe canlera, can be moved around Lo show Lhe 3D objecLs from any angle. MovemenL of Lhe camera, such as panning and lilLing, can be conlrolled using built-in ScripLs in lhe Library window. Properties and Parameters Linga behaviors can be crealed wiLh more fiexibilily by specifying behavior parameLers. Paramelers can change a behavior by supplying inpul lo lhe behavior when iL is crealed. If no paramelers are specified, a defaull value will be used. Paramelers can be easily changed for a parlicular behavior by double-clicking on lhe name of Lhe behavior while iL is alLached Lo anoLher casL member, wiLh dialog-driven pararneLer change as shown in Figure 2.23. A behaviorcan have a special handlercalled getPropertyflescriptionList lhaL is mli when a spriie alLached Lo Lhe behavior is creaLed. The handier relums a lisI of parameLers LhaL can be added by lhe addProp funclion. For example, if a movemenL behavior is made in Linga, paramelers can be added Lo specify lhe direcLion and speed of Lhe movemenl. The behavior can lhen be alLached 10 many casL meinbers for a varieLy of movemenLs. The parameters defined in Lhe getPropertyDescriptionList handier are prop erties of lhe behavior Lhal can be accessed wiLhin any handie of Lhal behavior. Defining a property in a behavior can be done by simply using lhe property keyword ouLside any handler and lisLing ali lhe properties, separaled by commas. Global variables can be
apter 2
Section 2.2
Mu me ia uthoring and Tools
.
.
~ ~ .100
I0~I Cartel
Spr~e: Cliamel[3l Fiaae4l lo lI
FIGURE 2.23: ParameLers dialog box.
accessed across behaviors; lhey can be deciared like a property, except Lhal Lhe global keyword is used inslead. Each behavior lhal needs Lo access a global variable musl declare iL wilh lhe global keyword. Director Objecls Direclor has LWO maia Lypes of objecls: Lhose creaLed in bago and lhose on Lhe Score. Parenl scripLs are used Lo creale a new objecL in L.ingo. A behavior can be LranSfOrmed mIo a parent scripl by changing Lhe scripl type ia lhe Property Inspector. Parenb scripts are different from other behaviors, in Lhal paramelers are passed inLo Lhe object when iL is crealed in Lingo scripl. Parent scripLs can be creaLed and changed only in Lingo, while objecls ia lhe Score can only be manipulaLed. The mosL common objecls used are Lhe spriLes ia lhe Score. SpriLes can be used only in lhe same Lime period as Lhe Lingo scripL referencing lhem. Reference lhe sprile aI lhe channel using lhe Sprite keyword followed by lhe spribe channel number. A sprite has many properties lhaL perfoaii a varieLy of aclions. The locabion of lhe sprile can be changed by lhe locv and loch properties lo change lhe vertical and horizontal posiLion, respecLively. The meniber properly specifies lhe sprile’s cast member and can be used Lo change lhe casl member aLlached Lo LhaL behavior. This can be useful in animaLion — insLead of changing Lhe spriLe in lhe Score Lo reflect a small change, it can be done ia Lingo. 2.2.3
47
in Flash, componenLs such as images and sound lhat make upa movie are cailed symbols, which can be inciuded in lhe movie by piacing lhem on Lhe Stage. The sLage is always visible as a large, white rectangie in Lhe cenler window of Lhe screen. Three oiher imporlanL windows in Flash are Lhe Tímeline, Library, and Tools.
l9Ontetna c-100. .1~.z.i6O 1~
Some Useful Editing and Authoring Tools
Macromedia Flash Flash is a simple auLhoring 1001 lhat facililaLes lhe creation of inleracLive movies. Flash follows lhe score meLaphor in lhe way Lhe movie is crealed and lhe windows are organized. Here we give a brief introduclion lo Flash and provide some examples of mIs use. Windows A movie is composed of one or more scenes, each a dislincl part of lhe movie. The command Insert > Scene creabes a new scene for lhe currenl movie.
Library Window The Library window shows ali lhe cunenL symbois in lhe scene and can be Loggled by lhe Window > Library command. A symboi can be ediLed by double-clicking iLs name in Lhe library, which causes il Lo appear on Lhe slage. Symbois can aiso be added Lo a scene by simply dragging lhe symbol from lhe Library onto lhe sLage. Timeline Window The Tímeline window manages lhe iayers and Limelines of lhe scene. The lefL portion of Lhe Timeline window consisLs of one or more Jayers of lhe SLage, which enabies you Lo easiiy organize Lhe SLage’s conlenLs. Symbols from lhe L.ibrary can be dragged onto lhe SLage, mIo a particular iayer. For example, a simple movie could have Lwo layers, lhe background and foreground. The background graphic from Lhe Iibrary can be dragged orlo lhe stage when lhe background iayer is selecLed. Another usefui funclion for iayering is lhe abiiily lo iock or hide a iayer. Pressing Lhe circular bullons nexL lo lhe layer name can loggle lheir hiddenllocked slaLe. Hiding a layer can be usefui whiie posilioning or ediling a symboi on a differenl iayer. Locking a layer can prevenL accidenlal changes Lo lIs symbols once lhe layer has been compleled. The right side of Lhe Timeline window consisls of a horizonLal bar for each layer in the scene, similar lo a musical score. This represenls Lhe passage of Lime in lhe movie. The Timeline is composed of a number of keyframes in differenL layers. A new keyframe can be inserled inLo lhe currenl layer by pressing F6. An evenl such as lhe slarl ofan animaLion or Lhe appearance of a new symbol musL be in a keyfraine. Ciicking on lhe Limeline changes Lhe currenl Lime in lhe movie being ediled. Tools Window The Tools window, which aiiows Lhe crealion and manipuiaLion of images, is composed of four main seclions: Tools, Yiew, Colors, and Options. Toois consisLs of seleclion Lools lhaL can be used lo demarcaLe exisling images, along wilh severai simple drawing Lools, such as lhe pencil and painl buckeL. View consisLs of a zoom boi and a hand 1001, which allow navigaLion on Lhe SLage. Coiors aiiows foreground and background colors lo be chosen, and symbol colors lo be manipuialed. OpLions ailows addiLional opLions when a 100115 selecLed. Many olher windows are useful in manipuiaLing symbois. WiLh Lhe exceplion of Lhe Tímeline window, which can be boggied wilh lhe View > Timeline command, ali oLher windows can be loggied under Lhe Window menu. Figure 2.24 shows lhe basic Flash screen. Symbols Symbols can be eiLher composed from oLher symbols, drawn, or imported inLo Flash. Fiash is abie Lo imporl severa] audio, image, and video formaIs mIo lhe symbo] library. A symboi can be imported by using Lhe command File > Import, which aulomaLicaliy adds il lo Lhe current library. To creale a new symboi for Lhe movie, press ctrl + F8. A pop-up diaiog box wiil appear in which you can specify lhe name and behavior of lhe symboi. Symbols can talce on one of lhree behaviors: a buiton, a graphic, ora movie. Symbols, such as a bulton, can be drawn using Lhe Toois window.
apter 2
o. ii
Mii tim
Section 2.2
ia uthoring and Tools
ii
cl. — — o— 1—~~~
,J
AT..
•
.801.
tA 00
~ ~ar
.1W
~rn
1.0
mT
•
~
ii.
-
k
‘9
Some Useful Editing and Authoring Tools 49
ii~ia
Mame: jBackgroundrnagel
e r
Mo’tieC~, &jIcn 0 Giaphic
gehavioc
icr
—o
0K
Met
1 1
irn ~ :
FIGURE 2.25: Create symboi dialog.
1,
—.1—
n
2J ~
FÁ ‘E OGAfl$~
E-aí
—z T-a3
g~
e* OJ e.,. o—
FIGURE 2.24: Macromedia Flash. Buttons To creaLe a simple butLon, create a new symbol wiLh Lhe buLton behavior. The Timeline window should have four keyframes: up, dovm, over, and hit. These keyframes show differenL images of Lhe button when Lhe specified action is taken. Only Lhe up keyframe is required and is the defauit; ali others are optional. A button can be drawn by selecting Lhe recLanguiar Looi in the Toois window and then dragging a rectangle onto Lhe Stage. To add images, so that the buLton’s appearance will change when an event is triggered, click on the appropriaLe keyframe and create Lhe butLon image. After aL leasL one keyframe is defined, the basic buLion is complete, although no acLion is yeL aLtached to it. AcLions are discussed further in thc acLion scripLs secLiou below. Creating a symbol from oLhcr symbols is similar to creaLing a scenc: drag Lhe desired symbols from Lhe Library onLo Lhe Stage. This allows Lhe creation of complex symbols by combining simpler symbols. Figure 2.25 shows a dialog box for symbol creaLion. Animation iii Flash Animation can be accomplished by creating subtie differences in each keyframe of a symbol. In the first keyframe, Lhe symbol to be animaLed can be dragged onto Lhe stage from Lhe Library. Then another keyframc can be inserted, and the symbol changed. This can be repeaLed as ofLen as needed. Although Lhis process is Lime-consuming, it offers more flexibility Lhan any other technique for animation. Flash also allows specific animations Lo be more easily creaLed in several oLher ways. Tweening can produce simpie animations, with changes auLomaLically created between keyframes.
Tweening There are two Lypes of Lweening: shape and moveinen( Lweening. Shape tweening allows you to create a shape that continuously changes to a different shape over time. Movement tweening allows you to place a symbol in different places on Lhe Stage in different keyframes. Fiash automatically fihIs in Lhe keyframes along a path between Lhe stan and finish. To cany OUL movement tweening, select the symbol te be tweened, choose Insert > Create Motion Tween,andselecttheendframe. Thenusethecominand Insert > Fraine and move Lhe symboi toLhe desired position. More advanced tweening allows conLrol of Lhe paLh as well as of acceleraLion. Movement and shape tweenings can be combined for addiLionai effecL. Mask animation involves Lhe manipulation of a layer mask — a layer thaL selectively hides portions of anoLher layer. For exampie, to creaLe an explosion effect, you could use a mask Lo cover ali buL Lhe center of Lhe explosion. Shape tweening could Lhen expand Lhe mask, so that eventually Lhe whoie explosion is seen te take place. Figure 2.26 shows a scene before and afLer a tweening effecL is added. Action Scripts AcLion scripts allow you to trigger events such as moving Lo a different keyframe or requiring the movie Lo sLop. Action scripLs can be aLtached Lo a keyframe or symbols in a keyframe. Right-clicking on Lhe symbol and pressing Actions in Lhe list can modify Lhe actions of a symbol. Simiiarly, by right-clicking on Lhe keyframe and pressing Actions in Lhe pop-up. you can appiy actions to a keyframe. A Frame Actions window will come up. wiLh a list of available acLions on the iefL and the current actions being applied symbol on the right. AcLion scripLs are broken into six categories: Basic
ML(
ML.!L.UM~
FIGURE 2.26: Before and after Lweening leLters.
50
Chapter 2
Section 2.3
Multimedia Authoring and Tools
BuLtons need action scripts — evenL procedures —so Lhat pressing Lhe button will cause an effecL. It is siraightforward to atLach a simple action, such as replaying Lhe Flash movie, io a butLon. SelecL lhe buLlon and click to launch Lhe action scripL window, localed aL lhe botiom righl of lhe screen. Then click on Basic Actions, which generates a drop-down lisL of acLions. Double-clicking on Lhe Play action auLomalically adds it to the right side of Lhe window. This button now replays the movie when clicked.
~~iMo’sieÉxpIorer ~iJ Frarne Actions
+
Frarne Actions Basio Actions
VRML 51
4%
2.2.4 Dreamweaver
® ® ® L~J
® Ø
PIa~ SLop
Toggle Higli Qualitg StopAllSounds
G€LURL FSCcrnmand
•?
FIGURE 2.27: Action scripts window.
Dreamweaver is quite a popular Macromedia product (Dreamweaver MX is lhe currenl version) forbui’lding mulLimedia-enabled web sues as well aslnLemet applications in HTML, XML, and other formaLs. It provides visual layout boIs and code-editing capability for file Lypes such as JavaScripL, Active Server Pages, PHP, and XML. The producL is integrated with other Macromedia products such as Flash MX and Fireworks MX. Along wiLh ils use as basically a WYSIWYG web developmenL 1001, an interesLing part of Dreamweaver Lhat relates more directly Lo authoring is lhe facL Lhat iL comes wilh a prepack aged seL of behaviors and is also extensible. The behaviors are essenLially evenl procedures, responding Lo evenls such as mouseover— Lhe set of possible evenLs is different for each Lar gel browser and is reconfigurable for each browser and version number. CompuLer Science sLudenls can write Lheir own JavascripL code, say, and atLach Lhis Lo evenLs.
2.3 VRML Actions, Actions, Operators, Funclions, Properties, and Objecis. Figure 2.27 shows the Frame Actions window. Bane Aclions allow you Lo aLlach many simple actions to the movie. Some common acLions are • Goto. Moves Lhe movie Lo Lhe keyframe specified and can oplionally stop. The sLop acLion is commonly used to SLOp inLeractive movies when Lhe user is given an option. • Play. Resumes Lhe movie if Lhe movie is stopped. Stop. SLops the movie if iL is playing. • Teu Target. Sends messages to differenL symbols and keyframes in Flash. It is commonly used Lo sLart or stop an action on a different symbol or keyframe. The Actions category contains many programming consLrucLs, such as Loopa and Goto slatemenLs. Other acLions are also included, similar to those in typical high-level, event driven programming languages, such as Visual Basic. The Operators caLegory inciudes many comparison and assignmenL operaLors for variables. This allows you to perform operations on variables in lhe acLion scripL. The Funetions caLegory contains built-in functions included in Fiash thaL are noL specific to a Flash objecL. The Properties section includes alI Lhe global vaiiables predefined in Flash. For example, Lo refer Lo lhe currenl frame, Lhe variable _currentf rame is defined. The Objeets section iists ali objecLs, such as movie clips or strings and their associaLed funclions.
2.3.1 Overview VRML, which stands for Virtual Realiiy Modeling Language, was conceived aI lhe firsL inLernaLional conference of the World Wide Web. Mark Pesce, Tony Parisi, and David RaggeL outlined lhe sLnicture of VRML at Lhe conference and specified that it would be a platform-independent language Lhat would be viewed on lhe IntemeL. The objective of VRML was Lo have Lhe capability to put colored objecLs into a 3D environment. VRML is an inlerpreLed language, which can be seen as a disadvantage, because it runs slowly on many compuLers Loday. However, it has been influenlial, because iL was lhe first meLhod available for displaying a 3D world on the World Wide Web. SLricLly spealdng, VRML is not a “Lool,” like Premiere or Director. In fact, Lhe only piece of sofLware needed to creale VRML. contenL is a lext ediLor. NoneLheless, VRML, is a Lool used Lo creaLe 3D environments on Lhe web, much like Flash is a Lool used Lo creaLe interacLive movies. History VRML 1.0 was created in May 1995, with a revision for clarification called VRIvIL 1 .OC in January 1996. VRML is based on a subset of lhe file invenLor format creaLed by Silicon Graphics lnc. VRML. 1.0 allowed for the creaLion of many simple 3D objecLs, such as a cube, sphere, and user-defined polygons. MaLerials and textures can be specified for objecLs to make the objects more realisLic. The lasL major revision of VRML. was VRML 2.0. This revision added lhe ability Lo create an interactive world. VRML. 2.0, also called “Moving Worlds”, allows for animaLion and sound in W1 interactive virtual world. New objecLs were added Lo make lhe creaLion of virtual worlds easier. Java and Javascript have been included in VRML, Lo allow for inLeracLive
52
Chapter 2
Multimedia Authoring and Tools
Section 2.3
VRML
53
•/1/Ii
—A FIGURE 2.28: Basic VRML shapes.
objecLs and user-defined actions. VRML 2.0 was a major change from VRML 1.0, and Lhe Lwo versions are not compatible. However, utilities are available Lo convert VRMI.. 1.0 Lo
VRML 2.0. VRML 2.0 was submitled for sLandardization lo the inLemaLional Organization for Stan dardization (ISO), and as a result, VRML97 was specified. VRML97 is virtually idenLical Lo VRML 2.0 only minar documentaLion changes and clarifications were added. VRML97 is an ISO/IEC standard. —
VRML Shapes VRIvIL is made up of nodes put into a hierarchy Lhat describe a of one or more objects. VRIvIL contains basic geomeLric shapes Lhat can be combined Lo creaLe more complex objects. The shape node is a generic node for ali objecLs in VRML. The Box, Cylinder, Cone, and Sphere are geoinetry nodes Lhat place basic objects in lhe virtual world. VRML allows for Lhe definition of complex shapes Lhat include IndexedFaceSet and Extrusion. An IndexedFaceSet is a seI of faces thal make up an objeet. This allows for lhe creation of complez shapes, since an arbitrary number of faces is allowed. An Extrusion is a 2D cross-secLion extruded along a spine and is useful in creaLing a simple curved surface, such as a flower petal. An object’s shape, size, color, and retiecLive properties can be specified in VRML. The Appearance node controls Lhe way a shape iooks and can contam a Material nade and LexLure nodes. Figure 2.28 displays some of Lhese shapes. A Material nade specifies an object’s surface properties. lt can control what color Lhe object is by specifying Lhe red, green, and blue values of Lhe object. The specular and emissive colors can be specified similarly. Olher aLtribules, such as how much Lhe objecL reflects direcl and indirect lighL, can also be controlled. ObjecLs in VRML can be transparenl or partially Iransparenl. This is also included in lhe Material node. Three kinds of LexLure nodes can be used lo map texLures onlo any objeeL. The most common one is lhe IxnageTexture, which can take an exLernal JPEG or PNG image file and map il onLo Lhe shape. The way lhe image is LexLured can be specified — Ibal is, Lhe way Lhe image should be tiled onto Lhe objeel is ediLable. A MovieTexture node allows mapping an MPEG movie onto an object; Lhe sLarling and sLopping Lime can also be specified.
FIGURE 2.29: A simple VRML scene. The final Lexlure-mapping nade is called a PixelTexture, which simply means cre aLing an image lo use wiLh ImageTexture VRML. Although iL is more inefficienL Lhan an ImageTexture nade, iL is sliII useful for simple textures. TexI can be put inLo a VRML world using lhe Text node. You can specify Lhe LexL Lo be included, as well as the font, alignmenL, and size. By defaulL, Lhe text faces in Lhe positive Y direcLion, or “up”. AlI shapes and texL sLarl in lhe middle of the VRML world. To arrange lhe shapes, Trans form nodes musl be wrapped around Lhe shape nodes. The Trans torrn node can contam Translation, Scale, and Rotation nodes. Translation simply moves lhe objecL a specific disLance from iLs cLirrent location, which is by defaull lhe center of Lhe world. Scale increases or decreases lhe size of Lhe object, while Rotation rotaLes lhe objecL around iLs center.
VRML World A virtual world needs more Lhan jusl shapes lo be realistic; it needs Lo view Lhe objects, as well as backgrounds and Iighting. The defaulL is aligned wiLh the negative z-axis, a few melers from the cenLer of lhe scene. Using Viewpoint nades, lhe defaulL camera posilion can be changed and olher cameras added. Figure 2.29 displays a simple VRML scene from one viewpoinl. The viewpoinl can be specified wiLh Lhe position node and can be rotaLed from lhe default view wiLh lhe orientation node. The camera’s angle for iLs field of view can be changed from iLs defaull 0.78 radians wilh Lhe fieldOfView node. Changing lhe field of view can creale a LelephoLo effecl. Three Lypes of lighling can be used in a VRML. world. A DirectionalLight node shines a light across lhe whole world in a certain direction, similar Lo lhe lighl from lhe sun — il is from one direction and affecls ali objecls in lhe scene. A PointLight shines a light in ali direclions from a cerlain poinl in space. A SpotLight shines a lighL in a cerlain direction from a poinl. Proper IighLing is important in adding realism to a world. Many parameLers, such as Lhe calor and sLrengLh of Lhe light, can be specified for every Lype of lighL.
54
Chapter 2
Multimedia Authoring and Tools
The background of Lhe VRML world can also Ix specifled using lhe Background node. The background color, black by default, as well as lhe sky color can be changed. A Panorama node can map a texture Lo lhe sides of lhe world. A panorama is mapped onto a large cube surrounding lhe VRML world. if a panorama is used, the user can never approach the lexLure, because Lhe panorama is ceniered on Lhe user. It is also possible lo add fog iii VRML using lhe Fog node, where Lhe color and densiLy of lhe fog can be specified. Fog can increase the frame raLe of a world, since objecLs hidden by lhe fog are noL rendered. 2.3.2
Animation and Interactions An advanlage of VRML97 over lhe original VRML 1.0 is Lhat lhe VRML world can be interactive. The only method of animalion in VRML is Lweening, which can be done by slowly changing an objecL specified in an interpolator node. This node will modify an object over time. based on Lhe Lype of inLerpolalOr. There are six inLerpolalors: color, coo rdinare, normal, orienzation, posilion, and scalar. Ali interpolalors have Lwo nodes Lhal musl be specified: lhe key and keyValue. The key consisls of a lisI of two or more numbers, starling wiLh O and ending wiLh 1. Each key elemenl musL be complemenled wilh a keyValue element. The key defines how far along the animaLion is, and Lhe keyValue defines what values should change. For example, a key element of 0.5 and iLs maLching keyvalue define what lhe objecL should look like aI lhe middle of lhe animation. A TimeSensor node limes an animation, 50 lhaL lhe inLerpolator knows whal sLage the object should be in. A TimeSensor has no physical form in Lhe VRML world and just keeps Lime. To notify an inlerpolaLor of a Lime change, a ROUTE is needed Lo connecl Lwo nodes. One is needed beLween lhe TimeSensor and lhe interpolaLor and anolher between lhe inlerpolaLor and Lhe object Lo be animated. Most animaLion can be accomplished this way. Chaining ROUTE commands 50 Lhal one evenL Lriggers many others can accomplish complex animaLions. 1\vo categories of sensors can be used in VRML Lo obLain input from a user. The firsL is en viromneni sensors. There are lhree kinds of environmenL sensor nodes: Vis ibi li tySen— sor, ProximitySensor, and Collision. A VisibilitySensor is acLivated when a user’s field of view enLers an invisible box. A ProximitySensor is activaLed when a user enters or leaves an ana. A Collision is aclivaLed when lhe user hils lhe node. The second category of sensors is called pointing device sensors. The first poinLing device sensor is a touch sensor, aclivaLed when an object is clicked wilh Lhe mouse. Three oLher sensors are called drag sensors. These sensors allow the rolation of spheres, cylinders, and planes when a mouse is dragging Lhe object.
2.3.3
VRML Specifics A VRML file is simpiy a text file wilh a . wrl exLensiOn. VRML97 musL inciude Lhe line #VRNIJ V2 . O UTFS in lhe first line of Lhe file. A # denoles a comment anywhere in lhe file except for Lhe firsL une. The firsL une of a VRIvIL file lells lhe VRML clienL what version of VRML lo use. VRML nodes are case sensiLive and are usually built hierarchically.
Section 2.4
Further Exploration
55
Although only a simple LexL ediLor such as notepad is needed, VRML-specific text ediLors are available, such as VRMLpad. They aid in creating VRML objects by providing differenL colors and collapsing or expanding nodes. AlI nodes begin wilh “(“ and end wiLh 1” and most can conlain nodes inside nodes. Specia] nodes, called group nodes, can cluster mulLiple nodes. The keyword children followed by “[“ begins Lhe lisL of children nodes, which ends with ~ “. A “Trans form” node is an example of a group node. Nodes can be named using DEF and can Ix used again Jater by using lhe keyword USE. This allows for creaLion of complex objecls using many simple objects. To creale a simple box in VRML. Shape ( Geometry Box{)
The box defaulLs lo a 2-meLer-Jong cube in Lhe cenler of lhe screen. Pulting it inLo a Trans form node can move lhis box lo a different part of lhe scene. We can also give the box a different color, such as red: Transforin { transiation O 10 Shape { Ceoinetry Box{} appearance Appearance ( material Material diffusecolor 1 O O
O children
} J} This VRML fragmenL puLs a red box cenlered in lhe +10 Y direclion. The box cmi be reused if DEF mybox is put in front of lhe Trans torin. Now, whenever lhe box needs Lo Ix used again, simply pulLing USE mybox will maice a copy.
2A FURTHER EXPLORATION Good general references for muiLimedia aulhoring are inLroduclory books [3, 1] and Chap lers 5—8 in [4]. Material on aulomatic authoring is frily expanded in [7]. A Jink lo lhe overall, ver)’ usefri FAQ file for mullimedia authoring is in lhe Lexlbook web siLe’s Further Exploration secLion for lhis chapter. Our TextColor . exe program for investigating complemenlary colors, as in Fig ure 2.5, is on lhe LexLbook web siLe as well. We also include a linic Lo a good FAQ coliection on DirecLor. A simple Director movie demonslraLing lhe ideas sei oul in Section 2.2.2 may Ix downioaded from lhe web sue, along wilh informalion on Dreamweaver, VRML, and a sniall demo VRML worid.
56
Chapter 2
Multimedia A
ng and Tools
2.5 EXERCISES 1. What extra informalion is muitimedia good at conveying? (a) What can spoken lexI convey lhat wrilLen lexl canriot? (b) When might wrillen lexI be beLler Lhan spoken Lext? 2. Find and leain 3D Studio Max in your local lab software. Read lhe online LuLorials Lo see Lhis software’s approach LO a 3D modeling technique. Learn texLure mapping and animation using lhis producl. Make a 3D model after canying ouL these sLeps. 3. Oesign an interaclive web page using Dreamweaver. HTML 4 provides layer func tionalily, as in Adobe PhoLoshop. Each layer represents an HTML objecl, such as text, an image, or a simple I-ITML. page. In Dreaniweaver, each layer has a marker associaLed with iL. Therefore, highlighLing Lhe layer marker selecLs the enLire layer, lo which you can apply any desired effecL. As in Flash, you can add buttons and behaviors for navigalion and control. You can creale animations using Lhe Timeline behavior. 4. In regard lo automatie aulhoring, (a) Whal would you suppose is meanL by Lhe Lerm “acLive images”? (b) WhaL are Lhe problems associated wilh moving lexL-based LechniqUes Lo lhe realm of image-based aulomaLic authoring? (c) WhaL is lhe single ff051 importanL problem associaled with auLomalic auLhoflng using legacy (already wriLlen) lexl documenls? 5. Suppose we wish Lo creaLe a simple animation, as iii Figure 2.30. Note Lhal Lhis image is exaclly whal lhe animation looks like aI some time, 1101 a figurative represenLaLion of lhe process of moving lhe fish; the fish is repeated as il moves. SLaLe whaL we need lo carry oul Lhis objecLive, and give a simple pseudocode solulion for Lhe problem.
Section 2.5
Exercises
Assume we already have a liSL of (x, y) coordinaLes for Lhe fish paLh, Lhal we have available a procedure for cenlering images on palh posiLions, and lhaL lhe movemenL lakes place on Lop of a video. 6. For lhe slide Lransilion iii Figure 2.11, expIam how we arrive aL Lhe formula for x in Lhe unmoving righL video RR. 7. Suppose we wish Lo creaLe a video Iransilion such lhat lhe second video appears under lhe first video lhrough an opening circie (like a camera iris opening), as in Figure 2.31. WriLe a formula Lo use Lhe correcL pixeis from lhe Lwo videos Lo achieve this special effect. JusL write your answer for lhe red channel.
(a)
(b)
FIGURE 2.31: Iris wipe: (a) íris is opening: (b) ata laler mOmenL. 8. Now suppose we wish Lo create a video Lransilion such LhaL Lhe second video appears under Lhe firsl video Lhrough a moving radius (like a clock hand), as in Figure 2.32. Wrile a formula Lo use Lhe correcL pixels from Lhe lwo videos Lo achieve lhis special effect for Lhe red channel.
(a)
(b)
FIGURE 2.32: Clock wipe: (a) clock hand is sweeping oul; (b) ala laLer nioment. FIGURE 2.30: Sprile, progressively laking up more space.
57
58
Chapter 2
Multimedia Authoring and Tools
Section 2.6
References
59
Cakewalk Pro Audio, and olher multimed ia software.
e
4’
(a) Capture (or find) at IeasL Lhree video files. You can use a camcorder or VCR to malce your own (through Premiere or the like) or find some on lhe Net. (b) Compose (or ediL) a smali MIDI file with Cakewalk Pro Audio. (e) Create (or flnd) at leasc one WAV file. You may either digiLize your own or download some from lhe net. (d) Use Photoshop Lo create a Litle and an ending. (e) Combine ali of the above Lo produce a movie about 60 seconds long, including a LiLIe, some credils, some soundLracks, and aL le’ast three transitions. Experiment wiLh different compression meLhods; you are encouraged Lo use MPEG for your final product. (f) The above ConsLiLutes a minimum sacement of Lhe exercise. You may be lempLed Lo get very creaLive, and that’s fine, but don’t go overboard and Lake too much time away from Lhe rest of your life! 2.6 REFERENCES 1 A.C. Lulher, Aurhoring Ínfero crive Mulrbnedia, The IBM Tocis Sedes, San Diego: AP Ptofes sionai, 1994.
FIGURE 2.33: Filter applied to video.
9. Suppose you wish to create a wavy effect, as ir’ Figure 2.33. This effecL comes from repiacing Lhe image .r value by an x value offset by a small amounL. Suppose Lhe image size is 160 rows x 120 columns of pixels.
3 J.C. Shepherd and IX Coiaizzi, AurhoringAuthorware: A Pracrical Guide, Upper Saddle River, NJ: Prenlice Flali, 1998. 4 DE. Wolfgram, Crearing Muirin,edio Presentations, indianapoiis: Que Publishing, 1994. 5 A. Ginige, D. Lowe, and). Robertson, “Hypermedia Auihoiing’ lEEEMultimedia, 2: 24—35,
(a) Using floaL arithmetic, add a sine component Lo Lhe x value of Lhe pixel such Lhat Lhe pixel takes on an RGB value equal Lo thaL of a different pixel in Lhe original image. Make Lhe maximum shift in x equal to 16 pixeis.
1995. 6 A. Ginige and D. Lowe, Nexi Generation Hypermedia AuLhoring SysLems’ Ir’ Proceedings ofMultirnedia Jnfonnation Systems and !lyperinedia, 1995, 1—li.
(b) In Premiere and oLher packages, only inLeger ariLhmetic is provided. Functions such as em are redefined soas Lo Lake an int argument and return an int. The argument Lo Lhe em function must be in 0.. 1,024, and the value of em is in —512.512: sin(0) returns 0, smn(256) rewrns 512, sin(5l2) reLumsø,sin(768) retums—5lZandsifl(l,O24) reLurns0. Rewrite your expression in part (a) using inLeger arithmetic. (e) How could you change your answer Lo make the waving Lime-dependent?
7 IX lnwe and W. Hail, Hypennedia and the Web: An EntineerinsApproach, New York: Wiley,
10. How would you create the image in Figure 2.6? WriLe a small program Lo make such an image. uni: Place R, 6, and B aL the corners of an equilateral Lriangle inside Lhe 11.
2 R. Velter, C. Ward, and 5. Shapiro, “Using Color and Texi in Mullimedia ProjecLions’ !EEE Mulrin,edia, 2(4): 46—54, 1995.
circle. IL’s best lo go over ali columns and rows in lhe ouLput image rather than simply going around Lhe disk and trying to map results back Lo (x, y) pixei positions. As a longer exercise for learning exisLing sofLware for manipulaling images, video, and music, malce a 1-minute digital video. Ey Lhe end of this exercise, you should be familiar wiLh PC-based equipmenL and know how Lo use Adobe Premiere, Photoshop,
1999.
Section 3.1
CHAPTER
Graphics/Image Data Types
61
3
Graphics and Image Data Representations FIGURE 3.1: Monochrome 1-bit Lena image. In this chapter we look ai images, starting with 1-bil images, Iben 8-bit gray images and how lo print them, then 24-bit color images and 8-bit versions of color images. The specifics of file formats for storing such images will also be discussed. We consider the following topics: • Graphics/image data types • Popular file formaIs
3.1
GRAPHICS!IMAGE DATA TYPES
TABLE 3.1: Macromedia Director file formats. File import
60
Palette
BMP, DIB,
PAL
GIF, JPG,
Ad’
File expori
Native
Video AnimatiOn
Image
L vi&o
AIFF
AVI
DIR
BMP
AVI
DIR
AU
MOV
FLA
MOV
DXR
1 Sound
1-Bit Images Images consist ofpixeis, orpels — picture elemenis in digital images. A 1 -bit image consists ofon and off bits only and ihus is lhe simplest type ofimage. Each pixel is stored as a single bit (0 or 1). Hence, such an image is also referred loas a binary image. lt is also calied a l-bit monochro,ne image, since it contains no colar. Figure 3.1 shows a 1 -bil monochrome image (called “Lena” by multimedia scientists — ibis is a standard image used to illusirate many algorithms). A 640 x 480 monochrome image requires 38.4 lciiobytes of siorage (= 640 x 480/8). Manochrome 1 -bit images can be satisfactory for pictures containing only simple graphics and text.
3.1.2 8-Bit Gray-Level Images
The number of file formais used in multimedia continues to proliferate (1]. For example, Table 3.1 shows a list of file formais used in the popular product Macromedia Director. In this texi, we shall study jusi a few popular file formais, lo develop a sense of how they operate. We shall concentrale on GIF and JPG image file formais, since tI-tese two formais are disiinguished by Lhe fact that most web browsers can decompress and display them. To begin, we shall discuss lhe features of file formais in general.
Image
3.1.1
PICT, PNG,
MP3
FLC
PNT,PSD,
WAV
FLI
TGA, TIFF,
GIF
WMF
PFT’
EXE
Now consider an 8-bit image — that is, one for which each pixei has a gray value between O and 255. Each pixel is represented by a single byte — for example, a dark pixel might have a value afIO, and a bright one might be 230. The entire image can be thought of as a two-dimensional array of pixel values. We reter lo such an array as a bllmap, — a representation of the graphics(image data that paralleis lhe manner in which it is stored in video memory. Image resolution refers Lo the numberof pixels inadigitalimage(higherresoltnionalways yieids better quality). Fairly high resolution for such an image might be 1,600 x 1,200, whereas Iower resolution might be 640 x 480. Nolice that here we are using an aspect raio of 4:3. Wedon’t have lo adopt ihis ratio, but ilhas been found to iook natural. Such an anay musi be stored in hardware; we cal) this hardware aframe buffer. Special (relatively expensive) hardware called a “video” card (actually a graphics card) is used for this purpose. The resolution of the video card does nol have to match Lhe desired resolution of the image, bul if not enough video card memory is available, lhe data has to be shified around in RAM for display. We can think of the 8-bit image as a ser of 1 -bit biiplanes, where each plane consists of a l-bit representation of the image ai higher and higher leveis of “elevation”: a bil is mmcd on if lhe image pixel has a nonzero value ar ar abave that bit levei. Figure 3.2 displays the concept of bitplanes graphically. Each bit-piane can have a value of O or 1 at each pixel bei, together, ali lhe bitpianes make up a singie byle that stores
62
Chapter 3
Graphics and Image Data RepresentatiOns
Se~ion 3.1
Plane 7
Graphics/Image Data Types 63
Dilbering For printing on a 1-biL printer, diLhering is used Lo caicuiate larger patlems of dois, such LhaL values from 010255 correspond lo pleasing patLenis ihal correcliy represeni darker and brighier pixei values. The maín straLegy is to replace a pixei value by a larger panem, say 2 x 2 or4 x 4, such lhaL Lhe number of prinLed dots approximales the varying sized disks of ink used ia halfione prinflng. Haif-Lone printing is an analog proccss Ihat uses smalieror larger fiuled circies of biack ink Lo represent shading, for newspaper printing, say.
Plane O
If insiead we use an ti x a matrix of on-off 1 -bit doLs, we can represenL 772 + 1 leveis of intensiLy resolulion — since, for exampie. three doIs Iiiled in any way counis as one intensity levei. The dol paLtems are CreaLed heurisLicaily. For example, if we use a 2 x 2 “dither maLrix”: BiLpiane
(0 2
FIGURE 3.2: Bitplanes for 8-biL grayscale image.
vaiues between O and 255 (in Lhis 8-biL situation). For Lhe leasL significanL bit, the bit value translates Lo O or 1 in Lhe final numeric sum of lhe binary number. PosiLional arithmetic implies thaL for lhe nexL, second, bit each O or 1 makes a coritribution of O or 2 lo the final sum. The nexL biLs sLand for O or 4,0 or 8, and so On, up to O or 128 for Lhe most signific~nL biL. Video cards can refresh bitplane data aL video rale bul, unlike RAM, do not hold Lhe daLa well. Raster flelds are refreshed aI 60 cycles per second ia North America. Each pixel is usually stored as a byLe (a value beLween O Lo 255), soa 640 x 480 grayscale image requires 300 kilobytes of sLorage (640 x 480 = 307,200). Figure 3.3 shows Lhe Lena image again, this time in grayscale. If we wish Lo prini such an image, Lhings become more complex. Suppose we have available a 600 doL-per-inch (dpi) iaser printer. Such a device can usually only print a dol or noL print It. However, a 600 x 600 image wili be prinLed in a l-inch space and wili thus not be very pleasing. Instead, di:hering is used. The basic slrategy of diihering is lo trade intensity resotiulon for spatiai resolution. (See [2], p. 568, for a good discussion of ditheiing).
we can firsL remap image values in 0.. 255 inLo lhe new range 0.. 4 by (integer) dividing by 256/5. Then, for exampie, if lhe pixel vaiue is 0, we prinL noLhing ia a 2 x 2 area ofprinter ouLput. Bui if Lhe pixei value is4, we prial ali four doIs. So lhe mie is
lf lhe inLensiLy is greater Lhan Lhe dither mairix enlry, print an on dol aI thaL entry iocation: replace each pixel by an a x a matrix of doIs.
However, we nolice lhat lhe number of leveis is small for this iype of prinLing. lf we increase lhe number of effecLive inLensiLy leveis by increasing lhe diiher matrix size, we also increase lhe size of Lhe output image. This reduces Lhe amounL of detaii in any smali part of Lhe image, effeciiveiy reducing Lhe spaLiai resolution. Note that Lhe image size may be much iarger for a diihered image, since repiacing each pixei by a 4 x 4 array of doLs, say, makes an image 16 limes as large. However, a dever Lrick can geL around Lhis probiem. Suppose we wish lo use a larger, 4 x 4 dilher matrix, such as O 12 3 IS
FIGURE 3.3: Grayscale image of Lena.
8 4 11 7
2 14 i 13
lO 6 9 5
Then suppose we siide lhe dilher maLrix over lhe image four pixeis ia lhe horizontal and vertical direcLions ata time (where image values have been reduced tolhe range 0.. 16). An “ordered dither” consisLs of Lurning on lhe prinLer ouLpul bil for a pixei if lhe inLensily levei is greater Lhan lhe particular maLrix eiemenLjusl aL Lhal pixei posilion. Figure 3.4(a) shows a grayscale image of Lena. The ordered-diLher version is shown as Figure 3.4(b), with a delaii of Lenas righl eye in Figure 3.4(c).
64
Chapter 3
Section 3.1
Graphics and Image Data Representations
(e)
(b)
(a)
Graphics/Image Data Types 65
FIGURE 3.4: Dithering of grayscaie images. (a) 8-bit gray image lenagray . bmp; (b) dithered version of lhe image: (e) detail of dithered version. (This figure also appears in Lhe color insert section.) An algoriLhm for ordered dilher, with
vi
x
n
dither matrix, is as foliows:
ALGORITIIM 3.1 ORDERED DITHER begin for x = 0 Lo Xrnax II coiumns for y — 0 Lo Yrnax II rows i — x inod n j=yrnodn II !(x, y) is Lhe input, O(x, y) is lhe ouLput, 13 is lhe diLher malrix. if 1 (x, y) > D(i, j) O(x,y) —1; eise O(x.y) —0; end Foley, eL ai. 12] provides more detaiis on ordered diLhering. 3.1.3
Image Data Types The nexL seclions introduce some of Lhe most common daLa types for graphics and image file formaLs: 24-biL color and 8-bil color. We then discuss file formais. Some formaIs are restricted Lo particular hardwareloperaling system piaLformS, whiie oLhers are plaiform independeni, or c,vss-plaifonn, formaLs. Even if some formaLs are not cross-piaLform, conversion applications can recognize and LransiaLe fonnats from one sysLem lo anoLher. Most image formaLs incorporaLe some varialion of a compression technique due Lo the large sLorage size of image files. Compressioli Lechniques can be ciassified as either loss less or iossy. We will study various image, video, and audio compression LechniqUes iii ChapLers 7 Lhrough 14.
3.1.4 24-Bit Color Images In a colar 24-bit image, each pixel is represented by Lhree byLes, usualiy represenling RGB. Since each value is in Lhe range 0—255, LhiS formal supporLs 256 x 256 x 256, or a loLal
(c)
(d)
FIGURE 3.5: High-resoluLion color and separale R, O, B color channei images. (a) example of 24-biL color image forestfire. bmp; (b, c, d) R, O, and B color channels for Lhis image. (This figure also appears in Lhe color inserl seclion.) of 16,777,2 16, possible combined colors. However, such fiexibility does resuil in a slorage penalLy: a 640 x 480 24-bit color image wouid require 921.6 kilobyLes of sLorage wiihouL any compression. An importanl poinl lo note is lhal many 24-biL color images are acLually stored as 32-biL images, with Lhe exLra byte of daLa for each pixei sLoring an « (alpha) value represenLing special-effect informaLion. (See [2], p. 835, for an inLroducLion lo use of Lhe «-channel for composiLing several overlapping objecis in a graphics image. The simplesl use is as a iransparency fiag.) Figure 3.5 shows lhe image forestf ire . bmp, a 24-biL image in Microsoft Windows BMP formaL (discussed laLer in lhe chapLer). Also shown are Lhe grayscale images for jusL lhe red, green, and blue channels, for lhis image. Taking lhe byle values 0.. 255 in each color channel Lo represent inLensily, we can display a gray image for each colar separaLely. 3.1.5 8-Bit Color Images II space is a concern (and il almost always is), reasonably accuraLe color images can be oh lained by quantizing Lhe color information Lo collapse iL. Many syslems can make use ofonly 8 biLs of color informaLion (lhe so-called ‘256 coiors”) in producing a screen image. Even if a sysLem has Lhe electronics lo aclually use 24-bil information, backward compatibiliLy demands thaL we understand 8-bil color image files.
66
Chapter 3
Graphics and Image Data Representations
Section 3.1
Graphics/Image Data Types
67
FIGURE 3.7: Exampie of 8-biL color image. (This figure aiso appears in Lhe cobor inseri seclion.) noL aiways lhe case. Consider Lhe fieid of medical imaging: wouid you be saLisfied wiLh only a “reasonabiy accuraLe” image of your brain for poLential laser surgery? Likeiy not — and Lhal is why consideraLion of 64-bit imaging for medical appiicaLions is noL ouL of Lhe question. NoLe lhe greaL savings in space for 8-bit images over 24-bil ones: a 640 x 480 8-bit color image requires only 300 kiiobyLes of sLorage, compared 10921.6 kiiobytes for a coior image (again, wilhouL any compression applied).
FIGURE 3.6: Three-dimensionai hisLogram of ROB colors in forestfire . b~np.
Such image files use lhe concepi of a iookup table lo store color information. Basically, lhe image aLores noL color but instead just a set of bytes, each of which is an index mIo a tabie with 3-byte values Lhal specify the colorfora pixel wilh that lookup table index. In a way, iL’s a bit like a paint-by-number children’s art sei, with number 1 perhaps standing for orange, number 2 for green, and ao on — Ihere is no inherent pattem lo Lhe set of actual colors. 1L makes sense lo carefully choose just which coiors to represeni best in lhe image: if an image is mostly red sunset, it’s reasonable lo represent red wilh precision and siore oniy a few greens. Suppose ali Lhe cobra in a 24-bil image were coliected in a 256 x 256 x 256 sei of celis, along with lhe counl of how many pixeis beiong Lo each of these coiors slored in Lhal ceil. For example, if exacLiy 23 pixels have ROR vaiues (45,200,91) lhen store lhe value 23 in a Lhree-dimensional array, aL the eiement indexed by Lhe index values [45, 200, 91]. This data sLructure is called a color hislogra,n (see, e.g., [3, 4]). Figure 3.6shows a 3D histogram of lhe RGB vaiues of lhepixeis in forestf ire . bmp. The histogram has 16 x 16 x 16 bins and shows lhe counL in each bin in terma of intensiLy and pseudocobor. We can see a few imporLant ciusters of color information, corresponding Lo lhe reds, yellows, greens, and so on, of lhe forestf ire image. CiusLering in this way aliows us lo pick Lhe most imporlant 256 groups of coior. Basically, large populations in 3D histogram bins can be subjecLed Lo a spiit-and-merge aigorithm lo deLermine lhe “best” 256 coiors. Figure 3.7 shows Lhe resuiting 8-bil image in 0fF formal (discussed iater in Lhis chapter). NoLice Lhat ii is difficult lo disceni Lhe difference beLween Figure 3.5(a), lhe 24-bil image, and Figure 3.7, lhe 8-biL image. This is
3.1.6
Color Lookup Tables (LUTS) Again, Lhe idea used in 8-bil color images is Lo abre oniy Lhe index, or code value, for each pixel. Then, if a pixel siores, say, lhe value 25, lhe meaning is Lo go Lo row 25 in a coior bookup Lable (L.UT). Whiie images are dispiayed as two-dimensional arrays of values, they are usuaily srored in row-column order as simpiy a bong sedes of values. For an 8-bil image, lhe image file can store in Lhe file header information jusL whal 8-biL values for R, 0, and B correspond Lo each index. Figure 3.8 displays Lhis idea. The LUT is ofLen called apateue.
FIGURE 3.8: Coior LUT for 8-bit color images.
68
Chapter 3
Section 3.1
Graphics and Image Data Representations R 0000 1111 2222 0000 liii 2222 0000 liii 2222 0000 1111 2222 3333 44445555 333344445555 3333 4444 5555 3333 44445555 6666 7777 8888 66667777 8888 66667777 8888 66667777 8888
O
Graphics/Image Data Types
69
B
0 Cyan 3 4
(a)
(b)
le)
FIGURE 3.10: (a) 24-bit color image lena.bmp; (b) version wilh color diLhering; (c) delail of diLhered version. 255
FIGURE 3.9: Color picker for 8-bil color: each block of lhe color picker corresponds Lo one row of Lhe color LUT.
A calor picker consisls of an array of fairiy large biocks of color (or a semicc~itinuouS range of colors) such Lhat a mouse ciick wiIi select Lhe color indicated. lii reaiity, a color picker displays lhe palette colors associated with index values frorn O lo 255. Figure 3.9 displays lhe concept of a color picker: if lhe user selects lhe color block wilh index value 2, lhen Lhe calor meanc is cyan, wiLh RGB values (0,255.255). A simple animation process is possible via simpiy changing lhe color Lable: ihis is called calor cycling or paleue aniniation. Since updaLes from lhe color Lable are fast, Lhis can result in a simple, pieasing cifreI. Dithering can also be canied oul for color prinLers, using 1 bil per color channei and spacing out lhe calor wiLh R, G, and B dots. Altematively, if Lhe prinLer or sereen can prinl only a limiled number of colors, say using 8 biLs jnslead of 24, color can be made lo seem prinlable, even jf il is not available in Lhe color LIJT. The apparenl color resolution of a display can be increased wilhouL reducing spaLial resoluLion by averaging Lhe intensiLies of neighboring pixeis. Then it is possible lo trick Lhe eye inLo perceiving colors Lhal are nol available, because iL cardes aula spalial blending Lhal can be pul lo good use. Figure 3.10(a) shows a 24-bil color image of Lena, and Figure 3.10(b) shows Lhe sarne imag~ reduced Lo oniy 5 bils via dilhering. Figure 3.10(c) shows a delail of Lhe lefL eye. How te Devise a Color Lookup Table In Seclion 3.1.5, we briefly discussed lhe idea of clustering lo generale lhe mosL imporlanl 256 coiors from a 24-bil color image. However, in general, cluslering is an expensive and slow process. BuL we need Lo devise color LUTs somehow — how shall we accomplish Lhis? The mosL slraighLforward way Lo make 8-biL lookup color ouL of 24-bil color would be Lo divide lhe RGB cube mIo equal slices in each dimension. Then lhe cenlers of each of Lhe resulting cubes would serve as lhe enlries in lhe color LUT, while simply scaling Lhe RGB ranges 0.. 255 mIo lhe appropriaLe ranges would generate Lhe 8-biL codes.
Since humans are more sensiLive Lo R and O Lhan Lo B, we could shrink Lhe R range and O range 0.255 inLo Lhe 3-bil range 0.. 7 and shrink lhe B range down lo Lhe 2-biL range 0.. 3, maldng a LoLal of 8 biLs. To shrink R and O, we could sirnpiy divide lhc R orO byLe value by (256/8 =) 32 and Lhen Lruncale. Then each pixel in Lhe image gets repiaccd by iLs 8-biL index, and Lhe color Liii’ serves Lo generale 24-biL color. However, whal Lends lo happen wilh Lhis simple scheme is lhal edge artifacLs appear in Lhe irnage. The reason is Lhal if a slighl change in ROB resulLs in shifLing Lo a new code, an edge appears, and Lhis can be quiLe annoying perceptually. A sirnpie alternaLe soluLiOn for Lhis color reducLion problem calied Lhe niedian-cui algo rithm does a belLer job (and several oLher compeling meLhods do as weli or beLler). This approach derives from compuLer graphics [5]; here, we show a rnuch sirnplified version. The method is a lype of adaptive partitioning scherne Lhal Lries lo puL lhe rnosl biLs, lhe mosL discrimination power, where coiors are mosI clusLered. The idea is Lo sorl Lhe R byLe values and find Lheir median. Then vaiues smaller Lhan lhe median are labeled wiLh aO biL and values larger Lhan lhe median are Iabeled wiLh a 1 biL. The median is lhe point where haif Lhe pixeis are smalier and half are larger. Suppose we are imaging some appies, and mosl pixeis are reddish. Then lhe median R byle value rnighL fali fairiy high on lhe red 0.. 255 scale. NexL, we consider oniy pixeis wilh aO label frorn Lhe firsl slep and sort lheir G values. Again, we labei irnage pixels wiLh another bil, O for Lhose less lhan Lhe median in Lhe greens and 1 for Lhose greaLer. Now appiying Lhe sarne scheme lo pixeis lhat received a 1 bil for Lhe red sLep, we have arrived aI 2-biL iabeling for ali pixeis. Carrying on Lo Lhe biue channei, we have a 3-bil scheme. Repealing ali sLeps, R, G, and B, resuics in a 6-bil scheme, and cyciing Lhrough R and O once more resuils in 8 biLs. These biLs fomi our 8-bil coior index vaiue for pixels, and corresponding 24-bil coiors can be lhe cenLers of lhe resuiLing smali color cubes. You can see lhaL in facL lhis Lype of scheme wili indeed concenlraLe bils where Ihey mosl need lo differenLiaLe belween high popuiaLions of dose coiors. We can mosl casiiy visualize finding lhe median by using a hislogram showing counLs aL posiLion 0.. 255. Figure 3.11 shows a hisLograrn of Lhe R byLe vaiues for Lhe forestfire .bmp irnage aiong wiLh Lhe median of Lhese vaiues, dcpicled as a verLical une.
Section 3.2 70
Chapter 3
Popular File Formats 71
Graphics and Image Data Representations in a lookup lable thaL indexes representative colors (in Lhe Lable, each represenLaLive color is 24-bits —8 biLs each for R, 0, and B.)
Red bil 1 o o U1
This way, we might have a lable of 256 rows, each conLaining three 8~biL values. The row mndices are Lhe codes for Lhe iookup lable, and Lhese indices are what are slored in pixel values of Lhe new, calor quontized or palertized image.
8 o o o
3.2 POPULAR FILE FORMATS
Greenbit2.forr~bil 1
=~)
Ç
Gftenbit2,forredblt 1
== 1 3.2.1
8
e
o
50
I~
‘50
2~ 250
O
50
I®
150 2~ 250
FIGURE 3.11: Hislogram of R bytes for Lhe 24-bit color image foresttire . bmp resulLs iii aO or 1 bil labei for every pixel. Por lhe second bit of Lhe color Labie index being buiiL, we take R values iess than lhe R median and labei just those pixeis as O or 1 according as their O value is iess or grealer Lhan lhe median of Lhe 0 vaiue. ConLinuing over R, 0, B for 8 bits gives a color LUT 8-bit index. The 24-biL color image resuiting from repiacing every pixel by iLs corresponding color LIJT 24-bit color is only an approximation to Lhe original 24-bit image, of course, buL Lhe above aigorithm does a reasonabie job of putLing mosL discriminatory power where it is most needed — where smali colar shading differences wiIl be most noticeable. li should also be mentioned thaL several methods exisL for disLributing Lhe approximaLiOn errors from one pixei tolhe nexL. This has lhe effect of smooLhing out probiems in the 8-biL approximatiofl. The more accurate version of lhe median-cut aigoriLhm proceeds via Lhe foliowing steps: 1. 2. 3. 4.
Some popular file formaLs for informalion exchange are described below. One of Lhe mosL important is lhe 8-bit 0W formal, because of ils hisLorical conneclion lo Lhe WWW and I-ITML markup language as lhe firsL image Lype recognized by neL browsers. However, currenLly lhe mosL importanL common file formal is JPEG, which wiii be explored iii greal deplh in Chapler 9.
Pind lhe smailest box Lhat conLains ali lhe coiors in Lhe image. Sort Lhe enciosed colors along Lhe iongesl dimension of lhe box. Split lhe box into LWO regions aI Lhe median of lhe sorLed lisI. Repeal Lhe above process in sLeps (2) and (3) unIU lhe original color space has been divided mIo, say, 256 regions. 5. For every box, cail Lhe mean of R, 0, and B in lhat box lhe representalive (lhe center) color for Lhe box. 6. Based on lhe Euclidean dislance between a pixel RGB value and lhe box centers, assign every pixel to one of the represenLaLive coiors. Replace the pixel by lhe code
GIF Graphics Inierchange Formal (GIF) was devised by UNISYS CorporaLion and Compuserve, mnitiaily for transmitting graphicai images over phone lines via modems. The GIF sLandard uses lhe Lempel-Ziv-Weich algorithm (a form of compression — see Chapter 7), modified slighliy for image scanlmne packeLs lo use Lhe une grouping of pixeis effectively. The OIF standard is limiLed Lo 8~bit (256) colar images only. While this produces accept able colar, iL is besl suiLed for images wilh few distinctive colors (e.g., graphics ar drawing). The GIF image formaL has a few interesting feaLures, nolwiLhsLanding Lhe fact lhat il has been largeiy supplanled. The standard supperls interlacing — lhe successive display of pixeis in widely spaced rows by a four-pass display process. In facL, 0W comes in iwo fiavors. The original specification is GIF87a. The laler version, GIFS9a, supports simple animation via a Oraphics ConLrOI ExLension biock in Lhe daLa. This provides simple contrai over delay time, a transparency index, and so on. SofLware such as Corei Draw allows access lo and ediling of GIF images. IL is worLhwhiie examining lhe file formaL for 01F87 in more detail, since many such formaLs bear a resembiance Lo it bul have grown a good deal more complex lhan Lhis “simple” standard. For lhe sLandard specification, Lhe general file formaL is as in Figure 3.12. The Signalure is 6 byles: GIF87a; lhe Screen Descriplor is a 7-byle seI of fiags. A G1F87 file can contam more Lhan one image definition, usually lo fiL on severai differenl parIs of Lhe screen. Therefore each image can contam its own colar lookup table, a Local Colar Map, for mapping 8 bils mIo 24-biL RGB values. However, iL need noL, and a global color map can insLead be defined Lo take Lhe place of a local labie if Lhe laLler is nol inciuded. The Screen DescripLor comprises a set of allributes lhal belong Lo every image in lhe file. According Lo lhe G1F87 standard, iL is defined as in Figure 3.13. Screen Width is given in lhe firsl 2 byles. Since some machines invert lhe order MSB/LSB (mosL significanL bytelleast significant byLe — i.e., byte order), this order is specified. Screen ffeight is lhe next 2 bytes. The “m” in byte 5 is O if no global calor map is given. Calor resoluLion, “cr”, is 3 bits in O.. 7. Since this is au old standard meanl lo operaLe on a variely of low-end hardware, “cr” is requesring Lhis much color resolution.
72
Chapter 3
Graphics and Image Data Representations
rãi1?
Section 3.2 BiLs
signat~~J
76543210
Screen descriptor Global colar map
Image descripLor
ril r r
colar map
1
1
iton Limes
Rasterarefj
GIF LerminaMj FIGURE 3.12: GIF file format.
Bits 7 6 5 4 3 2 1 O Byte# Screen width
Raster width in pixels (LSB first) 2
Screen height
cr
O~ pixel
Background
Popular File Formats 73
3
RasLer height in pixels (LSB first)
: 6
O O O O O O O O 7
Background = color index of screen background (color is defined from theglobalcolormaporifnone specified, from Lhe default map)
m= 1 Global color map follows descriptor cr + 1 # bits of color resolution pixel + 1 # biLs/pixel in image FIGURE 3.13: GIF screen descriptor.
Byte#
Red intensiLy
1
Red value for colar index O
Green intensity
2
Green value for colar index O
Blue inLensiLy
3
Blue value for color index O
Red intensity
4
Red value for colar index 1
Green intensity
5
Green value for color index 1
Blue inLensiLy
6
Blue value for color index 1 (continues for remaining colors)
FIGURE 3.14: GlFcolormap.
The next bit, shown as “0”, is extra and is nol used in Lhis sLandard. “Pixel” is another 3 bits, indicating Lhe number of bits per pixel in Lhe image, as stored in Lhe file. AlLhough “cr” usually equals “pixel”, it need not. Byte 6 gives Lhe colar table index byte for Lhe background calor, and byte 7 is filled with zeros. For presenL usage, Lhe ability Lo use a small colar resoluLion is a good feaLure, since we may be inLeresLed in very low-end devices such as web-enabled wristwatches, say. A colar map is sei up iii a simple fashion, as in Figure 3.14. However, Lhe actual lengLh of Lhe Lable equals 2pu~cl+I as given in lhe screen descripLor. Each image in Lhe file has iLs own lmage DescripLor, defined as in Figure 3.15. InLerest ingly, Lhe developers of this standard allowed for future extensions by ignoring any byLes between Lhe end of one image and the beginning of lhe nexl, idenLified by a comma character. Ia this way, fuLure enhancements could have been simply inserted in a backward-compatible fashion. If Lhe interlace bit is set in Lhe local Image Descriptor, Lhe rows of Lhe image are displayed ‘na four-pass sequence, as in Figure 3.16. Here, Lhe first pass displays rows O and 8, Lhe second pass displays rows 4 and 12, and so on. This allows fora quick sketch to appear when. a web browser displays Lhe image, followed by more deLailed fill-ins. The JPEG standard (below) has a similar display mode, denoted progi-essive mode. The actual rasler data itself is firsL compressed using Lhe LZW compression scheme (see ChapLer 7) before being stored. The GIFS7 sLandard also seL ouL, for finure use, how Extensian Blocks could be defined. Even in G1F87, simple animations can be achieved, but no delay was defined between images, and mulLiple images simply overwrite each other wiLh no screen clears. G1F89 introduced a number of ExLension Block definitions, especially those Lo assist animation: Lransparency and delay beLween images. A quite useful feaLure introduced in GIFS9 is Lhe idea of a sorted colar lable. The mosL imporlanL colors appear first, SO lhaL if
74
Chapter 3
Section 3.2
Graphics and Image Data Representations
Bits 7 6 5 4 3 2 1 O
ByLe #
00101100
1
Image separator character (comma)
2
Start of image in pixels from the
3
left side of Lhe screen (LSB firsL)
4
Start of image in pixels from the
~
top of lhe screen (LSB first)
Image Ieft Image Lop _________________________
WidLh of lhe image ia pixeis (LSB first>
Image width 8
Image height
m
Ii
o o O
Ipi~
Height of lhe image in pixels (LSB firsl) m=O m= 1 i=O i= 1 pixel + 1
Use global colar map, ignore ‘pixel’ Local colar map follows, use ‘pixel’ Image formatted in Sequential arder Image formatted ia Interlaced arder # bits per pixel for Lhis image
FIGURE 3.15: GIF image descriptor.
lmage row Pass 1 Pass 2 Pass 3 Pass 4 ResuIL ______________________________________________________ O 1 2 3 4 5 6 7 8 9
*la* *4a* *3a* *4b* *4c* *3b* *4d* *lb* *4e*
lO
*3c*
11
12
a decoder has fewer calors available, the mosl imporlant ones are chosen. ThaL is, only a segment of the color lookup Lable is used, and nearby colors are mapped as well as possible into Lhe colors available. We can investigate how Lhe file header works in practice by having a look ata particular 6fF image. Figure 3.7 is an 8-bit color GIF image. To see how the file header looks, we can simply use everyone’s favorite command in lhe UNIX operating system: od (ocial dump). la UNIX,’ Lhen, we issue Lhe command od -c forestfire.git 1 head -2 and we see lhe firsi 32 bytes interpreLed as characLers: 7 a \208 \2 \188 \1 \247 \0 \0 \6 G 1 F 8 \7 \198 \195 \ \128 u \27 \J.96 \166 J \132 \24 1
*4f*
*2b*
*la* *4a* *3a* *4b* *2a* *4c* *3b* *4d* *lb* *4e* *3c* *4f*
*2b*
FIGURE 3.16: 6fF four-pass interlace display row arder.
&
T
To decipher Lhe remainder of Lhe file header (after GIF87a), we use hexadecimal: ad -x forestfire.gif
10
Popular File Formats 75
head -2
with the resulL 4749 4638 3761 d002 boOl f700 0006 0305 ae84 187c 2907 c6c3 5c80 SSlb c4a6 2654 The d002 bcOl following lhe Signature are Screen Width and Height; these are given in least-significant-byLe-first order, so for Lhis file in decimal lhe Screen Width isO + 13 x 16 + 2 x 162 = 720, and Screen Heighl is li x 16+12+1 x 162 = 444. Then Lhe f7 (which is 247 in decimal) is the fifLh byte in Lhe Screen Descriptor, followed by the background calor index, 00, and Lhe 00 delimiter. The sei of Ilags, f7, ia bits, reads 1, lii, 0, lii, or in other words: global color map is used, 8-bit color resolution, O SeparaLor. 8-bii pixel daLa. 3.2.2 JPEG The most important curreni sLandard for image comprcssion is JPEO [6]. This standard was creaLed by a working group of lhe Inlemalional OrganizaLion for Standardizalion (150) that was informally called the Joint PhoLographic Experts Group and is iherefore sa named. We shall sLudy JPEG ia a good deal more detail ia Chapter 9, buL a few salieni feaLures of this compression standard can be mentioned here. The human vision syslem has some specific limilalions, which JPEG takes advantage of to achieve high rates of compression. The eye—brain system cannoL see extremely fine detail. lf many changes occur wiihin a few pixels, we refer lo thai image segment as having high spatialfrequency — that is, a great deal of change in (x, y) space. This limitalion is even more conspicuous for color vision Lhan for grayscale (black and white). Therefore, color informaLion in JPEG is dechncued (partially dropped, or averaged) and then small blocks of an image are represenLed in lhe spaiial frequency domam (ii, v), rather than in (x, y). That is, the speed of changes mx and y is evaluated, from low lo high, and a new “image” is formed by grouping Lhe coefficients ar weighLs of these speeds. ‘Solaris version; older versions use slightly different synux.
76
Chapter 3
Section 3.2
Graphics and Image Data RepresentatiOfls
Popular File Formats 77
Special features of PNG files include support for up Lo 48 bits of color information — a large increase. Files may also contam gamma-correction information (see SecLion 4.1.6) for correcL display of calor images and alpha-channel information for such uses as control of transparency. InsLead of a progressive display based on widely separaed rows, as in GIF images, the display progressively displays pixels in a Lwo-dimensional fashion a few at a Lime over seven passes thraugh each 8 x 8 block of an image. 3.2.4 TIFF Tagged linage File Format (TIFF) is anolher popular image file formaL. Developed by Lhe Aldus CorporaLion in Lhe 1980s, iL was later supported by MicrosofL. ILs support for aLtachmeni of additional information (referred to as “Lags”) provides a great deal offiexibility. The most important tag is a formaL signifier: whaL type of compression eLc. is in use in the stored image. For example, TIFF can sLore many differenL Lypes of images: 1 -bit, grayscale, 8-bit, 24-bit RGB, and so on. TIFF was originally a lossless format, buL a new JPEG tag allows you Lo opL for JPEG compression. Since TIFF is not as user-controllable as JPEG, li does noL provide any major advanLages over the laLier.
FIGURE 3.17: JPEG image with low quality specified by user. (This figure also appears in lhe color inserL section.) 3.2.5 WeighLs LhaL correspond to slow changes are then favored, using a siniple trick: values are dividcd by some large integer and truncated. In this way, small values are zeroed out. Then a scheme for representing long runs of zeros efficiencly is applied, and voila! — the image is greatly compressed. Since we effectively Lhrow away a ~oL of informalion by the division and Lruncation step, this compression scheme is “lossy” (although a lossless mode exists). WhaL’s more, since it is straightforward to allow the user Lo choose how large a denominator to use and hence how much information to discard, JPEG allows Lhe user to set a desired levei of quality, or compression raLio (inpuL divided by output). As an example, Figure 3.17 shows our forestfire image with a qualiLy factor 10%. (The usual defauit quality factor is Q = 75%.)
Q=
This image is a mere 1.5% of the original sue. In comparison, a JPEG image with = 75% yields an image size 5.6% of the original, whereas a GIF version of this image compresses down to 23.0% of Lhe uncompressed image size.
Q
3.2.3
PNG One intcresting development stemming from the populariLy of Lhe lnternet is efforts toward more sysLem4ndependent image formats. One such format is Poriable Neiwork Graphics (PNG). This sLandard is meant Lo supersede Lhe GIF standard and exLends it in impor tant ways. The niotivation for a new standard was in part Lhe patenL held by IJNISYS and Compuserve on the LZW compressiOn methad. (lnteresLingly, the paLenL cavers only compression, not decompression: Lhis is why Lhe UNIX gunzip utility can decompress LZW-compressed files.)
EXIF Exchange Iniage File (EXIF) is an image formaL for digiLal canieras. lniLially developed in 1995, iLs current version (2.2) was published in 2002 by the Japan Elecironics and Informa Lion Technology lndusLries AssociaLion (JEITA). Compressed EXIF files use lhe baseline JPEG format. A varieLy of Lags (many more Lhan in TIFF) is available lo facililate higher qualiLy prinLing, since informaLion abouL Lhe camera and piciure-iaking condiLions (flash, exposure, light source, white balance, Lype of scene) can be stored and used by prinLers for possible color-correction algorithms. The EXIF sLandard also includes specificaLion of file fomiat for audio lhat accompanies digital images. IL also supports tags for information needed for conversion to FlashPix (initially developed by Kodak).
3.2.6 Graphics Animation Files A few dominanL formaLs are aimed aL sLoring graphics animations (i.e., sedes of drawings or graphic illustraLions) as opposed to video (i.e., series of images). The difference is Lhat animations are considerably less demanding of resources than video files. However, animation file formaLs can be used Lo store video informaLion and indeed are somelimes used for such. FLC is an importani animaLion or moving picture file formaL; by Animalion Pro. Another format, FLI, is similar Lo FLC.
iL
was originally creaLed
OL produces somewhat beLter quality moving picLures. GL animaLions can also usually handle larger file sizes. Many older formats are used for animation, such as DL and Amiga 1FF, as well as altemaLes such as Apple Quicktime. And, of course, there are also animated G1F89 files.
78 3.2.7
Chapter 3
P5 and PDF PostScript is an important language for Lypesetting, and many high-end printers have a POSLSCripL interpreter built into them. PostScript is a vector-based, rather than pixel-based. picture language: page elements are essentially defined in terms of vectors. Witb fonts defined Lhis way, PostScript includes text as well as vector/structured graphics; bit-mapped images can also be included in output files. Encapsulated PostScript files add some infor mation for including PostScript files in another document. Several popular graphics programs, such as lllustrator and Freel-Iand, use PostScript. However, the PostScript page description language itself does not provide compression; in fac;, PostScript files are just stored as ASCII. Therefore files are ofLen large, and in academic settings, iL is common for such files Lo be made available only after compression by some UNIX uLility, such as compressor gzip. Therefore, another Lext + figures language has begun Lo supersede PostScript: Adobe Systems mc. includes LZW (see Chapter 7) compression in its Portabte Document Formal (PDF) file format. As a consequence, PDF files Lhat do not include images have about Lhe sarne compression ratio, 2:1 or 3:1, as do files compressed with other LZW-based compression Lools, such as UNIX compressor gzip on PC-based winzip (a variety of pkz ip). For files containing images, PDF may achieve higher compression ratios by using separate JPEG compression for Lhe image content (depending on the tools used to create original and cornpressed versions). The Adobe Acrobat PDF reader can also be configured to read documents sti-ucLured as linked elernents, wiLh clickable contenL and handy summary Lree-structured link diagrams provided.
3.2.8 Windows WMF Windows MetaFile (WMF) is Lhe native vector file formaL for Lhe Microsoft Windows oper ating environment. WMF files actually consist of a coilection of Graphics Device Inierface (ODI) function calis, also native LO the Windows environrnent. When a WMF file is “played” (typically using the Windows PlayMetaFile fl funclion) the described graphic is ren dered. WMF files are ostensibly device.independenL and unlimited in sue.
3.2.9 Windows BMP BizMap (BMP) is Lhe major system standard graphics file format for Microsoft Windows, used in Microsoft PainL and other programs. It makes use of run-length encoding compres sion (see ChapLer 7) and can fairly efficiently sLore 24-bit bitrnap images. Note, however, thaL BMP has rnany differenL modes, including uncompressed 24-bit images. 3.2.10
Section 3.3
Graphics and Image Data RepresentatiOns
Macintosh PAINT and PiO’ PAÍNT was originally used in the MacPaint program, iniLially only for 1 -bit monochrome images. pl~ is used in MacDraw (a vecLor-based drawing program) for storing strucwred graphics.
Further Exploration
79
3.2.11 X Windows PPM ‘rbis is lhe graphics formal for the X Windows Systern. Portable PixMap (PPM) supports 24-bit color bitmaps and can be rnanipulated using many public domam graphic editors, such as xv. lt is used inibe X Windows SysLem for sLoring icons, pixmaps, backdrops, and so on. 3,3 FIJRTHER EXPI.ORATION Foley e; ai. [2] provide,an excellent introduction to compuLer graphics. For a good discussion on issues invoiving image processing, see Gonzaiez and Woods [7]. More information including a complete up-to-daLe lisL of currenL file formats can be viewed on lhe textbook web site, in Chapter 3 of the Further Expioration directory. Otheriinksinciude • G1F87 and GIFB9 deLaiis. Aithough Lhese file formats are noL 50 interesting iii Lhern seives, they have Lhe virtue of being simple and are a usefui introduction to how such bitsLrearns are set ouL. A popular shareware program for developing GIF animations JPEG considered in detail .
PNG deLaiis
. The PDF file formaL The ubiquitous BMP file formaL In Lerrns of actual input/output of such file formaLs, code for simple 24-bit BMP file reading and manipuiaLion is given on Lhe web site.
3.4 EXERCISES 1. Briefly expIam why we need to be abie Lo have Iess Lhan 24-bit color and why this makes for a probiem. GeneraHy, whaL do we need Lo do Lo adaptiveiy transform 24-bit color vaiues Lo 8-biL ones? 2. Suppose we decide to quanLize an 8-bit grayscaie image down tojust 2 biLs ofaccuracy. What is Lhe simplest way todo so? WhaL ranges of byLe values in Lhe original image are mapped Lo what quanLized values? 3. Suppose we have a 5-bit grayscale image. What size of ordered dither maLrix do we need lo dispiay lhe image on a 1 -bit printer? 4. Suppose we have available 24 biLs per pixel for a coior image. However, we noLice Lhat humans are more sensiLive Lo R and G iban lo B—in facL, 1.5 Limes more sensitive Lo R or G than Lo B. How could we besL make use of lhe bits availabie?
80
Chapter 3
Section 3.5
Graphics and Image Data RepresentatiOns
5. AI your job, you have decided to impress lhe boss by using up more disk space for lhe company’s grayscale images. lnsLead of using 8 bils per pixel, you’d like lo use 48 biLs per pixel in ROR. How could you store Lhe original grayscale images 50 that in Lhe new formal lhey would appear lhe sarne as Lhey used lo, visually? 6. SomeLimes bilpianes of an image are characterized using au analogy from mapmaking called “elevalions”. Figure 3.18 shows some elevalions. Suppose we describe an 8-bil image using 8 bilpianes. Briefiy discuss how you could view each biLpiane in lerms of geographical concepls.
References 81
3.5 REFERENCES 1 J. Miano, Cornpressed Image File Formais: JPEG, PNG, G1F XBM, BMP, Reading. MÁ: Addison-Wesley, 1999. 2 J. D. Foley, Á. van Darn, 5. K. Feiner, and). F. Hughes, Coinpuier Graphics: Principies and Practéce in C, 2nd cd., Reading, MÁ: Addison-Wesley, 1996. 3 M. Sonka, V. Hlavac, and R. Boyle. image Processing, Analysis, and Machine Vision, Boslon: PWS Publishing, 1999. 4 LO. Shapiro and G.C. Stocknian, Compuier Vision, Upper Saddle River, Ni: Prenlice HaII,
2001. 5 P l-leckberl, “ColorlmageQuanlizacion forFrameBufferDisplay:’ inS!GGRAPffProceedings, vol. ló, p. 297—307, 1982. 6 W.B. Pennebaker and J.L. MilcheII, The JPEG Stiil buage Data Coinpression Standard, New York: Vau NosLrand Reinhold, 1993. 7 R.C. Gonzalez and R.E. Woods, Digital hnage Processing, 2nd ed., Upper Saddle River, NJ: Prenlice Hail, 2002.
O
20
40
60
80
I~
120
FIGURE 3.18: ElevaLions in geography.
7. For lhe color LIJT problem, lry oul Lhe median-cuL algorilhifi ou a sample image. ExpIam briefiy why iL is Lhal this algorithm, carried out OH an image of red apples, puls more color gradaLion in lhe resulLing 24-bit color image where iL is needed, among Lhe reds. 8. In regard Lo nonordered diihering, a sLandard graphics LexI [2] slales, “Even larger palLerns can be used, buL Lhe spaLial versus inLensily resoluLion trade-off is limiled by our visual acuiLy (abouL crie minule of arc in normal lighling)2 (a) Whal does lhis senrence mean? (b) lf we hold a piece of paper oul aI a disLance of 1 fool, whaL is Lhe apprOximale linear dislance belween doLs? (Inforination: Que minute of arc is 1/60 of one degree of angle. Arc IengLh ou a circle equals angle (in radians) limes radius.) Could we see lhe gap beLween doLs on a 300 dpi prinLer? (c) Wrile down au algorilhm (pseudocode) for calculaling a color hisLogram for ROB dala.
Section 4.1
CHAPTER
Color Science 83
4
Color in Image and Video Calor images and videos are ubiquitous on the web and in multimedia productions. Increas ingly, we are becoming aware of Lhe discrepancies between color as seen by people and lhe sometimes very different calor displayed on our screens. The latesl version of lhe J-ITML standard attempls lo address this issue by specifying colar in terms of a standard, “sRGB”, arrived aL by calor scienlisls. To become aware of lhe simpie yel strangely involved worid of color, in this chapler we shall cansider the following topics: . Colar science
• Colar madels in images
FIGURE 4.1: Sir lsaac Newton’s experiments. By permission o! lhe Warden and Fellows, New Coilege, Oxford.
• Colar madels in video
4.1 4.1.1
COLOR SCIENCE Light and Spectra Recall from high school ;hal light is an electromagnetic wave and lhat its calor is char acterized by lhe wavelength af lhe wave. Laser lighl consisls af a single wavelength — for example, a ruby laser produces a brighl, scarlet beam. Sa if we were lo plot lhe light inlensily versus wavelength, we wauld see a spike aI lhe appropriale red wavelength and na olher contributian lo lhe light. Ia contrasl, most light saurces praduce contribulions aver many wavelengths. Humans cannat delecl ali lighl — jusI cantribulions thal fali in lhe visible wavelength. Short wave lengths praduce a biue sensatian, and lang waveiengths produce a red ane. We measure visible lighl using a device cailed a spectrophotomeler, by reflecling iight from a diffraction grating (a ruled surface) lhat spreads out lhe differenl wavelenglhs, much as a prism does. Figure 4.1 shows Lhe phenamenan lhat while hghl contains ali lhe colars of a rainbow. If you have ever iooked ihrough a prism, yau will have naticed thal iL generales a rainbow effect, due lo a natural phenomenOn calied dispersion. You see a similar effecl an lhe surface of a soap bubble. Visible iighl is an electromagnetic wave in lhe range 400—700 nm (where nm stands for nanometer, ar i09 meter). Figure 4.2 shaws lhe reiative power in each wavelength interval for typical auldoor lighl ana sunny day. This type of curve, called a spectral power disliibution (SP1J), ar rpeclruni, shows lhe relative amount of light energy (eiectromagnelic signal) aL each waveienglh. The symbol for waveienglh is À, so this type of curve mighl be calied E(À).
82
in praclice, measurements are used tinI effecliveiy sum up vallage in a smali wavelenglh range, say 5 or 10 nanamelers, so such plots usualiy cansisl of segments joining funclian values every 10 nanomelers. This means aisa lhat such profiles are actuaily stored as vectors. Below, however, we show equations lhal lreal E(Â) as a continuous function, aithaugh in reality, integrais are calcuiated using sums.
1~
o o .2 ~00
400
450
500 550 600 Wavelength (nm)
650
700
FIGURE 4.2: Spectral power distributian af dayiight.
84 4.1.2
Chapter 4
Section 4.1
(olor in Image and Video
Human Vision The eye works iike a camera, with Lhe lens focusing an image onto Lhe retina (upside-down and lefL-right reversed). The retina consists of an airay of rods and Lhree kinds of cones, 50 named because of their shape. The rods come mio piay when Iight leveis are low and produce an image in shades of gray (“Ai night, ali cats are grayF’). For higher iight leveis, Lhe cones each produce a signai. Because of their differing pigments, Lhe three kinds of cones are mosi sensitive Lo red (8), green (G), and blue (B) light. Higher light leveis resuit in more neurons Iiring, buijust what happens in Lhe brain furiher down Lhe pipeline is the subject of much debate. However, it seems likely Lhat Lhe brain makes use of differences R—G, G—B, and 8—8, as weil as combining ali of R, G, and 8 into a high-light-ievel achromatic channei (and thus we can say Lhat the brain is good ai aigebra).
4.1.3
Spectral Sensitivity of the Eye The eye is most sensitive Lo iight in Lhe middle of the visible specirum. Like Lhe SPD profile of a light source, as in Figure 4.2, for recepiors we show lhe relative sensitiviiy as a function of wavelengLh. The blue receptor sensitivity is noL shown lo scale, because ii is much smaller Lhan the curves for red or green. Blue is a late addiLion in evolution (and, staLisLicaliy, is Lhe favorite color of humans, regardless of nationality — perhaps for Lhis reason: blue is a bit surprising!). Figure 4.3 shows Lhe overail sensiiivity as a dashed une, calied lhe luminous-efficiency funclion. It is usually denoted V (À) and is Lhe sum of the response curves Lo red, green, and biue [1, 2]. The rods are sensitive to a broad range of wavelengLhs, bui produce a signal that generates Lhe perception of Lhe biack—whiie scaie only. The md sensitiviLy curve looks uike Lhe iuminous-efficiency function Y(À) bul is shifted somewhai to the red end of Lhe spectrum [1].
(olor Science 85
The eye has about 6 miliion cones, bui the proportions of 8, G, and 8 cones are differeni. They likely are preseni in the raiios 40:20:1 (see [3] for a complete expianation). So lhe achromaLic channel produced by Lhe cones is thus something uike 28 + G + 8/20. These spectral sensitiviiy functions are usualiy denoted by some other ieLters than 8, G, and 8, so here lei us denote them by Lhe vector funciion q (À), wiLh componenLs q(À)
(4.1)
=
ThaL is, ihere are three sensors (a vector index k = 1 .3 therefore applies), and each is a function of wavelength. The response in each coior channel in lhe eye is proportional Lo Lhe number of neurons firing. For the red channei, any light faliing anywhere in lhe nonzero part of Lhe red cone funciion in Figure 4.3 wiil generaie some response. So lhe total response of Lhe red channei is the sum over ali Lhe light falling on lhe reLina to which Lhe red cone is sensitive, weighted by lhe sensiiiviiy ai that wavelength. Again Lhinking of these sensiLiviLies as continuous funciions, we can succincily wdte down Lhis idea in lhe fomi of an integral: R
=
G 8
=
J ,[ f
E(À) qR(À) dÀ E(À)
YG(À)
dÀ
(4.2)
E(À) q8(À) dÀ
Since Lhe signai transmitied consists of Lhree numbers, coiors fomi a ihree-dimensional vector space. 4.1.4 Image Formation Equation (4.2) above actually applies only when we view a seif-luminous object (i.e., a lighQ. In most situations, we image light reflected from a surface. Surfaces reOeci different amounLs of lighi aL different waveiengths, and dark surfaces reflect iess energy Lhan light surfaces. Figure 4.4 shows lhe surface spectrai reflectance from orange sneakers and faded bluejeans [4]. The reliecLance funciion is denoied S(À). The image formation siiuation is Lhus as fouiows: hght from Lhe iiiiuminant wilh 5H) E(À) impinges on a surface, wiih surface spectral refleciance function S(À), is rellecied, and is Lhen fiutered by lhe eye’s cone functions q (À). The basic arrangement is as shown in Figure 4.5. The funciion C(À) is called lhe color signa! and is the producL of lhe illuminant E(À) and the refleccance S(À): C(À) = E(À) S(À). The equations similar to Eqs. (4.2) Lhat Lake inLo account lhe image formation model are
/ o‘o e o
0, ‘o
2
/
o
/ /
‘o
á
o
o
o
R
/
r~I
á 8
=
G
=
8
=
q o 400
450
500
550
600
650
700
Wavelength (nm) FIGURE 4.3: Cone sensitivities: R, O, and B cones, and iuminous-efficiency curve V(À).
,[ f f
£(À) S(À) qR(À) dÀ E(À) 5(À) qGR) clÃ
8(À) 5(À)
q8(À) clÃ
(4.3)
86
Chapter 4
Section 4.1
Colar in Image and Video 4.1.6
a, au a) a) a, 1~ au
a.
j
o o o
o na, o
400
450
500 550 600 Wavelength (nrn)
650
700
Ganlma Correction The RGB nurnbers in an image file are converted back lo analog and drive Lhe electron guns in Lhe cathode ray Lube (CRT). ElecLrons are emitLed proporlional lo lhe driving voiLage, and we would like Lo have the CRT system produce light linearly related to Lhe voiLage. Unfortunalely, iL Lunis olaL thaL Lhis is noL Lhe case. The lighL emitLed is actually roughly proporLional Lo lhe voltage raised Loa power; lhis power is called “garnrna”, wiLh syrnbol y. Thus, if Lhe file value in Lhe red channel is R, Lhe screen erniLs Iight proportional Lo R~, wiLh SPD equal lo lhaL of lhe red phosphor paint on lhe screen Lhal is lhe LargeL of lhe red-channel electron gun. The value of garnrna is around 2.2. Since lhe mechanics of a Lelevision receiver are lhe sarne as Lhose for a cornputer CRT, TV systerns precorrect for Lhis situaLion by applying Lhe inVerse LransformaLion before trans rniLting TV voiLage signais. It is cuslomary lo append a prirne Lo signais that are “gamrna corrected” by raising lo Lhe power (l/y) before Lransrnission. Thus we have R
FIGURE 4.4: Surface spectral refiectance furictions 5(À) for two objects. 4.1.5
Camera Systems Now, we hurnans develop camera systerns in a sirnilar fashion. A good carnera has Lhree signais produced aL each pixel location (corresponding Loa reLinal position). Analog signais are converted lo digiLal, truncated lo inLegers, and stored. If lhe precision used is 8-bit, Lhe maxirnurn value for any of R, 6, B is 255, and lhe minimurn is 0. However, lhe light entering lhe eye of Lhe compuLer user is whaL Lhe screen erniLs — Lhe screen is essentially a self.luminous source. Therefore, we need Lo know lhe lighL E(A) entering lhe eye.
-.
=
~
(R’9
-.
(4.4)
Garnma correction
q
Q
co
~~S(À)
R
and we arrive aI “linear signals”. Voltage is ofLen nonnalized lo maxirnum 1, and it is inleresling lo see what effect Lhese gamma Lransformalions have on signais. Figure 4.6(a) shows Lhe light ouLpuL wiLh no garnma correction applied. We see Lhal darker values are displayed too dark. This is also shown in Figure 4.7(a), which displays a linear rarnp frorn left Lo righL. Figure 4.6(b) shows Lhe effecl of precorrecting signals by applying lhe power law R11”, where il is cusLomary lo normalize volLage lo lhe range O lo 1. We see LhaL applying firsL lhe correcLion in Figure 4.6(b), followed by lhe effecL of lhe CRT sysLern in Figure 4.6(a), No garnma correclion
E(A)
Colar Science 87
/
á
/
00
á
- - -
~:[
‘9 o
‘9 o 00
o
C(À)
o
o
/
q o Sensors ~R, G. 8 (À)
/1\
0.0
(9 o
/
/
Q o 0.2
0.4 0.6 VolLage (a)
0.8
1.0
0.0
0.2
0.4 0.6 VolLage
0.8
1.0
(b)
RGB FIGURE 4.5: Irnage formaLion rnodel.
FIGURE 4.6: Effecl of garnrna correclion: (a) no gamma correction — effect of CRTon light ernilLed from screen (vollage is normalized Lo range 0.. 1); (b) garnma correction of signal.
88
Chapter 4
Section 4.1
o or in Image and Video
(a)
(b)
FIGURE 4.7: Effect of gamma correction: (a) display of ramp from O lo 255, with no gamma correction; (b) image with gamma correction applied. would result in linear signais. Figure 4.7(b) shows lhe combined effect. Here, a ramp is shown ia 16 seps, from gray levei 0 Lo gray levei 255. A more careful definiíion of gamma recognizes that a simple power iaw wouid resuit ia an infinite derivative aL zero voiíage — which makes constructing a circuit lo accomplish gamma correclion difficuit to devise ia anaiog. ia practice a more general transform, such as R .-.~ = a x R’1~ + bis used, along wilh special care ai Lhe origin: 4.5 x 1% vout
=
1.099 x (Yin
—
<
0.099)
0.018
(4.5)
1% ~ 0.018
This is called a camera :ransfrrfunction, and lhe above iaw is recommended by Lhe Society of Motion PicLure and Television Engineers (SMPTE) as standard SMPTE—i7OM. Why a gamma of 2.2? la facL, this value does not produce a final power law of 1.0. The hisLory of this number is buried ia decisions of lhe Nalional Television Syslem Committee of Lhe lISA. (NTSC) when TV was invented. The power law for coior receivers may in actuality be closer to 2.8. However, if we compensate for only about 2.2 of Lhis power law, we arrive aL an overali value of about 1.25 instead of 1.0. The idea was thaL ia viewing conditions with a dim surround, such an overail gamma produces more pleasing images, albeic with coior errors — darker colors are made even darker, and also lhe eye—brain system changes lhe relative conlrast of light and dark coiors [5]. Wilh Lhe advent of CRT-based compuLer systems, lhe situalion has become even more inleresling. The camera may ormay nol have inserted gamma correcLion; software may write lhe image file using some gamma; software may decode expecting some (other) gamma; lhe image is stored ia a frame buffer, and iL is common Lo provide a lookup table for gamma in Lhe frame buffer. After alI, if we generaLe images using computer graphics, no gamma is applied, but a gamma is scill necessary Lo precompensate for lhe display.
Color Science 89
li makes sense, Lhen, to define an overail “syslem” gamma Lhat Lakes into account ali such Lransformations. linfortunaLeiy, we musL often simply guess aí lhe overali gamma. Adobe PhoLoshop allows us to Lry different gamana values. For WWW pubiishing, ii is importaní to know thaí a Macintosh does gamma correction ia iís graphics card, with a gamma of 1.8. SGI machines expect a gamma of 1.4, and most PCs or Suns do no extra gamma correction and likely have a display gamma of about 2.5. Therefore, for lhe most common machines, ii might malce sense Lo gamma-correcl images aI Lhe average of Macinlosh and PC values, orabout2.l. However, mosL practitioners mighl use a value of 2.4, adopted by Lhe sRGB group. A new “standard” RGB for WWW applicaLions calied sRGB, lo be included ia aH future HTML standards, defines a standard modeling of typical light leveis and monitor condilions and is (more or iess) “device-independenL color space for Lhe lnLemet”. An issue reIaled Lo gamma correction is lhe decision ofjusL whaL inLensiíy leveis wili be represented by what bit patLerns in Lhe pixel values in a file. The eye is most sensitive Lo raios of intensity leveis ralher lhan absolute inlensities. This means lhaL Lhe brighter lhe lighl, lhe greater musL be Lhe change in lighL levei for lhe change Lo be perceived. ifwe had precise control over whal bils represented whal inLensiLies, it would make sense lo code inlensities logarilhmically for maximum usage of Lhe bils available. Then we could include lhat coding in an inverse of lhe (1/>’) power law transform, as in Equation (4.4), or perhaps a iookup table implemenlalion of such an inverse function (see [6], p. 564). However, iL is mosL likeiy Lhat images or videos we encounler have no nonlinear encoding of bil leveis but have indeed been produced by a camcorder or are for broadcast TV. These images wiil have been gamma correcíed according lo Equalion (4.4). The CIE-sponsored CIELAB perceptually based coior-difference meLric discussed ia Section 4.1.14 provides a carefui algoriíhm for including lhe nonlinear aspecl of human brightness percepíion. 4.1.7 Color-Matching Functions PracLicaiiy speaking, many color applicalions involve specifying and re-creating a particular desired color. Suppose you wish Lo duplicate a particular shade on Lhe screen, ora particular shade of dyed cloLh. Over many years, even before lhe eye-sensiLivity curves of Figure 4.3 were known, a technique evolved ia psychology for malching a combinaíion of basic 1?, G, and 8 iights lo a given shade. A particular seL of three basic lighLs was available, called Lhe set of calor prirnaries. To maLch a given shade, a seL of observers was asked Lo separately adju6t Lhe brighlness of the Lhree primaries using a set of conLrols, uníil lhe resulting spoL of hghL most closely matched Lhe desired coior. Figure 4.8 shows Lhe basic siluaLion. A device for carrying out such an experimení is called a colorünerer. The inlemalional standards body for color, lhe Commission InLemationale de LEclairage (dE), pooled ali such dala ia 1931, in a seI of curves calied Lhe color.matchingfunclions. They used coior primaries with peaks aL 440, 545, and 580 nanomeLers. Suppose, in sLead of a swatch of cioLh, you were inleresLed in matching a given wavelength of iaser (i.e., monochromaLic) light. Then Lhe color-maLching experiments are summarized by a staíemenL of lhe proportion of Lhe color primaries needed for each individual narrow-band wavelenglh of iighL. General lighls are Lhen malched by a linear combinaLion of single wavelenglh resulls.
90
Chapter 4
Section 4.1
Color ir Image and Video
(olor Science 91
‘o
c1) o>
400
450
500
550
600
650
700
Wavelength (nm) FIGURE 4.10: CIE standard color-maiching functions i(À), 5(1), ~(À). FIGURE 4.8: Colorimeter experiment.
Why are some parts of lhe curves negative? This indicates that some colors cannot be reproduced by a linear combinalion of the primaries. For such colors, one or more of the primary lights has to be shifted from one side of lhe black partition in Figure 4.8 co the other, 80 they iliuminate the sample co be niatched instead of lhe white screen. Thus, in a sense, such samples are being matched by negative lights.
Figure 4.9 shows lhe CW color-matching curves, denoted i(À), g(À), b(À). In fact, sue curves are a linear matrix-multiplication away from lhe eye sensitivities in Figure 4.3. 4.1.8
CIE Chromaticity Diagram In limes long past, engineers found it upsecting thac one Cifi color-matching curve in Fig ure 4.9 has a negative lobe. Therefore, a seI of fictitious primaries was devised that led co color-matching funclions with only positives values. Figure 4.10 shows lhe resulting curves; these are usually referred Lo as rhe color-matching functions. They result from a linear (3 x 3 macrix) cransform from the F, ~, b curves, and are denoted i(À), 5(1), ~(À). The matiix is chosen such chal lhe middle standard color-matching function 5(1) exaccly equals lhe luminous-efficiency curve V(A) shown in Figure 4.3. For a general SPD E(A), lhe essencial “colorimetiic” information required lo characterize a color is Lhe set of tristimutus values X, Y, Z, defined in analogy Lo Equation (4.1) as
‘o
eo ‘o o o 0
400
450
500
550
600
650
Wavelength (nm) FIGURE 4.9: dE color-matching functions ?(À), j(À), ~(À).
700
X
=
1’
=
Z
=
f
E(A) 2(1) dA
JE(Ã)5(À)dÀ
J
E(A) ~(À) dA
(4.6)
92
Chapter 4
Section 4.1
Color in Image and Video
The rniddie vaRie, Y, is called Lhe tuminance. Ali color informalion and transforms are lied lo these speciai values, which incorporale substantial informaLion abouL Lhe hurnan visual sysLem. However, 30 data is difficull Lo visualize, and consequently, lhe dE devised a2D diagram based on lhe values of (X, 1’, Z) tripies irnplied by Lhe curves in Figure 4.10. For each wavelenglh ia lhe visible, lhe values of X, Y, Z given by Lhe three curve values forrn Lhe lirnils of whal hurnans can see. However, frorn Equation (4.6) we observe thal increasing Lhe brighlness of iliurninalion (luniing up Lhe light buib’s watlage) increases Lhe Lrislirnulus values by a scaiar muitiple. Therefore, il maRes sense to devise a 2D diagram by somehow factoring miL lhe magnilude of vecLors (X, Y, Z). In Lhe CIE syslem, lhis is accomplished by dividing by lhe sum X + Y + Z: x
= =
z
=
X/(X+Y+Z) Y/(X+Y+Z)
(4.7)
Z/(X+Y+Z)
This effecLively means Lhal one value out of Lhe seL (x, y, z) is redundanl, since we have
x+Y+z
(4.8)
Q
Color Science 93
\ \
co
á
o
o 700
r
o \
q
o
0.0
0.2
0.4
0.6
0.8
1.0
x so Lhal
FIGURE 4.11: CIE chromaLiciLy diagram. z=1—x--y
(4.9)
Values x, y are called chromaticiües. Effeclively, we are projecling each lrisLimUius vector (X, Y, Z) onLo Lhe plane connecLing poiaIs (1,0,0), (0, 1,0), and (0,0, 1). Usuaiiy, lhis plane is viewedprojected orlo Lhe z = O plane, as a seI of poinLs inside lhe Iriangie with verlices having (x, y) values (0,0), (1,0), and (0, 1). Figure 4.11 shows lhe locus of poiaIs for rnonochromaLic light, drawn on lhis dE “chro rnaLicily diagram”. The slraighL une aiong lhe boLlom of Lhe “horseshoe” joins poinLs aI lhe exLremilies of lhe visible specLnlm, 400 and 700 nanomelers (from blue Lhrough green Lo red). That sLraighl une is cailed lhe tine ofpurples. The horseshoe ilseif is cafled lhe spectrurn tocus and shows Lhe (x, y) chromaticily values of monochromalic lighl aI each of Lhe visible waveiengLhs. The coior-maLching curves are devised soas to add up Lo lhe sarne vaRie [lhe area under each curve is lhe sarne for each ofi(À), j(À), ~(À)]. Therefore for a while iliuminanl wiLh ali SPD values equal lo 1 — an “equi-energy while iighL” — lhe chromaticiLy values are (1/3, 1/3). Figure4.l 1 displays lhis while poinL in Lhe middle of lhe diagram. Finaliy, since we rnust have x, y ~ 1 and x + y ~ 1, ali possible chromalicily values must necessarily lie below lhe dashed diagonal une iii Figure 4.11. Note lhaL one may choose differenL “whiLe” speclra as Lhe sLandafd illuminanL. The CIE defines several of lhese, such as iiluminanl A, iliuminanl C, and slandard daylights 065 and D 100. Each of lhese will display as a sornewhal differenL while spol on lhe dE diagram: 065 has a chromaLicily equal lo (0.312713,0.329016), and iliurninanL C has chromalicily
(0.310063,0.316158). Figure 4.12 displays Lhe SPD curves for each of Lhese sLandard iighLs. HluminanL A is characLerisLic of incandescenl lighLing, wilh an SPD typicai of a lungsLen bulb, and is quile red. llluminanl C is an early aLLempl Lo characterize daylighL, while 065 and 0100 are respeclively a midrange and a biuish commonly used daylighl. Figure 4.12 also shows lhe rnuch more spiky SPD for a standard fluorescenL illuminalion, called F2 [2]. Cobra with chromaLiciLies on lhe specLrurn locus represenL “pure” colors. These are lhe mosl “saluraLed”: Lhink of paper becoming more and more saLuraLed wiLh ink. In conLrasl, colors closer lo Lhe while poinL are more unsaluraled. The chromalicily diagram has lhe nice properly thaL, for a mixLure of two Iighls, Lhe resulLing chromaliciLy lies on Lhe sLraighl une joining lhe chromalicities of Lhe Lwo lighLs. Here we are being slighlly cagey in foI saying lhaL lhis is lhe case for colors ia general, jusL for “lighLs”. The reason is Lhat so far we have been adhering Lo an additive modei of cobor rnixing. This model holds good for iighis or, as a special case, for monitor coiors. However, as we shall see below, il does nol hold for prinLer colors (see p. 102). For any chromaLicily on lhe CIE diagram, Lhe “dominant wavelenglh” is Lhe posilion on lhe speclrum locus intersecLed by a linejoining Lhe while poiaL Lo Lhe given color and extended Ihrough il. (For coiors lhal give an inLersecLion on lhe une of purpies, a compIemenLary dominant waveiengLh is defined by exLending lhe une backward Lhrough Lhe while poinL.) Anolher usefui definition is Lhe seL of complemenLary coiors for some given color, which is given by ali lhe colors on Lhe une Lhrough Lhe whiLe spol. Finally, lhe excitation purity is lhe
Section 4.1 94
Chapter 4
Color Science
95
Color in Image and Video TABLE 4.1: Chromaticities and white points for monitor specifications. Red a .9 ‘o E(a
Green
Blue
WhiLe Point
System
Xj~
3’,
Xg
Yg
Xb
Yb
XW
YW
NTSC
0.67
0.33
0.21
0.71
0.14
0.08
0.3101
0.3162
SMPTE
0.630
0.340
0.310
0.595
0.155
0.070
0.3127
0.3291
o oo.
EBU
0.64
0.33
0.29
0.60
0.15
0.06
0.3127
0.3291
oo a.
4.1.10 Out-of-Gamut Colors
1..
0
400
450 Wavelength (nm) FIGURE 4.12: Standard illuminanL SPDs
ratio of disLances from the white spot Lo lhe given color and to the dominanL wavelength, expressed as a percentage.
4.1.9 Color Monitor SpecificatioflS Coior moniLors are specified in part by lhe white point chromaticitY desired if lhe RGB elecLrOn guns are ali activaLed at Iheir highest power. Actually, we are lilcely using gamma corrected values R’, O’, B’. lf we normalize voltage Lo Lhe range O to 1. then we wish to specify a monitor such LhaL iL displays the desired whiLe point when R’ = O’ = 8’ = (abbreviating the transform from file value to voitage by simply sLating lhe pixel color values, normalized Lo maximum 1). However, Lhe phosphOresCent paints used on lhe inside of the monitor screen have their own chromaticiLies, so at firsL glance iL would appear lhat we cannot independentiY conLrol the moniLor whiLe poinL. Flowever, Lhis is remedied by setting lhe gain control for each electron gun such thaL at maximum volLages Lhe desired white appears. Several monitor specifications are in current use. Monitor specificaLions consist of Lhe fixed, manufacturer.specified chromaticities for the monitor phosphors, along wilh lhe sLan dard whiLe point needed. Table 4.1 shows Lhese values for three comnion specificaLiOn statemenLs. NTSC is Lhe standard North American and Japanese specification. SMPTE is a more modern version of this, wherein lhe sLandard illuminant is changed from iiluminant C to illuminanL D65 and lhe phosphor chromaticities are more in line with modero machines. Digital video specifications use a similar specification in North America. The EBU sys tem derives from Lhe European Broadcasting Union and is used in PAI. and SECAM video systems.
For lhe moment, IeL’s noL worry abouL garnma correcLion. Then lhe really basic problem for displaying color is how lo generate device-independeni color, by agreemenL Laken Lo be specified by (x, y) chromaticity values, using device-dependeni color values RGB. For any (x, y) pair we wish Lo find Lhat RGB Lriple giving lhe specified (x, y, z): there fore, we form lhe z values for lhe phosphors via z = 1 — x — y and solve for RGB from lhe manufacturer-specified chromaticilies. Since, if we had no green or blue value (i.e., file values of zero) we would simply see lhe red-phosphor chromaticities, we combine nonzero values of R, O, and E via Ex, Iy~ [z,
ER1
xl
Yb1101
YI
xg Yg
Xbl
Zg
ZbJ
LEi
zJ
(4.10)
lf (x, y) is specifled insLead of derived from Lhe above, we have Lo invert lhe malrix of phosphor (x, y, z) values Lo obLain lhe contct ROR values to use Lo obtain Lhe desired chromaticiLy. But what if any of Lhe RGB numbers is negative? The problem in lhis case is that while humans are able lo perceive lhe color, iL is noL represenLable on Lhe device being used. We say in Lhat case Lhe color is cui o! gamu, since lhe set of ali possible displayable colors consLitules Lhe gamuL of the device. One melhod used lo deal with Lhis siwalion is Lo simply use lhe closest in-gamut color availabie. AnoLher common approach is to selecL Lhe closest complemenLary color. For a monitor, every displayable color is within a triangie. This follows from so-called Grassnian ‘s Law, describing human vision, sLating that “color maLching is linear”. This means IbaL linear combinaLions of lighls made up of Lhree primaries are jusL Lhe linear set of weighLs used lo make Lhe combination Limes lhose primaries. That is, if we compose colors from a linear combination of lhe Lhree “lights” available from the three phosphors, we can creaLe colors only from lhe convex set derived from the lights — in Lhis case, a Lriangle. (We’il see below thaL for prinLers, lhis convexity no longer holds.) Figure 4.13 shows lhe triangular gamut for lhe NTSC sysLem drawn on Lhe CIE diagrani. Suppose lhe small triangle represenLs a given desired color. Then Lhe in-gamul poinL on lhe boundary of lhe NTSC moniLor gamul is laken Lo be the inlersection of(a) lhe une connecting lhe desired color lo Lhe white poinL wilh (b) lhe nearest une forming Lhe boundary of lhe gamuL triangle.
96
(hapter 4
Section 4.1
(olor in Image and Video
(olor Science 97
where ( )T means Iranspose. For Lhe SMPTE specificaLion, we have (x, y, z) = (0.3127, 0.3291,0.3582) or, dividing by lhe middle value, XYZwhíle = (0.95045, 1 , 1.08892). We note LhaL multiplying D by (1,1, ~
Green 00
6
rx
]
~
=
r 0.340 0.630
[ Z J while L 0.03
‘9 o
0.310 0.155 0.595 0.070 0.095
0.775
1 r ct~ j J L €13
(4.13)
lnverling, wilh lhe new values XYZwhj,e specified as above, we arrive aL (d1,d2,d3)
o Red
e o
=
(0.6247,1.1783, 1.2364)
4.1.12 XYZ to RGB Transform Now lhe 3 x 3 transform maLiix from XYZ Lo RGB is Laken lo be T=MD
O
El
o
0.0
0.4
x
0.6
0.8
1.0
rxl
(4.11)
Now we need Lo finda seL of lhree correcLion factors such Lhal ii Lhe gains of Lhe lhree elecLron guns are mulLiplied by lhese values, we get exactly lhe whiLe point XYZ value aL R = O = 8 = 1. Suppose lhe maLrix of phosphorchromaticiLiesxr. x8, ... in Equalion (4.10) is called M. We can express Lhe correcLion as a diagonal maLrix 1) = diag(di, c12, ti)) such Lhat (4.12) XYZWhI,e~MDO,LBT
(4.16)
For lhe SMPTE specificaLion, we arrive at
White-Point Correction
Y(whiLepoinl)= 1
rR1
“ 1=2’ l° [zJ [BJ
T
One deficiency in whaL we have done so far is that we need lo be able Lo map tristimulus values XYZ to device RGBs, and not jusL deal wiLh chromaticily xyz. The difference is Lhat X YZ values include lhe magnitude of Lhe color. We also need lo be able Lo alter matters such Lhat when each of R, G, 8 is aL maximum value, we obLain lhe white point. BuL 50 far, Table 4.1 would produce incorrect values. Consider Lhe SMPTE specifications. SetLing R = 0 = B = 1 resulLs in a value of X Lhat equals Lhe sum of lhe x values, or 0.630 + 0.310 + 0.155, which is 1.095. Similarly, lhe Y and Z values come mil 10 1.005 and 0.9. Dividing by (X + 1’ + Z) resulLs in a chromaticiLy of (0.365,0.335) raLher Lhan Lhe desired values of (0.3 127, 0.3291). The method used Lo correct boLh deficiencies is to firsL Lalce lhe while-point magnitude of Y as uniLy:
fl.~
(4.15)
even for points other Lhan lhe white point:
0.2
FIGURE 4.13: Approximating an out-of-gamuc color by an in-gamul one. The out-of-gamut color shown by a triangle is approximaLed by lhe intersection of (a) lhe line from that color Lo Lhe whiLe point wiLli (b) the boundary of the device color gamut. 4.1.11
(4.14)
=
r 0.3935 0.2124
0.3653 0.7011
0.1916 0.0866
[
0.1119
0.9582
0.0187
1 J
(4.17)
WritLen oul, this reads
X Y
= =
0.3935. R + 0.36530+0.1916 8 0.2124. R + 0.7011 O + 0.08668
(4.18)
Z=0.0l87~R+0.1ll9-O+0.9582 8 4.1.13 Transform with Gamma Correction The above calculaLions assume we are dealing wiLh linear signals. However, instead of linear R, 0, 8, we mosL Iikely have nonlinear, gamma-correcled R’, 0’, 8’. The best way of carrying mil mi XYZ-to-RGB transform is Lo calculale Lhe linear RGB required by inverting EquaLion (4.16), then creaLe nonlinear signais via gamina correcLion. Nevertheless, this is nol usually done as stated. lnsLead, lhe equaLion for lhe Y value is used as is but is applied Lo nonlinear signals. This does not imply much error, in facl, for colors near Lhe white point. The only concession lo accuracy is Lo give the new name Y’ lo Lhis new Y value creaLed from R’, 0’, 8’. The significance of Y’ is thaL iL codes a descripLor of brightness for Lhe pixel in quesLion) ‘In lhe Color FAQ file ci, lhe ‘cxi web sue, ihis new value
Y’ is callcd “luma”.
98
Chapter 4
Section 4.1
(olor in Image and Video L= IDO While
The most-used transform equalions are those for lhe original NTSC syslem, based upon an iliuminanl C while poinL, even though these are outdated. Foilowing lhe procedure outlined above bul with Lhe values in Tabie 4.1, we anive aL lhe foiiowing transform: X
=
Z
=
Green: a0
(4.19)
0.000. R +0.066 G + 1.116 B
Thus, coding for nonlinear signals begins wiLh encoding Lhe nonlinear-signal correlaLe of luminance: =
0.299 R’ + 0.587 G’ + 0.114 8’
BIue:b0
Black L=0
(4.20)
(See Section 4.3 below for more discussion on encoding of nonlinear signais.) FIGURE 4.14: CIELAB model. (This figure also appears in Lhe color inserl secLion.) 4.1.14
ttatbt (CIELAB) Color Model The discussion above of how besL Lo malce use of lhe biLs available Lo Us LOUched on lhe issue of how weli human vision sees changes in lighl leveis. This subjecL is acLUaily an exampie of Weber’s Law, from psychology: lhe more Lhere is of a quanLity, Lhe more change there musl be lo perceive a difference. For example, iL’s relaLively easy Lo teu lhe difference in weighL belween your 4-year-old sisLer and your 5-year-old broiher when you pick them up. However, ii is more difficult Lo leu Lhe difference in weighl beLween lwo heavy objects. AnoLher example is lhat lo see a change in a brighl iighL, lhe difference musl be much larger iban lo see a change in a dim lighL. A rule of Lhumb for Lhis phenomenon sLales ihal equally perceived changes musl be relative. Changes are about equally perceived if lhe ralio of lhe change is Lhe sarne, wheLher for dark or brighl lighLs, and so on. AfLer some lhoughL, Lhis idea leads Lo a logarilhrnic approximaLiOn lo perceptually equally spaced unils. For human vision, however, CIE arrived ai a sornewhal more involved version of Lhis kind of mie, called lhe CIELAB space. Whal is being quantified in ibis space is, again, differences perceived in color and brighlness. This makes sense because, pracLically speaking, color differences are mosl useful for cornparing source and LargeL colors. You would be interesled, for example, in whelher a particular batch of dyed clolh has lhe sarne color as an original swalch. Figure 4.14 shows a culaway mio a 3D solid of Lhe coordinale space associated wiLh Lhis color difíerence melric. CIELAB (also known as L*a*b*) uses apower iaw of 1/3 inslead of a logarilhrn. CIELAB uses lhree values Lhal correspond roughly Lo luminance and a pair lhat combine lo make colorfulness and hue (variabies have an asLerisk lo differenliate ibem from previous versions devised by lhe dE). The color difference is defined as =
J(L92 + (a12 + (b92
(4.21)
where 1)
=
/ y \ (1/3) 1161—1 \Y~)
a
=
500
—
200[()’
[&;:)x
—
(1/3) —
16 y (1/3) (j;;;)
]
(4.22)
(Z)°”3)]
with X~, Y,,, Z~ lhe XYZ values of lhe whiLe poinL. Auxiliary definilions are chroma
=
hue angle
=
et
=
«‘(a92
=
arcLan
+ (b92
—
(4.23)
Roughly, lhe maximum and minirnum of value at correspond lo red and green, while bt ranges from yellow Lo blue. The ehroma is a scale of colorfulness, wiLh more colorful (more saLuraLed) colors occupying Lhe oulside of lhe CIELAB solid aI each L brighlness levei, and more washed-oul (desaluraLed) colors nearer lhe cenlral achromalic axis. The hue angle expresses more or less whal mosl people mean by “lhe colo?’ — LhaL is, you would describe ii as redor orange. The developmenL of such color-differences models is an aclive field of research, and Lhere is a plethora of olher human-percepLion-based formulas (lhe olher compelilor of lhe sarne vinlage as CIEL.AB is called CIELUV — boLh were devised in 1976). The inleresl is generaled partly because such colar melrics iinpacl how we model differences in lighling and
100 Chapter 4
Section 4.2
Colar in Image and Video Blue
viewing across device and/ar network boundaries [7}. Severa! high-end products, including Adobe Photoshop, use Lhe CIELAB model.
More Color-Coordinate Schemes
4.1.15
Cyan
Calor Modeis in Images 101
Yel!ow
Red
Magenta
There are severa! other coordinate schemes in use Lo describe co!or as humans perceive it,
with some confusion in the fie!d as to whether gamma correetion shou!d ar shou!d not be app!ied. Here we are describing device-independent calor — based on XY Z and correiated Lo what humans see. However, generai!y users make fite use of RGB or R’, G’, 8’. Other schemes include: CMY (described on p. 101); HSL — Hue, Saturation and Light ness; HSV — Hue, Saturation and Value; HSI and Intensity; HCI — C=Chroma; HVC — V=Va!ue; HSD — D=Darkness; the beat goes on 4.1.16
COLOR MQDELS IN IMAGES We now have had an introduclion to colar science and some of Lhe prob!ems lhat crop up with respecl la calor for image displays. BuL how are calor models and coordinales systems really used for stored, disp!ayed, and prinled images?
4.2.1
Black (0,0,0)
Munseli Color Narning System Accurate naining of co!ors is also an importam consideration. One time-tested standard system was devised by Munsel! in lhe ear!y 1 900’s and revised many times (lhe iast one is cal!ed the Munseil renotasion) [8]. The idea is Lo set up (yet anolher) approxifllale!Y perceptua!!y uniform system of three axes lo discuss and specify colar. The axes are value (biack-while), hue, and chroma. Value is divided mIo 9 steps, hue is in 40 steps around a circle, and chroma (saturation) has a maximum of 16 leveIs. The circle’s radius varies with value. The main idea is a fairly mnvariant specification of colar for any user, inc!uding artists. The Munseil corporation lherefore sells books of ai! Lhese patches of paint, made up with proprietary paint formulas (lhe book is quite expensive). It has been asserted that this is lhe most oflen used uniform sca!e.
4.2
Red
RGB Color Model for CRT Displays According lo Chapter 3, we usuafly store calor informalion directly in RGB form. However. we note from lhe previous seclion thal such a coordinate system is in fact device-dependent. We expect lo be able Lo use 8 bits per colar channe! for color that is accurate enough. In fact, we have to use about 12 bits per channel lo avoid an a!iasing effect in dark image arcas — contour bands Lhat resuli from gamma correclion, since gamma correction resu!ts in many fewer available integer leveIs (see Exercise 7). For images produced from compuler graphics, we slore integers proportiona! lo intensity in Lhe frame buffer. Then we shou!d have a gamma correclion L.UT between lhe frame buffer and Lhe CRT. lf gamma correclion is app!ied lo fioats before quantizing lo integers, before storage in Lhe frame buffer, ihen we can use on!y 8 bils per channe! and stil! avoid cantouring artifacts.
The
White (1, 1, 1)
White (0,0,0)
RGB Cube
B!ack (1, 1,1)
The CMY Cube
FIGURE 4.15: RGB and CMY calor cubes. (This figure also appears in Lhe colar insert section.)
4.2.2
Subtractive Color: CMV Color Model So far, we have effectively been dealing only with addirive calor. Name!y, when two !ight beams impinge ana Largel, their co!ors add; when Lwo phosphors ana CRT screen are turned on, Lheir colors add. So, for example, red phosphor + green phosphor malces ye!Iow !ighl. BuL for ink deposited on paper, in essence lhe opposite situation ho!ds: ye!low ink subiracis b!ue from white illumination bul reflects red and green; which is why it appears yellow! So, instead of red, green, and blue primaries, we need primaries thaL amount to —red, —green, and —b!ue; we need lo subtracr R, G, ar B. These subtractive colar primaries are cyan (C), magenta (M), and ye!low (Y) inks. Figure 4.15 shows how the two systems, RGB and CMY, are connected. In Lhe additive (RCIB) system, b!ack is “no !ight”, RGB = (0,0,0). In the subtractive CMY system, black arises from subtracting alI lhe !ight by layingdowninkswithC=M—Y— 1
4.2.3 Transformation from RGB to CMV Given our identification of Lhe role of inks in subtractive systems, lhe simp!est mode! we can invent lo specify what ink density lo lay down on paper, lo make a certain desired RGB calor, is as fo!lows:
rcl r’ IMI—Il [YJ Li
FR
IG
L~
(4.24)
102
(hapter 4
(olor in Image and Video
Section 4.2 Biockdyes
(olor Models in Images
103
Block dyes gamut ÷ NTSC gamut
q
q
o
o
no
‘e o
o Yêllow
o [.4
r9 o
(‘4
o (a)
(b)
q o
FIGURE 4.16: Additive and subtraclive color: (a) RGB is used Lo specify additive co or; (b) CMY is used to specify subLraclive colar. (This figure also appears in lhe colar insert secLion.)
r~i ril ~c G 1=1 11-1 M LBi Lii L~
(4.25)
Undercolor Removal: CMYK System C, M, and Y are supposed Lo mix te black. However, more ofLen Lhey mix to a muddy brown (we ali know Lhis from kindergarten). Tn’ly “black” black ink is in facL cheaper lhan mixing colored inks to make black, soa simple approach Lo producing sharper printer coiors is Lo calculaLe thaL part of Lhe Lhree-color mix lhat would be black, remove it from lhe color proportions, and add it back as real biack. This is called “undercolor removal”. The new specification of inks is lhus K
rcl M
[~J
min(C, M, Y)
rc-K1 M—K [Y Kj
(426)
Figure 4.16 depicts the calor combinaLions Lhat resuit from combining primary coiors available in lhe two situations: additive color, in which we usuaiiy specify colar using RGB, and subtractive colar, in which we usuaily specify calor using CMY ar CMYK. 4.2.5
450
500 550 600 Wavelength (a)
650
700
0.0
0.2
0.4
0.6
0.8
1.0
x (b)
FIGURE 4.17: (a) Lransmission curves for biock dyes; (b) spectrum iocus, triangular NTSC gamut, and six-vertex prinler gamut.
Then Lhe inverse lransform is
4.2.4
400
Printer Gamuts In a comman modei af Lhe printing process, printers lay down Lransparerlt layers of ink anta a (generally white) substrate. lf we wish to have a cyan prinling ink lmly equal tu minus-red,
our objective is Lo produce a cyan ink that compietely blocks red light but aiso completely passes ali green and blue light. Urifortunately, such “block dyes” are only approximated in induslry. lii reality, transmission curves overlap for lhe C, M, and Y inlcs. This leads to “crossLalk” beLween lhe coior channels and difficulLies in predicLing colors achievable in printing. Figure 4.17(a) shows typicaJ transmission curves for real block dyes, and Figure 4.17(b) shows lhe resuiting color gamul for a color printer thaL uses such inks. We see lhat the gamut is smaller than LhaL of an NTSC monitor and can overlap it. Such a gamut arises from Lhe model used for prinler inks. TransmitLances are related Lo optical density E) via a logarithm: D = — in T, where T is one of Lhe curves in Figure 4.17(a). A calor is formed by a linear combinaLion E) of inks, wiLh E) a combination of Lhe Lhree densities weighted by weights w1, i = 1 .. 3, and w~ can be in lhe range from zero Lo Lhe maximum ailowable withouL smearing. So Lhe overali Lransmitlance T is fonned as a product of exponentiais of lhe lhree weighted densides — ligln is extinguished exponentiaily as it Iravels lhrough a “sandwich” of trans parent dyes. The light reflecLed from paper (ar through a piece of slide fim) is TE = e_DE, where E is lhe illuminating light. Forming colors XYZ wilh Equation (4.6) leads to lhe prinLer gamut in Figure 4.17(b). The center of lhe printer gamut is lhe white-black axis, and Lhe six boundary vertices correspond lo C, M, Y, and lhe three combinations CM, CY, and MY iaid down at fuIl density. Lesser ink densiies lie more in Lhe middle of Lhe diagram. FulI density for ali inks corresponds Lo Lhe black/white point, which lies in lhe center of lhe diagram, at lhe point marked “o”. For lhese particular inks, that point has chromaticiLy (x, y) = (0.276, 0.308).
104
Chapter 4
(olor in Image and Video
Section 4.3
4.3 COLOR MODELS IN VIDEO 4.3.1 Video Color Transforms Methods of dealing with color in digital video derive Iargely from older analog methods of coding color for TV. Typically, some version of Lhe lurninance is combined with color information in a single signal. For example, a matrix transform rnethod similar to Equa tion (4.19) called YIQ is used to transrnit TV signais in North Arnerica and Japan. This coding also malces its way into VHS videotape coding in these counLries, since video tape technologies also use YIQ. In Europe, videotape uses the PAI.. or SECAM codings, which are based on TV Lhat uses a rnatrix Lransform called YUV. Finally, digital video mostly uses a matrix transform called YCbCr Lhat is closely related to YUV.2 4.3.2
YUV Color Model Initially, YUV coding was used for PAI. analog video. A version of YUV is now also used in Lhe CCIR 601 standard for digital video. First, il codes a luminance signal (for gamma-corrected signals) equal Lo Y’ in Equa tion (4.20). (RecalI thaL Y’ is often called Lhe “lurna.”) The luma Y’ is similar to, but not exactly the sarne as, the dE lurninance value Y, garnrna-correcLed. In multirnedia, practitioners often biur Lhe difference and simply refer 10 both as lhe luminance. As well as magnitude or brightness we need a colorfulness scale, and Lo Lhis end chromi nance refers to the difference between a color and a reference whiLe aL the sarne lurninance. It can be represenLed by Lhe color differences U, V: U
=
Y
=
(4.27)
From Equation (4.20), Equation (4.27) reads
r“1 r
1 u j 1 Lv J L =
0.299 —0.299 0.701
0.587 —0.587 —0.587
0.144 0.886 —0.114
1
r R’
J [
(4.28)
8’
We go backward, frorn (Y’, U, V) to (R’, G’, 8’), by inverting the matrix in Equation (4.28). Note thaL for a gray pixel, with R’ = = 8’, Lhe luminance Y’ is equal Lo Lhat sarne gray value, R’, say, since the sum of thecoefficienLs in Equation (4.20) is 0.299+0.587+0.114 = 1.0. So fora gray (“black.and-white”) image, lhe clirominance (U, V) is zero, since the sum ofcoefhcients in each of the lowertwo equations in (4.28) is zero. ColorTV can be displayed on a black-and-white Lelevision by just using lhe Y’ signal.3 For backward compaLibility, color TV uses old black-and-white signals with no color information by identifying the signal wiLh Y’. 2The luminancc-chrominance color models (YIQ, YIJV. YCbCr) are proven effeclive. Hcnce, ihcy are also adopled ia image.compression slandards such as JPEG and JPEG2000. 31i should be noted that many aulhors and usas simply use chese lenas with no pnrncs and (perhaps) mesa Usem as if they were with primes!
Color Models in Video
105
Finally, in Lhe actual irnplementaLion, U and V are rescaled for purposes of having a more convenient maximum and rninirnurn. For analog video, lhe scales are chosen such that each of U or V is limited Lo lhe range between ±0.5 Limes Lhe maximurn of Y’ [9]. (NoLe Lhat acLual voiLages are in anoLher, non-normalized range — for analog, Y’ is ofLen in lhe range O Lo 700 mV, so rescaled U and V, called I’a and ~I? ia Lhat context, range over ±350 mv.) Such scaling reflecLs how Lo deal with cornponenL video — three separaLe signals. How ever, for dealing with conaposire video, in which we wanL Lo compose a single signa! OuL of 3”, U, and V aL once, it tums ouL lo be convenienL lo contam the composite signa! magniLude 3” ± ‘,/U2 + V2 within Lhe range — 1/3 to +4/3,so that it will remam wiLhin Lhe ampliLude limiLs of Lhe recording equiprnent. For this purpose, U and V are rescaled as follows: U V
= =
0.492111(8’—)”) 0.877283(R’ — 3”)
(4.29)
(with mulLipliers sorneLimes rounded lo Lhree significanL digiLs). Then lhe chrorninance signal is cornposed from U and V as lhe cornposite signa! C
=
li cos(w:) + V sinQoi)
(4.30)
where a represents Lhe NTSC color frequency. From equaLions (4.29) we note thaL zero is not lhe rninimurn value for U, V. In terms of real, positive colors, U is approximaLely from blue (U > O) Lo yellow (1] < 0) in Lhe RGB cube; V is approximately from red (V > 0)10 cyari (V < 0). Figure 4.18 shows Lhe decomposition of a Lypical color irnage mIo iLs 3”, U, Y compo nents. Since boLh U and V go negative, lhe images are in fact shifLed, rescaled versions of Lhe actual signals. Because Lhe eye is most sensiLive lo black-and-white variations, in Lerms of spaLial fre quency (e.g., lhe eye can see a grid of fine gray lines more clearly Lhan fine colored lines), in lhe analog PAL signa! a bandwidLh of only 1.3 MHz is allocated lo each of U and V, while 5.5 MHz is reserved for lhe Y’ signa!. In fact, color information LransrnitLed for color TV is actually veiy blocky. 4.3.3 YIQ (olor Model YIQ (acLually, 3” 1 Q) is used in NTSC color TV broadcasting. Again, gray pixels generaLe zero (1, Q) chrominance signal. The original rneanings of these names carne from combi naLions of analog signals —1 for in-phase chronainance, and Q for quadrature chrominance — and can now be safely ignored. IL is thoughL Lhat, alLhough U and V are more simply defined, they do noL caplure lhe most-to-least hierarchy of human vision sensilivity. AlLhough Lhey nicely define lhe color differences, Lhey do not best correspond lo acLual hurnan perceptual color sensitiviLies. NTSC uses 1 and Q insLead. YIQ isjust a version of YUV, wiLh Lhe sarne 1” buL wiLh U and V roLaLed by 33°:
Q
=
0.877283(R’
=
0.877283(R’
—
3”) cos33°
—
0.492111(8’
3”) sin33° +0.492111(8’
—
3”) sin 33°
—
Y’)cos33°
(4.31)
Section 4.4 106
Chapter4
Further Exploration
101
oorinlmageandVideO
(a)
(b)
FIGURE 4.19: (a) 1 and (b)
Q componencs of calor image.
4.3.4 YCbCr Color Model The international standard for component (Lhree-signal, swdio quality) digital video is of ficially Recommendation ITU-R BT.60I-4 (known as “Rec. 601”). This standard uses another colar space, YCbCr, often simply writlen YCbCr. The YCbCr transform is used in WEG image compression and MPEG video compression and is closely related to Lhe YUV transform. YUV is changed by scaling such thaL Cj, is U, but wiLh a coefficienc of 0.5 mulLiplying 8’. In some software systems, Cj, and Cr are also shifted such that values are between O and 1. This makes the equations as follows: (b)
(d)
(e)
FIGURE 4.18: Y’UV decomposition ofcolorimage (a) original calor image; (b) Y’; (c) li; (d) V. (This figure also appears in the colar insert section.)
ro.299 =
Q j
L
0.595879 0.211205
0.587 —0.274133 —0.523083
((8’—
)~u/I.772) +0.5
Cr
=
((R’
Y’)/1.402) + 0.5
r
0.299
1[ c, J1 1[ —0.168736 0.5
This leads to Lhe following matriz LransfOtTW y’l
=
—
0.144
1
—0.321746 0.311878 ~
—
(433)
Writlen out, we Lhen have
ry’i
[
Cj,
0.587 —0.331264
0.144
1 rR’l
0.5
J1 1[ G’8’ j1 + 1[ 0.5 o.s
—0.418688 —0.081312
r0
1
(4.34)
rR’l =
1 G’ 1 [ 8’ ~
(4.32)
1 is roughly Lhe orange-blue direetion, and Q roughly corresponda to the purple-green direction. Figure 4.19 shows the decomposition of Lhe sarne color image as above into YIQ components. Only the 1 and Q cornponents are shown, since Lhe original irnage and Lhe ~‘ ~0~ponentarethesameasinFigure4.I9 For this particular image, most of the energy is capLured in Lhe Y’ component, which is Lypical. However, in Lhis case Lhe YIQ decompoSition does a better of job of forming a hierarchical sequence of images: for Lhe 8-bit Y’ componenL, Lhe root-mean-square (RMS) value is 137 (with 255 Lhe maximum possible). The U, Y components have RMS values ‘~ and 44. For lhe YIQ decomposition, lhe 1 and Q components have RMS values 42 and 14, 50 they beLter prioriLiZe color values. Originally, NTSC allocated 4.2 MHz Lo Y. 1.5 MHz to 1, and 0.6 MHz Lo Q. Today, both 1 and Q are each allocated 1.0 MHz.
219Inand a minimum of Rec. +16.601Values below and abave denotedY’headroonz and practice, however, specifies 8-bitlócoding, with 235, a maximum value ofonly footroom, are reserved for other processing. Cb and Cr have a range of ±112 and offset of +128 (in olher words, a maximum of 240 and a minimum of 16). If R’, O’, 8’ are floats in + 1], we obtain Y’, Ca, C, in [0.255] via Lhe Lransform [9]
r y’ 1
r
[1 J1 1[ =
Cr
65.481 —37.797 112
128.553 —74.203
—93.786
24.966 1 r R’ 1 r ~ 112 O’ + 128 —18.214 8’ [128
J1 1[ J1 1
]
(4.35)
In facL, the output range is also clamped to [1 .. 254], since Lhe Rec. 601 synchronizabon signals are given by codes O and 255. 4.4 FIJRTHER EXPLORATION chief aLtribuLes that makes ao compelling. most-used referenceis on in ln a deep way, calor is onemultimedia of our favorite pleasures asThe humans and arguably onecolor of Lhe
108
Chapter 4
Section 4.5
(olor in Image and Video
general is lhe classic handbaok 121. A compendium of importaM techaiques usai Loday is lhe coilection [10]. Linlcs iii Lhe Chapter 4 section of lhe Further Exploration directory on Lhe text web site include More details on gamma correction for publicaLion on lhe WWW • The fuil speciflcation of Lhe new sRGB standard color space for WWW applications • An excellent review of color transforms and a standard color FAQ • A MATLAB script to exercise (and expand upon) Lhe calor transform functions that are part of lhe Image Toolbox in MATLAB: Lhe sLandard Lena image is transforma! Lo YIQ and Lo YCbCr • A new calor space. The new MPEG standard, MPEG4, (discussed ia Chapter 12) somewhat sidesteps lhe Lhorny question of whose favoriLe colar space to use in a standard definition by including six calor spaces. One of them is a new variant on HSV space, HMMD calor space, that purporis lo allow a simple calor quantization — from 24-bit down lo 8-bit color, say, — tbat is effecLively equivalent loa complex vector colar quantization (i.e., considering a more careful but also more expensive mapping of Lhe colors in an image inLo Lhe colar LUT). This new color space may indeed become imporlant.
Exercises
109
(b) What are (roughly) lhe relaLive sizes of Lhe LAR gamuL, lhe CMYK gainut, and a monitor gamuL? 4. Where does Lhe chromaticity “horseshoe” shape iii Figure 4.11 come from? Can we calculate it? WriLe a small pseudocode solution for Lhe problem of finding this so called “spectrum locus”. Hini: Figure 4.20(a) shows lhe color-maLching functions in Figure 4.10 drawn as a seL of points in Lhree-space. Figure 4.20(b) shows Lhese poinLs mapped into another 3D set of points. Anorher hiiu: Try a programming solution for lhis problem, to help you answer iL more expliciLly. 0.9 1.8 1.6 1.4 1.2
0.8 0.7 0.6 0.5 0.4 0.3 0.2
0.8
0.6 0.4 0.2 0~.
0.1 1.4
o
1 0:?
20
(a)
0.8
(b)
FIGURE 4.20: (a) color-matching functions; (b) Lransformed colar maLching functions.
4.5 EXERCISES 1. Consider Lhe following set of color-relaLed tenns: (a) (b) (e) (d)
WavelengLh Color levei BrighLness Whiteness
How wouid you match each of the following (more vaguely stated) characteristics lo each aí Lhe above LeLins? (e) (1) (g) (h)
Luminance Hue SaLuraLion Chrominance
2. What color is outdoor light? For example, around what wavelength would you guess Lhe peak power is for a red sunseL? For blue sky iight? 3. “The LAR gamut covers ali colors in Lhe visible specLflim. (a) What does this sLatemenL mean? Briefiy, how does LAR relate Lo color? Just be descripLive.
5. Suppose we use a new seI of color-matching funclions wiLh values
i”~’°(À), 5new(À) !~“~(À)
À (um)
i~t°(À)
jfleW(À)
~fleW(À)
450
0.2
0.1
0.5
500
0.1
0.4
0.3
600
0.1
0.4
0.2
700
0.6
0.1
0.0
In this system, what are Lhe chromaticity values (x, y) of equi-energy white lighL E(À) where E(À) 1 for alI wavelengLhs À? Expiam. 6. (a) Suppose images are no! gamma correcLed by a camcorder. Generaliy, how would Lhey appear ou a screen? (h) WhaL happens if we artificially increase lhe outpuL gamma for stored image pixels? (We can do Lhis in Photoshop.) WhaL is lhe effecL ou the image? 7. Suppose image file values are in 0.. 255 iii each colar chanuei. lf we define ~ = R/255 for Lhe red chanuel, we wish Lo cany ouL gamma correcLion by passing a new value 1 Lo Lhe dispiay device, with 1 ~I)2.O
110
Chapter 4
Color iii Image and Video
Section 4.6
li is cornmon Lo cany ouL Lhis operation using inleger rnath. Suppose we apprOximate lhe caiculalion as crealing new integer values in 0.. 255 via
(mi) (255
(R112°))
References
111
(We’re nol discussing any psychophysicai effects thaL change our perception — here we’rejust worried about lhe machine itselfl. (b) lhe lf garnma first, when correction dispiayed? is nor appiied, does lhe second RGB have lhe sarne hue as
(a) CornmenL (very roughly) on Lhe effect of Lhis operation on lhe nurnberofactuallY available leveis for display. uni: Coding this up in any language wili help you understand Lhe rnechanism 111 work betLer and wili ailow you Lo sirnply count the outpuL leveis. (b) Which end of Lhe leveis 0.. 255 is affecLed niost by gamma correcLion — lhe low end (near 0) or lhe high end (near 255)? Why? l4ow much at each end?
(e) For what color triples is Lhe hue aiways unchanged? 11. We wish lo produce a graphic thai is pleasing and easily readabie. Suppose we rnake mosl readable? Juslify your answer. Lhe background coior pink. What coior LeXL fonL shouid we use lo make lhe Lext
8. In rnany computer graphics applicalions, y-correction is performed only incolor LJJT (iookuptabie). ShowthefirsLfiveentriesofac0l0rLUTrnnhf0ruseY< fundamental
+0.33 x 3 x fundamenLai
0.25 x 4 x fundamenLai +
Nyquist Theorem Signals can be decomposed into asum of sinusoids, ifwe are willing Lo use enough sinusoids. Figure 6.3 shows how weighled sinusoids can build up quite a complex signal. Whereas frequency is an absoluLe measure, pilch is a perceplual, subjective qualily of sound — generally, piLch is relative. Pilch and frequency are linked by seLling lhe note A above middle C lo exactly 440 fiz. An octave above Lhal note corresponds Lo doubling lhe frequency and takes us lo anolher A noLe. Thus, wilh lhe middle A on a piano (“A4” or “A440”) seL Lo 440 Hz, lhe next A up is 880 Hz, one ocLave above. Here, we define har,nonics as any series of musical Lones whose frequencies are inLegral mulLipies of lhe frequency of a fundamental lone. Figure 6.3 shows lhe appearance of these harmonics. Now, if we allow noninLeger mulliples of the base frequency, we aliow non-A noLes and have a complex resulling sound. NeverLheicss, each sound isjusL made from sinusoids. Fig ureó.4(a) showsa single sinusoid: il is a single, pure, frequency (oniy elcclronic inslnlmenLs can create such boring sounds). Now if lhe sampling rale jusl equals lhe acLual frequency, we can see from Figure 6.4(b) lhal a false signal is delecLed: iL is simply a conslanL, with zero frequency. lf, on Lhe oLher hand, we sample aI 1.5 Limes lhe frequency, Figure 6.4(c) shows Lhal we oblain an incorrect (alias) frequency lhal is lower lhan lhe correcl one — ii is half lhe correcl one (lhe waveieflgLh, from peak Lo peak, is double Lhat of Lhe acLual signal). in compuLer graphics, much efforl is aimed aL masking such alias effecls by various melhods aí antialiasing. An alias is any artifacL thal does not belong Lo lhe original signal. Thus, for correcL sampling we musl use a sampiing rale equal lo aI leasl Lwice lhe maximum frequency contenl in lhe signal. This is called lhe Nyquist rale. The NyquisL Theorem is named afler Harry Nyquisl, a famous maLhemaLician who worked aL Beli Labs. More generaliy, if a signal is band-limited Lhal is, if iL has a lower limiL fl
+0.5 x 5 x fundamenLai
FIGURE 6.3: Building upa complex signal by superposing sinusoids. and an upper limiL f~ of frequency componenls in lhe signal — lhen we need a sampling raLe ofaL leasl 2(J’2 — li). Suppose we have afixed sampling raLe. Since iL would be impossible Lo recover fre quencies higher Lhan haif lhe sampling raLe in any evenl, mosL syslems have an &ntialiasing fihter lhal resLricLs Lhe frequency conlenl of Lhe sampler’s inpuL Lo a range aI or below haif lhe sampiing frequency. Confusingiy, lhe frequency equal lo half lhe NyquisL raLe is called lhe Nyquisifrequency. Then for our fixed sampling raLe, lhe NyquisL frequency is half lhe sampling raLe. The highesL possible signai frequency componenL has frequency equal Lo lhaL of Lhe sampling iLseif. Nole lhaL lhe Ime frequency and ils alias are locaLed symmelricaily on lhe frequency axis wilh respecl lo Lhe NyquisL frequency pertaining Lo lhe sampling rale used. For Ihis reason, Lhe Nyquisl frequency associaLed wiLh Lhe sampling frequency is ofLen called Lhe “folding” frequency. ThaL is Lo say, if Lhe sampling frequency is iess lhan Lwice Lhe Lme frequency, and is grealer Lhan Lhe Lrue frequency, Lhen Lhe alias frequency equals Lhe sampling frequency
130
Chapter 6
Sectiori 6.1
Basics ot Digital Audio
Digitization of Sound
131
4 >‘ o co
3 2
2 o.
a
o O
2
4 6 8 True frequency (kHz)
10
FIGURE 6.5: Folding of sinusoid frequency sarnpled aL 8,000 Hz. The foiding frequency, shown dashed, is 4,000 Hz. In general, lhe apparenL frequency of a sinusoid is Lhe iowesL frequency of a sinusoid lhal has exacliy lhe sarne sarnpies as lhe inpuL sinusoid. Figure 6.5 shows Lhe reiationship of lhe apparenL frequency Lo Lhe inpul (Irue) frequency. 6.1.4 Signai-to-Noise Ratio (SNR) Ia any analog sysLern, randorn flucLualions produce noise added to lhe signai, and Lhe rnea sured vollage is Lhus incorrecl. The ralio of lhe power of Lhe correel signai lo lhe noise is calied lhe signal-to-noise ralio (SNR). Therefore, lhe SNR is a measure of Lhe qualily of Lhe signai. The SNR is usualiy measured in decibeis (dli), where 1 dB is a LenLh of a bel. The SNR value, in unils of dB, is defined in Lerms of base-lO logarilhms of squared vollages:
(e) FIGURE 6.4: Aliasing: (a) a singie frequency; (b) sarnpling aI exacLly lhe frequency produces a conslanl; (c) sarnpiing aL 1.5 Limes per cycie produces ao alias frequency Lhal is perceived.
SNR
=
lOlog10
signa? =
201og10
~signa?
(6.2)
~nolse
minus lhe Irue frequency. For example, ii Lhe lrue frequency is 5.5 kHz and Lhe sampling frequency is 8 kllz, Lhen lhe alias frequency is 2.5 kHz:
fatias = fsamphng
—
finte,
for
ferue < fsasnpt,ng <
2x
finte
(6.1)
As weli, a frequency aL double any frequency could also fiL sample poinLs. lo fact, adding any positive ar negative muitiple of lhe sarnpling frequency lo lhe true frequency always gives anolher possible alias frequency, in LhaL such an alias gives Lhe sarne seI of samples when sampied aI lhe sarnpling frequency. So, if again Lhe sarnpling frequency is less Ihan Lwice Lhe Lrue frequency and is less Lhan lhe lrue frequency, lhen lhe alias frequency equals n limes the sarnpiing frequency rninus lhe lrue frequency, where Lhe n is lhe iowesl inLeger Lhal rnakes si tirnes lhe sarnpiing frequency larger lhan Lhe true frequency. For exampie, when lhe Lrue frequency is beLween 1.0 and 1.5 limes lhe sampling frequency, lhe alias frequency equais Lhe lrue frequency minus lhe sampling frequency.
The power in a signai is proporlionai lo Lhe square of Lhe voilage. For exarnpie, if Lhe signa] voiLage ~sjgnal ~ 10 Limes lhe noise, lhe SNR is 20 x log10(l0) = 20 dB. In lerms of power, if lhe squeaking we hear from len violins playing is Len Limes lhe squeaking we hear frorn crie vioiin playing, lhen lhe ralio of power is given in Lerrns of decibeis as lO dB, or, in olher words, 1 Bel. NoLice lhal decibeis are always defined in Lerms of a raLio. The Lerm “decibeis” as appiied Lo sounds in our environmenl usualiy is in comparison Lo a just-audibie sound wilh frequency 1 kHz. The leveis of sound we hear around us are desciibed in Lerms of decibeis, as a ratio Lo lhe quieLesL sound we are capabie of hearing. Tabie 6.1 shows approxirnale leveis for Lhese sounds. 6.1,5 Signai-to-Quantization-Noise Ratio (SQNR) For digilai signais, we musl Lake inLo account Lhe facL LhaL oniy quantized vaiues are sLored. For a digilai audio signai, lhe precision of each sampie is delermined by Lhe number of bils per sampie, lypicaiiy 8 ar 16.
132
Chapter 6
Section 6.1
Basics of Digital Audio
Threshold of hearing Rustie of leaves
SQNR
Very quiel roorn II
Busy street Loud radio II
RiveLer
II
6.1.6
Threshold of discomfort Threshold of pain II
Damage lo eardrum
Aside from any noise that may have been present in lhe original analog signal, addiLional error resulls from quantizalion. That is, if voltages are in Lhe range of 0 lo 1 but we have oniy 8 bits in which lo store values, we effectively force ali conlinuous values of voitage mIo oniy 256 different values. lnevitably, this introduces a roundoff error. Although It is nol really “noise’ it is called quanuzation noise (or quantization errar). The association with Lhe concepl of noise is lhal such errors will essentiaily occur randomiy from sample lo sample. The quaiity of lhe quantization is characterized by Lhe signaI~to-quanhiZatiOfl-n0üe ratio (SQNR). Quantization noise is defined as lhe difference beLween lhe value of Lhe analog signal, for lhe particular sampling time, and lhe nearesl quantization inLerval value. AL most, this errar can be as much as half of lhe interval. For a quanLizalion accuracy of N bits per sample, lhe range of lhe digital signal is —2” ~O 2N~ —1. Thus, if lhe actual analog signa1 is in lhe range from —Vmax Lo +Vmax, each quanlizalion levei represenLs a voltage of 2 Vmax ,2N, or Vmax/2M’. SQNR can be simply expressed in Lerms of lhé peak signai, which is mapped tolhe levei V,j3,,01 of aboul 2N~, and lhe SQNR has as denominalor lhe maximum Vquan.,wise of 1)2. The ratio of lhe lwo is a simpie definition of lhe SQNR:’ Vsignai
SQNR
=
2oiogio
=
20 x N x iog2
=
=
6.02?) + i.76(dR)
(6.4)
Since iarger is beLLer, Lhis shows lhaL a more reahsLic approximation gives a better charac Lerization number for lhe qualily of a syslem. Typical digital audio sample precision is eilher 8 bits per sample, equivaienL lo abouL LelephOne quahLy, or 16 bils, for CD quaiily. In fact, 12 bits or 80 would hkely do fine for adequale sound reproducLion.
Average room
Train through slation
133
We have examined Lhe worst case. lf, on lhe oLher hand, we assume Lhat lhe inpuL signai is sinusoidal, LhaL quanlizalion error is sLaListicaliy independent, and Lhal iLs magnitude is uniformly dislribuLed beLween O and haif lhe inLervai, we can show ([2], p. 37) lhaL lhe expression for Lhe SQNR becomes
TABLE 6.1: Magnitudes of common sounds, in decibeis
Conversation
Digitization of Sound
Linear and Nonlinear Quantization We menlioned above Lhal sampies are Lypicaiiy slored as uniformiy quanLized vaiues. This is calied linear formar. However, with a hmiled number of bits availabie, il may be more sensible lo Lry Lo lake mIo account lhe properlies of human percepLion and seI up nonuniform quantizalion leveis lhaL pay more aLtenLion Lo lhe frequency range over which humans hear besL. Remember that here we are quanLizing magnitude, or amplitude — how loud Lhe signai is. lii Chapter 4, we discussed an interesling fealure of many human percepLion subsysLems (as il were) — Weber’s Law — which slaLes Lhat Lhe more lhere is, proportionateiy more musL be added Lo discem a difference. SlaLed fonnaily, Weber’s Law says Lhat equaliy perceived differences have values proportionai Lo absoiute leveis: AResponse c ASLimuius/Sdmulus
This means lhaL, for exampie, if we can frei an increase in weighL from 10 lo 11 pounds, Lhen if inslead we slart at 20 pounds, iL wouid take 22 pounds for us lo frei an increase in weighl. Inserling a conslanL of proporLionaiily k, we have a differentiai equalion lhat sLates dr =k(1/s)ds
(6.6)
wilh response r and sLimulus s. lnLegrating, we arrive aI a soiulion r
kins+C
(6.7)
wilh conslanL of inLegraLion C. SLaLed differentiy, lhe soiuLion is r =kin(s/so)
20iog10 —j--
(6.5)
(6.8)
Yquani~oise =
6.02N(dB)
(6.3)
in olher words, each bil adds aboul 6dB of resolution, so ló bils provide a maximum SQNR of96dB. This ralio is aciually lhe peak signal.lo.quanhiZ2llOn~OiSe raio, or
PSQNR.
where so is lhe iowesL levei of sLimuius Lhat causes a response (r = O when s = Thus, nonuniform quanLizaLion schemes lhaL lake advanlage of Lhis percepLuai charac lerisLic male use of iogarithms. The ideais Lhat in a iog piol derived from EquaLion (6.8), if we simpiy Lake unifonn sleps aiong Lhe s axis, we are nol mirroring Lhe nonlinear response aiong Lhe r axis.
134
Chapter 6
Section 6.1
Basics of Digital Audio
Digitizatiori of Sound
135
js-law orA-law
Instead, we would like to take uniform steps along lhe r axis. Thus, nonlinearquantizatiOn works by firsl Lransforming au analog signal from the raw s space into te tbeoretical r space, teu uniformly quantizing te resulting values. The resuit is that for steps near the low end of te signa!, quanlization steps are effectively more concentrated on te s axis, whereas for large values of s, one quantization step in r encompasses a wide range of s values. Such a law for audio is called ~s-iaw encoding, or u-law, since it’s easier Lo write. A very similar mIe, ca!led A-law, is used in telephony in Furope. The equations for tese similar encodings are as foIlows: o
~u-law:
3 r=
A-law:
r=
sgn(s) 1 1 s 1) lnjl+itI—I ln(1+tr)
{
A (s) 1-1-luA ;;
I—1S1 I~pl
s~1—A
‘
!~!)[1+InAIi~ii, I+IflA 1
where sgn(r)
Is 1 ,
=
{
1 —1
(6.9) 1~
(6.10)
! 0, where k is a constanL —k
.
otherwise,
!n+ê~
NoLe lhaL lhe predicLion simpiy involves a delay.
(6.2i)
158
Chapter 6
Section 6.4
Basics of Digital Audio
f2
lO
II
f3 13
f4 15
N
mm Z(fn We also define an exacL reconsLnicled value li = = lO. Suppose we use a sLep value k = 4. Then we ative ai the following values:
/2=10. e2=11—lOl, /314, /410,
e3=13—14=—l. e4=15—1O5.
~=4. ë3=—4, é4=4,
12=10+4=14 /3=14—4=10 /4=10+4=14
Adaptive DM. However, if lhe slope of lhe acLual signal curve is high, lhe sLaircase approximation cannoL keep up. A sLraightfOrward approach to dealing wiLh a sLeep curve is lo simply change Lhe slep size k adapiively lhal is, ia response Lo lhe signal’s current properties. ADPCM Adaptive DPCM talces lhe idea of adapling Lhe coder lo suiL the input much furLher. Easically, two pieces make upa DPCM coder: Lhe quanlizer and Lhe predicLor. Above, in Adaptive DM, we adapted lhe quanLizer step size Lo suil lhe inpuL. In J3PCM, we cai’ adaptivety niodify the quanhizer. by changing lhe step size as well as decision boundaries in a nonuniform quanlizer. We cai’ carry Ibis oul ia two ways: using lhe properlies of lhe inpul signal (calledforward adaptive quantizahion), ar Lhe properlies of lhe quantized ouLpul. For if quanLiZed errors become loo large, we should change lhe nonuniforni Lloyd-Max quantizer (this is called backward adaphive quanhizahion). We can also adapi ihe predictor, agem using forward or backward adaplation. Generally, making Lhe prediclor coefficients adapLive is called Adaphive Predichive Coding (APC). li is inleresLiflg lo see how lhis is done. Recail Lhal Lhe prediclor is usually taken Lo be a linear funclion of previously reconstrucled quanLiZed values, f,,. The number of previous values used is called lhe arder of Lhe prediclor. For example, if we useM previous values, we need M coefficienls a~, i = 1 .. M ia a predicLor M
Q~fn—i
=
—
(6.23)
n=1
We see ifiaL lhe reconsLructed sei of values lO, 14, 10, 14 never strays far from Lhe correct seI lO, II, 13, 15. NeverLheless, ii is nol difficult Lo discover lhe! DM copes welI wiLh more or less constanl signals, buL noL as well wiLh rapidly changing signals. One approach lo mitigaLing this problem is to simply increase Lhe sampling, perhaps lo many limes lhe NyquisL rale. This scheme can work well and makes DM a very simple yeL effecLive analog-to-digilal converLer.
6.3.7
159
Lo solve for Lhese coefficienis. Suppose we decide to use a leasL-squares approach Lo solving a minimizalion, lrying Lo find Lhe beM values of lhe a~:
Again, leL’s consider actual numbers. Suppose signal values are as follows: fI
Further Exploration
(6.22)
1=1
However we can get mio a difficulL siLuaLion if we Lry lo change Lhe prediclion coefficienLs thaL multiply previous quantized values, because Lhal makes a complicaled sei of equalions
where here we would sum over a large number of samples f,, for Lhe curren! paLch ofspeech, say. BuL because f~ depends on lhe quantizaLion, we have a difficult problem lo solve. Also, we should really be changing lhe fineness of Lhe quanlizaLion aL lhe sarne lime, lo suil Lhe signal’s changing naLure; lhis makes lhings problernaLical. lnsLead, we usually resorL lo solving Lhe simplerproblem Lha! resuhs from using no! f,, ia lhe predicLion bu! simply Lhe signal f,, ilself. This is indeed simply solved, since, expliciLly wriLing ia lerms of lhe coefficienls a~, we wish lo solve N
minZ(f,,
M —
Zaifn_i2
(6.24)
Differentiation wilh respecL 10 each of lhe o~ and selLing lo zero produces a linear sysLem of M equalions thaL iseasy Losolve. (The seLof equations iscalled lheWiener-Hopfequalions.) Thus we indeed find a simple way lo adapLively change lhe prediclor as we go. For speech signals, iL is cornmon Lo consider blocks of signal values, jus! as for liege coding, and adaplively change Lhe prediclor, quanlizer, or bolh. lfwe sample a! 8 kHz, a common block size is 128 samples — 16 rnsec of speech. Figure 6.18 shows a schematic diagram for Lhe ADPCM coder and decoder (7]. 6.4 FURTHER EXPLORATION FascinaLing work is ongoing ia lhe use of audio Lo help sighl-impaired persons. One Lech nique is presenling HTML struclure by means of audio cues, using crealive Lhinking asia lhe papers [8,9,101. AnexcellenL resource fordigilization and SNR, SQNR, and soai’ is Lhe book by Pohlrnann [2]. The audio quantizaLion ti-Iaw is described in Lhe Chapler 6 web page ia lhe Furiher Expiorarion secLion of lhe LexI web siLe. Olher useful links included are .
An excellenL discussion of lhe use of FM lo creaLe synLhelic sound An exLensive lisi of audio file formats
• A good description of various CD audio file formaLs, which are somewhal differenl. The main music formal is called ,ed book audio. • A General MIDI Inslrumenl Patch Map, along with a General MIDI Percussion Key Map • A link 10 a good tutorial on MIDI and wave-!able music synthesis .
A link Loa Java program for decoding MJDI sLrearns
Section 6.5 160
Chapter 6
Exercises
161
Basics of Digital Audio 32 kbps oulput
64 kbps A-law or JL-law PCM input
(a)
64 kbps A-iaw or g-law PCM output
32 kbps input
(b)
FIGURE 6.18: Schematic diagram for: (a) ADPCM encoder; (b) decoder.
• A good multimediafsound page, including a source for iocating Internei soundlmusic materiais • A performing_arts-oriented site that is an exceflent all-around resource on sound gen eraily, inciuding a great deal of information on definitions of terms, signai processing, and sound perception.
6.5 EXERCISES 1. My oid SoupdBiaster card is an 8-bit card. (a) What is it 8 bits of? (b) What is Lhe best SQNR it can achieve? 2. lf a set of ear protectors reduces the poise levei by 30 dB, how much do they redLlce tbe intensity (the power)? 3. A loss of audio output ai both ends of Lhe audible frequency range is inevitabie, due to the frequency response function of an audio amplifier and Lhe medium (e.g., tape). (a) lf Lhe output was 1 volt for frequencies at midrange, what is the output voltage after a ioss of 3dB at IS kl-lz?
(b) To conipensate for lhe ioss, a iistener can adjusi lhe gain (and hence Lhe output) on an equalizer at different frequencies. lf Lhe ioss remains —3 dB and a gain through the equalizer is 6 dE at IS kHz, what is Lhe oulpui voltage now? uns: Assume iog10 2 = 0.3. 4. Suppose Lhe sampiing frequency is 1.5 Limes Lhe Lrue frequency. Whac is Lhe alias frequency? 5. In a crowded room, we can sLiill pick ou and understand a nearby speaker’s voice, notwithstanding Lhe fact that general noise leveis rnay be high. This is known as Lhe cocksail-party effecs. The way it operates is that our hearing can localize a sound source by taking advantage of the difíerence in phase between Lhe two signals entering our left and right ears (binaural audisory perception). In mono, we could noi hear our neighbor’s conversation weil if Lhe noise levei were at ali high. State how you think a karaoke machine works. Hins: The mix for commerciaI music recordings is such thai lhe “pan” parameter is difterent going to the ieft and right channeis for each instrumeni. That is, for an instrument, either Lhe left or right channei is emphasized. How would Lhe singer’s track Liming have to be recorded Lo make ii easy to subtract Lhe sound of Lhe singer (which is typicaily done)? 6. The dynaniic range of a signai V is Lhe ratio of Lhe maximum to the minimum absolute value, expressed in decibeis. The dynamic range expected in a signal is Lo some exient an expression of Lhe signal quality. li aiso diciales Lhe nuniberof bits per sample needed Lo reduce Lhe quaniization noise Lo an acceptable levei. For exampie, we may want to reduce the noise Lo at ieasi an order of magnitude beiow Vmjn. Suppose Lhe dynamic range for a signai is 60 dE. Can we use 10 bits for this signai? Can we use ló bits? 7. Suppose lhe dynamic range of speech in Leiephony implies a ratio ~max/ 1’n,in ofaboul 256. Using uniform quaniization, how many bits should we use to encode speech lo make lhe quaniization noise at ieast an order of magnitude less lhan Lhe smallest deiectabie Lelephonic sound? 8. Percepsual nonuniforinisy is a general term for describing the nonlinearity of human perception. That is, when a certain parameter of an audio signai varies, humans do noi necessarily perceive Lhe difference in proportion to the amount of change. (a) Briefiy describe ai least two Lypes of percepiual nonuniformities in human au ditory perception. (b) Which one of Lhem does A-iaw (or M-iaw) atLempt Lo approximate? Why could ii improve quanlization? 9. Draw a diagram showing a sinusoid at 5.5 kHz and sampling aI 8 kHz (show eighL intervais between sampies in your plol). Draw lhe alias at 2.5 kJ-lz and show that in lhe eighi sample intervais, exactly 5.5 cycles of lhe true signal fiL into 2.5 cycies of the alias signai. 10. Suppose a signal contains topes aLI, lO, and 21 kl-Iz and is sampied aL the rale 12 kHz (and then processed with an antialiasing filter iimiting output 106 kflz). What tones are inciuded in the oulput? Hins: MosL of lhe ouiput consists ofaiiasing.
162
Chapter 6 11.
Section 6.6
Basics of Digital Audio
(a) Can a singie MIDI message produce more than one note sounding? (b) Is it possibie for more than one note to sound ai once on a particular instrument? lf so, how is ii done in MIDI? (e) Is the Program Change MIDI message a Channel Message? What does dais mes sage accomplish? Based on the Program Change message, how many different instnlments are there in General MIDI? Why? (d) In general, what are the two main kinds of MIDI messages? la cerms of data, what is lhe main difference between the two types of messages? Within those two caiegories. list Lhe different subtypes.
12.
(a) Give an example (in English, not hex) of a MIDI voice message. (b) Describe the parts of ihe “assembier” statement for the message. (c) What does a Program Change message do? Suppose Program change is hex “&HCI ‘ What does Lhe instnlction “&HC 103” do?
13. In PCM, what is lhe delay, assuming 8 Icliz sampling? Generaiiy, delay is Lhe time penaity associated with any aigoiithm dueto sampiing, processing, and analysis. 14.
(a) Suppose we use a predictor as foliows:
1»
=
trunc{~un-l+Jn_2]
ei,
=
fn
—
I~’
250
250
200
200
g 150
‘o
06
‘o
‘o
~oo 50
References
163
l50
loa 50
Time
(a)
Time
(b)
FIGURE 6.19: (a) DPCM reconstrucied signal (doited une) tracks Lhe input signal (solid line); (b) DPCM reconstructed signal (dashed line) sieers farther and farther from lhe input signai (solid une). 6.6 REFERENCES 1 B. Tmax, HandbookforAcoustic Ecology, 2nd cd., Burnaby, BC, Canada: Cainbridge Streei Publishing, 1999. 2 K.C. Pohlmann, Principies of Digital Audio, 4ih cd., New York: McGraw-HiIi, 2000. 3 3H. McCIeIIan, R.W. Schafer, and M.A. Yoder, DSP First: A Multimedia Approach, Upper Saddle River, tU: Prentice-Hail PTR, 1998.
(6.25)
Also, suppose we adopt Lhe quantizer Equation (6.20). If the input signal has values as foflows:
4 3. Heckroth, Tuzorial on MÍDI and Music Synthesis, La 1-labra, CA: The MIDI Manufacwrers Association, 1995, www.harmony-central.comlMlDIIDocltuIoriai.html. 5 P.K. Andleigh and K. Thakrar, Multirnedia Sys;erns Design, Upper Saddle River, NJ: Prentice Hali PTR, 1984.
2038567492110128146 164 182200218236254
6 K. Sayood, Introduction to Data Conipression, 2nd cd., San Francisco: Morgan Kaufmann, 2000.
show Lhat Lhe oulput from a DPCM coder (without entropy coding) is as follows:
7 Roger L. Freeman, Reference Manual for Telecommunicaiions Engineering, 2nd cd., New York: Wiley. 1997.
2044567489105121153 161 181 195212243251
8 MM. BIaLIner, DA. Suinikawa, and R. Greenberg, “Barcons and Icons: Tbeir Stmciure and Common Design Principies:’ Hutnan-Coniputer Interaction, 4: 11-44, 1989.
Figure 6.19(a) shows how the quantized reconstructed signai iracks lhe input signal. As a programming project, write a smali piece of code Lo verify your resuits. (b) Suppose by misiake on the coder side we inadvertenliy use Lhe predictor for iosstess coding, Equation (6.14), using original values f,, insiead of quantized ones, fi,. Show ibat on the decoder side we end up with reconstructed signal values as foilows: 2044567489105 121 137 153 169 185 201 217 233 so that Lhe error gets progressivelY worse. Figure 6.19(b) shows how Lhis appears: Lhe reconstwcted signal gets progres sively worse. Modify your code from above Lo verify this siatement.
9 MM. Biattner, “MulLimedia Interfaces: Designing for Diversity’ Multünedia Tools and Ap plications,3: 87—122. 1996. lO W.W. Gaver and R. Smilh, “Audilory Icons in Large-Scale Coilaboraiive Environments,” in Readings iii Hurnan-Computer Interaction: Toward Lhe Year 2000. cd. R. Baecker, 3. Grudin, W. Buxton, and 5. Greenberg, San Francisco: Morgan-Kaufman. 1990, pp. 564—569.
PART TWO
MULTIMEDIA DATA COMPRESSION Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12
Lossless Compression Algorithms 167 Lossy Compression Algorithms 199 Image Compression Standards 253 Basic Video Compression Techniques 288 MPEG Video Coding 1— MPEG-1 and 2 312 MPEG Video Coding II MPEG-4, 7, and Beyond 332 Chapter 13 Basic Audio Compression Techniques 374 Chapter 14 MPEG Audio Compression 395 —
in Lhis part, we examine Lhe role played by data compression, perhaps Lhe most important enabling technology that makes modern muitimedia systems possible. We start off in Chapter 7 looking at lossless data compression — thal is, involving no distortion of Lhe original signal once it is decompressed or reconsLiLuLed. So much daLa exists, in archives and elsewhere, Lhat iL has become critical Lo compress this informaLion. 1,ossless compression is one way to proceed. For example, suppose we decide to spend our savings on a whole-body MRI scan, looking for Lrouble. Then we certainly wanL this costly medical infonnaLion to remam prisLine, wiLh no loss of informaLion. This example of volume data forms a simply huge dataset, buL We can’t afford Lo lose any of it, so we’d besL use lossless compression. WinZip, for example, is a ubiquiLous tool tIiaL uses Iossless compression. AnoLher good example is archival storage of precious artworks. Here, we inay go Lo Lhe LrOuble of imaging an Old Master’s painLing using a high-powered camera mounted on a dolly, to avoid parallax. Certainly we do not wish to jose any of Lhis hard-won information, so again we’ll use lossless compression. On the other hand, when iL comes Lo home movies, we’re more willing Lo lose some information. if we have a choice beLween losing some informaLion anyway, because our PC cannoL handle ali Lhe daLa we wanL to push Lhrough it, or losing some mnformation on purpose, using a lossy compression meLhod, we’il choose Lhe laLter. Nowadays, almost ali 165
166 video you see is compressed in some way, and Lhe compression used is mostly lossy. Ainiost every image on Lhe web is in the standard JPEG fonnat, which is usually lossy. So in Chapter 8 we go on to look at lossy methods of compression, mainly foeusing on lhe Discrete Cosine Transform and lhe Discrete Wavelet Transfotm. The major applications ~f these important methods is in lhe set of JPEG still image compression standards, including JPEG2000, examined in Chapter 9. We then go on to look aL how data conipression methods can be applied lo moving images — videos. We start with basic video compression techniques in Chapter 10. We examine lhe ideas behind Lhe MPEG standard, starting with MPEG-1 and 2 in Chapter 11 and MPEG-4, 7, and beyond in Chapter 12. Audio compression in a sense stands by iLself, and we consider some basic audio compression techniques in Chapter 13, while in Chapter 14 we look at MPEG Audio, including MP3.
C H A P T E R
7
Loss1 ess Corn presslo n
~ ~ g orith rns
1.1 INrRODUaION The emergence of multimedia Lechnologies has made digital libra ries a reality. Nowadays, libraiies, museums, fim studios, and govemments are converting more and more data and archives mio digital form. Some of Lhe data (e.g., precious books and painLings) indeed need Lo be stored without any loss. As a start, suppose we warit lo encode the cail numbers of Lhe 120 million or 50 items in lhe Library of Congress (a mere 20 million, if we consider just books). Why don’L we just Lransmit each 11cm as a 27-bit number, giving each item a unique binary cede (since 227 > 120, 000, 000)? The main problem is lhat Lhis “greal idea” requires loa many bits. And in fact there exisL many coding techniques IbaL will effectively reduce Lhe total number of bits needed Lo rep resent lhe above information. The process invalved is generally referred lo as compression [1,2]. In Chapter 6, we had a beginning Iook aL compression schemes aimed aL audio. There, we had to first consider Lhe complexity of Lransforming analog signals to digital ones, whereas here, we shall consider that we at least starl with digital signals. For example, even Lhough we how an image is caplured using analog signais, the file produced by a digital camera is indeed digital. The more general problem of coding (compressing) a set of any symbols, not just byte values, say, has been studied for a long time. Getling back to our L.ibrary of Congress problem, it is well known Lhat certain parIs of calI numbers appear more frequently than others, so it would be more economic lo assign fewer bits as their cedes. This is known as variable-lengih coding (VLC) — lhe more frequently-appearing symbols are coded with fewer bits per symbo], and vice versa. As a result, fewer bits are usually needed Lo represent the whole collection. In this chapter we sLudy lhe basics of information theory and severa] popular lossless compression techniques. Figure 7.1 depicts a generaJ data compression scheme, in which compression is performed by an encoder and decompression is performed by a decoder. We call lhe oulput of the encader codes or codewords. The intennediate medium could either be daLa storage or a communicaLionlcomputer network. lf Lhe compression and decompression processes induce no information loss, Lhe compression scheme is lossless; otherwise, ii is lossy. The next several chapters deal with lassy compression algorithms as Lhey are cam.monly used for image, video, and audio compression. Here, we concenlsate on lossless compression. 167
Section 7.2 168
Chapter 7
Input [~e~l data
[~ZZ10~~tPut
[geo1
(compression)
networks
(decoifiPressiOn)
data
FIGURE 7.1: A general daLa compression scheme.
If the total nuniber of bits required to represent the data before compression is Bo and the total number of biLs required to represent the data afLer compressiOn is Bi, Lhen we define Lhe conlpression ralio as conipression raio
(7.1)
=
In general, we wouid desire any codec (encoder/decoder scheme) to have a compres sion raio much larger than 1.0. The higher Lhe compression nado, the better the lossless coinpression scheme, as Iong as it is computationallY feasible. 7.2
Basics ot Information Theory
169
Los5Iess Compressioli Algorithms
BASICS OF INFORMATION TIIEORY According to Lhe famous scientist Claude E. Shannon, of Beil Labs [3,41, lhe entropy a~ of an informatian source with alphabeL S = lsi, ~2 5,,) is defined as: q=l1(S)
=
~pslog2’
(72)
1=1 =
—L
pglog2p,
(7.3)
where p~ is the probability that symbol s~ in S will occur. The term 1082 ~- indicales the amount of information (lhe so-called seif-informaiion defined by Shannon [3]) contained in s~, which corresponds to lhe number of bitst needed to encode s~. For example. if lhe probabiliLy of having lhe character a in a manuscript is 1/32, Lhe aniount of information associated with receiving this character is 5 bits. lis other words, a characLer string nnn will require 15 bits to code. This is Lhe basis for possible data reduction in text compression, since iL will lead to character coding schemes different from Lhe ASCII represenLation. in which each character is always represenLed with 8 bits. What is the enLropy? In science, entropy is a measure of lhe disorder of a system — lhe mote entropy, the more disorder. Typically, we add negative enLropy to a system when we impart more order to it. For example, suppose we sori a deck of canis. (Tbink of a bubble sori for lhe deck — perhaps this is noL lhe usual way you actually sori cards, though.) For Since we have chosen 2 as lhe base for logaiilhms in lhe above definilion, the onit or informaliOn is bis — naturally also mosl appropriale for lhe binaiy cade represenlaliOn usei in digital computers. li lhe log base is lO. lhe unit is lhe hartley; ir the base is e, lhe anil is che nal.
every decision to swap or noL, we impari 1 biL of inforinaLion Lo lhe card system and transfer 1 bit of negative entropy Lo Lhe card deck. The definition of entropy includes the idea LhaL two decisions means Lhe Lransfer of twice Lhe negative entropy in its use of lhe iog base 2. A two-biL vedor can have 22 states, and Lhe Iogarithm Lakes this value inLo 2 biLs of negaLive entropy. Twice as many sorting decisions impart Lwice lhe entropy change. Now suppose we wish to communicate Lhose swapping decisions, via a neLwork, say. Then for our Lwo decisions we’d have Lo send 2 bits. lf we had a two-decision sysLem, then of course lhe average number of biLs for ali such communications wouid also be 2 bits. lf we Iike, we can think of the possibie number of staLes in our 2-bit sysLem as four outcomes. Each outcome has probability 1/4. So on average, the number of bits to send per outcome is 4 x (1/4) x log((1/(l/4)) = 2 bits — no surprise here. To communicate (transmit) the results of our Lwo decisions, we would need Lo Lransmit 2 bits. But if the probability for one of lhe outcomes were higher than the oLhers, the average number of bits we’d send would be different. (This siluaLion might occur if lhe deck were already partially ordered, 50 LhaL lhe probability of a not-swap were higher Lhan for a swap.) Suppose the probabiliLies of one of our four sLates were 1/2, and Lhe oLher three states each had probabiiity 1/6 of occurring. To extend our modeling of how many bits Lo send on average, we need Lo go Lo noninteger powers of 2 for probabilities. Then we can use a logarithm Lo ask how many (float) biLs of information musL be sent to Lransmit Lhe information conLent. EquaLion (7.3) says Lhat in this case, we’d have Lo send just (1/2) x 1082(2) + 3 x (1/6) x log2(6) = 1.7925 biLs, a value Iess than 2. This reflects Lhe idea lhat if we could somehow encode our four states, such Lhat Lhe mosL-occurring one means fewer biLs Lo send, we’d do betLer (fewer biLs) on average. The definiLion of entropy is aimed at idenLifying ofLen-occurring symbols in Lhe data stream as good candidates for shorr codewords in lhe compressed bitsLream. As described earlier, we use a variable-lengrh coding scheme for enlropy coding — frequenLly-occurring symbols are given codes thaL are quickly LransmitLed, while infrequenLly-occurring ones are given longer codes. For example, E occurs frequentiy in English, so we should give it a shorter code Lhan Q, say. This aspecL of “surprise” in receiving an infrequenL symbol in the datasLream is reflecLed in Lhe definition (7.3). For if a symbol occurs rarely, its probability p~ is low (e.g., 11100), and thus its logaiithm is a large negative number. This reflecLs the facL thaL it Lakes a longer biLstring to encode ii. The probabiliLies Pi siLting ouLside Lhe IogariLhm in Eq. (7.3) say that over a long sLream, lhe symbols come by wiLh an average frequency equal Lo the probability of Lheir occurrence. This weighting shouid muitiply lhe long or short information conLent given by the elemenL of “surprise” in seeing a particular symbol. As anoLher concrete example, if lhe information source S is a gray-Ievel digital image, each s~ is a gray-level intensity ranging from O to (2k — 1), where k is lhe number of bits used Lo represenL each pixel in an uncompressed image. The range is often [0, 255], since 8 bits are Lypicaily used: Lhis makes a convenient one byLe per pixei. The image histogram (as discussed in ChapLer 3) is a way of calcuiating the probability p~ of having pixels wiLh gray-level intensity i in Lhe image. One wrinkle in lhe algoriLhm implied by Eq. (7.3) is LhaL if a symboi occurs wilh zero frequency, we simpiy don’t count it into lhe entropy: we cannoL Lake a log of zero.
170
Chapter 7
Lossless Compression Algorithms
se~ion 7.3
Run-Length Coding
171
7.3 RUN-LENGTH CODING lnslead of assuming a memoryless source, run-Iength coding (RLC) exploils rnemory presenl in lhe informalion source. IL is one of lhe simplesl fonns of daLa compression. The basic idea is lhat if Lhe informalion source we wish Lo cornpress has Lhe property lhal symbols tend lo forrn conlin~aous groups, insLead of coding each symbol in Lhe group individually, we can code one such symbol and lhe lenglh of Lhe group.
1)256
o
255
As an example, consider a bilevel image (one wilh only 1-bil black and while pixeis) wilh monoLone regions. This informaLion source can be efhcienlly coded using run-lengLh coding. In fact, since lhere are only Lwo symbols, we do noL even iieed Lo code any’symbol aL Lhe starl of each rurl. lnslead, we can assume lhal Lhe slarting run is always of a parlicular color (eiLher black or while) and simply code lhe lengLh of each nin.
255 (b)
(a)
FIGURE 7.2: Hislograms for Lwo gray-Ievel images.
The above descripLion is lhe one-dimensional run-lengLh coding algorilhm. A lwo dimensional varianl of il is usually used lo code bilevel images. This algorilhm uses lhe coded rua information in lhe previous row of Lhe image lo code lhe ruo in lhe currenL row. A fuIl descripLion of Lhis algorilhm can be found in [5].
Figure 7.2(a) shows lhe hislogram of an image with uniform dislribulion of gray-Ievel intensities, — lhat is, Vi p~ = 1/256. Hence, Lhe entropy of Lhis image is 255 =
~
.
Iog2 256
=
(7.4)
8
As can be seen in Eq. (7.3), lhe enLropy q is a weighted sum of Lerms Iog2 ~; hence iL represenls Lhe average arnounL of information contained per symbol in Lhe source S. For a memoryless source2 .9, Lhe enLropy ij represenls lhe minirnum average number of biLs required lo represent each symbol in.9. In olher words, iL specifies Lhe lower bound for lhe average number of bils Lo code each symbol in S. If we use (lo denoLe lhe average lengLh (measured in bits) of Lhe codewords produced by lhe encoder, Lhe Shannon Coding Theorern slates thal lhe enLropy is lhe besi we can do (under certain condiLions): (7.5) Coding schemes aim lo geL as dose as possible Lo Lhis Lheorelical Iower bound. II is inleresting lo observe LhaL in Lhe above uniform-dislribulion example we found Lhal = 8 — lhe minimum average number of biLs Lo represenl each gray-level inLensily is aI leasL 8. No compression is possible for LhiS image! In Lhe conlexl of imaging, Ihis will correspond lo lhe “worsl case:’ where neighboring pixel values have no similariLy. Figure 7.2(b) shows lhe histogram of anolher image, in which 1/3 of lhe pixeis are raLher dark and 2/3 of Lhem are raLher brighL. The enlropy of Lhis image is 1 2 3 II = —log23+~log2~ =
0.33 x 1.59 + 0.67 x 0.59
=
0.52 + 0.40
=
0.92
In general, lhe enLropy is greaLer when Lhe probabiliLy disLribulion is flal and srnaller when il is more peaked. 2An nfonnalion sorna lbat is indepcnderilly distribuled. meaning lhal Lhe value of lhe cunent symbol does n°1 depend 00 Lhe values ol’ lhe previously appeared symbols.
~ VARIABLE-LENGTH CODING (Vl.C) Since Lhe enlropy indicales lhe informalion conLenl in an informalion source S, iL leads Lo a family of coding melhods commonly known as enlropy coding melhods. As described earlier, variable-Ieng:h coding (VLC) is one of Lhe besl-known such rneLhods. Here, we will sludy Lhe Shannon—Fano algorilhrn, Huffman coding, and adapLive Huffrnan coding. 7.4.1
Shannon—Fano Aigorithm The Shannon—Fano algorithm was independenLly deveioped by Shannon aI Beil Labs and Roberl Fano aI MIT [6]. To illustrale lhe algorilhm, Iel’s suppose Lhe symbols lo be coded are lhe charadlers in lhe word HELLO. The frequency counl of Lhe symbols is
ti.
HELO 1121
The encoding sleps of Lhe Shannon—Fano a gonLhm can be presenled in lhe following iop-down manner: 1. Sort Lhe symbols according Lo lhe frequency counL of Lheir occurrences. 2. Recursively divide lhe symbols inLo lwo parIs, each with approxirnately lhe sarne numberofcounts, unIu ali parIs conlain only one symbol. A nalural way of implemenLing lhe above procedure is lo build a binary Lree. As a convenLion, leL’s assign biL O lo iL5 lefL branches and 110 lhe righL branches.
172
Chapter 7
Lossless Compression Algorithms
Section 7.4 (5)
(5)
(5)
L:(2)
H,E,O:(3)
L,H:(3)
L~>1 H:(1)
L:(2)
(2) H:(I) 0:0)
EU) (c)
FIGURE 7.3: Coding tree for HELLO by Lhe Shannon—Fano algorithm. lnitially, lhe syrnbols are sorted as LHEO. As Figure 7.3 shows, Lhe first division yields two paris: (a) 1.. wiLh a coma of 2, denoted as L:(2); and (b) H, E andO wiLh a total count of 3, denoLed as H,E,O:(3). The second division yields H:(I) and E,0:(2). The last division is E:(l) and 0:0). Table 7.1 surnmarizes the resuit, showing each syrnbol, iLs frequency count, infonnation contenL (log2 resulting codeword, and Lhe number of biLs needed to encodeeach syrnbol in lhe word HELLO. The LoLal number of bits used is shown at Lhe botLorn. To revisit the previous discussion on entropy, in lhis case,
f),
=
(5)
L:(2) (a)
1f:(1)
E:(1)
1 1 1 1 P1. Iog2—+pu Iog2— +pe log2—+po log2— fiz. PH PE P0 0.4 x 132+0.2 x 2.32+ 0.2 x 232+02 x 2.32 = 1.92
TABLE 7.1: One result of performing Lhe Shannon—Fano algorithm on IJELLO.
(b)
log2
7.4.2 Huffman Coding FirsL presenLed by David A. Huffrnan in a 1952 paper [7], Lhis rneLhod aLlracLed an over whelming amounL of research and has been adopted in many importanL and)or comrnercial applicaLions, such as fax machines, JPEG, and MPEG. in conLradistincLion lo Shannon—Fano, which is Lop-down, lhe encoding sleps of lhe Huffman aigorithrn are described in lhe foliowing bouom-up rnanner. Lel’s use lhe sarne exarnple word, HELLO. A similar binary coding Lree will be used as above, in which Lhe iefL branches are coded O and righL branches 1. A simple lisL data structure is also used.
TABLE 7.2: Anolher resuiL of performing Lhe Shannon—Fano aigoriLhm on HELLO. Nurnber of biLs used
CounL
L
2
1.32
O
2
4
H
1
2.32
10
2
2
E
1 1
2.32 2.32
110 Iii
3
2
3
2
TOTAL nurnber ofbiLs:
10
~-
Cade
This suggesLs Lhat Lhe minimum average number of bils Lo cade each character in lhe word I-IELLO would be ai IeasL 1.92. In Lhis example, Lhe Shannon—Fano algoiithm uses an average of 10/5 = 2 biLs Lo cade each symbol. which is fairly dose lo Lhe lower bound of 1.92. ApparenLly, lhe resuil is salisfacLory. IL should be pointed oul thaL lhe oulcome of Lhe Shannon—Fano algoriLhm is nol neces sarily unique. For instance, aI lhe firsL division in Lhe above example, iL would be equally valid LO divide inLo Lhe lwo paris L,H:(3) and E,O:(2). This would result in lhe coding in Figure 7.4. Table 7.2 shows lhe codewords are different now. Also, these Lwo seis of code words rnay behave differenLly when errors are preseni. Coincidentally, lhe total number of biLs required Lo encode lhe world HELLO remains aliO. The Shannon—Fano algoriLhm delivers satisfacLory coding resulLs for dala compression, bul iL was soon oLlLperfornled and overlaken by lhe Huffman coding rneLhod.
Syrnbol
o
0:0)
FIG URE 7.4: Another coding tree for HELLO by Lhe Shannon—Fano algoriLhrn.
(5)
=
173
E,O:(2)
E,O:(2) (b)
(a)
Variable-Length Coding (VLC)
Number of bits used
TOTAL nurnberofbits:
lo
174
Chapter 7
Section 7.4
Lossless Compression Algorithms Pi:(2)
E:(1)
P2:(3)
0:0)
H E:(1)
0:0)
(a)
(b) P3:(5) P2:(3)
L:(2)
P1:(2) H:(l) E:(1)
0:0)
(e) FIGURE 7.5: Coding tree for HELLO using Lhe Huffman algorithrn.
ALGORITHM 7.1
HUFFMAN CODING
1. Initialization: puL ali symbois on the iisL sorted according 2. Repeat unLii Lhe lisi has oniy one symboi left.
LO
their frequency counts.
(a) From Lhe iist, pick two symbois with Lhe iowest frequency counts. Forrn a Huffman subtree Lhat has these two symbols as child nodes and create a parent node for Lhenl. (b) Assign Lhe sum of Lhe children’s frequency counts to the parent and insert it luto the ilst, such thaL lhe order is maintained. (c) Delete Lhe chuldren from lhe hst.
• Unique prefix property. No Huffman cede is a prefix of any oLher Huffman cede. For inscance, Lhe code O assigned Lo L. in Figure 7.5(c) is not a prefix of lhe code 10 forHor IlOforEor lii for0;noristhecode lOforllaprefixofthecode lIOfor E or lii foro. lt Luras ouL lhaL Lhe unique prefix property is guaranLeed by Lhe above Huffman algorithm, since It aiways places ali inpuL symbols at Lhe Ieaf nodes of Lhe Hufíman Lree. The Huffman code is one of lhe prefix codes for which Lhe unique prefix property holds. The code generated by lhe Shannon—Fano algorilhm is anoLher such example. This property is essenLial and also makes for an efficienL decoder, since it preciudes any arnbiguity in decoding. In Lhe above example, if a bit Ois received, lhe decodercan immediately produce a symbol L wiLhout waiLing for any more bits lo be transmitted. • Optimality. The Huffman cede is a mininuim-redundancy code, as shown ia Huff man’s 1952 paper [7]. IL has been proveu [8,2] that lhe Huffman code is optimal for a given data model (i.e., a given, accurate, probability distribuLion): —
—
—
After iniLiaiization After iteraLion (a): After iLeration (b): AfLer iteraLion (e):
LHEO L P1 H LP2 P3
175
For Lhis simpie exampie, lhe Huffman algorithm apparenLiy generated Lhe sarne codirtg resuiL as one of lhe Shannon—Fano resulLs shown in Figure 7.3, aiLhough Lhe resulLs are usuaily betLer. The average number of bits used lo cede each characLer is also 2, (i.e., (1 + i + 2 + 3 + 3)/5 = 2). As anoLher simple example, consider a LexL string conlaining a seI of characters and their frequency counLs as foliows: A:(15), B:(7), C:(6), D:(6) and E(S). IL is easy Lo show that the Shannon—Fano algorithm needs a toLal of 89 biLs to encode this string, whereas the Huffman aigorithm needs only 87. As shown above, if eorrecL probabiliLies (“prior statistics”) are avaiiable and accurate, the Huffman coding method produces good compression resulis. Decoding for lhe Huffman coding is Lrivia] as long as lhe staLisLics andlor coding tree are senL before lhe data Lo be compressed (in lhe file header, say). This overhead becomes negligible if Lhe data file is sufficieatiy iarge. The foilowing are important properties of Huffman coding:
3. Assign a codeword for each leaf based on Lhe path from lhe root.
In lhe above figure, new symbois P1, fl, P3 are created Lo refer lo the parent nodes in the Huffman coding tree. The contenLs in Lhe list are iliusLrated below:
Variabie-Length Coding (VLC)
The two Ieast frequeat symbois wiii have Lhe sarne Iength for Lheir Huffman cedes, differing only at Lhe iasL biL. This should be obvious from Lhe above aigorithm. Symbols Lhat occur more frequently wiii have shorter Huffman codes than sym bois LhaL occur iess frequentiy. Nameiy, for symbois s~ and s~, if p~ > pj then 4 q in most cases. One way to address lhe probiem of inLegrai codeword iengLh is Lo group several symbols and assign a singie codeword lo Lhe group. Huffman coding of this Lype is caiied E.xtended Huifinan Coding [2]. Assume an informaLion source lias aiphabel S = (si, s,, ~. lf k syrnbois are grouped together, then Lhe extended alphabet is =
{s~si . ~~1, ~l~I
Procedures for Adaptive Huffman Coding
ENCODER
DECODER
Initial_code() while not EOF
Initial_code() while not EOF
{
syni bois
8 (A)
..
.52
s~sI . .5,,, ~1~1~ .32Sl
s,,s,,
( get(c) encode(c); update_tree(c);
. .
NoLe lhaL lhe size of Lhe new alphabeL S~ is ~k• lf k is reiativeiy iarge (e.g.. k > 3), then for mosL pracLical applicaLions where ri » i, nt< wouid be a very large number, impiying a huge symbol Labie. This overhead malces Extended Huffman Coding impracLical. As shown in [2], if the enLropy of 8 is ~, lhen Lhe average number of biLs needed for each symboi in 8 is now
Variable-Length Coding (VLC) 177
1
decode(c); output(c); updatetree(c);
1
(7.7)
• Initial_code assigns symbois wilh some initiaiiy agreed-upon codes, wiLhout any prior knowledge of lhe frequency counts for Lhem. For exampie, some conven tionai code such as ASCII may be used for coding character symbois.
so we have shaved quite a biL from the coding schernes’ bracketing of the Lheoreticai besL iimit. Nevertheless, Lhis is noL as much of an improvement over the original Huffman coding (where group size is 1) as one rnight have hoped for.
• update_tree is a procedure forconstrucLing an adaptive Huffman Lree. Itbasicaliy does two things: it incremenls Lhe frequency counts for Lhe symbois (inciuding any new ones), and updates the configuraLion of Lhe tree.
q 7) in most cases. Aitough iL is possible lo group symbois into metasymbols for codeword assignment (as in extended Huffman coding) to overcome the limitation of integral number of bits per symbol, te increase in Lhe resultant symbol table required by te Huffman encoder and decoder would be fonnidabie. Arithmeüc coding can treaL te whoie message as one unit. In practice, te inpuc data is usualiy broken up into chunlcs to avoid error propagaLion. However, in our presentation beiow, we Lalce a simplisLic approach and inciude a Lenuinator symbol. A message is represented by a half-open interval [a, 1,) where a and b are real numbers beLween O and 1. lnitialiy, Lhe inLervai is [0, 1). When Lhe message becomes longer, the lenglh of te inLerval shortens, and lhe number of bits needed Lo represent Lhe inLerval increases. Suppose Lhe alphabet is [A, B, C, D, E, F, $], in which $ is a special symbol used to tenninate the message, and Lhe icnown probability distribuLion is as shown in Figure 7.8(a).
ALGORITHM 7.5
ARITHMETIC CODING ENCODER
BEGIN low
high
=
0.0;
=
1.0;
range
1.0;
=
while (syrnbol 1= terminator) get (syrnbol); low = low i- range high = low + range range = high - 10w;
output a code so that low END
* *
Range_low(symbol); Range_high(syrnbol);
high. Hence, we assign O lo Lhe first bit. Since value(0.0) = O < Iow, lhe while loop continues. Assigníng 1 to Lhe second bit makes a binary cade 0.01 and value(0.01) = 0.25, which is less Lhan high, SO iL is accepted. Since it is sLill tme thaL value(0.01) < law, Lhe iteration continues. Eventually, Lhe binary codeword generated is 0.01010101, which is 2—2 + 2~ + 2—6 + 28 = 0.33203 125. Et must be poinLed out thaL we were lucky Lo have found a codeword of only 8 bits Lo represenL Lhis sequence of symbols CAEE$. In Lhis case, log2 ~- + log2 ~ + Iog2 ~ +
À
log2 ~ + log2 = Iog2 ~ = log2 ~ 11.44, which would suggest that it could take 12 birs to encode a string of symbols like Lhis.
190
Chapter 7
Lossless Image Compression 191
Lossless Compressiori Algorithms
TABLE 7.5: Arilhmetic coding: decode symbois CAEE$
lt can be proven [2] LhaL [iog2(l/fl1Pi)1 is lhe upper bound. Namely, in Lhe worsl case, lhe shorlesl codeword in arithmelic coding will require k biLs Lo encode a sequence of symbols, and =
[Iog2 range1
=
[log2 jj-!~.1
Value
(7.8)
where P~ is lhe probabilily for symbol i and range is Lhe fina] range generaLed by lhe encoder. Apparenlly, when lhe IengLh of lhe message is long, ils range quicldy becomes very smail, and hence Iog2 ~ becomes very large; lhe difference belween log2 ~je and [iog2 ~ is negligibie. Generally, AriLhmetic Coding achieves belLerperfonnance lhan Huffman coding, because Lhe former treaLs au enlire sequence of symbols as one anil, whereas lhe IaLler has lhe reslriCliOfl of assigning an inlegrai number of bits lo each symbol. For example, Huffman coding would require 12 biLs for CAEE$, equaling Lhe worsl-case performance ofAiilhmetic Coding. Moreover, Huffman coding cannol always alIam Lhe upper bound illusLraled in Eq. (7.8). II can be shown (see Exercise 5) lhal if Lhe aiphabel is [A, 8, C] and lhe known probabiliLy dislribulion is P,.j = 0.5, 1’8 — 0.4, ~c = 0.1, lhen for sending EBB, Huffnian coding wiil require 6 bils, which is more lhan [log2(l/fit Pa)1 = 4, whereas ariLhmeLic coding wiii need oniy 4 bils.
ALGORITHM 7.6
=
value(code);
Do find a syrabol s so that Range_1OW(5) -.. > À~ > 0. We define Lhe Karhunen-Loève transform as
7 6 5 4
y
y
T=[ui,u2,’-’ ,UkJT
3
(8.32)
Then, lhe autocorrelalion malrix of Y becomes
2 Ry
£[YYT] E[TXXTTI
= =
o
TRxTT À1 O O À2
=
1
234567
x
(a)
(b)
=
(8.35) .-.
O O
.
O: O O
FIGURE 8.11: ApproximatiOn of lhe ramp function: (a) three-term DCT approximation; (b) Lhree-Lerm DEI’ approximatiOn.
8.5.2
(8.33) (8.34)
-..
(8.36)
À~
lhe original ramp function using only lhree terms of both lhe DCI’ and DEI’, we notice that lhe D~ approximatiOn is much closer. Figure 8.11 shows lhe comparison.
Clearly, we have lhe required autocorrelation malrix for Y. Therefore, Lhe KLT is optimal, in Lhe sense that iL completely decorrelates Lhe input. In addilion, since Lhe KLT depends on lhe computalion of lhe aulocorrelation malrix of Lhe input vector, it is data dependenl: it has Lo be computed for every dalasel.
Karhunen—LOêVe TransfOflfl*
EXAMPLE 8.5
The Karhunen—LOè ve Transform (KLT) is a reversible linear transform that exploits lhe sta Listical properties of lhe vector representation. Its primary property is tal it optimally decor relates lhe input. To do so, il fits na n-dimensional ellipsoid around lhe (mean-subtracted) data. The main ellipsoid axis is Lhe major direction of change in lhe data. Think of a cigar that has unfortnnately been stepped on. Cigar data consisls of a cloud of poinls in 3-space giving lhe coordinates of positions of measured points in lhe cigar. The long axis of Lhe cigar will be identified by a statistical program as lhe first KLT axis. The second mosl important axis is the horizontal axis across lhe squashed cigar, perpendicular Lo lhe firsl axis. The third axis is orthogonal Lo both and is in lhe vertical, Lhin direetion. A KLT component program carnes oul just this analysis. To understand lhe optimality of lhe KLT, consider lhe autocorrelalion matnix Rx of Lhe inpul vector X, defined as (8.30) Rx = E[XXTI Rx(l, 2) ..Rx(I, k) Rx(l, 1) Rx(l,l) -Rx(2,k1) Rx(2, 1) (8.3 1) Rx(k, 1)
Rx(k—l,l)
...
Rx(l,l)
To illustrale Lhe mechanics of Lhe KLT, consider lhe four 3D inpul vectors xi = (4,4, 5), = (3,2,5), x3 = (5,7,6), and X4 = (6,7,7). To find lhe required iransform, we must first estimate lhe autocontialion malrix of lhe input. The mcmi of lhe four input veclors is 1
r 2018
m1=—I
4L23
We can estimate Lhe aulocorrelation matnix using the fonnula Rx
x~xT m~mT
=
—
where n is lhe number of input vectors. From lhis equation, we obtain
Rx
=
r 2.25 1.25 [
0.88
2.25 0.88 4.50 1.50 1.50 0.69
(8.37)
222
Chapter 8
Section 8.6
Lossy CompressiOn Algorithms
Tbe eigenvalues of Rx are À1 = 6.1963, À2 = 0.2147, and À3 = 0.0264. Clearly, lhe first component is by far lhe rnost important. The corresponding eigenvectors are
r 0.4385 1 =
L
1
0.8471 0.3003
U2 =
r
o.446o
L
—0.4952] 0.7456
i13 =
r —0.7803 0.1929
0.5949
Therefore, Lhe ICLT is given by time matrix
r
0.4385 0.4460 —0.7803
0.8471 0.3003 —0.4952 0.7456 0.1929 0. 5949
Subtracting Lhe mean vector from each input vector and applying Lhe KLT, we have
yi
=
Y3
=
[
r
r —3.4242
—1.2916 —0.2870 —0.2490
Y2
=
1.9885 ~
1 —0.5809 L 0.1445 J
Y4
=
1
L
0.2573 0.1453
r
2.7273
1 0.6107 [ —0.0408
Since Lhe rows of T are orthonormal vectors, Lhe inverse transform is just the transpose: = TT. We can obtain the original vectors from Lhe transform coefficients using Lhe inverse relation x
=
TTY + mx
(8.38)
In terms of the transform coefficients ~ lhe magnitude of lhe first few components is usually considerably iarger tban that of Lhe othercompOnents. In general, after lhe KLT, most of Lhe “energy” of Lhe transform coefficienls is concentrated wiLhin lhe first few components. This is Lhe energy compaction property of lhe KLT. For an input vector x with ri componenls, if we coarsely quantize Lhe output vector y by setting its last k components to zero, caliing Lhe resulting vector ~, the KLT minimizes Lhe mean squared mor between Lhe original vector and its reconstwction.
8.6 WAVELET-BASED CODING 8.6.1
Introduction Decomposing the input signal into its constituents ailows us to apply coding techniqtieS suitable for each constituent, to improve compression performance. Consider again a time dependent signal fQ) (iL is best Lo base discussion on continuous functions Lo stazt wiLh). The traditional method of signal decomposition is Lhe Fourier transfonn. Above, in our discussion of Lhe DCT, we considered a special cosine-based transform. If we carry out analysis based on boLh sine and cosine, then a concise notation assembles Lhe results imito a
Wavelet-Based Coding
223
function 7(w), a complex-valued function of real-vaiued frequency co given in Eq. (8.26). Such decomposition results iii very fine resolution iii Lhe frequency domam. However, since a sinusoid is Lheoretically infinite in extent intime, such a decomposilion gives no temporal resolution. Another method of decomposition Lhat has gained a great deal of popularity in recent years is Lhe wavelet transform. li seeks Lo represenls a signal with good resolution in both time and frequency, by using a set of basis functions called wavelets. There are two Lypes of waveiet Lransforms: Lhe Continuous Wavelet Transform (CWT) and Lhe Discreze Wavelet Transform (DWT). We assume Lhal Lhe CWT is applied Lo Lhe large class of functions f(x) thai are square integrable on Lhe real line — lhat is,! [f(x)}2 dx < co. In mathematics, this is written as f(x) € L2(R). The oLher kind of wavelet transform, Lhe DWT, operates on discrete samples of Lhe input signal. The DWT resembies other discrete linear transforms, such as Lhe DFT or Lhe DCT, and is very useful for image processing and compression. Beforewebegin adiscussion of Lhe Lheory of wavelets, let’s deveiop an intuition aboutLhis approach by going through an example using Lhe simplest wavelet transform, lhe so-calied Haar Waveier Transform, to form averages and differences of a sequence of float values. If we repeatediy take averages and differences and keep resuits for every step, we effec tively creale a multiresolution analysis of Lhe sequence. For images, Lhis wouid be equivalent to creating smaller and smailer summary images, one-quarter Lhe size for each step, and keeping track of differences from the average as weii. MenLaily slacking Lhe fuli-size image, lhe quarter-size image, lhe sixteenlh size image, and so on, crentes a pyramid. The fui] sei, along with difference images, is Lhe muitiresolution decomposilion. EXAMPLE 8.6
A Simple Wavelet Transform
The objective of Lhe wavelet transform is to decompose Lhe input signai, for compression purposes, into componenls lhat are easier Lo deal wilh, have speciai interpretations, or have some components lhat can be Lhresholded away. Furlhermore, we want Lo be able lo at ieast approximately reconstruct Lhe original signai, given these components. Suppose we are given Lhe foiiowing input sequence: (x~,~)
=
(10.13,25,26,29,21,7,15)
(8.39)
Here, i E [0.. 7] indexes “pixels”, and n stands for Lhe levei of a pyramid we are on. AL lhe Lop, n = 3 for this sequence, and we shali form three more sequences, for ri = 2, 1, and 0. At each levei, iess information wili be retained in Lhe beginning eiements of Lhe Lransformed signal sequence. When we reach pyramid levei ri = 0, we end up wiLh Lhe sequence average stored in Lhe first elemenL. The remaining eiements store detaii information. Consider Lhe transform LhaL repiaces Lhe original sequence wilh its pairwise average x,,..~j and d~fference 4.i,~, delined as foiiows:
= =
x~2f + 5n21+I 2 xfl.21 2
(8.40) (8.41)
Section 8.6 224
Chapter 8
Notice that lhe averages and differences are applied only on consecutive pairs of input sequences whose first element has an even index. Therefore, lhe number of elements in each set {Xn_l,i) and (4—i,~) is exactiy half lhe number of eiements in lhe original sequence. We can form a new sequence having Ienglh equal te that of lhe original sequence by concatenaling lhe two sequences (x~_i,~} and (d,,_j,J. The resuiLing sequeiiCe is thus {x~
~,,
4.-i,~)
=
(11.5,25.5,25,11, —1.5, —0.5,4, —4)
(8.42)
where we are now aI levei ri — 1 = 2. This sequence has exacLly Lhe sarne number of elements as lhe input sequence — lhe Lransform did not increase che arnount of daLa. Since Lhe first half of lhe above sequence contains averages from lhe original sequence, we can view it as a coarser approximaLion tolhe original signal. The second haif of this sequence can be viewed as lhe details or approximaLion errors of lhe firsl half. MosI of lhe values in lhe detail sequence are much smaller lhan Lhose of lhe original sequence. Thus, most of lhe energy is effectively concentrated in lhe firsl haif. Therefore, we can potentially store (4_j,t) using fewer bits. lt is easily verified lhat lhe original sequence can be reconstructed from lhe Lransformed sequence, using Lhe relations
Xn.21+I
=
x~_~ +d~.-~j
=
Xn_ij
—
(8.43)
~
This transform is lhe discrele Haar wavelet Lransform. Averaging and differencing caii be carried oul by applying a so-called .rcalingfunclion and waveletfúnctiOfl along Lhe signal. Figure 8.12 shows lhe Haar version of Lhese funclions. 2
o
0.5 (a)
1
1.5
—2 —0.5
We can further apply lhe sarne transform Lo (x~_i,~), lo obtain anoLher levei of approxi malion Xn...2,j and deLail cl,, 2 (x,,_2,1,d,,_2,~, d,,_i,1)
=
(18.5,18, —7,7, —1.5, —0.5,4, —4)
(8.44)
This is lhe essential idea of multiresolution analysis. We can now study Lhe inpul signal in lhree different scales, along wiLh Lhe detaiis needed Lo go from one scale Lo anolher. This process cmi conLinue ia limes, unLil only one element is left in the approxirnalion sequence. In lhis case, ia = 3, and lhe final sequence is given below: (x,,_3,~,d,,_3,~,d,,_21, d,,_i,~)
=
(18.25,0.25, —7,7, —1.5, —0.5,4, —4)
(8.45)
Now we realize Lhat ri was 3 because only Lhree resolulion changes were available uniU we reached Lhe final form. The value 18.25, corresponding Lo Lhe coarsest approximalion Lo Lhe original signai, is lhe average of ali lhe elements in lhe original sequence. From Lhis example, il is easy lo see lhaL lhe cosI of compuling Lhis Lransfonn is proporlional tolhe number of elemenLs N in Lhe inpul sequence — lhaL is, 0(N). Extending Lhe one-dimensional Haar waveleL transferiu mio Lwo dimensions is relaLively easy: we simpiy apply Lhe one-dimensional lransform lo Lhe rows and colurnns of lhe lwo. dimensional input separalely. We will demonstraLe Lhe lwo-dirnensional Haar Lransform applied lo Lhe 8 x 8 inpuL image shown in Figure 8.13. EXAMPLE 8.7
2D Haar Transform
This example of lhe 2D Haar lransform not only serves lo iiluslraLe how lhe wavelet Irans form is applied Lo Lwo-dimensional inpuls bul also poinis ouL usefui interpretations of lhe
00000000 00000000 0 O 63 127 127 63 O O 0 0 127255255 127 0 O 0 0 127255255 127 0 O O O 6312712763 O O 00000000 00000000
O
—0.5
Wavelet-Based Coding 225
Lossy Compression Algorithrns
o
0.5
1
(b)
FIGURE 8.12: Haar Waveiet Transform: (a) scaling function; (b) waveleL function.
1.5
(a)
(b)
FIGURE 8.13: inpuL image for lhe 2D Haar WaveieL Transform: (a) pixei values; (b) an 8 x 8 irnage.
Section 8.6 226
Chapter 8
Wavelet-Based Coding
227
Lossy CompressiOn Algorithms
FIGURE 8.16: A simple graphical illustration of the Wavelet Transform. FIGURE 8.14: Intentediate output of Lhe 21) Haar Wavelet Transfonn. viewed as a low-pass-filLered version of the original image, in the sense that higher-frequency edge informaLion is iost, while iow-spaLial-frequency smooLh infonnation is reLained. The upper right quadrant contains the vertical averages of Lhe horizontal differences and can be interpreted as information about the vertical edger within lhe original image. Similarly, Lhe lower left quadrant conLains Lhe vertical differences of Lhe horizonLal averages and represents Lhe horizontal edges in Lhe original image. The lower right quadranL conLains Lhe differences from both Lhe horizonLal and vertical passes. The coefficients in this quadrant represent diagonal edges. These interpretaLions are shown more clearly as images in Figure 8.16, where bright pixeis code positive and dark pixels code negaLive image values. The inverse of Lhe 21) Haar transform can be calculated by firsL inverting Lhe columns using Eq. (8.43), and then inverting Lhe resulting rows.
transformed coefficients. However, it is intended oniy to provide Lhe reader with an intuitive feeiing of the kinds of operations involved in perfonning a general 21) waveiet Lransform. Subsequent sections provide moredetaiied description of the forward and inverse 2D waveiet transform aigorithms, as well as a more elaborate example using a more compiex wavelet. 21) [bar Wavelet Wansform. We begin by appiying a one.dirnensional Haar wavelet transform to each row of the input. Tbe firsL and last Lwo rows ol the input are trivial. After performing the averaging and differenciiig operations on the remaining rows, we obtain tbe inLermediate output shown in Figure 8.14. We continue by appiying Lhe sarne 1 D Haar transform to each column of the intermediate output. This step completes one levei of Lhe 2D [bar transform. Figure 8.15 gives the resulting coefficients. We can naturally divide Lhe resuit into four quadrants. The upper lefL quadrant contains Lhe averaged coefficients from both Lhe horizontal and vertical passes. Therefore, it can be
FIGURE 8.15: Output of Lhe first levei of the 21) Haar Wavelet Transform.
8.6.2
Continuous Wavelet Transform* We noted LhaL Lhe motivation for Lhe use of waveleLs is Lo provide a set of basis funcLions Lhat decompose a signal in time over parameLers in Lhe frequency domam and Lhe time domam simultaneously. A Fourier transform aims to pin down only Lhe frequency content of a signal, in terms of spaLially varying raLher Lhan time varying signais. WhaL wavelets aim Lo do is pin down Lhe frequency content aL different parts of Lhe image. For example, one part of Lhe image rnay be “busy” with texture and thus high-frequency conlent, while anoLher part may be smooth, with litLle high-frequency contenL. Naturally, one can Lhink of obvious ways to consider frequencies for localized arcas of an image: divide mi image into parts and fire away wiLh Fourier analysis. The tirne-sequence version of that idea is called the Short-Term (or Windowed) Fourier Transform. And oLher ideas have also arisen. However, it tums out Lhat waveleLs, a much newer development, have neater characterisLics. To further motivate Lhe subject, we should consider the Heisenberg uncertainry principie, from physics. In Lhe conLext of signal processing, Lhis says Lhat Lhere is a Lradeoff beLween accuracy in pinning down a function’s frequency, and iLs extenL intime. We cannoL do boffi
228
Chapter 8
Section 8.6
tossy Compression Algorithms
Frequency
(a)
(b)
(a) =
1
=
=
_,2
(8.46)
2~ (4 [e~
—
i)]
(8A7)
We can explore Lhe frequency content of function *0) by talcing its Fourier transform. This turns out Lo be given by 7(w)
=
w2e~t
(8.48)
Figure 8.17(b) displays this function: Lhe candidate wavelet (8.47) is indeed limited in frequency as well. In general, a wavelet is a function * c 13(R) with a zero average, ç+OO
J
-00
(8.50)
~*
(8.51)
If ~Q)is normalized, so is *s,u 0). The Conrinuour Wavelei Transform (CW’l’) 0ff E L2(R) aL Lime is and scale ris defined
0.5; (b) iLs Fourier transform.
The parameter e expresses Lhe scate of the Gaussian (bell-shaped) function. The second derivative of this function, called ~‘(t), loolcs lute a Mexican hat, as in Figure 8.17(a). Clearly, Lhe function *0) is limited iii time, lis equation is as follows:
*0)
=
The function * is normalized with I * 1 = 1 and centered in lhe neighborhood of: = 0. We can obtain afaindy of wavelet funetions by scaling and translating Lhe mother wavele: • as follows:
accurately, in general, and still have a useful basis function. For example, a sine wave is exaci in terms of its frequency but infinite in extent. As an exampleofa function tini dies away quickly and also has limited frequency content, suppose we start with a Gaussian function,
fQ) =
L00t”*t»~t 00
1w!,,
(b)
FIGURE 8.17: A Mexican-hat Wavelet: (a) a
229
The constraint (8.49) is called the admissibilisy condilion for waveleLs. A function that sums Lo zero must osciliate around zero. ALso, from (8.26), we see Lhat Lhe DC value, Lhe Fourier transform of *0) for a, = 0, is zero. Another way to state this is Lhat Lhe Oth momenL M0 of ~Q) is zero. The pth moment is defined as
.4~j. .’&5j 0.J22f”\~~ Time
Wavelet-Based Cocling
(8.49)
4’Q)dt=O
that satisfies some conditions thaL ensure ii can be utilized in a muiLiresolution decompo sition. The conditions ensure that we can use Lhe decomposition for zooming in locally in some part of an image, much as we might be interested in closer or fariher views of some neighborhood in a map.
as W(f, s,
is) =
je
+00
f0)*~~0)dr
(8.52)
The CWT oCa 10 signal is a 20 function a funetion of both scak s and shift u. A very importanL issue is tini, in conLradisLincLion to (8.26), where Lhe Fourier analysis function is stipulated Lo be Lhe sinusoid, here (8.52) does noL siate what *0) acLually is? lnstead, we creaLe a sei of rules such functions must obey and Lhen invent useful functions tini obey Lhese niles — different funclions for different uses. Justas we defined Lhe DCTin Lerms of products of a function wiLh a sei of basis funcLions, here Lhe transform W is writLen in terms of inner producLs with basis functions thaL are a scaled and shifLed version of Lhe inolher waveieL *0). The moLher wavelet *0) is a wave, since ii must be an oscillatory function. Why is iL wavelet? The spatial-frequency anaiyzer parameler in (8.52) is s, Lhe scale. We choose some scaie s and see how much contem Lhe signal has around that scale. To malte Lhe funcLion decay rapidly, away from lhe chosen s, we have to choose a mother waveiet *0) IhaL decays as fast as some power of s. IL is actuaiiy easy lo show, from (8.52), Lhat if ali moments of *0) up Lo Lhe nth are zero (or quite small, pracLicaiiy speaking), Iben Lhe CW’l’ coefficienL W(f, s, u) has a Tayior expansion around is = O Lhat is of order 3~4’2 (see Exercise 9). This is lhe localization in frequency we desire in a good mother waveiet. We derive wavelet coefficients by applying waveieis aI different scales over many lo cations of lhe signal. Excitingly, if we shrink Lhe waveiets down small enough Lhat Lhey cover a part of Lhe function f 0) that is a polynomial of degree n or iess, Lhe coefficient for that wavelet and ali smailer ones will be zero. The condition Lhat Lhe waveleL shouid have vanishing momenLs up lo some order is one way of characterizing mathemaLicai regularUy condirions on Lhe mother wavelei. The inverse of Lhe continuous waveleL transform is:
~L L~ 1
=
+00
+00
1 W(f,s,u)7
—u
1
* (—;-—) -~duds
(8.53)
Section 8.6 230
Chapter 8
Wavelet-Based Coding
231
Lossy CompressiOn Algorithms
where
c.=L
+°°
(8.54) CL)
and ‘1’ (w) is Lhe Fourier transfonu of ~ (O. Eq. (8.54) is another phrasing of lhe admissibility condition. The troubie with lhe CWT is thai (8.52) is nasty: most waveiets are noc anaiytic but resuli simply from npmerical calculations. The resulting infinite set of scaied and shifted functions is not necessary for Lhe anaiysis of sampled functions, such as Lhe ones aiise in image processing. For lhis reason, we apply lhe ideas thaL pertain to the CWT to the discrete domam.
weights limes Lhe ievei-j basis. The scaiing funclion is chosen such ihaL Lhe coefficients of lis iransiates are ali necessariiy bounded (iess lhan infiniLe). The scaiing funcLion, aiong wilh its lransiaLes, forms a basis aL ihe coarser levei j + 1 (say 3, or lhe 1/8 levei) bui noL aL levei j (say 2, or lhe 1/4 ievel). InsLead, ai ievei j lhe sei of Lransiaies of lhe scaiing funciion 4’ a!ong with the sei of trans/ates of lhe ,noiher wavelet4’ do form a basis. We are ieft wilh lhe siLualion ihaL Lhe scaling funcLion describes smooth, or approximalion, information and lhe wavelei describes whai is ieft over — delail informaiion. Since Lhe seL of Iransiates of ihe scaiing function 4’ ai a coarser ievei can be wriiten exactiy as a weighLed sum of lhe iransiaies ata finer ievei, Lhe scaiing funciion musi satisfy Lhe so-caiied dilation equation [7]: «1)
‘Jihc[n]Ø(2t
=
—
n)
(8.56)
neZ
8.6.3
Discrete Wavelet TransfOrlfl* Discrete wavelets are again formed from a molher waveiet, but with scaie and shift in discrete steps. MnitiresolutiOfl Anaiysis and lhe Discrete Wavelet Transform. The connection be tween wavelets in lhe continuous time domam andflhter banks in Lhe discrete time domam is muitiresolution anaiysis; we discuss lhe DWT within ihis frameworic. Mailai [5] showed that it is possibie lo construct waveiets ~ such Lhat lhe diiated and Lraflsiated family
{
The square brackeLs come from Lhe Lheory of Jiiters, and Lheir use is carried over here. The dilaimon equalion is a recipe for finding a function lhaL cmi be buiili from a sum of copies of iLseif ihat are firsi scaied, Lransiaied, and diiaLed. EquaLion (8.56) expresses a condition Lhai a function musi saúsfy tu be a scaiing function and ai lhe same lime forms a definilion of lhe scaling vector h0. Noi oniy is Lhe scaiing funciion expressibie as a sum of Lransiaies, bulas weii Lhe wavelet ai lhe coarser levei is aiso expressible as such:
(8.55)
*~Q)
=
‘1’ tjjT)
=
IÚ.10eZ2
is an orthonormal basis of L2(R), where Z represenis the set of integers. This is known as “dyadic” scaiing and IransiaLion and corresponds tolhe notion of zooming oui in a map by factors of 2. (lf we draw a cosine function cos(t) from lime O lo 2r and then draw cos(t/2), we see Lhat while cosO) goes over a whole cycie, cos(t/2) has oniy a haifcycle: lhe function cos(2~t) is a wider function and thus is ata broader scaie.) NoLe Lhat we change Lhe scaie of transiaLions aiong wiLh lhe overali scale 2~, so as lo keep movement in Lhe iower.resolution image in proportion. Notice also ihat lhe notation used says ihai a larger index j corresponds lo a coarser version of Lhe iniage. MultiresoluLion anaiysis provides lhe tool Lo adapt signa! resolution lo only relevam details for a particular Lask. Tbe oclave decomposilion introduced by Mailai [6] initialiy decomposes a signai mio an approximaLion componeni and a detail component. The ap proximalion componenL is lhen recursively decomposed inLo approximation and detail ai successively coarser scaies. Waveleis are sei up such lhaL Lhe approximaiiOn aL resolu tion 2 contains ali lhe necessary information lo compuLe an approximation at coarser resolution 2_oi+». Waveiets are used lo characterize detail information. The averaging information is for maiiy determined by a ldnd of dual tolhe molher waveiet, caiied lhe scaiingfunction 4’(t). The main idea in Lhe theory of waveiets is Lhat at a particular levei of resoiution j, lhe set of trans/ates indexed by n form a basis ai lhai levei. InLeresLingiY, lhe sei of transiaLes forming lhe basis ai lhe j + 1 nexi levei, a coarser levei, can ali be wriiten as a sum of
~hhj[n]4’(2t —n)
(8.57)
nEZ
Beiow, we’ii show Lhai lhe sei of coeificients h1 for lhe waveiet can in facL be derived from lhe scaiing funclion ones h0 [Eq. (8.59) below], so we also have lhat Lhe waveiet can be derived from lhe scaling function, once we have one. The equaLion reads
*0)
=
Z(—fl”ho[l
—
n]4’(2t
—
(8.58)
So Lhe condiLion mi a waveieL is similar lo Lhal on Lhe scaiing funcfion, Eq. (8.56), and in faci uses lhe same coefficienis, oniy in lhe opposiie order and wilh aiLernaling signs. Cieariy, for efficiency, we would iike lhe sums in (8.56) and (8.57)10 be as few as possibie, so we choose waveieis Lhai have as few vecLor entries h0 and h1 as possibie. The effeci of Lhe scaling function is a kind of smoolhing, or fiiLering, operaLion on a signal. Therefore ii acts as a low-pass fluter, screening oul high-frequency contenL. The vecLor vaiues ho[n] are called lhe iow-pass fiuier impidse response coefficienLs, since Lhey describe Lhe effecL of lhe fiiLering operaLion on a signa] consisLing of a singie spilce wilh magnitude uniLy (an impuise) ai Lime t = O. A compiele discreLe signai is made of a seL of such spikes, shifted in time from O and weighted by Lhe magnitudes of Lhe discreie sampies. Hence, Lo specify a DWT, oniy Lhe discrele iow-pass fiuier impulse response ho[n] is needed. These specify lhe approximation fiuiering, given by lhe scaiing funcLion. The
ion 8.6 232
Chapter 8
Wave et-Based Coding
233
Lossy CompressiOn Algorithms TABLE 8.3: Biorthogonal wavelet filLers
TABLE 8.2: Orthogonal wavelet Micra Wavelet
Filter Number SLart Coefficients oftaps index
Antonini 9/7 ho[n]
9
—4
[0.038, —0.024, —0.111,0.377,0.853,0.377, —0.111, —0.024, 0.038)
Vilia 10/18
&o[n]
7
—3
[—0.065, —0.041,0.418,0.788,0.418, —0.041, —0.065]
ho[n)
10
—4
[0.029, 0.0000824, —0.158,0.077,0.759,0.759,0.077, —0.158, 0.0000824, 0.029]
~o[n]
18
—8
[0.000954, —0.00000273, —0.009, —0.003,0.031, —0.014, —0.086, 0.163,0.623,0.623,0.163, —0.086, —0.014,0.031, —0.003, —0.009, —0.00000273, 0.000954)
discrete high-pass impulse response h1 [n], describing lhe details using lhe wavelei funetion, can be derived from ho[n] using Lhe following equation: hj[n]
=
(—1)’~ho[1
—
si]
= =
(—1)”io[I (—1)”hc[l
—
si]
—
si]
(8.60) (8.61)
Tabies 8.2 and 8.3 (cf. [8]) give some commonly used orthogonal and biorthogonai wavelet fifters. The “start index” coiumns in these Lables refer to lhe starting value of Lhe index si used in Eqs. (8.60) and (8.6 1). Figure 8.18 shows a block diagram for the ID dyadic wavelet transform. [fere, x[n] is Lhe discrete sampled signa!. The box means subsanipling by talcing every second element, and Lhe box yields sedes y[n].
means upsampling by replication. The reconstruction phase
ho[n]
10
—4
[0.027, —0.032, —0.241,0.054,0.900,0.900,0.054, —0.241, —0.032, 0.027]
(8.59)
The number of coefficients in Lhe impulse response is calied lhe number of :aps in lhe fliter. If ho[n] has only a finite number of nonzero entries, Lhe resulting wavelet is said to have conspaci support. Additional constraints, such as orthonomiality and regularity, can be imposed mi lhe coefficients h0[n]. The vectors h0[n] and h1[n] are called lhe low-pass and high-pass analysis fihters. To reconstruct lhe original input, an inverse operation is needed. The inverse fiuters are calied synthesis fihters. For orthonormal wavelets, lhe forward transfonn and its inverse are transposes of each olher, and the analysis filters are identical to Lhe synthesis fiuters. Without orthogonality, lhe waveiets for analysis and synthesis are calied biorthogonai, a wealcer condition. In this case, lhe synthesis fi!ters are not identical tolhe analysis fiiters. We denote themas ho[n] and h1 [n). To specify a biorthogonal waveiet transform, we require both ho[,i] and ho[n]. As before, we can compute lhe discrete high-pass fihters in terms of sums of lhe low-pass ones: h1[n]
flrisiawn
~o[n]
lO
—4
[0.020, 0.024, —0.023, 0.146,0.541,0.541,0.146, —0.023, 0.024,0.020]
For analysis, aL each levei we transform lhe sedes x[n) inLo another series of lhe sarne length, in which lhe firsi haif of lhe eiements is approxiinaLion information and lhe second half consisLs of detail informaLion. For an N-tap filcer, Lhis is simply lhe sedes
(x[nj)
~
y[n]
=
Zx[j]ho[n
—
j];
~x[j]hi[n
—
j]
}
(8.62)
where for each half, the odd-numbered results are discarded. The sumrnaLion over shifLed coefficients in (8.62) is referred lo as a convolugion. 2D fliscrete Waveiet Transform. ‘fie exLension of Lhe waveiet transform lo Lwo di mensions is quite straightforward. A Lwo-dimensionai scaling funcLion is said lo be sepa rabie if ii can be factored mio a producL of two one-dimensionai scaling functions. That Is,
•(x, y)
—
~(x)~(y)
(8.63)
For simplicity, only separable wavelets are considered in this section. Furthermore, let’s assume Lhat lhe widLh and heighL of lhe inpuL image are powers of 2.
234
Chapter 8
Section 8.6
Lossy Compression AIgorithmS
LL2
Wavelet-Based Coding
235
HL2 HLI
LH2
11112
x[n]
(b) FIGURE 8.19: The two-dimensjonal discrete wavelet transfonn: (a) one-level transfonn; (b) two-Ievel transform.
y[nl
FIGURE 8.18: Bioclc diagram of lhe ID dyadic waveiet transform.
1. For each slage of te lransfonned image, stariing wilh lhe lasI, separate each cohimn mIo low-pass and high-pass coefficients. Upsample each of the low-pass and high pass arrays by inserting a zero after each coefficienl. 2. Convoive lhe low-pass coefficienls wilh h0[n] and high-pass coefficienls wilh h1 [ti] and add the two resulting arrays. 3. Afte9 ali coinmns have been processed, separale each row mIo iow-pass and high pass coefficienls and upsample each of lhe lwo arrays by inserting a zero after each coefficienl. 4. Conv~lve lhe Iow-pass coefflcients wilh ho[n] and high-pass coefficienls with h, [n] and add lhe two resulcing arrays.
For an N by N input image, lhe two.dimensiOnai DWT proceeds as foilows:
If biorthogonai filters are used for lhe forward lransform, we musl replace lhe ho[n] and h1[n] above wilh ho[n] and hi[n] in lhe inverse lransfonn.
1. Convolve each row of lhe image with h0[n) and h1 [ti], discard the odd.numbered
EXAMPLE 8.8
columns of the resulting arrays, and concatenate lhem to form a transformed row. 2. Alter ali rows have been transformed, convolve each coiumn of the result with ho[n] and hi [til. Again discard lhe odd-numbered rows and concatenate lhe resuit. Alter te above two sleps, one slage of lhe DWT is complete. The txansfonned image now contains four subbands Li, HL, LH, and 1111, standing for iow-iow, high-iow, and 50 on, as Figure 8.19(a) shows. As in lhe one.dimenSionai transform, te Li subband can be further deconiposed to yieid yet another levei of decomposition. This process can be continued untii lhe desired number of decomposition leveis is reached or te LL component only has a singie eiement left. A two levei decomposition is sbown ai Figure 8.19(b). The inverse lransform simply reverses lhe sleps of lhe forward transform.
The input irnage is a subsampled version of lhe image Lena, as shown in Figure 8.20. The size of lhe inpul is ló x 16. The filter used in lhe example is lhe Anlonini 9/7 filter seI given in Table 8.3. Before Jte begin, we need to compule lhe analysis and synthesis high-pass filters using Eqs. (8.60)mand (8.61). Theresulting filtercoefficienls are hi[n]
=
±
[—0.065,0.041, 0.4i8, —0.788, 0.418, 0.041, —0.065] [—0.038, —0.024,0.111,0.377, —0.853,0.377,0.Iii, —0.024, —0.038] (8.64)
Section 8.6 236
Chapter 8
Wavelet-Based Coding
237
Lossy CompressiOfl AlgorithmS NexL, we form lhe Lransformed output row by concalenaling lhe resulling coefficients. The firsL row of Lhe lransforrned image is Lhen [245, 156, 171,183,184, 173, 228,160, —30,3,0,7, —5, —16, —3,16] Similar Lo Lhe simple one-dimensional Haar lransform examples, mosL of lhe energy is now concenlraled on Lhe firsl half of lhe lransformed irnage. We conlinue lhe sarne process for lhe rernaining rows and obLain lhe following resuil: 1i,(x,y)
(b)
(a)
FIGURE 8.20: Lena: (a) original 128 x 128 image; (b) 16 x 16 subsampled image.
The inpul irnage in numerical form is 100(x,y)
=
858
870
97
164 116 95 808 103 802 99 84 58 56
153 849 145 856 153 846 ‘46 840 133 853 141
91 90 88 89 94 106 95 103 807 110 (08
804 99 101 lOS 880 103 99 97 809 84 48 58
23 824 III (88 165 203 99 144 803 149 94 92
89
885
888
47
35 89 44
99 98 805
151 97 69
67 SI 69
99
130
133
852
131 138 II? 148 846 39 (03 54 (58 li 55
825 860 852 182 870 92 60 807 88 95 73 61
III 123 113 836 121 68 824 43 283 SI
113 35 49 68
132
827
112
158
189
116
106
845
202 85 863 66 864 808 872 151 840 88
211 204 886 192 (75 III 837 120 103 866
84 203 144 188 898 192 (78 183 838 58
154 854 94 803 46 62 54 46 83 103
159 840 127 853 208 878 56 65 43 30 852 146
804 88
56 88
101 53
47 1(0
67 828
128 840
155 842
187
78
853
90 827
136 834
136 (46
876 57 859
213 205 184
844 806 809
44 846 229 39 47 56 128 859 147 843 150
116 227 58 46 883 167 856 153 849 142 (28 816
(34 828
203 284
43 121
54 72
(43
98 53 58 147 859 159 856 154 874 201 207 211
*
ho[n]) d~ 2
=
[245,156, 171,183,184, 173,228, 160]
(looC 0)
*
hi[n]) 4.2
=
[—30,3,0,7, —5, —16—3, 16]
= 156 141 147 139 145 (45 (59 (48 859 853 848 852 203 88 124 lIS
171 III 163 226 197 237 56 (62 (63 (49 858 102 130 63 87 85
183 197 177 177 191 184 77 129 97 (59 (48 70 94 844 96 (42
84 242 281 274 247 135 204 146 204 (76 864 853 III 191 877 888
73 151 173 267 230 253 232 283 202 204 857 826 218 257 236 234
222 202 209 247 239 169 SI 92 85 65 (88 899 171 285 862 884
160 229 106 63 (43 192 1% 287 234 236 285 207 228 232 77 132
303 —li 5 —34 2 —45 29 —49 22 —47 38 —31 9 —39 II —29 1 —32 32 —55 59 —47 38 (2 88 5 24 —2 20 —37 O
O —20 2 24 36 36 —48 50 —42 85 —((0 83 —27 —28 —48 27
7 3 19 —29 —U 4 30 —lO 23 39 28 lO IS —9 1 —4
5 26 —50 —2
—16 —58 II 33 37 38 26 —76 1 (9 (7 5
(6 —27 —35 30 —14 66 58 SI 48 44 48 3 76 —46 —56 —35
3 27 —38 —101 101 94 29 —23
16 141 —I —78 —54 —4 4 8
—56
—5
—54 —l —7 24 36 30 —22
—31 —64 —76 85 91 —24 —33
We now go on and apply Lhe filLers Lo Lhe colurnns of lhe above resulling irnage. As before, we apply bolh ho[n] and h, [ti] lo each column and discard Lhe odd indexed results: (111(0,:)
hc[n]) 4.2
=
[353,280,269,256,240,206,160, j53jT
(1,i(0,:)*hj[n])42
=
[_l2,10,_7,_4,2,_l,43j6]T
*
ConcaLenaling lhe above results inLo a single colurnn and applying lhe sarne procedure Lo each of Lhe rernaining colurnns, we arrive aL Lhe final Lransformed image:
95 880 76 (83
1 represenls lhe pixel values. The first subscript of 1 indicaLes the currenl slage of lhe lransfom1, while lhe second subscript indicales Lhe current step wilhin a slage. We slart by convolving lhe first row wilh boLh ho[n] and h1 [n] and discarding lhe values wilh odd nurnbered index. The resulls of these lw0 operaLiolis are (looC 0)
245 239 195 80 191 192 (76 (79 169 855 845 834 127 70 129 103
where lhe colon in lhe firsL index posilion indicaLes lhat we are showing a whole row. If you Iike, you can verify lheSe operations using MATLAB’S conv funcLion.
112ÇC,y) 353 280 269 256 240 206 860 153 —(2 (0 —7 —4 2 —8 43 86
= 212 258 272 288 203 254 250 402 202 3(2 280 3(6 217 247 (55 236 221 226 (72 264 204 203 192 230 275 50 (35 244 189 ‘33 (73 260 7—9—13—6 3 —31 86 —I 5 —44 —35 67 9—1 —37 41 —3 9 —25 2 22 32 46 lO —88 32 —40 —13 2—6—32—7
234 308 289 —33 6 269 297 207 —45 II 353 337 227 —70 43 328 114 283 —52 27 294 113 330 —dl (4 289 232 309 —76 67 294 267 331 —2 90 342 256 (76 —20 (8 II 12—69—10—1 —SI —lO —30 2 —(2 —(0 —17 —IS 3 —IS 6—33 2 9—12 —25 60 —8 —fl —4 48 —II 20 89 32 —23 —37 —61 8 22 5 (3 50 24 7
—(5 —2 56 —N 38 —53 —(7 —38 84 O —28 —67 —823 —59 2 —61
5 24 —29 38 9 —31 —26 —74 —23 —41 21 82 23 —2 90 49 23 57 60 —78 40 4 46 —(8 80 —24 49 29 —4 24 —75 25 6—38 3—45 24—32—45809 O 41 —30 —(8 31 —7 3 2 —(2 —6 —4 14 9 70 50 16 83 —(2 43 —8 2 II —33 43
(20 23 —88 82 —3 —(07 89 —5 —99 42 —89 —fl 73 —45 1
Section 8.6 238
Chapter 8
This completes one stage of lhe Discrete Wavelet Transform. We can perform another stage by applying Lhe same transform procedure to lhe upper left 8 x 8 DC image of 112(x, y). The resulting Lwo-stage transformed image is 122(x,)’)
Wavelet-Based Coding
239
Lossy CompressiOn Algorithms
=
558 463 464 422 ‘4 —13 25 46 —12 lo
451 608 532 15 26 94 25 —33 6 —IS 5 24 —29 38 120 SI’ 621 566 66 68 —43 64 —45 II —2 9 —31 —26 —74 23 40! 478 416 14 84 —97 —229 —70 43 56 —23 —41 21 82 —81 335 477 553 —88 46 —31 —6 —52 27 —14 23 —2 90 49 12 33—5642 22—43—36 1—41 14 31 2357 60—78 —3 3634 52 12—21 SI 10—7667—53 40 446—18-107 —20 25 —7 —35 35 —56 —55 —2 90 —17 lO —24 49 29 89 37 —SI SI —44 26 39 —74 —20 IS —38 —4 24 —75 25 —5 7—9—13—6 II 12~69—10—1 14 6—38 3—45 —99 3—31 16_I—5II0 —30 2—12 O 24—32—45 ‘09 42 —~ 5 —44 —35 67 —lO —17 —IS 3 —IS —28 O 41 —30 —18 —19 ~4 9—1—3741 6—33 2 9—12 —67 31—7 3 2 2 —3 9 —25 2 —25 60 —8 —II —4 —123 —12 —6 —4 14 —11 —I 22 32 46 lo 48 —II 20 19 32 —59 9 70 50 16 73 43 —IS 32 —40 —13 —23 —37 —6! 8 22 2 13 —12 43 —8 —45 ló 2—6—32—7 5 13—5024 7—61 2 11—33 43 1
that ‘12 corresponds lo Lhe subband diagram shown ia Figure 8.19(a), and ‘22 to Figure 8.19(b). ALIbis point, we may appiy different leveis ofquanhization Lo each subband according Lo some preferred biL allocation algorilhm, given a desired bitrate. This is 11w basisforo sirnple wavelet-based compression algorithm. However, since in this exampie we are illustrating the mechanics of Lhe DWT, here we wiil simply bypass Lhe quantization step and perform an inverse transform to reconstruct lhe input image. Notice
corresponds
We refer Lo Lhe top lefi 8 x 8 biock of values as lhe innermost stage in correspondence wiLh Figure 8.19. Starting wiLh Lhe innennosL sLage, we extract the first coiumn and separaLe Lhe iow-pass and high-pass coefficients. The low-pass coefficient is simply Lhe first half of Lhe column, and Lhe high-pass coefficients are Lhe second haif. Then we upsample them by appending a zero after each coefficient. The Lw0 resulLing arrays are á
= =
[558,0,463,0, 464,0,422,09
Since we are using biorthogonal filters, we convoive á and ~ with iofn] and ~ [a] respec tively. The resuiLs of Lhe two convolutions are Lhen added Lo form a single 8 x 1 array. The resulting coiumn is [414,354,323,338,333,294324~ 260]T
Ali columns in lhe innermost stage are processed in Lhis manner. The resulting image is
121(x,y) 414 354 323 338 333 294 324 260 —12 lO —7 —4 2 —I 43 16
= 337 382 403 70 —16 48 12 —33 6 —IS 322 490 368 39 59 63 55 —45 II —2 395 450 442 62 25 —26 90 —70 43 56 298 346 2% 23 77 —III —131 —52 27 —14 286 364 298 4 67 —75 —176 —41 14 3! 279 308 350 —2 II 12 —53 —76 67 —53 240 326 412 —% 54 —25 —45 —2 90 —II 189 382 359 —47 14 —63 69 —20 II —38 7—9—13—6 II 12—69—10—1 14 3—31 16—1—51 —lO —30 2—12 0 5 —44 —35 67 —lO —II —IS 3 —IS —28 9—1—37 41 6—33 2 9—12 —67 —3 9 —25 2 —25 60 —8 —II —4 —123 22 32 46 0 48 —II 20 19 32 —59 —18 32 —40 —13 —23 —37 —61 8 22 2 2—6—32—7 5—13—5024 7—61
5 24 —29 38 9 —31 —26 —74 —23 —41 21 82 23 —2 90 49 23 57 60 —78 40 4 46 —IS lO —24 49 29 —4 24 —75 25 6—38 3—45 24—32—45 09 O 41 —30 —IS 31—7 3 2 —12 —6 —4 14 9 70 50 16 13 —12 43 —8 2 11—3343
120 23 —81 12 —3 —107 89 —5 —99 42 —19 —12 73 —45 1
We are now ready to process the rows. For each row of Lhe upper lefL 8 x 8 sub-image, we again separaLe them into low-pass and high-pass coefficients. Then we upsample botb by adding a zero afLer each coeflicient. The results are convolved wiLh he appropriate ho[n] and h1 [o] fiiLers. After Lhese steps are compleLed for ali rows, we have
112(x,y) 353 280 269 256 240 206 60 153 —12 lO —7 —4 2 —I 43 16
= 212 251 272 281 234 308 289 —33 6 —IS 5 24 —29 38 203 254 250 402 269 297 207 —45 II —2 9 —31 —26 —74 202 312 280 316 353 337 227 —70 43 56 —23 —4! 21 82 217 247 155 236 328 114 283 —52 27 —14 23 2 90 49 221 226 172 264 294 113 330 —41 14 31 23 57 60 78 204 201 ‘92 230 219 232 307 —76 67 —53 40 4 46 —IS 275 150 135 244 294 267 331 —2 90 —II lO —24 49 29 189 113 173 260 342 256 176 —20 18 —38 —4 24 —75 25 7—9—13—6 II 12—69—10—1 14 6—38 3—45 3—31 16—1—51—10—30 2—12 024—32—45109 5 —44 —35 67 —lO —17 —IS 3 —IS —28 O 41 —30 —IS 9—1—37 41 6—33 2 9—12 —67 31—7 3 2 —3 9 —25 2 —25 60 —8 —II —4 —123 —12 —6 —4 14 22 32 46 lO 48 —II 20 19 32 —59 9 70 50 6 —IS 32 40 —13 —23 —37 —61 8 22 2 13 —12 43 —8 2 6 32—7 3—13 —50 24 7—61 2 11—33 43
120 23 SI 12 3 —107 89 —5 —99 42 —19 O —12 73 —45 1
We Lhen repeaL lhe same inverse cransform procedure 00 12(x, y), Lo obLain I~(x, y). NoLice Lhat I~(x, y) is nol exactly the sarne as !oo(x, y), buL Lhe difference is small. These smail differences are caused by round-off errors during the forward and inverse transform, and tnincaLion errors when converling from floating point numbers lo inleger grayscaie values. Figure 8.21 shows a Lhree-ievel image decomposition using Lhe Haar waveiel.
240
Chapter 8
I~(x,y) 58 164 115 94 1W 103 102 99 98 84 57 56 89 35 88 43
Section 8.8
Lossy Compression Algorithms
70 152 48 144 155 153 146 146 139 133 152 141 114 99 97 ‘04
97 90 89 88 88 93 106 95 102 lo? ‘09 108 187 150 96 69
Embedded Zerotree of Wavelet Coefficients 241
= 103 98 160 104 99 102 99 97 109 84 41 58 46 66 50 69
122 123 II? 87 64 203 99 143 103 148 93 91 113 34 49 68
129 151 118 123 112 135 121 60 123 42 213 50 104 88 101 53
132 131 131 117 147 145 39 102 53 157 70 54 55 88 47 110
125 159 151 IS! 69 91 60 ‘06 80 94 72 60 66 127 90 127
132 188 201 184 163 66 164 07 171 150 139 87 127 140 36 134
126 115 210 203 186 192 75 110 136 119 102 165 154 141 136 145
III 106 84 202 143 188 198 191 li? 182 137 57 186 175 56 158
IS? 145 154 153 193 103 46 61 53 45 82 102 71 212 204 183
159 ‘40 127 152 207 77 56 65 43 29 151 146 153 144 ‘05 109
144 143 146 228 38 46 56 128 158 146 143 149 34 128 43 121
116 227 58 45 112 66 156 153 148 141 128 116 203 213 54 72
91 52 58 146 158 158 156 54 173 260 207 211 94 160 76 113
—
Wavelet-Based Reduction Program. Keeping oniy lhe lowest-freq’iencY content amounts lo an even simpier waveiet-based image zooming-out reduction aigorithm. Pro gram wavelet_reductiofl . c on lhe book’s web site gives a simple iliustration of this principie, iimited tojust lhe scaling funclion and anaiysis filter te scaie down an image some number of times (lhree, say) using wavelet-based analysis. The program operaLes on lhe Unix-based POM (portable graymap) file formal and uses lhe Antonini 9/7 biorthogonal filterinTabie8.3. 8.7
FIGURE 8.21: Haar wavelet decomposiLion. Courwsy o! Sieve Kilthau.
WAVELET PACKETS Waveiet packeLs can be viewed as a generalization ofwavelets. They were first introduceil by Coifman, Meyer, Quake, and Wickerhauser [9] as a family of orihonormai bases for discrete functions of R”. A complete subband decomposition can be viewed as a decompositiOn of lhe input signal, using an analysis tree of depth iog N. In Lhe usual dyadic wavelet decompositiOn, only lhe iow-pass-filtered subband is recur siveiy decomposed and lhus can be represented by a logarithmic tree soructure. However, a wavelet packet decomposition ailows lhe decomposition lo be represented by any pruned subtree of lhe fuil tree topology. Therefore, Lhis representalion of lhe decomposition topo! ogy is isomorphic to ai! permissible subband topologies [10]. Tbe leaf nodes of each pnined subtree represent one permissible orthonormal basis. The wavelel packet decomposition offers a number of alLractive properties, including • ElexibiliLy, since a best wavelet basis in lhe sense of some cosI metric can be found wiohin a large library of permissible bases • Favorable localizaLion of wavelet paclcets in bolh frequency and space. • Low computatioflal requirement for wavelet packet decomposition, because each decompositioLi can be compuLed in lhe order of N log N using fast filter banks. Wavelet packeLs are currently being applied lo solve various practical problems such as image compression. signal de-noising, fingerprint idenlificaLion, and 50 0!!.
8.8
EMBEDDED ZEROTREE OF WAVELET COEFFICIENTS So far, we have described a wavelet-based scheme for image decomposition. However, aside from referring Lo Lhe idea of quantizing away small coefficients. we have noL really addressed how Lo code Lhe waveleL Lransform values — how Lo form a bilslream. This problem is precisely whaL is dealL with in Lemis of a new data strucLure, lhe Embedded ZeroLree. The Embedded Zerorree Wavelet (EZW) algorithm introduced by Shapiro [11] is an effecLive and computationally efficienL technique ir’ image coding. This work has inspired a number of refinemenLs lo lhe initial EZW algorithm, Lhe most notable being Said and Pearlman’s Set Partiííoning in Hierarchical Trees (SPIHT) algoriLhm [12] and Taubman’s Einbedded Block Coding wi:h Optiniized Truncanon (EBCOT) algorithm [13], which is adopLed frIto Lhe JPEG2000 standard.
Section 8.8 242
Chapter 8
Embedded Zerotree of Wavelet Coefficients 243
Lossy CompressiOli Algorithms
The EZW aigorithin adilresses two probierns: obtaining the best image quaiity for a given bitrate and accornpiishing Lhis task in an ernbedded fashion. An enzbedded code is one that contains ali lower-rate codes “embedded” aL Lhe beginning of lhe bitstream. The bits are effectively ordered by imporlance in lhe bitstreanl. An embedded cade allows lhe encoder to terminate lhe encoding at any point and thus meet any target bitrate exactiy. Similarly, a decoder can cease to decode at any point and produce reconslructions corresponding Lo ali iower-rate encodings. To ach leve this goai, lhe EZW algorithm takes advantage of an important aspect of iow bitrate image coding. When conventionai coding methods are used to achieve iow bitrates, using scaiar quantization followed by entropy cading, say, the most likeiy symbol, after quantization. is zero. Ii turns aut that a large fraction of Lhe bit budget is spent encoding the significance map, which flags whether input sampies (in lhe case of lhe 2D discrete waveiet lransfonfl, lhe transforrn coefficients) have a zero or nonzero quantizeil vaiue. The EZW algorithm expioits this observation to turn any significant improvenlent in encoding lhe significance map into a corresponding gain in compression efficiencY. The EZW aigorithm consists of two centrai componentS: lhe zerolree data structure and lhe melhod ofsuccessive approximatiOn quantization.
FIGURE 8.22: Parent—child relalionship in a zerotree. • Positive significance. The coefficient is significam wilh a positive value.
8.8.1
The
Zerotree Data Structure
The coding of Lhe significance map is achieved using a new data structure called lhe zeroiree. A waveiet coefficient x is said to be insignificani with respect to a given threshold T if < T. The zerotree operates under Lhe hypothesis lhat if a waveiet coefficient ata coarse scaie is insignificant wilh respect lo a given threshold T, ail waveiet coefficients of lhe sarne orientation in lhe sarne spatial iocation at finer scales are likeiy to be insignificant with respect lo T. Using lhe hierarchicai waveiet decompositiOn presented in Chapter 8, we can relate every coefficient ata given scaie Lo a set of coefficients aI lhe next finer scale of simiiar orientation. Figure 8.22 provides a pictoriai representatioli of lhe zerotree on a lhree-stage waveiet decomposition. The coefficient aI Lhe coarse scaie is called lhe parem whiie ali correspond ing coefllcients are lhe next finer scale of Lhe sarne spatiai location and similar orientation are caiied children. For a given parent, the seI of alI coefficients aI ali finer scales are cailed descendanis. Similarly, for a given child, Lhe seI of ali coefficients aI ali coarser scaies are cailed ancesiors. The scanning of Lhe coefficients is performed in such a way lhat no child node is scanned before its parent. Figure 8.23 depicts lhe scanning pattem for a Lhree-ievei wavelet decom pos ition. Given a lhreshoid T, a coefilcient x is an element of lhe zerotree if it is insignificant and aH its descendanls are insignificant as weil. An element of a zerotree is a zerotree moi if iL is not lhe descendant of a previously found zerotree root. The significance map is coded using lhe zerotree with a four-symbol alphabet. The four symbols are • The zerotree root. The root of lhe zerotree is encoded wilh a speciai symboi indicat ing lhat lhe insignificance of the coefficients at finer scales is compieteiy predictabie. . lsolated zero. The coefficient is insignificant but has some significant descendants.
• Negative signilicance. The coefficient is significant wilh a negative value. The cost of encoding lhe significance map is substantiaily reduced by ernpioying lhe zerotree. The zerotree works by exploiting seif-simiiarity on lhe transform coefficients. The underlyingjustification for lhe success of the zerotree is lhat even though lhe image has been transformed using a decorreiafing transforrn, lhe occurrences of insignificant coefficients are not independent events.
t7
/
/~ FIGURE 8.23: EZW scanning order.
Section 8.8 244
Chapter 8
Embedded Zerotree of Wavelet Coefficients 245
Lossy CompressiOn AlgorithmS
In addition, Lhe zerolree coding technique is based co lhe observation that ii is much easier to predict insignificance Lhan lo predict significaM details across scales. This technique focuses on reducing lhe COSI of encoding lhe significance map so that more bits wili be available to encode lhe expensive significant coefúcienis.
57—3739—20 3 —29301733 14
6
15 13
10 i9 —7 8.8.2 Successive ApproximatiOfl Quantization Embedded coding in lhe EZW coder is achieved using a method calied Successive Approxr imotion Quanuzation (SAQ). One motivation for deveioping this method is lo produce an embedded code that provides a coarse-to-fine, multiprecision logarithmic representalion of lhe scale space corresponding lo lhe waveiet-transforrned image. Mulher motivation is Lo lake further advantage of lhe efflcient encoding of Lhe significance map using lhe zerotree daLa slruClure, by allowing 1110 encode more significance maps. The SAQ method sequenLially applies a seqUeflCe of lhresholds T0 TN_j lo deter mine lhe significance of each coefficient. The tbreshoids are chosen such Lhat T1 = 7~i /2. The initial threshoid T0 is chosen so lhat lxii < 2To for ali transforrn coefficients Xj. A dorninant lis: and a subordinate lis: are maintained during lhe encoding and decoding pro cess. The dominant iisL conlains lhe coordinates of lhe coefficienls that have not yet been found lo be significant in lhe sarne relaLive order as lhe initial scan. Using lhe scan ordering shown in Figure 8.23, ali coefflcients in a given subband appear on Lhe iniiiai doniinant lisI prior lo coefficients in lhe next subband. The subordinate lisI contains lhe magnitudes of lhe coefflcients ihal have been found Lo be significani. Each lisI is scanned oniy once for each threshold. During a dominantpass, coefficienls having lheir coordinales on lhe dominani iist impiies thaL they are nol yet significant. These coefficients are compared Lo lhe lhreshold T1 Lo determine lheir significance. lf a coefficienl is found lo be significant, lIs magnitude is appended tolhe subordinale iist, and lhe coefficienL in lhe waveiet transfomi array is sei lo zero to enable lhe possibiliiy of a zeroiree occurring on fulure dominant passes ai smalier ihresholds. The resulling significance map is zerotree-coded. The dominant pass is foilowed by a subordinate pass. Ali coefficients on lhe subordinale iist are scanned, and Lbeir magnitude, as il is made avaiiabie lo lhe decoder, is refined Lo an additional bil of precision. Effectiveiy, lhe width of lhe uncertainly interval for lhe mie magnitude of lhe coefficienls is cui in half. For each magnitude on lhe subordinate lisI, lhe refinemenl can be encoded using a binary aiphabet with a 1 indicaiing lhat lhe fite value fails in lhe upper haif of lhe uncerlainty intervai and aO indicating lhat i~ faiis in lhe lower haif. The string of symbois from this binary aiphabet is lhen entropy-coded. After lhe subordinale pass, lhe magnitudes on lhe subordinate lisI are sorted in decreasing order Lo lhe exlent lhai lhe decoder can perform lhe sarne sori. The process continues Lo aiternate between lhe LW0 passes, wilh lhe threshoid halved before each dominani pass. The encoding stops when some larget stopping criterion has been mel. 8.8.3
EZW Exampie The foiiowing exampie demonstrates lhe concepl of zerotree coding and successive approx imation quantization. Shapiro [11] presents an exampie of EZW codiitg in his paper for an
9
7
9
10
8
2
1
6
9
—4
2
3
—7 i4 12 —9
12 15 33 20 —2
3
1
O
O
7
2
4
4 —1
1
1
4
1
10
3
2
O
1
O
56003
i
21
FIGURE 8.24: Coefficients of a lhree-stage waveieL transform used as inpuL lo Lhe EZW algorilhm. 8 x 8 lhree-ievei wavelet Lransform. However, uniilce Lhe exampie given by Shapiro, we wiili compiete Lhe encoding and decoding process and show Lhe oulpul bitsLream up lo lhe poinL just before entropy coding. Figure 8.24 shows lhe coefflcienLs of a lhree-sLage waveiet Lransforrn lhaL we alLempl Lo code using lhe EZW algorilhm. We wiiil use lhe symbois p, ri, t,and z lo denoLe positive significance, negative significance, zerotree rool, and isoiaLed zero respecliveiy. Since lhe Iargest coefficient is 57, we wiii choose Lhe initial lhreshold ~‘o to be 32. AL lhe beginning, lhe dominant iisl contains lhe coordinates of aH lhe coefficients. We begin scanning in lhe order shown in Figure 8.23 and delermine lhe significance of lhe coefficienLs. The foliowing is lhe iisl of coefficients visited, in the order of Lhe scan: (57, —37, —29,30,39, —20,17,33, i4, 6, iO, i9, 3,7,8,2,2,3, i2, —9,33,20,2,4) Wilh respecL lo lhe Lhreshold T0 = 32, ii is easy Lo see lhat lhe coefficients 57 and —37 are significani. Thus, we outpuL a p and an n Lo represent them. The coefficient —29 is insignificani buL conLains a signiflcant descendanL, 33, in LI-li. Therefore, il is coded as z. The coefficient 30 is aiso insignificani, and ali iL5 descendanLs are insignificant with respect Lo Lhe current lhreshold, so il is coded as t. Since we have already delermined Lhe insignificance of 30 and au iLs descendants, Lhe scan wiil bypass them, and no addiLionai symbois wili be generated. ConLinuing in this rnanner, Lhe dominam pass oulpuls Lhe foHowing symbois: pnztptrptzttpnt Five coefhcients are found lo be significant: 57, —37, 39, 33, and anolher 33. Since we imow Lhat no coefficients are grealer lhan 2T0 = 64, and lhe lhreshoid used in lhe firsl dominanl pass is 32, lhe uncertainly intervai is thus [32, 64). Therefore, we know lhal lhe value of significanL coefficienLs lie somewhere inside lhis uncertainty inlerval. The subordinaLe pass foHowing lhe dominam pass refines lhe magniLude of these coef ficients by indicaLing wheLher lhey Iie in lhe firsL haif or lhe second haif of Lhe uncerlainty
Section 8.9 246
Chapter 8
Set Partitioning in Hierarchical Trees (SPIHT)
247
Lossy CompressiOn Algorithrns
interval. The output is 0 if Lhe values lie in [32,48) and 1 for values wiLhin [48, 64). According Lo lhe order of lhe scan, Lhe subordinate pass outputs the following bils:
S0: 10000
(8.65)
Thesubordinateiistisnow (57,37,39,33,33,29,30,20. 17, 19,201. Tbesubordinatepass lhat follows will halve each of lhe three current unceriainty intervals [48, 64), [32,48), and [16, 32). The subordinate pass ouLputs lhe following bits: St: 10000110000 Now we set Lhe value of Lhe coefficients found Lo be significant Lo O in Lhe wavelet transforrn array. The output of lhe subsequent dominant and subordinate passes is shown below: 02:
zzzzzzzzptpzpptn:tptppttpttpttpnppttitttpttttttlttttutt
82:
oiiooi1i001101100000110h10 zzzzzzztzpttztntIptttttptnnttttptttpptpptlptIttt
83:
OOI00OIO(JOIIIOIOOII000IOO11I 1101100010 zzzzzttztZtZZtZZPPPPhIPhtPttfl~~~Pl
84:
O
O
O
O
O
O
O
O
O
O
O 40
00000000
Now Lhe dominant list contains Lhe coordinates of ali lhe coefficients except Lhose found Lo be significant, and lhe subordinate Iist contains lhe values (57, 37, 39, 33. 331 After lhe subordinate pass is conipleted, we atternpt Lo rearrange Lhe values in lhe subordinate list such that larger coefficients appear before smailer ones, wilh lhe consfliflt that Lhe decoder is abie do exactly lhe sarne. Since Lhe subordinate pass halves Lhe unceriainty interval, lhe decoder is able Lo distin guish values from [32,48) and [48, 64). Since 39 and 37 are not distinguishable in Lhe decoder, Lheir order wiII not be changed. Therefore, Lhe subordinate list remains lhe sarne after Lhe reordering operation. Before we move ou to the second round of dorninant and subordinaLe passes, we need Lo set Lhe values of lhe significant coefficients to O in lhe wavelet transform array se lhat lhey do not prevent lhe emergence of a new zeroLree. The new threshold for a second dorninant pass is 7’j = 16. Using lhe sarne procedure as above, lhe dominant pass outpuLs lhe foilowing symbols. Note lhat Lhe coefficienls in Lhe dominant list wilI not be scanned. D1: zznptnpttztPtttttthtfhPttUtt
56—40400
iiiiioi00110101100000b0hhb0hb0hb000b00100b0b0b0b0 ZZZztZtI(IZIZZZZPPPIPP)PUP
Since lhe length of lhe uncertainLy interval in lhe iast pass is 1, Lhe last subotdinate pass is unnecessary. On lhe decoder side, suppose we received inforrnation only from lhe first dorninant and subordinaLe passes. We can reconstruet a lossy version of Lhe transforrn coefficients by
00000000 O
O
40 O
O
O
O
O
00000000 00000000 00000000 FIGURE 8.25: ReconstnacLed transform coefficients from lhe firsl dorninanL and subordinate passes. reversing lhe encoding process. From lhe symbols in D0 we can obLain Lhe position of lhe significant coefficienLs. Then, using Lhe bits decoded from So, we can reconstruct lhe value of lhese coefficients using lhe center of lhe uncertainty interval. Figure 8.25 shows lhe resulting reconstruction. It is evidenL LhaL we can stop the decoding process at any point Lo reconstruct a coarser represenLaLion of Lhe original input coefficients. Figure 8.26 shows lhe reconslniction if Lhe decoder received only ~o, So, D1, Sj, ~ and only Lhe first 10 biLs of 82. The coefficients lhat were nol refined during lhe IasL subordinate pass appear as if Lhey were quanLized using a coarser quantizer lhan lhose thaL were. In facL, Lhe reconsLruction value used for lhese coefficients is lhe center of lhe uncertainty interval from Lhe previous pass. The heavily shaded coefflcients in Lhe figure are Lhose lhat were refined, while lhe lightly shaded coefficients are lhose lhat were noL refined. As a resuiL, iL is noL easy lo see where Lhe decoding process ended, and lhis eliminaLes much of lhe visual artifact contained in lhe reconstrucLion. 8.9 SET PARTITIONING IN HIERARCHICAI. TREES (SPIHT) SPIHT is a revolutionary extension of Lhe EZW algoriLhrn. Based on EZW’s underlying principies of partial ordering of lxansfonned coefficients, ordered bitplane transmission of refinement bits, and Lhe exploiLaLion of self-siniiariLy in Lhe transforrned wavelet image, Lhe SPIHT algorilhin significantly improves lhe performance of its predecessor by changing lhe ways subseLs of coefficients are partiLioned and refinement informalion is conveyed. A unique property of Lhe SPIHT bitsLrearn is iLs compacLness. The resulLing biLstrearn from Lhe SPIRT aigoriLhm is so compact Lhat passing iL lhrough an entropy coder would produce oniy marginal gain in compression aL Lhe expense of much more cornputaLion. Therefore, a fast SPIHT coder can be implemenLed wilhout any entropy coder or possibiy just a simple paLenl-free Huffman coder. AnoLher signature of Lhe SPIHT algorilhm is lhat no ordering inforrnation is expiicitiy Lransmitted Lo lhe decoder. Instead, Lhe decoder reproduces Lhe execution paLh of Lhe encoder
Section 8.11 248
Chapter 8
Exercises
249
Lossy Compression Algorithms
~-~e
• TheFAQforlhecomp.compressionand coinp.00mpression.research groups. This FAQ answers most of lhe commonly asked questions about wavelet theory and data compression in general.
I~F~ 111
II
• A seL of slides for scalar quanLization and vector quanlizalion, from Lhe informaLion Lheory course offered aL DelfI University. • A link lo an excellent article “Image Compression Review”.
0000 O
0
O
010 o o o
o o~~joIo o o o o o o ojo o o o FIGURE 8.26: Reconstructed transform coefficients from D0, So, D1, Si, D2, and Lhe first 10 bits of S2, and recovers Lhe ordering information. A desirable side effect of this is Lhat the encoder and decoder have similar execution times, which is rarely the case for other coding methods. Said and Peariman [12] gives a fuli descripLiOn of Lhis aigorithm.
8.10 FURTHER EXPLORATION Sayood [14] deals extensiveiy with lhe subject of Iossy data compressiOn in a weil.organized and easy-to-understand manner. (Jersho and Gray [151 coverquantiZation. especially vectorquantiZation. comprehensivlY. In addition toLhe basic theory, Lhis book provides a nearly exhaustive descriplion ofavailabie VQ methods. Gonzales and Woods [7] discuss malhematical transforms and image compression, in cluding straightforWard explanalions for a wide range of algoriihms in lhe conLext of image processing. The mathematical foundation for the development of many lossy data compression algo rithms is lhe study of srochas(ic processes. Stark and Woods [16] is an excelient textbook on Lhis subjecl. Finally, Mailal [5] is a book on wavelets, emphasizing theory. L.inlcs included in lhe Further Exploration directory of the Lext web sile for Lhis chapter are • An online, graphics-based demonstralion of Lhe wavelel transform. Two programs are included, one Lo demonslrate Lhe ID wavelet Lransfonu and lhe olher for 2D image compression. In the ID program, you simply draw Lhe curve lo be transformed. • The Theory of Data CoinpressiOn web page, which introduces basic theories behind both lossless and lossy data compression. Shannon’s original 1948 paper on infor malion tbeory can be downloaded from this site as well.
—
from DCI’ lo Wavelets: A
. Links to documentaLion and source code relaLed Lo quanLizaLion.
8.11
EXERCISES 1. Assume we have an unbounded source we wish Lo quantize using an M-bit midLread uniform quanLizer. Derive an expression for lhe total distortion if the step size is 1. 2. Suppose Lhe domam of a uniform quantizer is [—bM,bM]. We define Lhe Ioading fraction as bM
a where a is Lhe standard devialion of Lhe source. Write a simple program to quanlize a Gaussian distributed source having zero mean and unit variance using a4-bit unifonii quantizer. PIoL Lhe SNR against Lhe loading fraction and estimaLe lhe optimai step size Lhat incurs Lhe Ieast amount of distortion from Lhe graph. 3. * Suppose Lhe input source is Gaussian-distributed with zero mean and unit variance Lhat is, Lhe probability density function is defined as fx(x)
(8.66)
=
We wish Lo finda four-level Lloyd—Max quantizer. IeL y~
=
u’?
yi] and b~
=
[b? b?]. The initial reconstruction leveis are set lo y~ = [—2, —1,1,2]. This source is unbounded, so Lhe outer two boundaries are +00 and —ao. FolIow the Lloyd—Max aigorithm in Lhis chapter: lhe other boundary values are calcu Iated as Lhe midpointsof thereconstruccion values. We now have b0 = [—co, —1.5,0, 1.5, ao]. Continue one more iteration for i = 1, using Eq. (8.13) and find 4, y~, y~, 4, using numericai inLegralion. Also calculate lhe squared error of Lhe difference belweeny1 andyo. Iteration is repeaied until lhe squared error belween successive estimales of Lhe re construclion leveis is below some predefined Lhreshold E. Write a small program lo implemenl the Lloyd—Max quantizer described above. 4. if Lhe block size for a 2D DCI’ Lransform is 8 x 8, and we use oniy lhe DC components Lo create a Lhumbnail image, what fraction of Lhe original pixels would we be using? 5. When Lhe block size is 8, lhe definition of lhe DCT is given in Eq. (8.17).
250
Chapter 8
Section 8.11
Lossy Compression Algorithms
FIGURE 8.27: Sphere shaded by a Iight. (a) If an 8 x 8 grayscaie image is in the range O.. 255, what is lhe largesl value a Dcl’ coefficient could be, and for what input image? (Also, state ali lhe DCT coefficient values for that iniage.) (b) lf we first sublract the value 128 from lhe whole image and chen cany out lhe DCI’, what is lhe exact effect on lhe DCT value F[2, 3]? (e) Why would we carry out lhat subtraction? Does lhe subtraction affect lhe number of bits we need to code lhe image? (d) Would it be possible to invert ihat sublraction, inche IDCT? Ifso, how? 6. We could use a similar D~ scheme for video szreams by using a 3D version of Da. Suppose one color component of a video has pixeis fijk at position j) and time k. l-low could we define ils 3D Da cransform? 7. Suppose a uniformly colored sphere is illuminated and has shading varying smoolhiy across ils surface, as in Figure 8.27.
«,
(a) What wouid you expect lhe Da coefficients for ics image to look like? (b) What would be lhe effect on lhe Da coefficients of having a checkert,oard of coiors on lhe surface of lhe sphere? (c) For lhe uniformly colored sphere again, describe lhe D~ values for a block lhat straddles lhe top edge of lhe sphere, where it meis tbe biack background. (d) Describe lhe Da values for a block lhat straddies lhe left edge of lhe sphere. 8. The Haar wavelet has a scaling function which is defined as foilows: =
and its scaling vector is ho[0]
=
I
ho[l]
OI olherwise 0sIs1 =
(8.67)
l/’[2.
(a) Draw lhe scaling function, then verify thal its diiated translates ~(2t) and ~(2t — 1) satisfy lhe dilacion equation (8.56). Draw lhe combinalion of these functions lhat makes up lhe fuil funclion ~Q). (b) Derive lhe waveiet vector h1 [0], h1[l] from Eq. (8.59) and then derive and draw lhe Haar waveiet function 4’Q) from Eq. (8.57).
Exercises 251
9. Suppose lhe molher waveiel *0) has vanishing moments M~ up to and inciuding AI,,. Expand fQ) in a Tayior sedes around £ = 0, up tolhe nlh derivative of f (i.e., up lo ieftover error of order O (n + 1)]. Evaluate lhe summation of integrais produced by substituting lhe Tayior sedes into (8.52) and show lhal lhe resuil is of order O (s’~+2). 10. The program wavelet_compression. c on lhis book’s web sile is in facl simple lo impiement as a MATLAB function (or similar fourth-generation language). The advantage in doing so is lhat lhe imread function can input image formais of a great many lypes, and imwrite can output as desired. Using lhe given program as a tempiate, conslnict a MATLAB program for wavelel-based image reduclion, wilh perhaps lhe number of waveiel leveis being a function parameler. 11. II is interesting to find lhe Fourier lransform of functions, and lhis is easy if you have avaiiabie a symboiic manipulation syslem such as MAPLE. In lhat ianguage, you can just invoke lhe tourier function and view lhe answer directiy! As an exampie, lry lhe foiiowing code fragmenl: withPinttrans’); f : 1; F : fourier(f,t,w); The answer shouid be 2rS(w). Let’s try a Gaussian: f
:=
F
:=
exp(—t2); fourier(t,t,w);
Now lhe answer shouid be qWe~_’°2/~~: lhe Fourier transform of a Claussian is simply anolher Gaussian. 12. Suppose we define lhe waveiet funclion =
exp(_tV4) sinQ4)
,
1 ~
O
(8.68)
This function osciliates about lhe value O. Use a plotting package to convince yourself lhat lhe funclion has a zero momenl AI,, for any value of p. 13. hnpiement bolh a DCT-based and a waveiel-based image coder. Design your user interface se lhat lhe compressed resuils from bolh coders can be seen side by side for visual comparison. The PSNR for each coded image shouid a]so be shown, for quantitative comparisons. Include a slider bar lhal controis lhe target bitrate for bolh coders. As you change lhe largel bilrate, each coder shouid compress lhe inpul image in real lime and show lhe compressed resuits inimediateiy on your user interface. Discuss both qualitative and quantilalive compression resulls observed from your program aI cargel bitrales of 4 bpp, 1 bpp, and 0.25 bpp. 14. Wrile a simpie program or refer tolhe sampie Dcl’ program dct_1D. e in lhe book’s web sile lo verify lhe resuits in Exampie 8.2 of lhe ID DCI’ example in lhis chapter.
252 8.12
Chapter 8
Lossy Compression Algorithms
C H A P T E R
REFERENCES 1 A. Gyõrgy, “Ou theTheoreücal Limits of Lossy Source Coding’ TlidományoS Diálck& (TDK) Conference [1-lungarian Scientific Student’s Conferencel at Technical University of Budapest, 1998. ~~p:llwwwszit.bme.hu&gYWm~P5
9
1 mage Com p ress io n Standard s
2 5. Arimoto, “An Algotitlun for Caiculating lhe Capacity of au Áxbitraxy Discrete Memoryless Channel’ IEES Transactions ora bzfonnation Theory, IS: 14—20, 1972. 3 R. DhabI, “Computation of Channel Capacity and Rate-DistortiOli Funcúons,” lESE Transac dons on lnforrnatiOn Theory, 18: 460—473, 1972. 4 J.F. Bhinn,”What’stheDealwi±theD~’?” ~ 13(4): 78—83, 1993.
increase in numbers of digital imagingindevices, such as scanners digitalbecause cameras.ofThe Recent years have seen au explosion lhe availability of digitaland images, Lhe need Lo efficiently process and store images in digital form has motivated lhe development
5 5. MailaL. A Wavelet Tour of Signa? Processing, San Diego: Academic press, 1998.
of many image compression standards for various applications and needs. In general,
6 5. Malhal, “ATheory for Mulúresolutiotl Signal Decomposition: The Wavelet RepresentatiOtl’ lESE Transactions on Partem Analysis and Machine Inteiligence, 11: 674—693, 1989.
standards have greater longevity than particular programs or devices and Iberefore warrant careful study. In this chapter, we examine some current standards and demonstrate how
7 R.C. Gonzalez and R.E. Woods, Digital bnage Processins, 2ad ed., Upper Saddle River, NJ: Prentice HaIl, 2002.
topics presented in Chapters 7 and 8 are applied in practice. We first explore Lhe standard JPEG definition, mcd in most images ou the web, Lhen
8 B.E. Usevitch, “A Tútorial ou Modera Lossy Wavelet Image Compressiofl Foundations of JPEG z000;’ lESE Signa? Processing Magazine, 18(5): 2235,2001.
go on lo look at Lhe wavelet-based JPEG2000 standard. Two other standards, JPEG-LS — aimed particularly aI a lossless JPEG, outside Lhe main JPEG standard — and JBIG, for
9 R. Coifman, Y. Meyer, 5. Quake, and V. Wickerbauser, “Signa! Processing and Compression with Wavelet PacketsT Numerical Algorithms Research Group, YaIe UniversitY, 1990.
9.1
bilevel image compression, are included for completeness. THE JPEG STANDARD
10 K. Ramachandran and M. Vetterli, “BesI Wavelet Packet Basis ia a Rate-Distortioli Sense’ IEEE Transactions ou Image Processing, 2: 160—173, 1993.
JPEG is an image compression standard developed by lhe Joint Photographic Experts Group.
II J. Shapiro, “Embedded tmage Coding using Zerotrees of Wavelet CoefficientsT lESE Transactions ou Signa? Processing, 4102): 3445-3462, 1993.
Ii was formally accepted as au international standard in 1992 [1]. JPEO consists of a mimber of steps, each of which contributes Lo compression. We’Il
12 A. Said and W.A. Pearlman. “A New, Fast, and Efficieat Image Codec Based on SeL Partitioning ia Hierarchical TreesT IEEE Transactions on Circuits and Systems for Video Techuology, 6(3): 243—250, 1996.
9.1.1
look aI lhe motivation behind these steps, then take apart lhe algorithm piece by piece. Main Steps in JPEG Image Compression
13 D. Taubmaa, “1-11gb Performance Scalable Iniage Compression with EDCOT,” lESE Transactions on Isnage Processing, 9(7): 1158—1170.2000.
As we know, unlike one-dimensional audio signals, a digital image fQ, j) is not defined over lhe time domam Instead, it is defined over a spatial domam — Lhat is, an image is a
14 K. Sayood, Introductiou to Data Compression, znd cd., San Francisco: Morgan Kaufmann, 2000.
function of lhe two dimensions i and j (or, conventionally, x and y). The 2D DCT is used as one step ia JPEG, lo yield a frequency response that is a function F(u, u) ia lhe spatmal
IS A. Gersho and fIM. Gray. Vector Quantization and Signa? Cosnpression. Bostou: KluWer Academic Publishers, 1992.
JPEG isdomam, a lossy indexed image compression method. effectiveness of Lhe DCT transform frequency by two integers is and The ti.
16 1-1. Slark and J.W. Woods, Probabi?ily and Random Processes with App?icalion lo Signa? Pracessing, 3rd cd., Upper Saddle River, N$: Prenúce 1-lalI, 2001.
coding method in JPEG relies on Lhree major observations: Observation 1. Useful image contents change relatively slowly across Lhe image — that is, is is unusual for intensity values lo vary widely several Limes ia a small area — for example, in au 8 x 8 image block. Spatial frequency indicates how many times pixel values change across an image block. The Dcl’ formalizes this notion with a measure of how much Lhe image contenis change ia relation tolhe number of cycles of a cosine wave per block. Observation 2. Psychophysical experiments suggest lhat humans are much less likely lo notice lhe loss of very high-spatial frequency components Lhan lower-frequency components. 253
Section 9.1 254
Chapter 9
The JPEG Standard 255
Image Compression Standards When Lhe JPEG image is needed for viewing, Lhe lhree compressed component images can be deeoded independently and evenLually combined. For lhe calor channels, each pixel musL be first enlarged to cover a 2 x 2 block. WiLhauL loss of generaliLy, we wili simply use one of them — for example, lhe Y image, in lhe descripLion of Lhe compressian aigorithm beiow. Figure 9.1 shows a block diagram for a JPEG encoder. If we reverse Lhe arrows iii Lhe figure, we basically obLain a JPEG decoder. The JPEG encoder consists of lhe follawing main steps:
YIQ orYUV
. Transform RGB Lo YIQ ~ YUV and subsample colar
Perform Zigzag
Dcl’ on image biocks
. Appiy QuanLizalion . Perforni Zigzag ordering and mn-length encoding
Perforrn Entropy coding
FIGURE 9.1: Block diagram for JPEG encoder.
JPEG’s approach Lo lhe use aí DO’ is basically lo reduce high-frequeflcy contenLs and lhen efficienlly code lhe resull into a bitstring. The terrn spatiai redwzdancy indicates that much of lhe infonnation in an image is repeated: ifa pixel is red, then its neighbor is likely red also. Because aí Observation 2 above, lhe DCI’ coefficients for lhe lowest frequencies are rnost important. Therefore, as frequency gets higher, it becomes less importaM Lo represent lhe DCI’ coefficient accurately. It may even be safely seI to zero without losing much perceivable image information. Clearly, a string of zeros can be represented efficiently as lhe lengLh of such a run of zeros, and compression of bits required is possible. Since we end up using fewer nunibers to represent lhe pixels in blocks, by removing some location~dePendent inforniation, we have effectively removed spatial redundancy. JPEG worlcs forbothcolOr and grayscale images. In lhe case aí colar irnages, such as YIQ ar YUV, lhe encoder works on each component separately, using Lhe sarne routines. If lhe source image is in a different colar forrnat, lhe encoder perforrns a color-spaCe conversion Lo YIQ ar YUV. As discussed in Chapter 5, lhe chrominaflce images (1, Q or (.1, Y) are subsampled JPEG uses lhe 4:2:0 scheme, making use of anoiber observation about vision: ObservatiOn 3. Visual acuity (accuracy in distinguishiflg closely spaced lines) is much greater for gray C’black and white”) lhan for calor. We simply cannot see much change incolor if ii occurs in dose proximiLY — think of lhe blobby inlc used in comic boolcs. ‘[bis works simply because our eye sees lhe black lines best, and our brain just pushes lhe calor mIo place. Ia fact, ordinary broadcast TV makes use of lhis phenomenan to transmit much less colar infarmatiOn lhan gray information.
DCT OH Image Biocks. Each image is divided into 8 x 8 blocks. The 2D DCI’ (Equa tion 8.17) is appiied lo each block image f(i, j), wiLh ouLpul being lhe DCT coefhcients F(u, ti) for each block. The choice of a small block size in JPEG is a compromise reached by Lhe commiLtee: a number larger Lhan 8 wouid have made accuracy aL low frequencies betLer, buL using 8 malces lhe DCT (and IDCT) compuLation very fasL. Using blacks aL ali, however, has the effecL of isoiaLing each bioek from its neighboring conLexL. This is why JPEG images laok choppy (“blocky”) when lhe user specifies a high compression ratio we can see Lhese blocks. (And in fact removing such “biocking artifacts” is an importaril concem of researchers.) To calcuiale a particular F(u, ti), we selecL lhe basis image in Figure 8.9 Lhat corresponds LO Lhe appropriaLe u and ti and use iL in EquaLion 8.17 LOderive one of lhe frequency responses F(u, ti). QuaHtization. The quantizaLion slep in JPEG is aimed aI reducing lhe toLal number of bits needed for a compressed image [2]. IL consisLs of simpiy dividing each enLry iii lhe frequency space block by an inLeger, then rounding: fr(u,
ti)
—
round
1 F(u.v))
~ Q (u, ti)
(9.1)
Here, F(u, ti) represenLs a DCT coefficient, Q(u, ti) is a quansization matrix entry, and F(u, ti) represents Lhe quansized DCT coefficients JPEG will use in Lhe succeeding enLrapy coding. The defauit vaiues in lhe 8 x 8 quanLizaLiori maLrix Q(u, ti) are lisled in Tables 9.1 and 9.2 for luminance and chrominance images, respecLively. These numbers resulted from psychophysical sludies, wiLh lhe gaai aí maximizing lhe compression ralio while minimizing percepLual iosses iii JPEG images. The foflawing should be apparent:
Section 9.1 256
Chapter 9
The JPEG Standard 257
Image Compressiofl Standards
TABLE 9.1: The Iuminance quantization table.
16 12
II 12
10 14
16 19
24 26
40 58
51 60
61 55
14
13
16
24
40
57
69
56
14
17
22
29
51
87
80
62
109
103
77
18
22
37
56
68
24
35
55
64
81
104
113
92
49
64
72
92
78 95
87 98
103 112
121 100
120 103
101 99
TABLE 9.2: The chrominance quantizaüofl table. 17
18
24
47
99
99
99
99
18
21
26
66
99
99
99
99
24
26
56
99
99
99
99
99
47
66
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
• Since lhe numbers in Q(u, v) are relatively large, the magnitude and variance F(u, u) are significantly smaller than chose of FOi, v). We’ll see later that FQs can be coded with many fewer bis. The quantization step is the main.sourcefor lo in JPEG compressiún. • The entries of Q (u, ti) tend to have larger values toward lhe lower right comer. This aims lo introduce more Ioss aI lhe higher spatial frequencies — a practice supported by Observations 1 and 2. We can handily change Lhe compressiOfl ratio simply by multiplicatiVely scaling lhe numbers iii lhe Q(u ti) matrix. In fact, Lhe quaUiy factor, a user choice offered in every JPEO implementation. is essentially linearly tied lo lhe scaling factor. JPEG also allows custom quantiZation tables lo be specified and pul in lhe header; it is interesting lo use low constant or high-constartt values such as Q e 2 or Q e 100 to observe lhe basic effects of Q on visual artifacls. Figures 9.2 and 9.3 show some results of JPEG image coding and decoding on lhe test image Lena. Only lhe luminance image (Y) is shown. Also, lhe lossless coding sleps
An 8 x 8 block froni the Y image of Lena 200 200 203 200 200 200 205 210
202 203 200 200 205 200 200 200
189 198 200 200 200 200 199 200
188 188 195 200 200 200 200 200
189 189 200 197 195 200 191 188
175 182 187 187 i88 190 187 185
175 178 185 187 187 187 187 187
175 175 175 187 175 175 175 186
51565—124 —16 3 2 O —12 —g
6 3
0—2
7—5
5
—2
fO.j) 32
6 —1
O
O
O 3
4
0—1—4
4 1—1 O 3—1—1—3
4 —2
2 —3
O
F(u, v)
O
O
O
—1 O o o o o o o —1 O 1 O O O O O —1 O O O O O O O 00000000 00000000 00000000 00000000
512 66 —10 —12 O O —14 O 16
O O O
—14
O
O
O
0000 0000 0000 0000
Ê(u,v)
199 196 191 186 182 201 199 196 192 188 203 203 202 200 195 202 203 204 203 198
5 3
11 —1 3 O 1 —2 —4 2 —2 —3 —5 —2
0—3 —1 3—2—3
—2
1 2—8 0—11 —2
0000 0000 0000 0000 0000 0000 0000 0000
P(u, v) 178 177 176
1
183 18O178 189 183 180
—1
191 183 179
—2 —3 —4 —3 —1 —4
6 —2 4
2
2 —4
o —3 —2 —5
7 —3 —2 —1 1 —1 —2 —3
5 —2
200 201 202 201 196 189 182 177 200 200 199 197 192 186 181 177
o o
204 202 199 195 190 186 183 181 207 204 200 194 190 £87 185 184
1 —2
O
5
3 —4
O
6 —2 —2
4 —2 —1 —1 —1 o i ~ 8 4 1
EQ,j) =f(i,j)
1
2 —5 4
6 —2 4 —6
2
—J(i.J)
FIGURE 9.2: JPEG compression for a smooth image block.
8
5 —2 2
Section 9.1 258
Chapter 9
The JPEG Standard
259
Image Compression Standards afLer quanLization are not shown, since they do not affecL lhe quality/ioss of the JPEG images. These resuits show the effect of compression and decompression applied to a relatively smooth biock in the image and a more iextured (higher-frequency-conLenl) block, respectively. Suppose f(i, j) represenLs one of Lhe 8 x 8 biocks extracLed from Lhe irnage, FOi, u) the DC1’ coefficienLs, and F(u, ti) Lhe quanLized DCT coeflicients. Lei F(u, v) denoLe Lhe de-quantized DCT coefficienLs, determined by simply multiplying by Q(u, ti), and leL f (1, j) be lhe reconslructed image block. To iliustraLe Lhe quality of Lhe JPEG compression, especiaily Lhe Ioss, theerrore(i, j) = f(i, j) —f (1, j) is shown in lhe lastrow iii Figures 9.2 and 9.3.
Anolher 8 x 8 block from Lhe Y irnage of Lena’ 70 70 100 70 87 87 150 187
85 100 96 79 87 154 87 113 100 85 116 79 70 87 86 196 136 69 87 200 79 71117 96 161 70 87200 103 71 96 113
161 123 147 133 113 113 85161 146 147 175 100 103 103 163 187 156 146 189 70 113 161 163 197
—80 —40 89 —73 —135 —59 —26 6
0—1 1—1 0000 0000
0000 0000 0000
—5 —23
173 57 114 207 iii 89 84 90 164 123 131 135 133 92 85 162 141 159 169 73 106 101 149 224 150 141 195 79 107 147 210 153
fOi)
33 —30
17 —5 —4 20
F(u,v)
—80 -44 —132 —60 42 —78 o 17
o
o o .35 o o o o
90 —80 48 40 SI O —28 o 26 O o —55 64 o —120 —57 o 56 O 00 —22 o 51 0109 00 —37 o 55 —64- O O 00 o 000 00 0000 o o
In Figure 9.2, an image block (indicated by a black box in lhe image) is chosen aL Lhe arca where the luminance values change smooLhly. AcLually, Lhe iefL side of Lhe block is brighler, and Lhe right side is siighLly darker. As expecLed, except for Lhe DC and Lhe first few AC cornponenLs, represenling low spatial frequencies, rnost of Lhe DCT coefficiencs F(u, ti) have small magnitudes. This is because Lhe pixel values in Lhis block conLain few high-spatiai-frequency changes. An explanation of a small implementalion deLail is in order. The range of 8-bit iuminance values f(i, j) is [0,255]. in Lhe JPEG implemenLalion, each Y value is firsL reduced by 128 by simply sublracting. The idea here is Lo Wni Lhe Y componenL inLo a zero-mean image, Lhe sarne as Lhe chrominance images. As a result, we do noL wasLe any bits coding the mean value. (Think ofan 8 x 8 block with intensiLy values ranging from 120 tu 135.) Using f (1, j) — 128 in place of f (i, j) will nor affecl Lhe output of Lhe AC coeflicienis — it aliers only Lhe DC coe fficient. in Figure 9.3, Lhe image biock chosen has rapidly changing luminance. 1-lence, many more AC cornponents have large magnitudes (including Lhose toward the lower righi comer, where ii and ti are large). NoLice lhaL Lhe error EQ, J) is also larger now Lhan in Figure 9.2 — JPEG does iniroduce more loss if Lhe image has quickiy changing details. PreparaLion for Entropy Coding.
P(u,v) 70 60 106 94 62 103 146 176 85 101 85 75 102 127 93 144 98 99 92 102 74 98 89 167 132 53 1H 180 55 70 106 145
14 —3—13 —28
47 —76 66-3 —108 —78 33 59 1 —2 10 —18 O 33 11—21 —9 —22 8 32 65—36 —1 3 24—30 24 5 —20 28 —46 12 —35 33 17 6 —20 37 —28
fQ,j) —5 —4 9 —5 2110 —ii —5 —2 O 1 O o —1 3—6 4 O —3 —1 o 1 o 1 —1 0 1000 o o —l 0 0100
44 32 53 —3
P(u,v o o 2
4 —12 —3 5 6
10 —6—24 25—ló 4 Ii —i II 4 —15 27 —6 —31 —14 24 —23 4—11—3 29 16—2420 24 1 1149 13 —27 —7 —8 —18 12 23 O 16 —2 —20 21 O —1 —12 6 27 —3 —2 14 —37 6 1447 44 5 —6 —9 eO,f) =f (1,» —?o.»
FIGURE 9.3: JPEG compreSsion for a Lexlured image block.
We have
50
far seen Lwo of Lhe main sleps in
JPEG compression: DCT and quanLization. The remaining smali sLeps shown in Lhe block diagram in Figure 9.1 ali lead up Lo entropy coding of Lhe quantized Dcl’ coefficients. These additional daLa compression sLeps are lossless. InLereslingly, Lhe DC and AC coefficienLs are Lreated quiLe differenlly before entropy coding: run-iengLh encoding on ACs versus DPCM on DCs. Run-Length Coding (RLC) mi AC Coefficient& Notice in Figure 9.2 Lhe rnany zeros in F(u, ti) afterquantizaLion is applied. Run-Ieng:h Coding (RLC) (orRun-iength Encoding, RLE) is Lherefore useful in tuming Lhe F(u, ti) values inLo seLs {#-zevvs-to-skip, na! rica zero value). RLC is even more effective when we use an addressing schcrne, malcing li mosL likely Lo hit a long run of zeros: a zigzag scan Lums Lhe 8 x 8 malrix F(u, ti) into a 64-vector, as Figure 9.4 illustraLes. After alI, most irnage biocks tend Lo have small high spaLial-frequency componenis, which are zeroed out by quanLization. Hence Lhe zigzag
Section 9.1 260
Chapter 9
The JPEG Standard
261
Image CompreSSiO° Standards TABLE 9.3: Baseline entropy coding details — size caLegory. SIZE
AMPLITUDE
1
—1, 1
2
3,—2,2,3
3
—7..—4,4..7
4
—15—8,8.15
10
—1023—512,512.. 1023
FIGURE 9.4: Zigzag scan in JPEO.
scan order has a good chance of concatenating long ruIas of zeros. For example, Figure 9.2 will be turned mio (32,6,_1,_l,0,1,0.0~0_1.0.0~ 1,0,0
ÊQs, v) in
0)
with three runs of zeros inibe middle and a nin of 51 zeros ai the end. The RLC step replaces values by a pair (RUNLENGTMI VALUE) for e,ach rim of zeros in the AC coefficients of F, where RUNLENGTM is lhe number of zeros ia lhe rim and VAIUE is the next nonzero coefficient. To further save bits, a special pair (0,0) indicates the end-of-b1O~ after Lhe last nonzero AC coefficient is reached. Inibe above exaniple, not considerin& lhe first (DC) componenL~ we will thus have (0, 6)(0, .—1)(0, —fl(1, —1)(3, —1)(2, fl(0, 0)
Entropy Coding The DC and AC coefficients finally undergo ari entropy coding step. Below, we wilI discuss only Lhe basic (or baselíne’) entropy coding method, which uses Huffman coding and supports only 8-bit pixels lia lhe original images (or color image com ponents). LeL’s examine che two entropy coding schemes, using a variant of Huffman coding for DCs and a slightly different scheme for ACs. Huffman Codung of DC Coefficients Each DPCM-coded DC coeflicient is represented by a pair of symbols (SIZE, AMPLITUDE), where SIZE indicates how many bits are needed for representing lhe coefficient and AMPLITUDE contains lhe actual bits. Table 9.3 illustrates lhe size category for lhe different possible amplitudes. Notice that DPCM values could require more than 8 bits and could be negative values. The one’s complement scheme is used for negative numbers — Lhat is, binary code 10 for 2, 01 for —2; 11 for 3,00 for —3; and so on. In lhe example we are using, codes 150, 5, 6, 3, —8 will be Lumed mio (8, lOOlOllO),(3, lOl),(3,0O1),(2, ll),(4,Olll)
Difrerentual Pulse Code ModUhat101~ (DPCM) on DC CoeffiCitfl~ The DC coef ficients are coded separatelY from the AC ones. Each 8 x 8 image block has only one DC coefficient. The values of the DC coefficients for varloUs bloclcs could be large and different, because the DC value refiects tbe average intensitY of each bloclc, but consistent with Observation 1 above, lhe DC coefficient is unlilcelY lo change drastically within a shorl distance. This makes DPCM an ideal scheme for coding the DC coefficients. lf the DC coefficients for lhe first five image blocks are 150, 155, 149, 152, 144, DPCM would produce 150, 5, —6, 3, —8, assuming the predictor for ibe ith block is simply ci~ = DC,+t — DCi, and d0 = DC0. We expeci DPCM codes lo generallY have smaller magnitude and vafiance, which is beneficial for lhe next entropy coding step. Ii is worth noting that unlike the run.length coding of lhe AC coefficients, which is perfonfled on each individual bloclc, DPCM for the DC coefficients in JPEG is carried out on the entire image ai once.
In Lhe SPEG implementation, SIZE is Huffman coded and is hence a variable-length code. In other words, SIZE 2 might be represented as a single bit (0 or 1) if it appeared most frequently. In general, smaller SIZEs occur much more often — Lhe entropy of SIZE is low. Hence, deployment of Huffman coding brings additional compression. After encoding, a custom Huffman table can be stored in lhe JPEG image header; otherwise, a default Huffman table is used. On Lhe other hand, AMPLITUDE is not Huffman coded. Since its value can change widely, Huffman coding has no appreciable benefit. 11’he JPEO standard allows both Flufiman coding and Ariihmetic coding; alsa supporls boih 8-bit and 12.bit pixel Ienglhs.
bO(h
are entropy codíng meihods, Ii
Section 9.1 262
Chapter 9
The JPEG Standard 263
Image CompressiOfl Standards Scan 1: Encode DC and first few AC components, e.g., AC 1, AC2.
Huffman Coding of AC Coefllcients. Recail we said that lhe AC coefficients are run length coded and are represented by pairs ofnumbers (RUNLENGTHS VALUE) . 1-Iowever, in an actual SPEG implementation,VM.UE is furlherrepresentedbY SIZE and AMPLITUDE, as for Lhe DCs. To save bits, RUNLENGTH and SIZE are allocated only 4 bits each and squeezed into a single byte — let’s cali Lhis Synibol 1. Symboi 2 is lhe AMPLITUDE value; its number of bits is indicated by SIZE: Symbol 1: (RUNLENGTH, SIZE)
Scan 2: Encode a few more AC componenls, e.g., AC3, AC4, AC5.
Scan k: Encode lhe JasI few ACs, e.g., AC6I, AC62, AC63. Successive approximation: lnstead of gradualiy encoding spectral bands, ali DCI coeffi cicias are encoded simultaneously, but wilh lheir most significam bics (MSBs) first.
Symbol 2: (AMPLITUDE)
Scan 1: Encode lhe firsI few MSBs, e.g., Bits 7,6,5, and 4.
The 4~biL RUNLENGTM can represeflt only zero-runs of length 010 15. Occasionally, lhe zerO-run length exceeds 15; chen a special extension code, (15,0), is used for Symbol 1. In lhe worst case, lhree consecutive (15,0) extensions are needed hefore a normal terminating Symbol 1, whose RUNLENGTH wiil then complete Lhe actual runlength. As in DC, Symbol 1 is lluffman coded, whereas Symbol 2 is not.
Scan 2: Encode a few more less-significant bits, e.g., Eh 3.
9.1.2 JPEG Modes The JPEG standard supports numerous modes (variations). Some of lhe commonly used ones are: • Sequential Mede • ProgreSsiVe Mode • Hierarchical Mode
Scan m: Encode Lhe least significam bit (LSB), BiL 0.
ilierarchical Moda As its name suggests, Hierarchical JPECI encodes Lhe image in a hierarchy of several different resolutions. The encoded image aI lhe lowest resolution is basically a compressed low-pass-filtered image, whereas Lhe images aI successively higher resolutions provide additional delails (differences from lhe lower-resolucion images). Sim ilar Lo Progressive JPEG, Hierarchical JPEG images can be transmitted in multiple passes witli progressively improving quality. Figure 9.5 illustrates a Lhree-level hierarchical WEG encoder and decoder (separated by lhe dashed une in lhe figure).
. Lossless Mede
f4 Sequential Mode. This is Lhe default JPEG mode. Each gray-ievel image or color image component is encoded in a single left-to-right, top4o-bottOm scan. We implicitly assumed Lhis mode in Lhe discussions so far. The “Motion JPEG” video codec uses Baseline Sequential JPEG, applied to each image frame in lhe video. progressiveMode. Progressive JPEG delivers low-quality versions of lhe image quicldy, followed by higher-quality passes, and has become wideiy supported in web browsers. Such multiple scans of images are of course most useful when lhe speed of Lhe communication une is low. In Progressive Mede, Lhe first few scans carry only a few bits and deliver a rough picture of what is to follow. After each additional scan, more data is receiveil, and image quality is gradualiy enhanced. The advantage is that lhe user-end has a choice whether lo continue receiving image data afler lhe first scan(s). ProgressiVe JPEG can be realized in one of lhe following two ways. The main steps (Dar, quantization. etc.) are identical lo lhose ia Sequential Mode. Spectral selection: This scheme takes advantage of Lhe spectrai (spatial frequency spec Lnim) characteristiCS of lhe DCI coefficients: lhe higher AC components provide only detail information.
12 = EU4) +
f=
f FIGURE 9.5: Block diagram for Hierarchical JPEG.
Section 9.2 264
Chapter 9
The JPEG2000 Standard 265
Image Compression Standards Start.of_image
ALGORITHM 9.1
End...of_image~
Frame
THREE-LEVEL HIERARCHICAL JPEG ENCODER
1
1. Reductioii of image resolutiofl. Reduce resolution of lhe input image f (e.g., 512 x 512) by a factor of 2 in each dirnension Lo obtain f2 (e.g., 256 x 256). Repeat this (o obtain f~t (e.g., 128 x 128). 2. Compress Iow.rcsolutiofl image f~. Encode f~ using any other JPEG method (e.g., Sequential, Progressive) to obtain F4.
Tables, eLe.
3. CompreSs differente image ilz. (a) Decode F4 Lo obtain Use any interpolation method Lo expand Lo be of lhe sarne resolution as f2 and cail ii E(f4). (b) Encode difference d2 = f2— E(J4) using any otherJPEG metbod (e.g., Sequential. Progressive) to generate 02
14.
J~
4. Compres difference image d1. (a) Decode 02 to obtain J2; add it to EU4) to get 12 = E(34) +J2, which is aversion of f2 aiter compressiOn and decompression. (b) Encode difference d1 = f — E(,j2) using any other JPEG method (e.g., Sequential. Progressive) lo generate Vi.
b, et
Header
fllock
~
Segment
Block
Scan
1
Restart
Block
-
:‘:
-
Segment
-
Restart
. ..
...
FIGURE 9.6: JPEG biLstream. 9.1.3 A Glance at the JPEG Bitstream Figure 9.6 provides a hierarchical view of Lhe organizaLion of Lhe bitsLrearn foriPEO images. Here, aframe is a picture, a scan is a pass Lhrough Lhe pixeis (e.g., Lhe red component), a segmen: is a group of blocks, and a block consisis of 8 x 8 pixels. Examples of some header information are: . Frame header
ALGORITHM 9.2
THREE4.EVEL HIERARCHICAI. JPEG DECODER —
BiLs per pixel (Width, height) of image
—
Number of componenis
—
1. Decornpress the encoded low.resolutiOfl image F4. Decode F4 using lhe sarne JPEG method as inibe encoder, lo obtain f~.
2. Restore image f~ at the intermediate resolution. Use E(J4) + J2 to obtain 12. 3. Restore image f at the original resolution. Use E(12) + Lo obtain 7. It should be pointed out that at step 3 iii lhe encoder, lhe difference d2 is not taken as f2 — E(f4) buL as f2 — E(f4). Employing f~ has its overhead, since an additional decoding step must be introduced on the encoder side, as shown in lhe figure. So, is iL necessary? li is, because Lhe decoder never has a chance Lo see Lhe original f4. The restoration step in the decoder uses ft Lo obtain f2 = E(f4) + d2. Since f4 ~ f~ when a lossy JPEG rneLhod is used in compressing f~, Lhe encoder must use f4 ind2 = f2 — E(f4) to avoid unnecessarY error aL decoding Lime. This kind of decoder.encOder step is typical in many compression schemes. In fact, we have seen iL in Section 6.3.5. lt is present simply because Lhe decoder has access only ia encoded, not original. values. Similarly, aI step 4 in Lhe encoder, d1 uses the difierence between f and E(f2), not E(f2). Lossless Mode Lossless JPEG is a very special case of JPEG which indeed has no loss in its irnage quality. As discussed iii Chapter7. however, it employs only a simple differential coding melhod, involving no transform coding. It is rarely used, since its compression ratio is very low compared Lo other, lossy modes. On lhe other hand, ii meets a special need, and Lhe newly developed JPEG-LS standard is specifically airned aL lossless image compressiOn (see Section 9.3).
—
Unique ID (for each cornponenL) HorizonLallvertical sampling factors (for each cornponent)
—
QuantizaLion Lable to use (for each component)
—
.
Scan header — — —
Number of components in scan Cornponent ID (for each cornponenL) HuffrnaniAriLhmetic coding table (for each component)
9.2 THE JPEG2000 STANDARD The JPEG standard is no doubL lhe most successful and popular image formaL to date. The main reason for iLs success is Lhe qualiLy of is outpu for relatively good compression ratio. However, in anLicipaLing lhe needs and requirements of nexL-generaLion imagery applications, lhe JPEG cornmitLee has defined a new sLandard: JPEG2000. The new JPEG2000 standard [3] aims to provide not only a beLter raLe-distortion Lradeoff and improved subjective image quality but also additional functionalities Lhe current JPEG sLandard lacks. in particular, Lhe JPECI2000 standard addresses lhe following problerns [4]:
266
Chapter 9
Section 9.2
Image CompressiOn Standards
The JPEG2000 Standard
267
!~Z
• Low-bitrate compressiOn. The current JPEG standard offers excellent rate-distortiOn perfonnance ai niedium and high bitrates. However, ai bitrates below 0.25 bpp, subjective distortion becomes unacceptable. This is important if we hope to receive images on our web-enabled ubiquitous devices, such as web-aware wristwatches, and 50 Ofl.
• Lossless and Iossy compression. Currently, no standard can provide superior lossiess compression and lossy compression in a single bitslream. • Large images. The new standard wifl allow image resolutions greatet tban 64k x 64k without tiling. It can handie image sizes up 10232 — 1. • Single decompresSiün architecture. ‘fite current JPEG standard has 44 modes, many of which are application-Specific and not used by the majority of JPEG decoders. • TransmissiOn in noisy environmentS. The new standard wiIl provide improved error resilience for transmission in noisy environments such as wireless networks and lhe Internei. • Progressive transmission. The new standard provides seamless quality and reso lution scalability from low Lo high bitrates. The target bitrate and reconstmctiOn resolution need not be known ai lhe time of compression. • Region.Of4nterest coding. Tlie new standard permits specifying Regions ofInterest (ROl), which cmi be coded wilh better quality than the rest of the image. We might, forexample, like to code lhe face of someone making a presentation wilh more quality than lhe surrounding furniture. • computer~generated imagery. The current JPEG standard is optimized for natural imagery and does not perform well on computer.generated imagery. • Compound documents. The new standard offers metadata mechanisnis for incorpo rating additional non-image data as part of tbe file. This might be useful for including texi along with imagery, as one iniportant example. In addition, JPEG2000 is aMe to handle U~ to 256 channels of information, whereas lhe current JPEG standard is able lo handle only three color channels. Such huge quanlities of data are routinely produced in sateilite imagery. Consequently, JPEG2000 is designed to address a variety of applicatiOns, such as lhe InterneI, color facsimile, printing, scanning, digital photographY. remote sensing, mobile applications. medical imagery, digital library, e.colnnierce, and so on. The method looks ahead and provides lhe power to carry out remote browsing of large compressed images. The JPEG2OCIO standard operates in two coding modes: DCT-based and wavelet-based. The PCT-based coding mode is offered for backward compatibility wilh lhe current JPEG standard and implements baseline JPEG. AlI lhe new functionalitieS and improved perfor mance reside in lhe wavelet-based mede.
FIGURE 9.7: Cede block structure of EECOT. 9.2.1
Main Steps of JPEG2000 Image Compression The main compression method used in JPEG2000 is lhe (Embedded Block Coding wiih Op :imized Truncarion) algori:hm (EBCO7), designed by Taubman [5]. In addition lo providing excellent compression efficiency, EBCOT produces a bitstream with a number of desirable features, including quality and resolution scalability and random access. The basic idea of EBCOI’ is che partition of each subband LL, LH, HL, HH produced by the waveiet transform mIo small blocks called code blocks. Each code block is coded independeritly, in such a way that no inforniation for any other block is used. A separate, scalable bitstream is generated for each code block. With its block-based coding scheme, Lhe EBCOT algorithm has improved entr resilience. The EBCOT algorithm consisls of three steps: 1. Block coding and bitstream generation 2. Postcompression rale distortion (PCRD) optimization 3. Layer formation and representation Block Coding and Bitstream Generation. Each subband gerterated for the 2D discrete wavelet Lransform is first partitioned mio smafl code blocks of size 32 x 32 or 64 x 64. Then lhe EBCOT algorithm generates a highly scalable bilstream for each code block B~. The bitstream associated with R~ may be independently truncated to any member of a predetennined colieccion of different Iengths R7, with associated discortion D~. For each code block R~ (see Figure 9.7), iet sjk] = s1[k1, k2] be the two-dimensional sequence of small code blocks of subband samples, wich k1 and k2 the row and column index. (WiLh this definition, Lhe horizontal high-pass subband HL musl be transposed so that k1 and k2 wiIl have meaning consistem with Lhe other subbands. This transposition
Section 9.2 268
Chapter 9
The JPEG2000 Standard
269
Image CompressiOn standards y 2
x
o o
i
2
3
—1
—2 FIGURE 9.8: Dead zone quantizer. The iength of lhe dead zone is 28. Values inside lhe dead zone are quantized lo O.
Section 8.8 inLroduces lhe zerotree data strucwre as a way of efficiently coding lhe bitstream for wavelet coefficients. The underiying observation behind lhe zerotree data structure is that significant samples tend Lo be clusLered, 50 Lhat it is often possible lo dispose of a large number of samples by coding a single binary symbol. EBCCYT takes advantage of lhis observation; however, with efficiency in mmd, it exploits lhe clustering assumption oniy down lo reiatively large sub-blocics of size 16 x 16. As a result, each code block is furlher partitioned into a Lwo-dimensional sequence of sub-blocks B,ü]. For each bitpiane, expiicit information is first encoded lhat identifies sub-biocks containing one or more significant samples. The other sub-blocks are bypassed in lhe remaining coding phases for lhat bitplane. Ltt a~(B1[j]) be lhe significance of sub-block B,Ljj in bitplane p. The significance map is coded using a quad Lree. The Lree is constructed by identifying Lhe sub-blocks with ieaf nodes — lhat is, B7[JJ = B~ [ii. The higher leveis are built using recursion: RIU] = U~€w 1~Ç’[2J + z], O ~ t ~ T. The root of lhe tree represents lhe entire code-block: R[[O] = UjBuFj]. The significance of lhe code block is identified one quad levei aL a time, starting from lhe rool aL t = T and working toward lhe leaves at t = O. The significance vaiues are lhen sent lo an arithmetic coder for entropy coding. Significance vaiues lhat are redundani are skipped. A vaiue is Laken as redundani if any of lhe foilowing conditions is met: The parenL is insignificant. .
means that lhe HL subband can be treated in lhe sarne way as lhe LH, 11H, and 1.1 subbands and use lhe sarne context model.) The aigorithm uses a dead zone quantizer shown in Figure 9.8 — a double-iength region straddiing O. Let xt[k] E (—1,1) be lhe sign of s~tk1 and iet u1(kl be lhe quantized magniLude. Expiicitiy, we have (9.2) u1[k] —
where ôp is lhe step size for subband 5j, which contains code block B~. Let vf [k~ be lhe pth bit in lhe binary representation of v1jk], where p = O corresponds to lhe ieast significant bit, and let pr~z be lhe maximum value of p such that [k] ~ O foral least one sample in Lhe code biock. Theencoding process is similar Lo lhat of a bitpiane coder, in which lhe rnost significant bit vfi
[k) is coded first for ali samples in lhe cade block, foiiowed by lhe next most
The current quad was aiready significanL in Lhe previous bitplane.
• This is lhe iast quad visited among lhose lhat share lhe sarne significant parent, and Lhe olher sibbngs are insignificanL. EBCOT uses four different coding primitives lo code new information for a single sample ia a bitpiane p, as foliows: • Zero coding. This is used lo code ur[k], given lhat lhe quantized saniple saLisfies u~[k) < 2P~* Because lhe sample statistics are measured Lo be approximateiy Markovian, lhe significance of lhe current sample depends on lhe values of its eight immediate neighbors. The significance of these neighbors can be ciassified into Lhree categories: —
—
(floR—!)
significant bit vÇ) [k], and so on, until ali bitplanes have been coded. In this way, if lhe bitstream is truncated, then sorne sampies in lhe cade biock may be missing one or rnore ieast.significant bits. This is equivalent Lo having used a coarser dead zone quantizer for lhese samples. In addition, it is important Lo exploit Lhe previously encoded infonnation about a particular sample and its neighboring samples. This is done in EBCOT by defining a binary vaiued state variable cg[k], which is initiaily O but changes to 1 when lhe relevaM sarnple’s first nonzero bitpiane uf [ii = i is encoded. Tbis binary state variabie is referred Lo as lhe significance of a sample.
—
Horizontal. h~(k]
=
~ aj[ki + z, k2j, wilh Os h~[k] s2
Vertical. v~[k1 = E~€ll’, a1[ki, k2 + z], wilh O ~ v1[k]
À)
(9.8)
The optimal value À can be found using a simple bisecLion method operating on lhe distortion-rate curve. A deLailed descripLion of this method can be found iii [6]. Layer FormaLion and Representation. The EBCOT algoriLhm offers boLh resolution and quality scalabiliLy, as opposed lo oLher weII-known scalable image compression algo rithms such as EZW and SPIHT, which offer only qualily scalability. This functionality is achieved using a layered biLstream organization and a two-tiered codung straLegy. The final biLstream EBCOT produces is composed of a collecLion of quality layers. The quality Iayer Qi conLains Lhe uniLial R7’ bytes of each code block R~ and the oLher Iayers
4’
q
q—I
conLain Lhe incrementa! contribuLion L7 = — R” ~ O from code block R~. The quantiLy :r7 is Lhe truncaLion poinL correspondung to the raLe disLortion Lhreshold Àq selected for Lhe qth qualiLy Iayer. Figure 9.10 illusLrates lhe layered biLsiream (after [5]). Qq
274
Chapter 9
Image Compression Standards
Section 9.2
The JPEG2000 Standard 275
•Forp=F—l,P—2.0 Empty Empty
Empty
—
—
Send binary digits lo idenlify wheiher p7~Xt skipped. ff p~flifl
=
<
p. The redundant biLs are
p, Iben slop.
The redundanl bils are those corresponding Lo Lhe condition
p7I0Xt
< p
that can be
inferred either from ancestors such lhat pmax1 < p or from lhe partial quad-lree code used Lo identify p7” for a different code block R~. Empty
8o
B~
82
83
84
85
B~
B~
FIGURE 9.10: Three quality layers with eight blocks each.
Along with chese incremental contribulions, auxiliary informalion such as lhe length L7, lhe number of new coding passes t’17 = — Lhe value p7”'’ when R~ makes iis first nonemply contribuLion Lo qualiLy layer Qq, and ibe index qi of Lhe qua]ity layer lo which B~ firsl makes a nonemply contribulion must be expliciLly stored. This auxiliaxy information is compressed in Lhe second-Lier coding engine. Hence, in this Lwo-tiered architecture, Lhe Iirsl tier produces Lhe embedded block biLstreams, whiie Lhe second encodes lhe biock contribuLions lo each qualiLy layer. The focus of lhis subsecLion is lhe second-lier processing of lhe auxiiiary informaLion accompanying each qualiLy layer. The second-lier coding engine handies carefuliy lhe lwo quanlilies Lhat exhibit substanLial inLerbiock redundancy. These lwo quantities are p7”'’ and Lhe index qi of lhe quality layer Lo which B~ first makes a nonempty conlribution. The quanLiLy qi is coded using a separale embedded quad-Lree code wiLhin each subband. i..eL = 8~ be Lhe leaves and B[ be lhe rooL of Lhe lree lhaL represenls Lhe enlire subband. LeI qJ = min{q~ 8~ c B() be lhe index of Lhe firsl layer in which any code block in quad BJ malces a nonempty conlribulion. A singie bil identifies wheLher > q for each quad ar each levei:, wiLh redundanL quads omiLled. A quad is redundanl if eilher qJ < q — 1 or q~+I > q for some parenl quad B7~I. The olher redundanl quanlity lo consider is p7”. Ir is clear Lhat p7”'’ is irreievanL unIu lhe coding of Lhe qualily iayer Qq. Thus, any unnecessary informalion concerning p7”” need nol be senL unLil we are ready lo encode Qq. EBCCYI’ does Lhis using a modified embedded quadLree driven from Lhe leaves raLher than from lhe rool. Lei 8Ç be Lhe elemenis of Lhe quad lree slruclure buili ou Lop of Lhe code biocks B~ from any subband, and leI p7’°’~’ = max{p7°’IBj c Efl. In addition, leI B be lhe anceslor of quads from which E~ descends and leL P be a value guaranleed lo be larger than p7” for any code biock B,. When code biock R~ IirsL conLribules Lo Lhe biLstream in quahLy layer Qq. Lhe value of p7” = ≠Iax.O is coded using Lhe foilowing algorilhm:
9.2.2 Adapting EBCOT to
JPEG2000
JPEG2000 uses lhe EBCaI’ algorithm as iLs primary coding melhod. However, lhe ai goriLhm is siightly modified lo enhance compression efficiency and reduce compuLalionai complexiiy. To furlher enhance compression efflciency, as opposed Lo initializing lhe enLropy coder using equiprobable siates for ali conlexls, Lhe JPEG2000 sLandard makes an assumpiion of highly skewed disLribunons for some conlexls, Lo reduce Lhe model adaplaLion cosL for Lypical images. Several small adjuslmenis are made lo Lhe original algorilhm Lo further reduce iLs execulion lime. Firsl, a low-complexity ariLhmelic coder Lhal avoids mulLiplicalions and divisions, known as Lhe MQ coder [7], replaces Lhe usual arilhmetic coder used in Lhe original algoriLhm. Furlhermore, JPEG2000 does nol Iranspose Lhe HL subband’s code blocks. lnslead, lhe corresponding enlries in lhe zero coding conlexL assignmenL map are Lransposed. To ensure a consisLenl scan direclion, JPEG2000 combines lhe forward- and reverse signiflcance-propagalion passes mIo a single forward-significance-propagalion pass wilh a neighborhood requirement equal lo Lhal of lhe original reverse pass. In addiLion, reducing Lhe sub-block size 1o4 X 4 from Lhe original 16 x 16 eliminales Lhe need lo explicilly code sub-block significance. The resulling probabilily dislribuLion for these small sub-blocks is highly skewed, so lhe coder behaves as if ali sub-blocks are significanL. The cumulaLive effecL of Lhese modificalions is an increase of abouL 40% in sofLware execution speed, wiLh an average loss of aboul 0.15dB relalive Lo Lhe original algoriLhm. 9.2.3 Region-of-Interest Coding A significanL feaLure of Lhe new JPEG2000 standard is Lhe abilily Lo perform region-of inLeresl (ROl) coding [8]. Here, particular regions of Lhe image may be coded wilh betler qualily lhan Lhe resL of Lhe image or lhe background. The meLhod is called MAXSHIFI’, a scaling-based meLhod Lhal scales up lhe coefficients in Lhe ROl so LhaL Lhey are placed inLo higher biLpianes. During Lhe embedded coding process, Lhe resulling bits are placed in fronL of lhe non-ROl part of lhe image. Therefore, given a reduced biLrate, Lhe ROl will be decoded and relined before lhe resL of Lhe image. As a resulL of Lhese mechanisms, lhe Rol will have much beLler qualily lhan Lhe background.
apter 9
Sectiori 9.3
Image Compression Standards
The JPEG-LS Standard
277
9.2.4 Comparison of JPEG and JPEG2000 Performance Afler studying Lhe internais of lhe JPEG2000 compression algorithm, a natural queslion that comes lo mmd is, how well does JPEG2000 perform compared to olher well-known standards, in particular JPEG? Many comparisons have been made between JPEG and olher well-known standards, 50 here we compare JPEG2000 only to lhe popular JPEG. Various criteria, such as computalional complexity, error resilience, compression effl ciency, and so on, have been used to evaluate Lhe performance of systems. Since our main focus is on the compression aspect of lhe JPEG2000 standard, here we simply compare compression efficiency. (lnterested readers can refer lo [9] and [10] for comparisons using other criteria.) Given a fixed bitrate, leIs compare qualily of compressed images quantitatively by lhe PSNR: for color images, lhe PSNR is calculaled based on Lhe average of lhe mean square error of ali lhe RGB components. Also, we visually show results for bolh JPEG2000 and JPEG compressed images, so that you can make your own qualitative assessment. We perform a compaiison for Lhree categories of images: natural, computer-generated, and medical, using three images from each calegory. The LesL images used are shown on lhe Lextbook web sile in Lhe Further Exploration seclion for lhis chapter. For each image, we compress using JPEG and JPEG2000, aL four bilrates: 0.25 bpp, 0.5 bpp, 0.75 bpp, and 1.0 bpp. Figure 9.12 shows plots of lhe average PSNR of lhe images in each category against bitrate. We see that JPEG2000 subslantiaily outperforms JPEG in ali categories. For a qualitative comparison of lhe compression results, let’s choose a singie image and show decompressed outpul for lhe Lwo algorilhms using a low bitrale (0.75 bpp) and lhe lowesl bitrate (0.25 bpp). From lhe resuits in Figure 9.13, il should be obvious Lhat images compressed using WEG2000 show significantiy fewer visual artifacis.
9.3 THE JPEG-LS STANDARD
(d) FIGURE 9.11: Region of interest (ROl) coding of an image with increasmg bit-rale using a circuiarly shaped ROl: (a) 0.4 bpp; (b) 0.5 bpp; (c) 0.6 bpp; (d) 0.7 bpp.
One Lhing to note is Lhat regardless of scaling, fuli decoding of Lhe bitstream willi result in reconstruction of lhe entire image wiLh Lhe highest fideliLy available. Figure 9.11 demon strates lhe effect of region-of-interest coding as lhe target bitrate of lhe sample image is increased.
Generaliy, we wouid iikely apply a lossless compression scheme to images Lhat are critical iii some sense, say medical images of a brain, or perhaps images lhat are difficult or costly Lo acquire. A scheme in competition wilh lhe lossless mode provided in JPEG2000 is lhe JPEG-LS standard, specifically aimed aI lossless encoding [11]. The main advantage of JPEG-LS over JPEG2000 is that JPEG-LS is based on a iow-compiexity algorilhm. JPEG LS is part of a larger ISO effort aimed aI becter compression of medical images. JPEG-LS is in fact lhe currenL ISO/ITU standard for lossless or “near iossiess” compres sion of continuous-tone images. The core algorithm in JPEG-LS is cailed LOw COmple.xiiy LOssless COmpression for Images (LOCO-F), proposed by Hewlett-Packard [II]. The de sign of chis algoriLhm is motivated by lhe observation Lhat complexity reduction is oflen more important overali Iban any small increase in compression offered by more complex aigorithms. LOCO-l exploits a concepl called context modeling. The idea of context modeling is Lo Lalce advantage of lhe structure in lhe input source — conditional probabilities of what pixei values foliow from each olher in Lhe image. This extra knowledge is cailed lhe conrext. lf lhe inpuc source contains substantial struclure, as is usuaily Lhe case, we could polentiaily compress iL using fewer bits lhan Lhe OLh-order enlropy.
278
Chapter 9
Image Compression Standards
on 9.3
•
1
1~
The JPEG-LS Standard
279
1
(c) FIGURE 9.13: Comparison of JPEG and JPEG2000: (a) original image; (b) JPEG (lefi) and JPEG2000 (right) images compressed ai 0.75 bpp; (c) JPEG (lefi) and JPEG2000 (righi) images compressed ai 0.25 bpp. (This figure also appears in lhe color inseri seclion.)
280
Chapter 9
Image Compression Standards
Section 9.3
c
ad
b
x
9.3.2 Context Determination The conlexl model lhat conditions Lhe current prediction error (lhe residual) is indexed using a lhree-component context vector Q = (qi, q~, q~), whose components are
As a simple exampie, suppose we have a binary source with P(0) = 0.4 and P (1) 0.6. Then Lhe Oth-order entropy 11(S) = —0.4 iog2(0.4) — 0.6 log2(0.6) = 0.97. Now suppose we aiso know lhat lhis source has the property IbaL if the previous symbol isO, lhe probability of lhe current symbol being 0 is 0.8, and if lhe previous symboi is 1, lhe probabilily of lhe current symbol being O is 0.1. lf we use lhe previous symboi as our cozuexi, we can divide lhe input symbois into Lwo sets, corresponding to context O and context 1, respectively. Then Lhe entropy of each of the two seis is 11(Sj)
=
—0.81082(0.8)
—
0.2 log2(0.2)
=
0.72
11(52)
=
—0.1 log2(O.1)
—
0.9iog2(0.9)
=
0.47
The average bit-rale for lhe entire source wouid be 0.4 x 0.72 + 0.6 x 0.47 = 0.57, which is substantially iess than lhe Olh-order entropy of lhe entire source iii this case. LOCO-1 uses a context modei shown in Figure 9.14. [ti rasler scan order, lhe context pixeis a, b, e, and d ali appear before lhe curreni pixel x. Thus, this is called a causal contexi. LOCO-I can be broken down into lhree components: • Prediction. Predicting lhe value of lhe nexl sample x’ using a causal template • Context determination. Determining lhe conlexl in which x’ occurs • Residual coding. Entropy coding of lhe prediction residual conditioned by lhe contexl of x’ Prediction A better version of prediclion can use an adaptive model based on a calculation of lhe local edge direction. However, because JPEG-LS is aimed ai iow complexily, lhe LOCO-l algorilhm instead uses a fixed predictor lhat performs primilive tests Lo detect vertical and horizontal edges. The fixed predictor use~J by lhe aigorithm is given as follows:
qj
=
‘12
= =
d—b b—c c—a
min(a,b) max(a,b)
e? max(a,b) c IS frames per second) and lhe camera paramelers (focal iengLh, posilion, viewing angle, etc.) usualiy do not change rapidiy beLween frames, lhe contenLs of consecuLive frames are usualiy similar, uniess certain objects in Lhe scene move exlremeiy fasl. lo oLher words, lhe video has temporal redundancy. Temporal redundancy is ofLen significanL and iL is expioiLed, so Lhat nol every frame of lhe video needs lo be coded independenliy as a new image. InsLead, Lhe difference beLween lhe currenl frame and olher trame(s) in Lhe sequence is coded. lf redundancy belween Lhem is greal enough, lhe difference images could consisl mainly of smaii vaiues and iow enlropy, which is good for compression. As we menLioned, aiLhough a simplisLic way of deriving Lhe difference image is lo sub Lract one image from lhe oLher (pixei by pixei), such an approach is ineffeclive in yieiding a high compression raLio. Since lhe main cause of Lhe difference beLween frames is cam era andlor objecl motion, lhese molion generalors cari be “compensaLed” by deLecling Lhe dispiacemenL of corresponding pixels or regions in these frames and measuring lheir dif ferences. Video compression aigoriLhms Lhal adopL this approach are said Lo be based on molion compensaLion (MC). The Lhree main sLeps of Lhese aigorithms are: 1. Molion eslimaLion (moLion vecLor search) 2. Motion-compensaLion-based predicLion 3. Derivation of Lhe prediclion error — Lhe difference Forefhciency, each image is divided inLo macroblocks of size N x N. By defauit, N = 16 for iuminance images. For chrominance images, P1 = 8 if 4:2:0 chroma subsampiing is adopled. MoLion compensalion is not performed aL Lhe pixei levei, nor at lhe levei of video object, as in laler video sLandards (such as MPEG-4). lnstead, il is aL Lhe macrobiock levei. The currenl image trame is referred Lo as lhe Targeiframe. A maLch is soughL between Lhe macroblock under consideralion in Lhe TargeL frame and lhe mosl similar macroblock in
290
Chapter 10
Basic Video Compression Techniques
Section 10.3
previous and/or future frame(s) [referred lo as Referenceframe(s)]. In that sense, lhe Target macroblock is predicted from lhe Reference macroblock(s). The dispiacemeni of lhe reference macrobiock lO Lhe Largel macroblock is calied a motion vector MV. Figure 10.1 shows lhe case offorward prediction, in which lhe Reference frame is Lalcen lo be a previous frame. lf lhe Reference frame is a future frame, iL is referred loas backwardprediction. The differenceof lhe lwo corresponding macroblocks is lhe prediction error. For video compression based on molion compensalion, afler Lhe firsl frame, only lhe molion veclors and difference macroblocks need be coded, since Lhey are sufficienl for lhe decoder lo regenerale ali macrobioeks in subsequenl frames. We will relum lo Lhe discussion of some common video compression standards after Lhe foilowing seclion, in which we discuss search algorithms for motion vectors.
Search for Motion Vectors 291
Targel frame, pixel by pixel, and theirrespective MA D is then derived using Equalion (10.1). The vector (1, j) that offers lhe leasi MAD is designaLed lhe MV (ti, a) for lhe macrobiock in Lhe Target frame. PROCEDURE 10.1
Motion-vector: sequential search
BEGIN rnin_MAD = LARGE_NUMBER; for i = lo p for j = lo p
—p 1
I~ lniliahzation ~/
—p
eta_MAL? — MADQ, J); ifcurMAD 1) channels. A bit (or byte) will be taken from each buifer at one of Lhe m cycled time slots until a frame is fonned. The TDM frame will be transmitted and Lhen demultip]exed after its reception. The scheme described above is known as Synchronous TDM, in which each of Lhe m buffers is scanned in Lura and treated equally. 1f, at a given time slot, some sources
Multiplexing ec no ogies 427
TABLE 15.1: Comparison of TDM Carrier Standards FormaL
TI
Number of
Data rate
FormaL
channels
(Mbps)
24
1.544
El
Number of
Data rate
channels
(Mbps)
32
2.048
12
96
6.312
E2
128
8.448
T3
672
44.736
E3
512
34.368
T4
4032
274.176
E4
2048
139.264
ES
8192
565.148
(accordingly buffers) do not have data Lo transmit, Lhe slot is wasted. Asynchronous TDM (or Stausucal TDM) gaLhers Lhe statistics of Lhe buffers ia this regard. It will assign only k (k < m) time slots to scan Lhe k buffers likely to have data Lo send. Asynchronous TDM has Lhe potential for higher Lhroughput, given Lhe sarne caber data rale. There is, however, an overhead, since now Lhe source address must also be sent, along wiLh Lhe data, Lo have Lhe frame demultiplexed correctly. Traditionaily, voice data over a te]ephone channel has a bandwidLh of 4 kHz. Ac cording to Lhe Nyquist theorem, 8,000 samples per second are required for a good digitization. This yields a time interval of 125 itsec for each sample. Each channel cmi transmil 8 bits per sample, producing a gross data rale (including data and control) for each voice chanriel of 8 x 8,000 = 64 kbps. Ia North America and Japan, a TI caber’ is basically a synchronous TDM of 24 voice channels (i.e., 24 Lime slots), of which 23 are used for data and lhe last one for synchronization. Each TI frame contains 8 x 24 = 192 bits, plus one bit for framing [1, 2]. This yields a gross data raLe of 193 bits per 125 ~tsec — Lhat is, 193 bits/sampie x 8,000 sampleslsec = 1.544 Mbps. Four TI carriers can be further multiplexed Lo yield a 12. Note LhaL 12 has a gross data rate of 6.312 Mbps, which is more Lhan 4 x 1.544 = 6.176 Mbps, because more framing and control bits are needed. In a similar fashion, T3 and T4 are created. Similar caber formaIs have been defined by Lhe ITU-T, wiLh levei 1 (EI) starting aI 2.048 Mbps, in which each frame consists of 32 time s]ots: 8 x 32 x 8,000 = 2.048 Mbps. 1\vo slots are used for framing and synchronization; Lhe oLher 30 are for data channels. The multiplexed numberof channels quadruples ar each of Lhe nexl leveis — E2, E3, and se on. Table 15.1 compares the data rales of boLh TDM caber standards.
15.2.2 Integrated Services Digital Network (ISDN) For over a century, Piam Old Telephone Service (PCYI’S) was supported by Lhe public circuit switched telephone system for analog voice transmission. In 1980s, Lhe ITU-T started Lo t’flie formal foribe TI variar is called 051,12 is called DS2, and no on. Less slrictly. Ibese Lwo notaúons (1’ and OS) are often nsed interchangeably.
428
Chapter 15
Computer and Multimedia Networks
develop ISDN to meet the needs of various digital services (e.g., cailer ID, instant cali setup, teleconferencing) in which digital data, voice, and somelimes video (e.g., in videoconfer encing) can be transmitted. By default, ISON refers to Narrowband ISDN. The ITIJ-T lias subsequently developed BroadbandlsDN(B-ISDN). Its default switching technique is Asynchronous Transfer Mode (ATM) [3) which will be discussed later. ISDN defines several types of full-duplex channels: • B (bearer)-channel. 64 kbps each. B-channeis are for data transmission. Mostiy they are circuit-switched, but tbey can also support packet switching. ff needed, one B-channel can be readily used to replace P(YFS. • D (deita)-channei. 16 kbps or 64 kbps. D-cltannel takes care of caIl setup, cali control (cail forwarding, caIl waiting, etc.), and network maintenance. The advantage ofhaving a separate D-channel is that conlrol and maintenance can be done in realtime in D-channel while B-channels are transniitting data. The foliowing are the main specifications of ISDN: • It adopts Synchronous TDM, in which the above channels are muitipiexed. • Two type of interfaces were available to users, depending on the data and subscription rates: —
—
Basic «ate Interface provides two B-channels and one D-channel (at 16 kbps). The total of 144 kbps (64 x 2 + 16) is multiplexed and transmitted over a 192 kbps link. Primary Rale Interface provides 23 B-channels and one D-channel (at 64 kbps) in Nortli America and Japan; 30 B-channels and two D-channels (at 64 kbps) in Europe. The 238 and ID fit in Ti nicely, because Ti lias 24 time slots and a data rate of 24 slots x 64 kbpslslot 1544 kbps; whereas the 30B and 2D fit in El, which lias 32 time slots (30 of them available for user channels) and a data rate of 32 x 64 = 2,048 kbps.
Because of its relatively siow data rate and high cost, narrowband ISDN has generaIly failed to meet the requirement of data and muitimedia networks. For home computerllntemet users, it lias iargely been replaced by Cable Modem and Asymmetric Digital Subscriber Line (ADSL) discussed below. 15.2.3
Synchronous Optical NETwork (SONET) SONET is a standard initially developed by Belicore for optical fibers that support data rates mucli beyond T3. Subsequent SONET standards are coordinated and approved by ANSI in ANSI TIfOS, Ti.106 and TI .107. SONET uses circuit switching and synchronous TDM.
Section 15.2
Multiplexing Technologies
429
TABLE 15.2: Equivaiency of SONET and SDH SONET
SONET
SDH
electrical levei
opticai levei
equivalent
Line rate
Payload rate
(Mbps)
(Mbps)
STS-i
OC-I
—
51.84
50.112
STS-3
OC-3
STM-1
155.52
150.336
STS-9
OC-9
STM-3
466.56
451.008
STS-12
OC-i2
STM-4
622.08
601.344
STS-18
OC-i8
STM-6
933.12
902.016
STS-24
OC-24
STM-8
1244.16
1202.688
STS-36
OC-36
STM-12
1866.24
1804.032
STS-48
OC-48
STM-16
2488.32
2405.376
STS-96
OC-96
STM-32
4976.64
4810.752
STS-192
OC-i92
STM-64
9953.28
962 1.504
In optical networks, electrical signals must be converled to optical signals for transmission and converted back after their reception. Accordingly, SONET uses the terms Synchronous Transpor! Signal (STS) for the electrical signais and Opsical Carrier (OC) for the opticai signais. An STS-1 (OC-i) frame consisis of 810 TDM bytes. lt is transmitted in 125 ~Lsec, 8,000 frames per second, so the data rate is 810 x 8 x 8,000 = 51.84 Mbps. Ali other STS-N (OC-N) signais are further multipiexing of STS-l (OC-1) signals. For exampie, three STS-i (OC-I) signais are multiplexed for each STS-3 (OC-3) at 155.52 Mbps. lnstead of SONET, ITU-T developed a similar standard, Synchronous Digital ffierarclzy (SDH), using the technoiogy of Synchronous Transpor! Module (STM). STM- lis the lowest in 5DM — it corresponds to STS-3 (OC-3) in SONET. Table 15.2 Iists the SONET electrical and opticai leveis and their 5DM equivalents and data rates. Among ali, OC-3 (STM-i), OC-l2 (STM-4), OC-48 (STM-i6), and OC-192 (STM-64) are the ones mostly used. 15.2A Asymmetric Digital Subscriber Line (ADSL) ADSL is the telephone industry’s answer to the las; mi/e chalienge — delivering fast network service to every home. it adopts a higher data rate downstream (from network to subscriber) and lower data rale upstream (from subscriber lo network); hence, it is asynunetric. ADSL makes use of existing telephone twisted-pair lines to transmit Quadrature Ampli tude Modulated (QAM) digitai signals. lnstead of the conventional 4 kHz for audio signals on teiephone wires, the signai bandwidth on ADSL lines is pushed to 1 MHz or higher. ADSL uses FDM (Frequency Division Muitiplexing) to multiplex three channels: • The high speed (1.5 to 9 Mbps) downstream channel at lhe high end of the spectrum
430
Chapter 15
Section 15.3
Computer and Multimedia Networks
Name
Data Rale tvsintç~• C~içmwa
00
1.544 Mbps
V.32 or V.34
~pi
HDSL
6.1 Mbps 6.1 Mbps
ai~ SDSL
A medium speed (1610 640 kbps) duplex channel
ADSL
• A PCYI’S channel aI Lhe low end (nexl Lo DC, 0—4 kl-lz) of lhe spectrum.2 The lhree channels can lhemselves be furLher divided inLo 4 kJ-lz subchannels (e.g., 256 subchannels for lhe downsLream channel, for a LoLal of 1 MHz). The mulliplexing scheme among lhese subchanneis is also FDM. Because signais (especially Lhe higher-frequency signais near or aL 1 MHz) aLlenuaLe quickly on lwisLed-pair lines, and noise increases with une lenglh, lhe signal-Lo-noise railo wiIl drop Lo an unacceplable levei afier a certain disLance. Nol considering Lhe effecl of bridged Laps, ADSL has lhe distance limiLalions shown in Table 15.3 when using only ordinary lwisled-pair copper wires. The key lechnology for ADSL. is Discrete Muizi-Tone (DM1’). For beLler lransmission in poLenlialiy noisy channeis (eiLher downslream or upsLreazn), Lhe DMT modem sends lesl signals Lo ali subchannels firsl. IL lhen caiculales lhe signal-lo-noise ralios, Lo dynamicaliy deLermine Lhe amouni of dala Lo be senl in each subchannel. The higher Lhe SNR, Lhe more dala senL. Theoretically, 256 downslream subchanneis, each capabie of carrying over 60 kbps, will generaLe a data rale of more lhan 15 Mbps. In reaiity, DMT delivers 1.5 10 9 Mbps under currenl Lechnology. Table 15.4 offers a brief hislory of various digilal subscriber lines (xDSL). DSL corre sponds lo lhe basic-rale ISDN service. HDSL was an efforl Lo deliver Lhe TI (or El) dala rale wilhin a low bandwidlh (196 lcHz) [2]. However, il requires two Lwisled pairs for 1.544 Mbps or Lhree LwisLed pairs for 2.048 Mbps. SDSL provides lhe same service as HDSL on a single lwisLed-pair une. VDSL is a standard Lhal is 51111 aclively evolving and forras lhe future of xDSL. 15.3
LAN AND WAN Local Arca Neiwork (LAN) is reslricled Lo a smaii geographical area, usually loa relalively small number of staLions. Wide Arca Nerwork (WAN) refers lo neLworks across ciLies and counLries. Belween LAN and WAN, lhe Lena Melropolitan Arca Network (MAN) is sometimes also used. 2Ahernaiively, an ISDN channel na be supportcd
la
place
orlow
and medium-apced channels.
431
TABLE 15.4: Hislory of Digilal Subscriber Lines
TABLE 15.3: Maximum Dislances for ADSL Using 1\visted-Pair Copper Wire
1.544 Mbps
LAN and WAN
VDSL
15.3.1
Meaning
Data raLe
Voice band modems
1.2 Lo 56 kbps
Digilal subscriber une
160 kbps
High daLa rale
1.544 Mbps
digilai subscriber une
or 2.048 Mbps
Singie une
1.544 Mbps
digiLal subscriber line
or 2.048 Mbps
Asymmelric
1.5
digiLal subscriber une
16 lo 640 kbps
Very high dala rale
13 lo 52 Mbps
digital subscaiber une
1.5 lo 2.3 Mbps
109
Mbps
Local Area Networks (LANs) MosI LANs use a broadcasL lechnique. Wilhoul exceplion, Lhey use a shared medium. Ilence, medium access conlrol is an importanl issue. The IEEE 802 commiltee deveioped lhe IEEE 802 reference model for LANs. Since iayer 3 and above in Lhe 051 reference mede] are applicabie Lo eilher LAN, MAN, or WAN, main developmenls of lhe IEEE 802 sLandards are on lhe lower layers — Lhe Physical and lhe Dala Link layers. In particular, lhe Dala Link iayer’s funcLionalily is enhanced, and lhe layer has been divided mIo Lwo sublayers: • Medium Access Control (MAC) iayer. This layer assembies or disassembies frames upon Lransmission or receplion, performs addressing and error correclion, and regu lales access conIrol Lo a shared physicai medium. • Logicai Link Control (LLC) layer. This iayer performs fiow and error conlrol and MAC-layer addressing. 11 also acts as an inLerface Lo higher iayers. LLC is above MAC in lhe hierarchy. Following are some of lhe aclive IEEE 802 subcommiLlees and lhe arcas lhey define: • 802.1 (Higher Layer L,AN Protocois). The relalionship beLween Lhe 802.X slandards and lhe 051 reference model, lhe inlerconnecLion and managemenl of lhe LANs • 802.2 (LLC). The general slandard for logical uink conlrol (LLC) • 8023 (Ethernet). Medium access conlrol (CSMAJCD) and physical layer specifica Lions for ELhemel
Section 15.3 432
Chapter 15
LAN and WAN
433
Computer and Multimedia NetwOrks
• 802.5 (Token Ring). Medium access control and physical iayer specifications for token ring • 802.9. LAN interfaces at the medium access control and physical layers for integrated serv ices • 802.10 (Security). Interoperable LAN/MAN security for other IEEE 802 standards • 802.11 (Wireicss LAN). Medium access methad and physicai Iayer specifications for wireless LAN (WLAN) • 802.14 (Cable-TV based broadband communicatiOfl nctwork). Standard protocol about Lwo-way transrnissian of multirnedia services over cabie TV; e.g., Hybrid Fiber Coax (HFC) cable medem and cable network • 802.15 (Wireiess PAN). Access meLhod and physicai layer specifications for Wire iess Personai Area Network (WPAN). A Personal Asca Network (PAN) supports coverages on Lhe order aí lO meters • 802.16 (Broadband wircless). Access rnethod and physical layer specifications for broadband wireless networks
Ethernet. &hernet is a packet-switched bus network, iL is Lhe most popular LAN Lo date. As of 1998, Lhe coverage of ELhernets has reached 85% of networked cornputers. To send a message, Lhe recipient’s Ethemet address is attached Lo Lhe message, and Lhe message is sent lo everyone on Lhe bus. Only Lhe designated station wili receive Lhe message, while others wiII ignore iL. The probiem of rnediurn access contrai for Lhe network is solved by Carrier Sense Muiti pie Access with Coiiision Detection (CSMA/CD). The sLation that wishes Lo semi a message must listen to Lhe network (carrier sense) and wait unLil there is no traflic. ApparenLiY, multipie stations couid be waiting and then send Lheir messages aL lhe sarne time, causing a coilision. During frame transmission, the statiOn compares Lhe signais received wiLh lhe ones senL. If they are different, it detects a coilision. Once a coilision is detected, Lhe station stops sending Lhe frame, and Lhe frame is retransmitted after a random delay. A good transmission medium for Ethemet is coaxial cabie (ar opticai fiber for newer generations). However, il is also possible to use Lwisted pair. Since these are simply telephone wires, in most cases Lhey are aiready in ofúce buiidings or homes and do not need to be reinstalied. Often a star LAN is used, in which each station is connected directiy to a hub, which also heips cope with Lhe potential of lower Lransmission quaiity. The hub is mi active device and acts as a repeater. Every time it receives a signai from one station, il repeats, so olher stations wili hear. Logicaiiy, this is stili a bus, aiLhough iL is physicaliy a star network. The maximum data raLe for ordinary Ethernet is 10 Mbps. For lhe lO Mbps LAN, unshieided twisted pair was used in IOBASE-T within 100 meters, whereas optical Uber was used in IOBASE-F up lo 2 kiiomelers.
Fast Ethernet (known as IOOBASE-T) has a maximum data raLe of 100 Mbps3 and is entirely Ethemet-compalibie. Indeed, it is common nowadays Lo mix 100BASE-T and 1OBASE-T through a switch (instead of a 1mb) — Lhat is, IOOBASE-T iink between server and IOOBASE-T switch, and severai IOBASE-T iinks between Lhe switch and worlcstalions. Since the swilch is capabie of handiing muitiple comrnunicalions aL Lhe sarne time, ali worlcstations cmi comrnunicate up to a maximum data raLe of 10 Mbps. Token Ring. Stalions on a token ring are connected in a ring topoiogy, as Lhe name suggests. Data frames are transmitted in one direction around Lhe ring and can be read by ali stations. The ring structure can be mereiy iogicai when stations are acluaiiy (physicaliy) connected Lo a hub, which repeats and reiays the signa! down the “ring”. A smaH frame, cailed a token, circuiates while Lhe ring is idie. To transmit, a station S musl wait until Lhe token arrives. The source station S then seizes Lhe Loken and converts ii lo a front end of its data frarne, which Lhen traveis on Lhe ring and is received by Lhe destination station. The data franie continues traveling on Lhe ring until it comes back to station 5, which releases il and puts it back anta Lhe ring. Access Lo Lhe shared medium is reguiated by ailowing only one token; hence, coilision is avoided. By default, Lhe ring operates in a round-robin fashion. Every time a token is released, Lhe next slation gets Lhe chance Lo talce it, and ao on. Optionaliy, a muitipie priority scheme can also be used for access contrai — a station can transmit a frarne aL a given priority if it can grab a token wiffi an equal ar iower priority; otherwise, it niakes a reservation and waits for iLs Lum. The data raLes of Lhe token rings were either 4 Mbps ar 16 Mbps over shielded twisLed pair. The 4 Mbps ring manipuiates Lhe token as described above. In Lhe 16 Mbps ring, Lhe token can be released as soon as Lhe source sLation sends out Lhe data frame. This increases ring usage by allowing more than one frame to travei on Lhe ring simultaneously. New technology has enabied 100 Mbps token rings [2) and IEEE 802.5v was a feasibility study for Gigabit token ring in 1998. Fiber Distributed Data Interface (FDDI). FDDI is a successor of the original token ring [4). Medium access contrai (MAC) of FDDI is similar Lo MAC in TEBE 802.5 deseribed above for token rings. FODI has a dual-ring topoiogy, with iLS primary ring for daLa transmission and secondary ring for fault tolerance [5]. ifdamage is detected in both rings, lhey canbejoined Lo function as a single ring. The biLrale of FODI is 100 Mbps. Because of lhe relatively fast transmission speed, the source stations wili simply absorb lhe Loken (instead of converting iL as part of its data frame, as in lhe original token ring) before sending its data frarne(s). In FDDI, once a station captures a Loken, iL is granled a Lime period and may send as many data frames as il can within lhe period. Also, Lhe loken will be reieased as soon as lhe frames are transmitted (eariy Loken release). 3Nexl gencralion Elhemeis are Gigabi: Etheniei and lO-Gigabil Elhernel. which wiII be describcd laler.
434
Chapter 15
The FDDI network is allowed Lo spread over distances up to 100 kni. It supporls up to 500 stations, as long as the maximurn distance of neighboring stations is Iess than 2 kilomeLers. Hence, FDDI is primarily used iii L,AN or MAN backbones. FDDI supports both synchronous and asynchronous modes [5]. Synchronous mode enables bandwidth reservation and guaranteed data Lransrnission up to the synchronous capacity. Asynchronous mode is similar lo lhe token ring proLocol. FDDI-2 supports an additional mode — isochronous mode [5], in which Lhe networic is Lime-sliced, with each machine gelting a fixed piece. FDDI can thus provide isochronous services for delay sensitive applications (such as audio and video) and synchronous and asynchronous services for others in the sarne network. 15.3.2
Section 15.3
Computer and Multimedia Networks
Wide Area Networks (WANs)
Packet switching becomes ineifective when lhe network is congested and becomes unreliable by severeiy deiaying or losing a iarge number ofpackets. Frame Relay. Modera high-speed links have low errar rale; in opticai fiber, it can be down to lhe order of iO_12. Many bits added lo each packeL for excessive error checking in ordinary packet swiLching (X.25) thus becorne unnecessary. As X.25, frarne elay works ai Lhe data link conLrol layer. Frame relay made lhe following major changes Lo X.25: —
Switching Tcchnoiogies. The common types of switching Lechnoiogies are circuir switching and packet switching. The latter also has its modera varianis offraine relay and cdl relay.
—
Circuit switching is preferabie if Lhe user demands a connection andlor more or iess constant data rates, as in certain constant-bitrate video communications. IL is inefficient for general multimedia communication, especially for variable (somelimes bursty) data rates. • Packct Switching. Packet switching is used for almost ali data neLworks in which data raLes lend Lo be variable and sometimes bursty. Before transmission, data is broken into smail packers, usualiy 1,000 bytes or less. The header of each packet canies necessary control information, such as destination address, routing, and so on. X.25 was lhe most commoniy used protocol for packet swiLching. Generaily, two approaches are available to switch and route Lhe packets: datagram and virtual circuits. In Lhe former, each packet is treated independentiy as a datagram. No Lransfer route is predetennined prior Lo Lhe transrnission; hence, packets may be unknowingly losL or anive in lhe wrong order. It is up to the receiving slation to detect and recover lhe errors, as is Lhe case wiLh TCPIIP. in virtual circuits, a route is predetermined through requesi and accept by ai] nodes along Lhe route. lt is a “circuit” because Lhe route is fixed (once negoLiated) and used for lhe duration of Lhe connection; nonetheless, it is “virtual” because Lhe “circuit”
435
is only logicai and not dedicaLed, and packels from lhe same source to Lhe sarne destination can be transferred through different “circuits”. Sequencing (ordering Lhe packets) is much easier in virtual circuiLs. Retransmission is usually requested upon detection of an error.
WÂJV usuaiiy refers Lo networks across ciLies and countries. Instead of broadcasl, Lhey invariabiy use some Lype of switching technologies.
Circuit Switching. The public switched telephone network (PSTN) is a good exam pie of circuit swilching, in which an end-to-end circuit (dupiex, in this case) must be established that is dedicated for Lhe duration of Lhe connection aL a guaranteed band width. Although initialiy designed for voice cornmunications, iL can also be used for data Lransrnission. indeed, it is still the basis for narrowband 1SDN, discussed in Section 15.2.2. To cope with muiLi-users and variable data rales, iL adopls FDM or synchronous TDM multiplexing.
LAN and WAN
Reduction of error checking. No more acknowiedgrnent, no more hop-to-hop fiow control and error control. OpLionally, end-to-end flow control and error control can be performed aL a higher iayer. Reduction ei layers. The multiplexing and switching virtual circuits are changed from layer 3 in X.25 lo layer 2. L.ayer 3 of X.25 is eliminated.
Frame reiay is basicaily a cheaper version of packet switching, wiLh minimai services. Frames have a iengLh up to 1,600 bytes. When a bad frame is received, iL wiili simply be discarded. The data raLe for frarne reiay is Lhus much higher, in Lhe range of TI (1.5 Mbps) Lo T3 (44.7 Mbps). • Ceil Relay (ATM). Asynchronous transfer mode adopls small and fixed-length (53 bytes) packets referred to as cells. Hence, ATM is also known as ceIl relay. As Figure 15.2 shows, Lhe srnall packet size is beneficial in reducing latency in ATM networks. When Lhe darkened packet arrives slightly behind another packet of a normal size (e.g., 1 lcB) in Figure 15.2(a), iL musL wait for Lhe compleLion of Lhe other’s Lransmission, causing serialization delay. When Lhe packet (ccli) size is smali, as in Figure 15.2(b), much iess waiting time is needed for Lhe darkened celI Lo be sent. This Luras out to significantly increase network LhroughpuL, which is especialiy beneficial for real-time muitimedia applications. A’I’M is known lo have the poLential to deiiver high data rates at hundreds (and thousands) of Mbps. Figure 15.3 compares Lhe above four switching technologies in lerms of Lheir bitrates and compiexity. It can be seen Lhat circuit switching is Lhe ieast complex and offers a constant (fixed) data raLe, whiie packet switching is Lhe opposite. 15.3.3 Asynchronous Transfer Mode (ATM) Ever since Lhe 1980s, lhe dramalic increase in data communications and muitimedia ser vices (voice, video, etc.) has posed a major chalienge Lo teiecommunication networks. With Lhe ever-expanding bandwidth through optical fiber, braadband !SDN (B-1SDN) be came a reality. By 1990, the ITU-T (formeriy CCITI’) adopted synchronous optical net worklsynchronous digital hierarchy (SONETISDH) as the base of B-ISDN. Since SONET
436
Chapter 15
Section 1 5.3
Computer and Mv timedia Networks
~H
VPI
VCI
LAN and WAN
437
PT~
LujiiiiIiiiIIII11111111I111h111I~~ 40 32. 24 16 8 o 5 bytes (a)
cm
mm
GFC VPI
=
General Flow ConLiol
Ri’
=
Payioad Type
=
Virtual Path Identifier
CLP
=
Ccli Loss Priority
VCI
=
Virtual Channel Identifler
HEC = Header Error Check
FIGURE 15.4: ATM UNI celi header. Initiaiiy, ATM was used for WANs, especially serving as backbones. Nowadays, it is also used in LAN appiications.
(b) FIGURE 15.2: Latency: (a) seriahzation delay in a noni~ai packet switching network; (b) lower iaLency in a ccii network. uses circuit switching technoiogy and specifies only Lhe transmission and multiplexing of data, a new standard for switching technoiogy was desired. ATM can provide high speed and iow delay — its operational version has been scaied Lo 2.5 Gbps (0048). ATM is also fiexible in supporting various technoiogies, such as Frame relay (bursty), IP Ethemet, xDSL, SONET/SDH, and wireless networks. Moreover, it is capabie of guaranteeing predefined leveis of Quality of Service (QoS). Hence, ATM was chosen as lhe switching technoiogy for B-ISDN. Fixed data raLe Circuit swiLching Ccii reiay (ATM) Compiexity Frame reiay Packet switching Variable data raLe FIGURE 15.3: Comparison of different swiLching techniques.
The ATM Ccli Structure. ATM celis have a fixed forrnat: their size is 53 bytes, of which lhe flrst 5 bytes are for the ccii header, foliowed by 48 bytes of payload. The ATM iayer has two Lypes of interfaces: User-Network Interface (UNI) is local, between a user and an ATM neLwork, and Network-Network Interface (NNI) is between ATM switches. Figure 15.4 iliustrates lhe structure of an ATM UNI ccii header. The header starts with a 4-bit general flow control (GFC) which controis traffic entering Lhe network at the local user-network levei. It is foiiowed by an 8-bit Virtual Path Identifier (VPI) and i6-bit Virtual Channei Identifler (VCI) for selecting a particular virtual path and virtual circuit, respectively. The combination of VPI (8 bits) and VCI (16 bits) provides a unique routing indicator for Lhe ccli. As an anaiogy, VPI is iike an arca cede (604). and VCI is iike Lhe foiiowing digits (555.1212) in a phone number. The 3-bit payioad Lype (P1’) specifies whether Lhe ccii is for user data or management and maintenance, network congestion, and so on. For exampie, 000 indicates user data ccii type 0, no congestion; 010 indicates user data ccii type 0, congestion experienced. Ri’ may be altered by the network, say from 000 te 010, te indicate that Lhe neLwork has become congested. The 1-bit ccii ioss priority (CLP) aiiows Lhe specificaLion of a low-prioriLy ccii when CLP is set w 1. This provides a hint te lhe ATM swiLches about which celis te drop when Lhe neLwork is congested. The 8-bit hcader error detection (HEC) checks errors only in the header (not in Lhe payioad). Since Lhe rest of the header is oniy 32 bits iong, this is a reiativeiy iong 8-biL fieid; iL is used for both error checking and correction [2]. The NNI header is similar Lo lhe UNI header, except it does not have the 4-biL GFC. Instead, its VPI is increased te 12 bits. ATM Laycrs and Sublayers. Figure 15.5 iiiustrates Lhe comparison beLween 051 layers and ATM iayers and subiayers at ATM Adaptauon Layer (AAL) and below. As
438
Chapter 15
ection 15.4
Computer and Multimedia Networks
OSI
ATM
Transport
TABLE 15.5: Comparison of Fast, Gigabit, and lO-GigabiL Etbemets.
CS
~L
-
Data rate Network
Transmission mode ATM
AcceSS method
Data Iink
TC Physical
Access Networks 439
- -
PMD
Physical
Mediam Target distance
IO-Gigabit Ethernet
Fast Ethernet
Gigabit Ethernet
(1 OOBASE-T)
(I000BASE-T)
100 Mbps
1 Gbps
10 Gbps
Fali or half duplex
Fali or haifduplex
Fuli duplex only
CSMAICD
CSMAJCD
N/A (no coilision)
Copper or fiber
Fiber or copper
Fiber only
Up Lo 2 km (fiber)
Up LoS 1cm (SM fiber)
Up to 40km (SM fiber)
200 rn (copper)
550 m (MM fiber)
300 m (MM fiber)
25 m (copper) AAL
=
ATM Adaptation Layer
Network Type
LAN
LAN/MAN
LAN/MAN/WAN
CS
=
Convergence Sublayer
IEEE Standard
802.3z
802.3ae
Year
802.3u 1995
1998
2002
SAR
=
Segmentation and Reassembly
TC
=
Transmission Convergence
PMD
=
Physical Mediam Dependent
FIGURE 15.5: Comparison of OSI (iayer 4 and below) and ATM Iayers.
shown, AAL corresponds to Lhe 0Sf Transport layer and part of Lhe Network Iayer. It consists of two sublayers: convergence sublayer (CS) and segmentation and reassembly (SAR). CS provides interface (convergence) to user applications. SAR is in charge of ccii segmentation and reassembly. The ATM layer corresponds Lo parts of the 051 Network and Data Link layers. iLs main functions are uiow control, management of virtual circuit and path, and ccli rnultiplexing and demultiplexing. The ATIvI Physical Iayer consists of two sublayers: Transrnission Conver gence (TC) and Physical Mediam Dependent (PMD). PMD corresponds Lo the 051 Physical layer, whereas TC does header errar checking and packinglunpacking trames (cells). This makes Lhe ATM Physical layer very different from the 051 Physical layer, where framing is left for the 051 DaLa Link layer. 15.3.4
Gigabit and 1O-Gigabit Ethernets Gigabit Ethernet became a standard (IEEE 802.3z) in 1998 [2]. IL employs Lhe sarne frame formaL and size as Lhe previous Ethernets and is backward compatible with IOBASE-T and IOOBASE-T. li is gerieraliy known as I000BASE-T although it can be fiirlher classified as I000BASE-LX, I000BASE-SX, I000BASE-CX, and I000BASE-T when iL uses various fiber or copper media. The maximum link distance under I000BASE-LX is 5 kiiorneLers for single-mode optical fiber (SM fiber), 550 rneters for multi-mode fiber (MMfiber), and merely 25 rneters for shielded twisted pair.
Gigabit EtherneL adopts fuli-duplez modes for connections Lo and from switches and haif-duplex modes for shared connecLions that use repeaters. Since coilisions do occur frequently in half-duplex rnodes, Gigabit Ethemet uses sLandard Ethernet access method Carrier Sense Multiple Access wiLh Collision Detection (CSMA)CD), as in its predecessors. Gigabit Ethemet has beco rapidly replacing Fast ELhernet and FDDI, especially in network backbones. It has gone beyond LAN and found use in MANs. IO-GigabflEthernet was compleLed in 2002. li retains the main characteristics ofEthernet (bus, packet swiLching) and Lhe sarne packeL formaL as before. AL a data raLe of 10 Gbps, it funcLions only over optical fiber. Since iL operates only under fali duplex (swiLches and buffered disLributors) II does noL need CSMAJCD for coilision detection. lO-Gigabit ELhernet is expecLed Lo finally enable Lhe convergence of voice and data neLworks. lt can be substanLially cheaper Lhan ATM. ILs design encompasses ali LAN, MAN, and WAN, and its carrying capacity is equivalent or superior to Fiber Channei, High Perfonnance Parallel InLerface (HIPPI), Ultra 320 or 640 SCSI, and ATMIS0NET OC-192. The rnaximurn iinic distance is increased Lo 40 kilometers for SM fiber (see Table 15.5). In facL, speciai care is Lakeo for interoperability wiLh SONET/SDH, so Ethernet packets can readily travei across SONET/SDH links. Tabie 15.5 provides abrief comparison of FastEthernet, GigabitEthernet, and lO-Gigabit Ethernet.
15.4 ACCESS NETWORKS An access network connects end users toLhe core network. IL is also known as Lhe “last mile” for delivering various multirnedia services, which could include lntemeL access, telephony, and digiLal and analog TV services.
440
Chapter 15
Computer and Iviultimedia Networks
Section 15.5
Reside ADSL, discussed earlier, some known options for access networks are: Hybrid Fibcr-Coax (HFC) Cabie Network. Optical fibers connect Lhe core network with Optical Network Units (ONU5) in the neighborhood, each of which Lypically serves a few hundred homes. Ali end users are then served by a shared coaxial cabie. Traditionaily, analog cable TV was aliocaLed a frequency range of 50—500 MHz, divided into 6 MHz channeis for NTSC ‘IV and 8 MHz channeis in Lurope. For HFC cable networks, Lhe downstream is allocated a frequency range of 450—750 MHz, and tipsLream is allocated a range of 5—42 M1lz. For Lhe downstream, a cable modem acts as a tuner to capLure the QAM moduiated digital sLream. The upstream uses Quadrature Phase-ShifL Keying (QPSK) [2] modulation, because it is more robusL in the noisy and congested frequency spectrum. A poLenLial problem of HFC is the noise or interference on the shared coaxial cabie. Privacy and securiLy on Lhe upstream channel are also a concem. • Fiber To The Curb (FflC). Optical fibers connect Lhe core network with ONUs aL the curb. Each ONU is then connecLed Lo dozens of homes via twisted-pair copper or coaxial cable. For FITC, a star Lopology is used aL Lhe ONUs, so Lhe media Lo the end user are not shared — a much improved access neLwork over HFC. Typical data rales are TI to T3 in the downstream direction and up to 19.44 Mbps in the upsLream direction. • Fiber To The Home (FI’TH). Optical fibers connecL Lhe core networlc directly with a small group of homes, providing Lhe highestbandwidth. Forexample, before reaching four homes, a 622 Mbps downstream can be spiit into four 155 Mbps downstreams by TDM. Since most homes have only Lwisted pairs andlor coaxial cables, the imple mentation cost of FT]’H will be high. • Terrestrial Distribution. Terrestrial broadcasting uses VHF and UHF specLra (ap proximaLely 40—800 MHz). Each channel occupies 8 MHz in Europe and 6 MHz in Lhe U.S., and each Lransmission covers about 100 Iciiometers in diameter. AM and EM modulations are employed for analog videos, and Coded Orthagonal Fre quency Division Mul:iplexing (COFDM) for digital videos. The standard is known as Digital Video Broadcasting-Terrestrial (DV~-T). Since Lhe return channel (up stream) is not supported in terresLrial broadcasLing, a separate POTS or N-ISDN link is recommended for lhe upstream in interacLive applications. • Satellite Distribution. SaLeIliLe broadcasting uses Lhe Gigahertz spectrum. Each sateilite covers an arca of several Lhousand ldiometers. For digital video, each satel liLe channel typically has a data rate of 38 Mbps, good for several Digital Video Broadcasring (DVE) channels. lis standard is DigiLal Video Eroadcasting-Satellite (DVB-S). Similar to DVB-T, PO1’S or N-ISDN is proposed as a means ofsupporting upstream data in DVB-S.
Common Peripheral Interfaces 441
TABLE 15.6: Speed of Common Peripheral Interfaces Data-rate SeÍI~J Port standa~ paraliel port USB gCPIEPP paraliel port
___________________________
lis kbps
Uitra2
SCSI
SCSI-2 (Fast SCSI, Fast narrOw 5CM) Fast wide SCSI (Wide SCSI) ijflra SCSI (SCSI-3, Ultra narroW SCSI) BIDE Wide Ultra SCSI (Fast 20)
40 ME/s
lIS kB/s
IEEE 1394 (FireWire, i.Link)
1.5 MB/s
USE 2
60 MB/s
3 ME/s
Wide UlLra2 SCSI (Fast 40)
80 MEIs
3.3—16.7 ME/s Uitra3 SCSI
IDE SCSI-l
Data-raLe 1.5—50 MB/s
80 MB/s
5 ME/s
Ultra ATA 133
133 ME/s
lO MB/s 20 MEIs
Wide Uitra3 SCSI (Ultra 160 160 MB/s SCSI, FasL 80) IIIPPI 100—200 MEIs
20 MB/s
Ultra 320 SCSI
320 MEIs
33 MEIs
Fiber Channel
100400 MBIs
40 MB/s
Ultra 640 SCSI
640 MEIs
USB
Universal Serial Bus
SCSI
Smali Computer System Interface
ECP
Enhanced Capability Port
Narrow
8-bit data
EPP
Enhanced Paraliei Port
Wide
l6-bit data
IDE
lntegrated Disk Electronics
HIPPI
High Performance Parailel Interface
EIDE
Enhanced IDE
15.5 COMMON PERIPHERAL INTERFACES For a comparison, Table 15.6 Jists lhe speeds of various common peripheral interfaces for connecting 1/O and other devices [hard disk, printer, CD-ROM, poinling devices (e.g., mouse), Personal Digital Assistant (PDA), digital camera, and so on].
15.6 FURTHER EXPLORATION Good general discussions on computer networlcs and data communications are given in the books by Tanenbaum [1] and Stallings [2]. The Further Exploration section of the text web site for this chapter provides an exiensive set of web resources for computer and multimedia networks including links Lo • SONET FAQ, elc.
442
Chapter 15
Computer and Multimedia Networks
CHAPTER
• xI3SL introductions aL che DSL Forum web site • Introductions and White Papers on A1’M
Multimedia Network Communications and Applications
• FAQ and White Papers on 10 Gigabit Ethemet at the Alliance web site • IEBE 802 standards • IETF Request for Comments (RFC) for lPv6 (Intemet Protocol, Version 6) 15.7
EXERCISES 1. What is the main difference between lhe 051 and TCP/IP reference modeis? 2. IPv6 is a newer IP protocol. What is its advantage over IPv4? 3. UDP does not provide end-to-end flow control. but TCP does. Expiam how this is achieved using sequence numbers. Give an example where a packetized message sent using UDP is received incorrectly, but when using TCP it is received correctiy under lhe sarne circumstances (without channel errors). 4. As a variation of FDM, WDM is used for multiplexing over fiber-opfic channels. Compare WDM with FDM. 5. Both ISDN and ADSL deliver integrated network services, such as voice, video, and 50 on, lo home users or small-office usas. What are lhe advantages of ADSL. over ISDN?
Fundamentally, multimedia nelwork communication and (tradilional) computer network communication are similar, since they bolh deal with data communicalions. However, chailenges in multimedia nelwork communications arise because multimedia data (audio, video, etc.) are known as continucus media. They have lhe following characteristics:
6. Severa! protocols, such as Ethernel, Token ring, and FDDI, are commonly used in LAN. Discuss lhe functionalities of these three technologies and differences among them. 7. Frame relay and Celi relay are variants of packet switching. Compare these two technologies.
• Sometimes Bursty. Data rates fluctuate draslically — for example, in video-on demand, no trafiic most of lhe time but burst lo high volume.
8. WhaL is lhe difference between switching and routing? Are routing aigorithms specific to a switching technology? 9. l-low many sublayers are Ibere in ATM? What are they? 10. In 1-IFC cable networlcs, two modulation schemes are used for sending downstream and upstream data. Why should lhe upsiream case be handled differently from down stream? Should we employ different multiplexing technologies as well? 15.8
16
REFERENCES 1 AS. Tanenbaum, Coniputer Wetworks, 4± ed., Upper Saddle River, RI: Prentice 1h11 PTR, 2003. 2 W. Stallings, Data & Computer Communications, 6th cd., Upper Saddle River, Ni: Prentice 1h11, 2000. 3 W. Stallings, ISDN and Broadband ISDN. with Frame Reiay and /ITM, Upper Saddle River, NJ: Prenúce Hali, 1999. 4 K. Tolly, lnLroduction Lo FDDU’ Data Conununications. 22(11): 81—56, 1993. 5 R. Steinmetz and K. Nahrsledt, Muitimedia: Coniputing, Cainmunications & Applications. Upper Saddle River, TU: Prentice Nau PTR, 1995.
• Voluminous. They demand high data raLes, possibly dozens or hundreds of Mbps. • Real-Time and Interactive. They demand low deiay and synchronizalion between audio and video for “lip sync”. In addilion, applications such as videoconferencing and inleraclive multimedia require lwo-way traifie.
16.1 QUALITY OF MULTIMEDIA DATA TRANSMISSION 16.1.1
Quality 01 Service (QoS) Quality of Service (Q0S) for multimedia data iransmission depends on many parameters. Some of lhe most importanl are: • Data Rate. A measure of transmission speed, often in kilobils per second (kbps) or megabits per second (Mbps) • Latency (maximum frame./packet delay). Maximum time needed from transmis sion Lo reception, often measured in milliseconds (msec). In voice communication, for exampie, when lhe round-trip deiay exceeds 50 msec, echo becomes a noticeable problem; when lhe one-way delay is longer than 250 msec, talker overlap will occur, since each caller will lalIc without knowing lhe other is siso talldng. • Packet loss or error. A measure (in percenlage) of ator rale of lhe packelized data lransmission. Packets get lost or garbled, such as over Lhe lnlemet. They may also be delivered late or in lhe wrong order. Since retransmission is often undesirable, a 443
444
apter 16
Section 16.1
Multimedia Network Communications and Applications
Time!
Frame played
Quality of Multimedia Data Transmission
TABLE 16.1: Requirement on network bandwidth/bitrate. Appllcation
Speed requirement
Telephone
l6kbps
Audio conferencing
(b) FIGURE 16.1: Jicters in frame playback: (a) high jilter; (b) lowjitter. simple error-recovery method for real-Lime multimedia is Lo replay the lasL packet, hoping Lhe error is not noticeable. In general, for uncompressed audio/video, lhe desirable packet loss is < lO_2 (lose every hundredth packet, on average). When it approaches 10%, itbecomes intolerable. For compressed multimedia and ordinary data, Lhe desirable packet loss is less than lo i08. Some prioritized delivery Lechniques, described in Section 16.1.3, can alieviate Lhe impact of packet loss. • Jitter (or delay jitter). A measure of smoothness of Lhe audio/video playback. Technically, jiuer is related Lo lhe variance of framelpacket delays. A large buifer (jitter buffer) can to hold enough frames Lo allow Lhe frame with lhe longest delay lo arrive, to reduce playbackjitter. However, this increases Lhe latency and may note desirable in real-Lime and interactive applications. Figure 16.1 illustrates examples of high and low jilters in frame playbacks. • Sync skew. A measure of multimedia data synchronization, often measureil in mil liseconds (msec). For a good Iip synchronization, Lhe liniit of sync skew is ±80 msec between audio and video. In general, ±200 msec is still acceptable. For a video with speaker and voice Lhe limit of sync skew is 120 msec if video precedes voice and 20 msec if voice precedes video. (The discrepancy is probably because we are used Lo have sound lagging image ata distance.) Multimedia Service Classes Based on lhe above measures, multimedia applications can be classified into lhe following Lypes: • Real-Time (also Conversational). ‘IWo-way traffic, low latency and jitter, possibly with prioritized delivery, such as voice telephony and video telephony
445
32 kbps
CD-quality audio
128—192 kbps
Digital music (QoS)
64—640 kbps
H. 261
64 kbps—2 Mbps
H. 263
HDTV (uncompressed) MPEG-4 video-on-demand (QoS) Videoconferencing (QoS)
>
20 Mbps 1 Gbps
250—750 kbps 384 kbps—2 Mbps
• Priority data. Two-way traffic, low loss and Iow Iatency, with prioritized delivery, such as e-conunerce applicaLions • Silver. Moderate latency and jitter, strict ordering and sync. One-way traffic, such as sLreaming video; or two-way traffic (also Inreraclive), such as web surfing and Intemet games • Best Effort (also Background). No real-time requiremenL, such as downloading or lransferring large files (movies) • Bronze. No guaranLees for Lransmission Table 16.1 lisis lhe general bandwidLh/bit rate requirement for multimedia networks. Table 16.2 lists some specificaLions for tolerance Lo delay and jitter in digital audio and video of differenL qualities. Perceived QoS. AlLhough QoS is commonly measured by lhe above Lechnical param eters, QoS itself is a “collective effect of service performances that determine Lhe degree of satisfaction of lhe user of that service,” as defined by lhe Intemational Telecommunications Union. In olher words, it has everything lo do with how lhe userperceives iL. In reaJ-Lime multimedia, regularity is more important Lhan latency (i.e., jilter and quality tiuctuation are more annoying than slighLly longer waiting); temporal correctness is more importanL than Lhe sound and picture quality (i.e., ordering and synchronization of audio and video are of primary importance); and humans Lend lo focus on one subject aI a Lime.
Section 16.2 446
Chapter 16
Multimedia over Ip 447
Multimedia Network CommunicatiOnS and ApplicationS
TABL.E 16.2: Tolerance of iatency and jiLter in digiLal audio and video. Application
Average latency tolerance (msec)
Low-end videoconference (64 icbps)
300
Cornpressed voice (16 kbps)
30
Average jitter toicrance (msec) 130 130 7
The Lwo maia advanLages of MPLS are Lo support Traffic En,çineering (TE), which is used essentially lo control Lraffic Ilow, and Virtual Private Netwo,*s (VPN). Boih TE and VPN help delivery of QoS for mullimedia daLa. MPLS supports eight service classes. For more deLail refer lo RFC 3031. DiffServ and MPLS can be used togeLher Lo allow beLier control of both QoS performance per class and provision of bandwidLh, reLaining advanLages of boih MPLS and DiffServ. 16.1.3 prioritized Delivery When a high packeL loss or error rate is deLecLed ia Lhe event of network congestion, priori tized delivery of muiLimedia daLa can be used lo alieviate Lhe perceived deLerioralion.
MPEG NTSC video (1.5 Mbps)
5
MPEG audio (256 kbps)
7
HDTV video (20 Mbps)
0.8
• Prioritization for types of media. Transmission aigoriLhnis can provide prioritized delivery LO different media — for example, giving higher priority lo audio Lhan lo video — since loss of conLent in audio is oflen more noLiceable than in video.
User focus is usually at Lhe cenLer of the screen, and iL takes time to refocus, especially after a scene change. Togelher with the perceptual nonunifonniLy we have siudied ia previous chapters, many issues of perception can be exploited ia achieving the best perceived QoS in networked multimedia.
• Prioritization for uncompressed audio. PCM audio biLstreams can be broken mio groups of every nth sample — prioritize and send k of Lhe total of ti groups (k s n) and ask Lhe receiver Lo inLerpolaLe Lhe lost groups if so desired. For example, if Lwo OUL of four groups are lost, Lhe effective sampiing rale is 22.05 kHz instead of 44.1 kHz. Loss is perceived as change in sampiing raLe, nOL dropouts.
16.1.2
9
QoS for IP Protocois QoS policies and technologies enabie key meirics discussed ia lhe previous section such as iatency, packet loss, and jitter Lo be controlled by offering different leveis of service Lo differenL packet streams or applications. Frame relay routing protocol and ATM provide some leveis of QoS, buL currenLly mosL lntemet applications are built on 1?. II’ is a “best-effort” communications technoiogy and does not differentiate among different IP appiications. Therefore iL is bani Lo provide QoS over II’ by currenL routing methods. Abundant bandwidth improves QoS, but in complex networks, abundant bandwidth is unlikely Lo be available everywhere (ia practice, many 1P neLworics routinely use oversub scription). ia particular, iL is uniikely lo be available in ali Lhe access links. Even if it is available everywhere, bandwidth alone can’t resolve problems doe to sudden peaks ia traffic. Differentialed Service (DiffServ) uses DiffServ code [Type of Service (TOS) ocLet in IPv4 packet and Traffic Ciass octet ia JPv6 packet] Lo ciassify packets Lo enabie Lheir differentiaLed treaLment. IL is becoming more wideiy deployed ia intradomain networks and enterprise networks, as it is simpler and scaies weii, although it is also appiicabie to end-to-end networks. DiffServ, in conjunction with other QoS techniques, is emerging as Lhe de facto QoS technology. See IETF RequesL for Comments (RFC) 2998 for more information. Muflipte Protocol Label Switching (MPLS) faciliLates Lhe maniage of IP Lo 051 iayer 2 technologies, such as ATM, by overlaying a protocol on top of IP. it introduces a 32-bit iabei and inserts one or more shirn labels jato Lhe header of mi IP packet ia a backbone IP network. It thus creaLes Lunneis, caiied Label Switched Pai/is (LSP). By doing $0, Lhe backbone IP neLwork beconies connection-oriented.
• Prioritization for JPEG image. The different scans in Progressive JPEG and dif ferenL resoluLions of the image ia Hierarchicai JPEG can be given different prioriLies, for exampie, highest prioriLy for Lhe scan wiLh Lhe DC and first few AC eoefficienLs, and higher prioriLy Lo lower-resoluLion componenLs of Lhe J-lierarchical JPEG image. • Prioritization for compressed video. Video prioriLizaLion algoriihms can sei prior lhes lo minimize playback delay and jiLter by giving lhe highesL priorily Lo recepLion of 1-frames and Lhe lowest priority lo 8-frames. In scalable video (such as MPEG 2 and 4) using layered coding, lhe base layer can be given higher prioriLy ihan Lhe enhancemenL layers. 16.2 MULTIMEDIA OVER IR Dueto Lhe great populariLy and availabiliLy of Lhe intemeL, various efforts have been made Lo make MulLimedia over IP a reaiity, aiLhough iL was lmown to be a chailenge. This secLion will study some of Lhe key issues, technologies, and protocols. 16.2,1 IP-Multicast in network terminology, a broadcasi message is senL Lo ali nodes in lhe domam, a uniccisi message is sent lo only one node, and a multicast message is senL loa seL of specified nodes) ÍP-tnulzicasr enabies muiLicasl on lhe inlerneL. ii is vila) for applicaLions such as mailing lisis, builetin boards, group file transfer, audio/video-on-demand, audiolvideoconferencing, and so on. SLeve Deeriag inlroduced IP-multicast technology in his 1988 Ph.D. dissertation. IPv6 siso ailows anycas’, whereby lhe rnessage is sem ‘o any one
or ibe specified nodes.
448
Chapter 16
Section 16.2
Multimedia Network CommunicatiOns and Applications User
User
jp-multicast has anonymous membership. The source host mullicasts to one of lhe above lp-mullicasl addresses — il doesn’L know who wiil receive. The hostsoftware maps IP-group addresseS mIo a lisL of recipienls. Then iL either mullicasLs when there is hardware support (e.g., Elhemet and FDDI have hardware mullicast) or sends mulliple unicasLs Lhrough lhe neXL nade in Lhe spanning tree. One potential problem of multicasting is taL 100 many packets will be Lraveling and ajive in te nelwork. Fortunately, IP packets have a tbne-to-iive (TTL) field tal limits lhe packet’s lifelime. Each router decrements Lhe T1’L of lhe pass-by packel by aI leasL one. The packet is discarded when ils ‘ITL is zero. rue IP-mulLicasL melhod described above is based on UDP (not TCP), so as Lo avoid excessive acknowledgmenls from mulliple receivers for every message. As a resull, packeLs are delivered by “besL efi’orl”, so reliability is limiLed.
Router
Router
User
MRouter
lnternet Group Managemcnt Protocol (IGMP). Interne: Group Management Pro toco! (IGMP) was designed to help Lhe mainlenance of multicasl groups. Two special types of IGMP messages are used: Query and Report. Query messages are mLllLicast by rouLers lo ali local hosls, lo inquire abouL group membership. Report is used lo respond lo a query and tojoin groups. On receiving a query, members waiL for a random lime before responding. II a member hears anoLher response, iL will not respond. Roulers periodically query group membership, and declare lhemselves group members if lhey get a response lo aI leasL one quer)’. If no responses occur after a while, lhey declare lhemselves nonmembers. IGMP version 2 enforces a lower laLency, £0 Lhe membership is pruned more prompLly afLer ali members in Lhe group leave.
MRouLer
Router
L
Rou ter
Router
User
Ali
Multimedia over IP 449
User
Reliable Multicast Wansport. IETF RFC 2357 was an attempl lo define criteria for evalualing reliable IP-multicasl prolocols. As Almeroth [4] poinLs oul, MBone maintains a fiaL virtual topology and does noL provide good rouLe aggregaLion (at lhe peak lime, MBone had approximalely 10,000 rouLes). Hence, iL is noL scalable. Moreover, Lhe original design is highly distributed (and simplislic). It assumes no cenLral management, which results in ineifective Lunnel management, tal is, Lunnels connecling islands are noL optimally allocated. Somelimes multiple lunnels are created over a single physical link, causing congeslion. Paul et ai. [5] presented Lhe Reflabie Muiacasi Transpori Protocoi (RMTP), which sup ports rouLe aggregalion and hierarchical rouling. WhetLen and Taskale [6] provided an overview of Reiiabie Multicast Transpor! Protocoi II (RMTP II) taL supportsforwani error control (FEC) and is LargeLed for real-time delivery of multimedia data.
FIGURE 16.2: Tunnels for 1? MulLicasl in MBone. One of lhe first triais of IP-multicast was in March 1992, when te InterneI Engineering Task Force (IETF) meeting in San Diego was broadcast (audio only) on lhe Internet. MBone. The IntemeL Muiticast Backbone (MBone) is based on IP-multicast technol ogy [1]. Starting in lhe early 1990s, ii has been used, for example, for audio and video conferencing on Lhe InterneI [2, 3]. Earlier applications include vai for audio conferencing, ide and nv for video conferencing. Other apphcation bois include wb for whiteboards in shared workspace and sdr for maintaining session directories on MBone. Since many routers do foI support multicast, MBone uses a subnetwork of rouLers (mrouiers) that support multicast Lo forward multicast packets. As Figure 16.2 shows, Lhe mrouters (or so-called isiands) are connected wiLh tunnels. MulticasL packets are encap sulaLed inside regular II’ packeLs for “Lunneling”, so tal Lhey can be sent lo Lhe destination through Lhe islands. Recali Lhat underlPv4, IP addresses are 32 bits. If Lhe firsL 4 bits are 1110, Lhe message is an IP-multicasL message. IL covers IP addresses ranging from 224.0.0.0 Lo 239.255.255.255.
16.2.2
RTP (Real-time Transport Protocol) The original Intemet design provided “besL-effort” service and was adequate forapplications such as e-mail and FTP. However, iL is not suitable for real-Lime multimedia applications. RTP is designed for Lhe transpor: of real-time data, such as audio and video streams, often for audio or videoconferencing. It is inLended primarily for mulLicast, although il can also
450
Chapter 16
Section 16.2
Multimedia Network CommunicatiOns and Applications
be appiied Lo unicast. li was used, for example, in nv for MBone [3), Netscape LiveMedia, Microsoft Netmeeting, and Intel Videophone. RTP usually mns on top of UDP. which provides efficient (but iess reliable) connecLionless datagram service. There are two main reasons for using UDP instead of TCP. First, TCP is a connection-oriented Lransport protocol; hence, iL is more difficult Lo scale opina multicast environment. Second, TCP achieves its reliability by retransmiLting missing packets. As mentioned earlier, in muiLimedia daLa Lransmissions, the reliability issue is less impoitant. Moreover, Lhe Iate arrival of retransrnitted daLa may nor be usable in real-time applications anyWay. Since UDP wiIi not guaranlee Lhat Lhe daLa packets arrive in lhe original order (not to mention synchronizaLion of muiliple sources), RTP rnust create its own ti,neslamping and sequencing mechanisrns to ensure Lhe ordering. RTP inLroduces Lhe foilowing additionai paramelers in Lhe header of each packet [7]: • Payload type indicates the media daLa type as weli as its encoding scheme (e.g., PCM, I-L26i1l-L263, MPEG 1,2, and 4 audio/video, etc.) so lhe receiver knows how to decode iL. • Timestamp is Lhe mosL important mechanism of RTP. The limesLamp records Lhe insLant when lhe first ocLet of the packeL is sampied; it is set by Lhe sender. Wilh the timestamps. Lhe receiver can pIay Lhe audio/video in proper tiniing order and synchronize multiple streams (e.g., audio and video) when necessary. • Sequence number isto complernent Lhe function of timestamping. it is incremented by one for each RTP data packet sent, to ensure Lhat lhe packeLs can be reconsLrncted in order by Lhe receiver. This becomes necessary, for exampie, when ali packels of a video frame sometimes receive Lhe sarne timestarnp, and LimesLamping alone becomes insufficient.
v P
CSRC M~ Payioad type~
Multimedia over IP 451
Sequence number
31
TimesLamp
Synchronization source (SSRC) ID ConLributing source (CSRC) lDs (optional)
FIGURE 16.3: RTPpacketheader. 162.3 Real Time Control Protocol (RTCP) RTCP is a companion prolocol of RTP. IL monitors Q0S in providing feedback toLhe server (sender) on quality of data transmission and conveys informaLion abouL Lhe participants of a muiLiparty conference. RTCP aiso provides Lhe necessary information for audio and video synchronization, even if lhe3’ are sent ihrough differenL packeL streams. The five types of RTCP packels are as beiow.
• Synchronization source (SSRC) 1]) identifies sources of inuiLimedia data (e.g., au dio, video). lf Lhe data come from lhe same source (Lransiator, mixer), they wil]1 be given Lhe sarne SSRC ID, so as to be synchronized.
1. Receiver report (RR) provides quality feedback (number of iasL packet received, number of losL packets, jiLter, limesLamps for calcuiaLing round-liip deiays). 2. Sender report (SR) provides informalion abouL Lhe receplion of RR, number of packeLs/bytes sent, and so on. 3. Source description (SOES) provides information about the source (e-mali address, phone number, fuil name of Lhe participant). 4. Bye indicales Lhe end of participalion. 5. Appiication specific functions (APP) provides for fuLure exLension of new feaLures.
• Contributing Source (CSRC) II) idenLifies Lhe source of conuibuLors, such as ali speakers in an audio conference.
RTP and RTCP packeLs are senL to Lhe same IP address (muiticast or unicast) buL on differenL ports.
Figure 16.3 shows lhe RTP header format. The first 12 octets are of fixed foniiat, foilowed by optionai (O or more) 32-bit ContribuLing Source (CSRC) lOs. Bits O and 1 are for the version of RTP, bit 2(P) for signaling a padded payload, biL 3 (X) for signaIing an exlension Lo Lhe header, and bits 4 Lhrough 7 for a 4-bit CSRC coma that indicates Lhe number of CSRC IDs foiiowing Lhe fixed part of Lhe header. Bit 8(M) signais Lhe first packeL in an audio frarne or last packet in a video frame, since an audio frame can be played out as soon as Lhe first packet is received, whereas a video frame can be rendered oniy afLer Lhe lasI paclcet is received. Bits 9 through 15 describe lhe payioad Lype, EiLs 16 Lhrough 31 are for sequence number, foliowed by a 32-bit timesLanlp and a 32-bit Synchronization Source (SSRC) iD.
16.2.4 Resource Reservation Protocol (RSVP) RSVP is a seLup protocol for Internet resource reservation. Protocois such as RTP, described above, do noL address Lhe issue of QoS conLroi. RSVP was Lhus developed [8] Lo guarantee desirabie QoS, mostiy for muiticast, a]Lhough it is aiso apphcabie to unicast. A general communicaLion mede] supported by RSVP consisLs of m senders and ti re ceivers, possibiy in various mulLicast groups (e.g., in Figure 16.4(a), m = 2, n = 3, and lhe Irees for Lhe two multicast groups are depicLed by Lhe arrows — solid and dashed lines, respecLively). In Lhe special case of broadcasLing, tu = 1; whereas in audio- or videocon ferencing, each hosL acLs as boffi sender and receiver in the session, lhat is, m = ti.
Section 16.2 452
Chapter 16
Multimedia over IP 453
Multimedia Network Communications and Appiications The main chalienges of RSVP are LhaL many senders and receivers may compete for Lhe Iirnited network bandwidih, lhe receivers can be heterogeneous in demanding different contents wiih differeni Q0S, and they can be dynamic by joining orquitting muiticast groups aI any time. The rnosL irnportant messages aí RSVP are Path and Resv. A Path message is iniLiated by Lhe sender and traveis iowards ihe mulLicasL (or unicast) desLination addresses. II conLains informalion abouL Lhe sender and the path (e.g., Lhe previous RSVP hop), so Lhe receiver can find Lhe reverse path to lhe sender for resource reservaLion. A Resv message is sent by a receiver ihaL wishes to rnake a reservation. RSVP is receiver-initiated. A receiver (aL a ieaf aí Lhe mulLicasL spanning tree) iniLi ates Lhe reservation requesL Resv, and ihe requesL Lravels back toward the sender but not necessariiy ali Lhe way. A reservaLion wiIl be rnerged with an exisiing reservaLion made by other receiver(s) for the sarne session as soon as Lhey meet ai a rouler. The rnerged reservation will accommodaLe Lhe highesl bandwidth requirernenL among ali merged requesis. The user-iniLiated scheme is highly scaiable, and ii rneeLs users’ heterogeneous needs.
(a)
(b)
• RSVP creates oniy soft síate. The receiver host must mainLain the sofi state by periodicaiiy sending Lhe sarne Resv message; oLherwise, lhe siate wili time OUL. There is no distincLion beLween Lhe initiai rnessage and any subsequenL refresh message. if there is any change in reservaLion, Lhe sLate wiil automaticaliy be updaLed according ia lhe new reservation pararneters in Lhe refreshing rnessage. Hence, Lhe RSVP schenie is highiy dynamic. Figure 16.4 depicis a sirnpie neiwork wiLh iwo senders (Si, 52), ihree receivers (Ri, R2, and R3), and four rouiers (A, B, C, D). Figure 16.4(a) shows Lhat Si and S2 send ?ath messages along their paths ia Ri, R2, and R3. In (b) and (c), Ri and R2 send out Resv rnessages ia Si and 52, respectiveiy, ia malce reservaLions for SI and S2 resources. From C ia A, Lwo separate channels rnusi be reserved since Ri and R2 requesied differenL dalastreams. in (d), R2 and R3 send ouL iheir Resv messages ia Si, Lo make addiLionai requests. R3’s requesi was merged with Ri ‘s previous requesL aL A, and R2’s was rnerged with R1’s ai C. Any possible variaLion of Q0S Lhai demands higher bandwidth can be deaii wiLh by modifying ihe reservaLion state parameLers. 16.2.5 Real-Time Streaming Protocol (RTSP)
(c)
(d)
FIGURE 16.4: A scenario aí neLWork resource reservaiion wiLh RSVP: (a) senders Si and S2 send out iheir PAPH messages ia receivers Ri R2, and R3; (b) receiver RI sends oui RESV message lo Si; (e) receiver R2 sends oul RESV message Lo 52; (d) receivers R2 and R3 send oul iheir RESV rnessages Lo Si.
Strearning Audio and Video. in Lhe early days, muilirnedia data was iransmiLted over lhe neLwork (ofLen with siow links) as a whoie large file, which wouid be saved Lo a disk, then piayed back. Nowadays, more and more audio and video daLa is transmiiied írom a stored media server ia the clienL in a daLasiream Lhai is aimosL insiantiy decoded — streaming audio and streaniing video. Usuaily, Lhe receiver wiil sei aside bufíer space Lo prefetch Lhe incoming strcam. As soon as lhe buffer is fiuied Loa certain exteni, lhe (usualiy) cornpressed daLa wiii be uncompressed and piayed back. ApparenLiy, the buffer space needs Lo be sufficienLiy iarge Lo deai with lhe possibie jiLier and ia produce continuous, smaoth piayback. On the other hand, Loa
454
Chapter 16
Section 16.2
Multimedia Network Communications and Applications GET request
2. Session settip. The clieni issues a SETUP lo inform lhe server of lhe destinaLion IP address, port number, protocols. and TTL (for multicasi). The session is seL up when lhe server retums a session ID.
GET response OPTIONS requesL OPTIONS response
c
SETUP request
M
SETUP response
e
d
PLAY reqLaest PLAY response
a
5
e
RTP audio
n
e
RTP video
r
RTÇP
v e r
PAUSE requesl PAUSE response
L
Multimedia over IR 455
3. Requesting and receiving media. Afterreceiving a PLAY, lhe server starts lo Lransmit streaming audio/video data, using IZTP. It is foliowed by a RECORD or PAUSE. Ocher VCR commands, such as FAST-FORWARD and REWIND are also supported. During Lhe session, Lhe client periodically sends an RTCP packel Lo lhe server, lo provide feedback information aboul lhe QoS received (as described in Seccion 16.2.3). 4. Session closure. TEARDOWN doses Lhe session. 16.2.6 Internet Telephony The Publlc Swiiched Teiephone Network (PSTN) relies on copper wires carrying analog voice signais. lt provides reliable and Iow-cosL voice and facsimile services. In Lhe eighties and nineties, modems were a popular means of “data over voice networks”. In fact, lhey were predominant before lhe introduction of ADSL. and cable modems. As PCs and Lhe Inlernet became readily available and more and more voice and data commurlicaLiOns became digiLal (e.g., in ISDN), “voice overdata networlcs’ especiaiiy Voice over IP (VoIP) started Lo aLtract a great deal of inLerest in research and user communiLies. WiLh ever-increasing network bandwidlh and lhe ever-improving quality of multimedia data compression, Internet telephony [10] has become a reaiity. Increasingly, it is foI resLricled Lo voice (V0IP) — iL is about integrated voice, video, and data services. The main advantages of lntemeL lelephony over POTS2 are Lhe following: • It provides greaL flexibility and extensibiliLy in accommodating inLegraled services such as voicemail, audio- and videoconferences, mobile phone, and so on. • lt uses packel switching, nol circuit swiLching; hence, network usage is much more efficienL (voice communication is bursty and VBR-encoded).
TEARDOWN requesL TEARDOWN response
FIGURE 16.5: A possible scenario of RTSP operaLions.
large a buffer will introduce unnecessary initial delay, which is especially undesirable for interacLive applications such as audio- or videoconferencing [9]. The RTSP Protocol. RTSP is for communicaiion between a client and a stored media server. Figure 16.5 iliustrates a possible scenario of four RTSP operations: 1. Requesting presentation description. The client issues a DESCRIBE request LO lhe Stored Media Server to obtain the presentation descripLion, such as, media types (audio, video, graphics, etc.), frame raLe, resoluLion, codee, and so on, from lhe server.
• WiLh Lhe technologies of mulLicasL or multipoint communicaLion, muiLiparty calls are noc much more difficult Lhan Lwo-parly calis. • WiLh advanced muiLimedia data-compression lechniques, various degrees of QoS can be supported and dynamically adjusLed according toLhe network Lraffic, an improve ment over Lhe “ali or none” service in POTS. • Good graphics user interfaces can be developed lo show available features and ser vices, monitor cali sLatus and progress, and so on. As Figure 16.6 shows, Lhe lransport of real-time audio (and video) in TnLemet lelephony is supported by RTP (whose control proLocol is RTCP), as described in Section 16.2.2. Streaming media is handled by RTSP and InLemet resource reservation is taken care of by
RSVP. InterneI telephony is noL simply a streaming media service over lhe InterneL, becaLzse it re quires a sophisticaLed signaling protocol. A sLreaming media server can be readily idenLified
2~Ors refers ‘o piam
ing, and ao on.
old leiephone servives that do noi include new features such as vali waiting, vali forward
closing
channels for media streams, obtaining gateway between GSTN and Internei
456
Chapter 16
Multimedia Network CommunicatiOnS and Applications
Section 16.2
Multimedia over IP 457
• 11.245. Control protocol for multimediacommunicaLions — for example, opening and H.323orSIP telephony RTP, RTCP, RSVP, RTSP __________________________________
Transport layer (UDP, TCP)
• 11.235. Security and encryption for H.323 and other H.245-based multimedia Lermi
riais Audio Codecs
Network iayer (1?, 1? Muiticast) DaLa link layer Physical layer
• 0.711. Codec for 3.1 kI-iz audio over 48, 56, or 64 kbps channels. 0.711 describes Pulse Code Modulation for normal Lelephony • G.722. Codec for 7 kHz audio over 48, 56, or 64 kbps channels • 0.723.1. Codec for 3.1 lU-Iz audio over 5.3 or 6.3 kbps channels. (The VoIP Fonim
FIGURE 16.6: Network protocol structure for internet telephony.
by a URI (Universal Resource Identifier), whereas acceptance of a cali via Internet telephony depends on Lhe cailee’s current location, capabiiity, availability, and desire to communicate. The following are brief descriptions of Lhe H.323 standard and one of Lhe most commonly used signaling protocols, Session Initiation Protocol (511’). H.323. 11323 [li, 12] is a standard for packet-based multimedia communication services over networks (LAN. InLernet, wireless network, etc.) that do not provide a guaranteed Q0S. It specifies signaling protocols and desciibes terminais, multipoint control units (for conferencing), and gateways for integrating Internet telephony with General Switched Tele phone Network (GSTN)3 data terminaIs. The H.323 signaling process consists of two phases:
adopted 0.723.1 as lhe codec for V0IP.) • 0.728. Codec for 3.1 kHz audio over ló kbps channels • G.729, G.729a. Codec for 3.1 kHz audio over 8 kbps chameIs. (The Frame Relay
Foram adopted 0.729 by as Lhe codec for voice over frame relay.) i/ideo Codecs • H.261. Codec for video aí p x 64 kbps (p
>
1)
• H.263. Codec for iow-bitrate video (< 64 kbps) over lhe GSTN Related Standards • H.320. The original standard for videoconferencing over ISDN networks
1. Cal! setup. The calier sends Lhe gatekeeper (GK) a Registration, Adniission and Status (RAS) Adinission Request (ARQ) message, which contains lhe name and phone number of Lhe callee. The 0K may either grant permission or reject lhe request, with reasons such as “security violaLion” and “insufficient bandwidth”. 2. Capability exchangc. An H.245 control chanriel will be esLablished, for which Lhe firsL step is Lo exchange capabilities of boLh Lhe caller and callee, such as wheLher iL is audio, video, or data; compression and encrypLion, and so on.
Session Initiation Protocol (SIP) — A Signaling Protocol. SIP [lO] is an application layer control of establishing and terminating sessions IntemeLmulti Lele phony. These protocol sessions in arecharge not limited Lo VoIP communications Lhey alsoininclude
H.323 provides mandatory support for audio and optional support for data and video. Ii is associaLed with a family of related software standards Lhat deal wiLh calI control and daLa compression for Intemet telephony. Foliowing are some of lhe relaLed standards:
media conferences and mulLimedia distribution. Similar Lo HTI’P, SI? is a text-based protocol lhat is different from H.323. Ii is also a client-server protocol. A cailer (lhe client) initiates a requesL, which a server processes and
Signaling and Control
11.225. Call conLrol protocol, including signaling, registration, admissions, packeLizaíion and synchronization of media streams ~OS174 is a synonym for PSTN (publie switched ielephone networlQ.
• 11.324. An extension of H.320 for video conferencing over Lhe GSTN • T.120. Real-time data and conferencing control
responds call requests. to. There The difference are Lhree Lypes between of servers. the two is A proxy LhaL theserver proxyand server a redirect forwards server Lhe requesLs forward Lo lhe next-hop server, whereas lhe redirecí server retums Lhe address of lhe next-hop server tolhe as to Lheserver, cail toward destination. Theclient, Lhird so type is redirect a location whichLhefinds current locations of users. Location servers usually communicate with Lhe redirecL or proxy servers. They may use finger,
Section 16.3 458
Chapter 16
Multimedia over ATM Networks
459
Multimedia Network CommunicationS and AppIicatiOns Location server
Steps 3, 4.
j ohn@home . ca is not logged on Lhe server. A request is sent Lo Lhe nearhy location server. John’s current address,
[email protected], is lo caLed.
Stcp 5.
Sincethe server is aredirecL server,
[email protected] Lo Lhe proxy server P1.
Step 6.
Try Lhe nexi proxy server P2 for
[email protected].
Steps 7,8.
P2 consulis lis location server and obtains John’s local address,
[email protected].
Steps 9, 10.
The next-hop proxy server P3 is conLacLed, which in turn forwards Lhe inviLation to where Lhe clienL (callee) is.
SIP c)ient (cailer) Proxy
8
server (P3)
Steps 11—14. John accepts the cali aL his currenL locaLion (at work) and Lhe acknowl edgmenLs are returned LO Lhe calier. SIP can also use Session iJescription Protocol (SDP) Lo galher information abouL Lhe callee’s media capabilities.
10 SIP client (caflee) FIGURE 16.7: A possible scenario of SIP session initialion. rwhois, Lightweighl Directory Access Protocol (LDAP), or other multicast-based protoco 5 to determine a user’s address. SIP can advertise its session using e-mau, news groups, web pages or directories, or Session Announcement Protocol (SAI’) — a multicast proLocol. The methods (commands) for clients to invoke are
Session Description Protocol (SDP). As iLs name suggesLs, SDP describes mulLimedia sessions. As in SIP, SDP descriptions are in LexLual form. They include the number and types of media sLreams (audio, video, whiteboard session, eLc.), destination address (unicasL or muiticast) for each stream, sending and receiving port numbers, and media formaLs (payload types). When iniLiaLing a caIl, lhe cailer includes Lhe SDP iriformaLion in Lhe INVITE message. The cafled parLy responds and someLimes revises the SDP information, according Lo its capabiliLy. 16.3 MULTIMEDIA OVER ATM NETWORKS 16.3.1 Video Bitrates over ATM
. INVITE — invites callee(s) te participate in a cail.
The ATM Forum supports various Lypes of video biL-rates:
. ACK — acknowledges Lhe invitation. . OPTIONS . CMCEL
—
—
inquires about media capabulities without setting upa cali.
terminates Lhe invitation.
. BYE — terminates a cal). • REGISTER — sends user’s location information te a registrar (a SIP server). Figure 16.7 illustrates a possible scenario when a cailer initiates a SIP session:
j ohn@hoitte
Step 1.
Cailer sends an INVITE
Step 2.
The proxy uses its Domam Name Service (DNS) to locate Lhe server for
[email protected] and sends Lhe request Lo II.
.
cato Lhe local Proxy server P1.
• Constant Bitrate (CBR). For example, for uncompressed video or CBR-coded video. As mentioned before, if lhe allocaLed biLrate of CBR is too low, ceil loss and distortion of Lhe video conLent are inevitable. • Variable Bit Rate (VBR). The mosL commonly used video bitrate for compressed video. li can be further divided inLo real-time Variable Ritrate (rt-YBR) suiLable for compressed video, and non real-time Variable 8h Rale (nrt-VBR) for specified QoS. • Available Bit RaLe (ABR). As in IP-based service, daLa Lransmission can be backed off or buffered due to congesLion. Ceil loss raLe and minimum celi daLa rale can sometimes be specified. Unspecified Bit Rate (UBR). Provides no guarantee on any quality parameLer.
460
Chapter 16
Section 16.3
Multimedia Network Communications and Applications
up
User data 65,535 bytes
TABLE 16.3: Comparison of AAL Lypes.
LO
Trailer
1-leader
AAL 1
AATJ 3/4
AAII 5
O byte
8 bytes
8 bytes
1 ar 2 bytes
4 bytes
O byte
SAR payload
47 or 46 bytes
44 bytes
48 bytes
CS checksum
None
None
4 bytes
SAR checksum
None
10 bits
None
~1~~/
CS output
)~ —
Notused
7
‘‘
-.-48 bytes.i 1
.-48 byles-’-i
‘\
aw~
~ ..-48 bytes-’i ____________
CS header/trailer overhead SAR header/Lrailer overhead
~l
SARourpul
1
~1 ~ -.—53 bytes—’- .÷...53 byles—’-
Multimedia over ATM Netwo
A1’M output
..—53 bytes—’
FIGURE 16.8: Headers and trailers added aI lhe CS and SAR sublayers. 16.3.2 ATM Adaptation byer (AAL) AAL converts various formats of user data into ATM datastreams and vice versa. The following lists five types of AAL protocols: • AAL type 1 supports real-time, constant bitrate (CER), connection-oriented data streams. • AAL type 2 was intended for variable bitrate (VER) compressed video and audio. However, the protocol never really materialized and is now inactive. • AAL types 3 and 4 were similar, and have since been combined into one type: AAL Lype 3)4. It supports variable bitrate (VER) of either connection-oriented or connectionless general (non-real-time) data services. • AAL type 5 was the new protocol introduced for multimedia data transmission. It promises to support alI classes of data and video services (from CER lo UBR, from rt-VBR Lo mi-VER). li is assumed LhaL lhe layers above lhe AAL are connection oriented and that the ATM layer beneath it has a low error rate. As Figure 16.8 shows, headers and trailers are added lo lhe original user data at lhe Convergence Sublayer (CS) and Segmentation And Reassembly (SAR) sublayer. The,y eventualiy fonn Lhe 53-byte ATM cells with lhe 5-byte ATM header appended. The existence of lhe five different Lypes of AAL was due largely lo history. lii particular, ali AAL types except AAL 5 were developed by lhe telecomniunications industry and were generally unsuitable for interactive multimedia applications and services [13]. Table 16.3 provides a comparison among lhe lInce active AAL types, for example, comparing AAL 3/4 with AAL 5, • AAL 3/4 has an overhead of designating 4 bytes for each SAR ceil, whereas AAL 5 has none aL this sublayer. Considering lhe numerous SAR celis, lhis is a substantial saving for AAL 5. li is of course possible only with modem, relatively error-free fiber-optic technology.
• As part of Lhe SAR trailer, AAL 3/4 has a checksum field for error checldng. To cul down lhe overhead, lhe checksum is only 10 bits long, which is unfortnnately inadequate. AAL 5 does It aL lhe CS and allocates 4 bytes for Lhe checksum. Again, iL is based on lhe assumption that bit-Lransmission error is rare. However, when AAL 5 does error checking, iLhas enough information from lhe long checksum. By now, AAL. 5 has superseded AAL 3/4. The ATM Forem agrees Lhat beside CBR services that will use AAL 1, evety olher service will use AAL 5. For more detalls of Lhe AALs, see Tanenbaum [13] and Stallings [14). Table 16.4 summarizes lhe supporl for video Lransmission with and without ATM. 16.3.3 MPEG-2 Convergence to ATM The ATM Forem has decided Lhat MPEG-2 will be Lransported over AAL5. As mentioned in Section 11.3.3, by default, two MPEG-2 packets (each 188 bytes) from Lhe transport stream (TS) will be mapped into one AAL-5 service data unit (SDU) [15]. When establishing a virtual channel connection, Lhe following QoS parameters must be specified: Maximum cdl transfer delay (latency)
TABLE 16.4: Support for digital video transmission. Video requirement
Support in ATM
Support without ATM
Bandwidth
Scalable to several Clbps
Up lo 100 Mbps
Latency and jilter
Q0S support
RSVP
CBR or VBR
AAL 1, 2,5, LAN emulation,
ISDN and ADSL
circuit emulation, etc. Multicasting
Multicast switch, or
IP-multicast or protocol
permanent virtual circuil
independent multicast (PIM)
462
Chapter 16
Section 16.4
Multimedia Network CommunicatiOnS and ApplicatiOns
Transportof MPEG-4
463
Locel DM1F
• Maximum celi deiay jiLier
iniuji Sccne
• Cdl ioss ratio (CLR) • Ccli error ratio (CER)
Dili
• Severely errored ccli biock ratio (SECBR) _____i~aPuon~c~
An audio-visual service-speciflc convergence sublayer (AYSSCS) is also proposed, to enabie transmitting video over AAL5 using ABR services. _____________
Audio Video OWects
16.3.4
DM
MulticaSt over ATM
‘\J
Compared tolP muiticasL, which is a “best-effort” service provided on Lop of UDP, muiticast in ATM networks had severai chailenges [16, 17]:
Scene Description
• ATM is connection.oriented; hence, ATIVI muiticasting must set up ali muitipoint connections. • QoS in ATM must be negotiated at connection setup time and be known to ali switches. • It is difficuit lo support multipoint.to-POint or muitipoint-tO.mUlLipOint conuections in ATM, because AAL 5 does not lceep track of muitipiexer number or sequence number. It cannot reassembie lhe data correctly at lhe receiver side if ceils from different senders are interleaved ai their reception. Scalabie and efficient ATM multicasi (SEAAO and shared many-lo-many ATM reser vaiáons (SMART) are two approaches Lo multicasting over ATM [16]. The former uses a unique identifier and lhe iaLter a token scheme te avoid Lhe ambiguity caused by ceil interieaving. 16.4 TRANSPORT OF MPEG-4 The design of MPEG-4 was motivated by multimedia applications on lhe WWW. Tu particu lar, multimedia (text, graphics, audio. video, etc.) objects and scene descriptions (temporal and spatiai relationships of lhe video objects) can be transmitted by Lhe server and interpreted and reassembled at the client side, Lo drastically reduce multimedia data Lranslnitted onto lhe WWW. This section briefiy describes lhe Delivery Multimedia lnwgralion Framework (DMIF) and Lhe issue of MPEG-4 over IR
FIGURE 16.9: DMIF — Lhe multimedia content delivery inLegraLion framework.
Interface (DAÍ), which Lranslates lhe application’s requests inLo specific proLocol messages, Lo be transported through one of lhe three types of mediums. When lhe delivery is through a network, Lhe DMIF is unaware of Lhe applicaLion. In fact, an additional DMIF Network Interface (DNI) is needed, to iake care of iheir signaling messages for specific networks. When delivering multimedia daLa, DMIF is similar Lo F1’P. First, a SETUP session is established wilh lhe remoLe network siLe. Second, sLreams are selected and a STREAN request is sent toLhe DMIF peer. which retums a pointer Lo a separaLe connection where lhe streaming will Lalce place. Third, lhe new connection is estabiished, and data is streamed. la Lhe scenai~os of Broadcast and Local storage. Lhe application will know how Lhe daLa is stored and delivered. Hence, ihis becomes part of Lhe DMIF implemenLation. DMIF has built-in Q0S monitoring capabiiity. It supports (a) conLinuous moniloring, (b) specific QoS queries, and (c) QoS violation notificaLion. 16.4.2 MPECS-4 over IP
16.4.1
DMIF in MPEG4 DMIF is an interface between multimedia applications and theirtranspOrt. It supports remote interactive network access (IP, ATM, PS1’N, ISDN, or mobile), broadcast media (cable or sateilite), and local media on disks. The interface is transparent Lo the application, soa single application can run on different transport layers, as long as lhe right DMIF is instantiated. Figure 16.9 shows the integration of delivery through three types of communicatiOn mediums. As shown, Lhe local application inleracis with a uniform DMIF Applicatiofl
The specifications on MPEG-4 over ÍP networks are jointly developed by lhe MPEG and IETF as a framework in Part 8 of MPEG-4 (lSO/IEC 14496-8) and an InformaLive RFC in mlr.
MPEG-4 sessions can be carried over IP-based protocois such as RTP, RTSP, and IlTI’R Deiails regarding RTP payload format are specified by IETF RFC 3016. In short, generic RTP payload formar defines a mapping between logical MPEG-4 Si. packets and RTP packets, and FlexMuxpayloadforma: maps FlexMux packetized streams lo RTP packets.
464
Chapter 16
16.5
MEDIA-ON-DEMAND (MOD)
ection 16.5
Multimedia Network Communications and Applications
Multimedia Networks
Media-on-Demand involves many fundamental muitimedia network communication issues. In this section, we wiil briefiy introduce Interactive TV, broadcast schemes for video-on demand, and issues of bufíer management. r
16.5.1
Interactive 1V (11V) and Set-Top Box
(STB)
i
Interactive TV (ITV) is a muitimedia system based on lhe television seIs in homes. It can support a growing number of activities, such as
Media-on-Demand (MOD) 465
1
____
STB (Set-Top Box)
Network interface and communication unit
. TV (basic, subscription, pay-per-view)
• Video-on-Demand (VOO) • Information services (news, weather, magazines, sports events, ele.)
Processing unit
Audio/Video unit
Graphics unit
Peripheral controi LmiL
• Interactive entertainment (lnternet games, etc.) • E-commerce (online shopping, stoek trading) • Access Lo digital libraries and educationai materiais A new development in Digital Video Broadcasting (DVB) is Muitimedia Home Platform (DVB-MHP) which supports ali the activities above as well as electronic program guide (EPG) for television. The fundamental differences between ITV and conventional cable TV are first, Lhat ITV invites user interactions; hence Lhe need for two-way Lraffic — downstream (content provider Lo user) and upstream (user to content provider). Second, ITV is rich in information and multimedia content. To perform the above functions, a Set-top Box (STB) is required, which generally has Lhe following components, as Figure 16.10 shows: • Nctwork interface and communication unit, including tuner and demoduiator (to extract the digital stream from analog channel), security devices, and a comifiunication channel for basic navigation of WWW and digital libraries as well as services and maintenance • Processing unit, including for Lhe STB
CPU, memory,
and a special-purpose operating system
• Audio/video unit, including audio and video (MPEG-2 and 4) decoders, Digital Signai Processor (DSP), buffers, and DIA converters • Graphics unit, supporting real-time 30 graphics for animation and games • Peripherai control unit, controliera for disks, audio and video 1/O devices (e.g., digital video cameras), CD/DVD reader and writer, and 50011 Section 15.4 described various Access Networks and Lheir comparative advantages and disadvantages iii Lransmitting multimedia data efficiently and securely for ITV services.
TV monitor
na
Disks
TIO devices
FIGURE 16.10: General architecture of Set-top Box.
16.5.2 Broadcast Schemes for Video-on-Demand Among ali possibie Media-on-Demand services, Lhe most popular is likely to be subscription to movies: over high-spced networks, customers can specify Lhe movies Lhey want and Lhe time Lhey want Lo view Lhem. The statistics of such services suggest thaL mosL of Lhe demand is usually concenLrated on a few (lO Lo 20) popular movies (e.g., new releases and top-Len movies of Lhe season). This makes ii possible lo multicast or broadcasL Lhese movies, since a number of clients can be put into lhe nexL group following Lheir requesL. An importani quality measure of such MOO service is lhe waiLing time (latency). We will define access time as lhe upper bound beLween Lhe Lime of requesling Lhe movie and lhe Lime oíconsuming lhe movie. Given Lhe potentially extremeiy high bandwidlh of fiber-optic networks, iL is conceivabie Lhal lhe entire movie can be fed lo lhe clienL in a relaLively shorL time if ilhas access to some high-speed network. The problem with Lhis approach is Lhe need for aD unnecessarily large slorage space at Lhe client side. Staggercd Broadcasting. For simplicity, we wiil assume alI movies are encoded using COnslanL-bitraLe (CBR) encoding, are of Lhe same length L (measured in Lime units), and wiii be piaycd sequenlially from beginning Lo end wiLhouL interrupLion. The availabie high bandwidLh W is divided by Lhe playback raleb lo yield Lhe bandwidth ratio 8. Thebandwidth
466
Chapter 16
Section 16.5
Multimedia Network Communications and Applications
Channell
0
I2~3I4I5I6~7I8~
~
Channei2 ~
~ ~
Channel3 1
i8IiI2I3L4I5I6L7I8~1~2~3I4~5I6I
8
12
1
ChanneI 1 Channel 2 Chaunel 3 Channel 4
HI2IH4I5I6I7HI1I
Channei6
4
Chanuel 5
1 1 [II 1
Media-on-Demand (MOD) 467
16
20
24
28
32
36
1
1
1
1
1
1
40
44
48
1
1
~ •n•••n ••n•••••nn• ••E~•••I•• •n•••••n•••• ~fl —— ———— —— ——. ——— -
Channcl 6 FIGURE 16.11: Staggered broadcasting with M
=
8 movies and K
=
6 channeis.
of Lhe server is usually divided up mio K logical channels (K ~ 1). Assuming Lhe server broadcasts up toM movies (M > 1), ali can be periodicaliy broad casi on ali these channels wiih the start-Iime of each movie siaggered. This is therefore referred te as Siaggered broadcasling. Figure 16.11 shows an exampie of Slaggered broad castirig in which M = 8 and K = 6. lf the division of the bandwidih is equal among alI K logicai channels, ihen access Lime for any movie is 5 = %L. (Nole: lhe access time is acLuaily indeperidenL of lhe value of K.) In other words, access time will be reduced linearly with the increased network bandwidlh. Pyramid Broadcasting. Viswanathan and lmieiinski [18] proposed Pyramid broad casiing, in which movies are divided up mio segments of increasing sizes. That is, L1+i = a~L,, where L1 is Lhe size (lengLh) of Segment S1 and a > 1. Segment S~ wilI be periodically broadcast on Channel i. In oLher words, insLead of staggering the movies on K channels, the segmenLs are now slaggered. Each channel is given Lhe same bandwidLh, and Lhe iarger segmenLs are broadcasl less frequenLly. Since lhe available bandwidlh is assumed Lo be significantly larger Lhan Lhe movie piay back rate b (i.e., 8 >> 1), it is argued thaL the client can be playing a smaller Segment S1 and simuitaneously be receiving a larger SegmenL S1.~.1. Tu guaranLee continuous (noninterrupted) playback, the necessary condiLion is plciybackiinie(S1) > accessiinze(S~.j.l)
(16.1)
The playbackaime(S1) = L,. Given lhe bandwidLh aliocaled Lo each channei is 8/ K b, access_time(S~+i) = L,÷;Al = !j~&j~≥L, which yields «.
L~ M 81K
Channel 7 FIGURE 16.12: Skyscraper broadcasLing with seven segments.
The size of S1 determines access time for Pyramid broadcasting. By default, we set a = lo yield lhe shortest access time. Access time drops exponenLially wiLh Lhe increase in LoLai bandwidLh 8, because a can be increased lineariy. A main drawback of Lhe above scheme is Lhe need for a iarge storage space on the client side, because Lhe last Lwo segmenLs are Lypicaliy 75—80% of the movie size. lnstead of using a geomelric series, Skyscraper broadcas;ing [191 uses (1, 2,2,5,5, 12, 12, 25, 25, 52, 52, as Lhe sedes of segment sizes, Lo aileviale the demand on a large buffer. Figure 16.12 shows an exampie of Skyscraper broadcasting with seven segmenLs. As shown, two clienLs who made a requesL aL time intervais (1, 2) and (16, 17), respeclively, have their respective Lransmission schedules. At any given moment, no more lhan Lwo segments need Lo be received. Hu [20] described Greedy Equal Bandwidth Bivadcasung (GEBB) in 2001. The seg ment sizes and their corresponding channel bandwidLhs are analyzed, wiLh Lhe objective of minimizing Lhe toLal server bandwidth required Lo broadcasL a specific video. Different from Lhe above pyramid-based broadcasting schemes, GEBB operates in a “greedy” fash ion. The client receives as much daLa as possibie from ali Lhe channeis immediately afLer “Luning in” te a video broadcast. The client ceases receiving a segment immediately before piaying back lhe corresponding segmenl. Figure 16.13 iliustraLes GEBB. In Lhis figure, ali lhe bandwidchs are equal. The server bandwidLh optimization problem can be formaliy siated as:
(16.2) minimize
ConsequenLly, -
8 MK
(16.3)
subjectto
B~=
w + ~ SJ
1=1,2
K
(16.4)
46$
Chapter 16
Section 16.5
Multimedia Network Communications and Applications
Display p1
Download plane
S1
~2
53
55
54
Media-on-Demand (MOD)
469
Channel 1
8 8.
FIGURE 16.13: lllustration of GEBU. The shaded arca represents data received and played baclc by the clienL.
t
Channel 2 where w is the wait time and Lhe B~ is Lhe bandwidth of Channel 1. The condition represenLed by Equation (16.4) ensures that Segment S1 is completely received at Lhe exacL Lime when the playback of SegmenL S1~~1 LerminaLes. Thus, Lhe segmenLs are available exactly ontime for their playback. The above nonlinear opLimization problem is solved using Lhe Lagrange multiplier method. The resuit is Lhat the required bandwidh is minimized when the channel bandwidths are equal. The broadcasLing bandwidth of each channel is =
B=B
l 3 0Hz) and esiablish line-of-sight (LOS) communicalion, or for lower frequencies reflecL off lhe ionosphere and lhe ground, or travei aiong Lhe ionosphere Lo Lhe receiver. Frequencies over 3 GHz (which are necessary for sateilite LransmissiOns Lo penetrate Lhe ionosphere) experience gaseous aLLenuations, influenced primari!y by oxygen and water (vapor and raio).
17.2.1
Radio Propagation Modeis
K=0
0.6
0.5 K=3
-
K=10
K-5
K
20
0.4
0.3 0.2 0.1
Multipath Fading Fading modeis Lry to mode! lhe amplitude of lhe superimposed signal at lhe receiver. The Dopplerspreadof a signal is defined as the distribution of Lhe signa! power over lhe frequency spectnim (the signal is modu!ated aL a specific frequency bandwidlh). When lhe Doppler spread of Lhe signa! is sma!! enough, Lhe signa! is coherent — that is, lhere is on!y one disLinguishab!e signa! ai Lhe receiver. This is Lypica!!y lhe case for narrowband signals. However, when lhe signal is wideband, differeni frequencies of lhe signa! have different fading paths, and a few disLinguishab!e signa! paths are observed aL lhe receiver, separated in Lime. For narrowband signa!s, Lhe mosL popular inodeis are Rayleighfading and Rician fading. The Ray!eigh mode! assumes an infinite number of signa! paths with no line-of-sight (LOS) Lo lhe receiver for mode!ing lhe probabi!ity density funcLion 1’, of received signa! ampliLude r: Pr(r)
-~
(17.9)
ei?
where a is Lhe sLandard deviation of lhe probability densily funcLion. A!Lhough the number of signa! paths is typica!!y not 100 large, lhe Ray!eigh model does provide a good approximatiOn when lhe number of paLhs is over 5. A more general mode! thaL assumes a LOS is lhe Rician model. lt defines a K-factor as a ratio of lhe signa! pewer lo lhe scatLered power — LhaL is, K is lhe factor by which the LOS signal is greaLer lhan lhe other paths. The Rician probabi!iLy density funcLion P~, is P~(r)
r =
—
r ei.? K I0(.—’Jii), e
where K
32 =
(17.10)
—
As before, r and « are lhe signa! amplitude and sLandard deviation respeclive!y, and s is Lhe LOS signal power. 1,, is a niodified Besse! funclion of ihe firsi kind with O order.
2
4
8
6
FIGURE 17.6: Rician PDF p!ol wiLh K-facior = 0,
!,
!0
3,5, 10, and 20.
Note lhal when .ç = 0 (K = 0) there is no LOS, and lhe mode! Lhus reduces Lo a Rayleigh distribution. When K — co lhe mode! refiecis lhe addirive white Gaussian noise (AWGN) condiLions. Figure !7.6 shows lhe Rician probability densiLy function for K-facLors of 0, 1, 3,5, lO, and 20, wiLh sLandard deviaLion of a = 1.0. For a wideband signa!, lhe fading paLhs are more empirica!ly driven. One way is lo mode! lhe amplitude as a summaflon over a!! lhe paths, each having randomized fading. The number of palhs can be 7 for a c!osed-room environment (six wal!s and LOS) ora larger number for other environments. An aliernalive technique of modeling lhe channel fading is by measuring lhe channel impulse response. A similar Lechnique is uti!ized in CDMA sysiems, proposed in cdma2000 as wel! and added Lo WCDMA as pari of lhe harmonizalion efforL. A CDMA sLaLion (boLh mobile and base slation) has rake receivers, which are mu!iiple CDMA radio receivers Luned Lo signa!s wiLh differenL phase and amplitude, lo recompose lhe CDMA transmission lhaL split lo different disLinguishable paLhs. The signal ai each rake receiver is added up lo achieve beLler SNR. To tune lhe rake receivers Lo Lhe proper fading paLhs, CDMA syslems have a special pitot channel lhat sends a well-known piloL signal, and Lhe rake receivers are adjusLed Lo recognize lhaL symbo! on each fading paLh.
496 17.2.2
Chapter 17
Wireless Networks
Section 17.3
Path Loss
Lhe smaller device size is lhaL lower-resolulion videos are acceplable, which helps reduce prOCessing Lime. Second, due lo memory consLrainls and reasons for Lhe use of wireless devices, as well as billing procedures, real-lime communication is likely Lo be required. Long delays before sLarting Lo see a video are eilher nol possible or not acceptable. Finally, wireless channels have much more interference Lhan wired channels, wiLh specific loss patterns depending on Lhe environmenL condilions. The bitrale for wireless channels is also much more limiLed, alLhOugh Lhe 3G bilrales are more suilable for video. This implies that alihough a lot of bil proteclion musL be applied, coding efficiency has lo Ir mainlained as well. Error-resilienL coding is important. 3G slandards specify thal video shall be standard compliant. Moreover, most companies wiIl concenlrate on developing products using sLandards, in lhe inleresL of inleroperabilily of mobiles and nelworks. The video sLandards reasonable for use over wireless channels are MPEG-4 and H.263 and its varianls, since they have low bitrate requiremenls. The 3GPP2 group has defined Lhe following QoS parameLers for wireless videoconferenc ing services [7]. The QoS paramelers specified for lhe wireless part are more stringenl Lhan Lhose required for end-Lo-end Lransmissions. The 3OPP QoS requiremenLs for multimedia Lransmission are nearly identical [8].
For long-range communicalion, lhe signal loss is dominated by altenualion. The free-space altenuation model for LOS lransmission is iii inverse proportion lo lhe square of distance (d2) and is given by Lhe Friis radiation equation —
_______
—
(4,r2)d2L
(17.11)
Sr and S, are Lhe received and lransmitted signal power, Gr and G, are Lhe anlenna gain facLors, À is Lhe signal wavelength, and L is the receiver Ioss. EL can be shown, however, that if we assume ground reflecLion, altenuation increases lo be proportional to d~. Another popular medium-scale (urban City size) model is Lhe Hata model, which is empirically derived based on Okumura path loss data in Tokyo. The basic form of Lhe palh Ioss equation in dB is given by
L = A + B logio(d) + C.
Multimedia over Wireless Networks 497
(17.12)
Here, A is a function of the frequency and antenna heights, 8 is an environmenl function, and C is a function depending on Lhe carrier frequency. Again, d is the distance from lhe LransmilLer Lo Lhe receiver. Sateilite modeis are attenuated primarily by rain. Hence, meLeorological rainfali density maps Can be used Lo communicale with lhe region. AtLenuation is computed according Lo the amounL of rainfail in lhe area on Lhe given daLe.
• SynchronizaLion. Video and audio should be synchronized lo wilhin 20 msec. • ThroughpuL. The ntinimum video bilrate lo be supported is 32 kbps. Video raLes of 128 Icbps, 384 kbps, and above should be supporled as well. • Delay. The maximum end-lo-end Lransmission delay is defined lo be 400 msec.
17.3
MULTIMEDIA OVER WIRELESS NETWORKS We have studied lhe evolution of current 2G networks lo future high-capaciLy 3G networks, but is lhere a demand for 3G neLworks? Multimedia over wireless will certainly need a higher bandwidLh. SuggesLed multimedia applications range from web browsing, streaming video, videoconferencing, coliaboraLive work, and slide-show presenlations Lo enhanced roadside assisLance and downloadable GPS maps for drivers. In lhis seclion we are concemed mainly wilh sending video robuslly over wireless chan nels, such as for a videoconferencing applicalion. This applicalion should be prominenL on 3G handhelds, since iL is a natural exLenskn Lo voice communication. Because wireless dala Lransmissions incur lhe most daLa Ioss and dislortion, error re silience and error correcLion become primary concems. We have thus included some brief descriplion of synchronizalion loss, error-resilienL entropy coding, error concealmenl, and Forward Error CorrecLion (FEC) in lhis seclion, although most of these Lechniques are also applicable Lo olher networks. A few characLerisLics of wireless handheld devices are worth keeping in mmd when designing mulliinedia transmission, in particular video Lransmission. Firsl, both lhe hand held size and baltery life limit Lhe processing power and memory of Lhe device. Thus, encoding and decoding must have relalively Iow complexity. Of course, one advanlage of
• Jitter. The maximum delay jitter (maximum difference belween Lhe average delay and Lhe 951h percentile of lhe delay distribution) is 200 msec. • Error rate. The videoconferencing system should be able Lo toierate a frame error raLe of 10_2 ora bit error rale of 10 for circuit-switched transmission. In Lhe following, we discuss lhe vulnerability of a video sequence lo bit entrs and ways Lo improve resilience Lo errors. 17.3.1
Synchronization Loss A video slream is either packeLized and lransmilLed over a packet-switched channel or lransmiLled as a conLinuous bitstream over a circuit-swilched channel. In either case, it is obvious lhaL packet loss or bit enor wiIi reduce video quality. 1f a bil loss or packel loss is Iocalized in Lhe video in boLh space and time, Lhe loss can sLill be acceptable, since a frame is displayed for a very short period, and a small error mighl go unnoLiced. However, digiLai video coding lechniques involve variable-lengLh codes, and frames are coded wiLh different prediction and quantizalion leveis. IinfortunaLely, when a packet
498
Chapter 17
Section 17.3
Wireless Networks
containing variable bit-Iength data (such as DC]’ coefficients) is damaged, that error, if unconsLrained, will propagate ali the way throughout Lhe scream. This is called loss of decoder synchronizalion. Even if Lhe decoder can detect Lhe error due lo an invalid coded symbol or coeflicients out of range, ii stiil cannot estabiish Lhe neM point from which Lo sLart decoding [9]. As we have iearned in Chapter lo this complete biisLream loss does not happen for videos coded with standardized protocol layers. The Picture layer and Lhe Group Of Blocks (GOB) iayer or Slice headers have synchronization markers that enable decoder resynchronization. For example, Lhe H.263 bitstream has four layers — the Picture layer, GOB layer, Mac roblock layer, and Block layer. The Picture L.ayer starts with a unique 22-bit picLure start code (PSC). The iongest entropy-coded symbol possible is 13 bits, so lhe PSC serves as a synchronization marker as well. The GOB Jayer is provided for synchronizaLion after a few blocks rather than Lhe enLire frame. The group of blocks sLart code (GBSC) is 17 bits long and also serves as a synchronization marker.2 The macroblock and lhe Block Jayers do noL contam unique start codes, as these are deemed high overhead. ITU standards after 1-1.261 (e.g., H.263, H.263+, etc.) support slice-structured mode instead of GOBs (H.263 Annex K), where slices group biocks together according to Lhe block’s coded bit lengLh raLher Lhan Lhe number of blocks. The objective is lo space slice headers within a known distance of each olher. That way, when a bitstream error looks iike a synchronizaLion marker, if Lhe marker is not where Lhe slice headers should be it is discarded, and no false resynchronization occurs. Since slices need Lo group an integral number of macroblocks togeLher, and macroblocks are coded using VLCs, iL is not possibie Lo have ali slices lhe same size. l-lowever, Lhere is a minimum distance after which Lhe nexL scanned macroblock will be added Lo a new siice. We know thaL DC coefficients in macroblocks and motion vectors of macrobiocks are difi’erentially coded. Therefore, if a macroblock is damaged and Lhe decoder locaLes lhe next synchronization marker, iL mighL stiil not be able to decode Lhe sLream. To alieviate Lhe probiem, slices also reset spaLial predicLion parameters; differentiai cod ing across slice boundaries is not permiLted. The 150 MPEG sLandards (and H.264 as well) specify siices thaL are nol required Lo be of similar biL length and so do not proLecL against false markers weli. OLher Lhan synchronizaLion ioss, we should note thaL errors in prediction reference frames cause much more damage lo signal qualiLy than errors in frames not used for prediction. ThaL is, a frame error for an l-frame will deterioraLe Lhe qualiLy of a video stream more than a frame error for a P- or 8-frame. Similarly, if Lhe video is scalable, an error aI Lhe base layer wiii deteriorate the quality of a video sLream more Lhan in enhancement layers. MPEG-4 defines addiLional error-resiiient tools LhaI are useful forcoding under noisy and wireiess channel conditions. These are in addiLion Lo slice coding and Reversible Variable LengLh Codes (RVLC5) [lo, li]. To further help wiLh synchronizaLiOn, a data parliLioning scheme wiii group and separate header information, moLion vecLors, and DC1’ coefficients mio different packets and puL synchronization markers between them. As we shali see later on, such a scheme is also beneficial lo unequal protection Forward Error CorrecLion (FEC) schemes. 2Synchronizalion markers are always larger chan ffie minimom required. in case bil enors change bits Lo look 111w synchronization markers
Multimedia over Wireless Networks
499
Addilionaliy, an adapLive intra-frame refresh mode is ailowed, where each macrobiock can be coded independenLly of Lhe frame as an inter- or intra- biock according lo its motion, co assist wilh error conceaiment. A fasLer-moving block will require more frequenL refreshing .—thac is, be coded in intra- mode more often. Synchronization markers are easy Lo recognize and are particularly weil suited lo devices with limiLed processing power, such as ccli phones and mobile devices. For interacLive applicaLions, if a back channel is availabie Lo Lhe encoder, a few addiLional error control Lechniques are availabie, classified as sender-receiverfeedback. According Lo Lhe bandwidlh available aI any moment, Lhe receiver can ask Lhe sender lo iower or increase Lhe video biLrale (transmission raLe control), which combats packet loss due lo congestion. lf Lhe stream is scalable, it can also ask for enhancemenL layers. Additionaiiy, Annex N of H.263+ specifies thaL Lhe receiver can notice damage in a reference frame and request Lhat the encoder use a different reference frame for predicLion — a reference frame lhe decoder has reconstrucLed correctiy. The above techniques can be used in wireless real-time video applications such as video conferencing, since wireless ccli communication supports a back channel ifnecessary. llow ever, it is obviously cheaper noL Lo use one (iL wouid reduce multipie-access inLerference in Lhe uplink). 17.3.2
Error Resilient Entropy Coding The main purpose of GOBs, siices, and synchronizaLion markers is lo reestablish decoder synchronization as soon as possible after an error. In Annex K of H.263+, slices achieve betLer resilience, since they impose furLher consLraints on where lhe stream can be synchro nized. However, anolher aigoriLhm, called Error Resilieni Eniropy Coding (EREC), can achieve synchronizaLion after every single macrobiock, without any of Lhe overhead of Lhe slice headers or GOB headers. The aigorithm is cailed EREC because iL Lakes entropy-coded variable-lengLh macroblocks and rearranges lhem in an elTor-resilienL fashion. In addiLion, it can provide graceful degradaLion. EREC takes a coded bitstream of a few blocks and rearranges Lhem 50 thac lhe beginning of ali lhe blocks is a fixed dislance apart. AlLhough Lhe blocks can be of any size and any media we wish Lo synchronize, Lhe following descriplion wiil refer lo macroblocks in videos. The algorithm proceeds as in Figure 17.7. IniLially, EREC sloLs (rows) of fixed biL-lengLh are aliocated wilh total biL-lenglh equal Lo (or exceeding) Lhe total bit-length of ali lhe macrobiocks. The number of slots is equal Lo Lhe number of macrobiocks, excepL Lhat lhe macroblocks have varying biL-lengffi and lhe slots have a fixed bit-iength (approximaLely equal lo Lhe average biL-lengLh of ali Lhe macroblocics). As shown, lhe last EREC sioL (row) is shorler when lhe Lolal number of biLs does not divide evenly by the number of sloLs. LeL k be lhe number of macroblocks which is equai Lo Lhe number of slols, 1 be Lhe total bit-lengLh of ali Lhe macroblocks, mbs[ 1 be lhe macrobiocks, slots[ ] be lhe EREC sloLs, Lhe procedure for encoding lhe macrobiocks is shown below.
Section 17.3 500
Chapter 17
Multimedia over Wireless Networks
501
Wireless Networks Macroblocks
EREC slots
MacrobiockS
EREC slots
II
EELH
II~
Li’~J
——
LI
1 I~InI~*l
I~
FIGURE 17.8: Example of macrobicck decoding using EREC.
FIGURE 17.7: Example of macrobiock encoding using EREC.
PROCEDURE 17.1 BEGIN j = O; Repeal until 1 = O for i
=
O to k
—
EREcEncode
1
rn = (i + J) madk; fim is the macroblock number corresponding lo siaL é; Shift as many biLs as possible (wilhoul overflow) from mbslil inLO slois[m]; sb = number of bits successfully shifLed into .slors[m] (wiLhout overflow); = 1 — sb;
j
=
j + 1;
II shifL Lhe macroblocks downwards
END The macroblocks are shifted into Lhe corresponding sloLs uniu ali lhe bits of Lhe mac robiock have been assigned or remaining biLs of lhe macrobiock don’l fit into lhe sial. Then lhe macroblocks are shifted down, and this procedure repeaLs.
The decoder side works in reverse, with lhe additional requirement lhaL il has Lo detecl when a macrobiock has been read in fuli. It accornpiishes Lhis by deLecLing lhe end of macrobiockwhen ali DCTcoefficienLs have been decoded (ora block end cade). Figure 17.8 shows an example of Lhe decoding process for Lhe macroblocks coded using EREC in Figure 17.7. The Lransmissiofl arder of lhe daLa in Lhe slols is row-major — lhaL is, aI first Lhe data in sial O is senl, Lhen siaL 1, and so on, lefL lo right. IL is easy lo see how Lhis Lechnique is resiiienL Ia errors. No malLer where the damage is, even aI Lhe beginning of a macroblock, we sLiii know where lhe nexl macroblock sLarls — il is a fixed disLance from lhe previous a’ie. In Lhis case, no synchronizalion markers are used, so Lhe GOB Iayer ar siices are foI necessary eiLher (aithough we sLiili mighL wanL Ia reslricl spatial propagaLian af errar). When Lhe macrobiocks are coded using a daLa partiLioning Lechnique (such as Lhe one for MPEG-4 described in lhe previous secLion) and also biLpiane parliliOniflg, an errar in lhe bilsLream wili deslroy iess significanl data while receiving lhe significanL daLa. Ilis obvious lhaL lhe chance for errar propagalion is grealer.for biLs aL Lhe end of lhe sial lhan aI lhe beginning. On average, Lhis wiii aisa reduce visual delerioraLian aver a nanpartiLiOned encading. This achieves graceful degradalion under worsening errar condilions.
173,3 Error Concealment DespiLe ali lhe efforls lo minimize accurrences of errors and their significance, errars can sLili be visuaily annaying. Errorconcealmenl tcchniques are thus inlraduced La appraximale lhe iosL dala on Lhe decoder side. Many errar canceairnenl lechniques appiy eiLher in Lhe spaliai, lemparal, ar frequency domam, ar a combination af lhem. Ali lhe lechniques use neighbaring frames lemporaiiy
502
Chapter 17
Wireless Networks
or neighboring macrobiocks spatialiy. The transport stream coder interleaves the video packets, SO thaI ia case of a burst packet ioss, not ali lhe errors willl be at one place, and lhe missing data can be estimaLed from lhe neighborhood. Error concealment is necessary for wireless video communication, since lhe error rates are higher lhan for wired channels and mighl eva be higher than can be transmitted with appropriate bit protection. Moreover, Lhe error raLe Iluctuates more often, depending on various mobility or weather conditions. Decoding errors due lo missing or wrong data received are more noliceable on devices with limited resolution and small screen sizes. This is especially Lrue if macroblock size remains large, lo achieve encoding efficiency for lower wireless bicrates. Following is a summary of techniques for error concealnient. (See [12] for further details.) 1. Dealing with lost macrobiock(s). A simple and popular technique for concealment can be used when Dcl’ blocks are damaged but lhe motion veclors are received correctly. The missing block coefficients are estimaled from lhe reference frame, assuming no prediction errors. Since Lhe goal of motion-compensated video is lo minimize prediction errors, this is an appropriate assumption. The missing block is hence temporally masked using lhe block ia Lhe reference frame. We can achieve eva better resulis if lhe video is scaiable. In thai case, we assume that lhe base layer is received correctly and that iL contains lhe motion vectors and base iayer coefficients Lhat are mosl importanl. Then, for a lost macroblock ai lhe enhancement iayer, we use lhe motion vectors from Lhe base layer, repiace lhe DCI’ coefficients ai the enhancement layer, and decode as usual from lhere. Since coeifi cients of less importance are estimaled (such as higher-frequency coefficients), even if the estimation is not Loo accurate duelo prediction errors, Lhe concealment is more effective than ia a nonscalable case. lf motion vector information is damaged as well, Lhis technique can be used orily if Lhe motion vecLors are estimated using anolher concealment technique (lo be discussed next). The estimation of Lhe motion vedor has lo be good, or the visual quality of lhe video could be inauspicious. To apply this technique for intra-frames, some sLandards, such as MPEG-2, also aliow Lhe acquisition of motion vecLors for intra coded frames (i.e., treating them as mIra- as well as inter-frarnes). These molion vectors are discarded if Lhe block has no error. 2. Combining temporal, spatial and frequency coherences. lnstead of jusi relying on lhe lemporal coherence of motion vectors, we caii combine ii wiLh spatial and fre quency coherences. By having mies for eslimating missing block coefficients using Lhe received coefllcients and neighboring blocks in Lhe same frame, we cmi con ceal errors for inlra-frames and for frames with darnaged motion vector information. Additionally, combining wiLh prediction using motion vectors will give us a betler approximation of lhe prediction error block. Missing block coefficienls can be esLimated spatially by minimizing lhe error of a smooLhness function defined over Lhe block and neighboring blocks. For simplicily, lhe smoothness function can be chosen as lhe sum of squared differences of pairwise
Section 17.3
Multimedia over Wireless Networks
503
neighboring pixels in Lhe block. The function unknowns are Lhe missing coefficients. In lhe case where motion inforrnation is available, prediction smoothness is added Lo Lhe objective function for minimization, weighted as desired. The simple smooLhness measure defiried above has lhe problem that iL smoolhes edges as well. We can attempt Lo do better by increasing lhe order of lhe smooffiing crilerion from linear lo quadratic or cubic. This will increase Lhe chances of having boffi edge reconsLrncLion and smoolhing along lhe edge direction. AI a larger computaLiOnal cosi, we can use an edge-adaptive smoothing melhod, whereby Lhe edge direcLions inside Lhe block are firsL determined, and smoothing is noL permitted across edges. 3. Frequency smoothing for bigh-frequency coeflicients. Smooffiing can be defined much more simply, Lo save on computationai cost. AlLhough lhe human visual system is more sensiLive lo low frequencies, iL would be disLurbing lo see a checkerboard pattem where ii does noL belong. This will happen when a high-frequency coefficient is erroneously assigned a high value. The simpiest remedy is lo seL high-frequency coeificienls Lo O if lhey are damaged. If lhe frequencies of neighboring blocks are correlated, ii is possible to estimate losI coefficients ia Lhe frequency domam direclly. For each missing frequency coefficient ia a biock, we esLimate iLs value using an inLerpolaLion of Lhe sarne frequency coeffi cient values from lhe four neighboring blocks. This is applicable aL higher frequencies only if the image has regular palLerns. Unfortunately Lhat is nol usualiy lhe case for natural images, 50 most of lhe Lime Lhe high coefficients are again seI Lo O. Tempo ral prediclion error blocks are even less correiaLed ai ali frequencies, so this melhOd applies only for inLra-frames. 4. Estimation of lost motion vectors. Loss of moLion vectors prevents decoding of mi entire predicted bioclc, so iL is importanl lo esLimale molion vectors well. The easiest way lo estimale losL motion veclors is Lo set thern lo O. This works well oniy in Lhe presence of very liLlle motion. A beLter estimation is obtamned by exarnining lhe moLion veclors of reference macroblocks and of neighboring macroblocks. Assuming motion is also coherent, ii is reasonabie lo take Lhe moLion vectors of lhe corresponding macroblock ia Lhe reference frame as Lhe moLion vectors for lhe damaged targel block. Similarly, assuming objects wilh consislent motion fleids occupy more Lhan one mac roblock, Lhe moLion vecLor for lhe damaged block cmi be approximated as mi inter polaLion of Lhe motion veclors of Lhe surrounding bloclcs LhaL were received correcLly. Typicai simple interpolalion schemes are weighted-average and median. Also, lhe spatial esLimation of lhe motion vector cmi be cornbined wiLh lhe esLirnalion from Lhe reference frame using weighted sums. 11.3.4 Forward Error Correction
(FEC)
Some data are vitally imporlant for correet decoding. Missing Dcl’ coefficients rnay be estimated or Lheir effect visualiy concealed lo some degree. However, some lost and im properly estirnated data, such as picture coding mode, quantization levei, or rnost data in higher layers of a video standard protocol sLack, will cause caLastrophic video decoding failure. In such cases, we would like lo ensure “error-free” transmission. However, most channels, ia particular wireless channels, are noisy, and lo ensure correct Lransmission, we
504
Chapter 17
Wireless Networks
must provide adequaLe redundant retransmissions (when no back channei is available). Forward Error Correclion (FEC) is a Lechnique that adds redundant data lo a bitsLream to recover some random bit errors in iL. Ideally, lhe channel-packet error raLe (or bit error rate) is esLirnaLed, and enough redundancy is added lo malce lhe probability of error afler FEC recovery iow. The intervai over which lhe packel error raLe is estimated is chosen Lo be lhe smallest possible (Lo minimize iaLency and computation cosI) LhaL reliably estimates lhe frame ioss probabiliLy. NaLuraily, when bursl frarne ioss occurs, lhe esLimation may no longer be adequate. Frame errors are also calied erasures, since lhe enlire packel is dropped on au error. Videos have Lo be transmitted over a channel wilh limited bandwidlh. Therefore, II is imporlant Lo minimize redundancy, because à comes aI lhe expense of biLrates available for video source coding. AL lhe sarne Lime, enough redundancy is needed so lhaL lhe video can mainLain required Q0S under lhe current channel error condilions. There is au opLimal amount of redundancy lhal minimizes video dislortion, given certain channel conditions. FEC codes in general fali into two categories: block codes and convolutional codes. Block codes appiy to a greup of biLs aL once Lo generaLe redundancy. Convolutional codes apply loa slring of biLs one aL a Lime and have memory lhal cmi slore previous bils as weli. The foilowing presenls both Lypes of FEC cedes in brief [13]. Biock Cedes. Block codes [2] Lake as inpul k biLs and append r = n — k bits of FEC data, resuiling in an n-biL-long slring. These codes are referred to as (ti, k) cedes. The Lwo types of biock codes are linear and cyclic. Ali error correction codes operale by adding space beLween valid source strings. The space is measured using a Hamming distance, defined as lhe minimum number of biLs between any coded strings lhal need Lo be changed 50 as Lo be idenLicai lo a second string. To delecl r errors, lhe Hamming disLance has Lo aI least equai r; otherwise, Lhe comipl sLring mighL seem valid again. This is nol sufficient for correcting r errors however, since lhere is feL enough dislance among vahd codes Lo choose a preferabie correclion. Te correcl r errors, Lhe Hamming dislance musl be aI leasl 2r [14, 15]. Linear codes are simple lo compute but have higher coding overhead than cyclic cedes. Cyciic cedes are slaled in lerms of generaLor poiynomials of maximum degree equal lo Lhe number of source bils. The source biLs are lhe coefficienls of lhe polynomial, and redundancy is generated by mulLipiying wiLh anoLher polynomiai. The cede is cyclic, since Lhe modulo operaLion in effecl shifts lhe poiynomial coefficienls. One of Lhe mosL used classes of cyclic cedes is lhe Bose_ChaudhUri_HOcquenghem (8CR) cedes, since they apply lo any binary sfting. The generator polynomial for BCH is given over GF(2) (lhe binary Galois field) and is Lhe Iowest-degree poiynomiai wilh rooLs of a’, where a is a primitive elemenL of lhe field (i.e., 2) and i goes over lhe range of 110 Lwice Lhe number of biLs we wish Lo contct. BCH cedes can be encoded and decoded quickly using inleger arilhmelic, since Lhey use Galois fields. R.261 and R.263 use BCH lo ailow for 18 parily biLs every 493 source biLs. Unforlunaleiy, lhe IS pariLy bils wili certect aI most Lwo errors in lhe source. flus, lhe packets are sLiiii vuinerable Lo bursL biL errors or singie-packel errors.
Section 17.3
Multimedia over Wireless Networks
505
Packingdrder
h
FIGURE 17.9: InLerieaving scheme for redundancy cedes. PackeLs or bits are sLored in rows, and redundancy is generaled in lhe lasL r coiumns. The sending order is by coiumns, top lo bollom, Lhen lefI Lo right.
An importanL subciass of BCR cedes lhaL appiies lo muiLiple packels is Lhe Reed—Soiomon (RS) codes. RS cedes have a generalor polynomiai over GF(2m), wilh iii being lhe packeL size in biLs. RS codes Lake a group of k source packels and ouLpul n packels wilh r = n — k redundancy packeLs. Up lo r losl packeLs can be recovered frorn ti coded packels if we lcnow lhe erasure points. Olherwise, as wilh ali FEC cedes, recovery can be appiied oniy Lo half lhe number of packels (similariy, lhe number of bils), since error-poinL detection is now necessary as well. In the RS codes, only [~] packeLs can be recovered. Fortunalely, in lhe pacicel FEC scenario lhe packels have headers that can conLain a sequence number and CRC cedes on lhe physicai Iayer. In most cases, a packeL wiLh an error is dropped, and we cmi LeH Lhe localien ef lhe missing packeL from lhe missing sequence number. RS cedes are used in slorage media such as CD-ROMs and in neLwork muiLimedia lransmissions lhaL can have bursl errors. It is also possibie Lo use packeL interleaving Lo increase resilience lo bursL packet ioss. As Figure 17.9 shows, lhe RS cede is generated for each of Lhe h rows of k source video packets. Then iL is LransmiLled in coiumn-major order, SO lhal lhe firsl packet of each of Lhe li rows is lransmilLed firsl, lhen lhe secend, and se en. If a bursl packel Ioss occurs, we can lolerale more lhan r erasures, since lhere is enough redundancy daLa. This scheme inLroduces addiLional deiay buL does noL increase compuLatienai cosL. RS cedes cmi be useful for transmission ever packel nelwerks. When Lhere are bursl packel lesses, packeL inLerieaving, and packeL sequencing, iL is possibie lo delecl which packels were received incorrectly and recover Lhem using Lhe avaiiabie redundancy. if lhe video has scalabilily, a belLer use ef ailocated bandwidLh isto apply adequaLe FEC protecLion en lhe base layer, centaining melien veeters and ali header infennalion required Lo decode video lo lhe minimum QoS. The enhancement layers can receive eiLher iess proleclion or none aL ali, reiying jusl on resilienl ceding and error cencealmenl. EiLher way, Lhe minimum Q0S is already achieved.
Section 17.3 506
Chapter 17
A disadvanlage of biock codes is Lhal Lhey cannoL be selectively applied Lo certain biLs. li is difficull Lo proLect higher-proLocol-Iayer headers wilh more redundancy bils Lhan for, say, Dcl’ coeflicienis, if Lhey are senl in Lhe sarne Lransporl packeL (or even group of packeLs). On Lhe olher hand, convolutional codes can do Ibis, which makes tem more efficient for daLa in which unequal protection is advantageous, such as videos. Although convolulional cedes are nol as effecLive againsl burst packet loss, for wireless radio channels bursl packet loss is nol predominanL (and nol present in mosL propagaLion models). Convolutional Cedes. ConvoluLional FEC codes are defined over generaLor polyno miais as weiI [13]. They are compuLed by shifLing k message biLs mIo a coder Lhat convolves tem with te generaLor polynomial Lo generale ri bils. The raLe of such cede is defined lo be The shifling is necessary, since coding is achieved using memory (shift) regislers. There can be more Lhan k regislers, in which case pasL bils also affecL lhe redundancy code generaLed. After producing Lhe n bils, some redundancy bils can be deleled (or “punclured”) lo decrease lhe size of n, and increase Lhe rale of lhe cede. Such FEC schemes are known as rale compatible punctured convoluilonal (RCPC) codes. The higher Lhe raLe, lhe lower lhe bit prolecLion ‘viii be, bul also Lhe less overhead ou lhe biLraLe. A Vilerbi aigoriLhm wiLh soft decisions decodes Lhe encoded bil slream, ailhough rurbo codes are gaining populariLy. RCPC codes provide an advantage over biock codes forwireiess (secLions of lhe) neLwork, since bursl packel losses are nol likely. RCPC puncLuring is done after generalion of parily inforuiation. Knowing lhe significance of lhe source biLs for video qualily, we can appiy a differenL amounL of puncLuring and hence a differenL amounl of error prolecLion. SLudies and simulaLions of wireless radio modeis have shown Lhal applying unequal proLecLion using RCPC according lo biL significance informaLion resulls in beller video quahly (up lo 2 dB beller) for lhe sarne allocaled bilraLe lhan videos prolecled using RS codes. SimplisLically, lhe PicLure layer in a video proLocol should gel Lhe highesl proLecLion, Lhe macroblock layer lhal is more locaiized will gel lower prolecLion, and lhe Dcl’ coefficienLs in lhe block layer can gel lilLle prolecLion, or none aI ali. This couid be exLended furlher lo scaiable videos in similar ways. The cdma2000 sLandard uses convoluLionai codes lo prolecL lransmilled bits for any daLa lype, wilh differenL cede rales for differenl lransmission biLrales. If fulure 3G nelworks incorporale dala-type-specific provisions and recognize lhe video slandard chosen for Irans mission, Lhey can adapliveiy apply Iranspori coding of lhe video sLream wiLh enough unequal redundancy suilabie Lote channei condilions aL lhe lime and QoS requested. 17.3.5
Multimedia over Wireless Networks
507
Wireless Networks
Trends in Wireless Interactive Multimedia The UMTS fortim foresees lhaL by 2010, Lhe number of subscribers of wireless muilimedia communicaLion wiii exceed a biilion worldwide, and such Lraffic will be worlh over severai hundred biilion dollars lo operaLors. AddiLionaily, 3G wili also speed Lhe convergence of LeieconimunicaLiOns, compulers, mullimedia conlenl, and conLenl providers lo supporl enhanced services. Mosl ceiiular neLwOrks around lhe worid have already offered 2.5G services for a few years. iniflai 3G services are also being offered globaily, wilh cdma2000 lx service already commerciaily availabie in mosL countries.
Some of lhe presenL and fuLure 3G applicaLions are: • Mulurnedia Messaging Sei-vice (MMS), a new messaging proLocol for mulLimedia dala on mobile phones Lhal incorporales audio, images, and oLher mullimedia conlenL, along wilh IradiLionai texI messages • Mobile videophone, Vou’, and voice-acLivaLed network access • Mobile Iniemel access, with slreaming audio and video services • Mobile inlranellexlraneL access, wiLh secure access lo corporale LANs, Virtual Private Nelworks (VPN5), and lhe InLemel • CusLomized infolainmenl service Lhal provides access Lo personalized conlenl any Lime, anywhere, based on mobile portaIs • Mobile online muiliuser gaming • UbiquiLous and pervasive compuLing [6], such as auiomobile teiemalics, where an automaled navigalion syslem equipped wiLh GPS and voice recogniLion can interacl with Lhe driver lo obviaLe reading maps whiie driving The indusLry has iong envisioned Lhe convergence of IT, enLertainmenl, and Lelecommu nicaLions. A major porlion of lhe telecommunicaLion fieid is dedicaLed lo handheid wireless devices — Lhe mobile sLalions (ccli phones). AI lhe same time, Lhe computer induslry has focused ou creaiing handheld compuLers thaL can do aL leasl some importanl tasks necessary for people on Lhe go. Handheid compuLers are classified as Pockel PCs or PDAs. PockeL PCs are typically larger, have a keyboard, and supporL mosl funcLions and pra grarns of a desklop PC. PDAs do simpier Lasks, such as sloring evenl caiendars and phone numbers. PDAs normaily use a form of handwriling recognilion for inpul, aiLhough some incorporale keyboards as weli. PDA manufacLurers are slriving lo supporl more PC-iike funclions and aL lhe same lime provide wireless packei services (including voice over lP), so LhaL a PDA can be used as a phone as well as for wireiess Inlemel conneclivily. As wilh ali small portable compulers, lhe Hurnan Compuler InleracLion (HCI) problem is more significanl Lhan when using a deskLop compuLer. Where Lhere is no space for a key board, iL is envisioned tal command inpuL wilI be accomplished lhrough voice recognilion. MosL of Lhe new PDA producls supporL image and video caplure, MP3 playback, e-mail, and wireIess proLocoIs such as 802.11 b and BIueLoolh. Some aiso acL as ccii phones when connecled Lo a GPRS or ~cs neLworic (e.g., lhe Handspring Treo). They have color screens and supporL web browsing and muiLimedia e-mau messaging. Some Bluelooth-enabied PDAs reiy on BiueLoolh-compalible ccli phones Lo access mobile nelworks. However, as ccli phones become more powerful and PDAs incorporale 802.11 b interface cards, Bluetooth mighl become iess viable, As PDA manufacturers iook lo Lhe fuLure, ihey wish lo supporl nol oniy vOice commu nicaLion over wireiess networks buL aiso mulLimedia, such as video communicaLion. Some PDAs incorporale advanced digilai cameras wiLh flash and zoom (e.g., lhe Sony CLIE). The encoding of video can be done using MPEG-4 or H.263, and lhe PDA couid supporl muilipie playback formaIs.
508
Chapter 17
Ccli phone manufacturers, for their pan, are Lrying to incorporate more computer-like functionality, including Lhe basic tasks supported by PDAs, web browsing, games, image and video capture, attachments Lo e-mau, streaming video, videoconferencing, and so on. Growth in demand is steady for interaclive muitimedia, in particular image and video com munications. Most ccli phone manufacturers and mobile service providers akeady support some kind of image or video communicalion, either in Lhe form of e-mau attachments, video streaming, or even videoconferencing. Similariy lo Short-text Messaging Service (SMS), lhe new messaging proiocoi Mut,bnedia Messaging Service (MMS) is gaining support in Lhe industry as an interim solution to Lhe bandwidth limitation. New ccii phones feature color displays and have buiit-in digital cameras. MosL ccli phones use integrated CMOS sensors, and some handsels even have two of them. By 2004, Lhe number of camera sensors on mobile phones is estimated lo exceed Lhe number of digital cameras sold worldwide. Ccli phones have supported web browsing and e-mau functionaiity for a few years, but with packet services, Bluetoolh, and MMS, they can support video streaming in various formais and MP3 playbacic. Some ccii phones even include a touch screen IbaL uses hand writing recognition and a styius, as most POAs do. Olher ccii phones are envisioned Lo be smaii enough Lo be wearabie, inslead of a wrist watch. 17.4
FLJRTHER EXPLORATION Tanenbaum [14] has a good general discussion of wireiess networks, and Wesel [2] offers some specifics about wireiess communications networks. Viterbi [5] provides a soiid anal ysis 011 spread spectrum and Lhe foundation of CDMA. Wang et ai. [16] give an in-depth discussion on error control in video communications. The Further Expioration section of Lhe LexL web sile for Lhis Chapter contains current web resources for wireiess networks, including . A survey on wireiess networks and celiuiar phone technoiogies . A report on GSM
An iniroduction Lo GPRS and links Lo NTIA for informalion on spectmm management • Home pages of lhe COMA Oeveiopment Group, IMT-2000, UMTS, cdma2000 RTT, 3GPP, and so on • Wireiess LAN standards We also show images of severai PDAs and severai modera ccli phones. 17.5
Sectiori 17.5
Wireless Networks
EXERCISES 1. Iii impiemenlaLions of TOMA systems such as GSM and IS- 136, and lo a Iesserdegree in networks based on COMA, such as 15-95, an FDMA Lechnoiogy is stili in use lo divide Lhe aiiocated carrier spectnim mIo smaiier channeis. Why is this necessary?
Exercises
509
2. Oiscuss lhe difference between lhe way GSMJGPRS and WCDMA achieve variabie biLrate transmissions. 3. We have seen a geomelric iayoul for a celiular nelwork in Figure 17.1. The figure assumes hexagonal celis and a symmetric pian (i.e., lhat lhe scheme for splilting lhe frequency specLrum over difi’erenl celis is uniform). Also, lhe reuse factor is K = 7. Depending on ccii sizes and radio interference, Lhe reuse factor may need lo be differenl. SLiiI requiring hexagonal ceils, can ali possibie reuse factors achieve a symmetric pian? Which ones can? Can you specuiate on a formula for general possibie reuse factors? 4. What is Lhe spreading gain for 15-95? WhaL is Lhe spreading gain for WCDMA UTRA FDD mode, assuming ali users want lo LransmiL aI maximum bitrale? What is lhe impacl of Lhe difference between lhe spreading gains? 5. When a ceiluiar phone user saveis across lhe ccli boundary, a handoff (or handover) from one ccli lo lhe olher is necessary. A hard (imperfecl) handoff causes dropped caiis. (a) COMA (Direct Sequence) provides much belLer handoff performance than FDMA or Frequency Hopping (Mi). Why? (b) Suggesl an improvemenL to handoff so iL can be sofler. 6. In a CDMA ccii, when a COMA mobile sLalion moves across a ccii boundary, a sofi handoffoccurs. Moreover, ceiis are also spiit inLo sectors. and when a mobile station moves between sectors, a softer handoffoccurs. (a) Provide argumenls for why a sofler handoff is necessary. (b) Siale aI ieast one olher difference between Lhe Lwo handoffs. Hini: Ouring handoff in a COMA sysLem. Lhe mobile stalions can LransmiL aI iower power leveis Lhan inside lhe ccii. 7. MosL of the schemes for channel aiiocation discussed in this chapter are fixed (or uniform) channei assignmenL schemes. It is possibie Lo design a dynamic channei aiiocalion scheme lo improve lhe performance of a ccii neLwork. Suggest such a dynamic channei aiiocation scheme. 8. The 2.5G lechnoiogies are designed for packel-swilching services. This provides data-on-demand connectivily without Lhe need lo eslabiish a circuit firsL. This is advantageous for sporadic data bursis. (a) Suggest a method lo implemenl muitipie access control for TDMA packeL ser vices (such as GPRS). (b) Circuits are more eflicient for ionger dala. Extend your suggesled melhod so LhaL lhe channei goes through a contention process oniy for Lhe firsl pacicel lransmitLed. Hini: Add reservalions Lo your scheme.
510
Chapter 17
Wireless Networks
C H A P T E R
9. H.263÷ and MPEG-4 use RVLCs, which allow decoding of a stream in both forward and backward directions from a synchronization marker. The RVL.Cs increase Lhe bitrate of the encoding over regular entropy codes.
18
Conte nt— Based Retrieva 1 i n I~ i g ital Li bra ri es
(a) Why is this beneficial for transmissions over wireless channels? (Ii) WhaL condition is necessaof for it to be more efficient Iban FEC?
10. Why are RVLCs usually applied only to motion vectors? If you wanted to reduce Lhe bilrate impact, what changes would y~u make? 17.6
REFERENCES 1 M. Rahnema, ‘Overview of GSM Sysiem and Protocol Architecture,” IEEE Conimunications Magazine, 31(4): 92—100, 1993. 2 E.K. Wesel, WirelessMuitimedia Conununications: Networking Video, Voice, and Data, Reading, MA: Addison-Wesley, 1998.
18.1
HOW SHOULD WE RETRIEVE IMAGES?
3 F. Meeks, “The Sound of Lamarr’ Forbes, May 14, 1990.
Consider lhe image in Figure 18.1 of a small portion of The Garden ofDelighis by Hierony mus Bosch (1453—1516), now in Lhe Prado museum in Madrid. This is a famous painting,
4 14. Holma and A. Toskala, eds., WCDMA for UMTS: RadioAccessfor Third Generation Mobile Cotnmunications, New York: Wiley 2001.
but we may be stumped in understanding Lhe painter’s intent. Therefore, if we are aiming ai automatic retrieval of images, it should be unsurprising that encapsulating Lhe semantics
5 A.J. Viterbi, CDMA: Principies of SpreadSpeclrum Communication. Reading, MA: AddisonWesley, 1995.
(meaning) in the image is an even more difficult chailenge. A proper annotation of such an image certainly should include the descriptor “people”. On Lhe offier hand, should this
6 J. Burkhardt, ei ai., Pervasive Computing: Technoiogy and Architeciure of Mobile lnternet Applications, Boston, MA: Addison Wesley, 2002.
image be blocked by a “NeL nanny” screening out “naked people” (as in [1»? We know very well LhaL most major webbrowsers have a web search buiton formultimedia
7 Third Generation Partnership Project 2 (3GPP2), Vldeo Conferencing Sen’ices 3GGP2 Speeifications, S.R0022, July 2000.
Stage i,
content, as opposed lo Lext. For Eosch’s painting, a text-based search wili very likely do ihe best job, should we wish Lo find Lhis particular image. YeL we may be interested in fairiy
8 Third Generation Partnership Projeci (3GPP), Qos for Speech and Multi’nedia Codec, 3GPP Specifications, TR-26.912, March 2000.
general searches, say for scenes wiffi deep blue skies and orange sunsets. By pre-calculating some fundamental statistics about images stored in a database, we can usually find simple scenes such as Lhese.
9 K. N. Ngan, C. W. Yap, and 1 D1, Lhen pixel lis suppressed 100. To make a binary edge image, we seI ali pixeis wiLh D grealer than a threshold value Lo 1 and ali oLhers Lo 0.
•~
-
r
—S~4tI
Rnt~h,w~,t
•~
I~
21
.E1m.A UT.....L,&
For edge separation f, for each edge pixel é we measure Lhe distance along iLs gradient •~ Lo lhe nearesL pixel j having #j within 15 . lf such a pixel j doesn’l exist, lhe separation is considered infinite. Having creaLed edge direclionaliLy and edge separalion maps, C-BJRD constructs a 20 texLure hisiogram 0ff vcrsus~. The initial histogram size is 193 x 180, where separation value f = 193 is reserved for a separation of infinity (as well as any 4 > i92). The hisLogram size is Lhen reduced by three for each dimension Lo size 65 x 60, where joined enLries are summed Logelher. The histogram is “smooLhed” by replacing each pixel wilh a weighLed som of iLs neighbors and is then reduced again Lo size 7 x 8, wiLh separalion value 7 reserved for infinity. AL this sLage, the texture histogram is also normaiized by dividing by Lhe number of pixels in lhe image segmenL.
To extracl an edge map, lhe image is lirst coriverted lo luminance Y via Y — 0.299R + 0.587G + 0.1 14B. A Sobe? edge operolor [lO] is applied lo Lhe Y-image by siiding lhe foliowing 3 x 3 weighting maLrices (convolution mosks) over lhe image: -1 —2 -l
•‘~
-
FIGURE 18.6: Texture IayouL grid.
FIGURE 18.5: Color layouL grid.
d~:
flmn £~,
•
TCWJr,.a,OJ~GtC 9fl!~a.—
E*
4’
-J
C-BIRD — A Case Study 519
18.2.6
Search by Iliumination Invariance iliuminalion change can dramalically alLer lhe color measured by camera RGB sensors, frompink under dayiighL lo purpie under fluorescenL lighLing, for example. To deal wilh iliumination change from lhe quer)’ iniage Lo differenL daLabase images, each color-channel band of each image is firsL normaiized, lhen compressed Lo a 36-vecLor [II]. Normalizing each of Lhe R, O, and 8 bands ofan image serves as a simpie yeL effecLive guard againsl color changes when Lhe lighting color changes. A Lwo-dimensional color histogram
520
Chapter 18
Section 18.2
Content-Based Retrieval ia Digital Libraries :181,1
-—-
[h.Sà
-
4.
Fl.~.
—
I~ —
vcp
•,S
~
C-BIRD — A Case Study 521
9*’
L4
tIL
-,
~WkL
o-
-
zl
~
~
L.*,”
d
a
e e
4
-a o R~de ap~
1~I
T~4mr~rof~gn
‘e-
£37
To&ntrofw’mgutSS7 - sbow4~fln3 thu..th24
r rcd,Ki,,.. rr,t,’,* Ft.e~.ni,.rct rcbp..n,. rT..ncm,s r
L&e.~ wiagq
p,u~~h 924
r —..,.
_________________ —
-
S~~tSt.4
r
——
Rntt
i
zi $e-~~’IflLWL
IW.O~&M4I!4#4n
FIGURE 18.7: Search wiLh iliumination invariance. Some ;humbnaii images are from the Corei Gaiiery and are copyright Corei. Ali righis resen’ed.
is lhen creaLed using the chiv,nariciry, which is Lhe seL of band ratios [R, GI/(R + 6 + B). ChromaticiLy is similar Lo Lhe chrominance in video, ir’ Lhat iL capLures color informaLion only, noL luminance (or brighLness). A 128 x 128—bin 213 color histogram can then be treated as an image and compressed using a waveiet-based compression scheme [12]. To further reduce Lhe number of vector componenLs ir’ a feaLure vecLor, lhe DCT coefficients for Lhe smaller histogram are calculated and placed in zigzag order, Lhen ali but 36 componenLs are dropped. MaLching is performed ia Lhe compressed domam by taking Lhe Euclidean distance be Lween two DCT-compressed 36-componenL feature vecLors. (This illuminaLion-invarianL scheme and lhe objecL-modei-based search described next are unique Lo C-BIRD.) Fig ure 18.7 shows the resulls of such a search. Several of Lhe above iypes of searches can be done aL once by checking multipie check boxes. This reLurns a reduced Iist of images, since the lisL is the conjuncLion of ali resuiting separaLe return lisLs for each meLhod. 18.2.7
Search by Object Model The mosL important search type C-B1RD supports is lhe model-based objecL search. The user picks a sample image and inLeracLiveiy selecLs a region for objecL searching. ObjecLs phoLographed under differenL scene conditions are sLiil effecLively matched. This search type proceeds by lhe user selecLing a Lhumbiiail and clicking lhe Model tab lo enLer Object
FIGURE 18.8: C-BIRD interface, showing objeci selection using an ellipse primiLive. (This figure also appears ir’ Lhe color insert section.) !mage is frorn :he Corei Galiery and is copyright Corei. Ali rights reserved.
Seleclion mode. An objecL is Lhen inleracLiveiy selected as a portion of lhe image; Lhis consLitutes an object guery by example. Figure 18.8 shows a sampie object selecLion. An image region can be seiecLed using primitive shapes such as a reclangie or ellipse, a magic wand Lool thaL is basicaliy a seed based fiooding aigoriLhm, an active contour (a “snake”), ora brush boi, where the painLed region is selected. Ali lhe selections can be combined wiLh each other using Booleari operations such as union, inLersecLion, or exclusion. Once the object region is deflned lo a user’s saLisfacLion, it can be dragged lo lhe righL pane, showing ali current selecLions. MuIliple regions can be dragged Lo the selection pane, buL only Lhe acLive object ir’ the seiection pane wiIll be searched on. The user can also conLroi parameters such as flooding thresholds, brush size, and active contour curvaLure. DeLails of Lhe underlying mechanisms of this Search by Object Model are seL out ir’ [12] and introduced below as ar’ exampie of a working system. Figure 18.9 shows a block diagram for how Lhe aigoriLhm proceeds. First, lhe user-selecLed model image is processed and iLs features are localized (deLails iii lhe following sections). Color histogram intersection, based on Lhe reduced chromaLiciLy histogram described in Section 18.2.6 is Lhen applied as a firsL “screen’ Further steps esLimate Lhe pose (scale, LransiaLion, rotation) of Lhe object inside a targel image from lhe database. This is followed by veriflcaLion by intersecLion of LexLure histograms and then a final checic using ar’ efflcienL version of a Generalized Hough Transform for shape verificaLion.
522
Chapter 18
Content-Based Retrieval in Digital Libraries User objecL model selection
Color-based irnage screening
Pose estimatior’
Section 18.2
C-BIRD —A Case Study
523
Model feature localization
TexLure support
Shape verification
~
Dalabase
Matches
FIGURE 18.10: Model and targei images: (a) sample model image; (b) sample daiabase irnage containing lhe model book. (This figure also appears ir’ Lhe color insert seclin.) Active Perception textbook cover courtesy Lawrence Erlbau,n Associates, mc.
Feature l.ocalization versus Image Segnientation For image segmenlation (cf. [14]): if R is a segrnented region,
FIGURE 18.9: Block diagram of object rnaLching steps.
A possible model image and one of Lhe Largei images ir’ lhe darabase might be as in Figure 18.10, where Lhe scene in (b) was illuminated with a dim fluorescent Iight. Locales in Feature Localization The Search by Object Model iniroduced above — finding ari object inside a target image — is a desirable yet difficult mechanism for querying muiLimedia data. An added difficulty is Lhat objecis can be phoLographed under differenL Iighting condiLions. Human vision has “color consiancy” [13], an invariani processing, presumably, Lhat allows us lo see colors under differenL lighting as the sarne. For image indexing, ii should be useful to deLermine only a cavariam processing LhaL changes along wiLh changing Iight [12]. Ir’ LhaL case, we could aim at aiso recovering Lhe lighLing change. Since objecL-based search considers objecLs wirhin ar’ image, we should apply some sori ofsegrnentation Lo look ai regions of objects — say, patches ihaL have about Lhe sarne color. However, ii has been found to be more useful to use a sei of rough, possibly overlapping regions (called locales [7]) lo express noL a complete umage segmentaLion but instead a coarserfearure localizalion. IL is worthwhile looking ir’ more deLail at Lhis Iocale-directed search method, which we describe along wiLh Lhe process of feature localization. Since we are inrerested ir’ IighLing change, we also Iook aL a technique Lo compensate for iliuminaLion change, so as Lo caio! ouL a color covariant search.
1. 1? is usually connecLed; ali pixeis ir’ R are conneczed (8-connecLed or 4-connected). 2. R1 fl R1 = ~, i Ø j; regions are disjoini. 3. U7~ R1 = 1, where lis Lhe enlire image; Lhe segmenralion is cornpleie. ObjecL reLrieval algoriihms based on irnage segrnenraiion permit imprecise regions by aliowing a tolerance on lhe region-matching measure. This accounts for smail imprecision in the segmenlalion but not for over- or under-segmenlation, which can be atiributed Lo lhe pixel-level approach. This works only for siinplified images, where objecL pixeis have staLisiics LhaL are posiLion-invarianL. A coarse localization of irnage features based on proximity and compacLness is likely Lo be a more effective and atLainable process than image segmenraiion. Definition: A locale £f is a local enclosure of feature f. A locale uses blocks of pixels called ides as its posiLioning uniis and has Lhe following descripLors: 1. Envelope Lj-. A sei of tiles representing Lhe localily of Cj 2. Geometric parameters. Mass M(Cj)
=
ceniroid CtCf)
=
counL of lhe pixels having feaLure
f,
M(1j)
~ P~/MCC~~), 1=1
P~
=
position
524
Chapter
ri ent-Based Retrieval in Digital Libraries
Section 18.2
red
A A Q/~/ L~iue
~ / / / 7 //fJ / / ///%;//// ////// ///////
1
C-BIRD — A Case Study
525
to LranslaLions, rotations, and scaling, we will start with calor localizaLion, allhough oLher fealures (texture, shape, motion, etc.) can cerlainly be localized similarly. Dominant Color Enhancement To localize on colar, we firsL remove noise and blurring by restoring colors smooLhed out during image acquisiLion. The image is converLed from lhe RGB color space Lo a chroniaticity-iuminance color space. For a pixel wilh calor (R, G, B), we define i=R+G+B,r—R/l,g=G/l
(18.4)
where the luminance 1 is separaLed from Lhe chromaLicily (r, g). Clearly, we can also use aR approximately illumination-invariant version of color, as in Section 18.2.6. Prior Lo classifying feaLure Lues, image pixeis are classilied as having eiLher dorninan: colo, or iransitional colo,-. Pixels are classified donuinanL or LransiLional by examining Lheir neighborhood.
FIGURE 18.11: Locales for feature localization. Definition: Do,ninani colors are pixel colors thaL do in lheir pixel neighborhood. Transitional colors do.
Mcq)
and eccentricity E(Cj-)
=
Z
IIP~
—
ROL
lie on a slope of color change
C(Cf)112/M(Lf).
3. Color, texture, and shape parameters of the locale. For exampie. locale chromatic ity, elongation, and locale Lexture hisLogram lnitially, an image is subdivided into square tiles (e.g., 8 x 8 or ló x 16). While the pixel is Lhe building uniL for image segmenlalion, Lile is Lhe building unit for feaLure localizaLion. Tiles group pixeis with similar features within LheirexlenL and are said to have feaLure f if enough pixeis in Lhem have feaLure f (e.g., 10%). Tiles are necessary for good esLimation of initial objecL-level statistics and representa Lion of muiLiple features ai the sarne location. However, locale geometric parameters are measured iii pixels, RoL Lues. This preserves feaLure granularity. Hence, feature localization is not merely a reduced-resolution variaLion on image segmenlation. AfLer a feaLure localization process, lhe following can be Inie: 1. 3f : £~ is nol connected. 2. 3! Rg: £j- fl £~ Ø ~, f ≠ g; locales are non-disjoin!. 3. U~-L,- ~ 1, non-cornpleteness: ROL ali image pixeis are represenLed. Figure 18.11 shows a skeLch of lwo locales for color red and one for coior biue. The Iinks represenL an associaLion wiLh an envelope, which demonsLrates thaL locales do not have Lo be connecLed, disjoinL, or compleLe, yet colors are still localized.
Tile Classification Before locales can be generaLed, Lues are firsL classified as having cerLain features, for example, red tiles, or red and blue Lis. Since colar is most useful for CBIR and is invarianL
lf a pixel does not have sufficienL number of neighbors with similar color values within a Lhreshold, iL is considered noise and is also classified as Iransitional. The uniforniity of Lhe dominanL colors is enhanced by smoothing Lhe dominanl pixeis only, using a 5 x 5 average Quer, wilh Lhe exceplion Lhat only dominant pixeis having similar color are aver aged. Figure 18.12 shows how dominanL calor enhancement can clarify lhe target image in Figure 18.10 above. Tile feature iist ‘files have a tilefeaiure lisi of ali lhe color feaLures associaLed wiLh Lhe Lile and their geomet rical staLisLics. On Lhe first pass, dominanL pixels are added tolhe Lile feaLure IisL. For each pixel added, if Lhe calor is dose Lo a feaLure on lhe iist within Lhe iuminance-chromaLicity Lhresholds, the color and geometricai staListics for the feature are updaled. OLherwise, a new colar frature is added Lo the list. This feaLure lisI is referred Lo as the dorninanrfeaiure lis!. On the second pass, ali Iransitional colors are added Lo lhe dominant feature lisL without modifying Lhe calor, but updaLing Lhe geometrical staLisLics. To deLermine which dominant feature iist node Lhe transitional pixel should merge lo, we examine lhe neighborhood of the transiLional pixel and find the closest color thaL is welI represenLed in lhe neighborhood. Tf an associated dominanL colar doesn’L exisl, iL is necessary Lo crcale a second iransiiional featu,e lis! and add Lhe Iransitional calor Lo iL. The dominanL colar (r1, gj, F~) Laken on by a LransiliOnal pixel ip having calor (r, g, 1) salisfies Lhe following minimizaLion: 1W
lIfr)
minll( 1=1 I\g
(rr )~/F(ri.~ili) —
gi
(18.5)
526
Chapter 18
Section 18.2
Content-Based Retrieval in Digital Libraries
PROCEDURE 18.1
C-BIRD — A Case Study 527
Localeslnit II Pseudo-code for Iink initialization
BEGIN LeL c[n1][n~] be Lhe 2D array oí child nades. LeL P[nx/2][ny/2] be Lhe 2D array af parent nades. For each child node c[i][j] do LeL cn = c[i)[j] and pn = p[i/2][j/2]. For each nade cn~, in Lhe feaLure lisL ofcn do Find nade P~q ir’ the feature usE aí pn lhaL has similar colar. lf Lhe mergid eccenLricity aí cn1, and P~q has E < r Lhen Merge Cflp and ~flq. 1f P~q doesn’t exist ar E >= r Lhen Add ciz~, La Lhe start of Lhe feature list of pn. ENO FIGURE 18.12: Smoathing nsing dominant colors: (a) original image naL smaaLhed; (b) smoathed image wiLh transiLional colors shown iri Iight gray; (c) smoathed image wiLh transitianal colors shown in Lhe replacemenL dominanL colors (ifpassible). Lower row shaws delail images.
Afler Lhe pyramid linkage iniLialization, lhe campeLition begins. Since a4 x 4 overlapped pyramid structure is used, four parenLs compete far linkage wiLh Lhe child, one of which is aiready linked La it. This process is illustraLed by the EnvelopeGrowing pseudo-cade: PROCEDURE 18.2
The parameter nc is lhe number aí nansimiiar colors in Lhe neighborhoad of the :p. Similar colors are averaged Lagenerate Lhe (r1, ~, 4) colors. F(r1, ~, 4) isthe frequency aí lhe Í’~ average colar, ar in other words, the number aí similar colors averaged lo generate colori. The calar thaL minimizes this equaLion is lhe besL compromise for daminant calar selecLion for rp in terms af calor similarity and number aí similar colors in the neighbarhaod. The neighborhaod size was chosen la be 5 x 5 in aur implementaLian. When ali pixels have been added lo the Lues, lhe dominant and transiLionai color feaLure lists are merged. lf a LransiLianal list nade is dose ia calar ta a dominanL usE nade, the geomelrical staListics for Lhe merged nade are updated, but anly Lhe colar fram Lhe daminant lisI is preserved. Otherwise, the nades from bolh usEs are just concatenated anLa ihejainL lisL. Locale Generation Lacales are generated using a dynamic 4 x4 overlapped pyramid linking pracedure [15]. On each leveI, parent nades compete for inclusian aí child nades in a fair campetiLion. Image Lues are Lhe baitam-level child nades aí lhe pyramid, and lacales are generated for lhe entire image when the campetiLion prapagales la lhe tO~ levei. The tap-level pyramid nade has a lisL of calor fealures wilh assaciaLed envelopes (collecLions aí Lues) and geomeirical slatisties [12]. Campetitian an each levei is iniLialized by using a 2 x 2 nanoverlapped linkage slruc Lure, where faur child nades are linked wiLh a single parenL node. The Localeslnit initializalian proceeds as fallows:
EnvelopeGrowing 1! Pseudo-code for locale generation
BEGIN Let c[nj[n.] be lhe 2D array of child nades. Let p[nx/2][n3,/2] be lhe 2D array af parenL nades. RepeaL unLil parent-child linkage does noL change anymare For each child nade c[i][j] do Letcn = c[i][j] and pn E For each nade cn~, in Lhe feaLure iist af cii da Find nade P”q in Lhe feaLure lisLs aí p;i thaL has similar calar and minimizes the distance IIC(cnp) — C(pnq)~ lf lhe merged eccenLricity aí cn~, and P~q has E Swap the iinkage aí cvi1, lo iLs parenL lO pfl~. Update lhe associaLed geametrical sLaLisLics. Ir’ Lhe parent feature list p remove empty nodes. cIo up a levei in the pyramid and repeaL lhe pracedure. END
<
r then
Foliowing the pyramidai linking, locales having smali mass are remaved, since small locales are nat accuraLe enaugh and are prababiy eiLher an insignificanL part of an object ar naise. To increase lhe eíficiency af Lhe search, lacales are alsa sorted accarding lo decreasing mass size.
528
Chapter 18
Content-Based Retrieval in Digital Libraries
Section 18.2
C-BIRD —A Case Study 529
The color updaLe equation for parenL locale j and child locale i ai iteraLion k + 1 is (rjk+0, g (k+I)
=
(
(k~
(14 1(k))T M5~ + + (r,~. gÇk) J(k))T M~k)
_________________________________
(18.6)
and the update equations for the geometrical slatistics are M5k~
=
M514+M?1
=
C5~M5k) +
(18.7) (18.8)
—
(E514 + C~!’}
÷
+ (E~14 +
+ C~?2)MÍ~
j x.i
—
c~]i)Z
(18.9)
Figure 18.13 shows how color locales appear for sample model and target images.
1
Texture Analysis Every locale is associaled with a Iocale-based LexiLire hislogram as discussed lo Sec LiOn 18.2.5. Thus a tocale-dependent threshold makes more sense in generating lhe edge map. The threshold is obLained by examining Lhe histogram of Lhe locale edge magniLudes. The texture hisLogram is smooLhed using a Gaussian fliLer and subsampled to size 8 x 7, Lhen normalized. The Iocale-based texLure is a more effecLive measure of LexLure Lhan is a global coe, sirice the locale-dependenL thresholds can be adjusLed adaptively. Figure 18.14 compares locale-based edge deLecLion Lo global-Lhreshold-based edge deLecLion, as discussed in Sec Lion 18.2.5. The edge-maps shown demonstraLe that for Lhe lamp and lhe banana objecLs, some edge points are missing when using global thresholding, but mosL of them exisl when using locale-based thresholding. To draw Lhe locale-based edge-map, edge pixels generated for any locale are unioned togeLher.
a ¶
~fl
(b) FIGURE 18.13: Color locales: (a) color locales for Lhe model image; (b) color locales for a daLabase image. (This figure also appears in Lhe color insert secLion.)
Object Modeling and Matching ObjecL models in C-BIRD consisL of a seL of localized feaLures. As shown above, Lhey provide a rich seL of staLisLical ineasures for laLer maLching. Moreover, ibeir geomeLric relationships, such as Lhe spatial arrangemerit of locales, are also exiracted. They are besL represenLed using vecLors connecling centroids of Lhe respecLive locales. The objeci-search meLhod recovers 2D rigid object IranslaLion, scale, and roLaLion, as well as iliuminaLion change (fulI delails are given in [12)). C-BIRD also allows a combinaLion search, where an objecL search can be combined with oLher, simpler search Lypes. Ia that case, Lhe searches are execuLed according Lo decreasing speed. Since objecL search is Lhe most complex search available, IL is cxecuLed IasL, and only on lhe search resulLs passed on so far by the oLher search lypes.
The objecL ilnage selecLed by lhe user is senL Lo~he scrver for maLching againsL lhe locales database. The localization of thc submilted inodel objecL is considered Lhe appropriaLe localization for Lhe object, so thaL ilnage locales need to be found Lhal have a one-Lo-one correspondence wiLh model locales. Such a correspondence is called ao assignnienl. A locale assignment has Lo pass several screcning tests to verify an objecL match. Screen ing tesLs are applicd in order of increasing complexily and dependence on previous tesLs. Figure 18.9 shows lhe sequence of sleps during an object maLching process: (a) user oh ject model selecLion and modcl feaLure localizaLion, (b) color-based screening LcsL, (c) pose esLimation, (d) lexLure support, and (e) shape verificalion.
530
Chapter 18
Content-Based Retrieval in Digital Libraries
Section 18.2 —
Jilumination color covariani screening
—
ChromaliciLy voLing
—
EiasLic correlalion
C-BIRD — A Case Study 531
• EslimaLion of image objecl pose (sLep c) • Texiure support (sLep d) • Shape verificalion (step e) (b)
(a)
• Recovery of lighLing change FIGURE 18.14: Global versus locale-based thresholds: (a) the edge map for Lhe database image using a global threshold; (b) Lhe edge map for lhe database image using a locale-based threshold. The object match measure
Q is formulated as follows: SI’
Q=
wi
(18.10)
1=1
where n is lhe number of locales in lhe assignmenL, rn is Lhe number of screening tesls considered for lhe measure, Q~ is lhe fitness value of lhe assignmenl in screening test 1, and w~ are weights thaL corresporid Lo Lhe importance of Lhe fitness value of each screening Lest. The w~ can be arbiLrary; Lhey do noL have Lo sum Lo 1. Care has lo be Laken Lo normalize lhe Q~ values lo lie in lhe range [0.. I],so Lhat Lhey ali have lhe sarne numerical meaning. Locales wiLh higher mass (more pixels) staLisLically have a smalier percenLage of local izaLion error. The fealures are beLter defined, and small errors average out, so we have higher confidence ir’ locales with large mass. Similarly, assignments with many model locales are preferable lo few model locales, since lhe cumulative locale mass is larger and lhe errors average ouL. We Lry Lo assign as many locales as possible firsL, lhen compute Lhe maich measure and check Lhe error using a tighl Lhreshold. Locales are removedor changed in lhe assignmenL as necessary uniu a match is obLained. AL LhaL poinl, iL is probably Lhe besL malch measure possibie, $0 iL is unnecessary lo Lry olher assignments. In this case, ali possible permuLaLions of locale assignmenLs do noL have lo checked. ln Lhe worsl case, when lhe objecL model is noL present in Lhe search image, we have lo tesl ali assignments lo determine Lhere is no malch. The image locaies in Lhe database and lhe object model locales are sorted according lo decreasing mass size. Matching Steps The screening LesLs applied Lo locales lo generaLe assignmenls and validaLe them are: • Coior-based screening tesLs (slep b):
The idea of color covarianl malching is lo realize Lhat colors may change, from model lo LargeL, since lhe lighLing may easily change. A diagonal model of lighling change slates ihal lhe enrire red channel responds Lo lighling change via an overail muiLiplicalive change, as do lhe green and blue channeis, each wilh their own multipiicative consLanl [II]. l..ocales vote on lhe correcL lighling change, since each assignmenl of one model locale color lo a largeL one implies a diagonal lighLing shift. Many voLes in lhe same ceil of a voLing space wili imply a probable peak value for lighLing change. Using lhe chromaliciLy voling scheme, ali image locales are paired wiLh ali modei locales lo voLe for lighling change values in a voting alTay. We can evaluaLe lhe feasibilily of having an assignmenl of image locales lo model locales using lhe esLimaled chromalicity shift parameLers by a lype of ekisüc correlation. This compuLes lhe probabilily thaL Lhere can be a correcL assignment and retums Lhe seI ofpossible assignmenLs. l-iaving a candidale seL of chromalicity shift parameters, each candidate is successively used lo compute Lhe elastic correlaLion measure. lf Lhe measure is high enough (higher Lhan 80%, say), lhe possible assignments relumed by Lhe elaslic correlaLion process are Lesled for objecl maLching using pose estimaLion, LexLure supporl, and shape verification. Figure 18.15 shows Lhe elasLic correlalion process applied in Lhe model chromalicily space Q(r’, g’): lhe model image has Lhree iocale colors aL A’, B’ and C’. AlI Lhe image locaie coiors, A, E, C, D, E, and E, are shifled Lo Lhe model illuminant. Although Lhe locales (A’, E’, C’) and (A, E, C) are supposed lo be maLching enliLies, Lhey do nol appear aL exactiy lhe same localion. lnstead of a rigid lemplaLe malching (or correlalion) meLhod, we employ elaslic correlaLion, ir’ which lhe nodes A, E, C are allowed lo be localed in Lhe vicinily of A’, B’, C’, respecLively. The pose estimalion meLhod (slep (c)) uses geomelrical relaLionships beLween iocales for esLablishing pose parameters. For lhaL reason, iL has Lo be perforrned on a feasible locale assignmenl. Locale spaLial relationships are represenled by relalionships belween Lheir cenLroids. The number of assigned locales is aliowed Lo be as few as lwo, which is enough geomeLry informalion lo drive eslimaLion of a rigid body 2D displacemenL model wiLh four parameLers Lo recover: x, y Iranslalion, rolalion R, and scaie s [12].
532
Chapter 18
Content-Based Retrieval
ri
Section 18.3
Digital Libraries
r FIGURE 18.15: EiasLic correlation in f2(r’, g’).
Results of pose estimation are both Lhe best pose paramelers for an assignment and Lhe minimizaLion objective value, which is an indication of how weIi Lhe locales assignment fiL using lhe rigid-body displacemenL modei. lf Lhe error is within a smaii threshoid, Lhe pose estimaLe is accepted. The Lexture-sUpport screening LesL uses a variaLion of histogram inLersection technique, where lhe LexLure histograms of locales in Lhe assignmenL are intersected. lf the intersecLion measure is higher than a threshoid, the lexture match is accepted. The final match verificaLion process (sLep (e)) is shape verificaLion by the method of Genercalized Hough Transfonn (Gil) [ló]. The GHT is robust with respect Lo noise and occiusion [17]. Performing a fui! GHT search for ali possibie rotaLion, scaie, and Lransiation parameters is computaLionally expensive and inaccurate. Such a search is not feasibie for large databases. However, afLer performing pose esLimation, we already know Lhe pose parameLers and can appiy Lhem Lo the model reference poinL Lo find Lhe esLimated reference point in lhe database image. Flence, lhe GHT search reduces Lo a mere confirmaLion Lhal Lhe number of vaLes in a smail neighborhood around Lhe reference point is indicaLive of a maich. This GHT matching approach Lakes oniy a few seconds for a Lypical search. The reference point used is Lhe modei cenLer, since it minimizes voting error caused by errors in edge gradient measuremenLs. Once we have shape verification, Lhe image is reporled as a match, and its match measure Q returned, if Q is large enough. After obLaining maLch measures Q~ for ali images in lhe daLabase, the Q~ measures are sorted according Lo decreasing value. The number of maiches can further be restricted Lo Lhe Lop k if necessary. An esLimate of Lhe correct iliumination change follows from correcL matches reported. Figure 18.16(a) shows Lhe GHT voting resuit for searching Lhe pink book from one of lhe daLabase images as in Figure 18.10(b). Darkiiess indicaLes lhe number of voLes received, which in tum indicales Lhe iikelihood Lhat Lhe object is in the image and aL LhaL localion. Figure 18.16(b) shows the reconsLructed edge map for the book. Since Lhe modei edge map and lhe location, orienLaLion, and scale of Lhe objecL are known now, this reconsLrucLion is entireiy auiomated.
Synopsis of Current image Search Systems
(a)
533
(b)
FIGURE 18.16: lising Lhe GHT for shape verificaLion: (a) GHT accumuiator array image; (b) recon strucLion of Lhe deLecLed object using the esLimaLed pose and Lhe GHT LemplaLe (edge map). Figure 18.17 shows some search resuits for lhe pink book in C-BIRD. While C-BIRD is an experimental sysLem, it does provide a proof in principie that Lhe diflicuiL task of search by objecL modei is passibie. Video Locales Definition: A video locale is a sequence of image feaLure iocales Lhat share similar feaLures in Lhe spalioLemporal domam of videos. Like locales in images, video iocaies have their color, LexLure, and geomeLric properties. Moreover, Lhey capLure motion parameters, such as motion Lrajectory and speed, as weli as temporal informaLion, such as the lifespan of Lhe video locaie and its Lemporai reiaLionships with respect Lo oLher video locales. Since video proceeds in smail Lime sleps, we can aiso expect Lo deveiop new locales from ones already known from previous video frames more easily Lhan simpiy starting from scratch in each frame [18]. Figure 18.18 shows thaL while speeding up the generation of locales substanLially, iitLle difference occurs in generaLing locales from each image (lnLra-frame) and from predicting and Lhen refining lhe locales (Inter-frame). Whiie we shali noL go inLo Lhe detaiis of generaLing the video locales, suífice iL to say Lliat Lhe inter-frame algorithm is always much faster lhan the inLra-frame one. Moreover, video iocaies provide an effective means Loward real-time video objecL segmenLaLion and Lracking [18].
18.3 SYNOpSIS OF CURRENT IMAGE SEARCH SYSTEMS Some other currenL image search engines are menlioned here, aiong with URLs for each (more IJRLs and resources are in Lhe Further ExploraLions secLion). The foliowing is by no means a complete synopsis. Most of lhese engines are experimenlai, buL ali lhose inciuded here are inLeresting in some way. Several includequery feaLures differenL fromthoseouLlined for C-BIRD.
534
Chapter 18
n ent-Baseci Retrieval iii Digital Libraries
Section 18.3
B~dS,~hC~S
_e. •
ynopsis of Cijrrent Image Search Systems
535
536
Chapter 18
Content-Based Retrieval in Digital Libraries
The sLriking feature of this metric is that iL allows us lo use simple differences ofaverage three-dimensional calor as a first screen, because Lhe simpler metric is guaranLeed lo be a bound on Lhe more complex arte in EqLlaLion (18.12) [20]. QBIC has been developed further since its iniLial version and now forms an essential (and licensable) part of IBM’s SuiLe of Digital Libraty products. These aim aI providing a compleLe media-coliecrion management system. An interesting developmenL in lhe QBIC research effort aL IBM is Lhe attempL Lo include grayscale imagery in its domam (21], a difficult retrieval lask. QBIC can combine oLher aL tributes with color-only-based searches — these can be Lextual annotations, such as captions, and Lexture. Texture, parlicularly, helps in graylevel image retrieval, since Lo some extent iL captures sLnicLural information in an image. Database issues begin Lo dominate once lhe data set becomes very large, with careful control on cluster sizes and representatives for a tree-based indexing scheme. 18.3.2
Section 18.3
i8-~•5
A leam at Columbia University has developed Lhe following search engines: • Content-Based Visual Query (CBVQ), dcveloped by Lhe ADVENT projecL at Columbia Universily, is lhe lirst of the series. (ADVENT slands forAl! Digital Video Encoding. Netwo,-ki,,g and Trans,nission.) IL uses conlenL-based image relrieval based on calor, texLure, and colar composition. hLlp://maya.clr.columbia.edu:8088/cbvq • VisualSEEk is a color-photograph reLrieval system. Queries are by colar layout, ar by an image instance, such as lhe URL of a seed image, or by insLances of prior matches. VisualSEEk supporLs queries based on Lhe spalial relationships of visual features. htLp://www.clr.columbia.edulvisualseek • SaFe, an inLegraLed spalial and fealure image sysLem, exLracls regions from an image and compares Lhe spatial arrangemenLs ofregions. hllp://disney.clr.columbia.edu/safc
• NETRA [22] is also part of the Alexandria DigiLal Library projecL. Now in ils second generation as NETRA II, iL emphasizes calor image segmentation for objecL or region- based search. hLLp://maya.ece.ucsb.edulNetral
• WebSEEk collects images (and LexL) from lhe web. The emphasis is on making a searchable catalogue with such Lopics as animaIs, archilecture, arL, astronomy, caIs, and soon. Relevance feedback is provided in the form of lhumbnail imagesand motion icons. For video, a good form of feedback is also inclusion of small, short video sequences as animared 0W files. http://www.ctr.columbia.edu/webseek (mncludes a demo version)
• Perccption-Based Image Retrieval (PBIR) aims aLa beLter version of learning and relevance feedback techniques with leaming algoriLhms LhaL Lry lo geL aL Lhe underlying query behind Lhe user’s choices in zeroing mn on lhe righL (argel. hLLp://www.mmdb.ece.ucsb.edul demo/corelactn/ 18.3.7
The URL for this University of California, Berkeley, search engine is hLtp:llelib.cs.berkeley .edu. Texi queries are supported, wiLh search aimed ata particular commercial or oLher seL of stock phoLos. The experimenLal version Lries lo include semantic informaLion from LexL as a clue for image search. 18.3.4 Chabot Chabol is an earlier sysLem, also from UC Berkeley, that aims lo include 500,000 digitized muiLiresoluLion images. ChaboL uses lhe relaLional daLabase management system POST GRES (o access Lhese images and associated LexLual data. The syslem sLores both text and colar hisLogram data. lnsLead of color percenlages, a “mosLly red” type of simple query is acceptable. hlLp://htlp.cs.berkeley.edulginger/chaboL.hLml
Blobworld
18.3.6 Columbia University Image Seekers
UC Santa Barbara Search Engines
Berkeley Digital Library Project
537
Blobworld [23] was also developed aL UC Berkeley. It atLempls lo capLure lhe idea of objecLs by segmenting images mIo regions. To achieve a good segmenLaLion, an expec tatioti maxirnizahion (EM) algoriLhm derives Lhe maximum likelihood for a good cluster ing in Lhe fealure space. Blobworld allows for boLh textual and conLent-based searching. The sysLem has some degree of feedback, in thaL iL displays lhe inLernal represenLaLion of Lhe submiLted image and lhe query resulls, so Lhe user can becler guide Lhe algorilhm. hlLp:/lelib.cs.berkeley.edulpholosfblobworld
Alexandria Digital Library (ADL) is a seasoned image search engine devised at Lhe UniversiLy of California, Santa Barbara. Thc ADL is presently concerned wiLh geographical data: “spaLial data on Lhe web”. The user can interacL with a map and zoom inLo a map, Lhen retrieve images as a query resuiL type LhaL pertain lo lhe selecLed map arca. This approach mitigates the fact Lhat LerabyLes, perhaps, of data need Lo be sLored for LANDSAT images, say. lnstead, ADL uses a multiresolution approach thaL allows fast browsing by making use of image Lhumbnails. MulLiresolution images means LhaL It is possible lo selecL a certain region within an image and zoom in on iL. htLp://www.alexandria.ucsb.edu
18.3.3
Synopsis of Current lniage Search Systems
lnformedia The Informedia Digital Video Library project aL Camegie Mellon iJniversity is now in its second generation. known as Informedia II. This cenlers on “video mining” and is funded by a consortium of government and corporate sponsors. http://informedia.cs.cniu.edu/
18.3.8
MetaSEEk MetaSEEk is a meta-search engine, also developed ai Columbia but under Lhe a (heir IMKA Intelligeni Multimedia Knowledge Applicalion Project. The idea is to q several olher online image search engines, rank Lheir performance for diffe visual queries, and use lhem selectivcly for any particular search. http://ana.ctr.columbia.edulrnelaseek/
Section 18.4 538
Chapter 18
18.3.9
Relevance Feedback
539
Content-Based Retrieval in Digital Libraries sofLware uses are color conLent, shape conteM, LexLure content, biighiness structure color sLmclure, and aspect ratio. hLtp://vrw.convera.com:80l S/csl
Photobook and EourEyes Pbotobook [24] was one of Lhe earlier CBIR systems developed by lhe MIT Media Labora tory. li searches for Lhree different types of image coMeDi (faces, 2-D shapes, and texture images) using three mechanisms. For the íhst two types, ii creaLes an eigenfuncLiOn space a seL of “eigenimageS”. Then new images are described in Lerms of their coordinates in Lhis basis. For texiures, an image is treated as a sum of ihree orthogonal componentS in a decomposition denoted as Wold features [25]. With relevance feedback added, Photobook became FourEyes [26]. Not only does Lhis system assign positive and negative weighL changes fçr images, but given a similar query to oneithasseenbefore, ~ hLtp://vismOd.WWW.mediamhte~1 /v is modldemOs/Ph0t0b00k
i8.~
RELEVANCE FEEDBACK Relevancefeedback is a powerful Lool that has been brought lo bear in recenl CEIR sysLems is (see, e.g., [27]). Brieíiy, lhe idea is Lo involve lhe user in a ioop, whereby images retrieved are used in further rounds of convergence onlo correct reLunis. The usual situation is LhaL the user identifies images as good, bad, or don’L care, and weighting systems are updaled according Lo Lhis user guidance. (AnoLher approach is Lo move Me quety roward positively marked conient [29]. An even more inLeresLing idea is Lo move every daLa point in a disciplined way, by warping lhe space of feaLure points [30]. ln the lalLer approach, lhe points Lhemselves move along wilh Lhe high-dimensional space being warped, much like raisins embedded in a volume of Jeilo ihat is being squeezed!)
18.3.10 MARS MARS (MuiLimedia Analysis and Retrieval System) [27] was developed at the University of Illinois aI ~~bana.Champaign. The idea was to create a dynamic system of feature represenlations thaL could adapt Lo different applicationS and differenL users. Relevance feedback (see SecLion 18.4), with changes of weightings directed by the user, is Lhe main tool used. 18.3.11
Virage Visual InformaLion Retrieval (Virage) [28] operates on objecLs within images. Image in dexing is performed after several preprocessing operations. such as smoothing and conlrast enhancement. The details of Lhe feaLure vector are proprietary; however, ii is known that the compuLation of each feature is made by not one but several meLhods, with a com posiLe feature vector composed of lhe concatenaLion of these individual compuLaLiOns. http://www.virage.com’
18.3.12 Viper Visual InformaLion Processing for Enhanced Retrieval (VIPER) is an experimental system Lhat concentraLes on a user-guided shapíng of finer and finer search conslraints. This is referred to as relevance feedback. The system is developed by researchers at the University of Geneva. VIPER makes use of a huge sei of approximatelY 80,000 potential image feaLures, based on color and textures aI different scales and in a hierarchical decomposiliOn of the image ai different scales. VIPER is distributed under the auspices of lhe open software disLribuiion system GNU (“Gnu’s Not Unix”) under a General Public License. http://viper.unige.Ch 18.3.13
Visual RetrievalWare Visual RetrievalWare is an image search technology owned by Convera, mc. lt is builL on techniques created for use by various government agencies for searching databases of standards documenLs. lLs image version powers Yahoo’s Image Surfer. Honeywell has licensed this technology as well. Floneywell x-rayed over one million of iLs products and plans to be able Lo index and search a daiabase of these x-ray images. The features this
18.4.1
MARS in lhe MARS system [27], weights assigned to feaLure poinLs are updaLed by user inpul. Firsl, lhe MARS auLhors suppose Lhal Lhere are many fealures, 1 = 1 .. 1 of them, such as color, LexLure, and so on. For each such feaLure, ihey further suppose Lhat we can use multiple representations of each. For example, for color we may use color hisLograms, color layout, momenls of color hisLograms, dominant colors, and so on. Suppose that, for each 1, Lhere are j = 1.. .J~ such representaLions. Finally, for each represenlation j offeature 1. suppose Lhere is associaled a sei of k = 1 .. Kij componenLs of a feature vecLor. So in Lhe end, we have feaLure vector componenls rija. Each kind of fealure 1 has importance, or weighL, 114, and weights ll/~ are associated wiLh each of lhe representaLions for Lhe kind of fealure 1. Weighls Wuk are also associated with each component of each represenLation. Weights are meant to be dynamic, in thal Lhey change as further rounds of user feedback are incorporated. Lei 1’ = { fJ be lhe whole set of feaLures fi. LeL R — (rij) be lhe set of representations for a given feature fj. Then, again,jusL for Lhe current feaLure 1, suppose that M = ~m,j) is a seL of similariiy measures used io determine how similar or dissimilar lwo representaLions are in set R. ThaL is, differenL metrics should be used for differenL represeniations: a vector based represenLation might use Mahalanobis disLance for comparing feaLure vectors, while hisLogram inLersecLion may be used for comparing color hisLograms. With seL D being the raw image dala, an entire expression of a relevance feedback algorithm is expressed as a moclel (D, F, R, M). Then lhe reLrieval process suggesLed in [29] is as follows: 1. IniLialize weighLs as uniform values:
vi~ wij WIJa
=
1/1
=
1IJ~
=
1/K~~
RecalI Lhat lis lhe number of feaLures in seI F; .4 is lhe number of represeniaLions for feature f~; and ~ is lhe length of Lhe represenLaLion vector rjj.
540
Chapter 18
Section 18.5
Content-Based Retrievai in Digital Libraries
3. The inverse of Lhe sLandard deviaLion of feaLure rijk is assigned weight WIJk:
2. A database image’s similariLy to lhe query is first defined in tenns of components: S(rij) = m,~(rij, WIJÂ).
ThaL is, Lhe smailer Lhe variance, the larger Lhe weight. 4. Finally, Lhese weighLs are also normalized:
i
3. Finally, lhe overali similarity 8 is defined as
WIJk~
Z WjS(fj).
18.4.2 iFind An experimenLal system Lhat expiicitly uses relevance feedback in image retrievai is lhe MicrosofL Research sysLem iFind [31]. This approach atLempts to geL away fromjusL low leveI image feaLures by addressing Lhe semantic conlent in images. Images are associaLed wiLh keywords, and a semantic neL is buiit for image access based on Lhese, inLegraLed wilh low-ievei feaLures. Keywords have links Lo images in Lhe database, wiLh weights assigned lo each link. The degree of relevance, Lhe weighL, is updaLed on each relevance feedback round. Ciearly, an image can be assOciaLed wilh muiliple keywords, each wiLh a different degree of relevance. Where do lhe keywords come from? They can be generaLed manuaily or reLrieved from lhe AL.T HTML Lag associaLed wiLh an image, using a web crawler.
1. Since representations may have different scales, features are nonirnlized, both offline (iniranormalizalion), and online (internormahzation). 2. lnLranormalizatiOn: the idea here is lhe normalization of Lhe rjj/~ soas Lo place equal emphasis on each component within a representation vector rgf. For each coinponent k, find an average over ali M images in lhe database, tk. Then replace that component by ils normalized score in Lhe usual fashion from statistics: riJk
—
lÂk
—*
3. lnternormaiization: here we look for equal emphasis for each simiiarity value S(rij) within the overail measure 8. We (ind Lhe mean p~jj and standard deviation Cq over ali database image sinnlarüy measures 8. 4. Then, online, for any new query Q we replace the raw similarity between Q and a database imagem by S,~Q(rIJ) — Iii) SmQ(rij) —, 3e~1 Finally, lhe weight update process is as foHows: 1. Scores of (3, 1,0. 1, —3) are assigned lo user opinions “highiy reievant” LO “highiy nonreievanL”. 2. WeighLs are updated as —* ~ + Score for images viewed by Lhe user. Then weights are normaiized by Wu
>
WjJk zw~
The basie advantage of puLting lhe user into Lhe ioop by using relevance feedback is that Lhis way, Lhe user need noL provide a compieLely accuraLe inilial query. Relevance feed back esLabiishes a more accurale link between iow-ievei features and high-level cOncepLs, somewhat ciosing the semantic gap. Of course, retrieval perfonnance of CBIR sysLems is beLtered lhis way.
4. The top N images similar LO query image Q are Lhen returned. 5. Each of Lhe reLrieved images is mariced by lhe user as highly reievant, relevant, no opinion, nonrelevant, or highly nonreievant, according lo his orher subjective opinion. 6. Weights are updated, and lhe process is repeated.
rijk
Lhe componenL
Cjjk
= ~ WijS(r~).
Similarities have Lo be normalized to get a meaningfui set of images returned:
LO
541
W~~& =
Then each representaLion’s similarity values are grouped as
8=
Quantifying Results
18.5
QUANTIFYING RESULTS Generaliy speaking, some simpie expression of lhe performance of image search engines is desirable. In informaLion reLrieval theory,precision is Lhe percenlage of reievant documents retrieved compared Lo lhe number of ali lhe documenis retrieved, and recali is Lhe percenLage of reievanL documents reLrieved ouL of ali relevanL documents. Recail and prtcision are wideiy used for reporling reLrievai performance for image relrieval systems as weli. However, Lhese measures are affecLed by Lhe daLabase size and lhe amounL of similar information in Lhe database. Aiso, Lhey do not consider fuzzy maLching or search resuiL ordering. In equaLion form, Lhese quanLiLies are defined as Desired images returned Ali rewzeved imager Desired iniages relurned Recali = Ali derired iniager
Precision
=
(18.13)
542
Chapter 18
Content-Based Retrieval in Digital Libraries
In general, lhe more we relax Lhresholds and allow more images to be returned, lhe smaller lhe precision, buL the larger lhe recall. The curve of precision versus recall is Lermed a receiver operalor characterislic (ROC) curve. lt plols lhe relationship between sensiLivity and specificity over a range of parameters. 18.6
QUERYING ON VIDEOS Video indexing can make use of molion as Lhe salient fealure of temporally changing images for various types of queries. We shall not examine video indexing in any detail here boi refer Lhe reader Lo the excellent survey in [32]. In brief, since temporality is the main difference between a video and jusL a coliecLion of images, dealing with Lhe Lime componeni is firsL and foremosL in comprehending lhe indexing, browsing, search, and retrieval of video contenL. A direcLion taken by lhe QBIC group [21] is a new focos on sLoryboard generation for aulomaLic understanding ofvideo— lhe so-called “inverse Hollywood” problem. In produclion of a video, Lhe writer and director starl with a visual depicLion of how lhe story proceeds. In a video understanding siluation, we would ideally wish lo regeneraLe Lhis sLoryboard as lhe slarling place for comprehending Lhe video. The firsL place lo sLart, then, would be dividing lhe video mIo shois, where each shoL consists roughly of lhe video frames between the on and off clicks of Lhe Record buLton. However, IransiLions are often placed between shoLs — fade-in, fade-oul, dissolve, wipe, and so on —50 delection of shol boundaiies may noL be so simple as for abrupL changes. Generally, since we are dealing wiLh digiLal video, if aL alI possible we would like lo avoid uncompressing MPEG files, say, lo speed lhroughpuL. Therefore, researchers lry Lo work on Lhe compressed video. A simple approach lo this idea isto uncompressjusL enough lo recover lhe DC Lerni, generaLing a Lhumbnail 64 limes smaller than Lhe original. Since we musl consider P- and B-frames as well as l-frames, even generaLing a good approximaLion of Lhe besL DC image is itself a complicaLed problem. Once DC frames are obtaincd from Lhe whole video — or, even belter, are obtained on lhe fty — many approaches have been used for finding shol boundaries. FeaLures used have Lypically been color, lexture, and motion veclors, alLhough such concepls as trajecLories LraverSed by objecls have also been used [33]. ShoLs are grouped inLo scenes. A scene is a colleclion of shols Lhat belong togelher and thal are conLiguous in Lime. Even higher-level semanLics exisL in so-called “fim grammar” [34]. Semantic informaLion such as Lhe basic elemenLs of Lhe story may be obLainable. These are (aL lhe coarsesL leveI) lhe slory’s exposilion, crisis, climax, and denouement. Audio information is imporLanL for scene grouping. In a Lypical scene, Lhe audio has no break wiLhin a scene, even though many shots may Lake place over Lhe course of Lhe scene. General Liming informaLion from movie crealion may also be broughL Lo bear. TexL may indeed be Lhe mosl useful means of delineaLing shols and scenes, making use ofclosed-caplioning information already available. However, relying on texL is unreliable, since ii may nol exisL, especially for legacy video.
Section 18.6
Querying on Videos
543
Different schemes have been proposed for organizing and dispiaying sloryboards rea sonably succinctly. The mosl slraighlforward meLhod isto display a lwo-dimensional array of keyfranies. JusL whaL cOnslilules a good keyframe has of course been subject Lo much debate. One approach mighL be to simply oulpul one frame every few seconds. However. aclion has a lendency lo occur belween longer periods of inactive slory. Therefore, some kind of clustering melhod is usually used, lo represenL a longer period of lime thaL is more or less Lhe same wiLhin lhe Lemporal period belonging lo a single keyframe. Some researchers have suggested using a graph-based meLhod. Suppose we have a video of two lalking heads, Lhe inlcrviewer and lhe interviewee. A sensible representation mighl be a digraph with directed arcs taking us from one person lo Lhe oLher, Lhen back again. In lhis way, we can encapsulate much information abouL lhe video’s sLmcLure and also have available the arsenal of Lools developed for graph pruning and managemenl. Other “proxies” have also been developed for represenling shoLs and scenes. A grouping of seis of keyframes may be more represenlative than jusl a sequence of keyframes, as may keyframes of variable sizes. Annolalion by LexL or voice, of each set of keyframes in a “skimmed” video, may be required for sensible understandíng of lhe underlying video. A mosaic ofseveral frames may be useful, wherein frames are combined inLo largerones by malching feaLures over a seI of frames. This results in seI of larger keyframes Lhal are perhaps more represenLaLional of Lhe video. An even more radical approach Lo video represenlalion involves selecling (or creaLing) a single frame lhaL besl represenLs lhe enLire movie~ This could be based on making sure Lhal people are in lhe frame, lhaL lhere is aclion, and so on. In [35], Dufaux proposes an algorilhm lhaL selects shoLs and keyframes based on measures of moLion-acliviLy (via frame dilference), spaLial acLivily (via enlropy of lhe pixel value dislribution), skin-color pixeis, and face detection. By Laking mIo account skin color and faces, Lhe algorilhm increases lhe likelihood of Lhe selecLed keyframe including people and porlrails, such as close-ups of movie aclors, thereby producing inLeresLing keyframes. 5km color is learned using Iabeled image samples. Face deleclion is performcd using a neural nel. Figure 18.19(a) shows a seleclion of frames froni a video of beach acLivily (see [36]). Here, lhe keyframes in Figure 18.19(b) are sclecLed based mainly on color informalion (bul beingcareful wiLh respecl to lhe changes incurred by changing illumination condiLions when videos are shol). A more difficull problem arises when changes belween shols are gradual and when colors are raLher similar overall, as iii Figure 18.20(a). The keyframes in Figure 18.20(b) are sufficient Lo show lhe developmenL of Lhe whole video sequence. Olher approaches aLlempL lo deal wilh more profoundly human aspecls of video, as opposed to lower-level visual or audio fealures. Much efforL has gone mIo applying dala mining or knowledge-base lechniques Lo ciassiJ5’ing videos mIo such calegories as sporLs, news, and so on, and Lhen subcaLegories such as fociball and baskelball. Zhou and Kuo [37] give a good summary of aLLempls lo provide inLelligenl syslems for video analysis.
544
Chapter
ontent-Sasecl Retrieval in Digita
1 raries
Section 18.9
Further Exploratiori
545
4 1*
(b) FIGURE 18.20: Garden video: (a) frames from a digital video; (b) keyframes selecied. (b) FIGURE 18.19: Digital video and associaced keyframes, beach video: (a) frames from a digilal video; (b) keyframes seiected.
18.7
QUERYING ON OTHER FORMATS Work on using audio or combining audio with video ia better comprehend multimedia content is fascinaLing. Wang et ai. [38] is a good introduction to using both audio and video cues. He et ai. [39] offer an inleresting effort Lo understand and navigate slides from lectures, based on lhe time spent oneach slide and lhe speaker’s intonation. Otherinteresting approaches include search-by-audio [40] and “query-by-hurnming” [41]. Other features researchers have looked ai for indexing include indexing actions, concepts and feelings, facial expressions, and so on. Clearly, this lleld is a developing and growing one. particularly because of lhe advent of lhe MPEG 7 standard (see Chapte’r 12).
18.8
OUTLOOK FOR CONTENT-BASED RETRIEVAL A recent overview [42] coliecting Lhe very latest ideas in content-based retrieval identified Lhe following present and future irends: indexing, search, query, and retrieval ofmultimedia data based on 1. Video retrieval using video features: image calor and object shape, video segmenta tion, video keyframes, scene analysis, structure of objects, motion vectors, opticai flow (from compuler vision), multispectral data, and so-calied “signatures” that summarize lhe data
2. 3. 4. 5. 6. 7. 8. 9.
Spatiotemporai queries, such as trajectories Semanlic features; syntactic descriptors Reievance feedback, a weil-known technique from information retrievai Sound, cspecialiy spoken documents, such as using speaker information Muitimedia database techniques, such as using relationai databases of images Fusion of textual, visual, and speech cues Automaticandinstantvideomanipulation; user-enablededitingofmuitimediadatabases Muitimedia security, hiding, and authentication Lechniques such as watermarking
This fieid is truiy rich and meshes well with lhe outline direclion of MPEG-7. ln another direction, researchers try Lo create a search profile lo encompass most instances avaiiable, say ali “animais”. Then, for relalional database searches, such search profiles are communicated via database queries. For searches using visual features, inteiligent search engines learn a user’s query concepis through active leaming [43]. This type of endeavor is cailed “query-based learning”. Anather approach focuses on comprehending how peopie view images as similar, on the basis of perception [44]. The funclion used in this approach is a Lype of “percepLual similarity measure” and is learned by finding Lhe best sei of features (calor, texture, etc.) La capture “similarity” as deuined via Lhe groups of similar images identified.
18.9 FURTHER EXPLORATION Good books [45,46,47,48) are beginning Lo appear an the issues involved in CBIR.
546
Chapter 18
Content-Based Retrieval ri Digital Libraries
Section 18.11
8. Suppose a color histogram is defined coarseiy, with bins quantized lo 8 bits, with 3 bits for each red and green and 2 for blue. SeL up an appropriate structure for such a histogram, and fill iL from some image you read. Template Visual C++ cade for reading an image is on lhe text web site, as saznpleCcode. zip under “Sampie Code”. 9. Try creating a Lexture histogram as described in Section 18.2.5. You could Lry a smali image and follow lhe sleps given Ihere, using MATLAB, say, for case of visualization. 10. Describe how you may find an image containing some 2D “brick pattem” in an image database, assuming Lhe color of the “brick” is yellow and lhe color of Lhe “gaps” is blue. (Make sure you discuss Lhe limitations of your method and lhe possible improvements.)
Links to many useful coritent-based retrieval sites are collected inibe Further Exploration section of lhe Lext web site for this chapter: . A Java applet version of Lhe C-BIRD system described in Section 18.2 . A demo of QBIC as an artwork server
• Demo versions of Lhe Alexandria Digital Library, the Berkeley Digital Library Project, Photobook, Visual RetrievalWare, VIPER, and VisualSEEk • A demo of MediaSite, now rebranded Sonic Foundary Media Systems. The Informedia project provided lhe search engine power for lhis commercially available system. • A demo of the NETRA system. The idea is LO seiect au image, Lhen a particular segment within an image, and search on Lhat model.
(a) Use coloronly. (b) Use edge-based Lexture measures only. (c) Use color, Lexture, and shape.
• A video describing Lhe technology for Lhe Virage system. Virage provides lhe search engine for AitaVista’s Image Search
11. Themain differencebetweenastaticimageand videois Lhe availabiliLyof motion in Lhe laiter. One important part of CBR from video is motion estimation (e.g., Lhe direction and speed of any movement). Describe how you could estimate Lhe movement of an object in a video clip, say a car, if MPEG (instead of uncompressed) video is used. 12. Color is three-dimensional, as Newton pointed out. ln general, we have made use of several different color spaces, ali of which have some kind of brightness axis, plus two intrinsic-color axes. LeL’s use a chromaticiiy two-dimensional space, as defined in Equation (4.7). We’ll use just Lhe first two dimensions, {x, y}. Devise a 2D color histogram for a few images, and find their histogram intersections. Compare image similarity measures with those derived using a 3D color histogram, comparing over severa! different color resolutions. is it worth keeping ali three dimensions. generally? 13. lmplement an image search engine using low-level image features such as colar his logram, color moments, and texture. Construct an image database that conlains ai least 500 images from aL least 10 different categories. Perfonu retrieval Iasks using a single low-level feature as well as a combination of features. Which feature combi nation gives lhe best retrievai results, in lerms of both precision and recail, for each category of images?
• The keyframe production method for Figures 18.19 and 18.20. • And links lo standard sets of digital images and videos, for testing retrieval and video segmentatiOn programs 18.10
EXERCISES 1. What is lhe essence of fealure localizalion? WhaL are lhe pros and cons of this approach, as opposed Lo Lhe traditional image segmentatiOn method? 2. Show that Lhe update equation (Equation 18.9) is correct — Lhat is, lhe eccentricity for parent locale j aL iteration k + 1 can be derived using Lhe eccentricity, centroid, and mass information for lhe parent locale j and child locale 1 ai iteration k. (Note: C~k} and c~’} are Lhe x and y components of lhe centroid c~, respectively.) 3. Try lhe VIPER search engine, refining Lhe search with relevance feedback for a few iterations. The demo mentions Gabor histograms and Gabor blocks. Read enough of Lhe files associated with lhe site Lo determine lhe meaning of these tenns, and write a short explanation of their use. 4. Try a few of lhe more experimental image search engines in lhe Further Exploration section above. Some are quite impressive, but most are fairly undependable when used on broad data domains. 5. Devise a texL-annotation taxonomy (categorization) for image descriptions, starting your classification using lhe seI of Yahoo! categories, say. 6. Examine severa! web site image captions. Haw useful would you say lhe textual data is as a cue for identifying image contenis? (l’ypically, search systems use word slemming, for eliminating Lense, case, and number from words — lhe word sremming becomes Lhe word riem.) 7. Suggest at least Lhree ways in which audio analysis can assist in video retrieval-system related tasks.
References 547
18.11
REFERENCES 1 M.M. Fleck, D.A. Forsyth, and C. Bregler, “Finding Naked People’ in Luropean Congress on Compuser !4sion, 1996, (2)593—602. 2 CC. Chang and S.Y. Lee, “Retrieval of SimilarPictures on Piclorial Databases’ Palrem Recog nilion, 24:675—680, 1991. 3 5. Paek, C.L. Sable, V. iiatzivassiloglou, A. Jaimes, BH. Schiffman, SE Chang. and K.R. McKeown, “lntegration of Visual and Text Based Approaches for Lhe Conlent Label ing and Classification of Pholographs’ in ACM SIGIR ‘99 Workshop on Muirimedia lndexing and Reirieval, 1999, 423—444.
548
Chapter 18
Section 1811
Content-Based Retrieval in Digital Libraries
4 K. Barnard and DA. Forsylh, “Learning lhe Semantics of Words and Pictures’ In Proceedings ofthe !n:ernational Conjerence ou Compuser Vision, 2001,2:408—415.
References
$49
23 C. Carson, 5. Belongie, li. Creenspan, and J. Malik, “Blobworld: Image SegmenLation Using ExpecLation-Maximizalion and its Application to Image Querying,” IEBE Tma,,sacgio,,s ou Partem Analysis and Madune Inteiligence, 24(8):1026—1038, 2002.
5 M.J. Swain and Dli. Ballard, “Color lndexing’ international Journal o! Compuser Vision, 7(1)11—32, 1991.
24 A. Penlland, R. Picard, and 5. Sclaroff, “PhoLobook: ContenL-Based Manipulaiion of Image DaLabases,” in Siorage and Rerrievalfor image and Video Databases (SP!E~, 1994, 34—47,
6 A.W.M. Smculders, M. Worring. 8. Santini, A. GupLa. and R. Jain, “Contenc-Based Image Reirieval at Lhe End of lhe Early Years” IEEE Transactions ou Panem Analysis and Machine Inteiligence. 22:1349—1380, 2000.
25 F. Liu and R.W. Picard, “Periodicity, DirectionaiiLy, and Randomness: Wold FeaLures for Image Modeling and Retrieval,” lESE Transacrions on Partem Analysis and Machine Inteiligence, 18:722—733, 1996.
7 Z.N. Li, O.R. Zaïane, and Z. Tauber, “lilumination Invariance and Object Model in Conteni’ Based Image and Video Reirieval,” Joumal of Visual Conununication and buage Representafiou, I0(3):219—244, 1999. 8 li. Tamura, 5. Mori, and T. Yaniawaki, “Texiure Features Corresponding Lo Visual PerceplionT 1EEE Transactions ou Systems, Mau, and Obernetics. sMc-8(6):46o—473, 1978.
9 A.R. Rao and CL. Lohse, “Towards a TexLure Naming Syslem: Idenlifying Relevant Dimen
26 R.W. Picard, T. P. Minka, and M. Szummer, “Modeiing User SubjectiviLy in linage Libraries,” in IEEE lnsen,ariona? Conj’erence ou Iniage Processing, 1996, 2:777—780. 27 Y. Rui, T. S. Huang, M. Oriega, and 5. Mehrocra, “Relevance Feedback: A Power Tool for InteracLive ContenL-Based Image Relrieval,” lESE Transacnons ou Circuits and Systems for Video Technoiogy, 8(5):644—655, 1998.
lo R. iam. R. Kasturi, and B.G. Schunck, Machine Vision. New Yo,t: McGraw-HulI, 1995.
28 A. Hampapur, A. Gupta, E. Horowitz, and C,F. Shu, “The Virage Image Search Engine: An Open Framework for Image ManagemenlT in Ssorage and Rerrieva? for Image and Video Darabases (SPIE), 1997, 188—198.
II M.S. Drew, 3. Wei, and Z.N. Li, “llluminaLion-lnvariant Image Retrieval and Video Segmen laLion’ Pastem Recognition, 32:1369—1388, 1999.
29 Y. Ishilcawa, R. Subramanya, and C. Faloutsos, “Mindreader: Querying Databases lhrough Multiple Examples,” in 24gb luremnasional (‘onfemence ou Ve,y Large Data Bases, VLDB,
sions of Texture,” in IEEE Conference ou Visualization, 1993, 220—227.
12 M.S. Drew, Z.N. Li, and Z. Taubcr, “Iliumination ColorCovarianL Locale-Based Visual Object Reirieval’ Panem Recognition, 35(8): 1687—1704, 2002. 13 B.V. Funt and G.D. Finlayson, “Color Constanl Color lndexing’ !EEE Transaclions on Panem Analysis and Machine intelligence, 17:522—529, 1995. 14 D.H. Ballard and CM. Brown, Compuser l4sion, Upper Saddle River. NI: Prcnlice [fali, 1982.
IS Til. Hong and A. Rosenfeld, “Compacl Region Exlraclion Using Weighted Pixel Linking in a
1998,433-438. 30 H.Y. Bang and T. Chen, “Feature Space Warping: An Approach to Relevance FeedbackT in Inrernarional Confemence ou buage Processing, 2002, 1:968—971. 3! Y. Lu, C. 1-tu, X. Zhu, li. Zhang, and Q. Yang, “A Unified Framework for Semancics and FeaLure Based Relevance Feedback in Image Retrieval SystemsT in Eighth ACM Inremnanional Conference ou Mulrimedia, 2000, 31—37.
Pyramid’ IEEE Transactions ou Panem Analysis and Machine inseiligence, 6:222—229, 1984.
32 R. Bruneili, O. Mich, and CM. Modena, “A Survey on lhe Automatic Indexing of Video DaLa’
16 D. Ballard, “Generalizing Lhe Hough Transform lo Deiect ArbiLrary Shapes,” Panem Recog nilion, I3(2):lll—122, 1981.
33 SE. Chang, et ai., “VideoQ: An Automated Contenl Based Video Search Syslem Using Visual
17 R Gvozdjak and Z.N. Li, “From Nomad lo Explorer: Aclive Objecl Recognilion on Mobile Robots,” Panem Recognition. 3 l(6):773—790, 1998.
34 D. Bordwell and K. Thompson, Filio Ar!: Ati Introduction, New York. McGraw Hili, 1993.
18 3. Au, Z.N. Li, and M.S. Drew, “ObjecL Segmentation and Trackirig Using Video Locales,” in Pmoceedings of lhe Iniernational Conference on Panem Recognition (ICPR 2002), 2002, 2:544—547.
19 M. Flickner. et aI, “Query by Image and Video Conlenl: Tbe QBIC System7 1EEE Compuser, 28(9):23—32. 1995. 20 3. Hafner, liS. Sawhney. W. Equilz, M. Flickner, and W. Niblack, “Efficienl Color Hislogram indexing for Quadralic Form DisLance FuncLions,” IEEE Transactions ou Pastem Analysis and Machine Inreiligence, 17:729—736, 1995. 21 W. Niblack, Xiaoming Zhu, J.L. Hafner, T. Breuel, D. Ponceleon, D. Pctkovic, MD. Flickner, E. Upfal, SI. Nin, 5. SuIl, E. Dom, Boon-Lock Yeo, A. Srinivasan. D. Zivkovic, and M Penner, “Updates to ibe QBIC Syslem,” in Ssorage and Reirievalforlinage and Video Dasabases, 1998, 150—161. 22 Y. Deng, D. Mukherjee, and ES. Manjunalh, “NETRA-V: Toward an ObjecL-Based Video RepresentationT in Siorage and Rerrievalfor Image and Video Daiabases (SPIE), 1998, 202 215.
Joumnal of Visual Com,nunicarion and huage Repmesentation, 10:78—112, 1999. Cues’ in Pmceedings ofACM Mulgimedia 97, 1997, 313 324.
35 E Dufaux, “Key Frame Seleclion lo Represent a Video,” in Inteniatianal Confemence ou hnage Processing, 2000,2:275—278. 36 M.S. Drew and 3. Au, “Video Keyframe Production by EfficienL Cluscering of Compressed Chromaticity Signalures’ In ACM Mulrinsedia 2000, 2000, 365 368. 37 W. Zhou and C.C.J. Kuo, lntelligenn Systems for Video Undersranding, Upper Saddle River, NJ: Prenlice-Hali PTR, 2002. 38 Y. Wang, Z. Liu, and J.C. Huang, “Multimedia Conlent Analysis Using Both Audio and Visual Clues,” lESE Signal Pmocessing Magazine, 17:12—36, 2000. 39 L. He, E. Sanocicj, A. Gupla, and 3. Grudin, “Auto-Summarizalion of Audio-Video Presenia Lions,” in ACM Multimedia, 1999, 1:489-498. 40 E. Woid, T. Bium, D. Keislar, and 1. Wheaion, “Conlenl-Based Classification, Search, and ReLrieval of Audio’ lESE Multimedia, 3:27—36, 1996. 41 N. Kosugi, Y. Nishihara, T. Sakala, M. Yamamuro, and 1