Report of Design & Implementation of Mobile Vediopfone

March 13, 2019 | Author: ernilav_choksi11 | Category: Wavelet, Data Compression, Principal Component Analysis, Videoconferencing, Codec
Share Embed Donate


Short Description

Download Report of Design & Implementation of Mobile Vediopfone...

Description

Design and Implementation of a Mobile Videophone

 Ben Appleton

Department of Information Technology and Electrical Engineering, University of Queensland

Submitted for the degree of  Bachelor of Engineering (Honours) in the division of Signal and Image Processing

October 2001

2/123 Macquarie St St Lucia, Q4067

The Dean School of Electrical Engineering University of Queensland St Lucia, Q4072

Dear Professor Simmons,

In accordance with the requirements of the degree of Bachelor of Electrical Engineering (Honours) in the division of Electrical Engineering, I present the following thesis entitled “Design and Implementation of a Mobile Videophone”. This work was performed under the supervision of Dr Vaughan Clarkson.

I declare that the work submitted in this thesis is my own, except as acknowledged in

To my wife, Jenna

Acknowledgements

This thesis could not have been completed without the support and assistance of a number of important people.

First and foremost my thanks go to Dr Vaughan

Clarkson, my supervisor, for always having an open door. His advice and assistance over the last year have been invaluable.

My appreciation is also extended to Dr Brian Lovell, for his valuable teaching in the field of DSP and his boundless enthusiasm for grand projects.

To my wife Jenna, I am grateful for her patience, love and support.

Thanks go to my parents, Charles and Christine, for their love and guidance over the last 22 years. They have given me a great head start on life.

Abstract Mobile videoconferencing is an exciting field of research with a lot of potential. Applications include advanced telecommuting, remote job interviews, improved distance learning, flexible telemedicine, and the obvious advantage to interpersonal communications. However with the majority of potential videophone manufacturers using compression technology a decade out of date, current solutions are aimed at high bandwidth communication channels which are many years from implementation.

This thesis takes a different approach.

Recent advances in video compression

research have produced techniques which are strong enough to allow video conferencing over the existing low bandwidth mobile channels.

The aim of this

  project is to build a device which plugs into an existing mobile phone, extending its capabilities to 2-way audio and video communication. By combining Set Partitioning In Hierarchical Trees with a new motion field estimation technique, a real-time video

Table of Contents DESIGN AND IMPLEMENTATION IMPLEMENTATION OF A MOBILE VIDEOPHONE VIDEOPHONE ............... I ACKNOWLEDGEMENTS ACKNOWLEDGEM ENTS .......................................... ................................................................ ..........................................VII ....................VII ABSTRACT................................................... ABSTRACT............................ .............................................. ............................................... ..................................... ............. IX TABLE OF CONTENTS ........................................... ................................................................. ............................................. ......................... XI LIST OF FIGURES .......................................... ................................................................ ............................................ ................................. ........... XV CHAPTER 1 -

1.1 1.2

AIMS ............................................. .................................................................... .............................................. ...........................................1 ....................1 OVERVIEW ............................................ ................................................................... .............................................. ...................................2 ............2

CHAPTER 2 -

2.1 2.2 2.3 2.4 2.5

LITERATURE REVIEW ......................................... ...........................................................3 ..................3

OVERVIEW ............................................ ................................................................... .............................................. ...................................3 ............3 GENERAL VIDEOCONFERENCING ....................................................................3 MOBILE PHONE NETWORKS ............................................................................4 COMPRESSION R ESEARCH ESEARCH ...............................................................................4 R ELEVANCE ..................................................................... .............................................. ............................... ........5 5 ELEVANCE ..............................................

CHAPTER 3 -

3.1

INTRODUCTION........................................................................1

EXISTING THEORY........................ THEORY............................................. ..........................................7 .....................7

DATA COMPRESSION .......................................................................................7

3.5.5  Properties of EZW and SPIHT........................................................... SPIHT............................................................. ..29 29 3.6 MOTION-BASED VIDEO CODING .....................................................................30 3.6.1 Underlying Model .............................................. ...................................................................... ..................................30 ..........30 3.6.2 R   esidual Image................................... Image.......................................................... .............................................. ........................... ....30 30 3.7 SPEECH COMPRESSION ..................................................................................32 3.8 LINEAR PREDICTIVE CODING ........................................................................32 3.8.1 Speech Production Model ............................................... ....................................................................32 .....................32 3.8.2 Vocal Tract Filter Analysis..................................................................33 Analysis..................................................................33 3.8.3 Voicing and pitch determination............................................ determination..........................................................35 ..............35 3.8.4 Voice Synthesis......................... Synthesis................................................ .............................................. .....................................36  ..............36  3.8.5  LPC-10e Specifics............................ Specifics.................................................... ................................................ ............................ ....36  36  CHAPTER 4 -

THEORETICAL CONTRIBUTION.......................................38

4.1 MOTION ESTIMATION ...................................................................................38 4.1.1 Model-based Motion Estimation......................................... Estimation..........................................................38 .................38 4.1.2  Algorithm 2 - Model-Based Surface Tracking Tracking .....................................40 .....................................40 4.2 VARIABLE BIT R ATE ATE LINEAR  PREDICTIVE CODING .......................................41 CHAPTER 5 -

DESIGN AND IMPLEMENTATION IMPLEMENTATION .....................................44 .....................................44

5.1 SPECIFICATIONS ............................................ ................................................................... .............................................. ......................... ..44 44 5.1.1 Video and Audio Quality.................................................... Quality......................................................................44 ..................44 5.1.2  Real-time Compression Compression and Decompression Decompression .......................................45 .......................................45 5.1.3  Portability  Portability ................................................ ........................................................................ ............................................45 ....................45 5.2 SOFTWARE COMPONENTS .............................................................................46

10.3 10.4

MEC.C .............................................. ..................................................................... .............................................. .......................................1 ................1 VIDCODEC.C .............................................. ..................................................................... .............................................. ........................... ....1 1

List of Figures Figure 1 – Original Signal............................ Signal................................................... .............................................. .........................................11 ..................11 Figure 2 – Quantised signal .............................................. ...................................................................... ...........................................11 ...................11 Figure 3 – Super-Symbol (Pixel Pair) Distribution for a Sample Image.....................12 Figure 4 – Example: Daubechies 7-9 Mother Wavelet................................................16 Figure 5 - Sampled Signal............................. Signal..................................................... ................................................ .......................................17 ...............17 Figure 6 - Discrete Fourier Transform.........................................................................17 Figure 7 - Short-Time Fourier Transform....................................................................17 Figure 8 - Discrete Discrete Wavelet Transform ................................................. .......................................................................17 ......................17 Figure 9 - Quadrature Mirror Filter and Perfect Reconstruction .................................19 .................................19 Figure 10 - Wavelet Coefficient Arrangement ................................................. ............................................................20 ...........20 Figure 11 – Full-Scale Image....................... Image ............................................... ................................................ ........................................21 ................21 st Figure 12 – 1 Scale.....................................................................................................21 Figure 13 – 2nd Scale....................................................................................................21 Figure 14 – Completed Wavelet Transform ................................................ ................................................................21 ................21 Figure 15 - A Binary Tree of 1-D Wavelet Coefficients ........................................... ............................................. ..22 22 Figure 16 – LPC Speech Production Model .............................................. ................................................................33 ..................33 Figure 17 - Previous Frame................................................. Frame......................................................................... .........................................40 .................40 Figure 18 - Current Frame .............................................. ...................................................................... .............................................40 .....................40 v Figure 19 – Motion Field V  y .......................................................................................40 Figure 20 - Bit Allocation Scheme for VBR LPC ................................................ ....................................................... .......42 42

Chapter 1 - Introduction Mobile videoconferencing is a commercial gold mine, widely recognised as the next step forward in mobile communications.

The introduction of commonly

available mobile videoconferencing will bring a host of applications including advanced telecommuting, remote job interviews, telemarketing, improved distance learning, telemedicine, and the obvious advantage to interpersonal communications. Many big players in the communication and computing industry are planning to launch products in this market over the next 5 years.

There is however one serious obstacle which must first be overcome. This is the difficulty of sending a high-bandwidth video signal through a low-bandwidth mobile phone system. It is one of the major problems driving two large fields of  research: high-bandwidth wireless communication systems such as 3G1 networks, and new low-bandwidth videoconferencing standards such as H.324/M and T.120.

The aim of this thesis is to design and implement a mobile videophone operating over a GSM GSM or GPRS GPRS channel. The final product should consist of: 1. A video camera and display 2. A speaker and a microphone 3. A processor 

The central requirements for the final product are: 1. Real-time operation 2. Very low bit rate 3. Good quality video 4. Acceptable quality audio 5. Scalability 6. Portability These are reviewed in greater detail in Chapter 5, Design and Implementation.

Chapter 2 - Literature Review This chapter reviews the state of the art in videoconferencing, particularly mobile videoconferencing. videoconferenc ing. It also assesses the current state of mobile phone networks and their predicted improvements over the next decade.

A brief review of current

compression technology for audio and video follows.

2.1 Overview 

Of all of the publications reviewed, “Video Coding for Mobile Handheld Conferencing” [Faichney & Gonzalez 1999] is the most generally relevant. This   paper considers the potential implementations of handheld video conferencing systems. In particular it examines the major major competing video compression schemes currently available, including SPIHT [Said & Pearlman 1996]. It also assesses two

2. Dedicated videoconferencing hardware with an ISDN line or  3. A PDA and a mobile mobile phone with extra audio-visual hardware (See also [Faichney & Gonzalez 1999]) These achieve varying degrees of success, but are generally realising some of the  promise of videoconferencing.

At present very few mobile videoconferencing solutions are available.

Many

companies are advertising products under development aimed at 3G networks which will supply in excess of 1Mbps. For example, Toshiba and Winnov Winnov are targeting their videophones at the $1000 mark [MobileInfo 2001], not including the laptop computer on which which they run. These are naturally expected to achieve very good quality audio and video.

2.3 Mobile Phone Networks

The paper which first inspired this thesis is a master’s thesis [van der Walle 1995]   published in 1995 from the University of Waterloo.

This thesis analysed the

mathematics of fractal image compression and wavelet image compression, two competing views of image compression prevalent at the time, to bridge the gap  between them.

A relatively recent image compression scheme based on wavelets is Set Partitioning in Hierarchical Trees [Said & Pearlman 1996]. SPIHT is based on the properties of  a wavelet-transformed image, and has demonstrated remarkably good results at low  bit rates.

A couple of video compression standards have recently been produced targeting mobile videoconferencing. One evolving standard, H.324/M (“M” for mobile) is an extension of the existing H.324 standard for videoconferencing over Public Switched Telephone Telephone Networks. It is specifically designed for mobile terminals, allowing for Bit Error Rates of up to 0.05 and the very low bandwidth normally

Our project brings together much much of the above. We are aiming at an order of  magnitude lower bandwidth than modern video conferencing systems, so we have developed our own video compression technique and our own improved LPC speech codec. We also aim to support the current standard mobile channel (GSM Phase 1) while providing a scalable codec to allow for the advances in mobile services expected over the next decade.

Chapter 3 - Existing Theory

3.1 Data Compression

3.1.1 Speech Speech

coding

teleconferencing. teleconferencing .

is

used

for

low

quality

telephony

and

higher

quality

Telephone speech is sampled at 8kHz with 8-bit precision,

  producing a PCM code at 64kb/s. Model based analysis allows intelligible speech at rates as low as 2.4kb/s.

3.1.2 Audio Audio signals are a more general class of signals. On a CD, audio is sampled at 44.1kHz with 16-bit precision, producing a PCM code at 706kb/s. Model based

compensation. Each frame is predicted from previous frames by compensating for  the motion motion encoded by the transmitter.

An error image, called a residual , is

compressed with an image compression scheme.

For teleconferencing, Quarter Common Intermediate Format (QCIF) colour video has only 176 by 144 pixels [Cherriman 1996].

A maximum of 30 frames per 

second are transmitted, however 10 or 15 is usual. If the motion in the video is relatively simple then moderate quality video can be obtained at 128kb/s.

3.2 Information Theory  In 1948 Claude Shannon, a researcher for Bell Labs, published a number of  remarkable theorems [Shannon 1948] which have formed the foundation for the fields of data compression and error coding. He established that the fundamental measure of the information contained in any message, including text, audio and

Consider two variables  x and  y, with probability distribution functions  f  X  ( x ) and  f Y  ( y ) .  x and  y are called independent  if and only if   f  X  ( x |  y ) =  f  X  ( x ) . In other  

words, the information obtained from  y in no way contributes to the information known about x; it does not alter the distribution of  x  x.

We define the information obtained from the reception of a symbol  sk  from a memoryless source as follows:

  1    = − log 2 ( p k  )  p   k  

 I ( s k  ) = log 2 

where  p k  =  f S  ( s k  ) is the probability of receiving symbol sk .

This definition implies that the less probable a given symbol is, the more information we gain from its reception.

If we take the expected value of the

information contained in a random symbol produced by the source:

 H ( XY ) = Ε ( I ( XY )) = Ε  − log 2  p x p y

(

= Ε  − log 2 ( p x ) − log 2

( p y ))

=  E ( I ( X ) + I (Y ))

Therefore  H ( XY ) =  H ( X ) + H (Y ) for independent X and Y 

The difference between a message’s length and its entropy is known as its redundancy .

Redundancy is additional information which reinforces existing

knowledge rather than adding new information. Most data is very redundant; see [Shannon 1948] for an example of the redundancy of English text.

Redundancy allows for a reconstruction of the information from an incomplete fraction of the message. Artificially introduced redundancy is used extensively in coding theory. Conversely the removal of redundancy results in a shorter message, message, which is the goal of compaction and compression.

The concept of entropy allows us to draw a parallel between data compaction and

So far we have dealt only with data compaction, which is the lossless compression of a redundant message. However by discarding insignificant components of the data much much higher compression gains may may be made. made.

Data compression is the

technique of approximating the data within an acceptable level of distortion. The choice of distortion measure depends on both the type of data and the intended recipient.

For image and video compression a widely accepted distortion measure is the PeakSignal to Noise Ratio (PSNR). It is defined as:

 255 2    (dB )  PSNR = 10 log10  MSE      where MSE is the Mean Squared Error between the original image and the reconstructed image in 8-bit format. PSNR for video is measured measured frame-by-frame.

3.3 Image and Video Compression

Table 1 – Scalar Quantisation Example

In Vector Quantisation (VQ) we group data symbols into super-symbols before quantisation. Quantisation takes place in a multi-dimensional symbol space where the interdependency of neighbouring data elements can be exploited to reduce distortion.

Figure 3 – Super-Symbol (Pixel Pair) Distribution for a Sample Image

to produce a set of coefficients that are often close to independent, concentrating the information content of a signal into the first few coefficients.

3.3.2.1 Derivation

Let  X  ∈ ℜ N  be a (column) data vector of zero mean, drawn from a known distribution D. Let Γ be the autocovariance matrix of X:

(



Γ = Ε  X  ⋅ X  Γ quantifies

)

the correlations between data elements in X.

As Γ  is a symmetric matrix it will have orthogonal eigenvectors. Let T be the orthogonal matrix of normalised eigenvectors of  Γ . We now apply a change of   basis to the data vector X:  X ' = TX 

Consider the resulting autocovariance matrix Γ ' of  X ' :

One defining property of images is their   scale invariance. For example, if we take a picture of a scene from a large distance and then take a second picture from a near  distance the two images will have the same statistical properties.

Scale invariance in its direct form motivates fractal image compression [van der  Walle 1995], while the equivalent statement in the Fourier domain motivates wavelet image compression. We now give a simple derivation of the statistical distribution of the Fourier coefficients of a scale invariant signal.

Scale invariance in 1-D may be expressed simply as i( x) ~ i (sx)

where i ( x) is the intensity as a function of position and s is the scaling factor. means that that the rando random m funct functions ions A and and B are are drawn drawn from the same same  A ~  B means distribution.

By the Fourier scaling theorem, if we let  I ( f ) = F {i( x)} then we obtain

image database, or by determining a perceptual colour space based on human  psychovisual experiments. The most commonly used colour space is a LuminanceChrominance space known as YUV. Typically converting from RGB to YUV allows a compression gain of 2 [Bourke 2000].

3.4 Wavelets

3.4.1 Introduction to Wavelets The classic tool of signal analysis has been the Fourier Transform. It represents a signal as a set of complex sinusoidal components of infinite duration.

As a

complex sinusoid is an eigenfunction of a linear system the Fourier Transform has found applications in systems for signal processing, control and communication, and indeed for modelling any linear time-invariant system.

The Fourier Transform provides an excellent model for stationary (time-invariant) signals and systems, and when using it for signal analysis we implicitly assume that the signal is stationary.

However speech, images and video clearly are not

stationary processes. They are spatially and temporally variant - within a single signal occur both periods of silence and of noise, transients and oscillations.

of signal processing in which time and frequency are represented simultaneously and which naturally represents objects at all scales.

3.4.2 Wavelet Transform Mathematics The Wavelet Transform represents a signal as composed of a set of dilations and translations of a single waveform, the wavelet . A wavelet is a single function of  zero mean: ∞

∫ ψ  (t ).dt = 0 −∞

that is dilated with a scale parameter s, and translated by u to produce a wavelet atom: ψ  u , s (t ) =

1  s

ψ  (

t − u  s

)

The wavelet coefficient of a function f(t) at the scale s and position u is computed

3.4.3 Time-Frequency Tilings One way to view the wavelet transform and its relatives is by observing their timefrequency tilings. The time-frequency tiling is used to describe how a transform is able to resolve time and frequency components of a signal.

Each rectangle

describes the area of influence of a single transform coefficient. In an orthogonal transform each rectangle must have the same area.

Figure 5 - Sampled Signal

Figure 6 - Discrete Fourier Transform

Figure 7 shows the Short-Time Fourier Transform. In this transform the data has  been broken up into temporal blocks, allowing a coefficient to represent a transient frequency component. Although this simplified diagram is unable to show it, the rectangles do not overlap so discontinuities exist between temporal blocks.

Figure 8 shows the Discrete Wavelet Transform.

In this transform the high

frequency components influence a short time period, while the low frequency components influence a correspondingly correspondingl y long time period. The rectangles in this diagram overlap substantially, such that altering any coefficient will smoothly alter  the surrounding neighbourhood. This prevents a compression scheme based on the wavelet transform from introducing ‘blocky’ artefacts. The wavelet transform acts as a spatially adaptive filter, generating a large number of significant coefficients in the vicinity of signal discontinuities and relatively low amplitudes in smooth regions.

The Biorthogonal Wavelet Transform (BWT) is an orthogonal transform, allowing for coefficient extraction by correlation.

Note that the term BWT is used

specifically for biorthogonal wavelets, while the term DWT is used to denote the wavelet transform in general. Due to the octave-band split in the frequency domain the DWT has an efficient implementation as a pyramid of filters. QuadratureMirror Filters developed from sub-band coding for audio compression are used to ensure perfect reconstruction. Consequently the computation time is near-linear in the number of coefficients.

The pyramidal algorithm initially splits the signal into a high frequency and low frequency stream. The procedure is then repeated, recursively splitting the low frequency component into successively lower octaves.

The wavelet transform may be simply extended to any N-dimensional signal. As the wavelet transform is linear this may be achieved by transforming the signal separably along each axis.

By convention, we store the coefficients in the following arrangement for a 2dimensional image transform:

Figure 10 - Wavelet Coefficient Arrangement

Figure 11 – Full-Scale Image

Figure 12 – 1st Scale

Figure 13 – 2 nd Scale

Figure 14 – Completed Wavelet Transform

Table 3 – Successive Scales of a 2-D Wavelet Transform

Figure 15 - A Binary Tree of 1-D Wavelet Coefficients

The concept of a binary tree of wavelet coefficients extends simply to a 2 N-ary tree in N dimensions.

The DWT and its relatives may be viewed as an approximation to the KLT. Taking into account the statistical properties of images we expect to observe both scale and translation invariance in the autocovariance matrix Γ . The DWT explicitly assumes such a structure, thus requiring only a small set of wavelet coefficients as shared knowledge rather than the entire Γ  matrix.

This approximation produces good

results for typical images and greatly reduces the computational load of the transform.

See [van der Walle 1995] for a more more formal discussion of the

relationship between the DWT and the KLT.

3.5 Embedded Zerotree Wavelets and Set Partitioning in Hierarchical Trees

EZW and SPIHT are two efficient schemes for representing the wavelet coefficients of an image.

They are among the best available image compression schemes

currently under research. Since its invention in the mid-1990’s SPIHT has become a de facto baseline against which all other image compression schemes are measured. We now discuss the underlying principles and operation of EZW and SPIHT, before summarising their advantages for real-time video compression.

As we have observed, the wavelet transform is an approximation of the KarhunenLoeve transform for images. It removes the correlations between pixels in the image, producing a set of coefficients heavily biased towards 0. This allows a lossy coding scheme to discard the majority of the coefficients to obtain an efficient approximation of an image, sending only the significant coefficients for 

that at any stage the minimum Sum of Square Error (SSE) approximation of the coefficients has been transmitted. The SOT allows the efficient representation of  the positions of the significant coefficients.

EZW and SPIHT differ only in their choice of SOT, so we begin by describing the first two concepts.

3.5.1 Successive Approximation Successive approximation is the process of using a fixed-point binary representation of each coefficient, sending the nth bit of each coefficient in the image, before decrementing n. It is also known as bit-plane coding as it treats the image as a number of planes (one for each bit number) and compresses the corresponding   binary array. At any stage the coefficient’s value is specified to within a certain range, the width of which is halved with each successive approximation.

The

receiver tentatively assumes that the coefficient is in the centre of the current range

magnitude, such that the largest coefficients are sent first. This ensures that at any stage the compressed bit stream achieves the smallest SSE possible.

3.5.3 Spatial Orientation Trees The Spatial Orientation Tree is responsible for obtaining an efficient representation of the positions of the significant coefficients. Although the wavelet coefficients are uncorrelated, they are not all independent. The magnitudes of the coefficients at the lowest scale are somewhat equivalent to an edge map of the image, consisting of an x-edge subimage, a y-edge subimage, and an xy-edge (corner) subimage. Each successively higher scale represents edges on a larger scale, with a correspondingly spatial resolution. As a result the edge map at a given scale is approximately a downsampled version of the edge map at the next lower scale. This scale space of edges allows us to state the following heuristic:

If a coefficient is insignificant with respect to a threshold T (|x| < T), then all

Z (for Zero)

Insignificant, but with a significant descendant

T (for zero-Tree root)

Insignificant, with insignificant descendants

If a node is labelled as a zero-tree root then none of its children’s labels need to be transmitted. Thus the vast majority of the insignificant coefficients are labelled by a few zero tree roots in the highest levels of the tree, producing an extremely efficient representation.

3.5.3.2 SPIHT’s SOT

EZW’s SOT is inefficient, as a coefficient that is already known to be significant must be relabelled significant at each lower threshold.

Set Partitioning in

Hierarchical Trees avoids this by treating the insignificant coefficients as subtrees represented by their root node.

As new coefficients become significant during

encoding, these subtrees are repeatedly split into smaller subtrees until the significant coefficients have been extracted.

The following three lists are

maintained in synchronisation by both the encoder and the decoder:

Both algorithms are very similar with the exception of the SOT, so here we describe only SPIHT. For a more detailed description of the algorithms see the original   papers [Shapiro 1993] and [Said & Pearlman 1996]. Both EZW and SPIHT have symmetric encoding/decoding algorithms, which means that the encoding algorithm is the same as the decoding algorithm with each output step replaced by an input step.

3.5.4.1 Algorithm 1 – SPIHT encoding

1. Initialisation: 1.1. Output n, the highest significant bit position. 1.2. Set the LSP as empty 1.3. Add the significant coefficients co-ordinates to the LIP, and those with descendants to the LIS as type A entries. 2. Sorting pass 2.1. For each entry in the LIP: 2.1.1. Output their significance at the current threshold

3.1. For entry in i n the LSP, except the new entries, output the th e n-th most mo st significant bit of |x|. 4. Quantisation step update 4.1. Decrement n (halve the threshold) and go back to Step 2.

For further detail, see the attached code [Appendices, 9.2 – SPIHT.c]. SPIHT.c].

This

algorithm has been extended to colour by transformation to YUV space before coding the three components independently.

When interlaced, the three coded

streams compete for bit allocation such that the PSNR is maximised at each point.

3.5.5 Properties of EZW and SPIHT Both schemes produce an embedded  representation of the image, such that the compressed bit stream may be truncated to any length without recompression and still achieve the best possible compression for the corresponding number of bits. This ensures completely controlled Variable Bit Rate coding, allowing an image to

3.6 Motion-based video coding 

3.6.1 Underlying Model Video is composed of a sequence of successive images, or frames, of a scene. Each frame is closely related to its predecessor, with deformations due to objects moving within the scene and new information in the form of new objects entering the scene. By suitably modelling these deformations of the scene between frames we may remove a large amount of redundancy in the video data.

This view of video data motivates the widely used video compression paradigm of  motion estimation and compensation .

Motion estimation is the process of 

determining the motion between two successive frames of a video sequence, and is   performed in the transmitter. The set of motion vectors assigned to each pixel is known as a motion field . Motion compensation is the process of applying the motion field in the receiver to the previous frame to produce an estimate of the

and the predicted frame based on the previous frame and the corresponding motion field.

3.7 Speech Compression

There are currently a wide variety of speech compression standards available. At low bit rates ranging from 2.4 to 4.8kbps are the LPC model based vocoders such as LPC-10e, CELP, MELP, and the more advanced Mixed-Band Excitation vocoders. At the higher bit rates ranging upwards from 8kbps are more general audio codecs, such as G.711, G.722 and G.728 that have been developed for the H.320 videoconferencing standard, and G.729 which has been developed for mobile telecommunications [G.729 1996]. Given the very low available bandwidth the model-based vocoders are of particular interest in this thesis.

3.8 Linear Predictive Coding  3.8.1 Speech Production Model

resulting filter varies slowly relative to the pitch period, which is typically between 1 and 10 milliseconds.

Unvoiced phonemes are generated by forming a constriction at some point in the vocal tract, usually in the mouth, forcing the air through the constriction at a high enough velocity to produce turbulence. This turbulent flow produces white noise which is acoustically filtered by the mouth and nasal cavity.

The vocal tract filter can be modelled as an all-pole filter. This assumption is only true for non-nasal phonemes, as the vocal tract lacks side-branches to produce zeroes. For nasal phonemes it is considered a satisfactory approximation as zeros are less perceptible than poles.

This simple model of speech production therefore represents the speech by its amplitude, voicing decision, pitch (if voiced), and vocal tract filter. The speech is  broken up into 30 to 50 frames per second for analysis and synthesis. By extracting

speech waveform is the result of filtering the source by an all-pole filter. Therefore we can retrieve the source waveform from a frame of speech by passing it through the inverse (all-zero) filter.

For the system in Figure 15 above, the speech samples s(n) are related to the excitation u(n) by:  s(n ) =

 p

∑ a  s(n − k ) + u(n) k 

k =1

where p is the order of the filter.

This assumption of a weakly stationary signal is approximately true over short time  periods, typically fewer than 10 pitch periods.

So the all-zero filter’s output is: u ' (n ) = s(n) −

 p

∑α   s(n − k ) k 

k =1

This procedure extracts the filter coefficients, and with little extra effort we may then extract the source waveform.

The resultant coefficients are subsequently

quantised for inclusion in the encoded frame packet.

3.8.3 Voicing and pitch pitch determination determination While the voicing and pitch can be detected directly from the speech waveform a more robust approach is to observe the source waveform.

This technique is

  preferred as the filter estimation allows us to generate the source waveform with little extra computation.

The approach we discuss here measures the pitch frequency of the source waveform. If the source waveform does not have a well-defined pitch then it is designated unvoiced. The calculation of the pitch period measures the Average Magnitude Difference Function over the range of possible pitch periods τ:

3.8.4 Voice Synthesis The speech production model of Figure 16 suggests a very simple speech synthesis scheme. For each frame generate a pulse train or white noise according to the voicing and pitch, pass it through the all-pole filter whose coefficients were extracted in the transmitter, and amplify the resulting speech waveform to the desired volume.

In practice though the realistic regeneration of continuous speech from the LPC   parameters is hindered by the inherent discontinuity between speech frames. A good synthesiser must produce a smooth transition between successive frames, taking into account the voicing of each frame and the synchronisation of pitch   pulses. In general it must also ensure that the filter coefficients and pitch period vary smoothly.

3.8.5 LPC-10e Specifics

Amplitude

5

Reflection Coefficients 1-4

5 each

Reflection Coefficients 5-8

4 each

Reflection Coefficient 9

3

Reflection Coefficient 10

2

The reflection coefficients are the {α k } introduced in Section 3.8.2 above.

As the pitch does not exist for an unvoiced frame, the pitch and voicing bits are combined into a single symbol. For unvoiced frames only the first 4 reflection coefficients are encoded, as unvoiced frames typically have fewer poles [Rabiner & Schafer 1978]. The remaining 21 bits are used to Hamming encode the 28 most significant bits of the frame, which are the frame synchronisation, pitch, voicing, amplitude and the significant bits of the reflection coefficients.

The synthesis module in the decoder ensures that the pitch pulses are aligned across frames as well as smoothing the amplitude and pitch frequency between frames. It

Chapter 4 - Theoretical Contribution

4.1 Motion Estimation Motion estimation typically dominates the computational load of a video compression algorithm. Some naïve implementations break each frame into blocks, exhaustively searching for the displacement of each block between frames that yields the best match. For example, Common Intermediate Format (CIF) video has 352x288 pixels at 15 frames per second [Cherriman 1996]. Exhaustive motion estimation with a maximum displacement of 30 pixels requires approximately 5.7*109 pixel comparisons per second, well beyond the capabilities of most   processors. To complicate this further we often require half-pixel resolution in the motion field, more more than quadrupling the processing. In order to reduce this to a reasonable load we must restrict our search space by modelling the structure of the motion field.

In determining the motion motion field we break the image into 4x4 pixel blocks. Each  block is compared to the previous frame within a dynamic search range using the L2 or Sum of Squared Error metric. If a match with sufficiently low SSE is found then the corresponding displacement forms the value of the motion field for that block. v

Let  I n be the nth frame of the video sequence. The motion field V  specifies the offset between each pixel of  I n and its best match in  I n−1 as follows:  I n ( x, y ) ≈ I n−1 ( x + V  x ( x, y ), y + V  y ( x, y )) v

Within an object in a scene we impose the following smoothness constraint on V  : r ∇V  ≤



where M is the maximum maximum motion field gradient. This hard constraint reduces the size of the search space.

Figure 17 - Previous Frame

Figure 18 - Current Frame

v Figure 19 – Motion Field

V  y

Table 6 – Motion Field Example

The pair of sequential frames shown above (Table 6) demonstrates the key features of a motion field. The background is stationary and so produces a large flat area in the motion field estimate. The face is moving downwards and produces the bright face-shaped region. Careful comparison of the two frames shows that the right eyebrow and the mouth are moving relative to the face, as can be clearly seen in the motion field estimate.

this point by ultimate erosion of the set of unmatched points. Go back to step 2, obtaining a motion vector for each point in the new object. 6. Continue finding new objects and determining their motion fields until one of the following conditions is met: a. A real-time deadline occurs  b. No remaining points may be matched 7. Fill any gaps in the motion field estimate by local averaging.

For further detail, see the attached code [Appendices, 9.3 – MEC.c].

4.1.2.1 High Frame Rate Limit

The values of the motion field are inversely proportional to the frame rate. As ∇ is a linear operator we find that for high frame rates M can be reduced proportionally. As the search size is proportional to M2, we see that this algorithm has

 

   . As this algorithm is run once per frame, we then see that as the 2    frame rate  

O

1

The compression may be improved further still by removing the Hamming bits from the unvoiced frames, producing a shorter data frame. As the pitch and voicing data precedes the unused bits of an unvoiced frame, the receiver will be able to determine the frame data length without further synchronisation information. See figure 20 for the structure of a compressed data packet.

To assist in the later integration of audio and video data frames, the audio frame rate is altered to 40fps.

The resulting audio stream averages 600bps during

conversation, ranging from 40bps during silence to 2.1kbps instantaneous peak rate.

Figure 20 - Bit Allocation Scheme for VBR LPC

Chapter 5 - Design and Implementation

5.1 Specifications The aim of the project was to build a device known as the Mobile Video Phone or  MVP. It was to plug into a mobile phone, extending its capabilities to 2-way audio and video communication. Naturally such a device would include: 1. A video camera 2. A video display 3. A microphone 4. A speaker  5. A processor 

Key design requirements for the MVP were: 1. Good quality video

embedded nature of SPIHT the codec is scalable with bandwidth, allowing for  increased video quality with future improvements in the mobile phone networks.

5.1.2 Real-time Compression and and Decompression Decompression The low channel rate of 12-25kbps required very strong compression of both audio and video. Real-time compression and decompression of video requires a very fast   processor and well-designed algorithms. In addition it restricts the delay of the codec, reducing the extent to which the codec may “look ahead” to future frames. Thus schemes such as 3D SPIHT are inapplicable or severely limited in such an application. Speech is typically 8-16 bits wide with a sampling rate of 8-22kHz and requires relatively little processing.

5.1.3 Portability For portability a fast and low power processor is required, ideally one which is

The project was divided into two areas, hardware and software. The software includes the audio and video compression, error control and synchronisation, and drivers for the audio and visual interfaces. The hardware includes the camera and converter, display, microphone and speaker, and supporting circuitry for the DSP. This thesis is predominantly concerned with the software; for a detailed examination of the hardware see Elliot Hill’s thesis [Hill 2001].

5.2 Software Components

5.2.1 Audio Compression As described in Section 4.2 an enhanced VBR LPC audio codec was developed. The compressed bitstream averaged 600bps during conversation, with a peak rate of  2400bps. Initial testing indicated that the quality was quite acceptable for voice communication with low background noise. As a variable rate codec the output bit

5.2.3 Integration of Audio and Video Codecs For robust transmission over a wireless channel the audio and video data must be combined into a single stream.

The combination combination must take into account the

variable output bit rates and levels of control of each of the data components. Where necessary it must include synchronisation data so that the decoder can parse the data packet, extracting the relevant components for each step of the decompression.

The

audio

codec’s

output

rate

is

uncontrolled

but

contains

embedded

synchronisation information. As such it may be placed at the beginning of the data   packet without specifying its length. The audio codec is altered from its original 44.44 frames per second to 40 frames per second to simplify the interleaving of  audio and video frames.

The residual image data is dependent on the motion field data, and so is placed

Figure 22 - Bit Allocation for Compressed Data Packets

Figure 22 above shows the bit allocation scheme used in the MVP. The diagram is to scale, based on the average length of each data component at 12kbps. Larger   bandwidths simply increase the bits available for residue data.

For further detail, see the attached code [Appendices, 9.4 – VIDCODEC.c].

5.2.3.1 Low Motion Limit

5.3 Implementation The target platform consisting of the DSP, audio and video interface hardware and support circuitry was to be constructed by Elliot Hill Hill [Hill 2001]. However toward the end of the project it was decided that the target platform was too far behind schedule to be used. Instead I am developing a demonstration system consisting of  a laptop computer with a GPRS phone, downloading and decompressing a precompressed video stream in real time. This is near completion, due to be finalised   by the 30th of October for demonstration at the University of Queensland’s Innovation Expo.

Chapter 6 - Results This chapter outlines the results obtained from the individual system components as well as from the entire MVP system. It also discusses the performance of these components and compares this with their expected behaviour.

6.1 Image Compression The SPIHT-based image coding scheme works remarkably well, achieving very good quality images at low compression rates and acceptable images at extremely high compression rates.

JPEG Image (512x512x24bit)

SPIHT Image (512x512x24bit)

360x Compression

800x Compression

Original Image (512x512x24bit)

SPIHT Image (512x512x24bit)

786kB

50x Compression 15.7kB

Figure 24 – Low Compression Rate Example

The 800 times compressed image is quite blurry but still recognisable. The 50 times compressed image is near perfect, showing only very slight blurring effects.

noticeable coding artifacts afterwards. Due to its origins in LPC-10e this codec’s output is of closely comparable quality.

6.3 Video Compression In this section we view the results of the video compression. Overall the quality of  the decompressed video is quite good for the low bandwidth available to GPRS (25kbps) phones, with the quality for GSM (12kbps) phones considered satisfactory.

6.3.1 Typical Quality

6.3.2 Early Behaviour 

Figure 27 - Frames 1, 2, 3, 10 of Sample Video

Figure 27 above shows the typical behaviour of the first few frames. In the first frame there is no previous information about the scene, so the image is of very low quality. Successive frames show the information flowing into the image in the form of the wavelet ‘blobs’ which make make up the scene. After 1 second (frame 10) the video begins to resemble the scene of interest.

Sample 1 - Ben Talking

6.4 Computational Results The decompression and display system runs in real-time on a 750MHz Pentium III laptop computer with 64MB of memory. The code consists of 6000 lines of C code with a 1000 line C++ front end and has not been optimised to a commercial standard. Initially targeted at the C6701 floating point DSP, the majority of the  processing is performed in floating point. It is expected that significant speed gains may be made by conversion to a fixed-point or integer implementation, particularly with the use of the MMX SIMD instructions available on a Pentium III or higher   processor.

The encoding of audio and video operates at 3.3 frames per second on the same computer described above. The lower encoding rate is for to two reasons. Firstly, the encoder must decode its own output in order to maintain synchronisation with the receiver.

Secondly, motion estimation between frames is a significant

GPRS phone has a serial connector and plugs directly into the laptop computer as a modem.

Thus the system is mobile, performing real-time decompression on

 precompressed video sequences. Applications of this prototype could include video on demand, or with a sufficiently fast compression server, TV over a mobile phone and mobile surveillance. A fast implementation would allow bi-directional mobile videoconferencing, the ultimate goal of this project.

Chapter 7 - Discussion Overall the performance of the mobile videophone is excellent. Given the very low  bandwidth of the channel the quality of the audio and video are good. This chapter  outlines the key achievements of this system as well as its limitations. Possible improvements and extensions are discussed for future work in the field.

7.1 Achievements

7.1.1 Extremely Low Bit Rate Speech Codec This thesis describes the design and implementation of an extremely low bit rate (600bps) speech codec. It is a Variable Bit Rate Linear Predictive Coding speech compression scheme based on LPC-10e, with silence detection and voicing

algorithm that acquires the motion vector of an object and then tracks the motion surface outwards, determining each object’s motion motion field in turn. Although it does not provide an optimal motion field, such an approach drastically reduces the size of the search space to allow real-time motion estimation. Its fundamental model causes it to produce smooth, easily compressible motion field estimates.

The

algorithm is shown to produce less computational loading at higher frame rates, which suggests its application to high frame rate video coding.

7.2 Future Work  Mobile videoconferencing and more general video conferencing are enormous fields of research. Current research includes the compression of audio and video, video-specific error correction, packet loss recovery, network latency hiding schemes and even multimedia-driven network designs. However from this thesis a number of specific avenues for future work have emerged.

Chapter 8 - Conclusion This thesis described the theory, design and implementation of a mobile videophone. The system demonstrates good performance at 25kbps (GPRS), with satisfactory performance at 12kbps (GSM). A variable bit rate speech codec based on LPC-10e has been developed requiring only 600bps, significantly increasing the   bandwidth available for video.

The original video compression developed is

scalable to ensure that the quality improves with future increases in mobile communications bandwidth.

A prototype has been implemented implemented on a laptop

computer connected to a GPRS phone, demonstrating the system’s potential.

This thesis covered the following:

Chapter 1 introduced the need for and the

applications of videoconferencing. videoconferen cing. Chapter 2 reviewed the present state of mobile videoconferencing,

including

current

compression

research,

mobile

telecommunications and more more general videoconferencing. videoconferenci ng. Chapter 3 explores the

Chapter 9 - References [Bourke 2000] YCC Colour Space and Image Compression http://astronomy.swin.edu.au/pbourke/colour/ycc/ April 2000.

[Buckingham 2001] Simon Buckingham, An Introduction to the General Packet Radio Service , http://www.gsmworld.com/technology/yes2gprs.html,, Mobile Lifestreams http://www.gsmworld.com/technology/yes2gprs.html Ltd. (Issued Jan 2001)

[Cherriman 1996]

[Donaho & Johnstone 1994] D. Donaho and I. Johnstone, Ideal Spatial Adaption via Wavelet Shrinkage , Biometrika, 81:425-455, Dec 1994.

[Duran & Sauer 1997] Duran and Sauer, Mainstream Videoconferencing: A Developer’s Guide to  Distance Multimedia , Addison-Wesley, 1997.

[Faichney & Gonzalez 1999] J. Faichney, R. Gonzalez, “Video Coding for Mobile Handheld Videoconferencing,” Proc. IASTED International Conference, Internet and Multimedia Systems and Applications, Nassau, Bahamas, 1999.

[Hill 2001] E. Hill,   Hardware Design of a GPRS Videophone, Undergraduate thesis, University of Queensland, Brisbane, Australia, Department of Information Technology and Electrical Engineering, 2001.

[Mallat 1998] S. Mallat,   A Wavelet Tour of Signal Processing , Academic Press Ltd., San Diego, CA, 1998.

[Majani 1994] E. Majani, “Biorthogonal Wavelets for Image Compression”, Proc. SPIE, VCIP 1994, Vol. 2308, pp. 478-488, Sept. 1994.

[MPEG 2001] Moving Pictures Expert Group, MPEG 4 Applications – Coding of Moving   Pictures and Audio , March 1999/Seoul.

[Ohr 1996] Stephan Ohr, ITU effort eyes mobile video phone , http://www.icsl.ucla.edu/~luttrell/pubs/eetimes.html,, October 1996. http://www.icsl.ucla.edu/~luttrell/pubs/eetimes.html

[Rabiner & Schafer 1978] L. Rabiner, R. Schafer, Digital Processing of Speech Signals , Prentice-Hall, Eaglewood Cliffs, N.J., 1978.

[Storn 1996] Rainer Storn, “Echo Cancellation Techniques for Multimedia Applications –  a Survey”, Berkeley, CA, 1996.

[Vanstone & Oorschot 1989] S. A. Vanstone and P. C. Oorschot,   An Introduction to Error Correcting  Codes

with

Applications ,

Kluwer

Academic

Publishers,

Norwell,

Massachusetts, 1989.

[van der Walle 1995] A. van der Walle,   Relating Fractal Image Compression to Transform Methods, Masters thesis, Univ. of Waterloo, Ontario, Canada, Department

of Applied Mathematics, 1995.

[3G 2001] 3G – The Future of Communications,

http://www.gsmworld.com/technology/3g_future.html Mobile Lifestreams Ltd. (Issued April 26, 2001)

Chapter 10 -

Appendices

10.1 DWT.c  This appendix contains the C code written to perform forward and inverse Biorthogonal Wavelet Transforms on colour images.

/***************************************************************************** DWT.c A groups of functions to perform wavelet transforms and inverse transforms *****************************************************************************/ #include "DWT.h" /* - toYUV: Converts an RGB image into a YUV image - a simple matrix operation */ int toYUV(float *in, int imageSize) { int x, y; for (x = 0; x < i mageSize; x++) { for (y = 0; y < imageSize; y++) { float R, G, B; //RGB to YUV // Y = 0.299 R 0.587 G + 0.114 B // U =-0.146 R - 0.288 G + 0.434 B // V = 0.617 R - 0.517 G - 0.100 G

*/ int toRGB(float *in, int imageSize) { int x, y; for (x = 0; x < imageSize; x++) { for (y = 0; y < imageSize; y++) { float Y, U, V; //RGB to YUV // R = 1.0000Y -0.0009U 1.1359V // G = 1.0000Y -0.3959U -0.5783V // B = 1.0000Y 2.0411U -0.0016V Y = *(in + YCOMP + 3*(x + i mageSize*y)); U = *(in + UCOMP + 3*(x + i mageSize*y)); V = *(in + VCOMP + 3*(x + i mageSize*y)); *(in + RED + 3*(x + imageSize*y)) imageSize*y)) = Y - 0.0009f*U 0.0009f*U + 1.1359f 1.1359f *V; *(in + GREEN + 3*(x + imageSize*y)) = Y - 0.3959f*U 0.3959f*U - 0.5783f *V; *(in + BLUE + 3*(x + imageSize*y)) = Y + 2.0411f*U - 0.0016f *V; } } return 0;

R = *(in + RED + 3*(x + imageSize*y)); imageSize*y)); G = *(in + GREEN + 3*(x + imageSize*y)); B = *(in + BLUE + 3*(x + imageSize*y)); imageSize*y));

}

*(in + YCOMP + 3*(x + imageSize*y)) = 0.299f*R + 0.587f*G + 0.114f*B; *(in + UCOMP + 3*(x + imageSize*y)) = -0.146f*R - 0.288f*G + 0.434f*B; *(in + VCOMP + 3*(x + imageSize*y)) = 0.617f*R + -0.517f*G - 0.100f*B;

/* - fwt2: 2D forward wavelet transform a colour image. CAUTION - In-place! YUV BUILT-IN

} } return 0; } /* - toRGB: Converts a YUV image into an RGB image - a simple matrix operation

 Note: Not simply a sequential sequential fwt in both directions directions - we have to do the separable filtering at the same level before downsampling! CAREFUL OF ACCIDENTALLY CYCLING SUBBANDS RELATIVE TO EACH OTHER!!!! */ int fwt2(float *in, int imageSize, float *motherWavelet, int motherLength, float *fatherWavelet, int fatherLength,

A variation on symmConv to include EVEN upsampling */ int upLPF(float *in, int inLength, int dataStep, int numColours, float *filter, int filterLength, float *out) { int oneSidedLength = (filterLength - 1)/2; int inIndex, outIndex, filterIndex, colour, startIndex; int outLength = 2*inLength; int modBase = 2*(outLength - 1); char boundsApply; boundsApply; float filterCoefficient; float *inPtr, *outPtr; for (outIndex = 0; outIndex < outLength; outIndex++) outIndex++) { outPtr = (out + numColours*outIndex); /* Check outside the loop if the bounds are going to apply for this output index */  boundsApply = (outIndex < oneSidedLength*2 || outIndex > (outLength - 1 - oneSidedLength*2)); oneSidedLength*2)); startIndex = -oneSidedLength; if (((outIndex + startIndex)&1) != 0) startIndex++; for (filterIndex = startIndex; filterIndex
View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF