dsp notes
Short Description
DSP PROCESSOR...
Description
DSP PROCESSORS AND DSP IMPLEMENTATION
Introduction General and special purpose DSP processors Computer architectures for signal processing General purpose fixed point DSP processors Selecting DSP Processors Implementation of DSP algorithms Special purpose DSP processors
1
1.
Introduction
DSP processors are used to implement and execute DSP algorithms in real-time (often real-time implies 'as soon as possible', but within specified time limits). The main objectives of this section of the DSP course (lectures session and associated laboratory/course work) are to provide an understanding of (1) Key issues underlying DSP processors and their hardware/software architectures. (2) How DSP algorithms are implemented for real-time execution using fixed point DSP processors (digital filtering will be used as a vehicle for this). (3) Finite word length effects in fixed point DSP systems (using digital filtering as a vehicle for in the discussions).
2.
General and special purpose DSP processors
For convenience, DSP processors can be divided into two broad categories: (1)
General purpose DSP processors – these are basically high speed microprocessors with hardware and instruction sets optimized for DSP operations. Examples of such processors include fixed-point devices such as Texas Instruments TMS320C54x and Motorola DSP563x processors, and floating point processors such as Texas Instruments TMS320C4x and Analog Devices ADSP21xxx SHARC processors.
(2)
Special purpose DSP processors – these include: (i) hardware designed for efficient execution of specific DSP algorithms and (some times called algorithm-specific hardware), e.g. FFT, and (ii) hardware designed for specific applications (some times called application specific processors), e.g. for PCM in telecommunications or audio applications. Examples of special-purpose DSP processors are Cirrus's processor for digital audio sampling rate converters (CS8420), Mitel's multi-channel telephony voice echo canceller (MT9300), FFT processor (PDSP16515A) and programmable FIR filter (VPDSP16256).
3.
Computer architectures for signal processing
Standard microprocessors are based on the von Neumann concepts where operations are performed sequentially. Increase in processor speed is only achieved by making the individual units of the processor operate faster, but there is a limit to this (see Figure 1). For real-time operation, DSP processors must have architecture optimised for executing DSP operations. Figure 1b depicts a generic hardware architecture for DSP.
2
Figure 1 A simplified architecture for standard microprocessors
Figure 2 A simplified generic hardware architecture for DSP
The characteristic features of the architecture of Figure 2 include:
Multiple bus structure, with separate memory spaces for data and programs. Arithmetic units for logical and arithmetic operations, include a hardware multiplier/accumulator.
Why is such an architecture necessary? In DSP most algorithms, e.g. digital filtering and FFT, involve repetitive arithmetic operations such as multiplication, additions, memory accesses and heavy data flow through the CPU. The architecture of standard microprocessors is not suited to this type of activity. An important goal in DSP hardware design is to optimise both hardware architecture and instruction set to increase speed and make real
3
time execution possible whilst keeping quantization errors low. In DSP, this is achieved by making extensive use of the concepts of parallelism. In particular, the following techniques are used:
Harvard architecture Pipelining Fast, dedicated hardware multiplier/accumulator Specialised instructions dedicated to DSP Replication On-chip memory/cache. Extended parallelism – SIMD, VLIW and static super scalar processing.
We will examine some of the above techniques to gain more understanding of the architectural features of DSP processors. 3.1
Harvard architecture
In a standard microprocessor, the program codes and the data are held in one memory space. Thus, the fetching of the next instruction while the current one is executing is not allowed, because the fetch and execution phases each require memory access (see Figure 3).
Figure 3 An illustration of instruction fetch, decode and execute in a non-Harvard architecture with single memory space (a) instruction fetch from memory; (b) timing diagram NB: The example illustrates reading of a value op 1 at address ADR1 in memory into the accumulator and then storing it at two other addresses, ADR2 and ADR3. The instructions could be: LDA STA STA
ADR1 Load the operand op1 into the accumulator from ADR1 ADR2 Store op1 in address ADR2 ADR3 Store op1 in address ADR3
Typically, an instruction in a microprocessor involves three distinct steps:
4
Instruction fetch Instruction decode Instruction execute.
The main feature of the Harvard architecture is that the program and data memories lie in two separate spaces, see Figure 4. This permits a full overlap of instruction fetch and execution.
Figure 4 The basic Harvard architecture with separate data and program spaces;
Figure 5 An illustration of instruction overlap made possible by Harvard architecture.
In a Harvard architecture, since the program codes and data lie in separate memory spaces, the fetching of the next instruction can overlap the execution of the current instruction. Normally, the program memory holds the program codes, whilst the data memory stores variables such as the input data samples.
5
3.2
Pipelining
This is a technique used extensively in DSP to increase speed as it allows two or more operations to overlap during execution. In pipelining, a task is broken down into a number of distinct sub-tasks which are then over lapped during execution. A pipeline is akin to a typical production line in a factory, such as a car or TV assembly plant. As in the production line, the task is broken down into small, independent sub-tasks called pipe stages which are connected in series to form a pipe. Execution is sequential.
Figure 6 An illustration of the concepts of pipelining.
Figure 6 gives a timing diagram of a 3-stage pipeline. Typically, each step in the pipeline takes one machine cycle to complete. Thus, during a given cycle up to three different instructions may be active at the same time, although each will be at a different stage of completion.
The speedup
=
average instruction time (non pipeline) average instruction time (pipeline)
(1)
6
Example 1 In a non pipeline processor, the instruction fetch, decode and execute take 35 ns, 25 ns, and 40 ns, respectively. Determine the increase in throughput if the instruction steps were pipelined. Assume a 5 ns pipeline overhead at each stage, and ignore other delays. Solution In an ideal non pipeline processor, the average instruction time is simply the sum of the times for instruction fetch, decode and execute: 35 + 25 + 40 ns = 100 ns. However, if we assume a fixed machine cycle then each instruction time would take three machine cycles to complete: 40 ns x 3 = 120 ns (the execute time – maximum time – determines the cycle time). This corresponds to a throughput of 8.3x106 instructions per second. In the pipeline processor, the clock speed is determined by the speed of the slowest stage plus overheads, i.e. 40 + 5 = 45 ns. The through put (when the pipeline is full) is 22.2 x106 instructions per second. Speed up =
average instruction time (non pipeline) = 120/45 = 2.67 average instruction time (pipeline)
Pipelining has a major impact on the system memory because it leads to an increased number of memory accesses (typically by the number of stages). The use of Harvard architecture where data and instructions lie in separate memory spaces promotes pipelining.
Drill Assuming the times in the above example are as follows: fetch decode execute overhead
-
20 nS 25 nS 15 ns 1 nS
Determine the increase in throughput if the instructions were pipelined.
Solution
7
Example 2 Most DSP algorithms are characterised by multiply-and-accumulate operations typified by the following equation:
y (n) a0 x(n) a1 x(n 1) a 2 x(n 2) ... a N 1 x(n ( N 1)) Figure 5 shows a non pipeline configuration for an arithmetic element for executing the above equation. Assume a transport delay of 200 ns, 100ns and 100 ns, respectively for the memory, multiplier and the accumulator. (1) What is the system throughput? (2) Reconfigure the system with pipelining to give a speed increase of 2:1. Illustrate the operation of the new configuration with a timing diagram.
Figure 7 Non-pipelined MAC configuration. Solution (1)
The coefficients, a k , and the data arrays are stored in memory as shown in Figure 7. In the nonpipelined mode, the coefficients and data are accessed sequentially and applied to the multiplier. The products are summed in the accumulator. Successive MAC will be performed once every 400 ns (200 + 100 + 100), that is a throughput of 2.5 x106 operations per second.
(2)
The arithmetic operations involved can be broken up into three distinct steps: memory read, multiply, and accumulate. To improve speed these steps can be overlapped. A speed improvement
8
of 2:1 can be achieved by inserting pipeline registers between the memory and multiplier and between the multiplier and accumulator as shown in Figure 8. The timing diagram for the pipeline configuration is shown in Figure 9. As is evident in the timing diagram, the MAC is performed once every 200 ns. The limiting factor is the basic transport delay through the slowest element, in this case the memory. Pipeline overheads have been ignored.
Figure 8 Pipelined MAC configuration. The pipeline registers serve as temporary store for coefficient and data sample pair. The product register also serves as a temporary store for the product.
Figure 9 Timing diagram for a pipelined MAC unit. When the pipeline is full, a MAC operation is performed every clock cycle (200 ns).
9
DSP algorithms are often repetitive but highly structured, making them well suited to multilevel pipelining. Pipelining ensures a steady flow of instructions to the CPU, and in general leads to a significant increase in system through put. However, on occasions pipelining may caused problems (e.g. an unwanted instruction execution, especially near branch instructions).
3.3
Multiplier/Accumulator
The basic numerical operations in DSP are multiplication and addition. Multiplication in software is time consuming. Additions are even worse if floating point arithmetic is used. To make real-time DSP possible, a fast dedicated hardware MAC, using either fixed point or floating point arithmetic is mandatory. Characteristics of a typical fixed point MAC include: 16 x 16 bit 2's complement inputs 16 x 16 bit multiplier with 32-bit product in 25 ns 32/40 bit accumulator
3.4
Special instructions
These are instructions optimised for DSP and lead to compact codes and increased speed of execution of operations that are repeated. For example, digital filtering requires data shifts or delays to make room for new data, followed by multiplication of the data samples by the filter coefficients, and then accumulation of products. Recall that FIR filters are characterised by the following equation: N 1
y (n) h(k ) x(n k ) , where N is the filter length. k 0
In the TMS320C50, for example, the FIR equation can be efficiently implemented using the instruction pair: RPT MACD
NM1 HNM1, XNM1
The first instruction, RPT NM1, loads the filter length minus 1 (N-1) into the repeat instruction counter, and causes the multiply-accumulate with data move (MACD) instruction following it to be repeated N times. The MACD instruction performs a number of operations in one cycle: (1) (2) (3)
multiplies the data sample, x(n k ) , in the data memory by the coefficient, h(k ) , in the program memory; adds previous product to the accumulator; implements the unit delay, symbolized by z 1 , by shifting the data sample, x(n-k), up to update the tapped delay line.
In the Motorola DSP56000 DSP processor family, as in the TMS320 family, the MAC instruction, together with the repeat instruction (REP) may be used to implement an FIR filter efficiently: REP MAC
#N-1 X0, Y0, A
X: (R0)+, X0
Y: (R4)+, Y0
Here the repeat instruction is used with the MAC instruction to perform sustained multiplication and sums of product operations. Again, notice the ability to perform multiple operations with one instruction, made possible by having multiple data paths.
10
The contents of the registers X0 and Y0 are multiplied together and the product added to the accumulator. At the same time, the next data sample and corresponding coefficient are fetched from the X and Y memories for multiplication. In most modern DSP processors, the concept of instruction repeat has been taken further by providing instructions that allow a block of code, not just a single instruction, to be repeated a specified number of times. In the TMS320 family (e.g. TMS320C50, TMS320C54 and TMS320C30), the format for repeat execution of a block of instructions, with a zero-overhead loop, is:
loop
RPTB loop : : (last instruction)
Repeat instructions provided by some DSP processors have high level language features. In Motorola DSP56000 and DSP56300 families zero-overhead DO loops are provided which may also be nested. The example below illustrates a nested Do loop in which the outer loop is executed N times and the inner loop NM times. DO #N, LOOP1 : DO #M, LOOP2 : LOOP2 (last instruction is placed here) : LOOP1 (last instruction in the outer loop is placed here) Nested loops are useful for efficient implementation of DSP functions such as FFT algorithms and 2-D dimensional signal processing. Analog Devices DSP processors (e.g. ADSP-2115 and SHARC processors) also have nested-looping capability. The ADSP-2115 supports up to 4 levels of nested loops. The format for looping is: CNTR = N DO LOOP UNTIL CE : : LOOP: (last instruction in the loop) The loop is repeated until the counter expires. The loop can contain a large block of instructions, not just a single instruction. The format for nested looping is essentially the same as for DSP56000 family. Modern DSP processors also feature application-oriented instructions for applications such as speech coding (e.g. those for codebook search), digital audio (e.g. those for surround sound ) and telecommunications (e.g. those for Viterbi decoding). Other application oriented instructions include those that support coefficient update for adaptive filters and bit reverse addressing for FFTs (see later).
11
3.5
Extended parallelism - SIMD, VLIW and static superscaler processing.
The trend in DSP processor architecture design is to increase both the number of instructions executed in each cycle and the number of operations performed per instruction to enhance performance. In newer DSP processor architectures, parallel processing techniques are extensively used to achieve increased computational performance. The three techniques that are used, often in combination, are:
Single instruction, multiple data (SIMD) processing. Very-long-instruction-word (VLIW) processing Superscalar processing
Figure 10 An illustration of the use of SIMD processing and multiple data size capability to extend the number of multiplier/accumulators (MACs) from one to four in a TigerSHARC DSP processor.
Note: SIMD processing is used to increase the number of operations performed per instruction. Typically, in DSP processors with SIMD architectures the processor has multiple data paths and multiple execution units. Thus, a single instruction may be issued to the multiple execution units to process blocks of data simultaneously and in this way the number of operations performed in one cycle is increased.
12
Figure 11 Principles of very long instruction word (VLIW) architecture and data flow in the advanced, fixed point DSP processor, TMS320C62x.
Note: The Very-long-instruction-word processing is an important approach for substantially increasing the number of instructions that are processed per cycle. A very-long-instruction word is essentially a concatenation of several short instructions and require multiple execution units, running in parallel, to carry out the instructions in a single cycle. In the TMS320C62x, the CPU contains two data paths and eight independent execution units, organised in two sets (L1, S1, M1and D1) and (L2, S2, M2 and D2). In this case, each short instruction is 32-bits wide and eight of these are linked together to form a very long instruction word packet which may be executed in parallel. The VLIW processing starts when the CPU fetches an instruction packet (eight 32-bit instructions) from the on-chip program memory. The eight instructions in the fetch packet are formed into an execute packet, if they can be executed in parallel, and then dispatched to the eight execution units as appropriate. The next 256-bit instruction packet is fetched from the program memory while the execute packet is decoded and executed. If the eight instructions in a fetch packet are not executable in parallel, then several execute packets will be formed and dispatched to the execution units, one at a time. A fetch packet is always 256-bit wide (eight instructions), but an execute packet may vary between 1 and 8 instructions.
13
Figure 12 Principles of superscalar architecture and data flow in the TigerSHARC DSP processor Note: Superscalar processing is used to increase the instruction rate of a DSP processor by exploiting instructionlevel parallelism. Traditionally, the term superscalar refers to computer architectures that enable multiple instructions to be executed in one cycle. Such architectures are widely used in general purpose processors, such as PowerPC and Pentium processors. In superscalar DSP processors, multiple execution units are provided and several instructions may be issued to the units for concurrent execution. Extensive use is also made of pipelining techniques to increase performance further. The TigerSHARC is described as a static superscalar DSP processor because parallelism in the instructions is determined before run-time. In fact, the TigerSHARC processor combines SIMD, VLIW and superscalar concepts. This advanced, DSP processor has multiple data paths and two sets of independent execution units, each with a multiplier, ALU, a 64-bit shifter and a register file. TigeSHARC is a floating point processor, but it supports fixed arithmetic with multiple data types (8-, 16-, and 32-bit numbers). The instruction width is not fixed in the TigerSHARC processor. In each cycle, up to four 32-bit instructions are fetched from the internal program memory and issued to the two sets of execution units in parallel. An instruction may be issued to both units in parallel (SIMD instructions) or to each execution unit independently. Each execution unit (ALU, multiplier or shifter) takes its inputs from and returns its results to the register file. The register files are connected to the three data paths and so can simultaneously read two inputs and write an output to memory in a cycle. This load/store architecture is suited to basic DSP operations which often take two inputs and computes an output. Because the processor can work on several data sizes, the execution units allow further levels of parallel computation. Thus, in each cycle the TigerSHARC can execute up to eight addition/subtract operations and eight multiply-accumulate operations with 16-bit inputs, in stead of two multiply-accumulate operations with 32-bit inputs.
14
ctors in selecting a given processor.
15
View more...
Comments