body 12-12-13
December 11, 2016 | Author: Umamahesh Mavuluri | Category: N/A
Short Description
body...
Description
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 1 OVERVIEW OF THE PROJECT 1.1 Introduction Design of any Low power VLSI circuit with less area and high speed has become a main concern for digital designers. Building low power VLSI systems has emerged as highly in demand because of the fast growing technology in mobile communications and computation. The battery technology does not advance at the same rate as microelectronics technology. There is a limited amount of power available for the mobile systems. So designers are faced with more constraints such as high speed, high throughput, small silicon area, and at the same time, low power consumption. So building low power, high performance adder cells are of great interest [1]-[5]. In the past few decades ago, the electronics industry has been experiencing an unprecedented spurt in growth, thanks to the use of integrated circuits in computing, telecommunications and consumer electronics. We have come a long way from the single transistor era in 1958 to the present day ULSI (Ultra Large Scale Integration) systems with more than 50 million transistors in a single chip [6]. As the performance of processors has increased, the demand for high speed arithmetic blocks has also increased. With clock frequencies approaching 1 GHz, arithmetic blocks must keep pace with the continued demand for more computational power. The purpose of this thesis is to present methods of implementing the area and power efficient carry select adder. To reduce the power and area requirements of the computational complexities, the size of transistors are shrunk into the deep sub-micron region [7] and predominantly handled by process engineering. There are several Adder designs have been proposed to reduce the power consumption. Logic minimization not only results in better system throughput but also results in low power consumption designs. For low power results it is always
Department of ECE, MRITS
1
Design of Low Power ALU using Area Efficient Carry Select Adder
advisable to use CMOS technology in which the power dissipation is a complex function of the gate delays, clock frequency, process parameters, circuit topology and structure, and the input vectors applied. Once the processing and structural parameters have been fixed, the measure of power dissipation is dominated by the switching activity (toggle count) of the circuit .The dynamic power is given by, P=1/2 * Cload * (Vdd2/Tcycle) * E(switching), Where Cload is the load capacitance of the gate, Tcycle is the clock cycle time, E (switching) is the expected number of signal transitions per cycle and Vdd is the supply voltage [8].
1.2 Objective To design a high speed Arithmetic Logic Unit (ALU) by using the efficient carry select adder. Adder is the important block in ALU, speed of the ALU is limited by the adder because it has to pass carry to more number of bits. In digital adders, for speed up the operation Ripple Carry Adder (RCA) is modified as CSLA. To achieve more speed CSLA is replaces by SQRT CSLA. The CSLA is used in many computational systems to alleviate the problem of carry propagation delay by independently generating multiple carries and then select a carry to generate the sum [9]-[10]. However, the CSLA is not area efficient because it uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry input Cin=0 and Cin=1, the final sum and carry are selected by the multiplexers(mux) [11]-[15]. 1.2.1
Existing SQRT Carry Select Adder In general the complete SQRT CSLA is divided into different blocks. Block
size and the number of blocks depend upon the size of SQRT CSLA according to the SQRT technique. From second block onwards, each block contains three different levels, first level is ripple carry adder with input carry zero, second level is ripple carry adder with input carry one and the third level is multiplexer which is used to select one of the ripple carry adders output according to the previous block carry. The disadvantage in SQRT CSLA is more area requirement as it uses two levels of RCAs.
Department of ECE, MRITS
2
Design of Low Power ALU using Area Efficient Carry Select Adder
For achieving better area efficiency [13]-[15] Binary to Excess-1 Converter (BEC) is replaced in the place of RCA with Cin=1 in the regular CSLA. To replace n bit RCA an n+1 bit BEC is required. Though BEC technique reduces area and power [16] but not up to considerable amount and also the design is not suitable for sub threshold level modifications. The drawback with this logic structure is that it does not reduce the area and power to a satisfactory level. There is still scope to reduce the delay. In order to reduce the power and area a new logic structure for a BEC is proposed. 1.2.2
Proposed SQRT Carry Select Adder The 16-bit SQRT CSLA using BEC in its second level requires 792
transistors. There is a scope to reduce the number of transistors along with the area reduction and power dissipation reduction by using proposed logic. For the implementation of a 16-bit SQRT CSLA, 736 transistors are required by using proposed logic. The proposed logic implementation for second level RCA is Special Hardware using Multiplexers (SHM). In this the inputs are applied to first level RCA. And the output of RCA is applied to second level SHM and then to third level multiplexer. Third level multiplexer selects either RCA output or SHM output according to the previous carry. By using the proposed logic 8-bit Arithmetic Logic Unit (ALU) which performs arithmetic operations such as addition, subtraction, increment and decrement and logical operations such as AND, OR, XOR and XNOR is designed.
1.3 Tools used SOFTWARE: Logic Editor: DSCH2.6c Layout Editor: Micro wind 2.6a.
Department of ECE, MRITS
3
Design of Low Power ALU using Area Efficient Carry Select Adder
The performance of the proposed design is analyzed. The simulations are performed with 120nm(0.12um) using simulation tool Microwind2, power supply of 1.2V and nominal temperature of 27°C to extract the critical path delay and power consumption.
1.4 Thesis outline The next chapter describes literature survey such as different types of adders, different types low power design techniques in the design of low power ALU and different logic styles are analyzed. Existing design such as 8- bit ALU using ripple carry adders are designed in chapter 3 along with the implementation of SQRT CSLA using BEC technique. Chapter 4 describes implementation of proposed SQRT CSLA and proposed ALU using efficient carry select adder. Comparative analysis and results are shown in the chapter 5. Conclusion and future scope are discussed in chapter 6.
Department of ECE, MRITS
4
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 2 LITERATURE SURVEY 2.1 Introduction In nearly all digital IC designs today, the addition operation is one of the most essential and frequent operations. Instruction sets for DSP’s and general purpose processors include at least one type of addition. Other instructions such as subtraction and multiplication employ addition in their operations, and their underlying hardware is similar if not identical to addition hardware. Often, an adder or multiple adders will be in the critical path of the design, hence the performance of a design will be often be limited by the performance of its adders. When looking at other attributes of a chip, such as area or power, the designer will find that the hardware for addition will be a large contributor to these areas. It is therefore beneficial to choose the correct adder to implement in a design because of the many factors it aspects in the overall chip. In this chapter we begin with the basic building blocks used for addition, then go through different algorithms and name their advantages and disadvantages.
2.2 Basic Adder Blocks
2.2.1
Half Adder The half adder is an example of a simple, functional digital circuit built from
two logic gates. The half adder adds to one-bit binary numbers (AB). The output is the sum of the two bits (S) and the carry (C). Note how the same two inputs are directed to two different gates. The inputs to the XOR gate are also the inputs to the AND gate. The input "wires" to the XOR gate are tied to the input wires of the AND gate; thus, when voltage is applied to the A input of the XOR gate, the A input to the AND gate receives the same voltage. 2.1 2.2
Department of ECE, MRITS
5
Design of Low Power ALU using Area Efficient Carry Select Adder
Fig.2.1 Half adder 2.2.2
Full Adder In electronics, an adder is a digital circuit that performs addition of numbers.
Full adders are fundamental units in various circuits, especially in circuits used for performing arithmetic operations such as compressors, comparators, parity checkers, and arithmetic logic units and so on. The full adder takes into account a carry input such that multiple adders can be used to add larger numbers. To remove ambiguity between the input and output carry lines, the carry in is labeled Cin while the carry out is labeled Cout. The full-adder circuit adds three one-bit binary numbers (Cin, A, B) and outputs two one-bit binary numbers, a sum (S) and a carry (Cout). The full-adder is usually a component in a cascade of adders, which add 8, 16, 32, etc. binary numbers. The carry input for the full-adder circuit is from the carry output from the circuit "above" itself in the cascade. The carry output from the full adder is fed to another full adder "below" itself in the cascade. Hence, a full adder is a digital circuit that performs an addition operation on three binary digits. The full adder produces a sum and carries value, which are both binary digits. It can be combined with other full adders or work on its own.
A
B
CO
CIN 1-bit Full Adder S
Fig.2.2 Schematic Symbol of 1-bit full-adder cell
Department of ECE, MRITS
6
Design of Low Power ALU using Area Efficient Carry Select Adder
The final OR gate before the carry-out output may be replaced by an XOR gate without altering the resulting logic. This is because the only discrepancy between OR and XOR gates occurs when both inputs are 1; for the adder shown here, one can check this is never possible. Using only two types of gates is convenient if one desires to implement the adder directly using common IC chips. A full adder can be constructed from two half adders by connecting A and B to the input of one half adder, connecting the sum from that to an input to the second adder, connecting Ci to the other input and or the two carry outputs. Equivalently, S could be made the three-bit XOR of A, B, and Ci and Co could be made the three-bit majority function of A, B, and Ci. The output of the full adder is the two-bit arithmetic sum of three one-bit numbers.
Figure 2.3 Circuit diagram of 1-bit full-adder cell
2.3 2.4 2.2.3 Partial Full Adder
Department of ECE, MRITS
7
Design of Low Power ALU using Area Efficient Carry Select Adder
The Partial Full Adder (PFA) is a structure that implements intermediate signals that can be used in the calculation of the carry bit. It is an extension of FA which include the signals generate (g), kill (k), and propagate (p).When g=1, it means carryout will be 1 (generated) regardless of carry-in. When k=1, it means carryout will be 0 (killed) regardless of carry-in. When p=1, it means carryout will equal carry-in (carry-in will be propagated). Table 2.1 reflects these three additional signals, with a comment on the carryout bit in an additional column. Equations 2.5 − 2.7 are the Boolean equations for generate, kill, and propagate, respectively. It should be noted that for the propagate signal, the XOR function can also be used, since in the case of a, b=1, the generate signal will assert that carryout is 1. The Boolean equations for the sum and carryout can now be written as functions of g, p, or k shown by Equations 2.8 and 2.9. Figure 2.4 shows a circuit for creating the Generate, Propagate, and Sum signals. It is a partial full adder because it does not calculate the carryout signal directly; rather, it creates the signals needed to calculate the carryout signal. Generatei (gi) = ai . bi
2.5
Killi (ki) = ai . bi
2.6
Propagatei (pi) = ai bi
2.7
Sumi = Pi Cini
2.8
Carry-outi+1 = ai . bi + bi . carry-ini +ai .carry-ini
Department of ECE, MRITS
2.9
8
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 2.4 Generation of GENERATE, PROPAGATE and SUM
Table 2.1 Truth table of partial full adder Inputs
Outputs
Carry-in 0 0 0 0
a 0 0 1 1
B 0 1 0 1
Carry-out 0 0 0 1
Sum 0 1 1 0
G 0 0 0 1
K 1 0 0 0
p 0 1 1 1
Carry-status delete propagate propagate
1
0
0
0
1
0
1
0
delete
1
0
1
1
0
0
0
1
propagate
1
1
0
1
0
0
0
1
propagate
1
1
1
1
1
1
0
1
generate/propagate
generate/propagate
2.3 Adder Algorithms 2.3.1
Ripple Carry Adder
The Ripple Carry Adder (RCA) is one of the simplest adders to implement. This adder takes in two N-bit inputs (where N is a positive integer) and produces (N + 1) output bits (an N-bit sum and a 1-bit carryout). The RCA is built from N full adders cascaded together, with the carryout bit of one FA tied to the carry-in bit of the next FA. Figure 2.5 shows the schematic for an N-bit RCA. The input operands are labeled ‘a’ and ‘b’ the carryout of each FA is labeled C out (which is equivalent to the carry-in (c-in) of the subsequent FA), and the sum bits are labeled sum. Each sum bit requires both input operands and Cin before it can be calculated. To estimate the propagation delay of this adder, we should look at the worst case delay over every possible combination of inputs. This is also known as the critical path. The most significant sum bit can only be calculated when the carryout of the previous FA is known. In the worst case (when all the carry-out’s are 1), this carry bit needs to ripple across the
Department of ECE, MRITS
9
Design of Low Power ALU using Area Efficient Carry Select Adder
structure from the least significant position to the most significant position. Figure 2.6 has a darkened line indicating the critical path. Hence, the time for this implementation of the adder is expressed in Equation 2.10, where tRCAcarry is the delay for the carryout of a FA and t
RCAsum
is the delay for
the sum of a FA. Propagation Delay (tRCAgroup) = (N-1) . tRCAcarry + tRCAsum
2.10
From Equation 2.10, we can see that the delay is proportional to the length of the adder. An example of a worst case propagation delay input pattern for a 4 bit ripple carry adder is where the input operands change from 1111 and 0000 to 1111 and 0001, resulting in a sum changing from 01111 to 10000. From a VLSI design perspective, this is the easiest adder to implement. One just needs to design and layout one FA cell, and then array N of these cells to create an N-bit RCA. The performance of the one FA cell will largely determine the speed of the whole RCA. From the critical path in Equation 2.10, minimizing the carryout delay (tRCAcarry) of the FA will minimize t
RCAprop
. Various implementations of the
FA cell to minimize the carryout delay .
Figure 2.5 Schematic for an N-bit Ripple Carry Adder
Department of ECE, MRITS
10
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 2.6 Critical paths for an N-bit Ripple Carry Adder 2.3.2 Carry Skip Adder From examination of the RCA, the limiting factor for speed in that adder is the propagation of the Cout bit. The Carry Skip Adder (CSKA, also known as the Carry Bypass Adder) addresses this issue by looking at groups of bits and determines whether this group has a carryout or not. This is accomplished by creating a group propagate signal (PCSKAgroup) to determine whether the group carry-in (carry-in will propagate across the group to the carryout (carry-out
CSKAgroup
CSKAgroup
)
). To explore the
operation of the whole CSKA, take an N-bit adder and divide it into N/M groups, where M is the number of bits per group. Each group contains a 2-to-1 multiplexer, logic to calculate M sum bits, and logic to calculate P CSKAgroup. The select line for the mux is simply the PCSKAgroup signal, and it chooses between carry-inCSKAgroup or cout 4. To aid the explanation, we refer the reader to Figure 2.7, which shows the hardware for a group of 4 bits (M=4) in the CSKA. There are four full adders cascaded together and each FA creates a carryout (cout), a propagate (p) signal, and a sum (sum not shown). The propagate signal from each FA comes at no extra hardware cost since it is calculated in the sum logic (the hardware is identical to the sum hardware for the PFA shown in Figure 2.4). For the carry-out CSKAgroup to equal carry-in CSKAgroup
, all of the individual propagates must be asserted (Equations 2.11 and 2.12). If
this is true then carry-inCSKAgroup skips" past the group of full adders and equals the carryout CSKAgroup. For the case where PCSKAgroup is 0, at least one of the propagate signals is 0. This implies that either a delete and/or generate occurred in the group. A delete signal simply means that the carryout for the group is 0 regardless of the carryin, and a generate signal means that the carryout is 1 regardless of the carry-in. This is advantageous because it implies that the carry-out for the group is not dependent on the carry-in. No hardware is needed to implement these two signals because the group carryout signal will reflect one of the three cases (a d, g or group p occurred). The additional hardware to realize the group carryout in Figure 2.7 is accomplished with a 4-input AND gate and a 2-to-1 multiplexer (mux). In general, an M-input AND
Department of ECE, MRITS
11
Design of Low Power ALU using Area Efficient Carry Select Adder
gate and a 2-to-1 mux are required for a group of bits, including the logic to calculate the sum bits. PCSKAgroup = P0 . P1 . P2 . P3 Carry-outCSKAgroup = Carry-inCSKAgroup . PCSKAgroup
2.11 2.12
In examining the critical path for the CSKA, we are primarily concerned whether the carry-in can be propagated (“skipped") across a group or not. Assuming all input bits come into the adder at the same time, each group can calculate the group propagate signal (mux select line) simultaneously. Every mux then knows which signal to pass as the carryout of the group. There are two cases to consider after the mux select line has determined. In the first case, carry-in CSKA group will propagate to the carryout. This means P CSKAgroup=1 and the carryout is dependent on the carry-in. In the second case, the carryout signal of the most significant adder will become the group carryout. This means PCSKA group =0 and the carryout is independent of the carry-in. If we isolate the particular group (as in Figure 2.7), the second case (signal cout4) always takes longer because the carryout signal must be calculated through logic, whereas the first case (carry-inCSKAgroup) requires only a wire to propagate the signal. Looking at the whole architecture, however, this second case is part of the critical path for only the first CSKA group. Since the second case is not dependent on the group carry-in, all the groups in the CSKA can compute the carryout in parallel. If a group needs its carry-in (PCSKAgroup=1), then it must wait until it arrives after being calculated from a previous group. In the worst case, a carryout must be calculated in the first group, and every group afterwards needs to propagate this carryout. When the final group receives this propagated signal, then it can calculate its sum bits. Figure 2.8 shows a 16-bit CSKA with 4-bit groups and Figure 2.9 shows a darkened line indicating the critical path of the signals in the 16-bit CSKA. If we assume a 16-bit CSKA with 4-bit groups, with each group containing a 4-bit RCA for the sum logic, then the worst case propagation delay through this adder is expressed in equation 2.13. In this equation, tRCAcarry and tRCAsum are the delays to calculate the carryout and sum signals of an RCA, respectively. Each group has 4 bits, so the delay through the first group has 4 RCA carryout delays. This
Department of ECE, MRITS
12
Design of Low Power ALU using Area Efficient Carry Select Adder
carryout of the first group potentially propagates through 3 muxes, where one mux delay is expressed as t muxdelay. Finally, when the carryout signal reaches the final group, the sum for this group can be calculated. This is represented by the final two components of Equation 2.4.
Figure 2.7 One group in a Carry Skip Adder, in this case M=4
Figure 2.8 A 16-bit Carry Skip Adder N=16, M=4
Figure 2.9 Critical path through 16-bit CSKA
Department of ECE, MRITS
13
Design of Low Power ALU using Area Efficient Carry Select Adder
tCSKA16= 4 * tRCAcarry + 3 * tmuxdelay + 3 * tRCAcarry + tRCAsum
2.13
For Equation 2.13, there are some assumptions about the delay through the circuit. First, we assume in the first CSKA group that the group propagates signal is calculated before the carryout of the most significant adder. Thus, the mux for this first group is waiting for the carryout. For the final CSKA group, we assume that it takes longer for sum15 to be calculated than for sum16 to be calculated. Once the carry-in for this last group is known, the delay for sum16 is the delay of the mux; for sum 15 it is a delay of 3*tRCAcarry + t RCAsum (3 ripples through the adder before the last sum bit can be calculated). For an N-bit CSKA, the critical path equation is expressed in Equation 2.5. M represents the number of bits in each group. There are N/M groups in the adder, and every mux in this group except for the last one is in the critical path. As in Equation 2.13, Equation 2.14 assumes that each group contains a ripple carry adder. tCSKAN = M * tRCAcarry +( (N/M)-1)tmuxdelay + (M-1) * tRCAcarry + tRCAsum
2.14
From a VLSI design perspective, this adder shows improved speedup over a RCA without much area increase. The additional hardware comes from the 2-to-1 mux and group propagates logic in each group, which is about 15% more area. One drawback to this structure is that its delay is still linearly dependent on the width of the adder, therefore for large adders where speed is important, the delay may be unacceptable. Also, there is a long wire in between the groups that carryout CSKAgroup needs to travel on. This path begins at the carryout of the first CSKA group and ends at the carry-in to the final CSKA group. This signal also needs to travel through ((N/M)-1)) muxes, and these will introduce long delays and signal degradation if pass gate muxes are used. If buffers are required in between these groups to reproduce the signal, then the critical path is lengthened. An example of a worst case delay input pattern for a 16-bit CSKA with 4-bit groups is where the input operands are 1111111111111000 and 0000000000001000. This forces a
Department of ECE, MRITS
14
Design of Low Power ALU using Area Efficient Carry Select Adder
carryout in the first group that skips through the middle two groups and enters the final group. This carry-in to the final group ripples through to the final sum bit (sum15). To determine the optimal speed for this adder, one needs to find the delay through a mux and the carryout delay of a FA. It is one of these two delays that will dominate the delay of the whole CSKA. For short adders (≤ 16 bits), the t carryout of a FA will probably dominate delay, and for long adders the long wire that skips through stages and muxes will probably dominate the delay. 2.3.3 Carry Look Ahead Adder From the critical path equations in Sections 2.2.1 and 2.2.2, the delay is linearly dependent on N, the length of the adder. It is also shown in Equations 2.10 and 2.14 that the tcarryout signal contributes largely to the delay. An algorithm that reduces the time to calculate tcarryout and the linear dependency on N can greatly speed up the addition operation. Equation 2.9 shows that the carryout can be calculated with g, p, and carry-in. The signals g and p are not dependent on carry-in, and can be calculated as soon as the two input operands arrive. Weinberger and Smith invented the Carry Look Ahead (CLA) Adder [19]. Using Equation 2.9, we can write the carryout equations for a 4-bit adder. These equations are shown in Equations 2.15−2.18, where Ci represents the carryout of the ith position (0 ≤ i ≤ (N − 1)), and gi with just the input operands and initial carry-in (c3). This process of calculating ci by using only the pi, gi and c0 signals can be done indefinitely, however, each subsequent carryout. Generated in this manner becomes increasingly difficult because of the large number of high fan-in gates [20]. C1 = g0 + p0 .c0
2.15
C2 = g1 + p1 .c1 = g1 + p1 . g0 + p1 . p0 . c0
2.16
C3 = g2 + p2 .c2 = p2 . g1 +p2 . p1 . g0 +p2 . p1 . p0 . c0
2.17
C4 = g3 + p3 .c3 =g3 + p3.g2 + p3 . p2 . g1 +p3 . p2 . p1 . g0 +p3 . p2 . p1 . p0 . c0
2.18
The CLA adder uses partial full adders as described in Section 2.1.3 to calculate the Generate and propagate signals needed for the carryout equations. Figure
Department of ECE, MRITS
15
Design of Low Power ALU using Area Efficient Carry Select Adder
2.10 shows the schematic for a 4-bit CLA Adder. The CLA logic block implements the logic in Equations 2.15−2.18, and the gate schematic for this block is in Figure 2.11. For a 4-bit CLA adder the 4 th carryout signal can also be considered as the 5 th sum bit. Although it is impractical to have a single level of carry look-ahead logic for long adders, this can be solved by adding another level of carry look-ahead logic. To achieve this, each adder block requires two additional signals: groups generate and a group propagates. The equations for these two signals, assuming adder block sizes of 4 bits, are shown in Equations 2.19 and 2.20. A group generate occurs if a carry is generated in one of adder blocks, and a group propagate occurs if the carry-in to the adder block will be propagated to the carryout. Figure 2.11 shows the gate schematic of the two additional signals. Group Generate = g3 + p3.g2 + p3 . p2 . g1 +p3 . p2 . p1 . c3
2.19
Group Propagate = g3 + p3.g2 + p3 . p2 . g1 +p3 . p2 . p1 . c3
2.20
2.19 2.20 with multiple levels of CLA logic, carry look-ahead adders of any length can be built. To illustrate the use of another level of CLA logic, Figure 2.8 shows the schematic for a 16-bit CLA Adder. There is a second level of CLA logic which takes the group generate and group propagate signals from each 4-bit adder sub cell and calculates the carryout signals for each adder block. If an adder has multiple levels of CLA logic, only the final level needs to generate the
Department of ECE, MRITS
16
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 2.10 4-bit carry look-ahead adder c4 signal. All other levels replace this c4 signal with the group generate and group propagate. The CLA logic for this 16-bit adder is identical to the CLA logic for the 4bit adder in Figure 2.11; therefore the equations for the carryout signals are in Equations 2.15−2.18.
Figure 2.11 Schematic for a 16-bit CLA adder A third level of CLA logic and four 16-bit adder blocks can be used to build a 64-bit adder. The CLA logic would create the c16, c32, and c48 signals to be used as carry-ins to the 16-bit adder blocks and the c64 as the sum 64 signal. If a design calls for an adder of length 32, a designer can simply use two 16-bit adder blocks and the first two carryout signals (c16, c 32) from the third level of CLA logic. The identical hardware in the CLA logic, coupled with the fact that the adder blocks can be instantiated as sub cells, makes building long adders with this architecture simple. Determining the critical path for a CLA adder is difficult because the gates in the carry path have different fan-in. To get a general idea, we first assume that all gate delays are the same. The delay for a 4-bit CLA adder then requires one gate delay to calculate the propagate and generate signals, two gate delays to calculate carry signals, and one gate delay to calculate the sum signals; this equates to four gate delays. For a 16-bit CLA adder there is one gate delay to calculate the propagate and generate signal (from the PFA), two gate delays to calculate the group propagate and generate in the first level of carry logic, two gate delays for the carryout signals in the
Department of ECE, MRITS
17
Design of Low Power ALU using Area Efficient Carry Select Adder
second level of carry logic, and one gate delay for the sum signals. The second level of carry logic for the 16-bit CLA adder contributes an additional two gate delays over the 4-bit CLA adder, thus increasing the total to six gate delays. Continuing in this manner (a 64-bit add takes eight gate delays, a 256- bit add takes ten gate delays), we see that the delay for a CLA adder is dependent on the number of levels of carry logic, and not on the length of the adder. If a group size of four is chosen, then the number of levels in an N-bit CLA is expressed in Equation 2.21 and in general the number of levels in a CLA for a group size of k is expressed in Equation 2.22. For an N-bit CLA adder, each level of carry logic introduces two gate delays in addition to a gate delay for the generate and propagate signals and a gate delay for the sum. The total gate delay is expressed in Equation 2.23, which shows that the delay of a CLA adder is logarithmically dependent on the size of the adder. This theoretically results in one of the fastest adder architectures. CLA levels (with group size of 4) = [ log 4 N]
2.21
CLA levels (with group size of k) = [ log k N]
2.22
CLA gate delay = 2 + 2 . [ log k N]
2.23
From a VLSI design perspective, this adder may take more time to implement, but there still exists regularity with the architecture that allows building long adders fairly easily. The reuse of the CLA logic definitely contributes to the feasibility of building a long adder without additional design time. Also, after an adder is built, it can be used as a subcell, as is done with the 4-bit adders as blocks in the 16-bit CLA adder. A drawback to CLA adders are their larger areas. There is a large amount of hardware dedicated to calculating the carry bits from cell to cell. However, if the application calls for high performance, then the benefits of decreased delay can outweigh the larger area. 2.3.4
Carry Select Adder Adding two numbers by using redundancy can speed addition even further.
That is, for any number of sum bits we can perform two additions, one assuming the carry-in is 1 and one assuming the carry-in is 0, and then choose between the two results once the actual carry-in is known. This scheme, proposed by Sklanski in 1960,
Department of ECE, MRITS
18
Design of Low Power ALU using Area Efficient Carry Select Adder
is called conditional-sum addition [21]. An implementation of this scheme was first realized by Bedrij and is called the Carry Select Adder (CSLA) [22]. The CSLA divides the adder into blocks that have the same input operands except for the carryin. Figure 2.12 shows a possible implementation for a 16-bit CSLA using ripple carry adder blocks. The carryout of the first block is used as the select line for the 9-bit 2-to-1 mux. The second and third blocks calculate the signals sum 16 − sum 8 in parallel, with one block having its carryin hardwired to 0 and another hardwired to 1. After one 8-bit ripple adder delay there is only the delay of the mux to choose between the results of block 2 or 3. Equation 2.24 shows the delay for this adder. The 16-bit CSLA can also be built by dividing it into even more blocks. Figure 2.13 shows the block diagram for the adder if it were divided into 4-bit RCA blocks. Equation 2.25 expresses the delay for this structure. tCSLA16a =t8bitRCA + t9bitmux
2.24
tCSLA16b =t4bitRCA + 3 . t5bitmux
2.25
The CSLA described so far is called the Linear Carry Select Adder, because its delay is linearly dependent on the length of the adder. In the worst case, the carry signal
Figure 2.12 Schematic for a 16-bit CSLA with 8-bit RCA blocks
Department of ECE, MRITS
19
Design of Low Power ALU using Area Efficient Carry Select Adder
must ripple through each mux in the adder. Also, notice that the sub cells are done with their addition at the same time, yet the more significant bits are waiting at the input of the mux to be selected. From a VLSI design perspective, the CSLA uses a large amount of area compared to the other adders. There is hardware in this architecture which computes results that are thrown away on every addition, but the
Figure 2.13 Schematic for a 16-bit CSLA with 4-bit RCA blocks
Fact that the delay for an addition can be replaced by the delay of a mux makes this architecture very fast. Also, the Linear CSLA has regularity that makes it easier to layout. 2.3.5 SQRT Carry Select Adder To increase SQRT technique is developed. In this design the number of bits per block is not depend upon the total number of bits corresponding logical equation is shown in 2.26. Using that technique for 16-bit SQRT CSLA the bits per block is as follows 2-2-2-3-4-5. For 8-bit
the sequence is 1-3-4. The 16- bit SQRT CSLA is
shown in figure 2.14. tadd= tsetup+ (m X tcarry)+sqrt (2n) X tmux + tsum
Department of ECE, MRITS
2.26
20
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 2.14 Schematic for a 16-bit SQRT CSLA
2.4 Low power design techniques Designing systems aiming for low power is not a straightforward task, as it is involved in all the IC design stages beginning with the system behavioral description and ending with the fabrication and packaging processes. In some of these stages there are guidelines that are clear and there are steps to follow that reduce power consumption, such as decreasing the power-supply voltage. While in other stages there are no clear steps to follow, so statistical or probabilistic heuristic methods are used to estimate the power consumption of a given design. There are three major components of power dissipation in complementary metal–oxide–semiconductor (CMOS) circuits. 1) Switching Power: Power consumed by the circuit node capacitances during transistor switching. 2) Short Circuit Power: Power consumed because of the current flowing from power supply to ground during transistor switching. 3) Static Power: Due to leakage and static currents. 4) Dynamic Power: As given in equation 2.1 The first two components are referred to as dynamic power. Dynamic power constitutes the majority of the power dissipated in CMOS VLSI circuits. It is the power dissipated during charging or discharging the load capacitances of a given
Department of ECE, MRITS
21
Design of Low Power ALU using Area Efficient Carry Select Adder
circuit. It depends on the input pattern that will either cause the transistors to switch (consume dynamic power) or not to switch (no dynamic power consumed) at every clock cycle. The summation is over all the nodes of the circuit. Reducing any of these components will end up with lower-power consumption, although, it is of equal importance to increase the system-clock frequency for faster operation. Estimating the power of a large circuit is a complex task. Heuristic algorithms, statistical, and probabilistic methods are used to generate random-input patterns to test the switching activity of the circuit. These methods become less accurate when the size of the circuit increases. It is better to decompose the large circuit into smaller modules and then use these methods to estimate the power consumption of each module. When the decomposed modules are small enough, exact methods can be used to optimize their performance. 2.4.1 Transistor sizing optimization The transistor sizing for optimal performance is technology dependent. As the demand for high speed, low power consumption and high packing density continues to grow each year, there is need to scale the device to smaller dimensions. As the market trend moves towards greater scale of integration, the move towards a reduced supply voltage also has the advantage of improving the reliability of IC components of ever-reducing dimensions. This change can be easily understood if one recalls that IC component with smaller dimensions have more of a tendency to breakdown at high voltages. It has already been accepted that scaled-down CMOS devices even at 2.5V do not sacrifice device performance as they maintain device reliability. Scaling brings about the following benefits: Improved device characteristics for low voltage operation due to the improvement in the current driving capabilities, reduced capacitance through small geometries and junction capacitances, improved interconnect technology, higher density of integration.
Department of ECE, MRITS
22
Design of Low Power ALU using Area Efficient Carry Select Adder
The major device problem associated with simple scaling lies in the increase of the threshold voltage and the decrease of the carrier surface mobility, when the substrate doping concentration is increased to prevent punch-through. 2.4.2 Low-power clock distribution The clock network constitutes one of the most important parts of a synchronous very large scale integration (VLSI) chip as it can significantly influence the speed, area, and power dissipation of the system. Recent research on clock network construction has developed procedures for building a zero or near-zero skew clock networks with sharp clock edge rates at the clock utilization points. However, one major drawback associated with clock networks is their power dissipation. Studies have shown that the clock network can dissipate 20–50% of the total power on a chip. In the context of the growing importance of low-power designs for portable electronics, it is necessary to develop strategies to significantly reduce the power dissipation of the clock network, since this will lead to a major reduction in the overall power dissipation of the chip. Using a lower to distribute the signal over the chip, the clock network can be made to dissipate less power. However, for reasons related to performance requirements, the rest of the circuitry on the chip may use a higher Vdd and this implies that the clock levels would have to be converted to this higher value at the utilization points. 2.4.3 Low power design through voltage scaling The equation (2.1) shows that the avg. switching power dissipation is proportional to the square of the power supply voltage; hence, reduction of V DD will significantly reduce the power consumption. If the power supply voltage is scaled down while all other parameters are kept constant, the propagation delay time would increase. The dependence of circuit speed on the power supply voltage and the above equation. Suggest that a quadratic improvement or reduction of power consumption is possible as the power supply voltage is reduced. If the circuit is always operated at maximum frequency allowed by its propagation delay, the operating frequency or the no. of switching events per unit
Department of ECE, MRITS
23
Design of Low Power ALU using Area Efficient Carry Select Adder
time will drop as the propagation delay becomes larger with the reduction of power supply voltage. The net result is that the dependence of switching power dissipation on the power supply voltage becomes stronger than a simple quadratic equation. The propagation delay expressions show that the negative effect of reducing the power supply voltage upon delay can be compensated for, if the threshold voltage of the transistors (VT) is scaled down accordingly. However, this approach is limited because the threshold voltage may not be scaled to the same extent as the supply voltage. When scaled linearly, reduced threshold voltages allow the circuit to produce the same speed performance at a lower VDD. 2.4.4 Reduction of switching activity Switching activity can be reduced by algorithmic optimization, proper choice of logic topology, glitch reduction, and gated clock signals. Algorithmic optimization This depends heavily on the application and the characteristics of data such as dynamic range, correlation, and statistics of data transmission. The representation of data can have a significant impact on switching activity at the system level. In applications where data bits change sequentially and are highly correlated, the use of Gray Coding leads to a reduced number of transitions compared to binary coding. Another example is the use of sign-magnitude representation instead of conventional two’s complement representation for signed data. A change in sign will cause transitions of the higher order bits in the two’s complement representation, whereas only the sign bit will change in sign-magnitude representation. Hence, switching activity can be reduced by using the sign-magnitude representation in applications where the data sign changes are frequent. Glitch reduction An important architecture level measure to reduce switching activity is based on delay balancing and reduction of glitches. In multi-level logic circuits, the propagation delay from one logic block to the next can cause spurious signal transitions ,or glitches .Glitches occur primarily due to a mismatch or imbalance in the path lengths in the logic network. Such a mismatch in path lengths results in a
Department of ECE, MRITS
24
Design of Low Power ALU using Area Efficient Carry Select Adder
mismatch of signal timing with respect to the primary inputs. Redesigning the logic network in order to balance the delay paths can significantly reduce glitches, and consequently, the dynamic power dissipation in complex multi-level networks. Gated Clock Signals Another effective design technique for reducing the switching activity in CMOS logic circuits is the use of conditional or gated clock signals. If certain logic blocks in a system are not immediately used during the current clock cycle, temporarily disabling the clock signals of these blocks will obviously save switching power that is otherwise wasted. An N-bit number comparator compares the magnitudes of two unsigned N-bit binary numbers and produces an output to indicate which one is larger. In the conventional approach, all input bits are first latched into two N-bit registers, and subsequently applied to the comparator circuit .In this case, two N-bit register arrays dissipate power in every cycle. Yet, if only the most significant bits of the two binary numbers are different from each other, then the decision can be made by comparing the MSBs only. The two MSBs are latched in a two-bit register which is driven by the original system clock. At the same time, these two bits are applied to an XNOR gate and its output is used to generate the gated clock signal with an AND gate. If the two MSBs are different, the XNOR produces logic 0 at the output, disabling the clock signal of the lower order registers. If the two MSBs are same, the gated clock signal is applied to the lower-order registers and the decision is made by the (N-1) bit comparator. The gated clock strategy effectively reduces the overall switching power dissipation of the system by about 50%, since a large portion of the system is disabled for half of all input combinations. 2.4.5 Reduction of switching capacitance The amount of switched capacitance plays a significant role in the dynamic power dissipation of the circuit. Hence, reduction of this parasitic capacitance is a major goal for low-power design of digital integrated circuits. System-Level Measures At the system level, one approach to reduce the switched capacitance is to limit the use of shared resources. If a single shared bus is connected to all modules,
Department of ECE, MRITS
25
Design of Low Power ALU using Area Efficient Carry Select Adder
for example, a large bus capacitance comes into play due to-the large number of drivers and receivers sharing the same transmission medium, and the parasitic capacitance of the long bus line. Obviously, driving the large capacitance will require a significant amount of power consumption during each bus access. Alternatively, the global bus structure can be partitioned into a number of smaller dedicated local buses to handle the data transmission between the neighboring modules. As a result, the switched capacitance during each bus access is significantly reduced, although multiple buses may increase the overall routing area on the chip. Circuit-Level Measures The type of logic style used to implement a digital circuit also affects the output load capacitance of the circuit. The capacitance of a function of the number of transistors that are required to implement a given function. Pass-gate logic design is attractive since fewer transistors are required for certain functions such as XOR and XNOR. Pass-transistor structures typically require complementary control signals; dual-rail logic is used to provide all signals in complementary form. This diminishes the inherent advantages of pass-transistor logic gates over conventional CMOS logic. Thus, the use of pass-transistor logic gates to achieve low-power dissipation must be carefully considered, and the choice of logic design style must ultimately be based on a detailed comparison of all design aspects such as silicon area, overall delay as well as switching power dissipation. Mask-Level Measures The amount of parasitic capacitance that is switched (i.e., charged up or charged down) during operation can also be reduced at the physical design level, or mask level. A simple mask-level measure to reduce power dissipation is keeping the transistors at minimum dimensions whenever possible and feasible, thereby minimizing the parasitic capacitances. Designing a logic gate with minimum-size transistors certainly affects the dynamic performance of the circuit, and this trade-off between dynamic performance and power dissipation should be carefully considered in critical circuits.
Department of ECE, MRITS
26
Design of Low Power ALU using Area Efficient Carry Select Adder
2.4 Different logic styles Several variants of static CMOS logic styles have been used to implement lowpower 1-bit adder cells. Several logic styles have been used to design full adder cells. Each design style has its own merits and demerits. In general, they can be broadly divided into two major categories: 1) Static logic style and 2) Dynamic logic style A major distinction, also with respect to power dissipation, must be made between static and dynamic logic styles. As opposed to static gates, dynamic gates are clocked and work in two phases, a precharge and an evaluation phase. The logic function is realized in a single NMOS pull-down or PMOS pull-up network, resulting in small input capacitances and fast evaluation times. This makes dynamic logic attractive for high speed applications. However, the large clock loads and the high signal transition activities due to the precharging mechanism result in excessive high power dissipation. Also, the usage of dynamic gates is not as straightforward and universal as it is for static gates, and robustness is considerably degraded. With the exception of some very special circuit applications, dynamic logic is no viable candidate for low-power circuit design. Although they all perform the same function, their styles of generating the intermediate nodes and the outputs are different, the loads on the inputs and intermediate nodes are different, and the transistor count varies significantly. There are standard implementations for the full adder cell that are implemented. They are the following: 1) Double pass transistor logic uses both N and P channel transistors, with dual logic paths for every function. It uses 28 transistors. 2) The complementary pass-transistor logic (SR-CPL) full adder, it has 26 transistors and uses the CPL logic family. 3) Multiplexer based low power full adder which makes use of 34 transistors, it makes use of only multiplexer operation.
Department of ECE, MRITS
27
Design of Low Power ALU using Area Efficient Carry Select Adder
All these adder cells are compared based on power consumption, speed, power delay product, area, and driving capability. Classical designs of full adders normally use only one logic style for the whole full-adder design. While other hybrid designs exploit the features of different logic styles to improve upon the performance of the designs using single logic style. All hybrid designs use the best available modules implemented using different logic styles or enhance the available modules in an attempt to build a low power full-adder cell. Generally, the main focus in such attempts is to reduce the numbers of transistors in the adder cell and, consequently, reduce the number of power dissipating nodes. In doing so, the designers often trade off other vital requirements such as driving capability, noise immunity, and layout complexity. Most of these adders lack driving capabilities as the inputs are coupled to the outputs. Their performance as a single unit or in small chains is good but when large adders are built by cascading these 1-b fulladder cells, the performance degrades drastically. The performance degradation can be handled by inserting buffers in between stages to enhance the delay characteristics. However, this leads to an extra overhead and the initial advantage of having a lesser number of transistors is lost.
Department of ECE, MRITS
28
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 3 DESIGN OF ALU AND SQRT CSLA 3.1 Introduction to ALU and SQRT CSLA The arithmetic logic unit (ALU) is one of the main components inside a microprocessor. It is responsible for performing arithmetic and logic operations such as addition, subtraction, increment, and decrement, logical AND, logical OR, logical XOR and logical XNOR. An ALU is a digital circuit that performs arithmetic and logical operations. Generally the performance of ALU is degraded by adder because of carry propagation. To reduced carry propagation delay so many adders are proposed. In digital adders, for speed up the operation Ripple Carry Adder (RCA) is modified as CSLA. To achieve more speed CSLA is replaced by SQRT CSLA. The CSLA is used in many computational systems to alleviate the problem of carry propagation delay by independently generating multiple carries and then select a carry to generate the sum [8]-[9]. However, the CSLA is not area efficient because it uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry input Cin=0 and Cin=1, the final sum and carry are selected by the multiplexers (mux). For achieving better area efficiency [10]-[14] Binary to Excess-1 Converter (BEC) is replaced in the place of RCA with Cin=1 in the regular CSLA. The total 16-bit SQRT CSLA is divided into different blocks. Block size and the number of blocks depend upon size of SQRT CSLA according to the SQRT technique. From second block onwards, each block contains three different levels, first level is ripple carry adder with input carry zero, second level is ripple carry adder with input carry one and the third level is multiplexer which is used to select one of the ripple carry adders output according to the previous block carry. The disadvantage in SQRT CSLA is more area requirement as it uses two levels of RCAs. To reduce the area BEC is replaced in place of second level RCA. In place of 2-bit RCA, 3- bit BEC is used.
Department of ECE, MRITS
29
Design of Low Power ALU using Area Efficient Carry Select Adder
3.1.1 Delay and Area evaluation methodology of the basic adder blocks The AND, OR, and Inverter (AOI) implementation of an XOR gate is shown in fig 3.1 we add up the number of gates in the longest path of area evaluation approach, the CSLA adder blocks of 2:1 mux, Half Adder (HA), and FA are evaluated and listed in Table 3.1.
Table 3.1 Delay and area for basic gates
Figure 3.1 AOI implementation of XOR gate 3.1.2 Binary to Excess one Converter (BEC) As stated above the main idea of this work is to use BEC instead of the RCA with cin =1 in order to reduce the area and power consumption of the regular CSLA. To replace the n-bit RCA, an (n+1)-bit BEC is required. A structure and the function table of a 4-b BEC are shown in Fig.3.1.2 and Table 3.1.2, respectively.
Department of ECE, MRITS
30
Design of Low Power ALU using Area Efficient Carry Select Adder
Fig. 3.2 illustrates how the basic function of the CSLA is obtained by using the 4-bit BEC together with the mux. One input of the 8:4 mux gets as it input (B3, B2, B1, and B0) and another input of the mux is the BEC output. This produces the two possible partial results in parallel and the mux is used to select either the BEC output or the direct inputs according to the control signal Cin. The importance of the BEC logic stems from the large silicon area reduction when the CSLA with large number of bits are designed. The Boolean expressions of the 4-bit BEC is listed as (note the functional symbols NOT, &AND, XOR)
Fig 3.2 A 4- bit BEC
Fig 3.3 Functional block of CSLA
Department of ECE, MRITS
31
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.4 Block diagram for a 16-bit SQRT CSLA 3.1.3 Delay and area evaluation methodology of regular 16-bit SQRT CSLA The structure of the 16-b regular SQRT CSLA is shown in Fig. 3.4. It has five groups of different size RCA. The delay and area evaluation of each group in which the numerals within [] specify the delay values, e.g., sum2 requires 10 gate delays. The steps leading to the evaluation are as follows. 1) The group2 has two sets of 2-b RCA. Based on the consideration of delay values of Table 3.2 , the arrival time of selection input c1 [time (t) =7] of 6:3 mux is earlier than s3[t=8] and later than s2[t=6]. Thus, sum3 [t=11] is summation of s3 and mux [t=3] and sum2[t=10] is summation of c1 and mux. 2)
Except for group2, the arrival time of mux selection input is al ways greater than the arrival time of data outputs from the RCA’s. Thus, the delay of group3 to group5 is determined, respectively as follows:
3) The one set of 2-b RCA in group2 has 2 FA for and the other set has 1 FA and 1 HA for. Based on the area count of Table I, the total number of gate counts in group2 is determined as follows: 4) Similarly, the estimated maximum delay and area of the other groups in the regular SQRT CSLA are evaluated and listed in Table 3.2.
Department of ECE, MRITS
32
Design of Low Power ALU using Area Efficient Carry Select Adder
Table 3.2 Delay and area for SQRT CSLA
3.1.4 Delay and area evaluation methodology of modified 16-bit SQRT CSLA The structure of the proposed 16-b SQRT CSLA using BEC for RCA with to optimize the area and power is shown in Fig. 3.5. We again split the structure into five groups. The delay and area estimation of each group are shown in Figure.
Figure 3.5 A 16-bit SQRT CSLA using BEC 1) The group2 has one 2-b RCA which has 1 FA and 1 HA for carry input zero. Instead of another 2-b RCA with carry input one a 3-bit BEC is used which adds one to the output from 2-b RCA. The sum3 and final (output from mux) are depending on and mux and partial (input to mux) and mux, respectively. The sum2 depends on and mux.
Department of ECE, MRITS
33
Design of Low Power ALU using Area Efficient Carry Select Adder
2) For the remaining group’s the arrival time of mux selection input is always greater than the arrival time of data inputs from the BEC’s. Thus, the delay of the remaining groups depends on the arrival time of mux selection input and the mux delay. 3) The area count of group2 is determined as follows:
Table 3.3 Delay and area for modified SQRT CSLA
3.1.5 Transistor Level design of existing technique
1) Conventional full adder A conventional full adder takes 28 transistors to implement sum and carry functions. The conventional full adder is shown in figure 3.6 2) A 2-bit RCA A two bit Ripple Carry Adder (RCA) is formed by connecting the two full adders. It takes total 56 transistors to implement. It is shown in figure 3.7.
Department of ECE, MRITS
34
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.6 A conventional full adder 3) A 3-bit BEC A 3- bit BEC uses two XOR, one AND, one NOT gates, which takes 32 transistors overall whereas 2-bit RCA, which is the basic block in place of 3-bit BEC takes 56 transistors. A 3-bit BEC is shown in figure 3.8. comparison between 2-bit RCA, 3- bit BEC is shown in table 3.4.
Figure 3.7 A 2-bit RCA using conventional full adder
Department of ECE, MRITS
35
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.8 Transistor level 3-bit BEC Table 3.4 Comparison between 2-bit RCA and BEC Logic for Second Level
Number of Critical transistors path delay
Area
Power dissipation
(µm2)
(µw)
(ns)
Static
Dynami c
Total
RCA using CMOS
56
1.900
1342
6.706
42.565
49.271
BEC using CMOS
32
1.200
781
3.269
25.746
29.015
Though BEC technique reduces area and power [16] but not up to considerable amount and also the design is not suitable for sub threshold level modifications. The drawback with this logic structure is that it does not reduce the area and power to a satisfactory level. There is still scope to reduce the delay. In order to improve the delay a new logic structure for a full-adder cell is proposed.
Department of ECE, MRITS
36
Design of Low Power ALU using Area Efficient Carry Select Adder
3.2 ALU The arithmetic logic unit (ALU) is one of the main components inside a microprocessor. It is responsible for performing arithmetic and logic operations such as addition, subtraction, increment, and decrement, logical AND, logical OR, logical XOR and logical XNOR. An ALU is a digital circuit that performs arithmetic and logical operations. The ALU is a fundamental building block of the Central Processing Unit (CPU) of a computer, and even the simplest microprocessors contain one. The processors found inside modern CPUs and Graphics Processing Units (GPUs) have inside them very powerful ALUs. We have designed ALU by using multiplexer and full adder circuit. The input and output sections consist of 4xl and 2xl multiplexers and logic is implemented by using full adder. The full adder performs the computing function of the ALU. A full adder could be defined as a combinational circuit that forms the arithmetic sum of three input bits. It consists of three inputs and two outputs. The arithmetic logic unit (ALU) is one of the main components inside a microprocessor. It is responsible for performing arithmetic and logic operations such as addition, subtraction, increment, decrement, logical AND, logical OR logical XOR and logical XNOR. An ALU is a digital circuit that performs arithmetic and logical operations. We have designed ALU using 4Xl mux, 2Xl mux and an 8T full adder. Here all the blocks in ALU are designed using Gate Diffusion Input (GDI). 3.2.1 GDI Technique AS there is a scope to reduce power, area and delay using GDI cell technique A simple GDI cell is shown in Fig.3.9. We can implement any bullion function using GDI cell. Low swing problems will arise, because we apply inputs directly to the sources of P and N transistors. N transistor weak to pass logic high and P transistor weak to pass logic low. When transition occur from the high to low at the P transistor source and the low to high at the N transistor source, low swing problem will arise. To avoid that demands special emphasis is that 50% of the cases, the GDI cell operates as regular CMOS inverter, which is widely used as a digital buffer for logic-level restoration. In some of these cases , when Vdd=1 without a swing drop from the
Department of ECE, MRITS
37
Design of Low Power ALU using Area Efficient Carry Select Adder
previous stages, a GDI cell functions as an inverter buffer and recovers the voltage swing. Basic logic gates are shown in figure 3.10.
Figure 3.9 Simple GDI cell
Figure 3.10 Basic logic gates GDI cell 3.2.2 A 10-transistor full adder A full adder using GDI technique takes 10 transistor where as conventional full adder takes 28 transistors. It is shown in figure 3.11.
Department of ECE, MRITS
38
Design of Low Power ALU using Area Efficient Carry Select Adder
3.2.3 An 8-transistor full adder Full adder can implement with 8-transistors by using GDI technique. A 10 transistor full adder differentiates the 8 transistor full adder with two pull up transistors. It is shown in figure 3.12.
Figure 3.11 A 10- transistor full adder
Figure 3.12 A 8- transistor full adder 3.2.4 A 1-bit ALU ALU is designed using multiplexers and full adder circuit. The input and output sections consist of 4x1 and 2x1 multiplexers and logic is implemented by using full adder. A set of three select signals have been incorporated in the design to
Department of ECE, MRITS
39
Design of Low Power ALU using Area Efficient Carry Select Adder
determine the operation being performed and the inputs and outputs being selected. Figure 3.13 shows the block diagram of 1-bit ALU using two 4x1 multiplexers and one 2x1 multiplexer. The complement of B is used for SUBTRACTION operation. The full adder performs the SUBTRACT operation by two’s complement method. Table 3.5 shows the truth table for the operations performed by the ALU based on the status of the select signals. Table 3.5 Truth table of one bit ALU s2
s1
s0
Operation
0
0
0
AND
0
0
1
XOR
0
1
0
XNOR
0
1
1
OR
1
0
0
DECREMENT
1
0
1
ADDTION
1
1
0
SUBTRACTION
1
1
1
INCREMENT
Department of ECE, MRITS
40
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.13 A 1-bit ALU 3.2.5 8-bit ALU using ripple carry adders An 8- bit ALU is formed by connecting eight 1-bit ALUS in series. 8-bit ALU using 10 transistors and 8- transistors are shown in figure 3.14.
Department of ECE, MRITS
41
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 3.14 Eight bit ALU using 10 and 8 transistor full adders An eight bit ALU using ripple carry adders takes more propagation delay. The speed of ALU is limited by propagation of carry. To reduce the carry propagation the proposed design using carry select adder is implemented.
Department of ECE, MRITS
42
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 4 DESIGN OF ALU USING MODIFIED SQRT CSLA 4.1 Introduction to different transistor types Combinational logic forms the core of most digital integrated circuits such as fast arithmetic units and controllers. The design requirements imposed on the logic circuitry can vary widely. Area is often the prime concern, as it has direct impact on cost. In many state-of-the-heart designs, speed tends to be the dominating requirement. Contemporary microprocessors are excellent examples of designs in this class. For other applications, minimizing the power consumption is crucial, as in the design of portable applications such as mobile telephones. These different design requirements generally translate into the use of different circuit styles, or even different manufacturing technologies. The static CMOS has excellent properties in many areas: low sensitivity to noise and process variations, excellent speed, and low power consumption. Most of those properties are carried over to more static CMOS gates such as NAND gates with three or more inputs become large and slow. Other design styles like complementary, the ratioed and the pass transistor logic styles have been devised to address this issue, all of which belong to the class of static circuits. 4.1.1 Complementary CMOS A static CMOS gate is a combination of two networks, called the pull-up network (PUN) and the pull-down network (PDN). The PUN consists solely of PMOS transistors and provides a conditional connection to V dd. The PDN potentially connects the output to Vss and contains only NMOS devices. The PUN and PDN networks should be designed so that, whatever the value of the inputs, one and only one of the networks is conducting in steady state. In this way, a path always exists between Vdd and the output, realizing a high output (one) or alternatively, between Vss and output for a low output (zero).
Department of ECE, MRITS
43
Design of Low Power ALU using Area Efficient Carry Select Adder
Properties of complementary CMOS Complementary CMOS gates inherit all the nice properties like high noise margin, no static power consumption, as there is never a direct path between V dd and Vss in steady state mode and comparable rise and fall times. The complementary gate is inverting (implementing functions such as NAND, NOR & XNOR). Implementing a non inverting Boolean function (such as AND, OR, XOR) in one stage is not possible and requires the addition of an extra inverter stage. 4.1.2 Pseudo NMOS A grounded PMOS device presents an even better load. This configuration which is called pseudo-NMOS because it resembles the depletion NMOS load, is superior to the other approach. First of all, the PMOS transistor does not experience anybody effect as its Vsb is constant and equal to 0. Secondly, the PMOS device is driven by a Vgs equal to –Vdd, resulting in a higher load-current level for similarly sized devices.
Figure 4.1 Pseudo NMOS An important disadvantage is that it consumes static power when the output is low, because a direct path exists between V dd and ground through the load and device drivers. The grounded PMOS load is a good imitation of an ideal current-source load. For a certain circuit configurations, some simple modifications can further improve
Department of ECE, MRITS
44
Design of Low Power ALU using Area Efficient Carry Select Adder
either the speed or the power consumption. The following approach allows to completely eliminating the static current. 4.1.3 Differential cascade voltage switch logic (DCVSL) Let us consider that the complement of each signal is always available. This requires each gate to generate both polarities of the output signal. Such a gate, called Differential Cascade Voltage Switch Logic (DCVSL) is presented. The PDN1 & PDN2 are complementary, and implement the required logic function and its inverse. Assume now that, for a given set of inputs, PDN1 conducts while PDN2 does not. Node out is pulled down. This turns on the load transistor M2, pulling up out’. This in turn cuts off load transistor M1. The gate is clearly free of static current paths as only PDN1 & M2 are conducting.
Figure 4.2 DCVSL logic gate Basic Principle
Figure 4.3 XOR-XNOR gates
Department of ECE, MRITS
45
Design of Low Power ALU using Area Efficient Carry Select Adder
The availability of complementary signals eliminates extra inverter stages. An example in the circuit implements a two input XOR and XNOR gate. The transistor connected to the A-inputs are shared between the two PDNs. DCVSL has, for instance, been used for the implementation of fast error-correcting logic in memories. The DCVSL gate has the speed advantage; the reduction of the parasitic capacitances at the output nodes produces a faster response. At the same time the static power consumption is eliminated. This comes at the expense of extra area, as each gate requires two pull-down networks. 4.1.4 Pass transistor logic This is another promising approach to implement complex logic by realizing it as a logical network of switches or pass transistors. The pass transistor approach has the advantage of being simple and fast. Complex CMOS combinational logic is implemented with a minimal number of transistors. This reduces the parasitic capacitances and results in fast circuits. The static and transient performance of such a structure strongly depends upon the availability of a high-quality switch with low parasitic capacitance and resistance. Although the MOS transistor in itself is a switch of reasonable performance, some deficiencies will become apparent. Pass transistor logic networks are, therefore, often constructed from bidirectional transmission gates (pass gates). These gates are composed of an NMOS transistor and a PMOS device in a parallel arrangement. The pass transistor acts as a bidirectional switch controlled by the gate signal C. When C=1, both MOSFETs are on, allowing the signal to pass through the gate i.e., A=B if C=1. On the other hand, C=0 places both transistors in cutoff, creating an open circuit between nodes A and B.
Figure 4.4 Pass transistor logic
Department of ECE, MRITS
46
Design of Low Power ALU using Area Efficient Carry Select Adder
Although the pass transistor possesses some excellent properties, such as an almost constant resistance and no threshold loss, it has the disadvantage that it requires both an NMOS and a PMOS transistor, which have to be located in different wells. This reduces the layout efficiency of the design. Also, the control signal has to be presented in both the polarities, which once again has a negative influence on the layout density. Furthermore, the parallel connection of PMOS and NMOS results in increased node capacitances and reduced performance. It would therefore be advantageous if we could implement transmission gate using NMOS transistor only. Unfortunately, NMOS only pass transistors are subject to voltage loss. This is not a problem if the voltage levels are subsequently restored by a complementary CMOS inverter. Such a circuit suffers from two major drawbacks: reduced noise margin, due to threshold voltage drop and static power consumption. Several techniques have been proposed to get around this problem. 4.1.5 Transmission Gate logic Transmission gate logic includes at least two field-effect transistor elements used as pass transistors, each having a channel of conductivity type opposite that of the other (i.e., complementary FET’s). Transmission gate is switching element which switches the input to the output according to the gate input. Transmission gate is parallel connection of n-transistor, which is good at pass logic one and p-transistor which is good at pass logic zero. The basic arrangement of transmission gate is shown in figure 4.5.
Figure 4.5 A simple Transmission gate
Department of ECE, MRITS
47
Design of Low Power ALU using Area Efficient Carry Select Adder
4.2. Special Hardware using Multiplexers (SHM) Though BEC technique reduces area and power but not up to considerable amount and also the design is not suitable for sub threshold level modifications. The 16-bit SQRT CSLA using BEC in its second level requires 792 transistors. There is a scope to reduce the number of transistors along with the area reduction and power dissipation reduction by using proposed logic. For the implementation of a 16-bit SQRT CSLA, 736 transistors are required by using proposed logic. The proposed logic implementation for second level RCA is Special Hardware using Multiplexers (SHM) as shown in figure 4.6. In this the inputs are applied to first level RCA. And the output of RCA is applied to second level SHM and then to third level multiplexer. Third level multiplexer selects either RCA output or SHM output according to the previous carry. A simple 3-bit SHM requires 3 multiplexers to implement. b0, b1, b2 are the inputs to the 3-bit SHM and the x0, x1, x2 are corresponding outputs. SHM will take first level RCA output as input and appends its value by one. 3-bit SHM uses three multiplexers and three inverters. First inverter gives the first output bit x0 basing on input bit b0 and that output will be used as select line for the first multiplexer. First multiplexer passes either second bit b1 or inversion of second bit b1to the output because first inverter output acts like a carry to the second bit. First multiplexer gives the second output bit x1 and that will be used as second multiplexer select line. Basing on x1 output bit and b1 bit second multiplexer generates carry for input bit b2. One input to the second multiplexer is b1 and second input is grounded which will be selected when it is connected as select line to the third multiplexer. Third multiplexer passes third bit or inversion of third bit to the output according to the previous carry bit. This logic can be extended to any number of bits. It is implemented for second block with two inputs under consideration. When number of inputs is increased the proposed technique produces more efficient results on large scale. One point to be noticed is despite of the above advantages, delay is increased as carry has to pass 2(n-1) levels in n bit SHM in order to appear at the output. The comparison between numbers of transistors is shown in table 4.1.
Department of ECE, MRITS
48
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 4.6 A 3-bit SHM
Xo= bo X1=x0.b1+x0.b1 X2=(x1+b1).b2+x1.b1.b2
Table 4.1 Area comparison between 2-bit RCA and BEC Type of
Gates
logic 3-bit BEC
3-bit SHM
Department of ECE, MRITS
Number of
Total number of
transistors
transistors
2 –XOR
24
1-AND
6
1-NOT
2
3-MUX
18
3-NOT
6
32
24
49
Design of Low Power ALU using Area Efficient Carry Select Adder
4.2.1 Transistor level design of SHM A 3-bit SHM takes 24 transistors it is shown in figure4.7, corresponding functional verification in the figure and corresponding wave forms are shown in figure 4.8 and wave forms and power dissipation window shown in figure 4.9.
Figure 4.7 Transistor level 3-bit SHM
Department of ECE, MRITS
50
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 4.8 Critical path details of a 3-bit SHM
Figure 4.9 Power dissipation of a 3-bit SHM The power and area between existing technique such as BEC and proposed technique such as SHM are compared in table 4.2.
Department of ECE, MRITS
51
Design of Low Power ALU using Area Efficient Carry Select Adder
Table 4.2 Power and delay Comparison between 2-bit RCA and BEC Logic for Second
Number of transistors
Level
Critical path delay
Area
Power dissipation
(µm2)
(µw)
(ns)
static
Dynamic
Total
BEC using CMOS
32
1.200
781
3.269
25.746
29.015
SHM using CMOS
24
2.350
486
3.100
22.843
25.943
4.3 An 8-bit ALU using proposed carry select adder The proposed technique with 10-transistor full adder is applied to 8-bit ALU and corresponding circuit diagram shown in figure
and for 8- transistor full adder,
circuit diagram shown in figure 4.10.
Department of ECE, MRITS
52
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 4.10 Eight bit ALU using modified SQRT CSLA
4.4 Wave forms By applying the 20 ns clock to the every input output wave forms are obtained. The proposed technique with 10-transistor full adder is applied to 8-bit ALU and corresponding output wave forms and power dissipation is shown in figure 4.11 and for 8- transistor full adder, wave forms are shown in figure 4.12.
Department of ECE, MRITS
53
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure 4.11 Wave forms of 8- bit ALU for 10- transistor full adder
Figure 4.12 Wave forms of 8- bit ALU for 8- transistor full adder
Department of ECE, MRITS
54
Design of Low Power ALU using Area Efficient Carry Select Adder
CHAPTER 5 RESULTS 5.1 Comparative analysis of existing CSLA and modified CSLA In the designing of 8 bit ALU using efficient carry select adder, all the blocks of 16-bit SQRT CSLA, second level of second block such as 3-bit BEC and 3-bit SHM are implemented in Dsch2.6c – Logic Editor and synthesized in Micro wind 2.6a- Layout Editor under 0.12um technology with 1.2 volts as logic high voltage. The first level of second block in the 16-bit SQRT CSLA is two bit RCA which requires 56 transistors when implemented in CMOS logic. The second level of second block is 3-bit SHM in the proposed logic design; it uses 24 transistors. The third level of second block is multiplexer. A simple 2x1 multiplexer uses six transistors CMOS technology. Block2 needs three 2x1 multiplexers hence eighteen transistors are required for the implementation. The total number of transistors required for the complete block 2 is only 98 when SHM is used. Otherwise it requires 106 Transistors with BEC technique. The number of transistors required for block3 is only 146, for block4 are 194 and for block5 are 242 when SHM is used. Otherwise block3 requires 158, block4 requires210 and block5 requires 262 Transistors with BEC technique. Using SHM for the implementation of a 16 bit SQRL CSLA 736 transistors are required where it requires 792 transistors with BEC technique. Finally the complete second block of16-bit SQRT CSLA with BEC and SHM is implemented using CMOS technology and observed the results and are shown from Table 5.1.
5.2 Comparative analysis of existing ALU and modified ALU All the basic gates in the ALU such as AND, XOR, multiplexer and full adder are designed using GDI technique. Here full adder is designed using 10 transistors as well as 8 transistors. Final comparison on 8 bit ALU is considered by taking ripple carry adder and carry select adder.
Department of ECE, MRITS
55
Design of Low Power ALU using Area Efficient Carry Select Adder
Design of 8-bit ALU using efficient carry select adder is speed advantageous than the 8-bit ALU using ripple carry adders. ALU using efficient carry select adder gives 42% advantage for 10 transistors adder and 46% advantage for 8 transistor adder. Corresponding results are shown in table 5.3 and 5.4. Table 5.1 Comparison of second level 2- bit RCA; 3-bit BEC and3-bit SHM implemented using CMOS technology Logic for Second
Number of transistors
Level
Critical path delay
Area
Power dissipation
(µm2)
(µw)
(ns)
Static
Dynamic
Total
RCA using CMOS
56
1.900
1342
6.706
42.565
49.271
BEC using CMOS
32
1.200
781
3.269
25.746
29.015
SHM using CMOS
24
2.350
486
3.100
22.843
25.943
Table 5.2 Comparison between second block with BEC and second block with SHM using CMOS Design Type
Number of transistors
Critical path delay
Area
Power dissipation
(µm2)
(µw)
(ns)
Static
Dynamic
Total
RCA-BECMUX
106
3.240
3465
21.005
106
127.005
RCA-SHMMUX
98
3.770
2996
20.138
98.624
118.762
Department of ECE, MRITS
56
Design of Low Power ALU using Area Efficient Carry Select Adder
Table 5.3 Comparison of 8-bit ALU using 10 transistor adder
MODEL(ALU)
NUMBER OF TRANSISTORS
Critical path delay(ns)
Area(µm)
Power(mw)
8BIT ALU USING
448
3.195
12384
0.204
508
1.865
24682
0.205
10 TRANSISTOR RCA 8BIT ALU USING 10 TRANSISTOR CSLA Table 5.4 Comparison of 8-bit ALU using 8 transistor adder MODEL(ALU)
NUMBER OF TRANSISTORS
Critical path delay(ns)
Area(µm)
Power(mw)
8BIT ALU USING
432
3.745
11832
0.221
494
2.070
20988
0.262
8 TRANSISTOR RCA 8BIT ALU USING 8 TRANSISTOR CSLA
CHAPTER 6
Department of ECE, MRITS
57
Design of Low Power ALU using Area Efficient Carry Select Adder
CONCLUSIONS AND FUTURE SCOPE 6.1 Conclusions In the process of designing a low power ALU, various tradeoffs between area, delay and power dissipation occurred. As the adder is the main block in the ALU, always efficient adder is preferred. For that, SQRT carry select adder is moderated with more power and area advantageous. In this process all second level RCA blocks of 16-bit SQRT CSLA are replaced by SHM and the results are compared with existing technique such as BEC. From the comparisons in Table 5.1, it is observed that the variation between 2-bit RCA and proposed technique 3-bit SHM are more comparable such as percentage of utilization of number of transistors is reduced to 57.1%, correspondingly percentage of area required also reduced to 63.7% along with power dissipation reduction advantage of 47.3%. Whereas the variation between 2-bit RCA and existing technique 3-bit BEC is only 42.8% reduction of utilization of number of transistors, 41.8% reduction of area required along with the 41.1% reduction of power dissipation. Finally second block of 16-bit SQRT CSLA is designed using logic level modification such as SHM in place of BEC. From the table 5.2, it is observed that number of transistors is reduced by 7.5%, area is reduced by 13.5% and power is reduced by 6.4%, but critical path delay is increased by 16.3%. Once again it is proved that the tradeoff between area, power and delay, the design is optimized for power and area against to the delay over head. This delay overhead also can be overcome by using various existing low power circuit level modifications. By using the proposed efficient carry select adder and GDI technique 8-bit ALU is designed for both 10 transistor and 8 transistor full adders and compared with the existing technique such as 8-bit ALU using ripple carry adders in the tables 5.3 and 5.4. It is observed that speed is increased 41.6% in case 10 transistor full adder and 44.7% in case of 8-transistor full adder. The performance of the proposed design has been shown to outperform. Satisfactory level of power consumption and propagation delay can be achieved using
Department of ECE, MRITS
58
Design of Low Power ALU using Area Efficient Carry Select Adder
the proposed technology without the need to purchase new technology libraries, which may lead to design cost reduction. Consequently, the proposed design is suitable for the application in the high-performance arithmetic and VLSI circuits in the future.
6.2 Future Scope The proposed work can be extended and carried further with an aim of increasing the number of bits and approach to new technology such as 0.08, 0.06 micron meter technology. The resulting design with few numbers of transistors will in turn result in reduction of total area and also reduction in the power consumption.
REFERENCES
Department of ECE, MRITS
59
Design of Low Power ALU using Area Efficient Carry Select Adder
[1] Arun Prakash Singh, Rohit Kumar, “Implementation of 1-bit Full Adder Using Gate Diffusion Input (GDI) cell”, International Journal of Electronics and Computer Science Engineering J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68-73. [2] N. M. Chore, R. N. Mandavgane , “ A survay of low power high speed one bit full adder”,recent advances in networking, VLSI and signal processing, ISSN: 1790-5117. ISBN: 978-960-474-162-5. [3] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, A System Perspective . Reading, MA: Addison- Wesley, 1993. [4] Pardeep Kumar / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.599-606 [5]
M.sreedevi and p.jeno.paul “ Design and Optimization of a High Performance Low-Power CMOS Flex Cell “, International Journal of Signal System Control and
Engineering
Application,
2010,
vol.3,
no.4,
pp.65-69.
DOI:
10.3923/ijssceapp.2010.65.69. [6] A good over view of leakage and reduction methods are explained in the book Leakage and reduction in Nanometer CMOS Technologies ISBN 0-387-25737-3. [7] M.Parvathi, N.Vasantha, K. Satya Prasad “Design of High Speed -Low PowerHigh Accurate (HS-LP-HA) Adder “, ICECT, Internation conference on Electronics Computer Technology Proceedings, 2012, pp: 523-527, 978-1-46731850-1/12@2012, IEEE. [8] K Allipeera, S Ahmed Basha, “An Efficient 64-Bit Carry Select Adder With Less Delay And Reduced Area Application“, International Journal of Engineering Research and Applications( IJERA) .ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.550-554 [9] O.J.Bedrij, “Carry Select Adder”, IRE Trans. Electron. Comput.pp. 340-344,1962. [10]
U.Sreenivasulu, T.Venkata Sridhar, “Implementation of An 4 Bit - ALU Using
Low-Power And Area-Efficient Carry Select Adder”, International Conference on Electronics and Communication Engineering, 20th, May 2012, Bangalore, ISBN: 978-93-81693-29-2.
Department of ECE, MRITS
60
Design of Low Power ALU using Area Efficient Carry Select Adder
[11]
A.Andamuthu, S.Rithanyaa, ”Design Of 128 Bit Low Power and Area
Efficient Carry Select Adder”, International Journal of Advanced Research in Engineering (IJARE) Vol 1, Issue 1,2012 Page 31-34. [12]
B.Ramkumar, H.M.Kittur, and P .M.Kannan, “ASIC implementation of
modified faster carry save adder”, EUR .J. Sci .Res. vol.42, no.1, pp.53-58, 2010. [13]
T.Y.Ceaing and M.J.Hsaio, “carry –select adder using single ripple carry
adder”, Electron. Lett. Vol.34,no.22,pp.2101-2103, oct.1998 [14]
Y.Kim and L.S.Kim, “64-bit carry select adder with reduced area”, Electron.
Lett. Vol.37,no.10,pp.614-615, May.2001. [15]
B RamKumar and Harish M Kittur, “Low –Power And Area -Efficient Carry
Select Adder”, IEEE Transactions on Very Large Scale Integration(VLSI)Systems
Department of ECE, MRITS
61
Design of Low Power ALU using Area Efficient Carry Select Adder
APPENDIX About Microwind2 The MICROWIND2 program allows the student to design and simulate an integrated circuit at physical description level. The package contains a library of common logic and analog ICs to view and simulate. MICROWIND2 includes all the commands for a mask editor as well as original tools never gathered before in a single module (2D and 3D process view, Verilog compiler, tutorial on MOS devices). You can gain access to Circuit Simulation by pressing one single key. The electric extraction of your circuit is automatically performed and the analog simulator produces voltage and current curves immediately. This includes details on the device modeling, simulation at logic and layout levels.
Figure A: MICROWIND window as it appears at the initialization stage. We use MICROWIND2 to draw the MOS layout and simulate its behavior. Go to the directory in which the software has been copied (By default microwind2).
Department of ECE, MRITS
62
Design of Low Power ALU using Area Efficient Carry Select Adder
Double-click on the MicroWind3 icon. The MICROWIND2 display window includes four main windows: the main menu, the layout display window, the icon menu and the layer palette. The layout window features a grid; scaled in lambda (λ) units. The lambda unit is fixed to half of the minimum available lithography of the technology. The default technology is a CMOS 6-metal layers 0.12μm technology, consequently lambda is 0.06μm (60nm).
Simulation of a layout MICROWIND3 includes a 3D process viewer for that purpose. Click Simulate → Process steps in 3D. The simulation of the CMOS fabrication process is performed, step-by-step by a click on Next Step. The picture on the left represents the nMOS device, pMOS device, common polysilicon gate and contacts. The picture on the right represents the same portion of layout with the metal layers stacked on top of the active device.The inverter simulation is conducted as follows. Firstly, a VDD supply source (1.2V) is fixed to the upper metal2 supply line, and a VSS supply source (0.0V) is fixed to the lower metal2 supply line. The properties are located in the palette menu. Simply click the desired property, and click on the desired location in the layout. Add a clock on the inverter input node (The default node name clock1 has been changed into Vin) and a visible property on the output node Vout
The command Simulate → Run Simulation gives access to the analog simulation. Select the simulation mode Voltage vs. Time. The analog simulation of the circuit is performed. The time domain waveform, proposed by default, details the
Department of ECE, MRITS
63
Design of Low Power ALU using Area Efficient Carry Select Adder
evolution of the voltages in1 and out1 versus time. This mode is also called transient simulation The command simulate→run simulation gives access to four simulation modes.voltage vs time, voltage and current vs time, static voltage vs voltage and frequesncy vs time. all these simulation modes are applicable to inverter simulation. Due to the fact that the layout Inv steps. Msk not only includes the inverter correctly polarized but also several other MOS devices without any simulation properties, a warning window appears prior to the anolog simulation, in this case you may click simulate as it, In normal cases. All n-well regions should be stuck at VDD. Select the simulation mode voltage vs time. The analog simulation of the circuit is performed. The time domain waveform. Proposed by default, details the evolution of the voltages in1 and out1 versus time. This mode is also called transient simulation. The inverter consumes power during transitions, due to two separate effects. The first is short circuit power arising from momentary short circuit current that flows from VDD to VSS when the transistor functions in the complete on/off state. The second is charging/discharging power, which depends on the output wire capacitance. With small loading the short circuit power loss is dominant. With huge loading, that is a large output node capacitance, the load power is dominant. The power consumption occurs briefly during transitions of the output, either from 0 to 1 or from 1 to 0.the simulation contains the supply currents in the upper window, and all voltage waveforms in the lower window. The current consumption is important only during a very short period corresponding to the charge or discharge of the output node. Without any switching activity, the current almost equals zero. •
Delay As the number of gates connected to the inverter output mode increase, the
load capacitance increases. The fan-out corresponds to the number of gates connected
Department of ECE, MRITS
64
Design of Low Power ALU using Area Efficient Carry Select Adder
to the cell output. Physically a large fan-out means a large number of connections that is a large load capacitance. An inverter circuit is simulated by using different clock, fanout and supply conditions. The initial configuration is based on one inverter controlled by a 2 GHz clock, with its output connected either to a single inverter or to four inverters. The supply voltage is 1.2V, with a 0.12μm CMOS technology. Now we connect four inverter circuits to the output node, thus increasing the charge capacitance. In the simulation chronograms the inverter delay is significantly increased. When we investigate the delay variation with the output capacitance load. In the curve we can see that the gate delay variation with the loading capacitance is quite linear. A 100fF load leads to around 300ps delay in CMOS 0.12μm technology. In Microwind we obtain this type of screen, thanks to the command parametric analysis. Load the file Invcapa.MSK, invoke the command parametric analysis. By default the capacitance of output node is increased step-by-step from its default value Cdef to Cdef +100fF.for each value of the output capacitance, the analog simulation is performed, and the last computed rise time is plotted, appearing as one single red dot in the graphs. The complete graph is built once all analog simulations have been compelted.The memory button enables us to store one curve prior to a new parametric simulation, for comparison purposes. Three main parameters may vary in the parametric analysis: the capacitance voltage, temperature. Several analog parameters may be monitored: rise and fall delay, oscillating frequency, power consumption, final voltage of a node, cross talk etc. Power consumption The power consumption P is computed by micro wind as the average product of the supply voltage VDD and the supply current IDD, computed at each iteration stepin other words P = Σ IDD.VDD/steps
Department of ECE, MRITS
65
Design of Low Power ALU using Area Efficient Carry Select Adder
Three main factors contribute to power consumption P: the load capacitance C, the supply voltage VDD and the clock frequency for a CMOS inverter, this relation is usually represented by the first order approximation below .The following equation shows a linear dependence of the power consumption P with the total capacitance C and the operating frequency father power consumption is also proportional to the square of the supply voltage VDD. P = 0.5ή.C.V2dd.f ή = switching activity factor. C = output load capacitance Vdd= supply voltage f= clock frequency. Frequency dependence We can verify the linear dependence of the power consumption with the operating frequency by simulating a CMOS inverter circuit. At each time domain analog simulation, we get a value of the power consumption, which is computed by micro wind as the average product of the supply voltage VDD and the supply current IDD.as the power consumption is linearly proportional to the clock frequency, a usual metric found in most cell libraries is the μWGhz. Supply voltage dependence It can be considered as a first order approximation that the average power consumption is proportional to VDD^2.we use the parametric analysis tool in micro wind to control the incremental change of the supply voltage from 0.5 to 2.0 V.the supply voltage step is 0.1 V.in the measurement window, the item dissipation is selected. The result shows a non linear dependence of the power dissipation with VDD.the square law fits with the experimental data form 0.8 to 1.5 V.we notice a very important rise of the power consumption over 1.5 V, due to the avalanche effects in n channel MOS devices. The simulation demonstrates the interest for a minimum supply operation to achieve optimum low power operation.
Department of ECE, MRITS
66
Design of Low Power ALU using Area Efficient Carry Select Adder
Minimum supply voltage We must know the supply voltage for which the inverter does not work any more and the answer is given by the parametric analysis focusing this time on the inverter delay dependence versus the supply voltage. Load the file cmosload.msk for this study. Invoke the command parametric analysis of the analysis menu. click the layout region corresponding to the node VDD. Verify that the voltage menu is selected in the parametric analysis window. Verify that the node VDD is selected. Modify the VDD voltage range from 0.5 to 1.5 V, step 0.1.finally in the measurement menu, select the item rise delay and click start analysis. We observe that the delay is significantly increased as we decrease V DD from its nominal value 1.2V down to 0.6V.below 0.7V the inverter delay is higher than the default transient simulation time so that the delay evaluator does not work anymore. Static characteristics The static characteristics of the inverter correspond to the variation plot of the output voltage versus the input voltage. The simulation involves a step by step increase of Vin, and the monitoring of Vout. In the simulation window, the static characteristics are obtained by a click on the item voltage versus voltage situated in the selection menu, at the bottom of the chronograms. When Vin is low, Vout is high which corresponds to one logic state of the inverter. When Vin increases Vout starts to decrease slowly, and suddenly crosses the VDD/2 boundary. At that point the value of Vin is the commutation point of the inverter called Vc.then when Vin rises to V DD, Vout reaches 0.which corresponds to the other logic state of the inverter.
About DSCH3
Department of ECE, MRITS
67
Design of Low Power ALU using Area Efficient Carry Select Adder
The DSCH3 program is a logic editor and simulator. DSCH3 is used to validate the architecture of the logic circuit before the microelectronics design is started. DSCH3 provides a user-friendly environment for hierarchical logic design, and fast simulation with delay analysis, which allows the design and validation of complex logic structures. Some techniques for low power design are described in the manual. DSCH3 also features the symbols, models and assembly support for 8051. DSCH3 also includes an interface to SPICE. Features
Figure B: DSCH schematic editor •
user friendly environment for rapid design of logic circuits.
•
Supports hierarchical logic design.
•
Handles both conventional pattern-based logic simulation and intuitive onscreen mouse-driven simulation.
Department of ECE, MRITS
68
Design of Low Power ALU using Area Efficient Carry Select Adder
•
Built-in extractor, which generates a SPICE net list from the schematic diagram (Compatible with PSPICETM and WinSpiceTM).
•
Current and power consumption analysis.
•
Generates a VERILOG description of the schematic for layout editor.
•
Immediate access to symbol properties (Delay, fan-out).
•
Models and supports 8051 micro controller
An example of the design of the schematic diagram in the DSCH and the generation of its layout in the MICROWIND is shown. The CMOS inverter design is detailed in the figure C below. First click new on main menu then draw the circuit diagram on DSCH window by dragging the components from symbol library. Draw the circuit diagram as shown below.
Figure C: Inverter circuit
Save the file and Click Simulate→ Start simulation in the main menu. Then, click inside the buttons situated on the left part of the diagram. The result is displayed on the LED. Here the p-channel MOS and the n-channel MOS transistors function as switches as shown in the figure D. When the input signal is logic 0as shown in figure 5.4 the NMOS is switched off while PMOS passes V DD through the output. When the
Department of ECE, MRITS
69
Design of Low Power ALU using Area Efficient Carry Select Adder
input signal is logic 1 shown in figure 6.12, the PMOS is switched off while the NMOS passes VSS to the output.
Figure D: Circuit diagram of CMOS inverter, CMOS inverter While simulation The fan-out corresponds to the number of gates connected to the inverter output. Physically, a large fan-out means a large number of connections that is a large load capacitance. If we simulate an inverter loaded with one single output, the switching delay is small. Now, if we load the inverter by several outputs, the delay and the power consumption are increased. The power consumption linearly increases with the load capacitance. This is mainly due to the current needed to charge and discharge that capacitance. Click the button Stop simulation shown in the figure below. You are back to the editor.
Department of ECE, MRITS
70
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure E: Timing diagram of inverter Click the chronogram icon to get access to the chronograms of the previous simulation. As seen in the waveform, the value of the output is the logic opposite of that of the input. Generation of layout of the schematic diagram Next open the Microwind window and click on open in the main menu. Then open CMOS inverter circuit diagram. Then click on compile the verilog file to generate the verilog file of corresponding circuit diagram.It generates the corresponding stick diagram of the inverter circuit as shown in the figure. Then click on simulate icon in main menu to generate the waveforms. Verilog program // DSCH Ver 3.0
Department of ECE, MRITS
71
Design of Low Power ALU using Area Efficient Carry Select Adder
// G:\project\dsch microwind\self\example.sch module example (in1, out1); input in1; output out1; wire ; pmos #(17) pmos_1(out1,vdd,in1); // 2.0u 0.12u nmos #(17) nmos_2(out1,vss,in1); // 1.0u 0.12u endmodule // Simulation parameters in Verilog Format always #1000 in1=~in1; in1 CLK 10 Layout In this paragraph, the procedure to create manually the layout of a CMOS inverter is described. Click the icon MOS generator on the palette. The following window appears. By default the proposed length is the minimum length available in the technology (2 lambda), and the width is 10 lambda. In 0.12μm technology, where lambda is 0.06μm, the corresponding size is 0.12μm for the length and 0.6μm for the width.. Click on the top of the nMOS to fix the pMOS device. The result is displayed in figure F.
Department of ECE, MRITS
72
Design of Low Power ALU using Area Efficient Carry Select Adder
Figure F: Layout of inverter in MICROWIND
Figure G: Selecting the NMOS device
Department of ECE, MRITS
73
Design of Low Power ALU using Area Efficient Carry Select Adder
Connection between devices Within CMOS cells, metal and polysilicon are used as interconnects for signals. Metal is a much better conductor than polysilicon. Consequently, polysilicon is only used to interconnect gates, such as the bridge (1) between pMOS and nMOS gates, as described in the schematic diagram of figure G. Polysilicon is rarely used for long interconnects, except if a huge resistance value is expected. In the layout shown in figure G, the Polysilicon Bridge links the gate of the n-channel MOS with the gate of the p-channel MOS device. The polysilicon serves as the gate control and the bridge between MOS gates.
Figure H: Connections required to build the inverter (CmosInv.SCH)
Department of ECE, MRITS
74
View more...
Comments