aca notes

December 13, 2017 | Author: Kiran Raj | Category: Instruction Set, Cpu Cache, Parallel Computing, Central Processing Unit, Computer Data Storage

Share Embed Donate

Report this link

Short Description

aca notes for vtu 7th sem...

Description

ACA NOTES

Parallel Computing A common way of satisfying the described needs is to use parallel computers. A parallel computer consists of two or more processing units, which are operating more or less independently in parallel. Using such a computer, a problem can (theoretically) be divided into n sub problems (where n is typically the number of available processing units), and each part of the problem will be solved by one of the processing units concurrently. Ideally, the completion time of the computation will be t/n, where t is the completion time for the problem on a computer containing only one processing unit. In practice, a value of t/n will rarely be achieved due to manifold reasons: sometimes a problem cannot be divided exactly into n independent parts, usually there will be a need for communication between the parallel executing processes (e.g. for data exchanges, synchronization, etc.), some problems contain parts that are per se sequential and therefore cannot be processed in parallel, and so on. This leads us to the term scalability. Scalability is a measure that specifies, whether or not a given problem can be solved faster as more processing units are added to the computer. This applies to hardware and software. Scalability A computer system, including all its hardware and software resources, is called scalable if it can scale up (i.e., improve its resources) to accommodate ever-increasing performance and functionality demand and/or scale down (i.e., decrease its resources) to reduce cost. Parallel computers can be classified by various aspects of their architecture. Here we present three different classification schemes. In the first, parallel computers are distinguished by the way the processors are connected with the memory. The second scheme (called "Flynn's Classification Scheme") takes the number of instruction-streams and the number of data-streams into account. Finally, the third scheme (ECS, the "Erlanger Classification Scheme") focuses on the number of control units, functional units, and the word-size of the computer. Memory-Processor Organization In terms of memory-processor organization three main groups of architectures can be distinguished. These are • • •

shared memory architectures, distributed memory architectures, and distributed shared memory architectures

Shared Memory Architectures The main property of shared memory architectures is, that all processors in the system have access to the same memory, there is only one global address space. Typically, the main memory consists of several memory modules (whose number is not necessarily equal to the number of processors in the computer, see Figure 2-1). In such a system, communication and synchronization between the processors is done implicitly via shared variables.

The processors are connected to the memory modules via some kind of interconnection network. This type of parallel computer is also called UMA, which stands for uniform memory access, since all processors access every memory module in the same way concerning latency and bandwidth. A big advantage of shared memory computers is, that programming a shared memory computer is very convenient due to the fact that all data are accessible by all processors, such that there is no need to copy data. Furthermore the programmer does not have to care for synchronization, since this is carried out by the system automatically (which makes the hardware more complex and hence more expensive). However, it is very difficult to obtain high levels of parallelism with shared memory machines; most systems do not have more than 64 processors. This limitation stems from the fact, that a centralized memory and the interconnection network are both difficult to scale once built. 2.3.1.2 Distributed Memory Architectures In case of a distributed memory computer (in literature also called multiprocessor or multi computer, each processor has its own, private memory. There is no common address space, i.e. the processors can access only their own

memories. Communication and synchronization between the processors is done by exchanging messages over the interconnection network. Figure shows the organization of the processors and memory modules in a distributed memory computer. In contrary to shared memory architecture a distributed memory machine scales very well, since all processors have their own local memory which means that there are no memory access conflicts. Using this architecture, massively parallel processors (MPP) can be built, with up to several hundred or even thousands of processors.

Typical representatives of a pure distributed memory architecture are clusters of computers, which become more and more important nowadays. In a cluster each node is a complete computer, and these computers are connected through a low-cost commodity network (e.g. Ethernet, Myrinet, etc.). The big advantage of clusters compared to MPPs is, that they have a much better cost/performance ratio. Distributed Shared Memory Architectures To combine the advantages of the architectures described above, ease of programming on the one hand, and high scalability on the other hand, a third kind of architecture has been established: distributed shared memory machines. Here, each processor has its own local memory, but, contrary to the distributed memory architecture, all memory modules form one common address space, i.e. each memory cell has a system-wide unique address. In order to avoid the disadvantage of shared memory computers, namely the low scalability, each processor uses a cache, which keeps the number of memory access conflicts and the network contention low. However, the usage of caches introduces a number of problems, for example how to keep the data in the memory and the copies in the caches up-to-date. This problem is solved by using sophisticated cache coherence and consistency protocols. A detailed description of the most important protocols can be found in.

Interconnection Networks Since the processors in a parallel computer need to communicate in order to solve a given problem, there is a need for some kind of communication infrastructure, i.e. the processors need to be connected in some way. Basically, there are two kinds of interconnection networks: static and dynamic. In case of a static interconnection network, all connections are fixed, i.e. the processors are wired directly, whereas in the latter case there are switches in between. The decision whether to use a static or dynamic interconnection network depends on the kind of problem that should be solved with the computer. Generally, static topologies are suitable for problems whose communication patterns can be predicted reasonably well, whereas dynamic topologies (switching networks), though more expensive, are suitable for a wider class of problems [1]. In the following, we will give a description of some important static and dynamic topologies, including routing protocols. Static Topologies Descriptions Meshes and Rings The simplest - and cheapest - way to connect the nodes of a parallel computer is to use a one-dimensional mesh. Each node has two connections, boundary nodes have one. If the boundary nodes are connected to each other, we have a ring, and all nodes have two connections. The one-dimensional mesh can be generalized to a k-dimensional mesh, where each node (except boundary nodes) have 2k connections. Again, boundary nodes can be connected, but there is no general consensus on what to do on boundary nodes.

However, this type of topology is not suitable to build large-scale computers, since the maximum message latency, that is, the maximum delay of a message from one of the N processors to another, is ; this is bad for two reasons: firstly, there is a wide range of latencies (the latency between neighbouring processors is much lower than between notneighbors), and secondly the maximum latency grows with the number of processors. Stars In a star topology there is one central node, to which all other nodes are connected; each node has one connection, except the centre node, which has N-1 connections.

Stars are also not suitable for large systems, since the centre node will become a bottleneck with increasing number of processors.

Hyper cubes The hypercube topology is one of the most popular and used in many large scale systems. A k-dimensional hypercube has 2knodes, each with k connections. In Figure a fourdimensional hypercube is displayed. Hyper cubes scale very well, the maximum latency in a k-dimensional (or "k-ary") hypercube is log2 N, with N = 2k. An important property of hypercubes is the relationship between node-number and which nodes are connected together. The rule is, that any two nodes in the hypercube, whose binary representations differ in exactly one bit, are connected together. For example in a four-dimensional hypercube, node 0 (0000) is connected to node 1 (0001), node 2 (0010), node 4 (0100) and node 8 (1000). This numbering scheme is called Gray code scheme

Routing Meshes and Rings Typically, in meshes the so called dimension-order routing technique is used. That is, routing is performed in one dimension at a time. In a three-dimensional mesh for example, a message's path from node (a,b,c) to the node (x,y,z) would be moved along the first dimension to node (x,b,c), then, along the second dimension to node (x,y,c), and finally, in the third dimension to the destination-node (x,y,z). Stars Routing in stars is trivial. If one of the communicating nodes is the centre node, then the path is just the edge connecting them. If not, the message is routed from the source node to the centre node, and from there to the destination node.

Hypercubes A k-dimensional hypercube is nothing else than a k-dimensional mesh with only two nodes in each dimension, and thus the routing algorithm is the same as for meshes; apart from one difference: the path from node A to node B is calculated by simply calculating the exclusive-or X = A B from the binary representations for node A and B. If the i-th bit in X is '1' the message is moved to the neighbouring node in the i-th dimension. If the i-th bit is '0' the message is not moved anyway. This means, that it takes at most log2 N steps for a message to reach it's destination (where N is the number of nodes in the hypercube). Dynamic Topologies Single-Stage Networks Buses and crossbars are the two main representatives of this class. A bus is the simplest way to connect a number of processors with each other: all processors are simply connected to one wire. This makes communication and especially message routing very simple. The drawback of this type of network is, that the available bandwidth is inversely proportional to the number of connected processors. This means, that buses are good only for small networks with a maximum of about 10 processors.

The other extreme in terms of complexity is the crossbar network. With a crossbar full connectivity is given, i.e. all processors can communicate with each other simultaneously without reduction of bandwidth. In Figure the connection of n processors with m memory modules (as in a shared memory system) is shown. Certainly crossbars can also be used to connect processors with each other. In that case the memory modules are connected directly to the processors (which results in a distributed memory system), and the lines that were connected to the memory modules Mi are now connected to the processors Pi.

To connect n processors to n memory modules n2 switches are needed. Consequently, crossbar networks can not be scaled to any arbitrary size. Today's commercially available crossbars can connect up to 256 units. Multi-Stage Networks Multi-stage networks are based on the so called shuffle-exchange switching element, which is basically a 2 x 2 crossbar. Multiple layers of these elements are connected and form the network. Depending on the way these elements are connected, the following topologies can be distinguished:

• • • • • • •

Banyan Baseline Cube Delta Flip Indirect cube Omega

As an example of a multistage network a 8 x 8 Benes network is shown in Figure

Summary

The networks can be classified as static or dynamic. Static interconnection networks are mainly used in message-passing architectures; the following types are commonly defined: • • • •

•

•

completely-connected network. star-connected network. linear array or ring of processors. mesh network (in 2- or 3D). Each processor has a direct link to four/six (in 2D/3D) neighbor processors. Extensions of this kind of networks is a wraparound mesh or torus. Commercial examples are Intel Paragon XP/S and Cray T3D/E. These examples cover also another class, namely the direct network topology. tree network of processors. Communication bottleneck likely to occur in large configurations can be alleviated by increasing the number of communication links for processors closer to the root, which results in the fat-tree topology, efficiently used in the TMC CM5 computer. CM5 could be also an example of indirect network topology. hypercube network. Classically this is a multidimensional mesh of processors with exactly two processors in each dimension. An example of such a system is the Intel iPSC/860 computer. Some new projects incorporate the idea of several procesors in each node which results in fat hypercube, i.e. indirect network topology. An example is the SGI/Cray Origin2000 computer.

Dynamics interconnection networks implement one of four main alternatives: •

bus-based networks - the simplest and efficient solution when the cost and moderate number of processors are involved. Its main drawback is a bottleneck to the memory when number of processors becomes large and also a single point of failure. To overcome the problems, sometimes several parallel buses are incorporated. The classical example of such machine is the SGI Power Challenge computer with packet data bus. Table 1: Properties of various types of multiprocessor interconnections Property

Bus

Crossbar

Multistage

Speed

low

High

high

Cost

low

High

moderate

Reliability

low

High

high

Configurability

high

Low

moderate

Complexity

low

High

moderate

•

•

crossbar switching networks, which employ a grid of switching elements. The network is nonblocking, since the connection of a processor to a memory bank does not block the connection of any other processor to any other memory bank. In spite of high speed, their use is limited, due to nonlinear complexity (o(p2), p number of processors) and the cost (cf. Table 1). They are applied mostly in multiprocessor vector computers (like Cray YMP) and in multiprocessors with multilevel interconnections (e.g. HP/Convex Exemplar SPP). One outstanding example is the Fujitsu VPP500 which incorporates 224x224 crossbar switch. multistage interconnection networks formulate the most advanced pure solution, which lies between the two extremes (Table 1). A typical example is the omega network, which consists of stages, where p is number of inputs and outputs (usually number of processor and of memory banks). Its complexity

•

is , less than for the crossbar switch. However, in the omega network some memory accesses can be blocked. Although machines of this kind of interconnections offer virtual global memory model of programming and ease of use they are still not much popular. Examples from the past cover BBN Butterfly and IBM RP-3 computers, at present IBM RS6K SP incorporates multistage interconnections with Vulcan switch. multilevel interconnection network seems to be a relatively recent development. The idea comes directly from clusters of computers and consists of two or more levels of connections with different aggregated bandwidths. Typical examples are: SGI/Cray Origin2000, IBM RS6K SP with PowerPC604 SMP nodes and HP/Convex Exemplar. This kind of architecture is getting the most interest at present.

Hierarchical Memory Technology The hierarchical arrangement of storage in current computer architectures is called the memory hierarchy. It is designed to take advantage of memory locality in computer

programs. Each level of the hierarchy has the properties of higher bandwidth, smaller size, and lower latency than lower levels.

LEVELS in Memory Hierarchy

A ranking of computer memory devices, with devices having the fastest access time at the top of the hierarchy, and devices with slower access times but larger capacity and lower cost at lower levels. Most modern CPUs are so fast that for most program workloads, the locality of reference of memory accesses and the efficiency of the caching and memory transfer between different levels of the hierarchy are the practical limitation on processing speed. As a result, the CPU spends much of its time idling, waiting for memory I/O to complete. The memory hierarchy in most computers is:

• • • • • •

Processor registers – fastest possible access (usually 1 CPU cycle), only hundreds of bytes in size. Level 1 (L1) cache – often accessed in just a few cycles, usually tens of kilobytes. Level 2 (L2) cache – higher latency than L1 by 2× to 10×, often 512 KB or more. Main memory (DRAM) – may take hundreds of cycles, but can be multiple gigabytes. Access times may not be uniform, in the case of a NUMA machine. Disk storage – millions of cycles latency, but very large. Tertiary storage – several seconds latency, can be huge.

The various major units in a typical memory system can be viewed as forming a hierarchy of memories (M1,M2,...,Mn) in which each member Mi is in a sense subordinate to the next highest member Mi-1 of the hierarchy.

Memory Hierarchy MANAGEMENT Modern programming languages mainly assume two levels of memory, main memory and disk storage, though in assembly language, and in inline assembler in languages such as C, registers can be directly accessed. Taking optimal advantage of the memory hierarchy requires the cooperation of programmers, hardware, and compilers (as well as underlying support from the operating system): • • •

Programmers are responsible for moving data between disk and memory through file I/O. Hardware is responsible for moving data between memory and caches. Optimizing compilers are responsible for generating code that, when executed, will cause the hardware to use caches and registers efficiently.

Memory Hierarchy: TERMINOLOGY

•

Hit: data appears in some block in the cache (example: Block X) – Hit Rate: the fraction of cache access found in the cache – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss.

•

Miss: data needs to be retrieved from a block in the main memory (Block Y) – Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in cache + Time to deliver the block to the processor.

Memory Hierarchy – PARAMETERS & PROPERTIES The 5 parameters associated with memory technologies arranged in a hierarchy: 1) Access Time: Time for the CPU to fetch a value from memory -- including delays through any intermediate levels. 2) Memory Size: The amount of memory of a given type in a system. 3) Cost Per Unit (byte): Cost per unit times size roughly equals total cost. 4) Transfer bandwidth: Units (bytes) per second transferred to the next level. 5) Unit of transfer: Number of units moved between adjacent levels in a single move. The 3 properties of a memory hierarchy: 1) Inclusion: If a value is found at one level, it is present at all of the levels below it.

2) Coherence: The copies at all of the levels are consistent. 3) Locality: Programs access a restricted portion of their address space in any time window. Obviously, none of these is strictly true. Most hierarchies are inclusive from the registers to the main memory (although we could imagine a multi-level cache that skips a level on loading, and only copies out to the lower level when writing back). However, most tape units do not spool to disk before going to main memory -- they are DMA devices, just like the disks.

INCLUSION Property The Inclusion property is stated by the following set inclusion relations among n memory levels. M1 ⊂ M2⊂ M3 ⊂ … ⊂ Mn Here : M1,M2,M3 are memory levels. n is the number of levels. The above equation signifies that if a value is found at one level in the memory hierarchy, it is present at all of the memory levels below it.

COHERENCE Property Coherence requires that copies of the same information items be consistent at different memory levels. The coherence property extends all the way from the cache at M1 to the outermost memory Mn. 2 strategies to maintain the coherence in a memory hierarchy are: 1) Write Through (WT) 2) Write Back (WB)

LOCALITY Property Locality is entirely program-dependent. For example, LISP programs have logical locality that does not correspond to physical address locality. Most caches assume array type data access and sequential code. The book identifies three aspects of this form of locality: 1) Temporal locality: Recently accessed items tend to be accessed again in the near future. 2) Spatial locality: Accesses are clustered in the address space. 3) Sequential locality: Instructions tend to be accessed in sequential memory locations. NEED of Multilevel Hierarchy The current trends in hardware technologies show a constantly increasing performance gap between the speed of CPUs (processors) and main memory. This gap results the need of large but fast memory caches. If we had only one very big (L1) cache then we would be forced to have larger clock cycles. Thus, we overcome this problem by adding additional levels of larger and slower memory caches between the CPU and the main memory, as shown in figure 1. CPU L1 Additional Levels

L2 L3 Main Memory Figure 1: Multi-level cache hierarchy.

The main reason of using multi-level caching is the need for reducing the penalty for the cache misses. In particular, the cache that it is the smallest and it is closest to the processor can operate in very high frequency equal to the frequency of the processor. The lower level caches are further away from the processor, they operate in smaller frequency but still their access time is much smaller than the corresponding access time of the main memory. Those caches are big enough to contain data reducing effectively the need of main memory accesses (they have large hit ratios). It is obvious that by using this technique the average memory access time is reduced.

INTRODUCTION A vector processor , or array processor , is a CPU design where the instruction set includes operations that can perform mathematical operations on multiple data

elements simultaneously . This is in contrast to a scalar processor which handles one element at a time using multiple instructions . The vast majority of CPUs are scalar . Vector processors were common in the scientific computing area , where they formed the basis of most supercomputers through the 1980s and into the 1990s , but general increases in performance and processor design saw the near disappearance of the vector processor as a general-purpose CPU. A vector is a set of scalar data items , all of the same type , stored in memory . Usually , the vector elements are ordered to have a fixed addressing increment between successive elements called the stride . Vector Processing occurs when arithmetic or logical operations are applied to vectors. It is distinguished from scalar processing which operates on one or one pair of data . The conversion from scalar code to vector code is called vectorization . Vector processing is faster and more efficient than scalar processing . It reduces software overhead incurred in the maintenance of looping control and reduces memory-access conflicts . VECTOR INSTRUCTION TYPES 1 . Vector-vector instructions – one or two vector operands are fetched from the respective vector registers , enter through a functional pipeline unit and produce results in another vector registers . They are defined by the following mappings : f1 : Vi  Vj f2 : Vj X Vk  Vi Examples are V1 = sin(V2) and V3 = V1 + V2 for the mappings of f1 and f2 respectively . 2 . Vector-scalar instructions – The scalar s is operated on elements of the a vector V to produce another vector . It is defined as : f3 : s X Vi  Vj 3 . Vector-memory instructions - This corresponds to vector load or vector store , element by element , between the vector register ( V ) and the memory ( M ) , defined as : f4 : M  V Vector load f5 : V  M Vector store 4 . Vector reduction instructions - These are defined as : f6 : Vi  sj f7 : Vi X Vj  sk Examples of f6 include finding the maximum , minimum , sum and mean value of all elements in a vector . Example of f7 is the dot product . 5 . Gather and scatter instructions – These instructions use two vector registers to gather or to scatter vector elements randomly throughout the memory . It is defined as follows :

f8 : M  V1 X Vo Gather f9 : V1 X Vo  M Scatter Gather is an operation that fetches from memory the nonzero elements of a sparse vector using indices that themselves are indexed . Scatter stores into memory a vector in a sparse vector whose nonzero entries are indexed . The vector register V1 contains the data , and the vector register Vo is used as an index to gather or scatter data from or to random memory locations . 6 . Masking instructions – This type of instruction uses a mask vector to compress or to expand a vector to a shorter or longer index vector , respectively , they are defined as follows : f10 : Vo X Vm  V1 VECTOR – ACCESS MEMORY SCHEMES The flow of vector operands between the main memory and vector registers is usually pipelined with multiple access paths . Vector Operand Specifications – Vector operands may have arbitrary length . Vector elements are not necessarily stored in contiguous memory locations . For example , the entries in a matrix may be stored in a row major or in column major . Each row , column or diagonal of the matrix can be used as a vector . When row elements are stored in contiguous locations with a unit stride , the column elements must be stored with a stride of n , where n is the matrix order . Similarly , the diagonal elements are also separated by a stride of n+1 . To access a vector in memory , one must specify its base address , stride and length . Since each vector register has a fixed number of component registers , only a segment of the vector can be loaded into the vector register in a fixed number of cycles . Long vectors must be segmented and processed one segment at a time . Vector operands should be stored in memory to allow pipelined or parallel access . The memory system for a vector processor must be specifically designed to enable fast vector access . The access rate should match the pipeline rate . The access path is often itself pipelined and is called an access pipe . The main memory is built with multiple modules . Once presented with a memory address , each memory module returns with one word per cycle . It is possible to present different addresses to different memory modules so that parallel access of multiple words can be done simultaneously or in a pipelined fashion . Consider a main memory formed with m = 2â memory modules , each containing w = 2^b words of memory cells . The total memory capacity is m . w = 2^(a+b) . These memory words are assigned linear addresses . Different ways of assigning linear addresses results in different memory organizations . There are three vector access memory organizations : C-Access Memory Organization : This has the Low-Order Interleaving , which spreads contiguous memory locations across the m modules horizontally . This

implies that the low-order a bits of the memory address are used to identify the memory module . The high order b bits are the word addresses within each module . Access of the m memory modules can be overlapped in a pipelined fashion , called the Pipelined Memory Access . For this purpose , the memory cycle ( called the major cycle ) is subdivided into m minor cycles . An eight-way interleaved memory ( with m = 8 and w = 8 and thus a = b = 3 ) is : Memory address Register ( 6 bits )

Word address Mo M1 M2

M3

M4

M5

M6

Module address M7

The same word address is applied to all memory modules simultaneously . A module address decoder is used to distribute module addresses . This type of concurrent access of contiguous words has been called a C-access memory scheme . S-Access Memory Organization : The low-order interleaved memory can be rearranged to allow simultaneous access , or S-access . In this case , all memory modules are accessed simultaneously in a synchronized manner . The high order ( n-a) bits select the same offset word from each module . C / S – Access Memory Organization : This is a memory access where C-Access and S-Access are combined . Here n access buses are used with m interleaved memory modules attached to each bus . The m modules on each bus are m-way interleaved to allow C-access . The n buses operate in parallel to allow S-access . In each memory cycle , at most m . n words are fetched if the n buses are fully used with pipelined memory accesses .

SYSTEM INTERCONNECT ARCHITECTURES

Direct networks for static connections and indirect network for dynamic connections can be used for internal connections among processors, memory modules and I/O disk arrays in a centralised system, or for distributed networking of multicomputer nodes. Various topologies for building networks are specified below. We focus on the communication properties of interconnection networks. These include latency analysis, bisection bandwidth, and data routing functions. The communication efficiency of the underlying network is critical to the performance of a parallel computer. We hope to achieve a low-latency network with a high data transfer rate and thus a wide communication bandwidth. These network properties help make design choices for machine architecture. NETWORK PROPERTIES AND ROUTING The topologies of an interconnection network can be either static or dynamic. Static networks are formed of point-to-point direct connection, which will not change during program execution. Dynamic networks are implemented with switched channels, which are dynamically configured to match the communication demands in user programs. Static networks are used for fixed connections among subsystems of a centralized system or multiple computing nodes of a distributed system. Dynamic networks include buses, crossbar switches, and multistage networks, which are often used in shared memory multiprocessors. Some of the parameters used to estimate the complexity, communication efficiency, and cost of a network are defined below:(i) Network size:- In general, a network is represented by the graph of a finite number of nodes linked by directed or undirected edges. The number of nodes in a graph is called the network size. (ii) Node Degree and Network Diameter:-The number of edges (links or channels) incident on a node is called the node degree d. In case of unidirectional channels, the number of channels into a node is the in degree, and that out of a channel is the out degree. Then the node degree is the sum of the two. The node degree should be kept constant , and small in order to reduce the cost. The diameter D of a network is the maximum shortest path between any two nodes. The path length is measured by the number of links traversed. The diameter should be small from a communication point of view. (iii)Bisection Width:- When a given network is cut into two equal halves, the minimum number of edges (channels) along the cut is called the channel bisection width b. In case of a communication network, each edge corresponds to a channel with w bit wires. Then the wire bisection width is B=bw. B reflects the wiring density of a network. Thus,

bisection width provides a good indicator of the maximum communication bandwidth along the bisection f a network. (iv)Data Routing Functions:-A data routing network is used for inter-processing element data exchange. This routing network can be static, such as hypercube routing network used in the TMC/CM-2, or dynamic such as the multistage network used in the IBM GF11.In the case of a multicomputer network, the data routing is achieved through message passing. Hardware routers are used to route messages among multiple computers. The versatility of a routing network will reduce the time needed for data exchange and thus will significantly improve the performance. Commonly used data routing functions among the processing elements include shifting, rotation, broadcast (one-to-all), multicast (many-to-many), personalized communication (one-to-many) etc. Hyper Cube Routing Functions A three-dimensional binary cube network is shown below. Three routing functions are defined by three bits in the node address. a. One can exchange data between adjacent nodes which differ in the least significant bit. eg. 000---001---010---011---100---101---110---111 b. Routing by middle bit.eg. 000---010 ,001---011 c. Routing by most significant bit.eg. 000---100 ,001---101 Instruction Set Architecture (ISA) An instruction set is a list of all the instructions, and all their variations, that a processor (or in the case of a virtual machine, an interpreter) can execute. Instructions include: 

Arithmetic such as add and subtract



Logic instructions such as and, or, and not Data instructions such as move, input, output, load, and store Control flow instructions such as goto, if ... goto, call, and return.

 

An instruction set, or instruction set architecture (ISA), is the part of the computer architecture related to programming, including the native data types, instructions, registers,addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine language), the native commands implemented by a particular CPU design.

Categories of ISA 

CISC



RISC

RISC Why is this architecture called RISC? What is Reduced about it? The answer is that to make all instructions the same length the number of bits that are used for the opcode is reduced. Thus less instructions are provided. The instructions that were thrown out are the less important string and BCD (binary-coded decimal) operations. In fact, now that memory access is restricted there aren't several kinds of MOV or ADD instructions. Thus the older architecture is called CISC (Complete Instruction Set Computer). RISC architectures are also called LOAD/STORE architectures. The number of registers in RISC is usualy 32 or more.

The CISC Architecture

The RISC Architecture Reduced instruction set computer The acronym RISC (pronounced risk), for reduced instruction set computing, represents a CPU design strategy emphasizing the insight that simplified instructions that "do less" may still provide for higher performance if this simplicity can be utilized to make instructions execute very quickly. Many proposals for a "precise" definition have been attempted, and the term is being slowly replaced by the more descriptive load-store architecture. Well known RISC families include Alpha,ARC, ARM, AVR, MIPS, PARISC, Power Architecture (including PowerPC), SuperH, and SPARC. The complex addressing inherently takes many cycles to perform. It was argued that such functions would better be performed by sequences of simpler instructions, if this could

yield implementations simple enough to cope with really high frequencies, and small enough to leave room for many registers[1], factoring out slow memory accesses. Uniform, fixed length instructions with arithmetics restricted to registers were chosen to ease instruction pipelining in these simple designs, with special load-storeinstructions accessing memory. Typical characteristics of RISC For any given level of general performance, a RISC chip will typically have far fewer transistors dedicated to the core logic which originally allowed designers to increase the size of the register set and increase internal parallelism. Other features, which are typically found in RISC architectures are: 

Uniform instruction format, using a single word with the opcode in the same bit positions in every instruction, demanding less decoding;



Identical general purpose registers, allowing any register to be used in any context, simplifying compiler design (although normally there are separate floating point registers); Simple addressing modes. Complex addressing performed via sequences of arithmetic and/or load-store operations; Few data types in hardware, some CISCs have byte string instructions, or support complex numbers; this is so far unlikely to be found on a RISC.





Complex instruction set computer A complex instruction set computer (CISC, pronounced like "sisk") is a computer instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, and amemory store, all in a single instruction. Benefits Before the RISC philosophy became prominent, many computer architects tried to bridge the so called semantic gap, i.e. to design instruction sets that directly supported high-level programming constructs such as procedure calls, loop control, and complex addressing modes, allowing data structure and array accesses to be combined into single instructions. The compact nature of such instruction sets results in smaller program sizes and fewer calls to main memory, which at the time (early 1960s and onwards) resulted in a tremendous savings on the cost of computer memory and disc storage. Problems While many designs achieved the aim of higher throughput at lower cost and also allowed high-level language constructs to be expressed by fewer instructions, it was observed that this was not always the case. For instance, low-end versions of complex architectures (i.e. using less hardware) could lead to situations where it was possible to

improve performance by notusing a complex instruction (such as a procedure call or enter instruction), but instead using a sequence of simpler instructions. CISC vs RISC

CISC Pronounced sisk, and stands for Complex Instruction Set Computer. Most PC's use CPU based on this architecture. For instance Intel and AMD CPU's are based on CISC architectures. Typically CISC chips have a large amount of different and complex instructions. The philosophy behind it is that hardware is always faster than software, therefore one should make a powerful instructionset, which provides programmers with assembly instructions to do a lot with short programs. In common CISC chips are relatively slow (compared to RISC chips) per instruction, but use little (less than RISC) instructions. RISC Pronounced risk, and stands for Reduced Instruction Set Computer. RISC chips evolved around the mid-1980 as a reaction at CISC chips. The philosophy behind it is that almost no one uses complex assembly language instructions as used by CISC, and people mostly use compilers which never use complex instructions. Apple for instance uses RISC chips. Therefore fewer, simpler and faster instructions would be better, than the large, complex and slower CISC instructions. However, more instructions are needed to accomplish a task. An other advantage of RISC is that - in theory - because of the more simple instructions, RISC chips require fewer transistors, which makes them easier to design and cheaper to produce. Finally, it's easier to write powerful optimised compilers, since fewer instructions exist.

RISC vs CISC There is still considerable controversy among experts about which architecture is better. Some say that RISC is cheaper and faster and therefor the architecture of the future.

Others note that by making the hardware simpler, RISC puts a greater burden on the software. Software needs to become more complex. Software developers need to write more lines for the same tasks. Therefore they argue that RISC is not the architecture of the future, since conventional CISC chips are becoming faster and cheaper anyway. RISC has now existed more than 10 years and hasn't been able to kick CISC out of the market. If we forget about the embedded market and mainly look at the market for PC's, workstations and servers I guess a least 75% of the processors are based on the CISC architecture. Most of them the x86 standard (Intel, AMD, etc.), but even in the mainframe territory CISC is dominant via the IBM/390 chip. Looks like CISC is here to stay … Is RISC than really not better? The answer isn't quite that simple. RISC and CISC architectures are becoming more and more alike. Many of today's RISC chips support just as many instructions as yesterday's CISC chips. The PowerPC 601, for example, supports more instructions than the Pentium. Yet the 601 is considered a RISC chip, while the Pentium is definitely CISC. Further more today's CISC chips use many techniques formerly associated with RISC chips. So simply said: RISC and CISC are growing to each other. x86 An important factor is also that the x86 standard, as used by for instance Intel and AMD, is based on CISC architecture. X86 is thé standard for home based PC's. Windows 95 and 98 won't run at any other platform. Therefore companies like AMD an Intel will not abandoning the x86 market just overnight even if RISC was more powerful. Changing their chips in such a way that on the outside they stay compatible with the CISC x86 standard, but use a RISC architecture inside is difficult and gives all kinds of overhead which could undo all the possible gains. Nevertheless Intel and AMD are doing this more or less with their current CPU's. Most acceleration mechanisms available to RISC CPUs are now available to the x86 CPU's as well. Since in the x86 the competition is killing, prices are low, even lower than for most RISC CPU's. Although RISC prices are dropping also a, for instance, SUN UltraSPARC is still more expensive than an equal performing PII workstation is. Equal that is in terms of integer performance. In the floating point-area RISC still holds the crown. However CISC's 7th generation x86 chips like the K7 will catch up with that. The one exception to this might be the Alpha EV-6. Those machines are overall about twice as fast as the fastest x86 CPU available. However this Alpha chip costs about €20000, not something you're willing to pay for a home PC.

Maybe interesting to mention is that it's no coincidence that AMD's K7 is developed in co-operation with Alpha and is for al large part based on the same Alpha EV-6 technology. EPIC The biggest threat for CISC and RISC might not be eachother, but a new technology called EPIC. EPIC stands for Explicitly Parallel Instruction Computing. Like the word parallel already says EPIC can do many instruction executions in parallel to one another. EPIC is a created by Intel and is in a way a combination of both CISC and RISC. This will in theory allow the processing of Windows-based as well as UNIX-based applications by the same CPU. It will not be until 2000 before we can see an EPIC chip. Intel is working on it under code-name Merced. Microsoft is already developing their Win64 standard for it. Like the name says, Merced will be a 64-bit chip. If Intel's EPIC architecture is successful, it might be the biggest thread for RISC. All of the big CPU manufactures but Sun and Motorola are now selling x86-based products, and some are just waiting for Merced to come out (HP, SGI). Because of the x86 market it is not likely that CISC will die soon, but RISC may. So the future might bring EPIC processors and more CISC processors, while the RISC processors are becoming extinct. Conclusion The difference between RISC and CISC chips is getting smaller and smaller. What counts is how fast a chip can execute the instructions it is given and how well it runs existing software. Today, both RISC and CISC manufacturers are doing everything to get an edge on the competition. The future might not bring victory to one of them, but makes both extinct. EPIC might make first RISC obsolete and later CISC too. Written by A.A.Gerritsen for the CPU Site March '99

DATAFLOW ARCHITECTURE

One of the few experimental dataflow computer project is the token-tagged architecture for building dataflow computers developed by a couple of students from MIT. The global architecture consists of n processing elements (PEs) interconnected by an nxn routing network.The entire system supports pipelined dataflow operations in all n PEs.Inter-PE communications are done through the pipelined routing network. Processing Element Architecture Within each PE,the machine provides a low-level token-matching mechanism which Dispatches only those instructions whose input data(tokens) are already available.Each datum is tagged with the address of the instruction to which it belongs and the context in which the instruction is being executed.Instructions are stored in the program memory.Tagged tokens enter the PE through a local path.The tokens can also be passed to other PEs through the routing network. Another synchronization mechanism,called the I-structure ,is provided within each PE.The I-structure is a tagged memory unit for overlapped usuage of a data structure by both the producer and consumer processes.

Demand-Driven Mechanisms In a reduction machine, the computation is triggered by the demand for an operation’s result . Consider the evaluation of a nested arithmetic expression a=((b+1)*c-(d%e)).The data-driven computation chooses a bottom-up approach, starting from the innermost operations b+1 and d% e, then proceeding to the * operation, and finally to the outermost operation -.Such a computation has been called eager evaluation because operations are carried out immediately after all their operands become available. A demand-driven computation chooses a top down approach by first demanding the value of a,which triggers the demand for evaluating the next-level expressions (b+1)*c and d %e,which in turn triggers the demand for evaluating b+1 at the innermost level. The results are then returned to the nested demander in the reverse order before a is evaluated. A demand-driven computation corresponds to lazy evaluation , because operations are executed only when their results are required by another instruction.The demand-driven approach matches naturally with the functional programming concept.The removal of side effects in functional programming makes programs easier to parallelize . There are two types of reduction machine models,both having a recursive control mechanism.

Reduction Machine Models In a string reduction model,each demander gets a separate copy of the expression for its own evaluation.A long string expression is reduced to a single value in a recursive fashion. Each reduction step has an operator followed by an embedded reference to demand the corresponding input operands.The operator is suspended while its input arguments are being evaluated.An expression is said to be fully reduced when all the arguments have been replaced by literal values. In a graph reduction model,the expression is represented as a directed graph. Different parts of a graph or subgraphs can be reduced or evaluated in parallel upon demand. Each demander is given a pointer to the result of the reduction.The demander manipulates all references to that graph. Graph manipulation is based on sharing the arguments using pointers.This traversal of the graph and reversal of the reference are continued until constant arguments are encountered.This proceeds until the value of a is determined and a copy is returned to the original demanding instruction.

MULTIPROCESSOR SYSTEM INTERCONNECTS

Parallel processing demands the use of efficient system interconnects for fast communication among multiple processors and shared memory, I/O, and peripheral devices.Hierarchical buses, crossbar switches, and multistage networks are often used for this purpose. A generalized multiprocessor architecture combines features from the UMA, NUMA, and COMA models.Each processor Pi is attatched to its own local memory and private cache. Multiple processors are connected to shared-memory modules through an interprocessormemory network (IPMN). The processors share the access of I/o and peripheral devices through a processor I/O network (PION). Both IPMN and PION are necessary in a shared resource multiprocessor. Direct interprocessor communications are supported by an optional interprocessor communication network (IPCN) instead of through the shared memory. Network characteristics : Each of the above types of networks can be designed with many choices. The choices are based on the topology, timing protocol, switching method, and control strategy. Dynamic networks are used in multiprocessors in which the interconnections are under program control. Timing switching, and control are three major operational characteristics of an interconnection network. The timing control can be either synchronous or asynchronous. Synchronous networks are controlled by a global clock that synchronizes all network activities. Asynchronous networks use handshaking or interlocking mechanisms to coordinate fast and slow devices requesting use of the same network. A network can transfer data using either circuit switching or packet switching. In circuit switching, once a device is granted a path in the network, it occupies the path for the entire duration of the data transfer. In packet switching, the information is broken into small packets individually competing for a path in the network. Network control strategy is classified as centralized or distributed. With centralized control, a global controller receives requests from all devices attatched to the network and grants the network access to one or more requesters. In a distributed system, requests are handled by local devices independently. HIERARCHICAL BUS SYSTEMS A BUS SYSTEM CONSISTES OF A HIERARCHY OF BUSES connecting various system and subsystem components in a computer. Each bus is formed with a number of signal, control and power lines. Different buses are used to perform different interconnection functions. In general, the hierarchy of bus systems are packaged at different levels including local buses on boards, backplanes buses, and I/O buses. Local Bus: Buses implemented on printed circuit boards are called local buses. On a processor board one often finds a local bus which provides a common communication path among major components (chips) mounted on the board. A memory board uses a memory bus to connect the memory with the interface logic.

An I/O board or network interface board uses a data bus. Each of these board buses consists of signal and utility lines. With the sharing of the lines by many I/O devices, the layout of these lines may be at different layers of the PC board. Backplane Bus : A backplane is a printed circuit on which many connectors are used to plug in functional boards. A System bus, consisting of shared signal paths and utility lines, is built on the backplane. This system bus provides a common communication path among all plug in boards. Several backplane bus systems have been developed such as VME bus, multibus II and Futurebus+. I/O Bus : Input/Output devices are connected to a computer system through an I/O bus such as SCSI (small computer system interface) bus. This bus is made of coaxial disks with tape connecting disks, printer, and tape unit to a processor through an I/O controller. Special interface logic is used to connect various board types to the backplane bus.

Hiarchical Bus system

Multiprocessor System Interconnects Legend: IPMN Inter processor memory network PION Processor I/O network IPCN Inter Processor communication network P Processor C cache SM Shared Memory LM Local Memory

COMPARISON OF FLOW MECHANISMS

Machine Model

Control Flow

Dataflow

Reduction

Basic Definition

Advantages

Disadvantages

(control-driven)

(data-driven)

(demand-driven)

Conventional computation; token of control indicates when a statement should be executed

Eager evaluation; statements are executed when all of their operands are available

Full Control

Very high potential for parallelism

Lazy evaluation statements are executed only when their result is required for another computation Only required instructions are executed High degree of parallelism Easy manipulation of data structures Does not support sharing of objects with changing local state

Complex data and control structures are easily implemented

High throughput Free from side effects

Less efficient

Time lost waiting for unneeded arguments

Difficult in programming Difficult in preventing runtime error

High control overhead Difficult in manipulating data structures

Time needed to propagate demand tokens

Computer Arithmetic Principles Arithmetic operations can be performed by considering two basic forms of operations… 1) That is performed due to fixed memory size 2) The other which can be performed by rounding off or truncating the value Fixed Point Operations: - As defined early the concept behind involves fixed point operation with a sign magnitude, by using the concept of 1’s Complement and 2’s Complement. But 1’s complement introduces a second zero, also known as the dirty zero. This includes general arithmetic operations, such as: 1) Add 2) Subtract 3) Multiply

4) Divide Floating Point Numbers: - in there are two parts: 1) M- Mantissa 2) E – Exponent with implies base Formula that we work on is: X= M.RÊ Where, R=2 incase binary number system The size of 32 is utilized as: 1) 1bit for Sign(0 bit), 2) 8bit for Exponential(1-8), and 3) 23bit for Mantissa (9-31). E= (-127,128) is represented as (0,255) & X= (-1) ^s.2^ (E-127). (1. M) Conditions that exist are as follows: 1) If E = 255 & m! = 0 implies that X is Not a Number. 2) If E = 255 & m = 0 implies that X is an Infinite Number. 3) If E = 0 & m! = 0 implies that X is a Number. 4) If E = 0 & m = 0 implies that X +0, -0. Floating Point Operations: - The operations that can be performed are as follows 1) X+Y = (Mx. 2( Ex-Ey) + My) XÊy) 2) X-Y = (Mx. 2( Ex-Ey) - My) XÊy) 3) X*Y = (Mx * My) 2^(Ex+Ey) 4) X*Y = (Mx * My) 2^(Ex+Ey) SIMD ARCHITECTURE FLYNN’S TAXONOMY OF COMPUTER ARCHITECTURE Flynn’s classification scheme is based on the notion of a stream of information. Two types of information flow into a processor: instructions and data. The instruction stream is defined as the sequence of instructions performed by the processing unit. The data stream is defined as the data traffic exchanged between the memory and the processing unit. According to Flynn’s classification, either of the instruction or data streams can be

single or multiple. Computer architecture can be classified into the following four distinct categories: 1) Single-Instruction Single-Data streams (SISD); 2) Single-Instruction Multiple-Data streams (SIMD); 3) Multiple-Instruction Single-Data streams (MISD); and 4) Multiple-Instruction Multiple-Data streams (MIMD). The architecture of SIMD Computer models are determined by: 1) Memory Distribution, and 2) Addressing Schemes Used.

SIMD computers use a single control unit and distributed memories and some of them use associative memories. The instruction set of an SIMD computer is decoded by the array control unit. The major components of SIMD computers are:

1) Processing Elements (PEs) in the SIMD array are passive. 2) Arithmetic and Control Units (ALUs) executes instructions broadcast from control unit.

All PEs must operate in lockstep, synchronized by the same array controller.

DISTRIBUTED MEMORY MODEL

•

It consists of an array of PEs which is controlled by the same array control unit.

•

Host computer is responsible for the programs and data being loaded on the control memory.

•

When an instruction is sent to the control unit for decoding, then a scalar or program control operation is executed by the scalar processor that is attached to the control unit.

•

In case of a vector operation, it gets broadcasted to all the PEs for parallel execution. The partition data is distributed to all the local memories through a vector data bus.

•

Data routing network is a program control through the control unit.

•

Masking logic is there to provide to enable or disable any PE from the instruction cycle.

Examples of Distributed Memory:

1) MESH ARCHITECTURE: Illiac IV, Goodyear MPP, AMT DAP 610. 2) HYPERCUBE: CM-2, X-Net.

SHARED MEMORY

•

An alignment network is used as the inter PE memory communication network.

•

The memory module (m) should be relative prime to number of PEs so that parallel memory access can be achieved through skewing without conflicts.

Examples of Shared Memory: Burroughs Scientific Processor (BSP- having 16 PEs & 17 memory modules), CM/ 200.

SIMD Instructions: SIMD computers to execute vector instructions for arithmetic, logic, data routing and masking operations over vector quantities. In case of •

Bit-Slice SIMD: vectors are binary vectors.

•

Word-Parallel SIMD: vector components are 4-byte or 8-byte numerical value.

All the SIMD instructions are vectors of equal length n, where n corresponds to the number of PEs.

HOST & I/ O: All I/O operations are handled by the host computers in an SIMD organization. A special control memory is used between the host and the array control unit. This is a staging memory for holding program and data.

Divided data sets are distributed to the local memory modules or the shared memory modules before starting the program instruction. The host manages the mass storage or the graphics display of computational results.

Computer Arithmetic Principles Arithmetic operations can be performed by considering two basic forms of operations… 3) That is performed due to fixed memory size 4) The other which can be performed by rounding off or truncating the value Fixed Point Operations: - As defined early the concept behind involves fixed point operation with a sign magnitude, by using the concept of 1’s Complement and 2’s Complement. But 1’s complement introduces a second zero, also known as the dirty zero. This includes general arithmetic operations, such as: 5) Add 6) Subtract 7) Multiply 8) Divide Floating Point Numbers: - in there are two parts: 3) M- Mantissa 4) E – Exponent with implies base Formula that we work on is: X= M.RÊ Where, R=2 incase binary number system The size of 32 is utilized as: 4) 1bit for Sign(0 bit), 5) 8bit for Exponential(1-8), and 6) 23bit for Mantissa (9-31). E= (-127,128) is represented as (0,255) & X= (-1) ^s.2^ (E-127). (1. M)

Conditions that exist are as follows: 5) If E = 255 & m! = 0 implies that X is Not a Number. 6) If E = 255 & m = 0 implies that X is an Infinite Number. 7) If E = 0 & m! = 0 implies that X is a Number. 8) If E = 0 & m = 0 implies that X +0, -0. Floating Point Operations: - The operations that can be performed are as follows 5) X+Y = (Mx. 2( Ex-Ey) + My) XÊy) 6) X-Y = (Mx. 2( Ex-Ey) - My) XÊy) 7) X*Y = (Mx * My) 2^(Ex+Ey) 8) X*Y = (Mx * My) 2^(Ex+Ey) SIMD ARCHITECTURE FLYNN’S TAXONOMY OF COMPUTER ARCHITECTURE Flynn’s classification scheme is based on the notion of a stream of information. Two types of information flow into a processor: instructions and data. The instruction stream is defined as the sequence of instructions performed by the processing unit. The data stream is defined as the data traffic exchanged between the memory and the processing unit. According to Flynn’s classification, either of the instruction or data streams can be single or multiple. Computer architecture can be classified into the following four distinct categories: 1) Single-Instruction Single-Data streams (SISD); 2) Single-Instruction Multiple-Data streams (SIMD); 3) Multiple-Instruction Single-Data streams (MISD); and 4) Multiple-Instruction Multiple-Data streams (MIMD). The architecture of SIMD Computer models are determined by: 3) Memory Distribution, and 4) Addressing Schemes Used.

SIMD computers use a single control unit and distributed memories and some of them use associative memories. The instruction set of an SIMD computer is decoded by the array control unit. The major components of SIMD computers are:

3) Processing Elements (PEs) in the SIMD array are passive. 4) Arithmetic and Control Units (ALUs) executes instructions broadcast from control unit.

All PEs must operate in lockstep, synchronized by the same array controller.

DISTRIBUTED MEMORY MODEL

•

It consists of an array of PEs which is controlled by the same array control unit.

•

Host computer is responsible for the programs and data being loaded on the control memory.

•

When an instruction is sent to the control unit for decoding, then a scalar or program control operation is executed by the scalar processor that is attached to the control unit.

•

In case of a vector operation, it gets broadcasted to all the PEs for parallel execution. The partition data is distributed to all the local memories through a vector data bus.

•

Data routing network is a program control through the control unit.

•

Masking logic is there to provide to enable or disable any PE from the instruction cycle.

Examples of Distributed Memory: 3) MESH ARCHITECTURE: Illiac IV, Goodyear MPP, AMT DAP 610. 4) HYPERCUBE: CM-2, X-Net.

SHARED MEMORY

•

An alignment network is used as the inter PE memory communication network.

•

The memory module (m) should be relative prime to number of PEs so that parallel memory access can be achieved through skewing without conflicts.

Examples of Shared Memory: Burroughs Scientific Processor (BSP- having 16 PEs & 17 memory modules), CM/ 200.

SIMD Instructions: SIMD computers to execute vector instructions for arithmetic, logic, data routing and masking operations over vector quantities. In case of •

Bit-Slice SIMD: vectors are binary vectors.

•

Word-Parallel SIMD: vector components are 4-byte or 8-byte numerical value.

All the SIMD instructions are vectors of equal length n, where n corresponds to the number of PEs.

HOST & I/ O: All I/O operations are handled by the host computers in an SIMD organization. A special control memory is used between the host and the array control unit. This is a staging memory for holding program and data. Divided data sets are distributed to the local memory modules or the shared memory modules before starting the program instruction. The host manages the mass storage or the graphics display of computational results.

Introduction: In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements. Computer-related pipelines include: 1. Instruction pipelines, such as the classic RISC pipeline, which are used in processors to allow overlapping execution of multiple instructions with the same circuitry. The circuitry is usually divided up into stages, including instruction decoding, arithmetic, and register fetching stages, wherein each stage processes one instruction at a time. 2. Graphics pipelines, found in most graphics cards, which consist of multiple arithmetic units, or complete CPUs, that implement the various stages of common rendering operations (perspective projection, window clipping, color and light calculation, rendering, etc.).

3. Software pipelines, consisting of multiple processes arranged so that the output stream of one process is automatically and promptly fed as the input stream of the next one. Unix pipelines are the classical implementation of this concept.

Advantages of Pipelining: 1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in most cases. 2. Some combinatorial circuits such as adders or multipliers can be made faster by adding more circuitry. 3. If pipelining is used instead, it can save circuitry vs. a more complex combinatorial circuit. Disadvantages of Pipelining:

1. A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture.

2. The instruction latency in a non-pipelined processor is slightly lower than in a

pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor.

3. A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.

Arithmetic pipelines The most popular arithmetic operation utilized to illustrate the operation of arithmetic pipelines in the literature are: floating-point addition and multiplication.

Floating-point addition Consider the addition of two normalized floating-point numbers: A = (E., Ma) and B = (Et, Mb) to obtain the sum S = (Es, Ms) where E and M represent the exponent and mantissa, respectively. The addition follows the steps shown below: 1. Equalize the exponents: if E.< Eb, swap A and B; Ediff = Ea-Eb Shift Mb right Edith bits 2. Add Mantissae: Ms = Ma + Mb Es=Ea 3. Normalize Ms and adjust Es to reflect the number of shifts required to normalize. 4. Normalized M, might have larger number of bits than can be accommodated by the mantissa field in the representation. If so, round M. 5.If rounding causes a mantissa overflow, renormalize M. and adjust EQ accordingly. Figure shows a five-stage pipeline configuration for the addition process given above.

Floating-point add pipeline The throughput of the above pipeline can be enhanced by rearranging the computations into a larger number of stages, each consuming a smaller amount of time, as shown in Figure 3.6. Here, equalizing exponents is performed using a subtract exponents stage and a shift stage that shifts mantissa appropriately. Similarly, normalizing is split into two stages. This eight-stage pipeline provides a speedup of 8/5 = 1.6 over the pipeline of the above figure.

Modified floating-point add pipeline

In the pipeline of above figure we have assumed that the shift stages can perform an arbitrary number of shifts in one cycle. If that is not the case, the shifters have to be used repeatedly. Figure 3.7 shows the rearranged pipeline where the feedback paths indicate the reuse of the corresponding stage.

Floating-point multiplication Consider the multiplication of two floating-point numbers A = (E,, Ma) and B = (Eb,Mb), resulting in the product P = (Er,Mr). The multiplication follows the pipeline configuration shown in figure 1 and the steps are listed below: 1. Add exponents: Ep = Ea + Eb. 2. Multiply mantissae: Mp = Ma * Mb • Mp will be a double-length mantissa. 3. Normalize Mp and adjust Ep accordingly. 4. Convert Mp into single-length mantissa by rounding. 5. If rounding causes a mantissa overflow, renormalize and adjust EP accordingly.

Stage 2 in the above pipeline would consume the largest amount of time. In Figure below stage 2 is split into two stages, one performing partial products and the other accumulating them. In fact, the operations of these two stages can be overlapped in the sense that when the accumulate stage is adding, the other stage can be producing the next partial product.

Floating-point multiplication pipeline

Floating-point multiplier pipeline with feedback loops

\ Floating-point adder/ multiplier

The pipelines shown so far in this section are unifunction pipelines since they are designed to perform only one function. Note that the pipelines of Figures above have several common stages. If a processor is required to perform both addition and multiplication, the two pipelines can be merged into one as shown in figure above. Obviously, there will be two distinct paths of dataflow in this pipeline, one for

addition and the other for multiplication. This is a multifunction pipeline. A multifunction pipeline can perform more than one operation. The interconnection between the stages of the pipeline changes according to the function it is performing. Obviously, a control input that determines the particular function to be performed on the operand being input is needed for proper operation of the multifunction pipeline. Static Arithmetic Pipelines Most of today’s arithmetic pipelines are designed to perform fixed functions. These arithmetic and logic units perform fixed point and floating point operations separately. The fixed point unit is also called the integer unit. The floating point unit can be built either as part of control processor or on a separate coprocessor. These arithmetic units perform scalar operations involving one pair of operands at a time. The pipelining in scalar arithmetic pipelines is controlled by software loops. Vector arithmetic units can be designed with pipeline hardware directly under firmware or hardwired control. Scalar and vector arithmetic pipelines differ mainly in the area of register files and control mechanism involved. Vector hardware pipelines are often built as add on option to a scalar processor or as an attached processor driven by a control processor. Both scalar and vector processors are used in modern supercomputers. Arithmetic Pipeline Stages Depending on the function to be implemented, different pipeline stages in an arithmetic unit require different hardware logic. Since all arithmetic operations (such as add, subtract, multiply, divide, squaring, square rooting, logarithm, etc.) can be implemented with the basic add and shifting operations, the core arithmetic stages require some form of hardware to add or to shift. For example. a typical three-stage floating-point adder includes a first stage for exponent comparison and equalization which is implemented with an integer adder and some shifting logic; a second stage for fraction addition using a high-speed carry look-ahead adder; and a third stage for fraction normalization and exponent readjustment using a shifter and another addition logic. Arithmetic or logical shifts can be easily implemented with shift registers. Highspeed addition requires either the use of a carrypropagation adder (CPA) which adds two numbers and produces an arithmetic sum as shown in Fig. 6.22a, or the use of a carry-save adder

(CSA) to "add" three input numbers and produce one sum output and a carry output as exemplified in Figure below. In a CPA, the carries generated in successive digits are allowed to propagate from the low end to the high end, using either ripple carry propagation or some carry lookahead technique. In a CSA, the carries are not allowed to propagate but instead are saved in a carry vector. In general, an n-bit CSA is specified as follows: Let X, Y, and Z be three n-bit input numbers. expressed as X = (xn-I , xn -2... , x1, xo ). The CSA performs bitwise operations simultaneously on all columns of digits to produce two n-bit output numbers, denoted as b S = (0. Sn-1. Sn-2, ... , S1. So) and C = (Cn Cn-1 ... . C1.0). Note that the leading hit of the bitwise sum Sb is always a 0, and the tail bit of the carry vector C is always a 0. The input-output relationships are expressed below: Si= xi O Y. t zi Ci+1= xiyi V yizi V zixi (6.21) for i = O. 1, 2, ... . n - 1, where o is the exclusive OR and V is the logical OR operation. Note that the arithmetic sum of three input numbers, i.e., S = X + Y + Z, is obtained by adding the two output numbers, i.e., S = Sb + C, using a CPA. We use the CPA and CSA s to implement the pipeline stages of a fixed-point multiply unit as follows.

Multiply Pipeline Design

a.)An n-bit carry propagate adder(CPA) which allows either carry propagation or applies the carry look ahead technique.

b.)An n-bit carry save adder(csa) where sb is the bitwise sum of X,Y, nad Z and c is the carry vector generated without carry propogation between digits. Consider the multiplication of two 8-bit integers A x B = pi where p is the 16 bit product in double precision. This fixed point multiplication can be written as the summation of eight partial products as shown below. P=A x B = p0+p1+p2+….p7 where x and + are arithmetic multiply and add operations respectively.

Note that the partial product pj is obtained by multiplying the multiplicand A by the jth bit of B and then shifting the result j bits to the left for j0,1,2….7.Thus pj is (8+j) bits long with j trailing zeroes. The first stage generates all eight partial products ranging from 8 bits to 15 bits simultaneously. The second stage is made up of two levels of four CSAs which essentially merges eight numbers into four numbers ranging from 13 to 15 bits. The third stage consists of two CSAs which merge four numbers into two 16 bit numbers . The final stage is a cpa which adds up the last two numbers to produce the final product P.

A pipeline unit for fixed point multiplication of 8 bit integers.

SIMD Computers and Performance Enhancement SIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level parallelism.

Performance Enhancement Using SIMD: The SIMD concept is a method of improving performance in applications where highly repetitive operations need to be performed. Simply put, SIMD is a technique of performing the same operation, be it arithmetic or otherwise, on multiple pieces of data simultaneously. Traditionally, when an application is being programmed and a single operation needs to be performed across a large dataset, a loop is used to iterate through each element in the dataset and perform the required procedure. During each iteration, a single piece of data has a single operation performed on it. This is known as Single Instruction Single Data (SISD) programming. SISD is generally trivial to implement and both the intent and method of the programmer can quickly be seen at a later time.

Loops such as this, however, are typically very inefficient, as they may have to iterate thousands, or even millions of times. Ideally, to increase performance, the number of iterations of a loop needs to be reduced. One method of reducing iterations is known as loop unrolling. This takes the single operation that was being performed in the loop, and carries it out multiple times in each iteration. For example, if a loop was previously performing a single operation and taking 10,000 iterations, its efficiency could be improved by performing this operation 4 times in each loop and only having 2500 iterations. The SIMD concept takes loop unrolling one step further by incorporating the multiple actions in each loop iteration, and performing them simultaneously. With SIMD, not only can the number of loop iterations be reduced, but also the multiple operations that are required can be reduced to a single, optimized action. SIMD does this through the use of ‘packed vectors’ (hence the alternate name of vector processing). A packed vector, like traditional programming vectors or arrays, is a data structure that contains multiple pieces of basic data. Unlike traditional vectors, however, a SIMD packed vector can then be used as an argument for a specific instruction (For example an arithmetic operation) that will then be performed on all elements in the vector simultaneously (Or very close to). Because of this, the number of values that can be loaded into the vector directly affects performance; the more values being processed at once, the faster a complete dataset can be completed. This size depends on two things: 1. The data type being used (ie int, float, double etc) 2. The SIMD implementation When values are stored in packed vectors and ‘worked upon’ by a SIMD operation, they are actually moved to a special set of CPU registers where the parallel processing takes place. The size and number of these registers is determined by the SIMD implementation being used. The other area that dictates the usefulness of a SIMD implementation (Other than the level of hardware performance itself) is the instruction set. The instruction set is the list of available operations that a SIMD implementation provides for use with packed vectors. These typically include operations to efficiently store and load values to and from a vector, arithmetic operations (add, subtract, divide, square root etc), logical operations (AND, OR etc) and comparison operations (greater than, equal to etc). The more operations a SIMD implementation provides, the simpler it is for a developer to perform the required function. SIMD operations are available directly when writing code in assembly however not in the C language. To simplify SIMD optimization in C, intrinsics can be used that are essentially a header file containing functions that translate values to their corresponding call in assembler. SIMD Example The best way to demonstrate the effectiveness of SIMD is through an example. One area where SIMD instructions are particularly useful is within image manipulation. When a raster-based image, for example a photo, has a filter of some kind applied to it, the filter has to process the colour value of each pixel and return the new value. The larger the image, the more pixels that need to be processed. The operation of calculating each new

pixel value, however, is the same for every pixel. Put another way, there is a single operation to be performed and multiple pieces of data on which it must be completed. Such a scenario is perfect for SIMD optimization. In this case, a SIMD optimized version of the filter would still have a main loop to go through the entire pixel array, however the number of iterations would be significantly reduced because in each pass, the loop would be transforming multiple pixels.

SIMD Computers A SIMD computer consists of N identical processors, each with its own local memory where it can store data. All processors work under the control of a single instruction stream issued by a central control unit. There are N data streams, one per processor. The processors operate synchronously: at each step, all processors execute the same instruction on a different data element. SIMD computers are much more versatile that MISD computers. Numerous problems covering a wide variety of applications can be solved by parallel algorithms on SIMD computers. Another interesting feature is that algorithms for these computers are relatively easy to design, analyze and implement. On the downside, only problems that can be subdivided into a set of of identical subproblems all of which are then solved simultaneously by the same set of instructions can be tackled with SIMD computers. There are many computations that do not fit this pattern: such problems are typically subdivided into subproblems that are not necessarily identical, and are solved using MIMD computers.

SIMD machines have one instruction processing unit, sometimes called a controller and indicated by a K in the PMS notation, and several data processing units, generally called D-units or processing elements (PEs). The first operational machine of this class was the ILLIAC-IV, a joint project by DARPA, Burroughs Corporation, and the University of Illinois Institute for Advanced Computation. Later machines included the Distributed Array Processor (DAP) from the British corporation ICL, and the Goodyear MPP. The control unit is responsible for fetching and interpreting instructions. When it encounters an arithmetic or other data processing instruction, it broadcasts the instruction to all PEs, which then all perform the same operation. For example, the instruction might be `` add R3,R0.'' Each PE would add the contents of its own internal register R3 to its own R0. To allow for needed flexibility in implementing algorithms, a PE can be deactivated. Thus on each instruction, a PE is either idle, in which case it does nothing, or it is active, in which case it performs the same operation as all other active PEs. Each PE has its own memory for storing data. A memory reference instruction, for example ``load R0,100'' directs each PE to load its internal register with the contents of memory location 100, meaning the 100th cell in its own local memory. One of the advantages of this style of parallel machine organization is a savings in the amount of logic. Anywhere from 20% to 50% of the logic on a typical processor chip is

devoted to control, namely to fetching, decoding, and scheduling instructions. The remainder is used for on-chip storage (registers and cache) and the logic required to implement the data processing (adders, multipliers, etc.). In an SIMD machine, only one control unit fetches and processes instructions, so more logic can be dedicated to arithmetic circuits and registers. For example, 32 PEs fit on one chip in the MasPar MP1, and a 1024- processor system is built from 32 chips, all of which fit on a single board (the control unit occupies a separate board). Vector processing is performed on an SIMD machine by distributing elements of vectors across all data memories. For example, suppose we have two vectors, a and b, and a machine with 1024 PEs. We would store in location 0 of memory i and in location 1 of memory i. To add a and b, the machine would tell each PE to load the contents of location 0 into one register, the contents of location 1 into another register, add the two registers, and write the result. As long as the number of PEs is greater than the length of the vectors, vector processing on an SIMD machine is done in constant time, i.e. it does not depend on the length of the vectors. Vector operations on a pipelined SISD vector processor, however, take time that is a linear function of the length of the vectors

DIRECT MAPPING AND ASSOCIATIVE CACHES

As you will recall, we discussed three cache mapping functions, i.e., methods of addressing to locate data within a cache. • • •

Direct Full Associative Set Associative

Each of these depends on two facts: •

RAM is divided into blocks of memory locations. In other words, memory locations are grouped into blocks of 2n locations where n represents the number of bits used to identify a word within a block. These n bits are found at the leastsignificant end of the physical address. The image below has n=2 indicating that for each block of memory, there are 22 = 4 memory locations.

Therefore, for this example, the least two significant bits of an address indicate the location within a block while the remaining bits indicate the block number. The table below shows an example with a 20 bit address with four words per block. Notice that for each group of four words, the word bits take on each of the four possible values allowed with 2 bits while the block identification bits remain constant., Block

Block 0

Block 1

Block 2

Address

Block identification bits Word bits

0x00000 0000 0000 0000 0000 00

00

0x00001 0000 0000 0000 0000 00

01

0x00002 0000 0000 0000 0000 00

10

0x00003 0000 0000 0000 0000 00

11

0x00004 0000 0000 0000 0000 01

00

0x00005 0000 0000 0000 0000 01

01

0x00006 0000 0000 0000 0000 01

10

0x00007 0000 0000 0000 0000 01

11

0x00008 0000 0000 0000 0000 10

00

0x00009 0000 0000 0000 0000 10

01

0x0000A 0000 0000 0000 0000 10

10

Block 3

0x0000B 0000 0000 0000 0000 10

11

0x0000C 0000 0000 0000 0000 11

00

0x0000D 0000 0000 0000 0000 11

01

0x0000E 0000 0000 0000 0000 11

10

0x0000F 0000 0000 0000 0000 11

11

And so on...until we get to the last row

Block 2n-1

•

0xFFFFC 1111 1111 1111 1111 11

00

0xFFFFD 1111 1111 1111 1111 11

01

0xFFFFE 1111 1111 1111 1111 11

10

0xFFFFF 1111 1111 1111 1111 11 11 The cache is organized into lines, each of which contains enough space to store exactly one block of data and a tag uniquely identifying where that block came from in memory.

As far as the mapping functions are concerned, the book did an okay job describing the details and differences of each. I, however, would like to describe them with an emphasis on how we would model them using code. Direct Mapping Remember that direct mapping assigned each memory block to a specific line in the cache. If a line is all ready taken up by a memory block when a new block needs to be loaded, the old block is trashed. The figure below shows how multiple blocks are mapped to the same line in the cache. This line is the only line that each of these blocks can be sent to. In the case of this figure, there are 8 bits in the block identification portion of the memory address.

The address for this example is broken down something like the following: Tag 8 bits identifying line in cache word id bits Once the block is stored in the line of the cache, the tag is copied to the tag location of the line. Direct Mapping Summary The address is broken into three parts: (s-r) MSB bits represent the tag to be stored in a line of the cache corresponding to the block stored in the line; r bits in the middle identifying which line the block is always stored in; and the w LSB bits identifying each word within the block. This means that: • • • • •

The number of addressable units = 2s+w words or bytes The block size (cache line width not including tag) = 2w words or bytes The number of blocks in main memory = 2s (i.e., all the bits that are not in w) The number of lines in cache = m = 2r The size of the tag stored in each line of the cache = (s - r) bits

Direct mapping is simple and inexpensive to implement, but if a program accesses 2 blocks that map to the same line repeatedly, the cache begins to thrash back and forth reloading the line over and over again meaning misses are very high.

Full Associative Mapping In full associative, any block can go into any line of the cache. This means that the word id bits are used to identify which word in the block is needed, but the tag becomes all of the remaining bits. Tag

word id bits

Full Associative Mapping Summary The address is broken into two parts: a tag used to identify which block is stored in which line of the cache (s bits) and a fixed number of LSB bits identifying the word within the block (w bits). This means that: • • • • •

The number of addressable units = 2s+w words or bytes The block size (cache line width not including tag) = 2w words or bytes The number of blocks in main memory = 2s (i.e., all the bits that are not in w) The number of lines in cache is not dependent on any part of the memory address The size of the tag stored in each line of the cache = s bits

Set Associative Mapping This is the one that you really need to pay attention to because this is the one for the homework. Set associative addresses the problem of possible thrashing in the direct mapping method. It does this by saying that instead of having exactly one line that a block can map to in the cache, we will group a few lines together creating a set. Then a block in memory can map to any one of the lines of a specific set. There is still only one set that the block can map to.

Note that blocks 0, 256, 512, 768, etc. can only be mapped to one set. Within the set, however, they can be mapped associatively to one of two lines. The memory address is broken down in a similar way to direct mapping except that there is a slightly different number of bits for the tag (s-r) and the set identification (r). It should look something like the following: Tag (s-r bits) set identifier (r bits) word id (w bits) Now if you have a 24 bit address in direct mapping with a block size of 4 words (2 bit id) and 1K lines in a cache (10 bit id), the partitioning of the address for the cache would look like this. Direct Mapping Address Partitions Tag (12 bits) line identifier (10 bits) word id (2 bits) If we took the exact same system, but converted it to 2-way set associative mapping (2way meaning we have 2 lines per set), we'd get the following: Tag (13 bits) line identifier (9 bits) word id (2 bits)

Notice that by making the number of sets equal to half the number of lines (i.e., 2 lines per set), one less bit is needed to identify the set within the cache. This bit is moved to the tag so that the tag can be used to identify the block within the set.

aca notes

Short Description

Description

Comments

We need your help!