Solutions Manual

December 3, 2017 | Author: Yama Tama | Category: Cpu Cache, Computer Data Storage, Random Access Memory, Central Processing Unit, Read Only Memory

Share Embed Donate

Report this link

Short Description

Download Solutions Manual...

Description

Computer Architecture: Fundamentals and Principles of Computer Design

Solutions Manual

Joseph D. Dumas II University of Tennessee at Chattanooga Department of Computer Science and Electrical Engineering

Copyright © 2006 CRC Press/Taylor & Francis Group

1

Introduction to Computer Architecture

1.

Explain in your own words the differences between computer systems architecture and implementation. How are these concepts distinct, yet interrelated? Give a historical example of how implementation technology has affected architectural design (or vice versa). Architecture is the logical design of a computer system, from the top level on down to the subsystems and their components – a specification of how the parts of the system will fit together and how everything is supposed to function. Implementation is the physical realization of an architecture – an actual, working hardware system on which software can be executed. There are many examples of advances in implementation technology affecting computer architecture. An obvious example is the advent of magnetic core memory to replace more primitive storage technologies such as vacuum tubes, delay lines, magnetic drums, etc. The new memory technology had much greater storage capacity than was previously feasible. The availability of more main memory resulted in changes to machine language instruction formats, addressing modes, and other aspects of instruction set architecture.

2.

Describe the technologies used to implement computers of the first, second, third, fourth, and fifth generations. What were the main new architectural features that were introduced or popularized with each generation of machines? What advances in software went along with each new generation of hardware? First generation computers were unique machines built with very primitive implementation technologies such as electromagnetic relays and (later) vacuum tubes. The main new architectural concept was the von Neumann stored-program paradigm itself. (The early first generation machines were not programmable in the sense we understand that term today.) Software, for those machines where the concept was actually relevant, was developed in machine language.

Second-generation computers made use of the recently invented transistor as a basic switching element. The second generation also saw the advent of magnetic core memory as a popular storage technology. At least partly in response to these technological advances, new architectural features were developed including virtual memory, interrupts, and hardware representation of floating-point numbers. Advances in software development included the use of assembly language and the first high-level languages including Fortran, Algol, and COBOL. Batch processing systems and multiprogramming operating systems were also devised during this time period. The third generation featured the first use of integrated circuits (with multiple transistors on the same piece of semiconductor material) in computers. Not only was this technology used to create smaller CPUs requiring less wiring between components, but semiconductor memory devices began to replace core memory as well. This led to the development of minicomputer architectures that were less expensive to implement and helped give rise to families of computer systems sharing a common instruction set architecture. Software advances included increased use of virtual memory, the development of more modern, structured programming languages, and the dawn of timesharing operating systems such as UNIX. Fourth generation computers were the first machines to use VLSI integrated circuits including microprocessors (CPUs fabricated on a single IC). VLSI technology continued to improve during this period, eventually yielding microprocessors with over one million transistors and large-capacity semiconductor RAM and ROM devices. VLSI “chips” allowed the development of inexpensive but powerful microcomputers during the fourth generation. These systems gradually began to make use of virtual memory, cache memory, and other techniques previously reserved for mainframes and minicomputers; they provided direct

support for high-level languages either in hardware (CISC) or by using optimizing compilers (RISC). Other software advances included new languages like BASIC, Pascal, and C, and the first object-oriented language (C++). Office software including word processors and spreadsheet applications helped microcomputers gain a permanent foothold in small businesses and homes. Fifth generation computers exhibited fewer architectural innovations than their predecessors, but advances in implementation technology (including pipelined and superscalar CPUs and larger, faster memory devices) yielded steady gains in performance. CPU clock frequencies increased from tens, to hundreds, and eventually to thousands of megahertz; today, CPUs operating at several gigahertz are common. “Standalone” systems became less common as most computers were connected to local area networks and/or the Internet. Object-oriented software development became the dominant programming paradigm, and network-friendly languages like Java became popular. 3.

What characteristics do you think the next generation of computers (say, 5-10 years from now) will display? The answer to this question will undoubtedly vary from student to student, but might include an increased reliance on networking (especially wireless networking), increased use of parallel processing, more hardware support for graphics, sound, and other multimedia functions, etc.

4.

What was the main architectural difference between the two early computers ENIAC and EDVAC? ENIAC was not a programmable machine. Connections had to be re-wired to do a different calculation. EDVAC was based on the von Neumann paradigm, where instructions were not hard-wired but rather resided in main memory along with the data. The program, and thus the system’s functionality, could be changed without any modification to the hardware. Thus, EDVAC (and all its successors

based on the von Neumann architecture) were able to run “software” as we understand it today. 5.

Why was the invention of solid state electronics (in particular, the transistor) so important in the history of computer architecture? The invention of the transistor, and its subsequent use as a switching element in computers, enabled many of the architectural enhancements that came about during the second (and later) generations of computing. Earlier machines based on vacuum tubes were limited in capability because of the short lifetime of each individual tube. A machine built with too many (more than a few thousand) switching elements could not be reliable; it would frequently “go down” due to tube failures. Transistors, with their much longer life span, enabled the construction of computers with tens or hundreds of thousands of switching elements, which allowed more complex architectures to flourish.

6.

Explain the origin of the term “core dump.” The term “core dump” dates to the second and third generations of computing, when most large computers used magnetic core memory for main storage. Since core memory was nonvolatile (retained its contents in the absence of power), when a program crashed and the machine had to be taken down and restarted, the offending instruction(s) and their operands were still in memory and could be examined for diagnostic purposes. Some later machines with semiconductor main memory mimic this behavior by “dumping” an image of a program’s memory space to disk to aid in debugging in the event of a crash.

7.

What technological advances allowed the development of minicomputers, and what was the significance of this class of machines? How is a microcomputer different from a minicomputer? The main technological development that gave rise to minicomputers was the invention of the integrated circuit. (The shrinking sizes of secondary storage devices

and advances in display technology such as CRT terminals also played a part.) The significance of these machines was largely due to their reduced cost as compared to traditional mainframe computers. Because they cost “only” a few thousand dollars instead of hundreds of thousands or millions, minicomputers were available to smaller businesses (and to small workgroups or individuals within larger organizations). This trend toward proliferation and decentralization of computing resources was continued by the microcomputers of the fourth generation. The main difference between a microcomputer and a minicomputer is the microcomputer’s use of a microprocessor (or single-chip CPU) as the main processing element. Minicomputers had CPUs consisting of multiple ICs or even multiple circuit boards. The availability of microprocessors, coupled with the miniaturization and decreased cost of other system components, made computers smaller and cheaper and thus, for the first time, accessible to the average person. 8.

How have the attributes of very high performance systems (a.k.a. supercomputers) changed over the third, fourth, and fifth generations of computing? The third generation of computing saw the development of the first supercomputer-class machines, including the IBM “Stretch”, the CDC 6600 and 7600, the TI ASC, the ILLIAC IV and others. These machines were very diverse and did not share many architectural attributes. During the fourth generation, vector machines including the Cray-1 and its successors (and competitors) became the dominant force in high-performance computing. By processing vectors (large one-dimensional arrays) of operands in highly pipelined fashion, these machines achieved impressive performance on scientific and engineering calculations (though they did not achieve comparable performance increases on more general applications). Massively parallel machines (with many, simple processing elements) also debuted during this period. Vector machines lost popularity in the fifth generation, largely giving way to

highly parallel scalar systems using large numbers of conventional microprocessors. Many of these systems are cluster systems built around a network of relatively inexpensive, “commodity” computers. 9.

What is the most significant difference between computers of the last 10-15 years versus those of previous generations? Fifth generation computers are smaller, cheaper, faster, and have more memory than their predecessors – but probably the single most significant difference between modern systems and those of the past is the pervasiveness of networking. Almost every general-purpose or high-performance system is connected to a local area network, or a wide area network such as the Internet, via some sort of wired or wireless network connection.

10.

What is the principal performance limitation of a machine based on the von Neumann (Princeton) architecture? How does a Harvard architecture machine address this limitation? The main performance limitation of a von Neumann machine is the “von Neumann bottleneck” – the single path between the CPU and main memory, over which instructions as well as data must be accessed. A Harvard architecture removes this bottleneck by having either separate main memories for instructions and data (with a dedicated connection to each), or (much more common nowadays) by having only one main memory, but separate cache memories (see Chapter 2) for instructions and data. The separate memories can be optimized for access patterns typical of each type of memory reference in order to maximize data and instruction bandwidth to the CPU.

11.

Summarize in your own words the von Neumann machine cycle. Fetch instruction, decode instruction, determine operand address(es), fetch operand(s), perform operation, store result … repeat for next instruction.

12.

Does a computer system with high generality tend to have higher quality than other

systems? Explain. Not necessarily. If anything, a more general architecture tends to be more complex, as its designers try to make it capable of doing a wide variety of things reasonably well. This increased complexity, as compared to a more specialized architecture, may lead to a higher probability of “bugs” in the implementation, all else being equal. 13.

How does “ease of use” relate to “user friendliness”? Not at all; at least, not directly. User friendliness refers to the end user’s positive experience with the operating system and applications that run under it. Ease of use is an attribute that describes how well the architecture facilitates the development of system software such as operating systems, compilers, linkers, etc. In other words, it is a measure of “systems programmer friendliness.” While there is no direct connection, an architecture that is not “easy to use” could possibly give rise to systems software with a higher probability of bugs, which may ultimately lead to a lower quality experience on the part of the end user.

14.

The obvious benefit of maintaining upward and/or forward compatibility is the ability to continue to run “legacy” code. What are some of the disadvantages of compatibility? Building in compatibility with previous machines makes the design of an architecture more complex. This may result in higher design and implementation costs, less architectural ease of use, and a higher probability of flaws in the implementation of the design.

15.

Name at least two things (other than hardware purchase price, software licensing cost, maintenance, and support) that may be considered cost factors for a computer system. Costs are not always monetary – at least, not directly. Other cost factors, depending on the nature of the system and where it is used, might include power consumption, heat dissipation, physical volume, mass, and losses incurred if a system fails due to reliability issues.

16.

Give as many reasons as you can why PC compatible computers have a larger market share than Macs. It is probably impossible to know all the reasons, but one of the biggest is that PCs have an “open”, rather than proprietary, architecture. Almost from the very beginning, compatible “clones” were available at competitive prices, holding down not only the initial cost of buying a computer, but also the prices for software and replacement parts. Success breeds success, and the larger market share meant that manufacturers who produced PC hardware were able to invest in research and development that produced better, faster, and more economical PC compatible machines.

17.

One computer system has a 3.2 GHz processor, while another has only a 2.7 GHz processor. Is it possible that the second system might outperform the first? Explain. It is entirely possible that this might be the case. CPU clock frequency is only one small aspect of system performance. Even with a lower clock frequency (fewer clock cycles occurring each second) the second system’s CPU might outperform the first because of architectural or implementation differences that result in it accomplishing more work per clock cycle. And even if the first system’s CPU is indeed more capable, differences in the memory and/or input/output systems might still give the advantage to the second system.

18.

A computer system of interest has a CPU with a clock cycle time of 2.5 ns. Machine language instruction types for this system include: integer addition/subtraction/logic instructions which require 1 clock cycle to be executed; data transfer instructions which average 2 clock cycles to be executed; control transfer instructions which average 3 clock cycles to be executed; floating-point arithmetic instructions which average 5 clock cycles to be executed; and input/output instructions which average 2 clock cycles to be executed. a) Suppose you are a marketing executive who wants to hype the performance of this

system. Determine its “peak MIPS” rating for use in your advertisements. The fastest instructions take only one clock cycle to execute, so in order to calculate peak MIPS, assume that the whole program uses only these instructions. That means that the machine will execute one instruction every 2.5 ns. Thus we calculate: Instruction execution rate = (1 instruction) / (2.5 * 10-9 seconds) = 4 * 108 instructions/second = 400 * 106 instructions/second = 400 MIPS b) Suppose you have acquired this system and want to estimate its performance when running a particular program. You analyze the compiled code for this program and determine that it consists of 40% data transfer instructions, 35% integer addition, subtraction, and logical instructions, 15% control transfer instructions, and 10% I/O instructions. What MIPS rating do you expect the system to achieve while running this program? First, we need to determine the mean number of cycles per instruction using a weighted average based on the percentages of the different types of instructions: CPIavg = (0.40)(2 cycles) + (0.35)(1 cycle) + (0.15)(3 cycles) + (0.10)(2 cycles) = (0.80 + 0.35 + 0.45 + 0.20) = 1.80 cycles/instruction We already determined in part (a) above that if instructions take a single cycle, then we can execute 400 * 106 of them per second. This is another way of saying that the CPU clock frequency is 400 MHz. Given this knowledge and the average cycle count per instruction just calculated, we obtain: Instruction execution rate = (400 M cycles / second) * (1 instruction / 1.8 cycles) ≈ 222 M instructions/second = 222 MIPS c) Suppose you are considering purchasing this system to run a variety of programs using mostly floating-point arithmetic. Of the widely-used benchmark suites discussed in this chapter, which would be the best to use in comparing this system to others you are considering?

If general-purpose floating-point performance is of interest, it would be hard to go wrong by using the SPECfp floating-point CPU benchmark suite (or some subset of it, if specific types of applications to be run on the system are known). Other possibilities include the Whetstones benchmark or (if applications of interest perform vector computations) LINPACK or Livermore Loops. Conversely, you would definitely not want to compare the systems using any of the integer-only or non-CPU-intensive benchmarks such as Dhrystones, TPC, etc. d) What does MFLOPS stand for? Estimate this system’s MFLOPS rating; justify your answer with reasoning and calculations. MFLOPS stands for Millions of Floating-point Operations Per Second. Peak MFLOPS can be estimated in a similar manner to parts (a) and (b) above: Peak floating-point execution rate = (400 M cycles / second) * (1 FLOP / 5 cycles) = 80 MFLOPS A more realistic estimate of a sustainable floating-point execution rate would have to take into account the additional operations likely to be required along with each actual numeric computation. While this would vary from one program to another, a reasonable estimate might be that for each floating-point arithmetic operation, the program might also perform two data transfers (costing a total of four clock cycles) plus one control transfer (costing three clock cycles). This would mean that the CPU could only perform one floating-point computation every 12 clock cycles for a sustained execution rate of (400 M cycles / second) * (1 FLOP / 12 cycles) ≈ 33 MFLOPS. The student may come up with a variety of estimates based on different assumptions, but any realistic estimate would be significantly less than the 80 MFLOPS peak rate. 19.

Why does a hard disk that rotates at higher RPM generally outperform one that rotates at lower RPM? Under what circumstances might this not be the case? There are generally three components to the total time required to read or

write data on a rotating disk. These are the time required to step the read/write head in or out to the desired track, the rotational delay in getting to the start of the desired sector within that track, and then the time needed to actually read or write the sector in question. All else being equal, increasing disk RPM reduces the time it takes for the disk to make a revolution and so tends to reduce the second and third delay components, while it does nothing to address the first. If the higher-RPM drive had a longer track-to-track seek time, though, it might take just as long or even longer, overall, to access desired data as compared with a lower-RPM drive with shorter track-to-track access time. 20.

A memory system can read or write a 64-bit value every 2 ns. Express its bandwidth in MB/s. Since one byte equals 8 bits, a 64-bit value is 8 bytes. So we can compute the memory bandwidth as: BW = (8 bytes) / (2 * 10-9 seconds) = 4 * 109 bytes/second = 4 GB/s or 4000 MB/s

21.

If a manufacturer’s brochure states that a given system can perform I/O operations at 500 MB/s, what questions would you like to ask the manufacturer’s representative regarding this claim? One should probably ask under what conditions this data transfer rate can be achieved. If it is a “peak” transfer rate, it is probably unattainable under any typical circumstances. It would be very helpful to know the size of the blocks of data being transferred and the length of time for which this 500 MB/s rate was sustained. Odds are probably good that if this is a peak rate, that it is only valid for fairly large block transfers of optimum size, and for very short periods of time. This may or may not reflect the nature of the I/O demands of a customer’s application.

22.

Fill in the blanks below with the most appropriate term or concept discussed in this chapter: Implementation - the actual, physical realization of a computer system, as opposed to

the conceptual or block-level design Babbage’s Analytical Engine - this was the first design for a programmable digital computer, but a working model was never completed Integrated circuits - this technological development was an important factor in moving from second generation to third generation computers CDC 6600 - this system is widely considered to have been the first supercomputer Altair - this early microcomputer kit was based on an 8-bit microprocessor; it introduced 10,000 hobbyists to (relatively) inexpensive personal computing Microcontroller - this type of computer is embedded inside another electronic or mechanical device such as a cellular telephone, microwave oven, or automobile transmission Harvard architecture - a type of computer system design in which the CPU uses separate memory buses for accessing instructions and data operands Compatibility - an architectural attribute that expresses the support provided for previous or other architectures by the current machine MFLOPS - a CPU performance index that measures the rate at which computations can be performed on real numbers rather than integers Bandwidth - a measure of memory or I/O performance that tells how much data can be transferred to or from a device per unit of time Benchmark - a program or set of programs that are used as standardized means of comparing the performance of different computer systems

2

Computer Memory Systems

1.

Consider the various aspects of an ideal computer memory discussed in Section 2.1.1 and the characteristics of available memory devices discussed in Section 2.1.2. Fill in the columns of the table below with the following types of memory devices, in order from most desirable to least desirable: magnetic hard disk, semiconductor DRAM, CD-R, DVD-RW, semiconductor ROM, DVD-R, semiconductor flash memory, magnetic floppy disk, CD-RW, semiconductor static RAM, semiconductor EPROM. Cost/bit (will obviously fluctuate somewhat depending on market conditions): CDR, DVD-R, CD-RW, DVD-RW, magnetic hard disk, magnetic floppy disk, semiconductor DRAM, semiconductor ROM, semiconductor EPROM, semiconductor flash memory, semiconductor static RAM. Speed (will vary somewhat depending on specific models of devices): semiconductor static RAM, semiconductor DRAM, semiconductor ROM, semiconductor EPROM, semiconductor flash memory, magnetic hard disk, DVD-R, DVD-RW, CD-R, CDRW, magnetic floppy disk. Information Density (again, this may vary by specific types of devices): Magnetic hard disk, DVD-R and DVD-RW, CD-R and CD-RW, semiconductor DRAM, semiconductor ROM, semiconductor EPROM, semiconductor flash memory, semiconductor static RAM, magnetic floppy disk. Volatility: Optical media such as DVD-R, CD-R, DVD-RW, and CD-RW are all equally nonvolatile. The read-only variants cannot be erased and provide secure storage unless physically damaged. (The same is true of semiconductor ROM.) The read-write optical disks (and semiconductor EPROMs and flash memories) may be intentionally or accidentally erased, but otherwise retain their data indefinitely in the absence of physical damage. Magnetic hard and floppy disks are nonvolatile except in the presence of strong external magnetic fields. Semiconductor static RAM is volatile, requiring continuous application of electrical power to maintain

stored data. Semiconductor DRAM is even more volatile since it requires not only electrical power, but also periodic data refresh in order to maintain its contents. Writability (all memory is readable): Magnetic hard and floppy disks and semiconductor static RAM and DRAM can be written essentially indefinitely, and as quickly and easily as they can be read. DVD-RW, CD-RW, and semiconductor flash memory can be written many times, but not indefinitely, and the write operation is usually slower than the read operation. Semiconductor EPROMs can be written multiple times, but only in a special programmer, and only after a relatively long erase cycle under ultraviolet light. DVD-R and CD-R media can be written once and only once by the user. Semiconductor ROM is pre-loaded with its binary information at the factory and can never be written by the user. Power Consumption: All types of optical and magnetic disks as well as semiconductor ROMs, EPROMs, and flash memories can store data without power being applied at all. Semiconductor RAMs require continuous application of power to retain data, with most types of SRAMS being more power-hungry than DRAMs. (Low-power CMOS static RAMs, however, are commonly used to maintain data for long periods of time with a battery backup.) While data are being read or written, all memories require power. Semiconductor DRAM requires relatively little power, while semiconductor ROMs, flash memories, and EPROMs tend to require more and SRAMs, more still. All rotating disk drives, magnetic and optical, require significant power in order to spin the media and move the read/write heads as well as to actually perform the read and write operations. The specifics vary considerably from device to device, but those that rotate the media at higher speeds tend to use slightly more power. Durability: In general, the various types of semiconductor memories are more durable than disk memories because they have no moving parts. Only severe physical shock or static discharges are likely to harm them. (CMOS devices are

particularly susceptible to damage from static electricity.) Optical media are also very durable; they are nearly impervious to most dangers except that of surface scratches. Magnetic media such as floppy and hard disks tend to be the least durable as they are subject to erasure by strong magnetic fields and also are subject to “head crashes” when physical shock causes the read-write head to impact the media surface. Removability/Portability: Flash memory, floppy disks, and optical disks are eminently portable and can easily be carried from system to system to transfer data. A few magnetic hard drives are designed to be portable, but most are permanently installed in a given system and require some effort for removal. Semiconductor ROMs and EPROMs, if placed in sockets rather than being soldered directly to a circuit board, can be removed and transported along with their contents. Most semiconductor RAM devices lose their contents when system power is removed and, while they could be moved to another system, would not arrive containing any valid data. 2.

Describe in your own words what a hierarchical memory system is and why it is used in the vast majority of modern computer systems. A hierarchical memory system is one that is comprised of several types of memory devices with different characteristics, each occupying a “level” within the overall structure. The higher levels of the memory system (the ones closest to, or a part of, the CPU) offer faster access but, due to cost factors and limited physical space, have a smaller storage capacity. Thus, each level can typically hold only a portion of the data stored in the next lower level. As one moves down to the lower levels, speed and cost per bit generally decrease, but capacity increases. At the lowest levels, the devices offer a great deal of (usually nonvolatile) storage at relatively low cost, but are quite slow. For the overall system to perform well, the hierarchy must be managed by hardware and software such that the stored items

that are used most frequently are located in the higher levels, while items that are used less frequently are relegated to the lower levels. 3.

What is the fundamental, underlying reason why low-order main memory interleaving and/or cache memories are needed and used in virtually all high-performance computer systems? The main underlying reason why speed-enhancing techniques such as loworder interleaving and cache continue to be needed and used in computer systems is that main memory technology has never been able to keep up with the speed of processor implementation technologies. The CPUs of each generation have always been faster than any devices (from the days of delay lines, magnetic drums, and core memory all the way up to today’s high-capacity DRAM ICs) that were feasible, from a cost standpoint, to be used as main memory. If anything, the CPU-memory speed gap has widened rather than narrowed over the years. Thus, the speed and size of a system’s cache may be even more critical to system performance than almost any other factor. (If you don’t believe this, examine the performance difference between an Intel Pentium 4 and an otherwise similar Celeron processor.)

4.

A main memory system is designed using 15 ns RAM devices using a 4-way low-order interleave. (a)

What would be the effective time per main memory access under ideal conditions? Under ideal conditions, four memory accesses would be in progress at any

given time due to the low-order interleaving scheme. This means that the effective time per main memory access would be (15 / 4) = 3.75 ns. (b)

What would constitute “ideal conditions”? (In other words, under what circumstances could the access time you just calculated be achieved?) The ideal condition for best performance of the memory system would be

continuous access to sequentially numbered memory locations. Equivalently, any

access pattern that consistently used all three of the other “leaves” before returning to the one just accessed would have the same benefit. Examples would include accessing every fifth numbered location, or every seventh, or any spacing that is relatively prime with 4 (the interleaving factor). (c)

What would constitute “worst-case conditions”? (In other words, under what circumstances would memory accesses be the slowest?) What would the access time be in this worst-case scenario? If ideal conditions exist 80% of the time and worst-case conditions occur 20% of the time, what would be the average time required per memory access? The worst case would be a situation where every access went to the same

device or group of devices. This would happen if the CPU needed to access every fourth numbered location (or every eighth, or any spacing that is an integer multiple of 4). In this case, access time would revert to that of an individual device (15 ns) and the interleaving would provide no performance benefit at all. In the hypothetical situation described, we could take a weighted average to determine the effective access time for the memory system: (0.80)(3.75 ns) + (0.20)(15 ns) = (3 + 3) = 6 ns. (d)

When ideal conditions exist, we would like the processor to be able to access memory every clock cycle with no “wait states” (that is, without any cycles wasted waiting for memory to respond). Given this requirement, what is the highest processor bus clock frequency that can be used with this memory system? In part (a) above, we found the best-case memory access time to be 3.75 ns.

Matching the CPU bus cycle time to this value and taking the reciprocal (since f = 1/T) we obtain: f = 1/T = (1 cycle) / (3.75 * 10-9 seconds) ≈ 2.67 * 108 cycles/second = 267 MHz. (e)

Other than increased hardware cost and complexity, are there any potential disadvantages of using a low-order interleaved memory design? If so, discuss one

such disadvantage and the circumstances under which it might be significant. The main disadvantage that could come into play is due to the fact that under ideal conditions, all memory modules are busy all the time. This is good if only one device (usually the CPU) needs to access memory, but not good if other devices need to access memory as well (for example, to perform I/O). Essentially all the memory bandwidth is used up by the first device, leaving little or none for others. Another possible disadvantage is lower memory system reliability due to decreased fault tolerance. In a high-order interleaved system, if one memory device were to fail, 3/4 of the memory space would still be usable. In the low-order interleaved case, if one of the four “leaves” fails, the entire main memory space is effectively lost. 5.

Is it correct to refer to a typical semiconductor integrated circuit ROM as a “random access memory”? Why or why not? Name and describe two other logical organizations of computer memory that are not “random access.” It is correct to refer to a semiconductor ROM as a “random access memory” in the strict sense of the definition – a “random access” memory is any memory device that has an access time independent of the specific location being accessed. (In other words, any randomly chosen location can be read or written in the same amount of time as any other location.) This is equally true of most semiconductor read-only memories as it is of semiconductor read/write memories (which are commonly known as “RAMs”). Because of the commonly-used terminology, it is probably better not to confuse the issue by referring to a ROM IC as a “RAM”, even though that is technically a correct statement. Besides random access, the other two logical memory organizations that may be found in computer systems are sequential access (typical of tape and disk memories) and associative (or content addressable).

6.

Assume that a given system’s main memory has an access time of 6.0 ns, while its cache has an access time of 1.2 ns (five times as fast). What would the hit ratio need to be in order for the effective memory access time to be 1.5 ns (four times as fast as main memory)? Since effective memory access time in such a system is based on a weighted average, we would need to solve the following equation: ta effective = ta cache * (ph) + ta main * (1 - ph) for the particular values given in the problem, as shown: 1.5 ns = (1.2 ns)(ph) + (6.0 ns)(1 - ph) Using basic algebra we solve to obtain ph = 0.9375.

7.

A particular program runs on a system with cache memory. The program makes a total of 250,000 memory references; 235,000 of these are to cached locations. (a)

What is the hit ratio in this case?

ph = number of hits / (number of hits + number of misses) = 235,000 / 250,000 = 0.94 (b)

If the cache can be accessed in 1.0 ns but the main memory requires 7.5 ns for an access to take place, what is the average time required by this program for a memory access assuming all accesses are reads?

ta effective = ta cache * (ph) + ta main * (1 - ph) = (1.0 ns)(0.94) + (7.5 ns)(0.06) = (0.94 + 0.45) ns = 1.39 ns (c)

What would be the answer to part (b) if a write-through policy is used and 75% of memory accesses are reads? If a write-through policy is used, then all writes require a main memory

access and write hits do nothing to improve memory system performance. The average write access time is equal to the main memory access time, which is 7.5 ns. The average read access time is equal to 1.39 ns as calculated in (b) above. The overall average time per memory access is thus given by: ta effective = (7.5 ns)(0.25) + (1.39 ns)(0.75) = (1.875 + 1.0425) ns = 2.9175 ns

8.

Is hit ratio a dynamic or static performance parameter in a typical computer memory system? Explain your answer. Hit ratio is a dynamic parameter in any practical computer system. Even though the cache and main memory sizes, mapping strategy, replacement policy, etc. (which can all affect the hit ratio) are constant within a given system, the proportion of cache hits to misses will still vary from one program to another. It will also vary widely within a given run, based on such factors as the length of time the program has been running, the code structure (procedure calls, loops, etc.) and the properties of the specific data set being operated on by the program.

9.

What are the advantages of a set-associative cache organization as opposed to a directmapped or fully associative mapping strategy? A set-associative cache organization is a compromise between the directmapped and fully associative organizations that attempts to maximize the advantages of each while minimizing their respective disadvantages. Fully associative caches are expensive to build but offer a higher hit ratio than directmapped caches of the same size. Direct-mapped caches are cheaper and less complex to build but performance can suffer due to usage conflicts between lines with the same index. By limiting associativity to just a few parallel comparisons (two- and four-way set-associative caches are most common) the set-associative organization can achieve nearly the same hit ratio as a fully associative design at a cost not much greater than that of a direct-mapped cache.

10.

A computer has 64 MB of byte-addressable main memory. It is proposed to design a 1 MB cache memory with a refill line (block) size of 64 bytes. (a)

Show how the memory address bits would be allocated for a direct-mapped cache organization. Since 64M = 226, the total number of bits required to address the main

memory space is 26. And since 64 = 26, it takes 6 bits to identify a particular byte

within a line. The number of refill lines in the cache is 1M / 64 = 220 / 26 = 214 = 16K. Since there are 214 lines in the cache, 14 index bits are required. 26 total address bits – 6 “byte” bits – 14 “index” bits leaves 6 bits to be used for the tag. So the address bits would be partitioned as follows: Tag (6 bits) | Index (14 bits) | Byte (6 bits) (b)

Repeat part (a) for a four-way set-associative cache organization. For the purposes of this problem, a four-way set-associative cache can be

treated as four direct-mapped caches operating in parallel, each one-fourth the size of the cache described above. Each of these four smaller units would thus be 256 KB in size, containing 4K = 212 refill lines. Thus, 12 bits would need to be used for the index, and 26 – 6 – 12 = 8 bits would be used for the tag. The address bits would be partitioned as follows: Tag (8 bits) | Index (12 bits) | Byte (6 bits) (c)

Repeat part (a) for a fully associative cache organization. In a fully associative cache organization, no index bits are required.

Therefore the tags would be 26 – 6 = 20 bits long. Addresses would be partitioned as follows: Tag (20 bits) | Byte (6 bits) (d)

Given the direct-mapped organization, and ignoring any extra bits that might be needed (valid bit, dirty bit, etc.), what would be the overall size (“depth” by “width”) of the memory used to implement the cache? What type of memory devices would be used to implement the cache (be as specific as possible)? The overall size of the direct-mapped cache would be:

(16K lines) * (64 data bytes + 6 bit tag) = (16,384) * ((64 * 8) + 6) = (16,384 * 518) = 8,486,912 bits. This would be in the form of a fast 16K by 518 static RAM. (e)

Which line(s) of the direct-mapped cache could main memory location 1E0027A16 map into? (Give the line number(s), which will be in the range of 0 to (n-1) if there are n lines in the cache.) Give the memory address (in hexadecimal) of another location that could not reside in cache at the same time as this one (if

such a location exists). To answer this question, we need to write the memory address in binary. 1E0027A hexadecimal equals 01111000000000001001111010 binary. We can break this down into a tag of 011110, an index of 00000000001001 and a byte offset within the line of 111010. In a direct-mapped cache, the binary index tells us the number of the only line that can contain the given memory location. So, this location can only reside in line 10012 = 9 decimal. Any other memory location with the same index but a different tag could not reside in cache at the same time as this one. One example of such a location would be the one at address 2F0027A16. 11.

Define and describe virtual memory. What are its purposes, and what are the advantages and disadvantages of virtual memory systems? Virtual memory is a technique that separates the (virtual) addresses used by the software from the (physical) addresses used by the memory system hardware. Each virtual address referenced by a program goes through a process of translation (or mapping) that resolves it into the correct physical address in main memory, if such a mapping exists. If no mapping is defined, the desired information is loaded from secondary memory and an appropriate mapping is created. The translation process is overseen by the operating system, with much of the work done in hardware by a memory management unit (MMU) for speed reasons. It is usually done via a multi-level table lookup procedure, with the MMU internally caching frequently- or recently-used translations so that the costly (in terms of performance) table lookups can be avoided most of the time. The principal advantage of virtual memory is that it frees the programmer from the burden of fitting his or her code into available memory, giving the illusion of a large memory space exclusively owned by the program (rather than the usually much more limited physical main memory space that is shared with other resident

programs). The main disadvantage is the overhead of implementing the virtual memory scheme, which invariably results in some increase in average access time vs. a system using comparable technology with only physical memory. Table lookups take time, and even when a given translation is cached in the MMU’s Translation Lookaside Buffer, there is some propagation delay involved in address translation. 12.

Name and describe the two principal approaches to implementing virtual memory systems. How are they similar and how do they differ? Can they be combined, and if so, how? The two principal approaches to implementing virtual memory (VM) are demand-paged VM and demand-segmented VM (paging and segmentation, for short). They are similar in that both map a virtual (or logical) address space to a physical address space using a table lookup process managed by an MMU and overseen by the computer’s operating system. They are different in that paging maps fixed-size regions of memory called pages, while segmentation maps variable-length segments. Page size is usually determined by hardware considerations such as disk sector size, while segment size is determined by the structure of the program’s code and data. A paged system can concatenate the offset within a page with the translated upper address bits, while a segmented system must translate a logical address into the complete physical starting address of a segment and then add the segment offset to that value. It is possible to create a system that uses aspects of both approaches; specifically, one in which the variable-length segments are each comprised of one or more fixed-sized pages. This approach, known as segmentation with paging, trades off some of the disadvantages of each approach to try to take advantage of their strengths.

13.

What is the purpose of having multiple levels of page or segment tables rather than a single table for looking up address translations? What are the disadvantages, if any, of

this scheme? The main purpose of having multiple-level page or segment tables is to replace one huge mapping table with a hierarchy of smaller ones. The advantage is that the tables are smaller (remember, they are stored in main memory, though some entries may be cached) and easier for the operating system to manage. The disadvantage is that “walking” the hierarchical sequence of tables takes longer than a single table lookup. Most systems have a TLB to cache recently-used address translations, though, so this time penalty is usually only incurred once when a given page or segment is first loaded into memory (or perhaps again later if the TLB fills up and a displaced entry has to be reloaded). 14.

A process running on a system with demand-paged virtual memory generates the following reference string (sequence of requested pages): 4, 3, 6, 1, 5, 1, 3, 6, 4, 2, 2, 3. The operating system allocates each process a maximum of four page frames at a time. What will be the number of page faults for this process under each of the following page replacement policies? a) LRU

7 page faults

b) FIFO

8 page faults

c) LFU (with FIFO as tiebreaker) 15.

7 page faults

In what ways are cache memory and virtual memory similar? In what ways are they different? Cache memory and virtual memory are similar in several ways. Both involve the interaction between two levels of a hierarchical memory system – one larger and slower, the other smaller and faster. Both have the goal of performing close to the speed of the smaller, faster memory while taking advantage of the capacity of the larger, slower one; both depend on the principle of locality of reference to achieve this. Both operate on a demand basis and both perform a mapping of addresses generated by the CPU.

One significant difference is the size of the blocks of memory that are mapped and transferred between levels of the hierarchy. Cache lines tend to be significantly smaller than pages or segments in a virtual memory system. Because of the size of the mapped areas as well as the speed disparity between levels of the memory system, cache misses tend to be more frequent, but less costly in terms of performance, than page or segment faults in a VM system. Cache control is done entirely in hardware, while virtual memory management is accomplished via a combination of hardware (the MMU) and software (the operating system). Cache exists for the sole reason of making main memory appear faster than it really is; virtual memory has several purposes, one of which is to make main memory appear larger than it is, but also to support multiprogramming, relocation of code and data, and the protection of each program’s memory space from other programs. 16.

In systems which make use of both virtual memory and cache, what are the advantages of a virtually addressed cache? Does a physically addressed cache have any advantages of its own, and if so, what are they? Describe a situation in which one of these approaches would have to be used because the other would not be feasible. All else being equal, a virtually mapped cache is faster than a physically mapped cache because no address translation is required prior to checking the tags to see if a hit has occurred. The appropriate bits from the virtual address are matched against the (virtual) tags. In a physically addressed cache, the virtual-tophysical translation must be done before the tags can be matched. A physically addressed cache does have some advantages, though, including the ability to perform task switches without having to flush (invalidate) the contents of the cache. In a situation where the MMU is located on-chip with the CPU while a cache is located off-chip (for example a level-2 or level-3 cache on the motherboard) the address is already translated before it appears on the system bus and, therefore, that cache would have to be physically addressed.

17.

Fill in the blanks below with the most appropriate term or concept discussed in this chapter: Information density - a characteristic of a memory device that refers to the amount of information that can be stored in a given physical space or volume Dynamic Random Access Memory (DRAM) - a semiconductor memory device made up of a large array of capacitors; its contents must be periodically refreshed in order to keep them from being lost Magnetic RAM (MRAM) - a developing memory technology that operates on the principle of magnetoresistance; it may allow the development of “instant-on” computer systems Erasable/Programmable Read-Only Memory (EPROM) - a type of semiconductor memory device, the contents of which cannot be overwritten during normal operation, but can be erased using ultraviolet light Associative memory - this type of memory device is also known as a CAM Argument register - a register in an associative memory that contains the item to be searched for Locality of reference - the principle that allows hierarchical storage systems to function at close to the speed of the faster, smaller level(s) Miss - this occurs when a needed instruction or operand is not found in cache and thus a main memory access is required Refill line - the unit of information that is transferred between a cache and main memory Tag - the portion of a memory address that determines whether a cache line contains the needed information Fully associative mapping - the most flexible but most expensive cache organization, in which a block of information from main memory can reside anywhere in the cache Write-back - a policy whereby writes to cached locations update main memory only when the line is displaced

Valid bit - this is set or cleared to indicate whether a given cache line has been initialized with “good” information or contains “garbage” due to not yet being initialized Memory Management Unit (MMU) - a hardware unit that handles the details of address translation in a system with virtual memory Segment fault - this occurs when a program makes reference to a logical segment of memory that is not physically present in main memory Translation Lookaside Buffer (TLB) - a type of cache used to hold virtual-to-physical address translation information Dirty bit - this is set to indicate that the contents of a faster memory subsystem have been modified and need to be copied to the slower memory when they are displaced Delayed page fault - this can occur during the execution of a string or vector instruction when part of the operand is present in physical main memory and the rest is not

3

Basics of the Central Processing Unit

1.

Does an architecture that has fixed-length instructions necessarily have only one instruction format? If multiple formats are possible given a single instruction size in bits, explain how they could be implemented; if not, explain why this is not possible. Not necessarily. It is possible to have multiple instruction formats, all of the same length. For example, SPARC has three machine language instruction formats, all 32 bits long. This is implemented by decoding a subset of the op code bits (in the SPARC example, the two leftmost bits) and using the decoded outputs to determine how to decode the remaining bits of the instruction.

2.

The instruction set architecture for a simple computer must support access to 64 KB of byte-addressable memory space and eight 16-bit general-purpose CPU registers. (a)

If the computer has three-operand machine language instructions that operate on the contents of two different CPU registers to produce a result that is stored in a third register, how many bits are required in the instruction format for addressing registers? Since there are 8 = 23 registers, three bits are needed to identify each register

operand. In this case there are two source registers and one destination register, so it would take 3 * 3 = 9 bits in the instruction to address all the needed registers. (b)

If all instructions are to be 16 bits long, how many op codes are available for the three-operand, register operation instructions described above (neglecting, for the moment, any other types of instructions that might be required)? 16 bits total minus 9 bits for addressing registers leaves 7 bits to be used as

the op code. Since 27 = 128, there are 128 distinct op codes available to specify such instructions. (c)

Now assume (given the same 16-bit instruction size limitation) that, besides the instructions described in (a), there are a number of additional two-operand instructions to be implemented, for which one operand must be in a CPU register

while the second operand may reside in a main memory location or a register. If possible, detail a scheme that allows for at least 50 register-only instructions of the type described in (a) plus at least 10 of these two-operand instructions. (Show how you would lay out the bit fields for each of the machine language instruction formats.) If this is not possible, explain in detail why not and describe what would have to be done to make it possible to implement the required number and types of machine language instructions. We can accomplish this design goal by adopting two instruction formats that could be distinguished by a single bit. Format 1 will have a specific bit (say, the leftmost bit) = 0 while format 2 will have a 1 in that bit position. Three-operand (register-only) instructions would use format 1. With one bit already used to identify the format, of the remaining 15 bits, 6 would constitute the op code (giving us 26 = 64 possible instructions of this type). The other 9 bits would be used to identify source register 1 (3 bits), source register 2 (3 bits), and the destination register (3 bits). The two-operand instructions would use format 2. These instructions cannot use absolute addressing for memory operands because that would require 16 bits for the memory address alone, and there are only 16 total bits per instruction. However, register indirect addressing or indexed addressing could be used to locate memory operands. In this format, 3 of the remaining 15 bits would be needed to identify the operand that is definitely in a register. One additional bit would be required to tell whether the second operand was in a register or in a memory location. Then, another set of 3 bits would identify a second register that contains either the second operand or a pointer to it in memory. This leaves 8 bits, of which 4 or more would have to be used for op code bits since we need at least 10 instructions of this type. The remaining 4 bits could be used to provide additional op codes or as a small displacement for indexed addressing.

3.

What are the advantages and disadvantages of an instruction set architecture with variable-length instructions? For an architecture with a sufficient degree of complexity, it is natural that some instructions may be expressible in fewer bits than others. (Some may have fewer options, operands, addressing modes, etc. while others have more functionality.) Having variable-length instructions means that the simpler instructions need take up no more space than absolutely necessary. (If all instructions are the same length, then even the simplest ones must be the same size, in bits, as the most complex.) Variable-length instructions can save significant amounts of code memory, but at the expense of requiring a more complex decoding scheme that can complicate the design of the control unit. Variable-length instructions also make it more difficult to pipeline the process of fetching, decoding, and executing instructions (see Chapter 4).

4.

Name and describe the three most common general types (from the standpoint of functionality) of machine instructions found in executable programs for most computer architectures. In most executable programs (one can always find isolated counterexamples) the bulk of the machine instructions are, usually in this order: data transfer instructions, computational (arithmetic, logic, shift, etc.) instructions, and control transfer instructions.

5.

Given that we wish to specify the location of an operand in memory, how does indirect addressing differ from direct addressing? What are the advantages of indirect addressing, and in what circumstances is it clearly preferable to direct addressing? Are there any disadvantages of using indirect addressing? How is register indirect addressing different from memory indirect addressing, and what are the relative advantages and disadvantages of each? Direct (or absolute) addressing specifies the location of the operand(s)

explicitly as part of the machine language instruction (as opposed to immediate addressing, which embeds the operand itself in the instruction). Indirect addressing uses the machine language instruction to specify not the location of the operand, but the location of the location of the operand. (In other words, it tells where to find a pointer to the operand.) The advantage of indirect addressing is that if a given instruction is executed more than once (as in a program loop) the operand does not have to be in the same memory location every time. This is of particular use in processing tables, arrays, and other multi-element data structures. The only real disadvantages of indirect addressing vs. direct addressing are an increase in complexity and a decrease in processing speed due to the need to dereference the pointer. Depending on the architecture, the pointer (which contains the operand address) specified by the instruction may reside in either a memory location (memory indirect addressing) or a CPU register (register indirect addressing). Memory indirect addressing allows a virtually unlimited number of pointers to be active at once, but requires an additional memory access – which complicates and slows the execution of the instruction, exacerbating the disadvantages mentioned above. To avoid this complexity, most modern architectures support only register indirect addressing, which limits the pointers to exist in the available CPU registers but allows instructions to execute more quickly. 6.

Various computer architectures have featured machine instructions that allow the specification of three, two, one, or even zero operands. Explain the tradeoffs inherent to the choice of the number of operands per machine instruction. Pick a current or historical computer architecture, find out how many operands it typically specifies per instruction, and explain why you think its architects implemented the instructions the way they did. The answer to this question will obviously depend on the architecture chosen. The main tradeoff is programmer (or compiler) convenience, which favors more

operands per instruction, versus the desire to keep instructions smaller and more compact, which favors fewer operands per instruction. 7.

Why have load-store architectures increased in popularity in recent years? (How do their advantages go well with modern architectural design and implementation technologies?) What are some of their less desirable tradeoffs vs. memory-register architectures, and why are these not as important as they once were? Load/store architectures have become popular in large measure because the decoupling of memory access from computational operations on data keeps the control unit logic simpler and makes it easier to pipeline the execution of instructions (see Chapter 4). Simple functionality of instructions makes it easier to avoid microcode and use only hardwired control logic, which is generally faster and takes up less “real estate” on the IC. Not allowing memory operands also helps keep instructions shorter and can help avoid the need to have multiple instruction formats of different sizes. Memory-register architectures, on the other hand, tend to require fewer machine language instructions to accomplish the same programming task, thus saving program memory. The compiler (or the assembly language programmer) has more flexibility and not as many registers need to be provided if operations on memory contents are allowed. Given the decrease in memory prices, the improvements in compiler technology, and the shrinking transistor sizes over the past 20 years or so, the advantages of memory-register architectures have been diminished and load/store architectures have found greater favor.

8.

Discuss the two historically dominant architectural philosophies of CPU design: a)

Define the acronyms CISC and RISC and explain the fundamental differences between the two philosophies. CISC stands for “Complex Instruction Set Computer” and RISC stands for

“Reduced Instruction Set Computer.” The fundamental difference between these

two philosophies of computer system design is the choice of whether to put the computational complexity required of the system in the hardware or in the software. CISC puts the complexity in the hardware. The idea of CISC was to support highlevel language programming by making the machine directly execute high-level functions in hardware. This was usually accomplished by using microcode to implement those complex functions. Programs were expected to be optimized by coding in assembly language. RISC, on the other hand, puts the complexity in the software (mainly, the high level language compilers). No effort was made to encourage assembly language programming; instead there is a reliance on optimization by the compiler. The RISC idea was to make the hardware as simple and fast as possible by eliminating microcode and explicitly encouraging pipelining of the hardware. Any task that cannot be quickly and conveniently done in hardware is left for the compiler to implement by combining simpler functions. b)

Name one commercial computer architecture that exemplifies the CISC architectural approach and one other that exemplifies RISC characteristics. CISC examples include the DEC VAX, Motorola 680x0, Intel x86, etc. RISC

examples include the IBM 801, Sun SPARC, MIPS Rx000, etc. c)

For each of the two architectures you named in (b) above, describe one distinguishing characteristic not present in the other architecture that clearly shows why one is considered a RISC and the other a CISC. Answers will vary depending on the architectures chosen, but may include

the use of hardwired vs. microprogrammed control, the number and complexity of machine language instructions and memory addressing modes, the use of fixed- vs. variable-length instructions, a memory-register vs. a load/store architecture, the number of registers provided and their functionality, etc. d)

Name and explain one significant advantage of RISC over CISC and one significant advantage of CISC over RISC.

Significant advantages of RISC include simpler, hardwired control logic that takes up less space (leaving room for more registers, on-chip cache and/or floatingpoint hardware, etc.) and allows higher CPU clock frequencies, the ability to execute instructions in fewer clock cycles, and ease of instruction pipelining. Significant advantages of CISC include a need for fewer machine language instructions per program (and thus a reduced appetite for code memory), excellent support for assembly language programming, and less demand for complexity in, and optimization by, the compilers. 9.

Discuss the similarities and differences between the programmer-visible register sets of the 8086, 68000, MIPS, and SPARC architectures. In your opinion, which of these CPU register organizations has the most desirable qualities, and which is least desirable? Give reasons to explain your choices. The 8086 has a small number of highly specialized registers. Some are for addresses, some for computations; many functions can only be carried out using a specific register or a limited subset of the registers. The 68000, another CISC processor, has a few more (16) working registers and divides them only into two general categories: data registers and address registers. Within each group, registers have identical functionality (except for address register 7 which acts as the stack pointer). MIPS and SPARC, both RISC designs, have larger programmer-visible register sets (32 working registers) and do not distinguish between registers used for data vs. registers used for pointers to memory. For the most part, “all registers are created equal”, though in both architectures register 0 is a ROM location that always contains the value 0. SPARC processors actually have a variable number (up to hundreds) of registers and use a hardware register renaming scheme to make different subsets of 32 of them visible at different times. This “overlapping register window” scheme was devised to help optimize parameter passing across procedure

calls. Students can be expected to have different preferences, but should point to specific advantages of a given architecture to back up their choices. 10.

A circuit is to be built to add two 10-bit numbers x and y plus a carry-in. (Bit 9 of each number is the MSB, while bit 0 is the LSB. c0 is the carry-in to the LSB position.) The propagation delay of any individual AND or OR gate is 0.4 ns, and the carry and sum functions of each full adder are implemented in sum of products form. (a)

If the circuit is implemented as a ripple carry adder, how much time will it take to produce a result? Each full adder takes (0.4 + 0.4) = 0.8 ns to produce a result (sum and carry

outputs). Since the carry output of each adder is an input to the adder in the next more significant position, the operation of the circuit is sequential and it takes 10 * (0.8 ns) = 8.0 ns to compute the sum of two 10-bit numbers. (b)

Given that the carry generate and propagate functions for bit position i are given by gi = xiyi and pi = xi + yi, and that each required carry bit (c1...c10) is developed from the least significant carry-in c0 and the appropriate gi and pi functions using AND-OR logic, how much time will a carry lookahead adder circuit take to produce a result? (Assume AND gates have a maximum fan-in of 8 and OR gates have a maximum fan-in of 12.) In a carry lookahead adder, all the gi and pi functions are generated

simultaneously by parallel AND and OR gates. This takes 0.4 ns (one gate delay time). Since ci+1 = gi + pici, generating all the carries should take two more gate delay times or 0.8 ns. However, we have to consider the gate fan-in restrictions. Since OR gates can have a fan-in of 12 and we never need to OR that many terms, that restriction does not matter; but the fan-in limitation on the AND gates means that an extra level of logic will be needed (since there are cases where we have to AND more than 8 terms). Thus, 3 * (0.4 ns) = 1.2 ns is required for this AND-OR logic for a total of 4 * (0.4 ns) = 1.6 ns to generate all the carries. Once the carries

are available, all full adders operate simultaneously, requiring an additional 2 * (0.4 ns) = 0.8 ns to generate the final result. The overall propagation delay of the circuit is 6 gate delays or 2.4 ns. 11.

Under what circumstances are carry save adders more efficient than “normal” binary adders that take two operands and produce one result? Where, in a typical generalpurpose CPU, would one be most likely to find carry save adders? Carry save adders are more efficient when there are several numbers, rather than just two, that need to be added together. While multi-operand addition is not a typical operation supported by general-purpose computers, multiplication (which requires the addition of several partial products) is such an operation. Carry save adders are frequently used in multiplication hardware.

12.

Given two 5-bit, signed, two’s complement numbers x = -6 = 110102 and y = +5 = 001012, show how their 10-bit product would be computed using Booth’s algorithm (you may wish to refer to Figures 3.24, 3.25, and 3.26). M = -6 = 11010; therefore –M = +6 = 00110. The initial contents of P are 0000000101 (upper part zero, lower part = the multiplier (+5)). The computation proceeds as follows: P C 00000 00101 0 +00110 (1. add –M) 00110 00101 0 00011 00010 1 (then shift right) +11010 (2. add +M) 11101 00010 1 11110 10001 0 (then shift right) +00110 (3. add –M) 00100 10001 0 00010 01000 1 (then shift right) +11010 (4. add +M) 11100 01000 1 11110 00100 0 (then shift right) 11111 00010 0 (5. just shift right) Done. Answer = 1111100010 = -30 (represented as the two’s complement of 30).

13.

Discuss the similarities and differences between “scientific notation” (used for manual calculations in base 10) and floating-point representations for real numbers used in digital computers. Numbers expressed in scientific notation and floating-point format have a great deal in common. Both approaches divide a number into its mantissa (significant digits) and its exponent (the power of the system’s base or radix) such that the number is the product of the mantissa times the base raised to the given power. (Both the mantissa and the exponent are signed values, allowing us to represent a wide range of positive and negative values.) In both cases, we gain the considerable advantage of being able to represent very large and very small numbers without having to write (or store) a large number of digits that are either zero or insignificant. The main difference is that numbers in scientific notation usually work with base 10, with the mantissa and exponent themselves expressed in decimal notation also. Computer floating-point formats invariably express the mantissa and exponent in some type of binary format, and the radix is either 2 or some small power of 2 (i.e. 4, 8, or 16). Of course, the signs of the mantissa and exponent must be represented in sign-magnitude, complement, or biased notation since the computer cannot directly represent the symbols “+” and “-” in hardware. Floatingpoint formats also need representations for special cases like zero, infinity, and invalid or out-of-range results that may result from computations, so such results can be distinguished from normalized floating-point values.

14.

Why was IEEE 754-1985 a significant development in the history of computing, especially in the fields of scientific and engineering applications? IEEE 754 was significant because it represented the first time that a consortium of major computer manufacturers was able to come to an agreement on

a floating-point arithmetic standard that all of them could and would support. Before the adoption of IEEE 754, every vendor had its own, unique floating-point format (some more mathematically robust than others). Moving binary files containing real-number data between incompatible systems was impossible. Porting any kind of code that performed arithmetic on real numbers (as most scientific and engineering applications do extensively) from one system to another was very tedious, tricky, and prone to surprising – and in some cases, mathematically invalid – results. IEEE 754, once adopted and generally supported, essentially put an end to floating-point compatibility problems. (Much to the joy of scientific applications programmers everywhere!) 15.

Assume that IEEE has modified standard 754 to allow for “half-precision” 16-bit floating-point numbers. These numbers are stored in similar fashion to the single precision 32-bit numbers, but with smaller bit fields. In this case, there is one bit for the sign, followed by six bits for the exponent (in excess-31 format), and the remaining 9 bits are used for the fractional part of the normalized mantissa. Show how the decimal value +17.875 would be represented in this format. The fact that the number in question is positive means that the leftmost (sign) bit would be 0. To determine the rest of the bits, we express 17.875 in binary as 10001.111 or, normalizing, 1.0001111 * 24. The stored exponent would thus be 4 + 31 = 35 decimal or 100011 binary. Stripping off the leading “1.” and padding on the right with zeroes, the significand would be 000111100. So +17.875 would be expressed in this fictional format as 0100011000111100 binary or 463C hexadecimal.

16.

Show how the decimal value -267.5625 would be represented in IEEE-754 single and double precision formats. Because the number is negative, the leftmost (sign) bit will be 1 in either format. To determine the rest of the bits, we express 267.5625 in binary as 100001011.1001 or, normalizing, 1.000010111001 * 28. In single precision, the stored

exponent would be 8 + 127 = 135 decimal or 10000111 binary. In double precision, the stored exponent would be 8 + 1023 = 1031 decimal or 10000000111 binary. Stripping off the leading “1.” and padding on the right with zeroes, the significand would be 00001011100100000000000 (single precision) or 0000101110010000000000000000000000000000000000000000 (double precision). So –267.5625 would be expressed in the IEEE single precision format as 11000011100001011100100000000000 binary or C385C800 hexadecimal, and in the IEEE double precision format as 1100000001110000101110010000000000000000000000000000000000000000 binary or C070B90000000000 hexadecimal. 17.

Consider a simple von Neumann architecture computer like the one discussed in Section 3.3.1 and depicted in Figure 3.32. One of its machine language instructions is an ANDM instruction which reads the contents of a memory location (specified by direct addressing), bitwise ANDs this data with the contents of a specified CPU register, then stores the result back in that same register. List and describe the sequence of steps that would have to be carried out under the supervision of the processor’s control unit in order to implement this instruction. 1. MAR Å PC

Copy the contents of the Program Counter (the address of the instruction) to the MAR so they can be output on the address bus.

2. Read; PC Å PC + 1

Activate the Read control signal to the memory system to initiate the memory access. While the memory read is taking place, increment the Program Counter so that it points to the next sequential instruction in the program.

3. MDR Å [MAR]

When the memory read is complete, transfer the contents of the memory location over the data bus and latch them into the MDR.

4. IR Å MDR

Transfer the contents of the MDR (the machine language instruction) to the IR and decode the instruction. At this point, the control unit discovers that this is an “AND with memory direct” instruction.

5. MAR Å IRlow

Transfer the lower 8 bits from the IR (the operand address) to the MAR to prepare to read the operand.

6. Read

Activate the Read control signal to the memory system to initiate the

memory access for the operand.

18.

7. MDR Å [MAR]

When the memory read is complete, transfer the contents of the memory location over the data bus and latch them into the MDR.

8. MDRout; R4outB; AND

Transfer the contents of the MDR (the memory operand) and register 4 (the register operand) to the ALU inputs and activate the control signal telling the ALU to perform the logical AND function.

9. R4 Å ALU

Transfer the output of the ALU (the logical AND of the operands) to the destination register (R4). Execution of the current instruction is now complete and the control unit is ready to fetch the next instruction.

What are the two principal design approaches for the control unit of a CPU? Describe each of them and discuss their advantages and disadvantages. If you were designing a family of high performance digital signal processors, which approach would you use, and why? The two principal design approaches are hardwired control and microprogrammed control. A hardwired (conventional) control unit is implemented using standard sequential and combinational logic design techniques to design the control step counter, decoders, and other circuitry required to generate the machine’s control signals. It has the disadvantages of requiring more design effort (especially if the machine has a complex instruction set architecture) and being more difficult to change once designed, but the advantages of being faster (all else being equal, logic tends to be faster than memory access) and taking up less space for implementation. Microprogrammed control is implemented by treating each individual machine operation as a task to be implemented using programmed steps. These steps are carried out by a very simple computing engine within the CPU itself; thus, microprogramming is sometimes described as the “computer within a computer” approach. Microinstructions, representing sets of control signals to be generated for each step in carrying out a machine operation, are fetched from microprogram memory (control store) and issued (sent out over the control lines) to control internal and external operations. Microprogramming simplifies the design process

and makes it easy to change or extend the capabilities of an architecture, but the control store takes up a considerable amount of space on the chip. And since control signals are read out of a memory rather than generated in logic, microprogrammed control tends to yield a slower implementation than hardwired control. In the situation described, it would probably be most appropriate to use hardwired control since digital signal processing is likely to demand raw CPU speed over just about any other criterion and hardwired logic will contribute to a “lean, mean” and fast design. Since it is a special-purpose and not a general-purpose processor, the DSP’s instruction set is likely to be fairly simple and so the design would not be likely to profit much from microcode’s more flexible capabilities and ability to easily implement complexity. 19.

In a machine with a microprogrammed control unit, why is it important to be able to do branching within the microcode? Transfers of control (branches) are needed within microcode for many of the same reasons that they are needed in machine language, assembly language, and high-level language programming. At any level of programming, the next sequential instruction in memory is not always the one we wish to perform next. Unconditional transfers of control are needed to move between one microroutine and another, for example from the end of the execution microroutine for one instruction to the beginning of the instruction fetch microroutine that will retrieve the next microinstruction. Conditional transfers of control in the microcode are needed in order to implement conditional branching in the machine language program (which, in turn, supports conditional structures in high-level code).

20.

Given the horizontal control word depicted in Figure 3.39 for our simple example machine, develop the microroutines required to fetch and execute the ANDM instruction using the steps you outlined in question 17.

For each step needed to execute an instruction we would need to form a binary microinstruction that had 1s in the correct places to carry out the desired action, and 0s everywhere else. For example, to do step one (which is MAR ← PC in register transfer language) we would make PCout = 1 and MARin = 1, and all the rest of the bits of the control word would be 0. So the first microinstruction would look like this: 0000000000000000000000000010100000000000000 The same sort of procedure would have to be followed for each of the remaining steps. The complete binary microroutine would be as follows: 0000000000000000000000000010100000000000000 0000000000000000000000000000000100000000110 0000000000000000000000000001000000000000000 0000000000000000000000000100001000000000000 0000000000000000000000000010010000000000000 0000000000000000000000000000000000000000110 0000000000000000000000000001000000000000000 0000000000000000000100000000001000100000000 0000100000000000000000000000000000000000000 21.

Repeat question 20 using the vertical control word depicted in Figure 3.40. The steps needed to carry out the instruction are still the same, but the microinstructions are more compact since subsets of the control signals are encoded into bit fields. To come up with specific binary values for the microinstructions, we would have to know how the four system registers (PC, IR, MAR, and MDR) and the eight ALU operations were encoded. Assuming that PC = 00, IR = 01, MAR = 10, and MDR = 11, and that the AND operation is represented by 010, the binary microroutine in the vertical format would be as follows: 0000000001000000000 0000000000000100011 0000000001100000000 0000000000111000000 0000000001001000000 0000000000000000011

0000000001100000000 0000001000011001000 1000000000000000000 22.

Fill in the blanks below with the most appropriate term or concept discussed in this chapter: Op code - the portion (bit field) of a machine language instruction that specifies the operation to be done by the CPU Control transfer instruction - a type of instruction that modifies the machine’s program counter (other than by simply incrementing it) Indexed addressing - a way of specifying the location of an operand in memory by adding a constant embedded in the instruction to the contents of a “pointer” register inside the CPU Zero-operand instructions - these would be characteristic of a stack-based instruction set Accumulator machine - this type of architecture typically has instructions that explicitly specify only one operand Load-store architecture - a feature of some computer architectures where “operate” instructions do not have memory operands; their operands are found in CPU registers Complex Instruction Set Computer (CISC) - machines belonging to this architectural class try to “bridge the semantic gap” by having machine language instructions that approximate the functionality of high-level language statements Datapath - this part of a CPU includes the registers that store operands as well as the circuitry that performs computations Carry lookahead adder - this type of addition circuit develops all carries in logic, directly from the inputs, rather than waiting for them to propagate from less significant bit positions Wallace tree - a structure comprised of multiple levels of carry save adders, which can be used to efficiently implement multiplication

Excess (biased) notation - this type of notation stores signed numbers as though they were unsigned; it is used to represent exponents in some floating-point formats Significand (fraction) - in IEEE-754 floating-point numbers, a normalized mantissa with the leading 1 omitted is called this Positive infinity - this is the result when the operation 1.0/0.0 is performed on a system with IEEE-754 floating-point arithmetic Instruction Register (IR) - this holds the currently executing machine language instruction so its bits can be decoded and interpreted by the control unit Microroutine - a sequence of microinstructions that fetches or executes a machine language instruction, initiates exception processing, or carries out some other basic machine-level task Horizontal microprogramming - a technique used in microprogrammed control unit design in which mutually-exclusive control signals are not encoded into bit fields, thus eliminating the need for decoding microinstructions Microprogram Counter (µPC) - this keeps track of the location of the next microword to be retrieved from microcode storage

4

Enhancing CPU Performance

1.

Suppose that you are designing a machine that will frequently have to perform 64 consecutive iterations of the same task (for example, a vector processor with 64-element vector registers). You wish to implement a pipeline that will help speed up this task as much as is reasonably possible, but recognize that dividing a pipeline into more stages takes up more chip area and adds to the cost of implementation. (a) Make the simplifying assumptions that the task can be subdivided as finely or coarsely as desired and that pipeline registers do not add a delay. Also assume that one complete iteration of the task takes 16 ns (thus, a non-pipelined implementation would take 64 * 16 = 1024 ns to complete 64 iterations). Consider possible pipelined implementations with 2, 4, 8, 16, 24, 32, and 48 stages. What is the total time required to complete 64 iterations in each case? What is the speedup (vs. a nonpipelined implementation) in each case? Considering cost as well as performance, what do you think is the best choice for the number of stages in the pipeline? Explain. (You may wish to make graphs of speedup and/or total processing time vs. the number of stages to help you analyze the problem.) In general, for a pipeline with s stages processing n iterations of a task, the time taken to complete all the iterations may be expressed as: tTOTAL = [s * tSTAGE] + [(n-1) * tSTAGE] = [(s+n-1) * tSTAGE] In a pipelined implementation with 2 stages: the stage time is 16/2 = 8 ns, and the total time for 64 iterations = [(2 + 64 – 1) * 8] ns = 520 ns; the speedup factor is 1.969. In a pipelined implementation with 4 stages: the stage time is 16/4 = 4 ns,

and the total time for 64 iterations = [(4 + 64 – 1) * 4] ns = 268 ns; the speedup factor is 3.821. In a pipelined implementation with 8 stages: the stage time is 16/8 = 2 ns, and the total time for 64 iterations = [(8 + 64 – 1) * 2] ns = 142 ns; the speedup factor is 7.211. In a pipelined implementation with 16 stages: the stage time is 16/16 = 1 ns, and the total time for 64 iterations = [(16 + 64 – 1) * 1] ns = 79 ns; the speedup factor is 12.962. In a pipelined implementation with 24 stages: the stage time is 16/24 = 0.667 ns, and the total time for 64 iterations = [(24 + 64 – 1) * 0.667] ns = 58 ns; the speedup factor is 17.655. In a pipelined implementation with 32 stages: the stage time is 16/32 = 0.5 ns, and the total time for 64 iterations = [(32 + 64 – 1) * 0.5] ns = 47.5 ns; the speedup factor is 21.558. In a pipelined implementation with 48 stages: the stage time is 16/48 = 0.333 ns, and the total time for 64 iterations = [(48 + 64 – 1) * 0.333] ns = 37 ns; the speedup factor is 27.676. The speedup achieved versus the number of stages goes up in nearly linear fashion through about 8-16 stages, then starts to fall off somewhat. Still, while gains are not linear, performance continues to improve significantly all the way up to a 48-stage pipeline. If hardware cost were a critical limiting factor, it would be reasonable to build only a 16-stage pipeline. If, on the other hand, performance is paramount, it would probably be worth building the most finely-grained (48-stage)

pipeline even though it achieves a speedup of “only” about 28. (b) Now assume that a total of 32 levels of logic gates are required to perform the task, each with a propagation delay of 0.5 ns (thus the total time to produce a single result is still 16 ns). Logic levels cannot be further subdivided. Also assume that each pipeline register has a propagation delay equal to that of two levels of logic gates, or 1 ns. Re-analyze the problem; does your previous recommendation still hold? If not, how many stages would you recommend for the pipelined implementation under these conditions? The same general pipeline performance equation still holds; however, in this case we are presented with reasonable limitations on what the stage time can be. Clearly, if there are 32 levels of logic gates involved, then the number of stages should be 2, 4, 8, 16, or 32; a pipeline with 24 or 48 stages is not a realistic approach since the delays do not divide evenly. (Some stages would do less work than others, or perhaps no work at all!) In addition, each stage has a minimum propagation delay of 1 ns due to its pipeline register regardless of the number of logic levels it contains. Given these more realistic limitations, we analyze the situation as follows: In a pipelined implementation with 2 stages: the stage time is (16/2) + 1 = 9 ns, and the total time for 64 iterations = [(2 + 64 – 1) * 9] ns = 585 ns; the speedup factor is 1.750. In a pipelined implementation with 4 stages: the stage time is (16/4) + 1 = 5 ns, and the total time for 64 iterations = [(4 + 64 – 1) * 5] ns = 335 ns; the speedup factor is 3.057. In a pipelined implementation with 8 stages: the stage time is (16/8) + 1 = 3

ns, and the total time for 64 iterations = [(8 + 64 – 1) * 3] ns = 213 ns; the speedup factor is 4.808. In a pipelined implementation with 16 stages: the stage time is (16/16) + 1 = 2 ns, and the total time for 64 iterations = [(16 + 64 – 1) * 2] ns = 158 ns; the speedup factor is 6.481. In a pipelined implementation with 32 stages: the stage time is (16/32) + 1 = 1.5 ns, and the total time for 64 iterations = [(32 + 64 – 1) * 1.5] ns = 142.5 ns; the speedup factor is 7.186. Here, because of the more realistic design constraints, the speedup achieved versus the number of stages is not nearly linear, especially beyond about 4-8 stages. An implementation with 32 stages is twice as costly as one with 16 stages, yet performs only a fraction better. Even the 16-stage implementation is not appreciably better than the 8-stage pipeline. In both these latter cases, we would incur considerable extra hardware expense for very modest reductions in the total time required to complete the task. The best tradeoff, depending on the exact constraints of the situation, is probably to use 4 or, at most, 8 stages. As we said in Section 4.1, “achieving a speedup factor approaching s (the number of pipeline stages) depends on n (the number of consecutive iterations being processed) being large, where ‘large’ is defined relative to s.” Here, n is 64 which is large relative to 4 (or perhaps 8), but 64 is not particularly large relative to 16 or 32.

2.

Given the following reservation table for a static arithmetic pipeline: 0 Stage 1

1

2

3

4

X

Stage 2

5 X

X

X

Stage 3

X

X

(a) Write the forbidden list. (0, 1, 5) (b) Determine the initial collision vector C. C = c5c4c3c2c1c0 = 100011 (c) Draw the state diagram. The diagram includes three states. The initial state is given by the initial collision vector, 100011. For i = 4 or i ≥ 6, we remain in this state. For i = 3 we transition from the initial state to state 100111, while for i = 2 we transition from the initial state to state 101011. Once in state 100111 we remain in that state for i = 3 and return to the initial state for i = 4 or i ≥ 6. Once in state 101011 we remain in that state for i = 2 and return to the initial state for i = 4 or i ≥ 6. (d) Find the MAL. 2 (e) Find the minimum latency. 2 3.

Considering the overall market for all types of computers, which of the following are more commonly found in today’s machines: arithmetic pipelines (as discussed in Section 4.2) or instruction unit pipelines (Section 4.3)? Explain why this is so. Arithmetic pipelines are primarily found in vector supercomputers, which once had a significant share of the high-performance computing market but have now largely fallen out of favor – mainly because they exhibit little generality. (They

are only useful for a limited subset of applications.) On the other hand, instruction unit pipelines are found in all RISC and superscalar microprocessors and even in most CISC microprocessors (other than low-end embedded microcontroller units). Nowadays, instruction unit pipelines are nearly ubiquitous. 4.

Why do control transfers, especially conditional control transfers, cause problems for an instruction-pipelined machine? Explain the nature of these problems and discuss some of the techniques that can be employed to cover up or minimize their effect. In a non-pipelined machine, control transfer instructions are not appreciably more troublesome than other instructions since all instructions are processed one at a time. In an instruction-pipelined machine, control transfers are problematic because by the time the instruction in question is decoded and determined to be a control transfer (and, in the case of conditional control transfers, by the time the branch condition is evaluated to determine whether or not the branch succeeds), one or more additional instructions may have been fetched and proceeded some distance into the pipeline. If, as is the normal case, these instructions are the ones sequentially following the control transfer, they are not the correct instructions that the machine should be executing. (It should instead be executing the instructions starting at the control transfer’s target location.) While the control unit can suppress the effect of these following instructions by not allowing them to update any registers or memory locations (and thus insure correct operation of the program), it cannot recover the clock cycles lost by bringing them into the pipeline. One technique that can be used to minimize the performance effect of control transfers is delayed branching. This approach, commonly used in RISC

architectures, documents the fact that the instruction(s) immediately following a control transfer (a.k.a. the delay slot instruction(s)) – since it is (or they are) already in the pipeline – will be executed as though it (they) came before the control transfer instruction. (In the case of a conditional branch, this means that the delay slot instruction(s) are executed regardless of the outcome of the branch). Another approach (these are not mutually exclusive) is to use static (compile time) and/or dynamic (run time) branch prediction. The compiler and/or processor attempt to predict, based on the structure of the code and/or the history of execution, whether a conditional branch will succeed or fail and fetch subsequent instructions accordingly. A successful prediction will reduce or eliminate the “branch penalty”; an incorrect prediction, however, will incur a significant performance penalty due to the need to drain the pipeline. (See the answer to the next question for an example of a quantitative analysis.) 5.

A simple RISC CPU is implemented with a single scalar instruction processing pipeline. Instructions are always executed sequentially except in the case of branch instructions. Given that pb is the probability of a given instruction being a branch, pt is the probability of a branch being taken, pc is the probability of a correct prediction, b is the branch penalty in clock cycles, and c is the penalty for a correctly predicted branch: (a)

Calculate the throughput for this instruction pipeline if no branch prediction is done, given that pb = 0.16, pt = 0.3, and b = 3. The average number of clock cycles per instruction is (0.16)(0.3)(1 + 3 cycles)

+ (0.16)(0.7)(1 cycle) + (1 – 0.16)(1 cycle) = 0.192 + 0.112 + 0.84 = 1.144 cycles/instruction. The throughput equals 1 / (1.144 cycles/instruction) ≈ 0.874

instructions per cycle. (b)

Assume that we use a branch prediction technique to try to improve the pipeline’s performance. What would be the throughput if c = 1, pc = 0.8, and the other values are the same as above? The average number of clock cycles per instruction would be

(0.16)(0.3)(0.8)(1 + 1 cycles) + (0.16)(0.3)(0.2)(1 + 3 cycles) + (0.16)(0.7)(0.8)(1 cycle) + (0.16)(0.7)(0.2)(1 + 3 cycles) + (0.84)(1 cycle) = 0.0768 + 0.0384 + 0.0896 + 0.0896 + 0.84 = 1.1344 cycles/instruction. The throughput in this case would improve slightly to 1 / (1.1344 cycles/instruction) ≈ 0.882 instructions per cycle. 6.

What are the similarities and differences between a delayed branch and a delayed load? A delayed branch, as explained in the answer to question 4 above, is a special type of control transfer instruction used in some pipelined architectures to minimize or “cover up” the branch penalty caused by having the wrong instruction(s) in progress following a control transfer instruction. A delayed load is also a feature of some instruction sets that is used to cover up a potential performance penalty associated with pipelined implementation. It has nothing to do with branching, but rather deals with the (also problematic) latency normally associated with accessing memory as compared to internal CPU operations. Since (even in the case of a cache hit) reading a value from memory usually takes at least one additional clock cycle vs. obtaining it from a register, the architects may document the fact that the instruction immediately following a load should not (or must not) use the value being loaded; instead, it should operate on some other data already inside the CPU. Depending on the architecture, if the following instruction does use the value being

loaded from memory into a register, it may be documented to reference the old, rather than the new, value in that register; or a hardware interlock may be employed to delay the subsequent instruction until the load completes and the newly loaded value can be used. 7.

Given the following sequence of assembly language instructions for a CPU with multiple pipelines, indicate all data hazards that exist between instructions. I1: Add R2, R4, R3

; R2 = R4 + R3

I2: Add R1, R5, R1

; R1 = R5 + R1

I3: Add R3, R1, R2

; R3 = R1 + R2

I4: Add R2, R4, R1

; R2 = R4 + R1

Three RAW hazards exist: between I1 and I3 over the use of R2; between I2 and I3 over the use of R1; and between I2 and I4 over the use of R1. Two WAR hazards exist: between I1 and I3 over the use of R3, and between I3 and I4 over the use of R2. Only one WAW hazard exists; it is between I1 and I4 over the use of R2. 8.

What are the purposes of the scoreboard method and Tomasulo’s method of controlling multiple instruction execution units? How are they similar and how are they different? Both the scoreboard method and Tomasulo’s method are techniques of controlling and scheduling a superscalar processor’s internally parallel hardware execution units. By detecting and correcting for data hazards, both methods ensure that RAW, WAR, and WAW relationships in the code being executed do not cause the machine to improperly execute a sequentially written program. Even though instructions may actually be executed out of order, these methods make them appear to be executed in the original, sequential order.

These two methods are different in that each was initially devised for a different 1960s-era supercomputer (the CDC 6600 vs. the IBM 360/91); each bears the specific characteristics of its parent machine. The scoreboard method uses a centralized set of registers and logic to schedule the use of processor hardware by multiple instructions; when a data hazard is detected, the issue of one or more instructions to a functional unit is stalled to ensure correct operation. Tomasulo’s method uses a distributed control technique (with reservation stations associated with each functional unit) to schedule operations. It implements a “dataflow” approach to scheduling hardware, using register renaming and data forwarding to help avoid stalls as much as possible and enhance the machine’s ability to execute operations concurrently. 9.

List and explain nine common characteristics of RISC architectures. In each case, discuss how a typical CISC processor would (either completely or partially) not exhibit the given attribute. The characteristics that are common to virtually all RISC architectures are discussed in Section 4.4. A RISC architecture may primarily be distinguished by its adherence to the following characteristics: • • • • • • •

Fixed-length instructions are used to simplify instruction fetching. The machine has only a few instruction formats in order to simplify instruction decoding. A load/store instruction set architecture is used to decouple memory accesses from computations so that each can be optimized independently. Instructions have simple functionality, which helps keep the control unit design simple. A hardwired control unit optimizes the machine for speed. The architecture is designed for pipelined implementation, again to optimize for speed of execution. Only a few, simple addressing modes are provided, since complex ones may slow down the machine and are rarely used by compilers.

• •

There is an emphasis on optimization of functions by the compiler since the architecture is designed to support high-level languages rather than assembly programming. Complexity is in the compiler (where it only affects the performance of the compiler), not in the hardware (where it would affect the performance of every program that runs on the machine).

Secondary characteristics that are prevalent in RISC machines include: • • • • • • •

Three-operand instructions make it easier for the compiler to optimize code. A large register set (typically 32 or more registers) is possible because the machine has a small, hardwired control unit, and desirable because of the need for the compiler to optimize code for the load/store architecture. Instructions execute in a single clock cycle (or at least most of them appear to, due to pipelined implementation). Delayed control transfer instructions are used to minimize disruption to the pipeline. Delay slots behind loads and stores help to cover up the latency of memory accesses. A Harvard architecture is used to keep memory accesses for data from interfering with instruction fetching and thus keep the pipeline(s) full. On-chip cache is possible due to the small, hardwired control unit and necessary to speed instruction fetching and keep the latency of loads and stores to a minimum. The characteristics of a CISC would depend on the specific architecture and

implementation being compared, but would include such things as variable-length instructions, many instruction formats, a memory-register architecture, complex functionality of individual machine instructions, the use of microprogrammed control, the lack of explicit support for pipelining, support for many and/or complex addressing modes, an emphasis on optimization of code by manual profiling and recoding in assembly language, support for fewer than three operands per instruction, a limited number of programmer-visible registers, instructions that require multiple clock cycles to execute, the lack of delay slots, a Princeton architecture and, in some cases, a smaller or nonexistent on-chip cache. In general, complexity is in the hardware rather than in the compiler! A machine need not exhibit all these characteristics to be considered a CISC architecture, but the more of them that are observed, the more confident we can be in identifying it as such.

10.

How does the “overlapping register windows” technique, used in the Berkeley RISC and its commercial successor the Sun SPARC, simplify the process of calling and returning from subprograms? This technique partitions the register set not by intended use (data registers vs. pointer registers), as is common in many other architectures, but rather by scope (global, local, inputs, and outputs) relative to the procedure(s) in which values are used. When a procedure calls another procedure, the internal decoding scheme is altered such that registers are renumbered; the caller’s output registers automatically become the called procedure’s input registers and a new set of local and output registers is allocated to that procedure. In many cases, the number of procedure parameters is such that all of them can be passed in the overlapped registers, thus eliminating the need to access memory to write and then read values to/from a stack frame. This reduction in the number of accesses to data memory can help improve performance, especially for high-level language programs that are written to use a number of modular functions.

11.

You are on a team helping to design the new Platinum V® processor for AmDel Corporation. Consider the following design issues: (a) Your design team is considering a superscalar vs. superpipeline approach to the design. What are the advantages and disadvantages of each option? What technological factors would tend to influence this choice one way or the other? The superpipelined approach has the advantage of being simpler to control (since there is no out-of-order execution to cause WAR and WAW hazards); it also may be able to achieve a higher clock frequency since each pipeline stage does less work. The simpler control logic and single pipeline leave more space on the chip for registers, on-chip cache, floating-point hardware, memory management unit(s), and other enhancements. However, superpipelined processors suffer from an increased branch penalty and do not perform well on code with many control transfers.

The superscalar approach takes advantage of spatial parallelism to better exploit the instruction-level parallelism inherent in most programs. Superscalar processors also may achieve high performance with only a modest clock frequency, avoiding the need to generate and distribute a high-frequency clock signal. On the other hand, a superscalar implementation has the disadvantage of being more difficult to control internally, due to its vulnerability to all types of data hazards. The multiple pipelines and more complex control logic take up more space on the chip, leaving less room for other functionality. It is also more difficult to handle exceptions precisely in a superscalar architecture. An implementation technology with short propagation delays (i.e. very fast transistors) favors a superpipelined approach, while an implementation technology with small feature sizes (and thus more transistors per IC) favors a superscalar approach. (b) Your design team has allocated the silicon area for most of the IC and has narrowed the design options to two choices: one with 32 registers and a 512KB on-chip cache and one with 512 registers but only a 128 KB on-chip cache. What are the advantages and disadvantages of each option? What other factors might influence your choice? One would think that having more registers would always be a good thing, but this is not necessarily true. Registers are only beneficial to the extent that the compiler (or the assembly language programmer) can make use of them; adding additional registers past that point provides no benefit and has some definite costs. For example, one significant advantage of the first option is that it only requires 5 bits in the instruction format for addressing each register operand. Unless some scheme such as register windowing is used, the second option (with 29 registers) will require 9 bits in the instruction for each register operand used. The first option also requires fewer registers to be saved and restored on each context switch or

interrupt. On the other hand, while having more cache is essentially always a good thing, even cache hits are not as beneficial for performance as keeping data in a CPU register, since cache generally requires at least one extra clock cycle to read or write data as compared to internal operations. Cache, like registers, also has a “diminishing returns” effect; while 512KB is four times the capacity of 128KB, it certainly won’t provide four times the hit ratio. To make the best choice given the specified alternatives, one would probably need to consider factors including the intended application(s) of the processor, the anticipated speed gap between cache and main memory, whether the on-chip cache is going to have a Harvard (instructions and data kept separate) or Princeton (unified) architecture, whether there will be any off-chip (level 2) cache, etc. 12.

How are VLIW architectures similar to superscalar architectures, and how are they different? What are the relative advantages and disadvantages of each approach? In what way can VLIW architectures be considered the logical successors to RISC architectures? Both VLIW and superscalar architectures make use of internally parallel hardware (more than one pipelined instruction execution unit operating at the same time). The main difference is that superscalar architectures use hardware inside the CPU’s control unit to dynamically schedule the execution of instructions and avoid incorrect operation due to hazards; VLIW machines instead use the compiler to statically perform dependency resolution and resource scheduling. The principal advantages of the superscalar approach are better compatibility with existing architectures and a reduced burden on the (more conventional) compiler. VLIW has the advantage of simplified control hardware, which may reduce internal delays and allow operation at a higher clock frequency than a superscalar processor given the same implementation technology. Also, since the compiler can

use more sophisticated scheduling logic, a VLIW architecture may take better advantage of instruction-level parallelism and execute more instructions, on average, per clock cycle. However, VLIW architectures in general suffer from poor code density (programs take up a lot of memory), are not easily made compatible with existing architectures, and perform poorly on code with many branches. VLIW is the logical successor to RISC in the sense that both design philosophies emphasize making hardware simpler and faster while transferring the burden of computational complexity to the software (specifically, the compiler). 13.

Is Explicitly Parallel Instruction Computing (EPIC) the same thing as a VLIW architecture? Explain why or why not. Not exactly. EPIC is based on the same idea as VLIW, but the format of the “bundles” in the IA-64 processors is not tied to a particular hardware implementation. Some EPIC-based chips may be able to execute one bundle at a time, some others may do less than a complete bundle at a time, and some higherperformance implementations may be able to execute multiple bundles at a time (if not prevented by dependencies between operations). By standardizing the size of the bundles, EPIC (unlike a VLIW architecture) allows binary compatibility to be maintained between generations of CPUs. In a pure VLIW machine the compiler does all the resource scheduling. In a superscalar machine the control unit does the scheduling. In EPIC the compiler does much of the work, but the control unit still has to do some of it. So EPIC is somewhere between VLIW and superscalar, though closer to VLIW.

14.

Fill in the blanks below with the most appropriate term or concept discussed in this chapter: Flow-through time - the time required for the first result in a series of computations to emerge from a pipeline Pipeline register - this is used to separate one stage of a pipeline from the next

Multifunction pipeline - this type of pipeline can perform different kinds of computations at different times Collision - this occurs if we mistakenly try to use a pipeline stage for two different computations at the same time Average Latency - over time, this tells the mean number of clock cycles between initiation of operations into a pipeline (if an optimal pipeline control strategy is used, this would be equal to the Minimum Average Latency or MAL) Pipeline throughput - over time, this tells the mean number of operations completed per clock cycle Branch penalty - the clock cycles that are wasted by an instruction-pipelined processor due to executing a control transfer instruction Static branch prediction - a technique used in pipelined CPUs where the compiler supplies a hint as to whether or not a given conditional branch is likely to succeed Delay slot instruction(s) - the instruction(s) immediately following a conditional control transfer instruction in some pipelined processors, which are executed whether or not the control transfer occurs Delayed load - a technique used in pipelined CPUs where the instruction immediately following another instruction that reads a memory operand cannot use the updated value of the operand Read After Write (RAW) hazard - the most common data hazard in pipelined processors; also known as a true data dependence Write After Write (WAW) hazard - also known as an output dependence, this hazard can occur in a processor that utilizes out-of-order execution Scoreboard - a centralized resource scheduling mechanism for internally concurrent processors; it was first used in the CDC 6600 supercomputer Reservation stations - these are used by a Tomasulo scheduler to hold operands for functional units

Overlapping register windows - a technique used in some RISC processors to speed up parameter passing for high-level language procedure calls Superpipelined - this type of processor architecture maximizes temporal parallelism by using a very deep pipeline with very fast stages Superscalar - this approach to high-performance processing uses multiple pipelines with resolution of inter-instruction data dependencies done by the control unit Explicitly Parallel Instruction Computing (EPIC) - the “architecture technology” used in Intel’s IA-64 (Itanium) chips Predication - the IA-64 architecture uses this approach instead of branch prediction to minimize the disruption caused by conditional control transfers

5

Exceptions, Interrupts, and Input/Output Systems

1.

What do we mean when we say that interrupts must be processed “transparently”? What does this involve and why is it necessary? Since interrupts are asynchronous to CPU operations (that is, they can occur at any time, without warning), it is necessary that the complete run-time context of the program that was executing be preserved across the servicing of the interrupt. That is to say, the interrupted program must not experience any changes to the state of the processor (or to its program or data memory) due to interrupt handling; it should not “see” any effects other than a time lag, and should compute the same results in the presence vs. absence of an interrupt(s). For this to happen it is necessary to save and restore not only the program counter (so the program can be resumed at the next instruction that would have been executed had the interrupt not occurred), but also the condition codes (a.k.a. status flags) and the contents of all the CPU registers. This is normally accomplished via pushing these values on the system stack when an interrupt occurs and popping them back off after it has been serviced. If this were not done, the interrupt would not be “transparent” and the interrupted program could operate incorrectly.

2.

Some processors, before servicing an interrupt, automatically save all register contents. Others automatically save only a limited amount of information. In the second case, how can we be sure that all critical data are saved and restored? What are the advantages and disadvantages of each of these approaches? The advantage of automatically saving everything is that we are sure that it has been done and thus we know that the interrupt will be serviced in transparent fashion (as described in the answer to question 1 above). The disadvantage of this is that a given interrupt service routine may actually use only a small subset of the CPU registers, and the time spent saving and restoring all the other, unused registers is wasted.

The additional delay involved in saving a large number of registers can significantly increase the latency in responding to an interrupt; for some timingsensitive I/O devices, it is important to keep this latency as small as possible. Some processors only automatically save the program counter and condition codes; other registers are left to be preserved with push and pop instructions inside the service routine. In that case, we save only the necessary registers and keep latency to a minimum; but the potential disadvantage is that we are dependent on the vigilance of the programmer who writes the service routine to track his/her register usage and save all required registers. If he or she fails to do this, the interrupted program could have its run-time context corrupted. 3.

Explain the function of a watchdog timer. Why do embedded control processors usually need this type of mechanism? A watchdog timer is a mechanism that can be used to reset a system if it “locks up”, without requiring the intervention of a human user (which is not always possible, especially in embedded systems). To implement this mechanism, a counter that runs off the system clock (or some derivative of it) is initialized and allowed to count up toward a maximum count (or down toward zero). Software running on the system is assigned the task of periodically resetting the counter to its initial value. If the counter ever “rolls over”, presumably due to a system software failure, that rollover event is detected by hardware and used to generate a reset signal to reinitialize and recover the system to a known state. Embedded control processors, unlike general-purpose CPUs, are not normally mounted in a case with a convenient reset button within reach of a user’s finger. Due to the embedded location, it may be difficult or impossible to perform a manual reset; a watchdog timer may be the only mechanism that allows the system to be restarted in the event of a crash.

4.

How are vectored and autovectored interrupts similar and how are they different? Can they be used in the same system? Why or why not? What are their advantages and

disadvantages vs. nonvectored interrupts? Vectored and autovectored interrupts are similar in that both use an interrupt number to index into a table that contains the addresses of the various interrupt service routines. (The value read from the table is loaded into the program counter and execution of the service routine begins from that point.) The only difference between the two techniques is that hardware devices provide their own interrupt numbers (via the system bus during the interrupt acknowledge cycle) in a system with vectored interrupts, while an autovectoring scheme uses interrupt numbers internally generated by the CPU based on the priority level of an incoming interrupt. Yes, both of these techniques can be used in the same system; the Motorola 680x0 family of CPUs is a prime example. The “smarter” devices can provide their own vectors, while a special hardware signal or timeout mechanism can alert the CPU to the need to generate autovectors for other devices. While nonvectored interrupts are slightly easier to implement in hardware as compared to vectored or autovectored interrupts, the additional complexity required of the software (to identify the source of each interrupt and execute the correct code to handle it), the corresponding increase in the latency to service an interrupt, and the limitations it places on the design of the memory system are significant drawbacks. 5.

Given the need for user programs to access operating system services, why are traps a better solution than conventional subprogram call instructions? The main “problem” with a typical subprogram call instruction is that it generally requires a target address that is explicitly specified using one of the machine’s addressing modes; that is to say, we must know where a given routine resides in memory in order to call it. While this is normally not a problem for userwritten code or procedures in a link library, we often do not know the location of routines that are part of the operating system. Their location may vary from one

specific system or OS version to another. Also, code that performs system functions such as I/O usually needs to be executed at a system privilege level, while called procedures normally execute with the same privilege level as the code that called them. Traps, since they make use of the same vectoring mechanism as interrupts or other exceptions, allow OS routines to be accessed implicitly, without the programmer having to know the exact location of the code he or she wishes to execute. By executing a specific trapping instruction, the desired routine can be executed at a system privilege level with control returning (at the proper privilege level) to the user program that called it. 6.

Compare and contrast program-controlled I/O, interrupt-driven I/O, and DMA-based I/O. What are the advantages and disadvantages of each? Describe scenarios that would favor each particular approach over the others. In a system with program-controlled I/O, the CPU executes code to poll the various hardware devices to see when they require service, then executes more code to carry out the data transfers. This is the simplest way to handle I/O, requiring no extra hardware support; but the need for the CPU to spend time polling devices complicates the software and detracts from system performance. This approach would only be favored in a very low-cost, embedded system where the CPU is not doing much other than I/O and the goal is to keep the hardware as simple and inexpensive as possible. In a system with interrupt-driven I/O, the devices use hardware interrupt request lines to notify the CPU when they need servicing. The CPU then executes instructions to transfer the data (as it would in a system using program-controlled I/O). This approach doesn’t eliminate the need for CPU involvement in moving data and also involves a bit more hardware complexity, but support for interrupt processing is already built into virtually every microprocessor so the additional cost is minimal. The upside of this technique is that the CPU never has to waste time

polling devices. System software is simplified by having separate interrupt service routines for each I/O device, and devices are typically serviced with less latency than if interrupts were not used. This approach is good for many systems, especially general-purpose machines that have a wide variety of I/O devices with different speeds, data transfer volumes, and other characteristics. DMA-based I/O is carried out by a hardware DMA controller that is separate from the system CPU. When the CPU determines (often by receiving an interrupt) that a transfer of data to or from an I/O device needs to take place, it initializes the DMA controller with the particulars of the transfer; the DMA controller then carries out the operation, transferring data directly between the chosen device and memory, without further intervention by the CPU. This approach has the highest hardware cost of the three, since it requires an extra system component; it also requires the overhead of the CPU having to set up the DMA controller for each transfer. However, DMA is very efficient, especially when large blocks of data are frequently transferred. Its use would be favored in a general-purpose or (especially) a high-performance system with high-speed devices that can benefit significantly from large block I/O operations. 7.

Systems with “separate I/O” have a second address space for I/O devices as opposed to memory and also a separate category of instructions for doing I/O operations as opposed to memory data transfers. What are the advantages and disadvantages of this method of handling I/O? Name and describe an alternative strategy and discuss how it exhibits a different set of pros and cons. Separate I/O has the advantage of a unique address space for I/O devices; because of this, there are no “holes” in the main memory address space where I/O device interface registers have been decoded. The full physical memory address space is available for use by memory devices. Also, I/O operations are easily distinguished from memory operations by their use of different machine language

instructions. On the other hand, hardware complexity (and possibly cost) is increased slightly and the additional instructions required for I/O make the instruction set architecture a bit more complex. The alternative, memory-mapped I/O, shares a single physical address space between memory and I/O devices. This keeps the hardware and instruction set simpler while sacrificing the distinct functionality of I/O instructions as well as the complete, contiguous address space that would otherwise be available to memory. Given the widespread use of virtual memory in all but the simplest of systems, the pros and cons of either approach are not as noteworthy as they once were and either approach can be made to work well. 8.

Given that many systems have a single bus which can be controlled by only one bus master at a time (and thus the CPU cannot use the bus for other activities during I/O transfers), explain how a system that uses DMA for I/O can outperform one in which all I/O is done by the CPU. On the face of it, it would seem that DMA I/O would provide little or no advantage in such a system, since only one data transfer can occur at a time regardless of whether the CPU or DMAC is initiating it. However, DMA still has a considerable advantage for a couple of important reasons. One of these is that, due to the widespread use of on-chip instruction and data cache, it is likely that the CPU can continue to execute code for some time (in parallel with I/O activities) even without any use of the system bus. The second reason is that even if the CPU “stalls out” for lack of ability to access code or data in main memory, the I/O operation itself is done more efficiently than it would be if the CPU performed it. Instead of reading a value from a buffer in memory and then writing it to an I/O device interface (or vice versa), the CPU (which would be the middleman in the transaction) gets out of the way and the two transactions are replaced with one direct data transfer between memory and the device in question.

9.

Compare and contrast the channel processors used in IBM mainframes with the PPUs used in CDC systems. The channel processors used in IBM mainframes were simple von Neumann machines with their own program counters, register sets, (simpler) instruction set architecture, etc. They communicated with the main system processor(s) by reading and writing a shared area of main memory. CDC’s Peripheral Processing Units were complete computers dedicated to I/O operations. The PPUs had their own separate memory and were architecturally similar to the main system processor (although they lacked certain capabilities, such as hardware support for floatingpoint arithmetic, that were not useful for I/O). In addition to controlling I/O devices they performed other operations such as buffering, checking, formatting, and translating data.

10.

Fill in the blanks below with the most appropriate term or concept discussed in this chapter: Exception - a synchronous or asynchronous event that occurs, requiring the attention of the CPU to take some action Service routine (handler) - a special program that is run in order to service a device, take care of some error condition, or respond to an unusual event Stack - when an interrupt is accepted by a typical CPU, critical processor status information is usually saved here Non-maskable interrupt - the highest priority interrupt in a system; one that will never be ignored by the CPU Reset - a signal that causes the CPU to reinitialize itself and/or its peripherals so that the system starts from a known state Vectoring - the process of identifying the source of an interrupt and locating the service routine associated with it Vectored interrupt - when this occurs, the device in question places a number on the bus

which is read by the processor in order to determine which handler should be executed Trap - another name for a software interrupt, this is a synchronous event occurring inside the CPU because of program activity Abort - on some systems, the “Blue Screen Of Death” can result from this type of software-related exception Device interface registers - these are mapped in a system’s I/O address space; they allow data and/or control information to be transferred between the system bus and an I/O device Memory-mapped I/O - a technique that features a single, common address space for both I/O devices and main memory Bus master - any device that is capable of initiating transfers of data over the system bus by providing the necessary address, control, and/or timing signals Direct Memory Access Controller (DMAC) - a hardware device that is capable of carrying out I/O activities after being initialized with certain parameters by the CPU Burst mode DMA - a method of handling I/O where the DMAC takes over exclusive control of the system bus and performs an entire block transfer in one operation Input/Output Processor (IOP) (also known as Peripheral Processor or Front-End Processor) - an independent, programmable processor that is used in some systems to offload input and output activities from the main CPU

6

Parallel and High-Performance Systems

1.

Discuss at least three distinguishing factors that can be used to differentiate among parallel computer systems. Why do systems vary so widely with respect to these factors? Some of the main factors that differ from one parallel system to another are the number and type of processors used and the way in which they are connected to each other. Some parallel systems use shared main memory for communication between processors, while others use a message-passing paradigm. With either of these approaches, a wide variety of networks can be used to facilitate the exchange of information. The main reason why parallel systems exhibit such a wide range of characteristics is probably because there is such a wide variety of applications. The characteristics of the intended applications drive the characteristics of the machines that are built to run them.

2.

Michael Flynn defined the terms SISD, SIMD, MISD, and MIMD to represent certain classes of computer architectures that have been built or at least considered. Tell what each of these abbreviations stands for; describe the general characteristics of each of these architectures; and explain how they are similar to and different from one another. If possible, give an example of a specific computer system fitting each of Flynn’s classifications. SISD stands for Single Instruction stream, Single Data stream – this is a single-processor system such as a typical desktop or notebook PC or workstation. Generally, such systems are built around a processor with a conventional von Neumann (Princeton) or Harvard architecture. SIMD stands for Single Instruction stream, Multiple Data stream. SIMD systems are commonly known as array

processors because they execute the same operation on a large collection of operands at the same time. The control unit of a SIMD is much like that of a SISD machine, but it controls a number of processing elements simultaneously. Examples of SIMD computers include the ILLIAC IV, the Connection Machine, etc. MISD (Multiple Instruction stream, Single Data stream) machines would have carried out multiple algorithms on the same data sets. Such machines, while conceptually possible, have yet to be developed. MIMD is an acronym for Multiple Instruction stream, Multiple Data stream. This classification of machines encompasses the vast majority of parallel systems, which consist of multiple CPUs (each with a Princeton or Harvard architecture like the one in a typical SISD system) connected together by some type of communications network. Examples of MIMD computers include the Silicon Graphics Origin series, the Cray T3E, and any “Beowulf” class cluster system. 3.

What is the main difference between a vector computer and the scalar architectures that we studied in Chapters 3 and 4? Do vector machines tend to have a high, or low, degree of generality as defined in Section 1.4? What types of applications take best advantage of the properties of vector machines? The main difference between a vector computer and conventional scalar architectures is the fact that instructions executed by vector processors operate not on individual values or pairs of values, but on vectors (one-dimensional arrays of values). In a scalar machine the ADD instruction adds two numbers to produce their sum; in a vector processor the ADD instruction adds each element of one set of numbers to the corresponding element of a second set, producing a corresponding set of results. This is usually accomplished with a deeply pipelined execution unit(s)

through which vector elements are fed in succession. Because of their unique construction, vector machines have a very low degree of generality. They are extremely well suited to certain applications, particularly scientific and engineering applications like weather forecasting, CFD simulations, etc. that do a great deal of “number crunching” on vectors or arrays of data. However, they offer little to no advantage when running office applications or any type of scalar code. While vector processors are more powerful now than they have ever been, they are not as popular in the overall supercomputer market as they once were because they are useful for a relatively narrow range of specialized applications. Cluster computers based on RISC and superscalar microprocessors are now more popular since they tend to be less expensive (per MIP or MFLOP) and can run a wider range of applications efficiently. 4.

How are array processors similar to vector processors and how are they different? Explain the difference between fine-grained and coarse-grained array processors. Which type of array parallelism is more widely used in today’s computer systems? Why? Array processors are similar to vector processors in that a single machine instruction causes a particular computation to be carried out on a large set of operands. They are different in their construction: array processors use a number of relatively simple processing elements (spatial parallelism) while vector processors generally employ a small number of deeply pipelined processing units (temporal parallelism). Fine-grained array processors consist of a very large number of extremely simple processing elements, while coarse-grained array processors have a few (but usually somewhat more capable) processing elements. Coarse-grained

array parallelism is more widely used in today’s computers, particularly in the multimedia accelerators that have been added to many popular microprocessor families. This is probably because a coarse-grained SIMD is useful in a wider range of applications than a fine-grained array processor would be. 5.

Explain the difference between multiprocessor and multicomputer systems. Which of these architectures is more prevalent among massively parallel MIMD systems? Why? Which architecture is easier to understand (for programmers familiar with the uniprocessor model)? Why? Multiprocessors are systems in which the CPUs communicate by sharing main memory locations, while multicomputers are systems in which each CPU has its own, local memory and communication is accomplished by passing messages over a network. Most massively parallel MIMD systems are multicomputers because sharing main memory among large numbers of processors is difficult and expensive. Multiprocessors are easier for most programmers to work with because the shared memory model allows communication to be done using the same approaches that are used in systems with a single CPU. Multicomputers, on the other hand, must make use of message passing – a more "artificial" and counterintuitive paradigm for communication.

6.

Explain the similarities and differences between UMA, NUMA, and COMA multiprocessors. All three of these architectural classifications refer to machines with shared main memory. Any location in memory can be read or written by any processor in the system; this is how processes running on various processors communicate with

each other. In a system with UMA (Uniform Memory Access), any memory location can be read or written by any CPU in the same amount of time (unless the memory module in question is already busy). This is a desirable property, but the hardware required to accomplish it does not scale economically to large systems. Multiprocessors with many CPUs tend to use the NUMA (Non-Uniform Memory Access) or the more recently developed COMA (Cache-Only Memory Architecture) approaches. NUMA systems use a modular interconnection scheme in which memory modules are directly connected to some CPUs but only indirectly connected to others. This is more cost-effective for larger multiprocessors, but access time is variable (remote modules take longer to read or write than local modules) and thus code/data placement must be “tuned” to the specific characteristics of the memory system (a non-trivial exercise) for best performance. In a COMA system, the entire main memory space is treated as a cache; all addresses represent tags rather than physical locations. Items in memory can be migrated and/or replicated dynamically so they are nearer to where they are most needed. This experimental approach requires even more hardware support than the other two, and is therefore more expensive to implement, but it has the potential to make larger multiprocessors behave more like SMPs and thus perform well without the software having to be tuned to a particular hardware configuration. 7.

What does “cache coherence” mean? In what type of computer system would cache coherence be an issue? Is a write-through strategy sufficient to maintain cache coherence in such a system? If so, explain why. If not, explain why not and name and describe an approach that could be used to ensure coherence.

Cache coherence means that every CPU in the system sees the same view of memory (“looking through” its cache(s) to the main memory). In a coherent system, any CPU should get the same value as any other when it reads a shared location, and this value should reflect updates made by this or any other processor. A writethrough strategy is not sufficient to ensure this, because updating the contents of main memory is not enough to make sure the other caches’ contents are consistent with the updated memory. Even if main memory contains the updated value, the other caches might have previously loaded that refill line and thus might still contain an old value. To ensure a coherent view of memory across the whole machine, copies of a line that has been modified (written) need to be updated in other caches (so that they immediately receive the new data) or invalidated by them (so that they will miss if they try to access the old data and update it at that point by reading the new value from main memory). The write-update or write-invalidate operations can be accomplished by implementing a snoopy protocol (typical of smaller, SMP systems) in which caches monitor a common interconnection network, such as a bus, to detect writes to cached locations. In larger multiprocessor systems such as NUMA machines where bus snooping is not practical, a directory protocol (in which caches notify a centralized controller(s) of relevant transactions and, in turn, receive notifications of other caches’ write operations) is often used. 8.

What are the relative advantages and disadvantages of write-update and write-invalidate snoopy protocols? A write-invalidate snoopy protocol is simpler to implement and uses less bus

bandwidth since there is no need for other caches to load modified data. It works well when data are lightly shared, but not as well when data are heavily shared since the hit ratio is usually lower. A write-update snoopy protocol keeps the hit ratio higher for heavily shared data and works well when reads and writes alternate, but it is more complex to implement and may use more bus bandwidth (which can be a limiting factor if several processors are sharing a common bus). 9.

What are directory-based protocols and why are they often used in CC-NUMA systems? Directory-based protocols are cache coherence schemes that do not rely on the “snooping” of a common bus or other single interconnection between CPUs in a multiprocessor system. (While snooping is feasible in most SMP systems, it is not so easily accomplished in larger NUMA architectures with distributed shared memory using a number of local and system-wide interconnections.) Communications with the system directory, which is a hardware database that maintains all the information necessary to ensure coherence of the memory system, are done in pointto-point fashion and thus scale better to a larger system. In very large systems, the directory itself may be distributed (split into subsets residing in different locations) to further enhance scalability.

10.

Explain why synchronization primitives based on mutual exclusion are important in multiprocessors. What is a read-modify-write cycle and why is it significant? A read-modify write (RMW) cycle is one in which a memory location is read, its contents are modified by a processor, and the new value is written back into the same memory location in indivisible, “atomic” fashion. This type of operation is important in accessing mutual exclusion primitives such as semaphores. The RMW

cycle protects the semaphore test/update operation from being interrupted by any other process; if such an interruption did occur, this could lead to a lack of mutual exclusion on a shared resource, which in turn could cause incorrect operation of the program. 11.

Describe the construction of a “Beowulf cluster” system. Architecturally speaking, how would you classify such a system? Explain. A Beowulf-type cluster is a parallel computer system made up of a number of inexpensive, commodity computers (often generic Intel-compatible PCs) networked together, usually with off-the-shelf components such as 100 megabit/s or 1 gigabit/s Ethernet. The operating system is often an open-source package such as Linux. The idea is to aggregate a considerable amount of computational power as inexpensively as possible. Because they are comprised of multiple, complete computer systems and communicate via message passing over a network, Beowulf clusters are considered multicomputers (LM-MIMD systems).

12.

Describe the similarities and differences between circuit-switched networks and packetswitched communications networks. Which of these network types is considered “static” and which is “dynamic”? Which type is more likely to be centrally controlled and which is more likely to use distributed control? Which is more likely to use asynchronous timing and which is more likely to be synchronous? Circuit-switched and packet-switched networks are both used to facilitate communications in parallel (SIMD and MIMD) computer systems. Both allow connections to be made for the transfer of data between nodes. They are different in that circuit-switched networks actually make and break physical connections to

allow various pairs of nodes to communicate, while packet-switched networks maintain the same physical connections all the time and use a routing protocol to guide message packets (containing data) to their destinations. Because physical connections are not changed to facilitate communication, packet-switched networks are said to be static, while circuit-switched networks dynamically reconfigure hardware connections. Packet-switched networks generally exhibit distributed control and are most likely to be asynchronous in their timing. Circuit-switched networks are more likely to be synchronous (though some are asynchronous) and more commonly use a centralized control strategy. 13.

What type of interconnection structure is used most often in small systems? Describe it and discuss its advantages and disadvantages. Small systems often use a single bus as an interconnection. This is a set of address, data, and control/timing signals connected to all components in the system to allow information to be transferred between them. The principal advantage of a bus-based system is simplicity of hardware (and therefore low cost). Its main disadvantage is limited performance; only one transaction (read or write operation) can take place at a time. In a system with only one or a few bus masters (such as CPUs, IOPs, DMACs, etc.) there will typically be little contention for use of the bus and performance will not be compromised too much; in larger systems, however, there will often be a desire for multiple simultaneous data transfers which, of course, cannot happen. Thus one or more bus masters will have to wait to use the bus and overall performance will suffer to some degree.

14.

Describe the operation of a static network with a “star” topology. What connection

degree do its nodes have? What is its communication diameter? Discuss the advantages and disadvantages of this topology. A star network has all the computational and/or memory nodes connected to a single, central communications node or “hub”. All the nodes except the hub have a connection degree of one (the hub is of degree n if there are n other nodes). Such a network has a communication diameter of two (one hop from the source to the hub, one hop from the hub to the destination). Advantages include the network’s small communication diameter (2) and the fact that the communication distance is the same for all messages sent across the network. Another advantage is that the network structure is simple and it is easy to add additional nodes if system expansion is desired. The main disadvantage of a star network is that all communications must pass through the hub. Because of this, the hub may become saturated with traffic, thus becoming a bottleneck and limiting system performance. 15.

How are torus and Illiac networks similar to a two-dimensional nearest-neighbor mesh? How are they different? Both torus and Illiac networks are similar to a two-dimensional nearestneighbor mesh in that their nodes are of degree four (they are connected to neighbors in each of the x and y directions; in other words, to a “north”, “south”, “east”, and “west” neighbor). This is true of the interior nodes in a nearestneighbor mesh as well. The differences between these three network topologies lie in how the “edge” and “corner” nodes are connected. In a basic nearest-neighbor mesh, the nodes on the side and top/bottom edges lack one of the possible connections and thus have a connection degree of only three; the corner nodes lack

one connection in the x direction and one in the y direction and therefore are of degree two. In a torus network, the edge nodes have a “wrap-around” connection to the node in the same row or column on the opposite edge (corner nodes have wraparound connections in both dimensions); therefore all nodes in the network have a connection degree of four. The Illiac network has the same configuration, except that the rightmost node in each row has a wrap-around connection to the leftmost node in the next (rather than the same) row, with the rightmost node in the last row being connected to the leftmost node in the first row. 16.

Consider a message-passing multicomputer system with 16 computing nodes. (a)

Draw the node connections for the following connection topologies: linear array, ring, two-dimensional rectangular nearest-neighbor mesh (without edge connections), binary n-cube. (The drawings will be similar to the appropriate figures in Chapter 6.)

(b)

What is the connection degree for the nodes in each of the above interconnection networks? Linear array: 2 (1 for end nodes). Ring: 2 (all nodes). 2-D mesh: 4 (3 on

edges, 2 in corners). Binary n-cube: 4 (all nodes). (c)

What is the communication diameter for each of the above networks? Linear array: 15. Ring: 8. 2-D mesh: 6. Binary n-cube: 4.

(d)

How do these four networks compare in terms of cost, fault tolerance, and speed of communications? (For each of these criteria, rank them in order from most desirable to least desirable.)

Cost: linear array, ring, 2-D mesh, n-cube. Fault tolerance: n-cube, 2-D mesh, ring, linear array. Speed: n-cube, 2-D mesh, ring, linear array. 17.

Describe, compare, and contrast store-and-forward routing with wormhole routing. Which of these approaches is better suited to implementing communications over a static network with a large number of nodes? Why? Both wormhole and store-and-forward routing are methods for transmitting message packets across a static network. Store-and-forward routing treats each message packet as a unit, transferring the entire packet from one node in the routing path to the next before beginning to transfer it to a subsequent node. This is a simple but not very efficient way to transfer messages across the network; it can cause significant latency for messages that must traverse several nodes to reach their destinations. Wormhole routing divides the message packets into smaller pieces called flits; as soon as an individual flit is received by an intermediate node along the routing path, it is sent on to the next node without waiting for the entire packet to be assembled. This effectively pipelines, and thus speeds up, the transmission of messages between remote nodes. A network with a large number of nodes is likely to have a large communication diameter, with many messages requiring several “hops” to reach their destinations, and thus is likely to benefit considerably from the use of wormhole routing.

18.

In what type of system would one most likely encounter a full crossbar switch interconnection? Why is this type of network not usually found in larger (measured by number of nodes) systems? A full crossbar switch would most likely be found in a high-performance

symmetric multiprocessor (SMP) system. Such a network would probably not be used in a system with many nodes because it has a cost and complexity that increases proportionally to the square of the number of nodes connected. (In other words, its cost is O(n2).) 19.

Consider the different types of dynamic networks discussed in this chapter. Explain the difference between a blocking network and a non-blocking network. Explain how a rearrangeable network compares to these other two dynamic network types. Give an example of each. In a blocking network such as the Omega network, any node on one side of the network can be connected to any node on the other side. However, creating one connection across the network prevents (blocks) certain other pairs of nodes from being connected as long as the first connection exists. Only certain subsets of connections can exist simultaneously. In a non-blocking network such as a full crossbar switch, any node on one side of the network can be connected to any node on the other side, and this connection does not interfere with the establishment of a connection between any other (idle) nodes. A rearrangeable network such as a Benes network represents a middle ground (in functionality, complexity, and expense) between the previous two alternatives. It is similar in structure to a blocking network, but adds redundancy in the form of additional stages and/or connections. The redundancy allows for multiple possible paths connecting any two given nodes. While any particular connection across the network does block certain other connections, an established

connection can always be rerouted along one of the alternate paths (a.k.a. rearranged) in order to allow another desired connection to be made. 20.

Choose the best answer to each of the following questions: (a) Which of the following is not a method for ensuring cache coherence in a multiprocessor system? (1) write-update snoopy cache; (2) write-through cache; (3) write-invalidate snoopy cache; (4) full-map directory protocol (b) In a 16-node system, which of these networks would have the smallest communication diameter? (1) n-cube; (2) two-dimensional nearest-neighbor mesh; (3) ring; (4) torus (tie; both have diameter = 4) (c) Which of the following is a rearrangeable network? (1) Illiac network; (2) multistage cube network; (3) crossbar switch; (4) Benes network; (5) none of the above (d) In a 64-node system, which of the following would have the smallest node connection degree? (1) ring; (2) two-dimensional nearest-neighbor mesh; (3) Illiac network; (4) ncube

21.

Fill in the blanks below with the most appropriate term or concept discussed in this chapter: Multicomputer (LM-MIMD) - a parallel computer architecture in which there are several processing nodes, each of which has its own local or private memory modules Multiprocessor (GM-MIMD) - a parallel computer architecture in which there are several processing nodes, all of which have access to shared memory modules Single Instruction stream, Multiple Data stream (SIMD) machine - another name for an array processor Symmetric Multiprocessor (SMP) - a relatively small MIMD system in which the

“uniform memory access” property holds Deadlock - a situation in which messages on a network cannot proceed to their destinations because of mutual or cyclic blocking Blocking network - an interconnection network in which any node can be connected to any node, but some sets of connections are not simultaneously possible Communication diameter - the maximum number of “hops” required to communicate across a network Packet-switched network - multicomputers with many nodes would be interconnected by this (Full) Crossbar switch - the classic example of a non-blocking, circuit-switched interconnection network for multiprocessor systems Store and forward routing - a method of message passing in which flits do not continue toward the destination node until the rest of the packet is assembled Write-update snoopy cache protocol - a method used for ensuring coherence of data between caches in a multiprocessor system where a write hit by one CPU causes other processors’ caches to receive a copy of the written value Flow control digit (flit) - the basic unit of information transfer through the network in a multicomputer system using wormhole routing

7

Special-Purpose and Future Architectures

1.

Explain how a dataflow machine avoids the “von Neumann bottleneck.” A dataflow machine, unlike one based on a von Neumann architecture, does not rely on the use of sequential algorithms to guide the processing of data. Processing is data-driven rather than instruction-driven; it is not inherently sequential, but instead allows the hardware to exploit any parallelism inherent in the task. Since there is no need to fetch “instructions” separately from data, the von Neumann bottleneck described in Chapter 1 does not come into play.

2.

Draw a dataflow graph and an activity template for the following programming construct: if (x >= 0) { z = (x + y) * 4; } else { z = (y - x) * 4; } The dataflow graph will be similar to Figure 7.2 except for the details of the operations. Likewise, the activity template will be similar to Figure 7.3.

3.

If you had a scientific application that involved a large number of matrix manipulations, would you rather run it on a dataflow computer or a SIMD computer? Explain. It would probably be better to run such an application on a SIMD computer, since that type of system is designed to optimize performance on array processing (matrices are handled as two-dimensional arrays). Dataflow computers do reasonably well with “unstructured” parallelism but are not particularly good at exploiting array-type parallelism.

4.

What do you think is the main reason why dataflow computers have so far not been

widely adopted? There are several legitimate reasons why dataflow computers have not reached the mainstream. One is their reliance on specialized programming languages to express the constructs represented in a dataflow graph or activity template. Machines that do not use standard programming languages tend to have a higher software development cost. Also, dataflow machines do not perform particularly well (at least, not well enough to justify their cost) on many common applications. Unless the parallelism inherent to the task is a good match for the parallelism of the machine hardware, performance gains will be modest at best. Finally, dataflow machines are not easy or cheap to build and do not take much advantage of the locality of reference that is essential to the function of hierarchical memory systems. 5.

Give an example of how dataflow techniques have influenced and/or been used in conventional computer design. Dataflow techniques were a part of the control strategy developed by Robert Tomasulo for the IBM 360/91 computer in the 1960s. While the machine was programmed like a traditional von Neumann computer, internally its hardware execution units were scheduled using a dataflow approach: an operation was sent to a functional unit once its operands were available. Some modern, superscalar microprocessors still use Tomasulo’s method (or variations on it) and thus bear the influence of dataflow computing.

6.

Are superthreaded and hyper-threaded processors the same thing? If not, how do they differ?

Superthreaded and hyper-threaded processors are close cousins, but not identical. Superscalar machines that use superthreading (also called time-slice multithreading) can issue multiple instructions belonging to one process (or thread) during a given clock cycle; during a different clock cycle, instructions belonging to another process can be issued. Effectively, use of the CPU is time-multiplexed on a cycle by cycle basis. Hyper-threading (or simultaneous multithreading) takes this concept one step further: during a given clock cycle, instructions from more than one process may be issued in order to make the maximum possible use of CPU resources. 7.

Would you classify an artificial neural network as an SISD, SIMD, MISD, or MIMD system, or something else? Make a case to support your choice. The best answer is probably “something else” – ANNs represent a unique class of architectures with their own special characteristics. Artificial neural networks really do not fit the description of any of the systems described in Flynn’s taxonomy of computer systems. If one had to try to pigeonhole them into one of his four classifications, they could be said to at least resemble, in some ways, MIMD or even MISD machines. (ANNs are clearly not SISD or SIMD architectures because they lack a single instruction stream.)

8.

Explain how the processing elements and interconnections in an artificial neural network relate to the structure of the human nervous system. The many, simple processing elements in an artificial neural network correspond to the many neurons in the human body. Each processing element, like each real neuron, accepts several inputs and computes the weighted sum of those

inputs. This sum is applied to an activation function that simulates the action potential threshold of a biological neuron. The activation function determines whether a given processing element will send an output to the input of another processing element (simulated neuron). 9.

How is a supervised artificial neural network programmed to carry out a particular task? What is the difference between a supervised vs. unsupervised ANN? An artificial neural network is not so much programmed, but trained iteratively to perform a given task. A supervised ANN receives its training via the user’s repeated applications of inputs, producing outputs that are compared to the corresponding desired outputs; the neuron weights are adjusted after each pass until the network “learns” to produce good output for the full range of inputs. Unsupervised ANNs are used in situation for which feedback is unavailable (no “known good” output data exists). Instead, they use “competitive learning” techniques to learn on their own without intervention by a human trainer.

10.

Why are ANNs well suited to applications such as robotic control? Give an example of an application for which you do not think an ANN would be a good choice. ANNs are a good choice for robotic control because complex motions are difficult to program algorithmically using traditional programming languages. Since it is possible to generate examples of the desired functionality and since most ANNs operate on the principle of “training” a system to produce outputs corresponding to given examples, they are a natural “fit”. After all, the biological neural networks of human beings and animals can be trained to produce desired output, so why shouldn’t artificial neural networks exhibit similar strengths? On

the other hand, applications such as computational fluid dynamics, etc. that require a great deal of numeric computations (“number crunching”) would probably perform a lot better on a conventional supercomputer than on an ANN. 11.

What is different about logical variables in a fuzzy system as compared to a conventional computer system? In a conventional computer system, logical variables are binary in nature. That is to say, they are either true or false; on or off; 1 or 0; 100% or 0%. In a fuzzy system, logical values can take on a continuum of truth values between the limits of 0 and 1, inclusive.

12.

Both ANNs and fuzzy logic systems attempt to mimic the way human beings make decisions. What is the main difference between the two approaches? The main difference is that artificial neural networks attempt to mimic the actual structure of the human brain by simulating the functionality of neurons and the connections that exist between them. Fuzzy logic systems attempt to model the uncertain, imprecise methods people use to make decisions based on (often incomplete) available information, but their structure is not based on any biological model.

13.

What is a fuzzy subset and how does the idea of a membership function relate to it? Propose a simple membership function rich() that deals with the concept of a fuzzy subset of wealthy people. A fuzzy subset is a portion of the universe of discourse (the set of all things under consideration in formulating a given problem), whose membership is not defined precisely. A membership function expresses the perceived likelihood that a

given member of the universe belongs to a particular fuzzy subset. In other words, it produces a truth value (in the range of 0 to 1, inclusive) that indicates an object’s degree of membership in the fuzzy subset. One possible definition of a membership function rich() would be as follows: rich (x) =

14.

{0, if income (x) < $100K; (income (x) - $100K) / $900K, if $100K ≤ income (x) ≤ $1M; 1, if income (x) > $1M.}

Can the Boolean, or crisp, logic operations AND, OR, and NOT be defined in regard to fuzzy logic? If so, explain how; if not, explain why not. Yes, the Boolean functions AND, OR, and NOT correspond to fuzzy operations. NOT is generally defined such that truth (not x) is equal to 1.0 – truth (x). Various definitions have been used for the AND and OR functions; the most common are truth (x AND y) = min (truth (x), truth (y)) and truth (x OR y) = max (truth (x), truth (y)). Note that if the variables x and y are restricted to only the discrete values 0 and 1 (as in binary logic systems), these definitions are consistent with Boolean algebra properties.

15.

Explain, in the context of a fuzzy expert system, what rules are and how they are used. Rules are statements that reflect the knowledge of a human expert about how a given system works. They are typically expressed in terms of if-then relationships between fuzzy output subsets and linguistic variables derived from the inputs. The rules are used to make inferences about the system based on the fuzzified input data (that is, the results after the membership functions are applied to the “raw” input data). The outputs of the various rules that make up the system’s rule base are combined and used to create a single fuzzy subset for each output variable. These

outputs can then be defuzzified to produce “crisp” outputs if they are needed. 16.

For what type(s) of physical system is fuzzy control particularly well suited? Fuzzy control is a good choice for controlling systems that are nonlinear, complex, and/or have poorly specified characteristics, thus making them a poor match for conventional analog or digital control systems using algorithms that depend on having a well-defined, linear model of the process to be controlled.

17.

What is Moore’s Law and how has it related to advances in computing over the last 40 years? Is Moore’s Law expected to remain true forever or lose its validity in the future? Explain your answer and discuss the implications for the design of future highperformance computer systems. Moore’s Law says that the continually shrinking sizes of semiconductor devices will result in an exponential growth (doubling on approximately a yearly basis) in the number of transistors that can feasibly be integrated on a single chip. This has resulted in a doubling of computational power approximately every 18-24 months (since, apparently, computational power is not a linear function of the number of transistors). Given the known laws of physics, it is not possible that Moore’s Law will continue to hold true indefinitely. The problem is that transistors used as switching elements in computers cannot keep shrinking once they get to the size of individual atoms or small groups of atoms. Devices that small will no longer work under the binary logic principles of Boolean algebra, and the performance of traditional computer architectures will reach a hard limit. (This has been estimated to occur within the next 10-20 years.) At that point, further increases in performance will

only be achievable if some new approach, for example quantum computing, is adopted. 18.

How does a quantum computer fundamentally differ from all the other computer architectures discussed in this book? What allows a quantum computer to achieve the effect of a massively parallel computation using a single piece of hardware? A quantum computer differs from all traditional computer architectures in that it does not operate on the principles of Boolean algebra, where computations are done sequentially (or in parallel, by replicating hardware) on binary digits (bits) or groups of bits. In conventional machines, each bit can only be 0 or 1 at any given time. Quantum computers instead operate on quantum bits (qubits). Qubits can take on not only the distinct values 0 or 1, but also – by the principle of quantum superposition – they can take on states that can be 0 and 1 at the same time. By adding more qubits, a quantum computer becomes exponentially more powerful. While a 16-bit binary register can take on only one of its 65,536 possible states at a time, a 16-qubit quantum register can be in all 65,536 states at once in coherent superposition. This allows the effect of a massively parallel computation to be achieved using only one piece of hardware.

19.

What are some of the problems scientists must solve in order to make supercomputers based on the principles of quantum mechanics practical? Researchers working on quantum computers have encountered a number of problems that have so far made it impractical to construct machines that can compete with supercomputers based on conventional architectures. First of all, it is difficult to build a quantum computer, since there is a need to separate one or a

small number of atoms from others and keep them in a steady state in order to use them for computation. Another significant problem is decoherence, a phenomenon that can introduce errors in computations due to interactions of the computer hardware with the surrounding environment. Finally, assuming one has performed a quantum computation, it is difficult to observe the result without collapsing the coherent superposition of states and destroying one’s work. Research into solving these problems is ongoing. 20.

What application(s) are expected to be a good match for the unique capabilities of quantum computers? Explain. If large-scale quantum computers can be practically constructed, they will probably be used to solve extremely numerically intensive problems that have proven to be intractable with even the fastest conventional systems. (They won’t be used for word processing, sending e-mail, or surfing the Net!) One area that seems to be a good potential match for the capabilities of quantum computers is cryptography – the making and breaking of highly secure codes that protect sensitive information from being intercepted by unauthorized parties.

21.

Fill in the blanks below with the most appropriate term or concept discussed in this chapter: Dataflow machine - a type of computer architecture in which execution depends on the availability of operands and execution units rather than a sequential-instruction program model Node (actor) - an element in a dataflow graph that represents an operation to be performed on data

Tokens - these are used to represent data values (operands and results) in algorithms for a dataflow architecture IBM 360/91 - this (outwardly) von Neumann machine made use of dataflow techniques for internal scheduling of operations Hyper-threading - a machine using this technique can issue instructions from more than one thread of execution during the same clock cycle Artificial Neural Network (ANN) - a type of computer architecture with a structure based on that of the human nervous system Neurons - the fundamental units that make up a biological neural network Dendrites - these are fibers that act as “input devices” for neurons in human beings Convergence - when an artificial neural network achieves this, it is “trained” and ready to be put into operating mode Single-Layer Perceptron (SLP) - the earliest and simplest type of artificial neural network Unsupervised neural network - a type of artificial neural network that does not require user intervention for training Fuzzy logic architecture - a type of computer architecture in which logical values are not restricted to purely “true” or “false” (1 or 0) Linguistic variable - a type of variable that expresses a “fuzzy” concept; for example, “slightly dirty” or “very fast” Universe of discourse - the set of all objects under consideration in the design of a fuzzy system Truth value - the numerical degree (between 0 and 1, inclusive) of membership that an

object has in a fuzzy subset Fuzzification - the first step performed in doing “fuzzy computations” for an expert system, control system, etc. Defuzzification - this is necessary if a fuzzy result must be converted to a crisp output Quantum computer - a type of computer architecture in which the same physical hardware can be used to simultaneously compute many results as though it were parallel hardware; its operation is not based on Boolean algebra, but on the physics of subatomic particles Moore’s Law - a prophetic observation of the fact that conventional computers would tend to grow exponentially more powerful over time as integrated circuit features got smaller and smaller Quantum bit (qubit) - the basic unit of information in a quantum computer Quantum interference - this phenomenon results from the superposition of multiple possible quantum states Quantum entanglement - a state in which an atom’s properties are identically assumed by another atom, but with opposite spin Decoherence - the tendency for interactions with the surrounding environment to disturb the state of qubits, possibly resulting in computational errors Thirty - a quantum computer with this many qubits has been estimated to have 10 TFLOPS of computational power Cryptography - so far, this appears to be the most likely application for supercomputers based on quantum principles

Solutions Manual

Short Description

Description

Comments

We need your help!