Computer Architecture Elementary Pipelining Study
May 8, 2017 | Author: Jordan D. Ulmer | Category: N/A
Short Description
Computers are essentially very fast idiots, thus, it becomes necessary to understand their idiosyncrasies, and augment t...
Description
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 1
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
6.1
Time of print: 10:32 PM 10/8/2014
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 2
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
6.1 [M] Consider the following instructions at the given addresses in the memory:
Initially,
and
1000
Add
R3, R2, #20
1004
Subtract
R5, R4, #3
1008
And
R6, R4, #0x3A
1012
Add
R7, R2, R4
. These instructions are executed in a computer that has a five-stage
pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining instructions are fetched in successive cycles.
Time of print: 10:32 PM 10/8/2014
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 3
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
(a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1 through 8. --
(6.2.a) Flow Of Instructions : *Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle Stage: 1 2 3 4 5 6 7 Operation: 1000 Add Decode Compute Fetch Memory Write R3, R2, RA [R2] RZ [R2]+ IR (1000) RY [RZ] R3 [RY] #20 RB [R3] #20 1004 Subtract R5, R4, #3 1008 And R6, R4, #0x3A
1012 Add R7, R2, R4
Fetch IR (1004)
Decode RA [R4] RB [R5]
Compute RZ [R4]#3
Fetch IR (1008)
Decode RA [RZ] “R3” RB [R6] Fetch IR (1012)
Memory RY [RZ]
Write R5 [RY]
Compute Memory RZ RY [RZ] [R4]x3A Decode RA [R2] RB [R4]
Write R6 [RY]
Compute Memory Write RZ RY R7 [R2]+ [RZ] [RY] [R4]
(b) With reference to Figures 5.8 and 5.9, describe the contents of registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ in the pipeline during cycles 2 to 8. --
Time of print: 10:32 PM 10/8/2014
8
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 4
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
(6.1) Contents of Registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ : Stage: Register: [PC] [IR] [RA] [RB]
1
2
3
4
5
6
7
8
9
1000 Add (1004)
1004 Subtract (1004) [R2] 2000 [R3] ?
1008 And (1008) [R4] 50 [R5] ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
? ?
?
2000
2000
2000
2000
?
?
?
?
?
50
50
50
50
50
?
?
?
?
?
?
[R2]+ [R4] 2050 [R4] & #0x3A 50 2000 [R2]+#20 2020 50 [R4]-#3 47
?
?
[R4] & #0x3A 50 [R4]-#3 47 2000 [R2]+#20 2020 50
?
?
[R2] 2000 [R4] 50 [R4]-#3 47 [R2]+#20 2020 2000
?
?
1012 Add (1012) [R4] 50 [R6] ? [R2]+#20 2020
?
?
?
?
?
?
?
?
?
?
?
?
?
?
? ?
[RZ] [RY] [R2]
[R3] [R4] [R5] [R6] [R7]
Time of print: 10:32 PM 10/8/2014
[R2]+ [R4] 2050 2000 [R2]+#20 2020 50 [R4]-#3 47 [R4] & #0x3A 50 ?
? 2000 [R2]+#20 2020 50 [R4]-#3 47 [R4] & #0x3A 50 [R2]+ [R4] 2050
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 5
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
6.2
6.2 [M] Repeat Problem 6.1 for the following program: 1000
Add
R3, R2, #20
1004
Subtract
R5, R4, #3
1008
And
R6, R3, #0x3A // Data Dependency R3!!!
1012
Add
R7, R2, R4
Assume that the pipeline provides forwarding paths to the ALU from registers RY and RZ in Figure 5.8 and that the processor uses forwarding of operands. -From Problem 6.1… Initially,
and
. These instructions are executed in a computer that has a five-stage
pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining instructions are fetched in successive cycles.
Time of print: 10:32 PM 10/8/2014
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 6
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
(a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1 through 8. --
(6.2.a) Flow Of Instructions : *Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle Stage: 1 2 3 4 5 6 7 Operation: 1000 Add Decode Compute Fetch Memory Write R3, R2, RA [R2] RZ [R2]+ IR (1000) RY [RZ] R3 [RY] #20 RB [R3] #20 1004 Subtract R5, R4, #3
1008 And R6, R3, #0x3A
1012 Add R7, R2, R4
Fetch IR (1004)
Decode RA [R4] RB [R5]
Compute RZ [R4]#3
Fetch IR (1008)
Decode RA [RZ] “R3” RB [R6] Fetch IR (1012)
Memory RY [RZ]
Write R5 [RY]
Compute Memory RZ RY [RZ] [R3]x3A
Decode RA [R2] RB [R4]
Write R6 [RY]
Compute Memory Write RZ RY R7 [R2]+ [RZ] [RY] [R4]
(b) With reference to Figures 5.8 and 5.9, describe the contents of R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ in the pipeline during cycles 2 to 8. --
Time of print: 10:32 PM 10/8/2014
8
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 7
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
(6.2.b) Contents of Registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ : Stage: Register: [PC] [IR]
[RA]
[RB] [RZ]
[RY] [R2]
[R3] [R4] [R5] [R6] [R7]
1
2
3
4
5
6
7
8
9
1004 Add (1000)
1004 Subtract (1004)
1008 And (1008)
1012 Add (1012)
?
?
?
?
?
?
?
?
?
?
?
[R2] 2000
[R4] 50
[R2] 2000
?
?
?
?
?
[R3] ?
[R5] ?
[R4] 50
?
?
?
?
?
?
?
[R2]+#20 2020
[R4]-#3 47
[R3] & #0x3A 32
[R2]+ [R4] 2050
?
?
?
?
?
?
[R2]+#20 2020
[R4]-#3 47
[R2]+ [R4] 2050
?
2000
2000
2000
2000
2000
?
?
?
?
?
50
50
50
50
50
2000 [R2]+#20 2020 50
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
“R3” 2020 [R4] 50
Time of print: 10:32 PM 10/8/2014
[R3] & #0x3A 32 2000 [R2]+#20 2020 50 [R4]-#3 47
2000 [R2]+#20 2020 50 [R4]-#3 47 [R3] & #0x3A 32 ?
2000 [R2]+#20 2020 50 [R4]-#3 47 [R3] & #0x3A 32 [R2]+ [R4] 2050
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 8
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
6.7
6.7 [M] Assume that 20 percent of the dynamic count of the instructions executed for a program are branch instructions. Delayed branching is used, with one delay slot. Assume that there are no stalls caused by other factors. (a.)First, derive an expression for the execution time in cycles if all delay slots are filled with NOP instructions.
Branch
20% Delay slots
100% No-Operation
-(
) ( )
(b.)Then, derive another expression that reflects the execution time with 70 percent of delay slots filled with useful instructions by the optimizing compiler.
Branch
20% Delay slots Delay slots
30% No-Operation 70% Useful Operation
-(
) (
)
(c.) From these expressions, determine the compiler’s contribution to the increase in performance, expressed as a speedup percentage. --
The Optimizing Compiler with a 1-branch-delay processor makes operations the same processor without an Optimizing Compiler. Time of print: 10:32 PM 10/8/2014
faster than
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 9
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
6.8
6.8 [D] Repeat Problem 6.7, but this time for a pipelined processor with
.
The output from the optimizing compiler is such that the first delay slot is filled with a useful instruction 70 percent of the time, but the second slot is filled with a useful instruction only 10 percent of the time.
Compare the compiler-optimized execution time for this case with the compiler-optimized execution time for Problem 6.7. Assume that the two processors have the same clock rate. Indicate which processor/compiler combination is faster, and determine the speedup percentage by which it is faster. -(a.)The execution time in cycles if all delay slots are filled with NOP instructions.
(a.)The execution time in cycles if the delays are filled as described above.
Branches Slot #1 Delay slot Delay slot Slot #2 Delay slot Delay slot
20% Of Operations 30% No-Operation 70% Useful Operation 90% No-Operation 10% Useful Operation (
) (
)
Time of print: 10:32 PM 10/8/2014
(
) (
) (
)
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 10
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
(c.) From these expressions, determine the compiler’s contribution to the increase in performance, expressed as a speedup percentage. -The Pipeline, with optimizing compiler is best:
An Optimizing Compiler for a 2-branch -delay processor makes executions the same processor without an Optimizing Compiler.
Time of print: 10:32 PM 10/8/2014
faster than
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 11
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
6.14 Jordan’s Assumptions In Yellow
1Cycle
1Cycle
1Cycle
1Cycle
1Cycle
1Cycle
2Cycles
1Cycle
6.14 [E] Assume that a program contains no branch instructions. It is executed on the superscalar processor shown in Figure 6.13. (a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of 75 percent arithmetic instructions and 25 percent memory-access instructions? Arithmetic (75%)+ Memory Access(25%) -Given the assumptions marked on Figure 6.13 above the Arithmetic path takes on average:
Again, given the assumptions marked on Figure 6.13 above the Arithmetic path takes: (
)
Because there is a significantly un-equal amount of instructions passed through each pipeline, we can safely assume that the Load/Store( 1/4th Of Total Instructions MINORITY) will be able to finish before the Arithmetic( 3/4th Of Total Instructions MAJORITY) . Time of print: 10:32 PM 10/8/2014
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 12
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
(
)
(
)
(
)
(
)
Thus our limiting “bottleneck” is the Arithmetic sector of the pipeline:
Jordan’s Assumptions In Yellow 1Cycle 1Cycle 1Cycle 1Cycle 2Cycles 1Cycle
(b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2 using the same clock? -Given the assumptions marked on Figure 6.2 above the simple pipeline will be limited by the memory access: (
(
)
)
Time of print: 10:32 PM 10/8/2014
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 13
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Thus our comparison now becomes: (
Vs.
)
Because both processors would be dealing with the same set of instructions and the same clock our comparison simplifies:
(
Vs.
)
So we saved a little more than one cycle by re-routing memory access to a separate pipeline – we could have saved even more cycles by having a more balanced set of instructions :(ie. Arithmetic (50%) ; Memory Access(50%)) Furthermore looking at our “Speed Up Ratio”: (
) (
)
And so we find that our Double-Path-Super-Scalar-Processor is Pipelined – Processor.
Time of print: 10:32 PM 10/8/2014
faster than a Single-Path-
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 14
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
6.15
HAS BRANCHES 6.15 [M] Repeat Problem 6.14 to find the best possible execution times for the processors in Figures 6.2 and 6.13, assuming that the mix of instructions consists of 15 percent branch instructions that are never taken, 65 percent arithmetic instructions, and 20 percent memory access instructions. Assume a prediction accuracy of 100 percent for all branch instructions-(Branch delay is minimized 𝛿𝑏𝑟𝑎𝑛𝑐ℎ 𝐶𝑦𝑐𝑙𝑒). -(a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%) instructions? Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%) -Making the same assumptions as in problem 6.14, marked on Figure 6.13 above; Because the number of Arithmetic instructions outweighs the number of Memory Accesses the “bottleneck constraint” will originate from the Arithmetic sector of the pipeline. Where again the Arithmetic path takes: (
)
(
)
(
)
(
)
ℎ ℎ ℎ ℎ
And we now have the additional possibility of a branch delay, HOWEVER THIS IS TAKEN CARE OF by the Fetch Sector, as a pre-cursor to the Execution Stage: ℎ
So, in this scenario, our best Execution Time will be the SAME regardless of branching operations. Time of print: 10:32 PM 10/8/2014
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 15
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012 ℎ
(b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2 using the same clock? -Making the same assumptions as in problem 6.14, marked on Figure 6.2 above; And we now have the additional possibility of a single-cycle branch delay: ℎ
(
)
(
ℎ
(
ℎ
)
)
Thus our comparison becomes ℎ
(
Vs.
)
ℎ
Because both processors would be dealing with the same set of instructions and the same clock our comparison simplifies:
(
ℎ
Vs.
) ℎ
So we can see that the simple pipeline takes longer to reconcile the branch delays, but the super scalar pipeline is essentially un-effected by branching considerations (because this is taken care of in the fetch stage) Furthermore looking at our “Speed Up Ratio”: ℎ
(
) ℎ
(
)
And so we find that our Double-Path-Super-Scalar-Processor is
faster than a Single-Path-
Pipelined – Processor when we also take branching effects into consideration.
Time of print: 10:32 PM 10/8/2014
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 16
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Suplementary(a,b,c)
Time of print: 10:32 PM 10/8/2014
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 17
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
KEY For Tables Of Operations Fetch=E Decode=D Execute (Using Processor Hardware)=E Execute With Register Renaming (Not Using Processor Hardware)=E Write Back=W Waiting on=X NOTE: This method of display, was developed in collaboration with Nathan Genetzky
(Supplementary.A.)Completed The Preceding Table Of Operations
NOTE: Horizontal = Temporal-axis :
F D E
W
F D R3
E
E
E
F
D
ONE
E
W
F
D F
R3
E
E
R6 R6 R6 R6 D E W F D R7 E F D FIVE F D F
R7 R7
W
R6 W E R1 D F
R6,R2
E
W
W R1 R1 E W E E E E E W D R6 R6 R6 R6 R6 F D SEVEN E W R5 R1
R0 R3 R6
OPS, ERROR, instruction 8 cannot use the ALU at the same time as instruction 3.
Time of print: 10:32 PM 10/8/2014
E
W
R2
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 18
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
(B.2)What is the savings using capability? -We save SEVEN cycles. Performing 11 operations in 17 cycles ( Using - Out Of Order Capability) instead of 24 cycles (NOT Using - Out Of Order Capability)- THIS IS REALLY SIGNIFICANT!!!!
(Supplementary.B.)Completed The Preceding Table Of Operations NO OUT OF ORDER CAPABILITY NO REGISTER RENAMING NOTE: Horizontal = Temporal-axis :
F
D
E
W
F
D
R3
E
E
E
E
E
W
F
D
ONE
ONE
ONE
ONE
ONE
E
W
F
D F
R6 D F
R6 THREE D F
R6 THREE R7 D F
R6 THREE R7 FIVE D F
R6 THREE R7 FIVE R1 D F
E
R3
R6
R7 FIVE R1 SEVEN D F
W E R7 FIVE R1 SEVEN R6 D
W R7 FIVE SIX SEVEN R6 NINE
E FIVE SIX SEVEN R6 NINE
W E SIX SEVEN R6 NINE
W E SEVEN R6 NINE
R7
R1
R7
R2
R5
R0
Time of print: 10:32 PM 10/8/2014
W E R6 NINE
E R6 NINE
E R6 NINE
E R6 NINE
E R6 NINE
W R6 NINE R6
E NINE
W E
W
R2
R3
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 19
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
(Supplementary.C)Re-Completed The Table Of Operations OUT OF ORDER CAPABILITY
NOTE: Horizontal = Temporal-axis :
F D
E
W
F D R3 R3
E
E
E
F
D
E
F
D F F
R6 R6 R6 R6 D E W D R7 R7 E F D E W F D R1 R1 F D E F D R6 F D
E
E
W
W R6 R6 R6
E
W
R1 R1 R1 R1 E E E E R6 R6 R6 R6 E W
R1 W R6
W
R3 R7 R7 R5 R2 R3 R6
Time of print: 10:32 PM 10/8/2014
R1,R6
E
W
E
W R0,R2
Jordan Daniel Ulmer
Computer Org. HW#5 CH(6)
Page | 20
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
(A.)OUT OF ORDER EXECUTION (C.) SUPER SCALAR (2 Instructions Each Stage) WITH OUT OF ORDER CAPABILITIES
Instruction Fetch Decode Execute Writeback 0 ADD R3, R1, R2 0 1 2 3 1 LOAD R6, [R3] 1 2 4 9 2 AND R7, R5, 3 2 3 5 7 3 ADD R1, R6, R0 3 4 10 11 4 SRL R7, R0, 8 4 5 6 7 5 OR R2, R4, R7 5 6 8 9 6 SUB R5, R3, R4 6 7 9 10 7 ADD R0, R1, R10 7 8 12 13 15 8 LOAD R6, [R5] 8 9 10 9 SUB R2, R1, R6 9 10 16 17 10 AND R3, R7, 15 10 11 13 14
Instruction Fetch Decode Execute Writeback 0 ADD R3, R1, R2 0 1 2 3 1 LOAD R6, [R3] 0 1 4 9 2 AND R7, R5, 3 1 2 3 4 3 ADD R1, R6, R0 1 2 10 11 4 SRL R7, R0, 8 2 3 4 5 5 OR R2, R4, R7 2 3 6 7 6 SUB R5, R3, R4 3 4 5 6 7 ADD R0, R1, R10 3 4 12 13 8 LOAD R6, [R5] 4 5 6 11 9 SUB R2, R1, R6 4 5 12 13 10 AND R3, R7, 15 5 6 7 8
(B.) NO OUT OF ORDER OPERATIONS Instruction Fetch Decode Execute Writeback 0 ADD R3, R1, R2 0 1 2 3 1 LOAD R6, [R3] 1 2 4 9 2 AND R7, R5, 3 2 3 9 10 3 ADD R1, R6, R0 3 4 10 11 4 SRL R7, R0, 8 4 5 11 12 5 OR R2, R4, R7 5 6 13 14 6 SUB R5, R3, R4 6 7 14 15 7 ADD R0, R1, R10 7 8 15 16 8 LOAD R6, [R5] 8 9 16 21 9 SUB R2, R1, R6 9 10 22 23 10 AND R3, R7, 15 10 11 23 24
OPS, ERROR, instruction 8 cannot use the ALU at the same time as instruction 3.
Time of print: 10:32 PM 10/8/2014
View more...
Comments