Computer Architecture
CPU execution time for a program (×) =
(CPU clock cycles for a program) × (Clock cycle time)
CPU execution time for a program (÷) =
(CPU clock cycles for a program) ÷ (Clock rate)
CPI =
(CPU clock cycles) ÷ (Instruction count)
Number of Cycles =
(Cycles per second) × seconds run
CPI =
(Execution time * clock rate) / (instruction count)
Of a doubleword's 64 bits, what is the leftmost bit numbered?
63. The numbering of the rightmost bit starts with 0, so the leftmost bit is numbered with 63 rather than 64.
In the LEGv8 architecture, each register is _____ bits wide.
64
Workload
A set of programs run on a computer that is either the actual collection of applications run by a user or constructed from real programs to approximate such a mix. A typical workload specifies both the programs and the relative frequencies.
Control signal
A signal used for multiplexor selection or for directing the operation of a functional unit; contrasts with a data signal, which contains information that is operated on by a functional unit.
Two's Complement
A signed number representation where a leading 0 indicates a positive number and a leading 1 indicates a negative number. The complement of a value is obtained by complementing each bit (0 → 1 or 1 → 0), and then adding one to the result (explained further below).
Overflow (floating-point)
A situation in which a positive exponent becomes too large to fit in the exponent field.
Load-use data hazard
A specific form of data hazard in which the data being loaded by a load instruction has not yet become available when it is needed by another instruction.
Register file
A state element that consists of a set of registers that can be read and written by supplying a register number to be accessed.
Branch address table
A table of addresses of alternative instruction sequences.
Address
A value used to delineate the location of a specific data element within a memory array.
Doubleword
Another natural unit of access in a computer, usually a group of 64 bits; corresponds to the size of a register in the LEGv8 architecture.
CPU execution time (CPU time)
The actual time the CPU spends computing for a specific task.
Branch target address
The address specified in a branch, which becomes the new program counter (PC) if the branch is taken. In the MIPS architecture the branch target is given by the sum of the offset field of the instruction and the address of the instruction following the branch.
CPI (or Cycles per instruction)
The average number of clock cycles per instruction for a program or program fragment. It is the multiplicative inverse of instructions per cycle.
T-F: A clock rate of 1 GHz corresponds to a period of 1 nanosecond, which is 1x10-9 seconds.
The clock period is the inverse of the clock rate, so 1 / (109) = 1x10-9, which is 1 nanosecond.
Opcode
The field that denotes the operation and format of an instruction.
Rn
The first register source operand
Clock rate
The frequency at which a chip like a central processing unit (CPU), one core of a multi-core processor, is running and is used as an indicator of the processor's speed. It is the inverse of the clock period
Stored-program concept
The idea that instructions and data of many types can be stored in memory as numbers and thus be easy to change, leading to the stored-program computer.
Little Endian
The least significant byte of the data is placed at the byte with the lowest address. The rest of the data is placed in order in the next three bytes in memory.
most significant bit
The leftmost bit in a LEGv8 doubleword
Clock period
The length of each clock cycle.
Spatial Locality
The locality principle stating that if a data location is referenced, data locations with nearby addresses will tend to be referenced soon.
Big Endian
The most significant byte of the data is placed at the byte with the lowest address. The rest of the data is placed in order in the next three bytes in memory.
Instruction count
The number of instructions executed by the program.
Latency (pipeline)
The number of stages in a pipeline or the number of stages between two instructions during execution.
Temporal Locality
The principle stating that if a data location is referenced then it will tend to be referenced again soon.
Program counter (PC)
The register containing the address of the instruction in the program being executed
Rd
The register destination operand. It gets the result of the operation.
Least Significant Bit
The rightmost bit in a LEGv8 doubleword.
Rm
The second register source operand
Clock cycle (tick, clock tick, clock period, clock, or cycle)
The time for one clock period, usually of the processor clock, which runs at a constant rate.
Response time (execution time)
The total time required for the computer to complete a task, including disk accesses, memory accesses, I/O activities, operating system overhead, CPU execution time, and so on.
Sign-extend
To increase the size of a data item by replicating the high-order sign bit of the original data item in the high-order bits of the larger, destination data item.
WB
Write back
Does every byte in memory have a unique address?
Yes. Thus, each byte is addressable. However, programmers usually access memory by words and doublewords.
CPU time (÷) =
[(Instruction count) × (CPI)] ÷ Clock rate
ALUOp
a 2-bit control field that indicates whether the operation to be performed should be add (00) for loads and stores, pass input b (01) for CBZ, or be determined by the operation encoded in the opcode field (10).
Multiplexor
a device which takes in multiple signals and outputs a single signal
base register
a register that holds an array's base address
sign and magnitude representation
a signed number representation where a single bit is used to represent the sign, and the remaining bits represent the the magnitude.
datapath element
a unit used to operate on or hold data within a processor. in LEGv8 implementation, the datapath elements include the instruction and data memories, the register file, the ALU, and adders
The PC is written once at the ( beginning | end ) of every clock cycle
end
(instructions per cycle) × (cycles per second) =
number of cycles
Number of Instructions =
number of cycles ÷ CPI
Dynamic branch prediction
prediction of branches at runtime using runtime information
shamt
shift amount
In a single-cycle implementation, the cycle length must be long enough to support the __________ instruction (probably load).
slowest
base address
the starting address of an array in memory
Flush
to discard instructions in a pipeline, usually due to an unexpected event
In a CB instruction, the offset is in ________, not ___________.
words, bytes. - Address = PC + (signextended offset * 4)
In a D instruction, the offset is in ________, not ___________.
words, bytes. - Address = PC + (signextended offset * 4)
In a B instruction, the offset is in _________, not _________.
words, bytes. - Address = PC + (signextended offset * 4)
R-type instruction format
| 11 bit opcode | 5 bit Rm | 6 bit shamt | 5 bit Rn | 5 bit Rd ********-all operands are registers Instructions: AND, ADD, ORR, SUB, LSL, ASR, EOR, ETC...
D-type instruction format
| 11 bit opcode | 9 bit address | 2 bit op2 | 5 bit rn | 5 bit rt *****-used by the data transfer instructions (loads and stores) Instructions: STUR, LDUR
B-type instruction format
| 6 bit opcode | 26 bit destination address Instructions: B
I-type instruction format
| 6 bit opcode | 5 bit rs | 5 bit rt | 16 bit immediate | (immediate field could be either a constant or an address offset) ********-two operands are registers and one register is a constant Instructions: ADDI, SUBI
CB-type instruction format
| 8 bit opcode | 19 bit destination address | 5 bits Rt Instructions: CBZ, CBNZ
IM-type instruction format
| 9 bit opcode | 2 bit quad | 16 bit immediate | 5 bit Rd Instructions: MOVZ, MOVK
Overall effective CPI =
∑(n→i)) (CPIₙ × ICₙ)
MEM
Data memory access
Program counter =
Register + Branch offset
Combinational element
An operational element, such as an AND gate or an ALU.
Execution time =
(Instruction count x CPI) / (Clock Rate)
CPU time (×) =
(Instruction count) × (CPI) × (Clock cycle time)
CPU clock cycles =
(Instructions for a program) × (Average clock cycles per instruction)
CPU time =
(Number of clock cycles) ÷ (clock rate)
3 types of hazards
-Structural -Data -Control
The two units needed to implement R-format ALU operations:
-register file -ALU
The four units needed to implement loads and stores:
-register file -ALU -data memory unit -sign extension unit (Wires/muxes are also needed)
The five units needed for a branch:
-register file -ALU -data memory unit -sign extension unit -adder (computes branch target address (sign extending, then shifting left by 2)) (Wires/muxes are also needed)
IM
-represents the instruction memory and the PC in the instruction fetch stage
REG
-stands for the register file and sign extender in the instruction decode/register file read stage (ID)
1 nanosecond (10⁻⁹) clock cycle => (in clock rate units)
1 GHz (10⁹) clock rate
The control unit's Branch output will be 1 for a compare and branch on zero instruction. However, the branch's target address is only loaded into the PC if the ALU's Zero output is _____. Otherwise, PC is loaded with PC + 4.
1. To branch when equal to zero, the register must be equal to 0. If it is, Zero is 1, meaning the value is equal to 0 and so the branch SHOULD be taken.
The input to the control unit is the __________________ field from the instruction
6- to 11-bit opcode
five steps to execute load instruction:
1.An instruction is fetched from the instruction memory, and the PC is incremented. 2.A register (X2) value is read from the register file. 3.The ALU computes the sum of the value read from the register file and the sign-extended 9 bits of the instruction (offset). 4.The sum from the ALU is used as the address for the data memory. 5.The data from the memory unit is written into the register file (X1).
four steps to execute cb-type instruction:
1.An instruction is fetched from the instruction memory, and the PC is incremented. 2.The register, X1 is read from the register file using bits 4:0 of the instruction (Rt). 3.The ALU passes the data value read from the register file. The value of PC is added to the sign-extended, 19 bits of the instruction (offset) are shifted left by two; the result is the branch target address. 4.The Zero status information from the ALU is used to decide which adder result to store in the PC.
Typical 5 steps taken by LEGv8 instructions:
1.Fetch instruction from memory. 2.Read registers and decode the instruction. 3.Execute the operation or calculate an address. 4.Access an operand in data memory (if necessary). 5.Write the result into a register (if necessary).
four steps to execute r-type instruction:
1.The instruction is fetched, and the PC is incremented. 2.Two registers, X2 and X3, are read from the register file; also, the main control unit computes the setting of the control lines during this step. 3.The ALU operates on the data read from the register file, using portions of the opcode to generate the ALU function. 4.The result from the ALU is written into the destination register (X1) in the register file.
For all R-type instructions, ALUOp is _______
10
Data hazards occur when (specific examples) ____
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
The control unit sends _____ bits to the ALU control.
2
Each doubleword consists of _____ bytes.
8. Each LEGv8 word has 64 bits, meaning 8 bytes (a byte is 8 bits).
Branch not taken
A branch where the branch condition is false and the program counter (PC) becomes the address of the instruction that sequentially follows the branch.
Branch taken
A branch where the branch condition is satisfied and the program counter (PC) becomes the branch target. All unconditional jumps are taken branches.
Edge-triggered clocking
A clocking scheme in which all state changes occur on a clock edge.
Data transfer instruction
A command that moves data between memory and registers.
Double precision
A floating-point value represented in a 64-bit doubleword.
Single precision
A floating-point value represented in a single 32-bit word.
Instruction format
A form of representation of an instruction composed of fields of binary numbers.
Instruction mix
A measure of the dynamic frequency of instructions across one or many programs.
State element
A memory element, such as a register or a memory.
Branch prediction
A method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome.
Word
A natural unit of access in a computer, usually a group of 32 bits.
Normalized
A number in floating-point notation that has no leading 0s.
Benchmark
A program selected for use in comparing computer performance
Amdahl's Law
A rule stating that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns.
Basic block
A sequence of instructions without branches (except possibly at the end) and without branch targets or branch labels (except possibly at the beginning).
When MemToReg is 0, the data appearing at the register file's data input comes from the _____.
ALU's outpput. The mux controlled by MemToReg passes the ALU's output when MemToReg is 0. When 1, the data memory's output is passed instead.
Throughput
Also called bandwidth. Another measure of performance, it is the number of tasks completed per unit time
Control hazard
Also called branch hazard. When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected.
Branch prediction buffer
Also called branch history table. A small memory that is indexed by the lower portion of the address of the branch instruction and that contains one or more bits indicating whether the branch was recently taken or not.
Pipeline stall
Also called bubble. A stall initiated in order to resolve a hazard.
Forwarding
Also called bypassing. A method of resolving a data hazard by retrieving the missing data element from internal buffers rather than waiting for it to arrive from programmer-visible registers or memory.
Exception
Also called interrupt. An unscheduled event that disrupts program execution; used to detect overflow
PC-relative addressing
An addressing regime in which the address is the sum of the program counter (PC) and a constant in the instruction.
Von Neumann Architecture
An architecture for an electronic digital computer with these components: -A processing unit that contains an arithmetic logic unit and processor registers -A control unit that contains an instruction register and program counter -Memory that stores data and instructions -External mass storage -Input and output mechanisms
Harvard Architecture
An architecture with physically separate storage and signal pathways for instructions and data. These early machines had data storage entirely contained within the central processing unit, and provided no access to the instruction storage as data.
Don't-care term
An element of a logical function in which the output does not depend on the values of all the inputs. Don't care terms may be specified in different ways.
Interrupt
An exception that comes from outside of the processor. (Some architectures use the term interrupt for all exceptions.
Pipelining
An implementation technique in which multiple instructions are overlapped in execution, much like an assembly line. It increases the number of simultaneously executing instructions and the rate at which instructions are started and completed
Clock cycles per instruction (CPI) =
Average number of clock cycles per instruction for a program or program fragment.
1023
Bias for an 11-bit exponent (IEEE 754)
127
Bias for an 8-bit exponent (IEEE 754)
Machine Language
Binary representation used for communication within a computer system
Replacing a processor in a computer with a faster processor has what effect?
Both. A faster processor certainly means decreased response time. Plus, more tasks could be executed per minute (or other unit time), so throughput increases too. Decreasing response time almost always also increases throughput.
Adding additional processors to a system that uses multiple processors for separate tasks -- for example, searching the web -- has what effect? Assume that before adding processors, tasks often must wait to execute due to another task executing (tasks "queue up").
Both. If the demand for processing is almost as large as the throughput, the system might force requests to queue up. In this case, increasing the throughput could also improve response time, since it would reduce the waiting time in the queue. Thus, in many real computer systems, changing either execution time or throughput often affects the other.
B meaning
Branch
CB meaning
Conditional Branch
D meaning
Data Transfer
EX
Execution or address calcuation
T-F: Two instructions could possibly need to use the ALU simultaneously.
False. If an instruction uses the ALU, the instruction does so in the third cycle after the instruction starts. Since only one instruction starts at a time, no two instructions could possibly use the ALU in the same clock cycle.
T-F: The LEGv8 pipelined control approach determines all control line values during an instruction's 1st clock cycle, the instruction fetch stage.
False. The control signals only are used in the instruction's 3rd, 4th, and 5th cycles, so are generated in the instruction's 2nd cycle (instruction decode), thus avoiding storing those bits in the IF/ID register.
T-F: The store and load instructions behave similarly in stage 3 (EX: execute).
False. The store instruction differs from the load instruction by writing the register file's second read register's value into the EX/MEM register.
T-F: A smart branch predictor assumes the branch is not taken.
False. A smart predictor uses branching history to make better predictions.
T-F: For a given number of instructions, assume CPI is increased by 20%, and clock cycle time is decreased by 10%. The program execution time decreases.
False. 1 × (1 + 0.20) × (1 - 0.10) = 1.08 If instruction count remains the same, CPI increases by 20%, and clock cycle time decreases by 10%, then the program execution time increases by 8%.
T-F: Consider a rising clock edge that causes 3000 to be written into the PC. 3001 will be waiting at the PC's input to be written on the next rising clock edge.
False. 3001 will be waiting at the PC's input to be written on the next rising clock edge.
T-F: Consider a rising clock edge that causes 3000 to be written into the PC. The 3000 waits at the instruction memory input for the next rising clock edge, at which time the instruction at address 3000 is read out.
False. Because the instruction memory only reads, the instruction memory is like combinational logic. So the read begins as soon as the new address arrives, without waiting for a rising clock edge.
T-F: Overflow does not occur in unsigned integers.
False. Overflow can occur in unsigned integers, but since unsigned integers are commonly used for memory addresses, such overflows are often ignored.
T-F: Consider a rising clock edge that causes 3000 to be written into the PC. The 3000 waits at the adder input for the next rising clock edge.
False. The adder is combinational logic, so the 3000 enters the adder logic without waiting for a rising clock edge.
T-F: An instruction set is a particular program provided in the language of a computer.
False. The instruction set is the language itself, not a particular program. The language is like English, whereas a program is like a particular book written in English.
T-F: The register file writes to one register on every rising clock edge.
False. The write only occurs if the RegWrite input is 1.
T-F: Consider a rising clock edge that causes 3000 to be written into the PC. After the address 3000 is read into the PC, the 3000 only propagates to the adder.
False. When the 3000 is written into the PC, the 3000 propagates simultaneously to both the instruction memory and the adder.
(1, 2046) - 11 bits
For a double-precision number, the exponent is stored in the range _________
(1, 254) - 8 bits
For a single-precision number, the exponent is stored in the range ________
How many bits are used to hold a condition code?
Four extra bits hold a condition code, which record what happened during an instruction's execution.
Arithmetic Logic Unit (ALU)
Hardware that performs addition, subtraction, and usually logical operations such as AND and OR.
Load-use hazards occur when (specific examples) ____
ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))
Five stages of instruction execution
IF: Instruction fetch ID: Instruction decode and register file read EX: Execution or address calculation MEM: Data memory access WB: Write back
I meaning
Immediate
Adding additional processors to a system that uses multiple processors for separate tasks -- for example, searching the web -- has what effect? Assume that before adding processors, tasks do not wait to execute (tasks do not "queue up").
Increases throughput. More tasks can be completed per unit time, because the additional processor can execute additional tasks.
IF
Ins
ID
Instruction decode and register file read
Relative Performance formula: If α is n times faster at something than β, then ______________
Performance(α) ÷ Performance(β) = n
The stored-program concept means:
Programs are stored in memory along with data
The control unit enables a write to the register file using the _____ signal.
RegWrite. When RegWrite is 1, a register will be written on the next rising clock edge. The data will either come from the ALU or data memory as determined by MemToReg.
R meaning
Register
User CPU time
The CPU time spent in a program itself.
System CPU time
The CPU time spent in the operating system performing tasks on behalf of the program.
T-F: A branch predictor assumes the branch is not taken.
True
T-F: Two instructions could possibly access the register file simultaneously.
True. An instruction might read the register file in the instruction's second clock cycle, and might write in the fifth clock cycle. Thus, two instructions might simultaneously access the register file, one reading, one writing. Such simultaneous reading and writing of the register file is supported.
T-F: Assume CPI and clock cycle time remain constant. Reducing the instruction count will reduce the program's execution time.
True. Execution time is directly proportional to instruction count, so instruction count and program execution time decrease together.
T-F: The most reliable method to evaluate performance is execution time.
True. Individually, performance metrics such as instruction count, CPI, and clock frequency can be flawed. Performance can be accurately reported only when all metrics are combined into execution time.
T-F: Instructions, as well as data, can be stored in memory as numbers.
True. Instructions can be represented as just 0's and 1's, leading to the basic concept of the "stored-program": Both instructions and data can be stored in memory as numbers.
T-F: A particular processor has a clock rate of 1 GHz. The clock thus ticks one billion times per second.
True. The G is short for Giga, meaning billion, or 109.
T-F: The register file always outputs the two registers' values for the two input read addresses.
True. The register file does not wait for a rising clock edge, nor any special control signals, to output those two registers' values.
Data race
Two memory accesses form a data race if they are from different threads to same location, at least one is a write, and they occur one after another.
Given the importance of registers, what is the rate of increase in the number of registers in a chip over time?
Very slow. Since programs are usually distributed in the language of the computer, there is inertia in instruction set architecture, and so the number of registers increases only as fast as new instruction sets become viable.
Data hazard
When a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available.
Structural hazard
When a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute.
IM meaning
Wide Immediate
For a load or store instruction, the ALU should perform a(n) ______ to determine the memory address
add
SPEC (System Performance Evaluation Cooperative)
an effort funded and supported by a number of computer vendors to create standard sets of benchmarks for modern computer systems