CSC258 Midterm2
Memory Access
# a. MEM (Memory Access) # b. This stage involves with reading the data from memory using the address from the EX/MEM pipeline register and load the data into the MEM/WB register # c. EX/MEM Register, MEM/WB register
Execute or address calculation
# a.EX (Execute or address calculation) # b. This stage involves with executing the instruction using the values from the ID/EX register and adding them using the ALU, # and place the value into the EX/MEM register # c. ID/EX Register, EX/MEM Register, ALU
Write back
# a.WB (Write Back) # b. This stage involves in reading the data from the MEM/WB register and writing the data back to the register file. # c. MEM/WB register, Register File
Cache Performance
# misses = (# of instruction * misses) / instructions
memory stalls
# of misses * miss penalty
Cache Sizes
* Each cachee has a finite size)store same max number of blocks + map a block to a a location where it is stored.
Exception
* Exception detected at the execution stage flush all instruction ID, IF and Ex - Load in the SPEC address to handle the exception handler to PC
Whole System
* Includes Main memory, RAM + Series of caches * L1 -> close to the processor - L2 -> further away and a bit bigger L3 -> there's one, off chip even larger * Further a cache is from processor, more it stores the size of the cache block shrinks as you get closer to the processor.
Increase the potential amount of instruction level parallelism
* Increase the depth of the pipeline to overlapped more instructions * Replicate the internal components of the computer (Launch multiple instructions in every pipeline stage: multiple issue• Launching multiple instructions per stage allows the instruction execution rate to exceed the clock rate: i.e., CPI < 1)
Examples of Locality
* Iterating over an array, exhibits both temporal + spatial loclaity * Exectuting code: Exhiits temporal + spatial locality * "Access items" from the dictionary does not: the items in dict -> may not close to each other in many
Forwarding (Bypassing)
* No need to wait for values to be written back * Create Additional connections in datapath to allow recently computed values to be used. Values of the execution stage pass to the string after ID stage of the second instruction.
Memory I/O Devices
* Portions of the address space are assigned to I/O devices * Write to the address are interpresed as command to the I/O device * Read address, processor receive input from the device, from the project, we checked these addresses as we needed them. When I/O is sporatdic, an interrupt is sent to inform the OS that input has been received and is ready be read * Triggers the sam exception handler as before, data in SCAUSE Register allows the handler to identify which device needs attention
tag
* Rest of the address, check if a block in cache is the right block
vetored interrupts
- has this vectored table, containing interrupts/exception, - starting address should be executed -> cause register index through the table for SPEC
Hazard Cost
- insert stalls that keep the pipeline from being filled, but could tank our performance
2's complement
- invert bit and add 1
Addresses + Caches
- Every Load of a value at an address fetches an entire cache block -> not just a single value * size of the cache block -> dependent on cache - block = set of words with closely related data
Pipelining improves throughput. Doesn't decrease the time to complete one load of laundry, but with many loads this improvement results in less time to complete the work
- more loads, more apparent the performance increase
Ecall Exception
- the address of the instruction that triggered the exception will be saved i the supervisor exception will be saved i the supervisor exception program (SEPC) first. - Then, the processor is placed in the supervisor mode (transfer control from user mode to a dedicated location in supervisor code space) -To return user mode form the exception, use the supervisor exception -> return sret , which resets to the user mode sret jumps to the address in SEPC
Taken/ not Taken 2 bit predicator
0 (predict notaken) -> 1 not taken -> taken -> not taken
AlUSrc
0: for register values 1: for immediate values * immediates/register values of the immediates
PCSrc
0: other instructions, 1: for branch instructions
Steps to handle exceptions in RISCV
1. Pause + save current location in running user process (Like a function call, store the PC SEPC register and save register that will be modified(stack)) 2. Store the cause of the exception/ interrupt we use and scause register 3. Invoke a handler that deal with the issue (A single handler is used to handle all exceptions)
How to Handle an Exception
1. Pause and save current location in running user process• Like a function call! Store the PC (SEPC register) and save registers that will be modified (stack) 2. Store cause of the exception/interrupt (We use an SCAUSE register) 3. Invoke a handler that will deal with the issue (A single handler is used to handle all exceptions)
Single Handler Duties
1. Set up any resources required by the handler (or subsequent code) 2. Read the SCAUSE register 3. If the cause allows the user program to restart ... • Handle the error or transfer to code that can• Use the SEPC to return to program4. Else ...• Terminate program and report the error
example
16 32 byte blocks: 4 bits (2^4 = 16) bits from the address as the index Block size = 32 bytes 2^5 block offset tag = 32 -4 -5 = 23 for the tag
Types of Data Hazard
1a. EX/MEM RegisterRd = ID/EX. RegisterRs1 1b. EX/MEM. RegisterRd = ID/EX.RegisterRs2 2a. MEM/WB. RegisterRD = ID?Ex.RegisterRs1 2b. MEM/WB.RegisterRd = ID/Ex. RegsiterRs2 * ex. addi x7, x3, 42 sub x6, x3, x2 add x7, x7, x6 (2a) -> instruction a part
SEPC
A 64 bit register used to hold the address of the affected instruction (Such a register is needed even when exceptions are vectored) - Scause: A register used to record the cause of the exception.
Control Signals for the instruction type R-format
ALUSrc MemToReg RegWrite MemRead Memwrite (0, 0, 1, 0 ,0) Register set up Instruction Memory -> register file -> mux select from register of immediate -> ALU -> Another mux -> Set up register file
beq
ALUSrc MemToReg RegWrite MemRead Memwrite (0, x, 0, 0 , 0) - fetch instruction form instruction memory (PC) - read source operand (rs1, rs2) from RF (REgister file) * ALUSRC = 0 - subtraction between two register ALU - if branch is taken - PC + immediate - if branch not taken - PC + 4 * register set up - instruction memory - register file Mux ALu adder mux write back register file
lw
ALUSrc MemToReg RegWrite MemRead Memwrite (1, 1, 1, 1 ,0) - fetch instruction form instruction memory (PC) - read source operand (rs1) from RF (REgister file) - extend the immediate (ALUSRC) - compute the memory Address (ALU) - Address want to load into the data memory - Write the address data back to the. register file - PC + 4 next instruction Register set up Instruction Memory -> Register File -> Mux selected value form register 2 or immediate -> ALU -> Signal go to data memory -> mux + setup delay to register file write back register file
sw
ALUSrc MemToReg RegWrite MemRead Memwrite (1, x, 0, 0 , 1) sw x6, 8(x9) -> in the mem stage, the address of x6 is stored, and the x6 is writing into the memory of x9 Register set up Instruction memory Register FIle ALU Mux Data memory
Single Cycle processor time
Add all stages up
Elements in Single Cycle processor's datapath
Adder Immediate generation unit Instruction memory Data memory Multiplexer Program counter register ALU Registers/register file
Ch.5
Data Path
Interrupt
An exception that comes from outside of the processor. (Some architectures use the term interrupt for all exceptions.)
Exceptions
An unscheduled event that disrupt program execution; used to detect undefined instructions
ALU
Arithmetic and Logic Unit - does all mathematical calculations and makes all logical decisions
Cost Scheduling
Code Scheduling to Avoid Stalls • Can reorder code to avoid use of load result in the next instruction • C code for a = b + e; c = b + f
Exceptions: a Control Hazard
Consider malfunction on add in EX stage ...add x1, x2, x1 • Must prevent x1 from being clobbered • Must complete previous instructions • To do so, flush add and subsequent instructions - but keep previous The steps required are similar to a mispredicted branch
Control Hazard
Deciding on control action depends on previous instruction Happens with conditional branches as once the branch instruction is received we do not know the next instruction to be executed until we get the outcome of the branch (PC + incrementation) add x4, x5, x6 beq x1, x0, 40 or x7, x8, x9 * stall again between the beq and or (one nop) to execute correctly. For the Third instruction fetch stage line up with the second instruction instruction ALU stage.
Datapath with Hazard
Detection Hazard detection unit is placed in ID stage. Here, it can easily introduce a bubble by zeroing out theID/EX registers
CH.08
EXCEPTIONS
Exceptions and Interrupts
Exceptions are "unexpected" (unpredictable) events requiring non-user code to berun.• Different ISAs use the terms exception and interrupt differently • Exceptions: generally come from within the CPU (syscall, floating point error, ...)• Interrupts: generally generated by an external device Handling exceptions without sacrificing performance is challenging!
Block Offset
Find the right location inside the cache
Sign and Magnitude
First bit is a sign bit 0 for Positive 1 for Negative
How to stall the pipeline
Force control values in ID/EX register to 0 • Prevent update of PC and IF/ID register• This results in the current instruction being decoded again in the next cycle
Which of the following are events which may cause an exception as RISC-V defines it? Request from an I/O device Hardware error System reset ecall instruction Illegal (undefined) instruction
Hardware error System reset ecall instruction Illegal (undefined) instruction
RISC-Instruction stages
IF: Instruction fetch from memory. ID: Instruction decode & register read. EX: Execute operation or calculate address MEM: Access memory operand WB: Write result back to register
Spatial Locality
If one object in memory is accessed, object close to it will also be accessed
Temporal Locality
If one object is accessed (it and the object around it) will be accessed again soon
Interrupts, saved addresses
In , portions of the address space are assigned to I/O devices. Writes to those addresses are interpreted as to the I/O device (recall that writing to specific addresses in your project turned LEDs on or off!). Reads at those addresses allow the processor to receive input from the device. In our project, we checked these addresses as we needed them, but when I/O is sporadic, an is sent to inform the operating system that input has been been received and is ready to be read. This triggers the same exception handler as before, but the data in the register allows the handler to identify which device needs attention.
Branch Prediction
In a deep pipeline, the stall penalty is too high • Branches are common instructions! • So ... let's predict the outcome of the branch • If the prediction is wrong, then it's no worse than stalling.• And if the prediction is correct, there is no penalty .• Easiest prediction: not taken• Just fetch instruction after branch, with no delay The compiler can make not taken work pretty well. • Code can be built to use only unconditional jumps and branches that are likely to not be taken .• But the cost of a missed prediction is very high on a deep pipeline.• Modern pipelines are typically 10-14 stages (with a max of 31!)• So there has been a lot of work on improving branch prediction.
Handling exceptions
In a single-cycle processor, an exception is fairly easy to handle. The current instruction is cancelled, appropriate exception data is stored, and the next instruction to be executed is the start of the exception handler. In a pipelined implementation, however, exceptions are a form of . When an exception is received, all instructions after the offending instruction must be . Then, the PC that is saved is not the "current" PC but rather the address of the first instruction that must be re-executed.
Data Hazard
Need to wait for previous instruction to complete its data read/write ex: add x19, x0, x1 sub x2, x19, x3 need two nop in between need the second instruction ID cycle match with the WB cycle. (first half of the WB cycle is writing the data back, second half of the ID cycle for instruction2 is reading that instruction data) So 2 nop for getting the information back
An ILP Analogy
Pipelined laundry: overlapping execution improves performance Completing a load oflaundry requires foursteps. Each step utilizesdifferent hardware.n Key insight: we can runfour loads at once if thehardware is fullyutilized. Four loads:n Speedup= 8/3.5 = 2.3n Non-stop: n Speedup= 2n/(0.5n + 2.0) ≈ 4(for large n)= number of stages Pipeline 4 times faster than as it approaches to infinity
CH6.
Pipelining
One Handler vs. Many
RISC-V uses a single exception handler, but that's not the only possibility • Other systems use Vectored Interrupts • Here, the handler address is determined by the cause • To call the correct a function, a vector table is maintained, where a handler is registered for every category of exception.• The cause register is used to index into this table to invoke the correct handler.
Tag
Rest of the address, check if a block in cache is in the right block
Caching
Solution to make the memory appear closer than it is. * Stores the value that was loaded and the value near it, in case they needed soon.
Memory System
Source of delay: size + distance, memmory is large and it is too far away
The Bottom Line: Stalls and Performance•
Stalls reduce performance, so we avoid them at all costs .• The addition of hardware to detect hazards and forward data can help. • In some situations, we rely on the compiler (or even morecomplex hardware) to rearrange the instruction stream
linker
Takes all independently assembled machine language programs and "stitch" them together (create a executable program* used to create a executable program 1. place the code and data modules symbolically in memory 2. Determine the addresses of data and instruction labels3. patch both the internal and external references
Control Unit
Takes instruction to be executed as input Used to determine how to set control lines for functional units (register file, ALU and memories) and two of the multiplexors Third multiplexor (top) is driven by a combo of the unit and the output of the Zero line of ALU (performs comparison of beq instruction and determines whether next instruction js just PC+4 or PC+label (for branches
Pipelining
Technique where multiple instructions are overlapped during execution
Forwarding Paths
The forwarding unit detects a hazard condition. It emits a control signal to change the value selected by the multiplexer.
Hardware provides two execution mode
The hardware must provide at least two execution modes: one for user-level execution and the other for -level execution. To switch between the two modes, a exception is performed by invoking the assembly instruction. This exception causes the current PC to be saved in the register, for the processor to be placed in supervisor mode, and for the PC to be set to the start of the exception handler.
Handling an exception
To handle an exception, the operating system must know why the exception occurred. RISC-V uses an register. When the exception occurs, a code indicating the source of the exception is stored in the register. The same exception handler code starts each time; it inspects the register to determine what action to take. Other systems use interrupts. In this scheme, different exception handlers are invoked depending on the cause of the exception.
Dynamic branch prediction
Track historical branch behaviour and based on that history• Use counters to track "taken" vs. "not taken"
IF multiple exceptions happens at the time.
We could have multiple exceptions at once ... • A pipelined processor has more than one instruction in flight • Or an external interrupt could occur while an exception is occurring.
Structural Hazard
When a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute. * A required Resource is busy • In RISC-V pipeline with a single memory• Load/store requires data access • Instruction fetch would have to stall for that cycle• Hence, pipelined datapaths require separate ports for instruction/data access • Might be implemented as instruction/data caches
Instruction decode and Register Read
a. ID (Instruction decode and Register Read) b. This stage involves with decoding the instruction and reading the register values. c. IF/ID Pipeline Register, ID/EX Register (stored PC values, and register values), Register File
Fetching instruction (piplined)
a. IF(fetching the instruction) b. This stage is fetching the instruction, and the PC address is save in the IF/ID register. c. PC, IF/ID Pipeline Register
Multiplexor
alternate data sources are used for different instructions - is a device that receives multiple input signals and conveys that input to a single output signal
Compiler Tool Chain
c program -> Compiler Assembly language program -> Assembler Object: Machine Language Module, Object: Library routine(machine language) Linker: -> Machine language program -> Loader memory. Covert high level program into executable- compiler covert C-program into assembly program- assembler convert assembler prorgram into machine language module- multiple piece of machine code are composed using a linker/link editor to create an executable program
Ideal speed up
clock cycle(piplined) = clock Cycle(non-pipedlined)/Number of stages - speedup due to increased throughput, (latency)(time for each instruction does not decrease) - Number of instructions we can execute in a unit of time does increase
Assembler
convert assembly program into machine language code
Single Cycle data path
does an instruction in one clock cycle
Loader
executable place in memory - reads the executable file header to determine size of text and data segments - create address space large enough(allocation) - Copies Instructions and data from exec file into memory- Copies parameters, if any to main program onto stack initiallize process registers and d set the stack point to first location Ranged of Unsigned bits
Set
index in cache a block placed
set
index in the cache a block
ALUOp
opcode that tells what specific operation it is for the ALU
I type
register set up instruction memory register file mux ALU mux write back to register file
Pipeline five stage processor cycle time
take the longest stage
Instruction Level Parallelism
the set of techniques and designs that enable parallel execution of instructions in an architecture. simultaneous execution of instructions from a single thread of execution in a program. the opportunity to execute multiple instructions in a program simultaneously due to a lack of dependence between the instructions.
Bit Mask
value can be used to turn specific bits in a bit vector on or off
Pipelining and ISA Design
• All instructions are 32-bits• Easier to fetch and decode in one cycle• Few and regular instruction formats• Can decode and read registers in onestep• Load/store addressing• Can calculate address in 3rd stage,access memory in 4th stage
Control Hazards
• Branches change the next instruction to execute • As a result, the pipeline can't always fetch correct instruction... or even know what the correct instruction to fetch will be .• We could just stall ...... but we can't determine the next instruction until AFTER the execution stage.... so it's better to compute the target as early as possible and to make an educated guess about the next instruction.
Reducing Branch Delay
• First, move hardware to determine outcome to ID stage• Target address adder and/or a memory to store previously computed targets (the "branch target buffer") • Register comparator • Second, add hardware to choose whether to load target address orPC + 4
Instruction Execution
• PC ® instruction memory, fetch instruction• Register numbers ® register file, read registers• Use ALU to calculate • Arithmetic result * Memory address for load/store • Branch comparison • Access data memory for load/store• PC <- target address or PC + 4
Static branch prediction
• Predict backward branches taken (e.g., the end of a loop body) • Predict forward branches not taken (e.g., if statements)