CSE120 all
Decreasing Stalls: Forwarding
"Forward" the data to the appropriate unit Slide 19 This eliminates stalls for dependencies between ALU instructions
Relative Performance
"x is n times faster than y" means (Performance x/Performance y) = (Execution y/Execution x) = n Ex. A: Cycle Time = 250ps, CPI = 2.0 B: Cycle Time = 500ps, CPI = 1.2 A is faster okay bro I = number of instructions CPU Time A = I * 2.0 * 250 = 500 *I ps CPU Time B = I * 1.2 * 500 = 600 * I ps Execution Time B / Execution Time A - 600 / 500 = 1.2 So A is 1.2 times faster than B
NOP example
- NOP = Instruction that does nothing example: $0, $0, $0
Load-Use Case: Hardware Stall
-A pipeline interlock checks and stops the instruction issue
Fixed Length
-Address of next instruction is easy to compute -code density: common instructions as long as the rare ones.
Variable Length
-Better code density -x86 averages 3 bytes (from 1 to 16 per instruction) -common instructions are shorter -Less instruction memory to fetch -Fetch and decode are more complex
Forwarding Limitation: Load-Use Case
-Data is not available yet to be forwarded
Dependencies vs Hazards
-Dependencies are a property of your code. You can follow the instructions and find the errors. -Hazards are just dependencies that have become an issue in the current pipeline.
RAW Hazard Example
-Dependencies backwards in time are hazards Slide 16 in Pipelining When a register is written before another instruction reads it, you have a hazard.
RISC-V ISA
-Developed by UC Berkeley -reduced instruction set computer (RISC) -Simple, easy to understand, elegant -not bloated like x86 Design Principles of RISC-V: -simplicity favours regularity -smaller is faster -good design makes good compromises -make the common case faster
Semiconductor Chips
-Dominant tech for integrated circuits -Print-Like Process -Print resolution improves over time -> more devices
ISA does NOT define
-How instructions are implemented -How fast/slow instructions are -How much power instructions consume
perf record
-Low overhead sampling -Lists time spent per function
Time Application
-Measures execution time of the application -Distinguishes between user and kernel
Power and Energy
-Power Density (cooling) -limits compaction and integration e.g. cellphone cannot exceed 1-2W -Battery Life for mobile devices -Reliability at high temperatures -Cost -Energy Cost -Cost of power delivery, cooling system, packaging -Environmental Issues -IT responsible for 0.53 billion tons of CO2 in 2002.
Energy Proportionality
-Power consumption should scale with performance -most modern computers don't, we don't know how
Decreasing Stalls: Fast RF
-Register file writes on first half and reads on second half
ISA defines
-The state of the system (Registers, Memory) -The functionality of each HW instruction -The encoding of each HW instruction
Processors
-These are programmable HW components. -A generic HW component with an expressive and stable interface to SW. -The SW can define the functionality of HW through programming.
Complexity Challenge
-limiting factor in modern chip design -want to use all transistors on a chip Only way to survive: -Hide complexity using abstraction -design reusable hardware components
Arithmetic Mean
-use with times, not with rates Represents total execution time
perf topdown
-uses hardware performance counters (PMU) -enables microarchitectural studies -4 categories: Backend Bound, Retiring, Bad Speculation, Frontend Bound These categories give you can idea of where time and energy is being spent in the program.
Maximum Speedup
1 / (1 - Fraction) Example 80% parallelizable, what is max speedup? =1 / 1 - .8 = 1/0.2 = 5 times
Big Picture: Running a Program
1.High Level Language Program vvv (compiler) 2.Assembly Language Program vvv (assembler) 3.Machine Language Program vvv (Machine Interpretation) 4.Control Signal Specification
Procedure Call Steps
1.Place parameters in a place where the procedure can access them 2.Transfer control to the procedure -Some kind of jump 3.Allocate the memory resources needed for the procedure 4.Perform the desired task (procedure body) 5.Place the result value in a place where the calling program can access it 6.Free the memory allocated in (3) 7.Return control to the point of origin slide 29 of ISA
What is the clock cycle time of a processor with a clock frequency of 8GHz?
125ps
Assume the code sequence that uses instruction types A, B, C. Given below is the instruction count of each instruction and the CPI of each instruction. What is the average CPI of the sequence? class A B C CPI for class 1 4 2 IC 4 2 6
2
Waht frequency is a processor operating at, that has a clock cycle time of 400us?
2.5KHz
How many RISC-V instructions are required at a minimum to implement the following C-code snippet: a = b + *c + *d assume a,b,c,d are in registers.
4
Assume the clock of the digital system above is 2 GHz. Select the maximum propagation delay for the combinatorial logic block:
450ps
sll x1, x1, 3 performs a multiplication of the value in x1 by
8
The RISC-V ISA there exist 32 general purpose registers. Assume we wanted to increase the number of registers to 256. To enable this, the machine instruction format of three operand instructions (e.g. add x1, x2, x3) would require:
9 more bits
Execution Time
= (Instructions/Program) * (Clock cycles/Instruction) * (Seconds/Clock Cycle) =Seconds/Program
HW1 Q3 Amdahl's Law
P = (N-NS) / S(1-N) S = Speedup N = Number of Threads P = Parallelizable fraction of code Threads = 8 speedup = 2.9 P = (8-8(2.9)) / (2.9(1-8)) = .748 = 75%
Control: Load
RegWrite is activated. Load needs to be added to an offset so ALUSrc is used. MemtoReg is set because we do need to call memory and then set it into a register. MemWrite is set to 0 though because we arent writing to memory.
Examples of Benchmarks
SPEC CPU2006 (integer and FP benchmarks) TPC-H and TCP-W (database benchmarks) EEMBC (embedded benchmarks)
Tech Scaling Past and Present
Used to be Moores law plus Dennard Scaling. Moores law - more transistors dennard scaling - lower Vdd In 2005 Dennard Scaling stopped working and now we have a 32x gap per decade for chip capability compared to past scaling. Thats why we use multiple cores now.
Why stop at 5 pipeline stages?
Three issues: Some things have to complete in a cycle (AKA cant split the washing machine into two steps) CPI is not really 1 Cost (area and power)
Measuring Performance (Linux)
Time application perf record perf topdown
What does Moore's Law say:
Transistor density increases by 2x every 3 years
Dependency Examples
True dependency => RAW hazard addu xt0, xt1, xt2 subu xt3, xt4, xt0 Output dependency => WAW hazard addu xt0, xt1, xt2 subu xt0, xt4, xt5 Anti dependency => WAR hazard addu xt0, xt1, xt2 subu xt1, xt4, xt5
Instruction Set Architecture
What is the HW/SW interface? -RISC-V ISA
Reducing Stalls
When you say new data is actually available? The register file is typically fast so we can write on the first half and read on the second half. This allows two steps in one cycle, eliminating at least 1 cycle from the stall length. This decrease of 1 cycle would still only push the performance to a little more than 1/2 of what you want.
Execution (or CPU) Time
=Cycles Per Program * Clock Cycle Time OR =Cycles Per Program / Clock Rate -Execution time improved by reducing the number of clock cycles or increasing clock rate. -These two goals not always compatible so you must often trade off clock rate against cycle count
Execution (or CPU) time
=cycles per program * clock cycle time =cycles per program/Clock rate -execution time improved by: reducing number of clock cycles or increasing clock rate -These two things are not always compatible, most often trade off clock rate against cycle count
Data Transfer Instructions: Load
Operator name, Destination register, Base register address and constant offset ld dst, offset (base)
Performance Example
A: cycle time = 250ps, CPI = 2.0 B: cycle time = 500ps, CPI = 1.2 Which is faster and by how much? CPU Time A = Instruction Count x CPI A x Cycle Time A =I x 2.0 x 250ps = I x 500ps CPU Time B = Instruction Count x CPI B x Cycle Time B =I x 1.2 x 500ps = I x 600ps CPU Time B / CPU Time a = I x 600ps / I x 500ps =1.2 A is faster by 1.2 times
Perfomance is affected by
Algorithm: IC and CPI Programming Languages: IC and CPI Compiler: IC and CPI Instruction Set Architecture: IC and CPI HW Design: CPI and Tc
Amortization
All hardware used all the time
RISC-V Format
All instructions are 32 bits to alleviate decoding. -Smaller is faster. -Requires to interpret bit fields differently for different instructions (R vs I vs S/b vs U/J) -Simplicity favors regularity -Limits register size to 32 bits ( 5 bits per operand ) -Good designs demand good compromises
Memory Data Transfer
All memory access happens through loads and stores -Store sends both data and address to memory -Load sends memory address to registers.
Consider the following instruction: Ld x2, offset(x3) Which of the following statements is correct?
All three are correct
Useful Techniques for Evaluating Efficiency
Amdahls Law and Benchmarks
Heap
An area of our running Code (Thread or Process) save location that lets us allocate memory on the fly
Stack
An array-like data structure in the memory in which data can be stored and removed from a location called the 'top' of the stack. Contains function local variables Whenever you call a new function, you extend the stack downward to contain all of the new variables. You can deallocate all of the new variables very quickly by just moving the bottom boundary of the stack.
Summarizing Benchmarks
Arithmetic Mean, Harmonic Mean, Geometric Mean
Examples of Code in RISC-V
Around slide 20 in ISA sheet
Arrays
Arrays are really pointers to the base address in memory. Each value in the array is just offset by a constant amount. The offset is the size of the element.
HW1 Q2 parallel Speedup
As you add more threads to the mix (up to 8), a regular system achieves a speedup of about 3, disable-asm has a speedup of about 5.
Transistor
Basic Building block for logic
Combinational Logic
Boolean algebra
The seven Controls
Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite
Controls
Branch - 1 if branching MemRead - 1 if reading from memory MemtoReg - 1 if putting memory into a register ALUOp - tells ALU what operation to do MemWrite - 1 if writing to memory ALUSrc - 1 if using an immediate RegWrite - 1 if Writing a value into a register
CPU Time formula
CPU Time = Instruction Count * CPI/Clock Rate CPU Time = Execution Time
Abstraction
Captures design at different levels of representation. Stable interfaces expose functionality but not low-level implementation details of lower levels.
Abstraction
Captures design at different levels of representation. Stable interfaces expose functionality but not low-level implementation details of lower levels. We really only care about two layers in this class: -Instruction Set Architecture -Compute Cores/Memories
CPI example
Class A B C CPI for class 1 2 3 IC in seq. 1 2 1 2 IC in seq. 2 4 1 1 CPI(1) = (1*2 + 2*1 + 2*3) = 10/5 = 2 (5 = IC sum) CPI(2) = (1*4 + 1*2 + 1*3) = 9/6 = 1.5
CPI Example
Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1 CPI 1 = (1 x 2 + 2 x 1 + 3 x 2) = 10/5 = 2 CPI 2 = (1 x 4 + 2 x 1 + 3 x 1) = 9/6 = 1.5 CPI 2 is faster by 2/1.5
Instruction Count and CPI
Clock Cycles = Instruction Count x Cycles Per Instruction CPU Time = Instruction Count x CPI x Clock Cycle Time OR CPU Time = (Instruction Count x CPI) x Clock Rate -Instruction Count (IC) - determined by program, ISA, and compiler Average cycles per program (CPI) - determined by HW design, if different instructions have different CPI then take the average of all.
Calculating CPI
Clock Cycles = Sum of all CPIs x Instructions
Benchmark Suite
Collection of benchmarks -plus datasets, metrics, and rules for evaluation -Plus a way to summarize performance in one number
How can we delay the younger instruction/
Compiler insert independent work or NOPS ahead of it
Challenges for Performance
Complexity Effeciency
Execution Time example
Computer A: 2GHz clock, 10s CPU Time Designing Computer B: Aim for 6s CPU Time -Can do faster clock, but causes 1.2x clock cycles. How fast must Computer B be? Clock Rate b = Clock Cycles b/CPUTime b =1.2x Clock Cycles a/6s Clock Cycles a = CPUTime a * Clock Rate a =10s * 2GHz = 20 * 10^9 Clock Rate b = 1.2 * 20 * 10^9/6s =24*10^9/6s = 4GHz
RISC-V Constants
Constant == immediate == literal == offset
Single Cycle Processor Performance
Cycle speed determined by slowest instruction.
Clock Frequency (rate)
Cycles per second
Kinds of Hazards
Data Hazard -Must wait previous instructions to produce/consume data Control Hazard -Next PC depends on previous instruction Structure Hazard -A required resource is busy
Solutions for RAW Hazards
Delay the reading instruction until data is available. Also called stalling or inserting pipeline bubbles
Moores Law
Devices get smaller. Also more devices on a chip/devices get faster. Initial graph from 1965 paper Prediction: 2x density per year Reality: ~2x density every 3 years Not a law, just an observation that has held true for over 50 years.
Latency
Digital HW operates using a constant-rate clock
Latency
Digital HW uses a constant-rate clock
How to Stall the Pipeline
Discover the need to stall when 2nd instruction is in ID stage -Repeat it's ID stage until hazard resolved -Let all instructions ahead of it move forward -Stall all instructions behind it 1.Force control values in ID/EX register a NOP instruction -As if you fetched or $0, $0, $0 -When it propagates to EX, MEM and WB, nothing will happen 2.Prevent update of PC and IF/ID register -Using instruction is decoded again -Following instruction is fetched again
Moores Law
Double transistors every two years. In reality, its like every 3.
Clock Period
Duration of a clock cycle Example: 250ps = 0.25ns = 250 x10^-12s This is the basic unit of time in all computers
Energy =/= Power
Dynamic or active power consumption Joules = Watts / sec You can improve energy by reducing power consumption or by improving execution time
What should we optimize processors for?>
EPI = Energy per instruction After minimizing EPI you can tune performance and power as needed.
StackFrame or Procedure Activation Record
Each procedure creates an activation record on stack Slide 32 of ISA The stackframe holds this: -Saved arguments (if any) Higher addresses -Saved return addresses -Saved registers (if any) Stack grows downwards -Local arrays and structures (if any) Lower addresses
Data Hazards - Stalls
Eliminate reverse time dependency by stalling
Energy =/= Power
Energy = Average Power * Execution Time -Joules = Watts * Seconds -Power is limited by infrastructure (e.g. power supply) -Energy: what the utilities charge for or battery can store -you can improve energy by -reducing power consumption -Or by improving execution time -Race to Halt!
What should we optimize processors for?
Energy per instruction (EPI) -After minimizing EPI, tune performance and power as needed -higher for server, lower for cellphone
Performance Summary
Execution Time = ( Instructions/Program ) x ( Clock Cycles/Instruction ) x ( Seconds / Clock Cycle )
RISC-V designed for pipelining
Few and regular instruction formats - can decode and read registers in one step.
fewer or more registers?
Fewer allows to use less bits for them. It would be faster to call but you would need more memory accesses. More would require more bits to fit them all, making it a bit slower. However you would have less access to memory
Geometric Mean
Good with normalized performance Does not represent total execution time
Identifying the Forwarding Paths
Identify all stages that produce new values -EX and MEM Add stages after first producer as sources of forwarding data -EX, MEM, WB Identify all stages that really consume values -EX Add multiplexor for each input operand for each consumer -2 Multiplexors have sources+1 inputs -EX, MEM, WB + ID (register file)
Branching Far Away
If the target is more than -2^11 to 2^11-1 words away, then the compiler inverts condition and inserts an unconditional jump
RISC-V Instruction Format
Image 1 in file
Instruction Execution
Image 2 -Start with the before state of the machine -PC, Regs, Memory -PC used to fetch an instruction from memory -Instruction directs how to compute the after state of the machine -For add: PC + 4 and rd = rs + rt -This happens atomically Referrencing Image 2: -The entire line of 0's and 1's represents a single instruction. -The two blue bars represent the source registers -The line in the memory box represents the instruction coming from memory -The result is put into a destination register, the red bar on the right. -The PC is also incremented to recieve the next instruction in the line -This all happens in one step making it atomic
Pipelining
Incrementing steps from multiple instructions simulatenously. If there are 6 steps and they all take the same time, then maximum speedup is 6 times faster. Having differently timed steps is more inefficient and lowers speedup Pipelining doesnt help latency of single task, just helps throughput of entire workload. Multiple tasks operating simultaneously Potential Speedup = number of stages Pipeline rate limited by slowest stage Time to fill and drain pipeline lower speedup
Instruction Count and CPI
Instruction Count (IC) for a program - Determined by program, ISA, and compiler Average cycles per instruction (CPI) - determined by HW design. If different instructions have different CPI, then average CPI is affected by instruction mix.
HW2 Q6
Instructions = 4359 Average CPI = 2.12 3 GHz = 3 x 10^9 Execution Time = ? Execution Time = (4359 * 2.12) / 3 * 10^9 = 3.08us
HW2 any other question
Just look at the assignment dude
Stack Discipline
Last in First out LIFO. State for given procedure needed for a limited time, From when called to when return. The stack is allocated in frames.
Performance: Latency vs Throughput
Latency - response/execution time. How long it takes to do a task. Throughput - Total work done per unit time (e.g. queries/sec)
Performance: Latency vs Throughput
Latency or response/execution time: How long it takes to do a task Throughput: Total work done per unit time(e.g. queries/second) Speeding up a single core improves both Latency and Throughput. Adding more cores improves Throughput but not Latency. Latency helps to improve Throughput but not necessarily the other way around.
Mapping Memory
Loads move data from I/O device register to a CPU register Stores move data from a CPU register to a I/O device register
For perf stuff
Look at Q4-6 in HW1
Amdahls Law
Make common case efficient -Given an optimization x that accelerates fraction fx of program by a factor of Sx, how much is the overall speedup? Speedup = CPUTime old/CPUTime new =1/((1-fx) + fx/Sx) -Lessons from Amdahls Law -Make common cases fast but dont over optimize common case -Speedup is limited by the fraction of the code accelerated -uncommon case will eventually become the common one -Amdahls law applies to cost, power consumption, energy
Amdahls Law
Make common case efficient. -Given an optimization x that accelerates fx of a program by a factor of Sx, how much is overall speedup? Speedup = CPUTime old / CPUTime new =CPUTime old / CPUTime old [(1-fx) + fx / sx] =1 / ( ( 1-fx ) + ( fx / sx ) ) Lessons from Amdahls Law -Make common cases fast. As fx -> 1 speedup -> Sx -Dont overoptimize the common case. As Sx -> infinity, speedup -> 1 / ( 1 - fx ) -Uncommon case will eventually become the common one applies to everything https://www.fxsolver.com/solve/
Efficiency Challenge
Many Factors to consider: Transistors, Parallel App performance, Single-Thread Performance (SpecInt), Frequency (MHz), Typical Power (Watts), Number of Cores
Efficiency
Many factors to consider when attempting to improve efficiency. Transistors, Parallel App Performance, Single-Threaded Performance (SpecInt), Frequency (MHz), Typical Power (Watts), and Number of Cores all affect efficiency.
In RISC-V, Ld and Sd are
Memory transfer Instructions
Technology Scaling in the past
Moores Law (more transistors) + Dennard Scaling (lower Vdd). This equation led to a huge increase in performance every couple of years. However in 2005, Dennard Scaling stopped. The idea is that as we shrunk transistors, electrons dont have to move as far, this is why frequency increases.
Store
Moves data from a register to a memory location
Pipeline Control
Need to control functional units, but they are working on different instructions! Just pipeline the control signals along with the data.
Assuming a in x1 and base address of B in x2, a correct assembly code representation of a = B[5000] is: ld x1, 5000(x2)
No, this is not a correct RISC-V assembly
How does power scale with load?
Not well. Ex. 100% load = 258 watts 20% load = 121 watts 20/100 =/= 121/258
Technology Scaling in the present
Only Moores Law, no Dennard Scaling. Nowadays we have started adding more cores to make up for lack of progress. 1 core at 100 MHz vs 2 cores at 80 MHz. You have a reduction of 20% in clock rate but a 50% reduction in power consumption.
Control: addu
Only needs RegWrite to be 1. Since its two registers, no ALUSrc is 0. No memory needs to written, so Data Memory box is completely avoided.
Execution Time Example
PC A: 2GHz clock, 10s CPU time Designing PC B -Aim for 6s CPU time -Can do faster clock, but causes 1.2x clock cycles How fast must PC B be? Clock Rate B = Clock Cycles B / CPU Time B = 1.2x Clock Cycles A / 6s Clock Cycles A = CPU Time A * Clock Rate A =10s x 2GHz = 20 x 10^9 Clock Rate B = (1.2 x 20 x 10^9) / 6s = 24 x 10^9 / 6s = 4GHz
Period to Frequency Conversions
Period = 1 Second Frequency: Hz = 1 KHz = 1.0E-3 MHz = 1.0E-6 GHz = 1.0E-9
The Big Change
Power = (Joules/Op) * (Ops/Second) (Joules/Op) = energy efficiency (Ops/Second) = performance -To improve performance we must reduce energy. -All computers are now power limited. -Popular approaches: co-optimize HW and SW, custom HW -This is why so many more companies are designing their own hardware nowadays. With limited potential, they want to optimize their hardware to run their own stuff more efficiently.
Chip Power consumption formula
Power = C * Vdd^2 * F C = Capacitance Vdd = Voltage F = Frequency -Capacitance scales with number of transistors -In 2005 we hit maximum power per chip, due to very high heat. Multi core processors are now used.
Chip Power Consumption Formula
Power = C * Vdd^2 * F C = capacitance Vdd = voltage F = frequency Capacitance scales with number of transistors.
Power Consumption Formula
Power = C * Vdd^2 * F0->1 + Vdd * I(leakage) Power = energy / second = ( energy / instruction ) x ( instructions / second )
Power Consumption in Chips
Power = C*Vdd^2*F(0->1) + Vdd *I(leakage) -Dynamic or active power consumption -Charging or discharging capacitors -Depends on switching transistors and switching activity -Leakage current or Static Power consumption -Leaking diodes and transistors -Gets worse with smaller devices and lower Vdd -Gets worse at higher temperatures
The Big Change
Power = Joules/Op *Ops/second To improve performance we must reduce energy, otherwise we cannot use additional transistors. All computers are now power limited. To get around this companies design their own HW to compensate for their custom SW.
Power formula
Power = energy/second = (energy/instruction * instruction/second)
Benchmarks
Programs used to measure performance -Supposedly typical of actual workload Warning: -Different benchmarks focus on different workloads -All benchmarks have shortcomings -Your design will be as good as the benchmarks you use
Benchmarks
Programs used to measure performance Supposedly typical of actual workload Benchmark suite: collection
Pipeline Hazards
Situations that prevent completing an instruction every cycle This leads to CPI > 1
Forwarding Paths: Partial
Slide 21
RISC-V Calling Convention for registers
Slide 32
Call and Return
Slide 33 ISA
Digital Systems
Slide 35 ISA Digital systems have an internal clock. We try to do as much combinational logic as we can between each cycle. The cycle time defines how much time you have to do combinational logic.
In which memory segment would the variable count, listed in the C-code snippet below, be allocated from? int main(int argc, char *argv[]) { int count = 0; return count++; }
Stack
RISC-V Storage Layout
Stack and Heap grow toward each other, Stack grows downwards. Also stores Static data, Text/Code and some reserved trash. Stack and Heap grow during runtime, static data does not
Performance Effect from Stalls
Stalls can have a significant effect on performance Consider: ideal CPI is 1 A RAW Hazard causes a 3 cycle stall If 40% of the instructions cause a stall? The new effective CPI is 1 + 3 * 0.4 = 2.2 And the real % is probably higher than 40% You get less than 1/2 the desired performance!!!
Control
State free, every instruction takes a single cycle, you just need to decode the instruction bits -Few control points -Control on the multiplexors -Operation type for the ALU -Write Control on the register file and data memory
You want to reduce the dynamic power consumption of a microprocessor. Assume you can improve one of the chip properties below by reducing it by 2x. Which one would you choose?
Supply Voltage
Instruction Set Architecture (ISA)
The HW/SW interface -The contract between SW and HW (compilers, assemblers, etc.) This is all because you want ISA to be generic. The user should have power in this. ISA is link between SW and HW -Processor, Memory, I/O System all use bits and gates -ISA converts back to human readable stuff and vice versa
Complexity
The limiting factor in modern chip design. We try to hide complexity using Abstraction. We also design reusable hardware components. Reusable components: Cores, CPUs, MCU
Critically important for Pipelining
The number of stages determines the upper bound improvement that you can achieve by pipelining You never want to stop a pipeline. Start time needs to be avoided to improve speedup
Data Dependencies
These are dependencies for instruction j following instruction i Read after Write (RAW or true dependence) -Instruction j tries to read before instruction i tries to write it Write after Write (WAW or output dependence) -Instruction j tries to write an operand before i writes it's value Write after Read (WAR or anti dependence) -Instruction j tries to write a destination before it is read by i
HW1 Q1 disable-asm
adds much more time overall. average time of 31 seconds, compared to 5 seconds regularly regular is 6 times faster
beq
beq reg1, reg2, label if reg1 = reg2 jump to label.
Clock Frequency (rate)
cycles per second E.g. 4.0 GHz = 4000MHz = 4.0 * 10^9 Hz
Clock Period
duration of a clock cycle E.g. 250ps = 0.25ns = 250 * 10^-12s This is the basic unit of time in all computers
Unsigned vs Signed Values
given 4 bits Unsigned: [0, 2^4 - 1] Signed: [-2^3, 2^3 - 1] Equivalence - same encodings for non-negative values Uniqueness - Every pattern represents a unique integer - Not true with sign magnitude
What data is not part of a stack frame?
global variables
Hertz Conversions (1 Hertz)
kilohertz = 1.0E3 Hz megahertz = 1.0E6 Hz gigahertz = 1.0E9 Hz
HW3 rest
look at assignment
Seconds Conversions (1 second)
millisecond = 1.0E-3 microsecond = 1.0E-6 nanosecond = 1.0E-9 picosecond = 1.0E-12
Load
moves data from memory location to a register
Processors
programmable HW components. Generic HW component with expressive and stable interface to SW. Processor capability grows with more and faster transistors, they can achieve higher speeds.
Storing Data
reverse of load, copy data from source registers to an address in memory sd src, offset (base)
In RISC-V the stack pointer points to
the top of the stack (technically on the bottom of the stack though)
Harmonic Mean
use with rates not with times Represents total execution time