CSE120 all

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Decreasing Stalls: Forwarding

"Forward" the data to the appropriate unit Slide 19 This eliminates stalls for dependencies between ALU instructions

Relative Performance

"x is n times faster than y" means (Performance x/Performance y) = (Execution y/Execution x) = n Ex. A: Cycle Time = 250ps, CPI = 2.0 B: Cycle Time = 500ps, CPI = 1.2 A is faster okay bro I = number of instructions CPU Time A = I * 2.0 * 250 = 500 *I ps CPU Time B = I * 1.2 * 500 = 600 * I ps Execution Time B / Execution Time A - 600 / 500 = 1.2 So A is 1.2 times faster than B

NOP example

- NOP = Instruction that does nothing example: $0, $0, $0

Load-Use Case: Hardware Stall

-A pipeline interlock checks and stops the instruction issue

Fixed Length

-Address of next instruction is easy to compute -code density: common instructions as long as the rare ones.

Variable Length

-Better code density -x86 averages 3 bytes (from 1 to 16 per instruction) -common instructions are shorter -Less instruction memory to fetch -Fetch and decode are more complex

Forwarding Limitation: Load-Use Case

-Data is not available yet to be forwarded

Dependencies vs Hazards

-Dependencies are a property of your code. You can follow the instructions and find the errors. -Hazards are just dependencies that have become an issue in the current pipeline.

RAW Hazard Example

-Dependencies backwards in time are hazards Slide 16 in Pipelining When a register is written before another instruction reads it, you have a hazard.

RISC-V ISA

-Developed by UC Berkeley -reduced instruction set computer (RISC) -Simple, easy to understand, elegant -not bloated like x86 Design Principles of RISC-V: -simplicity favours regularity -smaller is faster -good design makes good compromises -make the common case faster

Semiconductor Chips

-Dominant tech for integrated circuits -Print-Like Process -Print resolution improves over time -> more devices

ISA does NOT define

-How instructions are implemented -How fast/slow instructions are -How much power instructions consume

perf record

-Low overhead sampling -Lists time spent per function

Time Application

-Measures execution time of the application -Distinguishes between user and kernel

Power and Energy

-Power Density (cooling) -limits compaction and integration e.g. cellphone cannot exceed 1-2W -Battery Life for mobile devices -Reliability at high temperatures -Cost -Energy Cost -Cost of power delivery, cooling system, packaging -Environmental Issues -IT responsible for 0.53 billion tons of CO2 in 2002.

Energy Proportionality

-Power consumption should scale with performance -most modern computers don't, we don't know how

Decreasing Stalls: Fast RF

-Register file writes on first half and reads on second half

ISA defines

-The state of the system (Registers, Memory) -The functionality of each HW instruction -The encoding of each HW instruction

Processors

-These are programmable HW components. -A generic HW component with an expressive and stable interface to SW. -The SW can define the functionality of HW through programming.

Complexity Challenge

-limiting factor in modern chip design -want to use all transistors on a chip Only way to survive: -Hide complexity using abstraction -design reusable hardware components

Arithmetic Mean

-use with times, not with rates Represents total execution time

perf topdown

-uses hardware performance counters (PMU) -enables microarchitectural studies -4 categories: Backend Bound, Retiring, Bad Speculation, Frontend Bound These categories give you can idea of where time and energy is being spent in the program.

Maximum Speedup

1 / (1 - Fraction) Example 80% parallelizable, what is max speedup? =1 / 1 - .8 = 1/0.2 = 5 times

Big Picture: Running a Program

1.High Level Language Program vvv (compiler) 2.Assembly Language Program vvv (assembler) 3.Machine Language Program vvv (Machine Interpretation) 4.Control Signal Specification

Procedure Call Steps

1.Place parameters in a place where the procedure can access them 2.Transfer control to the procedure -Some kind of jump 3.Allocate the memory resources needed for the procedure 4.Perform the desired task (procedure body) 5.Place the result value in a place where the calling program can access it 6.Free the memory allocated in (3) 7.Return control to the point of origin slide 29 of ISA

What is the clock cycle time of a processor with a clock frequency of 8GHz?

125ps

Assume the code sequence that uses instruction types A, B, C. Given below is the instruction count of each instruction and the CPI of each instruction. What is the average CPI of the sequence? class A B C CPI for class 1 4 2 IC 4 2 6

2

Waht frequency is a processor operating at, that has a clock cycle time of 400us?

2.5KHz

How many RISC-V instructions are required at a minimum to implement the following C-code snippet: a = b + *c + *d assume a,b,c,d are in registers.

4

Assume the clock of the digital system above is 2 GHz. Select the maximum propagation delay for the combinatorial logic block:

450ps

sll x1, x1, 3 performs a multiplication of the value in x1 by

8

The RISC-V ISA there exist 32 general purpose registers. Assume we wanted to increase the number of registers to 256. To enable this, the machine instruction format of three operand instructions (e.g. add x1, x2, x3) would require:

9 more bits

Execution Time

= (Instructions/Program) * (Clock cycles/Instruction) * (Seconds/Clock Cycle) =Seconds/Program

HW1 Q3 Amdahl's Law

P = (N-NS) / S(1-N) S = Speedup N = Number of Threads P = Parallelizable fraction of code Threads = 8 speedup = 2.9 P = (8-8(2.9)) / (2.9(1-8)) = .748 = 75%

Control: Load

RegWrite is activated. Load needs to be added to an offset so ALUSrc is used. MemtoReg is set because we do need to call memory and then set it into a register. MemWrite is set to 0 though because we arent writing to memory.

Examples of Benchmarks

SPEC CPU2006 (integer and FP benchmarks) TPC-H and TCP-W (database benchmarks) EEMBC (embedded benchmarks)

Tech Scaling Past and Present

Used to be Moores law plus Dennard Scaling. Moores law - more transistors dennard scaling - lower Vdd In 2005 Dennard Scaling stopped working and now we have a 32x gap per decade for chip capability compared to past scaling. Thats why we use multiple cores now.

Why stop at 5 pipeline stages?

Three issues: Some things have to complete in a cycle (AKA cant split the washing machine into two steps) CPI is not really 1 Cost (area and power)

Measuring Performance (Linux)

Time application perf record perf topdown

What does Moore's Law say:

Transistor density increases by 2x every 3 years

Dependency Examples

True dependency => RAW hazard addu xt0, xt1, xt2 subu xt3, xt4, xt0 Output dependency => WAW hazard addu xt0, xt1, xt2 subu xt0, xt4, xt5 Anti dependency => WAR hazard addu xt0, xt1, xt2 subu xt1, xt4, xt5

Instruction Set Architecture

What is the HW/SW interface? -RISC-V ISA

Reducing Stalls

When you say new data is actually available? The register file is typically fast so we can write on the first half and read on the second half. This allows two steps in one cycle, eliminating at least 1 cycle from the stall length. This decrease of 1 cycle would still only push the performance to a little more than 1/2 of what you want.

Execution (or CPU) Time

=Cycles Per Program * Clock Cycle Time OR =Cycles Per Program / Clock Rate -Execution time improved by reducing the number of clock cycles or increasing clock rate. -These two goals not always compatible so you must often trade off clock rate against cycle count

Execution (or CPU) time

=cycles per program * clock cycle time =cycles per program/Clock rate -execution time improved by: reducing number of clock cycles or increasing clock rate -These two things are not always compatible, most often trade off clock rate against cycle count

Data Transfer Instructions: Load

Operator name, Destination register, Base register address and constant offset ld dst, offset (base)

Performance Example

A: cycle time = 250ps, CPI = 2.0 B: cycle time = 500ps, CPI = 1.2 Which is faster and by how much? CPU Time A = Instruction Count x CPI A x Cycle Time A =I x 2.0 x 250ps = I x 500ps CPU Time B = Instruction Count x CPI B x Cycle Time B =I x 1.2 x 500ps = I x 600ps CPU Time B / CPU Time a = I x 600ps / I x 500ps =1.2 A is faster by 1.2 times

Perfomance is affected by

Algorithm: IC and CPI Programming Languages: IC and CPI Compiler: IC and CPI Instruction Set Architecture: IC and CPI HW Design: CPI and Tc

Amortization

All hardware used all the time

RISC-V Format

All instructions are 32 bits to alleviate decoding. -Smaller is faster. -Requires to interpret bit fields differently for different instructions (R vs I vs S/b vs U/J) -Simplicity favors regularity -Limits register size to 32 bits ( 5 bits per operand ) -Good designs demand good compromises

Memory Data Transfer

All memory access happens through loads and stores -Store sends both data and address to memory -Load sends memory address to registers.

Consider the following instruction: Ld x2, offset(x3) Which of the following statements is correct?

All three are correct

Useful Techniques for Evaluating Efficiency

Amdahls Law and Benchmarks

Heap

An area of our running Code (Thread or Process) save location that lets us allocate memory on the fly

Stack

An array-like data structure in the memory in which data can be stored and removed from a location called the 'top' of the stack. Contains function local variables Whenever you call a new function, you extend the stack downward to contain all of the new variables. You can deallocate all of the new variables very quickly by just moving the bottom boundary of the stack.

Summarizing Benchmarks

Arithmetic Mean, Harmonic Mean, Geometric Mean

Examples of Code in RISC-V

Around slide 20 in ISA sheet

Arrays

Arrays are really pointers to the base address in memory. Each value in the array is just offset by a constant amount. The offset is the size of the element.

HW1 Q2 parallel Speedup

As you add more threads to the mix (up to 8), a regular system achieves a speedup of about 3, disable-asm has a speedup of about 5.

Transistor

Basic Building block for logic

Combinational Logic

Boolean algebra

The seven Controls

Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite

Controls

Branch - 1 if branching MemRead - 1 if reading from memory MemtoReg - 1 if putting memory into a register ALUOp - tells ALU what operation to do MemWrite - 1 if writing to memory ALUSrc - 1 if using an immediate RegWrite - 1 if Writing a value into a register

CPU Time formula

CPU Time = Instruction Count * CPI/Clock Rate CPU Time = Execution Time

Abstraction

Captures design at different levels of representation. Stable interfaces expose functionality but not low-level implementation details of lower levels.

Abstraction

Captures design at different levels of representation. Stable interfaces expose functionality but not low-level implementation details of lower levels. We really only care about two layers in this class: -Instruction Set Architecture -Compute Cores/Memories

CPI example

Class A B C CPI for class 1 2 3 IC in seq. 1 2 1 2 IC in seq. 2 4 1 1 CPI(1) = (1*2 + 2*1 + 2*3) = 10/5 = 2 (5 = IC sum) CPI(2) = (1*4 + 1*2 + 1*3) = 9/6 = 1.5

CPI Example

Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1 CPI 1 = (1 x 2 + 2 x 1 + 3 x 2) = 10/5 = 2 CPI 2 = (1 x 4 + 2 x 1 + 3 x 1) = 9/6 = 1.5 CPI 2 is faster by 2/1.5

Instruction Count and CPI

Clock Cycles = Instruction Count x Cycles Per Instruction CPU Time = Instruction Count x CPI x Clock Cycle Time OR CPU Time = (Instruction Count x CPI) x Clock Rate -Instruction Count (IC) - determined by program, ISA, and compiler Average cycles per program (CPI) - determined by HW design, if different instructions have different CPI then take the average of all.

Calculating CPI

Clock Cycles = Sum of all CPIs x Instructions

Benchmark Suite

Collection of benchmarks -plus datasets, metrics, and rules for evaluation -Plus a way to summarize performance in one number

How can we delay the younger instruction/

Compiler insert independent work or NOPS ahead of it

Challenges for Performance

Complexity Effeciency

Execution Time example

Computer A: 2GHz clock, 10s CPU Time Designing Computer B: Aim for 6s CPU Time -Can do faster clock, but causes 1.2x clock cycles. How fast must Computer B be? Clock Rate b = Clock Cycles b/CPUTime b =1.2x Clock Cycles a/6s Clock Cycles a = CPUTime a * Clock Rate a =10s * 2GHz = 20 * 10^9 Clock Rate b = 1.2 * 20 * 10^9/6s =24*10^9/6s = 4GHz

RISC-V Constants

Constant == immediate == literal == offset

Single Cycle Processor Performance

Cycle speed determined by slowest instruction.

Clock Frequency (rate)

Cycles per second

Kinds of Hazards

Data Hazard -Must wait previous instructions to produce/consume data Control Hazard -Next PC depends on previous instruction Structure Hazard -A required resource is busy

Solutions for RAW Hazards

Delay the reading instruction until data is available. Also called stalling or inserting pipeline bubbles

Moores Law

Devices get smaller. Also more devices on a chip/devices get faster. Initial graph from 1965 paper Prediction: 2x density per year Reality: ~2x density every 3 years Not a law, just an observation that has held true for over 50 years.

Latency

Digital HW operates using a constant-rate clock

Latency

Digital HW uses a constant-rate clock

How to Stall the Pipeline

Discover the need to stall when 2nd instruction is in ID stage -Repeat it's ID stage until hazard resolved -Let all instructions ahead of it move forward -Stall all instructions behind it 1.Force control values in ID/EX register a NOP instruction -As if you fetched or $0, $0, $0 -When it propagates to EX, MEM and WB, nothing will happen 2.Prevent update of PC and IF/ID register -Using instruction is decoded again -Following instruction is fetched again

Moores Law

Double transistors every two years. In reality, its like every 3.

Clock Period

Duration of a clock cycle Example: 250ps = 0.25ns = 250 x10^-12s This is the basic unit of time in all computers

Energy =/= Power

Dynamic or active power consumption Joules = Watts / sec You can improve energy by reducing power consumption or by improving execution time

What should we optimize processors for?>

EPI = Energy per instruction After minimizing EPI you can tune performance and power as needed.

StackFrame or Procedure Activation Record

Each procedure creates an activation record on stack Slide 32 of ISA The stackframe holds this: -Saved arguments (if any) Higher addresses -Saved return addresses -Saved registers (if any) Stack grows downwards -Local arrays and structures (if any) Lower addresses

Data Hazards - Stalls

Eliminate reverse time dependency by stalling

Energy =/= Power

Energy = Average Power * Execution Time -Joules = Watts * Seconds -Power is limited by infrastructure (e.g. power supply) -Energy: what the utilities charge for or battery can store -you can improve energy by -reducing power consumption -Or by improving execution time -Race to Halt!

What should we optimize processors for?

Energy per instruction (EPI) -After minimizing EPI, tune performance and power as needed -higher for server, lower for cellphone

Performance Summary

Execution Time = ( Instructions/Program ) x ( Clock Cycles/Instruction ) x ( Seconds / Clock Cycle )

RISC-V designed for pipelining

Few and regular instruction formats - can decode and read registers in one step.

fewer or more registers?

Fewer allows to use less bits for them. It would be faster to call but you would need more memory accesses. More would require more bits to fit them all, making it a bit slower. However you would have less access to memory

Geometric Mean

Good with normalized performance Does not represent total execution time

Identifying the Forwarding Paths

Identify all stages that produce new values -EX and MEM Add stages after first producer as sources of forwarding data -EX, MEM, WB Identify all stages that really consume values -EX Add multiplexor for each input operand for each consumer -2 Multiplexors have sources+1 inputs -EX, MEM, WB + ID (register file)

Branching Far Away

If the target is more than -2^11 to 2^11-1 words away, then the compiler inverts condition and inserts an unconditional jump

RISC-V Instruction Format

Image 1 in file

Instruction Execution

Image 2 -Start with the before state of the machine -PC, Regs, Memory -PC used to fetch an instruction from memory -Instruction directs how to compute the after state of the machine -For add: PC + 4 and rd = rs + rt -This happens atomically Referrencing Image 2: -The entire line of 0's and 1's represents a single instruction. -The two blue bars represent the source registers -The line in the memory box represents the instruction coming from memory -The result is put into a destination register, the red bar on the right. -The PC is also incremented to recieve the next instruction in the line -This all happens in one step making it atomic

Pipelining

Incrementing steps from multiple instructions simulatenously. If there are 6 steps and they all take the same time, then maximum speedup is 6 times faster. Having differently timed steps is more inefficient and lowers speedup Pipelining doesnt help latency of single task, just helps throughput of entire workload. Multiple tasks operating simultaneously Potential Speedup = number of stages Pipeline rate limited by slowest stage Time to fill and drain pipeline lower speedup

Instruction Count and CPI

Instruction Count (IC) for a program - Determined by program, ISA, and compiler Average cycles per instruction (CPI) - determined by HW design. If different instructions have different CPI, then average CPI is affected by instruction mix.

HW2 Q6

Instructions = 4359 Average CPI = 2.12 3 GHz = 3 x 10^9 Execution Time = ? Execution Time = (4359 * 2.12) / 3 * 10^9 = 3.08us

HW2 any other question

Just look at the assignment dude

Stack Discipline

Last in First out LIFO. State for given procedure needed for a limited time, From when called to when return. The stack is allocated in frames.

Performance: Latency vs Throughput

Latency - response/execution time. How long it takes to do a task. Throughput - Total work done per unit time (e.g. queries/sec)

Performance: Latency vs Throughput

Latency or response/execution time: How long it takes to do a task Throughput: Total work done per unit time(e.g. queries/second) Speeding up a single core improves both Latency and Throughput. Adding more cores improves Throughput but not Latency. Latency helps to improve Throughput but not necessarily the other way around.

Mapping Memory

Loads move data from I/O device register to a CPU register Stores move data from a CPU register to a I/O device register

For perf stuff

Look at Q4-6 in HW1

Amdahls Law

Make common case efficient -Given an optimization x that accelerates fraction fx of program by a factor of Sx, how much is the overall speedup? Speedup = CPUTime old/CPUTime new =1/((1-fx) + fx/Sx) -Lessons from Amdahls Law -Make common cases fast but dont over optimize common case -Speedup is limited by the fraction of the code accelerated -uncommon case will eventually become the common one -Amdahls law applies to cost, power consumption, energy

Amdahls Law

Make common case efficient. -Given an optimization x that accelerates fx of a program by a factor of Sx, how much is overall speedup? Speedup = CPUTime old / CPUTime new =CPUTime old / CPUTime old [(1-fx) + fx / sx] =1 / ( ( 1-fx ) + ( fx / sx ) ) Lessons from Amdahls Law -Make common cases fast. As fx -> 1 speedup -> Sx -Dont overoptimize the common case. As Sx -> infinity, speedup -> 1 / ( 1 - fx ) -Uncommon case will eventually become the common one applies to everything https://www.fxsolver.com/solve/

Efficiency Challenge

Many Factors to consider: Transistors, Parallel App performance, Single-Thread Performance (SpecInt), Frequency (MHz), Typical Power (Watts), Number of Cores

Efficiency

Many factors to consider when attempting to improve efficiency. Transistors, Parallel App Performance, Single-Threaded Performance (SpecInt), Frequency (MHz), Typical Power (Watts), and Number of Cores all affect efficiency.

In RISC-V, Ld and Sd are

Memory transfer Instructions

Technology Scaling in the past

Moores Law (more transistors) + Dennard Scaling (lower Vdd). This equation led to a huge increase in performance every couple of years. However in 2005, Dennard Scaling stopped. The idea is that as we shrunk transistors, electrons dont have to move as far, this is why frequency increases.

Store

Moves data from a register to a memory location

Pipeline Control

Need to control functional units, but they are working on different instructions! Just pipeline the control signals along with the data.

Assuming a in x1 and base address of B in x2, a correct assembly code representation of a = B[5000] is: ld x1, 5000(x2)

No, this is not a correct RISC-V assembly

How does power scale with load?

Not well. Ex. 100% load = 258 watts 20% load = 121 watts 20/100 =/= 121/258

Technology Scaling in the present

Only Moores Law, no Dennard Scaling. Nowadays we have started adding more cores to make up for lack of progress. 1 core at 100 MHz vs 2 cores at 80 MHz. You have a reduction of 20% in clock rate but a 50% reduction in power consumption.

Control: addu

Only needs RegWrite to be 1. Since its two registers, no ALUSrc is 0. No memory needs to written, so Data Memory box is completely avoided.

Execution Time Example

PC A: 2GHz clock, 10s CPU time Designing PC B -Aim for 6s CPU time -Can do faster clock, but causes 1.2x clock cycles How fast must PC B be? Clock Rate B = Clock Cycles B / CPU Time B = 1.2x Clock Cycles A / 6s Clock Cycles A = CPU Time A * Clock Rate A =10s x 2GHz = 20 x 10^9 Clock Rate B = (1.2 x 20 x 10^9) / 6s = 24 x 10^9 / 6s = 4GHz

Period to Frequency Conversions

Period = 1 Second Frequency: Hz = 1 KHz = 1.0E-3 MHz = 1.0E-6 GHz = 1.0E-9

The Big Change

Power = (Joules/Op) * (Ops/Second) (Joules/Op) = energy efficiency (Ops/Second) = performance -To improve performance we must reduce energy. -All computers are now power limited. -Popular approaches: co-optimize HW and SW, custom HW -This is why so many more companies are designing their own hardware nowadays. With limited potential, they want to optimize their hardware to run their own stuff more efficiently.

Chip Power consumption formula

Power = C * Vdd^2 * F C = Capacitance Vdd = Voltage F = Frequency -Capacitance scales with number of transistors -In 2005 we hit maximum power per chip, due to very high heat. Multi core processors are now used.

Chip Power Consumption Formula

Power = C * Vdd^2 * F C = capacitance Vdd = voltage F = frequency Capacitance scales with number of transistors.

Power Consumption Formula

Power = C * Vdd^2 * F0->1 + Vdd * I(leakage) Power = energy / second = ( energy / instruction ) x ( instructions / second )

Power Consumption in Chips

Power = C*Vdd^2*F(0->1) + Vdd *I(leakage) -Dynamic or active power consumption -Charging or discharging capacitors -Depends on switching transistors and switching activity -Leakage current or Static Power consumption -Leaking diodes and transistors -Gets worse with smaller devices and lower Vdd -Gets worse at higher temperatures

The Big Change

Power = Joules/Op *Ops/second To improve performance we must reduce energy, otherwise we cannot use additional transistors. All computers are now power limited. To get around this companies design their own HW to compensate for their custom SW.

Power formula

Power = energy/second = (energy/instruction * instruction/second)

Benchmarks

Programs used to measure performance -Supposedly typical of actual workload Warning: -Different benchmarks focus on different workloads -All benchmarks have shortcomings -Your design will be as good as the benchmarks you use

Benchmarks

Programs used to measure performance Supposedly typical of actual workload Benchmark suite: collection

Pipeline Hazards

Situations that prevent completing an instruction every cycle This leads to CPI > 1

Forwarding Paths: Partial

Slide 21

RISC-V Calling Convention for registers

Slide 32

Call and Return

Slide 33 ISA

Digital Systems

Slide 35 ISA Digital systems have an internal clock. We try to do as much combinational logic as we can between each cycle. The cycle time defines how much time you have to do combinational logic.

In which memory segment would the variable count, listed in the C-code snippet below, be allocated from? int main(int argc, char *argv[]) { int count = 0; return count++; }

Stack

RISC-V Storage Layout

Stack and Heap grow toward each other, Stack grows downwards. Also stores Static data, Text/Code and some reserved trash. Stack and Heap grow during runtime, static data does not

Performance Effect from Stalls

Stalls can have a significant effect on performance Consider: ideal CPI is 1 A RAW Hazard causes a 3 cycle stall If 40% of the instructions cause a stall? The new effective CPI is 1 + 3 * 0.4 = 2.2 And the real % is probably higher than 40% You get less than 1/2 the desired performance!!!

Control

State free, every instruction takes a single cycle, you just need to decode the instruction bits -Few control points -Control on the multiplexors -Operation type for the ALU -Write Control on the register file and data memory

You want to reduce the dynamic power consumption of a microprocessor. Assume you can improve one of the chip properties below by reducing it by 2x. Which one would you choose?

Supply Voltage

Instruction Set Architecture (ISA)

The HW/SW interface -The contract between SW and HW (compilers, assemblers, etc.) This is all because you want ISA to be generic. The user should have power in this. ISA is link between SW and HW -Processor, Memory, I/O System all use bits and gates -ISA converts back to human readable stuff and vice versa

Complexity

The limiting factor in modern chip design. We try to hide complexity using Abstraction. We also design reusable hardware components. Reusable components: Cores, CPUs, MCU

Critically important for Pipelining

The number of stages determines the upper bound improvement that you can achieve by pipelining You never want to stop a pipeline. Start time needs to be avoided to improve speedup

Data Dependencies

These are dependencies for instruction j following instruction i Read after Write (RAW or true dependence) -Instruction j tries to read before instruction i tries to write it Write after Write (WAW or output dependence) -Instruction j tries to write an operand before i writes it's value Write after Read (WAR or anti dependence) -Instruction j tries to write a destination before it is read by i

HW1 Q1 disable-asm

adds much more time overall. average time of 31 seconds, compared to 5 seconds regularly regular is 6 times faster

beq

beq reg1, reg2, label if reg1 = reg2 jump to label.

Clock Frequency (rate)

cycles per second E.g. 4.0 GHz = 4000MHz = 4.0 * 10^9 Hz

Clock Period

duration of a clock cycle E.g. 250ps = 0.25ns = 250 * 10^-12s This is the basic unit of time in all computers

Unsigned vs Signed Values

given 4 bits Unsigned: [0, 2^4 - 1] Signed: [-2^3, 2^3 - 1] Equivalence - same encodings for non-negative values Uniqueness - Every pattern represents a unique integer - Not true with sign magnitude

What data is not part of a stack frame?

global variables

Hertz Conversions (1 Hertz)

kilohertz = 1.0E3 Hz megahertz = 1.0E6 Hz gigahertz = 1.0E9 Hz

HW3 rest

look at assignment

Seconds Conversions (1 second)

millisecond = 1.0E-3 microsecond = 1.0E-6 nanosecond = 1.0E-9 picosecond = 1.0E-12

Load

moves data from memory location to a register

Processors

programmable HW components. Generic HW component with expressive and stable interface to SW. Processor capability grows with more and faster transistors, they can achieve higher speeds.

Storing Data

reverse of load, copy data from source registers to an address in memory sd src, offset (base)

In RISC-V the stack pointer points to

the top of the stack (technically on the bottom of the stack though)

Harmonic Mean

use with rates not with times Represents total execution time


Ensembles d'études connexes

Chapter 14: Marketing Channels Key Concepts/Notes

View Set

ASE A1 (Engine Repair) Practice Test - Cumulative

View Set

Clicker Questions exam 3 physics

View Set

Fundamentals Chapters 1,2,3 Test

View Set

Radium Girls (Roeder) 1:9 Cues & Lines (for personal use!!)

View Set