Computer Organization & Design ARM - review
purpose of X30 (LR):
link register (return address)
what approach do computers take with division?
long division
Mean time to failure (MTTF)
reliability measure. you want a HIGH MTTF
What is a wide area network (WAN)?
the internet
In clearing an array, the compiler can achieve same effect as...
the manual use of pointers
how many characters in ascii?
128 (95 graphic, 33 control)
Machine Language
A mixture of electrical signals/string of binary bits that are unintelligible to humans.
ARM instructions are similar to that of...
MIPS
Stack is...
automatic storage
The Intel x86 ISA saw evolution in...
backward compatibility
in register offset addressing mod, another register is added to the ...
base register.
In an immediate pre-indexed load instruction, the content of the destination register changes _________ and the base register changes ________
based on the value fetched from memory; to the address that was used to access memory.
what is a block in the cache?
basic unit of cache storage; unit of copying. may be multiple words
in immediate pre-indexed addressing mode, when do addition and subtraction occur?
before the address is sent to memory.
Instructions are encoded in...
binary
what is ORR
bit-by-bit OR
2 components of a computer
datapath; control
what is a Header?
described contents of object module
EOR operations are also called...
differencing operation
Addressing modes are...
different ways to get data into registers (or store from registers into RAM)
what are the 3 different ways of mapping cache?
direct mapped cache, n-way set, and fully associative
Many compilers produce object modules....
directly
what is the memory hierarchy?
disk memory at bottom (cheapest, but slowest), then DRAM, then SRAM (cache), then registers
what happens If divisor ≤ dividend bits
1 bit in quotient, subtract
direct mapped cache is...
1 to 1
What is the instruction to microoperation ratio of a simple instruction?
1-1
how long can it take for a new chip to be manufactured?
1-3 months
What is the instruction to microoperation ratio of a complex instruction?
1-many
Multithreading
Performs multiple threads of execution in parallel
Immediate operand avoids...
a load instruction
when comparing performance, we say...
"X is n times faster than Y"
64-bit data is called a...
"doubleword"
32-bit data called a...
"word"
what is the equation for total time a task will take?
# of instructions * # of clock cycles/instruction * clock cycle time
Execution Time
# of instructions × CPI × Clock Period OR #Clock Cycles×Clock period
Performance Ratios
(Performance A) ÷ (Performance B)
Amdahl's Law
(Execution time affected)/(Amount of improvement) + Execution time unaffected = Total Time
formula for cost per die
(cost per wafer)/(dies per wafer * yield)
Range of Signed ints
-2,147,483,648 to 2,147,483,647
List 3 things inside the processor
-Datapath -Control -Cache memory
3 additional ARMv8 features
-Flexible second operand -Additional addressing modes -Conditional instructions (e.g. CSET, CINC)
3 functions of an OS
-Handling input/output -Managing memory and storage -Scheduling tasks & sharing resources
2 features of instruction level parallelism
-Hardware executes multiple instructions at once -Hidden from the programmer
2 facts about embedded computer
-Hidden as components of systems -Stringent power/performance/cost constraints
2 facts about supercomputers
-High-end scientific and engineering calculations -Highest capability but represent a small fraction of the overall computer market
2 properties of a high level language
-Level of abstraction closer to problem domain -Provides for productivity and portability
3 examples of non-volatile secondary memory
-Magnetic disk -Flash memory -Optical disk (CDROM, DVD)
3 things that make parallel programming hard
-Programming for performance -Load balancing -Optimizing communication and synchronization
3 ways to improve cpu time
-Reducing number of clock cycles -Increasing clock rate -Hardware designer must often trade off clock rate against cycle count
Multicore processors
-contains more than one processing unit. -software must be written to specifically allow multiple jobs to be carried out simultaneously.
GPU architecture
-high number of cores, enabling it to handle multiple processes at once. -particularly good at processing multiple jobs in parallel.
GPU
-known as a co-processor. -traditionally responsible for the processing of large blocks of visual data very quickly.
what happens if divisor > dividend bits?
0 bit in quotient, bring down next dividend bit
Range of Unsigned ints
0 to 4,294,967,295
In regards to immediate constants, the integers __ through ____ will always work.
0, 4095
What is the extended sign bit of the following: +2: 0000 0010 =>
0000 00000000 0010
4 Types of general addrressing in LEGv8
1. Immediate 2. Register 3. Base 4. PC-relative
what two steps are involved in array indexing?
1. Multiplying index by element size 2.Adding to array base address
List the '8 Great Ideas'
1.Design for Moore's Law 2.Use abstraction to simplify design 3.Make the common case fast 4.Performance via parallelism 5.Performance via pipelining 6.Performance via prediction 7.Hierarchy of memories 8.Dependability via redundancy
how does linking object modules produce an executable image (3 steps)?
1.Merges segments 2.Resolve labels (determine their addresses) 3.Patch location-dependent and external refs
6 Steps of Procedure Calling
1.Place parameters in registers X0 to X7 2.Transfer control to procedure 3.Acquire storage for procedure 4.Perform procedure's operations 5.Place result in register for caller 6.Return to place of call (address in X30)
What are the 6 steps in the process of loading from an image file on disk into memory?
1.Read header to determine segment sizes 2.Create virtual address space 3.Copy text and initialized data into memory (Or set page table entries so they can be faulted in) 4.Set up arguments on stack 5.Initialize registers (including SP, FP) 6.Jump to startup routine(Copies arguments to X0, ... and calls main and when main returns, do exit syscall)
What are the 4 design principles?
1.Simplicity favors regularity 2.Smaller is faster 3.Make the common case fast 4.Good design demands good compromises
formula for yield
1/(1 + (Defects per area * (die area/2))^2
A
1010
B
1011
C
1100
D
1101
E
1110
F
1111
What is the extended sign bit of the following: -2: 1111 1110 =>
1111 11111111 1110
how many instructions can you do per cycle on a 12 stage pipeline?
12
how many registers in ARM?
15 ×32-bit
what was the 8086, and what year did it come out?
16-bit extension to 8080; 1978
2⁶⁴
18,446,744,073,709,551,616
5 examples of progressing technology
1951 - Vacuum tube - 1 1965 - Transistor - 35 1975 - Integrated circuit (IC) - 900 1995 - Very large scale IC (VLSI) - 2,400,000 2013 - Ultra large scale IC - 250,000,000,000
Both ARM and MIPS were announced in what year?
1985
when did the Pentium come out and what did it add?
1993; superscalar, 64-bit datapath
when did the pentium pro come out?
1995
when did the pentium II come out?
1997
when did the pentium III come out?
1999
Performance
1÷(Execution time)
how many versions of immediate pre-indexed addressing are there?
2 one where the address is added to the base and one where the address is subtracted from the base—to allow the programmer to go through the array forwards or backwards.
how many processing steps are there in manufacturing IC's
20-40
when did the pentium IV come out?
2001
when did AMD64 come out and what did it do?
2003; extended architecture to 64 bits
when did we hit a physical limit on hardware?
2005
when did the intel core come out?
2006
floating point standard was last updated in...
2008
what did the 80286 add, and when did it come out?
24-bit addresses, MMU; 1982
2⁸
256
how many characters in latin-1?
256 (ascii, plus 95 additional)
Non-negative numbers have the same unsigned and ____ representation
2s-complement
saturating operations uses what sort of arithmetic?
2s-complement modulo arithmetic
how many data addressing modes in MIPS?
3
how many registers in MIPS?
31 ×32-bit
what is the ARM instruction size?
32 bits
LEGv8 has a ___ ×___ register file
32 x 64-bit
how many characters in unicode?
32-bit character set (Used in Java, C++ wide characters)
what was the 80386, and when did it come out?
32-bit extension; 1985
what size is the ARM address space?
32-bit flat
LEGv8 instructions are Encoded as...
32-bit instruction words
for many years, we were getting ____ increases in CPU performance per year
52%
2¹⁶
65,536
2³⁶
68,719,476,736
what was the 8080, and what year did it come out?
8-bit microprocessor; 1974
Graphics and media processing operates on vectors of ____ and ____ data
8-bit; 16-bit
how many data addressing modes in ARM?
9
Pre-indexing constants must fit in .....
9 bits (including the sign bit) (so -256 to 255)
Multicore Systems
A multi-core processor is an integrated circuit to which two or more processors have been attached for enhanced performance, reduced power consumption, and more efficient simultaneous processing of multiple tasks
Serial Computing
A problem is broken into a discrete series of instruction. Instructions are executed sequentially one after another and executed on a single processor. Only one instruction may execute at any moment in time.
Serial Computing
A serial computer is typified by bit-serial architecture — i.e., internally operating on one bit or digit for each clock cycle. Machines with serial main storage devices such as acoustic or magnetostrictive delay lines and rotating magnetic devices were usually serial computers.
Translate the following C Code into ARM: C code: f = (g + h) -(i + j); note: f, ..., j in X19, X20, ..., X23
ADD X9, X20, X21 ADD X10, X22, X23 SUB X19, X9, X10
RISCV is almost the same architecture as...
ARM
what is the most popular embedded core?
ARM
how does integer subtraction actually function?
Add negation of second operand
What is the formula for determining address
Address = PC + offset (from instruction)
what 2 things will the following code do: BL ProcedureLabel
Address of following instruction put in X30 Jumps to target address
File Virtualism
Addresses the NAS challenges by eliminating the dependencies between the data accessed at the file level and the location where the files are physically stored.
what does the following code do: LDR X2, [X0,X1]!
Adds X1 to X0 and stores the result in X0. Then uses that result as the address in main memory to fetch from
4 things that impact cpu performance
Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc
Von Neumann Continued
All parts of the computer are connected together by Bus. Memory and devices are controlled by CPU. Data can pass through bus to and from CPU. Memory holds both programs and data. Memory is addressed linearly; this means that there is an address for each and every memory location. Memory is addressed by the location number without regard to the data contained within.
Fallacies
Amdahl's law doesn't doesn't apply to parallel computers Peak performance tracks observed performance
ALU
Arithmetic and Logic Unit - Deals with all arithmetic and logic within the computer. The part of the central processing unit that deals with operations such as addition, subtraction, and multiplication of integers and Boolean operations. It receives control signals from the control unit telling it to carry out these operations.
Modeling Performance
Assume performace metric of interest is achievable GFLOPs/sec Arithmetic Intensity of a kernel For a given computer, determine
What are the two types of branch addressing?
B-type CB-type
what is hardware representation
Binary digits (bits), Encoded instructions and data
what is MVN
Bit-by-bit NOT
how do conditional operations work in assembly?
Branch to a labeled instruction if a condition is true. Otherwise, continue sequentially
Loosely Coupled Clusters
Built of a network of independent computers -each has private memory and OS -Connected using high performance network system High availability, scalable, affordable, fault tolerant
CPU Performance
CPU Time = Seconds/Program = Instructions/Program X Cycles/Instructions X Seconds/Cycle. The CPU performance is dependent upon instruction Count, SPI (Cycles per Instruction) and Clock cycle time. All three are affected by the instruction set architecture.
CPU
Central Processing Unit - Brain of the computer; fetches, decodes and executes instructions.
CPI
Clock Cycles Per Instruction
Disadvantages of RISC
Code Quality. The performance of a RISC processor depends greatly on the code that it is executing. If the programmer (or compiler) does a poor job of instruction scheduling, the processor can spend quite a bit of time stalling: waiting for the result of one instruction before it can proceed with a subsequent instruction. Code Expansion. CISC machines perform complex actions with a single instruction; RISC machines may require multiple instructions for the same action, code expansion can be a problem. Code expansion refers to the increase in size that you get when you take a program that had been compiled for a CISC machine and re-compile it for a RISC machine. The exact expansion depends primarily on the quality of the compiler and the nature of the machine's instruction set. System Design. Another problem that faces RISC machines is that they require very fast memory systems to feed them instructions. RISC-based systems typically contain large memory caches, usually on the chip itself. This is known as a first-level cache.
CISC
Commonly implemented within large computers, this just uses one instruction to execute everything, instead of using multiple instructions.
Comparison
Comparison operations compare values in order to determine such things as whether one number is greater than, less than or equal to another. These operations can be performed by subtraction of one of the numbers from the other, and as such can be handled by the aforementioned logic gates. However, it is not strictly necessary for the result of the calculation to be stored in this instance.. the amount by which the values differ is not required. Instead, the appropriate status flags in the flag register are set and checked to determine the result of the operation.
3 layers of software/hardware?
Compiler, assembler, hardware
what does system software do
Compiler: translates HLL code to machine code
what does java's just-in-time compiler do?
Compiles bytecodes of "hot" methods into native code for host machine
Characteristics of CISC
Complex instruction-decoding logic: It is driven by the need for a single instruction to support multiple addressing modes. Small number of general purpose registers: Instructions which operate directly on memory, and only the limited amount of chip space is dedicated for general purpose registers. Several special purpose registers: Many CISC designs set aside special registers for the stack pointer, interrupt handling, and so on. This can simplify the hardware design somewhat, at the expense of making the instruction set more complex. 'Condition code" register: This register reflects whether the result of the last operation is less than, equal to, or greater than zero and records if certain error conditions occur.
Advantages of Von Nuemann
Control unit gets data and instructions in the same way from memory. It simplifies design and development of the control unit. Data from memory and from devices are accessed in the same way. Memory organisation is in the hands of programmers.
Procedure return: jump register RET or BR LR What will this do?
Copies LR to program counter Can also be used for computed jumps
What is a local area network (LAN)?
Ethernet
how does signed division work
Divide using absolute values Adjust sign of quotient and remainder as required
how does restoring division work
Do the subtract, and if remainder goes < 0, add divisor back
what is DRAM?
Dynamic RAM, the most common form of memory that must be refreshed occasionally (data is stored as a charge in a capacitor)
Message Passing
Each processor has private physical address space(clusters) Instructions/data sent to them Hardware sends/receives messages between processors
how is immediate addressing efficient?
Efficient in regards to space and time, but only if the value fits in the 12-bit encoding scheme
What does EM64T stand for?
Extended Memory 64 Technology
Task that isn't parallelizable
Fibonacci sequence
what is the difference between a signed and unsigned bit
For a signed integer one bit is used to indicate the sign - 1 for negative, zero for positive. Thus a 16 bit signed integer only has 15 bits for data whereas a 16 bit unsigned integer has all 16 bits available. This means unsigned integers can have a value twice as high as signed integers (but only positive values).
Direct Addressing
For direct addressing, the operands of the instruction contain the memory address where the data required for execution is stored. For the instruction to be processed the required data must be first fetched from that location.
Clock Rate
Frequency (X GHz) (X×10⁹ Hz)
Logical Tests
Further logic gates are used within the ALU to perform a number of different logical tests, including seeing if an operation produces a result of zero. Most of these logical tests are used to then change the values stored in the flag register, so that they may be checked later by separate operations or instructions. Others produce a result which is then stored, and used later in further processing.
what is the difference between the GPU and CPU?
GPU's processing is highly data-parallel, the GPU doesn't have any branching/logic like the CPU, and the GPU has a very small cache/a lot less memory
GPU
GPUs are processors which can be used for a range of tasks other than processing computer game graphics. GPUs are used to display high quality video content such as HDMI or Blu-Ray on a screen. Video editing also requires many calculations, especially where edits or effects have been made. The decoding and encoding of videos is also carried out by the GPU
Decode
Here, the control unit checks the instruction that is now stored within the instruction register. It determines which opcode and addressing mode have been used, and as such what actions need to be carried out in order to execute the instruction in question.
Vector Processors
Highly pipelined function units in CPU Stream data from/to vector registers to units elimates loops
what is Response time
How long it takes to do a task
What Determines how fast I/O operations are executed
I/O system (including OS)
80386 is now known as...
IA-32
floating point standard was defined by...
IEEE Std 754-1985
What is Amdah's Law and its formula
Improving an aspect of a computer and expecting a proportional improvement in overall performance is false. Given by: Timproved = (Taffected/improvement factor) + Tunaffected
Pipelining
In computers, a pipeline is the continuous and somewhat overlapped movement ofinstruction to the processor or in the arithmetic steps taken by the processor to perform an instruction. Pipelining is the use of a pipeline. Without a pipeline, a computer processor gets the first instruction from memory, performs the operation it calls for, and then goes to get the next instruction from memory, and so forth. While fetching (getting) the instruction, the arithmetic part of the processor is idle. It must wait until it gets the next instruction. With pipelining, the computer architecture allows the next instructions to be fetched while the processor is performing arithmetic operations, holding them in a buffer close to the processor until each instruction operation can be performed. The staging of instruction fetching is continuous. The result is an increase in the number of instructions that can be performed during a given time period. Pipelining is sometimes compared to a manufacturing assembly line in which different parts of a product are being assembled at the same time although ultimately there may be some parts that have to be assembled before others are. Even if there is some sequential dependency, the overall process can take advantage of those operations that can proceed concurrently. Computer processor pipelining is sometimes divided into an instruction pipeline and an arithmetic pipeline. The instruction pipeline represents the stages in which an instruction is moved through the processor, including its being fetched, perhaps buffered, and then executed. The arithmetic pipeline represents the parts of an arithmetic operation that can be broken down and overlapped as they are performed. Pipelines and pipelining also apply to computer memory controllers and moving data through various memory staging places.
Multiplication and Division
In most modern processors, the multiplication and division of integer values is handled by specific floating-point hardware within the CPU. Earlier processors used either additional chips known as maths co-processors, or used a completely different method to perform the task.
main property of Volatile main memory
Loses instructions and data when power off
For nested call, caller needs to save what two things on the stack:
Its return address Any arguments and temporaries needed after the call
Java/JIT compiled code is significantly faster than...
JVM interpreted
Syntax of loading a byte
LDURSB Xt, [Xn, offset] (Sign extend to 64 bits in Xt (can be W or X))
Syntax of loading a halfword
LDURSH Xt, [Xn, offset] (Sign extends to 64 bits in Xt (can be W or X))
What is LRU page replacement?
Least Recently Used
Algorithm that is parallelizable
Linear search
Opcode Short Codes
MOV Moves a data value from one location to another ADD Adds to data values using the ALU, and returns the result to the accumulator STO Stores the contents of the accumulator in the specified location END Marks the end of the program in memory
For the occasional 32 bit constant, what 2 versions of mov do we use?
MOVZ and MOVK
what are the 3 LEGv8 multiple instructions and how do they function?
MUL: multiply (Gives the lower 64 bits of the product) SMULH: signed multiply high (Gives the upper 64 bits of the product, assuming the operands are signed) UMULH: unsigned multiply high (Gives the upper 64 bits of the product, assuming the operands are unsigned)
Design Principle 3:
Make the common case fast
I/O in ARM is...
Memory mapped
Progress in computer technology has been underpinned by...
Moore's Law
A basic block is...
No embedded branches (except at end), No branch targets (except at beginning)
what is a multicore microprocessor
More than one processor per chip
MIMD
Multiple Instruction Multiple Data Stream - Clusters
MISD
Multiple Instruction Single Data Stream- None
Are instruction count and CPI good performance indicators in isolation?
No
NUMA
Non-Uniform Memory Access - is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users
Pitfalls
Not developing the software to take account of a multiprocessor architecture
Network Charateristics
Performance -Latency per message -Throughput Cost Power Routability in Silicon
EOR does the same operations that ___ does
ORR
what does saturating operations mean?
On overflow, result is largest representable value
in an optimized divider, there is ___ cycle per partial-remainder subtraction
One
in multiplication, there is ___ cycle per partial-product addition
One
Features of RISC
One Cycle Execution Time: RISC processors have a CPI (clock per instruction) of one cycle. Pipelining: A technique that allows simultaneous execution of parts, or stages, of instructions to more efficiently process instructions. Large Number of Registers. The RISC design philosophy generally incorporates a larger number of registers to prevent large amounts of interactions with memory
Disadvantages of Von Neumann
One bus has a bottleneck effect. Only one piece of information can be accessed at the same time. Instructions stored in the same memory as the data can be accidentally rewritten by and error in a program.
Coarse-grain multithreading
Only switch on long stall Simplifies hardware, but doesn't hide short stalls
what is cpu clocking
Operation of digital hardware governed by a constant-rate clock
Optimizing Perfomance
Optimize fp performance (floating point?) -balance adds & multiplies Optimize Memory usage -Software prefetch -Memory Affinity
What will Adding two -ve operands do?
Overflow if result sign is 0
What will Adding two +ve operands do?
Overflow if result sign is 1
formula for performance
Performance = 1/Execution Time
what is the formula for power in CMOS IC technology
Power = Capacitative load x Voltage^2 x Frequency
Logic
Problems that need to be solved, logically.
what are non-leaf procedures
Procedures that call other procedures
GPU Architectures
Processing is highly data parallel -GPUs are highly multithreaded -use thread switching to hide memory latency -Graphics memory is wide and high-bandwidth
What determines how fast instructions are executed
Processor and memory system
what does linking object modules do?
Produces an executable image
Disadvantages of Harvard
Production of a computer with two buses and two memory storage's is more expensive and needs more time.
What 3 things determine number of machine instructions executed per operation
Programming language, compiler, architecture
how does an assembler help to translate a program into machine instructions?
Provides information for building a complete program from the pieces
Quantum Computing
Quantum computing studies theoretical computation systems (quantum computers) that make direct use of quantum-mechanical phenomena, such as superposition and entanglement, to perform operations on data. Quantum computers are different from digital computers based on transistors.
IA-32 is a microengine similar to...
RISC
LEGv8 is typical of...
RISC ISAs
Reduced Instruction Set Architecture(RISC)
RISC does the opposite, reducing the cycles per instruction at the cost of instruction per program.
What are the actions that the handler goes through when an exception is handled?
Read cause, and transfer to relevant handler; Determine action required; If restartable, Take corrective action & use EPC to return to program; Otherwise, Terminate program & Report error using EPC
the following code is an example of what? LDR X2, [X0],X1
Register Post-Indexed Addressing Mode
What is sign extension?
Representing a number using more bits. Preserves numeric value.
what are the 2 LEGv8 division operations?
SDIV (signed) UDIV (unsigned)
how do SISD and SIMD differ?
SISD is where a processor executes a single instruction stream, to operate on data stored in a single memory; a SIMD processor performs a single, identical action simultaneously on multiple data pieces.
Shared Memory
SMP: shared memory multiprocessor -Hardware provides single physical address space for all processors -Synchronize shared variables using locks -Memory Access time -UMA vs NUMA
Syntax of storing a byte
STURB Wt, [Xn, offset] (Stores just rightmost byte of Wt (MUST be W and not X))
syntax of storing a halfword
STURH Wt, [Xn, offset] (Store just rightmost halfword of Wt (MUST be W and not X)
translate the following to assembly: if (a > b) a += 1; a in X22, b in X23
SUBS X9,X22,X23 // use subtract to make comparison B.LE Exit // conditional branch ADDI X22,X22,#1 Exit:
the following code is an example of what? LDR X2, [X0,X1, LSL#3]!
Scaled Register Pre-Indexing
what does a bit mask do
Select some bits, clear others to 0
Grid Computing
Separate Computers interconnected by long-haul network
3 Facts about servers
Server computer -Network based -High capacity, performance, reliability -Range from small servers to building sized
what does including bits in a word do?
Set some bits to 1, leave others unchanged
what does it mean to pipeline?
Several multiplications performed in parallel
what does the following code do: LDR X2, [X0,X1, LSL #3]
Shift X1 left by 3 bits (so multiply by 8), then add that result to X0 and use the result as the address in main memory to fetch from
Bit Shifting
Shifting operations move bits left or right within a word, with different operations filling the gaps created in different ways. This is accomplished via the use of a shift register, which uses pulses from the clock within the control unit to trigger a chain reaction of movement across the bits that make up the word.
Characteristics of RISC
Simple Instructions Limited fixed length instructions and no instructions combine load/store with arithmetic Few Data Types Supports simple data types such as integers/characterrs to complex data structures such as records Simple Addressing modes Use simple addressing modes and fixed length instructions to facilitate pipelining. Memory indirect addressing isn't provided. Identical general purpose Registers Allow any register to be used in any context Harvard Architecture Harvard memory model - The instruction stream and data stream are conceptually separated.
what are java class files?
Simple portable instruction set for the JVM
Design Principle 1:
Simplicity favors regularity
Advantages of Harvard
Since it had two memory locations, the allows parallel access to data and instructions. Data and instructions are accessed in the same way.
what is SIMD?
Single Instruction Multiple Data (the same instruction is applied to many data streams/AKA vector architecture)
SIMD
Single Instruction Multiple Data Stream - GPU's, SSE instructions of x86 Operate element-wise on vectors of data All processors execute same instruction at the same time with different data addresses Simplifies synchronization Reduced Instruction Control Hardware Works best for highly data-parallel applications
SISD
Single Instruction Single Data Stream -Pentium 4
SPMD
Single Program multiple data, parallel program on a MMD computer -conditional code for different processors
what are the 2 floating point representations
Single precision (32-bit) Double precision (64-bit)
What is cache memory?
Small fast SRAM memory for immediate access to data
Design Principle 2:
Smaller is faster
Advantages of RISC
Speed. Since a simplified instruction set allows for a pipelined, superscalar design RISC processors often achieve 2 to 4 times the performance of CISC processors using comparable semiconductor technology and the same clock rates. Simpler hardware. Because the instruction set of a RISC processor is so simple, it uses up much less chip space. Smaller chips allow a semiconductor manufacturer to place more parts on a single silicon wafer, which can lower the per-chip cost dramatically. Shorter design cycle. Since RISC processors are simpler than corresponding CISC processors, they can be designed more quickly, and can take advantage of other technological developments sooner than corresponding CISC designs, leading to greater leaps in performance between generations. Efficient Code. Higher-level language compilers produce more efficient code than formerly because they have always tended to use the smaller set of instructions to be found in a RISC computer. Simplicity. The simplicity of RISC allows more freedom to choose how to use the space on a microprocessor.
what is SRAM?
Static RAM, a lower-power and faster but more expensive type of memory, used for CPU caches
Storage Virtualization
Storage systems typically use special hardware and software along with disk drives in order to provide very fast reliable storage for computing and data.
what is assembly language
Textual representation of instructions
Complex Instruction Set Architecture(CISC)
The CISC approach attempts to minimize the number of instruction per program, sacrificing the number count per instruction.
what makes up the Application binary interface
The ISA plus system software interface
Operand
The Operand indicates where the data required for the operation can be found and how it can be accessed.
Accumulator
The accumulator is used to hold the result of operations performed by the arithmetic and logic unit, as covered in the section of the ALU.
Execute
The actual actions which occur during the execute cycle of an instruction depend on both the instruction itself, and the addressing mode specified to be used to access the data that may be required. However, four main groups of actions do exist, which are discussed in full later on.
Address Bus
The address bus contains the connections between the microprocessor and memory that carry the signals relating to the addresses which the CPU is processing at that time, such as the locations that the CPU is reading from or writing to. The width of the address bus corresponds to the maximum addressing capacity of the bus, or the largest address within memory that the bus can work with. The addresses are transferred in binary format, with each line of the address bus carrying a single binary digit. Therefore the maximum address capacity is equal to two to the power of the number of lines present (2^lines).
Parallel
The computational problem should be able to: Be broken apart into pieces of work that can be solved simultaneously; Execute multiple program instructions at any moment in time; Be solved in less time with multiple compute resources than with a single compute resource. The compute resources are typically: A single computer with multiple processors/cores An subjective number of such computers connected by a network
Control Bus
The control bus carries the signals relating to the control and co-ordination of the various activities across the computer, which can be sent from the control unit within the CPU. Different architectures result in differing number of lines of wire within the control bus, as each line is used to perform a specific task. For instance, different, specific lines are used for each of read, write and reset requests.
Control Logic Circuits
The control logic circuits are used to create the control signals themselves, which are then sent around the processor. These signals inform the arithmetic and logic unit and the register array what they actions and steps they should be performing, what data they should be using to perform said actions, and what should be done with the results.
what is included in implementation
The details underlying and interface
what is a cache miss?
The event when a memory access results in a memory location that is not in cache.
Fetch
The fetch cycle takes the address required from memory, stores it in the instruction register, and moves the program counter on one so that it points to the next instruction.
Flag register / status
The flag register is specially designed to contain all the appropriate 1-bit status flags, which are changed as a result of operations involving the arithmetic and logic unit. Further information can be found in the section on the ALU.
What is Instruction set architecture (ISA)
The hardware/software interface
what is Instruction set architecture
The hardware/software interface
Memory
The memory is not an actual part of the CPU itself, and is instead housed elsewhere on the motherboard. However, it is here that the program being executed is stored, and as such is a crucial part of the overall structure involved in program execution.
Opcode
The opcode is a short code which indicates what operation is expected to be performed. Each operation has a unique opcode. Once the opcode is known, the execution cycle can occur. Different actions need to be carried out dependent on opcode, with two opcodes requiring the same actions to occur. 4 actions can occur: Transfer of data between CPU and memory.
what is an instruction set
The repertoire of instructions of a computer
what is Register Post-Indexed Addressing Mode
The same Immediate Post-Indexed, except you add or subtract a register instead of a constant.
what is Scaled Register Pre-indexed Addressing Mode
The same Register Pre-Indexed, except you shift the register before adding or subtracting it.
what is Register Pre-Indexed Addressing Mode
The same as Immediate Pre-Indexed, except you add or subtract a register instead of a constant.
Timer or Clock
The timer or clock ensures that all processes and instructions are carried out and completed at the right time. Pulses are sent to the other areas of the CPU at regular intervals (related to the processor clock speed), and actions only occur when a pulse is detected. This ensures that the actions themselves also occur at these same regular intervals, meaning that the operations of the CPU are synchronized.
Decoder
This is used to decode the instructions that make up a program when they are being processed, and to determine in what actions must be taken in order to process them. These decisions are normally taken by looking at the opcode of the instruction, together with the addressing mode used. This is covered in greater detail in the instruction execution section of this tutorial.
Other general purpose registers
These registers have no specific purpose, but are generally used for the quick storage of pieces of data that are required later in the program execution. In the model used here these are assigned the names A and B, with suffixes of L and U indicating the lower and upper sections of the register respectively.
Addition and Subtraction
These two tasks are performed by constructs of logic gates, such as half adders and full adders. While they may be termed 'adders', they can also perform subtraction via use of inverters and 'two's complement' arithmetic.
Control Unit
This controls the movement of instructions in and out of the processor, and also controls the operation of the ALU. It consists of a decoder, control logic circuits, and a clock to ensure everything happens at the correct time. It is also responsible for performing the instruction execution cycle.
Von Neumann Architecture
This describes the design architecture for an electronic digital computer with parts consisting of a processing unit containing: ALU, Control Unit, Register Array, Memory to store both data and instructions, External Mass Storage, Input and Output. Programs consist of a sequence of instructions. Instructions are executed in order they are stored in memory. Instructions, characters, data and numbers are represented in binary form.
Harvard Architecture
This is a computer architecture with physically separate storage and signal pathways for instructions of data.
Register Array
This is a small amount of internal memory that is used for the quick storage and retrieval of data and instructions. All processors include some common registers used for specific functions, namely the program counter, instruction register, accumulator, memory address register and stack pointer.
System Bus
This is comprised of the control bus, data bus and address bus. It is used for connections between the processor, memory and peripherals, and transferal of data between the various parts.
Block virtualism
This is the abstraction(separation) of logical storage from physical storage so that it may be accessed without the regard to physical storage or varied structure. This separation allows the administrators of the storage system greater flexibility in how they manage storage for end users.
Parallel Computing
This is the simultaneous use of multiple compute resources to solve a computational problem: A problem is broken into parts that can be solved concurrently Each part is further broken down to a series of instructions Instructions from each part execute simultaneously on different processors An overall control/coordination mechanism is employed
Data Bus
This is used for the exchange of data between the processor, memory and peripherals, and is bi-directional so that it allows data flow in both directions along the wires. Again, the number of wires used in the data bus (sometimes known as the 'width') can differ. Each wire is used for the transfer of signals corresponding to a single bit of binary data. As such, a greater width allows greater amounts of data to be transferred at the same time.
Instruction Register
This is used to hold the current instruction in the processor while it is being decoded and executed, in order for the speed of the whole execution process to be reduced. This is because the time needed to access the instruction register is much less than continual checking of the memory location itself.
Program Counter
This register is used to hold the memory address of the next instruction that has to executed in a program. This is to ensure the CPU knows at all times where it has reached, that is able to resume following an execution at the correct point, and that the program is executed correctly.
what does the following code do: LDR X2, [X0,X1]
This says add the contents of registers X0 and X1 and use the result as the address in main memory to fetch from
what is Throughputn
Total work done per unit time
UMA
Uniform Memory Access -shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data.
Memory Address Register
Used for storage of memory addresses, usually the addresses involved in the instructions held in the instruction register. The control unit then checks this register when needing to know which memory address to check or obtain data from.
Vector vs. Scalar
Vector Architectures and vectorizing compilers -simplify data parallel programming -explicit statement of absence of loop-carried dependencies -regular access patterns benefit from interleaved and burst memory -avoid control hazards by avoiding loops More general than ad-hoc media extensions -better match with compiler technology
Vector vs. Multimedia Extensions
Vector instructions have a variable vector length multimedia extensions have a fixed width/length vector instructions support strided access vector units can be combination of pipelined and arrayed function units
Virtual Storage
Virtual storage is the pooling of physical storage from multiple network storage devices into what appears to be a single storage device that is managed from a central console.
Parallel Computers
Virtually all stand-alone computers today are parallel from a hardware perspective: (Multiple functional units (L1 cache, L2 cache, branch, fetch, decode, floating-point, graphics processing (GPU), integer, etc.) Multiple execution units/cores Multiple hardware threads
31 x 32-bit general purpose sub-registers are...
W0 to W30
Execution Cycle
When a program is loaded into memory, it has to be executed.
Memory Buffer/Data Register
When an instruction or data is obtained from the memory or elsewhere, it is first placed in the memory buffer register. The next action to take is then determined and carried out, and the data is moved on to the desired location.
Indirect Addressing
When using indirect addressing, the operands give a location in memory similarly to direct addressing. However, rather than the data being at this location, there is instead another memory address given where the data actually is located. This is the most flexible of the modes, but also the slowest as two data look ups are required.
Von Neumann vs Harvard
With Von Neumann architecture the CPU can be either reading an instruction or reading/writing data from/to the memory. Both cannot occur at the same time since the instructions and data use the same bus system. In a computer using the Harvard architecture, the CPU can both read an instruction and perform a data memory access at the same time, even without a cache. A Harvard architecture computer can thus be faster for a given circuit complexity because instruction fetches and data access do not contend for a single memory pathway. Also, a Harvard architecture machine has distinct code and data address spaces.
Semantic GAP
With an objective of improving efficiency of software development, several powerful programming languages have been developed. They provide high level of abstraction, conciseness and power. By this evolution the semantic gap grows. To enable efficient compilation of high level language programs, CISC and RISC designs are the two options. CISC designs involve very complex architectures including a large number of instructions and addressing modes, whereas RISC designs involve simplified instruction set and adapt it to the real requirements of user programs.
Immediate Addressing
With immediate addressing, no look up of data is actually required. The data is located within the operands of the instruction itself, not in a separate memory location. This is the quickest of the addressing modes to execute, but the least flexible. As such it is the least used of the three in practice.
31 x 64-bit general purpose registers are...
X0 to X30
in immediate addressing, the operand is...
a constant within the instruction Example: ADD X2, X0, #5
what type of instruction does the compiler insert to produce a bubble?
a nop
In register addressing, the operand is...
a register Example: ADD X2, X0, X1
linking object modules could leave location dependencies for fixing by...
a relocating loader
what is multiple issue?
a scheme whereby multiple instructions are launched in one clock cycle
we make our technology "smaller and faster" with which of the 8 great ideas?
abstraction - it decomposes ideas
strided access
accessing the same amount of bytes every time
In optimized multiplication, what two steps are performed in parallel?
add and shift
In an immediate offset, a constant address is...
added to a base register. Example: LDRUSB X1, [X2,#1]
What are the 5 integer operations?
addition, subtraction, multiplication, division, handling overflow
what does this line refer to?: mov $message, %rsi
address of string
Postfix bytes specify...
addressing mode
Multiple forms of addressing are generically called...
addressing modes
what does B.LS mean
less than or equal, unsigned)
what is pipelining?
an implementation technique in which multiple instructions are overlapped in execution
Die area determined by...
architecture and circuit design
Register offset addressing mode can help with indexing into an...
array
how is pipelining achieved?
as soon as the resource is done with an instruction, it moves on to the next instruction, even if the first instruction hasn't gone through all the stages
If a particular constant can not be represented by the defined 12 bit format, you get an...
assembler error (invalid constant)
what does B.LT mean
less than, signed
Procedure call: ____ and ____
branch and link
what does the following code mean: B L1
branch unconditionally to instruction labeled L1;
how are locations determined in direct mapped cache?
by the address
What is a fully associative cache?
cache structure in which a block can be placed in any location in the cache
What is an N-way set associative cache?
cache structure that consists of a number of sets, which consist of n blocks. each block in the memory maps to a unique set in the cache, and a block can be placed in any element of that set
the higher the yield, the ____ the chip
cheaper
what does the following code do: MOV X9,XZR // i = 0 loop1: LSL X10,X9,#3 // X10 = i * 8 ADD X11,X0,X10 // X11 = addressof array[i] STUR XZR,[X11,#0] // array[i] = 0 ADDI X9,X9,#1 // i = i + 1 CMP X9,X1 // compare i to size B.LT loop1 // if (i < size) go to loop1
clears an array
what does the following code do: MOV X9,X0 // p = address of array[0] LSL X10,X1,#3 // X10 = size * 8 ADD X11,X0,X10 // X11 = address of array[size] loop2: STUR XZR,0[X9,#0]// Memory[p] = 0 ADDI X9,X9,#8 // p = p + 8 CMP X9,X11 // compare p to &array[size] B.LT loop2 // if (p < &array[size]) go to loop2
clears an array via pointers
GHz numbers refer to...
clock period
CPU clock is used to sync...
combinational logic
system software is a...
compiler
IA-32 is a ____ instruction set
complex
conditional branches are potential for which type of hazard?
control hazard
procedure calls are potential for which type of hazard?
control hazard
procedure returns are potential for which type of hazard?
control hazard
what sort of trade-off must be accepted with a faster multiply?
cost/performance
formulas for cpu time
cpu clock cycles x clock cycle time = cpu clock cycles/clock rate
what is Clock frequency (rate)
cycles per second
what is a Static data segment?
data allocated for the life of the program
load/use hazard is a sub-type of which type of hazard?
data hazard
what type of hazard is the following? j tries to read a source before i writes it, so j incorrectly gets the old value.
data hazard
what type of hazard is the following? j tries to write a destination before it is read by i , so i incorrectly gets the new value.
data hazard
what type of hazard is the following? j tries to write an operand before it is written by i. The writes end up being performed in the wrong order, leaving the value written by i rather than the value written by j in the destination.
data hazard
Classifiying GPUs
don't fit nicely into SIMD/MIMD model Static vs Dynamic and Instruction Level parallelism vs Data Level Parallelism
what is the downside/upside to a write through vs a write back?
downside: slow upside: memory is synchronized
uses of GPU's
due to the large number of cores many jobs are taken on by the GPU such as: -machine learning (AI) -modelling -Cryptocurrency mining (e.g. mining for bitcoins)
what is Clock period
duration of a clock cycle
what is the downside to direct mapped cache?
each block needs to use 2 bits to store the address
what is EOR
exclusive or
what is the best performance measure?
execution time
multicore microprocessors require...
explicitly parallel programming
what are the 5 steps for pipelining?
fetch, decode, execute, read access, write back
what is the benefit of having a larger cache block size?
fewer cache misses
what does this line mean?: mov $1, %rdi
file handle 1 is stdout
Wafer cost and area are...
fixed
what was the 8087, and what year did it come out?
floating-point coprocessor; 1980
what does B.LO mean
less than, unsigned
what is the difference between a write-through and a write-back?
for a write-through, when cache is updated, memory is also updated. for a write-back, memory is only updated when the block of cache is replaced
what is the purpose of Debug info?
for associating with source code
what is Relocation info?
for contents that depend on absolute location of loaded program
how do you resolve a data hazard?
forwarding, scheduling, or stalling (bubble)
purpose of X29 (FP):
frame pointer
LEGv8 register file is used for...
frequency accessed data
One cycle per partial-product addition is ok if...
frequency of multiplications is low
virtual memory uses which kind of cache mapping?
fully associative
What is Amdahl's Law?
gives a commonsense ceiling on performance (increased speed by a "better way" is limited by the usability of the "better way")
what is a Symbol table?
global definitions and external refs
Static data contains...
global variables
what does the following code do: CBNZ X19, Exit
go to Exit if X19 != 0
What does the following code do: B 1000
go to location 10000ten
immediate pre-indexed addressing mode is useful for...
going sequentially through an array
what does B.GE mean
greater than or equal, signed
what does B.HS mean
greater than or equal, unsigned
what does B.GT mean
greater than, signed
what does B.HI mean
greater than, unsigned
Dynamic data is...
heap
what does abstraction do
helps us deal with complexity by hiding lower-level detail
For local data on the stack, there is a ___ address and a ___ address
high, low
application software is written in...
high-level language
simplicty enables...
higher performance at lower cost
what does the following code mean: CBNZ register, L1
if (register != 0) branch to instruction labeled L1;
what does the following code mean: CBZ register, L1
if (register == 0) branch to instruction labeled L1;
dynamic linking avoids...
image bloat caused by static linking of all (transitively) referenced libraries
The following code is an example of what? LDR X2, [X0, #4]!
immediate pre-indexed addressing
what is the goal of parallel computing?
improve performance
Register offset addressing can help with indexing into an array, where the array index is...
in one register and the base of the array is in another
ORR operations are useful in...
including bits in a word
Each part is ... of the other
independent
pointers help avoid...
indexing complexity
purpose of X8
indirect result location register
Array version of clearing an array requires shift to be...
inside loop
in clearing an array, Array version requires shift to be...
inside loop
parallel programming for multicore microprocessors can compare with
instruction level parallelism
RISC
is a type of microprocessor architecture that utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instructions often found in other types of architectures. Prime difference between RISC and CISC design is the number and complexity of instructions. CISC designs includes complex instruction sets so as to provide an instruction set that closely supports the operations and data structures used by Higher-Level Languages
When you exclusive or (E0R) a register with itself, what happens?
it zeroes out
what is Immediate Post-indexed Addressing Mode
just like immediate pre-indexed except the address in the base register is used to access memory first and then the constant is added or subtracted later.
in virtual memory, which schema determines which block is going to be replaced?
least-recently used (LRU)
what does B.LE mean
less than or equal, signed)
Vector Instructions
lv,sv (load/store vector) addv.d add two vectors of double addvs.d add scalar to each element of double
Assembler (or compiler) translates program into...
machine instructions
induction variable elimination is better to...
make the program clearer and safer
Defect rate determined by...
manufacturing process
AND Operations are useful to...
mask bits in a word
purpose of X16 -X17 (IP0 -IP1):
may be used by linker as a scratch register, other times as temporary register
when loading a program, we load from image file on disk into...
memory
Pointers correspond directly to...
memory addresses
(IA-32) Hardware translates instructions to simpler...
microoperations
what does MOVZ do
move wide with zeros (16 bits)
what does MOVK do
move with with keep (16 bits)
Prefix bytes modify...
operation
most caches use which kind of cache mapping?
n-way set associative
can faster division use parallel hardware like a multiplier?
no
Subtracting two +ve or two -ve operands will result in...
no overflow
in integrated circuit production, relation to area and defect rate is...
nonlinear
Having more cores guarantees quicker processing time
not always
what does this line refer to?: mov $13, %rdx
number of bytes
what does an algorithm determine
number of operations executed
In regards to prefix bytes, operation refers to what types of things?
operand length, repetition, locking, etc.
Adding +ve and -ve operands will prevent...
overflow
division operations ignore what two things
overflow and division-by-zero
use ____ to improve performance
parallelism
In reality most tasks are
partially parallelizable
Parallel Programming difficulties
partioning, coordination, communications overhead(delay from message passing interface)
in branch addressing, both addresses are...
pc-relative
clock period is not always indicative of...
performance
in determining how many n times faster one thing is to another, we use these formulas:
performanceX/performanceY = exeuctionTimeY/executiontimeX = n
What does a datapath do?
performs operations on data
dynamic linking automatically...
picks up new library versions
how is dynamic scheduling implemented?
pipeline divided into 3 units: instruction fetch/issue unit, multiple functional units, and a commit unit. The first unit fetches instructions, decodes them, and and send to a functional unit. The functional units have reservation stations which holds the operands and operations. Once the buffer contains all its operands and the functional unit is ready to execute, the result is calculated. It is sent to any reservation stations waiting for this result, as well as the commit unit, which puts it in memory or a register.
faster multiplication can be...
pipelined
what did the i486 add and when did it come out?
pipelined, on-chip caches and FPU; 1989
purpose of X18:
platform register for platform independent code; otherwise a temporary register
what limits performance improvements
power
what are the pros/cons of a write-through?
pro: data is consistent between the memory and cache. con: is slow (solution: use a write buffer)
what are the pros/cons of a write-back?
pro: good performance. con: difficult to implement
purpose of X0 -X7
procedure arguments/results
dynamic linking requires...
procedure code to be relocatable
Parallel processing
processes are carried out at the same time.
in ic manufacturing, what is yield
proportion of working dies per wafer
Most caches use which replacement policy?
random
time/units of work is also known as...
raw speed/latency
Arithmetic instructions use ___ operands
register
in immediate offset addressing, the constant offset is added to the ______ and then the result is used as the address in main memory to fetch from
register inside the [ ]
the datapath includes...
registers
Subtracting +ve from -ve operand will overflow if...
result sign is 0
Subtracting -ve from +ve operand will overflow if...
result sign is 1
how are exceptions handled?
save the PC of the offending instruction (using the ELR), and save the indication of the problem (using the ESR). instructions before the exception are saved, instructions after the exception are thrown away
purpose of X19 -X27:
saved
We use MOVZ and MOVK with with flexible....
second operand (shift)
what does the control contain
sequences datapath, memory, etc.
Operating systems provide ___ code
service
what is shamt
shift amount
what is LSL
shift left
what is LSR
shift right
scaled register offset addressing allows the register to be...
shifted before it is added to the base register.
In scaled register addressing, the register operand is...
shifted first Example: ADD X2, X0, X1, LSL #3
Bit 31 is ___ bit
sign
what does LDURSB do?
sign-extend loaded byte
In PC-relative addressing the displacement from the PC is...
signed (so branches can go forward or backward in the code)
what material are wafers made from?
silicon inglot
Compilers are good at making fast code from...
simple instructions
Regularity makes implementation...
simpler
SMT
simultaneous multi-threading In multiple-issue dynamically scheduled processor -schedule instructions from multiple threads -instructions from independent threads execute when function units are available -within threads, dependencies handled by scheduling and register renaming
what is SISD?
single instruction single data (a uniprocessor)
what does SIMD stand for?
single-instruction, multiple-data
Most constants are ____, and __-bit immediate is sufficient
small; 12
in regards to immediate constants, large integers will....
sometimes work
purpose of X28 (SP):
stack pointer
how do you resolve a control hazard?
stalling or prediction
how is multiple issue implemented?
static (decisions are made by the compiler before execution) or dynamic (decisions are made during execution by the processor)
what are the 3 main types of hazards?
structure, data, and control
Fine-grain multithreading
switch threads after cycle interleave instruction execution if one thread stalls others are executed
what does this line mean?: mov $1, %rax
system call 1 is write
The speed of processing depends on if program was written to .... of multiple cores
take advantage
Parallelizable means....
task can be broken unto separate process
what are the 2 types of locality and how do they differ?
temporal locality: items accessed recently are likely to be accessed again soon. spatial locality: Items near those accessed recently are likely to be accessed soon
purpose of X9 -X15
temporaries
in java, what interprets bytecode?
the JVM
Compiler optimizations are sensitive to...
the algorithm
purpose of XZR (register 31):
the constant value 0
how do the pieces of the datapath fit together?
the memory stores the current instruction, the PC stores the address of the current instruction, the ALU executes the current instruction, and a mux chooses from multiple sources and steers one of those sources to its destination
in n-way set associative cache mapping, what is n, typically?
the number of cores
In PC-relative addressing, the branch address is...
the sum of the PC and a constant in the instruction. Example: B Loop1
what is the principle of locality?
the tendency of a processor to access the same set of memory locations repetitively over a short period of time
units of work/time is also known as...
throughput/bandwidth
what are the 2 facets of performance?
time/units of work and units of work/time
what is a Text segment?
translated instructions
what tool helps blocks to be found quickly in virtual memory?
translation lookaside buffer
what is a TLB?
translation lookaside buffer (TLB); buffer that memory management hardware uses to improve virtual address translation speed (NOT used in the cache)
T/F - is the offset optional in immediate offset addressing?
true (written as LDRUSB X1, [X2])
Many ARMv8 instructions allow for 12 bit.....
unsigned constants
how can you increase the speed of a multiply?
use multiple adders
how does virtual memory work?
uses a translation lookaside buffer to cache pages and speed up load times
what is paralleling?
using multiple resources to solve problems concurrently
x86 Instruction Encoding features ____ length encoding.
variable
how does a computer handle multiplication?
via long multiplication - the length of the product is the sum of the operand lengths
Market share makes IA-32 economically...
viable
A Program can be loaded into absolute location in...
virtual memory space
formulas for die per wafer
wafer area/die area
what is the Hamming SEC code?
way to detect a parity code and make things more reliable
what is AND
what is bit-by-bit AND
what is dynamic scheduling?
when the CPU executes instructions out of order to avoid stalls
2 examples of Wireless network
wifi, bluetooth
virtual memory uses write-through or write-back?
write-back
what does LDURB do?
zero-extend loaded byte