TDTS10 - Exam
An overwhelming dominance of simple (ALU and move) operations over complex operations. A dominance of simple addressing modes. Most of the referenced operands are scalars (can therefore be stored in registers) and are local variables or parameters. Implementing many registers in an architecture is good, since we can reduce the number of memory accesses. Optimizing the procedure CALL/RETURN mechanism promises large benefits in performance improvement
4. What are the main characteristics of program execution that have led to the development of the RISC architecture?
The collection of instructions a CPU can execute. Because the software and hardware are different, the interface is the instruction set. "It is an agreement between software and hardware on what functions will be supported, how they are supported etc"
4.1. What is an instruction set? Explain why we say that the instruction set serves as the interface between computer hardware and software
Cluster:A set of computers is connected typically over a highbandwidth local area network, and used as a multi-computer system. A group of interconnected stand-alone computers. Work together as a unified resource. Each computer is called a node. A node can also be a multiprocessor itself, such as an SMP. Message passing for communication between nodes. There is no longer a shared memory space. Ex. Google uses several clusters of off-the-shelf PCs to provide a cheap solution for its services. More than 15,000 nodes form a cluster
10. Define and discuss the main features of a cluster computer system. What are the advantages of having such a system?
A set of similar processors of comparable capacity. Each processor can perform the same functions (symmetric). The processors are connected by a bus or other interconnections. They share the same memory. A single memory or a set of memory modules. Memory access time is the same for each processor. All processors share access to I/O. Either through the same channels or different channels giving paths to the same devices. All processors are controlled by an integrated operating system
8. What is a symmetric multiprocessor system? What are its main characteristics?
Short: CISC - larger, more feature-rich instruction set (more operations, addressing modes, etc.). slower clock speeds. fewer general purpose registers. Examples: x86 variants RISC - smaller, simpler instruction set. faster clock speeds. more general purpose registers. Examples: MIPS, Itanium, PowerPC
8.3 Discuss the differences and arguments for RISC and CISC machines, respectively. (Short answer)
CISC Arguments: A rich instruction set should simplify the compiler by having instructions which match HLL statements. If a program is smaller in size, due to the use of complex machine instructions, it has better performance: They take up less memory space and need fewer instruction fetch cycles ─ this is important when we have limited memory space. Fewer number of instructions are executed, which maylead to smaller execution time. Program execution efficiency can be also improved by implementing operations in microcode rather than machine code. Less program code-size, Simpler compilers. Used mainly in Desktop and Servers. CISC problems: A large instruction set requires complex and time consuming hardware steps to decode and execute instructions. Some complex machine instructions may not match HLL statements exactly, in which case they may be of little use. Memory bottleneck is a major problem, due to complex addressing modes & multiple memory-accesses instructions. The irregularity of instruction execution reduces the efficiency of instruction pipelining. It will also lead to a complex design tasks, thus a larger timeto-market. Complex to implement, Consumes more power. RISC Arguments: Improved speed, simpler and more easy-to-use hardware and a shorter design cycle. Easier to implement, fewer transistor count for RISC cores. Faster Clock Speeds. In vague terms, power consumed per instruction execution is less. Hence its mainly used in Mobile and other power sensitive products Disadvantages of RISC: More program code-size due to reduced instruction set
8.3. Discuss the differences and arguments for RISC and CISC machines, respectively.
Implementing many registers in an architecture is useful, since we can reduce the number of memory accesses. Registers can be accessed much, much more quickly than main memory, However, if contents of all registers must be saved at every procedure call, more registers mean longer delay.
8.5. Why is it useful to have many registers in a CPU? Is there any disadvantage of having many registers?
A large number of registers is usually very useful. - However, if contents of all registers must be saved at every procedure call, more registers mean longer delay.A solution to this problem is to divide the register file into a set of fixed-size windows. Each window is assigned to a procedure. Windows for adjacent procedures are overlapped to allow parameter passing.
8.6 Describe the concept of overlapping register windows. What are the main advantages of such mechanism?
• Single instruction, single data (SISD) stream: A single processor executes a single instruction stream to operate on data stored in a single memory. Uniprocessors fall into this category. • Single instruction,multiple data (SIMD) stream: Simultaneous execution on different sets of data. A large number of processing elements is usually implemented. A single machine instruction controls the simultaneous execution of a number of processing elements on a lockstep basis. Each processing element has an associated data memory, so that each instruction is executed on a different set of data by the different processors. Vector and array processors fall into this category. • Multiple instruction, single data (MISD) stream: A sequence of data is transmitted to a set of processors,each of which executes a different instruction sequence.This structure is not commercially implemented. • Multiple instruction,multiple data (MIMD) stream: A set of processors simultaneously execute different instruction sequences on different data sets. The MIMD class can be further divided: Shared memory (tightly coupled): • Symmetric multiprocessor (SMP) • Non-uniform memory access (NUMA) and Distributed memory (loosely coupled) = Clusters
8.7. Define Flynn's taxonomy for classification of computers. List and briefly define the (4) different types of computer system organization according to Flynn.
Non-uniform memory access (NUMA) All processors have access to all parts of memory. Access time of processor differs depending on region of memory. Different processors access different regions of memory at different speeds. A Typical NUMA Organization: Each node is, in general, an SMP. Each node has its own main memory. Each processor has its own L1 and L2 caches. Nodes are connected by some networking facility. Each processor sees a single addressable memory space: Each memory location has a unique systemwide address. Memory request order: L1 cache (local to processor) L2 cache (local to processor) Main memory (local to node) Remote memory (via the interconnect network) All is done automatically and transparent to the processor. With very different access time!
9. Discuss the main features of a NUMA architecture.
A part of the CPU that controlss the execution of instructions. Controls the datapath, I/O-devices, memory etc.
Control Unit
A collection of functional units that is a part of the CPU. It performs operations such as adding, shifting, jumping etc
Datapath
How long it takes for one task to be executed.
Response time:
How many tasks can be executet in a given time.
Through pu
Split cache, one for instructions and one for data. Better performance but no balancing between the two as unified has (only one cache for d and i)
3.5. Define the Harvard architecture. Why is it useful to use the Harvard architecture?
Ignore new interrupts, prioritize interrupts
3.24. Discuss the two different ways to handle multiple interrupts.
Save all status information, reading data from input device, restore saved status information.
3.23. What is the purpose and function of an Interrupt Service Routine (ISR)?
Input devices, output devices, memory, datapath & control(=processor) Input/output devices provide a means for us to make use of a computer system. Memory is where data is stored. Processor(computer) = datapath + control unit.
1.1 What are the main components of a computer system? Briefly explain the basic function of each component. A datapath is a collection of functional units, such as arithmetic logic units, that perform data processing operations, registers, and buses. Control unit controls the execution of instructions.
- Data and instructions are stored in a single read/write memory. - The contents of this memory are addressable by location, without regard to what are stored there. - Instructions are executed sequentially (from one instruction to the next) unless the order is explicitly modified.
1.2 What are the von Neumann architecture principles?
- Programmable - They can solve very different problems by executing different problems. Instruction execution is done automatically. - It can be built with very simple electronic components. Data processing function is performed by electronic gates. Data storage function is provided by memory cells. Data communication is achieved by electronic wires
1.3 What are the advantages of the von Neumann architecture?
The execution of an instruction is carried out in a machine cycle (instruction cycle).
1.4 a) Describe the instruction execution cycle (machine cycle).
fetch cycle and execute cycle Fetch: Fetch and decode, fetches the instructions from the main memory to the IR, and decodes it in the control unit. Execute: calculations in the ALU
1.4 b)What are the two main phases of instruction execution?
Control unit: Controls the operation of the CPU and hence the computer. The one which interprets (decodes) the instruction to be executed and "tells" the other components what to do. Arithmetic and logic unit (ALU): Performs the computer's data processing functions Registers: Provides storage internal to the CPU, temporary storage devices used to hold control information, key data, and intermediate results CPU interconnection: Some mechanism that provides for communication among the control unit,ALU,and registers
1.5. What are the main components of a CPU? What are the different components used for?
Program counter (PC): An incrementing counter that keeps track of the memory address of the instruction that is to be executed next or in other words, holds the address of the instruction to be executed next. Memory address register (MAR): Holds the address of a block of memory for reading from or writing to. Memory data register (MDR): A two-way register that holds data fetched from memory (and ready for the CPU to process) or data waiting to be stored in memory. (This is also known as the memory buffer register (MBR).) Instruction register (IR): A temporary holding ground for the instruction that has just been fetched from memory. Control unit (CU): Decodes the program instruction in the IR, selecting machine resources, such as a data source register and a particular arithmetic operation, and coordinates activation of those resources. Arithmetic logic unit (ALU): Performs mathematical and logical operations.
1.6 Explain briefly how the CPU components work together to execute instructions. Program counter (PC).
instruction register (IR): A register that is used to hold an instruction for interpretation. Contains the instruction most recently fetched. program counter (PC)/instruction address register: A special-purpose register used to hold the address of the next instruction to be executed.( Contains the address of the next instruction-pair to be fetched from memory. ) accumulator register (AR): Employed to hold temporarily operands and results of ALU operations. general-purpose registers: can store both data and addresses, i.e., they are combined data/address registers
1.7 What is the function of each the following registers in the CPU? Instruction register (IR) Program counter (PC)/instruction address register. Accumulator register (AR) General-purpose registers
FLOPS is a measure of computer performance, useful in fields of scientific calculations that make heavy use of floating-point calculations. (FLoating-point Operations Per Second) MIPS - Million instructions per second. Where Instructions per second (IPS) is a measure of a computer's processor speed.
1.8 Define the computer performance measurement units MIPS and FLOPS.
Moore observed that the number of transistors that could be put on a single chip was doubling every year and correctly predicted that this pace would continue into the near future. To the surprise of many, including Moore, the pace continued year after year and decade after decade. A negative implication of Moore's law is obsolescence, that is, as technologies continue to rapidly "improve", these improvements may be significant enough to render predecessor technologies obsolete rapidly.
1.9 Explain Moore's law, and discuss its implications.
The main memory, also called primary memory, is used to store the program and data which are currently manipulated by the CPU.
2.1. What is the function of the main memory?
The secondary memory provides the long-term storage of large amounts of data and program.
2.10. What are the main purposes of the secondary memories?
• Magnetic tape: Magnetic tape is made up from a layer of plastic which is coated with iron oxide. The oxide can be magnetized in different directions to represent data. Features: Sequential access (access time about 1-5 s). Relatively high capacity of storage (ca. 80 MB per tape). Very cheap. • Diskette: Data are recorded on the surface of a floppy diskmade of polyester coated with magnetic material. Features: Direct-access memory Cheap Portable, convenient to use. • Hard disk: Data are recorded on the surface of a hard disk made of metal coated with magnetic material. A hard disk spins constantly and at a very high speed to reduce seek time, rotational delay and read/write time. Features: Direct-access memory. Fast access: • seek time ≈ 8 ms (vs. 100 ms for floppy) • rotational delay ≈ 3 ms (vs. 100 ms for floppy) • data transfer rate ≈ 1 Gbits/s (0,5 Mbits/s f. floopy). Huge storage capacity (ca. 200 GB for a compact unit) • Optical memory: An optical disk's surface is imprinted with microscopic holes which record digital information. When a low-powered laser beam shines on the surface, the intensity of the reflected light changes, representing 0 or 1. Eg. cd-ROM. • USB flash drive:A small, portable flash memory card. It plugs into a computer's USB port. It functions as a portable hard drive. Convenient to store and transfer data.
2.11. Briefly explain how the following secondary storage devices work and discuss their main features: • Magnetic tape • Diskette • Hard disk • Optical memory • USB flash drive
Access time: Seek time — the time required to spin the disk to a constant rotation speed and to position the read/write head at the right track. Rotational delay — the time required for the read/write head to position at the beginning of the sectors where data are stored. Read/write time — the time required to read/write a basic unit of data. Data transfer rate — DTR = 1 / (read/write time)
2.12. Give a short definition of seek time, rotational delay, read/write time, and data/transfer rate for a diskbased device
A secondary memory is usually divided into large blocks.Each block has a unique address and can be individually addressed. Data are moved between the secondary memory and the main memory one block at a time.
2.13. How is a secondary memory accessed by the CPU?
1. Daily dumping of data. 2. Logging of transactions performed of the day. 3. Disk crash. 4. Data are safe in backup. Recovery: 1. Copy dumped data to new disk. 2. Update dumped data with the logged transactions. 3. The system is back to normal. Necessary because otherwise you might lose all data.
2.14. Explain the back-up procedure. Why is back-up necessary?
In computer architecture the memory hierarchy is a concept used to discuss performance issues. A memory system has to store very large programs and a huge amount of data and still provide fast access. No single type of memory can provide all such need of a computer system. Therefore, several different storage mechanisms are organized in a layer hierarchy. As one goes down the hierarchy,the following occur: a. Decreasing cost per bit b. Increasing capacity c. Increasing access time d. Decreasing frequency of access of the memory by the processor. Smaller, more expensive, faster memories are supplemented by larger, cheaper, slower memories. Better access time, performance.
2.15. What does it mean by a memory hierarchy? Why it is useful to build a memory hierarchy?
Works if if conditions (a) through (d) apply. The key to the success of this organization is item d), "decreasing frequency of access", known as "Locality of reference".
2.16. What is the fundamental assumption that makes a memory hierarchy work efficiently?
Locality of reference - Programs access a small proportion of their address space at any short period of time. Temporal locality: If an item is accessed, it will tend to be accessed again soon. Spatial locality: If an item is accessed, items whose addresses are close by will tend to be accessed soon.
2.17. Give the definitions of locality of reference, temporal locality, and spatial locality
As one goes down the memory hierarchy,one finds decreasing cost/bit,increasing capacity, and slower access time. It would be nice to use only the fastest memory, but because that is the most expensive memory, we trade off access time for cost by using more of the slower memory.The design challenge is to organize the data and programs in memory so that the accessed memory words are usually in the faster memory.
2.18. What is the general relationship among access time, storage capacity, and cost of a given memory technology?
Because the memory cycle time is much longer than the clock cycle time (and the machine cycle time) of the CPU.
2.2. Why is memory access the bottleneck of a computer?
Quantitative measurement of the capacity of the bottleneck is the Memory Bandwidth.Memory bandwidth denotes the amount of data that can be accessed from a memory per second. M-Bandwidth = 1/memory cycle time ∙ amount of data per access. Ex. MCT = 100 nano second and 4 bytes (a word) per access: M-Bandwidth = 40 mega bytes per second.
2.3. How do you define and compute the memory bandwidth of a memory?
Reduce the memory cycle time, Expensive and Memory size limitation . Divide the memory into several banks, each of which has its own control unit (using parallelism).
2.4. How to increase the bandwidth of the main memory?
This mechanism is used to free the CPU from having to check periodically. An input device might cause an interrupt.
3.22. Describe the interrupt mechanism. What can it be used for? What may cause an interrupt?
It means that you divide the memory into several banks, each of which has its own control unit. It is useful to increase bandwidth/access performance to storage by putting data accessed sequentially into non-sequential sectors.
2.5. What does it mean by interleaving placement of program and data? Why is this placement approach useful?
- They exhibit two stable (or semistable) states, which can be used to represent binary 1 and 0. - They are capable of being written into (at least once), to set the state. - They are capable of being read to sense the state. +det är snabbt. behöver ingen mekanisk arm som läser data, direct/ random access
2.6. What are the main features of a semiconductor main memory?
Sequential access: Memory is organized into units of data, called records. Access must be made in a specific linear sequence. If a data item is to be read, all data items before it must also be read. Access time is variable. Direct access: Individual blocks or records have a unique address based on physical location. Access is accomplished by direct access to reach a general vicinity plus sequential searching, counting, or waiting to reach the final location. Access time is variable. Random access: Each addressable location in memory has a unique, physically wired-in addressing mechanism. The time to access a given location is independent of the sequence of prior accesses and is constant.
2.7. What are the differences among sequential access, direct access, and random access?
A RAM where a word is retrieved based on a portion of its contents rather than its address. Comparison of the given bits of a word with a specified pattern is made for each access, and this is performed for all words simultaneously. CAM is faster than RAM. "best for users that require searches to take place quickly and whose searches are critical for job performance on the machine." Advantages: This is suitable for parallel searches. It is also used where search time needs to be short.
2.8. What is an associative memory? What is the advantage of using an associative memory?
It is a permanent memory which can only be read but not written. Can be used for instructions that start the computer when it is first switched on (BIOS) and can also be used when fast reading of program/data is required. (eg Stores library subroutines (e.g., for division operation). Stores dictionaries for spelling check)
2.9. What can a read only memory (ROM) be used for? Why?
A cache memory is a fast memory between the cpu and the main memory holding segments from the main memory. Based on the logic that a computer is a "predictable and iterative reader." and locality of reference. Features: - It is transparent to the programmers. The CPU still refers to the instructions/data by their addresses in the MM. - Only a very small part of the program/data in the main memory has its copy in the cache - If the CPU wants to access program/data not in the cache (called a cache miss), the relevant block of the main memory will be copied into the cache. - The intermediate-future memory access will usually refer to the same word or words in the neighborhood, and will not have to involve the main memory - locality of reference.
3.1. What is a cache memory? How does it work? What are the main features of a cache?
Tag - Set - Offset Set: like slot for direct mapped, but now it is sets, and the info can be stored in any of the slots. Offset: where within a block we can find our data.
3.10. For a set-associative cache, a main memory address is viewed as considering of three fields. List and define the three fields.
First-in-first-out, Least-recently used, Least-frequently used, Random.
3.11. What are the different cache replacement algorithms?
Write through - All write operations: If the addressed location is currently in the cache, the cache copy is also updated so that it is coherent with the main memory. CPU slows to MM speed (15% writes, not much decrease) Write through with buffered write - does not slow down, the write address and data are stored in a high-speed write buffer, cpu continues, requires complex hardware Write back - Only updates MM when cache block is replaced
3.12. Describe the three different write policies that are used to keep the cache contents and the contents of the main memory consistent.
Divide programs in equally sized pages, divide MM in equally sized frames, allocate required numbered pages to a program, OS is responsible for the frames. The basic idea: Load only pieces of each executing program which are currently needed (on demand)
3.13. What does it mean by virtual memory? Describe how a virtual memory works.
Give the programmer bigger memory space than MM. To allow multiple programs to share main memory dynamically and efficiently. You don't need to negotiate with the others to share the physical memory addresses. Each program gets a private virtual address space
3.14. Why is it useful to have a virtual memory?
No, only the required pages are in the MM.
3.15. Is it necessary for all of the pages of a program to be in the main memory while the program is being executed?
By using a pages-table
3.16. How is a logical (virtual) address converted into a physical address of the main memory?
When a page is not in MM, the page needs to be loaded by the OS.
3.17. What does it mean by page fault? How is a page fault dealt with by the computer?
First-in-first-out, Least-recently used, Least-frequently used
3.18. Describe the main principles for memory page replacement in a virtual memory.
Interactive (ipad), Indirect (laser printer), much slower speed than CPU.
3.19. What are the main features and types of I/O devices and operations?
It has a higher performance than MM and with a small cache, 0.1% of MM, can give 96 % cache hit ratio.
3.2. What are the advantages of having a cache?
Function as an interface between a computer system and other physical systems. Control and timing, CPU communication, Device comm, Data buffer, Error detection and correction
3.20. What is the main functions of an I/O module?
• Programmed I/O - Operations controlled by instructions, ie READ/WRITE, CPU waits until IO operation is completed, slow but simple, used in Embedded. • Interrupt-driven I/O - the IO device sends interrupt to CPU and the CPU runs the interrupt service routine(ISR), the CPU does not have to check periodically • Direct memory access (DMA) - the IO directly maps data to the MM without CPU interaction
3.21. Explain the following ways of controlling I/O devices. What are the advantages and disadvantages of each technique?
AAT = Phit x Tcache_access + (1 - Phit) x (Tmm_access + Tcache_access) x Block_size +Tchecking Ex. A computer has 8MB MMwith 100 ns access time, 8KB cache with 10 ns access time, BS=4, and Tchecking = 2.1 ns, Phit = 0.97, AAT will be 25 ns.
3.3. How is the average access time calculated for a given combined cache/memory system?
The size and nature of the copied block must be carefully designed, as well as the algorithm to decide which block to be removed from the cache when it is full. Bock/line size, Tot. Cache Size, Mapping function, replacement method, write policy, number of caches.
3.4. What are the main cache design issues/parameters?
Cheaper, better balance if the instruction and data fetches varies.
3.6. What are the advantages of using a unified cache?
Direct - each block of MM maps to a fixed cache slot. Cheap, fast checking Associative - each block of MM maps to any slot. Needs a mechanism to examine every slots tag. Set Associative - Cache is divided into sets, each sets contains slots(W). For example, 2 slots per set (W = 2): 2-way associative mapping. Direct mapping: W = 1 (no alternative).
3.7. Describe the different cache mapping functions. Describe briefly the main features of each of the different mapping functions.
Tag: The tag of that memory address in the main memory, the number in the order considering every block that is mapped to that cache position. Slot: What slot in the cache is this memory address mapped to. Word: The actual data.
3.8. For a direct-mapped cache, a main memory address is viewed as considering of three fields. List and define the three fields.
Tag - Word
3.9. For an associative cache, a main memory address is viewed as considering of two fields. List and define the two fields
Instructions a computer can execute directly
4.10. What are machine codes?
A symbolic program consisting of operation codes, operand addresses, and instruction addresses. It compiles machine code from the assembly program. Before an assembly program can be executed, it must be translated into machine codes.
4.11. What are assembly programs? What is the basic function of an assembler?
A compiler translates the HL to machine codes. An interpreted program is not compiled into machine codes. Rather the statements are "executed" by the interpreter. The interpreter evaluates the HL language with the input data which results in output data.
4.12. What are the differences between a compiler and an interpreter?
Operation repertoire, Data types(supported), Instruction format( Length, number of addresses, size of various fields, etc), (CPU)Register organization, Addressing(Which addressing modes to be provided)
4.2. What are the main issues to be considered when designing an instruction set?
One can use hardwired or microcode for the IS, hardwired being faster but costs more to change or add features.
4.3. In what way can the instruction set influence the overall performance and implementation cost of a computer system?
A machine instruction specifies the following information: What has to be done (operation code) To whom the operation applies (source operands) Where does the result go (destination operand) How to continue after the operation is finished (next instruction address).
4.4. What information is usually specified in a machine instruction? Why?
Arithmetic and Logic, Data transfer between MM and CPU registers, Program control, IO transfer.
4.5. List and briefly discuss the four main types of instructions.
The layout of the instruction, size of op code, register number and MM adresses.
4.6. What is the instruction format in an instruction set?
Source operand, destination operand and address of next instruction.
4.7. If an instruction contains one/two/three/four address(es), what can be the purpose of each address?
Direct - The operand address is directly given at the address field of the instruction. Indirect - The MM location of the operand is given in the MM adress instead of the operand itself. Reduce the length of the address field(s) in the instruction. Indexed - The operand address equals the address plus the value stored in an index register. Relative - The address is relative to a base address stored in a register
4.8. Describe the four basic addressing modes: direct addressing, indirect addressing, indexed addressing, and relative addressing. Give examples to show how they are used.
Immediate - the operand is given in the instruction, e.g., #254 (no memory access is needed). Faster? Saves one memory or cache cycle in the instruction cycle
4.9. What does it mean by immediate addressing? Why is it useful to have this addressing mode?
A processor organization in which the processor consists of a number of stages, allowing multiple instructions to be executed concurrently. Is possible to execute other jobs in the same clock cycle eventhough one stage is occupied. It therefore allows faster CPU throughput.
5.1. What is the basic principle of an instruction pipeline?
A loop-closing branch is mispredicted once rather than twice. It tolerates a branch going an unusual direction one time.
5.10. What is the advantage of the bimodal prediction method, as compared with the one-bit predictor?
If some stages of the execution cycle is not sequentially dependent the stages can be pipelined for parallelism. A typical instruction execution sequence: 1. Fetch Instruction (FI): Fetch the instruction. 2. Decode Instruction (DI): Determine the op-code and the operand specifiers. 3. Calculate Operands (CO): Calculate the effective addresses (e.g., virtual address -> physical address). 4. Fetch Operands (FO): Fetch the operands. 5. Execute Instruction (EI): perform the operation. 6. Write Operand (WO): store the result in memory.
5.2. Discuss one approach to divide the instruction execution cycle into several stages.
The complexity and power consumption of the CPU grows with the number of stages. Difficult to keep a long pipeline at maximum rate due to the hazards/stalls and incorrect branch prediction. Also, Implementing a pipeline stage adds cost and delay to architecture.
5.3. In general, a larger number of pipeline stages gives a better performance. Why this assumption doesn't lead to the situation that we have a huge number of pipeline stages?
Structural(Resource) hazards - Hardware conflicts, caused by the use of the same hardware recourse at the same time. eg memory conflicts. Solution: Hardware resources are duplicated in order to avoid structural hazards, eg two ALU. Functional units can be pipelined themselves to support several instructions at the same time. Memory conflicts can be solved by: having two separate caches, one for instructions and the other for data (Harvard Architecture); using multiple banks of the main memory; or keeping as many intermediate results as possible in the registers (!) Data hazards - Caused by reversing the order of data-dependent operations due to the pipeline (e.g., WRITE/READ conflicts) Solution: penalty due to data hazards can be reduced by a technique called forwarding (bypassing). Control hazards - Caused by branch instructions, which change the instruction execution order.
5.4. Discuss briefly the different pipeline hazards that limit the performance of an instruction pipeline and the different solutions to address them.
The penalty due to data hazards can be reduced by a technique called forwarding (bypassing). The ALU result is fed back to the ALU input. If it detects that the value needed for an operation is the one produced by the previous one, and has not yet been written back. ALU selects the forwarded result, instead of the value from the memory system.
5.5. What does it mean by the forwarding (bypassing) technique in the context of instruction pipeline? Which problem does this technique solve?
Re-arrange the instructions so that branching occurs later than originally specified. - Software solution.
5.6. What does it mean by delayed branch? Which problem does this technique solve?
When a branch is encountered, a prediction is made and the predicted path is followed. The instructions on the predicted path are fetched. The fetched instruction can also be executed, which is called Speculative Execution.
5.7. What is speculative execution?
- Predict always taken - Assume that jump will happen. Always fetch target instruction. - Predict never taken - Assume that jump will not happen. Always fetch next instruction. - Predict by Operation Codes - Some instructions are more likely to result in a jump than others. • BNZ (Branch if the result is Not Zero) • BEZ (Branch if the result equals Zero) Can get up to 75% success. - Predict by relative positions - Backward-pointing branches will be taken (usually loop back). Forward-pointing branches will not be taken (often loop exist).
5.8. Describe briefly the static branch prediction methods.
- One-bit: Based on branch history, Store information regarding branches in a branch-history table so as to predict the branch outcome more accurately. E.g., assuming that the branch will do what it did last time. - Bimodal Prediction - Use a 2-bit saturating counter to predict the most common direction, where the first bit indicates the prediction. Branches evaluated as taken (T) increment the state towards strongly taken; and Branches evaluated as not taken (N) decrement the counter towards strongly not taken. It tolerates a branch going an unusual direction one time.
5.9. What are the dynamic branch prediction methods? Discuss briefly the one-bit predictor and the bimodal prediction techniques.
Control the execution of instructions.
6.1. What are the main purpose and operation of the control unit inside a computer?
Software that provides control, monitoring and data manipulation.
6.10. What are the possible applications of the microprogramming technique?
CU must send a set of control signals to the datapath. i.e., the collection of registers, buses, and functional units that perform data-processing operations: Switch on/off a datapath component. Set a flag signal. Bus/multiplexer selection.
6.2. What are the purposes of the control signals in the control unit?
Hardwired Implementation and Microprogrammed Control
6.3. What are the two main techniques for control unit implementation?
The basic idea is to implement the control unit as a microprogram execution machine (a computer inside a computer). The set of micro-operations occurring at one micro-clock cycle defines a micro-instruction. A sequence of micro-instructions defines a microprogram. The execution of a machine instruction becomes the execution of a sequence of micro-instructions.
6.4. Describe how the microprogramming technique works.
Control memory = Microprogram memory? Microcodes are stored in a µ-memory which is much faster than a cache. Since the micro-memory stores only µinstructions, a ROM is often used.
6.5. What is the purpose of a control memory?
The execution of a machine instruction becomes the execution of a sequence of micro-instructions.
6.6. What are the main difference between microprogramming and programming machine instructions?
--Faster? Like if java turned into machine instructions. Machine instructions to microprogramming. - simplify the control circuits; - increase flexibility (e.g., introducing a new instruction for computing division); - make it efficient to use the same computer for different purposes.
6.7. What are the advantages of using the microprogramming technique?
"Changing of firmware may rarely or never be done during its lifetime." Changing of firmware may rarely or never be done during its lifetime. The ROM can be upgraded though( Usually called flashing the firmware).
6.8. Why is a ROM used for storing microprograms rather than a RAM?
Microprograms stored in a µ-memory. Software that provides control, monitoring and data manipulation.
6.9. What is the firmware of a computer system?
Superscalar is a computer designed to improve the performance of the execution of scalar instructions. A scalar is a variable that can hold only one atomic value at a time, e.g., an integer or a floating-point. A scalar architecture processes one data item at a time the computers we discussed up till now. In a superscalar architecture (SSA), several scalar instructions can be initiated simultaneously and executed independently.
7.1. What is a superscalar architecture? What are the main features of such an architecture?
SSA includes all features of pipelining but, in addition, there can be several instructions executing simultaneously in the same pipeline stage. SSA introduces therefore a new level of parallelism, called instruction-level parallelism.
7.2. Define the concept of instruction-level parallelism.
Instruction-level parallelism (ILP) the average number of instructions in a program that a processor might be able to execute at the same time. Determined by the number of true dependencies and procedural (control) dependencies in relation to the number of other instructions. Machine parallelism of a processor the ability of the processor to take advantage of the ILP of the program. Determined by the number of instructions that can be fetched and executed at the same time, i.e., the capacity of the hardware.
7.4. What is the distinction between instruction-level parallelism and machine parallelism?
• Resource conflict - Several instructions compete for the same hardware resource at the same time. They can be solved partly by introducing several hardware units for the same functions.
7.5 a) Briefly define the following terms in a superscalar architecture: Resource conflict
• Control (Procedural) dependency - The presence of branches creates major problems in implementing the maximal parallelism. Cannot execute instructions after a branch, in parallel, with instructions before a branch.
7.5 b) Briefly define the following terms in a superscalar architecture: • Control (Procedural) dependency
• Data conflict - Caused by data dependencies between instructions in the program. Similar to date hazards in pipeline. But we have much more data dependencies now, due to the parallel execution of many instructions.To address the problem and to increase the degree of parallel execution, SSA provides a great liberty in the order in which instructions are issued and executed. Therefore, data dependencies have to be considered and dealt with much more carefully.
7.5 c) Briefly define the following terms in a superscalar architecture: • Data conflict
• True data dependency - True data dependencies exist when the output of one instruction is required as an input to a subsequent instruction.
7.5 d) Briefly define the following terms in a superscalar architecture: • True data dependency
• Output dependency - An output dependency exists if two instructions are writing into the same location, i.e., write after write (WAW) dependency.
7.5 e) Briefly define the following terms in a superscalar architecture: • Output dependency
• Anti dependency - An anti-dependency exists if an instruction uses a location as an operand while a following one is writing into that location, i.e., write after read (WAR) dependency.
7.5 f) Briefly define the following terms in a superscalar architecture: • Anti dependency
Output/Anti dependencies. They are due to the competition of several instructions for the same register. Storage conflicts - can be eliminated by using additional registers, called " register renaming".
7.6. Which dependencies can be eliminated? What is the technique used to eliminate them?
The window of execution can be extended over basic block borders by branch predictio - "Speculative execution". With speculative execution, instructions of the predicted path are entered into the window of execution. Instructions from the predicted path can be executed tentatively. If the prediction turns out to be correct the state change produced by these instructions will become permanent and visible (the instructions commit); Otherwise, all effects are removed; and the other path will be executed.
7.7. Why do we have a "commit" mechanism in a superscalar architecture? How does this mechanism work?
In-order issue with in-order completion: The simplest instruction issue policy is to issue instructions in the exact order that would be achieved by sequential execution (in-order issue) and to write results in that same order (in-order completion). With in-order issue,the processor will only decode instructions up to the point of a dependency or conflict. No additional instructions are decoded until the conflict is resolved. In-order issue with out-of-order completion: Out-of-order completion is used in scalar RISC processors to improve the performance of instructions that require multiple cycles. With out-of-order completion,any number of instructions may be in the execution stage at any one time, up to the maximum degree of machine parallelism across all functional units. → Output dependency. Out-of-order issue with out-of-order completion: To allow out-of-order issue, it is necessary to decouple the decode and execute stages of the pipeline.Allows to look ahead for independent instructions. This is done with a buffer referred to as an instruction window. The most efficient policy. → Anti dependency.
7.8. List and briefly define the three types of superscalar instruction execution policies related to issue and completion.
To allow out-of-order issue, it is necessary to decouple the decode and execute stages of the pipeline.This is done with a buffer referred to as an instruction window. With this organization,after a processor has finished decoding an instruction,it is placed in the instruction window. As long as this buffer is not full,the processor can continue to fetch and decode new instructions. The window of execution should be sufficiently large. Better performance? but need extra storage, buffer, for the instruction window?
7.9. What is the purpose of an instruction window? Discuss the impact of the instruction-window size with respect to performance and cost.
Load-and-store architecture → Only LOAD and STORE instructions reference data in memory. All other instructions operate only with registers (register-toregister instructions). Only a few simple addressing modes are used. Ex. register, direct, register indirect, and displacement. Instructions are of fixed length and uniform format. Loading and decoding of instructions are simple and fast. It is not needed to wait until the length of an instruction is known in order to start decoding it. Decoding is simplified because the opcode and address fields are located in the same position for all instructions. A large number of registers can be implemented. Variables and intermediate results can be stored in registers and do not require repeated loads from and stores to memory. All local variables of procedures and the passed parameters can also be stored in registers. The large number of registers is due to that the reduced complexity of the processor leaves silicon space on the chip to implement them. This is usually not the case with CISC machines.
8.1. What are the main features of RISC computers?
The opposed trend is Complex Instruction Set Computer (CISC), or the "regular computers. The main idea is to make machine instructions similar to high-level language statements. A large number of instructions (> 200) with complex instructions and data types. Many and complex addressing modes (e.g., indirect addressing is often used). Microprogramming techniques are used to implement the control unit, due to its complexity.
8.2. What are the main features of CICS computers?