Ch. 8 CSI240

Ace your homework & exams now with Quizwiz!

Two types of CPU architecture (both based on von Neumann architecture)

(1) CISC (2) RISC

To improve performance of the CPU

(1) Either reduce the number of steps in the fetch-execute cycle, or (2) Reduce the time for each instruction in the program

Two ways of handling the process of returning changed data from cache to main storage

(1) write-through (writes data back to main memory immediately upon change in the cache) - This method has the advantage that the two copies, cache and main memory, are always kept identical (2) Write-back (the changed data is held in cache until the cache line is to be replaced. This is faster.)

difficult technical issues that must be resolved to make it possible to execute multiple instructions simultaneously

- Problems that arise from instructions completing in the wrong order. - Changes in program flow due to branch instructions - Conflicts for internal CPU resources, particularly general-purpose registers.

studies revealed that

- Specialized instructions were used rarely, but added hardware complexity to the instruction decoder that slowed down execution of the other instructions that are used frequently. - The number of data memory accesses and total MOVE instructions could be reduced by increasing the number of general-purpose registers and using those registers to manipulate data and perform calculations. The time to locate and access data in memory is much longer than that required to process data in a register and requires more steps in the fetch-execute cycle of instructions that access memory than those that don't. -Permitting the use of general-purpose registers to hold memory addresses, also, would allow the addressing of large amounts of memory while reducing instruction word size, addressing complexity, and instruction execution time, as well as simplifying the design of programs that require indexing. Reducing the number of available addressing methods simplifies CPU design significantly. - The use of fixed-length, fixed-format instruction words with the op code and address fields in the same position for every instruction would allow instructions to be fetched and decoded independently and in parallel. With variable-length instructions, it is necessary to wait until the previous instruction is decoded in order to establish its length and instruction format

Code Morphing

- can be used to translate complex variable-width instruction words to simpler fixed-width internal equivalents for faster execution. - This technique allows the retention of legacy architectures while permitting the use of modern processing methods. Ex: Modern x86

The execution unit

- contains the arithmetic/logic unit - the portion of the control unit that identifies and controls the steps that comprise the execution part for each different instruction. When the execution unit is ready for an instruction, the instruction decoder unit passes the next instruction to the control unit for execution. Instruction operands requiring memory references are sent to the addressing unit. The addressing unit determines the memory address required, and the appropriate data read or write request is then processed by the bus interface unit.

Three different approaches are commonly used to enhance the performance of memory:

-Wide path memory access -Memory interleaving -Cache memory. complementary. Cache memory has most profound effect on system performance

speculative execution

1. the execution of an instruction out of order may or may not be valid, so the instruction is executed speculatively, that is, on the assumption that its execution will be useful. 2. For this purpose, a separate bank of registers is used to hold results from these instructions until previous instructions are complete. 3. The results are then transferred to their actual register and memory locations, in correct program instruction order

cache memory

A small amount of high-speed memory (SRAM) between the CPU and main storage, a "hidden" (to the programmer) storage area. It is organized into blocks. The block reproduces a corresponding amount of storage from somewhere in main memory. organized into blocks

cache line

A small amount of storage (between 8-64 bytes) in each cache block.

CISC

Complex Instruction Set Computers (IBM mainframe, X86 CPUs)

Memory Interleaving

Dividing memory into parts, so that it is possible to access more than one location at a time.

n-way interleaving

Dividing the memory into different blocks. n is a value that is the number of separate blocks.

logical storage elements.

Each element can accept a memory request independently and can be processed concurrently

cache controller

Hardware that checks tags to determine if the memory location of the request is presently stored within the cache. - If it is, the cache memory is used as if it were the main memory. If the request is a read, the corresponding word from cache memory is simply passed to the CPU.

tag

Identifies the location in main memory that corresponds to the data being held in a cache block.

The current organizational model of a CPU uses three primary, interrelated techniques to address the limitations of the conventional CU/ALU model and to improve performance.

Implementation of the fetch-execute cycle is divided into two separate units: a fetch unit to retrieve and decode instructions and an execution unit to perform the actual instruction operation. This simple reorganization of the CU and ALU components allows independent, concurrent operation of the two parts of the fetch-execute cycle. - The model uses an assembly line technique called pipelining to allow overlapping between the fetch-execute cycles of sequences of instructions. This reduces the average time needed to complete an instruction. - The model provides separate execution units for different types of instructions. This makes it possible to separate instructions with different numbers of execution steps for more efficient processing. It also allows the parallel execution of unrelated instructions by directing each instruction to its own execution unit. In some CPUs, there will even be multiple execution units of each kind. For example, Figure 8.3 lists the twelve execution units present in the IBM POWER7 CPU.

The Intel x86 is characteristic of older CISC architectures; it has comparatively few general-purpose registers, numerous addressing methods, dozens of specialized instructions, and instruction word formats that vary from 1 to 15 bytes in length.

In contrast, every instruction in the newer SPARC RISC architecture is the same 32-bit length; there are only five primary instruction word formats, and only a single, register-based, LOAD/STORE memory-addressing mode.

What does the second level buy us?

Most system designers believe that more cache would improve performance enough to be worthwhile. To be useful, the second level of cache must have significantly more memory than the first level

There are several factors that determine the number of instructions that a computer can perform in a second.

Obviously the clock speed is one major factor.

branching problem

One common approach is to maintain two or more separate pipelines so that instructions from both possible outcomes can be processed until the direction of the branch is clear. Another approach attempts to predict the probable branch path based on the history of previous execution of the same instruction.

number of different ways to increase the instruction execution performance of a computer.

One method is to provide a number of CPUs in the computer rather than just one. Since a single CPU can process only one instruction at a time, each additional CPU would, in theory, multiply the performance of the computer by the number of CPUs included

scalar processor

Processes one instruction per clock cycle CPU can average instruction execution approximately equal to the clock speed of the machine

RISC

Reduced Instruction Set Computers

disk cache

Storage space on a computer hard disk used to temporarily store downloaded data.

superscalar processing

The ability to process more than one instruction per clock cycle. Superscalar processing can increase the throughput by double or more.

instruction unit

The analyzing, managing, and steering instructions to the proper execution unit at the proper time, combined with instruction fetching and decoding. Handles all prep of instructions for executing

locality of reference

The locality of reference principle states that at any given time, most memory references will be confined to one or a few small regions of memory

Pipelining

The overlapping of instructions, so that more than one instruction is being worked on at a time. Thus, when the first instruction is completed, the next one is already one stage short of completion pipelining technique results in a large overall increase in the average number of instructions performed in a given time responsible for large increases in program execution speed

hit ratio

The ratio of hits to the total number of requests

stall time

The time to move data to the cache long compared to instruction execution time reducing performance

Instruction Set Architecture

These characteristics include such things as the number and types of registers, methods of addressing memory, and basic design and layout of the instruction set.

Rename registers, logical registers, register alias tables

They hold the results of speculative instructions until instruction completion.

Some modern architectures even provide program instructions to request cache preloading for data or instructions that will be needed soon

This improves execution speed even more allows even more rapid access, since the instruction and its operands can be accessed simultaneously much of the time. design of a separate instruction cache can be simplified The trade-off is that accommodating separate instruction and data caches requires additional circuit complexity, and many system designers opt instead for a combined, or unified, cache that holds both data and instructions

data dependency

This is a situation in which the later instruction is supposed to use the results from the earlier instruction in its calculation Ex: a MULTIPLY instruction takes longer to execute than a MOVE or ADD instruction. If a MULTIPLY instruction is followed in the program by an ADD instruction that adds a constant to the results of the multiplication, the result will be incorrect if the ADD instruction is allowed to complete ahead of the MULTIPLY instruction.

hazard or a dependency

When out-of-order instruction execution causes problems because a later instruction depends on the results of an earlier instruction.

miss

When the request is not found in the cache

branch history table

a small amount of dedicated memory built into the CPU that maintains a record of previous choices for each of several branch instructions that have been used in the program being executed to aid in prediction.

memory latency

access time of DRAM is too slow to keep up with the CPU, and delays must be inserted into the LOAD/STORE execution pipeline to allow memory to keep up. use of DRAM is a potential bottleneck in processing Instructions must be fetched from memory and data must be moved from memory into registers for processing

Early CPU architectures were characterized by

by comparatively few general-purpose registers, a wide variety of memory-addressing techniques, a large number of specialized instructions, and instruction words of varying sizes

Pipelining and instruction reordering complicate

complicate the electronic circuitry required for the computer and also require careful design to eliminate the possibility of errors occurring under unusual sequences of instructions Despite the added complexity, these methods are now generally accepted as a means for meeting the demand for more and more computer power.

control dependencies

flow or branch dependencies. conditional branch instructions are more difficult, because the condition decision may depend on the results from instructions that have not yet been executed.

Hit

if the request is a write, the data from the CPU is stored in the appropriate cache memory location

Computer's organization

includes consideration of the implementation, instruction execution speed, details of the interface between the CPU and associated computer circuitry, and various optional features

All of these examples of caching share the common attribute that they

increase performance by providing faster access to data, anticipating its potential need in advance, then storing that data temporarily where it is rapidly available.

Instruction reordering

makes it possible to provide parallel pipelines, with duplicate CPU logic, so that multiple instructions can actually be executed simultaneously

The solution to the conditional branching problem may be broken into two parts

methods to optimize correct branch selection and methods to prevent errors as a result of conditional branch instructions. Selection of the wrong branch is time wasting, but not fatal

fetch unit

portion of the CPU consists of an instruction fetch unit and an instruction decode unit. - Instructions are fetched from memory by the fetch unit, based on the current address stored in an instruction pointer (IP) register The fetch unit is designed to fetch several instructions at a time in parallel. Fetching the instructions in advance allows the execution of these instructions to take place quickly, without the delay required to access memory. Instructions in the fetch unit buffer are sent to the instruction decoder unit. The decoder unit identifies the op code. From the op code, it determines the type of the instruction. If the instruction set is made up of variable-length instructions, it also determines the length of the particular instruction. The decoder then assembles the complete instruction with its operands, ready for execution.

Branch instructions must always be

processed ahead of subsequent instructions, since the addresses of the proper subsequent instructions to fetch are determined from the branch instruction

Clock

provides a master control as to when each step in the instruction cycle takes place - The pulses of the clock are separated sufficiently to assure that each step has time to complete, with the data settled down, before the results of that step are required by the next step. - Thus, use of a faster clock alone does not work if the electric circuitry cannot keep up.

caching is used to

reduce the time necessary to access data from a disk.

modern CPUs achieve high performance by

separating the two major phases of the fetch-execute cycle into separate components, then further separating the execution phase into a number of independent execution units, each with pipeline capability Once a pipeline is filled an execution unit can complete an instruction with each clock tick

The write-back method is faster

since writes to memory are made only when a cache line is actually replaced, but more care is required in the design to ensure that there are no circumstances under which data loss could occur.

At present, important CPU architectural families include

the IBM mainframe series, the Intel x86 family, the IBM POWER/PowerPC architecture, the ARM architecture, and the Oracle SPARC family. - Each of these is characterized by a lifetime exceeding twenty years.

The architecture may or may not include

the absence or presence of particular instructions, the amount of addressable memory, or the data widths that are routinely processed by the CPU. - Some architectures are more tightly defined than others

It is important to remember that pipelining and superscalar processing techniques do not affect

the cycle time of any individual instruction. It is the average instruction cycle time that is improved by performing some form of parallel execution. If an individual instruction must be completed for any reason before another can be executed, the CPU must stall for the full cycle time of the first instruction.

wide path memory access.

the simplest means to increase memory access is to widen the data path so as to read or write several bytes or words between the CPU and memory with each access; for example, the system can retrieve 2, 4, 8, or even 16 bytes, simultaneously.

Ch. 8 CSI240

Related study sets

Policy Provisions

Acct 401 Exam 1 Review Questions

Energy

Conservation of Momentum

Chapter 15 Weight Management

Diagnosing and Classifying Psychological Disorders

Genetics final

MGMT 320 - Quiz 3

Chapter 13

Kutatásmódszertan- A társadalomtudományi kutatások alapjai I.

Chapter 10 Multiple Choice

mktg 301 chapter 2

Two Sum

EC101 chapter 3

Biostatistics 4 questions

ch 10

MKT 3311 Principles of Marketing chapter 19

Nutrition 4 and 5

California Real Estate Law - Q & A

Logical Fallacies twilson