Final Question pool
305. A load, add, store sequence can be contained in one convoy (270) A. F B. T
A
321. CUDA allows the programmer to assign code either to the host CPU or to the GPUs (289) A. T B. F
A
325. A block can contain up to (291) A. 512 threads B. Two grids C. four other blocks D. 9182 instructions
A
326. The typical NVIDIA SIMD instructions operates on ___ elements at once (291) A. 32 B. 8 C. 1024 D. 8192
A
329. A single NVIDIA GPU has ____ registers available (296) A. 32,768 B. 1024 C. 8192 D. 32
A
335. What do the GPU Special Function Units do? (307) A. Transcendental functions B. Raster operations C. Thread scheduling D. Wide-register adds and subtracts
A
337. All loops have exploitable loop-level parallelism (315) A. F B. T
A
343. Dependencies can exist within a single loop iteration. (317) A. T B. F
A
346. A loop may contain at most one dependency (317) A. F B. T
A
351. It can only be shown that a loop-level dependency exists, not that it does not (319) A. F B. T
A
355. What is the name of a problem-solving method which parallelizes part of a problem to produce intermediate results which can be further processed in parallel to produce the final result? (321) A. Reduction B. Retraction C. Induction D. Revelation
A
357. Data Level Parallelism allows a "Slow and wide" approach to computing which is more energy efficient (322) A. T B. F
A
358. Energy savings can be realized if parts of the CPU hardware can be shut down. (322) A. T B. F
A
363. The NVidia Fermi GPU card draws over 100 times the power of the Tegra processor (324) A. T B. F
A
374. Using multiple processors which address the same memory is a concept developed in the mid 1990s (344) A. F B. T
A
376. Multiple processors tied to common memory use the ____ model (345) A. Multiple Instructions, Multiple Data B. Single Instruction, single Data C. Single Instruction, Multiple Data D. Multiple instructions, single data
A
381. Cache coherence means all reads by any processor must return the most recently written value (352) A. T B. F
A
382. Cache coherence means writes to the same location by any two processors are seen in the same order by all processors (352) A. T B. F
A
387. An attempt to read from an invalid cache block (358) A. Results in a cache miss B. Will never happen in a snooping type protocol C. Invalidates other blocks D. Results in a cache hit
A
403. What is a significant feature of kernel code executing on multiple processors? (375) A. An increase in coherence misses B. An increase in instruction cache misses C. The number of true sharing misses decreases D. A decrease in false sharing misses E. A decrease in instruction cache misses F. A decrease in true sharing misses
A
412. Snooping and Directory-based cache coherence protocols use similar state diagrams but different communication methods (381) A. T B. F
A
414. Four processors (A, B, C and D) in a multiprocessor system have read word X from memory. What is the cache block status on Processor A after A writes to word X? (385) A. Modified B. Shared C. Invalid D. Excepted
A
415. Four processors (A, B, C and D) in a multiprocessor system have read word X from memory. What is the cache block status on Processor B after A writes to word X? (385) A. Invalid B. Exclusive C. Shared D. Modified
A
416. A processor has a read miss on a word. What is the resulting cache block state for that block on that processor? (385) A. Shared B. Exclusive C. Modified D. Invalid
A
423. Cache coherence mechanisms and locking memory locations are largely independent of one another (389) A. F B. T
A
424. The primary advantage of leveraging cache coherence mechanisms to support locking is that it (391) A. Reduces traffic on the memory bus B. Uses less power C. Is less error-prone. D. Resolves hazards without forwarding
A
428. Speculation (if correct most of the time) can be used to hide latency (396) A. T B. F
A
430. When all data in L1 cache is also in L2 cache, and all data in L2 cache is also in L3 cache, the cache is said to be (397) A. Inclusive B. Extroverted C. Exclusive D. Introverted E. Locked F. Complete
A
432. Which is probably not a good design for a server chip? (401) A. Seven cores B. Power consumption of about 100 watts C. Two threads per core D. Clock speed > 2GHz
A
437. Amdahl's Law doesn't apply to parallel computing (406) A. F B. T
A
438. Linear speedups relative to the number of processors are not needed to make multiprocessors effective because of economies of scale in design and production (407) A. T B. F
A
439. A Uniprocessor operating system runs significantly better on a multiprocessor (407) A. F B. T
A
443. The cost of 100,000 servers, the building they are housed in plus the electrical and cooling infrastructure is about (432) A. $150M B. $150K C. $15B D. $1.5M
A
446. Single servers typically have an abundance of parallelism to exploit, while large groups of servers find parallelism in short supply because of the diversity of users (434) A. F B. T
A
450. WSC servers are designed to handle continuously high workloads (439) A. F B. T
A
451. Running large numbers of servers, Google is able to obtain very favorable software licensing fees from suppliers such as Microsoft (441) A. F B. T
A
459. Use of electrical power in WSCs generates heat which takes even more power to remove (448) A. T B. F
A
466. Overll, the cost of operating a server in a wSC is on the order of (452) A. 10 cents per hour B. ten dollars per hour C. a tenth of a cent per hour D. a dollar per hour
A
472. The original Google data center in Dallas consumed about ___ megawatt(s) of power (464) A. 10 B. 1 C. 100 D. 1000
A
492. The cost of building a data warehouse facility is generally larger than the equipment housed in it (471) A. T B. F
A
493. Capital assets are amortized more quickly than equipment assets (471) A. F B. T
A
494. Capital assets generally last longer than equipment assets (471) A. T B. F
A
298. The load-store implementation in a vectorized MIPS processor (265) A. Can load/store one word every clock cycle after an initial latency B. Is pipelined. C. Always reads or stores 64 contiguous words of memory into the processor D. Always reads or stores 64 contiguous bytes of memory into the processor E. Uses un-pipelined architecture to save hardware F. Is the same for scalar processors. G. Uses strip-mining.
A, B
311. Operations on vectors with length < 64 require (275) A. The Vector Length Register B. Firmware which prevents overflow conditions C. Strip-mining techniques. D. Loop statements. E. Branch instructions and some arithmetic instructions F. The interrupt handler. G. The Exception Program Counter
A, C
320. In GPU-based applications, thread management is done by (289) A. The GPU hardware. B. The operating system C. The GPU processor D. The application itself E. The compiler F. The CPU G. The program loader.
A, C
371. A scatter-gather application called GJK runs 15 times faster on the Nvidia GPX-280. Why? (329) A. The Core i7 doesn't do scatter-gather operations. B. GJK doesn't take advantage of the kind of parallelism the i7 offers. C. The GPX-280 has better scatter-gather data support. D. The GPX-280 supports an SIMD instruction set. E. The CUDA compiler organizes data into its most efficient representation. F. The Core i7 doesn't use SIMD instructions. G. The GPX card has more memory on it.
A, C
470. The more power a data center consumes (464) A. The more power is required to cool it B. The greater the possibility of a power outage. C. THe more heat must be dissipated D. The more efficient it becomes E. The higher power supply MTTR becomes. F. The less other utilities like water and gas are consumed G. The denser the computing machinery can be packed
A, C
484. Compared to Google servers, Google data nodes (467) A. Consume more than half-power at half-load B. are automatically shut off when not in use C. Must keep more disks powered on regardless of load. D. Still consume a lot of power at idle E. Use about the same amount of power at idle
A, C
350. Greatest Common Denominator analysis is useful for what example of loop? (Assume the statement is with a "for (int i = 0; i < 100; i++)" loop) (319) A. X[a * i + b] = X[c * i + d] + 15; B. X[i] = X[A[j]] + 1; C. X[i] = X[i - 1] * 2; D. X[2 * i + 3] = X[2 * i] + 5.0; E. X[a * i + b] = Y[c * i + d] + 15; F. X[i] = X[i + 1] * 2;
A, D
396. As cache size increases (371) A. Capacity misses decrease B. Cache hits take longer. C. False sharing misses decrease. D. False sharing misses remain the same. E. Compulsory misses decrease F. memory access becomes symmetric G. Bus response time increases
A, D
478. To conserve power, cooling fans in shipping containers (465) A. Can respond exactly to the cooling level needed. B. Are all located on one side of the container C. Are left on all the time D. are variable-speed and temperature controlled E. Are powered by DC motors. F. Are mostly not used G. Have irregularly spaced blades for quietness.
A, D
348. An affine index is (318) A. when the index can be written in the form ai+b, where i is the loop variable. B. Possible in only one dimension of a multiple-dimension array. C. Computed in the form ai^2+bi+c where i is the loop variable. D. When the index is written as x[y[i]] where i is the loop variable. E. The basis of loop dependency detection F. Independent of the loop variable. G. An indication that there is a loop-level dependency.
A, E
473. If power usage effectiveness (PUE) is 1.23, What does the ".23" represent (464) A. Non-IT Overhead B. Capital loss. C. Indirect costs D. Power transmission loss E. Power not going to the computers. F. Administrative overhead G. Capacitive load
A, E
482. What is a unique feature of Google server power supplies? (467) A. A 12-volt lead-acid battery B. Ability to use 50-hz power for international operation C. They are ultra-reliable. D. Use of standard 1.5v D-cells E. A built-in battery backup F. Liquid cooling. G. Use of large transformers for efficient AC-DC conversion
A, E
300. MIPS vector instructions are often ___ times fewer instructions than scalar expressions of the same work (268) A. 100 B. 2 C. 1000 D. Fifty E. 10 F. One Hundred G. Four
A, F
316. SIMD is similar to vector processing, except that (282) A. There is no strided access B. Memory can't supply data fast enough C. Optimizing compilers are not available. D. It is an older technology. E. The registers cannot be divided into sub-word data types F. There is no vector mask register. G. Warps are used instead of chimes.
A, F
354. Why is Graphics DRAM soldered into the circuit board (320) A. To avoid capacitance in a physical junction. B. To tolerate higher temperatures C. To guard against data tampering D. It saves chip sockets. E. To prevent unauthorized upgrading. F. In order to run at higher speeds G. It reduces heat loss.
A, F
308. Latency is difficult to improve, but how can vector operations effectively be sped up? (272) A. Doing more at once. B. Reducing capacitive load C. Increasing clock speed D. Do not perform operations with 0. E. Raising the supply voltage F. Making the clock cycle shorter. G. Pipelining the functional units
A, G
344. If a loop iteration contains two statements which can be executed in the opposite order with no consequence (317) A. There is no dependency in the loop. B. The loop can be unrolled. C. The loop cannot be unrolled. D. The compiler will probably reverse the order of execution. E. There may still be a dependency within the iteration. F. There are no dependencies between iterations. G. There may still be dependencies between iterations.
A, G
309. The process of decomposing n operations into groups of m operations (m < n) and composing a final group of the left-over operations is known as (272) A. Gather. B. Strip mining C. De-merging D. Candy striping E. Multi-lane convergence F. Scatter.
B
328. NVIDIA's GPU instructions tend to be ___ and ___ than vector instructions (292) A. Shorter, faster B. Wider, shallower C. Moce complex, more memory intensive D. Narrower, longer
B
330. The NVIDIA instruction set architecture is similar to MIPS ISA in that it closely resembles the hardware instructions (298) A. T B. F
B
331. How do NVIDIA's GPUs handle conditional branching in threads? (300) A. They don't allow it B. Branching threads are halted. C. Both sides of the branch must contain the same number of instructions D. The compiler reschedules instructions so that branches aren't needed
B
340. The code for (int i = 0;i < 100; i++) x[i] = x[i] + 3 carries a loop-level dependency (316) A. T B. F
B
353. Loop-level dependence means a loop's operations cannot be parallelized. (320) A. T B. F
B
360. The goal for the Tegra processor is to (323) A. Have a 24-hour battery life B. Render full motion video by 2020 C. Beat Moore's law D. Be present in 98% of mobile devices E. Beat Amdahl's law F. Double its floating point calculation rate by 2017
B
361. Because of its small size, the Tegra processor runs hotter than the GPX-480 card (324) A. T B. F
B
365. GPU parallelism can be effectively leveraged in nearly all applications (326) A. T B. F
B
366. Processors generally have the same memory bandwidth limitations regardless of whether they are doing double- or single-precision floating point operations. (326) A. F B. T
B
368. The horizontal line of the Roofline diagram represents (326) A. The power wall B. The floating point operation speed limitation C. THe limit of processor inefficiency D. The memory bandwidth limitation
B
379. One of the more complex architectural matters with multiple processors is (351) A. Workload balancing B. Cache management C. Exception handling D. Thread size
B
380. If processor cache is coherent (352) A. Processors are using the same cache B. Processors see the same value in the same memory location C. One processor will not see what another processor writes to cache D. there is no difficulty if processors each have their own local cache
B
384. In cache coherence snooping, (355) A. The sharing status of each block is kept in a single location B. Each cache controller monitors a common bus for cache activity C. Information is distributed so that the outage of any one processor will not affect the others D. Each processor has its own copy of the directory
B
386. If a cache controller attempts to read an unmodified word in an invalid cache block, what results (356) A. A coherence miss B. A false miss C. Cache incoherence D. A true miss
B
390. In general, memory bus bandwidth is not the limiting factor for scaling symmetric multiprocessors (364) A. T B. F
B
398. Why would instruction misses drop significantly compared to data misses as cache size increases? (372) A. It takes instructions longer to execute than to be read from memory B. Many more instructions are read per data read C. Instructions are larger in size than data D. Data are better organized than are instructions E. Instruction and data cache are separate
B
399. As the number of processors increases (372) A. The number of instruction misses increases B. The number of true sharing misses increases C. The number of true sharing misses decreases D. Sharing misses (true and false) are unaffected E. The number of false sharing misses decreases
B
405. In general, kernel miss rates are higher than user application miss rates (376) A. F B. T
B
406. In multiprocessing systems, performance is more sensitive to block size than to cache size (377) A. F B. T
B
408. As block size increases, miss rate decreases but bus traffic increases (378) A. F B. T
B
418. A processor has a write hit on a word. What is the resulting cache block state for that block on that processor? (385) A. Invalid B. Modified C. Inverted D. Shared
B
426. Correct execution using relaxed consistency often means (395) A. Speculation cannot be used B. Programmers must implement explicit synchronization C. Multiprocessing is perilous D. The processor must execute instructions slowly
B
427. Optimizing code for execution by compilers does not involve (396) A. Inserting synchronization points B. Serializing a large percentage of the problem C. Exploiting the mechanisms of out-of-order execution D. Optimizing data location
B
429. One of the compiler optimization problems to be solved is how to handle pointers to shared memory (396) A. F B. T
B
433. In general, performance as a function of number of cores for SPEC suites is (402) A. Frequently Logarithmic B. Roughly Linear C. Mostly Exponential D. Approximately Quadratic
B
434. Java benchmark applications, being more inherently serial than the PARSEC benchmark suite tend not to benefit from multicore execution (404) A. F B. T
B
435. Inherently multithreaded applications tend to me more energy efficient when run as a single thread (404) A. T B. F
B
436. The whole field of parallel programming suffers from "lack of maturity" with programming and techniques (405) A. F B. T
B
440. Supercomputers use exclusively parallel processing to obtain their high speed (409) A. T B. F
B
442. The same principles of computer architecture which apply on the chip level also apply on the scale of large numbers of servers (432) A. F B. T
B
444. Warehouse scale architects and single-server architects share all of the following goals except (433) A. Dependability through redundancy B. Lowest possible operational cost C. Energy efficiency D. Cost-Performance ratios
B
445. Directory-based cache coherence Server architects and Warehouse scale computing architects share the same problems and goals except in the case of (433) A. Handling both batch and interactive processing loads B. The availability of ample parallelism to exploit C. The need to factor power consumption into efficiency calculations D. Achieving reliability through redundancy
B
455. 48-port Ethernet switches connect servers in a single rack. What kind of switch connects racks of servers (443) A. Bisection switch B. Array switch C. Dual-port switch D. RAID Switch
B
456. Data access across servers in a rack is best described as (444) A. SMP B. NUMA C. SMT D. SIMD
B
461. The measure of Power Utilization Effectiveness (PUE) is (450) A. Input power/ Output power B. Total facility power / IT equipment power C. Computation power / cooling power D. Power gained / Power lost
B
465. Personnel costs at a WSC center are approximately (452) A. 75% of the total B. 2% of the total C. 10% of the total D. 50% of the total
B
487. Compared to rack switches, array switches (469) A. Cannot use the reliability-through-redundancy principle B. Support much higher bandwidth C. Handle only a few servers at a time D. Are inexpensive, commodity products E. Interface directly with Google servers. F. Are undersubscribed.
B
491. Cloud services are generally inefficient users of computing power (471) A. T B. F
B
302. A convoy is a set of vector instructions that can execute together in predictable time. This means (269) A. They use less than the entire 64-element length of vectors B. They use separate functional units. C. They contain no structural hazards. D. There are two times as many reads as writes E. Data reads re from contiguous bytes in memory F. The VLR and VMR are being used. G. The VMR is non-zero.
B, C
303. In vector instructions, chaining means (269) A. Certain instructions are anchored to certain kinds of data B. An instruction can execute as soon as its data is available C. Pipeline operations may not occur left-to-right in the vector register. D. Processor rescheduling is not allowed. E. One instruction cannot begin until the previous instruction is complete F. The execution time is known. G. An instruction can execute as soon as the whole vector is loaded
B, C
333. A GPU has access to three types of memory, these are (304) A. Input, output and stayput. B. Private, local, and global C. Global, local and private D. Static, dynamic and cache E. User, kernel and privileged F. Permanent, semi-permanent and non-permanent. G. L1, L2 and L3
B, C
483. A standard Google server consumes about 160w under 100% load. How much does it consume at idle? (467) A. 2 watts B. 85 watts C. About half that. D. 8 watts E. 120 watts
B, C
296. Which is an industry expectation in parallel computing (263) A. Vector lengths increase with every new processor release. B. The number of SIMD operations will double every four years. C. Instruction set size will double every year. D. There will be 2 more cores per chip every two years E. GPU cards will become smarter than humans F. Data bandwidth will double every year G. Parallel computing will be replaced by warp speed serial computing
B, D
356. How can MapReduce help in loop-level parallelism? (321) A. It can try many solutions at once to see if one is the correct one B. MapReduce can bound or limit loop-level dependencies C. MapReduce can detect loop-level dependencies D. The work in a loop can be separated into to parallelizable operations and a smaller set of serial operations E. MapReduce can assure that there are no loop-level dependencies F. MapReduce can eliminate loop-level dependencies
B, D
359. If the stride is larger than page size, what situation could occur? (323) A. All data could be held in L3 cache. B. The program could run very slowly. C. You could run out of disk space. D. Every data access could cause a page fault. E. Needed cores would be unavailable F. The unthinkable. G. The power supply could overheat
B, D
369. If we write a DGEMM program which fits nicely in cache, what would expect to be the limitation? (327) A. Size of main memory B. Processor bandwidth. C. Memory bandwidth. D. Processor speed. E. Memory cycle time F. Cache blocks. G. Cache size.
B, D
401. In a symmetric multiprocessor system, what happens as cache block size is increased? (373) A. True sharing conflicts are increased B. True sharing conflicts are reduced C. False sharing conflicts are unaffected D. False sharing misses are more common. E. Performance decreases. F. Throughput drops. G. False sharing conflicts are decreased
B, D
485. Variation in power usage effectiveness (PUE) tends to be (468) A. Diurnal B. Seasonal C. Dictated by electricity prices. D. Dependent on climatic factors. E. Constant F. Based on user load G. in proportion to management effectiveness.
B, D
370. The Nvidia GPX-280 card is faster than the Intel Core i7 processor for all tasks except sorting. Why might this be? (328) A. Sorting is easily parallelizable. B. Sorting has many data-dependent branches. C. The GPU card can't access as much memory as the CPU. D. The host processor gets in the way of the GPU card. E. Sorting isn't suitable for fine-grained multithreading. F. The Intel processor is designed for fine-grained multithreading. G. The compiler settings were wrong for this application.
B, E
402. With only one of many processors active in a multiprocessing system, kernel response time is largely the result of (375) A. Pipeline depth B. Disk I/O. C. Memory bandwidth D. Thread limitations. E. Waiting for the disk. F. Floating point computation limitation G. Number of users.
B, E
431. Server processors are often like desktop processors but with (400) A. Higher voltage power supplies B. More cache C. Liquid cooling. D. a larger but shorter instruction set E. More cores F. Smaller cache. G. Less power and higher clock speed
B, E
307. VMIPS functional units consume one element per clock cycle. This means (270) A. Latency is unusually short B. Steady state is one clock cycle per instruction C. Power can be saved. D. It takes 64 cycles for an operation to complete. E. Scalar operations are more efficient than vector operations F. The functional unit is pipelined. G. Calculation of time responsiveness is easy
B, F
338. A significant source of parallelism in source code is (315) A. Inlining B. Loops C. Recursion D. Switch statements E. Subroutine calls F. Looping G. Input/Output operations
B, F
407. The primary difficulty with the snooping cache coherence protocol is (378) A. It is expensive in terms of chip space B. Bus bandwidth requirements increase dramatically. C. It is error-prone D. Its implementation is too simple E. Three-dimensional circuits are difficult to produce. F. It does not scale well to more processors G. The protocol isn't fast enough.
B, F
477. If the airflow around equipment can be carefully controlled (465) A. It can be shut off periodically B. There will be fewer hot spots. C. It can be cooler D. It is not as noisy E. Filtration is not necessary. F. It can be warmer G. Humans don't need to take temperature measurements.
B, F
490. Generally, the first maintenance action on a failed stateless computer in a Google warehouse is to (469) A. Open a trouble ticket. B. Restart it. C. put it in the repair queue. D. Save its data. E. Have the operator verify the failure. F. reboot it. G. Shut it down and check it out.
B, F
322. CUDA code resembles (290) A. Structured Pascal B. Conventional C code. C. The BASIC language D. Assembly language E. Python. F. Ruby on Rails. G. The C or C++ languages
B, G
342. If a loop iterations uses a value computed in a previous iteration (316) A. There are no loop dependencies. B. There may be a loop dependency. C. Loop dependencies are a given. D. The compiler can unroll the loop with minor difficulty. E. Loop unrolling is not affected. F. Loop unrolling is impossible. G. Unrolling the loop may be difficult.
B, G
372. NVidia engineers (330) A. Failed to fix obvious flaws in the GPX-280 architecure until the market complained. B. fixed weaknesses in the GPX-280 architecture before the market noticed them. C. Compete head-to-head with Intel engineers for high-performing processors. D. Copied Intel's SIMD instruction set design. E. Updated the GPX-480 design in the GPX-280. F. Tried, but could not implement Intel's AVX architecture. G. Updated the GPX-280 design in the GPX-480.
B, G
332. When a CUDA thread branches (300) A. Exception handling becomes very complex B. The process is suspended C. Its execution may be suspended D. The thread switch is costly
C
334. Which kind of memory is common to grids (304) A. Local B. Protected C. Global D. Privileged
C
336. What best defines "loop-carried dependence" (315) A. Two loops are nested B. The code within the loop does not contain branches out of the loop C. A value produced in one iteration of a loop is used in another iteration D. Values produced in loop iterations do not depend on each other
C
367. The slanted line of the Roofline diagram represents (326) A. The floating point operation speed limitation B. THe limit of processor inefficiency C. The memory bandwidth limitation D. The power wall
C
375. Market forces driving the rise of the multiprocessor market include all of the following except (344) A. E-mail skimming B. Growth of data mining C. Growth of mobile phone devices D. Internet data analysis
C
377. Symmetric multiprocessors (347) A. feature the same number of processors on either side of the motherboard B. Exhibit non-uniform memory access C. Can each access memory in about the same amount of time D. Do not use shared cache to keep access time down
C
378. If 100 processors only speed up an application by a factor of two, what may explain the problem? (349) A. Ohm's Law B. The power wall C. Amdahl's Law D. Moore's Law
C
383. In directory-based cache coherence, (354) A. The directory is kept on disk. B. Each processor queries other processors for cache status C. The sharing status of each block is kept in a single location D. Information is distributed so that the outage of any one processor will not affect the others
C
385. A cache block marked as shared (355) A. Has just been written to by a processor B. Will miss when it is read C. Shows the same value to each processor D. Has been read by only one processor
C
388. An attempt to write to an invalid cache block (358) A. refreshes other processor's cache blocks B. Results in a cache hit C. Results in a cache miss D. Will never happen in a snooping type protocol
C
389. In the snooping protocol, multiple cache controllers must use a single bus to carry out their cache management. This gives rise to the possibility that (362) A. Write-through is not possible B. Data will be lost C. Deadlocks may occur D. Cache will become incoherent
C
413. A word in memory is read for the first time in a program run on a multiprocessor machine. The cache tag for the block has what status? (385) A. Modified B. Exclusive C. Shared D. Invalid
C
417. A processor has a write miss on a word. What is the resulting cache block state for that block on that processor? (385) A. Inverted B. Invalid C. Modified D. Shared
C
441. Hosting massive numbers of servers in one location is referred to as (432) A. Silicon Valley B. Massive Server Farming C. Warehouse Scale Computing D. Phenomenally Huge Computing
C
449. What kind of data consistency is often used in a WSC environment? (439) A. Relative consistency B. Strict consistency C. Eventual consistency D. Immediate consistency
C
453. What is the term used if the bandwidth capability of servers on one side of a switch is many times the bandwidth on the uplink side of the switch? (441) A. Map reduce B. Overrating C. Oversubscription D. Underestimation
C
457. What is a primary driver of operation costs of a WSC? (446) A. The number of servers B. The number of managers C. Its location D. The number of users
C
458. The minimum temperature that can be obtained through simple evaporative cooling is referred to as the (447) A. Dry bulb temperature B. Dew point C. Wet bulb temperature D. Relative temperatures
C
467. One of the major advantages to cloud computing via a WSC as opposed to a data center is (455) A. You don't know where the cloud is B. Availability C. Cost D. Reliability
C
468. The preferred operating system on Amazon Web Services (AWS) servers is (456) A. HPUX B. Amazon Server Software C. Linux D. Windows Server 2008 R2
C
488. A Google array switch handles about ____ servers (469) A. 100 B. 1000 C. 10,000 D. 100,000
C
489. In a Google data center, one operator can handle about ___ servers (469) A. 100 B. 100,000 C. 1000 D. 10
C
404. A major difference between user code and operating system code is (375) A. There is less OS code than user code B. User code has more coherence misses in uniprocessor system C. OS code causes more data and instruction misses than user code D. OS code has less locality than user code E. User code has less locality than OS code F. OS code has fewer coherence misses in multiprocessor systems
C, D
471. The original Google data center in Dallas had about ___ containers of computing equipment (464) A. 10 B. 1024 C. Fifteen stacks of two containers and 15 single containers. D. 45 E. 150
C, D
301. Vector execution time does not depend on (269) A. Chime time. B. Structural hazards C. The number of convoys. D. Data dependencies. E. Distance of branch jumps F. length of operand vectors G. Chaining.
C, E
314. Collecting data for vector processing and redistributing the results is known as (279) A. Pipelining B. Dual-porting C. Scatter-Gather D. Hunter-Gather E. Gather-Scatter. F. Strip=mining. G. Input/Output
C, E
318. What best describes the heterogeneous execution model used with GPU cards? (289) A. Optimization is handled by both the compiler and the processor B. The compiler determines what code is serial and what code is parallel. C. Applications are distributed between the CPU and the GPUs D. The GPU card runs different versions of the same program at the same time. E. Few applications purely parallel and can benefit from a CPU in addition to GPUs. F. The GPU card runs different kinds of programs simultaneously. G. CUDA code is only for the GPU;separate C code is for the CPU.
C, E
393. The caches of two processors hold the same 16-byte block. One processor writes four bytes at offset 12. This invalidates the other block. If the other processor tries to read a word at offset 4, what kind of miss results? (366) A. Conflict miss B. Capacity miss C. False sharing miss D. True sharing miss E. Coherence miss F. Compulsory miss G. Inadvertent miss
C, E
422. A spin lock (389) A. Requires three numbers to be dialed in order to release the lock B. is like a combination lock. C. Keeps trying to obtain a lock until it succeeds. D. Is non-blocking E. Can cause resource contention. F. Can never spin indefinitely. G. Gives processors access to a variable in a predetermined order
C, E
479. A low wet-bulb temperature is an indication of (467) A. An approaching storm. B. A requirement for high power consumption. C. How cool the air can get with evaporation D. How warm the air can be before clouds form E. Evaporative cooling is sufficient F. high relative humidity G. A higher PUE.
C, E
295. Which was not a step in the evolution of parallel processing? (262) A. Vector Processing B. Multimedia extensions C. Virtual Machines D. SIMD E. Graphics Processing Units F. Hitting the power wall
C, F
297. The vector version of the MIPS processor is known as (264) A. Advanced Vector Extensions B. A superscalar processor C. VMIPS D. MMX E. The Cray-1 F. Vector MIPS G. Plan B
C, F
397. What is a normal consequence of adding more processors to the design of a multiprocessor system? (372) A. CPI goes down. B. Instruction misses become dominant C. True sharing conflicts increase D. Compulsory misses are reduced E. Capacity misses rise sharply F. Cache misses account for a greater part of CPI. G. Less energy is used.
C, F
469. Google's original warehouse-scale computing concept was designed around (464) A. Other existing data centers. B. double-wide trailer houses C. 1AAA containers D. The space in underground coal mines E. The size of Larry Page's garage. F. Standard shipping containers G. Military surplus bunkers
C, F
306. Which is a significant problem for vector processors? (270) A. Wasting clock cycles. B. Exception handling C. Storing data quickly enough. D. Overheating E. Sclar register width. F. Working with vectors smaller than the designed size G. Loading data quickly enough
C, G
395. A major contributor to on-line transaction processing programs' CPI is (370) A. Interrupt CPI B. Pipeline CPI C. Cache CPI D. Multi-issue overhead E. Clock cycle time F. Memory bus limitations G. The L3 miss rate
C, G
312. To supply adequate memory bandwidth for vector processors, hardware designers often resort to the use of ____ to supply timely data to the processor (276) A. Front-side buses. B. bigger registers C. Quadruple data-rate memory (QDRM) D. memory banks E. Slower clock speeds F. Dual in-line memory modules
D
324. NVIDIA's GPU instructions resemble (291) A. Superscalar instructions B. Vector instructions C. MIMD D. SIMD instructions
D
327. GPU Thread scheduling is done by (292) A. The operating system B. Logic on the motherboard C. The host CPU D. A scoreboard system in the GPUs
D
339. What is a loop-carried dependence? (315) A. When the code in the loop contains a branch instruction B. When two different iterations of a loop access the same value C. When loops are nested D. When data used in one iteration is created in a previous iteration E. When calculations in two different iterations of a loop return the same value F. When a loop is not loop-level parallel
D
352. Detecting loop-level dependencies is a feature of (320) A. Main memory B. Cache memory. C. Linkers. D. compilers. E. Loaders. F. Processors.
D
373. Thread-Level Parallelism leverages patterns in (344) A. instructions B. data C. Requests D. processes
D
391. What technique is used by cache controllers to ensure that write-invalidates are atomic? (365) A. The controller waits a small amount of time and resends the invalidate if no response is received B. The invalidate command is sent on one bus and the confirmation is returned on another bus. C. A central clearing house is used to collect block-invalidate information D. The controller holds the bus until all other controllers have acknowledged the invalidate
D
392. When a cache controller monitors the memory bus for pertinent activity, what kind of cache coherence is being implemented? (365) A. Side-channel cache coherence B. Relaxed cache coherence C. Extended cache coherence D. Snooping cache coherence
D
411. Cache coherence is typically used for which level of cache? (380) A. L1 B. Disk C. Main Memory D. L3 E. L0 F. L2
D
425. Sequential consistency means that (394) A. Reads may occur out of order as long as the correct values are read B. Reorder buffers have nothing to do C. groups of writes must complete before subsequent groups of writes may be committed D. processor reads and writes are strictly ordered
D
447. What is a key difference between datacenter computing and warehouse-scale computing? (436) A. Servers can be more precisely tailored to tasks in the datacenter environment compared to the warehouse scale environment B. Datacenter computing is more power conscious than warehouse scale computing C. Network connectivity is more important in the datacenter scheme as opposed to the warehouse scale scheme D. Datacenter computing runs different applications for many different users where a WSC runs essentially the same application for a huge number of users
D
448. Which is the best description of map-reduce? (437) A. Map: Summarize data for further processing;reduce: perform further processing B. Map: Make the problem smaller;Reduce: make the problem bigger C. Map: Create a multi-stage pipeline for data;reduce: simplify the pipeline for optimal processing D. Map: arrange data for computation, Reduce: summarize results
D
452. WSC Servers are often housed in (441) A. Minitower cabinets with 16 minitowers per module B. Nineteen servers per 48-inch rack C. Tower cabinets with 24 towers per pallet D. 19-inch racks with 48 servers per rack
D
454. One of the main reasons eventual consistency is used in Google data storage is (442) A. Significantly less power consumption B. Data accuracy C. Reliability through redundancy D. Decreased bandwidth requirements.
D
460. Most of the power consumed inside a server is used by (450) A. Displays B. Disks C. Memory D. The CPU
D
462. Power Utilization Effectiveness (PUE) is one measure of WSC efficiency, the other measure is (450) A. Idle time B. Total annual sales C. Storage capacity D. Response time
D
463. Google's service level agreement for user response time is (450) A. 50% of requests under 100 milliseconds. B. 99% of requests under 10 seconds. C. 50% of requests under 10 seconds. D. 99% of requests under 100 milliseconds.
D
464. The cost of a Warehouse Scale Computing center is composed of (452) A. Power cost plus cooling cost B. Power cost plus personnel cost C. Management costs plus labor costs D. Capital expenditures plus operational expenditures
D
474. In general, the largest portion of non-IT power consumption goes to (464) A. Building lighting and security B. Ground C. powering visual displays D. Cooling E. Heat loss in transformers. F. Wiring resistance.
D
476. Each container full of equipment uses how much power? (465) A. 25 kilowatts B. 1 megawatt C. 8 megawatts D. 250 kilowatts
D
310. The Vector Mask Register is used to (275) A. Avoid loading 0 values from memory. B. Keep users from seeing sensitive operating system data. C. Store multiple elements in the same cycle. D. Perform vector operations on vectors < 64 elements long E. Conditionally perform operations on vector elements. F. Performing branch prediction lookups. G. Keep the CPU from knowing what values it is manipulating.
D, E
345. Once a loop-level dependency is identified (317) A. Loop unrolling can proceed B. The loop can be rewritten to remove the dependency C. Iterating the loop backwards will remove the dependency D. It is possible that the dependency won't matter E. It may be possible to rewrite the loop without the dependency F. The loop cannot be unrolled
D, E
409. In non-uniform, distributed shared memory systems (379) A. Snooping works well for small numbers of distributed processors B. Spatial locality is easy to obtain C. Directory-based coherence is unnecessary D. Snooping no longer works E. Directory based coherence schemes are necessary F. Scalability of the coherence protocol is not important
D, E
420. Load Linked (ll) and store conditional (sc) are used to perform what building-block operation? (387) A. Read after write. B. Load-use without forwarding. C. Register shifts. D. Atomic exchange E. All-or-nothing swap. F. Load register. G. Cache hit.
D, E
419. The Load Linked / Store Conditional are a way of handling (386) A. Error correction. B. Coherence. C. Sharing. D. An atomic exchange. E. Distribution. F. Synchronization. G. Parity.
D, F
299. the 64 64-bit element register can be divided into (267) A. 4 256-bit eements. B. 2 32-bit elements C. 32 32-bit elements. D. 128 32-bit elements E. 64 32-bit elements F. 256 64-bit elements G. 256 16-bit elements
D, G
323. GPU architecture is similar to vector architecture in all the following ways except (291) A. They both work well with data-level parallel problems B. They both have large (wide) registers C. They both Perform scatter-gather transfers D. They both use multithreading to hide memory latency E. They both use the SIMD model. F. GPUs don't use thread blocks. G. Both processors can manage massive numbers of threads at once.
D, G
341. Analyzing a loop for dependence is typically done (316) A. By the compiler when the parse-tree is built B. By the optimizer at the assembly-code level C. Once the machine code is determined D. By the optimizer at the source-code level E. By the compiler at the source-code level
E
400. A side-effect of increasing block size is (373) A. Decreasing power requirements B. Increasing miss rate C. Decreased bus activity D. Increasing power requirements E. Increasing miss penalty
E
475. Each group of 40 servers in a rack is tied together with (465) A. custom-designed 40-port switch arrays B. Jumper cables. C. Nuts and bolts. D. Cisco optical switches E. Commodity 48-port ethernet switches F. Switching power supplies
E
347. A recurrence (in a loop) is (318) A. A rare form of loop dependency. B. Irresolvable. C. Algebraically impossible. D. When loop iterations can be unrolled with no conflicts. E. When a loop is dependent on the calculations in a previous loop. F. A common sort of loop dependency. G. When calculations in one loop are repeated in a subsequent loop.
E, F
349. If two affine indices into a one-dimensional array can produce the same index value with different loop variable values (319) A. There are loop-level dependencies. B. THe compiler will ignore the loop. C. The loop cannot be unrolled. D. A single index is required. E. A loop dependency is likely. F. It may still be possible to unroll the loop. G. There will not be a loop-level dependency.
E, F
410. Tables needed to store cache state in the directory-based protocol scale ___ with the number of processors (379) A. Independently B. Exponentially C. Order 2 (O2, as the square of the number of processors) D. Inversely E. Proportionally F. Linearly G. Sublinearly
E, F
421. Store conditional will not execute if (387) A. Another process executes. B. A lock is obtained. C. Locks are disabled. D. The system is already locked. E. The value in the location read by load-linked is modified F. Another process interferes with the swap. G. A pipeline exception is invalidated.
E, F
304. Which is not true of chimes? (269) A. m convoys execute in m chimes B. Chime length is proportional to vector length C. They are measured in cycles per instruction. D. It is the unit of time to execute one convoy E. Chimes can execute simultaneously F. They are measured in instructions per clock cycle. G. They contain no structural hazards
E, G
315. To obtain the best performance results for vector code, what should happen? (281) A. Use vector mask registers to avoid useless operations. B. Nothing. Vector compilers are very advanced optimizers C. The code should be as short as possible. D. Compile the code multiple tomes and select the version which runs most quickly. E. Programmers give hints to the compiler. F. use a combination of scalar and vector processing G. Consult with vector programming experts.
E, G
317. What is not true about graphics cards? (288) A. Despite their power, graphics cards can only accelerate parallel parts of a program. B. A GPU-based graphics card has hundreds of floating point units C. The market continues to be driven by graphics applications D. Most of today's supercomputers feature both serial and parallel processing based on graphics cards. E. Graphics cards can accelerate even purely serial programs. F. Languages have been developed to enable use programming of GPUs G. Applications running on GPU cards must be graphical in nature
E, G
480. In low-humidity environments, the cheapest source of cooling is (467) A. Freon. B. ground water C. snow D. Rain E. "Opening the windows." F. Compressed air. G. Outside air
E, G
486. An oversubscribed switch (469) A. Has much more potential bandwidth on the uplink side than it can actually downlink B. Too many computers attached to it C. can be a fire hazard D. Breaks down more easily. E. Can only uplink a fraction of the data sent to it on the downlink side. F. Too few computers attached to it. G. Has much more potential bandwidth on the downlink side than it can actually uplink
E, G
313. The even spacing between vector elements stored in memory in known as (278) A. Row-major order. B. Data layout. C. Address space layout D. Pipelining E. Positive displacement F. Stride
F
319. The proper hierarchy (lowest to highest) of threads is (289) A. grid, thread, block B. Grod. block, thread. C. Block, grid, thread D. Thread, grid, block E. Block, thread, grid. F. Thread, block, grid
F
362. NVidia's graphics processor for mobile devices is the (324) A. Core i7 B. Tesla C. ARM A-8 D. Fermi E. ARM A-9 F. Tegra 2
F
364. Which processor has the greatest GFLOPS/Sec? (325) A. ARM A9 B. ARM A8 C. Fermi GTX-280 D. NVidia Tegra E. Intel Core i7 F. Fermi GTX-480
F
394. The study of cache performance on the Alpha 21164 server was done on a(n) (367) A. Business-to-business model B. Single-user scenario C. Financial transaction model D. Client-server architecture E. Web server simulation F. Commercial workload
F
481. Why would the memory bus of the standard Google server be downclocked from 666 to 533 MHz? (467) A. Google users typically don't require much information. B. So they can use cheaper memory chips. C. It discourages high use of bandwidth D. It allows faster latency E. Avoid radio interference with other computers F. More economincal. G. Lower power consumption
F, G