Advanced OS Midterm

Ace your homework & exams now with Quizwiz!

A hypervisor is implementing a new memory management called "coordinated memory management (CMM)". The idea is the following. When a VM experiences memory pressure and comes to the hypervisor with a request for more memory, all the VMs cooperate to alleviate the memory pressure of the needy VM under the direction of the hypervisor chipping in what they can, commensurate with their current memory allocation and use. I.e., the hypervisor collects the statistics of memory usage dynamically from the VMs and instructs each of them to release an appropriate amount to meet the request. Assume that each VM has a balloon driver to support CMM. Present a design for the above scheme. Sketch the data structure in the hypervisor for CMM.

"Memory-Info" in the hypervisor for each VM: a.Allocated memory; Actual memory in use;

Consider an 8 core CPU with 4-way hardware multithreading in each core. The OS can schedule 32 threads to run on the CPU at any point of time. The expectation is: (circle one correct choice) (i) All 32 threads will run concurrently. (ii) At any point of time, 1 thread per core will be executing concurrently. (iii) At any point of time some random set of 8 threads will be executing concurrently. (iv) At any point of time some random set of 4 threads will be executing concurrently. (v) At any point of time a set of 8 threads will be executing concurrently in a round robin fashion.

(ii) At any point of time, 1 thread per core will be executing concurrently.

All the variables in the execution shown below are in the shared memory of a cache coherent NUMA multiprocessor. The multiprocessor implements sequential consistency. All variables are initialized to 0 before execution starts. Execution on processor P1 x=20 y = 30 Execution on processor P2 w=y+x z=x+10 Which of the following are impossible with the above? (i) w = 0; z = 10; (ii) w= 0; z = 30; (iii) w = 20; z = 30 (iv) w = 30; z = 10 (v) w = 30; z = 30 (vi) w = 50;z = 30

(iv) w = 30; z = 10 (v) w = 30; z = 30

Explain howI/O rings may be used to implement the networking subsystem of a guest OS in Xen

* Each guest OS has two I/O rings - receive and transmit * Transmit - Enqueue descriptor on the transmit ring * points to the buffer in guest os space * No copying into Xen * Page printed til transmission complete - Round robin packet scheduler across domains * Receive - Exchange received page for a page of the guest OS in the receive ring - no copying

Explain how the page table for a newly created process is set up by a library operating system executing on top of Xen.

- Allocates and initialize a physical page frame and registers it with Xen to serve as the PT for process - Creates VPN to PPN mapping for the process - Communicates these mappings via hypervisor calls to Xen - Xen populates the PT above with these mappings - allocates page frames for the process to back these VPN to PPN mappings

Qualitatively argue why Anderson's queuing lock algorithm that uses a unique memory location for each processor waiting for a lock mitigates the performance problem posed by the above algorithm.

- Each processor spins on a private variable awaiting its turn for the lock. Thus the spinning is local and does not cause bus traffic -Upon lock release, exactly one process's private variable is modified. so only this process goes to memory to perform the t&s operation while others continue to spin on their respective private variables in their caches

Consider a user process executing in a library OS on top of the Aegis Exokernel. The user process incurs a page fault (i.e., a TLB miss). Explain the action that follows in each of the following situations. You should identify the mechanisms and data structures available in Exokernel that enable it to accomplish the action. This is the first access to the faulting address by the user process.

- Exokernel calls the entry point in the lib os for handling page faults (using the PE data struct) - Lib os calls its hard disk driver to initiate I/O from disk to a physical page frame - Device driver presents the capability for the hard disk controller and the page frame to Exokernel to perform the DMA - Once DMA complete Exokernel uses PE (interrupt context) to upcall lib os - lib os completes TLB update

What is the role performed by I/O rings in the Xen architecture?

- Serves as a communication data structure between any given lib os and Xen - Lib os places the necessary info corresponding to a hypervisor call in the I/O ring for Xen to pick up - Similarly, Xen places its responses to the hypervisor calls from a particular lib os in the associated I/O ring

Using an example, describe the role performed by I/O ringsin the Xen architecture.

- Used for data transfer and command initialization between lib os and Xen Example: disk write - Lib os puts request descriptor into I/O ring - Xen performs disk write - Upon completion Xen puts response descriptor for lib OS in I/O ring

Consider a user process executing in a library OS on top of the Aegis Exokernel. The user process incurs a page fault (i.e., a TLB miss). Explain the action that follows in each of the following situations. You should identify the mechanisms and data structures available in Exokernel that enable it to accomplish the action. The faulting virtual address is not a guaranteed mapping for the library OS but a valid mapping exists for the virtual page in the page table of the faulting process.

- Using the PE data struct for this lib os, Exokernel determines the exception context and upcalls the lib os - Lib os looks up the page table for this process and retrieves the mapping (VPN, PFN) corresponding to this process - Lib os calls TLB update primitive available in Exokernel to put this mapping into the TLB by presenting the capability it has for the PFN - Exokernel resumes execution of faulty process

Consider a user process executing in a library OS on top of the Aegis Exokernel. The user process incurs a page fault (i.e., a TLB miss). Explain the action that follows in each of the following situations. You should identify the mechanisms and data structures available in Exokernel that enable it to accomplish the action. The faulting virtual address is a guaranteed mapping for the library OS.

- Using the PE data structure for this lib os, Exokernel looks up its software TLB to get VPN to PFN mapping for this VA - Exokernel installs this mapping into the TLB and resumes execution of the user process

Enumerate the actions (using figures) in LRPCin response to an import call from a client.

- upon import call the kernel calls the clerk in the server using the details of import call - clerk determines the entry point address and returns details (entry point, number of args, number of rewrites, etc) to the kernel in PDL - kernel uses PDL to determine sizes of A-stack and E-stack; allocates A-stack and maps it into client_server address space - kernel returns Bo to client for future calls from this client to this specific RPC entry point in server - client presents Bo for future calls

There are four myths that discourage microkernel-based design of an OS: (1) kernel-user switches (border crossings) are expensive; (2) hardware address space switching is expensive; (3) thread switches are expensive; (4) locality loss in the caches can be expensive when protection domains are implemented using hardware address spaces. How does Liedtke debunk these myths?

-(1): Proof by construction, 123 processor cycles (incl TLB + cache misses) -(2): Not necessarily, exploit H/W features, e.g. segment registers to pack small protection domains in the same hardware space + explicit costs for large protection domains much smaller than implicit costs (cache no longer warm etc.) -(3): Shown by construction that this is not true, competitive to SPIN and Exokernel -(4): This was true in Mach due to its focus on portability, but if focus on performance this can be overcome, taylor code to use architecture-specific features (e.g. segment-registers on x86)

Enumerate the four different types of program discontinuities that can occur during the execution of a process with a short sentence explaining what each one of the discontinuities is.

-Exception -program generated arithmetic errors (e.g., divide by zero) -Trap -Syscall -Page faults -External interrupt -I/O

A library OS implements a paged virtual memory on top of Exokernel. An application running in this OS encounters a page fault. List the steps from the time the page fault occurs to the resumption of this application.Make any reasonable assumptions to complete this problem(stating the assumptions).

-Fielding by exokernel -exokernel notifies Library OS by calling the entrypoint in the PE data structure for this library OS -Library OS allocates page frame running page replacement algorithm if necessary to find a free page frame from the set of page frames it already has -Library OS calls exokernel through the API to install translation (faulting page, allocated page frame) in TLB, presenting capability, an encrypted cypher

This question pertains to supporting multiple library OS images on top of Exokernel. What is the role performed by the software TLB?

-Guaranteed vm-> pm mappings for each library os kept in s-TLB -On TLB miss Exokernel checks if mapping exists in S-TLB; if yes, it restores mapping to hardware TLB -Reduces startup memory penalty for a lib OS upon a context switch

SPIN relies on programming language support for the implementation of protection domains. Discuss one pro and one con of this approach(two short bullet points should suffice).

-Pro: No cheating possible (e.g. void*-casting in C) with Modula-3, generic interface provides entry point, implementation hidden, allows for checking for the correct pointer at compile-time, allows for protection domains without incurring border crossing costs due to shared hardware address space -Con: What about drivers/etc. that need H/W-access, need to step outside of language protection, also Modula-3 potentially not too popular, rewriting necessary

Distinguish between VPN, PPN, and MPN as discussed in VMWare.

-VPN: Virtual page number for user level process in side a VM -PPN: VM's idea of the physical page backing the VPN of the user process (Physical page number) -MPN: Machine page number, the actual machine page frame on the processor for backing a given VPN

A process "foo" executing in XenoLinux on top of Xen makes a blocking system call: fd = fopen(<file-name>); Show all the interactions between XenoLinux and Xen for this call.

-Xenolinuz fields the fopen call(fastpath) -In the I/O ring corresponding to the device that this pertains to, Xen places all the information for performing the fopen call (e.g. disk block address corresponding to <filename>). - Xen makes a hypervisor call to convey the request to Xen - Xen performs the request places response in I/O ring - Xen picks up response and makes the blocked process runnable again.

Light-weight RPC (LRPC) is for cross-domain calls within a single host without going across the network. A specific LRPC call has totally 128 bytes of arguments to be passed to the server procedure, and 4 bytes of return value to be returned to the client. Assuming no programming language level support, How much of the copy overhead is due to copying into kernel buffers?

0 bytes

Assume a cache coherent multiprocessor using a write-update policy. An OS designer implements the following lock algorithm: Lock(L): While ((L == 1) or (T&S(L) == 1)) // lock is in use { While (L == 1); // do nothing Delay(d[Pi]); // delay proportional to the processor number } Unlock(L): L = 0; Given the above algorithm, upon a lock release, how many times will the T&S operation be executed assuming N processors in the system? Explain your answer.

1 time. The processor with the smallest delay will be the first to do T&S upon lock release; so, other processes will see L==1 (in the while statement) and hence won't do T&S.

In the absence of any optimization, for making an LRPC call from a client domain to a server domain on the same processor, there are four copies involved. Enumerate them.

1. Client side stub creates a message of the actual arguments of the call 2. Kernel copies this message into a buffer in kernel space 3. The kernel copies this buffer from kernel space into server domain 4. Server side stub places the arguments on the stack before calling the actual procedure

This question pertains to supporting multiple library OS images on top of Exokernel. Enumerate the steps involved in context switching from one library OS to another.

1. The state of the currently executing lib os is saved using its PE. This includes saving hardware TLB in the specific software TLB for that lib OS, and processor registers. 2. The state of the lib OS chosen to be scheduled is loaded into the hardware TLB using the PE of the chosen one. 3. The volatile processor state (registers, PC, etc.) is also loaded using the PE of the chosen lib os.

Recall that light-weight RPC (LRPC) is for cross-domain calls within a single host without going across the network. In a simple implementation of this paradigm, it is understandable that there has to be a copy from the client's address space into the kernel address space, and from the kernel address space to the server's address space before executing the server procedure. What are the reasons for the two additional copies (one in the client's address space, and one in the server's address space)? (Concise bullets please)

1.Client side: Serialize all arguments of the call into a contiguous sequence of bytes in the client's address space. 2.Server side: De-serialize the received contiguous sequence of bytes into the arguments for use by the called procedure in server's address space.

Give one bullet for the similarity and one bullet for the difference of the "address range" concept of Corey from the "region" concept of Tornado

1.Difference: "Region" is internal to the kernel to optimize the implementation of page fault handling to increase concurrency; "Address range" is exposed to the application, and allows the application to give hints to the OS to better optimize the coherence or (lack thereof) for kernel data structures. 2.Similarity: Both serve to reduce the locking requirement of kernel data structures (and thus avoids serialization of OS services).

VM1 and VM2 are hosted on top of VMware ESX server. VM1 has a page fault at VPN = 100. The page is being brought into MPN = 4000. The contents hash for this page matches another MPN = 8000, which belongs to VM2 at VPN = 200. Give the steps taken by VMware ESX server including the data structures used to possibly optimize the memory usage by using VM oblivious page sharing.

1.ESX server has a hash table data structure in which each entry keeps the following information: <VM, PPN, MPN, hash of MPN content, ref-count> 2.As the content hash for the faulting page of VM1 matched an entry in the hash table (MPN=8000), now perform full comparison of this page (MPN=4000) and the matched page(MPN= 8000). 3.If comparison fails, then create a new hint frame entry for this new page in the hash table. 4.If these two pages are identical then mark both the corresponding PPNs to point to the same MPN(say MPN=8000)in their respective shadow page tables. 5.mark these page table entries as "copy on write". (+1)6.Increase the reference count by 1 in the hash table for MPN=8000, and free up machine page MPN=4000

A hypervisor is implementing a new memory management called "coordinated memory management (CMM)". The idea is the following. When a VM experiences memory pressure and comes to the hypervisor with a request for more memory, all the VMs cooperate to alleviate the memory pressure of the needy VM under the direction of the hypervisor chipping in what they can, commensurate with their current memory allocation and use. I.e., the hypervisor collects the statistics of memory usage dynamically from the VMs and instructs each of them to release an appropriate amount to meet the request. Assume that each VM has a balloon driver to support CMM. Present a design for the above scheme. Sketch the steps in CMM.

1.Each VM has an entry point called: "Report Memory Status"; 2.Each VM also installs the balloon driver; 3.Hypervisor calls "Report Status" on each donor VM; 4.Each donor VM reports "Actual memory in use"; 5.Hypervisor updates the "Memory-info" for each donor VM; 6.Hypervisor calculates how much memory to "ask" from each donor VM based on the "Memory-info" data structure. 7.Hypervisor instructs the balloon driver in each of the donor VMs to inflate commensurate with the decision in Step 6. 8.Once all the balloon driers in the donor VMs have completed the "inflate" command, the hypervisor has the necessary memory to give to the VM experiencing memory pressure. 9.Hypervisor instructs the balloon driver in the requesting VM to "deflate" the balloon by the amount of the memory request.

Identify one situation where true locking is necessary, and one situation in which "existence guarantee" is sufficient while dealing with shared objects.

1.Existence guarantee => Process object in the chain of page fault handling; object itself is not going to modified but the handler needs guarantee that the object will not "go away" (e.g., due to process migration) 2.True locking => Region object in the chain of page fault handling; her the handler has to eventually modify a specific entry in the object corresponding to the faulting VPN.

A library OS implements a paged virtual memory on top of Exokernel. Assume that the processor uses a TLB (which does not support address space ID) to perform address translation. A TLB miss is a page fault to be handled by the OS. Exokernel supports the following calls: xGetTLBentry: get an entry in the TLB for use by a library OS xTLBWr: Insert mapping into TLB xTLB_VA_delete: Delete virtual address from the TLB A process running within this library OS encounters a page fault. Walk through the steps from the time the page fault occurs to the resumption of this application. Assume the following: xThe library OS uses a per-process page replacement algorithm (meaning that if a process page faults and there are no free physical frames available, the OS will use one of the physical frame of the process to service the page fault). xAt the point of the page fault, the library OS does not have any free physical pages available. If you make any other assumption, please state them.

1.Exokernel detects the page fault. 2.Exokernel uses the PE data structure for the library OS to make an up call to the appropriate entry point in the library OS that handles page faults giving the faulting VPN. 3.Library OS runs page replacement algorithm to find a victim physical frame (PFN). 4.Library OS calls Exokernel to delete the victim VPN associated with this PFN from the TLB using "TLB VA Delete". 5.Library OS issues an I/O call via Exokerel to get the faulting page from the disk into the physical page PFN. 6.Upon I/O completion, Exokernel upcalls the library OS via the appropriate entry point in the PE data structure for the library OS. 7.Library OS calls Exokernel to install translation (faulting VPN, PFN) in the TLB using "TLBWr". 8.The application is ready to run at this point.

An application process starts up in a guest OS running on top of Xen. List the steps taken by the guest OS and Xen to set up the page table for this process. What is the mechanism by which the guest OS schedules this process to run on the processor? You can assume any reasonable hypervisor calls without worrying about the exact syntax. State your assumptions.

1.Guest OS allocates a physical page frame from its own reserve as the page table (PT) data structure for the newly created process 2.Guest OS registers this page frame with Xen as a page table by using a hypercall. 3.Guest OS updates batched VPN to PPN mappings into this page table data structure via the hypervisor by using a single hypercall. 4.When it wants to schedule this process, the Guest OS issues a hypervisor call to change the PTBR corresponding to the PT for this process

Guest OS decides to context switch from the currently running process to another. List the steps taken by the guest and the hypervisor until the new process is running on the processor.

1.Guest OS executes the privileged instruction for changing the PTBR to point to the chosen process (say P2). Results in a trap to the hypervisor. 2.From the PPN of the PT for P2, the hypervisor will know the offset into the S-PT for that guest-OS where the PT resides in machine memory. 3.Hypervisor installs the MPN thus located as the PT for P2 into PTBR. 4.Once other formalities are complete associated with context switching (which also needs hypervisor intervention) such as saving the volatile state of the currently running process into the associated PCB, and loading the volatile state of P2 from its PCB into the processor, the process P2 can start executing.

Consider a processor that uses a page table to do address translation. A fully virtualized Guest OS runs on top of this processor. The Guest OS has an illusion of a fully contiguous physical memory. In fact, the machine memory (i.e., the real physical memory) given to this OS is NOT contiguous. Guest OS decides to context switch from the currently running process to another. List the steps taken by the guest and the hypervisor until the new process is running on the processor.

1.Guest OS tries to execute a privileged instruction to set the PTBR (page table base register) to the new process's page table data structure (the PPN that contains this process's page table). 2.The ensuing trap is caught by the hypervisor. 3.Hypervisor uses the shadow page table for this Guest OS to get the MPN that corresponds to this PPN. This gives the page table for this process in machine memory. 4.Hypervisor sets PTBR to this MPN. 5.The Guest OS does the other necessary steps for the normal context switch. (e.g. load/store volatile state to/from PCB) all of which is accomplished using "trap and emulate" method.

Consider a processor that uses a page table to do address translation. A fully virtualized Guest OS runs on top of this processor. The Guest OS has an illusion of a fully contiguous physical memory. In fact, the machine memory (i.e., the real physical memory) given to this OS is NOT contiguous. The process incurs a page fault. What are the steps in servicing this page fault and resuming the process? List all the interactions between the hypervisor and Guest OS via a set of concise bullets.

1.Hypervisor catches the page fault and passes it up to the Guest OS. 2.Guest OS runs page replacement algorithm (if there are no free p-frames) 3.Guest OS initiates I/O to fetch the missing page from the disk into the PPN assigned for this faulting page. 4.The hypervisor uses the shadow page table to know the MPN corresponding to this PPN. 5.The I/O is completed by the hypervisor using "trap and emulate" method. 6.Upon I/O completion, the Guest OS gets intimation via software interrupt from the hypervisor. 7.Guest OS tries to install the VPN->PPN mapping into the TLB. 8.The hypervisor takes the necessary action via the "trap and emulate" method to install the VPN->MPN mapping into the shadow page table corresponding to this process for this Guest OS.

Consider a processor that uses a page table to do address translation. A fully virtualized Guest OS runs on top of this processor. The Guest OS has an illusion of a fully contiguous physical memory. In fact, the machine memory (i.e., the real physical memory) given to this OS is NOT contiguous. How is the above illusion handled in full virtualization?

1.Hypervisor maintains a shadow page table (PPN -> MPN) 2.Guest OS attempts to write to the page table using a privileged instruction, which results in a trap caught by the hypervisor. 3.Hypervisor emulates the intended action of the Guest OS (writing the VPN->PPN) mapping into the Guest's page table 4.Hypervisor installs the VPN->MPN directly into the page table corresponding to the PPN for this Guest OS.

Consider a processor that uses a page table to do address translation. A fully virtualized Guest OS runs on top of this processor. The Guest OS has an illusion of a fully contiguous physical memory. In fact, the machine memory (i.e., the real physical memory) given to this OS is NOT contiguous. A process belonging to this Guest OS is currently executing on the processor. Assume that the needed memory contents for this process are in the machine memory. What happens on every memory access by this process?

1.If the TLB contains the VPN->MPN mapping, the processor completes the memory access without having to go to the shadow page table for the Guest OS (for this process). 2.Upon TLB miss, the processor goes to the shadow TLB and gets the VPN->MPN mapping and installs it into the TLB and completes the memory access.

"Latency", "Contention", and "Waiting time" are the three sources of un-scalability of synchronization algorithms. With ONE concise bullet for each term, explain what these terms mean.

1.Latency: Time it takes for a process to get a lock with no contention. 2.Contention: The interconnection network traffic generated upon lock release by the competing processes that are spinning to acquire the lock. 3.Waiting time: This is a property of the application (e.g., the duration of the critical section governing the lock, causing a new lock requestor to wait for lock release) and does not have anything to do with the scalability of the lock algorithm.

Guest OS is servicing a page fault. It finds a free PPN and decides to make the mapping of the VPN for the faulting process to this PPN. List the steps from here on taken by the guest and the hypervisor that ensures that the process will not fault on this VPN again.

1.To put this mapping into the page table, the guest OS has to execute a privileged instruction. 2.This will result in a trap into the hypervisor, which will emulate the required action (VPN->PPN mapping) in the guest OS's page table data structure. 3.Further, the hypervisor will install a direct mapping from VPN->MPN (MPN corresponding to the PPN) into the S-PT data structure for this guest OS and the TLB 4.Since the S-PT is the "hardware page table" in an architecture such as Intel, the process will not page fault then next time it runs on the CPU

Xenolinux runs on top of Xen. Xenolinux has installed a "fast handler" in Xen so that system calls by a process executing on top of XenoLinux gets directly handled by XenoLinux without a level of indirection from Xen. Xenolinux uses an I/O ring to convey disk I/O requests to Xen, and gets responses back from Xen via this I/O ring. A process P1 is executing in XenoLinux. It makes a blocking system call: fopen(<filename>); With a set of concise bullets, show all the interactions between XenoLinux and Xen for servicing this call. Clearly indicate when P1 resumes execution.

1.Via the fast handler, xenolinux fields the fopen system call from Xen. 2.Xenolinux populates the I/O ring data structure with the details of the fopen call. 3.Xenolinux makes the I/O request by using a hypervisor call. 4.Xen services the request. (+1) 5.Xen fills the response into the I/O ring data structure. 6.Xenolinux gets the response via polling the I/O ring. 7.Xenolinux completes what it needs to do and places the process back on the ready queue.

Giventhe following configuration for a chip multiprocessor: 4 cores Each core is 4-way hardware multithreaded 512 MB LLC (Last Level Cache on the chip) Given the following pool of threads ready to run: Group 1: T1-T8 each requiring 8 MB of cache Group 2: T9-T16 each requiring 16 MB of cache Group 3: T17-24 each requiring 32 MB of cache Group 4: T25-32 each requiring 64 MB of cache Which threads should be run by the scheduler that ensures all the hardware threads are fully utilized and maximizes LLC utilization?

4 * 4 = 16 hardware threads 32 software threads 64 + 128 + 256 + 512 = 960 MB required(cumulative cache requirement) One of many possible feasible schedules that uses all the hardware threads and uses all the available LLC: T25 -T28 on Core 1 = 4 hardware threads & 256 MB T17 -T20 on Core 2 = 4 hardware threads & 128 MB T9 -T12 on Core 3 = 4 hardware threads & 64 MB T13 -T16 on Core 4 = 4 hardware threads & 64 MB

(Explain why the following statement is true)(Tornado)"Dealing with concurrent page faults experienced by the threads of a multi-threaded process is a challenge for a parallel OS." (Concise bullets please).

A multithreaded process has a page table which is logically shared by all the threads. Concurrent page faults happening on different processors where the threads are executing could potentially end up getting serialized if careful attention is not paid to managing the shared page table.

Choose the correct two choices pertaining to the paper "Xenand the art of virtualization" A.System calls executed by a user process use a table lookup inside Xen to do an upcall to the appropriate guest operating system. B.Implementation of Xen on Intel x86 architecture requires special handling of page faults incurred by a user process before the page fault can be handed over to the guest operating system. C.Xen allows unmodified binaries of guest operating systems to be run on top of it. D.Xen allows fine grained access to hardware resources by a guest operating system exactly similar to Exokernel.

A.System calls executed by a user process use a table lookup inside Xen to do an upcall to the appropriate guest operating system. B.Implementation of Xen on Intel x86 architecture requires special handling of page faults incurred by a user process before the page fault can be handed over to the guest operating system.

Identify the components of an RPC call that ARE in the critical path of the latency: A.Time on the wire B.Controller latency C.Switching client out at the point of call D.Interrupt service E.Switching server in on call arrival F.Switching server out on call completion G.Switching client in to receive results of the call

A.Time on the wire B.Controller latency D.Interrupt service E.Switching server in on call arrival G.Switching client in to receive results of the call

What is the purpose of the "software TLB" in the Exokernel approach to providing extensibility?

All library OSes live above the kernel. Thus each library OS is in its own hardware address space. When a library OS is scheduled to run on the processor by Exokernel, the (hardware) TLB has to be flushed. Thus the newly scheduled library OS will have an inordinate amount of TLB misses. To mitigate the associated loss of performance, Exokernel maintains a software TLB data structure for each library OS. The translations contained in this data structure are loaded into the hardware TLB.

Consider a 64-bit paged-segmented architecture. The virtual address space is 264bytes. The TLB does not support address space id, so on an address space switch, the TLB has to be flushed. The architecture has two segment registers: LB: lower bound for a segment UB: upper bound for a segment There are three protection domains each requiring 32 MiB of address space (Note: Mi is 220). How would you recommend implementing the 3 protection domains that reduces border crossing overheads among these domains?

All three protection domains can be packed in 1 address space Each address space takes up: 2^25 BLB, UB(range) for each address space 0 , (2^25-1) (2^25) , (2^26-1) (2^26) , ((2^25)*3)-1

One of the techniques for efficient use of memory which is a critical resource for performance is to dynamically adjust the memory allocated to a VM by taking away some or all of the "idle memory" (unused memory) from a VM. This is referred to in the literature as "taxing" the VM proportional to the amount of idle memory it has. Why is a 100% tax rate (i.e. taking away ALL the idle memory from a VM) not a good idea?

Any sudden increase in the working set size of the VM will result in poor performance of that VM. This might violate the service agreement for that VM.

In the Intel architecture, the processor does the following on every memory access: Translates virtual page number to a physical page frame number using a page table in physical memory. Let's call this hardware page table (HPT). The processor also has a hardware TLB to cache recent address translations. Full virtualization is implemented on top of this processor. How many SPTs are there?

As many as the number of guest OSescurrently executing on top of the hypervisor. Each guest OShas its own illusion of a contiguous physical memory, which has to supported by the hypervisor by having a distinct SPT for each virtual machine.

Choose all that apply with respect to the advantages of component based design (-1 for each incorrect choice): A.Decouples specification, verification, and implementation B.Ease of development C.Avoids side effects present in non-functional programming languages D.Ease of adaptation to meet the requirements of specific environments E.Ease of extensibility F.Optimal performance

B.Ease of development D.Ease of adaptation to meet the requirements of specific environments E.Ease of extensibility

In building a subsystem (e.g., memory manager), you notice that an object is mostly shared in a read-only manner. You would choose to implement this object as a clustered object with (choose one of the following): A.Single copy shared on all the processors (a true shared object) B.One copy per processor (fully replicated representation) C.One copy per group of processors (partially replicated representation)

B.One copy per processor (fully replicated representation)

Choose the correct two choices pertaining to Exokernel from the following A.Once acquired through secure binding, a library OS can hold on to resources as long as it wants. B.Processor environment (PE) data structure in Exokernel contains, among other things, information pertaining to entry points in the library OS for upcalls from the Exokernel. C.Upon a TLB miss, Exokernel always calls the faulting library OS for servicing the miss. D.The mechanism in Exokernel for downloading code into the kernel is to avoid border crossing for protection domains that are critical for achieving high performance.

B.Processor environment (PE) data structure in Exokernel contains, among other things, information pertaining to entry points in the library OS for upcalls from the Exokernel. D.The mechanism in Exokernel for downloading code into the kernel is to avoid border crossing for protection domains that are critical for achieving high performance.

Fixed Processor (FP) scheduling for a given thread has the following effect (choose the most appropriate selection): A.Good load balance B.Reduces the ill-effects of cache reload C.Increases the ill-effects of cache reload

B.Reduces the ill-effects of cache reload

Choose the correct two choices pertaining to SPIN from the following A.SPIN keeps each logical protection domain in a distinct address space. B.SPIN does not incur border crossing overhead for protection domains that have been collocated with the kernel through kernel extension. C.Capabilities to entry points in protection domains are implemented with encrypted keys in SPIN. D.In SPIN, hardware events (such as a page fault) result in the direct execution of interface procedures that have been registered by a protection domain with the kernel through the extension model.

B.SPIN does not incur border crossing overhead for protection domains that have been collocated with the kernel through kernel extension. D.In SPIN, hardware events (such as a page fault) result in the direct execution of interface procedures that have been registered by a protection domain with the kernel through the extension model.

One of the techniques for efficient use of memory which is a critical resource for performance is to dynamically adjust the memory allocated to a VM by taking away some or all of the "idle memory" (unused memory) from a VM. This is referred to in the literature as "taxing" the VM proportional to the amount of idle memory it has. Why is a 100% tax rate (i.e. taking away ALL the idle memory from a VM) not a good idea?

Because any sudden increase in the working set size of the VM will result in poor performance for that VM potentially violating the SLA for that VM.

Recall that light-weight RPC (LRPC) is for cross-domain calls within a single host without going across the network. The kernel allocates A-stack in physical memory and maps this into the virtual address space of the client and the server. It also allocates an E-stack that is visible only to the server. What is the purpose of the E-stack?

By procedure calling convention, the server procedure expects the actual parameters to be in a stack in its address space. E-stack is provided for this purpose. The arguments placed in the A-stack by the client stub are copied into the E-stack by the server stub. Once this is done, the server procedure can execute as it would in a normal procedure call using the E-stack.

Choose the correct two choices pertaining to processor architecture and address spaces A.Implementation of protection domains always requires switching processor address space. B.Translation Look aside Buffer (TLB) is always flushed upon an address space switch. C. Segmentation in hardware allows independent logical protection domains to share the same page table. D.Keeping very large protection domains in distinct address spaces is a reasonable choice since indirect costs (i.e., reloading the TLB and cache effects) dominate the overall penalty for border crossing into such large protection domains.

C. Segmentation in hardware allows independent logical protection domains to share the same page table. D.Keeping very large protection domains in distinct address spaces is a reasonable choice since indirect costs (i.e., reloading the TLB and cache effects) dominate the overall penalty for border crossing into such large protection domains.

False sharing refers to (choose the most appropriate selection): A.A memory location being currently in the cache of a processor B.A memory location being present simultaneously in the caches of multiple processors in a shared memory parallel machine C.A memory location, which is private to a thread, appearing to be shared by multiple processors in a shared memory parallel machine

C.A memory location, which is private to a thread, appearing to be shared by multiple processors in a shared memory parallel machine

Identify the components of an RPC call that ARE NOT in the critical path of the latency (Choose all that apply; -1 for each incorrect choice): A.Time on the wire B.Controller latency C.Switching client out at the point of call D.Interrupt service E.Switching server in on call arrival F.Switching server out on call completion G.Switching client in upon call completion

C.Switching client out at the point of call F.Switching server out on call completion

A process becomes runnable. Give two factors that are important to take into consideration in assigning it to one of the processors in a shared memory multiprocessor?

Cache affinity; in particular choose the processor on which there has been a minimum number of "other processes" scheduled since the last time this process ran on that processor. Queue length; consider if the processor already has "other" processes already in its local queue before this process will get to run. Ensure working sets of all collocated processes including the newly runnable one fit in the last level cache of a multicore processor.

In a multiprocessor, when a thread becomes runnable again after a blocking system call completes, conventional wisdom suggests running the thread on the last processor it ran on before the system call. What is the downside to this idea?

Cache pollution by other threads run after the last time this thread ran on a particular processor.

Light-weight RPC (LRPC) is for cross-domain calls within a single host without going across the network. A specific LRPC call has totally 128 bytes of arguments to be passed to the server procedure, and 4 bytes of return value to be returned to the client. Assuming no programming language level support, What is the total copy overhead in bytes for the call-return with the LRPC system? Explain why.

Client →A-Stack⇒128 A-Stack→Server E-Stack ⇒128 Server E-Stack→A-Stack ⇒4 A-Stack→ Client⇒4 Total = 264

Consider the following simple code for lock acquisition. Lock: While (test-and-set(L) == locked) { /* do nothing */ }; Unlock: L = unlocked In an invalidation-based cache-coherent multiprocessor, the performance problems caused by the above algorithm are: A.Increased network traffic upon lock release B.Increased latency for lock acquisition in the absence of any lock contention C.All processors spin on the same memory location in their respective caches D.All processors have to go to main memory every time through the loop

D.All processors have to go to main memory every time through the loop

What are the highest two components of potential overhead involved in border crossing from one protection domain to another? (choose the two that apply; -1 for each incorrect choice) A.Saving and restoring the process context blocks pertaining to the protection domains B.Linked list maintenance for the process context blocks C.Switching the page tables and associated processor registers D.TLB misses E.Cache misses

D.TLB misses E.Cache misses

Imagine a message-passing multiprocessor with point to point communication links between any two processors. Between tournament and dissemination algorithms, which would you expect to do better for barrier synchronization? Why?

Dissemination would do better. At each round dissemination barrier has O(N) communications. But they can all go on in parallel. The number of rounds of communication in dissemination is ceil(log(N)). While all the communication can go in parallel in tournament barrier as well, the algorithm traverses the tree twice (arrival + wakeup), resulting in a total of2 *log(N) rounds of communication.

In the Intel architecture, the processor does the following on every memory access: Translates virtual page number to a physical page frame number using a page table in physical memory. Let's call this hardware page table (HPT). The processor also has a hardware TLB to cache recent address translations. Full virtualization is implemented on top of this processor. Each process created by a guest OS has its own page table. Let's call this physical page table (PPT). What is the relationship between PPT, SPT, and HPT?

Each process in a guest OS has its own PPT. The PPT contains the VPN to PPN mapping assigned by the guest OS. Hypervisor allocates the real machine pages to back the physical pages of each guest OS.SPT contains the mapping between PPN and the MPN. HPT is what the processor uses to translate the VA of the currently running process to the real physical address.

Ticket lock, which simulates many real life situations where we get a number and wait for our number to come up to get service in FCFS order, achieves fairness which is not present in simple T&S based algorithm. Equivalent to the real-life situation, a processor can know that its turn has come up by checking is achieved in a cache-coherent multiprocessor by the check: while ((my_ticket - L‐>now_serving) > 0); // do nothing Why is it that Ticket-Lock algorithm does not meet the goals of a scalable mutual exclusion lock algorithm in a large-scale multiprocessor?

Every lock release results in an update to shared variable "L->now_serving", which in turn results in contention on network as all the waiting processors have to get the up to date new value of L->now_serving.

The hypervisor receives a packet arrival interrupt from the network interface card (NIC). How does it determine which virtual machine it has to deliver the interrupt?

Every packet from the network will have the MAC address of the NIC to which the packet is destined.The MAC addresses are uniquely associated with the VMs. Based on the MAC addresses associated with the VMs and the destination field in the IP header of the packet, the packet arrival interrupt is delivered to the VM.

Give two plausible reasons (in the form of concise bullets) that could make system calls more expensive than simple procedure calls.

Explicit cost for changing hardware address spaces Implicit cost of change in locality

(Answer True/False with justification)The "region" concept in Tornado and the "address range" concept in Corey are one and the same.

False ●Justification: ○Region is invisible to the application; it is a structuring mechanism inside the kernel to increase concurrency for page fault handling of a multithreaded application since each region object manages a portion of the process address space. ○Address range is a mechanism provided to the applications by the kernel for threads of an application to selectively share parts of the process address space. This reduces contention for updating page tables and allows the kernel to reduce the amount of management work to be done by the kernel using the hints from the application (e.g., reduce TLB consistency overhead.

(Answer True/False with justification) In implementing a high-performance microkernel, one should keep the implementation of the microkernel abstractions to be architecture independent.

False. As Liedke argues in his paper on L3 microkernel, to achieve high performance the microkernel implementation should fully exploit the hardware features available in the processor architecture.

It is impossible to support sequential consistency memory model in a multiprocessor that has caches associated with each processor. (An answer without any justification gets zero credit.)

False. What is needed is a mechanism to get exclusive access to a memory location before writing to it in the local cache and ensuring that that memory location is purged in the peer caches. For example, in an SMP, a hardware solution would be for the writing processor to acquire the bus and indicate to its peers that it is writing this memory location so that the peers can take appropriate actions in their local caches (invalidate/update if that memory location exists in the cache)

(Answer True/False with justification)"SPIN's approach of extending logical protection domains and Exokernel's approach of downloading code into the kernel are one and the same."

False. Functionally, they accomplish the same thing, namely extend the kernel with additional functionality. However, in Exokernel once the code is downloaded into the kernel there is no protection against malicious or erroneous behavior of the downloaded code. In SPIN, since the extension via logical protection domains is achieved under the protection of the strong semantics of Modula-3, the other subsystems that live within the same hardware address space are protected from malicious or erroneous behavior of any given logical protection domain.

(Answer True/False with justification)Sequential consistency memory model makes sense only in a non-cache coherent (NCC) shared memory multiprocessor.

False. Sequential consistency memory model is a contract between software and hardware. It is required for the programmer to reason about the correctness of the software. Cache coherence is only a mechanism for implementing the memory consistency model. It can be implemented in hardware or software.

The currently running OS, switches from one process (say P1) to another process (P2). List the sequence of steps before P2 starts running on the processor.

Guest OS executes the privileged instruction for changing the PTBR to point to the PT for P2.Results in a trap to the hypervisor. From the PPN of the PT for P2, the hypervisor will know the offset into the S-PT for that VM where the PT resides in machine memory. Hypervisor installs the MPN thus located as the PT for P2 into PTBR. Once other formalities are complete associated with context switching (which also needs hypervisor intervention) such as saving the volatile state of P1 into its PCB, and loading the volatile state of P2 from its PCB into the processor, the process P2 can start executing.

How does ballooning work?

Guest OS installs a balloon driver. Upon a trigger from the hypervisor, the driver "inflates" (acquires memory from the Guest OS and gives the associated page frames to the hypervisor) or "deflates" (gets physical page frames from the hypervisor and releases it to the Guest OS).

How can Tornado approach thesis be applied to designing the memory management subsystem?

HAT: processor specific Process: one rep per processor shared by all the threads(mostly read-only) FCM: partitioned rep, region specific Region: partial replication for a group of processors, since in the critical path of page faults (granularity decides concurrency for p.f. handling) COR: a true shared object with a single rep since you don't except too much VM initiated file I/O with good mem mgmt (i.e., no thrashing) DRAM: several representations

The MCS algorithm assumes the presence of an atomic instruction: compare-and-swap(a, b, c), whose semantics is as follow: If a == b, then set a = c and return 1; Else do nothing and return 0; How does MCS lock algorithm use this instruction to make sure that the right thing happens?

In MCS at lock release the releasing processor T1 does the following: if (compare-and-swap(L, T1, nil) == 0) {/* spin awaiting the "new" request to set T1's next pointer */ while (T1->next != nil); } T1->next.got_it = 1; /* signal the next process it has the lock */

In the Intel architecture, the processor does the following on every memory access: Translates virtual page number to a physical page frame number using a page table in physical memory. Let's call this hardware page table (HPT). The processor also has a hardware TLB to cache recent address translations. Full virtualization is implemented on top of this processor. How is page fault by a process handled with this set up?

Hypervisor catches the page fault. Passes the faulting VA to the currently executing guest OS. Guest OS services the page fault. Hypervisor traps the PT/TLB updates of the guest OS to directly enter the VPN to MPN mapping into the HPT.

A virtualized setting uses ballooning to cater to the dynamic memory needs of VMs. Imagine 4 VMs currently executing on top of the hypervisor. VM1 experiences memory pressure and requests the hypervisor for 100 MB of additional memory. The hypervisor has no machine memory available currently to satisfy the request. List the steps taken by the hypervisor in trying to give the requested memory to VM1 using ballooning.

Hypervisor keeps information on the memory allocated and actively used by each of the VMs. This allows the hypervisor to decide the amount of memory to be taken from each of the other VMs to meet VM1's request. It communicates the amount of memory to be acquired from each VM to the balloon driven in that VM. The balloon drivers go to work in the respective VMs and return the released machine pages to the hypervisor. It gives the requested memory to the needy VM (VM1 in this case).

What is the purpose of "ballooning"?

Hypervisor mechanism for retrieving machine memory from a Guest VM to possibly allocate to another VM experiencing memory pressure.

Recall that in the distributed mutual exclusion algorithm of Lamport, every node acknowledges an incoming lock request message from a peer node by sending an ACK message. One suggested way to reduce the message complexity of the algorithm is to defer ACKs to lock requests. Using an example, show under what conditions a node can decide to defer sending an ACK to an incoming lock request.

If the node is holding the lock, then it can defer sending ACK.Or if the incoming lock request's timestamp is larger than the timestamp of its own lock request, the node can defer sending ACK.

Tornado suggests using"existence guarantees"with reference counts instead of hierarchical locking to avoid serialization. (PULL UP IMAGE OF TORNADO LOCK) CS 6210 Fall 2016 MidtermSolutionName:_____TAs plus Kishore____________GT Number: Page 10of 10Parallel System Case Studies6.(6mins, 10points)(Tornado)Tornado suggests using"existence guarantees"with reference counts instead of hierarchical locking to avoid serialization.(a)(6points) Using page fault service as a concrete example, discuss how this helps with reducing the pit falls of serialization of page fault service in a multi-threaded process.(Concise bullets please)

Imagine two threads T1 and T2 of the same process executing in the multiprocessor sharing the same representation "process" object. Both of them experience page faults SIMULTANOUSLY. Assume T1's page fault is contained in Region 2; T2's page fault is contained in Region 1. The page fault service will only update the respective region objects. Therefore, the "process" object need not be locked. But to ensure that some other subsystem (say a load balancer) does not remove the "process" object, the page fault service can increment the ref-count on the "process" object on the forward path of the above figure, and decrement the ref-count on the reverse path (once the service is complete) thus avoiding the serialization of the page fault service for T1 and T2.

In a fully-virtualized setting, Shadow Page Table (S-PT) is a data structure maintained by the hypervisor. Answer the following questions pertaining to S-PT. What does an entry in the S-PT contain?

In principle it is a mapping from PPN to MPN. However, since S-PT is the "real" hardware page table used by the architecture for address translation (VPN->MPN), the hypervisor keeps the VPN -> MPN mapping as each entry in the data structure.

How big is the shadow page table?

It is proportional to the number of processes currently running in that guest OS.

Given the attributes: (I) unfair (II) fairbut with contention on lock release (III) fair and no contention on lock releaseAssociate the attribute (Roman numeral)with the correct spinlock algorithms below. MCS lock T&S with delay Ticket lock

MCS lock (III) T&S with delay (I) Ticket lock (II)

How can we reduce the marshaling overhead in RPCat the point of the client call?

Marshal into kernel buffer directly avoiding an extra copy in the client stub (requires having the marshaling code of the client side stub in the kernel), OR Use a shared descriptor between the client stub and the kernel so that the kernel can create the RPC packet without knowing anything about the semantics of the RPC call.

In a shared memory multiprocessor in which the hardware is enforcing cache coherence, why is a "memory consistency model"necessary?

Memory consistency model serves as a contract between software and hardware to allow the programmer to reason about program behavior.

What is the main assertion of L3 microkernel?

Microkernel-based design of an OS need not be performance deficient. With the right abstractions in the microkernel and architecture-specific implementation of the abstractions, microkernel-based OS can be as performant as a monolithic kernel.

Give two reasonsin the form of concise bullets, explaining why SPIN's vision of OS extensibility purely via Language-enforced protection checks may be impractical for building real operating systems.

Modifying architectural features (e.g., hardware registers in the CPU, memory mapped registers in device controllers) may necessitate a service extension to step out of their logical protection domains (i.e.,Modula-3 compiler-enforced access checks). A significant chunk (upwards of 50%) of the OS code base (e.g., device drivers) is from third-party OEM vendors and it is difficult if not impossible to force everyone to use a specific programming language.

Based on a reading of the SPIN and Exokernel papers, select the features best supported by each type of OS below Extensibility Protection Performance Monolithic OS DOS-like OS Microkernel OS

Monolithic OS: protection and performance DOS: extensibility and performance Microkernel: extensibility and protection

A hardware cache coherent multiprocessor architecture has system-wide atomic load/store instructions (for loading into/storing from registers from/to memory) and arithmetic/logic instructions which work on register operands. An OS designer implements a simple mutual exclusion lock algorithm as follows: Lock(L): L1: If (L == 0) // lock not in use L = 1; // got lock Else { While (L == 1); // do nothing Goto L1; } Unlock (L): L = 0; Will this work? If not, why not?

No. The code block "If (L == 0) L = 1" is not atomic (since it will compile into multiple machine instructions). So multiple processes can execute them in parallel and assume they each have the lock. The code for lock algorithm is not atomic.

Are the terms "protection domain" and "hardware address space" synonymous? If not, explain the difference between the two.

No. Protection domain is a logical concept that signifies the reach of a particular subsystem. For example, each specific service of an OS such as the memory manager, the file system, the network protocol stack, and the scheduler may each live in their own protection domains and interact with one another through well-defined interfaces implemented using IPC mechanism (a la L3 microkernel). Hardware address space is a physical concept. It pertains to the ability of the hardware to restrict the memory reach of the currently executing process. The OS can use this hardware mechanism to implement logical protection domains. A specific implementation of an OS may choose to pack multiple protection domains in the same hardware address space. Another may choose to associate a distinct hardware address space to each protection domain.

On an Non-Cache-Coherent (NCC)NUMA machine with per-processor caches, would you advocate using the following lock algorithm? Explain why or why not. LOCK(L): back:while(L==LOCKED); //spin if(Test_and_Set(L)== LOCKED) go back; UNLOCK(L): L = UNLOCKED;

No. Since there is no cache consistency Lock release will never be seen by waiting processors.

In a fully-virtualized setting, Shadow Page Table (S-PT) is a data structure maintained by the hypervisor. Answer the following questions pertaining to S-PT. How many S-PT data structures are there?

One per guest OS currently running.

For the above multiprocessor, we are implementing a DRAM object to physical memory. Give the design considerations for this object. Discuss with concise bullets the representation you will suggest for such an object from each core (singleton, partitioned, fully replicated). Tornado multiprocessor pic

One representation of the DRAM object for each NUMA piece (i.e., shared by all the 4 cores. Each representation manages the DRAM at that node for memory allocation, and reclamation. Core-and thread-sensitive allocation of physical frames to ward off false sharing among threads running on different cores of the NUMA node.

Prior to the optimization done by LRPC, list the number of and the need for copying involved in a client-server RPC call when the client and server are on the same machine.How does LRPC reduce the number of copies and what are the copies after the optimization?

Originally: 4 copies each way client args to server, server results to client LRPC OPT: 1 copy client args to server 1 copy server result to client

How can we reduce the context switch overhead in RPC?

Out of the potential 4 context switches (2 on the client side, and 2 on the server side), only 2 are in the critical path oSwitching to the server when the remote call comes in oSwitching to the client when the result comes back The other two context switches (to a different process on the client side upon an RPC call by a client; to a different process on the server side once the RPC call has been serviced) can be overlapped with communication.

The MCS barrier uses a 4-ary arrival tree where the nodes are the processors and the links point to children. What will the arrival tree look like for 16 processors labeled P0-P15? Use P0 as the root of the tree. What is the reason for such a construction of the arrival tree?

P0: [P1, P2, P3, P4 ] P1: [P5, P6, P7, P8 ] P2: [P9, P10, P11, P12 ] P3: [P13, P14, P15, X ] P4: [ ] P5: [ ] P6: [ ] P7: [ ] P8: [ ] P9: [ ] P10: [ ] P11: [ ] P12: [ ] P13: [ ] P14: [ ] P15: [ ] Unique and static location for each processor to signal barrier completion Spinning on a statically allocated local word-length variable by packing data for four processors reduces bus contention 4-ary treeconstruction shows the best performanceon sequent symmetry used in the experimentation in MCS paper

This question pertains to supporting multiple library OS images on top of Exokernel. What is the role performed by the Processor Environment?

PE is a DS that contains 4 entry points corresponding to each lib os executing on top of Exokernel: 1. Addressing context identifies the software TLB for that lib os 2. Interrupt context, exception context, and protected entry context, respectively identify the handler entry points registered with exokernel for dealing with those three kinds of program discontinuities during the execution of a lib os

Explain the role of the "processor environments" in the Aegis Exokernel. How is it used to achieve the functionalities provided by a library OS?

PE is per lib os in the Exokernel for passing events to the library * exception context for program genreated exceptions * interrupt context for exokernel events * protected entry context for cross domain calls * addressing context for keeping guaranteed mappings implemented by software TLB These examples are library specific information maintained by Exokernel. For example, a process page fault will be communicated via the Exveption context to the page fault handler in the lib os

From the class discussions and the papers we have read on processor scheduling for shared memory multiprocessors, what are important things to keep in mind to ensure good performance?

Pay attention to cache affinity for a candidate process in selecting a processor to resume the candidate process Pay attention to the number of intervening processes that executed on the same processor since the time this candidate process ran on that processor Pay attention to the size of the scheduling queue on the processor before making a decision to place a candidate process on a processor queue In a multicore processor, misses in the LLC is expensive. Therefore, pay attention to the working set size of a thread in deciding the mix of threads that can be scheduled to run simultaneously on the multiple cores of the processor. The goal should be to ensure that the cumulative working set sizes of the co-scheduled threads can fit in the LLC to minimize misses and thus having to go off-chip.

What is the distinction between "physical page" and "machine page" in a virtualized setting?

Physical page is the illusory view of the physical memory from the Guest OS MMU. Physical pages are deemed contiguous from the Guest OS point of view. Machine page is the view of the physical memory from the Hypervisor. This refers to the REAL hardware physical memory. The actual "physical memory" given to a specific guest OS maps to some portion of the real machine memory.

A processor has 2 cores and each core is 4-way multithreaded. The last level cache of the processor is 32 MB. The OS has the following pool of ready to run threads:Pool1: 8 threads each having a working set of 1 MB (medium priority)Pool2: 3 threads each having a working set of 4 MB (highest priority)Pool3: 4 threads each having a working set of 8 MB (medium priority)The OS can choose any subset of the threads from the above pools to schedule on the cores. Which threads should be scheduled that will make full use of the available parallelism and the cache capacity while respecting the thread priority levels?

Pool 1 -3 threads (3MB) Pool 2 -3 threads (12MB) Pool 3 -2 threads (16MB) Total cache used: 31 MB

Recall that the MCS lock algorithm implements a queue of lock requestors using a linked list. The MCS algorithm uses an atomic fetch-and-store(X,Y)primitive in its implementation. The primitive returns the old value of X, and stores Y in X. Assume the current state of a lock is as shown below: L -> curr running | ''''' L curr running | | | ''''' | T2 | ________> T1 What sequence of subsequent actions will ensure the correct formation of the waiting queue of lock requestors behind the current lock holder?

Possibility 1: T2 will set the next pointer in "curr" to point to T2. T1 will set the next pointer in "T2" to point to T1. Possibility 2: T1 will set the next pointer in "curr" to point to T1. T2 will do a fetch-and-store on L->next; this will result in two things: oT2 will get its predecessor T1 T2 will set the next pointer in "T1" to point to T2 oL->next will now point to T2

Recall that the MCS lock algorithm implements a queue of lock requestors using a linked list. The MCS algorithm uses an atomic fetch-and-store(X,Y)primitive in its implementation. The primitive returns the old value of X, and stores Y in X. Assume the current state of a lock is as shown below: L -> curr running | ''''' L curr running | | | ''''' | T2 | ________> T1 What does each of T1 and T2 know about the state of the data structures at this point of time?

Possibility 1: T2knows its predecessor is "curr" T1 knows its predecessor is "T2" Possibility 2: T2 does not know anything about the queue associated with L T1 knows its predecessor is "curr"

In a fully-virtualized setting, Shadow Page Table (S-PT) is a data structure maintained by the hypervisor. Answer the following questions pertaining to S-PT. How big is the S-PT data structure?

Proportional to the number of processes in that guest OS.

What is the difference between"protection domain" and "hardware address space"?

Protection domain is a logical concept that signifies the reach of a particular subsystem. For example, each specific service of an OS such as the memory manager, the file system, the network protocol stack, and the scheduler may each live in their own protection domains and interact with one another through well-defined interfaces implemented using IPC mechanism (a la L3 microkernel). Hardware address space is a physical concept and pertains to the ability of the hardware to restrict the memory reach of the currently executing process. The OS can use this hardware mechanism to implement logical protection domains. A specific implementation of an OS may choose to pack multiple protection domains in thesame hardwareaddress space. Another may choose to associate a distinct hardware address space to each protection domain.

Consider a byte-addressable 32-bit architecture. The virtual address space is 232bytes. The TLB supports entries to be tagged with an address space id. We are building a microkernel-based design of an operating system for this architecture, in which every subsystem will be implemented as a process above the kernel. With four succinct bullets suggest what design decisions you will take to make sure that the performance will be as good as a monolithic OS.

Provide mechanisms in the kernel for generating unique IDs Give unique address space IDs to each subsystem so that the AS-tagged TLB can be exploited No need to flush the TLB when we move from one subsystem to another Use kernel threads and do "hand-off" scheduling between subsystems (a la "doctoring" threads discussed in LRPC paper) to implement protected procedure calls between the subsystems Provide efficient memory sharing mechanisms across subsystems (a la L3 mechanisms of map/grant/flush) so that copying overheads during protected procedure calls Provide low over head mechanism in the kernel for catching program discontinuities (e.g., external interrupt) and packaging and delivering it like a protected procedure call to the subsystem that has to deal with the discontinuity

Consider an architecture that has address-space tagged TLB. The architecture has a flat 32-bit address space and has no support for hardware segmentation. You are implementing an OS using the L3 approach. The OS consists of a number of subsystems that can be implemented as small protection domains. How would you implement the small protection domains?

Put all the small protection domains in the same hardware address space. Assign a unique PID to each small protection domain to utilize the AS-tagging by the architecture.

Explain sequential consistency

Respect program order of memory access from each processor Arbitrary interleaving of the memory accesses from different processors that respects the program order

Explain how SPIN makes OS service extensions as cheap as a procedure call.

SPIN implements each service extension as a Modula-3 object: interactions to and from this object are compile time checked and runtime verified. Thus each object is a logical protection domain. SPIN and the service extensions are co-located in the same hardware address space.This means that OS service extensions do not require border crossing, and can be as cheap as a procedure call.

Explain how content-based physical page sharing works across VMs in VMWare's ESX server

Scan a candidate PPN and generate a content hash If the hash matches an entry in the hash table, then perform full comparison of the scanned page and the matched page If the two pages are identical then mark both PPNs to point to the same MPN thus freeing up a machine page; mark both the PTEs in the respective shadow page tables (PPN->MPN entries) as "copy on write" If there is no match to the scanned page, then create a new "hint frame" entry in the hash table The scanning is done in the background during idle processor cycles

In the Intel architecture, the processor does the following on every memory access: Translates virtual page number to a physical page frame number using a page table in physical memory. Let's call this hardware page table (HPT). The processor also has a hardware TLB to cache recent address translations. Full virtualization is implemented on top of this processor. What is the shadow page table(SPT), and who maintains it?

Shadow page table gives the mapping between physical pages (which are deemed contiguous in physical memory so far as the guest OSis concerned), and the machine pages (i.e., the real physical memory in DRAM which is under the control of the hypervisor). It is maintained by the hypervisor (or virtual machine monitor)

Givethe differences and similarities between the mechanisms in Exokernel for extensibility with those in Xen for virtualization.

Similarities: Both Exo and Xen catch program discontinuities and pass it up to the OS above Both Exo and Xen have a way of identifying the OS that is currently executing so that they can correctly do the upcall to the OS that has to deal with the program discontinuity Differences: Exo allows library OS to download arbitrary code into the kernel, while Xen has a structured way of communicating with the library OSes with I/O ring data structure Exodoes hardware resource allocation to the library OSes at a very fine granularity (e.g., individual TLB entry), while Xen does it at a much coarser granularity (e.g., batching page table updates) Exo's default CPU scheduling uses linear vector of "time slots", while Xen uses proportional share and fair share schedulers and accurately accounts for time even when a program discontinuity occurs on behalf of an OS that is not currently scheduled to run on the CPU

Give the differences and similarities between the goals of extensibility (a laSPIN/Exokernel/L3) with the goals of virtualization (a laXen/VMware ESX server).

Similarities: *Customization of system services commensurate with application needs * Allowing policies for resource allocation to be dictated by the customized system service, by providing a thin layer above the hardware with just the mechanisms Differences: Customization granularity is an entire OS in the case of virtualization while it is any desired granularity (e.g., individual subsystems of the OS) in the case of SPIN/Exokernel/L3 Striving to avoid penalties for border crossing between system service and the kernel is super important in Exo/SPIN/L3; while in virtualization isolation/integrity/independence of the address spaces of the hypervisor and the OSes (above the hypervisor)is of paramount importance

Consider an operating system for a shared memory multiprocessor. The operating system is structured to execute independently on each processor to handle system calls and page faults occurring for processes and threads executing on that processor. Each process executes within an address space. Multiple threads of the same process execute within the address space of the process. The operating system uses a single page table in shared memory for each process. Explain the performance problem with this approach if the application process is multithreaded.

Simultaneous page faults from different threads to different parts of the process address space cannot be serviced concurrently by the OS despite the availability of hardware parallelism

Why is the assertion of L3 at odds with the premise of SPIN/Exokernel?

Spin/Exokernel used Mach as an exemplar for microkernel-based OS structure whose design focus was on portability On the other hand, L3 argues that the primary goal of a microkernel should be performance not portability.

A process is currently executing on the processor. The process makes a system call. List the steps involved in getting this system call serviced by the operating system that this process belongs to. You should be precise in mentioning the data structures involved in Exokernel that makes this service possible.

System Call traps to Exokernel. Exokernel identifies the library OS responsible for handling this system callusing Time Slice Vector Exokernel uses the PE (Processor Environment) data structure associated with this library OS to get the "system call context", which is the entry point registered with Exokernel by the library OS for system calls. Exokernel "upcalls" the library OS using this entry point. The library OS services the system call

Let's say the state of a lock is as follows: T1 is the current lock holder and there is no other request in the queue yet; and a "new" request for the lock is just starting to appear. What could happen given the above state?

T1 could think there is no one else waiting for the lock and set L to nil, resulting in a live lock ("new" process will be spinning forever waiting for the lock).

Consider an operating system for a shared memory multiprocessor. The operating system is structured to execute independently on each processor to handle system calls and page faults occurring for processes and threads executing on that processor. Each process executes within an address space. Multiple threads of the same process execute within the address space of the process. The operating system uses a single page table in shared memory for each process. Consider the following situation in a multiprocessor. The history of execution on Processor P1 is T1, T2, T1, T3, T2 The history of execution on Processor P2 is T3, T2, T4, T3 The history of execution on Processor P3 is T1, T2, T3, T4, T5 In the above history, time flows from left to right. That is T2 is the most recent thread to execute on P1; T3 is the most recent thread to execute on P2; and T5 is the most recent thread to execute on P3. Thread T1 is ready to run again and can be scheduled on any of the above three processors. Which processor would you recommend to use for scheduling T1 and why?

T1 was never run on P2 -> cache on P2 has no context of T1. Further other taks have run on P3 since T1 ran on it. Only two other tasks have run on P1 since T1 ran on it. Therefore, I would choose P1 to run T1, since P1's cache is likely less polluted

Let's say the state of a lock is as follows: T1 is the current lock holder and there is no other request in the queue yet; and a "new" request for the lock is just starting to appear. What should happen? (assume no other lock requestors appear on the scene)

T1should recognize that there is a new request forming for the lock, and wait for its next pointer to point to new, so that it can signal it upon releasing the lock.

Tornado suggests using"existence guarantees"with reference counts instead of hierarchical locking to avoid serialization. (PULL UP IMAGE OF TORNADO LOCK) Using the"process" object as a concrete example, identify situations where existence guarantees may not be sufficient and you may have to lock an object.

The "process" object is the equivalent of a context block. Every representation of this object is shared by multiple threads that are to be scheduled on the cores of a single processor or multiple processors (depending on the details of the architecture).The "process" data structure has information pertaining to the currently executing threads. If a thread is context switched out, then the scheduler subsystem would need to save the volatile state of the thread, and re-adjust the ready-queue, etc. All of these actions have to be performed atomically which would require locking the process object.

In LRPC, the "client thread is doctored by the kernel to start executing the server code in the server domain." Explain.

The BO presented by the client code allows the kernel to identify the PDL associated with this call. From the PDL, the kernel extracts the location of the E-stack that is to be used for this call. The kernel sets the SP to point to the E-stack The kernel sets the address space pointer in the processor to point to the server's address space From the PDL, the kernel extracts the entry point address (in the server) for this call. The kernel sets the PC to point to this entry point address. The "doctoring" is complete and the original client thread starts executing in the server address space at the entry point for this call using the E-stack as the stack.

(Explain why the following statement is true)(MCS)"Dissemination barrier is bound to do poor lyon a bus-based shared memory multiprocessor compared to the MCS Tree barrier algorithm."

The algorithm has O(N) communication in each round.The communications in each round are independent and so far as the algorithm is concerned they can all go in parallel. In a bus-based shared memory multiprocessor, these O(N) communication events will get sequentialized. The total communication in dissemination barrier is O(N log2N) whereas it is O(N)for the MCS barrier. Thus dissemination barrier is more adversely affected by the serialization of bus requests.

Recall that in a para virtualized setting, the guest operating system manages its own allocated physical memory. (Note: you don't have to worry about being EXACT in terms of Xen's APIs in answering this question; we are only looking for your general understanding of how para virtualization works) A new process starts up in a para-virtualized library OS executing on top of the Xen hypervisor. List the interaction between the library OS and Xen to establish a distinct protection domain for the process to execute in.

The distinct protection domain mentioned in the question refers to the page-table for the new process. Following are the steps: ●Library OS (Linux) allocates memory from its own reserve for a new page table. ●Library OS registers this memory with the Hypervisor (Xen) as a page table by using a Hypercall. ●Library OS makes updates to this page table (virtual page, physical page mapping) through hypervisor via batches of updates in a single hypercall. ●Library OS changes the active page table via the hypervisor thus in effect "scheduling" the new process to run on the processor.

Consider an architecture that gives segmentation support in hardware. How will you implement small protection domains in such an architecture?

The entire hardware virtual address space carved into distinct non-overlapping regions. Each protection domain assigned to one of these distinct regions. The hardware segment register is loaded with the start address and length of each logical protection domain at the point of entry into that protection domain by the operating system.

A process in a guest OS makes a request to read a file from the disk. Using figures, explain the steps taken by the guest OS and the Xen hypervisor using I/O rings to fulfill the guest OS's request.

The file system of the guest OS translates the file name to a disk block address and passes it to the disk subsystem of the guest OS. The disk subsystem shares an I/O ring data structure with Xen. Guest OS(i.e., the disk subsystem) uses an available slot in the I/O ring to enqueue the disk read request. It embeds a physical page frame pointer in this descriptor. Hypervisor performs the disk I/O using DMA directly transferring the disk block into this physical page frame. Hypervisor enqueues a response using an available slot in the I/O ring data structure.

The hypervisor gets an interrupt from the disk. How does it determine which virtual machine it has to deliver the interrupt?

The interrupt is a result of a request that originated from some specific VM. The hypervisor tags each request dispatched to the disk controller with the id of the requesting VM. Using this information, the interrupt is delivered to the appropriate VM.

Recall that light-weight RPC (LRPC) is for cross-domain calls within a single host without going across the network. In a simple implementation of this paradigm, it is understandable that there has to be a copy from the client's address space into the kernel address space, and from the kernel address space to the server's address space before executing the server procedure. What are the reasons for the two additional copies (one in the client's address space, and one in the server's address space)?(Concise bullets please)

The kernel has no knowledge of the semantics of the call. Therefore, before going to the kernel, on the client-side we have to serialize the arguments of the call into a contiguous sequence of bytes in the client's address space before the kernel copies it into its kernel buffers. Similarly, on the server side, we have to populate the contiguous sequence of bytes received from the kernel into the server's address space into the actual arguments of the call as expected by the server procedure.

How many shadow page tables are there?

There is one S-PT per guest OS currently running.

Recall that the MCS lock algorithm implements a queue of lock requestors using a linked list. The MCS algorithm uses an atomic fetch-and-store(X,Y)primitive in its implementation. The primitive returns the old value of X, and stores Y in X. Assume the current state of a lock is as shown below: L -> curr running | ''''' Assume two threads T1 and T2 make a lock request simultaneously for the same lock. What sequence of actions would have brought the data structures to the intermediate state shown below from the current state? L curr running | | | ''''' | T2 | ________> T1

Though T1 and T2 are doing the lock request simultaneously, their attempt at queuing themselves behind the current lock holder (curr) (will get sequentialized through the atomic fetch-and-store operation. In the picture above, T1 has definitely done a fetch-and-store. So the lock is pointing to it as the lock requestor. As for T2, the thread has allocated the queue node data structure, but there are two possibilities with respect to where it will be in the lock queue o(Possibility 1) T2 may have done a fetch-and-store prior to T1. o(Possibility 2) T2 is yet to do fetch-and-store

What is the thesis of the Tornado approach to designing OS for shared memory multiprocessors?

To ensure scalability it is important to reduce the number of shared data structures in the kernel for which mutually exclusive access is necessary.

Imagine a 100-processor cache coherent multiprocessor. The word size of the processor is 32 bits, which is the unit of memory access by the processor's instruction set.The cache block size is 16 bytes. We would like to implement the Anderson's queue-based lock algorithm. Recall that in this algorithm, each processor marks its unique spin position in a "flags" array (associated with each lock) so that it can locally spin on that variable without disrupting the work of other processors until its turn comes to acquire the lock.How much space is needed for each flags array associated with a lock?

To ensure there is no false sharing, we need to allocate each spin variable in a distinct cache line.Space for one spin variable = cache block size = 16 bytes We will assume that the maximum number of threads in an application cannot exceed the number of processors. Thus in the worst case we need 100 distinct spin locations in the flags array. So total space needed in the flags array for each lock variable = 100*16 = 1600 bytes

In an Intel-like architecture, the CPU has to do address translation on every memory access by accessing the TLB and possibly the page table. In a non-virtualized set up, the page table is set up by the OS and is used by the hardware for address translation through the page table base register (PTBR) set by the OS upon scheduling a process. Recall that in a fully virtualized setting, the page table for a process is an internal data structure maintained by the library OS. When the library OS schedules a process to run, how does the CPU learn where the page table is for doing address translation on every memory access?

To schedule a process the library OS will do the following: -library OS has a distinct PT data structure for each process. -dispatching this process to run on the processor involves setting the PTBR (a privileged operation) -The library OS will try to execute this privileged operation -This will result in a trap into the hypervisor -The hypervisor will "emulate" this action by setting the PTBR Henceforth, the CPU will implicitly use the memory area pointed to by the PTBR as the page table

What is the need for a shadow page table? (One or two brief sentences please)

To support the guest OS's illusion of a contiguous physical address space Since the guest OS does not have access to the hardware page table, by keeping the mapping from guest VPNs to MPNs in the hypervisor, the address translation can happen at hardware speed

(Answer True/False with justification) Scalable implementation of page fault service for multiprocess workload is easier to achieve in a parallel OS than for a multithreaded workload.

True. Multiprocess workload => page table distinct for each process; no serialization of page fault handling even if there are concurrent page faults on different processors.

(Answer True/False with justification)In a large-scale invalidation-based cache coherent shared memory multiprocessor with rich connectivity among the nodes (i.e., the interconnection network is not a shared bus), the tournament barrier is likely to perform better than the MCS barrier.To jog your memory, MCS algorithm uses a 4-ary arrival and binary wakeup tree. The tournament barrier uses a binary tree for both arrival and wakeup.

True. Tournament algorithm can exploit the rich connectivity for peer-peer signaling among the processors in each round (going up and down the tree).

Consider a 64-bit paged-segmented architecture. The virtual address space is 264bytes. The TLB does not support address space id, so on an address space switch, the TLB has to be flushed. The architecture has two segment registers: LB: lower bound for a segment UB: upper bound for a segment Explain with succinct bullets, what happens upon a call from one protection domain to another.

UB and LB hardware registers are changed to correspond to the called protection domain. Context switch is made to transfer control to the entry point in the called protection domain. The architecture ensures that virtual addresses generated by the called domain are within the bounds of legal addresses for that domain. There is no need to flush the TLB on context switch from one protection domain to another.

The guest OS is running multiple processes within it. The guest OS itself appears as a "process" so far as the hypervisor is concerned. How is each of the processes inside the guest OS guaranteed memory isolation/integrity/independence from one another?

Upon process creation, guest-OS assigns a distinct PT data structure for the newly created process As part of creating a memory footprint for the process, the guest-OS creates VPN to PPN mappings in the PT by executing privileged operations These privileged operations result in traps to the hypervisor which emulates them on behalf of the guest-OS (both populating the PT data structure for the newly created process in the guest-OS as well as the S-PT data structure in the hypervisor). The distinct PT data structure for EACH process within the guest-OS thus gives the isolation/integrity/independence guarantees for the processes from one another.

Each core in the figure below is 4-way hardware multithreaded. The workload managed by the OS is multithreaded multiprocesses. You are designing a "thread scheduler" as a clustered object. Give the design considerations for this object. Discuss with concise bullets the representation you will suggest for such an object from each core (singleton, partitioned, fully replicated). Need to look at pic of tornado multicore thing

Use one representation of the "thread scheduler" object for each processor core. Each representation has its own local queueof threads. No need for locking the queue for access since each representation is unique to each core Each local queue is populated with at most 8 threads since each processor is 4-way hardware multithreaded (this is just a heuristic to balance interactive threads with compute-bound threads). The threads in a local queue could be a mix of threads from single-threaded processes and multi-threaded processes. Each local queue is populated with the threads of the same process (up to a max of 4) when possible. If a process has less than 16 threads, then the threads are placed in the local queues of the 4 cores in a single NUMA node. If a process has more than 16 threads, then the threads are split into multiple local queues so that the threads of the same process can be co-scheduled on the different nodes (often referred to as gang scheduling) Implement entry point in "thread scheduler" object for peer representations of the same object to call each other for work stealing.

Enumerate the myths about a microkernel-based approach that are debunked by Liedtke in the L3 microkernel.

User/Kernel border crossing is expensive Address space switches are expensive Thread switches and IPC are expensive Locality loss (memory/cache effects) due to context switches when protection domains are implemented as server processes

What does the shadow page table of ESX server hold and why?

VPN to MPN mapping for a VM; there is one shadow page for each VM on top of the ESX server. Shadow page table removes the one level of indirection VPN->PPN->MPN | ^ |_______spt_________|

Recall that in a para virtualized setting, the guest operating system manages its own allocated physical memory. (Note: you don't have to worry about being EXACT in terms of Xen's APIs in answering this question; we are only looking for your general understanding of how para virtualization works) The library OS is "up called" by Xen to deal with a page fault incurred by its currently running process. How does the library OS make the process runnable again?

When a page-fault occurs, the hypervisor catches it, and asynchronously invokes the corresponding registered handler. Following are the steps: Inside the hypervisor: ●Xen detects the address which caused the page-fault. For example, the faulting virtual address may be in a specific architectural register ●This register value is copied into a suitable space in the shared data-structure between the hypervisor and library OS. Then the hypervisor does the up-call, which activates the registered page-fault handler. Inside the library OS page-fault handler: ●The library OS page fault handler allocates a physical page frame (from its pool of free, i.e., unused pre-allocated physical memory kept in a free-list). ●If necessary the library OS may run page replacement algorithm tofree up some page frames if its pool of free memory falls below a threshold. ●If necessary the contents of the faulting page will be paged in from the disk. ●Note that paging in the faulting page from the disk would necessitate additional interactions with the hypervisor to schedule an I/O request using appropriate I/O rings. ●Once the faulting page I/O is complete, the library OS will establish the mapping in the page table for the process by making a hypervisor call.

Assume DMA without scatter/gather. To send a message out at least one copy is needed in the kernel or user space before starting the DMA. Explain why.

Without scatter/gather, the DMA engine needs the entire message to be in one contiguous memory buffer. thus, either the RPC stub or the kernel has to take parameters of an RPC call (or return values) and write into a contiguous memory buffer before starting the DMA.

Consider an operating system for a shared memory multiprocessor. The operating system is structured to execute independently on each processor to handle system calls and page faults occurring for processes and threads executing on that processor. Each process executes within an address space. Multiple threads of the same process execute within the address space of the process. The operating system uses a single page table in shared memory for each process. Explain false sharing using a simple example

Write to x by T1 does not affect write to y by T2; yet since x and y are in same cache line, there will be overhead of cache coherence actions. This is false sharing

A processor architecture provides segmented-paged virtual memory (similar to x86). The TLB contains virtual to physical address translations and is divided into two parts: USER and KERNEL. (Answer True/False with justification) It is possible to avoid having to flush the TLB entries corresponding to the USER space upon a context switch.

Yes All domains colocated in same address space. Each domain given distinct segment register.

Recall that "limited minimum intervening" thread scheduling policy uses affinity information for"k" processors, where "k" is a subset of the processors that this thread ran on during its life time for the scheduler to make scheduling decisions. Sketch the data structure for the TCB (assume k = 4) in this scheduling regime (you need to show only the fields that are to be used for scheduling purposes).

struct affine_type {int processor_id; /* processor number*/ int affinity; /* number of intervening threads */ }; struct TCB_type { /* ....other unrelated info*/ affine_type affinity[4]; /* top 4 affinity info for this thread */ };

This question is with reference to Exokernel. A user process makes a system call. Enumerate the sequence of steps by which this system call gets serviced by the library OS that this process belongs to.

the "syscall" results in a trap fielded by Exokernel Exokernel knows the "processor environment" (PE) data structures that corresponds to the currently executing library OS. Specifically, the "protected entry context" in the PE gives the entry point address in the library OS that corresponds to the syscall. Using this entry point, exokernel upcalls into the library OS. Library OS then completes the syscall

Give two concise bullet points that can be considered virtues of dissemination barrier from the point of view of scalability of the algorithm.

xNo hierarchy xNo pair wise sync xEach node works independently to send out the messages per protocol xAll will realize the barrier is complete when they have received the ceil(log2N) messages from their peers xTotal amount of comm: O(N log2N) xWorks for NCC and MP machines in addition to SM

Using a concrete example (e.g., a disk driver), show how copying memory buffers between guest VMs and the hypervisor is avoided in a para virtualized setting?

• I/O ring data structure shared between hypervisor and guest VM • Each slot in the I/O ring is a descriptor for a unique request from the guest VM or a unique response from the hypervisor • Address pointer to the physical memory page corresponding to a data buffer in the guest OS is placed in the descriptor • The physical memory page is pinned for the duration of the transfer

In a multiprocessor, when a thread becomes runnable again after a blocking system call completes, conventional wisdom suggests running the thread on the last processor it ran on before the system call. What do you think are important considerations in choosing the "right" processor to schedule this thread?

●Cache pollution by threads run after the last time a particular thread ran on a processor ●Cache pollution by threads that will be run after a thread is scheduled to run on a particular processor (queue length)

An application process starts up on Tornado multiprocessor OS. What are the steps that need to happen before the process actually starts to run?

●Clustered process objects created one representation per processor on which this process has threads. ●Clustered region objects created commensurate with the partitioning of the address space. ●Clustered FCM objects created to support the region objects

A process running on top of Tornado frees up some portion of its memory space (say using a facility such as free()). What does Tornado have to do to cleanup under the cover?

●Locate the clustered region object that corresponds to this freed address range. This region object has the piece of the page table that corresponds to the address range being freed. ●Identify the replicas of this region object on all the processors and fix up the mapping of these address ranges in the replicated copies of the page table entries corresponding to the memory being freed.

Assume that messages arrive out-of-order in a distributed system. Does this violate the "happened before" relation? Justify your answer with examples.

●No. ●Justification: ○The happened-before relation is only concerned with the sending and receipt of messages ONE at a time. Therefore, messages arriving out of order does not violate the relationship

In an Intel-like architecture, the CPU has to do address translation on every memory access by accessing the TLB and possibly the page table. In a non-virtualized set up, the page table is set up by the OS and is used by the hardware for address translation through the page table base register (PTBR) set by the OS upon scheduling a process. Recall that in a fully virtualized setting, the page table for a process is an internal data structure maintained by the library OS. When the library OS services a page fault and updates the page table for a given process, how is this mapping conveyed to the CPU so that it can do the address translation correctly for future access to this page by this process?

●Page fault service involves finding a free physical page frame to map the faulting virtual page to this allocated physical page frame. ●To establish this mapping, the library OS has to update the page table or the TLB depending on the specifics of the processor architecture assumed by the fully virtualized library OS. Both of these are privileged operations which will result in a trap when the library OS tries to execute either of them. ●Hypervisor will catch the trap and "emulate" the intended PT/TLB update by the libraryOS's into the library OS's illusion of PT/TLB. ●More importantly, the hypervisor has a mapping of what machine page (MPN) this physical page of the guest OS refers to in its shadow page table. ●Hypervisor will establish a direct mapping between the virtual page and the corresponding machine page by entering the mapping into the hardware page table (the shadow page table) or the TLB depending on the specifics of the processor architecture.


Related study sets

General chemistry: Carbohydrates

View Set

CIS4570 Advanced Java Programming

View Set

Bio Unit 5 Module 4 Concept Resources

View Set

Supplemental Nutrition Assistance Program (SNAP)

View Set