CS 4410 P2

¡Supera tus tareas y exámenes ahora con Quizwiz!

FIFO: Convoy effect

a number of relatively-short potential consumers of a resource get queued behind a heavyweight resource consumer

Inverted Page Table Optimization: Hashing

hash(page,pid) -> PT entry (or chain of entries)

Allocation Structures

some way to track whether inodes or data blocks are free or allocated we use a bitmap, one for the data region (the data bitmap), and one for the inode table (the inode bitmap)

Redundant Array of Inexpensive Disks/RAID

small, slower disks are cheaper and parallelism is free Benefits of RAID: cost; capacity; reliability

TLB Issue: Context Switches Solutions: address space identifier (ASID)

some hardware systems provide an address space identifier (ASID) field in the TLB. You can think of the ASID as a process identifier (PID), but usually it has fewer bits

Basic Flash Operations: read (a page)

A client of the flash chip can read any page (e.g., 2KB or 4KB), simply by specifying the read command and appropriate page number to the device. This operation is typically quite fast, 10s of microseconds or so, regardless of location on the device, and (more or less) regardless of the location of the previous request (quite unlike a disk). Being able to access any location uniformly quickly means the device is a random access device.

**Virtual Memory/Caching Notes**

**CH 21 Beyond Physical Memory: Mechanisms**

**CH 22 Beyond Physical Memory: Policies**

**CH 23 Complete Virtual Memory Systems**

**CH 36 I/O Devices**

**CH 37 Hard Disk Drives**

**CH 38 Redundant Arrays of Inexpensive Disks (RAIDs)**

**CH 39 Interlude: Files and Directories**

**CH 40 File System Implementation**

**CH 41 Locality and The Fast File System**

**CH 42 Crash Consistency: FSCK and Journaling**

**CH 43 Log-structured File Systems**

**CH 44 Flash-based SSDs**

**CH 45 Data Integrity and Protection**

**CH 7 Scheduling: Introduction**

**CH 8 Scheduling: The Multi-Level Feedback Queue (MLFQ)**

**CH 9 Scheduling: Proportional/Fair Share**

**CPU Scheduling Notes**

**Disks and RAID Notes**

**File Systems Notes**

**Main Memory Notes**

Flash Limitations

• can't write 1 byte/word (must write whole pages) • limited # of erase cycles per block (memory wear) (10^3-10^6 erases and the cell wears out) and reads can "disturb" nearby words and overwrite them with garbage • Lots of techniques to compensate: --error correcting codes --bad page/erasure block management --wear leveling: trying to distribute erasures across the entire driver

NFS: Client-Side Caching of Blocks

• read ahead + write buffering improve performance by eliminating message delays. • Client-side buffering causes problems if multiple clients access the same file concurrently: -- Update visibility: Writes by client C not seen by server, so not seen by other clients C' (Solution: flush-on-close semantics for files) -Stale Cache: Writes by client C are seen by server, but other cache at other clients stale. (Server does not know where the file is cached.) (Solution: Periodically check last-update time at server to see if cache could be invalid.)

**CH 19 Paging: Faster Translations (TLBs)**

**CH 20 Paging: Smaller Tables**

RAID Level 5: Rotating Parity

(partially) addresses the small-write problem RAID-5 works almost identically to RAID-4, except that it rotates the parity block across drives

**CH 10 Scheduling: Multiprocessor Scheduling**

**CH 13 The Abstraction: Address Spaces**

**CH 14 Interlude: Memory API**

**CH 15 Mechanism: Address Translation**

**CH 16 Segmentation**

**CH 18 Paging: Introduction**

Problems with a Basic MLFQ (Rules 1-4)

*starvation:* if there are "too many" interactive jobs in the system, they will combine to consume all CPU time, and thus long-running jobs will never receive any CPU time (they starve). We'd like to make some progress on these jobs even in this scenario *Gaming the scheduler* generally refers to the idea of doing something sneaky to trick the scheduler into giving you more than your fair share of the resource. The algorithm we have described is susceptible to the following attack: before the time slice is over, issue an I/O operation (to some file you don't care about) and thus relinquish the CPU; doing so allows you to remain in the same queue, and thus gain a higher percentage of CPU time. When done right (e.g., by running for 99% of a time slice before relinquishing the CPU), a job could nearly monopolize the CPU. Finally, a program may change its behavior over time; what was CPU bound may transition to a phase of interactivity. With our current approach, such a job would be out of luck and not be treated like the other interactive jobs in the system.

File Storage Layout Options: Linked-list: File Allocation Table (FAT) Positives

+ Simple: state required per file: start block only + Widely supported + No external fragmentation + block used only for data

Evaluating RAID Performance: steady-state throughput

, the total bandwidth of many concurrent requests. Because RAIDs are often used in high-performance environments, the steady-state bandwidth is critical, and thus will be the main focus of our analyses

Journal: Update Protocol Step (write x; write y; write z)

- Append to journal: TxBegin, x, y, z - Wait for completion of disk writes. - Append to journal: TxEnd - Wait for completion of disk write. - Write x, y, z to final locations in file system

Disk Failure Cases: (1) Isolated Disk Sectors

1+ sectors down, rest OK Permanent: physical malfunction (magnetic coating, scratches, contaminants) Transient: data corrupted but new data can be successfully written to / read from sector

Working Set

1. Collection of a process' most recently used pages 2. Pages referenced by process in last Δ time-units

Address Translation Problem Solution: Cache Coloring

1. Color frames according to cache configuration. 2. Spread each process' pages across as many colors as possible.

Overall Execution of a Write using Metadata Journaling (with Journal Superblock)

1. Data write: Write data to final location; wait for completion (the wait is optional; see below for details). 2. Journal metadata write: Write the begin block and metadata to the log; wait for writes to complete. 3. Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction (including data) is now committed. 4. Checkpoint metadata: Write the contents of the metadata update to their final locations within the file system. 5. Free: Later, mark the transaction free in journal superblock.

Problems with old Unix File System (simple)

1. Horrible performance 2. Fragmentation 3. Original block-size was too small (helped minimize internal fragmentation but bad for transfer as each block might require a positioning overhead to reach it)

Handling a Page Fault

1. Identify page and reason (r/w/x) 2a. access inconsistent w/ segment access rights: terminate process 2b. access a page that is kept on disk: does frame with the code already exist? if no then allocate a frame and bring page in 2c. access of zero-initialized data (BSS) or stack: allocate a frame, fill page with zero bytes 2d. access of COW page: allocate a frame and copy

Overall Execution of a Write using Journaling (with Journal Superblock)

1. Journal write: Write the contents of the transaction (containing TxB and the contents of the update) to the log; wait for these writes to complete. 2. Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction is now committed. 3. Checkpoint: Write the contents of the update to their final locations within the file system. 4. Free: Some time later, mark the transaction free in the journal by updating the journal superblock.

Overall Execution of a Write using Journaling

1. Journal write: Write the contents of the transaction (including TxB, metadata, and data) to the log; wait for these writes to complete. 2. Journal commit: Write the transaction commit block (containing TxE) to the log; wait for write to complete; transaction is said to be committed. 3. Checkpoint: Write the contents of the update (metadata and data) to their final on-disk locations.

a typical interaction (protocol) between OS and I/O device

1. polling: the OS waits until the device is ready to receive a command by repeatedly reading the status register 2. programmed I/O (PIO): the OS sends some data down to the data register; one can imagine that if this were a disk, for example, that multiple writes would need to take place to transfer a disk block (say 4KB) to the device. When the main CPU is involved with the data movement 3. the OS writes a command to the command register; doing so implicitly lets the device know that both the data is present and that it should begin working on the command 4. the OS waits for the device to finish by again polling it in a loop, waiting to see if it is finished (it may then get an error code to indicate success or failure)

TLB entry

A TLB entry might look like this: VPN | PFN | other bits (protection/valid/dirty/etc)

Flash Operations

A block comprises a set of pages Erase block: sets each cell to "1" • erase granularity = "erasure block" = 128-512 KB • time: several ms Write page: can only write erased pages • write granularity = 1 page = 2-4KBytes • time: 10s of milliseconds Read page: • read granularity = 1 page = 2-4KBytes • time: 10s of microseconds

compulsory miss (cold-start miss)

A cache miss caused by the first access to a block that has never been in the cache. occurs because the cache is empty to begin with and this is the first reference to the item

fsck: Bad blocks

A check for bad block pointers is also performed while scanning through the list of all pointers. A pointer is considered "bad" if it obviously points to something outside its valid range, e.g., it has an address that refers to a block greater than the partition size. In this case, fsck can't do anything too intelligent; it just removes (clears) the pointer from the inode or indirect block.

Single Block Failure: Detecting block corruption (The Checksum)

A checksum is simply the result of a function that takes a chunk of data (say a 4KB block) as input and computes a function over said data, producing a small summary of the contents of the data (say 4 or 8 bytes). This summary is referred to as the checksum. The goal of such a computation is to enable a system to detect if data has somehow been corrupted or altered by storing the checksum with the data and then confirming upon later access that the data's current checksum matches the original storage value. Whatever the method used, it should be obvious that there is no perfect checksum: it is possible two data blocks with non-identical contents will have identical checksums, something referred to as a collision

Directory

A directory, like a file, also has a low-level name (i.e., an inode number), but its contents are quite specific: it contains a list of (user-readable name, low-level name) pairs (user name, inode number) By placing directories within other directories, users are able to build an arbitrary directory tree (or directory hierarchy), under which all files and directories are stored.

The File Abstraction

A file is a named assembly of data. Each file comprises: • data - information a user or application stores, typically an array of untyped bytes that is implemented by an array of fixed-size blocks • metadata - information added / managed by OS like size, owner, security info, modification time, etc.

File

A file is simply a linear array of bytes, each of which you can read or write. Each file has some kind of low-level name. For historical reasons, the low-level name of a file is often referred to as its *inode number*

Shortest Job First (SJF)

A scheduling algorithm that deals with each user or task based on the getting the smaller ones out of the way. Schedule in order of estimated execution time (if with preemption its remaining execution time) Effect on the short jobs is huge while effect on the long job is small. Positives: optimal average turnaround time Negatives: pessimal variance in turnaround time; needs estimate of execution time Horrible: can starve long jobs

Disk Tracks

A single track around the disk that contains multiple sectors Track length varies across disk • Outside: More sectors per track; Higher bandwidth • Most of disk area in outer regions

Distributed File System (DFS) NFSv2

A system that enables folders shared from multiple computers to appear as though they exist in one centralized hierarchy of folders instead of on many different computers. Goals: • Clients share files • Centralized file storage -- Allows efficient backup -- Allows uniform management -- Enables physical security for files • Client-side transparency -- Same operations file sys operations: open, read, write, close

Paths as Names

Absolute: path of file from the root directory Relative: path from the current working directory 2 special entries in each UNIX directory: "." current dir ".." parent of current dir (except .. for "/" (root) is "/") To access a file: • Go to the dir where file resides —OR— • Specify the path where the file is

Solving Starvation and change of behavior (card above)

Add rule 5: priority boost

Memory Management Common Errors: Calling free() Incorrectly (Invalid Free)

After all, free() expects you only to pass to it one of the pointers you received from malloc() earlier. When you pass in some other value, bad things can (and do) happen

File Storage Layout Options: Contiguous allocation

All bytes together, in order + Simple: state required per file: start block & size + Efficient: entire file can be read with one seek - Fragmentation: external fragmentation is bigger problem - Usability: user needs to know size of file at time of creation

FFS: Improving Performance

Allocate blocks that will be accessed together in the same cylinder group: - Result: shorter seek times. Buffer blocks in memory: - Combines writes prior to seek - Single fetch for reads by many processes - Allows re-ordering of disk access Pre-fetch for read: - Eliminates latency from critical path

OS Support for Paging: Process Creation

Allocate frames, create & initialize page table & PCB

Track Skew

Allows sequential transfer to continue after seek.

A Simple Policy: Random Replacement

Another similar replacement policy is Random, which simply picks a random page to replace under memory pressure. Random has properties similar to FIFO; it is simple to implement, but it doesn't really try to be too intelligent in picking which blocks to evict.

data region

Area of the disk where the actual files and directory data (user data) is stored. takes up most of the partition.

device driver

At the lowest level, a piece of software in the OS must know in detail how a device works and any specifics of device interaction are encapsulated within.

Priority Scheduling

Assign a number to each job and schedule jobs in (increasing) order Can implement any scheduling policy To avoid starvation, improve job's priority with time (aging)

Writing Sequentially And Effectively: Write Buffering

Before writing to the disk, LFS keeps track of updates in memory; when it has received a sufficient number of updates, it writes them to disk all at once, thus ensuring efficient use of the disk. The large chunk of updates LFS writes at one time is referred to by the name of a *segment*

Belady's Anomaly

Belady (of the optimal policy) and colleagues found an interesting reference stream that behaved a little unexpectedly (especially using FIFO replacement). The memory reference stream: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5. In general, you would expect the cache hit rate to increase (get better) when the cache gets larger. But in this case, with FIFO, it gets worse!

Page Replacement Algorithms: OPT

Belady's algorithm: Select page not used for longest time (farthest in future) OPT Approximation: In real life, we do not have access to the future page request stream of a program -> Need to make a guess at which pages will not be used for the longest time

LRU and MRU are stack algorithms

By definition: • For LRU: M(m + 1, r) contains the m most frequently used frames, so M(m, r) is a subset of M(m + 1, r) • Similar for MRU • Similar also for LFU (Least Frequently Used)

NAND Flash

Charge is stored in Floating Gate (can have Single and Multi-Level Cells)

RAID 0: SImple Stripping

Chunk size: number of consecutive blocks on a disk. The first (chunk size) blocks go on disk1, next chunk on disk 2, so on Striping reduces reliability: • More disks ➜ higher probability of some disk failing • N disks: 1/Nth mean time between failures of 1 disk

Disk Scheduling: C-SCAN

Circular list treatment: • head moves from one end to other • servicing requests as it goes • reaches the end, returns to beginning • no requests serviced on return trip (More uniform wait time than SCAN)

Fairness vs Turnaround time

Any policy (such as RR) that is fair, i.e., that evenly divides the CPU among active processes on a small time scale, will perform poorly on metrics such as turnaround time

Sharing: Protection Bits

Basic support adds a few bits per segment, indicating whether or not a program can read or write a segment, or perhaps execute code that lies within the segment By setting a code segment to *read-only* the same code can be shared across multiple processes, without worry of harming isolation; while each process still thinks that it is accessing its own private memory, the OS is secretly sharing memory which cannot be modified by the process, and thus the illusion is preserved With protection bits, the hardware algorithm described earlier would also have to change. In addition to checking whether a virtual address is within bounds, the hardware also has to check whether a particular access is permissible. If a user process tries to write to a read-only segment, or execute from a non-executable segment, the hardware should raise an exception, and thus let the OS deal with the offending process.

Writing To Disk Sequentially

Basically, every write (and subsequent metadata write) all lie in a sequential line. This basic idea, of simply writing all updates (such as data blocks, inodes, etc.) to the disk sequentially, sits at the heart of LFS

Crash Recovery And The Log: Making Sure Segment is Valid

Because LFS writes the CR every 30 seconds or so, the last consistent snapshot of the file system may be quite old. Thus, upon reboot, LFS can easily recover by simply reading in the checkpoint region, the imap pieces it points to, and subsequent files and directories; however, the last many seconds of updates would be lost. To improve upon this, LFS tries to rebuild many of those segments through a technique known as *roll forward*

SQMS Problems: Cache Affinity

Because each CPU simply picks the next job to run from the globally shared queue, each job ends up bouncing around from CPU to CPU, thus doing exactly the opposite of what would make sense from the standpoint of cache affinity

Metadata Journaling (ordered journaling)

Because of the high cost of writing every data block to disk twice, people have tried a few different things in order to speed up performance. it is nearly the same as data journaling, except that user data is not written to the journal The data block, previously written to the log, would instead be written to the file system proper, avoiding the extra write; given that most I/O traffic to the disk is data, not writing data twice substantially reduces the I/O load of journaling

Basic Flash Operations: erase (a block)

Before writing to a page within a flash, the nature of the device requires that you first erase the entire block the page lies within. Erase, importantly, destroys the contents of the block (by setting each bit to the value 1); therefore, you must be sure that any data you care about in the block has been copied elsewhere (to memory, or perhaps to another flash block) before executing the erase. The erase command is quite expensive, taking a few milliseconds to complete. Once finished, the entire block is reset and each page is ready to be programmed.

AFS Crash Recovery

Client crash/reboot/disconnect: -- Client might miss the callback from server -- On client reboot: treat all local files as suspect and recheck with server for each file open Server failure: -- Server forgets list of callbacks registered. -- On server reboot: Inform all clients; client must treat all local files as suspect

NFS: Tolerating Server Failure Solution

Client does retry (after timeout). And all NSF server operations are idempotent -- Idempotent = "Repeat of an operation generates same resp."

AFS Cache Consistency

Consistency between: • Processes on different machines: -- Updates for file made visible at server when file closed() » Last writer wins if multiple clients have file open and are updating it. (So file reflects updates by only machine.) » Compare with NFS: updates blocks from different clients. -- All clients with callbacks for that file are notified and callback cancelled. -- Subsequent open() re-fetches the file • Processes on the same machine -- Updates are visible locally through shared cache

FFS: File creation: create()

Creates: 1. a new file with some metadata; and 2. a name for the file in a directory

Ticket Mechanisms: ticket currency

Currency allows a user with a set of tickets to allocate tickets among their own jobs in whatever currency they would like; the system then automatically converts said currency into the correct global value

SSD Cost

Currently, an SSD costs something like $150 for a 250-GB drive; such an SSD costs 60 cents per GB. A typical hard drive costs roughly $50 for 1-TB of storage, which means it costs 5 cents per GB. There is still more than a 10× difference in cost between these two storage media

Hard Disk Drives: Components: track

Data is encoded on each surface in concentric circles of sectors; we call one such concentric circle a track. A single surface contains many thousands and thousands of tracks, tightly packed together, with hundreds of tracks fitting into the width of a human hair

Problems with MQMS: Load Imbalance

Depending on how processes are put on queues/CPUs it may be the case that a process gets much more of the CPU (general/combined cpu) than the other. Or, even worse, it may be the case that there is an idle CPU

Memory Management Common Errors: Not Allocating Enough Memory (Buffer Overflow)

Depending on what you allocate and how the OS is set up, you need to make sure you allocate enough memory. If you call a function that expects an allocate before it, it will use the expected memory. If you did not allocate enough beforehand, the function may overwrite some other variable or stored information (because it expected that area of memory to be allocated to the object it is using/storing)

Garbage Collection Problem: Which Blocks To Clean, And When?

Determining when to clean is easier; either periodically, during idle time, or when you have to because the disk is full Determining which blocks to clean is more challenging: In the original LFS paper, the authors describe an approach which tries to segregate hot and cold segments. A hot segment is one in which the contents are being frequently over-written; thus, for such a segment, the best policy is to wait a long time before cleaning it, as more and more blocks are getting over-written (in new segments) and thus being freed for use. A cold segment, in contrast, may have a few dead blocks but the rest of its contents are relatively stable. Thus, the authors conclude that one should clean cold segments sooner and hot segments later, and develop a heuristic that does exactly that.

Directories

Directory: A file whose interpretation is a mapping from a character string to a low level name. Directories Compose into Trees: Each path from root is a name for a leaf.

Disk Operation Overhead

Disk Latency = Seek Time + Rotation Time + Transfer Time Seek: to get to the track (5-15 millisecs (ms))=(#of tracks being seeked)(seek time for one track) Rotational Latency: to get to the sector (4-8 milliseconds (ms); on average, only need to wait half a rotation)=6000/RMPs Transfer: get bits off the disk (25-50 microsecs (μs))= (rotation time)/(#of blocks)

Aspects of Memory Multiplexing: Isolation

Don't want distinct process states collided in physical memory (unintended overlap à chaos)

RAID-1: Mirroring

Each block is stored on 2 separate disks. Read either copy; write both copies (in parallel)

File Storage Layout Options: Linked-list

Each block points to the next block • First word of each block points to next block • Rest of disk block is file data + Space Utilization: no space lost to external fragmentation + Simple: only need to store 1st block of each file - Performance: random access is slow - Space Utilization: overhead of pointers

fsck: Inode state

Each inode is checked for corruption or other problems. For example, fsck makes sure that each allocated inode has a valid type field (e.g., regular file, directory, symbolic link, etc.). If there are problems with the inode fields that are not easily fixed, the inode is considered suspect and cleared by fsck; the inode bitmap is correspondingly updated

Solution #1: The File System Checker (fsck)

Early file systems decided to let inconsistencies happen and then fix them later (when rebooting) fsck is a UNIX tool for finding such inconsistencies and repairing them. The tool fsck operates in a number of phases. It is run before the file system is mounted and made available (fsck assumes that no other file-system activity is on-going while it runs); once finished, the on-disk file system should be consistent and thus can be made accessible to users.

static partitioning of memory

Early file systems thus introduced a fixed-size cache to hold popular blocks. As in our discussion of virtual memory, strategies such as LRU and different variants would decide which blocks to keep in cache. This fixed-size cache would usually be allocated at boot time to be roughly 10% of total memory

Disk Scheduling: SCAN

Elevator Algorithm: • arm starts at one end of disk • moves to other end, servicing requests • movement reversed @ end of disk • repeat

Journal: Performance Optimizations

Eliminate disk write of TxEnd record. • Compute checksum of xxx in "TxBegin xxx TxEnd" • Include checksum TxBegin. • Recovery checks whether all log entries present. Eliminate disk write of xxx when data block • Step 1: Write data block to final disk location • Step 2: Await completion • Step 3: Write xxxx for meta-data blocks • Step 4: Await completion

LFS: Garbage Collection

Eventually disk will fill. But many blocks ("garbage") not reachable via CP, because they were overwritten. Cleaner Protocol: 1. read entire segment; 2. find live blocks within (see below); 3. copy live blocks to new segment; 4. append new segment to disk log

Page Replacement Algorithms: LFU

Evict least frequently used page

Page Replacement Algorithms: LRU

Evict page that hasn't been used for the longest: Assumes past is a good predictor of the future Implementing LRU: • On reference: Timestamp each page • On eviction: Scan for oldest page Problems: • Large page lists • Timestamps are costly Solution: approximate LRU • Note: LRU is already an approximation • Exploit use (REF) bit in PTE

Page Replacement Algorithms: MRU

Evict the most recently used page

Paging Advantages: simplicity of free-space management

Ex, when the OS wishes to place our tiny 64-byte address space into our eight-page physical memory, it simply finds four free pages; perhaps the OS keeps a free list of all free pages for this, and just grabs the first four free pages off of this list.

Trashing

Excessive rate of paging Cache lines evicted before they can be reused Causes: • Too many processes in the system • Cache not big enough to fit working set • Bad luck (conflicts) • Bad eviction policies Prevention: • Restructure code to reduce working set • Increase cache size • Improve caching policies

File Storage Layout Options: Linked-list: File Allocation Table (FAT)

FAT (is stored on disk): • Linear map of all blocks on disk • Each file is a linked list of blocks • 1 entry per block • EOF for last block • 0 indicates free block • directory entry maps name to FAT index

FFS Organizing Structure: The Cylinder Group

FFS divides the disk into a number of cylinder groups. A single *cylinder* is a set of tracks on different surfaces of a hard drive that are the same distance from the center of the drive; it is called a cylinder because of its clear resemblance to the so-called geometrical shape FFS aggregates N consecutive cylinders into a group, and thus the entire disk can thus be viewed as a collection of cylinder groups Note that modern drives do not export enough information for the file system to truly understand whether a particular cylinder is in use: Thus, modern file systems instead organize the drive into block groups, each of which is just a consecutive portion of the disk's address space

Disk Layout

File System is stored on disks • sector 0 of disk called Master Boot Record (MBR) • end of MBR: partition table (partitions' start & end addrs) • Remainder of disk divided into partitions. --Each partition starts with a boot block --Boot block loaded by MBR and executed on boot --Remainder of partition stores file system.

NFS: Client Operations

File system operations at client are translated to message exchange with server. • fd := open( "/foo", ...) ->{ send LOOKUP( roodir FH, "foo") to NFS server receive FH_for_foo from NFS server openFileTable[i] := FH_for_foo {slot i presumed free} return i } • read(fd, buffer, start, MAX){ FH := openFileTable[fd].fileHandle send READ( FH, offset=start, count=MAX) to NFS server receive data from NSF server buffer := data; }

File Names

Files have names: • a unique low-level name -- low-level name is distinct from location where file stored (File system provides mapping from low-level names to storage locations.) • one or more human-readable names (File system provides mapping from human-readable names to low-level names.)

First In First Out (FIFO)

First jobs in run Positives: simple; low overhead; no starvation Negatives: average turnaround time very sensitive to schedule order Horrible: not responsive to interactive jobs

write buffering

First, by delaying writes, the file system can batch some updates into a smaller set of I/Os; for example, if an inode bitmap is updated when one file is created and then updated moments later as another file is created, the file system saves an I/O by delaying the write after the first update. Second, by buffering a number of writes in memory, the system can then schedule the subsequent I/Os and thus increase performance. Finally, some writes are avoided altogether by delaying them; for example, if an application creates a file and then deletes it, delaying the writes to reflect the file creation to disk avoids them entirely. In this case, laziness (in writing blocks to disk) is a virtue.

OS Servicing Page Faults

First, the OS must find a physical frame for the soon-to-be-faulted-in page to reside within; if there is no such page, we'll have to wait for the replacement algorithm to run and kick some pages out of memory, thus freeing them for use here. With a physical frame in hand, the handler then issues the I/O request to read in the page from swap space. Finally, when that slow operation completes, the OS updates the page table and retries the instruction. The retry will result in a TLB miss, and then, upon another retry, a TLB hit, at which point the hardware will be able to access the desired item.

Memory Issues for OS to Handle: Process Creation

First, the OS must take action when a process is created, finding space for its address space in memory. When a new process is created, the OS will have to search a data structure (often called a *free list*) to find room for the new address space and then mark it used

Flash Translation Layer

Flash device firmware maps logical page # to a physical location: • Garbage collect erasure block by copying live pages to new location, then erase --More efficient if blocks stored at same time are deleted at same time (e.g., keep blocks of a file together) • Wear-levelling: only write each physical page a limited number of times • Remap pages that no longer work (sector sparing)

Flash

Flash, has some unique properties. For example, to write to a given chunk of it (i.e., a flash page), you first have to erase a bigger chunk (i.e., a flash block), which can be quite expensive. In addition, writing too often to a page will cause it to wear out

File Storage Layout Options: Linked-list: File Allocation Table (FAT) Structure

Folder: a file with 32-byte entries Each Entry: • 8 byte name + 3 byte extension (ASCII) • creation date and time • last modification date and time • first block in the file (index into FAT) • size of the file • Long and Unicode file names take up multiple entries

XOR

For a given set of bits, the XOR of all of those bits returns a 0 if there are an even number of 1's in the bits, and a 1 if there are an odd number of 1's You can remember this in a simple way: that the number of 1s in any row, including the parity bit, must be an even (not odd) number; that is the invariant that the RAID must maintain in order for parity to be correct

Page Table

For each process, the operating system creates a page table, which keeps track of the corresponding frame location in physical memory where each page is stored. There is one entry in the table for each page in a program. The entry contains the page number and its corresponding frame number.

So how does LFS store directory data?

Fortunately, directory structure is basically identical to classic UNIX file systems, in that a directory is just a collection of (name, inode number) mappings. For example, when creating a file on disk, LFS must both write a new inode, some data, as well as the directory data and its inode that refer to this file The piece of the inode map contains the information for the location of both the directory file dir as well as the newly-created file f. Thus, when accessing file (with inode number k), you would first look in the inode map (usually cached in memory) to find the location of the inode of directory dir; you then read the directory inode, which gives you the location of the directory data; reading this data block gives you the name-to-inode-number mapping of (file, k). You then consult the inode map again to find the location of inode number k, and finally read the desired data block at the address

RAID Level 4: Parity: Analysis: capacity

From a capacity standpoint, RAID-4 uses 1 disk for parity information for every group of disks it is protecting. Thus, our useful capacity for a RAID group is (N − 1) · B.

solid-state storage device (SSD)

Generically referred to as solid-state storage, such devices have no mechanical or moving parts like hard drives; rather, they are simply built out of transistors, much like memory and processors However, unlike typical random-access memory (e.g., DRAM), such a solid-state storage device (a.k.a., an SSD) retains information despite power loss, and thus is an ideal candidate for use in persistent storage of data.

Thrashing: Admission control

Given a set of processes, a system could decide not to run a subset of processes, with the hope that the reduced set of processes' working sets (the pages that they are using actively) fit in memory and thus can make progress This approach, generally known as admission control, states that it is sometimes better to do less work well than to try to do everything at once poorly, a situation we often encounter in real life as well as in modern computer systems

File System Design Challenges: Flexibility

Handle diverse application workloads

Memory Management Unit (MMU)

Hardware device (part of processor) that at run time maps virtual to physical address User Process deals with virtual addresses; Never sees the physical address Physical Memory deals with physical addresses; Never sees the virtual address

How to keep track of Offset: Explicit Approach

Here you chop up the address space into segments based on the top few bits of the virtual address This limits each section to a *maximum size* (Address space/# of possible segments, bitwise) Ex. we have three segments; thus we need two bits to accomplish our task. If we use the top two bits of our 14-bit virtual address to select the segment, our virtual address looks like this 13-12: Segment, 11-0: Offset 13-12=00 means Code segment, 01 means Heap, 10 means Stack 2 bits=4 possible segments of a 16KB address space means each segment has max size of 4KB

flash translation layer (FTL): goals: reliability

High reliability will be achieved through the combination of a few different approaches. One main concern, as discussed above, is wear out. If a single block is erased and programmed too often, it will become unusable; as a result, the FTL should try to spread writes across the blocks of the flash as evenly as possible, ensuring that all of the blocks of the device wear out at roughly the same time; doing so is called *wear leveling* and is an essential part of any modern FTL. Another reliability concern is program disturbance. To minimize such disturbance, FTLs will commonly program pages within an erased block in order, from low page to high page. This sequential-programming approach minimizes disturbance and is widely utilized.

Starvation

How bad can it get? The lack of progress for one job, due to resources given to higher priority jobs.

Garbage Collection Problem: Determining Block Liveness (Segment Summary Block)

How can LFS tell which blocks within a segment are live, and which are dead? To do so, LFS adds a little extra information to each segment that describes each block. Specifically, LFS includes, for each data block D, its inode number (which file it belongs to) and its offset (which block of the file this is). This information is recorded in a structure at the head of the segment known as the *segment summary block* Given this information, it is straightforward to determine whether a block is live or dead: 1. For a block D located on disk at address A, look in the segment summary block and find its inode number N and offset T. 2. Next, look in the imap to find where N lives and read N from disk (perhaps it is already in memory, which is even better). 3. Finally, using the offset T, look in the inode (or some indirect block) to see where the inode thinks the Tth block of this file is on disk. If it points exactly to disk address A, LFS can conclude that the block D is live. If it points anywhere else, LFS can conclude that D is not in use (i.e., it is dead) and thus know that this version is no longer needed. *Pseudocode* (N, T) = SegmentSummary[A]; inode = Read(imap[N]); if (inode[T] == A) // block D is alive else // block D is garbage There are some shortcuts LFS takes to make the process of determining liveness more efficient. For example, when a file is truncated or deleted, LFS increases its version number and records the new version number in the imap. By also recording the version number in the on-disk segment, LFS can short circuit the longer check described above simply by comparing the on-disk version number with a version number in the imap, thus avoiding extra reads.

Predictability

How consistent? Low variance in turnaround time for repeated jobs.

Finding the Imap: The Checkpoint Region (CR)

How do we find the inode map, now that pieces of it are also now spread across the disk? the file system must have some fixed and known location on disk to begin a file lookup. This location is the Checkpoint Region The checkpoint region contains pointers to (i.e., addresses of) the latest pieces of the inode map, and thus the inode map pieces can be found by reading the CR first. Note the checkpoint region is only updated periodically (say every 30 seconds or so), and thus performance is not ill-affected Thus, the overall structure of the on-disk layout contains a checkpoint region (which points to the latest pieces of the inode map); the inode map pieces each contain addresses of the inodes; the inodes point to files (and directories) just like typical UNIX file systems

Fairness

How equal is performance? Equality in the resources given to each job.

Evaluating A RAID: reliability

How many disk faults can the given design tolerate?

Throughput

How many jobs over time? The rate at which jobs are completed

Overhead

How much useless work? Time lost due to switching between jobs

Fault-tolerant Disk Update: Journaling (Write-Ahead Logging)

Idea: Protocol where performing a single disk write causes multiple disk writes to take effect. Implementation: New on-disk data structure ("journal") with a sequence of blocks containing updates plus ...

FFS Superblock

Identifies file system's key parameters: • type • block size • inode array location and size (or analogous structure for other FSs) • location of free list

The Page Fault

If a page is not present and has been swapped to disk, the OS will need to swap the page into memory in order to service the page fault. The OS could use the bits in the PTE normally used for data such as the PFN of the page for a disk address. When the OS receives a page fault for a page, it looks in the PTE to find the address, and issues the request to disk to fetch the page into memory. When the disk I/O completes, the OS will then update the page table to mark the page as present, update the PFN field of the page-table entry (PTE) to record the in-memory location of the newly-fetched page, and retry the instruction. This next attempt may generate a TLB miss, which would then be serviced and update the TLB with the translation

Using Overlap to improve I/O processes

If a process needs an I/O request (aka has a syscall) it must wait for that I/O to return. Instead of occupying the CPU and waiting, the scheduler should *overlap* another process that does not have an I/O request (at that time), allowing it to run while the first process waits for the I/O request

FFS: Tolerating Crashes

If a processor crashes then only some blocks on a disk might get updated. • Data is lost • On disk data structures might become inconsistent. Solutions: - Add fsync: programmer forces writes to disk - Detect and recover - Fault-tolerant disk update protocols

FTL Organization: direct mapped problems: reliability

If file system metadata or user file data is repeatedly overwritten, the same block is erased and programmed, over and over, rapidly wearing it out and potentially losing data. The direct mapped approach simply gives too much control over wear out to the client workload; if the workload does not spread write load evenly across its logical blocks, the underlying physical blocks containing popular data will quickly wear out

Recovery using the Journal

If the crash happens before the transaction is written safely to the log (i.e., before Step 2 above completes), then our job is easy: the pending update is simply skipped. If the crash happens after the transaction has committed to the log, but before the checkpoint is complete, the file system can recover the update as follows: When the system boots, the file system recovery process will scan the log and look for transactions that have committed to the disk; these transactions are thus replayed (in order), with the file system again attempting to write out the blocks in the transaction to their final on-disk locations. This form of logging is one of the simplest forms there is, and is called *redo logging*

Problems with iterrupts

If the device does not take that long to run the cost of interrupt handling and context switching may outweigh the benefits interrupts provide. There are also cases where a flood of interrupts may overload a system and lead it to livelock. If the speed of the device is not known, or sometimes fast and sometimes slow, it may be best to use a hybrid that polls for a little while and then, if the device is not yet finished, uses interrupts. This two-phased approach may achieve the best of both worlds

The Present Bit

If the present bit is set to one, it means the page is present in physical memory and everything proceeds as expected; if it is set to zero, the page is not in memory but rather on disk somewhere. The act of accessing a page that is not in physical memory is commonly referred to as a *page fault* Upon a page fault, the OS is invoked to service the page fault. A particular piece of code, known as a page-fault handler, runs, and must service the page fault, as we now describe.

Address Translation with Upward/Negative Segment Growth (Stack)

If the segment grows in the negative direction you take the offset (from the address) and subtract the segment's size (calculated/taken from base and bound registers) to get the new, negative, offset. We then add this negative number to the base to find the correct physical address Ex. assume we wish to access virtual address 15KB, which should map to physical address 27KB: Our virtual address, in binary form, thus looks like this: 11 1100 0000 0000 The hardware uses the top two bits (11) to designate the segment, but then we are left with an offset of 3KB. To obtain the correct negative offset, we must subtract the maximum segment size from 3KB: in this example, a segment can be 4KB, and thus the correct negative offset is 3KB minus 4KB which equals -1KB. We simply add the negative offset (-1KB) to the base (28KB) to arrive at the correct physical address: 27KB

How does the OS employ the use bit to approximate LRU? Clock algorithm

Imagine all the pages of the system arranged in a circular list: A clock hand points to some particular page to begin with (it doesn't really matter which). When a replacement must occur, the OS checks if the currently-pointed to page P has a use bit of 1 or 0. If 1, this implies that page P was recently used and thus is not a good candidate for replacement. Thus, the use bit for P set to 0 (cleared), and the clock hand is incremented to the next page (P + 1). The algorithm continues until it finds a use bit that is set to 0, implying this page has not been recently used (or, in the worst case, that all pages have been and that we have now searched through the entire set of pages, clearing all the bits).

Disk Scheduling: SSTF Problem- starvation

Imagine if there were a steady stream of requests to the inner track, where the head currently is positioned. Requests to any other tracks would then be ignored completely by a pure SSTF approach

crash-consistency problem

Imagine you have to update two on-disk structures, A and B, in order to complete a particular operation. Because the disk only services a single request at a time, one of these requests will reach the disk first (either A or B). If the system crashes or loses power after one write completes, the on-disk structure will be left in an inconsistent state

Caches in Single Processor CPUs

In a system with a single CPU, there are a hierarchy of hardware caches that in general help the processor run programs faster Caches are small, fast memories that (in general) hold copies of popular data that is found in the main memory of the system. Main memory, in contrast, holds all of the data, but access to this larger memory is slower. Caches are thus based on the notion of locality, of which there are two kinds: temporal locality and spatial locality

RAID Level 1: Mirroring

In a typical mirrored system, we will assume that for each logical block, the RAID keeps two physical copies of it When reading a block from a mirrored array, the RAID has a choice: it can read either copy. When writing a block, though, no such choice exists: the RAID must update both copies of the data, in order to preserve reliability All data is duplicated One drive is exact replica of other Most fault tolerant Least efficient storage Uses half of storage capacity

Memory Management Common Errors: Forgetting To Free Memory (Memory Leak)

In long-running applications or systems (such as the OS itself), this is a huge problem, as slowly leaking memory eventually leads one to run out of memory, at which point a restart is required. Thus, in general, when you are done with a chunk of memory, you should make sure to free it

Hard Disk Drives: Rotational Delay

In our simple disk (with only one track), the disk doesn't have to do much. In particular, it must just wait for the desired sector to rotate under the disk head. This wait happens often enough in modern drives, and is an important enough component of I/O service time, that it has a special name: rotational delay (sometimes rotation delay).

MLFQ: Basic Rules

In our treatment, the MLFQ has a number of distinct queues, each assigned a different *priority level.* At any given time, a job that is ready to run is on a single queue. MLFQ uses priorities to decide which job should run at a given time: a job with higher priority (i.e., a job on a higher queue) is chosen to run. Of course, more than one job may be on a given queue, and thus have the same priority. In this case, we will just use round-robin scheduling among those jobs *** • Rule 1: If Priority(A) > Priority(B), A runs (B doesn't). • Rule 2: If Priority(A) = Priority(B), A & B run in RR. • Rule 3: When a job enters the system, it is placed at the highest priority (the topmost queue). • Rule 4: Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue). • Rule 5: After some time period S, move all the jobs in the system to the topmost queue **

Stack Grows Upward/Backward

In physical memory, it "starts" at 28KB and grows back to 26KB, corresponding to virtual addresses 16KB to 14KB; translation must proceed differently The first thing we need is a little extra hardware support. Instead of just base and bounds values, the hardware also needs to know which way the segment grows (a bit, for example, that is set to 1 when the segment grows in the positive direction, and 0 for negative)

Interrupt Problems: Solution- coalescing

In such a setup, a device which needs to raise an interrupt first waits for a bit before delivering the interrupt to the CPU. While waiting, other requests may soon complete, and thus multiple interrupts can be coalesced into a single interrupt delivery, thus lowering the overhead of interrupt processing

Journal Superblock

In the journal superblock (not to be confused with the main file system superblock), the journaling system records enough information to know which transactions have not yet been checkpointed, and thus reduces recovery time as well as enables re-use of the log in a circular fashion

FTL Organization: direct mapped (bad approach)

In this approach, a read to logical page N is mapped directly to a read of physical page N. A write to logical page N is more complicated; the FTL first has to read in the entire block that page N is contained within; it then has to erase the block; finally, the FTL programs the old pages as well as the new one.

TLB miss: when the access could be to an invalid page

In this case, no other bits in the PTE really matter; the hardware traps this invalid access, and the OS trap handler runs, likely terminating the offending process.

TLB miss: when the page was both present and valid

In this case, the TLB miss handler can simply grab the PFN from the PTE, retry the instruction (this time resulting in a TLB hit), and thus continue as described (many times) before.

Crash Scenarios: Just the updated bitmap is written to disk

In this case, the bitmap indicates that block 5 is allocated, but there is no inode that points to it. Thus the file system is inconsistent again; if left unresolved, this write would result in a space leak, as block 5 would never be used by the file system.

Crash Scenarios: Just the data block is written to disk

In this case, the data is on disk, but there is no inode that points to it and no bitmap that even says the block is allocated. Thus, it is as if the write never occurred. This case is not a problem at all, from the perspective of file-system crash consistency

Crash Scenarios: The inode and bitmap are written to disk, but not data

In this case, the file system metadata is completely consistent: the inode has a pointer to block 5, the bitmap indicates that 5 is in use, and thus everything looks OK from the perspective of the file system's metadata. But there is one problem: 5 has garbage in it again

Crash Scenarios: Just the updated inode is written to disk

In this case, the inode points to the disk address (5) where Db was about to be written, but Db has not yet been written there. Thus, if we trust that pointer, we will read garbage data from the disk (the old contents of disk address 5). Further, we have a new problem, which we call a *file-system inconsistency.* The on-disk bitmap is telling us that data block 5 has not been allocated, but the inode is saying that it has. The disagreement between the bitmap and the inode is an inconsistency in the data structures of the file system; to use the file system, we must somehow resolve this problem

Crash Scenarios: The bitmap and data block are written, but not the inode

In this case, we again have an inconsistency between the inode and the data bitmap. However, even though the block was written and the bitmap indicates its usage, we have no idea which file it belongs to, as no inode points to the file

Crash Scenarios: The inode and the data block are written, but not the bitmap

In this case, we have the inode pointing to the correct data on disk, but again have an inconsistency between the inode and the old version of the bitmap (B1). Thus, we once again need to resolve the problem before using the file system.

fail-stop fault model

In this model, a disk can be in exactly one of two states: working or failed. With a working disk, all blocks can be read or written. In contrast, when a disk has failed, we assume it is permanently lost. One critical aspect of the fail-stop model is what it assumes about fault detection. Specifically, when a disk has failed, we assume that this is easily detected.

fail-partial disk failure model

In this view, disks can still fail in their entirety (as was the case in the traditional fail-stop model); however, disks can also seemingly be working and have one or more blocks become inaccessible (i.e., LSEs) or hold the wrong contents (i.e., corruption). Thus, when accessing a seemingly-working disk, once in a while it may either return an error when trying to read or write a given block (a non-silent partial fault), and once in a while it may simply return the wrong data (a silent partial fault).

LFS: Garbage Collection: Finding Live Blocks

Include at start of each LFS segment a segment summary block that gives for each data block D in that LFS segment: inode number in segment and offset in the file of segment Read block for <in, of> from LFS to reveal if D is live (=) or it is garbage (=!).

File Storage Layout Options: Indexed structure (FFS)

Index block points to many other blocks

inode

Inside each inode is virtually all of the information you need about a file: its type (e.g., regular file, directory, etc.), its size, the number of blocks allocated to it, protection information (such as who owns the file, as well as who can access it), some time information, including when the file was created, modified, or last accessed, as well as information about where its data blocks reside on disk (e.g., pointers of some kind). We refer to all such information about a file as metadata; in fact, any information inside the file system that isn't pure user data is often referred to as such

Hybrid Approach: Paging and Segments

Instead of having a single page table for the entire address space of the process, why not have one per logical segment? remember with segmentation, we had a base register that told us where each segment lived in physical memory, and a bound or limit register that told us the size of said segment. In our hybrid, we still have those structures in the MMU; here, we use the base not to point to the segment itself but rather to hold the physical address of the page table of that segment. The bounds register is used to indicate the end of the page table (i.e., how many valid pages it has). However, this approach is not without problems. First, it still requires us to use segmentation; as we discussed before, segmentation is not quite as flexible as we would like, as it assumes a certain usage pattern of the address space. Second, this hybrid causes external fragmentation to arise again. While most of memory is managed in page-sized units, page tables now can be of arbitrary size (in multiples of PTEs).

The Multi-Level Index

Instead of pointing to a block that contains user data, it points to a block that contains more pointers, each of which point to user data. Thus, an inode may have some fixed number of direct pointers (e.g., 12), and a single indirect pointer. If a file grows large enough, an indirect block is allocated (from the data-block region of the disk), and the inode's slot for an indirect pointer is set to point to it. You can also have *double indirect pointers* which refers to a block that contains pointers to indirect blocks, each of which contain pointers to data blocks (or triple-indirect pointers and so on) Overall, this imbalanced tree is referred to as the multi-level index approach to pointing to file blocks

Disk Scheduling: Scan Variants- C-SCAN (Circular SCAN)

Instead of sweeping in both directions across the disk, the algorithm only sweeps from outer-to-inner, and then resets at the outer track to begin again. Doing so is a bit more fair to inner and outer tracks, as pure back-and-forth SCAN favors the middle tracks

Redundant Array of Inexpensive Disks (RAID)

Instead of using one large disk to store data, one can use many smaller disks (because they are cheaper). An approach to using many low-cost drives as a group to improve performance, yet also provides a degree of redundancy that makes the chance of data loss remote. Externally, a RAID looks like a disk: a group of blocks one can read or write. Internally, the RAID is a complex beast, consisting of multiple disks, memory (both volatile and non-), and one or more processors to manage the system. A hardware RAID is very much like a computer system, specialized for the task of managing a group of disks

File System

Interface provides operations involving: • Files • Directories (a special kind of file)

Problems with Paging: Memory Consumption

Internal Fragmentation Page table space

Non-preemptive scheduler

Job runs until it voluntarily yields CPU: • job blocks on an event (e.g., I/O or P(sem)) • job explicitly yields • job terminates

FFS: Free List

Keeps track of blocks not in use Possible data structures: 1. linked list of free blocks - inefficient 2. linked list of metadata blocks that in turn point to free blocks - simple and efficient 3. bitmap - good for contiguous allocation

Recursive Update Problem: LFS Solution

LFS cleverly avoids this problem with the inode map Even though the location of an inode may change, the change is never reflected in the directory itself; rather, the imap structure is updated while the directory holds the same name-to-inode-number mapping. Thus, through indirection, LFS avoids the recursive update problem.

LFS: Crash Recovery

LFS writes to disk: CR and segment. After a crash: • Find most recent consistent CR (see below) • Roll forward by reading next segment for updates. Crash-resistant atomic CR update: • Two copies of CR: at start and end of disk. • Updates alternate between them. • Each CR has timestamp ts(CR,start) at start and ts(CR,end) at end. --CR consistent if ts(CR,start)=ts(CR,end) • Use consistent CR with largest timestamp

TLB Issue: Replacement Policy Solutions: least-recently-used (LRU)

LRU tries to take advantage of locality in the memory-reference stream, assuming it is likely that an entry that has not recently been used is a good candidate for eviction

RAID advantages: capacity

Large data sets demand large disks

Implementation of Ticket Lottery

Let's assume we keep the processes in a list, each with some number of tickets. To make a scheduling decision, we first have to pick a random number (the winner) from the total number of tickets. The code walks the list of processes, adding each ticket value to counter until the value exceeds winner. Once that is the case, the current list element is the winner

Multi-level Feedback Queue Scheduling

Like multilevel queue, but assignments are not static Jobs start at the top: • Use your quantum? move down • Don't? Stay where you are Need parameters for: • Number of queues • Scheduling alg. per queue • When to upgrade/downgrade job

Virtual Address Components: Page offset

Lower bits of the address *DOES NOT CHANGE IN TRANSLATION* If logical address space is 2^m and page size is 2^n the page offset is the lower n bits of the address

MQMS vs SQMS

MQMS has a distinct advantage of SQMS in that it should be inherently more scalable: As the number of CPUs grows, so too does the number of queues, and thus lock and cache contention should not become a central problem MQMS intrinsically provides cache affinity: jobs stay on the same CPU and thus reap the advantage of reusing cached contents therein.

Memory Management Common Errors: Forgetting To Allocate Memory (Segfault)

Many routines expect memory to be allocated before you call them. When you run code that expects allocated memory (but you did not allocate yet), it will likely lead to a *segmentation fault*

File System Implementation Basics: Mappings

Mappings: • Directories: file name ➜ low-level name • Index structures: low-level name➜ block • Free space maps: locate free blocks (near each other) To exploit locality of file references: • Group directories together on disk • Prefer (large) sequential writes/reads • Defragmentation: Relocation of blocks: --Blocks for a file appear on disk in sequence --Files for directories appear near each other

Problems with MQMS: Solution to Load Imbalance: Migration

Migration=moving a process from one CPU to another By migrating a job from one CPU to another, true load balance can be achieved

dynamic partitioning of memory

Modern systems, in contrast, employ a dynamic partitioning approach. Specifically, many modern operating systems integrate virtual memory pages and file system pages into a unified page cache. In this way, memory can be allocated more flexibly across virtual memory and file system, depending on which needs more memory at a given time.

Supporting Virtual Memory

Modify Page Tables with a valid bit (= "present bit") Page in memory -> valid = 1 Page not in memory -> PT lookup triggers page fault

Page Replacement Algorithms: Two-Handed Clock Algorithm

More complex Clock algorithm: Leading hand clears use bit • slowly clears history • finds victim candidates Trailing hand evicts frames with use bit set to 0

Solid State Drives (Flash)

Most SSDs based on NAND-flash (retains its state for months to years without power)

Multi-Level Queue Scheduling

Multiple ready queues based on job "type" (system jobs; interactive jobs; background batch jobs) Different queues may be scheduled using different algorithms

Disk operations

Must specify: • cylinder # (distance from spindle) • head # • sector # • transfer size • memory address Operations: • seek • read • write

fsck: Free blocks

Next, fsck scans the inodes, indirect blocks, double indirect blocks, etc., to build an understanding of which blocks are currently allocated within the file system. It uses this knowledge to produce a correct version of the allocation bitmaps; thus, if there is any inconsistency between bitmaps and inodes, it is resolved by trusting the information within the inodes. The same type of check is performed for all the inodes, making sure that all inodes that look like they are in use are marked as such in the inode bitmaps.

How to parameterize MLFQ (how many levels/queues? how long should the time-slice be? How long should S be in priority boost?)

No definite/true answer thus only some experience with workloads and subsequent tuning of the scheduler will lead to a satisfactory balance For example, most MLFQ variants allow for varying time-slice length across different queues. The high-priority queues are usually given short time slices; they are comprised of interactive jobs, after all, and thus quickly alternating between them makes sense. The low-priority queues, in contrast, contain long-running jobs that are CPU-bound; hence, longer time slices work well

Disk Scheduling

Objective: minimize seek time

Basic Flash Operations: program (a page)

Once a block has been erased, the program command can be used to change some of the 1's within a page to 0's, and write the desired contents of a page to the flash. Programming a page is less expensive than erasing a block, but more costly than reading a page, usually taking around 100s of microseconds on modern flash chips.

Checkpointing

Once this transaction (writing to the log) is safely on disk, we are ready to overwrite the old structures in the file system; this process is called checkpointing

TLB Issue: Context Switches Solutions: Flush

One approach is to simply flush the TLB on context switches, thus emptying it before running the next process the flush operation simply sets all valid bits to 0, essentially clearing the contents of the TLB. By flushing the TLB on each context switch, we now have a working solution, as a process will never accidentally encounter the wrong translations in the TLB. However, there is a cost: each time a process runs, it must incur TLB misses as it touches its data and code pages. If the OS switches between processes frequently, this cost may be high.

Journaling Tricky Case: Block Reuse Solution

One could, for example, never reuse blocks until the delete of said blocks is checkpointed out of the journal. What Linux ext3 does instead is to add a new type of record to the journal, known as a revoke record. In the case above, deleting the directory would cause a revoke record to be written to the journal. When replaying the journal, the system first scans for such revoke records; any such revoked data is never replayed, thus avoiding the problem mentioned above.

flash translation layer (FTL): goals: performance

One key will be to utilize multiple flash chips in parallel; all modern SSDs use multiple chips internally to obtain higher performance Another performance goal will be to reduce write amplification, which is defined as the total write traffic (in bytes) issued to the flash chips by the FTL divided by the total write traffic (in bytes) issued by the client to the SSD.

Direct Pointer Inodes

One simple approach would be to have one or more direct pointers (disk addresses) inside the inode; each pointer refers to one disk block that belongs to the file However this is not a good approach, especially for larger files

Clock algorithm modification: Considering Dirty Pages

One small modification to the clock algorithm that is commonly made is the additional consideration of whether a page has been modified or not while in memory. The reason for this: if a page has been modified and is thus dirty, it must be written back to disk to evict it, which is expensive. If it has not been modified (and is thus clean), the eviction is free; the physical frame can simply be reused for other purposes without additional I/O. Thus, some VM systems prefer to evict clean pages over dirty pages. To support this behavior, the hardware should include a modified bit (a.k.a. dirty bit). This bit is set any time a page is written, and thus can be incorporated into the page-replacement algorithm.

FFS: Directory Structure

Originally array of 16 byte entries: • 14 byte file name • 2 byte i-node number Now linked lists. Each entry contains: • 4-byte inode number • Length of name • Name (UTF8 or some other Unicode encoding) First entry is ".", points to self (current directory) Second entry is "..", points to parent inode (of current directory)

File System Design Challenges: Performance

Overcome limitations of disks (leverage spatial locality to avoid seeks and to transfer block sequences)

Basic Flash Operations: overview

Pages start in an INVALID state. By erasing the block that a page resides within, you set the state of the page (and all pages within that block) to ERASED, which resets the content of each page in the block but also (importantly) makes them programmable. When you program a page, its state changes to VALID, meaning its contents have been set and can be read. Reads do not affect these states (although you should only read from pages that have been programmed). Once a page has been programmed, the only way to change its contents is to erase the entire block within which the page resides. iiii - Initial: pages in block are invalid (i) Erase() → EEEE - State of pages in block set to erased (E) Program(0) → VEEE - Program page 0; state set to valid (V) Program(0) → error - Cannot re-program page after programming Program(1) → VVEE - Program page 1 Erase() → EEEE - Contents erased; all pages programmable

RAID-4: Parity for Errors: How to Compute Parity

Parity P( Bi, Bj, Bk): XOR( Bi , Bj, Bk)... keeps an even number of 1's in each stripe Thm: XOR( Bj , Bk, P( Bi, Bj, Bk )) = Bi

RAID-4: Parity for Errors

Parity block for each stripe - saves space. Read block; write full stripe (including parity)

RAID-5: Rotating Parity

Parity block for each stripe - saves space. Read block; write full stripe (including parity) Instead of being on the same disk at the end of the stripe, the new parity blocks rotate through the disks

RAID Level 4: Saving Space With Parity

Parity-based approaches attempt to use less capacity and thus overcome the huge space penalty paid by mirrored systems. They do so at a cost, however: performance. To compute parity, we need to use a mathematical function that enables us to withstand the loss of any one block from our stripe. It turns out the simple function XOR does the trick quite nicely

Evaluating A RAID: performance

Performance is somewhat challenging to evaluate, because it depends heavily on the workload presented to the disk array. Thus, before evaluating performance, we will first present a set of typical workloads that one should consider

Page Replacement Algorithms: Random

Pick any page to eject at random • Used mainly for comparison

Linux Virtual Memory: Page Table Structure

Probably the biggest change in recent years is the move from 32-bit x86 to 64-bit x86. Moving to a 64-bit address affects page table structure in x86 in the expected manner. Because x86 uses a multi-level page table, current 64- bit systems use a four-level table. The full 64-bit nature of the virtual address space is not yet in use, however, rather only the bottom 48 bits Thus, a virtual address can be viewed as follows: The top 16 bits of a virtual address are unused (and thus play no role in translation) The bottom 12 bits (due to the 4-KB page size) are used as the offset (and hence just used directly, and not translated) Leaving the middle 36 bits of virtual address to take part in the translation. The P1 portion of the address is used to index into the topmost page directory, and the translation proceeds from there, one level at a time, until the actual page of the page table is indexed by P4, yielding the desired page table entry.

OPT is a stack algorithm

Proof non-trivial.

RAID Level 1: Mirroring: Analysis: reliability

RAID-1 does well. It can tolerate the failure of any one disk. RAID-1 can actually do better than this, with a little luck. More generally, a mirrored system (with mirroring level of 2) can tolerate 1 disk failure for certain, and up to N/2 failures depending on which disks fail.

RAID Level 1: Mirroring: Analysis: capacity

RAID-1 is expensive; with the mirroring level = 2, we only obtain half of our peak useful capacity. With N disks of B blocks, RAID-1 useful capacity is (N · B)/2.

RAID-2 and RAID-3

RAID-2: • Bit level striping • Multiple ECC disks (instead of parity) RAID-3: • Byte level striping • Dedicated parity disk RAID-2 and RAID-3 are not used in practice

Rate of I/O (R_I/O)

R_(I/O)=Size_transfer/(T_(I/O))

RAID Level 5: Rotating Parity: Analysis: performance (random read)

Random read performance is a little better, because we can now utilize all disks. Random write performance improves noticeably over RAID-4, as it allows for parallelism across requests. In fact, we can generally assume that given a large number of random requests, we will be able to keep all the disks about evenly busy. If that is the case, then our total bandwidth for small writes will be N/4 · R MB/s The factor of four loss is due to the fact that each RAID-5 write still generates 4 total I/O operations, which is simply the cost of using parity-based RAID.

How MLFQ Sets Priority

Rather than giving a fixed priority to each job, MLFQ varies the priority of a job based on its observed behavior. If, for example, a job repeatedly relinquishes the CPU while waiting for input from the keyboard, MLFQ will keep its priority high, as this is how an interactive process might behave. If, instead, a job uses the CPU intensively for long periods of time, MLFQ will reduce its priority. In this way, MLFQ will try to learn about processes as they run, and thus use the history of the job to predict its future behavior

FFS: Steps to reading /foo/bar/baz (Example)

Read & Open: (1) inode #2 (root always has inumber 2), find root's blocknum (912) (2) root directory (in block 912), find foo's inumber (31) (3) inode #31, find foo's blocknum (194) (4) foo (in block 194), find bar's inumber (73) (5) inode #73, find bar's blocknum (991) (6) bar (in block 991), find baz's inumber (40) (7) inode #40, find data blocks (302, 913, 301) (8) data blocks (302, 913, 301)

Real-Time Scheduling

Real-time processes have timing constraints (expressed as deadlines or rate requirements) Common RT scheduling policies: • Earliest deadline first (EDF) (priority = deadline) • Priority Inheritance: High priority process (needing lock) temporarily donates priority to lower priority process (with lock). This avoids priority inversion

OS Support for Paging: Process Termination

Release pages

RAID Level 4: Parity: Analysis: reliability

Reliability is also quite easy to understand: RAID-4 tolerates 1 disk failure and no more. If more than one disk is lost, there is simply no way to reconstruct the lost data

Journal: Recovery protocol

Replay journal from start, writing blocks as indicated by checkpoint steps. for TxBegin ... TxEnd: - if TxEnd present then redo writes to final locations following TxBegin - else ignore journal entries following TxBegin Infinite Journal -> Finite Journal: • introduce journal super block (JSB) as first entry in journal: JSB gives start / end entries of journal. • view journal as a circular log • delete journal entry once writes in checkpoint step complete.

File System Design Challenges: Reliability

Resilient to OS crashes and HW failure

Disk Scheduling: Shortest Seek Time First (SSTF)

SSTF orders the queue of I/O requests by track, picking requests on the nearest track to complete first

Preemtive scheduler

Same as non-preemptive • job blocks on an event (e.g., I/O or P(sem)) • job explicitly yields • job terminates Plus: • Timer and other interrupts (when jobs cannot be trusted to yield explicitly) • Incurs some context switching overhead

Disk Scheduling: FIFO

Schedule disk operations in order they arrive

Memory Issues for OS to Handle: Process Termination

Second, the OS must do some work when a process is terminated reclaiming all of its memory for use in other processes or the OS. Upon termination of a process, the OS thus puts its memory back on the free list, and cleans up any associated data structures as need be.

File Storage Layout Options: Log structure (LFS)

Sequence of segments, each containing updated blocks Idea: Buffer sets of writes and store as single log entry ("segment") on disk. File system implemented as a log! Technological drivers: • System memories are getting larger (Larger disk cache + Reads mostly serviced by cache + Traffic to disk mostly writes) • Sequential disk access performs better. (Avoid seeks for even better performance.)

Writing A File To Disk

Similar to reading: 1. First, the file must be opened (as in reading) 2. Then, the application can issue write() calls to update the file with new contents 3. the file is closed Unlike reading, writing to the file may also allocate a block (unless the block is being overwritten, for example). When writing out a new file, each write not only has to write data to disk but has to first decide which block to allocate to the file and thus update other structures of the disk accordingly (e.g., the data bitmap and inode). Thus, each write to a file logically generates five I/Os: one to read the data bitmap (which is then updated to mark the newly-allocated block as used), one to write the bitmap (to reflect its new state to disk), two more to read and then write the inode (which is updated with the new block's location), and finally one to write the actual block itself.

open()

Syscall to open a file it returns a *file descriptor*: an integer, private per process, and is used in UNIX systems to access files; thus, once a file is opened, you use the file descriptor to read or write the file

Disk Scheduling: Elevator (a.k.a. SCAN or C-SCAN)

Solution to starvation problem The algorithm simply moves back and forth across the disk servicing requests in order across the tracks. Let's call a single pass across the disk (from outer to inner tracks, or inner to outer) a sweep. Thus, if a request comes for a block on a track that has already been serviced on this sweep of the disk, it is not handled immediately, but rather queued until the next sweep (in the other direction).

Memory Management Common Errors: Freeing Memory Before You Are Done With It (Dangling Pointer)

Sometimes a program will free memory before it is finished using it. The subsequent use can crash the program, or overwrite valid memory

data integrity/protection

Specifically, how should a file system or storage system ensure that data is safe, given the unreliable nature of modern storage devices?

RAID advantages: reliability

Spreading data across multiple disks (without RAID techniques) makes the data vulnerable to the loss of a single disk; with some form of redundancy, RAIDs can tolerate the loss of a disk and keep operating as if nothing were wrong.

File System Design Challenges: Persistence

Storage for long term.

Journaling Tricky Case: Block Reuse

Suppose you are using some form of metadata journaling (and thus data blocks for files are not journaled). Let's say you have a directory called foo. The user adds an entry to foo (say by creating a file), and thus the contents of foo (because directories are considered metadata) are written to the log; assume the location of the foo directory data is block 1000. At this point, the user deletes everything in the directory and the directory itself, freeing up block 1000 for reuse. Finally, the user creates a new file (say bar), which ends up reusing the same block (1000) that used to belong to foo. The inode of bar is committed to disk, as is its data; note, however, because metadata journaling is in use, only the inode of bar is committed to the journal; the newly-written data in block 1000 in the file bar is not journaled. Now assume a crash occurs and all of this information is still in the log. During replay, the recovery process simply replays everything in the log, including the write of directory data in block 1000; the replay thus overwrites the user data of current file bar with old directory contents!

I/O Time

T_(I/O) = T_seek + T_rotation + T_transfer

Simple Solution: Bigger Pages

Take our 32-bit address space again, but this time assume 16KB pages. We would thus have an 18-bit VPN plus a 14-bit offset. Assuming the same size for each PTE (4 bytes), we now have 2^18 entries in our linear page table and thus a total size of 1MB per page table, a factor of four reduction in size of the page table The major problem with this approach, however, is that big pages lead to waste within each page, a problem known as internal fragmentation

Hardware Support of Memory Management: Exceptions

The CPU must be able to generate exceptions in situations where a user program tries to access memory illegally (with an address that is "out of bounds"); in this case, the CPU should stop executing the user program and arrange for the OS "out-of-bounds" exception handler to run.

flash translation layer (FTL)

The FTL takes read and write requests on logical blocks (that comprise the device interface) and turns them into low-level read, erase, and program commands on the underlying physical blocks and physical pages (that comprise the actual flash device). The FTL should accomplish this task with the goal of delivering excellent performance and high reliability.

Garbage Collection Problem: LFS Solution

The LFS cleaner works on a segment-by-segment basis, thus clearing up large chunks of space for subsequent writing. The basic cleaning process works as follows: Periodically, the LFS cleaner reads in a number of old (partially-used) segments, determines which blocks are live within these segments, and then write out a new set of segments with just the live blocks within them, freeing up the old ones for writing. Specifically, we expect the cleaner to read in M existing segments, compact their contents into N new segments (where N < M), and then write the N segments to disk in new locations. The old M segments are then freed and can be used by the file system for subsequent writes.

Using History: LFU (Least Frequently Used) Replacement

The Least-Frequently-Used (LFU) policy replaces the least-frequently used page when an eviction must take place. We should also note that the opposite of this algorithm exist: Most-Frequently-Used (MFU)

Linux Virtual Memory: The Page Cache

The Linux page cache is unified, keeping pages in memory from three primary sources: memory-mapped files file data and metadata from devices (usually accessed by directing read() and write() calls to the file system), heap and stack pages that comprise each process (sometimes called anonymous memory, because there is no named file underneath of it, but rather swap space). These entities are kept in a page cache hash table, allowing for quick lookup when said data is needed. The page cache tracks if entries are clean (read but not updated) or dirty (a.k.a., modified). Dirty data is periodically written to the backing store (i.e., to a specific file for file data, or to swap space for anonymous regions) by background threads (called pdflush), thus ensuring that modified data eventually is written back to persistent storage. This background activity either takes place after a certain time period or if too many pages are considered dirty (both configurable parameters).

Goals of Memory Virtualization: Transparency

The OS should implement virtual memory in a way that is invisible to the running program. Thus, the program shouldn't be aware of the fact that memory is virtualized; rather, the program behaves as if it has its own private physical memory. Behind the scenes, the OS (and hardware) does all the work to multiplex memory among many different jobs, and hence implements the illusion.

Goals of Memory Virtualization: Protection

The OS should make sure to protect processes from one another as well as the OS itself from processes. When one process performs a load, a store, or an instruction fetch, it should not be able to access or affect in any way the memory contents of any other process or the OS itself (that is, anything outside its address space). Protection thus enables us to deliver the property of isolation among processes; each process should be running in its own isolated cocoon, safe from the ravages of other faulty or even malicious processes.

Goals of Memory Virtualization: Efficiency

The OS should strive to make the virtualization as efficient as possible, both in terms of time (i.e., not making programs run much more slowly) and space (i.e., not using too much memory for structures needed to support virtualization).

Wear Leveling

The basic idea is simple: because multiple erase/program cycles will wear out a flash block, the FTL should try its best to spread that work across all the blocks of the device evenly. In this manner, all blocks will wear out at roughly the same time, instead of a few "popular" blocks quickly becoming unusable The basic log-structuring approach does a good initial job of spreading out write load, and garbage collection helps as well. However, sometimes a block will be filled with long-lived data that does not get over-written; in this case, garbage collection will never reclaim the block, and thus it does not receive its fair share of the write load. To remedy this problem, the FTL must periodically read all the live data out of such blocks and re-write it elsewhere, thus making the block available for writing again. This process of wear leveling increases the write amplification of the SSD, and thus decreases performance as extra I/O is required to ensure that all blocks wear at roughly the same rate.

Crash Recovery And The Log: Making Sure Segment is Valid: Roll Forward

The basic idea is to start with the last checkpoint region, find the end of the log (which is included in the CR), and then use that to read through the next segments and see if there are any valid updates within it. If there are, LFS updates the file system accordingly and thus recovers much of the data and metadata written since the last checkpoint

FFS Policies: How To Allocate Files and Directories

The basic mantra is simple: keep related stuff together (and its corollary, keep unrelated stuff far apart) FFS employs a simple approach to place *directories*: find the cylinder group with a low number of allocated directories (to balance directories across groups) and a high number of free inodes (to subsequently be able to allocate a bunch of files), and put the directory data and inode in that group For files, FFS does two things: First, it makes sure (in the general case) to allocate the data blocks of a file in the same group as its inode, thus preventing long seeks between inode and data (as in the old file system). Second, it places all files that are in the same directory in the cylinder group of the directory they are in

Solutions to Cache Coherency Problem

The basic solution is provided by the hardware: by monitoring memory accesses, hardware can ensure that basically the "right thing" happens and that the view of a single shared memory is preserved: One way to do this on a bus-based system is to use an old technique known as *bus snooping* Also use synchronization to protected shared data

SSD Performance

The biggest difference in performance, as compared to disk drives, is realized when performing random reads and writes; while a typical disk drive can only perform a few hundred random I/Os per second, SSDs can do much better First, and most dramatic, is the difference in random I/O performance between the SSDs and the lone hard drive. While the SSDs obtain tens or even hundreds of MB/s in random I/Os, this "high performance" hard drive has a peak of just a couple MB/s (in fact, we rounded up to get to 2 MB/s). Second, in terms of sequential performance, there is much less of a difference; while the SSDs perform better, a hard drive is still a good choice if sequential performance is all you need. Third, SSD random read performance is not as good as SSD random write performance. The reason for such unexpectedly good random-write performance is due to the log-structured design of many SSDs, which transforms random writes into sequential ones and improves performance

inode table

The collection of inodes for all files and directories in a file system. holds an array of on-disk inodes

Root Directory

The directory hierarchy starts at a root directory and uses some kind of separator to name subsequent sub-directories until the desired file or directory is named

Hard Disk Drives: Interface

The drive consists of a large number of sectors (512-byte blocks), each of which can be read or written. The sectors are numbered from 0 to n − 1 on a disk with n sectors. Thus, we can view the disk as an array of sectors; 0 to n − 1 is thus the address space of the drive Multi-sector operations are possible; indeed, many file systems will read or write 4KB at a time (or more). However, when updating the disk, the only guarantee drive manufacturers make is that a single 512-byte write is atomic; thus, if an untimely power loss occurs, only a portion of a larger write may complete (sometimes called a *torn write*).

Disk Scheduling: SSTF Problem- Drive Geometry

The drive geometry is not available to the host OS; rather, it sees an array of blocks. Fortunately, this problem is rather easily fixed. Instead of SSTF, an OS can simply implement nearest-block-first (NBF), which schedules the request with the nearest block address next.

Reading A File From Disk

The file system must traverse the pathname and thus locate the desired inode: 1. All traversals begin at the root of the file system, in the root directory which is simply called /. The first thing the FS will read from disk is the inode of the root directory (but we need the i-number to find the inode. The root must have a "well known" i-number). 2. Once the inode is read in, the FS can look inside of it to find pointers to data blocks, which contain the contents of the root directory, looking for the next directory in the file path 3. The next step is to recursively traverse the pathname until the desired inode is found 4. The final step of open() is to read the file's inode into memory; the FS then does a final permissions check, allocates a file descriptor for this process in the per-process open-file table, and returns it to the user. 5. The first read will thus read in the first block of the file, consulting the inode to find the location of such a block; it may also update the inode with a new last-accessed time. The read will further update the in-memory open file table for this file descriptor, updating the file offset such that the next read will read the second file block, etc 6. the file is closed

Two Types of Allocated Memory: Stack

The first is called stack memory, and allocations and deallocations of it are managed implicitly by the compiler for you, the programmer; for this reason it is sometimes called *automatic memory* whenever you declare a variable in a function it is added to the stack and when the function ends/returns the stack is automatically deallocated

Swap Space

The first thing we will need to do is to reserve some space on the disk for moving pages back and forth. The size of the swap space is important, as ultimately it determines the maximum number of memory pages that can be in use by a system at a given time.

Finding Inodes: inode map (imap)

The imap is a structure that takes an inode number as input and produces the disk address of the most recent version of the inode. LFS places chunks of the inode map right next to where it is writing all of the other new information. Thus, when appending a data block to a file k, LFS actually writes the new data block, its inode, and a piece of the inode map all together onto the disk

RAID Level 0: Striping: Chunk Size

The number of blocks put on a disk at each time Chunk size mostly affects performance of the array a small chunk size implies that many files will get striped across many disks, thus increasing the parallelism of reads and writes to a single file; however, the positioning time to access blocks across multiple disks increases, because the positioning time for the entire request is determined by the maximum of the positioning times of the requests across all drives A big chunk size, on the other hand, reduces such intra-file parallelism, and thus relies on multiple concurrent requests to achieve high throughput. However, large chunk sizes reduce positioning time; if, for example, a single file fits within a chunk and thus is placed on a single disk, the positioning time incurred while accessing it will just be the positioning time of a single disk.

The Optimal Replacement Policy

The optimal replacement policy leads to the fewest number of misses overall. Belady showed that a simple (but, unfortunately, difficult to implement!) approach that *replaces the page that will be accessed furthest in the future* is the optimal policy, resulting in the fewest-possible cache misses.

Page Replacement Algorithms: FIFO

The page brought in earliest is evicted • Ignores usage

Multi-level Page Tables: Page Directory

The page directory thus either can be used to tell you where a page of the page table is, or that the entire page of the page table contains no valid pages. The page directory, in a simple two-level table, contains one entry per page of the page table. It consists of a number of page directory entries (PDE). A PDE (minimally) has a valid bit and a page frame number (PFN), similar to a PTE. However, the meaning of this valid bit is slightly different: if the PDE is valid, it means that at least one of the pages of the page table that the entry points to (via the PFN) is valid, i.e., in at least one PTE on that page pointed to by this PDE, the valid bit in that PTE is set to one. If the PDE is not valid (i.e., equal to zero), the rest of the PDE is not defined.

FTL Organization: direct mapped problems: performance

The performance problems come on each write: the device has to read in the entire block (costly), erase it (quite costly), and then program it (costly). The end result is severe write amplification (proportional to the number of pages in a block) and as a result, terrible write performance, even slower than typical hard drives with their mechanical seeks and rotational delays

Hard Disk Drives: Components: spindle

The platters are all bound together around the spindle, which is connected to a motor that spins the platters around (while the drive is powered on) at a constant (fixed) rate The rate of rotation is often measured in *rotations per minute (RPM)*, and typical modern values are in the 7,200 RPM to 15,000 RPM range

Recursive Update Problem

The problem arises in any file system that never updates in place (such as LFS), but rather moves updates to new locations on the disk. Specifically, whenever an inode is updated, its location on disk changes. If we hadn't been careful, this would have also entailed an update to the directory that points to this file, which then would have mandated a change to the parent of that directory, and so on, all the way up the file system tree

Scheduling Metric: turnaround time

The turnaround time of a job is defined as the time at which the job completes minus the time at which the job arrived in the system: T_turnaround = T_completion − T_arrival

FTL Organization: A Log-Structured FTL: Garbage Collection

The process of finding garbage blocks (also called dead blocks) and reclaiming them for future use The basic process is simple: find a block that contains one or more garbage pages, read in the live (non-garbage) pages from that block, write out those live pages to the log, and (finally) reclaim the entire block for use in writing.

Address Space Layout

The program code lives at the top of the address space. Code is static (and thus easy to place in memory), so we can place it at the top of the address space and know that it won't need any more space as the program runs Next, we have the two regions of the address space that may grow (and shrink) while the program runs: Those are the heap (at the top) and the stack (at the bottom). We place them like this because each wishes to be able to grow, and by putting them at opposite ends of the address space, we can allow such growth: they just have to grow in opposite directions. The heap thus starts just after the code (at 1KB) and grows downward (say when a user requests more memory via malloc()); The stack starts at 16KB and grows upward (say when a user makes a procedure call)

Memory Management Common Errors: Freeing Memory Repeatedly (Double Free)

The result of doing so is undefined. As you can imagine, the memory-allocation library might get confused and do all sorts of weird things; crashes are a common outcome.

FTL Organization: A Log-Structured FTL: Mapping Table Size

The second cost of log-structuring is the potential for extremely large mapping tables, with one entry for each 4-KB page of the device. Thus, this page-level FTL scheme is impractical.

Solving Gaming the Scheduler (2 cards above)

The solution here is to perform better accounting of CPU time at each level of the MLFQ. Instead of forgetting how much of a time slice a process used at a given level, the scheduler should keep track; once a process has used its allotment, it is demoted to the next priority queue. Whether it uses the time slice in one long burst or many small ones does not matter. Change rules: • Rule 4a: If a job uses up an entire time slice while running, its priority is reduced (i.e., it moves down one queue). • Rule 4b: If a job gives up the CPU before the time slice is up, it stays at the same priority level. To • Rule 4: Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).

RAID Level 4: Parity: Random Writes: subtractive parity

The subtractive method works in three steps: First, we read in the old data at the block we want to write and the old parity Then, we compare the old data and the new data; if they are the same then we know the parity bit will also remain the same.If, however, they are different, then we must flip the old parity bit to the opposite of its current state. We can express this whole mess neatly with XOR (where ⊕ is the XOR operator): P_new = (C_old ⊕ C_new) ⊕ P_old

superblock

The superblock contains information about this particular file system, including, for example, how many inodes and data blocks are in the file system, where the inode table begins, and so forth. It will likely also include a magic number of some kind to identify the file system type

Virtual Address Components: Page number

The upper bits of the address Must be translated into physical frame number If logical address space is 2^m and page size is 2^n the page number is the upper m-n bits of the address

How Much To Buffer?

The way to think about this is that every time you write, you pay a fixed overhead of the positioning cost. Thus, how much do you have to write in order to amortize that cost? The more you write, the better (obviously), and the closer you get to achieving peak bandwidth. To obtain a concrete answer, let's assume we are writing out D MB. The time to write out this chunk of data (T_write) is the positioning time T_position plus the time to transfer D (D/Rpeak ), or: T_write = T_position + D/R_peak And thus the effective rate of writing (R_effective), which is just the amount of data written divided by the total time to write it, is: R_effective = D/T_write = D/(T_position+(D/R_peak)) We want the effective rate to be some fraction F of the peak rate, where 0 < F < 1; this means we want R_effective = F × R_peak. At this point, we can solve for D: D = F/(1 − F) × R_peak × T_position

Memory Issues for OS to Handle: Context Switches

There is only one base and bounds register pair on each CPU, after all, and their values differ for each running program, as each program is loaded at a different physical address in memory. Thus, the OS must save (on the PCB) and restore the base-and-bounds pair when it switches between processes

Hard Disk Drives: cache (track buffer)

This cache is just some small amount of memory (usually around 8 or 16 MB) which the drive can use to hold data read from or written to the disk On writes, the drive has a choice: should it acknowledge the write has completed when it has put the data in its memory, or after the write has actually been written to disk? The former is called write-back caching (or sometimes immediate reporting), and the latter write through

Andrew File System (AFS)

This file system distributes, stores, and joins files on networked computers, making it possible for users to access information located on any computer in a network Goal: • Support large numbers of clients

internal structure

This part of the device is implementation specific and is responsible for implementing the abstraction the device presents to the system. Very simple devices will have one or a few hardware chips to implement their functionality; more complex devices will include a simple CPU, some general purpose memory, and other device-specific chips to get their job done

Solution #3: Other Approaches: copy-on-write (COW)

This technique never overwrites files or directories in place; rather, it places new updates to previously unused locations on disk. After a number of updates are completed, COW file systems flip the root structure of the file system to include pointers to the newly updated structures. Doing so makes keeping the file system consistent straightforward

Thread Scheduling

Threads share code & data segments so Option 1: Ignore this fact Option 2: Gang scheduling- all threads of a process run together Option 3: Space-based affinity- assign tasks to processors (Improve cache hit ratio)

Job Execution time

Time needed to run the task without contention

How system writes to Journal (avoid data corruption in the journal)

To avoid this problem, the file system issues the transactional write in two steps: First, it writes all blocks except the TxE block to the journal, issuing these writes all at once When those writes complete, the file system issues the write of the TxE block, thus leaving the journal in this final, safe state

Creating a File

To create a file, the file system must not only allocate an inode, but also allocate space within the directory containing the new file. The total amount of I/O traffic to do so is quite high: one read to the inode bitmap (to find a free inode), one write to the inode bitmap (to mark it allocated), one write to the new inode itself (to initialize it), one to the data of the directory (to link the high-level name of the file to its inode number), and one read and write to the directory inode to update it. If the directory needs to grow to accommodate the new entry, additional I/Os (i.e., to the data bitmap, and the new directory block) will be needed too.

Crash Recovery And The Log: Making Sure CR is Valid

To ensure that the CR update happens atomically, LFS actually keeps two CRs, one at either end of the disk, and writes to them alternately. LFS also implements a careful protocol when updating the CR with the latest pointers to the inode map and other information; specifically, it first writes out a header (with timestamp), then the body of the CR, and then finally one last block (also with a timestamp). If the system crashes during a CR update, LFS can detect this by seeing an inconsistent pair of timestamps. LFS will always choose to use the most recent CR that has consistent timestamps, and thus consistent update of the CR is achieved.

SQMS Problems: Lack of Scalability

To ensure the scheduler works correctly on multiple CPUs, the developers will have inserted some form of locking into the code Locks, unfortunately, can greatly reduce performance, particularly as the number of CPUs in the systems grows As contention for such a single lock increases, the system spends more and more time in lock overhead and less time doing the work the system should be doing

Implementing Historical Algorithms: LRU

To implement it perfectly, we need to do a lot of work. Specifically, upon each page access (i.e., each memory access, whether an instruction fetch or a load or store), we must update some data structure to move this page to the front of the list (i.e., the MRU side) To keep track of which pages have been least- and most-recently used, the system has to do some accounting work on every memory reference.

Hard Disk Drives: Components: disk head+disk arm

To read and write from the surface, we need a mechanism that allows us to either sense (i.e., read) the magnetic patterns on the disk or to induce a change in (i.e., write) them. This process of reading and writing is accomplished by the disk head; there is one such head per surface of the drive. The disk head is attached to a single disk arm, which moves across the surface to position the head over the desired track.

page table

To record where each virtual page of the address space is placed in physical memory, the operating system usually keeps a per-process data structure known as a page table. The major role of the page table is to store address translations for each of the virtual pages of the address space, thus letting us know where in physical memory each page resides. To translate this virtual address that the process generated, we have to first split it into two components: the virtual page number (VPN), and the offset within the page. With our virtual page number, we can now index our page table and find which physical frame virtual page 1 resides within. This is the physical frame number (PFN) (also sometimes called the physical page number or PPN)

Sharing

To save memory, sometimes it is useful to share certain memory segments between address spaces. In particular, code sharing is common and still in use in systems today To support sharing, we need a little extra support from the hardware, in the form of *protection bits*

Hard Disk Drives: Seek time

To service a read, the drive has to first move the disk arm to the correct track, in a process known as a seek. Seeks, along with rotations, are one of the most costly disk operations. The seek has many phases: first an acceleration phase as the disk arm gets moving; then coasting as the arm is moving at full speed then deceleration as the arm slows down finally settling as the head is carefully positioned over the correct track. The settling time is often quite significant, as the drive must be certain to find the right track

Job Turnaround Time

Total time from when job arrives to when it finishes execution execution time+waiting time

RAID-4: Parity for Errors: How to Update Parity

Two approaches: 1. Read all blocks in stripe and recompute 2. Use subtraction: - Given data blocks: Bold, Bnew and parity block: Pold Use the theorem: Pnew := XOR( Bold, Bnew, Pold)

Fast File System (FFS)

UNIX Fast File System Tree-based, multi-level index Notable Characteristics: • Tree Structure (efficiently find any block of a file) • High Degree/fan-out (minimizes number of seeks + supports sequential reads & writes • Fixed Structure (implementation simplicity) • Asymmetric (not all data blocks are at the same level + supports large files + small files don't pay large overheads)

Disk Scheduling: Problem with SCAN

Unfortunately, SCAN and its cousins do not represent the best scheduling technology. In particular, SCAN (or SSTF even) does not actually adhere as closely to the principle of SJF as they could. In particular, they ignore rotation

Flash Reliability

Unlike mechanical disks, which can fail for a wide variety of reasons, flash chips are pure silicon and in that sense have fewer reliability issues to worry about. The primary concern is wear out;

FTL Organization: A Log-Structured FTL

Upon a write to logical block N, the device appends the write to the next free spot in the currently-being-written-to block; we call this style of writing logging. To allow for subsequent reads of block N, the device keeps a mapping table (in its memory, and persistent, in some form, on the device); this table stores the physical address of each logical block in the system. Unfortunately, this basic approach to log structuring has some downsides. The first is that overwrites of logical blocks lead to something we call garbage, i.e., old versions of data around the drive and taking up space. The device has to periodically perform garbage collection (GC) to find said blocks and free space for future writes; excessive garbage collection drives up write amplification and lowers performance. The second is high cost of in-memory mapping tables; the larger the device, the more memory such tables need.

FFS: File System Consistency Checks: Detection Algorithm for Directories

Use a per-file table instead of per-block Parse entire directory structure, start at root: • Increment counter for each file you encounter • This value can be >1 due to hard links • Symbolic links are ignored Compare table counts w/link counts in i-node: • If i-node count > our directory count (wastes space) • If i-node count < our directory count (catastrophic)

RAID advantages: performance

Using multiple disks in parallel can greatly speed up I/O times

Flash-Based Solid-State Storage Device (SSD)

Value stored by transistor - SLC (Single-level cell): 1 bit - MLC (Multi-level cell): 2 bits - TLC (triple-level cell): 3 bits

Full Page Table

contains metadata about each frame: a protection (R/W/X) bit, modified bit, valid bit, etc. MMU Enforces the R/W/X protection (illegal accesses throw a page fault)

Aspects of Memory Multiplexing: Sharing

Want option to overlap when desired (for efficiency and communication)

Aspects of Memory Multiplexing: Utilization

Want to best use of this limited resource

Aspects of Memory Multiplexing: Virtualization

Want to create the illusion of more resources than exist in underlying physical system

Metadata Journaling (ordered journaling) problem

We need to write the data block to the disk BEFORE the metadata is written to the log. If we don't the file system is consistent but the inode may end up pointing to garbage data. Specifically, consider the case where inode and bitmap are written to the log but data block did not make it to disk. The file system will then try to recover. Because data block is not in the log, the file system will replay writes to inode and bitmap, and produce a consistent file system (from the perspective of file-system metadata). However, inode will be pointing to garbage data, i.e., at whatever was in the slot where data block was headed

Interface And RAID Internals

When a file system issues a logical I/O request to the RAID, the RAID internally must calculate which disk (or disks) to access in order to complete the request, and then issue one or more physical I/Os to do so

Flash Reliability: disturbance

When accessing a particular page within a flash, it is possible that some bits get flipped in neighboring pages; such bit flips are known as read disturbs or program disturbs, depending on whether the page is being read or programmed, respectively.

Using Checksums

When reading a block D, the client also reads its checksum from disk C_s(D), which we call the stored checksum (hence the subscript C_s). The client then computes the checksum over the retrieved block D, which we call the computed checksum C_c(D). At this point, the client compares the stored and computed checksums; if they are equal, the data has likely not been corrupted, and thus can be safely returned to the user. If they do not match, this implies the data has changed since the time it was stored (since the stored checksum reflects the value of the data at that time). In this case, we have a corruption, which our checksum has helped us to detect

Linux Virtual Memory: Large Page Support: transparent huge page support

When this feature is enabled, the operating system automatically looks for opportunities to allocate huge pages (usually 2 MB, but on some systems, 1 GB) without requiring application modification

Solution #2: Journaling (or Write-Ahead Logging)

When updating the disk, before overwriting the structures in place, first write down a little note (somewhere else on the disk, in a well-known location) describing what you are about to do. Writing this note is the "write ahead" part, and we write it to a structure that we organize as a "log"; hence, write-ahead logging By writing the note to disk, you are guaranteeing that if a crash takes places during the update (overwrite) of the structures you are updating, you can go back and look at the note you made and try again; thus, you will know exactly what to fix (and how to fix it) after a crash, instead of having to scan the entire disk. By design, journaling thus adds a bit of work during updates to greatly reduce the amount of work required during recovery

TLB Issue: Replacement Policy

When we are installing a new entry in the TLB, we have to replace an old one, and thus the question: which one to replace?

Log-structured File System. (LFS)

When writing to disk, LFS first buffers all updates (including metadata!) in an in-memory segment; when the segment is full, it is written to disk in one long, sequential transfer to an unused part of the disk. LFS never overwrites existing data, but rather always writes segments to free locations. Because segments are large, the disk (or RAID) is used efficiently, and performance of the file system approaches its zenith.

File Names: File name extensions are widespread

Windows: attaches meaning to extensions (.txt, .doc, .xls, ...); associates applications to extensions UNIX: extensions not enforced by OS; Some apps might insist upon them (.c, .h, .o, .s, for C compiler)

Memory Management Common Errors: Forgetting to Initialize Allocated Memory (Uninitialized Read)

With this error, you call malloc() properly, but forget to fill in some values into your newly-allocated data type. If you do forget, your program will eventually encounter an uninitialized read, where it reads from the heap some data of unknown value

Data Journaling

Written blocks in the log: The transaction begin (TxB) tells us about this update, including information about the pending update to the file system (e.g., the final addresses of the blocks), and some kind of transaction identifier (TID). The middle blocks just contain the exact contents of the file system blocks themselves; this is known as physical logging as we are putting the exact physical contents of the update in the journal (an alternate idea, logical logging, puts a more compact logical representation of the update in the journal, e.g., "this update wishes to append data block Db to file X", which is a little more complex but can save space in the log and perhaps improve performance). The final block (TxE) is a marker of the end of this transaction, and will also contain the TID.

Linux Virtual Memory: Address Space

a Linux virtual address space consists of a user portion (where user program code, stack, heap, and other parts reside) and a kernel portion (where kernel code, stacks, heap, and other parts reside) One slightly interesting aspect of Linux is that it contains two types of kernel virtual addresses. The first are known as kernel logical addresses and kernel virtual address

TLB miss

a TLB lookup that fails because the TLB does not contain a valid translation for that virtual address

TLB hit

a TLB lookup that succeeds at finding a valid address translation

Hard Disk Drives: Components: platter

a circular hard surface on which data is stored persistently by inducing magnetic changes to it. A disk may have one or more platters; each platter has 2 sides, each of which is called a surface. These platters are usually made of some hard material (such as aluminum), and then coated with a thin magnetic layer that enables the drive to persistently store bits even when the drive is powered off.

RAID Level 4: Parity: Analysis: performance-latency of a single read request

a single read (assuming no failure) is just mapped to a single disk, and thus its latency is equivalent to the latency of a single disk request. The latency of a single write requires two reads and then two writes; the reads can happen in parallel, as can the writes, and thus total latency is about twice that of a single disk

FFS Organizing Structure: The Cylinder Group: Block Groups

a consecutive portion of the disk's address space. By placing two files within the same group, FFS can ensure that accessing one after the other will not result in long seeks across the disk To use these groups to store files and directories, FFS needs to have the ability to place files and directories into a group, and track all necessary information about them therein. To do so, FFS includes all the structures you might expect a file system to have within each group. FFS keeps a copy of the super block (S) in each group for reliability reasons. The superblock is needed to mount the file system; by keeping multiple copies, if one copy becomes corrupt, you can still mount and access the file system by using a working replica. Within each group, FFS needs to track whether the inodes and data blocks of the group are allocated. A per-group inode bitmap (ib) and data bitmap (db) serve this role for inodes and data blocks in each group. Finally, the inode and data block regions are just like those in the previous very-simple file system

FFS: Hard & Soft Links: Soft (Sym) Links

a mapping from a file name to a file name - ... a file that contains the name of another file - use as alias: a soft link continues to remain valid when the (path of) the target file name changes

FFS: Hard & Soft Links: Hard Links

a mapping from name to a low-level name (inode)

Ticket Mechanisms: ticket transfer

a process can temporarily hand off its tickets to another process. This ability is especially useful in a client/server setting, where a client process sends a message to a server asking it to do some work on the client's behalf. To speed up the work, the client can pass the tickets to the server and thus try to maximize the performance of the server while the server is handling the client's request. When finished, the server then transfers the tickets back to the client and all is as before.

Ticket Mechanisms: ticket inflation

a process can temporarily raise or lower the number of tickets it owns Of course, in a competitive scenario with processes that do not trust one another, this makes little sense; one greedy process could give itself a vast number of tickets and take over the machine. Rather, inflation can be applied in an environment where a group of processes trust one another; in such a case, if any one process knows it needs more CPU time, it can boost its ticket value as a way to reflect that need to the system, all without communicating with any other processes

Caching Problem with Multiple CPUs: Cache Affinity

a process, when run on a particular CPU, builds up a fair bit of state in the caches (and TLBs) of the CPU. The next time the process runs, it is often advantageous to run it on the same CPU (its state is in that CPUs cache so it will run faster) If, instead, one runs a process on a different CPU each time, the performance of the process will be worse, as it will have to reload the state each time it runs (even though it will still run correctly). Thus, a multiprocessor scheduler should consider cache affinity when making its scheduling decisions, perhaps preferring to keep a process on the same CPU if at all possible.

Frame

a section of physical memory

Page

a section of virtual memory

Two Types of Allocated Memory: Heap

all allocations and deallocations are explicitly handled by you, the programmer but the allocations (that are not explicitly deallocated) survive the end of a function (specifically, the function in which the memory was allocated)

Internal Fragmentation

allocated memory may be slightly larger than requested memory; this size difference is memory internal to a partition, but not being used if an allocator hands out chunks of memory bigger than that requested, any unasked for (and thus unused) space in such a chunk is considered internal fragmentation (because the waste occurs inside the allocated unit) and is another example of space waste

Copy on Write (COW)

allows both parent and child processes to initially share the same pages in memory Ex. P1 forks -> P2 created with its own page table and same translations -> All pages are marked COW (in page table) Now one process tries to write to the stack: Page fault-> allocate new frame-> copy page -> both pages no longer COW Ex2. P1 forks -> P2 created with its own page table and same translations -> P2 calls exec: allocate new frames-> load in new pages-> pages no longer COW

TLB miss: when the page fault handler must be run

although this was a legitimate page for the process to access (it is valid, after all), it is not present in physical memory

page-replacement policy

an algorithm used by virtual memory systems to decide which page or segment to remove from main memory when a page frame is needed and memory is full

Page Table Entry

an entry in the page table (one for each page in the virtual address space) made up of a valid bit and an n-bit address field - the valid bit indicates whether or not a page is currently cached in DRAM/common to indicate whether the particular translation is valid We also might have protection bits, indicating whether the page could be read from, written to, or executed from A present bit indicates whether this page is in physical memory or on disk (i.e., it has been swapped out). A dirty bit is also common, indicating whether the page has been modified since it was brought into memory. A reference bit (a.k.a. accessed bit) is sometimes used to track whether a page has been accessed, and is useful in determining which pages are popular and thus should be kept in memory; such knowledge is critical during page replacement

conflict miss

arises in hardware because of limits on where an item can be placed in a hardware cache, due to something known as set-associativity; it does not arise in the OS page cache because such caches are always fully-associative

Multi-Queue Multiprocessor Scheduling (MQMS)

basic scheduling framework consists of multiple scheduling queues. Each queue will likely follow a particular scheduling discipline, such] as round robin, though of course any algorithm can be used. When a job enters the system, it is placed on exactly one scheduling queue, according to some heuristic. Then it is scheduled essentially independently, thus avoiding the problems of information sharing and synchronization found in the single-queue approach

Finding Sector and Block on a disk

blk = (inumber * sizeof(inode_t)) / blockSize; sector = ((blk * blockSize) + inodeStartAddr) / sectorSize;

flash banks/planes

collections of cells A bank is accessed in two different sized units: blocks (sometimes called erase blocks), which are typically of size 128 KB or 256 KB, and pages, which are a few KB in size (e.g., 4KB) Within each bank there are a large number of blocks; within each block, there are a large number of pages

FFS: File creation: link()

creates a hard link-a new name for the same underlying file, and increments link count in inode

Approximating LRU vs Implementing LRU

do we really need to find the absolute oldest page to replace? Can we instead survive with an approximation? And the answer is Yes. The idea requires some hardware support, in the form of a use bit (sometimes called the reference bit). Whenever a page is referenced (i.e., read or written), the use bit is set by hardware to 1. The hardware never clears the bit, though (i.e., sets it to 0); that is the responsibility of the OS.

Bitmap

each bit is used to indicate whether the corresponding object/block is free (0) or in-use (1)

Solutions to Cache Coherency Problem: Bus Snooping

each cache pays attention to memory updates by observing the bus that connects them to main memory. When a CPU then sees an update for a data item it holds in its cache, it will notice the change and either invalidate its copy or update it.

Lottery Scheduling

each job given lottery tickets, more for higher priority; on each time slice a is ticket drawn. Winning job gets to run

RAID Level 4: Parity: Random Writes: small-write problem

even though the data disks could be accessed in parallel, the parity disk prevents any parallelism from materializing; all writes to the system will be serialized because of the parity disk.

Problems with Paging: Performance

every data/instruction access requires two memory accesses (one for the page table, one for the data/instruction)

FFS: File System Consistency Checks: Detection Algorithm for File Blocks

fsck (UNIX) & scandisk (Windows) • Build table with info about each block -- initially each block is unknown except superblock • Scan through the inodes and the freelist -- Keep track in the table -- If block already in table, note error • Finally, see if all blocks have been visited

fsck: Duplicates

fsck also checks for duplicate pointers, i.e., cases where two different inodes refer to the same block. If one inode is obviously bad, it may be cleared. Alternately, the pointed-to block could be copied, thus giving each inode its own copy as desired.

fsck: Inode links

fsck also verifies the link count of each allocated inode. As you may recall, the link count indicates the number of different directories that contain a reference (i.e., a link) to this particular file. To verify the link count, fsck scans through the entire directory tree, starting at the root directory, and builds its own link counts for every file and directory in the file system. If there is a mismatch between the newly-calculated count and that found within an inode, corrective action must be taken, usually by fixing the count within the inode. If an allocated inode is discovered but no directory refers to it, it is moved to the lost+found directory

fsck: Directory checks

fsck does not understand the contents of user files; however, directories hold specifically formatted information created by the file system itself. Thus, fsck performs additional integrity checks on the contents of each directory, making sure that "." and ".." are the first entries, that each inode referred to in a directory entry is allocated, and ensuring that no directory is linked to more than once in the entire hierarchy.

fsck: Superblock

fsck first checks if the superblock looks reasonable, mostly doing sanity checks such as making sure the file system size is greater than the number of blocks that have been allocated. Usually the goal of these sanity checks is to find a suspect (corrupt) superblock; in this case, the system (or administrator) may decide to use an alternate copy of the superblock.

Evaluating A RAID: capacity

given a set of N disks each with B blocks, how much useful capacity is available to clients of the RAID? Without redundancy, the answer is N ·B; in contrast, if we have a system that keeps two copies of each block (called mirroring), we obtain a useful capacity of (N · B)/2. Different schemes (e.g., parity-based ones) tend to fall in between.

Caching Problem with Multiple CPUs: Cache Coherency

if a process on CPU1 should write a value to memory but a process on CPU2 reads before value is written, CPU2 has wrong value in its cache

Caches: spacial locality

if a program accesses a data item at address x, it is likely to access data items near x as well

Shortest Time-to-Completion First (STCF)/Preemptive Shortest Job First (PSJF) Scheduling Scheme

improves SJF for situations where shorter jobs come in after a long job has already started Any time a new job enters the system, the STCF scheduler determines which of the remaining jobs (including the new job) has the least time left, and schedules that one optimizes turnaround time, but is bad for response time

Round-Robin (RR)/Time Slicing Scheduling Scheme

improves response time instead of running jobs to completion, RR runs a job for a *time slice* (sometimes called a *scheduling quantum*) and then switches to the next job in the run queue. It repeatedly does so until the jobs are finished Note that the length of a time slice must be a multiple of the timer interrupt period: The shorter it is, the better the performance of RR under the response-time metric. However, making the time slice too short is problematic: suddenly the cost of context switching will dominate overall performance. Thus, deciding on the length of the time slice presents a trade-off to a system designer, making it long enough to amortize the cost of switching without making it so long that the system is no longer responsive. optimizes turnaround time, but is bad for response time (one of the worst policies if turnaround time is our metric)

First In, First Out (FIFO)/First Come, First Served (FCFS) Scheduling Scheme

improves turnaround time First job in gets to run first. Simple and easy to implement but usually too simple

FFS Inodes

inode array: • F: inode nbr -> disk location • inode contains: Metadata; 12 data pointers; 3 indirect pointers What else is in an inode? • Type: ordinary file/directory/symbolic link/special device • Size of the file (in #bytes) • # links to the i-node • Owner (user id and group id) • Protection bits • Times: creation, last accessed, last modified

Proportional-share

instead of optimizing for turnaround or response time, a scheduler might instead try to guarantee that each job obtain a certain percentage of CPU time

Solution to polling problem

interrupts: Instead of polling the device repeatedly, the OS can issue a request, put the calling process to sleep, and context switch to another task. When the device is finally finished with the operation, it will raise a hardware interrupt, causing the CPU to jump into the OS at a predetermined interrupt service routine (ISR) or more simply an interrupt handler. The handler is just a piece of operating system code that will finish the request and wake the process waiting for the I/O, which can then proceed as desired Interrupts thus allow for overlap of computation and I/O, which is key for improved utilization This is only really good if the device takes a long time.

a translation-lookaside buffer (TLB)

is part of the chip's memory-management unit (MMU), and is simply a hardware cache of popular virtual-to-physical address translations; Associative cache of virtual to physical page translations the TLB improves performance due to spatial locality. The elements of the array are packed tightly into pages (i.e., they are close to one another in space), and thus only the first access to an element on a page yields a TLB miss. *Access TLB before you access memory.*

RAID Level 0: Striping: Analysis: capacity

it is perfect: given N disks each of size B blocks, striping delivers N ·B blocks of useful capacity

What problems does MLFQ try to solve?

it would like to optimize turnaround time; unfortunately, the OS doesn't generally know how long a job will run for, exactly the knowledge that algorithms like SJF (or STCF) require. Second, MLFQ would like to make a system feel responsive to interactive users (i.e., users sitting and staring at the screen, waiting for a process to finish), and thus minimize response time; unfortunately, algorithms like Round Robin reduce response time but are terrible for turnaround time. Thus, our problem: given that we in general do not know anything about a process, how can we build a scheduler to achieve these goals?

Making The Log Finite: circular log

journaling file systems treat the log as a circular data structure, re-using it over and over To do so, the file system must take action some time after a checkpoint. Specifically, once a transaction has been checkpointed, the file system should free the space it was occupying within the journal, allowing the log space to be reused. One way to do this: you could simply mark the oldest and newest non-checkpointed transactions in the log in a *journal superblock*; all other space is free

LFS: How to Find Inode on Disk

location of inode on disk changes Maintain inode Map (imap) in pieces and store updated pieces on disk. imap: inode number -> disk addr • For write performance: Put piece(s) at end of segment • Checkpoint Region (CR): Points to all inode map pieces and is updated every 30 secs. Located at fixed disk address. Also buffered in memory

SQMS Problems: Solving Cache Affinity

most SQMS schedulers include some kind of affinity mechanism to try to make it more likely that process will continue to run on the same CPU if possible Specifically, one might provide affinity for some jobs, but move others around to balance load

TLB Issue: Replacement Policy Solutions: Random eviction

n. Another typical approach is to use a random policy, which evicts a TLB mapping at random. Such a policy is useful due to its simplicity and ability to avoid corner-case behaviors

Solution #3: Other Approaches: backpointer-based consistency (BBC)

no ordering is enforced between writes To achieve consistency, an additional back pointer is added to every block in the system; for example, each data block has a reference to the inode to which it belongs. When accessing a file, the file system can determine if the file is consistent by checking if the forward pointer (e.g., the address in the inode or direct block) points to a block that refers back to it. If so, everything must have safely reached disk and thus the file is consistent; if not, the file is inconsistent, and an error is returned. By adding back pointers to the file system, a new form of lazy crash consistency can be attained

capacity miss

occurs because the cache ran out of space and had to evict an item to bring a new item into the cache.

single-level cell (SLC) flash

only a single bit is stored within a transistor (i.e., 1 or 0)

Cache Management

our goal in picking a replacement policy for this cache is to minimize the number of cache misses Knowing the number of cache hits and misses let us calculate the average memory access time (AMAT) for a program. we can compute the AMAT of a program as follows: AMAT = T_M + (P_Miss · T_D) where T_M represents the cost of accessing memory (the cache), T_D the cost of accessing disk, and P_Miss the probability of not finding the data in the cache (a miss);

Hard Disk Drives: multi-zoned disk drives

outer tracks tend to have more sectors than inner tracks, which is a result of geometry; there is simply more room out there. These tracks are often referred to as multi-zoned disk drives, where the disk is organized into multiple zones, and where a zone is consecutive set of tracks on a surface. Each zone has the same number of sectors per track, and outer zones have more sectors than inner zones.

Linux Virtual Memory: Large Page Support

over time, Linux has evolved to allow applications to utilize these huge pages huge pages reduce the number of mappings that are needed in the page table; the larger the pages, the fewer the mappings. Huge pages allow a process to access a large tract of memory without TLB misses, by using fewer slots in the TLB, and thus is the main advantage. There are other benefits to huge pages: there is a shorter TLB-miss path, meaning that when a TLB miss does occur, it is serviced more quickly. In addition, allocation can be quite fast (in certain scenarios), a small but sometimes important benefit. However, these are subject to internal fragmentation and overhead of allocation can also be bad (in some other cases)

Inverted Page Table

page table indexes physical memory, search to find entry Here, instead of having many page tables (one per process of the system), we keep a single page table that has an entry for each physical frame of the system. The entry tells us which process is using this page, and which virtual page of that process maps to this physical page. Tradeoffs: ↓ (less) memory to store page tables ↑ (more) time to search page tables

A Simple Policy: FIFO (First-in, First-out) Replacement

pages were simply placed in a queue when they enter the system; when a replacement occurs, the page on the tail of the queue (the "first-in" page) is evicted. FIFO has one great strength: it is quite simple to implement.

RAID Level 0: Striping: Analysis: performance

performance is excellent: all disks are utilized, often in parallel, to service user I/O requests.

Page-table base register (PTBR)

points to the page table saved/restored on context switches

hardware interface

presents to the rest of the system. Just like a piece of software, hardware must also present some kind of interface that allows the system software to control its operation. Thus, all devices have some specified interface and protocol for typical interaction.

Flash Performance

read latencies are quite good, taking just 10s of microseconds to complete. Program latency is higher and more variable, as low as 200 microseconds for SLC, but higher as you pack more bits into each cell; to get good write performance, you will have to make use of multiple flash chips in parallel. Finally, erases are quite expensive, taking a few milliseconds typically. Dealing with this cost is central to modern flash storage design

multi-level page table

reduces page table space allocate only PTEs (page table entries) in use simple memory allocation Negative: more lookups per memory refrence The basic idea behind a multi-level page table is simple. First, chop up the page table into page-sized units; then, if an entire page of page-table entries (PTEs) is invalid, don't allocate that page of the page table at all. To track whether a page of the page table is valid (and if valid, where it is in memory), use a new structure, called the page directory.

blocks

regions of the disk used by file systems to store information

FFS: File creation: unlink()

removes a name for a file from its directory and decrements link count in inode. If last link, file itself and resources it held are deleted

Shortest Job First (SJF) Scheduling Scheme

response to convoy effect A scheduling algorithm that deals with each user or task based on getting the smaller ones out of the way. optimizes turnaround time, but is bad for response time

Single-Queue Multiprocessor Scheduling (SQMS)

reuse the basic framework for single processor scheduling, by putting all jobs that need to be scheduled into a single queue This approach has the advantage of simplicity; it does not require much work to take an existing policy that picks the best job to run next and adapt it to work on more than one CPU (where it might pick the best two jobs to run, if there are two CPUs, for example)

CPU Scheduler

selects a process to run from the run queue

Network Scheduler

selects next packet to send or process

Disk Scheduler

selects next read/write operation

Page Replacement Scheduler

selects page to evict

Disk Scheduling: Shortest Positioning Time First (SPTF)

sometimes also called shortest access time first or SATF is the solution to the problem with SCAN. It takes both seek time and rotation time into account

RAID Level 0: Striping: Analysis: reliability

striping is perfect, but in the bad way: any disk failure will lead to data loss

Garbage Collection Problem

t LFS leaves old versions of file structures scattered throughout the disk. We call these old versions garbage. So what should we do with these older versions of inodes, data blocks, and so forth? One could keep those older versions around and allow users to restore old file versions (for example, when they accidentally overwrite or delete a file, it could be quite handy to do so); such a file system is known as a versioning file system because it keeps track of the different versions of a file

Using History: LRU (Least Recently Used) Replacement

the Least-Recently-Used (LRU) policy replaces the least-recently-used page We should also note that the opposite of this algorithm exist: Most-Recently-Used (MRU)

Memory Issues for OS to Handle: Exception Handlers

the OS must provide exception handlers, or functions to be called, as discussed above; the OS installs these handlers at boot time (via privileged instructions)

RAID Level 4: Parity: full-stripe write

the RAID can simply calculate the new value of P0 (by performing an XOR across the blocks) and then write all of the blocks (including the parity block) to the disks in parallel

TLB Issue: Context Switches

the TLB contains virtual-to-physical translations that are only valid for the currently running process; these translations are not meaningful for other processes. As a result, when switching from one process to another, the hardware or OS (or both) must be careful to ensure that the about-to-be-run process does not accidentally use translations from some previously run process

RAID Level 0: Striping

the basic idea: spread the blocks of the array across the disks in a round-robin fashion. This approach is designed to extract the most parallelism from the array when requests are made for contiguous chunks of the array (as in a large, sequential read, for example). We call the blocks in the same row a stripe serves as an excellent upper-bound on performance and capacity and thus is worth understanding No fault tolerance If one drive fails, data is lost Not good for critical applications

Finding a file with Inode 32

the file system would first calculate the offset into the inode region (32 · sizeof(inode)), add it to the start address of the inode table on disk, and thus arrive upon the correct byte address of the desired block of inodes Recall that disks are not byte addressable, but rather consist of a large number of addressable sectors, usually 512 bytes. Thus, to fetch the block of inodes that contains inode 32, the file system would issue a read to sector 20×1024 512 , or 40, to fetch the desired inode block.

How to keep track of Offset: Implicit Approach

the hardware determines the segment by noticing how the address was formed If, for example, the address was generated from the program counter (i.e., it was an instruction fetch), then the address is within the code segment; if the address is based off of the stack or base pointer, it must be in the stack segment; any other address must be in the heap

(Hardware-Based) Address Translation

the hardware transforms each memory access (e.g., an instruction fetch, load, or store), changing the virtual address provided by the instruction to a physical address where the desired information is actually located. Thus, on each and every memory reference, an address translation is performed by the hardware to redirect application memory references to their actual locations in memory

Evaluating RAID Performance: single-request latency

the latency of a single I/O request to a RAID is useful as it reveals how much parallelism can exist during a single logical I/O operation

TLB hit rate

the number of hits divided by the total number of accesses

TLB miss rate

the number of misses divided by the total number of accesses (1-hit rate)

Problem with basic protocol (above)

the protocol is that polling seems inefficient; specifically, it wastes a great deal of CPU time just waiting for the (potentially slow) device to complete its activity, instead of switching to another ready process and thus better utilizing the CPU.

Address Space

the running program's view of memory in the system The address space of a process contains all of the memory state of the running program: The code of the program (the instructions) have to live in memory somewhere, and thus they are in the address space The program, while it is running, uses a stack to keep track of where it is in the function call chain as well as to allocate local variables and pass parameters and return values to and from routines. Finally, the heap is used for dynamically-allocated, user-managed memory, such as that you might receive from a call to malloc() in C or new in an object-oriented language such as C++ or Java

How scheduler picks a ticket

the scheduler must know how many total tickets there are. The scheduler then picks a winning ticket, which is a number from 0 to #tickets-1. Winning process runs

Scheduling Metric: response time

the time from when the job arrives in a system to the first time it is scheduled T_response = T_firstrun − T_arrival

Disk Scheduling: Scan Variants- F-SCAN

this freezes the queue to be serviced when it is doing a sweep; this action places requests that come in during the sweep into a queue to be serviced later. Doing so avoids starvation of far-away requests, by delaying the servicing of late-arriving (but nearer by) requests.

linear page table

this is just an array. The OS indexes the array by the virtual page number (VPN), and looks up the page-table entry (PTE) at that index in order to find the desired physical frame number (PFN)

Job response time

time from when job arrives to when it is first run

Job total waiting time

time job spent on a queue (available to run but not running)

Job Arrival time

time the job arrives/is first available to run

paging

to chop up space into fixed-sized pieces

multi-level cell (SLC) flash

two bits are encoded into different levels of charge, e.g., 00, 01, 10, and 11 are represented by low, somewhat low, somewhat high, and high levels. There is even triple-level cell (TLC) flash, which encodes 3 bits per cell

high watermark (HW) and low watermark (LW)

used to help decide when to start evicting pages from memory. when the OS notices that there are fewer than LW pages available, a background thread that is responsible for freeing memory runs. The thread evicts pages until there are HW pages available. The background thread, sometimes called the *swap daemon or page daemon*, then goes to sleep, happy that it has freed some memory for running processes and the OS to use.

RAID Level 1: Mirroring: Analysis: performance-latency of a single read request

we can see it is the same as the latency on a single disk; all the RAID-1 does is direct the read to one of its copies A write is a little different: it requires two physical writes to complete before it is done. These two writes happen in parallel, and thus the time will be roughly equivalent to the time of a single write; however, because the logical write must wait for both physical writes to complete, it suffers the worst-case seek and rotational delay of the two requests, and thus (on average) will be slightly higher than a write to a single disk.

Flash Reliability: wear out

when a flash block is erased and programmed, it slowly accrues a little bit of extra charge. Over time, as that extra charge builds up, it becomes increasingly difficult to differentiate between a 0 and a 1. At the point where it becomes impossible, the block becomes unusable

Caches: temporal locality

when a piece of data is accessed, it is likely to be accessed again in the near future

Paging Advantages: flexibility

with a fully-developed paging approach, the system will be able to support the abstraction of an address space effectively, regardless of how a process uses the address space; we won't, for example, make assumptions about the direction the heap and stack grow and how they are used.

Address Translation Problem

• Adding a layer of indirection disrupts the spatial locality of caching • CPU cache is usually physically indexed • Adjacent pages may end up sharing the same CPU cache lines BIG PROBLEM: cache effectively smaller

NFS: Tolerating Server Failure

• Asmpt: Server that fails is eventually rebooted. • Manifestations of failures: -- Failed server: no reply to client requests. -- Lost client request: no reply to client request. -- Lost reply: no reply to client request.

File System Operations

• Create a file • Write to a file • Read from a file • Seek to somewhere in a file • Delete a file • Truncate a file

Disk Failure Cases: (2) Entire Device Failure

• Damage to disk head, electronic failure, wear out • Detected by device driver, accesses return error codes • Annual failure rates or Mean Time To Failure (MTTF)

Round Robbin

• Each job allowed to run for a quantum • Context is switched (at the latest) at the end of the quantum If quantum is too long this ends up being FIFO Too short and too much time is wasted on context switches Typical quantum: about 100X cost of context switch Positives: no starvation; can reduce response time Negatives: context switch overhead; mix of I/O and CPU bound Horrible: bad average turnaround time for equal length jobs

Goals of Storage Disks

• Fast: data is there when you want it • Reliable: data fetched is what you stored • Affordable: won't break the bank

Bringing in a Page

• Find a free frame - evict one if there are no free frames • Issue disk request to fetch data for page • Block current process • Context switch to new process • When disk completes, update PTE (frame number, valid bit, RWX bits) • Put current process in ready queue

Swapping out a Page

• Find all page table entries that refer to old page (Frame might be shared or Core Map (frames → pages)) • Set each page table entry to invalid • Remove any TLB entries ("TLB Shootdown") • Write changes on page back to disk, if needed (Dirty/Modified bit in PTE indicates need or Text segments are (still) on program image on disk)

AFS V1 Problems

• Full path names sent to remote file server -- Remote file server spends too much time traversing the directory tree. • Too much traffic between client and file server devoted to testing if local file copy is current.

Paging Overview: Management

• Keep track of which pages are mapped to which frames • Keep track of all free frames

Page Replacement Algorithms: Stack Algorithms

• Let M(m, r) be the set the of virtual pages in physical memory given that there are m frames at reference string r • A page replacement algorithm is called a "stack algorithm" if for all #frames m and all reference strings r: M(m, r) is a subset of M(m + 1, r) -i.e., a stack algorithm does not suffer from Belady's anomaly (more frames -> not more misses)

Swapping vs. Paging: Swapping

• Loads entire process in memory • "Swap in" (from disk) or "Swap out" (to disk) a process • Slow (for large processes) • Wasteful (might not require everything) • Does not support sharing of code segments • Virtual memory limited by size of physical memory

NFS: Server Operations

• Lookup: name of file -> file handle • Read: file handle, offset, count -> data • Write: file handle, offset, count, data Initially, client obtains file handle for root directory from NFS server.

Page Replacement Algorithms: Working Set Algorithm (WS)

• Maintain for each frame the approximate time the frame was last used • At each clock tick: Update this time to the current time for all frames that were referenced since the last clock tick (i.e., the ones with use (REF) bits set); Clear all use bits; Put all frames that have not been used for some time Δ (working set parameter) on the free list • When a frame is needed, use free list • If empty, pick any frame

Page Replacement Algorithms: WSClock

• Merge WS and CLOCK algorithms • Maintain timestamp for each frame • When allocating a frame: -Inspect use bit of frame under hand -If set: Clear the use bit; Update the timestamp; Continue with next frame -If clear but (now - timestamp) < Δ: Continue with next frame (do not update timestamp) -Otherwise evict frame

The Perfect Scheduler

• Minimizes response time and turnaround time • Maximizes throughput • Maximizes utilization (aka "work conserving"): keeps all devices busy • Meets deadlines • Is Fair: everyone makes progress, no one starves • Is Envy-Free: no job wants to switch its schedule with another no such scheduler exists

Demand Loading

• Page not mapped until it is used • Requires free frame allocation • May involve reading page contents from disk or over the network

Page Replacement Algorithms: Not Recently Used

• Periodically (say, each clock tick), clear all use (aka REF) bits in PTEs (Ideally done in hardware) • When evicting a frame, scan for a frame that hasn't recently been referenced (use bit is clear in PTE; may require a scan of all frames, so keep track of last evicted frame) • If no such frame exists, select any

Paging Overview: Divide

• Physical memory into fixed-sized blocks called frames • Virtual memory into blocks of same size called pages

File Storage Layout Options: Linked-list: File Allocation Table (FAT) Negatives

• Poor locality • Many file seeks unless entire FAT in memory • Poor random access • Limited metadata • Limited access control • Limitations on volume and file size

OS Support for Paging: Process Execution

• Reset MMU (PTBR) for new process • Context switch: flush TLB (or TLB has pids) • Handle page faults

Swapping vs. Paging: Paging

• Runs all processes concurrently • A few pages from each process live in memory • Finer granularity, higher performance • Large virtual memory supported by small physical memory • Certain pages (read-only ones, for example) can be shared among processes A page can be mapped (to a physical frame) or not mapped (in a physical frame but not currently mapped; or still in the original program file; or still zero-filled; or on backing store/paged/swapped out); or illegal (not part of a segment/segfault)

Updated Context Switch

• Save current process' registers in PCB *• Also Page Table Base Register (PTBR)* *• Flush TLB (unless TLB is tagged) * • Restore registers and PTBR of next process to run • "Return from Interrupt"

Disk Scheduling: Shortest Seek Time First

• Select request with minimum seek time from current head position • A form of Shortest Job First (SJF) scheduling • Not optimal: suppose cluster of requests at far end of disk ➜ starvation

NFS: A stateless protocol

• Server does not maintain any state about clients accessing files. -- Eliminates possible inconsistency between state at server and state at client. -- Requires client to maintain and send state information to server with each client operation. • Client uses file handle to identify a file to the server. Components of a file handle are: -- Volume identifier -- Inode number -- Generation number (allows inode number reuse)

File Names: Naming conventions

• Some aspects of names are OS dependent: Windows is not case sensitive, UNIX is. • Some aspects are not: Names up to 255 characters long

Permanent Storage Devices: Flash memory

• Storage that rarely becomes corrupted • Capacity at intermediate cost (50x disk) • Block level random access --Good performance for reads; --Worse for random writes

Permanent Storage Devices: Magnetic disks

• Storage that rarely becomes corrupted • Large capacity at low cost • Block level random access --Slow performance for random access --Better performance for streaming access

Page Replacement Algorithms: Clock Algorithm

• To allocate a frame, inspect the use bit in the PTE at clock hand and advance clock hand • Used? Clear use bit and repeat

LFS: Storing Data on Disk

• Updates to file j and k are buffered. • Inode for a file points to log entry for data • An entire segment is written at once

AFS V1 Design

• Whole file caching on local disk » NFS caches blocks, not files -- open() copies file to local disk » ... unless file is already there from last access -- close() copies updates back -- read/write access copy on local disk (Blocks might be cached in local memory)

LFS: To Read a File

• [Load checkpoint region CR into memory] • [Copy inode map into memory] • Read appropriate inode from disk if needed • Read appropriate file (dir or data) block [...] = step not needed if information already cached

AFS V2 Design

• callbacks added: -- Client registers with server; -- Server promises to inform client that a cached file has been modified. • file identifier (FID) replaces pathnames: -- Client caches various directories in pathname (Register for callbacks on each directory and Directory maps to FID) -- Client traverses local directories, using FID to fetch actual files if not cached.

Ver todos los conjuntos de estudio

CS 4410 P2

Conjuntos de estudio relacionados

Patent Cooperation Treaty (MPEP 1800)

4400 Test 3 In-Class Examples

psychology 110 final

Psychology final review

Psychology 001-Chapter 7-Learning

MSII Prep U Ch. 72 Emergency Nursing

Econ final

Anatomy Ch.12 connect

Unit 13 and 14 World War I - Causes, Russia's involvement, the US enters the War, WW I map study

unit 6 progress check : mcq part a&b

PA License Exam

True/False Chapter 1

HLTH 231 chapter 2

Chapter 58 Substance Abuse Prep U

Mind Tap Chapters 19-21

Bio Exam 3 Masterings

GIS MT1

ECN Chapter 3

Midterm

Chapter 4: Operational, Financial, and Strategic Risk