Unit 2 Storage
Linear mode, "JBOD"
"Linear mode," also known as JBOD (for "just a bunch of disks") is not even a real RAID level. And yet, every RAID controller seems to implement it. JBOD concatenates the block addresses of multiple drives to create a single, larger virtual drive. It provides no data redundancy or performance benefit. These days, JBOD functionality is best achieved through a logical volume manager rather than a RAID controller.
Using EBS Stats and Volume Analyzer to identify volume performance issues
- Average Service Time: Service time can be defined as the time a request takes to leave the droplet and be processed by the EBS server- it's the aggregate round trip time. This is the primary metric we are concerned with to identify an underperforming volume. In an ideal world, the service time for a 1000 PIOPS volume would be 1ms. Mathematically, to perform 1000 I/O operations in one second, 1 request must be serviced each millisecond. Similarly, for magnetic storage that we expect to ideally perform 100 IOPS, the expected service time would be 10ms. If you see service times significantly higher than what this ratio would suggest for extended periods of time (several hours), then a TT to EBS OPS may be valid. - Derived Server Side Stats: High S3 chunk time indicates the volume is likely being restored from snapshot or a snapshot is being created. https://megamind.amazon.com/node/1596
Adding a disk
-Run sudo fdisk -l to list the systems disks and identify new drive. -Then run any convenient partitioning utility to create a partition table for the drive. For drives 2TB or less, install a Windows MBR partition table. cfdisk is easiest, but you can also use fdisk, sfdisk, parted, or gparted. Larger disks require GPT partition table(so you must use parted or gparted). -Create Filesystem -sudo mount -In the /etc/fstab file, copy the line for an existing filesystem and adjust it.
Linux logical volume management
A Linux LVM configuration proceeds in a few distinct phases: • Creating (defining, really) and initializing physical volumes • Adding the physical volumes to a volume group • Creating logical volumes on the volume group -pvcreate, pvdisplay, pvscan - vgcreate, vgdisplay, vgscan - lvcreate -L 10G -n volume1 volume_group1, lvextend You can create copy-on-write duplicates of any LVM2 logical volume, whether or not it contains a filesystem. So, as a matter of practice, LVM snapshots should be either short-lived or as large as their source volumes resize2fs forces you to double-check the consistency of the filesystem before resizing. - e2fsck -f - resize2fs
RAID array
A RAID array (a redundant array of inexpensive/independent disks) combines multiple storage devices into one virtualized device. Depending on how you set up the array, this configuration can increase performance (by reading or writing disks in parallel), increase reliability (by duplicating or parity-checking data across multiple disks), or both. RAID can be implemented by the operating system or by various types of hardware. As the name suggests, RAID is typically conceived of as an aggregation of bare drives, but modern implementations let you use as a component of a RAID array anything that acts like a disk.
filesystem
A filesystem mediates between the raw bag of blocks presented by a partition, RAID array, or logical volume and the standard filesystem interface expected by programs: paths such as /var/spool/mail, UNIX file types, UNIX permissions, etc. The filesystem determines where and how the contents of files are stored, how the filesystem namespace is represented and searched on disk, and how the system is made resistant to (or recoverable from) corruption. Most storage space ends up as part of a filesystem, but swap space and database storage can potentially be slightly more efficient without "help" from a filesystem. The kernel or database imposes its own structure on the storage, rendering the filesystem unnecessary.
Partition
A partition is a fixed-size subsection of a storage device. Each partition has its own device file and acts much like an independent storage device. For efficiency, the same driver that handles the underlying device usually implements partitioning. Most partitioning schemes consume a few blocks at the start of the device to record the ranges of blocks that make up each partition.
Storage device
A storage device is anything that looks like a disk. It can be a hard disk, a flash drive, an SSD, an external RAID array implemented in hardware, or even a network service that provides block-level access to a remote device. The exact hardware doesn't matter, as long as the device allows random access, handles block I/O, and is represented by a device file.
EBS Encryption
Amazon EBS encryption offers you a simple encryption solution for your Amazon EBS volumes without the need for you to build, maintain, and secure your own key management infrastructure. When you create an encrypted EBS volume and attach it to a supported instance type, data stored at rest on the volume, disk I/O, and snapshots created from the volume are all encrypted. The encryption occurs on the servers that host Amazon EC2 instances, providing encryption of data-in-transit from EC2 instances to EBS storage. Amazon EBS encryption uses AWS Key Management Service (AWS KMS) Customer Master Keys (CMKs) when creating encrypted volumes and any snapshots created from your encrypted volumes. The first time you create an encrypted Amazon EBS volume in a region, a default CMK is created for you automatically. This key is used for Amazon EBS encryption unless you select a CMK that you created separately using AWS Key Management Service. Creating your own CMK gives you more flexibility, including the ability to create, rotate, disable, define access controls, and audit the encryption keys used to protect your data. For more information, see the AWS Key Management Service Developer Guide. This feature is supported on all Amazon EBS volume types (General Purpose (SSD), Provisioned IOPS (SSD), and Magnetic), and you can expect the same provisioned IOPS performance on encrypted volumes as you would with unencrypted volumes with a minimal effect on latency. You can access encrypted Amazon EBS volumes the same way you access existing volumes; encryption and decryption are handled transparently and they require no additional action from you, your EC2 instance, or your application. Snapshots of encrypted EBS volumes are automatically encrypted, and volumes that are created from encrypted EBS snapshots are also automatically encrypted. Important Encrypted boot volumes are not supported at this time Encryption Key Management Amazon EBS encryption handles key management for you. Each newly created volume is encrypted with a unique 256-bit key; any snapshots of this volume and any subsequent volumes created from those snapshots also share that key. These keys are protected by our own key management infrastructure, which implements strong logical and physical security controls to prevent unauthorized access. Your data and associated keys are encrypted using the industry-standard AES-256 algorithm. Amazon's overall key management infrastructure uses Federal Information Processing Standards (FIPS) 140-2 approved cryptographic algorithms and is consistent with National Institute of Standards and Technology (NIST) 800-57 recommendations. Each AWS account has a regularly rotated unique master key that is stored completely separate from your data, on a system that is surrounded with strong physical and logical security controls. Each encrypted volume (and its subsequent snapshots) is encrypted with a unique volume encryption key that is then encrypted with a region-specific secure master key. The volume encryption keys are used in memory on the server that hosts your EC2 instance; they are never stored on disk in plain text.
Amazon Elastic Block Store (Amazon EBS)
Amazon Elastic Block Store (Amazon EBS) provides block level storage volumes for use with Amazon EC2 instances. Amazon EBS volumes are highly available and reliable storage volumes that can be attached to any running instance that is in the same Availability Zone. Amazon EBS volumes that are attached to an Amazon EC2 instance are exposed as storage volumes that persist independently from the life of the instance. With Amazon EBS, you pay only for what you use. Amazon EBS is recommended when data changes frequently and requires long-term persistence. Amazon EBS volumes are particularly well-suited for use as the primary storage for file systems, databases, or for any applications that require fine granular updates and access to raw, unformatted, block-level storage. Amazon EBS is particularly helpful for database-style applications that frequently encounter many random reads and writes across the data set. For simplified data encryption, you can launch your Amazon EBS volumes as encrypted volumes. Amazon EBS encryption offers you a simple encryption solution for your EBS volumes without the need for you to build, manage, and secure your own key management infrastructure. When you create an encrypted EBS volume and attach it to a supported instance type, data stored at rest on the volume, disk I/O, and snapshots created from the volume are all encrypted. The encryption occurs on the servers that hosts EC2 instances, providing encryption of data-in-transit from EC2 instances to EBS storage. For more information, see Amazon EBS Encryption. Amazon EBS encryption uses AWS Key Management Service (AWS KMS) master keys when creating encrypted volumes and any snapshots created from your encrypted volumes. The first time you create an encrypted Amazon EBS volume in a region, a default master key is created for you automatically. This key is used for Amazon EBS encryption unless you select a Customer Master Key (CMK) that you created separately using the AWS Key Management Service. Creating your own CMK gives you more flexibility, including the ability to create, rotate, disable, define access controls, and audit the encryption keys used to protect your data. Features of EBS: You can create Amazon EBS volumes from 1 GiB to 1 TiB in size. You can mount these volumes as devices on your Amazon EC2 instances. You can mount multiple volumes on the same instance, but each volume can be attached to only one instance at a time. For more information, see Creating an Amazon EBS Volume. With General Purpose (SSD) volumes, your volume receives a base performance of 3 IOPS/GiB, with the ability to burst to 3,000 IOPS for extended periods of time. General Purpose (SSD) volumes are ideal for a broad range of use cases such as boot volumes, small and medium size databases, and development and test environments. With Provisioned IOPS (SSD) volumes, you can provision a specific level of I/O performance, up to 4000 IOPS per volume. This allows you to predictably scale to thousands of IOPS per EC2 instance. Amazon EBS volumes behave like raw, unformatted block devices. You can create a file system on top of these volumes, or use them in any other way you would use a block device (like a hard drive).
Messed up partition Repairing GPT Disks
An Ounce of Prevention Before you get into trouble, it's worth taking preventive measures: Back up your partition tables! You can do this in either of two ways, will just provide this: - On gdisk's main menu, you'll find the b option, which saves partition data to a disk file. the data saved is the protective MBR, the main header, the backup header, and one copy of the partition table. These items are stored in a binary file in this order. You should back up your partition table and keep this file on another computer or on a removable medium If your disk is already damaged, performing a gdisk binary backup is a wise precaution before you begin repairing the disk. In the event that your repair attempts make matters worse, you may be able to recover the disk to its damaged state by restoring the backup; however, be aware that GPT fdisk's backup function saves the in-memory representation of the on-disk structures, and the program performs some minimal interpretation in the act of loading the data. Therefore, a backup of a corrupt partition table, when restored, might not exactly replicate the original corrupt state; it could be even worse! GPT disks contain five data structures: the protective MBR, the main GPT header, the main partition table, the backup partition table, and the backup GPT header. Any or all of these data structures can become damaged. Although recovery from some problems is fairly simple, other problems may be impossible to fix. Things that Can Go Wrong -- http://www.rodsbooks.com/gdisk/repairing.html Semi-Automated Recovery: When GPT fdisk starts, it attempts to read the various GPT data structures. In doing so, the program checks the CRC values stored in the main and backup headers, and it performs various other sanity checks. If GPT fdisk detects a problem, it will notify you, and if a fix is obvious, it will implement it automatically. For instance, consider what happens when GPT fdisk discovers that the main partition table's CRC doesn't match that stored in the main header: # gdisk /dev/sdc Manual Recovery Procedures: In some cases, GPT fdisk won't be able to automatically recover. Several recovery options, most of them on the recovery & transformation menu, can help you recover your partitions, provided at least one valid partition table exists on the disk. You can see the available options by typing ? at the recovery & transformation menu's prompt: In all cases, you should exercise extreme caution when performing data recovery. You can experiment with all of the options just described (except for w); none of the data-recovery tools causes immediate writes to the disk. The w option, though, overwrites all your GPT data structures, so if you aren't sure you've recovered your partitions, you should not employ the w option. The z option on the experts' menu is also very dangerous; this option destroys all GPT (and optionally MBR) data structures and then exits. If your disk contains mission-critical data, I urge you to contact data-recovery specialists rather than poke around with GPT fdisk or any other data recovery software. Although such specialists charge a great deal of money, they have the expertise needed to make full recovery of your data more likely. If hiring a data-recovery specialist is out of the question, making a complete backup of the problem disk can help ensure that you won't make matters worse. You can use the dd command to do this job: # dd if=/dev/sda of=/dev/sdb Be sure to get the if and of parameters right; if you reverse them, you'll end up overwriting the disk you want to restore!
Out of Inodes
Another less common but tricky situation in which you might find yourself is the case of a file system that claims it is full, yet when you run df, you see that there is more than enough space. If this ever happens to you, the first thing you should check is whether you have run out of inodes (an inode is a data structure that holds information about a file). When you format a file system, the mkfs tool decides at that point the maximum number of inodes to use as a function of the size of the partition. Each new file that is created on that file system gets its own unique inode, and once you run out of inodes, no new files can be created. Generally speaking, you never get close to that maximum; however, certain servers store millions of files on a particular file system, and in those cases you might hit the upper limit. The df -i command will give you information on your inode usage: $ df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda 520192 17539 502653 4% / In this example, the root partition has 520,192 total inodes but only 17,539 are used. That means you can create another 502,653 files on that file system. In the case where 100% of your inodes are used, only a few options are at your disposal. You can try to identify a large number of files that you can delete or move to another file system; you can possibly archive a group of files into a tar archive; or you can back up the files on your current file system, reformat it with more inodes, and copy the files back.
Using EBS Stats and Volume Analyzer to identify volume performance issues Average latency
Average Latency Latency vs. Service time: latency = service time + time spent in EBS client queue, therefore high latency is expected to follow a high service time. Service time is still the metric we're the most concerned with.
Using EBS Stats and Volume Analyzer to identify volume performance issues Average Queue Length
Average Queue Length Queue length is the number of pending I/O request for a volume. Volumes must maintain an average queue length of 1 per minute (rounded up to the nearest whole number) for every 200 provisioned IOPS. The customer can check this through the Cloudwatch metric "volumequeuelength". If queue length spikes significantly and there are no corresponding increase in IO Ops/Sec, it's likely that something is locking the I/O subsystem and preventing writes being made to the disk, resulting in pending operations piling up. Absolute queue len limit (this is an OS level restriction) - 32 https://megamind.amazon.com/node/1596
Creating a filesystem
Creating a Filesystem Most filesystems, including all Linux-native filesystems, have Linux tools that can create the filesystem on a partition. Typically, these tools have filenames of the form mkfs.fstype, where fstype is the filesystem type code. These tools can also be called from a front-end tool called mkfs; you pass the filesystem type code to mkfs using its -t option: # mkfs -t ext3 /dev/sda6 For ext2 and ext3 filesystems, the mke2fs program is often used instead of mkfs. The mke2fs program is just another name for mkfs.ext2. In AWS docs: [ec2-user ~]$ sudo mkfs -t ext4 device_name Extra information: One obscure option that does deserve mention is -m percent, which sets the reserved-space percentage. The idea is that you don't want the disk to completely fill up with user files; if the disk starts getting close to full, Linux should report that the disk is full before it really is, at least for ordinary users. This gives the root user the ability to log in and create new files, if necessary, to help recover the system. The ext2fs/ext3fs/ext4fs reserved-space percentage defaults to 5 percent, which translates to quite a lot of space on large disks. You may want to reduce this value (say, by passing -m 2 to reduce it to 2 percent) on your root (/) filesystem and perhaps even lower (1 percent or 0 percent) on some, such as /home. Setting -m 0 also makes sense on removable disks, which aren't likely to be critical for system recovery and may be a bit cramped to begin with.
Describe the purpose of disk partitions.
Describe the purpose of disk partitions. Disk partitions break the disk into a handful of distinct parts. Each partition can be used by a different OS, can contain a different filesystem, and is isolated from other partitions. These features improve security and safety and can greatly simplify running a multi-OS system.
Direct-attached storage (DAS)
Direct Attached Storage (DAS), also called Direct Attach Storage, is digital storage that is attached directly to a PC or a server. In other words, DAS isn't part of a storage network. The most familiar example of DAS is the internal hard drive in a laptop or desktop PC. In practice, the term "direct attached storage" is used most often in reference to dedicated storage arrays attached directly to servers. It is used to distinguish DAS from networked storage arrangements, like SAN or NAS devices. DAS can refer to a single drive or a group of drives that are connected together, as in a RAID array. In addition, DAS devices can be housed inside a PC or server (as is the case with internal hard drives) or outside the PC or server (as is the case with external hard drives and storage appliances). Multiple systems can use the same DAS device, as long as each PC or server has a separate connection to the storage device.
RAID: Redundant Arrays of Inexpensive Disks
Even with backups, the consequences of a disk failure on a server can be disastrous. RAID, "redundant arrays of inexpensive disks," is a system that distributes or replicates data across multiple disks.13 RAID not only helps avoid data loss but also minimizes the downtime associated with hardware failures (often to zero) and potentially increases performance. RAID can be implemented by dedicated hardware that presents a group of hard disks to the operating system as a single composite drive. It can also be implemented simply by the operating system's reading or writing multiple disks according to the rules of RAID. Because the disks themselves are always the most significant bottleneck in a RAID implementation, there is no reason to assume that a hardware-based implementation of RAID will necessarily be faster than a software- or OS-based implementation.
Sequential vs Random access:
Every time you need to access a block on a disk drive, the disk actuator arm has to move the head to the correct track (the seek time), then the disk platter has to rotate to locate the correct sector (the rotational latency). This mechanical action takes time, just like the sushi travelling around the conveyor belt. Obviously the amount of time depends on where the head was previously located and how fortunate you are with the location of the sector on the platter: if it's directly under the head you do not need to wait, but if it just passed the head you have to wait for a complete revolution. Even on the fastest 15k RPM disk that takes 4 milliseconds (15,000 rotations per minute = 250 rotations per second, which means one rotation is 1/250th of a second or 4ms). Admittedly that's faster than the sushi in my earlier analogy, but the chances are you will need to read or write a far larger number of blocks than I can eat sushi dishes (and trust me, on a good day I can pack a fair few away). What about the next block? Well, if that next block is somewhere else on the disk, you will need to incur the same penalties of seek time and rotational latency. We call this type of operation a random I/O. But if the next block happened to be located directly after the previous one on the same track, the disk head would encounter it immediately afterwards, incurring no wait time (i.e. no latency). This, of course, is a sequential I/O. Please continue reading here -- http://flashdba.com/2013/04/15/understanding-io-random-vs-sequential/
Expanding a Linux Partition Using parted also provides some gdisk info.
Expanding a Linux Partition Using parted The parted utility is a partition editing tool that is available on most Linux distributions. It can create and edit both MBR partition tables and GPT partition tables. Some versions of parted (newer than version 2.1) have limited support for GPT partition tables and they may cause boot issues if their version of parted is used to modify boot volumes. You can check your version of parted with the parted --version command. If you are expanding a partition that resides on a GPT partitioned device, you should choose to use the gdisk utility instead. If you're not sure which disk label type your volume uses, you can check it with the sudo fdisk -l command. For more information, see To expand a Linux partition using gdisk. -- http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage_expand_partition.html#part-resize-gdisk Please see steps and test on own.
Common Filesystem Types
Ext2fs The Second Extended File System (ext2fs or ext2) is the traditional Linux-native filesystem. It was created for Linux and was the dominant Linux filesystem throughout the late 1990s. Ext2fs has a reputation as a reliable filesystem. It has since been eclipsed by other filesystems, but it still has its uses. In particular, ext2fs can be a good choice for a small /boot partition, if you choose to use one, and for small (sub-gigabyte) removable disks. On such small partitions, the size of the journal used by more advanced filesystems can be a real problem, so the non-journaling ext2fs is a better choice. (Journaling is described in more detail shortly.) The ext2 filesystem type code is ext2. Ext3fs The Third Extended File System (ext3fs or ext3) is basically ext2fs with a journal added. The result is a filesystem that's as reliable as ext2fs but that recovers from power outages and system crashes much more quickly. The ext3 filesystem type code is ext3. Ext4fs The Fourth Extended File System (ext4fs or ext4) is the next-generation version of this filesystem family. It adds the ability to work with very large disks (those over 16TiB, the limit for ext2fs and ext3fs) or very large files (those over 2TiB), as well as extensions intended to improve performance. Its filesystem type code is ext4. In addition to these Linux-native filesystems, you may need to deal with some others from time to time, including the following: FAT The File Allocation Table (FAT) filesystem is old and primitive—but ubiquitous. It's the only hard disk filesystem supported by DOS and Windows 9x/Me. For this reason, every major OS understands FAT, making it an excellent filesystem for exchanging data on removable disks. Two major orthogonal variants of FAT exist: It varies in the size of the FAT data structure after which the filesystem is named (12-, 16-, or 32-bit pointers), and it has variants that support long filenames. Linux automatically detects the FAT size, so you shouldn't need to worry about this. To use the original FAT filenames, which are limited to eight characters with an optional three-character extension (the so-called 8.3 filenames), use the Linux filesystem type code of msdos. To use Windows-style long filenames, use the filesystem type code of vfat. A Linux-only long filename system, known as umsdos, supports additional Linux features—enough that you can install Linux on a FAT partition, although this practice isn't recommended except for certain types of emergency disks or to try Linux on a Windows system. NTFS The New Technology File System (NTFS) is the preferred filesystem for Windows NT/200x/XP/Vista/7. Unfortunately, Linux's NTFS support is rather rudimentary. As of the 2.6.x kernel series, Linux can reliably read NTFS and can overwrite existing files, but the Linux kernel can't write new files to an NTFS partition.
Storage Hardware
Few basic ways to store computer data: hard disks, flash memory, magnetic tapes, and optical media.
EBS Pre-warming
For optimal EBS performance you must pre-warm the volume: For a completely new volume that was created from scratch, you should write to all blocks before using the volume. For a new volume created from a snapshot, you should read all the blocks that have data before using the volume. For example, on Linux you can read each block on the volume using the following command: $ dd if=/dev/md0 of=/dev/null
Using GNU parted
GNU Parted (http://www.gnu.org/software/parted/) is a partitioning tool that works with MBR, GPT, APM, BSD disk labels, and other disk types. You start GNU Parted much as you start fdisk, by typing its name followed by the device you want to modify, as in parted /dev/hda to partition /dev/hda. The result is some brief introductory text followed by a (parted) prompt at which you type commands. Type ? to see a list of commands, which are multi-character commands similar to Linux shell commands. For instance, print displays the current partition table, mkpart creates (makes) a partition, and rm removes a partition.
Hard disks
Hard disks: typical hard drive contains several rotating platters coated with magnetic film. They are read and written by tiny skating heads that are mounted on a metal arm that swings back and forth to position them. The heads float close to the surface of the platters but do not actually touch. Reading from a platter is quick; it's the mechanical maneuvering needed to address a particular sector that drives down random-access throughput. There are two main sources of delay. First, the head armature must swing into position over the appropriate track. This part is called seek delay. Then, the system must wait for the right sector to pass underneath the head as the platter rotates. That part is rotational latency. Disks can stream data at tens of MB/s if reads are optimally sequenced, but random reads are fortunate to achieve more than a few MB/s. A set of tracks on different platters that are all the same distance from the spindle is called a cylinder. The cylinder's data can be read without any additional movement of the arm. Although heads move amazingly fast, they still move much slower than the disks spin around. Therefore, any disk access that does not require the heads to seek to a new position will be faster. Rotational speeds have increased over time. Currently, 7,200 RPM is the mass-market standard for performance-oriented drives, and 10,000 RPM and 15,000 RPM drives are popular at the high end. Higher rotational speeds decrease latency and increase the bandwidth of data transfers, but the drives tend to run hot. Hard disks fail frequently. Disk failures tend to involve either platter surfaces (bad blocks) or the mechanical components. The firmware and hardware interface usually remain operable after a failure, so you can query the disk for details (see page 230). Drive reliability is often quoted by manufacturers in terms of mean time between failures (MTBF), denominated in hours.
Messed up partition MBR
I recently had trouble with the MBR/partition table on my laptop. I managed to rebuild the partition table using testdisk, and install GRUB to get it booting properly again (I'm using a dual-boot with Windows 7). However, I can no longer run gparted properly as I get the error Can't have a partition outside the disk!. So the disk has 30401 cylinders, but sda6 ends at cylinder 30402; presumably that's where the problem is. When I run testdisk it has the 6th partition ending at cylinder 30401, but writing it to the partition table does not make any difference. To Fix: Use fdisk. Put it into sector mode with the u command, then p to print the table, d to delete the partition, and then n to recreate it. When you recreate it, use the same starting sector, but an ending sector that actually fits within the disk. When you are done and have double checked ( p again ), save and quit with w.
I/O characteristics
I/O Characteristics On a given volume configuration, certain I/O characteristics drive the performance behavior on the back end. General Purpose (SSD) and Provisioned IOPS (SSD) volumes deliver consistent performance whether an I/O operation is random or sequential, and also whether an I/O operation is to read or write data. I/O size, however, does make an impact on IOPS because of the way they are measured. In order to fully understand how General Purpose (SSD) and Provisioned IOPS (SSD) volumes will perform in your application, it is important to know what IOPS are and how they are measured. What are IOPS? IOPS are input/output operations per second. Amazon EBS measures each I/O operation per second (that is 256 KB or smaller) as one IOPS. I/O operations that are larger than 256 KB are counted in 256 KB capacity units. For example, a 1,024 KB I/O operation would count as 4 IOPS. When you provision a 4,000 IOPS volume and attach it to an EBS-optimized instance that can provide the necessary bandwidth, you can transfer up to 4,000 chunks of data per second (provided that the I/O does not exceed the 128 MB/s per volume throughput limit of General Purpose (SSD) and Provisioned IOPS (SSD) volumes). This configuration could transfer 4,000 32 KB chunks, 2,000 64 KB chunks, or 1,000 128 KB chunks of data per second as well, before hitting the 128 MB/s per volume throughput limit. If your I/O chunks are very large, you may experience a smaller number of IOPS than you provisioned because you are hitting the volume throughput limit. For more information, see Amazon EBS Volume Types. For 32 KB or smaller I/O operations, you should see the amount of IOPS that you have provisioned, provided that you are driving enough I/O to keep the drives busy. For smaller I/O operations, you may even see an IOPS value that is higher than what you have provisioned (when measured on the client side), and this is because the client may be coalescing multiple smaller I/O operations into a smaller number of large chunks. If you are not experiencing the expected IOPS or throughput you have provisioned, ensure that your EC2 bandwidth is not the limiting factor; your instance should be EBS-optimized (or include 10 Gigabit network connectivity) and your instance type EBS dedicated bandwidth should exceed the I/O throughput you intend to drive. For more information, see Amazon EC2 Instance Configuration. Another possible cause for not experiencing the expected IOPS is that you are not driving enough I/O to the EBS volumes. For more information, see Workload Demand. Calculate: Calculate a Single Drive's Maximum IOPS Assume that we have a Seagate ST3146807FCV Cheetah 146GB 10K RPM Fibre Channel hard disk. It is rated as follows: Average latency (avgLatency): 2.99 ms or .00299 seconds Average seek time (avgSeek): 4.7 ms or .0047 seconds To calculate this disk's IOPS, use the following equation: IOPS = 1/(avgLatency + avgSeek) For our example disk, the equation would be (note the values of avgLatency and avgSeek are measured in seconds): IOPS = 1/(.00299 + .0047) IOPS = 130 Total maximum IOPS for this disk is 130.
Using EBS Stats and Volume Analyzer to identify volume performance issues I/O Ops/sec
I/O Ops/sec This value is the total number of I/O operations per second, both read and write. You can turn this into a more tangible or useful value with the following formula: IOPS * {BlockSizeInBytes} = BytesPerSec To find a block size, one needs to use iostat and look at the value for avgrq-sz. Block size = Avgrq-sz * 512 bytes. We use 512 bytes because blocks are equivalent to sectors with kernels 2.4 and later, and sectors are 512 bytes. You can use "blockdev --report" to identify your file system block size- look at the "BSZ" value. However, this doesn't necessarily mean that your application is performing I/Os at this block size! https://megamind.amazon.com/node/1596
General EBS Performance troubleshooting
If an instance is not EBS-Optimized, we cannot guarantee reliable volume performance: EBS optimized ensures dedicated network throughput for volumes. It is mandatory for PIOPS volumes, but it can also improve non-PIOPS performance. If a customer is using many standard EBS volumes and consistently does significant network I/O, EBS optimized would be useful as it segregates this traffic. If not snapshotting or restoring, and a PIOPS volume is seeing high latency/svctime for several hours, a TT may be warranted. You can tell if a volume is likely performing a snapshot action by looking at the Derived Server Side Stats graph and the S3 chunk time metric. Another thing to consider is the first-use penalty: If creating a new volume by restoring data from snapshot, you need to read every block to avoid first-use penalty. If creating a new volume, you need to write every block to avoid first-use penalty.
Instance metadata
Instance metadata is data about your instance that you can use to configure or manage the running instance. Instance metadata is divided into categories. For more information, see Instance Metadata Categories. *(ami-id, block device mapping, hostname, instance id, instance type, network interface information,etc.)* EC2 instances can also include dynamic data, such as an instance identity document that is generated when the instance is launched. You can also access the user data that you supplied when launching your instance. For example, you can specify parameters for configuring your instance, or attach a simple script. You can also use this data to build more generic AMIs that can be modified by configuration files supplied at launch time. For example, if you run web servers for various small businesses, they can all use the same AMI and retrieve their content from the Amazon S3 bucket you specify in the user data at launch. To add a new customer at any time, simply create a bucket for the customer, add their content, and launch your AMI. If you launch more than one instance at the same time, the user data is available to all instances in that reservation. Because you can access instance metadata and user data from within your running instance, you do not need to use the Amazon EC2 console or the CLI tools. This can be helpful when you're writing scripts to run from within your instance. For example, you can access your instance's local IP address from within the running instance to manage a connection to an external application. When you are adding user data, take note of the following: - User data is treated as opaque data: what you give is what you get back. It is up to the instance to be able to interpret it. - User data is limited to 16 KB. This limit applies to the data in raw form, not base64-encoded form. - User data must be base64-encoded before being submitted to the API. The API command line tools perform the base64 encoding for you. The data is decoded before being presented to the instance. Retrieving Instance Metadata: curl http://169.254.169.254/latest/meta-data/ GET http://169.254.169.254/latest/meta-data Note that you are not billed for HTTP requests used to retrieve instance metadata and user data.
Disk failure recovery
JBOD and RAID 0 modes are of no help when hardware problems occur; you must recover your data manually from backups. Other forms of RAID enter a degraded mode in which the offending devices are marked as faulty. The RAID arrays continue to function normally from the perspective of storage clients, although perhaps at reduced performance. Bad disks must be swapped out for new ones as soon as possible to restore redundancy to the array. A RAID 5 array or two-disk RAID 1 array can only tolerate the failure of a single device. Once that failure has occurred, the array is vulnerable to a second failure. The specifics of the process are usually pretty simple. You replace the failed disk with another of similar or greater size, then tell the RAID implementation to replace the old disk with the new one. What follows is an extended period during which the parity or mirror information is rewritten to the new, blank disk. Often, this is an overnight operation. The array remains available to clients during this phase, but performance is likely to be very poor. To limit downtime and the vulnerability of the array to a second failure, most RAID implementations let you designate one or more disks as "hot" spares. When a failure occurs, the faulted disk is automatically swapped for a spare, and the process of resynchronizing the array begins immediately.
All about swap space
Linux divides its physical RAM (random access memory) into chucks of memory called pages. Swapping is the process whereby a page of memory is copied to the preconfigured space on the hard disk, called swap space, to free up that page of memory. The combined sizes of the physical memory and the swap space is the amount of virtual memory available. Swapping is necessary for two important reasons. First, when the system requires more memory than is physically available, the kernel swaps out less used pages and gives memory to the current application (process) that needs the memory immediately. Second, a significant number of the pages used by an application during its startup phase may only be used for initialization and then never used again. The system can swap out those pages and free the memory for other applications or even for the disk cache. However, swapping does have a downside. Compared to memory, disks are very slow. Memory speeds can be measured in nanoseconds, while disks are measured in milliseconds, so accessing the disk can be tens of thousands times slower than accessing physical memory. The more swapping that occurs, the slower your system will be. Sometimes excessive swapping or thrashing occurs where a page is swapped out and then very soon swapped in and then swapped out again and so on. In such situations the system is struggling to find free memory and keep applications running at the same time. In this case only adding more RAM will help. Linux has two forms of swap space: the swap partition and the swap file. The swap partition is an independent section of the hard disk used solely for swapping; no other files can reside there. The swap file is a special file in the filesystem that resides amongst your system and data files. To see what swap space you have, use the command swapon -s. Mohammads article: By default, EC2 instances do not include swap space. Depending on the requirements, it may be necessary to enable swap space within EC2 instances. Use of instance-store storage is recommended for swap space. Steps: a. Ensure the AMI includes instance-stores or add one instance-store to add to your instance during launch time. b. SSH into the instance and query for the instance-store: [prompt]# df -ah | grep ephemeral The above command should return output like the following (note this may be different based on your AMI and you have to manually test to see which device is mounted as ephemeral storage): /dev/xvdb 37G 177M 35G 1% /media/ephemeral0 c. Unmount the mounted instance-store: [prompt]# umount /media/ephemeral0 d. Create swap partition: [prompt]# mkswap /dev/xvdb e. Comment out the line for /media/ephemeral0 within /etc/fstab: #/dev/sdb /media/ephemeral0 auto defaults,nofail,comment=cloudconfig 0 2 f. Activate the swap space: [prompt]# swapon /dev/xvdb g. Check that the swap space is active: [prompt]# cat /proc/meminfo | grep Swap The above command should return output like the following: SwapCached: 0 kB SwapTotal: 39313404 kB SwapFree: 39313404 kB i. Create a script to prepare swap space so that if the instance is shutdown and then moved to a different physical server, a pristine instance-store will be converted to swap space again. Invoking the script within /etc/rc.local would be ideal to perform this.
LVM: Logical volume management
Logical volume management is essentially a supercharged and abstracted version of disk partitioning. It groups individual storage devices into "volume groups." The blocks in a volume group can then be allocated to "logical volumes," which are represented by block device files and act like disk partitions. However, logical volumes are more flexible and powerful than disk partitions. Here are some of the magical operations a volume manager lets you carry out: • Move logical volumes among different physical devices • Grow and shrink logical volumes on the fly • Take copy-on-write "snapshots" of logical volumes • Replace on-line drives without interrupting service • Incorporate mirroring or striping in your logical volumes The components of a logical volume can be put together in various ways. Concatenation keeps each device's physical blocks together and lines the devices up one after another. Striping interleaves the components so that adjacent virtual blocks are actually spread over multiple physical disks. By reducing single-disk bottlenecks, striping can often provide higher bandwidth and lower latency.
Mounting and Unmounting Filesystems
Maintaining filesystems is necessary, but the whole reason filesystems exist is to store files—in other words, to be useful. Under Linux, filesystems are most often used by being mounted—that is, associated with a directory. This task can be accomplished on a one-time basis by using tools such as mount (and then unmounted with umount) or persistently across reboots by editing the /etc/fstab file. Syntax for mount: mount [-alrsvw] [-t fstype] [-o options] [device] [mountpoint] Mount All Filesystems The -a parameter causes mount to mount all the filesystems listed in the /etc/fstab file, which specifies the most-used partitions and devices. Mount Read-Only The -r parameter causes Linux to mount the filesystem read-only, even if it's normally a read/write filesystem. Show Verbose Output As with many commands, -v produces verbose output—the program provides comments on operations as they occur. Mount Read/Write The -w parameter causes Linux to attempt to mount the filesystem for both read and write operations. This is the default for most filesystems, but some experimental drivers default to read-only operation. The -o rw option has the same effect. Mount by Label or UUID The -L label and -U uuid options tell mount to mount the filesystem with the specified label or UUID, respectively. Mount Point The mountpoint is the directory to which the device's contents should be attached. As with device, it's usually required, but it may be omitted under some circumstances.
EC2 instance store volume
Many Amazon EC2 instance types can access disk storage located on disks that are physically attached to the host computer. This disk storage is referred to as instance store. An instance store provides temporary block-level storage for use with an instance. The size of an instance store ranges from 900 MiB to up to 48 TiB, and varies by instance type. Larger instance types have larger instance stores. Some smaller instance families, such as T2 and T1, do not support instance store volumes at all and they use Amazon EBS exclusively for storage An instance store consists of one or more instance store volumes. When you launch an instance store-backed AMI, each instance store volume available to the instance is automatically mapped. When you launch an Amazon EBS-backed AMI, instance store volumes must be configured using block device mapping at launch time (with either the default block device mapping for the chosen AMI or manually using the console or the CLI or SDK tools). Volumes must be formatted and mounted on the running instance before they can be used. Some AMIs (such as Ubuntu and Amazon Linux) that use the cloud-init utilities during the initial boot cycle may format and mount a single instance store volume, but this varies by AMI. By default, instances launched from an Amazon EBS-backed AMI have no mounted instance store volumes. Instances launched from an instance store-backed AMI have a mounted instance store volume for the virtual machine's root device volume (the size of this volume varies by AMI, but the maximum size is 10 GiB) in addition to the instance store volumes included with the instance type. Instance Store swap volumes: Swap space in Linux can be used when a system requires more memory than it has been physically allocated. When swap space is enabled, Linux systems can swap infrequently used memory pages from physical memory to swap space (either a dedicated partition or a swap file in an existing file system) and free up that space for memory pages that require high speed access. Note Using swap space for memory paging is not as fast or efficient as using RAM. If your workload is regularly paging memory into swap space, you should consider migrating to a larger instance type with more RAM. For more information, see Resizing Your Instance. The c1.medium and m1.small instance types have a limited amount of physical memory to work with, and they are given a 900 MB swap volume at launch time to act as virtual memory for Linux AMIs. Although the Linux kernel sees this swap space as a partition on the root device, it is actually a separate instance store volume, regardless of your root device type. Amazon Linux AMIs automatically enable and use this swap space, but your AMI may require some additional steps to recognize and use this swap space. To see if your instance is using swap space, you can use the swapon -s command. Instance Store Usage Scenarios: Instance store volumes are ideal for temporary storage of information that changes frequently, such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances, such as a load-balanced pool of web servers. Making Instance Stores Available on Your Instances Instances that use Amazon EBS for the root device do not, by default, have instance store available at boot time. Also, you can't attach instance store volumes after you've launched an instance. Therefore, if you want your Amazon EBS-backed instance to use instance store volumes, you must specify them using a block device mapping when you create your AMI or launch your instance. Examples of block device mapping entries are: /dev/sdb=ephemeral0 and /dev/sdc=ephemeral1.
Preparing a Partition for Use
Once a partition is created, you must prepare it for use. This process is often called "making a filesystem" or "formatting a partition." It involves writing low-level data structures to disk. Linux can then read and modify these data structures to access and store files in the partition.
Pre-Warming Amazon EBS Volumes
Pre-Warming Amazon EBS Volumes When you create any new EBS volume (General Purpose (SSD), Provisioned IOPS (SSD), or Magnetic) or restore a volume from a snapshot, the back-end storage blocks are allocated to you immediately. However, the first time you access a block of storage, it must be either wiped clean (for new volumes) or instantiated from its snapshot (for restored volumes) before you can access the block. This preliminary action takes time and can cause a 5 to 50 percent loss of IOPS for your volume the first time each block is accessed. For most applications, amortizing this cost over the lifetime of the volume is acceptable. Performance is restored after the data is accessed once. However, you can avoid this performance hit in a production environment by writing to or reading from all of the blocks on your volume before you use it; this process is called pre-warming. Writing to all of the blocks on a volume is preferred, but that is not an option for volumes that were restored from a snapshot, because that would overwrite the restored data. For a completely new volume that was created from scratch, you should write to all blocks before using the volume. For a new volume created from a snapshot, you should read all the blocks that have data before using the volume. To Pre-Warm a New Volume on Linux For a new volume, use the dd command to write to all blocks on a volume. This procedure will write zeroes to all of the blocks on the device and it will prepare the volume for use. Any existing data on the volume will be lost. To Pre-Warm a Volume Restored from a Snapshot on Linux For volumes that have been restored from snapshots, use the dd command to read from (and optionally write back to) all blocks on a volume. All existing data on the volume will be preserved.
Raid level 0
RAID 0 consists of striping, without mirroring or parity. RAID level 0 is used strictly to increase performance. It combines two or more drives of equal size, but instead of stacking them end-to-end, it stripes data alternately among the disks in the pool. Sequential reads and writes are therefore spread among several disks, decreasing write and access times. Use: When I/O performance is more important than fault tolerance; for example, as in a heavily used database (where data replication is already set up separately). Disadvantages: Performance of the stripe is limited to the worst performing volume in the set. Loss of a single volume results in a complete data loss for the array.
RAID level 1
RAID 1 consists of mirroring, without parity or striping RAID level 1 is colloquially known as mirroring. Writes are duplicated to two or more drives simultaneously. This arrangement makes writes slightly slower than they would be on a single drive. However, it offers read speeds comparable to RAID 0 because reads can be farmed out among the several duplicate disk drives. Use: When fault tolerance is more important than I/O performance; for example, as in a critical application. Advantages: Safer from the standpoint of data durability. Disadvantages: Does not provide a write performance improvement; requires more Amazon EC2 to Amazon EBS bandwidth than non-RAID configurations because the data is written to multiple volumes simultaneously.
Important note about RAID 5 and RAID 6 on AWS
RAID 5 and RAID 6 are not recommended for Amazon EBS because the parity write operations of these RAID modes consume some of the IOPS available to your volumes. Depending on the configuration of your RAID array, these RAID modes provide 20-30% fewer usable IOPS than a RAID 0 configuration. Increased cost is a factor with these RAID modes as well; when using identical volume sizes and speeds, a 2-volume RAID 0 array can outperform a 4-volume RAID 6 array that costs twice as much.
Drawbacks of RAID 5
RAID 5 is a popular configuration, but it has some weaknesses, too. The following issues apply to RAID 6 also, but for simplicity we frame the discussion in terms of RAID 5. First, it's critically important to note that RAID 5 does not replace regular off-line backups. It protects the system against the failure of one disk—that's it. It does not protect against the accidental deletion of files. It does not protect against controller failures, fires, hackers, or any number of other hazards. Second, RAID 5 isn't known for its great write performance. Finally, RAID 5 is vulnerable to corruption in certain circumstances.
RAID write penalty
RAID Write Penalty Each RAID type suffers a different write penalty. The most common RAID types and their write penalties are defined in the following table: RAID Type Write Penalty RAID 1 2 RAID 5 4 RAID 10 2
RAID Levels
RAID can do two basic things. First, it can improve performance by "striping" data across multiple drives, thus allowing several drives to work simultaneously to supply or absorb a single data stream. Second, it can replicate data across multiple drives, decreasing the risk associated with a single failed disk. Replication assumes two basic forms: mirroring, in which data blocks are reproduced bit-for-bit on several different drives, and parity schemes, in which one or more drives contain an error-correcting checksum of the blocks on the remaining data drives. Mirroring is faster but consumes more disk space. Parity schemes are more disk-space-efficient but have lower performance. The levels are simply different configurations; use whichever versions suit your needs.
RAID level 5
RAID level 5 stripes both data and parity information, adding redundancy while simultaneously improving read performance. In addition, RAID 5 is more efficient in its use of disk space than is RAID 1. If there are N drives in an array (at least three are required), N-1 of them can store data. The space-efficiency of RAID 5 is therefore at least 67%, whereas that of mirroring cannot be higher than 50%.
RAID level 6
RAID level 6 is similar to RAID 5 with two parity disks. A RAID 6 array can withstand the complete failure of two drives without losing data.
RAID level 1+0
RAID levels 1+0 and 0+1 are stripes of mirror sets or mirrors of stripe sets. Logically, they are concatenations of RAID 0 and RAID 1, but many controllers and software implementations provide direct support for them. The goal of both modes is to simultaneously obtain the performance of RAID 0 and the redundancy of RAID 1.
RAID level 2,3,4
RAID levels 2, 3, and 4 are defined but are rarely deployed. Logical volume managers usually include both striping (RAID 0) and mirroring (RAID 1) features.
Rebuild RAID
Repair Software RAID The hard drive is one of the pieces of hardware most likely to break on your server, and if you run a system that uses Linux software RAID, it's good to know how to repair the RAID. The first step is figuring out how to detect when a RAID has failed. On a modern software RAID install, the system should have mdadm configured to email the root user whenever there is a RAID problem (if you want to change this, edit the MAILADDR option in /etc/mdadm/mdadm.conf and run /etc/init.d/mdadm reload as root to load the changes). Otherwise you can view the /proc/mdstat file: $ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid5 sdb1[0] sdd1[3](F) sdc1[1] 16771584 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_] unused devices: <none> Here you can see that sdd1 is marked with an (F) stating it has failed, and on the next line of output, the array shows two out of three disks ([3/2] [UU_]). The next step is to remove the disk from /dev/md0 so that you can swap it out with a new drive. To do this, run mdadm with the --remove option: $ sudo mdadm /dev/md0 --remove /dev/sdd1 The drive must be set as a failed drive for you to remove it, so if for some reason mdadm hasn't picked up the drive as faulty but you want to swap it out, you might need to set it as faulty before you remove it: $ sudo mdadm /dev/md0 --fail /dev/sdd1 The mdadm command supports chaining commands, so you could fail and remove a drive in the same line: $ sudo mdadm /dev/md0 --fail /dev/sdd1 --remove /dev/sdd1 Once you remove a drive from an array, it will be missing from /proc/mdstat: $ cat /prod/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid5 sdb1[0] sdc1[1] 16771584 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_] unused devices: <none> Now you can swap out the drive with a fresh one and partition it (either a hot-swap if your system supports that, or otherwise by powering the system down and swapping the hard drives). Be sure that when you replace drives you create new partitions to be equal or greater in size than the rest of the partitions in the RAID array. Once the new partition is ready, use the --add command to add it to the array: $ sudo mdadm /dev/md0 --add /dev/sdd1 Now mdadm will start the process of resyncing data. This can take some time, depending on the speed and size of your disks. You can monitor the progress from /proc/mdstat: $ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid5 sdd1[3] sdb1[0] sdc1[1] 16771584 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_] [>....................] recovery = 2.0% (170112/8385792) finish=1.6min speed=85056K/sec unused devices: <none>
Solid State disks
SSDs spread reads and writes across banks of flash memory cells, which are individually rather slow in comparison to modern hard disks. But because of parallelism, the SSD as a whole meets or exceeds the bandwidth of a traditional disk. The great strength of SSDs is that they continue to perform well when data is read or written at random, an access pattern that's predominant in real-world use. Hard disks and SSDs are enough alike that they can act as drop-in replacements for each other, at least at the hardware level. They use the same hardware interfaces and interface protocols. And yet they have different strengths A further complication is that flash memory pages must be erased before they can be rewritten. Erasing is a separate operation that is slower than writing. It's also impossible to erase individual pages—clusters of adjacent pages (typically 128 pages or 512KiB) must be erased together. The write performance of an SSD can drop substantially when the pool of pre-erased pages is exhausted and the drive must recover pages on the fly to service ongoing writes. SSD have faster 'Random Access Time', 'Sequential' and 'Random' read, more expensive, and limited writes as compared to standard hard disks.
Creating Swap Space
Some partitions don't hold files. Most notably, Linux can use a swap partition, which is a partition that Linux treats as an extension of memory. (Linux can also use a swap file, which is a file that works in the same way. Both are examples of swap space.) Linux uses the MBR partition type code of 0x82 to identify swap space, but as with other partitions, this code is mostly a convenience to keep other OSs from trying to access Linux swap partitions; Linux uses /etc/fstab to define which partitions to use as swap space Although swap space doesn't hold a filesystem per se and isn't mounted in the way that filesystem partitions are mounted, swap space does require preparation similar to that for creation of a filesystem. This task is accomplished with the mkswap command, which you can generally use by passing it nothing but the device identifier: # mkswap /dev/sda7 This example turns /dev/sda7 into swap space. To use the swap space, you must activate it with the swapon command: # swapon /dev/sda7 To permanently activate swap space, you must create an entry for it in /etc/fstab
/etc/fstab
The /etc/fstab file controls how Linux provides access to disk partitions and removable media devices. Linux supports a unified directory structure in which every disk device (partition or removable disk) is mounted at a particular point in the directory tree. The /etc/fstab file describes how these filesystems are laid out. (The filename fstab is an abbreviation for filesystem table.) example: #device mount point filesystem options dump fsck /dev/hda1 / ext4 defaults 1 1 /dev/hdb5 /windows vfat uid=500,umask=0 0 0 /dev/hdc /media/cdrom iso9660 users,noauto 0 0 Device The first column specifies the mount device. These are usually device filenames that reference hard disks, floppy drives, and so on. Most distributions now specify partitions by their labels or UUIDs, as in the LABEL=/home and UUID=3631a288-673e-40f5-9e96-6539fec468e9 entries in Listing 3.1. When Linux encounters such an entry, it tries to find the partition whose filesystem has the specified name or UUID and mount it. This practice can help reduce problems if partition numbers change, but some filesystems lack these labels. It's also possible to list a network drive, as in server:/home, which is the /home export on the computer called server; or //winsrv/shr, which is the shr share on the Windows or Samba server called winsrv. Mount Point The second column specifies the mount point; in the unified Linux filesystem, this is where the partition or disk will be mounted. This should usually be an empty directory in another filesystem. The root (/) filesystem is an exception. So is swap space, which is indicated by an entry of swap. Filesystem Type The filesystem type code is the same as the type code used to mount a filesystem with the mount command. You can use any filesystem type code you can use directly with the mount command. A filesystem type code of auto lets the kernel auto-detect the filesystem type, which can be a convenient option for removable media devices. Auto-detection doesn't work with all filesystems, though. Mount Options Most filesystems support several mount options, which modify how the kernel treats the filesystem. You may specify multiple mount options, separated by commas. For instance, uid=500,umask=0 for /windows in Listing 3.1 sets the user ID (owner) of all files to 500 and sets the umask to 0. (User IDs and umasks are covered in more detail in Chapter 4.) Table 3.3 summarizes the most common mount options. Backup Operation The next-to-last field contains a 1 if the dump utility should back up a partition or a 0 if it shouldn't. If you never use the dump backup program, this option is essentially meaningless. (The dump program was once a common backup tool, but it is much less popular today.) Filesystem Check Order At boot time, Linux uses the fsck program to check filesystem integrity. The final column specifies the order in which this check occurs. A 0 means that fsck should not check a filesystem. Higher numbers represent the check order. The root partition should have a value of 1, and all others that should be checked should have a value of 2. Some filesystems, such as ReiserFS, shouldn't be automatically checked and so should have values of 0.
The File System Is Read-Only
The File System Is Read-Only Every now and then you may encounter a file system that isn't full, but it won't let you write to it all the same. When you do try to copy a file or save a file, you get an error that the file system is read-only. The first step is to see if you can, in fact, remount the file system read-write; so for instance, if the /home partition were read-only, you would type $ sudo mount -o remount,rw /home Chances are, though, if you get this error, it's because your file system has encountered some sort of error and has decided to remount read-only to protect itself from further damage. This sort of problem happens more frequently on virtual machines in part, I imagine, to the extra level of abstraction between its virtual disk and the physical hardware. When there's some hiccup between the two, the file system detects a serious error and protects itself. To know for sure, examine the output of the dmesg command, specifically for lines that begin with EXT3-fs error. You should see lines in the output that reference the errors ext3 found and a log entry that states Remounting filesystem read-only. So what do you do if this happens to you? If the file system is not the root partition and you can completely unmount it, you can try to unmount it completely and then remount it. If it's the root partition, or remounting doesn't work, unfortunately you will have to reboot the system so it can check and remount the file system cleanly. If after a reboot the file system still won't mount cleanly, then move on to the next section.
Summarize the tools that can help keep a filesystem healthy.
The fsck program is a front-end to filesystem-specific tools such as e2fsck and fsck.jfs. By whatever name, these programs examine a filesystem's major data structures for internal consistency and can correct minor errors.
Understanding inodes
The inode (index node) is a fundamental concept in the Linux and UNIX filesystem. Each object in the filesystem is represented by an inode. But what are the objects? Let us try to understand it in simple words. Each and every file under Linux (and UNIX) has following attributes: => File type (executable, block special etc) => Permissions (read, write etc) => Owner => Group => File Size => File access, change and modification time (remember UNIX or Linux never stores file creation time, this is favorite question asked in UNIX/Linux sys admin job interview) => File deletion time => Number of links (soft/hard) => Extended attribute such as append only or no one can delete file including root user (immutability) => Access Control List (ACLs) All the above information stored in an inode. In short the inode identifies the file and its attributes (as above) . Each inode is identified by a unique inode number within the file system. Inode is also know as index number. inode definition: An inode is a data structure on a traditional Unix-style file system such as UFS or ext3. An inode stores basic information about a regular file, directory, or other file system object. How do I see file inode number? You can use ls -i command to see inode number of file $ ls -i /etc/passwd
Summarize important Linux disk partitions.
The most important Linux disk partition is the root (/) partition, which is at the base of the Linux directory tree. Other possible partitions include a swap partition, /home for home directories, /usr for program files, /var for transient system files, /tmp for temporary user files, /boot for the kernel and other critical boot files, and more.
mdadm: Linux software RAID
The standard software RAID implementation for Linux is called md, the "multiple disks" driver. It's front-ended by the mdadm command. md supports all the RAID configurations listed above as well as RAID 4. The virtual file /proc/mdstat always contains a summary of md's status and the status of all the system's RAID arrays. It is especially useful to keep an eye on the /proc/mdstat file after adding a new disk or replacing a faulty drive. (watch cat /proc/mdstat is a handy idiom.) mdadm does not technically require a configuration file, although it will use a configuration file if one is supplied (typically, /etc/mdadm.conf). We strongly recommend the use of a configuration file. It documents the RAID configuration in a standard way, thus giving administrators an obvious place to look for information when problems occur To enable the array at startup by using the freshly created /etc/mdadm.conf, we would execute $ sudo mdadm -As /dev/md0 To remove the drive from the RAID configuration, use mdadm -r: 1. To create a RAID 0 array, execute the following command (note the --level=stripe option to stripe the array): [ec2-user ~]$ sudo mdadm --create --verbose /dev/md0 --level=stripe --raid-devices=number_of_volumes device_name1 device_name2 2. Create a file system on your RAID device. For example, to create an ext4 file system, execute the following command: [ec2-user ~]$ sudo mkfs.ext4 /dev/md0 3. Create a mount point for your RAID array. [ec2-user ~]$ sudo mkdir /mnt/md0 4. Finally, mount the RAID device on the mount point that you created: [ec2-user ~]$ sudo mount -t ext4 /dev/md0 /mnt/md0
Creating Partitions and Filesystems Partitioning a disk
The traditional Linux tool for disk partitioning is called fdisk. This tool's name is short for fixed disk. Although fdisk is the traditional tool, several others exist. One of these is GNU Parted, which can handle several different partition table types, not just the MBR that fdisk can handle. If you prefer fdisk to GNU Parted but must use GPT, you can use GPT fdisk (http://www.rodsbooks.com/gdisk/); this package's gdisk program works much like fdisk but on GPT disks. Although fdisk is the tool covered by the exam, some administrators prefer the related cfdisk (or the similar cgdisk for GPT), which has a friendlier user interface. The sfdisk (or sgdisk for GPT) tool is useful for writing scripts that can handle disk partitioning tasks.
Repair Corrupted File Systems
There are a number of scenarios in which a file system might get corrupted through either a hard reboot or some other error. Normally Linux will automatically run a file system check command (called fsck) at boot to attempt to repair the file system. Often the default fsck is enough to repair the file system, but every now and then a file system gets corrupt enough that it needs manual intervention. What you will often see is the boot process drop out after a fsck fails, hopefully to a rescue shell you can use to run fsck manually. Otherwise, track down a rescue disk you can boot from (many distribution install disks double as rescue disks nowadays), open up a terminal window, and make sure you have root permissions (on rescue disks that use sudo, you may have to type sudo -s to get root). One warning before you start fscking a file system: Be sure the file system is unmounted first. Otherwise fsck could potentially damage your file system further. You can run the mount command in the shell to see all mounted file systems and type umount <devicename> to unmount any that are mounted (except the root file system). Since this file system is preventing you from completing the boot process, it probably isn't mounted, so in this example let's assume that your /home directory is mounted on a separate partition at /dev/sda5. To scan and repair any file system errors on this file system, type # fsck -y -C /dev/sda5 The -y option will automatically answer Yes to repair file system errors. Otherwise, if you do have any errors, you will find yourself hitting Y over and over again. The -C option gives you a nice progress bar so you can see how far along fsck is. A complete fsck can take some time on a large file system, so the progress bar can be handy. Sometimes file systems are so corrupted that the primary superblock cannot be found. Luckily, file systems create backup superblocks in case this happens, so you can tell fsck to use this superblock instead. You aren't likely to automatically know the location of your backup superblock. For ext-based file systems you can use the mke2fs tool with the -n option to list all of the superblocks on a file system: # mke2fs -n /dev/sda5 Once you see the list of superblocks in the output, choose one and pass it as an argument to the -b option for fsck: # fsck -b 8193 -y -C /dev/sda5 When you specify an alternate superblock, fsck will automatically update your primary superblock after it completes the file system check.
Understanding UNIX / Linux filesystem Superblock in File system, Linux, UNIX
Understanding UNIX / Linux filesystem Superblock in File system, Linux, UNIX This is second part of "Understanding UNIX/Linux file system", part I is here. Let us take an example of 20 GB hard disk. The entire disk space subdivided into multiple file system blocks. And blocks used for what? The blocks used for two different purpose: Most blocks stores user data aka files (user data). Some blocks in every file system store the file system's metadata. So what the hell is a metadata? In simple words Metadata describes the structure of the file system. Most common metadata structure are superblock, inode and directories. Following paragraphs describes each of them. Superblock Each file system is different and they have type like ext2, ext3 etc. Further each file system has size like 5 GB, 10 GB and status such as mount status. In short each file system has a superblock, which contains information about file system such as: File system type Size Status Information about other metadata structures If this information lost, you are in trouble (data loss) so Linux maintains multiple redundant copies of the superblock in every file system. This is very important in many emergency situation, for example you can use backup copies to restore damaged primary super block. Following command displays primary and backup superblock location on /dev/sda3: # dumpe2fs /dev/hda3 | grep -i superblock
Partitioning a disk using fdisk
Using fdisk To use Linux's fdisk, type the command name followed by the name of the disk device you want to partition, as in fdisk /dev/hda to partition the primary master PATA disk. Display the Current Partition Table You may want to begin by displaying the current partition table. To do so, type p. If you only want to display the current partition table, you can type fdisk -l /dev/hda (or whatever the device identifier is) at a command prompt rather than enter fdisk's interactive mode. Create a Partition To create a partition, type n. The result is a series of prompts asking for information about the partition—whether it should be a primary, extended, or logical partition; the partition's starting cylinder; the partition's ending cylinder or size; and so on. Delete a Partition To delete a partition, type d. If more than one partition exists, the program will ask for the partition number, which you must enter. Change a Partition's Type When you create a partition, fdisk assigns it a type code of 0x83, which corresponds to a Linux filesystem. If you want to create a Linux swap partition or a partition for another OS, you can type t to change a partition type code. The program then prompts you for a partition number and a type code. List Partition Types Several dozen partition type codes exist, so it's easy to forget what they are. Type l (that's a lowercase L) at the main fdisk prompt to see a list of the most common ones. You can also get this list by typing L when you're prompted for the partition type when you change a partition's type code. Mark a Partition Bootable Some OSs, such as DOS and Windows, rely on their partitions having special bootable flags in order to boot. You can set this flag by typing a, whereupon fdisk asks for the partition number. Get Help Type m or ? to see a summary of the main fdisk commands. Exit Linux's fdisk supports two exit modes. First, you can type q to exit the program without saving any changes; anything you do with the program is lost. This option is particularly helpful if you've made a mistake. Second, typing w writes your changes to the disk and exits the program.
Using umount
Using umount The umount command is simpler than mount. The basic umount syntax is as follows: umount [-afnrv] [-t fstype] [device | mountpoint] Unmount All Rather than unmount partitions listed in /etc/fstab, the -a option causes the system to attempt to unmount all the partitions listed in /etc/mtab, the file that holds information about mounted filesystems. On a normally running system, this operation is likely to succeed only partly because it won't be able to unmount some key filesystems, such as the root partition. Force Unmount You can use the -f option to tell Linux to force an unmount operation that might otherwise fail. This feature is sometimes helpful when unmounting NFS mounts shared by servers that have become unreachable.
Volume groups and logical volumes
Volume groups and logical volumes are associated with logical volume managers (LVMs). These systems aggregate physical devices to form pools of storage called volume groups. The administrator can then subdivide this pool into logical volumes in much the same way that disks of yore were divided into partitions. For example, a 750GB disk and a 250GB disk could be aggregated into a 1TB volume group and then split into two 500GB logical volumes. At least one volume would include data blocks from both hard disks. Since the LVM adds a layer of indirection between logical and physical blocks, it can freeze the logical state of a volume simply by making a copy of the mapping table. Therefore, logical volume managers often provide some kind of a "snapshot" feature. Writes to the volume are then directed to new blocks, and the LVM keeps both the old and new mapping tables. Of course, the LVM has to store both the original image and all modified blocks, so it can eventually run out of space if a snapshot is never deleted.
Relationship between block size and IOPS
We calculate our IOPS performance based on a 16k block size, so adjusting block size can alter total IOPS, but the actual performance should remain the same. For example, if you have provisioned a 1000 IOPS volume and are writing to it at a 16K block size, you will consistently get approximately 16MB/sec of throughput. If you start writing at a block size of 32K, your IOPS would drop to 500 during this period and you'd continue to see 16MB/sec of throughput. In general, every increase in I/O size above 16KB linearly increases the resources you need to achieve the same IOPS rate.
Filesystem full
When the Disk Is Full Linux actually makes it pretty obvious when you run out of disk space: $ cp /var/log/syslog syslogbackup cp: writing `syslogbackup': No space left on device Of course, depending on how your system is partitioned, you may not know which partition filled up. The first step is to use the df tool to list all of your mounted partitions along with their size, used space, and available space. If you add the -h option, it shows you data in human-readable format, instead of in 1K blocks: $ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 7.8G 7.4G 60K 100% / none 245M 192K 245M 1% /dev none 249M 0 249M 0% /dev/shm none 249M 36K 249M 1% /var/run none 249M 0 249M 0% /var/lock none 249M 0 249M 0% /lib/init/rw Here you can see there is only one mounted partition, /dev/sda1; it has 7.8Gb of total space of which 7.4Gb is used, and it says it's 100% full with 60Kb available. Of course, with a full file system, how are you supposed to log in and fix anything? As you'll see later in the chapter, one of the common ways to free up space on a file system is to compress uncompressed logs, but if the disk has no free space, how are you expected to do that? Reserved Blocks If you look at the df numbers closely, though, you may say, wait a minute, is Linux really that bad at math? 7.4Gb divided by 7.8Gb is closer to 95% full. What's happening here is that Linux has set aside a number of blocks on the file system, known as reserved blocks, for just such an emergency (and also to help avoid fragmentation). Only the root user can write to those reserved blocks, so if the file system gets full, the root user still has some space left on the file system to log in and move around some files. On most servers with ext-based file systems, 5% of the total blocks are reserved, but this is something you can check with the tune2fs tool if you have root permissions. For instance, here is how you would check the reserved block count on your full /dev/sda1 partition: $ sudo tune2fs -l /dev/sda1 | grep -i "block count" Block count: 2073344 Reserved block count: 103667 If you divide 103667 by 2073344, you'll see that it works out to about 5%, or, in this case, it means the root user has about 400Mb to play around with to try to fix the problem. Track Down the Largest Directories The df command lets you know how much space is used by each file system, but after you know that, you still need to figure out what is consuming all of that disk space. The similarly named du command is invaluable for this purpose. This command, with the right arguments, can scan through a file system and report how much disk space is consumed by each directory. If you pipe it to a sort command, you can then easily see which directories consume the most disk space. What I like to do is save the results in /tmp (if there's enough free space, that is) so I can refer to the output multiple times and not have to rerun du. I affectionately call this the "duck command": $ cd / $ sudo du -ckx | sort -n > /tmp/duck-root This command won't output anything to the screen but instead it creates a sorted list of which directories consume the most space and outputs the list to /tmp/duck-root. If you then use tail on that file, you can see the top ten directories that use space: $ sudo tail /tmp/duck-root 67872 /lib/modules/2.6.24-19-server 67876 /lib/modules 69092 /var/cache/apt 69448 /var/cache 76924 /usr/share 82832 /lib 124164 /usr 404168 / 404168 total In this case, you can see that /usr takes up the most space, followed by /lib, /usr/share, and then /var/cache. Note that the output separates out /var/cache/apt and /var/cache, so you can tell that /var/cache/apt is the subdirectory that consumes the most space under /var/cache. Of course, you might have to open the duck-root file with a tool like less or a text editor so that you can see more than the last ten directories. So what can you do with this output? In some cases the directory that takes up the most space can't be touched (as with /usr), but often when the free space disappears quickly, it is because of log files growing out of control. If you do see /var/log consuming a large percentage of your disk, you could then go to the directory and type sudo ls -lS to list all of the files sorted by their size. At that point, you could truncate (basically erase the contents of) a particular file: $ sudo sh -c "> /var/log/messages" Alternatively, if one of the large files has already been rotated (it ends in something like .1 or .2), you could either gzip it if it isn't already gzipped, or you could simply delete it if you don't need the log anymore. If you routinely find you have disk space problems due to uncompressed logs, you can tweak your logrotate settings in /etc/logrotate.conf and /etc/logrotate.d/ and make sure it automatically compresses rotated logs for you.
Surviving a Linux Filesystem Failures
When you use term filesystem failure, you mean corrupted filesystem data structures (or objects such as inode, directories, superblock etc. This can be caused by any one of the following reason: * Mistakes by Linux/UNIX Sys admin * Buggy device driver or utilities (especially third party utilities) * Power outage (very rarer on production system) due to UPS failure * Kernel bugs (that is why you don't run latest kernel on production Linux/UNIX system, most of time you need to use stable kernel release) ex. BUG: soft lockup - CPU#1 stuck for 22s! [nfsd:19060] Due to filesystem failure: File system will refuse to mount Entire system get hangs Even if filesystem mount operation result into success, users may notice strange behavior when mounted such as system reboot, gibberish characters in directory listings etc So how the hell you are gonna Surviving a Filesystem Failures? Most of time fsck (front end to ext2/ext3 utility) can fix the problem, first simply run e2fsck - to check a Linux ext2/ext3 file system (assuming /home [/dev/sda3 partition] filesystem for demo purpose), first unmount /dev/sda3 then type following command : # e2fsck -f /dev/sda3 Where, -f : Force checking even if the file system seems clean. Please note that If the superblock is not found, e2fsck will terminate with a fatal error. However Linux maintains multiple redundant copies of the superblock in every file system, so you can use -b {alternative-superblock} option to get rid of this problem. The location of the backup superblock is dependent on the filesystem's blocksize: For filesystems with 1k blocksizes, a backup superblock can be found at block 8193 For filesystems with 2k blocksizes, at block 16384 For 4k blocksizes, at block 32768. Tip you can also try any one of the following command(s) to determine alternative-superblock locations: # mke2fs -n /dev/sda3 OR # dumpe2fs /dev/sda3|grep -i superblock To repair file system by alternative-superblock use command as follows: # e2fsck -f -b 8193 /dev/sda3 However it is highly recommended that you make backup before you run fsck command on system, use dd command to create a backup (provided that you have spare space under /disk2) # dd if=/dev/sda2 of=/disk2/backup-sda2.img
/etc/fstab options
defaults -- supports all filesystems -- Causes the default options for this filesystem to be used. It's used primarily in the /etc/fstab file to ensure that the file includes an options column. auto or noauto -- supports all fs - Mounts or doesn't mount the filesystem at boot time or when root issues the mount -a command. The default is auto, but noauto is appropriate for removable media. Used in /etc/fstab. user or nouser -- supports all fs - Allows or disallows ordinary users to mount the filesystem. The default is nouser, but user is often appropriate for removable media. Used in /etc/fstab. When included in this file, user allows users to type mount /mountpoint (where /mountpoint is the assigned mount point) to mount a disk. Only the user who mounted the filesystem may unmount users -- supports all fs - Similar to user, except that any user may unmount a filesystem once it's been mounted. owner --supports all fs - Similar to user, except that the user must own the device file. Some distributions, such as Red Hat, assign ownership of some device files (such as /dev/fd0 for the floppy disk) to the console user, so this can be a helpful option. id=value Most filesystems that don't support Unix-style permissions, such as vfat, hpfs, ntfs, and hfs Sets the owner of all files. For instance, uid=1000 sets the owner to whoever has Linux user ID 1000. (Check Linux user IDs in the /etc/passwd file.) gid=value Most filesystems that don't support Unix-style permissions, such as vfat, hpfs, ntfs, and hfs Works like uid=value, but sets the group of all files on the filesystem. You can find group IDs in the /etc/group file. umask=value Most filesystems that don't support Unix-style permissions, such as vfat, hpfs, ntfs, and hfs Sets the umask for the permissions on files. value is interpreted in binary as bits to be removed from permissions on files. For instance, umask=027 yields permissions of 750, or -rwxr-x---. Used in conjunction with uid=value and gid=value, this option lets you control who can access files on FAT, HPFS, and many other foreign filesystems.
Amazon EBS Volume Performance on Linux Instances
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-ec2-config.html http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-io-characteristics.html http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-workload-demand.html
Migrating a live system from ext3 to ext4 filesystem
https://www.debian-administration.org/article/643/Migrating_a_live_system_from_ext3_to_ext4_filesystem
Disk devices for linux
parted -l lists the sizes, partition tables, model numbers, and manufacturers of every disk on the system.