Maintaining a Linux Filesystem (203.2)

Candidates should be able to properly maintain a Linux filesystem using system utilities. This objective includes manipulating standard filesystems.

Key Knowledge Areas:

The following is a partial list of used files, terms, and utilities:

Resources: Bandel97 ; Btrfs ; the man pages for e2fsck , badblocks , dumpe2fs , debugfs and tune2fs .

Good disk maintenance requires periodic disk checks. Your best tool is fsck , and should be run at least monthly. Default checks will normally be run after 20 system reboots, but if your system stays up for weeks or months at a time, you'll want to force a check from time to time. Your best bet is performing routine system backups and checking your lost+found directories from time to time.

The frequency of the checks at system reboot can be changed with tune2fs . This utility can also be used to change the mount count, which will prevent the system from having to check all filesystems at the 20th reboot (which can take a long time).

The dumpe2fs utility will provide important information regarding hard disk operating parameters found in the superblock, and badblocks will perform surface checking. Finally, surgical procedures to remove areas grown bad on the disk can be accomplished using debugfs .

fsck (fsck.*)

fsck is a utility to check and repair a Linux filesystem. In actuality fsck is simply a front-end for the various filesystem checkers ( fsck. fstype ) available under Linux.

fsck is called automatically at system startup. If the filesystem is marked not clean , or the maximum mount count is reached or the time between checks is exceeded, the filesystem is checked. To change the maximum mount count or the time between checks, use tune2fs .

Frequently used options to fsck include:

-s

Serialize fsck operations. This is a good idea if you're checking multiple filesystems and the checkers are in an interactive mode.

-A

Walk through the /etc/fstab file and try to check all filesystems in one run. This option is typically used from the /etc/rc system initialization file, instead of multiple commands for checking a single filesystem.

The root filesystem will be checked first. After that, filesystems will be checked in the order specified by the fs_passno (the sixth) field in the /etc/fstab file. Filesystems with a fs_passno value of 0 are skipped and are not checked at all. If there are multiple filesystems with the same pass number, fsck will attempt to check them in parallel, although it will avoid running multiple filesystem checks on the same physical disk.

-R

When checking all filesystems with the -A flag, skip the root filesystem (in case it's already mounted read-write).

Options which are not understood by fsck are passed to the filesystem-specific checker. These arguments(=options) must not take arguments, as there is no way for fsck to be able to properly guess which arguments take options and which don't. Options and arguments which follow the -- are treated as filesystem-specific options to be passed to the filesystem-specific checker.

The filesystem checker for the ext2 filesystem is called fsck.e2fs or e2fsck . Frequently used options include:

-a

This option does the same thing as the -p option. It is provided for backwards compatibility only; it is suggested that people use -p option whenever possible.

-c

This option causes e2fsck to run the badblocks(8) program to find any blocks which are bad on the filesystem, and then marks them as bad by adding them to the bad block inode.

-C

This option causes e2fsck to write completion information to the specified file descriptor so that the progress of the filesystem check can be monitored. This option is typically used by programs which are running e2fsck. If the file descriptor specified is 0, e2fsck will print a completion bar as it goes about its business. This requires that e2fsck is running on a video console or terminal.

-f

Force checking even if the filesystem seems clean.

-n

Open the filesystem read-only, and assume an answer of no to all questions. Allows e2fsck to be used non-interactively. (Note: if the -c, -l, or -L options are specified in addition to the -n option, then the filesystem will be opened read-write, to permit the bad-blocks list to be updated. However, no other changes will be made to the filesystem.)

-p

Automatically repair ("preen") the filesystem without any questions.

-y

Assume an answer of yes to all questions; allows e2fsck to be used non-interactively.

mkfs (mkfs.*)

The mk2fs command is used to create a Linux filesystem. It can create different types of filesystems by specifying the -t filesystem or by giving the mkfs.filesystem command. Actually mkfs is a front-end for the several mkfs.fstype commands. Please read the mkfs man pages and section Creating Filesystems for more information.

tune2fs

tune2fs is used to tune a filesystem. This is mostly used to set filesystem check options, such as the maximum mount count and the time between filesystem checks .

It is also possible to set the mount count to a specific value. This can be used to 'stagger' the mount counts of the different filesystems, which ensures that at reboot not all filesystems will be checked at the same time.

So for a system that contains 5 partitions and is booted approximately once a month you could do the following to stagger the mount counts:

tune2fs -c 5 -C 0
partition1
tune2fs -c 5 -C 1
partition2
tune2fs -c 5 -C 2
partition3
tune2fs -c 5 -C 3
partition4
tune2fs -c 5 -C 4
partition5
				

The maximum mount count is 20, but for a system that is not frequently rebooted a lower value is advisable.

Frequently used options include:

-c max-mount-counts

Adjust the maximum mount count between two filesystem checks. If max-mount-counts is 0 then the number of times the filesystem is mounted will be disregarded by e2fsck(8) and the kernel. Staggering the mount-counts at which filesystems are forcibly checked will avoid all filesystems being checked at one time when using journalling filesystems.

You should strongly consider the consequences of disabling mount-count-dependent checking entirely. Bad disk drives, cables, memory and kernel bugs could all corrupt a filesystem without marking the filesystem dirty or in error. If you are using journalling on your filesystem, your filesystem will never be marked dirty, so it will not normally be checked. A filesystem error detected by the kernel will still force an fsck on the next reboot, but it may already be too late to prevent data loss at that point.

-C mount-count

Set the number of times the filesystem has been mounted. Can be used in conjunction with -c to force an fsck on the filesystem at the next reboot.

-i interval-between-checks[d|m|w]

Adjust the maximal time between two filesystem checks. No suffix or d result in days, m in months, and w in weeks. A value of zero will disable the time-dependent checking.

It is strongly recommended that either -c (mount-count-dependent) or -i (time-dependent) checking be enabled to force periodic full e2fsck(8) checking of the filesystem. Failure to do so may lead to filesystem corruption due to bad disks, cables or memory or kernel bugs to go unnoticed, until they cause data loss or corruption.

-m reserved-blocks-percentage

Set the percentage of reserved filesystem blocks.

-r reserved-blocks-count

Set the number of reserved filesystem blocks.

dumpe2fs

dumpe2fs prints the super block and blocks group information for the filesystem present on device.

-b

print the blocks which are reserved as bad in the filesystem.

-h

only display the superblock information and not any of the block group descriptor detail information.

badblocks

badblocks is a Linux utility to check for damaged sectors on a disk drive. It marks these sectors so that they are not used in the future and thus do not cause corruption of data. It is part of the e2fsprogs project.

It is strongly recommended that badblocks not be run directly but to have it invoked through the -c option in e2fsck or mke2fs . A commonly used option is:

-o output-file

write the list of bad blocks to output-file .

debugfs

With debugfs , you can modify the disk with direct disk writes. Since this utility is so powerful, you will normally want to invoke it as read-only until you are ready to actually make changes and write them to the disk. To invoke debugfs in read-only mode, do not use any options. To open in read-write mode, add the -w option. You may also want to include in the command line the device you wish to work on, as in /dev/hda1 or /dev/sda1 , etc. Once it is invoked, you should see a debugfs prompt.

When the superblock of a partition is damaged, you can specify a different superblock to use:

debugfs -b 1024 -s 8193 /dev/hda1
				

This means that the superblock at block 8193 will be used and the blocksize is 1024. Note that you have to specify the blocksize when you want to use a different superblock. The information about blocksize and backup superblocks can be found with:

dumpe2fs /dev/hda1
				

The first command to try after invocation of debugfs , is params to show the mode (read-only or read-write), and the current file system. If you run this command without opening a filesystem, it will almost certainly dump core and exit. Two other commands, open and close , may be of interest when checking more than one filesystem. Close takes no argument, and appropriately enough, it closes the filesystem that is currently open. Open takes the device name as an argument. To see disk statistics from the superblock, the command stats will display the information by group. The command testb checks whether a block is in use. This can be used to test if any data is lost in the blocks marked as bad by the badblocks command. To get the filename for a block, first use the icheck command to get the inode and then ncheck to get the filename. The best course of action with bad blocks is to mark the block bad and restore the file from backup.

To get a complete list of all commands, see the man page of debugfs or type ? , lr or list_requests .

ext4

Ext4 is the evolution of the most used Linux filesystem, Ext3. In many ways, Ext4 is a deeper improvement over Ext3 than Ext3 was over Ext2. Ext3 was mostly about adding journaling to Ext2, but Ext4 modifies important data structures of the filesystem such as the ones destined to store the file data. The result is a filesystem with an improved design, better performance, reliability, and features. Therefore converting ext3 to ext4 is not as straightforward and easy as it was converting ext2 to ext3.

To creating ext4 partitions from scratch, use:

mkfs.ext4 /dev/sdxY

Note

Tip: See the mkfs.ext4 man page for more options; edit /etc/mke2fs.conf to view/configure default options.

Be aware that by default, mkfs.ext4 uses a rather low bytes-per-inode ratio to calculate the fixed amount of inodes to be created.

Note

Note: Especially for contemporary HDDs (750 GB+) this usually results in a much too large inode number and thus many likely wasted GB. The ratio can be set directly via the -i option; one of 6291456 resulted in 476928 inodes for a 2 TB partition.

For the rest ext4 can be manipulated using all the same tools that are available for ext2/ext3 type of filesystems like badblocks, dumpe2fs, e2fsck and tune2fs.

btrfs awareness

Btrfs (abbreviation for: BTree File System)is a new copy on write (CoW) filesystem for Linux aimed at implementing advanced features while focusing on fault tolerance, repair and easy administration. Jointly developed at Oracle, Red Hat, Fujitsu, Intel, SUSE, STRATO and many others, Btrfs is licensed under the GPL and open for contribution from anyone. Btrfs has several features characteristic of a storage device. It is designed to make the file system tolerant of errors, and to facilitate the detection and repair of errors when they occur. It uses checksums to ensure the validity of data and metadata, and maintains snapshots of the file system that can be used for backup or repair. The core datastructure used by btrfs is the B-Tree - hence the name.

Note

Btrfs is still under heavy development, but every effort is being made to keep the filesystem stable and fast. Because of the speed of development, you should run the latest kernel you can (either the latest release kernel from kernel.org, or the latest -rc kernel.

As of the beginning of the year 2013 Btrfs was included in the default kernel and its tools (btrfs-progs) are part of the default installation. GRUB 2, mkinitcpio, and Syslinux have support for Btrfs and require no additional configuration.

The main Btrfs features available at the moment include:

  • Extent based file storage

  • 2^64 byte == 16 EiB maximum file size

  • Space-efficient packing of small files

  • Space-efficient indexed directories

  • Dynamic inode allocation

  • Writable snapshots, read-only snapshots

  • Subvolumes (separate internal filesystem roots)

  • Checksums on data and metadata (crc32c)

  • Compression (zlib and LZO)

  • Integrated multiple device support

  • File Striping, File Mirroring, File Striping+Mirroring, Striping with Single and Dual Parity implementations

  • SSD (Flash storage) awareness (TRIM/Discard for reporting free blocks for reuse) and optimizations (e.g. avoiding unnecessary seek optimizations, sending writes in clusters, even if they are from unrelated files. This results in larger write operations and faster write throughput)

  • Efficient Incremental Backup

  • Background scrub process for finding and fixing errors on files with redundant copies

  • Online filesystem defragmentation

  • Offline filesystem check

  • Conversion of existing ext3/4 file systems

  • Seed devices. Create a (readonly) filesystem that acts as a template to seed other Btrfs filesystems. The original filesystem and devices are included as a readonly starting point for the new filesystem. Using copy on write, all modifications are stored on different devices; the original is unchanged.

  • Subvolume-aware quota support

  • Send/receive of subvolume changes

  • Efficient incremental filesystem mirroring

The most notable (unique) btrfs features are:

Snapshotting

Btrfs’s snapshotting is simple to use and understand. The snapshots will show up as normal directories under the snapshotted directory, and you can cd into it and walk around there as you would in any directory.

By default, all snapshots are writeable in Btrfs, but you can create read-only snapshots if you choose so. Read-only snapshots are great if you are just going to take a snapshot for a backup and then delete it once the backup completes. Writeable snapshots are handy because you can do things such as snapshot your file system before performing a system update; if the update breaks your system, you can reboot into the snapshot and use it like your normal file system. When you create a new Btrfs file system, the root directory is a subvolume. Snapshots can only be taken of subvolumes, because a subvolume is the representation of the root of a completely different filesystem tree, and you can only snapshot a filesystem tree.

The simplest way to think of this would be to create a subvolume for /home, so you could snapshot / and /home independently of each other. So you could run the following command to create a subvolume:

btrfs subvolume create /home
                                  

And then at some point down the road when you need to snapshot /home for a backup, you simply run:

btrfs subvolume snapshot /home/ /home-snap
                                  

Once you are done with your backup, you can delete the snapshot with the command

btrfs subvolume delete /home-snap/
                                  

The hard work of unlinking the snapshot tree is done in the background, so you may notice I/O happening on a seemingly idle box; this is just Btrfs cleaning up the old snapshot. If you have a lot of snapshots or don’t remember which directories you created as subvolumes, you can run the command:

# btrfs subvolume list /mnt/btrfs-test/
ID 267 top level 5 path home
ID 268 top level 5 path snap-home
ID 270 top level 5 path home/josef
                                  

This doesn’t differentiate between a snapshot and a normal subvolume, so you should probably name your snapshots consistently so that later on you can tell which is which.

Subvolumes

A subvolume in btrfs is not the same as an LVM logical volume, or a ZFS subvolume. With LVM, a logical volume is a block device in its own right; this is not the case with btrfs. A btrfs subvolume is not a block device, and cannot be treated as one.

Instead, a btrfs subvolume can be thought of as a POSIX file namespace. This namespace can be accessed via the top-level subvolume of the filesystem, or it can be mounted in its own right. So, given a filesystem structure like this:

toplevel
  `--- dir_a                * just a normal directory
  |         `--- p
  |         `--- q
  `--- subvol_z             * a subvolume
	    `--- r
	    `--- s
                                  

the root of the filesystem can be mounted, and the full filesystem structure will be seen at the mount point; alternatively the subvolume can be mounted (with the mount option subvol=subvol_z), and only the files r and s will be visible at the mount point.

A btrfs filesystem has a default subvolume, which is initially set to be the top-level subvolume. It is the default subvolume which is mounted if no subvol or subvolid option is passed to mount. Changing the default subvolume with btrfs subvolume default will make the top level of the filesystem inaccessible, except by use of the subvolid=0 mount option.

Space-efficient indexed directories

Directories and files look the same on disk in Btrfs, which is consistent with the UNIX way of doing things. The ext file system variants have to pre-allocate their inode space when making the file system, so you are limited to the number of files you can create once you create the file system.

With Btrfs we add a couple of items to the B-tree when you create a new file, which limits you only by the amount of metadata space you have in your file system. If you have ever created thousands of files in a directory on an ext file system and then deleted the files, you may have noticed that doing an ls on the directory would take much longer than you’d expect given that there may only be a few files in the directory.

You may have even had to run this command:

e2fsck -D /dev/sda1
                                  

to re-optimize your directories in ext. This is due to a flaw in how the directory indexes are stored in ext: they cannot be shrunk. So once you add thousands of files and the internal directory index tree grows to a large size, it will not shrink back down as you remove files. This is not the case with Btrfs.

In Btrfs we store a file index next to the directory inode within the file system B-tree. The B-tree will grow and shrink as necessary, so if you create a billion files in a directory and then remove all of them, an ls will take only as long as if you had just created the directory.

Btrfs also has an index for each file that is based on the name of the file. This is handy because instead of having to search through the containing directory’s file index for a match, we simply hash the name of the file and search the B-tree for this hash value. This item is stored next to the inode item of the file, so looking up the name will usually read in the same block that contains all of the important information you need. Again, this limits the amount of I/O that needs to be done to accomplish basic tasks.

mkswap

mkswap sets up a Linux swap area on a device or in a file. (After creating the swap area, you need to invoke the swapon command to start using it. Usually swap areas are listed in /etc/fstab so that they can be taken into use at boot time by a swapon -a command in some boot script.) See Swap file usage

xfs_info

xfs_info shows the filesystem geometry for an XFS filesystem. xfs_info is equivalent to invoking xfs_growfs with the -n option.

xfs_check

xfs_check checks whether an XFS filesystem is consistent. It is needed only when there is reason to believe that the filesystem has a consistency problem. Since XFS is a Journalling filesystem, which allows it to retain filesystem consistency, there should be little need to ever run xfs_check.

xfs_repair

xfs_repair repairs corrupt or damaged XFS filesystems. xfs_repair will attempt to find the raw device associated with the specified block device and will use the raw device instead. Regardless, the filesystem to be repaired must be unmounted, otherwise, the resulting filesystem may become inconsistent or corrupt.

smartmontools: smartd and smartctl

Two utility programs, smartctl and smartd (available when the smartmontools package is installed) can be used to monitor and control storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART). SMARTis built into most modern ATA and SCSI harddisks and solid-state drives. The purpose of SMARTis to monitor the reliability of the hard drive and predict drive failures, and to carry out different types of drive self-tests.

smartdis a daemon that will attempt to enable SMARTmonitoring on ATA devices and polls these and SCSI devices every 30 minutes (configurable), logging SMARTerrors and changes of SMARTAttributes via the SYSLOG interface. smartd can also be configured to send email warnings if problems are detected. Depending upon the type of problem, you may want to run self-tests on the disk, back up the disk, replace the disk, or use a manufacturer's utility to force reallocation of bad or unreadable disk sectors.

smartd can be configured at start-up using the configuration file /usr/local/etc/smartd.conf. When the USR1 signal is sent to smartd it will immediately check the status of the disks, and then return to polling the disks every 30 minutes. Please consult the manual page for smartd for specific configuration options.

The smartctl utility controls the SMARTsystem. It can be used to scan devices and print info about them, to enable or disable SMARTon specific disks, to configure what to do when (imminent) errors are detected. Please consult the smartctl manual page for details.

Copyright Snow B.V. The Netherlands