Candidates should be able to properly maintain a Linux filesystem using system utilities. This objective includes manipulating standard filesystems and monitoring SMART devices.
Tools and utilities to manipulate ext2, ext3 and ext4.
Tools and utilities to perform Btrfs operations, including subvolumes and snapshots.
Tools and utilities to manipulate xfs.
Awareness of ZFS.
xfs_info, xfs_check, xfs_repair, xfs_dump and xfs_restore
smartd and smartctl
Good disk maintenance requires periodic disk checks. Your best tool is
fsck, and should be run at least
monthly. Default checks will normally be run after 20 system
reboots, but if your system stays up for weeks or months at a
time, you'll want to force a check from time to time. Your best
bet is performing routine system backups and checking your
lost+found directories from time to time.
The frequency of the checks at system reboot can be changed with tune2fs. This utility can also be used to change the mount count, which will prevent the system from having to check all filesystems at the 20th reboot (which can take a long time).
The dumpe2fs utility will provide important information regarding hard disk operating parameters found in the superblock, and badblocks will perform surface checking. Finally, surgical procedures to remove areas grown bad on the disk can be accomplished using debugfs.
fsck is called automatically at system startup. If the filesystem is marked “not clean”, or the maximum mount count is reached or the time between checks is exceeded, the filesystem is checked. To change the maximum mount count or the time between checks, use tune2fs.
Frequently used options to fsck include:
Walk through the
and try to check all filesystems in one run. This
option is typically used from the /etc/rc system
initialization file, instead of multiple commands for
checking a single filesystem.
The root filesystem will be checked first. After that,
filesystems will be checked in the order specified by
fs_passno (the sixth) field
/etc/fstab file. Filesystems
fs_passno value of 0 are
skipped and are not checked at all. If there are
multiple filesystems with the same pass number, fsck
will attempt to check them in parallel, although it will
avoid running multiple filesystem checks on the same
Options which are not understood by fsck are passed to
the filesystem-specific checker. These arguments(=options) must not
take arguments, as there is no way for fsck to be able to properly
guess which arguments take options and which don't. Options and
arguments which follow the
-- are treated as
filesystem-specific options to be passed to the filesystem-specific
The filesystem checker for the ext2 filesystem is called fsck.e2fs or e2fsck. Frequently used options include:
This option does the same thing as the
option. It is provided for backwards compatibility only; it is
suggested that people use -p option whenever possible.
This option causes e2fsck to write completion information to the specified file descriptor so that the progress of the filesystem check can be monitored. This option is typically used by programs which are running e2fsck. If the file descriptor specified is 0, e2fsck will print a completion bar as it goes about its business. This requires that e2fsck is running on a video console or terminal.
Open the filesystem read-only, and assume an answer of “no” to all questions. Allows e2fsck to be used non-interactively. (Note: if the -c, -l, or -L options are specified in addition to the -n option, then the filesystem will be opened read-write, to permit the bad-blocks list to be updated. However, no other changes will be made to the filesystem.)
The mk2fs command is used to create a Linux
filesystem. It can create different types of filesystems by
-t filesystem or by giving
the mkfs.filesystem command. Actually
mkfs is a front-end for the several
mkfs.fstype commands. Please read the mkfs
man pages and section for more information.
It is also possible to set the
to a specific value. This can be used to
'stagger' the mount counts of the different filesystems, which
ensures that at reboot not all filesystems will be checked at
the same time.
So for a system that contains 5 partitions and is booted approximately once a month you could do the following to stagger the mount counts:
tune2fs -c 5 -C 0
partition1tune2fs -c 5 -C 1
partition2tune2fs -c 5 -C 2
partition3tune2fs -c 5 -C 3
partition4tune2fs -c 5 -C 4
Frequently used options include:
Adjust the maximum mount count between two filesystem checks. If max-mount-counts is 0 then the number of times the filesystem is mounted will be disregarded by e2fsck(8) and the kernel. Staggering the mount-counts at which filesystems are forcibly checked will avoid all filesystems being checked at one time when using journalling filesystems.
You should strongly consider the consequences of disabling mount-count-dependent checking entirely. Bad disk drives, cables, memory and kernel bugs could all corrupt a filesystem without marking the filesystem dirty or in error. If you are using journalling on your filesystem, your filesystem will never be marked dirty, so it will not normally be checked. A filesystem error detected by the kernel will still force an fsck on the next reboot, but it may already be too late to prevent data loss at that point.
It is strongly recommended that either -c (mount-count-dependent) or -i (time-dependent) checking be enabled to force periodic full e2fsck(8) checking of the filesystem. Failure to do so may lead to filesystem corruption due to bad disks, cables or memory or kernel bugs to go unnoticed, until they cause data loss or corruption.
print the blocks which are reserved as bad in the filesystem.
only display the superblock information and not any of the block group descriptor detail information.
badblocks is a Linux utility to check for damaged sectors on a disk drive. It marks these sectors so that they are not used in the future and thus do not cause corruption of data. It is part of the e2fsprogs project.
It is strongly recommended that badblocks not be run directly but
to have it invoked through the
option in e2fsck or mke2fs.
A commonly used option is:
write the list of bad blocks to
With debugfs, you can modify the disk with
direct disk writes. Since this utility is so powerful, you
will normally want to invoke it as read-only until you are
ready to actually make changes and write them to the disk. To
invoke debugfs in read-only mode, do not
use any options. To open in read-write mode, add the
-w option. You may also want to include
in the command line the device you wish to work on, as in
etc. Once it is invoked, you should see a debugfs prompt.
debugfs -b 1024 -s 8193 /dev/hda1
This means that the superblock at block 8193 will be used and the blocksize is 1024. Note that you have to specify the blocksize when you want to use a different superblock. The information about blocksize and backup superblocks can be found with:
The first command to try after invocation of debugfs, is params to show the mode (read-only or read-write), and the current file system. If you run this command without opening a filesystem, it will almost certainly dump core and exit. Two other commands, open and close, may be of interest when checking more than one filesystem. Close takes no argument, and appropriately enough, it closes the filesystem that is currently open. Open takes the device name as an argument. To see disk statistics from the superblock, the command stats will display the information by group. The command testb checks whether a block is in use. This can be used to test if any data is lost in the blocks marked as “bad” by the badblocks command. To get the filename for a block, first use the icheck command to get the inode and then ncheck to get the filename. The best course of action with bad blocks is to mark the block “bad” and restore the file from backup.
To get a complete list of all commands, see the man page of debugfs or type ?, lr or list_requests.
Ext4 is the evolution of the most used Linux filesystem, Ext3. In many ways, Ext4 is a deeper improvement over Ext3 than Ext3 was over Ext2. Ext3 was mostly about adding journaling to Ext2, but Ext4 modifies important data structures of the filesystem such as the ones destined to store the file data. The result is a filesystem with an improved design, better performance, reliability, and features. Therefore converting ext3 to ext4 is not as straightforward and easy as it was converting ext2 to ext3.
To creating ext4 partitions from scratch, use:
Tip: See the mkfs.ext4 man page for more options; edit
view/configure default options.
Be aware that by default, mkfs.ext4 uses a rather low bytes-per-inode ratio to calculate the fixed amount of inodes to be created.
Note: Especially for contemporary HDDs (750 GB+) this usually results in a much too large inode number and thus many likely wasted GB. The ratio can be set directly via the -i option; one of 6291456 resulted in 476928 inodes for a 2 TB partition.
For the rest ext4 can be manipulated using all the same tools that are available for ext2/ext3 type of filesystems like badblocks, dumpe2fs, e2fsck and tune2fs.
Btrfs (abbreviation for: BTree File System)is a new copy on write (CoW) filesystem for Linux aimed at implementing advanced features while focusing on fault tolerance, repair and easy administration. Jointly developed at Oracle, Red Hat, Fujitsu, Intel, SUSE, STRATO and many others, Btrfs is licensed under the GPL and open for contribution from anyone. Btrfs has several features characteristic of a storage device. It is designed to make the file system tolerant of errors, and to facilitate the detection and repair of errors when they occur. It uses checksums to ensure the validity of data and metadata, and maintains snapshots of the file system that can be used for backup or repair. The core datastructure used by btrfs is the B-Tree - hence the name.
Btrfs is still under heavy development, but every effort is being made to keep the filesystem stable and fast. Because of the speed of development, you should run the latest kernel you can (either the latest release kernel from kernel.org, or the latest -rc kernel.
As of the beginning of the year 2013 Btrfs was included in the default kernel and its tools (btrfs-progs) are part of the default installation. GRUB 2, mkinitcpio, and Syslinux have support for Btrfs and require no additional configuration.
The main Btrfs features available at the moment include:
Extent based file storage
2^64 byte == 16 EiB maximum file size
Space-efficient packing of small files
Space-efficient indexed directories
Dynamic inode allocation
Writable snapshots, read-only snapshots
Subvolumes (separate internal filesystem roots)
Checksums on data and metadata (crc32c)
Compression (zlib and LZO)
Integrated multiple device support
File Striping, File Mirroring, File Striping+Mirroring, Striping with Single and Dual Parity implementations
SSD (Flash storage) awareness (TRIM/Discard for reporting free blocks for reuse) and optimizations (e.g. avoiding unnecessary seek optimizations, sending writes in clusters, even if they are from unrelated files. This results in larger write operations and faster write throughput)
Efficient Incremental Backup
Background scrub process for finding and fixing errors on files with redundant copies
Online filesystem defragmentation
Offline filesystem check
Conversion of existing ext3/4 file systems
Seed devices. Create a (readonly) filesystem that acts as a template to seed other Btrfs filesystems. The original filesystem and devices are included as a readonly starting point for the new filesystem. Using copy on write, all modifications are stored on different devices; the original is unchanged.
Subvolume-aware quota support
Send/receive of subvolume changes
Efficient incremental filesystem mirroring
The most notable (unique) btrfs features are:
A Btrfs filesystem provides support for integrated RAID functionality, and can be created on top of many devices. More devices can be added after the filesystem is created. By default, metadata will be mirrored across two devices and data will be striped across all of the devices present. If only one device is present, metadata will be duplicated on that one device.
Btrfs can add and remove devices online, and freely convert between RAID levels after the filesystem is created. Btrfs supports raid0, raid1, raid10, raid5 and raid6, and it can also duplicate metadata on a single spindle. When blocks are read in, checksums are verified. If there are any errors, Btrfs tries to read from an alternate copy and will repair the broken copy if the alternative copy succeeds.
mkfs.btrfs will accept more than one device on the command line. It has options to control the raid configuration for data (-d) and metadata (-m). Valid choices are raid0, raid1, raid10 and single. The option -m single means that no duplication of metadata is done, which may be desired when using hardware raid. here some examples on creating the filesystem.
# Create a filesystem across four drives (metadata mirrored, linear data allocation) mkfs.btrfs /dev/sdb /dev/sdc /dev/sdd /dev/sde # Stripe the data without mirroring mkfs.btrfs -d raid0 /dev/sdb /dev/sdc # Use raid10 for both data and metadata mkfs.btrfs -m raid10 -d raid10 /dev/sdb /dev/sdc /dev/sdd /dev/sde # Don't duplicate metadata on a single drive (default on single SSDs) mkfs.btrfs -m single /dev/sdb
After filesystem creation it can be mounted like any other filesystem. To see the devices used by the filesystem you can use the follwing command:
butterfs filesystem show label: none uuid: c67b1e23-887b-4cb9-b037-5958f6c0a333 total devices 2 FS bytes used 383.00KiB devid 1 size 4.00 GiB used 847.12MiB path /dev/sdb devid 2 size 4.00 GiB used 837.12MiB path /dev/sdc Btrfs v4.8.5
Btrfs’s snapshotting is simple to use and understand. The snapshots will show up as normal directories under the snapshotted directory, and you can cd into it and walk around there as you would in any directory.
By default, all snapshots are writeable in Btrfs, but you can create read-only snapshots if you choose so. Read-only snapshots are great if you are just going to take a snapshot for a backup and then delete it once the backup completes. Writeable snapshots are handy because you can do things such as snapshot your file system before performing a system update; if the update breaks your system, you can reboot into the snapshot and use it like your normal file system. When you create a new Btrfs file system, the root directory is a subvolume. Snapshots can only be taken of subvolumes, because a subvolume is the representation of the root of a completely different filesystem tree, and you can only snapshot a filesystem tree.
The simplest way to think of this would be to create a
/home, so you could
independently of each other. So you could run the
following command to create a subvolume:
btrfs subvolume create /home
And then at some point down the road when you need to snapshot
/home for a backup, you simply run:
btrfs subvolume snapshot /home/ /home-snap
Once you are done with your backup, you can delete the snapshot with the command
btrfs subvolume delete /home-snap/
The hard work of unlinking the snapshot tree is done in the background, so you may notice I/O happening on a seemingly idle box; this is just Btrfs cleaning up the old snapshot. If you have a lot of snapshots or don’t remember which directories you created as subvolumes, you can run the command:
# btrfs subvolume list /mnt/btrfs-test/ ID 267 top level 5 path home ID 268 top level 5 path snap-home ID 270 top level 5 path home/josef
This doesn’t differentiate between a snapshot and a normal subvolume, so you should probably name your snapshots consistently so that later on you can tell which is which.
A subvolume in btrfs is not the same as an LVM logical volume, or a ZFS subvolume. With LVM, a logical volume is a block device in its own right; this is not the case with btrfs. A btrfs subvolume is not a block device, and cannot be treated as one.
Instead, a btrfs subvolume can be thought of as a POSIX file namespace. This namespace can be accessed via the top-level subvolume of the filesystem, or it can be mounted in its own right. So, given a filesystem structure like this:
toplevel `--- dir_a * just a normal directory | `--- p | `--- q `--- subvol_z * a subvolume `--- r `--- s
the root of the filesystem can be mounted, and the full
filesystem structure will be seen at the mount point;
alternatively the subvolume can be mounted (with the
and only the files
will be visible at the mount point.
A btrfs filesystem has a default subvolume, which is
initially set to be the top-level subvolume. It is the
default subvolume which is mounted if no subvol or
subvolid option is passed to mount. Changing the default
subvolume with btrfs subvolume default
will make the top level of the filesystem inaccessible,
except by use of the
subvolid=0 mount option.
Directories and files look the same on disk in Btrfs, which is consistent with the UNIX way of doing things. The ext file system variants have to pre-allocate their inode space when making the file system, so you are limited to the number of files you can create once you create the file system.
With Btrfs we add a couple of items to the B-tree when you create a new file, which limits you only by the amount of metadata space you have in your file system. If you have ever created thousands of files in a directory on an ext file system and then deleted the files, you may have noticed that doing an ls on the directory would take much longer than you’d expect given that there may only be a few files in the directory.
You may have even had to run this command:
e2fsck -D /dev/sda1
to re-optimize your directories in ext. This is due to a flaw in how the directory indexes are stored in ext: they cannot be shrunk. So once you add thousands of files and the internal directory index tree grows to a large size, it will not shrink back down as you remove files. This is not the case with Btrfs.
In Btrfs we store a file index next to the directory inode within the file system B-tree. The B-tree will grow and shrink as necessary, so if you create a billion files in a directory and then remove all of them, an ls will take only as long as if you had just created the directory.
Btrfs also has an index for each file that is based on the name of the file. This is handy because instead of having to search through the containing directory’s file index for a match, we simply hash the name of the file and search the B-tree for this hash value. This item is stored next to the inode item of the file, so looking up the name will usually read in the same block that contains all of the important information you need. Again, this limits the amount of I/O that needs to be done to accomplish basic tasks.
mkswap sets up a Linux swap area on a device or in a file.
(After creating the swap area, you need to invoke the
swapon command to start using it. Usually swap
areas are listed in
/etc/fstab so that they
can be taken into use at boot time by a swapon
-a command in some boot script.) See
Swap file usage
xfs_check checks whether an XFS filesystem is consistent. It is needed only when there is reason to believe that the filesystem has a consistency problem. Since XFS is a Journalling filesystem, which allows it to retain filesystem consistency, there should be little need to ever run xfs_check.
xfs_repair repairs corrupt or damaged XFS filesystems. xfs_repair will attempt to find the raw device associated with the specified block device and will use the raw device instead. Regardless, the filesystem to be repaired must be unmounted, otherwise, the resulting filesystem may become inconsistent or corrupt.
Two utility programs, smartctl and smartd (available when the smartmontools package is installed) can be used to monitor and control storage systems using the Self-Monitoring, Analysis and Reporting Technology System (SMART). SMARTis built into most modern ATA and SCSI harddisks and solid-state drives. The purpose of SMARTis to monitor the reliability of the hard drive and predict drive failures, and to carry out different types of drive self-tests.
smartdis a daemon that will attempt to enable SMARTmonitoring on ATA devices and polls these and SCSI devices every 30 minutes (configurable), logging SMARTerrors and changes of SMARTAttributes via the SYSLOG interface. smartd can also be configured to send email warnings if problems are detected. Depending upon the type of problem, you may want to run self-tests on the disk, back up the disk, replace the disk, or use a manufacturer's utility to force reallocation of bad or unreadable disk sectors.
smartd can be configured at start-up using the configuration file /usr/local/etc/smartd.conf. When the USR1 signal is sent to smartd it will immediately check the status of the disks, and then return to polling the disks every 30 minutes. Please consult the manual page for smartd for specific configuration options.
The smartctl utility controls the SMART system. It can be used to scan devices and print info about them, to enable or disable SMART on specific disks, to configure what to do when (imminent) errors are detected. Please consult the smartctl manual page for details.
ZFS, currently owned by Oracle Corporation, was developed at Sun Microsystems as a next generation filesystem aimed at near infinite scalability and free from traditional design paradigms. Only Ubuntu currently offers kernel integration for ZFS on Linux. Other Linux distributions can make use of ZFS through userspace with the aid of Fuse.
The main ZFS features available at the moment include:
High fault tolerance and data integrity
Near unlimited storage due to 128 bit design
Hybrid volume/filesystem management with RAID capabilities
Compression/deduplication of data
Volume provisioning (zvols)
Ability to use different devices for caching and logging
Ability to use delegate administrative rights to unprivileged users
The high fault tolerance and data integrity design makes it unneccesary to have a command like fsck available. ZFS works in transactions in where the ueberblock is only updated if everything was completed. Copies of previous uberblocks (128) are being kept in a round robin fashion. The so called vdev labels, which identify the disks used in a zfs pool, also have multiple copies: 2 at the beginning of the disk and 2 at the end. Periodic scrubbing of pools can reveal data integrity issues, and if a pool is equipped with more than one device, can repair it on the fly.
ZFS allows for additional checksumming algoritms of blocks and can have multiple copies of those blocks stored. This can be configured per ZFS filesystem.
ZFS pools can be considered the logical volume managent and can consist out of single or multiple disks. The pools can have seperate cache and logging devices attached so that reading/writing is offloaded to faster devices. Having multiple disks in a pool allows for on the fly data recovery.
A typical simple ZFS pool:
$ sudo zpool list -v NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT tank 1.98G 448K 1.98G - 0% 0% 1.00x ONLINE - sdb 1.98G 448K 1.98G - 0% 0%
Status of the pool:
$ sudo zpool status -v pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 sdb ONLINE 0 0 0 errors: No known data errors
History overview for ZFS pools:
$ sudo zpool history History for 'tank': 2017-01-11.11:32:50 zpool create -f -m /tank tank /dev/sdb 2017-01-11.12:37:44 zpool scrub tank
The ZFS pool is used as backing for one or more ZFS filesystems. By default the pool itself has a single filesystem named after the pool. ZFS filesystems are flexible in size and allow for customisation of various attributes like compression, size reservation, quota, block copies and so on. A full set of attributes can be seen with zfs get all and optionally a zfs filesystem, snapshot or volume.
A typical ZFS filesystem listing
$ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT tank 242K 1.92G 19K /tank
Create a ZFS filesystem “documents” and use compression
$ sudo zfs create -o compression=on tank/documents $ sudo zfs list tank/documents NAME USED AVAIL REFER MOUNTPOINT tank/documents 19K 1.92G 19K /tank/documents
$ sudo zfs create tank/documents $ sudo zfs set compression=on tank/documents $ sudo zfs list tank/documents NAME USED AVAIL REFER MOUNTPOINT tank/documents 19K 1.92G 19K /tank/documents
With some data in
/tank/documents compression ratio can be seen with
$ sudo zfs get compressratio tank/documents NAME PROPERTY VALUE SOURCE tank/documents compressratio 1.68x -
Creating a backup of
/tank/documents can be done instantaniously
and takes no space until
/tank/documents's content changes. The contents of the
snapshot can be accessed through the
directory of that ZFS filesystem.
$ sudo zfs snap tank/documents@backup $ sudo zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT tank/documents@backup 0 - 45.0M -