General troubleshooting (213.2)

Revision: $Revision$

Candidates should be able to identify and correct common boot and run time issues.

Key Knowledge Areas

/proc filesystem

Various system and daemon log files

Content of /, /boot , and /lib/modules

Screen output during bootup

Kernel syslog entries in system logs (if entry is able to be gained)

Tools and utilities to analyse information about the used hardware

Tools and utilities to trace software and their system and library calls

The following is a partial list of the used files, terms and utilities:

dmesg
/sbin/lspci
/usr/bin/lsdev
/sbin/lsmod
/sbin/modprobe
/sbin/insmod
/bin/uname
strace
strings
ltrace
lsof
lsusb

Resources: the man pages for the various commands, Vasudevan02.

/proc filesystem

/proc filesystem.  is a direct reflection of the system in memory presented to you as files and directories. It provides an easy way to view kernel information and information about currently running processes. In Linux some commands read /proc directly to get information about the state of the system. It allows you to view statistical information, hardware information, network and host parameters, memory and performance information and lets you modify some parameters at runtime by writing values in it.

Various system and daemon log files

Various system and daemon log files can be found at /var/log.

Below is an output from a Debian system:

debian-601a:~$ ls /var/log
alternatives.log    debug           kern.log       pm-powersave.log
alternatives.log.1  debug.1         kern.log.1     pm-powersave.log.1
apt                 debug.2.gz      kern.log.2.gz  pycentral.log
aptitude            debug.3.gz      kern.log.3.gz  samba
aptitude.1.gz       dmesg           lastlog        syslog
auth.log            dmesg.0         lpr.log        syslog.1
auth.log.1          dmesg.1.gz      lpr.log.1      syslog.2.gz
auth.log.2.gz       dmesg.2.gz      lpr.log.2.gz   syslog.3.gz
auth.log.3.gz       dmesg.3.gz      lpr.log.3.gz   syslog.4.gz
boot                dmesg.4.gz      mail.err       syslog.5.gz
btmp                dpkg.log        mail.info      syslog.6.gz
btmp.1              dpkg.log.1      mail.log       unattended-upgrades
ConsoleKit          exim4           mail.warn      user.log
cups                faillog         messages       wtmp
daemon.log          fontconfig.log  messages.1     wtmp.1
daemon.log.1        fsck            messages.2.gz  Xorg.0.log
daemon.log.2.gz     gdm3            messages.3.gz  Xorg.0.log.old
daemon.log.3.gz     installer       news

Different types of logging can be seen: auth, cups, daemon, debug, dmesg, fsck, kern, lpr, mail, messages, samba, syslog, wtmp and xorg. And a few others.

Make sure that the file system where /var/log is mounted, has enough room to gather a lot of information if things go wrong.

Contents of /, /boot , and /lib/modules

Screens below with contents of / , /boot and /lib/modules are from a Debian system.

Contents of /:

debian-601a:/$ ls -a
.    boot  home        lib32       media  proc  selinux  tmp  .ure  vmlinuz
..   dev   initrd.img  lib64       mnt    root  srv      u8   usr
bin  etc   lib         lost+found  opt    sbin  sys      u9   var

Contents of /boot:

debian-601a:/boot$ ls -a
.                      debian.bmp                 sarge.bmp
..                     debianlilo.bmp             sid.bmp
coffee.bmp             grub                       System.map-2.6.32-5-amd64
config-2.6.32-5-amd64  initrd.img-2.6.32-5-amd64  vmlinuz-2.6.32-5-amd64

The most important are vmlinuz, System.map, initrd.img and config file. The grub directory resides here. Also some bitmap images live here.

Contents of /lib/modules:

debian-601a:/lib/modules$ ls -a
.  ..  2.6.32-5-amd64
debian-601a:/lib/modules/2.6.32-5-amd64$ ls -a
.      kernel             modules.dep      modules.order    modules.symbols.bin
..     modules.alias      modules.dep.bin  modules.softdep  source
build  modules.alias.bin  modules.devname  modules.symbols  updates

Kernel syslog entries in system logs

Below a part of /etc/rsyslog.conf can be seen. This is on a Debian system.

#
auth,authpriv.*                 /var/log/auth.log
*.*;auth,authpriv.none          -/var/log/syslog
#cron.*                         /var/log/cron.log
daemon.*                        -/var/log/daemon.log
kern.*                          -/var/log/kern.log
lpr.*                           -/var/log/lpr.log
mail.*                          -/var/log/mail.log
user.*                          -/var/log/user.log

All kernel messages are sent to /var/log/kern.log kern.log files are rotated. Below are the various kern.log files on a Debian system.

debian-601a:/var/log# ls -l kern*
-rw-r----- 1 root adm      0 May 16 09:47 kern.log
-rw-r----- 1 root adm 149389 May 16 09:25 kern.log.1
-rw-r----- 1 root adm  16802 May 11 11:43 kern.log.2.gz
-rw-r----- 1 root adm  16807 Apr 26 10:33 kern.log.3.gz
-rw-r----- 1 root adm  24720 Apr 20 10:49 kern.log.4.gz

In the rsyslog.conf ther are some entries for "catch all" files:

#
# Some "catch-all" log files.
#
*.=debug;\
        auth,authpriv.none;\
        news.none;mail.none     -/var/log/debug
*.=info;*.=notice;*.=warn;\
        auth,authpriv.none;\
        cron,daemon.none;\
        mail,news.none          -/var/log/messages

So kernel messages are also sent to the debug and messages file.

dmesg

dmesg The kernel logs messages into a ring buffer. dmesg dumps the contents of the ring buffer to the standard output. Often, the dmesg command is issued at the end of the boot sequence, for example in one of the start up scripts, to dump the bootup messages into a file (e.g. boot.messages).

dmesg helps users to print out their bootup messages. Instead of copying the messages by hand, the user need only:

              dmesg > boot.messages

dmesg has some options. See manpage for specific usage. With dmesg -c the ring buffer contents can be cleared after printing.

lspci

lspci This command displays information about all PCI buses in the system and all devices connected to them. Please refer to the manpage on lspci as well.

By default, it shows a brief list of devices. In the manpage, many options are described to request either a more verbose output or output intended for parsing by other programs. To make the output of lspci more verbose, one or more -v parameters (up to 3) can be added.

Access to some parts of the PCI configuration space is restricted to root on many operating systems, so lspci features available to normal users are limited. However, lspci tries its best to display as much as available and mark all other information with <access denied> text.

Below is the lspci output of a Debian system running in a Virtualbox.

debian-601a:~$ lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:02.0 VGA compatible controller: InnoTek Systemberatung GmbH VirtualBox Graphics Adapter
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00:04.0 System peripheral: InnoTek Systemberatung GmbH VirtualBox Guest Service
00:05.0 Multimedia audio controller: Intel Corporation 82801AA AC'97 Audio Controller (rev 01)
00:06.0 USB Controller: Apple Computer Inc. KeyLargo/Intrepid USB
00:07.0 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:0d.0 SATA controller: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA AHCI Controller (rev 02)

lsdev

lsdev is a front end to the /proc filesystem and prints out information about interrupts, I/O ports and dma settings. lsdev gives an overview of which hardware uses what IO addresses, IRQ's and DMA channels and can aid in determining conflicts.

FILES
       /proc/interrupts
              IRQ channels.

       /proc/ioports
              I/O memory addresses.

       /proc/dma
              DMA channels.

Below is the output of an lsdev command on a Debian system:

debian-601a:/var/log$ lsdev
Device            DMA   IRQ  I/O Ports
------------------------------------------------
0000:00:01.1                 0170-0177 01f0-01f7 0376-0376 03f6-03f6 d000-d00f
0000:00:03.0                 d010-d017
0000:00:04.0                 d020-d03f
0000:00:05.0                 d100-d1ff d200-d23f
0000:00:0d.0                 d240-d247 d250-d257 d260-d26f
82801AA-ICH               5
ACPI                         4000-4003 4004-4005 4008-400b 4020-4021
ahci                           d240-d247   d250-d257   d260-d26f
ata_piix              14 15    0170-0177   01f0-01f7   0376-0376   03f6-03f6   d000-d00f
cascade             4     2
dma                          0080-008f
dma1                         0000-001f
dma2                         00c0-00df
e1000                          d010-d017
eth1                     10
fpu                          00f0-00ff
i8042                  1 12
Intel                          d100-d1ff   d200-d23f
keyboard                     0060-0060 0064-0064
ohci_hcd:usb1            11
PCI                          0cf8-0cff
pic1                         0020-0021
pic2                         00a0-00a1
rtc0                      8    0070-0071
rtc_cmos                     0070-0071
timer                     0
timer0                       0040-0043
timer1                       0050-0053
vboxguest                 9
vga+                         03c0-03df

lshw

Note

The lshw command is not mentioned in the LPI objectives.

lshw - list hardware

SYNOPSIS
       lshw [ -version ]

       lshw [ -help ]

       lshw [ -X ]

       lshw  [  -html | -short | -xml | -businfo ] [ -class class ... ] [ -disable test
       ... ] [ -enable test ... ] [ -sanitize ] [ -numeric ] [ -quiet ]

lshw is a small tool to extract detailed information on the hardware configuration of the machine. It can report exact memory configuration, firmware version, mainboard configuration, CPU version and speed, cache configuration, bus speed, etc. on DMI-capable x86 or IA-64 systems and on some PowerPC machines (PowerMac G4 is known to work).

lsusb

lsusb - list USB devices

SYNOPSIS lsusb [ options ]

lsusb is a utility for displaying information about USB buses in the system and the devices connected to them.

lsusb -v gives lots of verbose information.

Below is the output of a lsusb command of a laptop with Ubuntu:

ubuntu:/var/log$ lsusb
Bus 007 Device 003: ID 03f0:0324 Hewlett-Packard SK-2885 keyboard
Bus 007 Device 002: ID 045e:0040 Microsoft Corp. Wheel Mouse Optical
Bus 007 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 004 Device 002: ID 147e:2016 Upek Biometric Touchchip/Touchstrip Fingerprint Sensor
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 04f2:b018 Chicony Electronics Co., Ltd 2M UVC Webcam
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

lsmod

/sbin/lsmod

lsmod - program to show the status of modules in the Linux Kernel

lsmod shows all loaded modules.

lsmod is a trivial program which nicely formats the contents of the /proc/modules, showing what kernel modules are currently loaded. lsmod is a program that displays the modules currently in use by the kernel. Name, size, use count, an a list of referring modules are displayed. The information displayed is identical to that in /proc/modules. lsmod frequently is used to check if the proper modules could be loaded.

modprobe

/sbin/modprobe

modprobe - program to add and remove modules from the Linux Kernel

SYNOPSIS
       modprobe  [  -v ]  [ -V ]  [ -C config-file ]  [ -n ]  [ -i ]  [ -q ]  [ -b ]  [
       modulename ]  [ module parameters... ]

       modprobe [ -r ]  [ -v ]  [ -n ]  [ -i ]  [ modulename... ]

       modprobe [ -l ]  [ -t dirname ]  [ wildcard ]

       modprobe [ -c ]

       modprobe [ --dump-modversions ]  [ filename ]

modprobe intelligently adds or removes a module from the linux kernel: note that for convenience, there is no difference between _ and - in module names (automatic underscore conversion is performed). modprobe looks in the module directory /lib/modules/`uname -r` for all the modules and other files, except for the optional /etc/modprobe.conf configuration file and /etc/modprobe.d directory (see modprobe.conf(5)). modprobe will also use module options specified on the kernel command line in the form of <module>.<option>. modprobe is a high level interface to insmod. Often, modules depend on each other and/or need to be loaded in a certain order. modprobe is used to make this more easy for system administrators. It uses a dependency file, which is created by depmod, to load modules in the right order from certain specified locations. The normal use of depmod is to include it somewhere in the rc-files in /etc/rc.d, so that the correct module dependencies will be available immediately after booting the system. The configuration file /etc/modules.conf can be used to steer depmod and modprobe's behavior. modprobe will unload all modules in a dependent chains if one of them fails to load. See also the section called “Kernel Components (201.1)”.

modprobe -l list all available modules

lsmod displays all loaded modules.

Load a module

debian:/lib/modules/2.6.35-22-generic/kernel/lib$ sudo modprobe cpu-notifier-error-inject

and display with lsmod

debian:/lib/modules/2.6.35-22-generic/kernel/lib$ lsmod|grep cpu
cpu_notifier_error_inject     1861  0

remove a module and check:

debian:/lib/modules/2.6.35-22-generic/kernel/lib$ sudo modprobe -r cpu-notifier-error-inject
debian:/lib/modules/2.6.35-22-generic/kernel/lib$ lsmod|grep cpu
debian:/lib/modules/2.6.35-22-generic/kernel/lib$

insmod

/sbin/insmod

insmod - Simple program to insert a module into the Linux Kernel

insmod is a trivial program to insert a module into the kernel: if the filename is a hyphen, the module is taken from standard input. Most users will want to use modprobe(8) instead, which is more clever and can handle module dependencies.

insmod installs a loadable module in the running kernel. It tries to do this by resolving all symbols from the kernel's exported symbol table. You can specify the (object) file name. If the file name is given without extension, insmod will search for the module in common default directories. These default locations can be overridden by the contents of an environment variable (MODPATH) or in the configuration file /etc/modules.conf.

Only the most general of error messages are reported: as the work of trying to link the module is now done inside the kernel, the dmesg command usually gives more information about errors.

uname

/bin/uname

uname - print system information

uname displays machine type, network hostname, OS release, OS name, OS version and processor type of the host.

       -a, --all
              print  all  information, in the following order, except omit -p and -i if
              unknown:

       -s, --kernel-name
              print the kernel name

       -n, --nodename
              print the network node hostname

       -r, --kernel-release
              print the kernel release

       -v, --kernel-version
              print the kernel version

       -m, --machine
              print the machine hardware name

       -p, --processor
              print the processor type or "unknown"

       -i, --hardware-platform
              print the hardware platform or "unknown"

       -o, --operating-system
              print the operating system
debian-601a:~$ uname -a
Linux debian-601a 2.6.32-5-amd64 #1 SMP Mon Mar 7 21:35:22 UTC 2011 x86_64 GNU/Linux

Tools and utilities to trace software and their system and library calls

strace

strace - trace system calls and signals

SYNOPSIS
       strace  [  -CdffhiqrtttTvxx  ] [ -acolumn ] [ -eexpr ] ...  [ -ofile ] [ -ppid ]
       ...  [ -sstrsize ] [ -uusername ] [ -Evar=val ] ...  [ -Evar ] ...  [ command  [
       arg ...  ] ]

       strace -c [ -eexpr ] ...  [ -Ooverhead ] [ -Ssortby ] [ command [ arg ...  ] ]

In the simplest case strace runs the specified command until it exits. It intercepts and records the system calls which are called by a process and the signals which are received by a process. The name of each system call, its arguments and its return value are printed on standard error or to the file specified with the -o option.

strace is a useful diagnostic, instructional, and debugging tool. System administrators, diagnosticians and trouble-shooters will find it invaluable for solving problems with programs for which the source is not readily available since they do not need to be recompiled in order to trace them. Students, hackers and the overly-curious will find that a great deal can be learned about a system and its system calls by tracing even ordinary programs. And programmers will find that since system calls and signals are events that happen at the user/kernel interface, a close examination of this boundary is very useful for bug isolation, sanity checking and attempting to capture race conditions.

By default strace reports the name of the system call, its arguments and the return value on standard error. It is very useful in cases where you do not have access to the source code and also serves as a tool to be used to gain better understanding of the inner workings of certain programs. The program to be traced need not be recompiled for this.

Below an output of strace cat /dev/null is seen:

debian-601a:~$ strace cat /dev/null
execve("/bin/cat", ["cat", "/dev/null"], [/* 34 vars */]) = 0
brk(0)                                  = 0xc25000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcac8fb4000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=59695, ...}) = 0
mmap(NULL, 59695, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fcac8fa5000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/libc.so.6", O_RDONLY)        = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\355\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1432968, ...}) = 0
mmap(NULL, 3541032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fcac8a38000
mprotect(0x7fcac8b90000, 2093056, PROT_NONE) = 0
mmap(0x7fcac8d8f000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x157000) = 0x7fcac8d8f000
mmap(0x7fcac8d94000, 18472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fcac8d94000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcac8fa4000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcac8fa3000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcac8fa2000
arch_prctl(ARCH_SET_FS, 0x7fcac8fa3700) = 0
mprotect(0x7fcac8d8f000, 16384, PROT_READ) = 0
mprotect(0x7fcac8fb6000, 4096, PROT_READ) = 0
munmap(0x7fcac8fa5000, 59695)           = 0
brk(0)                                  = 0xc25000
brk(0xc46000)                           = 0xc46000
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1527584, ...}) = 0
mmap(NULL, 1527584, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fcac8e2d000
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
open("/dev/null", O_RDONLY)             = 3
fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
read(3, "", 32768)                      = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?

strings

strings - print the strings of printable characters in files.

SYNOPSIS
       strings [-afovV] [-min-len]
               [-n min-len] [--bytes=min-len]
               [-t radix] [--radix=radix]
               [-e encoding] [--encoding=encoding]
               [-] [--all] [--print-file-name]
               [-T bfdname] [--target=bfdname]
               [--help] [--version] file...

For each file given, GNU strings prints the printable character sequences that are at least 4 characters long (or the number given with the options below) and are followed by an unprintable character. By default, it only prints the strings from the initialized and loaded sections of object files; for other types of files, it prints the strings from the whole file.

strings is mainly useful for determining the contents of non-text files, such as executables. Often used to check for names of environment variables and configurations files used by an executable.

ltrace

ltrace - A library call tracer

SYNOPSIS
       ltrace  [-CfhiLrStttV]  [-a  column] [-A maxelts] [-D level] [-e expr] [-l file-
       name] [-n nr] [-o filename] [-p pid] ... [-s strsize] [-u username] [-X  extern]
       [-x   extern]   ...   [--align=column]   [--debug=level]  [--demangle]  [--help]
       [--indent=nr] [--library=filename] [--output=filename] [--version] [command [arg
       ...]]

ltrace is a program that simply runs the specified command until it exits. It intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process. It can also intercept and print the system calls executed by the program. ltrace is similar to strace, but instead of recording system calls it runs the specified command and intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process. It can also intercept and print the system calls executed by the program. The program to be traced need not be recompiled for this, so you can use it on binaries for which you don't have the source handy.

Example with ltrace:

debian-601a:~$ ltrace cat /dev/null
__libc_start_main(0x401ad0, 2, 0x7fff5f357f38, 0x409110, 0x409100 <unfinished ...>
getpagesize()                                                     = 4096
strrchr("cat", '/')                                               = NULL
setlocale(6, "")                                                  = "en_US.utf8"
bindtextdomain("coreutils", "/usr/share/locale")                  = "/usr/share/locale"
textdomain("coreutils")                                           = "coreutils"
__cxa_atexit(0x4043d0, 0, 0, 0x736c6974756572, 0x7f4069c04ea8)    = 0
getenv("POSIXLY_CORRECT")                                         = NULL
__fxstat(1, 1, 0x7fff5f357d80)                                    = 0
open("/dev/null", 0, 02)                                          = 3
__fxstat(1, 3, 0x7fff5f357d80)                                    = 0
malloc(36863)                                                     = 0x014c4030
read(3, "", 32768)                                                = 0
free(0x014c4030)                                                  = <void>
close(3)                                                          = 0
exit(0 <unfinished ...>
__fpending(0x7f4069c03780, 0, 0x7f4069c04330, 0x7f4069c04330, 0x7f4069c04e40) = 0
fclose(0x7f4069c03780)                                            = 0
__fpending(0x7f4069c03860, 0, 0x7f4069c04df0, 0, 0x7f4069e13700)  = 0
fclose(0x7f4069c03860)                                            = 0
+++ exited (status 0) +++

lsof

lsof - list open files

lsof revision 4.81 lists on its standard output file information about files opened by processes for the following UNIX dialects:

            AIX 5.3
            FreeBSD 4.9 for x86-based systems
            FreeBSD 7.0 and 8.0 for AMD64-based systems
            Linux 2.1.72 and above for x86-based systems
            Solaris 9 and 10

lsof has many options to show open files. See man lsof for the various options and have a look at the examples at the bottom of the man page.

lsof by default lists all open files belonging to all active processes. Since Unix uses the file metaphor too for devices, an open file can be a regular file, a directory, a block special file, a character special file, an executing text reference, a library, a stream or a network file (Internet socket, NFS file or UNIX domain socket). The utility can be used to see which processes use which resources.

fuser

Note

The fuser command is not part of this LPI objective.

fuser accepts a filename and displays the PID's of processes using the specified files or filesystems. Comes in handy if you want to know which process uses a certain file, for example: if you are not able to unmount a filesystem this often is caused by a process that still uses a file on that filesystem. fuser can be used to find the PID(s) of the process(es). A consecutive ps -p $PID will name the process.

Troubleshooting - a word of caution

Debugging boot problems (or any other problems for that matter) can be complex at times. Your best teacher is experience. However, by carefully studying the boot messages and have the proper understanding of the mechanisms at hand you should be able to solve most, if not all, common Linux problems.

You need a good working knowledge of the boot process to be able to solve boot problems. In previous sections the boot process was described in detail as was system initialization. By now you should be able to determine in what stage the boot process is from the messages displayed on the screen. You should be able to utilize kernel boot messages to diagnose kernel errors. If not, please re-read the previous sections carefully. We will provide a number of suggestions on how to solve common problems in the next sections. However, it is beyond the scope of this book to cover all possible permutations. If your problem is not covered here, search the web for peers with the same problem, consult your colleagues, read more documentation, investigate and experiment.

Cost effectiveness

In an ideal world you always have plenty of time to find out the exact nature of the problem and consequently solve it. In the real world you will have to be aware of cost effectiveness. Look for the most (cost-) effective way to solve the problem. For example: let's say that initial investigation indicates that the boot disk has hardware problems. You do not know yet the exact nature of the problems, but have eliminated the most common causes. The disk still does not work reliably. Hence, you need more time to investigate further. If your customer has a recent backup and deliberation learns that he agrees to go back to that situation, you might consider suggesting installation of a brand new disk and restoring the most recent backup instead. The total costs of your time so far and the new hardware are probably less than the costs you have to make to investigate any further - even more so if you consider the probability that you need to replace the disk after all.

Getting help

This book is not an omniscient encyclopedia that describes solutions for all problems you may encounter. If you get stuck, there are many ways to get help. First, you could call or mail the distributor of your Linux version. Often you are granted installation support or 30 day generic support. It may be worth your money to subscribe to a support network - most distributions offer such a service at nominal fees.

Some distributions grant on-line (Internet) access to support databases. In these databases you can find a wealth of information on hardware and software problems. You can often search for keywords or error messages and most of them are cross-referenced to enable you to find related topics. Some URLs you may check follow:

You can also grep or zgrep for (parts of) an error message or keyword in the documentation on your system, typically in and under /usr/doc/, /usr/share/doc/ or /usr/src/linux/Documentation. Additionally, you could try to enter the error message in an Internet search engine, for example http://www.google.com. This often returns URLS to FAQ's, HOW-TO's and other documents that may contain clues on how to solve your problem. Usenet news archives are another resource to use.

Generic issues with hardware problems

Often, Linux refuses to boot due to hardware problems. If a system boots and seems to be working fine, it still may have a hardware problem. If it is not working with Linux, but is working fine with other operating systems, for example DOS or Windows, this too often signifies hardware problems. Linux assumes the hardware to be up to specifications and will try to use it to its (specified) limits.

In general: regularly check the system log files for write and read errors. They indicate that the hardware is slowly becoming less reliable. Other indications of lurking hardware problems are: problems when accessing the CDROM (halt, long delays, bus errors, segmentation faults), kernel generation or compilation of other programs aborts with signal 11 or signal 7, scrambled or incorrect file contents, memory access errors, graphics that are not displayed correctly, CRC errors when accessing the floppy disk drive, crashes or halts during boot and errors when creating a filesystem.

To discover lurking hardware errors, you can use a simple, yet effective test: create a small script that compiles the kernel in an endless loop, for example:

#
# adapted from http://www.bitwizard.nl/sig11/
#
cd /usr/src/linux
#
c=0
while true
do
   make clean &> /dev/null
   make -k bzImage > log.${c} 2> /dev/null
   c=`expr ${c} + 1`
done  

Every iteration of this loop should create a log file - all log files should have the exact same content. You could use sum or md5sum to check this. If you detect differences between the log files this often is an indication some hardware problem exists.

Resolving initial boot problems

If your system does not boot (anymore) you want to retrieve basic system functionality: your system should boot and the kernel should load. After that, you often are able to resolve the other problems using the rich set of debugging tools Linux offers.

There are a number of common boot problems. One group relates to the MBR and bootstrap files. Data corruption or accidental deletion of files or the boot partition will prevent your drive from booting. Another group clearly relates to hardware failures on the boot-drive.

Hardware problems.  If a hardware component failure causes the system to refuse to boot you typically get one of the following clues: the BIOS reports errors like No Fixed Disk Found or Disk Controller Error. Sometimes, numerical messages are displayed, often in the 1700 range, e.g. 1701, 1791 etc. If the controller is in error, a message that indicates this is often shown. In the most obvious case, the disk does not even start spinning.

Start by ensuring that your BIOS was set up correctly. The disk geometry needs to be set correctly to enable your BIOS to see the drive. Often, a Plug And Play option can be set in the BIOS. Set the option so that PNP is deactivated. If an additional option Reset Configuration Data or Update ESCD is offered, please set it to yes or alternatively to enabled.

If you have problems with a drive you just added to the system, the problems are often caused by incorrect cabling or jumper selections. Make sure all connectors are properly seated. Ensure that IDE master and slave drives are jumpered and cabled properly. UDMA drives make use of special twisted pair cabling. Your BIOS needs to be able to access the disk during boot, so make sure your drive geometry and/or type has been correctly specified in your BIOS.

You should verify the seating of the connectors, check the cables between disk and motherboard. Check for corrosion. Also check the power cables. You can remove the cables and measure them, using a simple ohm meter or buzzer, and/or replace them with new ones. A hard disk controller error message can often be caused by bad cabling too. Always check your cabling first, even when the symptoms seem to indicate a controller error. If the error persists try to replace the controller card.

Software problems.  If a software failure causes the system to refuse to boot, you typically get one of the following clues: LILO does not display all of its four letters, LILO hangs, you get a screen with scrolling error codes, e.g. 010101.., the system report something like drive not bootable, insert system disk, the system boots another operating system than intended.

If you are able to boot a CD/DVD, insert the distibution installation CD and select the RESCUE part. Most Linux distributions have a rescue part on the installation CD with which the system can be repaired of investigated.

Note

The part below with booting from a floppy is still kept, although some of the infomation is bit outdated. The infomation is still useful.

If you are able to boot a floppy you can use a boot floppy or rescue disk to boot a kernel. You could use a rescue floppy that boots a kernel that has the root filesystem set to your hard disk (using the rdev command). Or you could use a special tiny Linux distribution, like tomsrtbt (http://www.toms.net/rb/ ) and mount the root filesystem by hand. Make sure the rescue disk contains the necessary functionality/modules to fit your hardware, e.g. if you have SCSI disks, be sure the drivers are either compiled into your boot kernel or available as modules. If the boot floppy has booted you should start by checking the validity of your MBR. When the LILO bootup messages indicate a geometry error or a related error, you should verify that /sbin/lilo ran correctly. If you are in doubt, you can rerun it, provided you have access to (a backup copy of) the lilo.conf file. Also check if the error is caused by the 1024 cylinder boundary problem: your kernel(s) and related files should be accessible for your BIOS, and some BIOSes (as a rule of thumb: made before 1998) are not able to access disk cylinders above 1024.

Next, verify that the partition table in the MBR is correct by running fdisk. Verify that at least one partition is marked as boot or active - some BIOSes require that the Linux boot partition to be marked bootable, some distributions may require this too. If the partition table is incorrect, you will need to repair it. If you have access to a backup of your boot systems MBR (including the partition table, hence: the first 512 bytes of your hard disk) and have put it on a floppy, now is the time to recover it. Alternately, you may have printed the partition list; if so you can use fdisk to restore the partition table.

If the partition table looks correct when you print it in fdisk the next step is to take a look at the root filesystem of your drive. You can use the fsck command to repair filesystem problems on your root file system. If all else fails, you have to resort to reformatting and/or repartitioning your disk and restore the latest back up or you may even need to replace the disk.

Resolving kernel boot problems

If the initial boot sequence could be completed, the kernel is loaded and tries to mount its root filesystem. This can either be a RAMDISK (initrd) or a partition on a hard disk. Of course, the kernel needs to be able to access (all of) its memory and the filesystem on the disk, which may require certain device drivers in the kernel. When the root filesystem could be mounted additional programs can be executed and additional kernel modules may be loaded to add functionality. If the root filesystem is a RAM disk, it may contain programs to load the modules needed to address other hardware devices such as disks. In these phases you could experience module loading problems, which can result in (partial) hardware inaccessibility.

In many cases it is not clear whether or not the cause of a problem lies with the hardware or the software. However, most of these errors result from invalid configuration.

Hardware problems.  If a hardware problem causes the system to refuse to boot, you typically get one of the following clues: PANIC VFS unable to mount root fs on ##:##, modprobe reports errors loading a module, IRQ/DMA conflicts can be reported, parts of your hardware do not work or intermittent errors occur. As a rule of thumb, if your kernel boots, but your system consequently hangs or issues error messages, you should check for software problems and configuration problems first. If this does not resolve the problem, check your hardware (the section called “Generic issues with hardware problems”).

Software problems.  If a software problem causes the system to refuse to boot, you typically also get one of the clues we listed under hardware errors: PANIC VFS unable to mount root fs on ##:##, modprobe reports errors loading a module, IRQ/DMA conflicts can be reported.

The PANIC VFS unable to mount... message occurs when the kernel could be loaded, but either the ramdisk or the physical partition could not be mounted. This can be the result of forgetting to run /sbin/lilo after a kernel change or update, or forgetting to run rdev in case you put the kernel-image directly into the boot-sectors of a partition. The PANIC message is sometimes caused by inaccessibility of (parts of) your system's memory, for example when it tries to mount the RAM-disk as its root filesystem. Older kernels may require you to set the amount of RAM by rebooting and setting the boot parameter:

mem=<size-of-memory-in-Kbytes>

In order to recognize memory above 64 Meg, it may be necessary to append the mem= option to the kernel command line permanently. If you are using LILO for your boot loader, you would do this in the lilo.conf file. For example, if you had a machine with 128 Meg you would type:

append="mem=131072K"

To help you to determine what device the kernel tries to mount, the PANIC message contains the major and minor number of the device, e.g. 9:0 for /dev/md0 (the first virtual RAID disk, the section called “Software RAID”) or 8:3, the first SCSI disk /dev/sda. This can pinpoint the area where you should start your search for configuration errors. For example: if this message points to the /dev/md0 device, you should check if the kernel was compiled with support for software RAID, ditto for SCSI disks.

In most cases, your system should have booted by now. It may be that you still see numerous errors, e.g. daemons won't start, your network card or sound card does not work. In the next sections we will describe tools that aid you with troubleshooting and give hints and tips on how to resolve more problems.

Resolving IRQ/DMA conflicts

IRQ conflicts are a common source of problems. You can use dmesg to see which interrupts were required by the drivers and compare this with the contents of /proc/interrupts or the output of lsdev to determine conflicts. The IRQ a PCI device uses is also reported in /proc/pci or by using the lspci program. Sometimes a card could not obtain an IRQ since the BIOS assigned all of them to non-PCI (ISA) cards. Check your BIOS settings if you suspect this to be the case.

Under some conditions IRQ's can be shared between two devices. Devices on the PCI bus may share the same IRQ interrupt with other devices on the PCI bus provided the driver software supports this. In other cases where there is potential for conflict, there should be no problem if no two devices with the same IRQ are ever in use at the same time. Even if devices with conflicting IRQs are used simultaneously one of them will likely have its interrupts caught by its device driver and may work. The other device(s) will likely behave like they were configured with the wrong interrupts.

Copyright Snow B.V. The Netherlands