Revision: $Revision$
Candidates should be able to identify and correct common boot and run time issues.
/proc filesystem
Various system and daemon log files
Content of /, /boot , and /lib/modules
Screen output during bootup
Kernel syslog entries in system logs (if entry is able to be gained)
Tools and utilities to analyse information about the used hardware
Tools and utilities to trace software and their system and library calls
dmesg |
/sbin/lspci |
/usr/bin/lsdev |
/sbin/lsmod |
/sbin/modprobe |
/sbin/insmod |
/bin/uname |
strace |
strings |
ltrace |
lsof |
lsusb |
Resources: the man pages for the various commands, Vasudevan02.
/proc filesystem.
is a direct reflection of the system in memory presented to you
as files and directories. It provides an easy way to view kernel information
and information about currently running processes. In Linux some commands read
/proc directly to get information about the state of the
system. It allows you to view statistical information, hardware information,
network and host parameters, memory and performance information and lets you
modify some parameters at runtime by writing values in it.
Various system and daemon log files can be found at /var/log.
Below is an output from a Debian system:
debian-601a:~$ ls /var/log alternatives.log debug kern.log pm-powersave.log alternatives.log.1 debug.1 kern.log.1 pm-powersave.log.1 apt debug.2.gz kern.log.2.gz pycentral.log aptitude debug.3.gz kern.log.3.gz samba aptitude.1.gz dmesg lastlog syslog auth.log dmesg.0 lpr.log syslog.1 auth.log.1 dmesg.1.gz lpr.log.1 syslog.2.gz auth.log.2.gz dmesg.2.gz lpr.log.2.gz syslog.3.gz auth.log.3.gz dmesg.3.gz lpr.log.3.gz syslog.4.gz boot dmesg.4.gz mail.err syslog.5.gz btmp dpkg.log mail.info syslog.6.gz btmp.1 dpkg.log.1 mail.log unattended-upgrades ConsoleKit exim4 mail.warn user.log cups faillog messages wtmp daemon.log fontconfig.log messages.1 wtmp.1 daemon.log.1 fsck messages.2.gz Xorg.0.log daemon.log.2.gz gdm3 messages.3.gz Xorg.0.log.old daemon.log.3.gz installer news
Different types of logging can be seen: auth, cups, daemon, debug, dmesg, fsck, kern, lpr, mail, messages, samba, syslog, wtmp and xorg. And a few others.
Make sure that the file system where /var/log is mounted, has enough room to gather a lot of information if things go wrong.
Screens below with contents of / , /boot and /lib/modules are from a Debian system.
Contents of /:
debian-601a:/$ ls -a . boot home lib32 media proc selinux tmp .ure vmlinuz .. dev initrd.img lib64 mnt root srv u8 usr bin etc lib lost+found opt sbin sys u9 var
Contents of /boot:
debian-601a:/boot$ ls -a . debian.bmp sarge.bmp .. debianlilo.bmp sid.bmp coffee.bmp grub System.map-2.6.32-5-amd64 config-2.6.32-5-amd64 initrd.img-2.6.32-5-amd64 vmlinuz-2.6.32-5-amd64
The most important are vmlinuz, System.map, initrd.img and config file. The grub directory resides here. Also some bitmap images live here.
Contents of /lib/modules:
debian-601a:/lib/modules$ ls -a . .. 2.6.32-5-amd64
debian-601a:/lib/modules/2.6.32-5-amd64$ ls -a . kernel modules.dep modules.order modules.symbols.bin .. modules.alias modules.dep.bin modules.softdep source build modules.alias.bin modules.devname modules.symbols updates
Below a part of /etc/rsyslog.conf can be seen. This is on a Debian system.
# auth,authpriv.* /var/log/auth.log *.*;auth,authpriv.none -/var/log/syslog #cron.* /var/log/cron.log daemon.* -/var/log/daemon.log kern.* -/var/log/kern.log lpr.* -/var/log/lpr.log mail.* -/var/log/mail.log user.* -/var/log/user.log
All kernel messages are sent to /var/log/kern.log kern.log files are rotated. Below are the various kern.log files on a Debian system.
debian-601a:/var/log# ls -l kern* -rw-r----- 1 root adm 0 May 16 09:47 kern.log -rw-r----- 1 root adm 149389 May 16 09:25 kern.log.1 -rw-r----- 1 root adm 16802 May 11 11:43 kern.log.2.gz -rw-r----- 1 root adm 16807 Apr 26 10:33 kern.log.3.gz -rw-r----- 1 root adm 24720 Apr 20 10:49 kern.log.4.gz
In the rsyslog.conf ther are some entries for "catch all" files:
#
# Some "catch-all" log files.
#
*.=debug;\
auth,authpriv.none;\
news.none;mail.none -/var/log/debug
*.=info;*.=notice;*.=warn;\
auth,authpriv.none;\
cron,daemon.none;\
mail,news.none -/var/log/messages
So kernel messages are also sent to the debug and messages file.
dmesg.
The kernel logs messages into a ring buffer. dmesg
dumps the contents of the ring buffer to the standard output. Often,
the dmesg command is issued at the end of the boot sequence,
for example in one of the start up scripts, to dump the bootup messages
into a file (e.g. boot.messages).
dmesg helps users to print out their bootup messages. Instead of copying the messages by hand, the user need only:
dmesg > boot.messages
dmesg has some options. See manpage for specific usage. With dmesg -c the ring buffer contents can be cleared after printing.
lspci.
This command displays information about all PCI buses in the
system and all devices connected to them. Please refer to the manpage on lspci as well.
By default, it shows a brief list of devices. In the manpage, many options are described to request either a more verbose output or output intended for parsing by other programs. To make the output of lspci more verbose, one or more -v parameters (up to 3) can be added.
Access to some parts of the PCI configuration space is restricted to root on many operating systems, so lspci features available to normal users are limited. However, lspci tries its best to display as much as available and mark all other information with <access denied> text.
Below is the lspci output of a Debian system running in a Virtualbox.
debian-601a:~$ lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01) 00:02.0 VGA compatible controller: InnoTek Systemberatung GmbH VirtualBox Graphics Adapter 00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02) 00:04.0 System peripheral: InnoTek Systemberatung GmbH VirtualBox Guest Service 00:05.0 Multimedia audio controller: Intel Corporation 82801AA AC'97 Audio Controller (rev 01) 00:06.0 USB Controller: Apple Computer Inc. KeyLargo/Intrepid USB 00:07.0 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 00:0d.0 SATA controller: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA AHCI Controller (rev 02)
lsdev.
is a front end to the /proc filesystem and
prints out information about interrupts, I/O ports and dma settings.
lsdev gives an overview of which hardware uses what IO addresses,
IRQ's and DMA channels and can aid in determining conflicts.
FILES
/proc/interrupts
IRQ channels.
/proc/ioports
I/O memory addresses.
/proc/dma
DMA channels.
Below is the output of an lsdev command on a Debian system:
debian-601a:/var/log$ lsdev Device DMA IRQ I/O Ports ------------------------------------------------ 0000:00:01.1 0170-0177 01f0-01f7 0376-0376 03f6-03f6 d000-d00f 0000:00:03.0 d010-d017 0000:00:04.0 d020-d03f 0000:00:05.0 d100-d1ff d200-d23f 0000:00:0d.0 d240-d247 d250-d257 d260-d26f 82801AA-ICH 5 ACPI 4000-4003 4004-4005 4008-400b 4020-4021 ahci d240-d247 d250-d257 d260-d26f ata_piix 14 15 0170-0177 01f0-01f7 0376-0376 03f6-03f6 d000-d00f cascade 4 2 dma 0080-008f dma1 0000-001f dma2 00c0-00df e1000 d010-d017 eth1 10 fpu 00f0-00ff i8042 1 12 Intel d100-d1ff d200-d23f keyboard 0060-0060 0064-0064 ohci_hcd:usb1 11 PCI 0cf8-0cff pic1 0020-0021 pic2 00a0-00a1 rtc0 8 0070-0071 rtc_cmos 0070-0071 timer 0 timer0 0040-0043 timer1 0050-0053 vboxguest 9 vga+ 03c0-03df
The lshw command is not mentioned in the LPI objectives.
lshw - list hardware
SYNOPSIS
lshw [ -version ]
lshw [ -help ]
lshw [ -X ]
lshw [ -html | -short | -xml | -businfo ] [ -class class ... ] [ -disable test
... ] [ -enable test ... ] [ -sanitize ] [ -numeric ] [ -quiet ]
lshw is a small tool to extract detailed information on the hardware configuration of the machine. It can report exact memory configuration, firmware version, mainboard configuration, CPU version and speed, cache configuration, bus speed, etc. on DMI-capable x86 or IA-64 systems and on some PowerPC machines (PowerMac G4 is known to work).
lsusb - list USB devices
SYNOPSIS lsusb [ options ]
lsusb is a utility for displaying information about USB buses in the system and the devices connected to them.
lsusb -v gives lots of verbose information.
Below is the output of a lsusb command of a laptop with Ubuntu:
ubuntu:/var/log$ lsusb Bus 007 Device 003: ID 03f0:0324 Hewlett-Packard SK-2885 keyboard Bus 007 Device 002: ID 045e:0040 Microsoft Corp. Wheel Mouse Optical Bus 007 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 004 Device 002: ID 147e:2016 Upek Biometric Touchchip/Touchstrip Fingerprint Sensor Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 002: ID 04f2:b018 Chicony Electronics Co., Ltd 2M UVC Webcam Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
/sbin/lsmod
lsmod - program to show the status of modules in the Linux Kernel
lsmod shows all loaded modules.
lsmod is a trivial program which nicely formats the contents of the /proc/modules, showing what kernel modules are currently loaded.
lsmod is a program that displays the modules currently in use by the kernel. Name, size, use count, an a list of referring
modules are displayed. The information displayed is identical to that in /proc/modules. lsmod frequently is used to check if the proper modules could be loaded.
/sbin/modprobe
modprobe - program to add and remove modules from the Linux Kernel
SYNOPSIS
modprobe [ -v ] [ -V ] [ -C config-file ] [ -n ] [ -i ] [ -q ] [ -b ] [
modulename ] [ module parameters... ]
modprobe [ -r ] [ -v ] [ -n ] [ -i ] [ modulename... ]
modprobe [ -l ] [ -t dirname ] [ wildcard ]
modprobe [ -c ]
modprobe [ --dump-modversions ] [ filename ]
modprobe intelligently adds or removes a module from the linux kernel: note that for convenience, there is no difference between _ and - in module names (automatic underscore conversion is performed). modprobe looks in the module directory /lib/modules/`uname -r` for all the modules and other files, except for the optional /etc/modprobe.conf configuration file and /etc/modprobe.d directory (see modprobe.conf(5)). modprobe will also use module options specified on the kernel command line in the form of <module>.<option>.
modprobe is a high level interface to insmod. Often, modules depend on each other and/or need to be loaded in a certain order. modprobe is used to make this more easy for system administrators. It uses a dependency file, which is created by depmod, to load modules in the right order from certain specified locations.
The normal use of depmod is to include it somewhere in the rc-files in /etc/rc.d, so that the correct module dependencies will be available immediately after booting the system. The configuration file /etc/modules.conf can be used to steer depmod and modprobe's behavior. modprobe will unload all modules in a dependent chains if one of them fails to load. See also the section called “Kernel Components (201.1)”.
modprobe -l list all available modules
lsmod displays all loaded modules.
Load a module
debian:/lib/modules/2.6.35-22-generic/kernel/lib$ sudo modprobe cpu-notifier-error-inject
and display with lsmod
debian:/lib/modules/2.6.35-22-generic/kernel/lib$ lsmod|grep cpu cpu_notifier_error_inject 1861 0
remove a module and check:
debian:/lib/modules/2.6.35-22-generic/kernel/lib$ sudo modprobe -r cpu-notifier-error-inject debian:/lib/modules/2.6.35-22-generic/kernel/lib$ lsmod|grep cpu debian:/lib/modules/2.6.35-22-generic/kernel/lib$
/sbin/insmod
insmod - Simple program to insert a module into the Linux Kernel
insmod is a trivial program to insert a module into the kernel: if the filename is a hyphen, the module is taken from standard input. Most users will want to use modprobe(8) instead, which is more clever and can handle module dependencies.
insmod installs a loadable module in the running kernel. It tries to do this by resolving all symbols from the kernel's exported symbol table. You can specify the (object) file name. If the file name is given without extension, insmod will search for the module in common default directories. These default locations can be overridden by the contents of an environment variable (MODPATH) or in the configuration file
/etc/modules.conf.
Only the most general of error messages are reported: as the work of trying to link the module is now done inside the kernel, the dmesg command usually gives more information about errors.
/bin/uname
uname - print system information
uname displays machine type, network hostname, OS release, OS name, OS version and processor type of the host.
-a, --all
print all information, in the following order, except omit -p and -i if
unknown:
-s, --kernel-name
print the kernel name
-n, --nodename
print the network node hostname
-r, --kernel-release
print the kernel release
-v, --kernel-version
print the kernel version
-m, --machine
print the machine hardware name
-p, --processor
print the processor type or "unknown"
-i, --hardware-platform
print the hardware platform or "unknown"
-o, --operating-system
print the operating system
debian-601a:~$ uname -a Linux debian-601a 2.6.32-5-amd64 #1 SMP Mon Mar 7 21:35:22 UTC 2011 x86_64 GNU/Linux
strace - trace system calls and signals
SYNOPSIS
strace [ -CdffhiqrtttTvxx ] [ -acolumn ] [ -eexpr ] ... [ -ofile ] [ -ppid ]
... [ -sstrsize ] [ -uusername ] [ -Evar=val ] ... [ -Evar ] ... [ command [
arg ... ] ]
strace -c [ -eexpr ] ... [ -Ooverhead ] [ -Ssortby ] [ command [ arg ... ] ]
In the simplest case strace runs the specified command until it exits. It intercepts and records the system calls which are called by a process and the signals which are received by a process. The name of each system call, its arguments and its return value are printed on standard error or to the file specified with the -o option.
strace is a useful diagnostic, instructional, and debugging tool. System administrators, diagnosticians and trouble-shooters will find it invaluable for solving problems with programs for which the source is not readily available since they do not need to be recompiled in order to trace them. Students, hackers and the overly-curious will find that a great deal can be learned about a system and its system calls by tracing even ordinary programs. And programmers will find that since system calls and signals are events that happen at the user/kernel interface, a close examination of this boundary is very useful for bug isolation, sanity checking and attempting to capture race conditions.
By default strace reports the name of the system call, its arguments and the return value on standard error. It is very useful in cases where you do not have access to the source code and also serves as a tool to be used to gain better understanding of the inner workings of certain programs. The program to be traced need not be recompiled for this.
Below an output of strace cat /dev/null is seen:
debian-601a:~$ strace cat /dev/null
execve("/bin/cat", ["cat", "/dev/null"], [/* 34 vars */]) = 0
brk(0) = 0xc25000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcac8fb4000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=59695, ...}) = 0
mmap(NULL, 59695, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fcac8fa5000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\355\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1432968, ...}) = 0
mmap(NULL, 3541032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fcac8a38000
mprotect(0x7fcac8b90000, 2093056, PROT_NONE) = 0
mmap(0x7fcac8d8f000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x157000) = 0x7fcac8d8f000
mmap(0x7fcac8d94000, 18472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fcac8d94000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcac8fa4000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcac8fa3000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcac8fa2000
arch_prctl(ARCH_SET_FS, 0x7fcac8fa3700) = 0
mprotect(0x7fcac8d8f000, 16384, PROT_READ) = 0
mprotect(0x7fcac8fb6000, 4096, PROT_READ) = 0
munmap(0x7fcac8fa5000, 59695) = 0
brk(0) = 0xc25000
brk(0xc46000) = 0xc46000
open("/usr/lib/locale/locale-archive", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1527584, ...}) = 0
mmap(NULL, 1527584, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fcac8e2d000
close(3) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
open("/dev/null", O_RDONLY) = 3
fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
read(3, "", 32768) = 0
close(3) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?
strings - print the strings of printable characters in files.
SYNOPSIS
strings [-afovV] [-min-len]
[-n min-len] [--bytes=min-len]
[-t radix] [--radix=radix]
[-e encoding] [--encoding=encoding]
[-] [--all] [--print-file-name]
[-T bfdname] [--target=bfdname]
[--help] [--version] file...
For each file given, GNU strings prints the printable character sequences that are at least 4 characters long (or the number given with the options below) and are followed by an unprintable character. By default, it only prints the strings from the initialized and loaded sections of object files; for other types of files, it prints the strings from the whole file.
strings is mainly useful for determining the contents of non-text files, such as executables. Often used to check for names of environment variables and configurations files used by an executable.
ltrace - A library call tracer
SYNOPSIS
ltrace [-CfhiLrStttV] [-a column] [-A maxelts] [-D level] [-e expr] [-l file-
name] [-n nr] [-o filename] [-p pid] ... [-s strsize] [-u username] [-X extern]
[-x extern] ... [--align=column] [--debug=level] [--demangle] [--help]
[--indent=nr] [--library=filename] [--output=filename] [--version] [command [arg
...]]
ltrace is a program that simply runs the specified command until it exits. It intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process. It can also intercept and print the system calls executed by the program. ltrace is similar to strace, but instead of recording system calls it runs the specified command and intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process. It can also intercept and print the system calls executed by the program. The program to be traced need not be recompiled for this, so you can use it on binaries for which you don't have the source handy.
Example with ltrace:
debian-601a:~$ ltrace cat /dev/null
__libc_start_main(0x401ad0, 2, 0x7fff5f357f38, 0x409110, 0x409100 <unfinished ...>
getpagesize() = 4096
strrchr("cat", '/') = NULL
setlocale(6, "") = "en_US.utf8"
bindtextdomain("coreutils", "/usr/share/locale") = "/usr/share/locale"
textdomain("coreutils") = "coreutils"
__cxa_atexit(0x4043d0, 0, 0, 0x736c6974756572, 0x7f4069c04ea8) = 0
getenv("POSIXLY_CORRECT") = NULL
__fxstat(1, 1, 0x7fff5f357d80) = 0
open("/dev/null", 0, 02) = 3
__fxstat(1, 3, 0x7fff5f357d80) = 0
malloc(36863) = 0x014c4030
read(3, "", 32768) = 0
free(0x014c4030) = <void>
close(3) = 0
exit(0 <unfinished ...>
__fpending(0x7f4069c03780, 0, 0x7f4069c04330, 0x7f4069c04330, 0x7f4069c04e40) = 0
fclose(0x7f4069c03780) = 0
__fpending(0x7f4069c03860, 0, 0x7f4069c04df0, 0, 0x7f4069e13700) = 0
fclose(0x7f4069c03860) = 0
+++ exited (status 0) +++
lsof - list open files
lsof revision 4.81 lists on its standard output file information about files opened by processes for the following UNIX dialects:
AIX 5.3
FreeBSD 4.9 for x86-based systems
FreeBSD 7.0 and 8.0 for AMD64-based systems
Linux 2.1.72 and above for x86-based systems
Solaris 9 and 10
lsof has many options to show open files. See man lsof for the various options and have a look at the examples at the bottom of the man page.
lsof by default lists all open files belonging to all active processes. Since Unix uses the file metaphor too for devices, an open file can be a regular file, a directory, a block special file, a character special file, an executing text reference, a library, a stream or a network file (Internet socket, NFS file or UNIX domain socket). The utility can be used to see which processes use which resources.
The fuser command is not part of this LPI objective.
fuser.
accepts a filename and displays the PID's of processes using the
specified files or filesystems. Comes in handy if you want to know
which process uses a certain file, for example: if you are not able
to unmount a filesystem this often is caused by a process that still
uses a file on that filesystem. fuser can be used
to find the PID(s) of the process(es). A consecutive
ps -p $PID will name the process.
Debugging boot problems (or any other problems for that matter) can be complex at times. Your best teacher is experience. However, by carefully studying the boot messages and have the proper understanding of the mechanisms at hand you should be able to solve most, if not all, common Linux problems.
You need a good working knowledge of the boot process to be able to solve boot problems. In previous sections the boot process was described in detail as was system initialization. By now you should be able to determine in what stage the boot process is from the messages displayed on the screen. You should be able to utilize kernel boot messages to diagnose kernel errors. If not, please re-read the previous sections carefully. We will provide a number of suggestions on how to solve common problems in the next sections. However, it is beyond the scope of this book to cover all possible permutations. If your problem is not covered here, search the web for peers with the same problem, consult your colleagues, read more documentation, investigate and experiment.
In an ideal world you always have plenty of time to find out the exact nature of the problem and consequently solve it. In the real world you will have to be aware of cost effectiveness. Look for the most (cost-) effective way to solve the problem. For example: let's say that initial investigation indicates that the boot disk has hardware problems. You do not know yet the exact nature of the problems, but have eliminated the most common causes. The disk still does not work reliably. Hence, you need more time to investigate further. If your customer has a recent backup and deliberation learns that he agrees to go back to that situation, you might consider suggesting installation of a brand new disk and restoring the most recent backup instead. The total costs of your time so far and the new hardware are probably less than the costs you have to make to investigate any further - even more so if you consider the probability that you need to replace the disk after all.
This book is not an omniscient encyclopedia that describes solutions for all problems you may encounter. If you get stuck, there are many ways to get help. First, you could call or mail the distributor of your Linux version. Often you are granted installation support or 30 day generic support. It may be worth your money to subscribe to a support network - most distributions offer such a service at nominal fees.
Some distributions grant on-line (Internet) access to support databases. In these databases you can find a wealth of information on hardware and software problems. You can often search for keywords or error messages and most of them are cross-referenced to enable you to find related topics. Some URLs you may check follow:
You can also grep or zgrep for (parts of)
an error message or keyword in the documentation on your system, typically in and
under /usr/doc/,
/usr/share/doc/ or
/usr/src/linux/Documentation.
Additionally, you could try to enter the error message in an Internet search
engine, for example
http://www.google.com. This often returns URLS to
FAQ's, HOW-TO's and other documents that may contain clues on how to solve your problem.
Usenet news archives are another resource to use.
Often, Linux refuses to boot due to hardware problems. If a system boots and seems to be working fine, it still may have a hardware problem. If it is not working with Linux, but is working fine with other operating systems, for example DOS or Windows, this too often signifies hardware problems. Linux assumes the hardware to be up to specifications and will try to use it to its (specified) limits.
In general: regularly check the system log files for write and read errors. They indicate that the hardware is slowly becoming less reliable. Other indications of lurking hardware problems are: problems when accessing the CDROM (halt, long delays, bus errors, segmentation faults), kernel generation or compilation of other programs aborts with signal 11 or signal 7, scrambled or incorrect file contents, memory access errors, graphics that are not displayed correctly, CRC errors when accessing the floppy disk drive, crashes or halts during boot and errors when creating a filesystem.
To discover lurking hardware errors, you can use a simple, yet effective test: create a small script that compiles the kernel in an endless loop, for example:
#
# adapted from http://www.bitwizard.nl/sig11/
#
cd /usr/src/linux
#
c=0
while true
do
make clean &> /dev/null
make -k bzImage > log.${c} 2> /dev/null
c=`expr ${c} + 1`
done
Every iteration of this loop should create a log file - all log files should have the exact same content. You could use sum or md5sum to check this. If you detect differences between the log files this often is an indication some hardware problem exists.
If your system does not boot (anymore) you want to retrieve basic system functionality: your system should boot and the kernel should load. After that, you often are able to resolve the other problems using the rich set of debugging tools Linux offers.
There are a number of common boot problems. One group relates to the MBR and bootstrap files. Data corruption or accidental deletion of files or the boot partition will prevent your drive from booting. Another group clearly relates to hardware failures on the boot-drive.
Hardware problems. If a hardware component failure causes the system to refuse to boot you typically get one of the following clues: the BIOS reports errors like “No Fixed Disk Found” or “Disk Controller Error”. Sometimes, numerical messages are displayed, often in the 1700 range, e.g. “1701”, “1791” etc. If the controller is in error, a message that indicates this is often shown. In the most obvious case, the disk does not even start spinning.
Start by ensuring that your BIOS was set up correctly. The disk geometry needs to be set correctly to enable your BIOS to see the drive. Often, a Plug And Play option can be set in the BIOS. Set the option so that PNP is deactivated. If an additional option “Reset Configuration Data” or “Update ESCD” is offered, please set it to “yes” or alternatively to “enabled”.
If you have problems with a drive you just added to the system, the problems are often caused by incorrect cabling or jumper selections. Make sure all connectors are properly seated. Ensure that IDE master and slave drives are jumpered and cabled properly. UDMA drives make use of special twisted pair cabling. Your BIOS needs to be able to access the disk during boot, so make sure your drive geometry and/or type has been correctly specified in your BIOS.
You should verify the seating of the connectors, check the cables between disk and motherboard. Check for corrosion. Also check the power cables. You can remove the cables and measure them, using a simple ohm meter or buzzer, and/or replace them with new ones. A “hard disk controller” error message can often be caused by bad cabling too. Always check your cabling first, even when the symptoms seem to indicate a controller error. If the error persists try to replace the controller card.
Software problems. If a software failure causes the system to refuse to boot, you typically get one of the following clues: LILO does not display all of its four letters, LILO hangs, you get a screen with scrolling error codes, e.g. “010101..”, the system report something like “drive not bootable, insert system disk”, the system boots another operating system than intended.
If you are able to boot a CD/DVD, insert the distibution installation CD and select the RESCUE part. Most Linux distributions have a rescue part on the installation CD with which the system can be repaired of investigated.
The part below with booting from a floppy is still kept, although some of the infomation is bit outdated. The infomation is still useful.
If you are able to boot a floppy you can use a boot floppy or rescue disk to boot
a kernel. You could use a rescue floppy that
boots a kernel that has the root filesystem set to your hard disk (using the
rdev command). Or you could use a special tiny Linux distribution,
like tomsrtbt
(http://www.toms.net/rb/
) and mount the root filesystem by hand. Make sure the rescue disk contains
the necessary functionality/modules to fit your hardware, e.g. if you have SCSI disks,
be sure the drivers are either compiled into your boot kernel or available as modules.
If the boot floppy has booted you should start by checking the validity of your
MBR. When the LILO
bootup messages indicate a geometry error or a related error, you should verify that
/sbin/lilo ran correctly. If you are in doubt, you can rerun
it, provided you have access to (a backup copy of) the lilo.conf
file. Also check if the error is caused by the 1024 cylinder boundary problem: your
kernel(s) and related files should be accessible for your BIOS, and some BIOSes
(as a rule of thumb: made before 1998) are not able to access disk cylinders above
1024.
Next, verify that the partition table in the MBR is correct by running fdisk. Verify that at least one partition is marked as “boot” or “active” - some BIOSes require that the Linux boot partition to be marked “bootable”, some distributions may require this too. If the partition table is incorrect, you will need to repair it. If you have access to a backup of your boot systems MBR (including the partition table, hence: the first 512 bytes of your hard disk) and have put it on a floppy, now is the time to recover it. Alternately, you may have printed the partition list; if so you can use fdisk to restore the partition table.
If the partition table looks correct when you print it in fdisk the next step is to take a look at the root filesystem of your drive. You can use the fsck command to repair filesystem problems on your root file system. If all else fails, you have to resort to reformatting and/or repartitioning your disk and restore the latest back up or you may even need to replace the disk.
If the initial boot sequence could be completed, the kernel is loaded and tries to mount its root filesystem. This can either be a RAMDISK (initrd) or a partition on a hard disk. Of course, the kernel needs to be able to access (all of) its memory and the filesystem on the disk, which may require certain device drivers in the kernel. When the root filesystem could be mounted additional programs can be executed and additional kernel modules may be loaded to add functionality. If the root filesystem is a RAM disk, it may contain programs to load the modules needed to address other hardware devices such as disks. In these phases you could experience module loading problems, which can result in (partial) hardware inaccessibility.
In many cases it is not clear whether or not the cause of a problem lies with the hardware or the software. However, most of these errors result from invalid configuration.
Hardware problems. If a hardware problem causes the system to refuse to boot, you typically get one of the following clues: “PANIC VFS unable to mount root fs on ##:##”, modprobe reports errors loading a module, IRQ/DMA conflicts can be reported, parts of your hardware do not work or intermittent errors occur. As a rule of thumb, if your kernel boots, but your system consequently hangs or issues error messages, you should check for software problems and configuration problems first. If this does not resolve the problem, check your hardware (the section called “Generic issues with hardware problems”).
Software problems. If a software problem causes the system to refuse to boot, you typically also get one of the clues we listed under “hardware errors”: “PANIC VFS unable to mount root fs on ##:##”, modprobe reports errors loading a module, IRQ/DMA conflicts can be reported.
The “PANIC VFS unable to mount...” message occurs when the kernel could be
loaded, but either the ramdisk or the physical partition could not be mounted.
This can be the result of forgetting to run /sbin/lilo
after a kernel change or update, or forgetting to run rdev
in case you put the kernel-image directly into the boot-sectors of a partition.
The PANIC message is sometimes caused by inaccessibility of (parts of) your
system's memory, for example when it tries to mount the RAM-disk as its root
filesystem. Older kernels may require you to set the amount of RAM by rebooting
and setting the boot parameter:
mem=<size-of-memory-in-Kbytes>
In order to recognize memory above 64 Meg, it may be necessary to append the
“mem=” option to the kernel command line permanently. If you are using LILO
for your boot loader, you would do this in the lilo.conf
file. For example, if you had a machine with 128 Meg you would type:
append="mem=131072K"
To help you to determine what device the kernel tries to mount,
the “PANIC” message contains the major and minor number of the device, e.g.
9:0 for /dev/md0 (the first
virtual RAID disk, the section called “Software RAID”) or
8:3, the first SCSI disk
/dev/sda. This can pinpoint the area where you should
start your search for configuration errors. For example: if this message points
to the /dev/md0 device, you should check if the kernel was
compiled with support for software RAID, ditto for SCSI disks.
In most cases, your system should have booted by now. It may be that you still see numerous errors, e.g. daemons won't start, your network card or sound card does not work. In the next sections we will describe tools that aid you with troubleshooting and give hints and tips on how to resolve more problems.
IRQ conflicts are a common source of problems. You can use dmesg
to see which interrupts were required by the drivers and compare this with the
contents of /proc/interrupts or the output of lsdev
to determine conflicts. The IRQ a PCI device uses is also reported in
/proc/pci or by using the lspci program.
Sometimes a card could not obtain an IRQ since the BIOS assigned all of them to non-PCI
(ISA) cards. Check your BIOS settings if you suspect this to be the case.
Under some conditions IRQ's can be shared between two devices. Devices on the PCI bus may share the same IRQ interrupt with other devices on the PCI bus provided the driver software supports this. In other cases where there is potential for conflict, there should be no problem if no two devices with the same IRQ are ever in use at the same time. Even if devices with conflicting IRQs are used simultaneously one of them will likely have its interrupts caught by its device driver and may work. The other device(s) will likely behave like they were configured with the wrong interrupts.