Improving the Reliability of
Commodity Operating Systems |
|
nook (nk) n.
|
Despite decades of
research in extensible operating
system technology, extensions such as device drivers remain a
significant cause of system failures. In Windows XP, for example,
drivers account for 85% of recently reported failures.
Nooks is a reliability subsystem that seeks to greatly enhance
OS reliability by isolating the OS from driver failures. The Nooks
approach is practical: rather than guaranteeing complete fault
tolerance through a new (and incompatible) OS or driver architecture,
our goal is to prevent the vast majority of driver-caused
crashes with little or no change to existing driver and system
code. To achieve this, Nooks isolates drivers within lightweight
protection domains inside the kernel address space, where hardware and
software prevent them from corrupting the kernel. Nooks also tracks a
driver's use of kernel resources to hasten automatic clean-up during
recovery.
More recently, we
have extended Nooks with shadow drivers. A shadow driver is
a kernel agent that (1) conceals a driver failure from its clients,
including the operating system and applications, and (2) transparently
restores the driver back to a functioning state. In this way,
applications and the operating system are unaware that the driver
failed, and hence continue executing correctly themselves.
Faculty |
Hank Levy Brian Bershad |
Former Students | Mike Swift Muthu Annamalai Brian Milnes Leo Shum |
Undergraduate
Students |
Micah
Brodsky Eric Kochhar Jordan Hom Doug Buxton Steve Martin |
Exchange
Students |
Christophe
Augier Damien Martin-Guillerez |
How to profile the Linux Kernel: use kernrprof, which may cause double-faults when used with Nooks but provides call graphs, or oprofile, which does interrupt-based sampling.
How to mesaure microarchitectural events on the Pentium 4 with Linux: Use Brink-Abyss, but be sure to set the duration knob long enough to complete your experiment.
What architectural state must be writable for a task gate to execute in Ring 0 on the Pentium: the GDT and TSS of the current task must be writable with the current page table, so that these can be updated before switching to the new task.
This tool was created by porting the fault injector tool used in the Rio file cache project and is describe in the paper The Systematic Improvement of Fault Tolerance in the Rio File Cache.
Nooks source
Last modified 9/30/2004