Unit OS B: Windows

Download Report

Transcript Unit OS B: Windows

Unit OS B: Comparing the Linux
and Windows Kernels
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Copyright Notice
© 2000-2005 David A. Solomon and Mark Russinovich
These materials are part of the Windows Operating
System Internals Curriculum Development Kit,
developed by David A. Solomon and Mark E.
Russinovich with Andreas Polze
Microsoft has licensed these materials from David
Solomon Expert Seminars, Inc. for distribution to
academic organizations solely for use in academic
environments (and not for commercial use)
2
Roadmap for Section B
A Brief History of Windows and Linux
Comparing the Windows and Linux kernel
architectures
Linux: becoming more like Windows
Benchmarks and other lies
What does the future hold?
3
Scope
We’re going to look at the technology of the
kernels
We’re not going to look at:
Cost
Support
Applications
Management
Use as a desktop system
4
The History of Linux
The real history of Linux starts in 1969, when Ken
Thompson developed the first version of UNIX at Bell
Labs
After Dennis Ritchie, designer of the C programming language,
joined the project it debuted to the research community in an
academic paper in 1974
Bell Labs released the first commercial version in 1976 as UNIX
Version 6 (V6)
UNIX spread throughout universities and in 1978 Bell
Labs released UNIX Time-Sharing System, a version with
portability in mind
5
Linux History Continued
Because Bell Labs distributed UNIX with source code, the
early 1980’s saw three major branches grow on the UNIX
tree:
UNIX System III from Bell Lab’s UNIX Support Group (USG)
UNIX Berkeley Source Distribution (BSD) from the University of
California at Berkeley
Microsoft’s XENIX
The UNIX market fragmented further in the 1980’s,
despite the IEEE’s POSIX standard and the X/Open
Group’s Portability Guide
6
Linus and Linux
In 1991 Linus Torvalds took a college computer science
course that used the Minix operating system
Minix is a “toy” UNIX-like OS written by Andrew Tanenbaum as a
learning workbench
Linus wanted to make MINIX more usable, but Tanenbaum
wanted to keep it ultra-simple
Linus went in his own direction and began working on
Linux
In October 1991 he announced Linux v0.02
In March 1994 he released Linux v1.0
7
The History of Windows (NT)
The history of Windows really begins in the mid-1970s,
when Dick Hustvedt, Peter Lipman and David Cutler
designed the VMS operating system for Digital’s 32-bit
VAX processor
Digital shipped VMS v1.0 in 1978
Cutler moved to Seattle to open DECWest and worked on
the Digital Mica OS for a new CPU codenamed Prism
12 engineers went with him and the facility grew to 200
In 1988 Digital cancelled the project
8
The History of Windows Continued
Bill Gates wanted a UNIX rival
He hired Cutler and 20 Digital engineers in 1989
The new project was called NT OS/2 because it focused on OS/2
backward compatibility
With the success of Windows 3.0’s 1990 release Gates
refocused the project on Windows compatibility
The project renamed to Windows NT
Microsoft released Windows NT 3.1 in August 1993
9
Windows and Linux
Both Linux and Windows are based on
foundations developed in the mid-1970s
1970
1980
1990
2000
1970
1980
1990
2000
10
Comparing the Architectures
Both Linux and Windows are monolithic
All core operating system services run in a shared address space
in kernel-mode
All core operating system services are part of a single module
Linux: vmlinuz
Windows: ntoskrnl.exe
Windowing is handled differently:
Windows has a kernel-mode Windowing subsystem
Linux has a user-mode X-Windowing system
11
Kernel Architectures
Application
Windows
User Mode
Kernel Mode
Win32
Windowing
Device
Drivers
System Services
Process Management,
Memory Management,
I/O Management, etc.
Application
Hardware Dependent Code
Linux
X-Windows
User Mode
Kernel Mode
System Services
Process Management,
Memory Management,
I/O Management, etc.
Device
Drivers
Hardware Dependent Code
12
Linux Kernel
Linux is a monolithic but modular system
All kernel subsystems form a single piece of code with no
protection between them
Modularity is supported in two ways:
Compile-time options
Most kernel components can be built as a dynamically
loadable kernel module (DLKM)
DLKMs
Built separately from the main kernel
Loaded into the kernel at runtime and on demand (infrequently
used components take up kernel memory only when needed)
Kernel modules can be upgraded incrementally
Support for minimal kernels that automatically adapt to the
machine and load only those kernel components that are used
13
Windows Kernel
Windows is a monolithic but modular system
No protection among pieces of kernel code and drivers
Support for Modularity is somewhat weak:
Windows Drivers allow for dynamic extension of kernel
functionality
Windows XP Embedded has special tools / packaging rules that
allow coarse-grained configuration of the OS
Windows Drivers are dynamically loadable kernel modules
Significant amount of code run as drivers (including network
stacks such as TCP/IP and many services)
Built independently from the kernel
Can be loaded on-demand
Dependencies among drivers can be specified
14
Comparing Portability
Both Linux and Windows kernels are portable
Mainly written in C
Have been ported to a range of processor architectures
Windows
i486, MIPS, PowerPC, Alpha, IA-64, x86-64
Only x86-64 and IA-64 currently supported
> 64MB memory required
Linux
Alpha, ARM, ARM26, CRIS, H8300, i386, IA-64, M68000,
MIPS, PA-RISC, PowerPC, S/390, SuperH, SPARC, VAX,
v850, x86-64
DLKMs allow for minimal kernels for microcontrollers
> 4MB memory required
15
Comparing Layering, APIs, Complexity
Windows
Kernel exports about 250 system calls (accessed via ntdll.dll)
Layered Windows/POSIX subsystems
Rich Windows API (17 500 functions on top of native APIs)
Linux
Kernel supports about 200 different system calls
Layered BSD, Unix Sys V, POSIX shared system libraries
Compact APIs (1742 functions in Single Unix Specification
Version 3; not including X Window APIs)
16
Comparing Architectures
Processes and scheduling
SMP support
Memory management
I/O
File Caching
Security
17
Process Management
Windows
Process
Address space, handle
table, statistics and at least
one thread
No inherent parent/child
relationship
Threads
Basic scheduling unit
Fibers - cooperative usermode threads
Linux
Process is called a Task
Basic Address space,
handle table, statistics
Parent/child relationship
Basic scheduling unit
Threads
No threads per-se
Tasks can act like Windows
threads by sharing handle
table, PID and address
space
PThreads – cooperative
user-mode threads
18
Scheduling Priorities
Windows
Two scheduling classes
31
“Real time” (fixed) priority 16-31
Dynamic - priority 1-15
Fixed
Higher priorities are
favored
Priorities of dynamic
threads get boosted on
wakeups
Thread priorities are
never lowered
16
15
Dynamic
I/O
0
Windows
19
Scheduling Priorities
Windows
Two scheduling classes
“Real time” (fixed) priority 16-31
Dynamic - priority 1-15
Higher priorities are
favored
Priorities of dynamic
threads get boosted on
wakeups
Thread priorities are
never lowered
Linux
Has 3 scheduling classes:
Normal – priority 100-139
Fixed Round Robin – priority
0-99
Fixed FIFO – priority 0-99
Lower priorities are favored
Priorities of normal threads
go up (decay) as they use
CPU
Priorities of interactive
threads go down (boost)
20
Scheduling Priorities (cont)
31
0
Fixed
Fixed FIFO
Fixed Round-Robin
16
15
Dynamic
I/O
99
100
I/O
Normal
CPU
0
Windows
140
Linux
21
Linux Scheduling Details
Most threads use a dynamic priority policy
Normal class - similar to the classic UNIX scheduler
A newly created thread starts with a base priority
Threads that block frequently (I/O bound) will have their
priority gradually increased
Threads that always exhaust their time slice (CPU bound) will
have their priority gradually decreased
“Nice value” sets a thread’s base priority
Larger values = less priority, lower values = higher priority
Valid nice values are in the range of -20 to +20
Nonprivileged users can only specify positive nice value
Dynamic priority policy threads have static priority zero
Execute only when there are no runnable real-time threads
22
Real-Time Scheduling on Linux
Linux supports two static priority scheduling policies:
Round-robin and FIFO (first in, first out)
Selected with the sched-setscheduler( ) system call
Use static priority values in the range of 1 to 99
Executed strictly in order of decreasing static priority
FIFO policy lets a thread run to completion
Thread needs to indicate completion by calling the sched-yield( )
Round-robin lets threads run for up to one time slice
Then switches to the next thread with the same static priority
RT threads can easily starve lower-prio threads from executing
Root privileges or the CAP-SYS-NICE capability are required for
the selection of a real-time scheduling policy
Long running system calls can cause priority-inversion
Same as in Windows; but cmp. rtLinux
23
Windows Scheduling Details
Most threads run in variable priority levels
Priorities 1-15;
A newly created thread starts with a base priority
Threads that complete I/O operations experience priority
boosts (but never higher than 15)
A thread’s priority will never be below base priority
The Windows API function SetThreadPriority() sets the
priority value for a specified thread
This value, together with the priority class of the thread's
process, determines the thread's base priority level
Windows will dynamically adjust priorities for non-realtime
threads
24
Real-Time Scheduling on Windows
Windows supports static round-robin scheduling policy
for threads with priorities in real-time range (16-31)
Threads run for up to one quantum
Quantum is reset to full turn on preemption
Priorities never get boosted
RT threads can starve important system services
Such as CSRSS.EXE
SeIncreaseBasePriorityPrivilege required to elevate a thread’s
priority into real-time range (this privilege is assigned to
members of Administrators group)
System calls and DPC/APC handling can cause priority
inversion
25
Scheduling Timeslices
Linux
Windows
The thread timeslice
(quantum) is 10ms-120ms
The thread quantum is
10ms-200ms
Default is 100ms
When quanta can vary,
has one of 2 values
Varies across entire
range based on priority,
which is based on
interactivity level
Reentrant and
preemptible
Reentrant and
preemptible
Fixed: 120ms
20ms
Background
Foreground: 60ms
200ms
10ms
100ms
26
Multiprocessor Support
Windows
Supports symmetric multiprocessing
(SMP)
Ready Thread
Up to 32 processors on 32-bit
Windows
Up to 64 processors on 64-bit
Windows
All CPUs can take interrupts
Supports Non-Uniform Memory Access
systems
Scheduler favors the node a thread
prefers to run on
Memory manager tries to allocate
memory on the node a thread prefers
to run on
0
1
3
4
Supports Hyperthreading
Scheduler favors idle physical
processors when it has a choice
Doesn’t count logical CPUs against
licensing limits
Physical
CPU 0
Physical
CPU 1
27
Multiprocessor Support
Windows
Linux
Supports symmetric multiprocessing
(SMP)
Up to 32 processors on 32-bit
Windows
Up to 64 processors on 64-bit
Windows
All CPUs can take interrupts
Supports Non-Uniform Memory Access
systems
Scheduler favors the node a thread
prefers to run on
Memory manager tries to allocate
memory on the node a thread prefers
to run on
Supports Hyperthreading
Scheduler favors idle physical
processors when it has a choice
Doesn’t count logical CPUs against
licensing limits
Supports SMP
No upper CPU limit: set as
kernel build constant
All CPUs can take interrupts
Supports Non-Uniform Memory
Access systems
Scheduler favors the node a
thread last ran on
Memory manager tries to
allocate memory on the node a
thread is running on
Supports Hyperthreading
Scheduler favors idle
physical processors when it
has a choice
28
Virtual Memory Management
Windows
32-bit versions split usermode/kernel-mode from 2GB/2GB
to 3GB/1GB
Demand-paged virtual memory
32 or 64-bits
Copy-on-write
Shared memory
Memory mapped files
0
Linux
Splits user-mode/kernel-mode
from 1GB/3GB to 3GB/1GB
2.6 has “4/4 split” option where
kernel has its own address
space
Demand-paged virtual memory
32-bits and/or 64-bits
Copy-on-write
Shared memory
Memory mapped files
User
0
2GB
User
System
4GB
3GB
System
4GB
29
Physical Memory Management
Windows
Per-process working sets
Working set tuner adjust
sets according to memory
needs using the “clock”
algorithm
No “swapper”
Linux
Global working set
management
uses “clock” algorithm
No “swapper” (the working
set trimmer code is called
the swap daemon, however)
LRU
Process
Reused Page
LRU
LRU
Other Process
Reused Page
30
I/O Management
Windows
Centered around the file object
Layered driver architecture
throughout driver types
Most I/O supports asynchronous
operation
Internal interrupt request level
(IRQL) controls interruptability
Interrupts are split between an
Interrupt Service Routine (ISR)
and a Deferred Procedure Call
(DPC)
Supports plug-and-play
Linux
Centered around the vnode
No layered I/O model
Most I/O is synchronous
Only sockets and direct disk
I/O support asynchronous
I/O
Internal interrupt request level
(IRQL) controls interruptability
Interrupts are split between an
ISR and soft IRQ or tasklet
Supports plug-and-play
IRQL
Masked
31
File Caching
Linux
Windows
Single global common cache
Single global common cache
Virtual file cache
Virtual file cache
Caching is at file vs. disk block
level
Caching is at file vs. disk block
level
Files are memory mapped into
kernel memory
Files are memory mapped into
kernel memory
Cache allows for zero-copy file
serving
Cache allows for zero-copy file
serving
File Cache
File Cache
File System Driver
File System Driver
Disk Driver
Disk Driver
32
Security
Windows
Very flexible security model based on
Access Control Lists
Users are defined with
Privileges
Member groups
Security can be applied to any Object
Manager object
Files, processes, synchronization
objects, …
Supports auditing
Linux
Two models:
Standard UNIX model
Access Control Lists (SELinux)
Users are defined with:
Capabilities (privileges)
Member groups
Security is implemented on an
object-by-object basis
Has no built-in auditing support
Version 2.6 includes Linux Security
Module framework for add-on
security models
33
Monitoring - Linux procfs
Linux supports a number of special filesystems
Like special files, they are of a more dynamic nature and tend to have side
effects when accessed
Prime example is procfs (mounted at /proc)
provides access to and control over various aspects of Linux (I.e.; scheduling
and memory management)
/proc/meminfo contains detailed statistics on the current memory usage of Linux
Content changes as memory usage changes over time
Services for Unix implements procfs on Windows
34
Windows’ Evolution Towards Linux
Services for Unix 3.5 - really targeted at POSIX, not Linux
POSIX threads, full POSIX subsystem (Interix)
X Window clients+server (X-Win32 LX)
nfs, NIS, pam
proc-file system for Windows
Configurability / Module Management
POSIX compatibility in
Windows actually
predates Linux and
was one of the original
design goals
Windows XP Embedded
Target Designer/Component Designer/
Component Management Database
Editions targeting new Application Domains
Windows Compute Cluster Server 2003
35
Linux’s Evolution Towards Windows
I/O processing
Kernel reentrancy
Kernel preemptibility
Per-processor memory allocation
O(1) scheduler and per-CPU ready queues
Zero-Copy SendFile
Wake-One socket semantics
Asynchronous I/O
Light-weight synchronization
36
I/O Processing
Linux 2.2 had the notion of bottom halves (BH) for lowpriority interrupt processing
Fixed number of BHs
Only one BH of a given type could be active on a SMP
Linux 2.4 introduced tasklets, which are non-preemptible
procedures called with interrupts enabled
Tasklets are the equivalent of Windows Deferred
Procedure Calls (DPCs)
37
Kernel Reentrancy
Mark Russinovich’s April 1999 Windows NT Magazine article, “Linux
and the Enterprise”, pointed out that much of the Linux 2.2 was not
reentrant
cpu 1
Non-reentrant
Reentrant
cpu 2
cpu 1
cpu 2
Ingo Molnar stated in rebuttal:
Time Saved
“his example is a clear red herring.”
A month later he made all major paths reentrant
38
Kernel Preemptibility
A preemptible kernel is more responsive to high-priority
tasks
Through the base release of v2.4 Linux was only
cooperatively preemptible
There are well-defined safe places where a thread running in the
kernel can be preempted
The kernel is preemptible in v2.4 patches and v2.6
Windows NT has always been preemptible
39
Per-CPU Memory Allocation
Keeping accesses to memory localized to a CPU
minimizes CPU cache thrashing
Hurts performance on enterprise SMP workloads
Linux 2.4 introduced per-CPU kernel memory buffers
Windows introduced per-CPU buffers in an NT 4 Service
Pack in 1997
Buffer Cache 0
CPUs
0
Buffer Cache 1
1
40
Scheduling
The Linux 2.4 scheduler is O(n)
If there are 10 active tasks, it scans 10 of them in a list in order to
decide which should execute next
This means long scans and long durations under the scheduler lock
Ready
List
103
112
112
101
Highest
Priority
Task
41
Scheduling
Linux 2.6 has a revamped scheduler that’s O(1) from Ingo Molnar
that:
Calculates a task’s priority at the time it makes scheduling decision
Has per-CPU ready queues where the tasks are pre-sorted by priority
Highest-priority
Non-empty Queue
101
103
112
112
42
Scheduling
Windows NT has always had an O(1) scheduler based
on pre-sorted thread priority queues
Server 2003 introduced per-CPU ready queues
Linux load balances queues
Windows does not
Not seen as an issue in performance testing by Microsoft
Applications where it might be an issue are expected to use affinity
43
Zero-Copy Sendfile
Linux 2.2 introduced Sendfile to efficiently send file data over a socket
I pointed out that the initial implementation incurred a copy operation,
even if the file data was cached
Linux 2.4 introduced zero-copy Sendfile
Windows NT pioneered zero-copy file sending with TransmitFile, the
Sendfile equivalent, in Windows NT 4
File Data
Buffer
1-Copy
Network
Network
Adapter
Driver
Buffer
File Data
Buffer
Network
0-Copy
Network
Network
Driver
44
Wake-one Socket Semantics
Linux 2.2 kernel had the thundering herd or
overscheduling problem
In a network server application there are typically several
threads waiting for a new connection
In v2.2 when a new connection came in all the waiters would
race to get it
Ingo Molnar’s response:
5/2/99: “here he again forgets to _prove_ that overscheduling
happens in Linux.”
5/7/99: “as of 2.3.1 my wake-one implementation and
waitqueues rewrite went in”
In Linux 2.4 only one thread wakes up to claim the new
connection
Windows NT has always had wake-1 semantics
45
Asynchronous I/O
Linux 2.2 only supported asynchronous I/O on socket
connect operations and tty’s
Linux 2.6 adds asynchronous I/O for direct-disk access
AIO model includes efficient management of asynchronous I/O
Also added alternate epoll model
Useful for database servers managing their database on a
dedicated raw partition
Database servers that manage a file-based database suffer from
synchronous I/O
Windows I/O is inherently asynchronous
Windows has had completion ports since NT 3.5
More advanced form of AIO
46
Light-Weight Synchronization
Linux 2.6 introduces Futexes
There’s only a transition to kernel-mode when there’s
contention
Windows has always had CriticalSections
Same behavior
Futexes go further:
Allow for prioritization of waits
Works interprocess as well
47
A Look at the Future
The kernel architectures are fundamentally similar
There are differences in the details
Linux implementation is adopting more of the good ideas used in
Windows
For the next 2-4 years Windows has and will maintain an
edge
Linux is still behind on the cutting edge of performance tricks
Large performance team and lab at Microsoft has direct ties into
the kernel developers
As time goes on the technological gap will narrow
Open Source Development Labs (OSDL) will feed performance
test results to the kernel team
IBM and other vendors have Linux technology centers
Squeezing performance out of the OS gets much harder as the OS
gets more tuned
48
Linux Technology Unknowns
Linux kernel forking
RedHat has already done it: Red Hat Enterprise Server v3.0 is
Linux 2.4 with some Linux 2.6 features
Backward compatibility philosophy
Linus Torvalds makes decisions on kernel APIs and
architecture based on technical reasons, not business reasons
49
Further Reading
Transaction Processing Council: www.tpc.org
SPEC: www.spec.org
NT vs Linux benchmarks: www.kegel.com/nt-linuxbenchmarks.html
The C10K problem: http://www.kegel.com/c10k.html
Linus Torvald’s home: http://www.osdl.org/
Linux Kernel Archives: http://www.kernel.org/
Linux history: http://www.firstmonday.dk/issues/issue5_11/moon/
Veritest Netbench result:
http://www.veritest.com/clients/reports/microsoft/ms_netbench.pdf
Mark Russinovich’s 1999 article, “Linux and the Enterprise”:
http://www.winntmag.com/Articles/Index.cfm?ArticleID=5048
The Open Group's Single UNIX Specification:
http://www.unix.org/version3/
50