Understanding Linux Kernel - Booting, Syscalls, Interrupts & Context Switching

Download Report

Transcript Understanding Linux Kernel - Booting, Syscalls, Interrupts & Context Switching

Understanding Linux Kernel
- Booting, Syscalls, Interrupts
& Context Switching
By –
Jayant Upadhyay 2003CS50214
Pankaj K. Sharma 2003CS50219
Sohit Bansal
2003CS50224
Akshay Gaur
2003CS50209
Overview of Booting
The process can be divided into following six logical stages:
1. BIOS selects the boot device
2. BIOS loads the boot sector from the boot device
3. Boot-sector loads setup, decompression routines and
compressed kernel image
4. Kernel is uncompressed in protected mode
5. Low level initialization is performed by the asm code
6. High-level C initialization
BIOS POST
• POST – Power On Self Test
• Power supply starts the clock generator and asserts
#POWERGOOD signal on the bus
• CPU #RESET line is asserted
• POST checks are performed with interrupts disabled
• IVT initialized at address zero
• BIOS bootstrap function is invoked via INT 0x19. This loads
track 0, sector 1 at physical address 0x7C00(0x07C0:0000)
Boot-sector & Setup
The boot-sector to boot linux kernel could be either:
• Linux boot-sector(arch/i386/boot/bootsect.S)
• LILO (or other bootloader’s) boot-sector
Linux Boot-sector
• bootsector.S
– Firstly moves the bootsector’s code from 0x7C00 to
0x90000
– Then it jumps to the newly made copy of bootsector i.e.
in segment 0x90000
– Prepares the stack at $INITSEG:0x4000-0xC
– This is where the limit on setup size comes from
– Setup sectors are loaded immediately after the
bootsector i.e. at physical address using BIOS service
INT 0x13
– If loading is failed due to some reason error code is
dumped n it retry in endless loop
– If loading setup_sects sectors of setup code succeeded
we jump to label ok_load_setup
– Kernel image is then loaded 0x10000. This is done to
preserve the firmware data in low memory ( 0-64K )
– After the kernel is loaded we jump to
$SETUPSEG:0(arch/i386/boot/setup.S)
• setup.S
– Once the data is no longer needed (e.g. no more calls to
BIOS) it is overwritten by moving the entire
(compressed) kernel image from 0x10000 to 0x1000.
– sets things up for protected mode and jumps to 0x1000
which is the head of the compressed kernel, i.e.
arch/386/boot/compressed/{head.S,misc.c}
– This sets up stack and calls decompress_kernel() which
uncompresses the kernel to address 0x100000 and
jumps to it.
How to load a big kernel?
• The setup sectors are loaded as usual at 0x90200, but the kernel is
loaded 64K chunk at a time using a special helper routine that calls
BIOS to move data from low to high memory.
• This helper routine is referred to by bootsect_kludge in bootsect.S
and is defined as bootsect_helper in setup.S. The bootsect_kludge
label in setup.S contains the value of setup segment and the offset
of bootsect_helper code in it so that bootsector can use the lcall
instruction to jump to it (inter−segment jump).
• This routine uses BIOS service int 0x15 (ax=0x8700) to move to high
memory and resets %es to always point to 0x10000. This ensures
that the code in bootsect.S doesn't run out of low memory when
copying data from disk.
Using LILO as bootloader
• There are several advantages in using a specialised
bootloader (LILO) over a bare bones Linux bootsector:
– Ability to choose between multiple Linux kernels or even multiple
OSes.
– Ability to pass kernel command line parameters
– Ability to load much larger bzImage kernels − up to 2.5M vs 1M.
• Old versions of LILO (v17 and earlier) could not load bzImage
kernels. The newer versions (as of a couple of years ago or
earlier) use the same technique as bootsect+setup of
moving data from low into high memory by means of BIOS
services.
High Level Initialization
• By "high−level initialisation" we consider anything which is not
directly related to bootstrap, even though parts of the code to
perform this are written in asm, namely arch/i386/kernel/head.S
which is the head of the uncompressed kernel. The following steps
are performed:
– Initialise segment values (%ds = %es = %fs = %gs = __KERNEL_DS =
0x18).
– Initialise page tables.
– Enable paging by setting PG bit in %cr0.
– Zero−clean BSS (on SMP, only first CPU does this).
– Copy the first 2k of bootup parameters (kernel commandline).
– Check CPU type using EFLAGS and, if possible, cpuid, able to detect 386 and
higher.
– The first CPU calls start_kernel(), all others call
arch/i386/kernel/smpboot.c:initialize_secondary() if ready=1, which just
reloads esp/eip and doesn't return.
• The init/main.c:start_kernel() is written in C and does the
following:
– Perform arch−specific setup (memory layout analysis, copying
boot command line again, etc.).
– Print Linux kernel "banner" containing the version, compiler
used to build it etc. to the kernel ring
– buffer for messages. This is taken from the variable
linux_banner defined in init/version.c and is the same string as
displayed by cat /proc/version.
– Initialise traps, irqs, data required for scheduler.
– Parse boot commandline options & Initialise console.
– If module support was compiled into the kernel, initialise
dynamical module loading facility.
– If "profile=" command line was supplied, initialise profiling
buffers.
– kmem_cache_init(), initialise most of slab allocator.
– Enable interrupts.
– Calculate BogoMips value for this CPU.
– Call mem_init() which calculates max_mapnr, totalram_pages
and high_memory and prints out the "Memory: ..." line.
– kmem_cache_sizes_init(), finish slab allocator initialisation.
– Initialise data structures used by procfs.
– fork_init(), create uid_cache, initialise max_threads based on the
amount of memory
– available and configure RLIMIT_NPROC for init_task to be
max_threads/2.
– Create various slab caches needed for VFS, VM, buffer cache, etc.
– If System V IPC support is compiled in, initialise the IPC subsystem. Note
that for System V shm, this includes mounting an internal (in−kernel)
instance of shmfs filesystem.
– If quota support is compiled into the kernel, create and initialise a special
slab cache for it.
– Perform arch−specific "check for bugs" and, whenever possible, activate
workaround for processor/bus/etc bugs. Comparing various architectures
reveals that "ia64 has no bugs" and "ia32 has quite a few bugs", good
example is "f00f bug" which is only checked if kernel is compiled for less
than 686 and worked around accordingly.
– Set a flag to indicate that a schedule should be invoked at "next
opportunity" and create a kernel thread init() which execs
execute_command if supplied via "init=" boot parameter, or tries to exe
/sbin/init, /etc/init, /bin/init, /bin/sh in this order; if all these fail,
panic with "suggestion" to use "init=" parameter.
– Go into the idle loop, this is an idle thread with pid=0.
Interrupts and Exceptions
• Hardware support for getting CPUs attention
– Often transfers from user to kernel mode
• Nested interrupts are possible; interrupt can occur
while an interrupt handler is already executing (in
kernel mode)
– Asynchronous: device or timer generated
• Unrelated to currently executing process
– Synchronous: immediate result of last instruction
• Often represents a hardware error condition
• Intel terminology and hardware
– Irqs, vectors, IDT, gates, PIC, APIC
• Interrupt handling: data structures, flow of control
• Handlers: softirqs, tasklets, bottom halves
Basic Ideas
• Similar to context switch (but lighter weight)
– Hardware saves a small amount of context on stack
– Includes interrupted instruction if restart needed
– Execution resumes with special “iret” instruction
• Structure: top and bottom halves
– Top-half: do minimum work and return
– Bottom-half: deferred processing
• Handler code executed in response
– Possible to temporarily mask interrupts
– Handlers need not be reentrant
– But other interrupts can occur, causing nesting
Interrupts vs Exceptions
• Varying terminology but for Intel:
– Interrupt (synchronous, device generated)
• Maskable: device-generated, associated with IRQs (interrupt
request lines); may be temporarily disabled (still pending)
• Nonmaskable: some critical hardware failures
– Exceptions (asynchronous)
• Processor-detected
– Faults – correctable (restartable); e.g. page fault
– Traps – no reexecution needed; e.g. breakpoint
– Aborts – severe error; process usually terminated (by
signal)
• Programmed exceptions (software interrupts)
– int (system call), int3 (breakpoint)
– into (overflow), bounds (address check)
Vectors, IDT
• Vector: index (0-255) into descriptor table (IDT)
– Special register: idtr points to table (use lidt to load)
• IDT: table of “gate descriptors”
– Segment selector + offset for handler
– Descriptor Privilege Level (DPL)
– Gates (slightly different ways of entering kernel)
• Task gate: includes TSS to transfer to (not used by
Linux)
• Interrupt gate: disables further interrupts
• Trap gate: further interrupts still allowed
• Vector assignments
– Exceptions, NMI are fixed
– Maskable interrupts can be assigned as needed
PIC
• Programmable Interrupt Controller (PIC)
– chip between devices and cpu
– Fixed number of wires in from devices
• IRQs: Interrupt ReQuest lines
– Single wire to CPU + some registers
• PIC translates IRQ to vector
– Raises interrupt to CPU
– Vector available in register
– Waits for ack from CPU
• Other interrupts may be pending
• Possible to “mask” interrupts at PIC or CPU
• Early systems cascaded two 8 input chips (8259A)
Interrupt Handling Components
vector
IRQs
Memory Bus
0
CPU
PIC
INTR
idtr
0
IDT
15
Mask points
255
handler
IO-APIC, LAPIC
• Advanced PIC for SMP systems
– Used in all modern systems
– Interrupts “routed” to CPU over system bus
– IPI: inter-processor interrupt
• Local APIC versus “frontend” IO-APIC
– Devices connect to front-end IO-APIC
– IO-APIC communicates (over bus) with Local APIC
• Interrupt routing
– Allows broadcast or selective routing of interrupts
– Need to distribute interrupt handling load
– Routes to lowest priority process
• Special register: Task Priority Register (TPR)
– Arbitrates (round-robin) if equal priority
Intel Exceptions
• Architecture (processor) dependent
– Intel has about 20 (out of 32 possible)
• Most exceptions send signal to current process
– Default action often just kills process
– Page fault is the one exception; very complex handler
• Some examples:
– 0 SIGFPE Divide by zero error
– 3 SIGTRAP Breakpoint
– 6 SIGILL Invalid op-code
– 11 SIGBUS Segment not present
– 12 SIGBUS Stack overflow
– 13 SIGSEGV General protection fault (DPL violation)
– 14 SIGSEGV Page fault
Hardware Handling
• On entry:
– Which vector?
– Get corresponding descriptor in IDT
– Find specified descriptor in GDT (for handler)
– Check privilege levels (CPL, DPL)
• If entering kernel mode, set kernel stack
– Save eflags, cs, (original) eip on stack
• -> Jump to appropriate handler
– Assembly code prepares C stack, calls handler
• On return (i.e. iret):
– Restore registers from stack
– If returning to user mode, restore user stack
– Clear segment registers (if privileged selectors)
Nested Execution
• Interrupts can be interrupted
– By different interrupts; handlers need not be reentrant
– No notion of priority in Linux
– Small portions execute with interrupts disabled
– Interrupts remain pending until acked by CPU
• Exceptions can be interrupted
– By interrupts (devices needing service)
• Exceptions can nest two levels deep
– Exceptions indicate coding error
– Exception code (kernel code) shouldn’t have bugs
– Page fault is possible (trying to touch user data)
IDT Initialization
• Initialized once by BIOS in real mode
– Linux re-initializes during kernel init
• Must not expose kernel to user mode
access
– start by zeroing all descriptors
• Linux lingo:
– Interrupt gate (same as Intel; no user access)
• Not accessible from user mode
– System gate (Intel trap gate; user access)
• Used for int, int3, into, bounds
– Trap gate (same as Intel; no user access)
• Used for exceptions
Exception Handling
• Some exceptions push error code on stack
– IDT points to small individual handlers (assembly)
– handler_name:
pushl $0 // placeholder if no error code
pushl $do_handler_name
jmp error_code
• Common code sets up for C call
– Pops handler address from stack, calls
• All handlers check if kernel mode
– Exceptions caused by touching bad syscall params
• Return to userland with error code
– Other exceptions-> die() // kernel Oops
• Most handlers just generate signal for current
– current->tss.error_code = error_code;
– current->tss.trap_no = vector;
– force_sig(sig_number, current);
Interrupt Handling
• More complex than exceptions
– Requires registry, deferred processing, etc.
• Some issues:
– IRQs are often shared; all handlers (ISRs) are
executed so they must query device
– IRQs are dynamically allocated to reduce contention
• Example: floppy allocates when accessed
• Three types of actions:
– Critical: Top-half (interrupts disabled – briefly!)
• Example: acknowledge interrupt
– Non-critical: Top-half (interrupts enabled)
• Example: read key scan code, add to buffer
– Non-critical deferrable: Do it “later” (interrupts
IRQ, Vector Assignment
• PCI bus usually assigns IRQs at boot
• Vectors usually IRQ# + 32
– Below 32 reserved for non-maskable,
execeptions
– Vector 128 used for syscall
– Vectors 251-255 used for IPI
• Some IRQs are fixed by architecture
– IRQ0: interval timer
– IRQ2: cascade pin for 8259A
• See /proc/interrupts for assignments
IRQ Data Structures
• irq_desc: array of IRQ descriptors
– status (flags), lock, depth (for nested disables)
– handler: PIC device driver!
– action: linked list of irqaction structs (containing ISRs)
• irqaction: ISR info
– handler: actual ISR!
– flags:
• SA_INTERRUPT: interrupts disabled if set
• SA_SHIRQ: sharing allowed
• SA_SAMPLE_RANDOM: input for /dev/random
entropy pool
– name: for /proc/interrupts
– dev_id, next
• irq_stat: per-cpu counters (for /proc/interrupts)
Interrupt Processing
• BUILD_IRQ macro generates:
– IRQn_interrupt:
• pushl $n-256 // negative to distinguish syscalls
• jmp common_interrupt
• Common code:
– common_interrupt:
• SAVE_ALL // save a few more registers than
hardware
• call do_IRQ
• jmp $ret_from_intr
– do_IRQ() is C code that handles all interrupts
Low-level IRQ Processing
• do_IRQ():
– get vector, index into irq_desc for appropriate struct
– grab per-vector spinlock, ack (to PIC) and mask line
– set flags (IRQ_PENDING)
– really process IRQ? (may be disabled, etc.)
– call handle_IRQ_event()
– some logic for handling lost IRQs on SMP systems
• handle_IRQ_event():
– enable interrupts if needed (SA_INTERRUPT clear)
– execute all ISRs for this vector:
• action->handler(irq, action->dev_id, regs);
Deferrable Functions
• Bottom-halves (deprecated):
– Old static array of function pointers that are marked
for execution (can be masked temporarily)
– Executed on kernel to user transition
– Executed serially (globally) on SMP system
• Mostly for networking code:
– Tasklets: Different tasklets can execute concurrently
– Softirqs: The same softirq can execute concurrently
• Layered implementation:
– Bottom-halves implemented using tasklets
– Tasklets implemented using softirqs
• When executed? (pretty frequently)
– When last (nested) interrupt handler terminates
– When network packet receiver
– When idle: per-cpu ksoftirqd kernel thread
• Lot’s of detail in book; a bit complex …
Return Code Path
• Interleaved assembly entry points:
–
–
–
–
ret_from_exception()
ret_from_inr()
ret_from_sys_call()
ret_from_fork()
• See flowchart in text (Fig 4-5 page 158)
• Things that happen:
– Run scheduler if necessary
– Return to user mode if no nested handlers
• Restore context, user-stack, switch mode
• Re-enable interrupts if necessary
– Deliver pending signals
– (Some DOS emulation stuff – VM86 Mode)
System Calls
System Calls
• Interface between user-level processes and hardware
devices.
– CPU, memory, disks etc.
• Make programming easier:
– Let kernel take care of hardware-specific issues.
• Increase system security:
– Let kernel check requested service via syscall.
• Provide portability:
– Maintain interface but change functional implementation.
Mode, Space, Context
•
•
•
Mode: hardware restricted execution state
– restricted access, privileged instructions
– user mode vs. kernel mode
• “dual-mode architecture”, “protected mode”
– Intel supports 4 protection “rings”: 0 kernel, 1 unused, 2 unused, 3 user
Space: kernel (system) vs. user (process) address space
– requires MMU support (virtual memory)
– “userland”: any process address space; there are many user address
spaces
– reality: kernel is often mapped into user process space
Context: kernel activity on “behalf” of ???
– process: on behalf of current process
– system: unrelated to current process (maybe no process!)
• example “interrupt context”
• blocking not allowed!
35
POSIX APIs
• API = Application Programmer Interface.
– Function defn specifying how to obtain service.
– By contrast, a system call is an explicit request to kernel made
via a software interrupt.
• Standard C library (libc) contains wrapper routines that make
system calls.
– e.g., malloc, free are libc routines that use the brk system call.
• POSIX-compliant = having a standard set of APIs.
• Non-UNIX systems can be POSIX-compliant if they offer the
required set of APIs.
Interrupts and Exceptions
• Interrupts - async device to cpu communication
– example: service request, completion notification
– aside: IPI – interprocessor interrupt (another cpu!)
– system may be interrupted in either kernel or user mode
– interrupts are logically unrelated to current processing
• Exceptions - sync hardware error notification
– example: divide-by-zero (AU), illegal address (MMU)
– exceptions are caused by current processing
• Software interrupts (traps)
– synchronous “simulated” interrupt
– allows controlled “entry” into the kernel from userland
37
Linux System Calls
Invoked by executing int $0x80.
– Programmed exception vector number 128.
– CPU switches to kernel mode & executes a kernel
function.
• Calling process passes syscall number identifying
system call in eax register (on Intel processors).
• Syscall handler responsible for:
– Saving registers on kernel mode stack.
– Invoking syscall service routine.
– Exiting by calling ret_from_sys_call().
Linux System Calls
• System call dispatch table:
– Associates syscall number with corresponding service
routine.
– Stored in sys_call_table array having up to
NR_syscall entries (usually 256 maximum).
– nth entry contains service routine address of syscall n.
Kernel Entry and Exit
Library Code
exceptions
(error traps)
System Call Interface
trap
80h
trap /
boot
interrupt
table
system
call
table
schedu
ler
Kernel
interrupt
device
dialog
Devices
page faults
IPI: interprocessor
interrupt
40
Initializing System Calls
• trap_init() called during kernel initialization sets up the IDT
(interrupt descriptor table) entry corresponding to vector 128:
– set_system_gate(0x80, &system_call);
• A system gate descriptor is placed in the IDT, identifying address
of system_call routine.
– Does not disable maskable interrupts.
– Sets the descriptor privilege level (DPL) to 3:
• Allows User Mode processes to invoke exception
handlers (i.e. syscall routines).
The system_call() Function
• Saves syscall number & CPU registers used by exception handler
on the stack, except those automatically saved by control unit.
• Checks for valid system call.
• Invokes specific service routine associated with syscall number
(contained in eax):
– call *sys_call_table(0, %eax, 4)
• Return code of system call is stored in eax.
Parameter Passing
• As the syscall number, user-space must relay the parameters to the
kernel during the exception trap
• The parameters are stored in registers: onx86, the registers ebx,
ecx, edx, esi, and edi contain, in order, the first five
arguments.
• In the unlikely case of six or more arguments, a single register is
used to hold a pointer to user-space where all the parameters reside
• The return value is sent to user-space via register, eax on x86
Writing a system call for Linux
•
•
•
•
Define its purpose, i.e., exactly one purpose
Decide arguments, return value, and error codes
Design the interface with forward compatibility in mind
return appropriate error codes
•Verifying the Parameters
The pointer points to a region of memory in user-space
The pointer points to a region of memory in the process’s address
space
If reading, the memory is marked readable. If writing, the memory is
marked writable
• copy_to_user(usr_dst, krnl_src, len);
• copy_from_user(krnl_dst, usr_src, len);
Asmlinkage long sys_scopy(unsigned long *src, unsigned
long *dst, unsigned long len)
{
unsigned long buf;
/*fail if the kernel wordsize and user wordsize do not
match */
if (len != sizeof(buf))
return –EINVAL;
if (copy_from_user(&buf, src, len))
return –EFAULT;
if (copy_to_user(dst, &buf, len))
return –EFAULT;
return len;
/*return amount of data copied */
}
System Call Context
• In process context, the kernel is capable of sleeping (e.g.,
blocked on a call or calling schedule()): make use of the
majority of the kernel’s functionality; simplifying kernel
programming
• In process context, the kernel is preemptible: system calls
must be reentrant (the current task may be preempted by
another task that may then execute the same system call).
Blocking System Calls
• system calls may block “in the kernel”
• “slow” system calls may block indefinitely
– reads, writes of pipes, terminals, net devices
– some ipc calls, pause, some opens and ioctls
– disk io is NOT slow (it will eventually complete)
• blocking slow calls may be “interrupted” by a signal
– returns EINTR
• problem: slow calls must be wrapped in a loop
• BSD introduced “automatic restart” of slow interrupted calls
• POSIX didn’t specify semantics
• Linux
– no automatic restart by default
– specify restart when setting signal handler (SA_RESTART)
47
Linux Files Relating to Syscalls
• Main files:
– arch/i386/kernel/entry.S
• System call and low-level fault handling routines.
– include/asm-i386/unistd.h
• System call numbers and macros.
– kernel/sys.c
• System call service routines.
arch/i386/kernel/entry.S
.data
ENTRY(sys_call_table)
.long SYMBOL_NAME(sys_ni_syscall)
/* 0
-
old "setup
()" system call*/
.long SYMBOL_NAME(sys_exit)
.long SYMBOL_NAME(sys_fork)
 Add system calls by appending entry to sys_call_table:
.long SYMBOL_NAME(sys_my_system_call)
include/asm-i386/unistd.h
• Each system call needs a number in the system call table:
– e.g., #define __NR_write 4
– #define __NR_my_system_call nnn, where nnn is next
free entry in system call table.
kernel/sys.c
• Service routine bodies are defined here:
• e.g., asmlinkage retval
sys_my_system_call (parameters) {
body of service routine;
return retval;
}
Example System Calls
• sys_foo, do_foo idiom
– all system calls proper begin with sys_
– often delegate to do_ function for the real work
• asmlinkage
– gcc magic to keep parameters on the stack
– avoids register optimizations
• sys_ni_syscall
– just return ENOSYS!
– guards position 0 in table (catch uninitialized bugs)
– fills “holes” for obsolete syscalls or library implemented calls
52
Example System Calls: sys_time
• kernel/time.c: sys_time
•
•
•
•
•
•
just return the number of seconds since Jan 1, 1970
available as volatile CURRENT_TIME (xtime.tv_sec)
snapshot current time
check user-supplied pointer for validity
copy time to user space (asm/uaccess.h:put_user)
return time snapshot or error
53
Example System Calls: sys_reboot
• kernel/sys.c : sys_reboot
•
•
•
•
require SYS_BOOT capability
check “magic numbers” (0xfee1dead, Torvalds family birthdays)
acquire the “big kernel lock”
switch options
• shutdown in various ways: restart, halt, poweroff
• “user-specified” shutdown command for some architectures
• toggle control-alt-delete processing
• go through reboot_notifier callbacks as appropriate
• unlock and return error if failure
54
Example System Calls: sys_sysinfo
• kernel/info.c : sys_sysinfo
•
•
•
•
allocate a local struct to return info to user space
disable (clear) interrupts to keep info consistent
calculate uptime
calculate 1, 5, 15 second “load averages”
• average length of run queue over interval
• use confusing int math to avoid floating-point inefficiency
• enable (set) interrupts
• return number of processes and some mem stats
• copy local struct values to user space (copy_to_user)
55
Context switch in Linux
Memory layout – general picture
Stack
Stack
Process X user memory
Stack
Stack
Process Y user memory
Process Z user memory
Stack
Stack
tss->esp0
task_struct
task_struct
Process X kernel
stack and task_struct
Process Y kernel
stack and task_struct
TSS of
CPU i
Kernel memory
task_struct
Process Z kernel stack
and task_struct
#1 – kernel stack after any system call, before context switch
prev
ss
User Stack
esp
eflags
cs
…
TSS
tss->esp0
eip
…
…
Schedule() function frame
User Code
orig_eax
es
esp
ds
eax
ebp
task_struct
edi
esi
thread.esp0
edx
ecx
ebx
Saved on the
kernel stack during
a transition to
kernel mode by a
jump to interrupt
and by SAVE_ALL
macro
#2 – stack of prev before switch_to macro in schedule() func
prev
…
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Return address to schedule()
TSS
Old (schedule’s()) EBP
…
tss->esp0
esp
task_struct
thread.eip
thread.esp
thread.esp0
#3 – switch_to: save esi, edi, ebp on the stack of prev
prev
…
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Return address to schedule()
TSS
Old (schedule’s()) EBP
tss->esp0
…
ESI
EDI
EBP
esp
task_struct
thread.eip
thread.esp
thread.esp0
#4 – switch_to: save esp in prev->thread.esp
prev
…
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Return address to schedule()
TSS
Old (schedule’s()) EBP
tss->esp0
…
ESI
EDI
EBP
esp
task_struct
thread.eip
thread.esp
thread.esp0
#5 – switch_to: load next->thread.esp into esp
prev
…
next
…
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule()
Return address to schedule()
TSS
Old (schedule’s()) EBP
Old (schedule’s()) EBP
tss->esp0
…
…
ESI
ESI
EDI
EDI
EBP
EBP
esp
task_struct
task_struct
thread.eip
thread.esp
thread.esp0
thread.eip
thread.esp
thread.esp0
$1f
#6 – switch_to: save return address in the prev->thread.eip
prev
…
next
…
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule()
Return address to schedule()
TSS
Old (schedule’s()) EBP
Old (schedule’s()) EBP
tss->esp0
…
…
ESI
ESI
EDI
EDI
EBP
EBP
esp
$1f
task_struct
task_struct
thread.eip
thread.esp
thread.esp0
thread.eip
thread.esp
thread.esp0
$1f
#7 – switch_to: save return address on the stack of next
prev
…
next
…
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule()
Return address to schedule()
TSS
Old (schedule’s()) EBP
Old (schedule’s()) EBP
tss->esp0
…
…
ESI
ESI
EDI
EDI
EBP
EBP
esp
$1f
task_struct
task_struct
thread.eip
thread.esp
thread.esp0
thread.eip
thread.esp
thread.esp0
$1f
$1f
#8 – __switch_to func: save the base of next’s stack in TSS
prev
…
next
…
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule()
Return address to schedule()
TSS
Old (schedule’s()) EBP
Old (schedule’s()) EBP
tss->esp0
…
…
ESI
ESI
EDI
EDI
EBP
EBP
esp
$1f
task_struct
task_struct
thread.eip
thread.esp
thread.esp0
thread.eip
thread.esp
thread.esp0
$1f
$1f
#9 – back in switch_to: eip points to $1f instruction label
prev
…
next
…
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule()
Return address to schedule()
TSS
Old (schedule’s()) EBP
Old (schedule’s()) EBP
tss->esp0
…
ESI
EDI
EBP
$1f
…
ESI
eip
EDI
1:
esp
task_struct
task_struct
thread.eip
thread.esp
thread.esp0
thread.eip
thread.esp
thread.esp0
EBP
$1f
#10 – switch_to: restore esi, edi, ebp from the stack of next
prev
…
next
…
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule()
Return address to schedule()
TSS
Old (schedule’s()) EBP
Old (schedule’s()) EBP
tss->esp0
…
…
esp
ESI
EDI
EBP
$1f
task_struct
task_struct
thread.eip
thread.esp
thread.esp0
thread.eip
thread.esp
thread.esp0
$1f
Thank you