슬라이드 1

Transcript 슬라이드 1

Linux Kernel Analysis
for ARM
Version 0.4-Alpha
Seo-Young Noh
[email protected]
Date
2009-12-17
Version
0.1
Description
Initially created
•
•
•
•
•
setup_arch/bootmem_init/bootmem_free_node/free_area_init_node/free_ara_init_c
ore까지 정리
_find_next_zero_bit_le 로직 정리
Memory Bank, Zone, Node, NUMA, UMA
Memory Hot-Plug
./Documentation/arm/Booting 정리
./Documentation/arm/Porting 정리
0.3
•
•
•
•
Thread Model 정리
Ext2 File System 정리
Preemption 정리
Cache
0.4
•
•
•
•
•
Kernel Lock 정리
Red-Black Tree 정리
Memory Management 보강
Malloc 동작 방식
Program Execution Step
•
2010-03-03
2010-11-13
0.2
Etc.
compressed/head.S
U-Boot로부터 전달된 값
R0  0/ R2  Architecture ID/ R3  Board Information (aTag)
Nop 8번 실행
mov r0, r0 @ delay를 가지기 위함
mrs
orr
Interrupt Disable
Process Type Table
Offset
Architecture Number
Magic Number
xxx_mmu_cache_on
xxx_mmu_cache_off
xxx_mmu_cache_flush
r2, cpsr
r2, r2, #0xc0
0
4
8
12
16
Bss 영역을 0으로 초기화
Call_Cache_Fn 분기
Cache On
RAM
xxx_mmu_cache_on
Call Setup_mmu
zImage
0x50C00000
unzip
Drain Write Buffer
r3  8/ Process Type Table에서
Process ID와 같은 Entry를 찾음 해
당 Architecture mmu_cache_on
으로 분기 함.
zImage가 decompress될
위치 아래에 16KB를 L1 Page
Directory로 생성하고 RAM 영역에
대해 Cacheable/ Bufferable 설정
Flush I/D TLBs
Image
16KB
Page Directory
(4096 Entries)
0x50008000
0x50004000
I Cache Enable/ RR
Replacement Enable
__common_mmu_cache_
on
MMU Enable/ L1 Data Cache
Enable/ WB Enable ,
0xd = 1101 (b)
Cp15의 c2 register에 Page
Directory 위치 저장
Boot Loader

Boot Loader가 제공해야 하는 Functionalities
1.
2.
3.
4.
5.
Setup and initialize the RAM.
Initialize one serial port
Detect the machine type.
Setup the kernel tagged list.
Call the kernel image.
<그림 1>
base
ATAG_CORE
ATAG_MEM
ATAG_NONE

Increasing
address
Setup the kernel tagged list
• 유효한 Tagged List는 ATAG_CORE 로 시작해서 ATAG_NONE로 끝나야 함.
• ATAG_CORE tag는 빈 공간일 수 있는데, 이 경우 ATAG_CORE tag의 size field는 '2' (0x00000002)로
셋팅 되어야 함.
• ATAG_NONE 의 size filed는 0으로 셋팅 되어야 함.
• Tagged List에 어떠한 수의 tag들이 올 수 있지만, 동일한 tag가 올 경우 append될 지 overwrite될지는
정의되지 않음.
• Boot Loader는 최소한 시스템 메모리의 위치와 크기, 루트파일 시스템 위치를 패스해야 하므로 최소한
Tagged List는 <그림 1>과 같음
• Tagged List는 메모리상에 위치해야 함.
• Tagged List는 Kernel의 Decompressor나 initrd ‘bootp’ 프로그램이 overwrite하지 않아야 함. 따라서
주로 RAM의 첫 16KB에 위치하게됨.
Reference: ./Documentation/arm/Booting
misc.c decompress 호출 시 전달되는 값
64KB Malloc
Decompress시 사용됨
Call decompress
gunzip 호출로 압축해제
call_kernel 로 점프
Call cache_clean_flush
Call_Cache_Fn 분기
Xxx_mmu_cache_flush
R0
R1
R2
R3




zImage가 decompress될 위치
64KB Malloc Memory 시작위치
64KB Malloc Memory 끝 위치
Architecture ID
r3  16/ Process Type Table에서 Process
ID와 같은 Entry를 찾음 해당 Architecture
mmu_cache_flush로 분기 함.
D Cache Clean/ Invalidate
I Cache/ BTB Invalidate
WB Drain
r3  16/ Process Type Table에서 Process
ID와 같은 Entry를 찾음 해당 Architecture
mmu_cache_off로 분기 함.
Call cache_off
Call_Cache_Fn 분기
Mmu/ L1 Cache/ WB 비 활성화
Cache/ TLB Invalidate
Process Type Table
Xxx_mmu_cache_off
zImage Decompress된 곳
으로 PC 이동
Architecture Number
Magic Number
xxx_mmu_cache_on
xxx_mmu_cache_off
xxx_mmu_cache_flush
Offset
0
4
8
12
16
Entry
RAM
4095
zImage
0x50C00000
0xFFF00C12
.
.
.
.
.
.
1536
0x50005800
0x60000C12
1535
0x500057FC
0x5FF00C1E
.
.
.
.
.
.
1230
0x50005400
0x50000C1E
1279
0x500053FC
0x4FF00C12
.
.
.
.
.
.
2
0x50004008
0x00200C12
1
0x50004004
0x00100C12
0
0x50004000
0x00000C12
Image
16KB
0x50004000
Page Directory
(4096 Entries)
Value in Virtual
0x50007FFC
unzip
0x50008000
P-Address
256 Entries
Cacheable/
Bufferable
각 Entry 4Byte로 구성되며 Virtual Memory의
1MB Section을 의미 함
Decompressor Symbols and Kernel Symbols
Reference: ./Documentation/arm/Porting

Decompressor Symbols
Symbols
Description
ZTEXTADDR
• Decompressor의 시작 주소.
• Decompressor code가 호출될 때에는 MMU가 off된 상태이기 때문에
virtual 및 physical address는 논의 대상이 아니다.
• Booting하기 위해서 ZTEXTADDR에 있는 Kernel을 호출하게된다.
ZBSSADDR
• Decompressor가 사용할 0으로 초기화된 작업 영역의 시작 주소.
• Decompressor는 이 영역을 0으로 초기화 한다.
ZRELADDR
• 압축이 해제된 Kernel이 쓰여질 주소
• __virt_to_phys(TEXTADDR) == ZRELADDR
INITRD_PHYS
• RAM disk가 위치할 물리주소이고 bootpImage를 이용할 경우만 관련됨.
INITRD_VIRT
• RAM disk의 가상주소
• __virt_to_phys(INITRD_VIRT) == INITRD_PHYS
PARAMS_PHYS
• para_struct 구조체나 tag_list의 물리주소로 커널에게 다양한 실행 환경에
대한 파라미터를 전달함.
Decompressor Symbols and Kernel Symbols
Reference: ./Documentation/arm/Porting

Kernel Symbols
Symbols
Description
PHYS_OFFSET
• RAM의 첫번째 뱅크의 시작 물리주소
PAGE_OFFSET
• RAM의 첫번째 뱅크의 가상주소로 커널 부팅단계에서는 가상주소 PAGE_OFFSET이 물
리주소 PHYS_OFFSET으로 매핑 되고 TASK_SIZE와 같은 값을 가져야 함.
TASK_SIZE
• User Process의 최대 크기 (바이트). Max Addr = User Process가 Access + 1
• User Process의 스택은 Max Addr.에서 아래로 자람.
• TASK_SIZE아래의 모든 가상 주소는 User Process 영역이며, 커널이 Process 기준으로
동적으로 관리함. 이 영역을 User Segment라고 부름. TASK_SIZE 위의 모든 가상 주소
는 모든 Process에 공통이고 이 영역을 Kernel Segment라고 부름.
TEXTADDR
• 커널의 가상 시작주소로 일반적으로 PAGE_OFFSET + 0x8000가 됨.
• 이 주소는 커널 Image가 끝나는 위치
DATAADDR
• 커널 데이터 세그먼트를 위한 가상주소. Decompressor를 사용할 때는 정의되면 안됨.
VMALLOC_START/
VMALLOC_END
• vmalloc() 영역에 대한 가상주소 Boundary이며 vmalloc이 이 영역에 대해 overwrite를
하기 때문에 이 영역에는 어떠한 Static Mapping이 와서는 안됨
• 이 가상주소들은 모두 Kernel Segment내에 존재하며 일반적으로 vmalloc() 영역은 변
수 high_memory를 이용해서 찾게 되는 마지막 가상 RAM 주소보다 위쪽 바이트인
VMALLOC_OFFSET 바이트에서 시작함.
VMALLOC_OFFSET
• 가상 RAM과 vmalloc 영역 사이의 공간 (hole)을 제공하기위해서 VMALLOC_START와
관계된 Offset로 이렇게 하는 이유는 Out of Bound 메모리 접근하는 것을 Catch하기
위함. 일반적으로 8MB로 세팅됨.
Decompressor Symbols and Kernel Symbols
Reference: ./Documentation/arm/Porting

Architecture Specific Macros
Symbols
Description
BOOT_MEM(pram,pio,vio)
• Debugging 영역이 Architecture 종속적인 코드에 의해 초기화됨.
• 'pram'은 RAM의 시작 물리주소를 나타내고 반드시 Present해야 하고
PHYS_OFFSET과 같아야 함.
• 'pio'는 IO를 포함하고 있는 8MB의 물리주소이고
arch/arm/kernel/debug-armv.S에 있는 debugging macro들과 함께
사용.
• 'vio'는 8MB Debugging 영역의 가상주소.
BOOT_PARAMS
• PARAMS_PHYS와 같음.
FIXUP(func)
• Memory subsystem이 초기화전에 실행되는 Machine 종속적인 fixups
MAPIO(func)
• Map IO 영역에 대한 Machine 종속적인 함수
INITIRQ(func)
• Interrupt들을 초기화하기 위한 Machine 종속적인 함수
kernel/head.S,
msr cpsr_c, #PSR_F_BIT |
PSR_I_BIT | SVC_MODE
IRQ (I, F) Disable/ SVC
Mode로 셋팅
mrc
R9  Processor ID를 CoProcessor로 부터 가져옴
p15, 0, r9, c0, c0
proc.info.init section의
시작/끝 물리주소를 찾음
__lookup_processor_type
R9에 저장된 Processor ID
와 같은 값을 찾을 때 까지
루프를 돈다
__lookup_processor_type
에서 가져온 값 = 0?
Y
.arch.info.init section의
시작/끝 물리주소를 찾음
__lookup_machine_type
R1에 저장된 machine
type과 같은 값을 찾을 때
까지 루프를 돈다
N
Y
R1 : machine type
R2 : aTag Pointer
R9: Processor ID
head-common.S
__proc_info_begin/
__proc_info_end 의 물리주소
sizeof(proc_info_list) 단위
로 .proc.info.init section의 memory
를 access 함
종료
N
__lookup_machine_type
에서 가져온 값 = 0?
head.S/ head-common.S에서
사용하는 값
종료
__arch_info_begin /
__arch_info_end 의 물리주소
sizeof(machine_desc) 단위
로 .arch.info.init section의
memory를 access 함
Virtual Address -> Physical Address로 전환하는 방법
ldmia 메모리/ 레지스터지정 방법
kernel/head.S,
__vet_atags Jump
aTag의 포인터 유효성을
검사, 잘못되었을 시 0
head-common.S
__vet_atags
return값이 = 0?
N
__create_page_tables
Page Table에서 Kernel 시
작부터 끝까지 processor
mmu flag를 셋팅
KERNEL_START = 0xC0008000
KERNEL_END = 0xC00XXXXX
Flag 값은 XXXXXC1E
= Cacheable/ Bufferable
proc_v6.S
PC = proc_info_list
구조체의 __cpu_flush
Call __v6_setup
R0 에 값 셋팅 후
__enable_mmu에서 이용
__enable_mmu
Jump to start_kernel
Control register setting
C1 = mmu
C2 = Page Table
C3 = Domain
mm/proc_v6.S __v6_proc_info 에서
__v6_setup이 pc 값이 됨. 따라서
mm/proc-v6.S의 __v6_setup이 수행
[0] mmu on
[2] L1 D-cache on
[3] SBOne
[4~6] SBOne
[11] branch 예측 on
[12] L1 I-cache on
기타 등등
필요 시 data segment를 복사
BSS segment를 Clear
processor ID/ machine Type 메모리 저장
atag pointer 값을 메모리 저장
Page Table 다시 그릴것
Virtual Address -> Physical Address로 전환: ldmia 메모리/ 레지스터지정 방법
ldmda, ldmia, ldmdb, ldmda 명령어의 경우 높은 번호의 Register가 항상 높은 메모리와 매핑
됨. 따라서, LDMDA R0, { R1-R4 } 는 R4  R0, R3  R0 - 4, 가 된다.
head-common.S 에서
프로세서 타입을 찾는 부분 __lookup_processor_type 및 머신타입을 찾는 부분 __lookup_m
achine_type이 아래와 같은 동일한 방법으로 가상주소를 물리주소로 전환하여 찾음
 __proc_info_end (virtual)
 __proc_info_begin (virtual)
 . (virtual)
__proc_info_end (physical)
 __proc_info_begin (physical)
 label 3 (physical)
label 3 (physical) - . (virtual) = offset
__proc_info_begin (virtual) + offset = __proc_info_begin (physical)
Note: offset의 값은 음수가 될수 있음
offset
kernel/head.S,의 __create_page_tables 수행 후 Page Table 결과
Entry
RAM
4095
unzip
Image
0xFFF00C12
.
.
.
.
.
.
0x50007XXX
0xXXXXXC1E
.
.
.
.
.
.
0x50007000
0xC0000C1E
.
.
.
.
.
.
1536
0x50005800
0x60000C12
1535
0x500057FC
0x5FF00C1E
.
.
.
.
.
.
1230
0x50005400
0x50000C1E
1279
0x500053FC
0x4FF00C12
.
.
.
.
.
.
2
0x50004008
0x00200C12
1
0x50004004
0x00100C12
0
0x50004000
0x00000C12
3072
0x50008000
16KB
0x50004000
Page Directory
(4096 Entries)
각 Entry 4Byte로 구성되며 Virtual Memory의
1MB Section을 의미 함
Value in Virtual
0x50007FFC
zImage
0x50C00000
P-Address
Kernel이 끝나는 위치
(KERNEL_END)
 Kernel이 시작하는 위치
(KERNEL_START)
256 Entries
Cacheable/
Bufferable
Physical Memory
위치
init/main.c
asmlinkage를 통해 assem과 link될
수 있게 함 head-common.S에서 Call
start_kernel
smp_setup_processor_id()
Weak attribute를 사용하여 Arch가
sparc일 경우 수행 다른 arch일 경우 아
무것도 하지 않음
CONFIG_DEBUG_OBJECTS 가 정의된
경우 Kernel에서 관리하는 Object들에
대한 tracking이 가능하도록 하는 것 같
음 (To Be Cleared)
lockdep_init
debug_objects_early_init
CONFIG_LOCKDEP가 정의된 경우 Lock
dependency tool이 사용하기위한 자료
구조를 초기화 함.
Classhash_table (2^12 = 4096 entry)
Chainhash_table(2^14 = 4 * 4096
entry)
x86에서 사용됨. Stack에 특정 패턴을
넣어 함수가 return할 때 overwrite하는
것을 detection 함.
Canary 새가 광산에서 문제가 발생할
때 제일 먼저 detection한 것을 비유
boot_init_stack_canary
control group은 resource tracking을
위한 것이며 cgroup/ subsystem/
hierarchy로 구성됨.
(./Documentation/cgroups/*.txt 참조)
CONFIG_TRACE_IRQ_FLAGS가 정의되
어 있다면, early_boot_irqs_enabled =
0으로 만듦. Debugging시에 도움을 주
기 위한 것으로 현재 early bootup
code에 있고 invalid한 irqs-on event에
대해 warning 메시지를 주기 위함.
cgroup_init_early
local_irq_disable
early_boot_irqs_off
early_init_irq_lock_class
cpsr register의 I bit를 mask 시킴:
cpsid I
current task의 hardirq를 disable 시킴
CONFIG_GENERIC_HARDIRQS가 정의
된 경우 lockdep_init()에서 초기화 했던
classhash_table에 NR_IRQ만큼의 irq를,
즉 모든 irq_lock을 등록함
Kernel의 early stage로 주로
Debugging 관련된 셋팅을 함. 따라서
관련된 Define값들이 정의되지 않았을
경우, 대부분 아무 일도 수행하지 않음
lib/kernel_lock.c
Kernel Lock 획득하는 방법
#define __lockfunc __attribute__((section(".spinlock.text")))
void __lockfunc lock_kernel(void)
{
int depth = current->lock_depth+1;
if (likely(!depth))
__lock_kernel();
current->lock_depth = depth;
}
lock_kernel
__lock_kernel
Lock Acquired
Lock Acquired
_raw_spin_trylock
Lock Acquired
CONFIG_PREEMPT 설정 시
__raw_spin_trylock(
&(lock)->raw_lock)
Inline assembly code:
1. lock->lock 값을 배타적으로 tmp에 저장
2. tmp = 0 인지 검사
① True 라면, 배타적으로 lock->lock = 1 설정
② tmp = 0인지 검사 (strexeq에 의해 결과값이
tmp에 저장됨)
3. tmp != 0이라면
① 1 번으로 분기 하여 루프를 돔
최초 current->lock_depth의 값은 -1이기 때문에 !depth
가 true가 됨. True일 경우 __lock_kernel 호출
Lock을 획득하는데 성공했다면 즉 _raw_spin_trylock의
리턴값이 1이라면 리턴하고 그렇지 않다면, 스핀
arch/arm/include/asm/spinlock.h
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
unsigned long tmp;
__asm__ __volatile__(
"1: ldrex %0, [%1]\n“
" teq %0, #0\n“
#ifdef CONFIG_CPU_32v6K
" wfene\n“
#endif
" strexeq %0, %2, [%1]\n“
" teqeq %0, #0\n“
" bne 1b“
: "=&r" (tmp)
: "r" (&lock->lock), "r" (1)
: "cc");
}
smp_mb();
init/main.c
tick_init
struct notifier_block
int (*notifier_call)
notifier_block *next
int priority
clockevents_register_notifier(
&tick_notifier)
kernel/time/tick-common.c
static struct notifier_block tick_notifier =
{ .notifier_call = tick_notify, }
tick_notify는 static 함수이며, clock 이벤트 디바이스
들에 대한 Notification을 담당하며, 이벤트에는,
CLOCK_EVT_NOTIFY_ADD,
CLOCK_EVT_NOTIFY_BROADCAST_ON,
CLOCK_EVT_NOTIFY_CPU_DYING,
CLOCK_EVT_NOTIFY_CPU_DEAD,
CLOCK_EVT_NOTIFY_SUSPEND 등이 있음
spin_lock(
&clockevents_lock)
raw_notifier_chain_register(
&clockevents_chain,
tick_notifier)
spin_unlock
(&clockevents_lock)
clockevents_lock에 대한 spin lock
clockevents_chain에 tick_notifier를 등
록함
삽입하고자 하는 tick_notifier의
priority과 clock_event_chain에
속한 notifier_block의 priority를 비교하
여 clock_event_chain에 priority가 높은
순으로 삽입
clockevents_lock의 spin lock 해제
Notifier Chain의 4가지 Type
1.
Atomic Notifier Chains
Atomic Notifier Chain Type은 Chain Callback 함수들이 Interrupt나 Atomic Context에서
동작하고 Callback 함수가 Block되는 것을 허용하지 않음.
2.
Blocking Notifier Chains
Blocking Notifier Chain Type은 Callback 함수들이 프로세스 컨텍스트에서 수행되며, Callback
함수가 Block되는 것을 허용
3.
Raw Notifier Chains
Raw Notifier Chain Type은 callback, registration, unregistration에 대한 제약이 전혀 없으며,
호출자 (Caller)에 의해서 모든 locking 및 protection이 제공되어야 함.
4.
SRCU (Sleepable Read-Copy Upate) Notifier Chains
SRCU Notifier Chain Type은 Blocking Notifier Chain의 변종으로 동일한 제약을 가짐.
srcu_notifier_call_chain() 함수는 SRCU (Sleepable Read-Copy Update)를 사용하며 SRCU는
rw-semaphore 보다 chain link들에 대한 보호에서 작은 오버헤드를 가짐 (no cache bounce
& no memory barrier). 이에 대한 대가로 srcu_notifier_chain_unregister()는 조금의
오버헤드가 있음. SRCU Notifier Chain들은 Chain이 빈번하게 호출되지만 Notifier Block에서
거의 제거되지 않은 경우에 사용되어야 하며, 사용하기 위해서는 Runtime Initialization이
필요함.
chain을 등록 (register)할 때, atomic_notifier_chain_register(), blocking_notifier_chain_register(),
srcu_notifier_chain_register()를 이용하며, atomic_notifier_chain_register()는 atomic 컨텍스트에서
호출되지만, 나머지 두개는 process 컨텍스트에서 호출되어야 함. chain을 해제 (unregister) 할
때는 call chain에서 unregister 함수가 호출되면 안됨
init/main.c
boot_cpu_init
현재 cpu를 online/ present/ possible 상태로 셋
online: 현재 스케쥴러가 사용 가능한 cpu의 개수
present: 현재 핫플러그된 cpu의 개수
possible: 최대 핫플러그 가능한 cpu의 개수
CONFIG_SMP가 정의되지 않았을 경우
raw_smp_processor_id()로 대체되고 그 값은 0이며,
#define raw_smp_processor_id() 0
cpu = smp_processor_id()
정의되었을 경우, arch/arm/include/asm/smp.h에 정
의된 raw_smp_processor_id()가 호출되고, 이 함수는
thread_info 구조체에서 cpu member를 액세스함
#define raw_smp_processor_id()
(current_thread_info()->cpu)
set_cpu_online(cpu, true)
set_cpu_present(cpu, true)
cpu_online_bits에 해당 cpu 비트 셋
cpu_present_bits에 해당 cpu 비트 셋
set_cpu_present(cpu, true)
cpu_possible_bits에 해당 cpu 비트 셋
init/main.c
list_head 구조체인 page_address_pool
구조체 초기화
page_address_init
INIT_LIST_HEAD(
&page_address_pool)
page_address_maps를
page_address_pool에 추가
page_address_htable 초기화
page_address_map
page_address_maps[LAST_PKMAP]
mm/highmem.c에
LAST_PKMAP = PTRS_PER_PTE = 512
page_address_maps는
one page->virtual association (?)
page_address_maps에 대한 hash
table. 총 128개의 entry를 가지며 각
entry당 lock이 존재함. 각 entry 및
lock을 초기화 하고 있음
page_address_pool, page_address_maps,
page_address_htable 자료구조 설명
init/version.c
printk(KERN_NOTICE
"%s", linux_banner);
Linux Banner 출력
const char linux_banner[] =
"Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@"
LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n";
Example
Linux version 2.6.29-omap1 (dobi77@dobinet) (gcc version 4.3.2 (Sourcery G++ Lite 2008q3-72) )
page_address_pool, page_address_maps, page_address_htable 자료구조
page_address_pool
struct list_head list
struct list_head *prev
struct list_head *next
page_address_maps[0]
page_address_maps[1]
LAST_PKMAP = 512
page_address_maps[511]
struct page *page
struct page *page
struct page *page
void *virtual
void *virtual
void *virtual
struct list_head list
struct list_head list
struct list_head list
struct list_head *prev
struct list_head *prev
struct list_head *prev
struct list_head *next
struct list_head *next
struct list_head *next
page_address_htable
page_address_htable[0]
page_address_htable[1]
1<<PA_HASH_ORDER = 0x80 = 128
page_address_htable[127]
struct list_head lh
struct list_head lh
struct list_head lh
spinlock_t lock =
SPIN_LOCK_UNLOCKED
spinlock_t lock =
SPIN_LOCK_UNLOCKED
spinlock_t lock =
SPIN_LOCK_UNLOCKED
초기 tags값 assign 및 boot시 넘어온 command line
assign default command line은
arch/arm/configs/xxxx_defconfig에
CONFIG_CMDLINE로 정의된 값이며, __initdata
section에 위치하게됨.
init/main.c
setup_arch
tags  init_tags
from 
default_command_line
unwind_init()
CONFIG_ARM_UNWIND이 정의되었을 경우, 실행되며
정의되지 않은 경우 0을 리턴함.
.ARM.unwind_idx section의 처음을 나타내는
__start_unwind_idx에서 끝을 나타내는
__stop_unwind_idx까지 unwind_idx struct 단위로
for 루프를 돌면서 .ARM.unwind_idx section에 저장된
symbol 주소를 절대주소로 변경함 symbol 주소는
unwind_idx->addr에 저장되어 있음.
arch/arm/include/asm/unwind.h
struct unwind_idx {
unsigned long addr; /* virtual address 임*/
nsigned long insn; /* 현재 instruction 가리킴*/
};
setup_processor()
.ARM.unwind_idx는 unwinding section에 대한 엔트리
를 저장하고 있는 섹션이며 vmlinux.lds.S에서 확인할
수 있고, linker descriptor에 사용된 값은 모두 virtual
address값이기 때문에, 실제 물리주소로 변경함.
-g option을 command line에 주게되면, 목적파일은
DWARF2 디버그 프레임 정보를 갖게되고 디버거가 이
정보를 이용하여 필요할 경우 스택을 unwind할 때 이
용하게 되어 stack backtrace를 볼 수 있게 됨.
다음문서에서 ELF 포멧 확인할 것
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044d/IHI0044D_aaelf.pdf
init/main.c
arch/kernel/head-common.S에 정의되어 있는 함수로,
proc_info_list에 processor type을 저장함.
setup_arch
lookup_processor_type의 동작방식은 앞장 xxx 참조
printk("CPU: %s [%08x] revision %d (ARMv%s), cr=%08lx\n",
cpu_name, read_cpuid_id(), read_cpuid_id() & 15,
proc_arch[cpu_architecture()], cr_alignment);
setup_processor()
ex: CPU: ARMv7 Processor [411fc083] revision 3 (ARMv7),
cr=10c5387f
lookup_processor_type(
read_cpuid_id())
printk()
cpu_architecture()는 cpuid를 cp15 c0에서 읽어와서 그 값으로
proc_arch의 인덱스로 사용하여 architecture를 가져옴.
proc_info_list 구조체, proc_arch 배열
& sample value (__v7_proc_info)
cacheid_init()
printk()
read_cpuid_cachetype()를 통해 cache type을,
cpu_architecture()를 통해 arch 정보를 얻어와 cache id를 셋팅
함.
cache type에는 4가지가 있음
PIPT (Physically indexed, physically tagged)
VIVT (Virtually indexed, virtually tagged)
VIPT (Virtually indexed, physically tagged)
PIVT (Physically indexed, virtually tagged)
cpu_proc_init()
Data cache와 instruction cache가 어떤 type인지 printk를 통해
콘솔에 프린트함.
ex: CPU: VIPT nonaliasing data cache, VIPT nonaliasing
instruction cache
proc_info_list 구조체 & sample value (__v7_proc_info)
arch/arm/include/asm/procinfo.h
arch/arm/mm/proc-v7.S
struct proc_info_list {
unsigned int
cpu_val;
unsigned int
cpu_mask;
unsigned long
__cpu_mm_mmu_flags; /* used by head.S */
unsigned long
__cpu_io_mmu_flags; /* used by head.S */
unsigned long
__cpu_flush;
/* used by head.S */
const char
*arch_name;
const char
*elf_name;
unsigned int
elf_hwcap;
const char
*cpu_name;
struct processor *proc;
struct cpu_tlb_fns *tlb;
struct cpu_user_fns *user;
struct cpu_cache_fns *cache;
};
.type __v7_proc_info, #object
__v7_proc_info:
.long 0x000f0000
@ Required ID value
.long 0x000f0000
@ Mask for ID
.long PMD_TYPE_SECT | \
PMD_SECT_BUFFERABLE | \
PMD_SECT_CACHEABLE | \
PMD_SECT_AP_WRITE | \
PMD_SECT_AP_READ
.long PMD_TYPE_SECT | \
PMD_SECT_XN | \
PMD_SECT_AP_WRITE | \
PMD_SECT_AP_READ
b __v7_setup
.long cpu_arch_name
.long cpu_elf_name
.long HWCAP_SWP|HWCAP_HALF|HWCAP_THUMB|
HWCAP_FAST_MULT|HWCAP_EDSP
.long cpu_v7_name
.long v7_processor_functions
.long v7wbi_tlb_fns
.long v6_user_fns
.long v7_cache_fns
.size __v7_proc_info, . - __v7_proc_info
arch/arm/kernel/setup.c
static const char *proc_arch[] = {
"undefined/unknown",
"3", "4", "4T", "5",
"5T", "5TE", "5TEJ",
"6TEJ", "7", "?(11)",
"?(12)", "?(13)", "?(14)",
"?(15)", "?(16)", "?(17)",
};
init/main.c
single cpu일 때와 multi cpu일 때로 나눠지는데, single cpu일
경우, CPU_NAME과 함수명인 _proc_init를 결합한 함수명을 호
출 함. 즉, CPU_NAME = cpu_v7, _proc_init = xxx_proc_init
이면 호출되는 함수는 cpu_v7xxx_proc_init, 임.
setup_arch
setup_processor()
cpu_proc_init()
multi cpu일 경우, processor 구조체의 _proc_init 함수포인터를
콜, 즉 processor._proc_init()를 콜. processor는
lookup_processor_type()를 통해 결과값을 넘어온 proc_info_list
의 멤버이기 때문에, processor의 값들은 proc_info_list.proc으
로 액세스가 가능함.
ex: arch/arm/mm/proc-v7.S에 정의된 function list
setup_machine()
processor 구조체 참조
.type v7_processor_functions, #object
ENTRY(v7_processor_functions)
.word v7_early_abort
.word pabort_ifar
.word cpu_v7_proc_init
.word cpu_v7_proc_fin
.word cpu_v7_reset
.word cpu_v7_do_idle
.word cpu_v7_dcache_clean_area
.word cpu_v7_switch_mm
.word cpu_v7_set_pte_ext
.size v7_processor_functions, . - v7_processor_functions
이 경우, processor._proc_init = cpu_v7_proc_init 이고 그 정의
ENTRY(cpu_v7_proc_init)
mov pc, lr
ENDPROC(cpu_v7_proc_init)
proc_info_list 구조체 & sample value (__v7_proc_info)
arch/arm/include/asm/cpu-multi32.h
extern struct processor {
/* MISC: get data abort address/flags */
void (*_data_abort)(unsigned long pc);
unsigned long (*_prefetch_abort)(unsigned long lr); /* Retrieve prefetch fault address
*/
void (*_proc_init)(void); /* Set up any processor specifics */
void (*_proc_fin)(void); /* Disable any processor specifics
*/
void (*reset)(unsigned long addr) __attribute__((noreturn)); /* Special stuff for a reset
int (*_do_idle)(void); /* Idle the processor
*/
*/
/* Processor architecture specific */
/* clean a virtual address range from the D-cache without flushing the cache. */
void (*dcache_clean_area)(void *addr, int size);
void (*switch_mm)(unsigned long pgd_phys, struct mm_struct *mm); /* Set the page table
/* Set a possibly extended PTE. Non-extended PTEs should ignore 'ext'. */
void (*set_pte_ext)(pte_t *ptep, pte_t pte, unsigned int ext);
} processor;
*/
arch/arm/tools/gen-mach-types의 awk script이
include/asm-arm/mach-types.h 파일을 생성하는데, mach-types.h에
machine_arch_type에 대한 정의를 함.
init/main.c
setup_arch
#define machine_arch_type __machine_arch_type
setup_machine(
machine_arch_type)
machine_desc.soft_boot?
machine_arch_type은 head-common.S의 __lookup_machine_type에 의해
machine_arch_type이 해당 machine descriptor의 메모리상의 위치를
가르키고, 그 첫 번째 멤버인 architecture number가 됨, 즉
machine_desc.nr를 가리킴.
ex: arch/arm/mach-omap2/board-3430sdp.c
MACHINE_START(OMAP_3430SDP, "OMAP3430 3430SDP board")
/* Maintainer: Syed Khasim - Texas Instruments Inc */
.phys_io = 0x48000000,
.io_pg_offst = ((0xd8000000) >> 18) & 0xfffc,
.boot_params = 0x80000100,
.map_io
= omap_3430sdp_map_io,
.init_irq = omap_3430sdp_init_irq,
.init_machine = omap_3430sdp_init,
.timer
= &omap_timer,
MACHINE_END
여기에서, MACHINE_START매크로에 의해 .nr = MACH_TYPE_OMAP_3430SDP_type이 되며,
arch/arm/tools/mach-types에 아래와 같이 정의되어 있음.
omap_3430sdp
MACH_OMAP_3430SDP
OMAP_3430SDP
1138
콘솔에 machine_desc.name을 다음과 같이 출력, printk("Machine: %s\n", list->name)
ex: Machine: My Company Board
machine_desc 구조체 / MACHINE_START 매크로참조
machine_desc 구조체 / MACHINE_START 매크로
arch/arm/include/asm/mach/arch.h
struct machine_desc {
/*
* Note! The first four elements are used
* by assembler code in head.S, head-common.S
*/
unsigned int
nr; /* architecture number */
unsigned int
phys_io; /* start of physical io */
unsigned int
io_pg_offst; /* byte offset for io
* page tabe entry */
const char
*name;
/* architecture name */
unsigned long
boot_params; /* tagged list
*/
unsigned int
video_start; /* start of video RAM */
unsigned int
video_end; /* end of video RAM */
unsigned int
reserve_lp0 :1; /* never has lp0 */
unsigned int
reserve_lp1 :1; /* never has lp1 */
unsigned int
reserve_lp2 :1; /* never has lp2 */
unsigned int
soft_reboot :1; /* soft reboot
*/
void
(*fixup)(struct machine_desc *,
struct tag *, char **,
struct meminfo *);
void
(*map_io)(void);/* IO mapping function */
void
(*init_irq)(void);
struct sys_timer *timer; /* system tick timer */
void
(*init_machine)(void);
arch/arm/include/asm/mach/arch.h
#define MACHINE_START(_type,_name)
\
static const struct machine_desc __mach_desc_##_type
__used
\
__attribute__((__section__(".arch.info.init"))) = { \
.nr = MACH_TYPE_##_type,
\
.name
= _name,
};
#define MACHINE_END
};
\
\
init/main.c
만약 machine_desc.soft_boot가 정의되어 있다면, reboot_mode = "s"로
assign.하는 reboot_setup("s") 콜. Soft reboot????
setup_arch
machine_desc.soft_boot ?
tags 
phys_to_virt(__atags_pointer)
__atags_pointer는 arch/arm/kernel/head-common.S에서 부트로더에서
넘어온 atags 주소 를 가르키고 있다. __atags_pointer를 가상주소로 변경
하여 tags에 assign. arch/arm/kernel/setup.c에 다음과 같이 정의됨
unsigned int __atags_pointer __initdata;
만일 __atags_pointer가 null이라면, machine_desc.boot_params을
가상주소로 변경함. 만일 machine_desc.boot_params까지 null이라면,
init_tags의 값을 이용함.
mdesc->fixup is null?
init_mm.start_code  _text
init_mm.end_code  _etext
init_mm.end_data  _edata
init_mm.brk  _end;
parse_cmdline()
mdesc->fixup이 null이 아닐 경우, mdesc->fixup 수행하며, 이 함수는
메모리 관련정보를 설정한다. 메모리 관련정보의 설정은 아래 3곳 중
한곳에서 한다
1. 머신의 fixup함수에서 meminfo를 전달받아서 설정
2. 부트로더에서 ATAG_MEM태그를 전달해서 설정
3. 커널 커맨드라인에서 mem=xxx 파라미터를 전달해서 설정
Init task의 mm_struct의 _text, _etext, _edata, _end 위치를 assign하며,
arch/alpha/kernel/init_task.c에 init_mm은 정적으로 생성된다.
struct mm_struct init_mm = INIT_MM(init_mm);
mm_struct 구조체 & INIT_MM 매크로
mm_struct 구조체 & INIT_MM 매크로
include/linux/mm_types.h
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
struct vm_area_struct * mmap_cache;
/* last find_vma result */
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
unsigned long mmap_base; /* base of mmap area */
unsigned long task_size;
/* size of task vm space */
include/linux/init_task.h
#define INIT_MM(name) \
{
\
.mm_rb
= RB_ROOT,
\
.pgd
= swapper_pg_dir,
\
.mm_users = ATOMIC_INIT(2),
\
.mm_count = ATOMIC_INIT(1),
\
.mmap_sem = __RWSEM_INITIALIZER(name.mmap_sem), \
.page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \
.mmlist
= LIST_HEAD_INIT(name.mmlist),
\
.cpu_vm_mask = CPU_MASK_ALL,
\
}
/* if non-zero, the largest hole below free_area_cache */
unsigned long cached_hole_size;
unsigned long free_area_cache;
/* first hole of size cached_hole_size or larger */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count;
/* number of VMAs */
include/linux/mm_types.h
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables and some counters */
/* for /proc/PID/auxv */
(continue)
struct list_head mmlist;
unsigned long saved_auxv[AT_VECTOR_SIZE];
cpumask_t cpu_vm_mask;
mm_counter_t _file_rss;
mm_counter_t _anon_rss;
/* Architecture-specific MM context */
mm_context_t context;
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
unsigned int faultstamp;
unsigned int token_priority;
unsigned int last_interval;
unsigned long flags; /* Must use atomic bitops to access the bits */
struct core_state *core_state; /* coredumping support */
unsigned
unsigned
unsigned
unsigned
unsigned
long
long
long
long
long
total_vm, locked_vm, shared_vm, exec_vm;
stack_vm, reserved_vm, def_flags, nr_ptes;
start_code, end_code, start_data, end_data;
start_brk, brk, start_stack;
arg_start, arg_end, env_start, env_end;
/* aio bits */
spinlock_t
struct hlist_head
};
ioctx_lock;
ioctx_list;
init/main.c
부트로더에서 넘어온 command line을 파싱한다. command line은 .init section에
저장되어 있고, __early_begin를 시작으로 __early_end까지,
struct early_params 단위로 액세스하며, 공백이 delimeter로 이용됨.
setup_arch
parse_cmdline()
struct early_params {
const char *arg;
void (*fn)(char **p);
};
해당 arg (ex: console=ttySAC0,115200)에 대해 정의된early_params.fn을 호출함.
paging_init(mdesc)
Todo: 각 arg에대해 정의된 함수는 어디에 있으며, 실체를 파악하자
page table들을 셋업하고, zone memory map들을 초기화하며, zero page, bad
page, bad page 테이블을 만든다.
build_mem_type_table()
printk()
sanity_check_meminfo()
정적 mem_types테이블은 memory type에 대한 entry들로 구성되며 각
엔트리는 아키텍쳐를 검사하여 필요없는 비트는 클리어시키고 필요한
비트는 세트 시킨다.
cache policy를 설정함. CONFIG_SMP가 정의된 경우 cache policy는
CPOLICY_WRITEALLOC로 셋팅 됨
Memory policy에 대한 printk를 수행함.
ex: Memory policy: ECC disabled, Data cache writeback
Cache policy & 정적 mem_types 구조체 배열에 대한 정리
mem_type 구조체와 mem_types 정적 구조체 배열
arch/arm/mm/mmu.c
static struct mem_type mem_types[] = {
[MT_DEVICE] = {
/* Strongly ordered / ARMv6 shared device */
.prot_pte = PROT_PTE_DEVICE |
L_PTE_MT_DEV_SHARED |
L_PTE_SHARED,
.prot_l1 = PMD_TYPE_TABLE,
.prot_sect = PROT_SECT_DEVICE |
PMD_SECT_S,
.domain
= DOMAIN_IO,
},
[MT_DEVICE_NONSHARED] = {
/* ARMv6 non-shared device */
.prot_pte = PROT_PTE_DEVICE |
arch/arm/mm/mm.h
struct mem_type {
unsigned int prot_pte;
unsigned int prot_l1;
unsigned int prot_sect;
unsigned int domain;
};
L_PTE_MT_DEV_NONSHARED,
.prot_l1 = PMD_TYPE_TABLE,
.prot_sect = PROT_SECT_DEVICE,
.domain
= DOMAIN_IO,
},
[MT_DEVICE_CACHED] = {
/* ioremap_cached */
.prot_pte = PROT_PTE_DEVICE |
L_PTE_MT_DEV_CACHED,
.prot_l1 = PMD_TYPE_TABLE,
.prot_sect = PROT_SECT_DEVICE |
PMD_SECT_WB,
.domain
= DOMAIN_IO,
},
[MT_DEVICE_WC] = { /* ioremap_wc */
.prot_pte = PROT_PTE_DEVICE |
L_PTE_MT_DEV_WC,
.prot_l1 = PMD_TYPE_TABLE,
.prot_sect = PROT_SECT_DEVICE,
.domain
= DOMAIN_IO,
},
[MT_UNCACHED] = {
.prot_pte = PROT_PTE_DEVICE,
.prot_l1 = PMD_TYPE_TABLE,
.prot_sect = PMD_TYPE_SECT |
PMD_SECT_XN,
.domain
= DOMAIN_IO,
},
[MT_CACHECLEAN] = {
.prot_sect = PMD_TYPE_SECT |
PMD_SECT_XN,
.domain = DOMAIN_KERNEL,
},
[MT_LOW_VECTORS] = {
.prot_pte = L_PTE_PRESENT |
L_PTE_YOUNG |
L_PTE_DIRTY |
L_PTE_EXEC,
.prot_l1 = PMD_TYPE_TABLE,
.domain = DOMAIN_USER,
},
[MT_HIGH_VECTORS] = {
.prot_pte = L_PTE_PRESENT |
L_PTE_YOUNG |
L_PTE_DIRTY |
L_PTE_USER |
L_PTE_EXEC,
.prot_l1 = PMD_TYPE_TABLE,
.domain = DOMAIN_USER,
},
[MT_MEMORY] = {
.prot_sect = PMD_TYPE_SECT |
PMD_SECT_AP_WRITE,
.domain = DOMAIN_KERNEL,
},
[MT_ROM] = {
.prot_sect = PMD_TYPE_SECT,
.domain = DOMAIN_KERNEL,
},
[MT_MEMORY_NONCACHED] = {
.prot_sect = PMD_TYPE_SECT |
PMD_SECT_AP_WRITE,
.domain = DOMAIN_KERNEL,
},
};
cachepolicy 구조체
arch/arm/mm/mmu.c
struct cachepolicy {
const char policy[16];
unsigned int cr_mask;
unsigned int pmd;
unsigned int pte;
};
arch/arm/mm/mmu.c
static struct cachepolicy cache_policies[] __initdata = {
{
.policy = "uncached",
.cr_mask = CR_W|CR_C,
.pmd
= PMD_SECT_UNCACHED,
.pte
= L_PTE_MT_UNCACHED,
}, {
.policy = "buffered",
.cr_mask = CR_C,
.pmd
= PMD_SECT_BUFFERED,
.pte
= L_PTE_MT_BUFFERABLE,
}, {
.policy = "writethrough",
.cr_mask = 0,
.pmd
= PMD_SECT_WT,
.pte
= L_PTE_MT_WRITETHROUGH,
}, {
.policy = "writeback",
.cr_mask = 0,
.pmd
= PMD_SECT_WB,
.pte
= L_PTE_MT_WRITEBACK,
}, {
.policy = "writealloc",
.cr_mask = 0,
.pmd
= PMD_SECT_WBWA,
.pte
= L_PTE_MT_WRITEALLOC,
}
};
init/main.c
예로 mach-pxa의 machine descriptor에 정의된 pixup 함수에 의해 meminfo가
채워지고, 여기에서 sanity 검사를 함. 각 bank에 대해서 bank->size가
VMALLOC_MIN을 넘어가는 영역을 빼고 다시 resize를 시킴.
setup_arch
VMALLOC_MIN = VMALLOC_END - vmalloc_reserve (default 128MB)
bank->size = VMALLOC_MIN - __va(bank->start)
sanity_check_meminfo()
PAGE_OFFSET
prepare_page_table()
bank->start
VMALLOC_MIN
bank->start
VMALLOC_MIN

PAGE_OFFSET

bootmem_init()
Original bank->size
bank->size = VMALLOC_MIN __va(bank->start)
A fixup table -- a
relocation table -- can
also be provided in the
header of the object
code file. Each "fixup"
is a pointer to an
address in the object
code that must be
changed when the
loader relocates the
program.
pgd page table을 초기화하는데, 크게 다음단계로 수행됨
1. virtual address = 0에서 PAGE_OFFSET - 16MB 까지
2. CONFIG_XIP_KERNEL이 정의되었을 경우, 1MB로 align하여 _etext + 1MB
이후로 초기화할 address를 지정
3. CONFIG_XIP_KERNEL가 정의되지 않았을 경우, PAGE_OFFSET까지 pgd table
entry를 초기화
4. bank[0]의 size만큼을 제외한후 VMALLOC_END까지 pgd table entry를
초기화
todo: memory bank, node, vmalloc area에 대해서 정리하기
init/main.c
ramdisk image를 가지고 있는 노드를 찾는다. meminfo의
bank의 수만큼 루프를 돌면서, 다음을 검사
setup_arch
bank_phys_start(bank) <= phys_initrd_start
&& end <= bank_phys_end(bank)
bootmem_init()
check_initrd()
for_each_node
해당노드의 pgdata와 initrd의
시작 물리주소와 크기를
reserve_bootmem_node의 인자로
넘겨, initrd의 page frame들에
대해 reserve 시킴
이 범위에 속한 bank->node를 리턴함. 여기에서,
end = phys_initrd_start + phys_initrd_size이며,
phys_initred_start 및 phys_initrd_size 값은 parse_tag_initrd()
함수에 의해서 설정되어 있음.
MAX_NUMNODES 만큼 Loop를 돌면서 각 node에 대해 수행
bootmem_init_node()
reserve_node_zero()
bootmem_reserve_
initrd()
CONFIG_NUMA가 정의 되었을
경우에만 수행
sparse_init()
bootmem_free_node()
Next page
|------------------| <----- end
|
bss
|
|------------------|
|
data
|
|------------------|
|
text
|
|------------------| <----- _stext
node = 0의 pgdata를 입력받아, reserve_bootmem_node()
함수를 호출함. reserve_bootmem_node에 넘겨지는 argument는
pgdata: 해당 노드의 pddata (node id = 0)
__pa(_stext): _stext의 물리메모리 주소
_end - _stext: text + data + bss 영역 물리메모리 주소
reserve_bootmem_node는 다시 mark_bootmem_node()를
호출하여 해당 page frame들에 대해 reserve 시킴.
CONFIG_MMU가 설정되어 있을 경우, 해당 bank에 대해서 struct map_desc
md 구조체를 채운다.
init/main.c
map_desc구조체를 다음과 같이 채운 후,
setup_arch
map.pfn = bank_pfn_start(bank); /* bank의 시작 pfn */
map.virtual = __phys_to_virt(bank_phys_start(bank)); /* 가상주소 */
map.length = bank_phys_size(bank); /* bank 크기 */
map.type = MT_MEMORY; /* 메모리 타입 */
bootmem_init()
create_mapping 함수를 콜
for_each_node
bootmem_init_node()
arch/arm/include/asm/mach/map.h
map_memory_bank()
pgd의 각 entry (각 1MB)를 섹션 값으로 설정한다.
ex: phys | type->prot_sect (영문ARM Developer, 503p 참조)
Node의 bank에
대해 Loop
return
create_mapping()
설정된 pmd entry를 메모리로 flush, 즉 cacheline clean
flush_pmd_entry() 함수 콜.
flush_pmd_entry() 함수는 3개의 명령으로 구성됨
if TLB_DCLEAN, Data Cacheline clean
asm("mcr p15, 0, %0, c7, c10, 1 @ flush_pmd“
: : "r" (pmd) : "cc");
if TLB_L2CLEAN_FR
asm("mcr p15, 1, %0, c15, c9, 1 @ L2 flush_pmd"
: : "r" (pmd) : "cc");
struct map_desc {
unsigned long virtual;
unsigned long pfn;
unsigned long length;
unsigned int type;
};
for_eachnode_bank
1.
2.
return
3.
4.
map_desc의 가상주소가 user
region인지, memory type은
적절한지 검사한 후,
map_desc가 나타내는 memory
bank가 MB 단위로 되어 있는지
검사하고,
memory bank의 처음을
나타내는 pgd의 entry를
구한다음,
2MB 단위로 루프를 돌면서
alloc_init_section()을 호출
alloc_init_section()
if TLB_WB
"dsb" : : :"memory"
만약, 시작주소와 끝 주소 물리주소가 SECTION 단위로
Align이 안되었다면, alloc_init_pte호출하여 page table
entry를 update한다.
bootmem_bootmap_pages()
Next page
Memory Types
arch/arm/include/asm/mach/map.h /* types 0-3 are defined in asm/io.h */
#define
#define
#define
#define
#define
#define
#define
#define
MT_UNCACHED 4
MT_CACHECLEAN 5
MT_MINICLEAN 6
MT_LOW_VECTORS 7
MT_HIGH_VECTORS 8
MT_MEMORY 9
MT_ROM 10
MT_MEMORY_NONCACHED 11
init/main.c
setup_arch
해당 노드의 시작 frame과 끝 frame을 이용하여
page frame 수를 구한다. frame을 나타낼 bitmap
page의 개수를 구한다.
bootmem_init()
for_each_node
bootmem_init_node()
bootmem_bootmap_pages()
해당노드의 비트맵 페이지프레임이
저장된 Page Frame Number를 찾는다.
비트맵 페이지 프레임은 BSS영역을
기준으로 적어도 BSS 영역이 끝나는
다음 page frame에 위치한다.
find_bootmap_pfn()
해당 노드를 online으로 셋함. nodemasks 구조체에
값을 셋팅하게됨. nodemasks는 시스템의 노드들에
대한 bitmap이고 node당 한 개의 bit가 할당됨.
해당 노드가 가질 수 있는 상태는 아래와 같이
정의되어 있음.
node_set_online()
NODE_DATA()
include/linux/nodemask.h
/* * Bitmasks that are kept for all the nodes. */
enum node_states {
N_POSSIBLE, /* The node could become online at some point */
N_ONLINE, /* The node is online */
N_NORMAL_MEMORY, /* The node has regular memory */
#ifdef CONFIG_HIGHMEM
N_HIGH_MEMORY, /* The node has regular or high memory */
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
N_CPU, /* The node has one or more cpus */
NR_NODE_STATES
};
init/main.c
해당 노드의 pg_data_t 구조체를 리턴한다.
정적메모리를 포인터에 할당만 함. 초기화 되지 않음.
setup_arch
노드에 대한 정보는 pg_data_t 구조체에 저장하고
시스템 전체 노드는 pg_data_t 구조체 배열
discontig_node_data로 관리한다. pg_data_t는
node에 속한 zone을 high-level로 기술한다 (zone
구조체가 기술하는 것보다).
bootmem_init()
for_each_node
메모리 통계와 Page Replacement 데이터구조체들은
Zone 별로 관리된다. (zone 구조체에서 확인할 수
있음)
bootmem_init_node()
NODE_DATA()
이 함수가 호출될때 인자들은 다음과 같다.
pgdat : 초기화 안된 현재 노드의 pg_data_t 구조체
boot_pfn : 현재 노드의 비트맵이 위치할 pfn
start_pfn: 현재 노드의 첫번째 pfn
end_pfn : 현재 노드의 마지막 pfn + 1,
즉 다음노드의 첫번째 pfn
단순히 받은 인자를 init_bootmem_core의 인자로
하여 init_bootmem_core()함수 호출
입력으로 받은 bdata를 이용하여 bdata->list를
bdata_list 링크드리스트에 연결한다. 연결할 때는
node의 시작 pfn을 나타내는 node_min_pfn
기준으로 내림차순으로 링크드리스트를 만든다.
링크드리스트 만드는 과정 참조
node의 page frame 수에 bit로 나타낼 때 총 몇
바이트가 필요한지 계산하고 모든 bit를 1로 셋팅 함,
즉 모두 사용 중이라고 표시함.
unsigned log bytes = (pages + 7) / 8; ALIGN(bytes,
sizeof(long));
물리메모리 주소의 최대범위내에 속하는지
sanity 검사를 수행하고 노드를 나타내는
pg_data_t 타입 pgdat의 bdata값을 설정
init_bootmem_node()
bdata->node_bootmem_map = 비트맵이
시작하는 page frame 가상주소
bdata->node_min_pfn = 노드의 시작
pfn bdata->node_low_pfn = 노드의 끝
free_bootmem_node()
return
init_bootmem_core()
link_bootmem()
return
link_bootmem()
Todo: bdata->list 링크드 리스트
bdata->list 링크드 리스트

New Item

bdata_list
bdata_list
struct list_head list
struct list_head list
struct list_head list
struct list_head *prev
struct list_head *prev
struct list_head *prev
struct list_head *next
struct list_head *next
struct list_head *next
Initial State
list_add_tail(&bdata->list, iter) 호출 후
Ž

bdata_list
struct list_head list
struct list_head list
struct list_head list
struct list_head *prev
struct list_head *prev
struct list_head *prev
struct list_head *next
struct list_head *next
struct list_head *next
중간 item을 찾고 list_add_tail(&bdata->list, iter) 호출 후
typedef struct bootmem_data {
unsigned long node_min_pfn;
unsigned long node_low_pfn;
void *node_bootmem_map;
unsigned long last_end_off;
unsigned long hint_idx;
struct list_head list;
} bootmem_data_t;
/*
/*
/*
/*
/*
/*
노드의 시작 pfn */
노드의 마지막 pfn */
관리용 비트맵 */
마지막으로 할당한 위치 */
다음에 사용될 페이지(힌트) */
다른 노드의 bootmem_data를 연결하는 포인터 */
init/main.c
setup_arch
bank의 시작주소를 포함하는 page의 pfn + 1을 시작
pfn으로 설정 bank의 끝 주소를를 포함하는 page의
pfn을 끝 pfn으로 설정
bootmem_init()
for_each_node
start = PFN_UP (physaddr) end =
PFN_DOWN(physaddr + size)
bootmem_init_node()
for_each_node
mark_bootmem_node()함수를 호출하여, node의 pfn
나타내는 bdata->node_bootmem_map bitmap에
start부터 end까지 비트를 모두 사용가능하다고 셋
시킴 (즉, 비트를 0으로)
free_bootmem_node()
include/linux/pfn.h
#define PFN_UP (x) (((x) + PAGE_SIZE-1) >> PAGE_SHIFT)
#define PFN_DOWN (x) ((x) >> PAGE_SHIFT)
reserve_bootmem_node()
해당 node의 bitmap page frame을 사용중이라고
reserve 한다.
init/main.c
setup_arch
bootmem_init()
for_each_node
bootmem_free_node()
Steps:
1. 해당 노드의 pgdata를 구한다음,
2. node의 start와 end pfn을 구하고,
3. ZONE 크기와 ZONE의 hole을 나타내는 zone_hole, zone_size 배열을
초기화 시키고,
4. node에 포함된 bank들을 순회하면서 bank들의 page frame의 합을
구하면,
5. node에 실제 page frame수가 구해지고,
6. node에 start와 end pfn으로 부터 구해진 node의 전체 크기를
이용해서, node의 hole이 얼마인지 구한다.
zone_size[0] = end_pfn - start_pfn
zhole_size[0] = zone_size[0] - sum (bank_pfn_size(node의 bank들))
return
free_area_init_node()
zone data구조체의 필드들을 초기화 한다. 초기화 할때,
모든 페이지를 예약으로 만들고 모든 메모리 큐를 비우고,
메모리 비트맵을 클리어 시킨다.
calculate_node_totalpages()
setup_usemap(pgdat, zone, size) 함수를 통해 usemap
크기를 구한다. Usemap은 zone을 page block으로 관리할
때 필요로 하는 필요한 메모리이고 이 메모리는 zone->
pageblock_flags 필드가 포인팅하게 된다.
Bootmem에의해 메모리 관리는 bitmap으로 하고, 추후
node는 page 단위로 pgdat->node_mem_map으로
관리하고, 각 zone에 대해서 page block 단위로 관리함
page block에 대한 mobility 속성은 pageblock_flag
field로 관리함.
Page Mobility Types:
-Unmovable
-Reclaimable
-Movable
alloc_node_mem_map()
return
free_area_init_core()
pgdata의 node_spanned_pages와
node_present_pages 필드 업데이트
pgdata->node_spanned_pages
필드를 hole을 포함한 전체 page 수로
셋팅한다.
pgdata->node_present_pages 필드를
hole을 포함하지 않은 실제 page 수로
셋팅한다
CONFIG_FLAT_NODE_MEM_MAP
정의되어 있을 때 pgdata의
node_mem_map 필드에 page를
할당하여 assign.
node_mem_map field의 용도는
무엇인가??
init/main.c
PAGE_SIZE크기만큼의 메모리를 할당해 시작주소를 리턴
Architecture에서 ARCH_LOW_ADDRESS_LIMIT이 정해져있으면
해당 Limit 주소 아래에서 메모리를 할당하고, 정의가 되지 않았다
면, 0xffffffffUL 이용
./mm/bootmem.c:983:
#ifndef ARCH_LOW_ADDRESS_LIMIT
./mm/bootmem.c:984:
#define ARCH_LOW_ADDRESS_LIMIT 0xffffffffUL
setup_arch
paging_init()
devicemaps_init
alloc_bootmem_low_pages(
PAGE_SIZE)
VMALLOC_END ~ forloop
pmd_clear(pmd_off_k(addr))
PGDIR_SIZE = 1UL << 21
2MB 단위로
VMALLOC_END보다 상위주소를 가리키는 pgd엔트리를 모두 초
기화시킴. 커널부팅 초기에 pgd를 만들고 커널코드영역만 매핑을
수행했었음. devicemap_init에서는 VMALLOC_END위의 영역을 초
기화하고 devicemap을 매핑하는것처럼 보임.
high-vectors에 해당하는 mapping 구조체의 값을 채워 pgd entry
& pte entry 값을 update 함. 0xffff0000 주소를 vectors의 물리주
소로 매핑
create_mapping(&map)
주의: map.length가 1MB보다 작기 때문에,
create_mapping()->alloc_init_section() 함수에서 pte table을 생성
하는 로직으로 분기됨.
!vections_high() ?
create_mapping(&map)
vectors_high가 아니라면, virtual address를 0으로 하고 메모리타
입을 MT_LOW_VECTORS로 mapping 구조체의 값을 채워서
create_mapping() 함수를 호출하여 pgd entry & pte entry 값을
update 함.
mdesc->map_io()
위쪽을 실행하기전에 vectors_high()를 호출하여 분기하는 것이
좋을 것같은데, 왜 이렇게 하는지....
0xffff0000, 0x00000000 에 해당하는 pte entry들이 동일한 page
를 가르키고 있는 셈.
io 장치를 pgd/ pte entry에 매핑
io mapping function인 mdesc->map_io가 NULL이 아니면,
mdesc->map_io()함수를 호출
init/main.c
machine별로 정의되어 있고 예제는 아래 참조
example: ./arm/arm/mach-s3c6410/mach-smdk6410.c
setup_arch
paging_init()
devicemaps_init
mdesc->map_io()
MACHINE_START(SMDK6410, "SMDK6410")
/* Maintainer: Ben Dooks <[email protected]> */
.phys_io = S3C_PA_UART & 0xfff00000,
.io_pg_offst = (((u32)S3C_VA_UART) >> 18) & 0xfffc,
.boot_params = S3C64XX_PA_SDRAM + 0x100,
.init_irq = s3c6410_init_irq,
.map_io
= smdk6410_map_io,
.init_machine = smdk6410_machine_init,
.timer
= &s3c24xx_timer,
MACHINE_END
static void __init smdk6410_map_io(void)
{
s3c64xx_init_io(smdk6410_iodesc, ARRAY_SIZE(smdk6410_iodesc));
s3c24xx_init_clocks(12000000);
s3c24xx_init_uarts(smdk6410_uartcfgs, ARRAY_SIZE(smdk6410_uartcfgs));
}
local_flush_tlb_all()
flush_cache_all()
kmap_init()
top_pmd =
pmd_off_k(0xffff0000)
update된 부분을 반영하기 위해서 tlb & cache를 flush함
flush_cache_all() 함수가 호출될때, 실제 구현된 함수는
arch/arm/mm/cache-*.S and arch/arm/mm/proc-*.S files에 존재
함.
KConfig에서 CONFIG_HIGHMEM을 셋팅함
CONFIG_HIGHMEM이 정의되어 있지않으면 아무일도 수행하지
않음.
ARM에서 High Memory Support는 Experimental이고 여기에서
정의하는 HIGH MEM Support는 메모리가 4GB보다 큰 메모리를
시스템이 가지고있을 때 지원하기 위한 것임
0xffff0000은 high-vectors가 위치한 가상주소. 이 주소가
top_pmd임
init/main.c
페이지 테이블을 세팅하고 zone메모리를 관리하는 map을 초기화한다.
마지막으로 제로페이지를 생성한다
setup_arch
zero_page =
alloc_bootmem_low_pages(PAGE_SIZE)
zero page를 생성함
zero page는 전역적으로 사용되는 항상 0인 공유 페이지이고
zero-mapped memory 영역을 위해 사용됨.
zero_page는 가상주소
empty_zero_page는 zero_page의 상태를 기록하는 페이지구조체의 주소
empty_zero_page is a special page that is used for
zero-initialized data(BSS?) and COW
empty_zero_page =
virt_to_page(zero_page)
zero page의 페이지구조체 주소를 empty_zero_page에 저장
paging_init()
flush_dcache_page(empty_zero_page)
request_standard_resources(&me
minfo, mdesc)
smp_init_cpus()
cpu_init()
init_arch_irq = mdesc->init_irq
system_timer = mdesc->timer
init_machine = mdesc->init_machine
early_trap_init()
zero_page를 플러시한다.
iomemory, ioport에 대한 트리를 각각 구축해서 할당된 메모리의
정보를 구성한다.
cpu개수를 구해서 cpu_possible_map에 해당 비트를 세팅한다
cpu의 각 모드별로 사용할 익셉션핸들러스택을 세팅
각 machine archtecture별로 정의된 함수 포인터를 assign
ex) arch/arm/mach/mach-smdk6410.c
vectors, stubs, kuser helpers를 vectors에 차례대로 복사한다.
vectors+ 0x500에 sigreturn code 7개를 복사
vectors -> The base address of exception vectors.
init/main.c
setup_arch 함수의 주요 Operation
1. 부트로더에서 넘어온 atag를 각 항목별로 파싱함수를 콜해서 처리 atag는
램디스크정보,cmdline,메모리정보,페이지크기등을 가지고 있다.
2. machine archtecture nr을 이용해서 해당되는 machine descriptor 구조체를 구한다.
3. 부팅을 진행하는 프로세스의 init_mm(mm_struct형 구조체)에 가상메모리구조에 대한 정보를 설정
4. 페이지 테이블을 세팅, 시스템의 메모리를 관리하는 맵을 초기화함
5. iomem, ioport에 대한 트리를 각각 구축해서 할당된 메모리의 정보를 계층구조로 구성
6. cpu의 각 모드별로 사용할 익셉션핸들러 스택을 세팅
7. vector테이블에 핸들러코드등을 복사
init/main.c
command_line 은 setup_arch()에서 처리못한
cmdline을 가리키고 있음
init_mm(mm_struct 구조체)의 멤버변수 owner를
init_task(task_struct 구조체)로 초기화함
task_struct 구조체 -> 프로세스에 대한 정보가 기술
되어 있음
mm_struct 구조체 -> 프로세스의 메모리공간이 기술
되어 있음
ex) console=ttySAC0,115200 root=/dev/ram
init=/bin/bash initrd=0x51000000,4M
따라서 기본적으로는 프로세스는 두 구조체를 각각 1
개씩 가지고 있고 서로를 가르키도록 되어있다.
mm_init_owner()
파싱전의 full command line(untouched)와 파
싱하지 않은 나머지 command line(touched)
을 각각 별도의 변수에 저장한다.
setup_command_line()
setup_per_cpu_areas()
possible맵의 set된 마지막 bit값+1 구한다.
비트맵의 0번비트부터 set될 것이므로 set된 마
지막 비트번호+1은 possible한 cpu 개수를 의
미
N개의 프로세스가 공간을 공유하는 경우 N
개의 task_struct가 1개의 mm_struct를 가리키
게 된다. 그렇다면 이때 mm_struct는 N개의
task_struct중 누구를 가리키는가?
setup_nr_cpu_ids()
cpu개수만큼 메모리를 할당받아 .data.percpu섹션에
저장된 percpu변수들을 복사한다.
smp_prepare_boot_cpu()
1.각 cpu의 runqueue,cfs_rq,rt_rq를 초기화
2. SCHED_SOFRIRQ가 발생시 호출될 함수로
run_rebalance_domains를 지정
3. current thread를 idle thread로 바꾸고
sched_class만 fair_sched_class로 assign하여
부트타임에 normal task로 인식되게 한다
4.scheduler_running을 1로 세팅하여 사용가능
함을 알린다
질문
Sched_init()
cache coherency로 인해 발생할 수 있는 성능 저하를
최소화하기 위해 cpu 마다 percpu 변수를 사용한다.
현재 스레드가 실행되고 있는 cpu번호 구함
구한 cpu의 per_cpu 데이터인 cpu_data의 idel멤버에
현재 스레드의 task_struct를 assign한다.
Scheduling Policy
init/main.c
선점을 비활성화 한다.
preempt_disable()
build_all_zonelists()
set_zonelist_order()
1. online상태인 모든 노드에 각각 두개의 zonelist
를 만든다.
pgdat->zonelist[0] - set_zonelist_order()에서 선택
된 정렬방식에 따라 생성
pgdat->zonelist[1] - 해당 node의 존들만 정렬해
서 생성
2. 캐시용도로 fullzones 비트맵과 z_to_n배열을
생성한다
fullzones 비트맵 -> fallback list의 모든 존의 상태
를(full=1,not full=0) 비트맵으로 표현함. 이 비트맵
의 clear상태인 leftmost bit를 찾아내서 z_to_n배열
을 인덱싱하면 가용한 best node를 빠르게 찾아낼
수 있다.
current task의 mems_allowed 비트맵에 시스템의
모든 노드에 해당하는 비트들을 set함. current task
가 모든 노드에게 메모리요청을 할 수 있음을 의미
__build_all_zonelists()
mminit_verify_zonelist()
cpuset_init_current_mems_allo
wed()
Page group by mobility check
시스템의 메모리크기가 page group by mobility메
카니즘이 동작할 정도로 큰지 검사한 후 설정한다
build_all_zonelists()에서 생성할 zonelist의
정렬방식을 결정한다.
ZONE ORDER NODE0.ZONE_NORMAL,
NODE1.ZONE_NORMAL,
NODE2>ZONE_NORMAL,
NODE0.ZONE_DMA,
NODE1.ZONE_DMA...
장점= OOM문제발생이 쉽게 되지 않음,
단점 = best locality를 제공하지 못함
NODE ORDER –
NODE0.ZONE_NORMAL,
NODE0.ZONE_DMA,
NODE1.ZONE_NORMAL,
NODE1.ZONE_DMA,
NODE2.ZONE_NORMAL...
장점=best locality
단점 = OOM문제가 발생할 소지가 있음
따라서 DMA,DMA32존의 전체페이지가 시
스템 전체페이지의 절반 이상이라면 OOM문
제에서 상대적으로 안전하다고 보고 NODE
ORDER방식을 선택하고, 절반 이하일 때는
ZONE ORDER 방식을 선택한다.
시스템의 모든 존의 정보를 출력한다.
이때 online인 노드에 속하고 present page
가 존재하는 존을 대상으로 한다.
그리고 command line으로 mminit_loglevel
값이 MMINIT_VERIFY이상으로 넘어왔을때만
출력한다
Page group by mobility
init/main.c
untouched command line을 가지고 있는
boot_command_line 변수의 내용을 출력
page_alloc_init()
printk(..,
boot_command_line);
parse_early_param()
static_commnad_line에서 처리하지 못한 파
라미터를 envp_init[]에 저장한다. 저장된 파
라미터는 init에서 처리한다
parse_args("Booting kernel“..)
page_alloc_cpu_notify_nb라는 notifier block을 만
들어raw_notifier_chain인 cpu_chain에 등록한다.
이때 page_alloc_cpu_notify()를 callback 함로,
priority는 0으로 설정한다.
do_early_param()을 호출해서 boot_command_line
에서 console에 대한 파라미터만 파싱한다
init/main.c
if (!irqs_disabled())
IRQ가 enmable되어있는지 검사한다. Enable되어있다면
경고메시지를 띄우고 disalbe시킨다.
sort_main_extable()
CONFIG_MMU 가 정의 되어 있을 경우에 __ex_table
섹션에 저장된 주소(어떤 명령어의 주소가 담겨있음) 를
기준으로 정렬한다. CONFIG_MMU 가 정의
안되어있을경우 의미 없음. - 예) copy_from_user.S 에서
섹션 정의 하는 부분 찾을수 있음.
trap_init
ARM 에서는 아무 것도 하지 않음.
rcu_init
__rcu_init
rcu_online_cpu를 호출해서 해당 cpu의
rcu자료구조를 초기화하고 RCU_SOFTIRQ에 대한
콜백함수를 등록한다. 마지막으로 cpu_chain의
알맞은 위치에 notifier block 인 nb를 추가 한다.
hotcpu_notifier
rcu_barrier_cpu_hotplug에 대한 notifier block 을
하나 만들고 해당 block에
rcu_barrier_cpu_hotplug 함수를 등록 하고
priority를 0으로 셋팅한다. 마지막으로 notifier
block을 cpu_chain의 알맞은 위치에 추가한다.
Memory Types
init/main.c
early_irq_init
전역 변수인 struct irq_desc irq_desc{[NR_IRQS] 를 초기화 하는
작업을 한다. 전역 변수 irq_default_affinity 에 메모리를 할당 하고
nr_cpu_ids-1 만큼의 비트를 모두 set 한다. irq_desc 를 ARRAY_SIZE
만큼 순회 하면서 멤버변수 affinity 에 필요한 메모리를 할당 받고
kstat_irqs 를 이미 정적으로 선언한 kstat_irqs_all 로 설정 한다.
irq_desc 를 NR_IRQ 만큼순회하면서 status 를 IRQ_NOREQUEST
IRQ_NOPROBE 로 설정하고 CONFIG_SMP 가 설정 되어 있을 경우
bad_irq_desc.affnity 를 nr_cpu_ids-1 비트만큼 set 한다.
그리고 setup_arch 에서 mdesc->init_irq 가 설정된 init_arch_irq 를
호출 하여 각 해당 플랫폼에 맞는 인터럽트 초기화 과정을 거친다.
init_IRQ
init_arch_irq
인터럽트 컨트롤러를 초기화하고 각 irq에
해당 하는 irq_desc[irq] 의 멤버 변수
handle_irq (핸들러 함수) 를 플랫폼에 맞는
함수로 설정 한다.
s3c6410 의 경우
s3c6410_init_irq
To Do
하드웨어 의존적이기는 하나 좀더 자
세한 분석 필요.
irq_desc 구조체와 irq_desc[NR_IRQS]_ 정적 구조체 배열
include/linux/irq.h
kernel/handle.c
struct irq_desc irq_desc[NR_IRQS]
__cacheline_aligned_in_smp = {
[0 ... NR_IRQS-1] = {
.status = IRQ_DISABLED,
.chip = &no_irq_chip,
.handle_irq = handle_bad_irq,
.depth = 1,
.lock = __SPIN_LOCK_UNLOCKED(irq_desc->lock),
}
};
early_irq_init 함수와 init_IRQ 함수에서 참조되는
구조체 이다. 인터럽트 처리를 위해 중요한 자료
구조 이다.
struct irq_desc {
unsigned int
irq;
struct timer_rand_state *timer_rand_state;
unsigned int
*kstat_irqs;
#ifdef CONFIG_INTR_REMAP
struct irq_2_iommu
*irq_2_iommu;
#endif
irq_flow_handler_t handle_irq;
struct irq_chip
*chip;
struct msi_desc
*msi_desc;
void
*handler_data;
void
*chip_data;
struct irqaction *action; /* IRQ action list */
unsigned int
status;
/* IRQ status */
unsigned int
depth;
/* nested irq disables */
unsigned int
wake_depth; /* nested wake enables */
unsigned int
irq_count; /* For detecting broken IRQs */
unsigned long
last_unhandled; /* Aging timer for unhandled count */
unsigned int
irqs_unhandled;
spinlock_t
lock;
#ifdef CONFIG_SMP
cpumask_var_t
affinity;
unsigned int
cpu;
#ifdef CONFIG_GENERIC_PENDING_IRQ
cpumask_var_t
pending_mask;
#endif
#endif
atomic_t
threads_active;
wait_queue_head_t
wait_for_threads;
#ifdef CONFIG_PROC_FS
struct proc_dir_entry *dir;
#endif
const char
*name;
} ____cacheline_internodealigned_in_smp;
init/main.c
1.커널 페이지의 크기를 megabyte단위로 구함
2. 시스템의 메모리를 기준으로 해쉬테이블의
slot 개수를 16~4096개로 설정한다. 1기가
바이트 이상이면 해쉬테이블의 slot 을 4096
개로 설정.
3. 해쉬테이블을 위해 메모리를 할당하고 각
엔트리의 first 멤버들을 NULL 로 초기화
pid_hash_init
init_timers
timer_cpu_notify
register_cpu_notifier
open_softirq
Raw notifier chain인 cpu_chain에 timers_nb를
추가한다.
전역 자료구조 softirq_vec 배열에서
TIMER_SOFTIRQ 번째 엔트리 값의 action
변수를 runtimer_softirq로 설정한다.
TIMER_SOFTIRQ 발생시 설정한 함수가 호출 될
것이다.
include/linux/interrupt.h
struct softirq_action
{
void (*action)(struct softirq_action *);
};
kernel/softirq.c
struct softirq_action softirq_vec[NR_SOFTIRQS] ;
init/main.c
4가지의 action 타입에 따라 처리하는 일이 달라진다.
init_timers
CPU_UP_PREFARE:
CPU_UPPREFARE_FROZEN:
CPU_DEAD:
CPU_DEAD_FROZEN:
timer_cpu_notify
init_timer_cpu
struct tvec_base {
spinlock_t lock;
struct timer_list *running_timer;
unsigned long timer_jiffies;
// 타이머에 등록된 핸들러는 타이머 간격에 따라
// 시간별로 분류하여 다섯 개의 타이머 벡터에
// 분할해서 등록한다 tv1는 타이머 간격이 1초
// 이내인, tv2는 1분 이내인 핸들러가 등록되는
// 방식이다
struct tvec_root tv1; // 약 1초
struct tvec tv2;
// 약 1분
struct tvec tv3;
// 약 1시간
struct tvec tv4;
// 약 3일
struct tvec tv5;
// 약 199일
} ____cacheline_aligned;
boot_done을 1로
초기화하고 전역 으로
선언된 struct tvec_base
boot_tvec_bases 의
timer_jiffies 값을
시스템의 jiffies 값으로
설정하고 구조체내의
값들을 초기화함. boot
time이 아닐 때 호출된
경우에는 base를 할당
받아서 동일 과정을
처리한다.
migrate_timers
주어진 cpu의
tvec_base를
현재 cpu의
tvec_base로
migration
init/main.c
init_hrtimers
hrtimer_cpu_notify
register_cpu_notifier
Raw notifier chain인 cpu_chain에 hrtimers_nb 를
추가한다.
open_softirq
HRTIMER_SOFTIRQ가 raise되면
run_hrtimer_softirq를 호출하도록 등록한다
softirq_init
1.해당 CPU의 tasklet_vec,tasklet_hi_vec,
softirq_work_list를 초기화
2. cpu_chain의 알맞은 위치에
remote_softirq_cpu_notifier를 추가한다.
3. 각각 해당 softirq발생시 호출될 콜백함수를
등록한다
init/main.c
6가지의 action 타입에 따라 처리하는 일이 달라진다.
init_hrtimers
CPU_UP_PREFARE:
CPU_UPPREFARE_FROZEN:
hrtimer_cpu_notify
CPU_DYING: :
CPU_DYING_FROZEN:
CPU_DEAD:
CPU_DEAD_FROZEN:
clockevents_notify
init_hrtimer_cpu
2개의 clock
base(CLOCK_REALTIME과
CLOCK_MONOTONIC)를
초기화
high resolution mode를
비활성화, 만료시간 설정
CLOCK_REALTIME:현재
시간을 기준으로 하여 시간을
계산(시스템 시간 변경 시
영향을 받음)
CLOCK_MONOTONIC:커널이
동작한 시간을 기준으로 하여
시간을 계산(외부의 영향을
받지 않음)
checkevent_chain에 등록된 모든 콜백함수를 호출한다
그 중 tick_notify()는 reason이
CLOCK_EVT_NOTIFY_CPU_DYING일 경우
tick_handover_do_timer()를 호출한다.
clockevents_notify
migrate_hrtimers
이 함수는 tick_do_timer_cpu변수를 세팅하여 online되어
있는 가장첫번째 cpu가 jiffies값을 업데이트하도록 한다
( jiffies 업데이트를 담당하던 cpu가 죽으면 다른 cpu가
맡도록 한다)
sched_timer가 수행이 끝날 때까지 기다렸다가 rbtree에서 삭제하고
상태를 inactive로 변경한다.그리고 expired time을 재설정한다
timer의 state를 Migrate로 바꾸고 base를 new_base로 바꾼다
base를 바꾸면 타이머를 처리할 cpu가 바뀐다. timer를 rb tree에
추가하고 state에 HRTIMER_STATE_ENQUEUED를 추가한다
state에서 migrate를 제거한다. rb-tree에서 soft expired된 timer들의
노드를 삭제하고 타이머의 이벤트핸들러를 호출해서 처리한다
init/main.c
NTP 상태 변수를 초기화하고 leap_timer가
현재시간을 기준으로 시간을 계산하는
CLOCK_REALTIME에 기반하여 절대시간을
사용하도록 초기화한다
timekeeping_init
ntp_init
clocksource_get_next
clocksource_enable
clocksource_calculate_interval
clocksource_register()를 호출하여
next_clocksource를 초기화하기 전까지는 계속
clocksource_jiffies를 리턴한다
clock_source가 jiffies일 경우 이미 jffies값이
동작하고 있으므로 enable할 필요가 없다.
NTP_INTERVAL_LENGTH를 이용해서 clock의
멤버변수인 xtime_interval, low_interval,
cycle_interval을 세팅한다.
NTP_INTERVAL_LENGTH -> jiffies값이 1증가될 때
걸리는 nano sec단위 값
set_normalized_timespec
wall_to_monotonic은시스템부팅완료시에 0이 된다.
현재는 부팅중이므로 음수값을 가진다 . 세번째
인자가 음수가 넘겨지더라도 wall_to_monotonic의
멤버변수인 tv_nsec은 양수값을 가진다
init/main.c
time_init
s3c6410 의 경우
s3c2410_timer_init
setup_arch 에서 system_timer 에 설정한 함수를
호출한다. 각 플랫폼 마다 다르게 정의 되어있다.
hardware timer 클럭 초기화 하고 인터럽트를
등록한다.
To Do
하드웨어 의존적이기는 하나 좀더 자
세한 분석 필요.
init/main.c
monotonic time을 얻어와서 nanosec으로 변경한 후 각 cpu별 sched_clock_data의
tick_gtod, clock 멤버에 세팅한다. 마지막으로 sched_clock_running을 1로 세팅한다.
sched_clock_init
Ktime_to_ns()
monotonic time을 얻어서 nanosec단위로 변
환해 리턴
cpu_sdc()
cpu의 sched_clock_data형 포인터를 가져와서
tick_gtod, clock을 ktime_now로 세팅한다
코드영역(_etext - _stext)을 profiling 하기 위한 profiling counter의 메모리를 할당받고
cpu_possible_mask의 내용을 prof_cpu_mask로 세팅한다
profile_init()
Slab 할당자를 사용할 수 없
다면 bootmem 할당자를 사
용한다.
alloc_bootmem(buffer_bytes)
alloc_bootmem_cpumask_var
(&prof_cpu_mask)
Cpumask_copy(prof_cpu_mas
k,cpu_possible_mask)
부트멤할당자를 이용해서
buffer_bytes만큼 메모리를 할당
받는다
prof_cpu_mask에 cpumask_size()
만큼을 할당받는다
prof_cpu_mask를
cpu_possible_mask로 mask시킨
다
Slab 할당자를 사용
해서 메모리 할당
Slab 할당자를 사용해서 메모리 할당
slab allocator가 enable되어있을 경우 prof_cpu_mask를
GFP_KERNEL타입으로 할당한다
alloc_cpumask_var(&prof_cpu
_mask, GFP_KERNEL))
prof_cpu_mask를 cpu_possible_mask로 mask시킨다
slab 할당자를 이용해서 buffer_bytes만큼 메모리 (물리적으로 연
속적인)를 할당받는다
slab할당자를 이용한 할당이 실패할 경우에 어떻게든지 할당하기
위해 다양한 방법으로 여러번 시도 요청한 buffer_bytes만큼(물리
적으로 연속적인)을 할당요청한다 5k 요청시 alloc_pages()는 2의
배수단위로 할당(8k), alloc_pages_exact()는 5k를 정확히 할당한다
위와 달리 virtually continuous한 페이지를 할당요청한다
할당시도가 전부 실패했다면 prof_cpu_mask를 해제한다
if (!irqs_disabled())
인터럽트가 활성화 되어있는지 검사
early_boot_irqs_on()
CONFIG_TRACE_IRQFLAGS가 정의되어 있으면
early_boot_irqs_enabled 를 1로 설정
cpumask_copy(prof_cpu_mask,
cpu_possible_mask)
kzalloc(buffer_bytes,
GFP_KERNEL)
alloc_pages_exact(buffer_byte
s, GFP_KERNEL|__GFP_ZERO)
vmalloc(buffer_bytes)
free_cpumask_var(prof_cpu_m
ask)
local_irq_enable()
trace_hardirqs_on()
raw_local_irq_enable();
console_init()
tty_ldisc_begin()
tty_register_ldisc()
while( __con_initcall_start ~
__con_initcall_end )
(*call)()
1. CONFIG_TRACE_IRQFLAGS가 설정되어 있을 경우
hardirq가 on/off되는 이벤트를 trace하는 기능을 on한다
__LINUX_ARM_ARCH__에 의해 어셈블러가 달라지지만
결국은 local irq를 enable시킨다. 예를 들어
__LINUX_ARM_ARCH__ >= 6일 경우
__asm__("cpsie i @ __sti" : : : "memory", "cc")
#define N_TTY
0
struct tty_ldisc_ops tty_ldisc_N_TTY = {
.magic
= TTY_LDISC_MAGIC,
.name
= "n_tty",
.open
= n_tty_open,
.close
= n_tty_close,
.flush_buffer = n_tty_flush_buffer,
.chars_in_buffer = n_tty_chars_in_buffer,
.read
= n_tty_read,
.write
= n_tty_write,
.ioctl
= n_tty_ioctl,
.set_termios
= n_tty_set_termios,
.poll
= n_tty_poll,
.receive_buf
= n_tty_receive_buf,
.write_wakeup = n_tty_write_wakeup
};
tty_ldisc[]배열의 해당 line discipline function 구조체를 등록한다
__con_initcall_start = .;
*(.con_initcall.init)
__con_initcall_end = .;
섹션안에 있는 함수들을 전부 다 호출
panic(panic_later,
panic_param)
panic_later가 NULL이 아니라면 system을 정지시킨다
Lock dependency 정보를 출력
lockdep_info()
locking_selftest()
퀄컴 boot message example
\Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo
Molnar
... MAX_LOCKDEP_SUBCLASSES: 8
... MAX_LOCK_DEPTH:
48
... MAX_LOCKDEP_KEYS:
2048
... CLASSHASH_SIZE:
1024
... MAX_LOCKDEP_ENTRIES:
8192
... MAX_LOCKDEP_CHAINS:
16384
... CHAINHASH_SIZE:
8192
memory used by lock dependency info: 992 kB
per task-struct memory footprint: 1920 bytesB
Locking mechanism 을 자체 테스트한다.
if (initrd_start && ..)
2.6 부터는 initrd대신 initramfs를 사용한다
initrd_start < memory_start이면 initrd가 존재할 때
initrd_start < memory_start이고
initrd_start의 pfn < min_low_pfn 이면
initrd가 overwrite된 경우로 간주하고 initrd_start=0으로 설정해서 initrd를
disable한다
vmalloc_init이 호출되기 전에 이미 vmlist가 존재했다면 vmlist의 elemen
들의 vmap_area를 생성하여, vmlist의 element들로 설정해준다. 그런 다
음 vmap_area를 vmap_area_root를 root로 하는 rb-tree에 insert한다
Vmalloc_init()
for loop ( all vmlist )
alloc_bootmem(sizeof(struct
vmap_area))
…
__insert_vmap_area(va)
struct vm_struct
vmalloc 함수 수행시 할당되는 불연속적인 공간을 나타내는 구조체 vmlist변수로 연결리스
트가 구현되어 있다 vm_area_struct는 사용자영역에 있는 프로세스의 가상메모리관리를
위한것이다 (각 섹션의 메타데이터) vm_struct와 vm_area_struct는 관계가 없다
커널영역에서는, 전체 vm_struct는 vmlist가 연결리스트로 관리를 하고 있다 vm_struct전체
를 참고할 때 vmlist가 사용된다. 그리고 개별 vm_struct의 빠른 접근을 위해서 vm_struct들
은 rb-tree로 유지되는데 각 노드는 vmap_area로 관리된다 사용자영역에서는,
mm_struct가 vm_area_struct를 관리하기 위해 mmap과 mm_rb를 이용하는것도 흡사한 구
조이다
Dentry Cache & Inode Cache를 위한 해시테이블을 만들고
해시테이블의 엔트리들을 초기화한다
vfs_caches_init_early()
Dentry Cache 해시테이블을 만들고 해시테이블의 엔트리들을 초
기화함 만약 할당이 어려울 경우, 해시 엔트리의 개수를 계속 반
으로 줄여가면서 할당 시도함. NUMA 노드에 해시들이 분산되어
있다면, vmalloc이 가용한 후에 시도
dcache_init_early()
dentry_hashtable
1. numentries를 2의 지수승으로 만든다
2. bootmem, vmalloc, get_free_pages()중 하나를 사용해서
Dentry cache 해쉬테이블을 위한 메모리를 할당하려 시도한다.
메모리할당이 실패할 경우 dhash_entries를 반으로 줄여가며
할당을 시도한다
inode_init_early()
node-cachee 해시테이블을 만들고 해시테이블의 엔트리들을 초
기화함 만약 할당이 어려울 경우, 해시 엔트리의 개수를 계속 반
으로 줄여가면서 할당 시도함. NUMA 노드에 해시들이 분산되어
있다면, vmalloc이 가용한 후에 시도
hashdist
#if defined(CONFIG_NUMA) && defined(CONFIG_64BIT)
#define HASHDIST_DEFAULT 1
#else
#define HASHDIST_DEFAULT 0
#endif
hashdis = HASHDIS_DEFAULT
64비트 NUMA 시스템에서는 hash들이 distribution되어 있다
hash allocation을 vmalloc이 가용한 후에 함
init/main.c
1. top_cpuset.cpus_allowed에 cpuset mask 구조체 만큼
의 메모리를 할당.
2. 전역변수 cpuset_mems_generation 1 증가한 후
Top_cpuset.mems_generation에 대입.
cpuset_init_early
Cpuset_mems_generation
Cpuset 이 서로다른 태스크에 있을때 cpuset.memsgeneration이 락 필요없이 업데이트되고
태스크의 cpuset_mems_generation 에서만 락 획득후 업데이트하는 시나리오를 위한 전역변
수.
-- TODO 추후 내용이나 이해하기쉬운형태로 추가 필요.
1. Mem_cgoup_subsys.disabled(메모리cgroup 서브시스
템의 사용여부가 저장되어있음)의 값을 참조해서
disable이면 return.
2. Online 되어있는 전체 노드를 순회하면서 각 노드의
전체 페이지개수만큼의 page_cgroup 구조체를 할당
받는다.
page_cgroup_init
Allac_node_page_croup(nid)
Lochole을 포함하는 전체 페이지개수만큼 page_cgrou구조체를
할당하고 초기화한다. 할당받은 자료구조는
NODE_DATA(node_id)->node_page_cgroup이 가리킨다.
init/main.c
1.
2.
3.
4.
mem_init
Mem_map에서 free area를 마킹하고 free memory가 얼마인지 알려줌.
각 노드의 뱅크를 순회하면서 뱅크사이의 홀에 대한 영역을 free해줌.
Bitmap의 bit가 0으로 셋팅된 page를 Buddy Allocator로 릴리즈 함.
시스템의 가용 메모리와 코드, 데이터사이즈등 정보표시.
Free_unused_memmap_node(node,
&meminfo)
두 뱅크사이에는 hole이 존재할 수 있는데 이 영역도 페이
지구조체가 할당되어 있다. 해당 page 구조체를 free함
Node의 free page를 buddy allocator에 release시킴.
Totalram_page +=
free_all_bootmem_node(pgdat)
Totalhigh_page += free_area(start, end,
NULL)
1. 현재 노드의 free가능한 page를 free시킴.
1) 일반 page인 경우는 free할 위치부터 bulk(order를
기준으로)free할수 있으면 gree하고 bulk로 free할 수
없는 경우는 한 page씩 free시킴.
2) metadata를 가지고 있는 page인 경우, 즉 bitmap이
나 frame 정보등. 이 경우는 meta data를 가지고 있는
page를 구하여 해제시킴.
2. Page들이 해제될때 buddy system이 관여되어 해당 정
보가 기록됨.
3. 총 free된 page의 수를 리턴.
Highmem page를 free시킨다.
참조할부분 :arch/arm/mm/mmu.c::sanity_check_meminfo()
자세한 메커니즘 TODO. 아직 파악안됨.
Sysctl_overcommit_memory
Memory overcommit 정책을 OVERCOMMIT_ALWAYS 로 설정한다.
-- Overcomit에 관하여 자세한 것은 TODO.
init/main.c
Enable_debug_pagealloc()
Cpu_hotplug_init()
Kmem_cache_init()
Kmem_list3_init(&initkmem_list3[i]);
Set_up_list3s(&cache_cache,
CACHE_CACHE);
List_Add(&cache_cache.nex,t, ….);
CONFIG_DEBUG_PAGEALLOC이 정의되어 있으면
debug_pagealloc_enabled를 1로 세팅한다.
CONFIG_SMP가 정의되어 있으면 cpu hotplug 변수를 초
기화한다
1. Cache_cache 초기화
2. Kmalloc에서 사용할 기본 사이즈별 캐시 생성
3. 정적으로 생성해놓은 per cpu cache를 슬랩할당자로
다시 만듬
4. Per cpu cache 를 리사이징해서 다시 생성
-- TODO slab할당자와 kmem_cache_init 함수에대한 대대
적인 보충 수정 요약 정리 작업이 필요.
1. initkmem_list3,cache_cache 초기화.
2. online상태인 cachep->nodelist[node]에
kmem_list3[index+node]를 할당.
3. Cache가 shrink, growing 해야 할 주기를 설정해줌.
Cache_cache를 cache_chain에 추가
init/main.c
colour_off에 cacheline 크기를 설정.
cache_cache.colour_off =
cache_line_size();
cache_cache.array[smp_processor_id()]
= &initarray_cache.cache;
cache_cache.buffer_size =
ALIGN(cache_cache.buffer_size,
cache_line_size());
cache_cache.reciprocal_buffer_size =
reciprocal_value(cache_cache.buffer_size)
for (…) {
cache_estimate(…);
if (cache_cache.num) break;
}
BUG_ON(!cache_cache.num);
cache_cache.gfporder = order;
cache_cache.array[0]가 initarray_cache.cache를 가
리키도록 해준다. array_cache는 per cpu data인
cpu별 캐시이다. 슬랩오브젝트 할당,해제시 바로 슬
랩할당자에게 요청하는게 아니라 이 자료구조에 먼
저 접근해서 처리를 시도한다.
Cache cache가 관리하는 오브젝트의 크기를 계산
하여 캐시라인크기에 맞춰서 정렬하여 buffer_size
멤버에 세팅한다.
캐시에서의 슬랩오브젝트 index를 빠르게 구하기
위해 뉴튼랩슨법으로 구한 값을
eciprocal_buffer_size에 세팅한다.
order를 지수승??으로 갖는 페이지 크기를 구해 해
당 캐시의 오브젝트를 할당할 수 있는지를 검사한
다. 그리고 해당 order에서 할당가능한 오브젝트의
갯수(cache_cache.num에 세팅됨), 할당하고 남은
공간(leftover)을 구한다. 0에서 MAX_ORDER -1까
지 이 작업을 반복하여 최소 order값을 구한다.
전체 order를 순회하였는데도 할당가능하지 않을
경우는 BUG.
order -> 오브젝트 할당이 가능했던 order
init/main.c
while (..) {
if (!sizes->cs_cachep) kmem_cache_create(..)
#ifdef CONFIG_ZONE_DMA
kmem_cache_create(..)
#endif
sizes++; names++;}
struct array_cache *ptr = kmalloc(..);
memcpy(ptr, arraycache,..);
cache_cache.array[smp_processor_id()] = ptr;
for_each_online_node(nid) {
init_list(&cache_cache, &initkmem_list3..);
init_list(malloc_sizes[..].cs_cachep,
&initkmem_list3..);
if (INDEX_AC != INDEX_L3) {
init_list(malloc_sizes[INDEX_L3].cs_cachep,
&initkmem_list3..);}}
list_for_each_entry(cachep, &cache_chain, next) {
if (enable_cpucache(cachep)) BUG(); }
register_cpu_notifier(&cpucache_notifier);
sizes배열을 순회하여 캐시를 생성하지 않은 사이즈
에 대한 캐시를 생성해준다.
컴파일 타임에 생성된 array_cache인
initarray_cache를 슬랩할당자를 이용해서 다시 할당
한다. initarray_generic 도 동일한 방법으로 재할당
한다.
역시 컴파일 타임에 생성된 kmem_list3를 사용하
는 cache_cache, malloc_size[INDEX_AC],
malloc_sizes[INDEX_L3]의 nodelist를 새롭게 생성
한 kmem_list3로 교체해준다.
cache_chain에 연결된 모든 캐시들을 순회하며 per
cpu cache의 limit, share등의 멤버를 새롭게 세팅
해준다.
마지막으로 register_cpu_notifier()를 호출하여 cpu
의 상태가 변경시에 cpuup_callback()이 호출되도
록 설정해준다.
init/main.c
Kmemtrace_init()
Debug_objects_mem_init()
하는 일 없음.
하는 일 없음.
Idr_init_cache()
idr_layer_cache 캐시를 생성함
시스템의 모든 존의 pageset[cpu]변수에
메모리를 할당하고 멤버를 적절한 값으
로 세팅한다.
Setup_per_cpu_pageset()
Process_zones(smp_processor_id())
Register_cpu_notifire(&pageset_notifire)
Callback 등록.
init/main.c
CONFIG_NUMA가 정의되어 있을 경우 수행
1.메모리 정책과, SPREAD NODE관련 슬랩을 생성
2. Interleave node에 대한 mempolicy를 만들어 현
task_struct->mempolicy에 연결해줌
numa_policy_init
kmem_cache_create()
numa_policy와 shared_policy_node 에 대한 slab을 생
성
node_present_pages()
HIGHMEM영역을 갖는 노드를 순회하여 가장 큰 노드
의 번호(prefered nid)와 총페이지수를 기억해두고 각
노드에 대해서 16MB이상의 크기를 갖는다면
interleave_nodes비트맵에 해당 노드번호를 1로 세팅
함. 만약 모든 노드의 크기가 16MB보다 작다면 순회한
노드중 가장 큰 페이지수를 갖는 노드(prefered nid)를
interleave_nodes 비트맵에 세팅함
node_set
do_set_mempolicy
MPOL_INTERLEAVE정책(struct mempolicy)을 생성하여
현재 태스크의 메모리 정책(current->mempolicy)으로
지정하고flag를 PF_MEMPOLICY로 설정해둠
(PF_MEMPOLICY: NON-Default NUMA Memory
Policy)
init/main.c
late_time_init
calibrate_delay
Architecture 별로 override됨
BoGo mips값을 관리하여 1 jiffy동안 얼마나 많은 cpu 사이클
이 소요되는지 측정하여 loops_per_jiffy값을 구해냄
pidmap_init
init_pid_ns namespace에서 pid 0이 사용할 페이지를 Pidmap
배열의 0번 index에서 page를 할당받고 pid 0을 사용중으로 표
시한 후 pid 구조체의 슬랩캐시(pid_cachep)를 생성함
Kernel이 BogoMIPS값을 결정하기위한 함수
BogoMIPS
MIPS는 Millons of Instruction Per Second의 약자이다.
계산 능력을 나타내는 지표로 사용된다. BogoMIPS는 리
누즈 토발즈가 만들어낸 것으로서, 커널은 부팅시에 현
시스템에서 busy loop이 얼마나 빠른지에 대해 기록한
값이며 Bogo 는 bogus(가짜)를 의미한다.
이렇게 측정된 bogoMIPS를 가지고 프로세서의 속도를
하며 이것은 전혀 과학적이값은 아니다.
리누즈에 의하면 이 BogoMIPS값은 다음의 유용성이 있
다.
1.디버깅에 유용
2.Computer cache와 turbo button 이 제대로 작동하는
지 확인할수 있음.
실제적으로 BogoMIPS는 cpu 가 1 jiffy 동안 수행하는
empty loop의 횟수 값이다.
PID namespace
PID namespace는 태스크들의 집합을 만들도록 해준
다. 즉 다른 namespace의 태스크들이 같은 ID를 가질
수 있다. 이 특징은 hosts사이의 이주를 위해 필수조건
이다. PID namespace를 가짐으로써 프로세스는 PID값
을 유지하면서 다른 host로 이동할 수 있다. 이특징이
없다면 호스트간 태스크 이주는 빈번히 실패할 것이다.
같은 ID를 가지는 프로세스가 존재할 수 있기 때문이다.
init/main.c
struct pid_namespace {
..
struct pidmap pidmap[PIDMAP_ENTRIES];
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
unsigned int level;
..
}
struct pid_link
{
struct hlist_node node;
struct pid *pid;
}
struct task_struct
{
..
pid_t pid;
pid_t tgid;
struct pid_link pids[PIDTYPE_MAX];
struct list_head thread_group;
..
}
Struct pidmap {
atomic_t nr_free;
void *page;
}
Struct upid {
int nr;
struct pid_namespace *ns;
struct hlist_node pid_chain;
}
Struct pid
{
...
Struct hlist_head tasks[PIDTYPE_MAX]
struct upid numbers[1];
…
}
init/main.c
prio_tree_init
anon_vma_init
Index_bit_to_maxindex배열에 PST의 index_bits에서 저
장될 수 있는 heap_index를 0부터 1,3,7,15… 2^31-1,
2^32값으로 세팅함으로써 maximum heap index를 구
하는데 사용할 수 있도록 해줌.
anon_vma 구조체의 슬랩캐시 생성
cred_init
LSM(Linux Security Module)에서 사용될 cache를 위해
struct cred구조체로 슬랩캐시를 생성함
PST(radix priority search tree)는 Interval을 저장하는데
쓸모있으며 heap tree와 radix tree를 잘 혼합한 형태이
다. 사용되는 예를 보자면 vma를 file pages의 closed
interval[offset_begin,offset_end] 로 생각해봤을 때 file
에 맵된 모든 vma를 PST에 저장함으로써 O(log n + m)
시간에 주어진 interval(연속적인 파일페이지)을 선택할
수 있다.
log n :PST 트리의 높이, m : 저장된 interval의 개수
init/main.c
LSM(Linux Security Modules)는 리눅스 커널이 적재가능한 커널모듈로서 구현되어야 하는 다양한
access-control-model을 허용하기 위해 경량의 다목적 접근 제어 프레임워크로 구현되었다. 그 중에는
Security-Enhanced Linux(SELinux), Domain and Type Enforcement(DTE), Linux Intrusion Detection
System(LIDS)가 있으며. LSM은 향상된 보안 정책을 커널 모듈로 적재가능하다.
page
page->mapping
( if page mapped as
anonymous memory, it
points to anon_vma_object)
anon_vma
vm_area_struct
vm_area_struct
vm_area_struct
Reverse-mapping virtual memory(“rmap“)
rmap은 swapping이 요구될때 커널이 메모리를 free
하는 것을 쉽게 만드는 목적이였다. 이를 위해 reverse
pointer chain에서 각 포인터는 페이지를 참조하는 페
이지테이블을 가리킨다. rmap chain에 의해 커널은 빠
르게 주어진 페이지의 모든 매핑을 찾을수 있고
unmap 이후 page를 swap out할수 있다.
anonymous page를 참조하는 다수의 VMA를
anon_vma를 통해 연결시키는데 이것은
page->mapping 포인터가 anon_vma를 가리킴으로
써 커널은 list를 순회하여 연관된 VMA structure를 찾
을 수 있다.
init/main.c
1) (struct task_struct) 의 슬랩을 생성
2) 전역 변수인 max_threads 의 값을 정한다. (생성할수 있는 스
레드 최대 갯수)
3) 현재 부팅을 주도하고있는 init_task가 생성할 수 있는 스레드
갯수를 정해준다. (init_task.signal.rlim[RLIMIT_NPROC] 값 설정)
4) 현재 부팅을 주도하고있는 init_task의 최대 pending signal
갯수를 정해준다. (init_task.signal.rlim[RLIMIT_SIGPENDING] 값
설정)
fork_init
kernel/fork.c
int max_thread
arch/arm/kernel/init_task.c
struct task_struct init_task
struct task_struct
...
struct signal_struct *signal
struct task_struct
...
...
struct rlimit rlim[RLIM_NLIMITS]
...
Include/asm-generic/resource.h
...
#define RLIMIT_NPROC
#define RLIMIT_SIGPENDING
#define RLIM_NLIMITS
...
6 /* max number of processes */
11 /* max number of pending signals */
16 // maximum file locks held
init/main.c
1. 여러종류의 슬랩(slab) 캐시를 생성한다.
(struct shighand_struct, struct signal_struct, struct files_struct, struct
fs_struct, struct mm_struct, struct vm_area_struct 등)
2. percpu_counter구조체의 counter세팅, counters의 메모리 할당,
lock class등록을 한다.
proc_caches_init
kmem_cache_create("sighand_cache“, . . . )
슬랩 캐시 생성
kmem_cache_create("signal_cache", . . . )
kmem_cache_create("files_cache", . . . )
kmem_cache_create("fs_cache", . . . )
TODO: slob, slab, slub 내용 정리
TODO: struct kmem_cache 자료구조
정리
kmem_cache_create("mm_struct", . . . )
kmem_cache_create("vm_area_struct", . . . )
mmap_init()
- percpu_counter 값을 0으로 세팅
- percpu_counter lock을 lock class로 등록
init/main.c
buffer_init
- (struct buffer_head) 슬랩 캐시를 생성한다.
- 시스템에서 가질 수 있는 buffer_head의 최대 갯수를 정해서
max_buffer_heads 전역 변수에 저장한다.
- max_buffer_heads의 개수 이상의 buffer_head가 생성될땐,
writeback 이 실행된다.
(Once the number of bh's in the machine exceeds this level, we
start stripping them in writeback.)
kmem_cache_create("buffer_head", . . . )
슬랩 캐시 생성
fs/buffer.c
struct kmem_cache *bh_cachep
int max_buffer_heads
notifier 함수 등록
hotcpu_notifier(buffer_cpu_notify, 0)
struct notifier_block
int (*notifier_call)( )
struct notifier_block *next
int priority
init/main.c
- key_jar라는 이름의 key 구조체 슬랩캐시 생성
- key_type들을 가지고 있는 리스트 key_types_list에 key type, keyring,
dead, user를 추가한다.
- key_user_tree라는 레드블랙트리에 root_key_user.node를 추가시킨
다.parent인자가 NULL이므로 root_key_user.node가 root가 된다.
- 노드를 추가했으므로 트리를 rebalance한다
key_init
kmem_cache_create("key_jar", . . . )
key_jar라는 이름의 슬랩캐시 생성
security/keys/key.c
struct kmem_cache *key_jar
list_add_tail(&key_type_keyring, . . . )
list_add_tail(&key_type_dead, . . . )
list_add_tail(&key_type_user, . . . )
rb_link_node
rb_insert_color
key_types_list의 맨 뒤에 3가지
키타입을 차례대로 추가
key_user_tree라는 레드블랙트리에
root_key_user.node를 추가
rebalance tree
init/main.c
security_init
- CONFIG_SECURITY가 정의되어 있을 경우에만
default_security_ops 구조체의 함수포인터 초기화
- __security_initcall_start 부터 __security_initcall_end까지의
함수를 호출
security_fixup_ops(&default_security_ops)
do_security_initcalls
인자로 넘긴 default_security_ops의 각 멤버
(ptrace_may_access..)를 아래와 같은 방법으
로 assign
set_to_cap_if_null(ops, ptrace_may_access);
...
set_to_cap_if_null(ops, ptrace_traceme);
__security_initcall_start 부터
__security_initcall_end까지의 함수를 호출
SELinux 모델의 경우 함수를 위 섹션으로
정의해 놓음
따라서 default_security_ops에 세팅된 모든
함수를 call할 수 있음
init/main.c
- num_physpages : 시스템의 전체 페이지프레임 갯수
- 구조체 names_cache,dentry, inode, vfsmount, sysfs_dirent,
bdev_inode, file의 슬랩캐시 생성
- 가용한 메모리 상에서 최대 file개수 10%를 max file개수로 설정
- sys_fs, rootfs, bdev fs 파일시스템을 등록하고 마운트한다
- fs라는 최상위 kobject를 생성하고 추가한다
- 문자디바이스를 위한 cdev_map을 하나 할당받아 초기화
vfs_caches_init
(num_physpages)
kmem_cache_create("names_cache“, . . . )
dcache_init()
inode_init()
files_init()
mnt_init()
슬랩 캐시 생성
dentry 구조체의 슬랩캐시 생성
dcache_shrinker 구조체를 shrinker_list에 추가
shrinker구조체의 nr 초기화
shrinker_list의 끝에 shrinker추가
inode 구조체의 슬랩캐시 생성
icache_shrinker 구조체를 shrinker_list에 추가
shrinker구조체의 nr 초기화
shrinker_list의 끝에 icache_shrinker추가
가용한 메모리 상에서 최대 file개수 10%를
max file개수로 설정
filp이라는 file구조체 슬랩캐시 생성
fdtable_defer_list의 wq에 free_fdtable_work()
를 assign
percpu_counter구조체 nr_files의 counter세팅,
counters의 메모리 할당,lock class등록을 한다
init/main.c
vfs_caches_init
(num_physpages)
mnt_init()
bdev_cache_init
chrdev_init
vfsmount구조체, sysfs_dirent 구조체의 슬랩캐시 생성
sysfs_backing_dev_info의 멤버를 초기화
sys_fs 파일시스템을 등록하고 마운트한다
fs라는 최상위 kobject를 생성하고 추가한다
ramfs_backing_dev_info를 초기화 한 후 rootfs file
system을 등록하고 마운트한다.
rootfs를 위한 namespace ns를 할당받아 멤버를 초기
화한다
rootfs, init_task가 속한 namespace로 ns를 설정한다.
path구조체 설정, current->fs의 pwd, mnt 관련 멤버를
초기화한다
bdev_inode 구조체의 슬랩캐시 생성
bdev fs를 등록하고 마운트 한다
bdev의 superblock을 전역변수에 설정
문자디바이스를 위한 cdev_map을 하나 할
당받아 초기화
directly_mappable_cdev_bdi 초기화
init/main.c
radix_tree_node 구조체의 슬랩캐시 생성
radix tree의 예)
- page I/O 에 앞서 page cache가 존재 하는지 확인하기 위해서
메모리 상에 존재하는 모든 page 를 search 할 필요가 있는데, 이
overhead를 줄이기 위해서 radix tree를 이용한다.
radix_tree_init
signal_init
sigqueue 구조체의 슬랩캐시 생성
1) 특정시간에 해당 타이머가 expire되도록 세팅, 타이머에 등록된
함수 wb_timer_fn()이 호출될것이다.
2) ratelimit_nb를 cpu_notifier chain에 등록
3) dirty_ratio에 대한 period를 구한다
4) prop_descriptor 구조체 변수인 vm_completions, vm_dirties의
멤버변수값들을 셋팅한다
page_writeback_init
mod_timer(&wb_timer, . . .)
register_cpu_notifier(&ratelimit_nb)
calc_period_shift()
특정시간에 해당 타이머가 expire되도록 세팅
타이머에 등록된 함수 wb_timer_fn()이 호출
ratelimit_nb를 cpu_notifier chain에 등록
init/main.c
page_writeback_init
calc_period_shift()
prop_descriptor_init()
dirty_ratio에 대한 period를 구한다. vm_dirty_bytes가 0
보다 크다면, vm_dirty_bytes에 대한 page 수를 구하고
log를 취한 값을 리턴
vm_dirty_bytes값이 0이라면, dirtyable memory *
vm_dirty_ratio 값을 구하고 log를 취한 값을 리턴
prop_descriptor 구조체 변수인 vm_completions의 멤
버변수값들을 셋팅한다
proc_root_init
1) proc_inode 구조체의 슬랩캐시 proc_inode_cache를 생성한다
2) proc을 파일시스템으로 등록하고 마운트한다
3) /proc/self/mounts의 심볼릭링크 mounts를 /proc에 생성한다
4) /proc/self/net의 심볼릭링크 /proc/net을 생성한다
5) network namespace 서브시스템을 등록한다
6) proc 파일시스템에 디렉토리 sysvipc,fs,driver, fs/nfsd,
openprom,bus를 생성한다
7) proc 파일시스템에 device-tree 디렉토리를 생성하고 devicetree 아래에 open firmware의 child device node들을 add한다.
8) proc/tty 서브트리를 생성,초기화한다.
9) /proc/sys를 생성하고 iops, fops,nlink를 초기화한다.
cgroup_init() 함수의 주요 Operation
init/main.c
1.
2.
3.
4.
5.
cgroup_init
cgroup_backing_dev_info의 멤버를 초기화
cgroup의 subsystem 전부 initialize
cgroup subsystem 의 해시테이블에 init_css_set을 추가
cgroup 파일시스템을 등록한다
proc파일시스템에 cgroups 엔트리 생성
bid_init(&cgroup_backing_dev_info)
1. cgroup_backing_dev_info의 멤버를 초기화
2. percpu_counter 구조체인 bdi_stat[], completions의
counter를 0으로 세팅, counters의 메모리 할당, lock
class등록을 한다
cgroup subsystem init
CGROUP_SUBSYS_COUNT 만큼 for loop
cgroup의 subsystem (cpuset) (debug) (ns) (cpu_cgroup)
(cpuacct) (mem_cgroup) (devices) (freezer) (net_cls)
모두 initialize
register_filesystem(&cgroup_fs_type)
Proc_create(“cgroups”, 0, NULL,
&proc_cgroupstats_operations)
cgroup 파일시스템을 등록한다.
proc 파일시스템에 cgroups 엔트리 생성
cpuset_init() 함수의 주요 Operation
init/main.c
1. 전역변수 top_cpuset의 cpus_allowed, mems_allowed를
각각 cpu개수만큼, MAX_NUMNODES만큼 set
2. top_cpuset구조체의 fmeter 구조체를 초기화
3. cpuset fs를 파일시스템으로 등록한다
4. cpumask인 cpus_attach에 메모리를 할당
5. 현재 시스템에 있는 cpuset개수를 세팅
cpuset_init
cpumask_setall(top_cupset.cpus_allowed)
nodes_setall(top_cpuset.mems_allowed)
1. top_cpuset의 cpus_allowed, mems_allowed를
2. 각각 cpu개수만큼, MAX_NUMNODES만큼 set
fmeter_init(&top_cpuset.fmeter)
top_cpuset.mems_generation =
cpuset_mems_generation++
set_bit(CS_SCHED_LOAD_BALANCE,
&top_cpuset.flags)
register_filesystem(&cpuset_fs_type)
top_cpuset구조체의 fmeter 구조체를 초기화 fmeter는
frequency meter인데 어떤 이벤트가 얼마나 빠르게
발생하는지에 대한 정보를 가지고 있음
어떠한 cpuset이 mems_allowed의 값을 변경하면
mems_generation의 값을 증가시킨다
cpuset fs를 파일시스템으로 등록한다.
number_of_cpusets = 1
현재 시스템에 있는 cpuset개수를 세팅
taskstats_init_early() 함수의 주요 Operation
init/main.c
taskstats_init_early()
1. taskstats 구조체의 슬랩캐시 taskstats를 생성한다
2. i번째 cpu의 per_cpu 데이터인 listener_array[]의 list를
초기화
3. i번째 cpu의 per_cpu 데이터인 listener_array[]의
semaphore을 초기화
taskstats_cache =
KMEM_CACHE(taskstats, SLAB_PANIC)
taskstats 구조체의 슬랩캐시 taskstats를 생성한다
for_each_possible_cpu
INIT_LIST_HEAD(listener_array.list)
init_rwsem(listener_array.sem)
1. i번째 cpu의 per_cpu 데이터인 listener_array[]의
list를 초기화
2. i번째 cpu의 per_cpu 데이터인 listener_array[]의
semaphore을 초기화
taskstats
Format for per-task data returned to userland when a
task exits or listener requests stats for a task
taskstats_init_early() 함수의 주요 Operation
init/main.c
delayacct_init()
1. task_delay_info 구조체의 슬랩캐시 task_delay_info를 생
성한다.
2. tsk->delays를 NULL로 초기화.
3. delayacct_on이 true라면 task_delay_info 구조체하나를
슬랩캐시에서 할당받음
delayacct_cache =
KMEM_CACHE(task_delay_info, SLAB_PANIC);
delayacct_tsk_init(&init_task);
task_delay_info 구조체의 슬랩캐시 task_delay_info를
생성한다
1. tsk->delays를 NULL로 초기화
2. delayacct_on이 true라면 task_delay_info
구조체하나를 슬랩캐시에서 할당받음
__delayacct_tsk_init(struct
task_struct *tsk)
task_delay_info 구조체하나를
슬랩캐시에서 할당받음
CONFIG_TASK_DELAY_ACCT
Enable per-task delay accounting Collect information
on time spent by a task waiting for system resources
like cpu, synchronous block I/O completion and
swapping in pages. Such statistics can help in setting a
task's priorities relative to other tasks for cpu, io, rss
limits etc.
check_bugs() 함수의 주요 Operation
init/main.c
1. physical한 page 한개 할당받는다
2. page의 protection을
L_PTE_PRESENT|L_PTE_YOUNG|L_PTE_DIRTY|
L_PTE_WRITE| L_PTE_MT_BUFFERABLE로 설정한다
3. page를 virtual address p1,p2로 매핑한다
4. 동일한 physical page에 대한 서로 다른 virtual address
를 갖는 p1, p2에 writing test. p1에 쓴값이 p2에 쓴값
으로 over write된다면 테스트 성공
check_bugs()
check_writebuffer_bugs
pgprot_t prot =
__pgprot(L_PTE_PRESENT|L_PTE_YOUNG|
L_PTE_DIRTY|L_PTE_WRITE|
L_PTE_MT_BUFFERABLE);
p1 = vmap(&page, 1, VM_IOREMAP, prot)
p2 = vmap(&page, 1, VM_IOREMAP, prot)
if (p1 && p2) {
v = check_writebuffer(p1, p2);
reason = "enabling work-around";
} else {
reason = "unable to map memory\n";
}
1.
2.
3.
physical한 page 한개 할당 받는다
page의 protection을 L_PTE_PRESENT |
L_PTE_YOUNG | L_PTE_DIRTY | L_PTE_WRITE|
L_PTE_MT_BUFFERABLE로 설정한다
page를 virtual address p1, p2로 매핑한다
동일한 physical page에 대한 서로 다른 virtual
address를 갖는 p1, p2에 writing test를 하는
함수 p1에 쓴값이 p2에 쓴값으로 over
write된다면 테스트 성공
init/main.c
1. ACPI (Advanced Configuration and Power Interface)
Support
2. for X86 ref page  http://www.spinics.net/lists/armkernel/msg01205.html
acpi_early_init()
ftrace_init()
no-op으로 시작하는 함수의 주소 갯수
count = __stop_mcount_loc –
__start_mcount_loc
인자로 넘어온 num_to_init은 no-op으로
시작하는 함수의 총 주소 개수 num_to_init이
몇페이지에 들어갈 수 있는지를 계산하여
페이지를 할당 받음
ftrace_dyn_table_alloc(count)
1.
ftrace_convert_nops(NULL,
__start_mcount_loc,
__stop_mcount_loc)
2.
3.
4.
ftrace를 위한 entry 개수만큼 루프를 돌면서
dyn_ftrace page로부터 record를 담을 주소를
리턴받아 멤버세팅
ftrace의 entry인 dyn_ftrace의 ip는
MCOUNT_ADDR을 가리키고 있음 ip를 noop으로 변경한다
no-op으로 바꾸는 시간을 측정
변경된 횟수를 저장
ftrace(function trace)
컴파일 타임에 모든 커널함수의 앞에 5byte No-op instruction을
삽입 frace가 enable 되면 no-op instruction이 ftrace 에 의해 함
수를 trace 하는 코드로 교체된다.
init/main.c
copy_process() 함수의 주요 Operation
rest_init()
1.
2.
3.
kernel_thread(kernel_init)
kernel_thread()함수가 do_fork()
copy_process()를 통해 task를 복사하여
kernel_init 함수를 실행할 준비를 하게함
부모의 task_struct의 내용을 복사하여
새로운 프로세스를 생성한다.
레지스터, 프로세스환경변수들을 CLONE flags에 따라
적절하게 수정한다.
do_fork()
do_fork() 함수의 주요 Operation
1.
2.
3.
4.
5.
6.
7.
8.
copy_process()를 호출하여 부모의 task_struct 내용을 복사하여 새로운
프로세스를 생성한다. 레지스터, 프로세스환경변수들을 CLONE flags에 따라
적절하게 수정한다.
프로세스 생성에 문제가 없었다면 current task의 pid name space으로 부터
virtual id를 가져온다
CLONE_PARENT_SETTID가 셋되어있을때는 parent_tidptr을 NULL이 아닌 값을
인자로 넘겨 nr값을 써준다
CLONE_VFORK가 세팅되어있으면 vfork_done을 설정하고 vfork 구조체를
초기화한다
child p의 audit_context를 초기화한다. child p에 ptrace가 있으면
pending.signal에 SIGSTOP을 추가하고 thread_info->flag에
TIF_SIGPENDING을 세팅한다. PF_STARTING 플래그 제거
CLONE_STOPPED가 세팅되어있으면 pending.signal에 SIGSTOP을 추가하고
thread_info->flag에 TIF_SIGPENDING을 세팅하고 태스크 state에
TASK_STOPPED를 설정한다
CLONE_STOPPED가 세팅되어 있지 않으면 태스크를 깨운다. mask를 검사해서
event의 notification을 stop할 것인지를 확인, stop할 경우에는 message와
함께 ptrace의 parent에게 리포트한다.
CLONE_VFORK플래그가 설정되어있으면 PF_FREEZER_SKIP. 플래그를 설정,
vfork로 생성한 child가 끝날때까지 기다린다. 태스크가 다시 freeze될 수
있음을 마킹. vfork가 완료되었음을 알림
copy_process()
init/main.c
1.
2.
3.
4.
5.
6.
7.
8.
rest_init()
kernel_thread(
kernel_init)
do_fork()
구조체 task_struct, thread_info를 생성
부모의 task_struct 내용을 복사
tsk->stack에 thread_info를 세팅
tsk->dirties 초기화
부모의 thread_info를 tsk의 thread_info로 복사
thread_info->task를 tsk로 세팅
stackend를 stack의 end(thread_info주소+1byte)로 세팅
overflow detection을 위해 magic number, canary값
집어넣음
copy_process()
dup_task_struct(current)
rt_mutex_init_task(p)
spin_lock, pi list, pi_blocked_on을 초기화 한다.
rlim: resource limit
현재 프로세스가 관리할 수 있는 프로세스 개수보다 더 많은
프로세스가 생성되었다면, 그리고 CAP_SYS_ADMIN,
CAP_SYS_RESOURCE, INIT_USER 중 하나도 해당되지 않는다
면 fork를 허용하지 않는다
rlim check
1.
copy_creds(p, clone_flags)
2.
3.
4.
max_threads check
exec_domain->module
check
cred->thread_keyring이 NULL이고 CLONE_THREAD플래
그가 설정되어있는 경우에는 부모의 credential을 공유한
다
그렇지 않다면 새로운 credential 을 생성
부모가 cred->thread_keyring을 가지고 있다면 새로운 쓰
레드도 값을 새롭게 설정하여 갖는다
CLONE_THREAD 플래그가 설정되어 있지 않을 경우
부모의 tgcred를 버리고 새로운 tgcred를 설정한다
max_threads보다 많은 쓰레드를 생성했는지
execution domain
how system calls are invoked or how the various signals are
numbered. This information is stored in execution domain
descriptors of type exec_domain.
exec_domain->module의 context가 존재하지 않는지
init/main.c
Linux supports various organization formats for
executable files. The standard format is ELF
(Executable and Linkable Format).
rest_init()
kernel_thread(
kernel_init)
binfmt가 존재하는데 binfmt->module이 존재하지
않는다면 bad_fork_cleanup_count로 점프
do_fork()
copy_process()
P->binfmt &&
p->binfmt->moudle check
p->did_exe ….
P->vfork_done
init_sigpending(&p>pending)
p->utime ….
p->default_timer_slack_ns
task_io_accounting_init(
&p->ioac)
did_exec, delayacct, flags, children,sibling
rcu 관련 멤버, vfork_done, spin lock을 초기화한다
pending에는 부모 프로세스의 pending된 시그널이 기
록되어있음. 자식 프로세스에서는 필요없으므로
sigpending의 모든 signal clear
Time 관련 멤버 초기화
ioac를 0으로 셋팅
p의 accounting field를 클리어
acct_clear_integrals(p)
POSIX timer handling 초기화
posix_cpu_timers_init(p)
do_posix_clock_monoton
ic_gettime(
&p->start_time)
monotonic 시간(xtime+wall_to_monotonic)을
timespec포맷으로 구하여 start_time에 세팅
init/main.c
rest_init()
kernel_thread(
kernel_init)
do_fork()
monotonic 시간(xtime+wall_to_monotonic)을
timespec포맷으로 구하여 start_time에 세팅
copy_process()
monotonic_to_bootbased(&
p->real_start_time)
p->real_start_time =
p->start_time
monotonic_to_bootbased(
&p->real_start_time)
cgroup_fork(p)
부팅한 시점을 기준으로 p의 real_start_time을 설정
새로 fork된 task p를 parent의 cgroup에 포함시킨다
CONFIG_NUMA가 적용된 경우
부모의 memory policy 복사
p->mempolicy =
mpol_dup(p->mempolicy)
mpol_fix_fork_child_flag(p)
p->irq_events …
p->softirq_context
mempolicy가 존재하면 PF_MEMPOLICY 비트 set 그렇
지 않다면 clear. non-default memory policy가 아닌경
우에만 PF_MEMPOLICY를 셋함
irq 관련 멤버 초기화
init/main.c
Perform scheduler related setup. Assign this task to a CPU.
rest_init()
1.
2.
kernel_thread(
kernel_init)
3.
do_fork()
4.
copy_process()
5.
sched_fork(p, clone_flags)
6.
7.
audit_alloc(p)
copy_semundo(
clone_flags, p)
copy_files(clone_flags, p)
copy_fs(clone_flags, p)
copy_sighand(clone_flags, p)
copy_signal(clone_flags, p)
copy_mm(clone_flags, p)
copy_namespaces(
clone_flags, p)
current의 preempt disable
새로 생성한 프로세스의 sched_entity의 멤버를 초기화.
프로세스의 state를 TASK_RUNNING으로 바꾼다
그렇지만 runqueue에는 아직 삽입하지 않은 상태
SMP일 경우 가장 로드가 적게걸린 CPU를 찾는다.
새롭게 assign된 cpu에 맞게 값을 세팅한다.
prio를 부모의 normal_prio로 세팅. rt priority가 아니라면,
프로세스의 스케쥴러를 fair_sched_class로 세팅.
CONFIG_SCHEDSTATS가 정의되어있으면 sched_info
초기화
pushable_tasks의 priority를 prio로 초기화 plist를 null
초기화
current의 preempt enable
audit context block을 할당한다
audit context block?? system call을 audit한다
CLONE_SYSVSEM이 설정되어 있으면 parent와 child 태스크간
에 SEM_UNDO를 공유할 수 있게 설정한다
File descriptor 복사
File system 관련 자료구조 복사
Signal handler 복사
p->sig 자료구조 할당하고 초기화, rlim은 current의 것을 복사
CLONE_VM 플래그가 설정되어있으면 부모와 mm공유
아니면 별도로 할당받아 초기화
namespace 복사
init/main.c
rest_init()
kernel_thread(
kernel_init)
do_fork()
CLONE_IO 플래그가 설정되어있으면 부모와 ioc공유 아니면
별도로 할당받아 초기화
copy_process()
copy_io(clone_flags, p)
copy_thread(clone_flags,
stack_start, stack_size, p,
regs)
child process의 register pointer의 값을 do_fork로부터 넘겨받
은 인자 regs로 세팅. child process의 thread_info를 얻어와서
sp, pc값을 설정한다. pc는 fork후 리턴된 주소로 세팅된다
pid를 할당 받음
alloc_pid(p->nsproxy>pid_ns)
p->nsproxy->pid_ns를 proc 파일시스템에 마운트
pid_ns_prepare_proc(p>nsproxy->pid_ns)
CONFIG_FUNCTION_GRAPH_TRACER가 정의되어있으면
trace용으로 사용할 스택을 할당받아 p->ret_stack에 assign
ftrace_graph_init_task(p)
init namespace에서 보이는 global id를 가져온다.
p->pid = pid_nr(pid)
p->tgid = p->pid
init/main.c
rest_init()
kernel_thread(
kernel_init)
부모 프로세스와 자식프로세스의 nsproxy가 다르다면
namespace cgroup 하나를 생성한 후 child 프로세스를
생성한 namespace cgroup으로 이동시킨다.
do_fork()
copy_process()
ns_cgroup_clone(p, pid)
thread_info->flag에서 TIF_SYSCALL_TRACE, IF_SYSCALL_EMU
를 클리어시킨다
clear_tsk_thread_flag(p,
TIF_SYSCALL_TRACE)
cgroup subsystem의 fork callback 함수를 수행
cgroup_fork_callbacks(p)
p에게 assign된 cpu가 cpus_allowed에 세팅되어있지 않거나
assign된 cpu가 online상태가 아닐 경우 현재 프로세서에
assign시킴
set_task_cpu(p,
smp_processor_id())
p->real_parent =
current->real_parent;
p->parent_exec_id =
current->parent_exec_id;
recalc_sigpending()
CLONE_PARENT, CLONE_THREAD 플래그가 설정되어있을 경
우 real_parent, parent_exec_id를 부모에게서 복사한다. 그렇
지 않을때는 각각 current, current->self_exec_id로 세팅한다
signal pending flag를 재설정할 때 호출된다
init/main.c
rest_init()
kernel_thread(
kernel_init)
TIF_SIGPENDING이 세팅되어있다면 에러처리
do_fork()
copy_process()
signal_pending(current)
p->pid exists?
thread_group_leader(p)?
attach_pid(p, PIDTYPE_PID,
pid);
total_forks++
pid가 존재한다면
1. 부모의 children에 p->sibling에 추가한다
2. 이를 통해 children에 자신을 등록한다
3. p의 ptrace 관련 멤버들을 초기화
p가 thread group의 leader라면
1. CLONE_NEWPID가 세팅되어있으면 child_reaper로 설정
한다
2. pid를 leader_pid에 설정
3. signal->tty에 대한 참조카운트를 1감소
4. 부모의 signal->tty를 다시 참조함
5. 새로 생성된 프로세스는 current task group과 세션에 추
가되어진다.
6. p를 init_task의 task list에 추가
7.
process_counts를 증가시킨다
pid를 id network에 추가한다
fork된 횟수를 1 증가시킴
proc_fork_connector(p)
cgroup_post_fork(p)
Provide a connector that reports process events to
userspace. Send events such as fork, exec, id change (uid,
gid, suid, etc), and exit. CONFIG_PROC_EVENTS가
설정되어있을 경우 사용
use_task_css_set_links가 true이고 child->cg_list가 비어있다면
child->cgroups->tasks에 child를 add한다
init/main.c
현재 프로세스의 memory policy를 초기화한다
(current->mempolicy=NULL)
rest_init()
kernel_thread(kernel_init)
numa_deafult_policy()
커널스레드 데몬인 kthreadd를 생성한다
kthread_create_list에 등록된 스레드를
create_kthread()함수를 호출하여 생성
kernel_thread(kthreadd)
kthreadd
find_task_by_pid_ns(
pid, &init_pid_ns)
init_idle_bootup_task(
current)
rcu_scheduler_starting()
schedule()
kthreadd의 task_struct를 찾는다
current의 sched_class를 idle_sched_class로 초기화
rcu_scheduler_active = 1
1.
2.
3.
cpu_idle()
Done!
4.
current의 need_resched flag를 clear
preemption에 의해 scheduling되지 않았고 pending된
시그널이 있다면 task의 state를 TASK_RUNNING으로
설정. pending된 시그널이 없다면 task를 deactivate한다
prev task가 여전히 active되어있을 경우만 run queue에
삽입. deactivate_task()가 호출되면 deactive
컨텍스트 스위칭을 한다. prev->mm과 register를 새로운
task의 mm과 register로 변경
함수를 호출한 프로세스가 TIF_NEED_RESCHED 플래그가
세팅될때까지 무한 idle()
init/main.c
kernel_init
현재 프로세스의 memory policy를 초기화한다
(current->mempolicy=NULL)
current task의 pid를 cad_pid에 할당
cad -> ctrl+alt+del reboot
INITCALLS(.initcall*.init)섹션의 함수를 모두
호출한다
current task를 실행시킬 cpu를 찾는다
set_cpus_allowed_ptr(current,
cpu_all_mask)
init_pid_ns.child_reaper =
current
cad_pid = task_pid(current)
smp_prepare_cpus(
setup_max_cpus)
do_pre_smp_initcalls()
cpu_active_bits에 current cpu의 id를 세팅
online되어있지 않은 cpu가 있다면
online시킨다
1.
2.
3.
4.
5.
ncores값의 sanity check하고 NR_CPUS보다 클 경우 값
조정
cpuid 에 해당하는 cpu의 cpu_date 자료구조를 가져와서
loops_per_jiffy세팅
cpu의 local clock event를 세팅한다
변경된 max_cpus에 맞게 cpu_present_map 재설정
멀티코어인 경우에 scu(snoop control unit)를 enable,
secondary cpu를 부팅시킨다
start_boot_trace()
smp_init()
ramdisk_execute_command가 없다면 /init 으로
초기화
child reaper로 등록
sched_init_smp()
부트 트레이서에게 pre_smp_initcalls이 전부 호출되었음을
알린다.
sched domain을 초기화하여 각 cpu에 설정한다. cpu hotplug
이벤트시에 domain을 업데이트할 수 있게 notifier를 설정한다
do_basic_setup()
/dev/console을 열어 fd 0,1,2를 만든다.
/sbin/init (or /etc/init or /bin/init or /bin/sh)을
실행한다
ramdisk_execute_command =
"/init"
init_post()
1.
2.
3.
4.
5.
6.
rcu_sched_grace_period 함수를 실행시킬 커널스래드를
생성한다
프로세서별로 워크큐 events를 생성
kernel thread cpuset, khelper를 만든다
sysfs의 트리구조에 각 서브시스템의 entry를 초기화한다.
platform_bus 디바이스를 등록
irq에 대한 proc entry를 초기화한다
initcall 섹션의 함수들을 호출한다
init/main.c
kernel_init 함수의 주요 Operation
1.
current task를 실행시킬 cpu를 찾고 current를 child reaper로 등록
2.
current task의 pid를 cad_pid에 할당. cad -> ctrl+alt+del reboot
3.
ncores값의 sanity check하고 NR_CPUS보다 클 경우 값 조정. cpuid 에 해당하는 cpu의 cpu_date 자료구조를
가져와서 loops_per_jiffy세팅. cpu의 local clock event를 세팅한다. 변경된 max_cpus에 맞게 cpu_present_map
재설정. 멀티코어인 경우에 scu(snoop control unit)를 enable, secondary cpu를 부팅시킨다
4.
INITCALLS(.initcall*.init)섹션의 함수를 모두 호출한다
5.
부트 트레이서에게 pre_smp_initcalls이 전부 호출되었음을 알린다.
6.
cpu_active_bits에 current cpu의 id를 세팅. online되어있지 않은 cpu가 있다면 online시킨다
7.
sched domain을 초기화하여 각 cpu에 설정한다. cpu hotplug 이벤트시에 domain을 업데이트할 수 있게 notifier를
설정한다
8.
rcu_sched_grace_period 함수를 실행시킬 커널스래드를 생성한다. 프로세서별로 워크큐 events를 생성. kernel thread
cpuset, khelper를 만든다. sysfs의 트리구조에 각 서브시스템의 entry를 초기화한다. platform_bus 디바이스를 등록.
irq에 대한 proc entry를 초기화한다. initcall 섹션의 함수들을 호출한다.
9.
ramdisk_execute_command 검사
10. /dev/console을 열어 fd 0,1,2를 만든다
11. init을 실행한다
Zone_size, zhone_size 배열
4개의 Bank로 구성된 Node의 3개의 Zone
ZONE_DMA
ZONE_NORMAL
bank#1
ZONE_HIGHMEM
bank#2

absent
zone_size[0] = size of ZONE_DMA
zone_size[1] = size of ZONE_NORMAL
zone_size[2] = size of ZONE_HIGHMEM
zhole_size[0] = ZONE_DMA의 영역중 hole size
zhole_size[1] =  + 
zhole_size[2] = 
bank#3

bank#4
Ž

Memory Bank, Node, NUMA, UMA, ZONE, Page Frame (1)
1.
Memory Bank
2.
Node
3.
NUMA vs. UMA*
4.
ZONE
5.
Page Frame
• Memory Slot의 구성되어 있음.
• 각 Slot에는 같은 타입의 메모리가 장착된 것을 Memory Bank라 함.
• 한 CPU로부터 접근속도가 같은 메모리의 집합을 노드라고 부른다
• 여러 개의 Bank가 하나의 Node를 구성할 수 있음
• Linux에서 Memory Bank를 표현하는 구조
• NUMA는 프로세서로부터 메모리까지의 거리에 따라 메모리에 대한 접근 속도가 다른 아키텍쳐
• NUMA에서는 커널이 노드간 통신을 최소화할 필요가 있는데, 커널 텍스트와 읽기전용 데이터를 노드간
복사하는 것에서부터 커널이 필요로 하는 데이터구조체를 한 노드에 유지시키려는 방법들이 필요.
• UMA는 접근속도가 같은 아키텍쳐
• Node에서 동일한 속성을 가지는 영역
• DMA, NORMAL, HIGHMEM, ZONE_MOVABLE등이 있으나 ARM의 경우는 한 개의 ZONE만 존재
• ZONE은 물리 메모리를 관리하며 최소 단위를 Page Frame이라 함.
• Page Frame은 Linux에서 page 구조체로 관리되며, mem_map 전역배열을 통해 접근
*NUMA와 UMA 플랫폼간 코드 일관성을 위해 UMA 플랫폼들은 정적으로 할당된 pg_data_t를 가지고 있다.
Memory Bank, Node, NUMA, UMA, ZONE, Page Frame (2)
struct pg_data_t
Set of Banks
=
Node
Slot
struct zone
Zone #1
Zone #2
struct page
Page
Page
Frame
Page
Frame
Page
Frame
Frame
….
Zone #n
1 Bank of 4 Slots
mem_map
mem_map array
Page #1
Physical Memory
Page Frame
Page #2
Page #3
Page Frame
Page #4
Page Frame
Page #5
Page #n
Page Frame
Memory Bank, Node, NUMA, UMA, ZONE, Page Frame (3)
Linux에서 Bank가 Node가 아닌 이유
find_bootmap_pfn는 bootmap의 page frame number를 찾는 함수이다. for_each_nodebank에서 memory info의
bank들을 순차적으로 돌면서 bank의 node 번호가 입력받은 node 번호와 일치하는지 검사한다. 만약, bank가 node라면
모든 bank는 유일한 node번호를 가져야 할 것이고 Bank의 끝 page frame number가 bss 영역의 끝 page frame
number 보다 작은지를 검사하는 if 문의 continue 문장이 의미가 없어진다. 그리고 if 문을 수행했을 때 true 라면,
for_each_nodebank 문이 종료될 것이다.
따라서, 리눅스 커널은 bank와 node를 다르게 보며 한 node는 여러 개의 bank를 가질 수 있다고 유추할 수 있다.
find_bootmap_pfn, arch/arm/mm/init.c
for_each_nodebank(i, mi, node) {
start_pfn = PAGE_ALIGN(__pa(_end)) >> PAGE_SHIFT;
struct membank *bank = &mi->bank[i];
unsigned int start, end;
start = bank_pfn_start(bank);
end = bank_pfn_end(bank);
if (end < start_pfn)
continue;
arch/arm/include/asm/setup.h
#define for_each_nodebank(iter,mi,no)
\
for (iter = 0; iter < (mi)->nr_banks; iter++) \
if ((mi)->bank[iter].node == no)
Slots
This is the total number of memory upgrade slots (sockets) followed by
their configuration. Banks are the way a system addresses memory. A
bank must be completely filled with memory modules of the same size
and type in order for the system to recognize and address the memory.
i.e. :
3 (3 banks of 1) This indicates that there are 3 memory slots. These are
divided into 3 banks, and each bank consists of one memory slot. So
you can add memory one piece at a time for the system to use.
4 (2 banks of 2) This indicates that there are 4 memory slots. These are
divided into 2 banks, and each bank consists of two memory slots. So
you must add memory two pieces at a time (they must be the same size
and type of memory) in order for the system to benefit from the
upgrade.
12 (3 banks of 4) This indicates that there are 12 memory slots. These
are divided into 3 banks, and each bank consists of four memory slots.
So you must add memory four pieces at a time (and they must be the
same size and type of memory) in order for the system to benefit from
the upgrade.
http://www.computermemoryupgrade.net/maximum-standard-memory-slots.html
Node 구성도
include/linux/mmzone.h
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones;
struct page *node_mem_map;
struct bootmem_data *bdata;
unsigned long node_start_pfn;
unsigned long node_present_pages;
unsigned long node_spanned_pages;
int node_id;
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
} pg_data_t;
pg_data_t: Memory Node를 표현하는 구조체
/* Node내 Zone에 대한 Array */
/* Zone에 대한 Fallback List, Zone에 가용 메모리가 없을시 다른 Node의 같은 Type의
Zone을 찾아 메모리를 할당하기 위함 */
/* Node내 Zone의 개수, Node마다 다른 Zone의 개수를 가질 수 있음 */
/* Page Array에 대한 포인터이며 해당 Node에 속한 Page 의 시작 주소임 */
/* Memory Allocator가 활성화되지 않은 Boot시에 Kernel이 메모리를 사용할 필요가
있음. 이 경우 임시적으로 메모리관리를 하기위한 구조체로 Boot Memory Allocator
가 사용함 */
/* Logical Page Frame 번호로, 시스템내의 모든 Node의 페이지들은 연속적인 번호가
부여되며, Globally Unique한 번호임. */
/* total physical page 수 */
/* 해당 Node의 모든 page 수, Node내 Hole이 존재할 경우 Physical한 Page가 존재하
지 않을 수 도 있음 */
/* Global Node Identifier */
/* swap deamon을 위한 wait queue로 zone에 있는 frame을 swap하기위해 필요 */
/* swap deamon에 대한 task struct 포인터 */
/* swapping subsystem에 사용되며, free 될 영역의 크기를 정할때 사용됨 */
struct pglist_data *pgdat_next 필드는 2.6.31에 존재하지 않음.
이전 version에서는 시스템 내 Node들을 Singly Linked List로 연결하기 위해 사용됨.
Node 구성도
include/linux/mmzone.h
struct zone {
unsigned long
unsigned long
struct free_area
pages_min, pages_low, pages_high;
lowmem_reserve[MAX_NR_ZONES];
free_area[MAX_ORDER];
/* swap을 위한 기준, 아래 참고 */
/* critical allocation을 위해 예약, 어떠한 조건에서도 할당되어야 할경우 사용 */
/* Buddy System의 근간 */
if (free pages > pages_high )
OK.
else if (free pages < page_low)
swapping
else if (free_pages < pages_min)
page recalim
/* Fields commonly accessed by the page reclaim scanner */
spinlock_t
lru_lock;
struct {
struct list_head list;
unsigned long nr_scan;
} lru[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
unsigned long
unsigned long
pages_scanned;
/* since last reclaim */
flags;
/* zone flags */
atomic_long_t
int prev_priority;
vm_stat[NR_VM_ZONE_STAT_ITEMS];
unsigned int inactive_ratio;
/* Zone Statistics */
/* Page reclaim에 의해 page frame이 free될 때
zone이 scan되었을 때 Priority */
wait_queue_head_t * wait_table;
/* page가 avaialbe할 때까지 기다리고 있는 process들을 위한 */
unsigned long
wait_table_hash_nr_entries;
/* wait queue를 구현하기 위한 변수 들 */
unsigned long
wait_table_bits;
struct pglist_data *zone_pgdat;
unsigned long
zone_start_pfn;
/* zone이 속한 node에 대한 포인터 */
/* zone의 첫 page frame number */
unsigned long
spanned_pages; /* total size, including holes */
unsigned long
present_pages; /* amount of memory (excluding holes) */
const char
*name;
} ____cacheline_internodealigned_in_smp;
Zone의 flag
typedef enum {
ZONE_ALL_UNRECLAIMABLE,
ZONE_RECLAIM_LOCKED,
ZONE_OOM_LOCKED,
} zone_flags_t;
의미
/* all pages pinned */
/* prevents concurrent reclaim */
/* zone is in OOM killer zonelist */
if (flags is Nothing)
NORMAL
else if (flags == ZONE_ALL_UNRECLAIMALBE)
all pages are pinned, so recalim 불가
else if (flags == RECLAIM_LOCKED)
다른 CPU가 reclaim을 하지 못하도록 함, 현재 recalim 중
else if (flags == ZONE_OOM_LOCKED)
process가 과다한 memory를 사용하고 있고 operation을 완료할 수
없는 경우 Out of Memory를 방지하기 위함
Node 구성도
struct per_cpu_pageset *pageset[NR_CPUS]
struct per_cpu_pages {
int count;
/* number of pages in the list */
int high;
/* high watermark, emptying needed */
int batch;
/* chunk size for buddy add/remove */
struct list_head list; /* the list of pages */
};
cpu 0
pcp
hotpages
cpu 1
pcp
coldpages
cpu 2
pcp
pageset
Page block
(order 0)
free_area
ZONE(DMA)
pg_data_t
Memory
UNMOVABLE
Buddy System의 근간
ZONE(NORMAL)
2^0
free_list
ZONE(HIGMEM)
2^1
free_list
2^(n-1)
free_list
node_zones
Node 0
node_zonelists
zlcache
Node 1
node_mem_map
_zone_ref
bdata
node_start_pfn
kswapd_wait
Memory의 Node를 나타내
는 구조체가 pg_data_t
pg_data_t
MAX_NUMNODES
x
MAX_NR_ZONES
Page Array
node_zones
node_zonelists
node_mem_map
bdata
bitmap
Page 구조체의 private 필드를 통
해 buddy system에서 page가 속
할 order를 확인함
idx
idx
idx
idx
zone*
zone*
zone*
zone*
idx
zone*
Physical
Memory
page #1
frame
page #2
frame
page #3
frame
page #x
frame
page #y
frame
page #z
RECLAIMABLE
MOVABLE
Page block
(order 1)
UNMOVABLE
RECLAIMABLE
node_start_pfn
kswapd_wait
1 page caching
Buddy System으로
부터 가져옴
해
당
zo
ne
포
인
팅
MOVABLE
struct free_area {
struct list_head
usigned long
};
free_list[MIGRATE_TYPES];
nr_free;
Node 구성도
Buddy System Information & Page Type Information
첫 번째 0 bit를 찾는 함수
_find_first_zero_bit_le 는 _find_next_zero_bit_le와 매우 유사
arch/arm/lib/findbit.S
ENTRY(_find_first_zero_bit_le)
teq r1, #0
beq 3f
mov r2, #0
1:
ldrb r3, [r0, r2, lsr #3]
eors r3, r3, #0xff
@ invert bits
bne .L_found
@ any now set - found zero bit
add r2, r2, #8
@ next bit pointer
2:
cmp r2, r1
@ any more?
blo 1b
3:
mov r0, r1
@ no free bits
mov pc, lr
ENDPROC(_find_first_zero_bit_le)
.L_found:
#if __LINUX_ARM_ARCH__ >= 5
rsb r1, r3, #0 @ r1 = 0 - r3
and r3, r3, r1 @ r3 = r1 & r3 이 연산이 후, r3의 값은
@ bit 값이 0인 첫번째 bit만 1로 셋트됨
@ 나머지는 모두 0 예를 들면
@ r3 = 00000000 00000000 00000000 00000010
clz r3, r3 @ 첫 번째 1을 만날때까지 0을 카운트
rsb r3, r3, #31 @ 31 - r3, 왜 31이냐?
@ bitmap의 해당 byte에는 적어도 1개의
@ 1이 존재할 것이기 때문에 0 개수의 max
@ 값은 31. 처음으로 0인 bit에 대한
@ offset을 구할 수 있음
add r0, r2, r3 @ base 값 (r2)에 offset을 더해 return
#else
....
#endif
mov pc, lr
ENTRY(_find_next_zero_bit_le)
teq r1, #0
@ maxbit (size)가 0이라면
beq 3b
@ backward 3로 점프
ands ip, r2, #7 @ r2 즉 sidx가 8의 배수인지를 검사
@ 8의 배수라면 _find_first_zero_bit_le 로
@ bitmap이 8bit씩 구성되어 있기 때문에,
@8의 배수일 경우는 한 바이트의 첫 번째
@비트부터 검사할 수 있지만, 8의 배수가 아닐
@경우는 8로 나누었을 때 나머지 값에 대한
@처리를 해줘야 합니다.
beq 1b
@ 8의 배수일 경우 _find_first_zero_bit_le
ldrb
r3, [r0, r2, lsr #3] @ 액세스할 메모리 주소를 계산함
@ r2, lsr #3에 의해 기준점 r0로 부터
@ byte 단위 offset을 구함.
eor r3, r3, #0xff
movs
@ r3과 0xff를 EOR해서 결과값을 r3에 저장
@ r3이 0이 아라면 원래 r3의 값에는 0 bit가
@ 존재한다는 의미
r3, r3, lsr ip
@ r2를 8로 나누었을 때 나머지 값 만큼
@ 오른쪽으로 쉬프트 시킴.
@ 나머지 값에 해당되는 offset을
@ 건너뛰어야 하기 때문
@ 예를 들어, ip가 6이라면, 6번 이상의
@ bit들 중 0인 값을 찾아야 함.
bne .L_found
@ 여기까지 왔는데, r3 값이 여전히 0이 아니라면,
@ r3에는 0을 나타내는 비트가 존재한다는
@ 의미이고 페이지를 할당할 수 있음.
@ 할당할 수 있는 페이지가 있으므로,
@.L_found로 점프
orr r2, r2, #7
@ if zero, then no bits here
add r2, r2, #1
@ align bit pointer
b 2b
@ loop for next bit
ENDPROC(_find_next_zero_bit_le)
Bootmem bitmap에서 sidx를 기준으로 최초 0 bit를 찾는 로직
bitmap
offset
7
index
6
5
4
3
2
1
Zero bit
Offset position
0
[0]
[1]
[2]
sidx >> 3 = sidx를 8로 나눈 몫이 되므로 1
byte array로 구성된 bitmap의 index가 됨
[3]
bitmap + (sidx >> 3), where sidx = 34
[4]
…
sidx + r3
sidx & 7, where sidx = 34, 나머지 = 2
[n-1]
sidx를 8로 나눴을 때 나머지가 되므로 해당
byte에서의 bit offset이 됨
[n]
Step 1
Index [4] = r3 =
1
0
0
0
1
1
1
1
Step 2
0xff =
1
1
1
1
1
1
1
1
Step 3
r3 = r3 XOR 0xff =
0
1
1
1
0
0
0
0
Step 4
r3 = r3 >> (sidx & 7) =
0
0
0
1
1
1
0
0
Step 5
r1 = 0 – r3 =
1
1
1
0
0
1
0
0
Step 6
r3 = r1 and r3 =
0
0
0
0
0
1
0
0
Step 7 clz r3, r3 = 29, (00000000 00000000 00000000 00000100)
Step 8
r3 = 31 – r3 = 2
최초 가용bit의 위치 = sidx + r3
[step 3] R3 XOR 0xff가 0이 아니라면, r3가 0 bit를 포함
[step 4] 2 bit shift, r3가 0이 아니라면 r3에 0 bit가 존재
[step 5] 최하위 bit부터 시작하여 최초 bit 1의 위치만 1
로 셋팅 됨 이 의미는 step 4에서 첫번째 1이 되
는 bit를 의미하고, 그 1은 step 1에서 첫번째 가
용한 bit 0를 나타냄.
[Step 7] leading zero를 counting 함 r3 = 29
[Step 8] 최소 1개의 1bit는 1이 있을 것이므로 0의 최대
개수는 31에서 r3를 빼면, sidx + r3는 index[4]에
서 최초 가용 bit를 가리키게 됨
Memory Hotplug 정리
1.
Purpose
2.
2 Phases
A. On Demand에 대응하여 메모리의 양을 변경, 이 경우는 주는 highly virtualized environments에서 요구됨
B. DIMMs나 NUMA-nodes를 물리적으로 Install하거나 제거하고자 할 때, 이 경우는 메모리 파워관리를 지원하고자 하는
하드웨어에서 요구됨.
• 물리적 메모리 핫플러그 단계
– 하드웨어나 펌웨어와 커뮤니케이션을 하여 핫플러그된 메모리를 위한 환경을 만들거나 제거한다.
– 목적 (B)를 위한 것이지만, highly virtualized environments간 커뮤니케이션을 위한 좋은 단계다.
– 메모리가 핫플러그될 때 커널은 새로은 메모리를 인지하고 새로운 메모리 관리 테이블 (Memory Management
Table)을 만들고 새로운 메모리에 대한 오퍼레이션을 위해 시스파일시스템 파일들을 만든다.
– 펌뭬어가 새로운 메모리가 추가되었다고 OS에 Noti를 할 수 있다면, 이 단계는 자동적으로 트리거된다.
– ACPI는 이 이벤트를 Noti할 수 있다. 만일 펌웨어에 그런 기능이 없다면, 시스템 관리자에 의한 "probe" 오퍼레이션이
대신 사용된다.
• 논리적 메모리 핫플러그 단계
– 논리적 메모리 핫플러그 단계에서는 메모리 상태를 "가용"또는 "불가용" 상태로 변경한다.
– 이 단계에서 사용자의 메모리 양이 변하게된다.
– 커널은 이 단계에 있는 모든 메모리를 메모리 범위가 가용할 때 프리페이지로 만든다.
– 이 단계를 online/ offline라고 함.
– 시스템 관리자가 sysfs 파일에 쓰기 오퍼레이션을 하면 논리적 메모리 핫플러그 단계가 트리거된다.
– hot-add의 경우는 물리적 핫플러그 단계 후에 수행되어야 한다. (하지만, 메모리 핫플러그를 위해 udev의 핫플러그
스크립트에 쓴다면, 이 단계는 구분없이 수행될 수 있다.)
Memory Hotplug 정리
3.
Logical Memory Remove
4.
Memory Hotplug Event Notifier
• 메모리 오프라인은 전체 메모리 섹션을 비가용상태로 만들어야 하기 때문에 만일 섹션이 메모리를 해제시킬 수 영역을
포함하고 있다면 메모리 오프라인은 실패함
• 일반적으로 메모리 오프라인에는 다음과 같은 2가지 기술이 사용된다.
– (1) 섹션내에 있는 모든 메모리를 회수하여 해제시킨다.
– (2) 섹션에 있는 모든 페이지를 이주시킨다.
• 현재 리눅스의 메모리 오프라인은 페이지를 이주시켜 섹션에 있는 모든 페이지를 해제하는 방법인 방법 (2)를 사용한다.
• 현재 리눅스에서 이주 가능한 페이지들은 모두 anonymous 페이지와 페이지 캐시들이다.
• 페이지 이주에 의한 섹션을 오프라인 시키기 위해서 섹션이 모두 이주 가능한 페이지들만을 포함하고 있다는 것을 커널이
보장해야 한다.
• 섹션을 이주 가능한 페이지로만 구성되도록 하는 부트 옵션이 현재 지원되고 있다. "kernelcore="나 "movablecore=" 부트
옵션을 정의하여 이동 가능한 페이지들을 위한 zone인 ZONE_MOVABLE을 생성.
• ZON_MOVABLE에 있는 섹셔은 쉽게 오프라인 될 수 있는 것으로 간주되지만 가끔 busy 상태에서는 -EBUSY가 리턴된다.
메모리 섹션이 -EBUSY 때문에 오프라인이될 수 없을 지라도, 몇 번 시도하면 메모리 섹션을 오프라인 시킬 수도 있다.
(예를 들면, 커널내부 호출에 의해 참조되는 페이지는 조만간 릴리즈 될 수 있다.)
• MEMORY_GOING_ONLINE
– 서브시스템이 메모를 핸들링 할 수 있도록 하기 위해 새로운 메모리가 가용상태가 되기 전에 발생됨. 페이지 할당자는
아직 새로운 메모리로부터 할당하지 못함.
• MEMORY_CANCEL_ONLINE
– MEMORY_GOING_ONLINE이 실패할 때 발생됨.
• MEMORY_ONLINE
– 메모리가 성공적으로 온라인 상태가 되었을 때 발생됨. callback은 새로운 메모리로 부터 페이지를 할당할 수 있음.
• MEMORY_GOING_OFFLINE
– 메모리를 오프라인하기 위해 발생됨. 메모리 할당은 더 이상 불가하나 오프라인 하고자 하는 메모리의 일부는 여전히
사용 중일 수 있음. callback은 지정된 메모리 섹션에서 서브시스템에 알려진 메모리를 해제하기 위해 사용될 수 있다.
• MEMORY_CANCEL_OFFLINE
– MEMORY_GOING_OFFLINE이 실패할 경우 발생됨. 오프라인을 시도했던 섹션에 대한 메모리가 다시 가용상태가 됨.
MEMORY_OFFLINE 메모리 오프라인이 완료된 후에 발생함.
Thread Model

1:1 (Kernel-level threading)
•
•
•
•
User가 생성한 Thread가 Kernel에서 스케쥴될 Entity로 1:1 매핑
Thread 구현을 심플하게 함.
NPTH과 LinuxThread가 1:1 Model 임.
Solaris, NetBSD and FreeBSD에서 1:1 Model 사용

N:1 (User-level threading)

N:M (Hybrid threading)
• 모든 Application-level Thread들은 하나의 Kernel-level Scheduling Entity로 매핑
• Kernel은 Application Thread에 대한 정보를 전혀가지고 있지 않음.
• Context Switching을 매우 빠르게 할 수 있으나 Multi-threaded processor나 Multi-Processor Computer와
같은 하드웨어적 Benefit을 전혀 얻을 수 없음. 즉, 동시에 한 개 이상의 Thread가 Schedule될 수 없음.
GNU Portable Thread에서 사용됨.
• N개의 Application Thread가 M개의 Schedule가능한 Thread로 매핑됨.
• Thread Library가 User Thread에 대한 Scheduling을 책임지고 Context Switching이 매우 빠른 장점이 있음
(System Call을 하지 않기 때문)
• 구현이 매우 어려운 단점이 있고 Priority가 바뀔수 있는 가능성이 있으며, User Space의 Scheduling과
Kernel Space의 Scheduling이 Coordinate되어야 함.
LinuxThread vs. NPTL vs. NGTL

LinuxThread

NPTL (Native POSIX Thread Library)

NGPT (Next-Generation POSIX Threads)
• Clone system call을 사용하여 새로운 process를 생성. 이 호출은 호출하는 프로세스의 복사본을 생성해
호출자의 주소 영역에서 복사 공유가 가능하게 만들었다. LinuxThreads 프로젝트는 이런 시스템 호출을
사용해 사용자 영역에서 스레드 지원을 완벽하게 흉내 내었다. POSIX Threads를 부분적으로 구현한 것
• 새로운 process는 process identifier를 갖게 되어 Signal Handling 문제가 발생함.
• Thread들간 통신은 SIGUSR1과 SIGUSR2를 사용함으로써 프로그램에서 이 signal들을 사용할 수 없었음.
• 2.6 이전에는 진정한 thread를 지원하지 못했고, clone system call을 이용하여 지원. 따라서, 이 시기의
pthread는 모두 user space에서 이루어진 것임. signal handling, scheduling, inter-process synchronization
문제를 내포함.
• 이러한 문제점을 해결하기 위해 두 프로젝트가 시작됨
– NGPT: Next Generation POSIX Threads (IBM 주도)
– NPTL: Native POSIX Thread Library (RedHat 주도)
• NPTL에 의해 대체됨.
• LinuxThreads와 같이 NPTL은 1대1 모델을 구현한다
• NPTL이 POSIX 규약을 따르므로 NPTL은 프로세스 단위로
시그널을 처리한다. getpid()는 모든 스레드에서 똑같은 프로세스
ID를 반환한다. 예를 들어 시그널 SIGSTOP을 보내면 전체
프로세스가 멈춘다. LinuxThreads에서는 이 시그널을 받은
스레드만 멈춘다. 이는 NPTL 기반 응용에서 GDB와 같은 디버그
지원을 강화한다.
• IBM 주도로 LinuxThread를 대체하기 위해 시작됨.
• N:M Model
people.redhat.com/drepper/nptl-design.pdf
Zone의 Wait Table 구성
zone->wait_table
Each entry는 wait_queue_head
Entry [0]
Wait queue head
Entry [1]
Wait queue head
Entry [2]
Wait queue head
struct __wait_queue_head {
spinlock_t lock;
struct list_head task_list;
};
zone->wait_table_hash_nr_entries = n-1
Entry [n-1]
Wait queue head
zone->wait_table_bits = log_2^(n)
Buddy System
MAX_ORDER는 총 몇 개의 order가 있는지 나타내는 것임.
즉, 2 ^n에서 n = order
Order는 0 ~ MAX_ORDER까지 임
The indices of the individual elements of the free_area[] array are also interpreted as order parameters
and specify how many pages are present in the contiguous areas on a shared list. The zeroth array
element lists sections with one page (20 = 1), the first lists page pairs (21 = 2), the third manages sets
of 4 pages, and so on.
• Memory management based on the buddy system is concentrated on a single memory zone of a node,
for instance, the DMA or high-memory zone. However, the buddy systems of all zones and nodes are
linked via the allocation fallback list. Figure 3-23 illustrates this relationship.
•
•
•
•
<mmzone.h>
struct zone {
...
/*
* free areas of different sizes
*/
struct free_area
free_area[MAX_ORDER];
...
};
<mmzone.h>
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};
Buddy System
• Linux Kernel의 Memory Fragmentation 에 대한 정책은 File System에서 사용하는 defragmentation이
아닌, Anti-Fragmentation임
• Three types of pages
• Non-movable Pages – Memory에 고정됨
• Reclaimable Pages – 이동될 수는 없지만 제거될 수 는 있음. Data mapped from files, kswapd
deamon에 의해 주기적으로 제거됨.
• Movable Pages – 이동될 수 있음. Userspace Application으로 page table에 의해 mapping됨
• Anti-Fragmentation의 핵심은 Page의 Mobility를 기준으로 같은 것 끼리 Grouping하는 것임.
같은 mobility 속성을 가지는 page들을 block단위
로 묶어 관리하는 것으로 unmovable 속성을 가지는
2개의 연속된 page가 필요하면 order = 1인
unmovable bit가 on된 page block을 찾아 할당하
게 되므로 Page Fragmentation을 **방지** 할 수 있
음. 여기서 **방지**의 의미는 Linux Kernel이
Fragmentation 해결방법으로 Anti-Fragmentation
을 사용하기 때문이며 파일 시스템에서
Defragmentation 방법과는 차이가 있음.
• Bloc Bitmap = 1Block • Inode = file & directory
File System (Ext2)
type
usr
grp
othr
usg
xrw
xrw
xrw
Sizeof (inode) = 128 Bytes
direct
Inode
flags
&
mode
flags
0
indirect
Sec. 0

Super
Block
Unu
sed
사용불가능한
Data Block
Inode
•
•
•
•
•
•
hda3
Block 
Group 1
Block
Group 0
Block
Group n
…….
Sec. 1
가변적
n blocks
4 KB
1 block
3 KB
Block count (file system 전체)
Free_block count
Free inode count
First data block (block group
0이 시작되는)
Block per group
Inode per group
Inode size
Last mounted
Uuid (file system 식별자)
Volume name
hda2
표현
• Inode는 #1부터 시작
• Inode번호로부터 Block
Group 번호를 확인 가능
• Block Group = (inode1)/INODES_PER_GROUP
• INODE_PER_BLOCK =
4096B/128B=32
• # of blocks for inode
table =
INODES_PER_GROUP/IN
ODES_PER_BLOCK
1 KB
1024
entries
•
•
•
•
hda
hda1
Boot
Block
• Inode count (file system 전체)
partition
S_IFLNK
S_IFDIR
S_IFREG
11
Max file size
=12x4KB
+ 1024x4KB
+(1024)2x4K
+ (1024)3x4K
= ~4TB
• Max Blocks = 4096x8
• Sizeof (Block Group) =
Max Block x 4KB =
128MB
• 10GB (Partition) =
10GB/128MB=80 Block
Groups
Group
Descriptor
Table

2
Root
Directory
Inode
Inode
Bitmap
4 KB
1 block
4 KB
1 block

가변적
n blocks
1
Block
Bitmap
3
4
Boot
Loader
Inode
5
6
Inode Table

Data
Block
1
8
9
10
11
…..
46
47
type: file
….
Data:
35,36
/lost+found
xorg.conf
root
45
type: dir
….
Data: 33
1
정보들이 저장
• Block들의 모임
…..
33
34
35
36
…..
50
…..
n
Data Blocks
home
51
52
…..
dentry
• Descriptor = 32 Bytes • File system을 구성하는
• Free block count
• Free inode count
48
type: dir
….
Data: 52
Example
/
|---home/
|---xorg.conf
Data
Block
n
…….
Reserved
7
삭제불가
Directory
Inode
Data
Block
0
.
2
.
48
..
2
..
2
xorg.conf
45
home
48
n
Cache
Cache Line Accessing
Cache Allocation
Address를 세 부분으로 나눠 Cache Line에 접근
Tag
1.
2.
3.
Index
Conflict에 의한 Cache Miss를 해결하기 위해 한 Index에
여러 개의 Cache 공간 할당
Offset
Index
1.
• 주소값의 중간값을 가져와 Cache Line의
Index로 사용
• 512개의 Cache Line이라면 9 bit 사용
(2^9=512)
Directly-Mapped Cache
• 하나의 Index가 하나의 Cache Line과 Mapping
• 두 데이터가 같은 Cache Index를 가지고 번갈아가며 접근될
경우 Cache Miss가 발생
Index
Tag
• Cache Line에 원하는 값이 있는지
검사하기 위해 사용
• 서로 다른 주소값이 동일한 Cache Line을
가르킬 수 있기 때문
2.
Set-Associative Cache
• 하나의 Index가 N개의 Cache Line에 들어갈 수 있음
Offset
Index
• Cache Line내에서 원하는 데이터를
가르키기 위해 사용
• 64 Byte Cache Line일 경우 Offset은 6 bit
사용
Index를 가운데에 위치 시킨 이유는 Hash를
이용한 Cache Line을 매핑할 시 충돌을 적게
하기 위함
2-way set-associative
Set 0
3.
Set 1
Set 2
Fully-Associated Cache
• Cache Line어디에도 Mapping될 수 있음
Index
Cache
Cache Replacement Policy
Cache Writing Policy
1.
LRU
1.
CPOLICY_UNCACHED
2.
LFU
2.
CPOLICY_BUFFERED
3.
FIFO
3.
CPOLICY_WRITETHROUG
4.
CPOLICY_WRITEBACK
5.
CPOLICY_WRITEALLOC
• Least Recently Used
• Least Frequently Used
• First In First Out
Multi Cache Level
1.
L1 Cache
2.
L2 & L3 Cache
• Hit Latency를 줄이기 위해 중점
• 32KB, 64KB 수준
• Miss Rate & Missing Penalty를 줄이기
위해 중점
• 8MB, 12MB
• Cache를 사용하지 않음
• Cache를 사용하지 않고 Cache와
Memory사이의 Write Buffer 사용
• Cache쓰기가 발생할 때 메모리 및 아래
계층에 바로 반영
• USB 메모리처럼 탈부착이 빈번한 저장
장치의 경우 사용
• 변경된 부분을 일단 Cache에 보관하고
Cache Replacement 대상이 될 때
메모리에 갱신
• Cache Miss시 Cache Line 할당 방법
• 두 가지가 있음
1) read-allocate 방식:데이터를
Memory에서 읽었을 때 Cache Line
할당
2) write-allocate방식: Memory에
데이터를 쓸 때 캐시 라인을 할당
Cache
Cache Miss Types
1.
Cold Miss
AMD (Phenom)/Intel(Nehalem) Cache 구성도
• Compulsory Miss라고도 함
• 데이터를 최초로 읽을 때 발생하는 Cache
Miss
2.
Conflict Miss
3.
Capacity Miss
4.
Multi Core에서 Cache
Core 0
Core 1
Core n
I$
I$
I$
D$
L2$
• Cache의 연관도가 부족해 발생하는
Cache
D$
L2$
D$
L2$
L3 Shared Cache
• Cache의 용량이 부족해 발생하는 Cache
Miss
Memory Controller
(DDR2/DDR3)
System Bus
Coherence Miss
• 다른 프로세서에 의해 Cache Line이
Invalidation되어 발생하는 Cache Miss
Cache Coherence는 어떤 주소의 내용을 읽을 때 항상
최신 내용을 읽을 수 있게 한다. MSI Snooping
Protocol은 Cache Coherence의 가장 고전적인 해결방
법으로 Cache Line마다 Coherence 상태인 M
(Modified), S(Shared), I (Invalid) 상태를 가지게 한다.
그리고 Cache에 접근할 때마다 모든 캐시에 Bus를 통
해 신호를 보내는 Snooping 작업을 한다.
Cache Line이 Processor A에 의해 Invalid된 상태에서
동일한 Cache Line을 Processor B가 접근할 때 Cache
Miss발생, 특히 이 경우를 Coherence Miss라고 함
Cache Coherence 문제가 발생
MSI Snooping Protocol
1.
Invalid
2.
Shared
3.
Modified
• 어떤 Cache Line의 상태가 유효하지 않다. 읽거나 쓰려면
반드시 값을 요청해야 함
• Cache Line이 한 곳 이상에서 공유 중이다. 어떤 Processorㄱ
쓰기를 하려면 반드시 신호를 보내 자신의 Cache Line을 M
상태로 바꾸면서 다른 Cache 사본은 I 상태로 바꿔야 함
• Cache Line이 어떤 한 Processor에 의해 고쳐졌음 (Dirty)
Cache
False Sharing
 False Sharing Example
• 0번 Processor가 Data 1을, 1번 Processor가 Data 2를 갱신한다.
Core 0
Core 1
Cache
Cache
Data 1
• 각 Core는 전용 L1 Cache (Private Cache)를 가지고 있다.
Data 2
• Cache를 읽고 쓰는 단위는 64Byte와 같이 Cache Line 만큼 처리된다.
• Core 0의 전용 Cache에는 Core 0가 갱신하는 Data 1이 있음.
• Data 2는 Core 1의 전용 Cache에 존재한다.
• Locality에 의해 Data 1과 Data 2가 동일 Cache Line에 위치할 확률이 높다.
• Data 1과 Data 2가 연이어 선언되어 있다 (그림참조).
• Core 0가 Data 1을 갱신할 때 다른 Core가 가지고 있을지도 모르는 이 Cache Line을 무효화 시켜야 한다.
• Core 1도 Data 2을 갱신할때 이 Cache Line을 무효화 시켜야 한다.
• Core 0이 Data 1이 있는 Cache Line을 무효화시키면 Core 1은 Data 2가 있는 Cache Line을 잃어버리게 된다.
• Core1이 Data 2를 읽을 때 Cache Miss가 발생하게 된다. (Coherence Miss)
• 실제로 공유되지 않은 Data가 없음에도 불구하고 끊임없이 서로의 Cache Line을 무효화하여 Cache Miss를 유발시키다.
____cacheline_aligned을 통해 cache line 정렬 (빈공간 Padding)을 통해 해결
spinlock_t async_lock ____cacheline_aligned;
int req ____cacheline_aligned;
/* transmit slots submitted
*/
int done ____cacheline_aligned;
/* transmit slots completed
*/
PGD, PMD, PTE 관계 (커널에서)
Preemption

Background

SMP Code 차용

Preemption 관련 보조함수

Preemption Checking
• Kernel preemption을 통해 urgent process를 normal process 보다 빠르게 수행할 수 있음.
• Kernel preemption은 2.5에서 추가됨
• Kernel preemption은 User land의 preemption과 다른 Concept.
• Kernel preemption을 지원하기위해서 Kernel이 약간 수정되었지만, user land preemption 처럼 쉬운것은
아니었다. 커널이 single operation에서 certain action을 complete하지 못한다면 (data manipulation) race
condition이 발생함. Multiprocessor system에서 발생하는 동일한 문제임. 이러한 문제점은 모두 SMP에서 모두
identify 되었음. SMP implementation에서 적용되어 재사용 가능.
• Kernel이 선점될 수 있다면, uniprocessor system도 SMP와 같이 동작함
• preempt_disable --> inc_preempt_count, memory optimization 회피 (kernel preemption mechanism에 문제
야기할 수 있기 때문)
• preempt_check_resched --> scheduling이 필요한지 검사하고 필요하며 scheduling
• preempt_enable --> enable kernel preemption --> preempt_check_reched
• preempt_disable_no_resched --> disabling preemption (no rescheduling)
• SMP system을 위해서 dec_preempt_count/ inc_preempt_count는 sychronization operation에 통합됨
• get_cpu, put_cpu는 kernel preemption을 disable 시킨다.
•
•
•
•
struct thread_info의 preempt_count를 이용해서 preempt될 수 있는지 tracking
preempt_count = 0 --> preemptable, >0 --> unpreemptable
preempt_count는 dec_preempt_count, inc_preempt_count를 통해서 값 변경
preempt_count를 Boolean value로 하지 않는 것은 critical section에 다양한 route로 접근가능하기 때문
Preemption

SMP vs Kernel preemption의 차이

preempt_count vs. PREEMPT_ACTIVE

How does the kernel know if preemption is required?
• SMP에서는 per-CPU 관련 변수에 대해 protection을 할 필요가 없었지만, kernel preemption에서는 고려해야함.
왜냐하면 같은 CPU에서 다른 route를 통해서 그 값을 참조할 수 있기 때문임
• preempt_count는 다른 task가 kernel을 선점할 수 있는지 검사하기 위한 변수
• 반면 PREEMPT_ACTIVE는 scheduler가 scheduling 호출이 선점에 의해 호출되었는지를 검사하기위한 flag
•
•
•
•
•
•

preempt_enable -->
preempt_check_reched --> test_thread_flag (TIF_NEED_RESHED) -->
TIF_NEED_RESCHED flag --> CPU time을 요청한 process가 있음을 의미 -->
preempt_schedule : preempt_count > 0, irqs_disabled되었으면 just return
add_preempt_count(PREEMPT_ACTIVE) : scheduling요청이 preemption으로 부터왔음을 마킹
schedule : 선점에 의한 스케쥴링 요청일경우 deactivate_task를 건너뛰고 high-priority task가 스케쥴됨
kernel preemption이 trigerring되는 2가지 방법
• preempt_schedule
• preempt_schedule_irq
* preemption request가 IRQ context로 부터 왔음을 의미
* IRQ가 disable된 상태에서 call됨 (Simultanous IRQ에의한 recursive call을 방지하기 위함)

Low Latency
• cond_resched --> long operation 중간중간에 call되어 다른 process가 동작할 수 있도록 해줌. preemption과
관계 없음
Kernel Locks
Lock Types
Locks
Description
Semaphore, Mutex
Sleep 할 수 있는 Lock
Reader Writer Semaphore
(rwsem)
Read를 위해서 Multiple Access가 가능하나 Write를 위
해서는 한 개의 Process만 진입가능
Completion
Semaphore의 경우 up/down이 빈번히 발생하면 up 연
산이 일어나기 전에 Semaphore 자료구조가 기다리지
않고 제거될 수 있음. Semaphore의 up()은 complete()
로, down()은 wait_for_completion()으로 대체
Spinlocks
Lock이 다른 process에 의해 획득이되었다면, Lock이 풀
릴때까지 Spin. SMP상에서 Locking을 구현하기위한 방
법임
Seqlocks
Lock없이 공유자원에 접근함. 읽기 작업의 경우 바로 접
근하지만, 쓰기 작업과 충돌이 날 경우 읽기작업에서 재
시도를 통해 값을 읽어옴
RCU (Read-Copy-Update)
Spinlock의 비효율성을 개선하기위해 나온것이며, 수정
이 필요한 Writer는 해당 데이터를 복사하고 변경한뒤
Reader가 읽고 있는 데이터가 정리되면 포이터만 변경
함
Atomic Variables
공유자원이 단순한 정수값일 경우 Lock을 이용하는 대
신, atomic_t 구조체 타입을 이용하고 단일 기계어로 컴
파일 되게 하여 실행속도를 빠르게 함.
BIT Operation
원자적으로 개별 bit를 설정할 경우. 예로, set_bit()
Sleep 할 수 없는 Lock
기타
Bottom Halves
Lock Types
Locks
Description
Linker Descriptor
vmlinux linker descriptor analysis
Linux Kernel 2.6.30.4
.comment 0
.stab.indexstr 0
.stab.index 0
.stab.exclstr 0
.tracedata
.rio_route
.builtin_fw
.pci_fixup
BUG_TABLE
.rodata1
.rodata
.text
__ksymtab_strings
.stab.excl 0
__kcrctab_gpl_future
.stabstr 0
__kcrctab_unused_gpl
.stab 0
__kcrctab_unused
.bss
__kcrctab_gpl
.data
__kcrctab
.ARM.unwind_tab
__ksymtab_gpl_future
.ARM.unwind_idx
__ksymtab_unused_gpl
__param
__ksymtab_unused
__init_rodata
__ksymtab_gpl
__ksymtab
/DISCARD/
.init
.text.head
0xC0008000
= PAGE_OFFSET + TEXT_OFFSET = 0xC0000000 + 0x00008000
vmlinux linker descriptor analysis
Linux Kernel 2.6.30.4
*(.dev.*.*) = if !CONFIG_HOTPLUG
*(.cpu.*.*) = if !CONFIG_HOTPLUG_CPU
*(.mem.*.*) = if !CONFIG_MEMORY_HOTPLUG
*(.mem.init.rodata)
*(.mem.init.data)
*(.cpu.init.rodata)
*(.cpu.init.data)
*(.dev.init.rodata)
*(.dev.init.data)
*(.init.data)
usr/built-in.o(.init.ramfs) = if CONFIG_BLK_DEV_INITRD
.tracedata
.rio_route
INIT_DATA
.builtin_fw
*(.data.percpu.shared_aligned)
*(.data.percpu)
*(.data.percpu.page_aligned)
usr/built-in.o(.init.ramfs)
*(.security_initcall.init)
*(.con_initcall.init)
.pci_fixup
BUG_TABLE
.rodata1
.rodata
.text
/DISCARD/
.init
.text.head
INITCALLS
*(.early_param.init)
*(.init.setup)
*(.taglist.init)
*(.arch.info.init)
*(.proc.info.init)
INIT_TEXT
*(.text.head)
*(.initcall7s.init)
*(.initcall7.init)
*(.initcall6s.init)
*(.initcall6.init)
*(.initcallrootfs.init)
*(.initcall5s.init)
*(.initcall5.init)
*(.initcall4s.init)
*(.initcall4.init)
*(.initcall3s.init)
*(.initcall3.init)
*(.initcall2s.init)
*(.initcall2.init)
*(.initcall1s.init)
*(.initcall1.init)
*(.initcall0s.init)
*(.initcall0.init)
*(.initcallearly.init)
*(.mem.init.text)
*(.cpu.init.text)
*(.dev.init.text)
*(.init.text)
0xC0008000
= PAGE_OFFSET + TEXT_OFFSET = 0xC0000000 + 0x00008000
vmlinux linker descriptor analysis
Linux Kernel 2.6.30.4
*(__ex_table) = if CONFIG_MMU
*(.fixup) = if CONFIG_MMU
*(.dev.*.*) = if !CONFIG_HOTPLUG
*(.cpu.*.*) = if !CONFIG_HOTPLUG_CPU
*(.mem.*.*) = if !CONFIG_MEMORY_HOTPLUG
.tracedata
.rio_route
*(.kprobes.text)
*(.got)
*(.glue_7t)
*(.glue_7)
*(.rodata.*)
*(.rodata)
*(.gnu.warning)
*(.fixup)
.builtin_fw
KPROBES_TEXT
.pci_fixup
LOCK_TEXT
BUG_TABLE
SCHED_TEXT
.rodata1
TEXT_TEXT
.rodata
*(.exception.text)
.text
/DISCARD/
.init
.text.head
*(__ex_table)
*(.fixup)
*(.ARM.extab.exit.text)
*(.ARM.exidx.exit.text)
*(.exitcall.exit)
EXIT_DATA
EXIT_TEXT
*(.spinlock.text)
*(.sched.text)
*(.text.unlikely)
*(.mem.exit.text)
*(.mem.init.text)
*(.cpu.exit.text)
*(.cpu.init.text)
*(.dev.exit.text)
*(.dev.init.text)
*(.ref.text)
*(.text)
*(.text.hot)
*(.mem.exit.rodata)
*(.mem.exit.data)
*(.cpu.exit.rodata)
*(.cpu.exit.data)
*(.dev.exit.rodata)
*(.dev.exit.data)
*(.exit.data)
*(.mem.exit.text)
*(.cpu.exit.text)
*(.dev.exit.text)
*(.exit.text)
vmlinux linker descriptor analysis
Linux Kernel 2.6.30.4
*(.tracedata)
*(.rio_route_ops)
/* RapidIO route ops */
*(.builtin_fw)
.tracedata
.rio_route
.builtin_fw
.pci_fixup
BUG_TABLE
*(.pci_fixup_suspend)
*(.pci_fixup_resume_early)
*(.pci_fixup_resume)
*(.pci_fixup_enable)
*(.pci_fixup_final)
*(.pci_fixup_header)
*(.pci_fixup_early)
(__bug_table)
/* if CONFIG_GENERIC_BUG */
.rodata1
.rodata
.text
/DISCARD/
.init
*(.rodata1)
*(__tracepoints_strings)
*(__markers_strings)
*(__vermagin)
*(.rodata.*)
*(.rodata)
/* Tracepoints: strings */
/* Markers: strings */
/* Kernel version magic */
.text.head
AT(ADDR(.rodata1) - LOAD_OFFSET)
#ifndef LOAD_OFFSET
#define LOAD_OFFSET 0
#endif
Align = 4096, AT(ADDR(.rodata) - LOAD_OFFSET)
vmlinux linker descriptor analysis
Linux Kernel 2.6.30.4
__ksymtab_strings
/* Kernel symbol table: strings */
*(__kcrctab_gpl_future)
/* Kernel symbol table:
GPL-future-only symbols */
*(__kcrctab_unused_gpl)
/* Kernel symbol table:
GPL-only unused symbols */
*(__kcrctab_unused)
/* Kernel symbol table:
Normal unused symbols */
*(__kcrctab_gpl)
/* Kernel symbol table:
GPL-only symbols */
*(__kcrctab)
/* Kernel symbol table:
Normal symbols */
*(__ksymtab_gpl_future)
/* Kernel symbol table:
GPL-future-only symbols */
*(__ksymtab_unused_gpl)
/* Kernel symbol table:
GPL-only used symbols */
*(__ksymtab_unused)
/* Kernel symbol table:
Normal unused symbols */
*(__ksymtab_gpl)
/* Kernel symbol table:
GPL-only symbols */
*(__ksymtab)
/* Kernel symbol table:
Normal symbols */
__ksymtab_strings
__kcrctab_gpl_future
__kcrctab_unused_gpl
__kcrctab_unused
__kcrctab_gpl
__kcrctab
__ksymtab_gpl_future
__ksymtab_unused_gpl
__ksymtab_unused
__ksymtab_gpl
__ksymtab
.tracedata
vmlinux linker descriptor analysis
Linux Kernel 2.6.30.4
*(__mcount_loc) = if CONFIG_FTRACE_MCOUNT_RECORD
*(.dev.*.*) = if !CONFIG_HOTPLUG
*(.cpu.*.*) = if !CONFIG_HOTPLUG_CPU
*(.mem.*.*) = if !CONFIG_MEMORY_HOTPLUG
.stab 0
.bss
*(.ARM.extab*)
/* Stack unwinding tables,
if CONFIG_ARM_UNWID */
*(.ARM.exidx*)
/* Stack unwinding tables,
if CONFIG_ARM_UNWID */
*(__param)
/* Built-in module parameters */
.data
.ARM.unwind_tab
.ARM.unwind_idx
__param
__init_rodata
__ksymtab_strings
*(.mem.exit.rodata)
*(.mem.init.rodata)
*(.cpu.exit.rodata)
*(.cpu.init.rodata)
*(.dev.exit.rodata)
*(.dev.init.rodata)
MCOUNT_REC
*(.ref.rodata)
*(__mcount_loc)
vmlinux linker descriptor analysis
Linux Kernel 2.6.30.4
*(__ex_table) = if CONFIG_MMU
*(.dev.*.*) = if !CONFIG_HOTPLUG
*(.cpu.*.*) = if !CONFIG_HOTPLUG_CPU
*(.mem.*.*) = if !CONFIG_MEMORY_HOTPLUG
*(__syscalls_metadata) = if CONFIG_FTRACE_SYSCALLS
*(_ftrace_events) = if CONFIG_EVENT_TRACER
*(__trace_printk_fmt) = if CONFIG_TRACING
*(_ftrace_branch) = if CONFIG_PROFILE_ALL_BRANCHES
*(_ftrace_annotated_branch) =
if CONFIG_TRACE_BRANCH_PROFILING
.comment 0
.stab.indexstr 0
.stab.index 0
.stab.exclstr 0
.stab.excl 0
.stabstr 0
.stab 0
.bss
.data
*(.comment)
*(.stab.indexstr)
/* Stabs debugging sections */
*(.stab.index)
/* Stabs debugging sections */
*(.stab.exclstr)
/* Stabs debugging sections */
*(.stab.excl)
/* Stabs debugging sections */
*(.stabstr)
/* Stabs debugging sections */
*(.stab)
/* Stabs debugging sections */
*(COMMON)
*(.bss)
CONSTRUCTORS
DATA_DATA
*(__ex_table)
*(.data.casheline_aligned)
*(.data.nosave)
*(.data.init_task)
.ARM.unwind_tab
AT (__data_loc)
*(__syscalls_metadata)
*(_ftrace_events)
*(__trace_printk_fmt)
*(_ftrace_branch)
*(_ftrace_annotated_branch)
*(__verbose)
*(__tracepoints)
*(__markers)
*(.mem.exit.data)
*(.mem.init.data)
*(.cpu.exit.data)
*(.cpu.init.data)
*(.dev.exit.data)
*(.dev.init.data)
*(.ref.data)
*(.data)
Cache Type
Cache Type: PIPT, VIVT, VIPT, PIVT
Cache Type은 index와 tag가 physical address로 표현되는지 virtual address로 표현되는지에 따라 4가지로 구분된
다. index & tag에 대한 개념은 ARM System Developer's Guide의 460 page 참조.
Cache 에 Tag가 있는데, 이게 Main Memory의 위치를 가르키는 값을 포함합니
다. 이 값이 virtual이냐 physical이냐에 따라 virtually tagged 냐 physically
tagged냐로 구분할 수 있겠습니다. Index라 함은 processor가 참조하는 주소가
되는데, virtual이냐 physical이냐로 나눠집니다. physical이 될려면 일단 MMU
가 개입이 되어야 합니다 (MMU를 이용하는 Linux라 가정함).
source: http://upload.wikimedia.org/wikipedia/commons/3/3d/Cache%2Cbasic.svg
PIPT (Physically indexed, physically tagged)
index와 tag에 모두 physical address가 사용됨. 가장 간단한 방법이고 physical address는 유일하기 때문에
address aliasing문제가 없지만, physical address로 변경이 필요하게 됨. 그 의미는 TLB의 개입이 필요하고, TLB
miss가 발생할 시에 속도가 떨어짐. TLB miss가 발생하지 않는다 하더라도, TLB lookup 후에 Cache lookup을
sequential하게 이루어져야 함. 따라서 전체적으로 속도가 slow.
Cache Type: PIPT, VIVT, VIPT, PIVT
VIVT (Virtually indexed, virtually tagged)
index와 tag에 모두 virtual address가 사용됨. virtual address가 사용되기 때문에 physical address를 찾기 위해
MMU가 개입될 필요가 없음. 하지만, 서로다른 virtual addresses가 같은 physical address를 가리키는 aliasing문제
가 발생함. 이는 physical address에 서로다른 virtual address가 cache될 때 발생하는데, coherency problem을 발
생시킴. 예를 들면, 서로다른 사용자 application이 동일한 file을 mmap할 때 사용자 application의 virtual address
는 다르나 실제 physical address는 갖게됨. 다른 문제점으로는 V->P mapping (virtual to physical)이 바뀔 수 있다
는 점이며, 이경우 TLB entry의 변경에 주의해야하고 entry가 변경되기 전에 해당 cache line들을 flushing해야 함.
왜냐하면, virtual address가 더이상 유효하지 않기 때문에.
Cache Type: PIPT, VIVT, VIPT, PIVT
VIPT (Virtually indexed, physically tagged)
index에는 virtual address를 사용하고 tag에는 physical address를 사용함. PIPT 대비 low latency (physical address
를 찾는데 걸리는 시간)의 장점이 있음. TLB를 통해 cache line을 parallel 하게 찾을 수 있음. 하지만, physical
address를 찾을 때까지 tag를 비교할 수는 없음. 왜냐하면 tag가 physical address를 사용하기 때문에. processor가
참조하는 Index는 virtual address이기 때문에 이를 physical address로 변경이 필요하다는 의미임. VIVT 대비 장점
은 tag가 physical address이기 때문에 cache가 aliasing을 검출할 수 있음. VIPT는 약간의 tag bit가 좀 더 필요한데,
왜냐하면 index bit가 더이상 동일한 address를 표현하지 않기 때문임.
PIVT (Physically indexed, virtually tagged)
이론적으로만 존재할 뿐 useless한 type
Tips
Debugging용 Profiling Code 추가
__no_instrument_function__은 이 함수가 profiling되
지 않도록 함. Gprof을 수행시킬때 이 옵션이 없으면
stack overflow 발생
#define __USE_GNU
#include <dlfcn.h>
void __attribute__ ((__no_instrument_function__)) __cyg_profile_func_enter(void *this_fn, void *call_site)
{
Dl_info info_this;
Dl_info info_callor;
}
dladdr(this_fn, &info_this);
dladdr(call_site, &info_callor);
void __attribute__ ((__no_instrument_function__)) __cyg_profile_func_exit(void *this_fn, void *call_site)
{
Dl_info info_this;
Dl_info info_callor;
}
dladdr(this_fn, &info_this);
dladdr(call_site, &info_callor);
$ gcc –finstrument-functions –ldl –g xxxx.c –o xxxx
To Do
Function Pointer로 부터 Function
Name을 알아내는 방법은? Address
를 보여주는 것은 Debugging에 별
도움이 안됨.
-finstrument-functions 옵션을 사용하면, 함수의
시작과 끝에 profile 함수를 추가함.
-ldl 옵션으로 dladdr 함수를 사용할 수 있음
GCC Concept
GCC는 GNU Compiler Collection의 약자이며, 이 Collection에는 g++ (GNU C++), gcc (GNU C), Fortran, Java등의
Compiler를 포함하고 있다. (over 280만 LOC)
cc1
gcc
Parser
Preprocessor
Source
Code
Genericizer
CPP
Gimplifier
Compiler
CC1
Tree SSA
Optimizer
Assembler
AS
RTL Generator
Linker
LD
Optimizer
Code Generator
Target Code
• Portable Optimizer (PO)에 사용
된 RTL이 gcc에서 PO를 적용하면
서 사용됨
• GCC 4.x에서 RTL보다는 GIMPLE
에 비중을 높힘
• RTL을 점점 제거하는 추세
•
•
•
•
•
•
•
Machine Independent
(GIMPLE Code*) *SIMPLE IL의 변형
a=b+c+d

T1 = b + c;
a = T1 + d;
a = b ? c : d;

if (b) T1 = c;
else T1 = d;
a = T1
Machine Dependent
(RTL Code*) * Intermediate Language
(set (reg:SI 103) (const int 0 [0x0]))
Connst Int 0
 Register 103: Single Integer 4 Bytes
 i = 0 의 RTX
Parser: Parse Tree를 만들어 Grammatical Check
Genericizer: 언어독립 Tree 구성 (GENERIC)
Gimplifier: GENERIC  GIMPLE로 변경
SSA: Single Static Assignment로 모든 변수가 한번 assign됨
RTL Generator: Register Transfer Language로 GIMPLE  RTL 변환
Optimizer: RTL Code를 Optimization
Code Generator: Assembly Language로 생성
Question: 사용자가 malloc call시 Kernel의 메모리 할당자가 메모리를 할당하는가?
■ Kernel의 메모리 할당:
 Buddy Allocator 및 Slab을 통해 처리됨. 커널 내부적으로 물리적으로 연속
된 메모리가 필요할 때 사용함
■ malloc시 메모리 할당:
 heap을 이용하는 방법: heap은 brk를 통해서 미리 할당 받아 놓고 이를 조각
내어 유저공간에 제공함. 옆 그림에서 brk가 가르키고 있는 부분 (약 128KB)
 mmap을 이용하는 방법: 1MB이상과 같이 큰 메모리를 할당할 때는mmap을
통해서 메모리를 할당받음. 이때는 4KB단위로 할당됨.
■ mmap vs. brk의 차이점?
 brk는 Process의 가상공간에서 bss가 위치한 바로 위로 부터 특정 크기를 할
당함. 이렇게 되면 항상 Base주소는 같게 되고 brk를 통해 그 위치만 Update.
allocated chunk하고 free but blocked가 된 부분을 볼 수 있음 (옆 그림). 이는
메모리 영역을 잘게 나눠서 필요한 만큼을 애플리케이션으로 전달함
 mmap은 말 그대로 unmapped space에서 필요한 만큼의 물리메모리를 가상
공간으로 매핑해서 전달해 주는 것임. 이렇게 되면 mmap, munmap시
Fragmentation이 발생하게되고 주소가 매번 Update되어야 하는 문제가 있음.
*malloc은 dlmalloc (Doug Lea Malloc)을
통해 구현되어 있음. Doug Lea가 구현한
것으로 많은 malloc() 함수가 Doug Lea의
알고리즘을 따른다고 함. malloc은 유저공
간에서 메모리 할당을 해주는 메모리 할
당자임.
<malloc_test_with_num.c>
그림소스: http://www.linuxjournal.com/article/6390
$strace ./malloc_test_with_num 127
$strace ./malloc_test_with_num 400
[Deassembling]
Program Execution Steps
$ objdump –d out
[source: m.c]
[source: func.c]
[Compile & Linking]
$ gcc –c m.c func.c
$ gcc –o out m.o func.o
$ objdump –R out
Relocation Table을 보여줌
<main function>
<out_func relocation>function>
Linker에 의한
재배치
Reference
To be updated
1.
2.
3.
4.
5.
6.
7.
8.
Wolfgang Mauerer, Professional Linux Kernel Analysis – Wrox
백승재외, 리눅스 커널 내부구조 – 교학사
arina02.tistory.com
김민장, 프로그래머가 몰랐던 멀티코어 CPU 이야기
정준석, IT EXPERT 임베디드 개발자를 위한 파일시스템의 원리와 실습
Kurt Wall, GCC 완전정복
리눅스 커널 2.6 구조와 원리, 한빛미디어
Embedded Linux Device Driver – MSD 아카데미