3.9.1. kernel 启动概述

3.9.1.1. kernel启动之前的动作

3.9.1.1.1. kernel镜像加载到ddr的相应位置

kernel一般会存在于存储设备上,比如FLASH/EMMC/SDCARD。因此需要先将kernel镜像加载到RAM的位置上,CPU才可以去访问到kernel,具体实现方法由 bootloader决定,可以是自动复制,也可以是根据bootloader cmdline模式下的输入命令来复制

3.9.1.1.2. 硬件要求

根据 arch/arm64/kernel/head.S 的stext(kernel的入口函数)的注释头

/*
 * Kernel startup entry point.
 * ---------------------------
 *
 * The requirements are:
 *   MMU = off, D-cache = off, I-cache = on or off,
 *   x0 = physical address to the FDT blob.
 *
 * This code is mostly position independent so you call this at
 * __pa(PAGE_OFFSET + TEXT_OFFSET).
 *
 * Note that the callee-saved registers are used for storing variables
 * that are useful before the MMU is enabled. The allocations are described
 * in the entry routines.
 */
    __HEAD
_head:
    /*
     * DO NOT MODIFY. Image header expected by Linux boot-loaders.
     */
#ifdef CONFIG_EFI
    /*
     * This add instruction has no meaningful effect except that
     * its opcode forms the magic "MZ" signature required by UEFI.
     */
    add     x13, x18, #0x16
    b       stext
#else
    b       stext                           // branch to kernel start, magic
    .long   0                               // reserved
    ...

所以有要求如下

  • MMU = off

MMU是用来处理物理地址到虚拟内存地址的映射,因此软件上需要先配置其映射表(也就是常说的页表),MMU关闭的情况下,CPU的寻址都是物理地址,也就是不需要经过转化直接访问相应的硬件, 一旦打开之后,CPU的寻址都是虚拟地址,都会经过MMU映射到真正的物理地址上,即使代码中写的是一个物理地址但也会被当做虚拟地址使用

地址映射表是由kernel自己创建的,在创建映射表之前的地址都是物理地址,所以必须保证MMU是关闭状态

  • D-cache = off

CACHE是CPU核内存之间的告诉缓冲器,又分成数据缓冲器D-cache和指令缓冲器I-cache,D-cache一定要关闭,否则可能kernel刚启动的过程中,去取数据的时候,从Cache中取,而这个时候RAM中 的数据还没有Cache过来,导致数据存取异常。

3.9.1.1.3. 跳转到kernel镜像入口的对应位置

bootloader需要通过设置PC指针到kernel的入口代码处(也就是kernel的加载位置)来实现kernel的跳转

3.9.1.2. kernel 启动第一阶段

linux内核启动第一阶段,也就是常说的汇编阶段,也就是stext函数的实现内容,这部分主要完成的工作:CPU ID检查,machine ID检查,创建初始化页表,设置C代码运行环境,跳转到内核第一个真正 的C函数 start_kernel 执行

3.9.1.2.1. kernel 入口地址的指定

arch/arm64/kernel/vmlinux.lds.S

OUTPUT_ARCH(aarch64)    //说明最终编译的格式为aarch64
ENTRY(_text)            //表示入口地址为_text

...

. = KIMAGE_VADDR + TEXT_OFFSET; //起始链接地址

    .head.text : {
            _text = .;
            HEAD_TEXT
    }
.text : {           /* Real text segment        */

...

所以kernel入口地址是.head.text段的代码首地址

而.head_text段,通过include/linux/init.h中的宏定义__HEAD来表示

/* For assembly routines */
#define __HEAD              .section        ".head.text","ax"
#define __INIT              .section        ".init.text","ax"
#define __FINIT             .previous

#define __INITDATA  .section        ".init.data","aw",%progbits
#define __INITRODATA        .section        ".init.rodata","a",%progbits
#define __FINITDATA .previous

#define __MEMINIT        .section   ".meminit.text", "ax"
#define __MEMINITDATA    .section   ".meminit.data", "aw"
#define __MEMINITRODATA  .section   ".meminit.rodata", "a"

内核启动的入口点,在arch/arm64/kernel/head.S文件中

    __HEAD
_head:
    /*
     * DO NOT MODIFY. Image header expected by Linux boot-loaders.
     */
#ifdef CONFIG_EFI
    /*
     * This add instruction has no meaningful effect except that
     * its opcode forms the magic "MZ" signature required by UEFI.
     */
    add     x13, x18, #0x16
    b       stext
#else
    b       stext                           // branch to kernel start, magic
    .long   0                               // reserved
#endif
    le64sym _kernel_offset_le               // Image load offset from start of RAM, little-endian
    le64sym _kernel_size_le                 // Effective size of kernel image, little-endian
    le64sym _kernel_flags_le                // Informative flags, little-endian
    .quad   0                               // reserved
    .quad   0                               // reserved
    .quad   0                               // reserved
    .ascii  ARM64_IMAGE_MAGIC               // Magic number
#ifdef CONFIG_EFI
    .long   pe_header - _head               // Offset to the PE header.

pe_header:
    __EFI_PE_HEADER
#else
    .long   0                               // reserved
#endif

这段汇编代码中最重要的就是b stext,加载kernel镜像之后第一个运行的函数就是stext

3.9.1.2.1.1. stext函数

启动过程中的汇编阶段,是从arch/arm64/kernel/head.S文件开始,执行的起点是stext函数,入口函数是通过vmlinux.lds链接而成,在head.S中ENTRY(stext)指定

在汇编代码中,宏定义ENTRY和ENDPROC是成对出现的,表示定义一个函数,同时也要指定当前代码所在的段,如 __INIT

#define __INIT  .section    ".init.text","ax"

    __INIT

ENTRY(stext)
    ....
ENPROC(stext)

内核启动的必要条件:MMU关闭,D-cache关闭,x0是传递给FDT blob的物理地址

注解

数据高速缓存一定要关闭,因为在内核启动过程中取数据时会先访问高速缓存,而可能高速缓存中缓存了以前u-boot的一些数据,这些数据对于内核来说是错误的。 而指令高速缓存可以打开,是因为U-boot和内核代码是不重叠的,不会存在指令高速缓存有冲突。

stext函数开始执行

__INIT

/*
 * The following callee saved general purpose registers are used on the
 * primary lowlevel boot path:
 *
 *  Register   Scope                      Purpose
 *  x21        stext() .. start_kernel()  FDT pointer passed at boot in x0
 *  x23        stext() .. start_kernel()  physical misalignment/KASLR offset
 *  x28        __create_page_tables()     callee preserved temp register
 *  x19/x20    __primary_switch()         callee preserved temp registers
 *  x24        __primary_switch() .. relocate_kernel()
 *                                        current RELR displacement
 */
ENTRY(stext)
bl  preserve_boot_args
bl  el2_setup                       // Drop to EL1, w0=cpu_boot_mode
adrp        x23, __PHYS_OFFSET
and x23, x23, MIN_KIMG_ALIGN - 1    // KASLR offset, defaults to 0
bl  set_cpu_boot_mode_flag
bl  __create_page_tables
/*
 * The following calls CPU setup code, see arch/arm64/mm/proc.S for
 * details.
 * On return, the CPU will be ready for the MMU to be turned on and
 * the TCR will have been set.
 */
bl  __cpu_setup                     // initialise processor
b   __primary_switch
ENDPROC(stext)
  • preserve_boot_args

保存从bootloader传递过来的x0~x3参数到boot_args数组

/*
 * Preserve the arguments passed by the bootloader in x0 .. x3
 */
preserve_boot_args:
    mov     x21, x0                         // x21=FDT  //将dtb的地址暂存在x21寄存器,释放出x0使用

    adr_l   x0, boot_args                   // record the contents of   //x0保存boot_args变量的地址
    stp     x21, x1, [x0]                   // x0 .. x3 at kernel entry     //将x0 x1的值保存在Boot_args[0] boot_args[1]
    stp     x2, x3, [x0, #16]       //将x2 x3的值保存在boot_args[2] boot_args[3]

    dmb     sy                              // needed before dc ivac with
                        // MMU off

    mov     x1, #0x20                       // 4 x 8 bytes
    b       __inval_dcache_area             // tail call
ENDPROC(preserve_boot_args)
  • set_cpu_boot_mode_flag

此函数用来设置__boot_cpu_mode flag,需要一个前提条件,w20寄存器中保存了CPU启动时的异常等级(exception level)

/*
 * Sets the __boot_cpu_mode flag depending on the CPU boot mode passed
 * in w0. See arch/arm64/include/asm/virt.h for more info.
 */
set_cpu_boot_mode_flag:
    adr_l   x1, __boot_cpu_mode
    cmp     w0, #BOOT_CPU_MODE_EL2
    b.ne    1f
    add     x1, x1, #4
1:  str     w0, [x1]                        // This CPU has booted in EL1
    dmb     sy
    dc      ivac, x1                        // Invalidate potentially stale cache line
    ret
ENDPROC(set_cpu_boot_mode_flag)

由于系统启动之后,需要了解CPU启动时候的exception level,因此需要一个全局变量__boot_cpu_mode来保存启动时的CPUmode

全局变量__boot_cpu_mode定义

/*
 * We need to find out the CPU boot mode long after boot, so we need to
 * store it in a writable variable.
 *
 * This is not in .bss, because we set it sufficiently early that the boot-time
 * zeroing of .bss would clobber it.
 */
ENTRY(__boot_cpu_mode)
    .long   BOOT_CPU_MODE_EL2
    .long   BOOT_CPU_MODE_EL1
  • __create_page_tables

建立页初始化的过程

__create_page_tables:
    mov     x28, lr

    /*
     * Invalidate the init page tables to avoid potential dirty cache lines
     * being evicted. Other page tables are allocated in rodata as part of
     * the kernel image, and thus are clean to the PoC per the boot
     * protocol.
     */
    adrp    x0, init_pg_dir
    adrp    x1, init_pg_end
    sub     x1, x1, x0
    bl      __inval_dcache_area

    /*
     * Clear the init page tables.
     */
    adrp    x0, init_pg_dir
    adrp    x1, init_pg_end
    sub     x1, x1, x0
1:  stp     xzr, xzr, [x0], #16
    stp     xzr, xzr, [x0], #16
    stp     xzr, xzr, [x0], #16
    stp     xzr, xzr, [x0], #16
    subs    x1, x1, #64
    b.ne    1b

    mov     x7, SWAPPER_MM_MMUFLAGS

    /*
     * Create the identity mapping.
     */
    adrp    x0, idmap_pg_dir
    adrp    x3, __idmap_text_start          // __pa(__idmap_text_start)

#ifdef CONFIG_ARM64_VA_BITS_52
    mrs_s   x6, SYS_ID_AA64MMFR2_EL1
    and     x6, x6, #(0xf << ID_AA64MMFR2_LVA_SHIFT)
    mov     x5, #52
    cbnz    x6, 1f
#endif
    mov     x5, #VA_BITS_MIN
1:
    adr_l   x6, vabits_actual
    str     x5, [x6]
    dmb     sy
    dc      ivac, x6                // Invalidate potentially stale cache line

    /*
     * VA_BITS may be too small to allow for an ID mapping to be created
     * that covers system RAM if that is located sufficiently high in the
     * physical address space. So for the ID map, use an extended virtual
     * range in that case, and configure an additional translation level
     * if needed.
     *
     * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
     * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
     * this number conveniently equals the number of leading zeroes in
     * the physical address of __idmap_text_end.
     */
    adrp    x5, __idmap_text_end
    clz     x5, x5
    cmp     x5, TCR_T0SZ(VA_BITS)   // default T0SZ small enough?
    b.ge    1f                      // .. then skip VA range extension

    adr_l   x6, idmap_t0sz
    str     x5, [x6]
    dmb     sy
    dc      ivac, x6                // Invalidate potentially stale cache line

#if (VA_BITS < 48)
#define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3)
#define EXTRA_PTRS  (1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))

    /*
     * If VA_BITS < 48, we have to configure an additional table level.
     * First, we have to verify our assumption that the current value of
     * VA_BITS was chosen such that all translation levels are fully
     * utilised, and that lowering T0SZ will always result in an additional
     * translation level to be configured.
     */
#if VA_BITS != EXTRA_SHIFT
#error "Mismatch between VA_BITS and page size/number of translation levels"
#endif

    mov     x4, EXTRA_PTRS
    create_table_entry x0, x3, EXTRA_SHIFT, x4, x5, x6
#else
    /*
     * If VA_BITS == 48, we don't have to configure an additional
     * translation level, but the top-level table has more entries.
     */
    mov     x4, #1 << (PHYS_MASK_SHIFT - PGDIR_SHIFT)
    str_l   x4, idmap_ptrs_per_pgd, x5
#endif
1:
    ldr_l   x4, idmap_ptrs_per_pgd
    mov     x5, x3                          // __pa(__idmap_text_start)
    adr_l   x6, __idmap_text_end            // __pa(__idmap_text_end)

    map_memory x0, x1, x3, x6, x7, x3, x4, x10, x11, x12, x13, x14

    /*
     * Map the kernel image (starting with PHYS_OFFSET).
     */
    adrp    x0, init_pg_dir
    mov_q   x5, KIMAGE_VADDR + TEXT_OFFSET  // compile time __va(_text)
    add     x5, x5, x23                     // add KASLR displacement
    mov     x4, PTRS_PER_PGD
    adrp    x6, _end                        // runtime __pa(_end)
    adrp    x3, _text                       // runtime __pa(_text)
    sub     x6, x6, x3                      // _end - _text
    add     x6, x6, x5                      // runtime __va(_end)

    map_memory x0, x1, x5, x6, x7, x3, x4, x10, x11, x12, x13, x14

    /*
     * Since the page tables have been populated with non-cacheable
     * accesses (MMU disabled), invalidate the idmap and swapper page
     * tables again to remove any speculatively loaded cache lines.
     */
    adrp    x0, idmap_pg_dir
    adrp    x1, init_pg_end
    sub     x1, x1, x0
    dmb     sy
    bl      __inval_dcache_area

    ret     x28
ENDPROC(__create_page_tables)
  • __cpu_setup

cpu的初始化设置

/* *        __cpu_setup
 *
 *  Initialise the processor for turning the MMU on.  Return in x0 the
 *  value of the SCTLR_EL1 register.
 */
    .pushsection ".idmap.text", "awx"
ENTRY(__cpu_setup)
    tlbi    vmalle1                         // Invalidate local TLB
    dsb     nsh

    mov     x0, #3 << 20
    msr     cpacr_el1, x0                   // Enable FP/ASIMD
    mov     x0, #1 << 12                    // Reset mdscr_el1 and disable
    msr     mdscr_el1, x0                   // access to the DCC from EL0
    isb                                     // Unmask debug exceptions now,
    enable_dbg                              // since this is per-cpu
    reset_pmuserenr_el0 x0                  // Disable PMU access from EL0
    /*
     * Memory region attributes for LPAE:
     *
     *   n = AttrIndx[2:0]
     *                      n       MAIR
     *   DEVICE_nGnRnE      000     00000000
     *   DEVICE_nGnRE       001     00000100
     *   DEVICE_GRE         010     00001100
     *   NORMAL_NC          011     01000100
     *   NORMAL             100     11111111
     *   NORMAL_WT          101     10111011
     */
    ldr     x5, =MAIR(0x00, MT_DEVICE_nGnRnE) | \
             MAIR(0x04, MT_DEVICE_nGnRE) | \
             MAIR(0x0c, MT_DEVICE_GRE) | \
             MAIR(0x44, MT_NORMAL_NC) | \
             MAIR(0xff, MT_NORMAL) | \
             MAIR(0xbb, MT_NORMAL_WT)
    msr     mair_el1, x5
    /*
     * Prepare SCTLR
     */
    mov_q   x0, SCTLR_EL1_SET
    /*
     * Set/prepare TCR and TTBR. We use 512GB (39-bit) address range for
     * both user and kernel.
     */
    ldr     x10, =TCR_TxSZ(VA_BITS) | TCR_CACHE_FLAGS | TCR_SMP_FLAGS | \
            TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
            TCR_TBI0 | TCR_A1 | TCR_KASAN_FLAGS
    tcr_clear_errata_bits x10, x9, x5

#ifdef CONFIG_ARM64_VA_BITS_52
    ldr_l           x9, vabits_actual
    sub             x9, xzr, x9
    add             x9, x9, #64
    tcr_set_t1sz    x10, x9
#else
    ldr_l           x9, idmap_t0sz
#endif
    tcr_set_t0sz    x10, x9

    /*
     * Set the IPS bits in TCR_EL1.
     */
    tcr_compute_pa_size x10, #TCR_IPS_SHIFT, x5, x6
#ifdef CONFIG_ARM64_HW_AFDBM
    /*
     * Enable hardware update of the Access Flags bit.
     * Hardware dirty bit management is enabled later,
     * via capabilities.
     */
    mrs     x9, ID_AA64MMFR1_EL1
    and     x9, x9, #0xf
    cbz     x9, 1f
    orr     x10, x10, #TCR_HA               // hardware Access flag update
1:
#endif      /* CONFIG_ARM64_HW_AFDBM */
    msr     tcr_el1, x10
    ret                                     // return to head.S
ENDPROC(__cpu_setup)

主要包括:

  1. cache和TLB的处理

  2. memory attribute lookup table的创建

  3. SCTLR_EL1 TCR_EL1的设定

  • __primary_switch

主要工作是为打开MMU做准备

__primary_switch:
#ifdef CONFIG_RANDOMIZE_BASE
    mov     x19, x0                         // preserve new SCTLR_EL1 value
    mrs     x20, sctlr_el1                  // preserve old SCTLR_EL1 value
#endif

    adrp    x1, init_pg_dir
    bl      __enable_mmu        //打开MMU
#ifdef CONFIG_RELOCATABLE
#ifdef CONFIG_RELR
    mov     x24, #0                         // no RELR displacement yet
#endif
    bl      __relocate_kernel
#ifdef CONFIG_RANDOMIZE_BASE
    ldr     x8, =__primary_switched
    adrp    x0, __PHYS_OFFSET
    blr     x8

    /*
     * If we return here, we have a KASLR displacement in x23 which we need
     * to take into account by discarding the current kernel mapping and
     * creating a new one.
     */
    pre_disable_mmu_workaround
    msr     sctlr_el1, x20                  // disable the MMU
    isb
    bl      __create_page_tables            // recreate kernel mapping

    tlbi    vmalle1                         // Remove any stale TLB entries
    dsb     nsh

    msr     sctlr_el1, x19                  // re-enable the MMU
    isb
    ic      iallu                           // flush instructions fetched
    dsb     nsh                             // via old mapping
    isb

    bl      __relocate_kernel
#endif
#endif
    ldr     x8, =__primary_switched
    adrp    x0, __PHYS_OFFSET
    br      x8
ENDPROC(__primary_switch)

函数中通过__enable_mmu函数来开启MMU, 并调用__primary_switched函数

/*
 * The following fragment of code is executed with the MMU enabled.
 *
 *   x0 = __PHYS_OFFSET
 */
__primary_switched:
    adrp    x4, init_thread_union
    add     sp, x4, #THREAD_SIZE
    adr_l   x5, init_task
    msr     sp_el0, x5                      // Save thread_info

    adr_l   x8, vectors                     // load VBAR_EL1 with virtual
    msr     vbar_el1, x8                    // vector table address
    isb

    stp     xzr, x30, [sp, #-16]!
    mov     x29, sp

    str_l   x21, __fdt_pointer, x5          // Save FDT pointer

    ldr_l   x4, kimage_vaddr                // Save the offset between
    sub     x4, x4, x0                      // the kernel virtual and
    str_l   x4, kimage_voffset, x5          // physical mappings

    // Clear BSS
    adr_l   x0, __bss_start
    mov     x1, xzr
    adr_l   x2, __bss_stop
    sub     x2, x2, x0
    bl      __pi_memset
    dsb     ishst                           // Make zero page visible to PTW

#ifdef CONFIG_KASAN
    bl      kasan_early_init
#endif
#ifdef CONFIG_RANDOMIZE_BASE
    tst     x23, ~(MIN_KIMG_ALIGN - 1)      // already running randomized?
    b.ne    0f
    mov     x0, x21                         // pass FDT address in x0
    bl      kaslr_early_init                // parse FDT for KASLR options
    cbz     x0, 0f                          // KASLR disabled? just proceed
    orr     x23, x23, x0                    // record KASLR offset
    ldp     x29, x30, [sp], #16             // we must enable KASLR, return
    ret                                     // to __primary_switch()
0:
#endif
    add     sp, sp, #16
    mov     x29, #0
    mov     x30, #0
    b       start_kernel
ENDPROC(__primary_switched)

此函数中进行一些C环境的准备,并在最后执行start_kernel函数,内核的启动进入到C语言环境阶段

3.9.1.3. kernel 启动第二阶段

linux内核启动的第二阶段也就是常说的C语言阶段,从 start_kernel 函数开始。 start_kernel函数是所有linux平台进入系统内核初始化后的入口函数,主要完成剩余的与硬件平台相关的初始化 工作,这些初始化操作有的是公共的,有的需要配置才会执行,内核工作需要的模块的初始化一次被调用:如内存管理、调度系统、异常处理等

3.9.1.3.1. start_kenel

start_kernel函数在init/main.c文件中,主要完成linux子系统的初始化工作,此部分初始化内容繁多,暂时略过…

asmlinkage __visible void __init start_kernel(void)
{
    char *command_line;
    char *after_dashes;

    set_task_stack_end_magic(&init_task);
    smp_setup_processor_id();
    debug_objects_early_init();

    cgroup_init_early();

    local_irq_disable();
    early_boot_irqs_disabled = true;

    /*
     * Interrupts are still disabled. Do necessary setups, then
     * enable them.
     */
    boot_cpu_init();
    page_address_init();
    pr_notice("%s", linux_banner);
    early_security_init();
    setup_arch(&command_line);
    setup_command_line(command_line);
    setup_nr_cpu_ids();
    setup_per_cpu_areas();
    smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
    boot_cpu_hotplug_init();

    build_all_zonelists(NULL);
    page_alloc_init();

    pr_notice("Kernel command line: %s\n", boot_command_line);
    /* parameters may set static keys */
    jump_label_init();
    parse_early_param();
    after_dashes = parse_args("Booting kernel",
                  static_command_line, __start___param,
                  __stop___param - __start___param,
                  -1, -1, NULL, &unknown_bootoption);
    if (!IS_ERR_OR_NULL(after_dashes))
        parse_args("Setting init args", after_dashes, NULL, 0, -1, -1,
               NULL, set_init_arg);

    /*
     * These use large bootmem allocations and must precede
     * kmem_cache_init()
     */
    setup_log_buf(0);
    vfs_caches_init_early();
    sort_main_extable();
    trap_init();
    mm_init();

    ftrace_init();

    /* trace_printk can be enabled here */
    early_trace_init();

    /*
     * Set up the scheduler prior starting any interrupts (such as the
     * timer interrupt). Full topology setup happens at smp_init()
     * time - but meanwhile we still have a functioning scheduler.
     */
    sched_init();
    /*
     * Disable preemption - early bootup scheduling is extremely
     * fragile until we cpu_idle() for the first time.
     */
    preempt_disable();
    if (WARN(!irqs_disabled(),
         "Interrupts were enabled *very* early, fixing it\n"))
        local_irq_disable();
    radix_tree_init();

    /*
     * Set up housekeeping before setting up workqueues to allow the unbound
     * workqueue to take non-housekeeping into account.
     */
    housekeeping_init();

    /*
     * Allow workqueue creation and work item queueing/cancelling
     * early.  Work item execution depends on kthreads and starts after
     * workqueue_init().
     */
    workqueue_init_early();

    rcu_init();

    /* Trace events are available after this */
    trace_init();

    if (initcall_debug)
        initcall_debug_enable();

    context_tracking_init();
    /* init some links before init_ISA_irqs() */
    early_irq_init();
    init_IRQ();
    tick_init();
    rcu_init_nohz();
    init_timers();
    hrtimers_init();
    softirq_init();
    timekeeping_init();

    /*
     * For best initial stack canary entropy, prepare it after:
     * - setup_arch() for any UEFI RNG entropy and boot cmdline access
     * - timekeeping_init() for ktime entropy used in rand_initialize()
     * - rand_initialize() to get any arch-specific entropy like RDRAND
     * - add_latent_entropy() to get any latent entropy
     * - adding command line entropy
     */
    rand_initialize();
    add_latent_entropy();
    add_device_randomness(command_line, strlen(command_line));
    boot_init_stack_canary();

    time_init();
    perf_event_init();
    profile_init();
    call_function_init();
    WARN(!irqs_disabled(), "Interrupts were enabled early\n");

    early_boot_irqs_disabled = false;
    local_irq_enable();

    kmem_cache_init_late();

    /*
     * HACK ALERT! This is early. We're enabling the console before
     * we've done PCI setups etc, and console_init() must be aware of
     * this. But we do want output early, in case something goes wrong.
     */
    console_init();
    if (panic_later)
        panic("Too many boot %s vars at `%s'", panic_later,
              panic_param);

    lockdep_init();

    /*
     * Need to run this when irqs are enabled, because it wants
     * to self-test [hard/soft]-irqs on/off lock inversion bugs
     * too:
     */
    locking_selftest();

    /*
     * This needs to be called before any devices perform DMA
     * operations that might use the SWIOTLB bounce buffers. It will
     * mark the bounce buffers as decrypted so that their usage will
     * not cause "plain-text" data to be decrypted when accessed.
     */
    mem_encrypt_init();

#ifdef CONFIG_BLK_DEV_INITRD
    if (initrd_start && !initrd_below_start_ok &&
        page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
        pr_crit("initrd overwritten (0x%08lx < 0x%08lx) - disabling it.\n",
            page_to_pfn(virt_to_page((void *)initrd_start)),
            min_low_pfn);
        initrd_start = 0;
    }
#endif
    setup_per_cpu_pageset();
    numa_policy_init();
    acpi_early_init();
    if (late_time_init)
        late_time_init();
    sched_clock_init();
    calibrate_delay();
    pid_idr_init();
    anon_vma_init();
#ifdef CONFIG_X86
    if (efi_enabled(EFI_RUNTIME_SERVICES))
        efi_enter_virtual_mode();
#endif
    thread_stack_cache_init();
    cred_init();
    fork_init();
    proc_caches_init();
    uts_ns_init();
    buffer_init();
    key_init();
    security_init();
    dbg_late_init();
    vfs_caches_init();
    pagecache_init();
    signals_init();
    seq_file_init();
    proc_root_init();
    nsfs_init();
    cpuset_init();
    cgroup_init();
    taskstats_init_early();
    delayacct_init();

    poking_init();
    check_bugs();

    acpi_subsystem_init();
    arch_post_acpi_subsys_init();
    sfi_init_late();

    /* Do the rest non-__init'ed, we're now alive */
    arch_call_rest_init();
}
pr_notice("%s", linux_barner);
/* FIXED STRINGS! Don't touch! */
const char linux_banner[] =
        "Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@"
        LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n";
        ")"

执行的效果是,在内核启动的初期,打印内核版本号和构建信息

[    0.000000 ] Linux version 4.14.74 (jenkins@MonoCI) (gcc version 6.5.0 (Linaro GCC 6.5-2018.12)) #2 SMP PREEMPT Mon Aug 23 12:17:44 CST 2021

3.9.1.3.2. setup_arch

setup_arch是体系结构相关的,该函数根据处理器、硬件平台具体型号设置系统,及解析系统命令行,系统内存管理初始化,统计并注册系统各种资源等,每个体系都有自己的setup_arch函数, 是由顶层Makefile中的arch变量定义的,参数是违背初始化的内部变量command_line

void __init setup_arch(char **cmdline_p)
{
    init_mm.start_code = (unsigned long) _text;
    init_mm.end_code   = (unsigned long) _etext;
    init_mm.end_data   = (unsigned long) _edata;
    init_mm.brk        = (unsigned long) _end;

    *cmdline_p = boot_command_line;

    early_fixmap_init();
    early_ioremap_init();

    setup_machine_fdt(__fdt_pointer);

    /*
     * Initialise the static keys early as they may be enabled by the
     * cpufeature code and early parameters.
     */
    jump_label_init();
    parse_early_param();

    /*
     * Unmask asynchronous aborts and fiq after bringing up possible
     * earlycon. (Report possible System Errors once we can report this
     * occurred).
     */
    local_daif_restore(DAIF_PROCCTX_NOIRQ);

    /*
     * TTBR0 is only used for the identity mapping at this stage. Make it
     * point to zero page to avoid speculatively fetching new entries.
     */
    cpu_uninstall_idmap();

    xen_early_init();
    efi_init();
    arm64_memblock_init();

    paging_init();

    acpi_table_upgrade();

    /* Parse the ACPI tables for possible boot-time configuration */
    acpi_boot_table_init();

    if (acpi_disabled)
        unflatten_device_tree();

    bootmem_init();

    kasan_init();

    request_standard_resources();

    early_ioremap_reset();

    if (acpi_disabled)
        psci_dt_init();
    else
        psci_acpi_init();

    cpu_read_bootcpu_ops();
    smp_init_cpus();
    smp_build_mpidr_hash();

    /* Init percpu seeds for random tags after cpus are set up. */
    kasan_init_tags();

#ifdef CONFIG_ARM64_SW_TTBR0_PAN
    /*
     * Make sure init_thread_info.ttbr0 always generates translation
     * faults in case uaccess_enable() is inadvertently called by the init
     * thread.
     */
    init_task.thread_info.ttbr0 = __pa_symbol(empty_zero_page);
#endif

#ifdef CONFIG_VT
    conswitchp = &dummy_con;
#endif
    if (boot_args[1] || boot_args[2] || boot_args[3]) {
        pr_err("WARNING: x1-x3 nonzero in violation of boot protocol:\n"
            "\tx1: %016llx\n\tx2: %016llx\n\tx3: %016llx\n"
            "This indicates a broken bootloader or old kernel\n",
            boot_args[1], boot_args[2], boot_args[3]);
    }
}
  • setup_machine_fdt

setup_machine_fdt函数的输入参数是设备树(dtb)首地址,u-boot启动程序把设备树读取到内存中,之后在启动内核的同时,将设备树首地址传给内核,setup_machine_fdt函数的参数__fdt_pointer 就是u-boot传给内核的设备树地址,函数中的fdt表示设备树在内存中是一块连续地址存储的

static void __init setup_machine_fdt(phys_addr_t dt_phys)
{
    int size;
    void *dt_virt = fixmap_remap_fdt(dt_phys, &size, PAGE_KERNEL);  //此时已开启MMU,需要将dtb物理地址转换为虚拟地址
    const char *name;

    if (dt_virt)
        memblock_reserve(dt_phys, size);

    if (!dt_virt || !early_init_dt_scan(dt_virt)) {     //fdt扫描函数,经过此函数之后内核便可以通过调用fdt接口函数获取相关信息
        pr_crit("\n"
            "Error: invalid device tree blob at physical address %pa (virtual address 0x%p)\n"
            "The dtb must be 8-byte aligned and must not exceed 2 MB in size\n"
            "\nPlease check your bootloader.",
            &dt_phys, dt_virt);

        while (true)
            cpu_relax();
    }

    /* Early fixups are done, map the FDT as read-only now */
    fixmap_remap_fdt(dt_phys, &size, PAGE_KERNEL_RO);

    name = of_flat_dt_get_machine_name();
    if (!name)
        return;

    pr_info("Machine model: %s\n", name);
    dump_stack_set_arch_desc("%s (DT)", name);
}
  • console_init

console_init函数执行控制台的初始化工作,在console_init函数执行之前的printk打印信息,需要在console_init函数执行之后才能打印出来,在此之前printk的打印信息都被保存在一个缓存中

kernel/printk/printk.c

/*
 * Initialize the console device. This is called *early*, so
 * we can't necessarily depend on lots of kernel help here.
 * Just do some early initializations, and do the complex setup
 * later.
 */
void __init console_init(void)
{
    int ret;
    initcall_t call;
    initcall_entry_t *ce;

    /* Setup the default TTY line discipline. */
    n_tty_init();

    /*
     * set up the console device so that later boot sequences can
     * inform about problems etc..
     */
    ce = __con_initcall_start;
    trace_initcall_level("console");
    while (ce < __con_initcall_end) {
        call = initcall_from_entry(ce);
        trace_initcall_start(call);
        ret = call();
        trace_initcall_finish(call, ret);
        ce++;
    }
}

此函数中会执行,__con_initcall_start和__con_initcall_end这两个地址之间的内容,这两个地址可以在vmlinux.lds中找到

__con_initcall_start = .;
KEEP(*(.con_initcall.init))
__con_initcall_end = .;

这两个地址之间,存放的是.con_initcall.init段的内容

include/linux/init.h

#define console_initcall(fn)    \
    static initcall_t __initcall_##fn##id __used  \
    __attribute__((__section_.con_initcall.init)) = fn

通过宏定义console_initcall(fn)将函数指针fn存放到.con_initcall.init段,之后在调用console_init()函数时,就会遍历__con_initcall_start和__con_initcall_end的 地址区域,依次运行存放在启动的函数fn

3.9.1.3.3. rest_init

在一系列的初始化之后,在rest_init函数中启动了三个进程 idlekernel_initkthreadd 来开始操作系统的正式运行

noinline void __ref rest_init(void)
{
    struct task_struct *tsk;
    int pid;

    rcu_scheduler_starting();
    /*
     * We need to spawn init first so that it obtains pid 1, however
     * the init task will end up wanting to create kthreads, which, if
     * we schedule it before we create kthreadd, will OOPS.
     */
    pid = kernel_thread(kernel_init, NULL, CLONE_FS);   //创建kernel_init内核线程,即init, 1号进程
    /*
     * Pin init on the boot CPU. Task migration is not properly working
     * until sched_init_smp() has been run. It will set the allowed
     * CPUs for init to the non isolated CPUs.
     */
    rcu_read_lock();
    tsk = find_task_by_pid_ns(pid, &init_pid_ns);
    set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
    rcu_read_unlock();

    numa_default_policy();
    pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);    //创建kthreadd内核线程,2号进程,用于管理和调度其他内核线程
    rcu_read_lock();
    kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
    rcu_read_unlock();

    /*
     * Enable might_sleep() and smp_processor_id() checks.
     * They cannot be enabled earlier because with CONFIG_PREEMPTION=y
     * kernel_thread() would trigger might_sleep() splats. With
     * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
     * already, but it's stuck on the kthreadd_done completion.
     */
    system_state = SYSTEM_SCHEDULING;

    complete(&kthreadd_done);

    /*
     * The boot idle thread must execute schedule()
     * at least once to get things moving:
     */
    schedule_preempt_disabled();    //调用进程调度,并禁止内核抢占
    /* Call into cpu_idle with preempt disabled */
    cpu_startup_entry(CPUHP_ONLINE);    //0号进程完成kernel初始化工作,进入idle循环
}
  1. idle进程是操作系统的空闲进程,CPU空闲的时候会去运行它

  2. kernel_init进程最开始只是一个函数,作为进程被启动,init进程是永远存在的,PID是1

  3. kthreadd是内核守护进程,始终运行在内核空间,负责所有内核线程的调度和管理,PID是2

也就是说,系统启动后的第一个进程是IDLE,idle进程是唯一没有通过kernel_thread或fork产生的进程,idle创建了kernel_init进程作为1号进程,创建了kthreadd进程作为2号进程

3.9.1.3.4. kernel_init

kernel_init函数在创建kernel_init进程时,作为进程被启动,虽然kernel_init最开始只是一个函数,但是在最后,通过系统调用将读取根文件系统下的init进程,完成从内核态到用户态的转变, 转变为用户态的1号进程,这个init进程是所有用户态进程的父进程,产生了大量的子进程,init进程是1号进程,是永远存在的

3.9.1.3.4.1. kernel_init_freeable

此函数主要工作如下

  1. 等待内核线程kthreadd创建完成

  2. 注册内核驱动模块 do_basic_setup

  3. 启动默认控制台/dev/console

static noinline void __init kernel_init_freeable(void)
{
    /*
     * Wait until kthreadd is all set-up.
     */
    wait_for_completion(&kthreadd_done);
    //虽然kernel_init进程先创建,但是要在kthreadd线程创建完成才能执行

    /* Now the scheduler is fully set up and can do blocking allocations */
    gfp_allowed_mask = __GFP_BITS_MASK;

    /*
     * init can allocate pages on any node
     */
    set_mems_allowed(node_states[N_MEMORY]);

    cad_pid = task_pid(current);

    smp_prepare_cpus(setup_max_cpus);

    workqueue_init();

    init_mm_internals();

    do_pre_smp_initcalls();
    lockup_detector_init();

    smp_init();
    sched_init_smp();

    page_alloc_init_late();
    /* Initialize page ext after all struct pages are initialized. */
    page_ext_init();

    do_basic_setup();

    /* Open the /dev/console on the rootfs, this should never fail */
    if (ksys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
        pr_err("Warning: unable to open an initial console.\n");

    (void) ksys_dup(0);
    (void) ksys_dup(0);
    /*
     * check if there is an early userspace init.  If yes, let it do all
     * the work
     */

    if (!ramdisk_execute_command)
        ramdisk_execute_command = "/init";

    if (ksys_access((const char __user *)
            ramdisk_execute_command, 0) != 0) {
        ramdisk_execute_command = NULL;
        prepare_namespace();
    }

    /*
     * Ok, we have completed the initial bootup, and
     * we're essentially up and running. Get rid of the
     * initmem segments and start the user-mode stuff..
     *
     * rootfs is available now, try loading the public keys
     * and default modules
     */

    integrity_load_keys();
}
  • bo_basic_setup

/*
 * Ok, the machine is now initialized. None of the devices
 * have been touched yet, but the CPU subsystem is up and
 * running, and memory and process management works.
 *
 * Now we can finally start doing some real work..
 */
static void __init do_basic_setup(void)
{
    cpuset_init_smp();
    driver_init();
    init_irq_proc();
    do_ctors();
    usermodehelper_enable();
    do_initcalls();
}

driver_init函数完成了与驱动程序相关的所有子系统的创建,实现了linux设备驱动的一个整体框架,但是它只是建立了目录结构,是设备驱动程序初始化的一部分 具体驱动模块的装载在do_initcalls函数中实现

/**
 * driver_init - initialize driver model.
 *
 * Call the driver model init functions to initialize their
 * subsystems. Called early from init/main.c.
 */
void __init driver_init(void)
{
    /* These are the core pieces */
    devtmpfs_init();    //注册devtmpfs文件系统,启动devtmpfsd进程
    devices_init();     //初始化驱动模型中部分子系统,/dev/devices,/dev/cha,/dev/block
    buses_init();       //初始化驱动模型中的bus子系统
    classes_init();     //初始化驱动模型中的class子系统
    firmware_init();    //初始化驱动模型中的firmware子系统
    hypervisor_init();  //初始化驱动模型中的hypervisor子系统

    /* These are also core pieces, but must come after the
     * core core pieces.
     */
    of_core_init();     //初始化设备树访问过程
    platform_bus_init();    //初始化设备驱动模型中的bus/platform子系统,此节点是所有platform设备和驱动的总线模型
                            //所有的platform设备和驱动都会挂载到这个总线上
    cpu_dev_init();     //初始化驱动模型中的device/system/cpu子系统,该节点包含CPU相关属性
    memory_dev_init();  //初始化驱动模型中的device/system/memory子系统,该节点包含了内存相关属性
    container_dev_init();   //初始化系统总线类型为容器
}
  • do_initcalls

编译器在编译内核时,将一系列模块初始化函数的其实地址按照一定的顺序放在指定的section中,在内核启动的初始化阶段,do_initcalls函数中以函数指针的形式取出这些函数的起始地址 依次运行,以完成相应模块的初始化工作,这是设备驱动程序初始化的第二部分,由于内核模块可能存在依赖关系,因此这些模块的初始化顺序非常重要

// init/main.c

static void __init do_initcalls(void)
{
    int level;

    for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
        do_initcall_level(level);
}

对同一个level等级下的函数,依次遍历执行

static void __init do_initcall_level(int level)
{
    initcall_entry_t *fn;

    strcpy(initcall_command_line, saved_command_line);
    parse_args(initcall_level_names[level],
           initcall_command_line, __start___param,
           __stop___param - __start___param,
           level, level,
           NULL, &repair_env_string);

    trace_initcall_level(initcall_level_names[level]);
    for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
        do_one_initcall(initcall_from_entry(fn));
}

执行某一个确定的函数

int __init_or_module do_one_initcall(initcall_t fn)
{
    int count = preempt_count();
    char msgbuf[64];
    int ret;

    if (initcall_blacklisted(fn))
        return -EPERM;

    do_trace_initcall_start(fn);
    ret = fn();
    do_trace_initcall_finish(fn, ret);

    msgbuf[0] = 0;

    if (preempt_count() != count) {
        sprintf(msgbuf, "preemption imbalance ");
        preempt_count_set(count);
    }
    if (irqs_disabled()) {
        strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
        local_irq_enable();
    }
    WARN(msgbuf[0], "initcall %pS returned with %s\n", fn, msgbuf);

    add_latent_entropy();
    return ret;
}
// include/linux/init.h

#define __define_initcall(fn, id)   \
    static initcall_t __initcall##fn_id __used \
    __attribute__((__section__(".initcall" #id ".init"))) = fn;

__attribute__((__section__())) 表示把对象放在这个由括号中的名称所指代的section中

__define_initcall()宏的含义是

  1. 声明一个名为__initcall_##fn的函数指针(其中##表示将两边的变量链接为有一个变量)

  2. 将这个函数指针初始化为fn

  3. 编译时,要将这个函数指针放到名为”.initcall”#id”.init”的section中

__define_initcall宏并不会被直接使用,而是被定义为其他的宏定义形式使用

// include/linux/init.h


#define pure_initcall(fn)           __define_initcall(fn, 0)

#define core_initcall(fn)           __define_initcall(fn, 1)
#define core_initcall_sync(fn)              __define_initcall(fn, 1s)
#define postcore_initcall(fn)               __define_initcall(fn, 2)
#define postcore_initcall_sync(fn)  __define_initcall(fn, 2s)
#define arch_initcall(fn)           __define_initcall(fn, 3)
#define arch_initcall_sync(fn)              __define_initcall(fn, 3s)
#define subsys_initcall(fn)         __define_initcall(fn, 4)
#define subsys_initcall_sync(fn)    __define_initcall(fn, 4s)
#define fs_initcall(fn)                     __define_initcall(fn, 5)
#define fs_initcall_sync(fn)                __define_initcall(fn, 5s)
#define rootfs_initcall(fn)         __define_initcall(fn, rootfs)
#define device_initcall(fn)         __define_initcall(fn, 6)
#define device_initcall_sync(fn)    __define_initcall(fn, 6s)
#define late_initcall(fn)           __define_initcall(fn, 7)
#define late_initcall_sync(fn)              __define_initcall(fn, 7s)

在编译生成的vmlinux.lds文件中可以找到initcall相关的定义

__initcall_start = .;
KEEP(*(.initcallearly.init))
__initcall0_start = .;
KEEP(*(.initcall0.init))
KEEP(*(.initcall0s.init))
__initcall1_start = .;
KEEP(*(.initcall1.init))
KEEP(*(.initcall1s.init))
__initcall2_start = .;
KEEP(*(.initcall2.init))
KEEP(*(.initcall2s.init))
__initcall3_start = .;
KEEP(*(.initcall3.init))
KEEP(*(.initcall3s.init))
__initcall4_start = .;
KEEP(*(.initcall4.init))
KEEP(*(.initcall4s.init))
__initcall5_start = .;
KEEP(*(.initcall5.init))
KEEP(*(.initcall5s.init))
__initcallrootfs_start = .;
KEEP(*(.initcallrootfs.init))
KEEP(*(.initcallrootfss.init))
__initcall6_start = .;
KEEP(*(.initcall6.init))
KEEP(*(.initcall6s.init))
__initcall7_start = .;
KEEP(*(.initcall7.init))
KEEP(*(.initcall7s.init))
__initcall_end = .;

这些section中总的开始位置被标识为__initcall_start,而在结尾被标识为__initcall_end

3.9.1.3.4.2. free_initmem

free_initmem函数用来释放所有init段中的内存

// arch/arm64/mm/init.c

void free_initmem(void)
{
    free_reserved_area(lm_alias(__init_begin),
               lm_alias(__init_end),
               0, "unused kernel");
    /*
     * Unmap the __init region but leave the VM area in place. This
     * prevents the region from being reused for kernel modules, which
     * is not supported by kallsyms.
     */
    unmap_kernel_range((u64)__init_begin, (u64)(__init_end - __init_begin));
}

3.9.1.3.4.3. 启动用户态init进程

if (!try_to_run_init_process("/sbin/init") ||
    !try_to_run_init_process("/etc/init") ||
    !try_to_run_init_process("/bin/init") ||
    !try_to_run_init_process("/bin/sh"))
        return 0;
static int try_to_run_init_process(const char *init_filename)
{
    int ret;

    ret = run_init_process(init_filename);

    if (ret && ret != -ENOENT) {
        pr_err("Starting init: %s exists but couldn't execute it (error %d)\n",
               init_filename, ret);
    }

    return ret;
}
static int run_init_process(const char *init_filename)
{
    argv_init[0] = init_filename;
    pr_info("Run %s as init process\n", init_filename);
    return do_execve(getname_kernel(init_filename),
        (const char __user *const __user *)argv_init,
        (const char __user *const __user *)envp_init);
}

在大多数系统中,bootloader会传递参数给内核的main函数,这些参数中会包含init=/linuxrc参数,于是在kernel_init进程中,如果有execute_command = “linuxrc”,在经过 run_init_process函数的解析之后,得到需要运行的linuxrc,通过do_execve函数进入用户态,开始文件系统的初始化init进程

如果没有传递,则系统开始顺序执行/sbin/init /etc/init /bin/init /bin/sh 程序

init进程进行的工作

  1. 为init设置信号处理过程

  2. 初始化控制台

  3. 解析/etc/inittab文件

  4. 执行系统初始化命令,一般情况下会使用/etc/init.d/rcS

  5. 执行所有导致init暂停的inittab命令(动作类型: wait)

  6. 执行所有仅执行一次的inittab命令 (动作类型: once)

执行完以上工作后,init进程会循环执行以下进程

  1. 执行所有终止时必须重新启动的inittab命令(动作类型: respawn)

  2. 执行所有终止时必须重新启动但启动前必须询问用户的inittab命令(动作类型: askfirst)

  • inittab

init程序会解析/etc/inittab初始化配置文件

# /etc/inittab: init(8) configuration.
# $Id: inittab,v 1.91 2002/01/25 13:35:21 miquels Exp $

# The default runlevel.
id:5:initdefault:

# Boot-time system configuration/initialization script.
# This is run first except when booting in emergency (-b) mode.
si::sysinit:/etc/init.d/rcS

# What to do in single-user mode.
~:S:wait:/sbin/sulogin

# /etc/init.d executes the S and K scripts upon change
# of runlevel.
#
# Runlevel 0 is halt.
# Runlevel 1 is single-user.
# Runlevels 2-5 are multi-user.
# Runlevel 6 is reboot.

l0:0:wait:/etc/init.d/rc 0
l1:1:wait:/etc/init.d/rc 1
l2:2:wait:/etc/init.d/rc 2
l3:3:wait:/etc/init.d/rc 3
l4:4:wait:/etc/init.d/rc 4
l5:5:wait:/etc/init.d/rc 5
l6:6:wait:/etc/init.d/rc 6
# Normally not reached, but fallthrough in case of emergency.
z6:6:respawn:/sbin/sulogin
# AMA0:12345:respawn:/bin/start_getty 38400 ttyAMA0 vt102
S0:12345:respawn:/bin/start_getty 0 ttyS0 vt102

inittab的内容以行为单位,行与行之间没有关联,每行都是一个独立的配置项,每一行的配置项都是由3个冒号分隔开的4个配置值组成,冒号是分隔符

inittab文件中的代码格式

<id>:<runlevels>:<action>:<process>

注解

id: /dev/id ,用作终端的terminal:stdin、stdout、stderr、printf、scanf、err runlevels: action:执行时机,包括:sysinit、respawd、askfirst、waite、once、restart、ctriatdel、shutdown process: 应用程序和脚本