3.7.2.2. init_task

linux下有三个特殊的进程,idle进程(PID=0) init进程(PID=1)和 kthreadd(PID=2)

  • idle进程是系统自动创建,运行在内核态

idle进程其pid=0,其前身是系统创建的第一个进程,也是唯一一个没有通过fork或者kernel_thread产生的进程。完成加载系统后,演变为进程调度、交换

  • init进程由idle通过kernel_thread创建,在内核空间完成初始化后,加载init程序,并最终在用户空间运行

由0进程创建,完成系统的初始化,时系统中所有其他用户进程的祖先进程.linux中所有进程都是Init进程创建并运行的,首先linux内核启动,然后在用户空间启动Init进程,再启动其他系统进程. 在系统启动完成后,init将变为守护进程监视系统其他进程

  • kthreadd 进程由idle通过kernel_thread创建,并始终运行在内核空间,负责所有内核线程的调度和管理

它的任务就是管理和调度其他内核线程,会循环执行一个kthread的函数,该函数的作用就是运行kthread_create_list全局链表维护的kthread,当我们调用kernel_thread创建的内核线程 就会被加入到此链表中.因此所有的内核线程都直接或者间接的以kthreadd为父进程

3.7.2.2.1. idle的创建

在smp系统中,每个处理器单元有独立的运行队列,而每个运行队列上又有一个idle进程,即有多少个处理器单元就有多少个idle进程

系统的空闲时间,其实就是指idle进程的运行时间

3.7.2.2.1.1. 0号进程上下文信息-init_task描述符

init_task 是内核中所有进程、线程的task_struct雏形,在内核的初始化过程中,通过静态定义构造处理一个task_struct接口,取名为init_task,然后在初始化的后期,通过 rest_init() 函数创建了 内核init线程,kthreadd内核线程

  1. 内核init线程,最终执行/sbin/init进程,变为所有用户态程序的根进程(pstree命令显示),即用户空间的init进程.开始的init是由kernel_thread创建的内核线程.在完成初始化工作后,转向用户空间,并且生成所有用户进程的祖先

  2. 内核kthreadd线程,变为所有内核态其他守护线程的父线程

../../../../_images/0号-1号进程.png

所以init_task决定了系统所有进程、线程的基因,它完成初始化后,最终演变为0号进程idle,并且运行在内核态

idle进程的优先级为MAX_PRIO-20,早期的版本中idle是参与调度的,但是目前的版本中idle并不在运行队列中参与调度,而是在运行队列结构中含idle指针,指向idle进程,在调度器发现运行队列为空的时候运行.

内核中的init_task变量就是进程0使用的进程描述符,也是linux系统中第一个进程描述符.

init_task 在init/init_task.c中定义

/*
 * Set up the first task table, touch at your own risk!. Base=0,
 * limit=0x1fffff (=2MB)
 */
struct task_struct init_task = {
    .thread_info    = INIT_THREAD_INFO(init_task),
    .stack_refcount = REFCOUNT_INIT(1),
    .state          = 0,
    .stack          = init_stack,
    .usage          = REFCOUNT_INIT(2),
    .flags          = PF_KTHREAD,
    .prio           = MAX_PRIO - 20,
    .static_prio    = MAX_PRIO - 20,
    .normal_prio    = MAX_PRIO - 20,
    .policy         = SCHED_NORMAL,
    .cpus_ptr       = &init_task.cpus_mask,
    .cpus_mask      = CPU_MASK_ALL,
    .nr_cpus_allowed= NR_CPUS,
    .mm             = NULL,
    .active_mm      = &init_mm,
    .restart_block  = {
            .fn = do_no_restart_syscall,
    },
    .se             = {
            .group_node     = LIST_HEAD_INIT(init_task.se.group_node),
    },
    .rt             = {
            .run_list       = LIST_HEAD_INIT(init_task.rt.run_list),
            .time_slice     = RR_TIMESLICE,
    },
    .tasks          = LIST_HEAD_INIT(init_task.tasks),
    ...
}
EXPORT_SYMBOL(init_task);

3.7.2.2.1.2. 进程堆栈init_thread_union

init_task进程使用 init_stack作为进程堆栈

.stack = init_stack,

3.7.2.2.1.3. 进程内存空间

init_task的虚拟地址空间,也采用同样的方法定义

由于init_task是一个运行在内核空间的内核线程,因此其虚拟地址段mm为NULL,但是必要时它还是需要使用虚拟地址的,因此active_mm被设置为 init_mm

::

.mm = NULL, .active_mm = &init_mm,

其中init_mm被定义在init-mmc.c中

struct mm_struct init_mm = {
    .mm_rb          = RB_ROOT,
    .pgd            = swapper_pg_dir,
    .mm_users       = ATOMIC_INIT(2),
    .mm_count       = ATOMIC_INIT(1),
    .mmap_sem       = __RWSEM_INITIALIZER(init_mm.mmap_sem),
    .page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
    .arg_lock       =  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
    .mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
    .user_ns        = &init_user_ns,
    .cpu_bitmap     = CPU_BITS_NONE,
    INIT_MM_CONTEXT(init_mm)
};

3.7.2.2.2. 0号进程的演化

3.7.2.2.2.1. rest_init创建init和kthread进程

linux在无进程概念的情况下将一直从初始化部分的代码执行到 start_kernel ,然后再到其最后一个函数调用rest_init

大致是在vmlinux的入口startup_32(head.S)中为pid号为0的原始进程设置了执行环境,然后start_kernel完成内核的初始化工作,包括初始化页表,初始化中断向量表,初始化系统时间等

rest_init 开始,linux开始产生进程,因为Init_task是静态制造出来的,pid=0,它试图将从最早的汇编代码一直到start_kernel的执行都纳入到init_task的进程上下文中. 在这个函数中创建了init进程和kthreadd进程

noinline void __ref rest_init(void)
{
    struct task_struct *tsk;
    int pid;

    rcu_scheduler_starting();
    /*
     * We need to spawn init first so that it obtains pid 1, however
     * the init task will end up wanting to create kthreads, which, if
     * we schedule it before we create kthreadd, will OOPS.
     */
    pid = kernel_thread(kernel_init, NULL, CLONE_FS);
    /*
     * Pin init on the boot CPU. Task migration is not properly working
     * until sched_init_smp() has been run. It will set the allowed
     * CPUs for init to the non isolated CPUs.
     */
    rcu_read_lock();
    tsk = find_task_by_pid_ns(pid, &init_pid_ns);
    set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
    rcu_read_unlock();

    numa_default_policy();
    pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
    rcu_read_lock();
    kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
    rcu_read_unlock();

    /*
     * Enable might_sleep() and smp_processor_id() checks.
     * They cannot be enabled earlier because with CONFIG_PREEMPTION=y
     * kernel_thread() would trigger might_sleep() splats. With
     * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
     * already, but it's stuck on the kthreadd_done completion.
     */
    system_state = SYSTEM_SCHEDULING;

    complete(&kthreadd_done);

    /*
     * The boot idle thread must execute schedule()
     * at least once to get things moving:
     */
    schedule_preempt_disabled();
    /* Call into cpu_idle with preempt disabled */
    cpu_startup_entry(CPUHP_ONLINE);
}
  1. pid = kernel_thread(kernel_init, NULL, CLONE_FS);创建了1号内核线程,该线程最后转向用户空间,演变为init进程

  2. pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);创建了kthreadd内核线程

  3. 调用schedule函数切换当前进程,调用该函数后kernel_init将会运行

void __sched schedule_preempt_disabled(void)
{
    sched_preempt_enable_no_resched();
    schedule();
    preempt_disable();
}

kernel_init会继续完成剩下的初始化工作,然后execve(/sbin/init),称为系统中其他进程的祖先

cpu_startup_entry函数调用cpu_idle_loop(),0号线程进入idle函数的循环,在该循环中周期性的检查

void cpu_startup_entry(enum cpuhp_state state)
{
    arch_cpu_idle_prepare();
    cpuhp_online_idle(state);
    while (1)
        do_idle();
}

3.7.2.2.2.2. idle的运行与调度

idle在系统没有其他就绪进程可执行的时候才会被调度,即执行 do_idle 函数

/*
 * Generic idle loop implementation
 *
 * Called with polling cleared.
 */
static void do_idle(void)
{
    int cpu = smp_processor_id();
    /*
     * If the arch has a polling bit, we maintain an invariant:
     *
     * Our polling bit is clear if we're not scheduled (i.e. if rq->curr !=
     * rq->idle). This means that, if rq->idle has the polling bit set,
     * then setting need_resched is guaranteed to cause the CPU to
     * reschedule.
     */

    __current_set_polling();
    tick_nohz_idle_enter();

    while (!need_resched()) {
        rmb();

        local_irq_disable();

        if (cpu_is_offline(cpu)) {
            tick_nohz_idle_stop_tick();
            cpuhp_report_idle_dead();
            arch_cpu_idle_dead();
        }

        arch_cpu_idle_enter();

        /*
         * In poll mode we reenable interrupts and spin. Also if we
         * detected in the wakeup from idle path that the tick
         * broadcast device expired for us, we don't want to go deep
         * idle as we know that the IPI is going to arrive right away.
         */
        if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
            tick_nohz_idle_restart_tick();
            cpu_idle_poll();
        } else {
            cpuidle_idle_call();
        }
        arch_cpu_idle_exit();
    }

    /*
     * Since we fell out of the loop above, we know TIF_NEED_RESCHED must
     * be set, propagate it into PREEMPT_NEED_RESCHED.
     *
     * This is required because for polling idle loops we will not have had
     * an IPI to fold the state for us.
     */
    preempt_set_need_resched();
    tick_nohz_idle_exit();
    __current_clr_polling();

    /*
     * We promise to call sched_ttwu_pending() and reschedule if
     * need_resched() is set while polling is set. That means that clearing
     * polling needs to be visible before doing these things.
     */
    smp_mb__after_atomic();

    sched_ttwu_pending();
    schedule_idle();

    if (unlikely(klp_patch_pending(current)))
        klp_update_patch_state(current);
}

默认的idle实现是hlt指令,hlt指令使cpu处于暂停状态,等待硬件中断发生的时候恢复,从而达到节能的目的.

rest_init 流程

说明

rcu_scheduler_starting