init_task ========== linux下有三个特殊的进程,idle进程(PID=0) init进程(PID=1)和 kthreadd(PID=2) - idle进程是系统自动创建,运行在内核态 idle进程其pid=0,其前身是系统创建的第一个进程,也是唯一一个没有通过fork或者kernel_thread产生的进程。完成加载系统后,演变为进程调度、交换 - init进程由idle通过kernel_thread创建,在内核空间完成初始化后,加载init程序,并最终在用户空间运行 由0进程创建,完成系统的初始化,时系统中所有其他用户进程的祖先进程.linux中所有进程都是Init进程创建并运行的,首先linux内核启动,然后在用户空间启动Init进程,再启动其他系统进程. 在系统启动完成后,init将变为守护进程监视系统其他进程 - kthreadd 进程由idle通过kernel_thread创建,并始终运行在内核空间,负责所有内核线程的调度和管理 它的任务就是管理和调度其他内核线程,会循环执行一个kthread的函数,该函数的作用就是运行kthread_create_list全局链表维护的kthread,当我们调用kernel_thread创建的内核线程 就会被加入到此链表中.因此所有的内核线程都直接或者间接的以kthreadd为父进程 idle的创建 ---------- 在smp系统中,每个处理器单元有独立的运行队列,而每个运行队列上又有一个idle进程,即有多少个处理器单元就有多少个idle进程 系统的空闲时间,其实就是指idle进程的运行时间 0号进程上下文信息-init_task描述符 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``init_task`` 是内核中所有进程、线程的task_struct雏形,在内核的初始化过程中,通过静态定义构造处理一个task_struct接口,取名为init_task,然后在初始化的后期,通过 ``rest_init()`` 函数创建了 内核init线程,kthreadd内核线程 1) 内核init线程,最终执行/sbin/init进程,变为所有用户态程序的根进程(pstree命令显示),即用户空间的init进程.开始的init是由kernel_thread创建的内核线程.在完成初始化工作后,转向用户空间,并且生成所有用户进程的祖先 2) 内核kthreadd线程,变为所有内核态其他守护线程的父线程 .. image:: res/0号-1号进程.png 所以init_task决定了系统所有进程、线程的基因,它完成初始化后,最终演变为0号进程idle,并且运行在内核态 idle进程的优先级为MAX_PRIO-20,早期的版本中idle是参与调度的,但是目前的版本中idle并不在运行队列中参与调度,而是在运行队列结构中含idle指针,指向idle进程,在调度器发现运行队列为空的时候运行. 内核中的init_task变量就是进程0使用的进程描述符,也是linux系统中第一个进程描述符. init_task 在init/init_task.c中定义 :: /* * Set up the first task table, touch at your own risk!. Base=0, * limit=0x1fffff (=2MB) */ struct task_struct init_task = { .thread_info = INIT_THREAD_INFO(init_task), .stack_refcount = REFCOUNT_INIT(1), .state = 0, .stack = init_stack, .usage = REFCOUNT_INIT(2), .flags = PF_KTHREAD, .prio = MAX_PRIO - 20, .static_prio = MAX_PRIO - 20, .normal_prio = MAX_PRIO - 20, .policy = SCHED_NORMAL, .cpus_ptr = &init_task.cpus_mask, .cpus_mask = CPU_MASK_ALL, .nr_cpus_allowed= NR_CPUS, .mm = NULL, .active_mm = &init_mm, .restart_block = { .fn = do_no_restart_syscall, }, .se = { .group_node = LIST_HEAD_INIT(init_task.se.group_node), }, .rt = { .run_list = LIST_HEAD_INIT(init_task.rt.run_list), .time_slice = RR_TIMESLICE, }, .tasks = LIST_HEAD_INIT(init_task.tasks), ... } EXPORT_SYMBOL(init_task); 进程堆栈init_thread_union ^^^^^^^^^^^^^^^^^^^^^^^^^^ init_task进程使用 init_stack作为进程堆栈 :: .stack = init_stack, 进程内存空间 ^^^^^^^^^^^^ init_task的虚拟地址空间,也采用同样的方法定义 由于init_task是一个运行在内核空间的内核线程,因此其虚拟地址段mm为NULL,但是必要时它还是需要使用虚拟地址的,因此active_mm被设置为 ``init_mm`` :: .mm = NULL, .active_mm = &init_mm, 其中init_mm被定义在init-mmc.c中 :: struct mm_struct init_mm = { .mm_rb = RB_ROOT, .pgd = swapper_pg_dir, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem), .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), .mmlist = LIST_HEAD_INIT(init_mm.mmlist), .user_ns = &init_user_ns, .cpu_bitmap = CPU_BITS_NONE, INIT_MM_CONTEXT(init_mm) }; 0号进程的演化 -------------- rest_init创建init和kthread进程 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ linux在无进程概念的情况下将一直从初始化部分的代码执行到 ``start_kernel`` ,然后再到其最后一个函数调用rest_init 大致是在vmlinux的入口startup_32(head.S)中为pid号为0的原始进程设置了执行环境,然后start_kernel完成内核的初始化工作,包括初始化页表,初始化中断向量表,初始化系统时间等 从 ``rest_init`` 开始,linux开始产生进程,因为Init_task是静态制造出来的,pid=0,它试图将从最早的汇编代码一直到start_kernel的执行都纳入到init_task的进程上下文中. 在这个函数中创建了init进程和kthreadd进程 :: noinline void __ref rest_init(void) { struct task_struct *tsk; int pid; rcu_scheduler_starting(); /* * We need to spawn init first so that it obtains pid 1, however * the init task will end up wanting to create kthreads, which, if * we schedule it before we create kthreadd, will OOPS. */ pid = kernel_thread(kernel_init, NULL, CLONE_FS); /* * Pin init on the boot CPU. Task migration is not properly working * until sched_init_smp() has been run. It will set the allowed * CPUs for init to the non isolated CPUs. */ rcu_read_lock(); tsk = find_task_by_pid_ns(pid, &init_pid_ns); set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); rcu_read_unlock(); numa_default_policy(); pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); rcu_read_lock(); kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); rcu_read_unlock(); /* * Enable might_sleep() and smp_processor_id() checks. * They cannot be enabled earlier because with CONFIG_PREEMPTION=y * kernel_thread() would trigger might_sleep() splats. With * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled * already, but it's stuck on the kthreadd_done completion. */ system_state = SYSTEM_SCHEDULING; complete(&kthreadd_done); /* * The boot idle thread must execute schedule() * at least once to get things moving: */ schedule_preempt_disabled(); /* Call into cpu_idle with preempt disabled */ cpu_startup_entry(CPUHP_ONLINE); } 1) pid = kernel_thread(kernel_init, NULL, CLONE_FS);创建了1号内核线程,该线程最后转向用户空间,演变为init进程 2) pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);创建了kthreadd内核线程 3) 调用schedule函数切换当前进程,调用该函数后kernel_init将会运行 :: void __sched schedule_preempt_disabled(void) { sched_preempt_enable_no_resched(); schedule(); preempt_disable(); } kernel_init会继续完成剩下的初始化工作,然后execve(/sbin/init),称为系统中其他进程的祖先 cpu_startup_entry函数调用cpu_idle_loop(),0号线程进入idle函数的循环,在该循环中周期性的检查 :: void cpu_startup_entry(enum cpuhp_state state) { arch_cpu_idle_prepare(); cpuhp_online_idle(state); while (1) do_idle(); } idle的运行与调度 ^^^^^^^^^^^^^^^^ idle在系统没有其他就绪进程可执行的时候才会被调度,即执行 ``do_idle`` 函数 :: /* * Generic idle loop implementation * * Called with polling cleared. */ static void do_idle(void) { int cpu = smp_processor_id(); /* * If the arch has a polling bit, we maintain an invariant: * * Our polling bit is clear if we're not scheduled (i.e. if rq->curr != * rq->idle). This means that, if rq->idle has the polling bit set, * then setting need_resched is guaranteed to cause the CPU to * reschedule. */ __current_set_polling(); tick_nohz_idle_enter(); while (!need_resched()) { rmb(); local_irq_disable(); if (cpu_is_offline(cpu)) { tick_nohz_idle_stop_tick(); cpuhp_report_idle_dead(); arch_cpu_idle_dead(); } arch_cpu_idle_enter(); /* * In poll mode we reenable interrupts and spin. Also if we * detected in the wakeup from idle path that the tick * broadcast device expired for us, we don't want to go deep * idle as we know that the IPI is going to arrive right away. */ if (cpu_idle_force_poll || tick_check_broadcast_expired()) { tick_nohz_idle_restart_tick(); cpu_idle_poll(); } else { cpuidle_idle_call(); } arch_cpu_idle_exit(); } /* * Since we fell out of the loop above, we know TIF_NEED_RESCHED must * be set, propagate it into PREEMPT_NEED_RESCHED. * * This is required because for polling idle loops we will not have had * an IPI to fold the state for us. */ preempt_set_need_resched(); tick_nohz_idle_exit(); __current_clr_polling(); /* * We promise to call sched_ttwu_pending() and reschedule if * need_resched() is set while polling is set. That means that clearing * polling needs to be visible before doing these things. */ smp_mb__after_atomic(); sched_ttwu_pending(); schedule_idle(); if (unlikely(klp_patch_pending(current))) klp_update_patch_state(current); } 默认的idle实现是hlt指令,hlt指令使cpu处于暂停状态,等待硬件中断发生的时候恢复,从而达到节能的目的. +-------------------------------+---------------------------------------------------------------------------------------------------------+ | rest_init 流程 | 说明 | +===============================+=========================================================================================================+ | | | | | | | rcu_scheduler_starting | | | | | +-------------------------------+---------------------------------------------------------------------------------------------------------+ | | | | | | | | | | | | +-------------------------------+---------------------------------------------------------------------------------------------------------+ | | | | | | | | | | | | +-------------------------------+---------------------------------------------------------------------------------------------------------+ | | | | | | | | | | | | +-------------------------------+---------------------------------------------------------------------------------------------------------+ | | | | | | | | | | | | +-------------------------------+---------------------------------------------------------------------------------------------------------+ | | | | | | | | | | | | +-------------------------------+---------------------------------------------------------------------------------------------------------+ | | | | | | | | | | | | +-------------------------------+---------------------------------------------------------------------------------------------------------+