Linux中時鐘中斷到進程調度schedule的執行過程

在現代多任務操作系統（Multitasking OS）上，系統可能運行於單核或者多核CPU上，進程可能處於運行狀態或者在內存中可運行等待狀態。而關於調度，有兩個很重要的知識點需要區別，preemptive multitasking和cooperative multitasking。在preemptive multitasking系統上，調度器決定運行中的進程何時中止運行換出而新的進程開始執行，該過程稱爲搶佔preemption，而搶佔前的進程運行時間一般爲提前設定的時間片(timeslice)。費搶佔的cooperative multitasking調度類型最大的特點就是進程只有在主動決定放棄CPU的時候纔開始調度其他進程執行，稱爲yielding，調度器無法控制全局的進程運行狀態和時間，掛起的進程可能會導致整個系統停止運行，無法調度.
IO-bounced和Processor-bounced進程，區分調度，採用複雜的算法保證及時響應以及大吞吐量。
進程nice和real-time priority的區別
nice取值範圍在-10 -- +19，進程時間片大小一般根據nice值進行調整，nice值越高則進程時間片一般會分配越小，通過ps -el可以查看。
real-time priority爲進程實時優先級，與nice爲兩個不同維度的考量，取值範圍爲0--99，越大的值則其優先級越高，一般實時進程real-time process的優先級高於普通進程normal process。ps -eo state,uid,pid,ppid,rtprio,time,comm可以查看具體信息，其中-代表進程非實時，數值代表實時優先級。

https://www.cnblogs.com/tolimit/p/4303052.html
schedule的過程：首先關閉搶佔，放置調度重入，然後調用__schedule，其中將current進行處理，比如有待決信號，則繼續標記狀態爲TASK_RUNNING，或者如果需要睡眠則將其deactivate_task後加入對應的等待隊列，通過pick_next_task選擇下一個需要執行的進程，進行context_switch進入新進程運行
pick_next_task的過程：首先判斷當前進程調度類sched_class是否爲fair_sched_calss，也就是CFS，如果是且當前cpu的調度隊列下所有調度實體數目與其下面所有cfs調度類的下屬羣組內的調度實體數目總數相同，即無rt等其他調度類中有調度實體存在（rq->nr_running == rq->cfs.h_nr_running），則直接返回當前調度類fair_sched_class的pick_next_task選擇結果，否則需要遍歷所有調度類for_each_class(class)，返回class->pick_next_task的非空結果。這裏需要關注的是for_each_class的遍歷過程，從sched_class_highest開始，也就是stop_sched_class，那麼接下來就有必要看一下各個調度類的註冊過程。

#define sched_class_highest (&stop_sched_class)
#define for_each_class(class) \
for (class = sched_class_highest; class; class = class->next)

extern const struct sched_class stop_sched_class;
extern const struct sched_class dl_sched_class;
extern const struct sched_class rt_sched_class;
extern const struct sched_class fair_sched_class;
extern const struct sched_class idle_sched_class;

首先看一下各個調度類的關聯：stop_sched_class->next->dl_sched_class->next->rt_sched_class->next->fair_sched_class->next->idle_sched_class->next=MULL

接下來看一下各種調度類的註冊：
在編譯過程中通過early_initcall(cpu_stop_init)進行stop相關的註冊，cpu_stop_init對cpu_stop_threads進行了註冊，其create方法被調用時實際執行了cpu_stop_create->sched_set_stop_task，這裏面有對stop的sched_class的註冊過程，而create的執行路徑爲：
   cpu_stop_init->
       smpboot_register_percpu_thread->
           smpboot_register_percpu_thread_cpumask->
               __smpboot_create_thread->
                   cpu_stop_threads.create(即cpu_stop_create)
現在又可以回到pick_next_task了，其中對上面提到的所有sched_class進行遍歷，從最高優先級開始，找到一個可以調度的進程返回。

timer interrupt
is responsible for decrementing the running process’s timeslice count.When the
count reaches zero, need_resched is set and the kernel runs the scheduler as soon as possible
在時鐘中斷中更新進程執行時間信息，如果時間片用完，則設置need_resched，在接下來的調度過程中換出正在執行的進程（時鐘中斷返回？）。

RTC(Real-Time Clock)實時時鐘，非易失性設備存儲系統時間，在系統啓動時，通過COMS連接設備到系統，讀取對應的時間信息提供給系統設置。

System Timer（系統定時器），由電子時鐘以可編程頻率實現，驅動系統時鐘中斷定期發生，也有部分架構通過減法器decrementer實現，通過計數器設定初始值，以固定頻率減少直到爲0，然後出發時鐘中斷。

The timer interrupt
is broken into two pieces: an architecture-dependent and an architecture-independent
routine.
The architecture-dependent routine is registered as the interrupt handler for the system
timer and, thus, runs when the timer interrupt hits. Its exact job depends on the
given architecture, of course, but most handlers perform at least the following work:
1. Obtain the xtime_lock lock, which protects access to jiffies_64 and the wall
time value, xtime.
2. Acknowledge or reset the system timer as required.
3. Periodically save the updated wall time to the real time clock.
4. Call the architecture-independent timer routine, tick_periodic().

The architecture-independent routine, tick_periodic(), performs much more work:
1. Increment the jiffies_64 count by one. (This is safe, even on 32-bit architectures,
because the xtime_lock lock was previously obtained.)
2. Update resource usages, such as consumed system and user time, for the currently
running process.
3. Run any dynamic timers that have expired (discussed in the following section).
4. Execute scheduler_tick(), as discussed in Chapter 4.
5. Update the wall time, which is stored in xtime.
6. Calculate the infamous load average.

接下來看看中斷的註冊和調用入口，以Powerpc FSL Booke架構芯片ppce500爲例來看具體代碼
中斷向量
首先是在系統啓動過程中註冊中斷處理函數，在啓動完成後時鐘中斷髮生則會出發中斷函數執行。
ppce500架構的內核入口文件爲head_fsl_booke.S，在arch/powerpc/kernel/路徑下，其中定義了中斷向量列表：

interrupt_base:
   /* Critical Input Interrupt */
   CRITICAL_EXCEPTION(0x0100, CRITICAL, CriticalInput, unknown_exception)
   ......

   /* Decrementer Interrupt */
   DECREMENTER_EXCEPTION
   ......

可以看到時鐘中斷的定義爲DECREMENTER_EXCEPTION，實際展開在arch/powerpc/kernel/head_booke.h頭文件中：
#define EXC_XFER_TEMPLATE(hdlr, trap, msr, copyee, tfer, ret)   \
   li   r10,trap;                   \
   stw   r10,_TRAP(r11);                   \
   lis   r10,msr@h;                   \
   ori   r10,r10,msr@l;                   \
   copyee(r10, r9);                   \
   bl   tfer;                      \
   .long   hdlr;                       \
   .long   ret

#define EXC_XFER_LITE(n, hdlr)       \
   EXC_XFER_TEMPLATE(hdlr, n+1, MSR_KERNEL, NOCOPY, transfer_to_handler, \
           ret_from_except)

#define DECREMENTER_EXCEPTION                       \
   START_EXCEPTION(Decrementer)                   \
   NORMAL_EXCEPTION_PROLOG(DECREMENTER);       \
   lis r0,TSR_DIS@h; /* Setup the DEC interrupt mask */ \
   mtspr SPRN_TSR,r0;       /* Clear the DEC interrupt */   \
   addi r3,r1,STACK_FRAME_OVERHEAD;               \
   EXC_XFER_LITE(0x0900, timer_interrupt)

再來看timer_interrupt函數：

/*
* timer_interrupt - gets called when the decrementer overflows,
* with interrupts disabled.
*/
void timer_interrupt(struct pt_regs * regs)
{
struct pt_regs *old_regs;
u64 *next_tb = this_cpu_ptr(&decrementers_next_tb);

   /* Ensure a positive value is written to the decrementer, or else
   * some CPUs will continue to take decrementer exceptions.
   */
   set_dec(DECREMENTER_MAX);

   /* Some implementations of hotplug will get timer interrupts while
   * offline, just ignore these and we also need to set
   * decrementers_next_tb as MAX to make sure __check_irq_replay
   * don't replay timer interrupt when return, otherwise we'll trap
   * here infinitely :(
   */
   if (!cpu_online(smp_processor_id())) {
       *next_tb = ~(u64)0;
       return;
   }

   /* Conditionally hard-enable interrupts now that the DEC has been
   * bumped to its maximum value
   */
   may_hard_irq_enable();

#if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC)
if (atomic_read(&ppc_n_lost_interrupts) != 0)
do_IRQ(regs);
#endif

old_regs = set_irq_regs(regs);
irq_enter();

   __timer_interrupt();
   irq_exit();
   set_irq_regs(old_regs);
}

static void __timer_interrupt(void)
{
   struct pt_regs *regs = get_irq_regs();
   u64 *next_tb = this_cpu_ptr(&decrementers_next_tb);
   struct clock_event_device *evt = this_cpu_ptr(&decrementers);
   u64 now;

trace_timer_interrupt_entry(regs);

   if (test_irq_work_pending()) {
       clear_irq_work_pending();
       irq_work_run();
   }

   now = get_tb_or_rtc();
   if (now >= *next_tb) {
       *next_tb = ~(u64)0;
       if (evt->event_handler)
           evt->event_handler(evt);
       __this_cpu_inc(irq_stat.timer_irqs_event);
   } else {
       now = *next_tb - now;
       if (now <= DECREMENTER_MAX)
           set_dec((int)now);
       /* We may have raced with new irq work */
       if (test_irq_work_pending())
           set_dec(1);
       __this_cpu_inc(irq_stat.timer_irqs_others);
   }

#ifdef CONFIG_PPC64
   /* collect purr register values often, for accurate calculations */
   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
       struct cpu_usage *cu = this_cpu_ptr(&cpu_usage_array);
       cu->current_tb = mfspr(SPRN_PURR);
   }
#endif

trace_timer_interrupt_exit(regs);
}

可以看到在上面__timer_interrupt中執行了evt->event_handler函數調用，此處event_handler是什麼，究竟是怎麼註冊的呢？

答案就是該event_handler爲tick_handle_periodic，其函數定義如下：

/*
* Event handler for periodic ticks
*/
void tick_handle_periodic(struct clock_event_device *dev)
{
int cpu = smp_processor_id();
ktime_t next = dev->next_event;

tick_periodic(cpu);

#if defined(CONFIG_HIGH_RES_TIMERS) || defined(CONFIG_NO_HZ_COMMON)
   /*
   * The cpu might have transitioned to HIGHRES or NOHZ mode via
   * update_process_times() -> run_local_timers() ->
   * hrtimer_run_queues().
   */
   if (dev->event_handler != tick_handle_periodic)
       return;
#endif

   if (!clockevent_state_oneshot(dev))
       return;
   for (;;) {
       /*
       * Setup the next period for devices, which do not have
       * periodic mode:
       */
       next = ktime_add(next, tick_period);

       if (!clockevents_program_event(dev, next, false))
           return;
       /*
       * Have to be careful here. If we're in oneshot mode,
       * before we call tick_periodic() in a loop, we need
       * to be sure we're using a real hardware clocksource.
       * Otherwise we could get trapped in an infinite
       * loop, as the tick_periodic() increments jiffies,
       * which then will increment time, possibly causing
       * the loop to trigger again and again.
       */
       if (timekeeping_valid_for_hres())
           tick_periodic(cpu);
   }
}

tick_handle_periodic的註冊流程如下：
start_kernel->time_init->init_decrementer_clockevent->register_decrementer_clockevent->clockevents_register_device->tick_check_new_device->tick_setup_periodic->tick_set_periodic_handler->tick_handle_periodic->tick_periodic->update_process_times->scheduler_tick
後面一段爲tick_handle_periodic的執行流程調用，可以看到在scheduler_tick中又調用了調度類的task_tick函數接口，在rt_sched_class中實現爲task_tick_rt，實現如下：
static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
struct sched_rt_entity *rt_se = &p->rt;

update_curr_rt(rq);

watchdog(rq, p);

   /*
   * RR tasks need a special form of timeslice management.
   * FIFO tasks have no timeslices.
   */
   if (p->policy != SCHED_RR)
       return;

if (--p->rt.time_slice)
return;

p->rt.time_slice = sched_rr_timeslice;

   /*
   * Requeue to the end of queue if we (and all of our ancestors) are not
   * the only element on the queue
   */
   for_each_sched_rt_entity(rt_se) {
       if (rt_se->run_list.prev != rt_se->run_list.next) {
           requeue_task_rt(rq, p, 0);
           resched_curr(rq);
           return;
       }
   }
}
可以看到，如果當前時間片還未用完，則直接返回，否則將進程實時時間片設置爲sched_rr_timeslice，並且將調度實體的進程放置到調度隊列rq的末尾，調用resched_curr設置調度信息後返回。

現在又有新的問題，設置了進程的調度標誌TIF_NEED_RESCHED之後，實際的調度何時發生呢？
調度的入口分爲四個：
1. 中斷返回；
2. 系統調用返回用戶空間；
3. 進程主動放棄cpu執行調度；
4. 信號處理完成後返回內核空間；
時鐘中斷返回導致進程調度爲第1種，此處以ppce500爲例來看調度如何發生：
各種異常返回的入口RET_FROM_EXC_LEVEL，調用user_exc_return而進入do_work
而do_work作爲總的入口點進入執行過程：
   do_work:           /* r10 contains MSR_KERNEL here */
       andi.   r0,r9,_TIF_NEED_RESCHED
       beq   do_user_signal
可以看到，如果未設置調度標誌，則會執行restore_user返回之前的調用棧
   do_user_signal:           /* r10 contains MSR_KERNEL here */
       ori   r10,r10,MSR_EE
       SYNC
       MTMSRD(r10)       /* hard-enable interrupts */
       /* save r13-r31 in the exception frame, if not already done */
       lwz   r3,_TRAP(r1)
       andi.   r0,r3,1
       beq   2f
       SAVE_NVGPRS(r1)
       rlwinm   r3,r3,0,0,30
       stw   r3,_TRAP(r1)
   2:   addi   r3,r1,STACK_FRAME_OVERHEAD
       mr   r4,r9
       bl   do_notify_resume
       REST_NVGPRS(r1)
       b   recheck
調用do_resched的地方爲同樣定義在entry_32.S的recheck函數：
   recheck:
       /* Note: And we don't tell it we are disabling them again
       * neither. Those disable/enable cycles used to peek at
       * TI_FLAGS aren't advertised.
       */
       LOAD_MSR_KERNEL(r10,MSR_KERNEL)
       SYNC
       MTMSRD(r10)       /* disable interrupts */
       CURRENT_THREAD_INFO(r9, r1)
       lwz   r9,TI_FLAGS(r9)
       andi.   r0,r9,_TIF_NEED_RESCHED
       bne-   do_resched
       andi.   r0,r9,_TIF_USER_WORK_MASK
       beq   restore_user
在entry_32.S中可以看到在函數do_resched中調用了schedule函數執行了調度：
   do_resched:           /* r10 contains MSR_KERNEL here */
       /* Note: We don't need to inform lockdep that we are enabling
       * interrupts here. As far as it knows, they are already enabled
       */
       ori   r10,r10,MSR_EE
       SYNC
       MTMSRD(r10)       /* hard-enable interrupts */
       bl   schedule

再看時鐘中斷的執行過程：
在前面的中斷向量定義中可以看到有一個處理過程爲bl tfer;這裏的tfer爲transfer_to_handler或者transfer_to_handler_full，在時鐘中斷中爲transfer_to_handler，主要做了一些中斷處理函數調用之前的準備處理過程，然後跳轉到中斷執行過程hdlr，最後進入ret執行，ret對應函數ret_from_except或者ret_from_except_full，在時鐘中斷中對應爲ret_from_except，進而調用resume_kernel後進入preempt_schedule_irq執行調度過程：

/*
* this is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
* protect us against recursive calling from irq.
*/
asmlinkage __visible void __sched preempt_schedule_irq(void)
{
enum ctx_state prev_state;

/* Catch callers which need to be fixed */
BUG_ON(preempt_count() || !irqs_disabled());

prev_state = exception_enter();

   do {
       preempt_disable();
       local_irq_enable();
       __schedule(true);
       local_irq_disable();
       sched_preempt_enable_no_resched();
   } while (need_resched());

exception_exit(prev_state);
}

Linux中時鐘中斷到進程調度schedule的執行過程

微服務實踐k8s&dapr開發部署實驗（2）狀態管理

Win10 LTSC 2019 安裝後的一些步驟

Python 潮流週刊#52：Python 處理 Excel 的資源

cgroup介紹

Linux中時鐘中斷到進程調度schedule的執行過程

cgroup子系統cpu

Ubuntu交叉編譯aarch64平臺libcgroup工具

用qemu搭建aarch64學習環境

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結