国际 网站制作公司校园网站建设意见
关于LINUX资源调度
计算机系统中,管理资源的方式一般有两种方法,分别是时间分割和空间分割,可以通过分割硬件的相似性,让软件以一致的逻辑执行,CPU运行特点是在时刻点A和时刻B运行机制是一样的,不同的只是执行现场,可以看作是在时间上有对称性的一种资源,比较适合的管理方法就是分割时间。而对于内存这种资源,不同时刻点的数据是不可预期的,在时间尺度上没有相似性,但是在空间尺度上,A区和B区的晶体管作用基本上是一样的,只是空间范围的起始不同,就比较适合使用空间分割管理。
所以,在操作系统实现中,对于CPU分割时间的管理方式就是任务调度,而对于内存这种进行空间管理的手段就是BUDDY系统。
CPU是否能在空间维度上管理呢?或者内存是否可以在时间维度上进行管理呢?应该是可以的,比如CPU的多核和SMT/SMP设计,就是物理上配置了多组处理器或者流水线资源,实现空间维度的扩展。所以,运行LINUX系统的当代处理器,既支持时分,也支持空分管理,统统属于调度范畴。
那么内存在时间维度上管理的例子呢?或许LINUX系统中的内存交换机制可以看作内存在时间管理上的一个实现,为了在有限的内存提供尽量多的运行进程带宽,LINUX支持在内存紧张的时刻,将匿名页面交换到外部磁盘上,在需要的时候在再交换回来,实现对物理内存页面的分时复用。所以同CPU管理一样,LINUX系统对物理内存的管理也同时支持时分和空分。
CFS调度和优先级队列调度的区别
LINUX内核同时支持CFS调度和优先级队列调度,CFS调度算法用在选择SCHED_NORMAL策略的进程上,而内核支持的SCHED_RR实时调度策略则采用优先级队列的方式调度下一个要运行的进程。
至于这两种调度策略在内核中的区别,个人理解主要体现在对“优先级”的理解上。对于优先级队列这种方式来说,优先级的高低是选择下一个运行进程的唯一标准,在占有CPU这件事情上,高优先级的进程具有绝对的优先权,所以只要有高优先级的进程存在,低优先级进程永远没有执行的机会,所以会有线程饥饿情况的发生。
CFS调度策略对“优先级”的理解则不同,对于CFS调度策略来说,优先级只是代表进程在时间这个资源池中占据的“比重”或者“份数”,而并不是一种绝对的优先权。所以即便就绪队列中存在优先级很低的线程,但是仍然能够有一定的“比重”获取CPU。
所以,对于优先级队列来说,它的调度策略为:
对于CFS调度算法来说,它的调度策略是选择距离执行到“预期比重”进度最慢的任务。需要有一种指标来衡量这种进度,同样的执行时间,进度增量和权重成反比,权重越大,进度增量应该越慢。
假设有两个进程的权重为W1,W2, W1 < W2, 选择W1作为其他任务的对照基准:
以两个权重分别为2和5的线程举例,按照CFS调度,其调度过程中进度变化如下表所示,进度最终得到相同。
所以,CFS调度算法保证的是,在持续调度的过程中,所有任务的执行进度和参照任务保持一致,其中参照任务是任意指定的,一般选择被大多数线程使用的任务权重作为参考基准,这样在计算过程中,比例引子为1,时间增量和进度增量相同,便于计算。
在LINUX内核调度器中,有一个专有名字表示进度指标,叫做虚拟时间。观察上表可以看到对于高优先级进程,由于其权重较大,因此它的虚拟时间是按照比例缩小的,也就是说,基准权重的时间和正常时间流逝相同,权重大于基准始终的任务,其时间流逝会变慢,CFS调度器会有限选择虚拟时间最慢的线程进行调度,所以高权重的任务才有更多的运行机会。对于权重低于基准权重的低优先级任务来说,前时间流逝会比正常的时间更快,这样,它只要执行较少的时间就可以满足对进度的要求,这样,在一个完整的调度周期内(队列中所有进程都得到一次调度的时间,比如上表中的W1+W2),每个任务都有执行的机会,进入尽量在调度周期结束后对齐虚拟时间。
之所以说是尽量,是因为内核中的调度点并不一定恰好在被权重整除的整数单位上,在一个DELTA TIME中,总会以有进程比预期多执行一会儿,获得了较多的虚拟时间增量,而其它进程少执行了一些。这就需要在下一个调度周期内进行补偿和奖励那些少运行的线程。因为调度周期是有限的,而CFS调度能够保证每个调度周期内每个任务都有机会执行,误差只会在不对齐的调度时刻出现,所以任务的虚拟时间总体上不会相差太大,CFS调度是一个不断追求公平的动态的过程,它实现了程序上的公平,并通过动态纠偏保证了结果的公平。
下面是参考内核CFS调度算法实现的算法模型仿真程序,分别使用红黑数和链表两种方式管理任务队列,支持动态添加任务:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <float.h>
#include <unistd.h>
#include <pthread.h>
#include <time.h>
#include <math.h>
#include "rbtree.h"#define CFS_USE_RB_TREE
// A CFS Scheduler CModel.
#define DBG(fmt, ...) do { printf("%s line %d, "fmt, __func__, __LINE__, ##__VA_ARGS__); } while (0)
#define assert(expr) \if (!(expr)) { \printf("Assertion failed! %s,%s,%s,line=%d\n",\#expr,__FILE__,__func__,__LINE__); \while(1); \}static pthread_mutex_t cfs_mutex;
double min_vruntime = 0.0f;
void update_min_vruntime(double vruntime)
{min_vruntime = vruntime;
}
/** Nice levels are multiplicative, with a gentle 10% change for every* nice level changed. I.e. when a CPU-bound task goes from nice 0 to* nice 1, it will get ~10% less CPU time than another CPU-bound task* that remained on nice 0.** The "10% effect" is relative and cumulative: from _any_ nice level,* if you go up 1 level, it's -10% CPU usage, if you go down 1 level* it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.* If a task goes up by ~10% and another task goes down by ~10% then* the relative distance between them is ~25%.)*/
const double sched_prio_to_weight[40] = {
#if 1/* -20 */ 88761, 71755, 56483, 46273, 36291,/* -15 */ 29154, 23254, 18705, 14949, 11916,/* -10 */ 9548, 7620, 6100, 4904, 3906,/* -5 */ 3121, 2501, 1991, 1586, 1277,/* 0 */ 1024, 820, 655, 526, 423,/* 5 */ 335, 272, 215, 172, 137,/* 10 */ 110, 87, 70, 56, 45,/* 15 */ 36, 29, 23, 18, 15,
#else/* -20 */ 10, 10, 10, 10, 10,/* -15 */ 10, 10, 10, 10, 10,/* -10 */ 10, 10, 10, 10, 10,/* -5 */ 10, 10, 10, 10, 10,/* 0 */ 10, 10, 10, 10, 10,/* 5 */ 10, 10, 10, 10, 10,/* 10 */ 10, 10, 10, 10, 10,/* 15 */ 10, 10, 10, 10, 10,
#endif
};typedef struct sched_entity {struct rb_node node;int prio;int pid;double weight;double vruntime;double realtime;int ctx_switch;int on_rq;
} sched_entity_t;struct rb_root cfs_root_tree = RB_ROOT;
int sched_period(void)
{return rand() % 10 + 1;
}#define MAX_ENTRY 80
//#define SCHED_PERIOD 100
#define SCHED_PERIOD sched_period()
static sched_entity_t *tasks;// https://zhuanlan.zhihu.com/p/673572911
// in linux kernel, cfs_rq->load used for statistic the total
// weight in rq, refer update_load_add / update_load_sub.
double caculate_total_weight(void)
{double total = 0.0;
#if 1struct rb_node *node;sched_entity_t *task;for (node = rb_first(&cfs_root_tree); node; node = rb_next(node)) {task = rb_entry(node, sched_entity_t, node);total += task->weight;}
#elseint i;for (i = 0; i < MAX_ENTRY; i ++) {total += tasks[i].weight;}
#endifreturn total;
}double caculate_realtime(int task)
{double real = 0.0;real = SCHED_PERIOD * tasks[task].weight / caculate_total_weight();//printf("%s line %d, real %f.\n", __func__, __LINE__, real);return real;
}double caculate_vruntime(sched_entity_t *task, int deltatime)
{double vruntime = 0.0;vruntime = deltatime * sched_prio_to_weight[20] / task->weight ;//printf("%s line %d, vruntime %f.\n", __func__, __LINE__, vruntime);return vruntime;
}#define MAX_TICKS_TEST 100000000UL
double vruntime_total(unsigned long worldtime)
{return worldtime * sched_prio_to_weight[20] / caculate_total_weight();
}double realtime_total(sched_entity_t *task, unsigned long worldtime)
{
#if 1return worldtime * task->weight / caculate_total_weight();
#elsereturn vruntime_total(worldtime) * task->weight / sched_prio_to_weight[20];
#endif
}static int compare_prio(sched_entity_t *task1, sched_entity_t *task2)
{
#if 0if (task1->vruntime < task2->vruntime) {// task1 prior than task2.return -1;} else if (task1->vruntime > task2->vruntime) {// task2 prior than task1.return 1;} else {if (task1->weight > task2->weight) {return -1;} else {// task2 prior than task1.return 1;}}
#elsedouble res = (task1->vruntime == task2->vruntime) ? task1->weight - task2->weight : task2->vruntime - task1->vruntime;return res > 0.0f ? -1 : 1;
#endif
}static int cfs_rq_delete(struct rb_root *root, sched_entity_t *task)
{rb_erase(&task->node, root);return !RB_EMPTY_ROOT(root);
}static int cfs_rq_insert(struct rb_root *root, sched_entity_t *task)
{int ret;struct rb_node **tmp = &(root->rb_node), *parent = NULL;/* Figure out where to put new node */while (*tmp) {sched_entity_t *this = rb_entry(*tmp, sched_entity_t, node);parent = *tmp;ret = compare_prio(task, this);if (ret < 0)tmp = &((*tmp)->rb_left);else if (ret > 0)tmp = &((*tmp)->rb_right);elsereturn -1;}/* Add new node and rebalance tree. */rb_link_node(&task->node, parent, tmp);rb_insert_color(&task->node, root);return 0;
}static void cfs_rq_destroy(struct rb_root *root)
{struct rb_node *node, *next;sched_entity_t *task;node = rb_first(root);while (node) {next = rb_next(node);task = rb_entry(node, sched_entity_t, node);rb_erase(node, root);node = next;}if (!RB_EMPTY_ROOT(root)) {printf("%s line %d, rb is not empty.\n", __func__, __LINE__);}return;
}void print_rbtree(struct rb_root *tree)
{struct rb_node *node;sched_entity_t *task;for (node = rb_first(tree); node; node = rb_next(node)) {task = rb_entry(node, sched_entity_t, node);printf("%s line %d, task(%d) prio %d,weight %f vruntime %f, on rq %d.\n",__func__, __LINE__, task->pid, task->prio, task->weight, task->vruntime, task->on_rq);}return;
}void init_cfs_rbtree(void)
{int i;for (i = 0; i < MAX_ENTRY; i ++) {tasks[i].on_rq = 1;cfs_rq_insert(&cfs_root_tree, &tasks[i]);}print_rbtree(&cfs_root_tree);return;
}#ifdef CFS_USE_RB_TREE
// O(logn) scheduler base on rbtree.
sched_entity_t *schedule(void)
{struct rb_node *node;sched_entity_t *task;node = rb_first(&cfs_root_tree);task = rb_entry(node, sched_entity_t, node);return task;
}#else
// A O(n) linear scheuler impl.
sched_entity_t *schedule(void)
{int i;int taskid = -1;double minruntime = DBL_MAX;// schedule policy:// 1.first find the task with the minum vruntime.// 2.if multiple task the the same minum vruntime, then// select the weighter one.for (i = 0; i < MAX_ENTRY; i ++) {if (minruntime > tasks[i].vruntime) {minruntime = tasks[i].vruntime;taskid = i;} else if (minruntime == tasks[i].vruntime) {if (tasks[i].weight > tasks[taskid].weight) {taskid = i;}}}return &tasks[taskid];
}
#endifdouble list_task_info(unsigned long worldtime)
{double total = 0.0;printf("==================================================================================================================================================================\n");
#if 1struct rb_node *node;sched_entity_t *task;for (node = rb_first(&cfs_root_tree); node; node = rb_next(node)) {task = rb_entry(node, sched_entity_t, node);total += task->realtime;printf("task(pid%d) vuntime %f, realtime %f, prio %d, weight %f, switches %d, ideal real time %f.\n",task->pid, task->vruntime, task->realtime, task->prio, task->weight, task->ctx_switch,realtime_total(task, worldtime));}
#elseint i;for (i = 0; i < MAX_ENTRY; i ++) {double ratio = 0.0;if (i > 0) {ratio = (tasks[i - 1].realtime - tasks[i].realtime) / tasks[i].realtime;}total += tasks[i].realtime;printf("task %d(pid%d) vuntime %f, realtime %f, prio %d, weight %f, incretio %f, switches %d, ideal real time %f.\n",i, tasks[i].pid, tasks[i].vruntime, tasks[i].realtime, tasks[i].prio, tasks[i].weight, ratio * 100, tasks[i].ctx_switch,realtime_total(&tasks[i], worldtime));}
#endifdouble staticis(void);printf("fangcha %f.\n", staticis());printf("==================================================================================================================================================================\n");return total;
}static void *fork_thread(void *arg)
{sched_entity_t *task;unsigned long *pworldtime = (unsigned long *)arg;static unsigned int pid = MAX_ENTRY;int i;while (1) {task = malloc(sizeof(sched_entity_t));memset(task, 0x00, sizeof(sched_entity_t));i = rand() % 40;pthread_mutex_lock(&cfs_mutex);task->prio = -20 + i;task->weight = sched_prio_to_weight[i];//task->vruntime = vruntime_total(*pworldtime) +1.5f;task->vruntime = min_vruntime + 1.5f;task->realtime = 0;task->ctx_switch = 0;task->pid = pid ++;task->on_rq = 0;cfs_rq_insert(&cfs_root_tree, task);task->on_rq = 1;pthread_mutex_unlock(&cfs_mutex);sleep(1);}return NULL;
}static void *exit_thread(void *arg)
{unsigned long *pwordtime = (unsigned long *)arg;while (1) {sleep(1);}return NULL;
}double staticis(void)
{double statics = 0.0f;double average = 0.0f;double total = 0.0f;int count = 0;struct rb_node *node;sched_entity_t *task;for (node = rb_first(&cfs_root_tree); node; node = rb_next(node)) {task = rb_entry(node, sched_entity_t, node);total += task->vruntime;count ++;}average = total / count;for (node = rb_first(&cfs_root_tree); node; node = rb_next(node)) {task = rb_entry(node, sched_entity_t, node);statics += (average - task->vruntime) * (average - task->vruntime);}return sqrt(statics / count);
}
int main(void)
{int i;unsigned long ticks;double total = 0.0;unsigned long worldtime = 0;pthread_t t1, t2;pthread_mutex_init(&cfs_mutex, NULL);tasks = malloc(sizeof(sched_entity_t) * MAX_ENTRY);memset(tasks, 0x00, sizeof(sched_entity_t) * MAX_ENTRY);if (!tasks) {printf("%s line %d, fatal errro, alloc failure.\n",__func__, __LINE__);return -1;}srand((unsigned int)time(0));for (i = 0; i < MAX_ENTRY; i ++) {tasks[i].prio = -20 + i % 40;tasks[i].weight = sched_prio_to_weight[i % 40];tasks[i].vruntime = 0;tasks[i].realtime = 0;tasks[i].ctx_switch = 0;tasks[i].pid = i;tasks[i].on_rq = 0;}#ifdef CFS_USE_RB_TREEinit_cfs_rbtree();
#endif// should be first.printf("%s line %d, first schedule select %ld.\n", __func__, __LINE__, schedule() - tasks);pthread_create(&t1, NULL, fork_thread, &worldtime);pthread_create(&t2, NULL, exit_thread, &worldtime);for (ticks = 0; /* ticks < MAX_TICKS_TEST */ 1; ticks ++) {double deltatime, vruntime;sched_entity_t *task;pthread_mutex_lock(&cfs_mutex);task = schedule();#ifdef CFS_USE_RB_TREEcfs_rq_delete(&cfs_root_tree, task);task->on_rq = 0;#endifdeltatime = SCHED_PERIOD;vruntime = caculate_vruntime(task, deltatime);task->vruntime += vruntime;task->realtime += deltatime;task->ctx_switch ++;worldtime += deltatime;update_min_vruntime(task->vruntime);#ifdef CFS_USE_RB_TREE// USE dequeue and enqueue to trigger the reorder of cfs rbtree ready queue.// this also the same in linux kernel, refer put_prev_entity(enqueue) and set_next_task(dequeue)// the only differenct is in linux kenrel the on_rq still keep no matter put_prev_entity/set_next_entity// involation.cfs_rq_insert(&cfs_root_tree, task);task->on_rq = 1;
#endifif (ticks % 1000 == 0) {list_task_info(worldtime);}pthread_mutex_unlock(&cfs_mutex);}pthread_mutex_lock(&cfs_mutex);total = list_task_info(worldtime);assert(total == worldtime);printf("vruntime %f.\n", vruntime_total(worldtime));pthread_mutex_unlock(&cfs_mutex);pthread_join(t1, NULL);pthread_join(t2, NULL);#ifdef CFS_USE_RB_TREEprint_rbtree(&cfs_root_tree);cfs_rq_destroy(&cfs_root_tree);
#endiffree(tasks);return 0;
}
仿真结果,按照CFS算法调度大约1000个进程,VRUNTIME的方差始终控制在70左右,下面截图中显示所有进程的VRUNTIME集中在[308177.014345,308736.000000],对照方差,VRUNTIME表示的所有任务的执行进度是非常集中的,说明CFS的调度策略确实照顾到了所有优先级的进程,使大家的执行进度基本保持在一个动态的一致范围之内。
参考文章
(五)Linux进程调度-CFS调度器
吐血整理 | 肝翻 Linux 进程调度所有知识点