欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

深入理解Android:第三部分 - 进程内存管理与LMKD机制详解

最编程 2024-07-31 19:45:42
...

Android进程管理3——内存回收LMKD

关于android内存的调节一共可以分为三个方面。当设备内存紧张的时候开始开始采用LMKD杀进程,对于杀不掉的进程以TrimMemroy的方式通知进程自己清理内存,最极端的情况直接OOM爆掉。这里我们主要讨论LMKD。

1.LMKD与Lowmemorykiller

Untitled

从Android9.版本开始,系统放弃了传统的Lowmemorykiller改用LMKD(Low Memory Killer Daemon)进行低内存查杀。

从Android10版本开始,lmkd的监测内存模式从vmpressure变为了PSI方式。

1.1 LMKD和Lowmemorykiller的一些区别?

  • Lowmemorykiller 运行于 Linux 内核中,而LMKD作为一个独立的守护进程运行,后者扩展性更好更灵活
  • Lowmemorykiller依赖于oom_score_adj,LMKD的监测和参考维度更多
  • Lowmemorykiller杀进程的速度要弱于LMKD 代码中也可以看见逻辑

9.0之前的版本主要依靠以下位置的文件进行判断

# /sys/module/lowmemorykiller/parameters/minfree
18432,23040,27648,32256,55296,80640
# /sys/module/lowmemorykiller/parameters/adj
0,100,200,300,900,906

1.2 PSI代替vmpressure**?**

vmpressure 信号(由内核生成,用于内存压力检测并由 lmkd 使用)通常包含大量误报,因此 lmkd 必须执行过滤以确定内存是否真的有压力。这会导致不必要的 lmkd 唤醒并使用额外的计算资源。使用 PSI 监视器可以实现更精确的内存压力检测,并最大限度地减少过滤开销。

支持LMKD需要的编译配置

CONFIG_ANDROID_LOW_MEMORY_KILLER=n
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y

支持PSI需要的编译配置

CONFIG_PSI=y

所以在Android10及以上的版本中,当系统初始化完成后启动lmkd之后,会首先判断是否use_inkernel_interface(高版本都是false),然后判断是否支持PSI,不支持则用vmpressure。然后在根据是否是低内存设备和是否用use_minfree_levels采用不同的策略。

2 LMKD代码流程

2.1 lmkd启动

lmkd.rc

service lmkd /system/bin/lmkd
    class core   //核心进程 class_start core init.rc中 onboot
    user lmkd
    group lmkd system readproc
    capabilities DAC_OVERRIDE KILL IPC_LOCK SYS_NICE SYS_RESOURCE
    critical // 4min之内crash4次,则重启bootloader
    socket lmkd seqpacket 0660 system system     // 设置socket
    writepid /dev/cpuset/system-background/tasks  //对应cpuset

critical的具体代码参考 system/core/init/service.cpp

2.2 LMKD main方法

int main(int argc, char **argv) {
   
    update_props(); //更新prop 一系列和lmk相关的prop值

    ctx = create_android_logger(KILLINFO_LOG_TAG);//eventlog

    if (!init()) {
        if (!use_inkernel_interface) {//正常都是false
            /*
             * MCL_ONFAULT pins pages as they fault instead of loading
             * everything immediately all at once. (Which would be bad,
             * because as of this writing, we have a lot of mapped pages we
             * never use.) Old kernels will see MCL_ONFAULT and fail with
             * EINVAL; we ignore this failure.
             *
             * N.B. read the man page for mlockall. MCL_CURRENT | MCL_ONFAULT
             * pins ⊆ MCL_CURRENT, converging to just MCL_CURRENT as we fault
             * in pages.
             */
             //锁住该实时进程在物理内存上全部地址空间。这将阻止Linux将这个内存页调度到交换空间(swap space),
            // 及时该进程已有一段时间没有访问这段空间。
            /* CAP_IPC_LOCK required */
            if (mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT) && (errno != EINVAL)) {
                ALOGW("mlockall failed %s", strerror(errno));
            }

            /* CAP_NICE required */
            struct sched_param param = {
                    .sched_priority = 1,
            };
            if (sched_setscheduler(0, SCHED_FIFO, &param)) { //实时调度
                ALOGW("set SCHED_FIFO failed %s", strerror(errno));
            }
        }
				//循环处理事件 接收socket
        mainloop();
    }

    android_log_destroy(&ctx);
    ALOGI("exiting");
    return 0;
}

对应更新的prop各项表格如下

属性 使用 默认
ro.config.low_ram 指定设备是低内存设备还是高性能设备。 false
ro.lmk.use_psi 使用 PSI 监视器(而不是 vmpressure 事件)。 true
ro.lmk.use_minfree_levels 使用可用内存和文件缓存阈值来做出进程终止决策(即与内核中的 LMK 驱动程序的功能一致)。 false
ro.lmk.low 在低 vmpressure 水平下可被终止的进程的最低 oom_adj 得分。 1001(停用)
ro.lmk.medium 在中等 vmpressure 水平下可被终止的进程的最低 oom_adj 得分。 800(已缓存或非必要服务)
ro.lmk.critical 在临界 vmpressure 水平下可被终止的进程的最低 oom_adj 得分。 0(任何进程)
ro.lmk.critical_upgrade 支持升级到临界水平。 false
ro.lmk.upgrade_pressure 由于系统交换次数过多,将在该水平执行水平升级的 mem_pressure 上限。 100(停用)
ro.lmk.downgrade_pressure 由于仍有足够的可用内存,将在该水平忽略 vmpressure 事件的 mem_pressure 下限。 100(停用)
ro.lmk.kill_heaviest_task 终止符合条件的最繁重任务(最佳决策)与终止符合条件的任何任务(快速决策)。 true
ro.lmk.kill_timeout_ms 从某次终止后到其他终止完成之前的持续时间(以毫秒为单位)。 0(停用)
ro.lmk.debug 启用 lmkd 调试日志。 false

2.3 LMKD init()

lmkd通过试探进入内核lmk模块路径(/sys/module/lowmemorykiller/parameters/minfree)的方式判断当前系统是否含义lmk模块。如果存在内核lmk模块,并且用户配置了enable_userspace_lmk为false,直接使用内核lmk。否则使用用户空间lmkd。

在init_monitors()( 判断通过psi 还是vmpressure检测内存,我这里的设备Android10以上均为psi方式。

static int init(void) {
    static struct event_handler_info kernel_poll_hinfo = { 0, kernel_event_handler };
    struct reread_data file_data = { ///proc/zoneinfo
        .filename = ZONEINFO_PATH,
        .fd = -1,
    };

    epollfd = epoll_create(MAX_EPOLL_EVENTS);//创建全局epoll文件句柄
/*
   MAX_EPOLL_EVENTS
 * 1 ctrl listen socket, 3 ctrl data socket, 3 memory pressure levels,
 * 1 lmk events + 1 fd to wait for process death
 */

    ctrl_sock.sock = android_get_control_socket("lmkd"); //打开lmkd socket文件句柄
    ···
    ret = listen(ctrl_sock.sock, MAX_DATA_CONN);

    ···
    epev.events = EPOLLIN;
    //lmkd的socket连接时回调这里 这里打印"lmkd data connection established"
    ctrl_sock.handler_info.handler = ctrl_connect_handler;
    epev.data.ptr = (void *)&(ctrl_sock.handler_info);
  
    has_inkernel_module = !access(INKERNEL_MINFREE_PATH, W_OK);//"/sys/module/lowmemorykiller/parameters/minfree"
    use_inkernel_interface = has_inkernel_module;

    if (use_inkernel_interface) {   //false 高版本不玩儿这一套了
        ALOGI("Using in-kernel low memory killer interface");
        if (init_poll_kernel()) {
           ···   

        }
    } else {
        if (!init_monitors()) {  // initmonitor
            return -1;
        }
        /* let the others know it does support reporting kills */
        property_set("sys.lmk.reportkills", "1");
    }
return 0;

}


#### **init_psi_monitors**

ro.lmk.use_psi判断是否支持psi,init_psi_monitors()后者主要是psi.c中是否初始化成功。

```cpp
static bool init_monitors() {
    /* Try to use psi monitor first if kernel has it */
    use_psi_monitors = property_get_bool("ro.lmk.use_psi", true) &&
        init_psi_monitors();
    /* Fall back to vmpressure */
    if (!use_psi_monitors &&
        (!init_mp_common(VMPRESS_LEVEL_LOW) ||
        !init_mp_common(VMPRESS_LEVEL_MEDIUM) ||
        !init_mp_common(VMPRESS_LEVEL_CRITICAL))) {
        ALOGE("Kernel does not support memory pressure events or in-kernel low memory killer");
        return false;
    }
    if (use_psi_monitors) {
        ALOGI("Using psi monitors for memory pressure detection");
    } else {
        ALOGI("Using vmpressure for memory pressure detection");
    }
    return true;
}

init_psi_monitors

这里主要通过下面两个配置,进行不同的kill策略

  • ro.config.low_ram 配置设备为低内存
  • ro.lmk.use_minfree_levels 与内核中的 LMK 驱动程序相同的kill策略(即可用内存和文件缓存阈值(file cache thresholds))做出终止决策。
static bool init_psi_monitors() {
    /*
     * When PSI is used on low-ram devices or on high-end devices without memfree levels
     * use new kill strategy based on zone watermarks, free swap and thrashing stats
     */
    //low_ram_device即ro.config.low_ram   use_minfree_levels则是ro.lmk.use_minfree_levels
		//当为低内存设备 或用旧模式的时候,使用use_new_strategy
    bool use_new_strategy =
        property_get_bool("ro.lmk.use_new_strategy", low_ram_device || !use_minfree_levels);

    /* In default PSI mode override stall amounts using system properties */
    if (use_new_strategy) {
        /* Do not use low pressure level */
        psi_thresholds[VMPRESS_LEVEL_LOW].threshold_ms = 0;
				//ro.lmk.psi_partial_stall_ms  低内存设备 200ms or 70ms
        psi_thresholds[VMPRESS_LEVEL_MEDIUM].threshold_ms = psi_partial_stall_ms;
				//ro.lmk.psi_complete_stall_ms      700ms
			  psi_thresholds[VMPRESS_LEVEL_CRITICAL].threshold_ms = psi_complete_stall_ms;
    }
		//重点分析init_mp_psi
    if (!init_mp_psi(VMPRESS_LEVEL_LOW, use_new_strategy)) {
        return false;
    }
    if (!init_mp_psi(VMPRESS_LEVEL_MEDIUM, use_new_strategy)) {
        destroy_mp_psi(VMPRESS_LEVEL_LOW);
        return false;
    }
    if (!init_mp_psi(VMPRESS_LEVEL_CRITICAL, use_new_strategy)) {
        destroy_mp_psi(VMPRESS_LEVEL_MEDIUM);
        destroy_mp_psi(VMPRESS_LEVEL_LOW);
        return false;
    }
    return true;
}

/* memory pressure levels */
enum vmpressure_level {
    VMPRESS_LEVEL_LOW = 0,
    VMPRESS_LEVEL_MEDIUM,
    VMPRESS_LEVEL_CRITICAL,
    VMPRESS_LEVEL_COUNT
};

static struct psi_threshold psi_thresholds[VMPRESS_LEVEL_COUNT] = {
    { PSI_SOME, 70 },    /* 70ms out of 1sec for partial stall */
    { PSI_SOME, 100 },   /* 100ms out of 1sec for partial stall */
    { PSI_FULL, 70 },    /* 70ms out of 1sec for complete stall */
};

init_mp_psi

• 只有当设备不是低内存设备,同时使用minfree级别时,不使用新策略。

static bool init_mp_psi(enum vmpressure_level level, bool use_new_strategy) {
    int fd;

    /* Do not register a handler if threshold_ms is not set */
    if (!psi_thresholds[level].threshold_ms) {
        return true;
    }
		//往该节点(/proc/pressure/memory)写入stall_type、threshold_ms 、PSI_WINDOW_SIZE_MS
    //调用psi.cpp 窗口大小时间(1000ms),PSI监视器监控窗口大小,
    //在每个窗口最多生成一次事件,因此在PSI窗口大小的持续时间内轮询内存状态
    fd = init_psi_monitor(psi_thresholds[level].stall_type,                                                       
        psi_thresholds[level].threshold_ms * US_PER_MS,
        PSI_WINDOW_SIZE_MS * US_PER_MS);
    ···
    vmpressure_hinfo[level].handler = use_new_strategy ? mp_event_psi : mp_event_common;//判断是否是use_new_strategy
    vmpressure_hinfo[level].data = level;
    if (register_psi_monitor(epollfd, fd, &vmpressure_hinfo[level]) < 0) { // 调用psi.cpp
        destroy_psi_monitor(fd);
        return false;
    }
		···
    return true;
}

调用psi

psi主要监控了proc/pressure下 io memory cpu三项指标。

blog.****.net/zhzhangnews…

init_psi_monitor

register_psi_monitor

2.4 LMKD接收SystemServer socket消息

lmkd进程的客户端是ActivityManager,通过socket(dev/socket/lmkd)跟 lmkd 进行通信, 当有客户连接时,就会回调ctrl_connect_handler函数 > ctrl_data_handler > ctrl_command_handler

 // lmkd进程的客户端是ActivityManager,通过socket(dev/socket/lmkd)跟 lmkd 进行通信,
 // 当有客户连接时,就会回调ctrl_connect_handler函数 > ctrl_data_handler > ctrl_command_handler

这里我们直接看ctrl_command_handler

ctrl_command_handler

static void ctrl_command_handler(int dsock_idx) {
    
 ......
    switch(cmd) {
    
        case LMK_TARGET:
        	 // 解析socket packet里面传过来的数据,写入lowmem_minfree和lowmem_adj两个数组中,
        	 // 用于控制low memory的行为;
        	 // 设置sys.lmk.minfree_levels,比如属性值:
        	 // [sys.lmk.minfree_levels]: [18432:0,23040:100,27648:200,85000:250,191250:900,241920:950]
              cmd_target(targets, packet); 
        case LMK_PROCPRIO:
        // 设置进程的oomadj,把oomadj写入对应的节点(/proc/pid/oom_score_adj)中;
        // 将oomadj保存在一个哈希表中。
        // 哈希表 pidhash 是以 pid 做 key,proc_slot 则是把 struct proc 插入到以 oomadj 为 key 的哈希表 procadjslot_list 里面
              cmd_procprio(packet);   
        case LMK_PROCREMOVE:
        //  解析socket传过来进程的pid,
        // 通过pid_remove 把这个 pid 对应的 struct proc 从 pidhash 和 procadjslot_list 里移除
                cmd_procremove(packet);
       case LMK_PROCPURGE:
                cmd_procpurge();        
       case LMK_GETKILLCNT:
                kill_cnt = cmd_getkillcnt(packet);
........
}
命令 功能 方法
LMK_TARGET 初始化 oom_adj ProcessList::setOomAdj()
LMK_PROCPRIO 更新 oom_adj ProcessList::updateOomLevels()
LMK_PROCREMOVE 移除进程(暂时无用) ProcessList::remove()

当监听到系统内存压力过大时,会通过/proc/pressure/memory上报内存压力,由于配置的是some 60、some 100、full70,当一秒内内存占用70ms\100ms时会上报内存压力,上报压力后,会判断use_new_strategy触发不同的事件。

3. mp_event_psi和mp_event_common不同kill策略

3.1 mp_event_psi流程

mp_event_psi 使用zone_watermark监测。当设备为低内存或者不使用旧模式minfree时,均如下处理方式。

static void mp_event_psi(int data, uint32_t events, struct polling_params *poll_params) {
    
    bool kill_pending = is_kill_pending();//判断last_kill_pid_or_fd节点是否存在,存在则为true
    if (kill_pending && (kill_timeout_ms == 0 ||
        get_time_diff_ms(&last_kill_tm, &curr_tm) < static_cast<long>(kill_timeout_ms))) {
        /* Skip while still killing a process */
        wi.skipped_wakeups++;
        goto no_kill;
    }
    /*
     * Process is dead or kill timeout is over, stop waiting. This has no effect if pidfds are
     * supported and death notification already caused waiting to stop.
     */
    //进程已死或杀死超时结束,停止等待。 如果支持pidfds,并且死亡通知已经导致等待停止,
    stop_wait_for_proc_kill(!kill_pending);

    if (vmstat_parse(&vs) < 0) {// 解析/proc/vmstat
        ALOGE("Failed to parse vmstat!");
        return;
    }
    /* Starting 5.9 kernel workingset_refault vmstat field was renamed workingset_refault_file */
    workingset_refault_file = vs.field.workingset_refault ? : vs.field.workingset_refault_file;

    if (meminfo_parse(&mi) < 0) {/// 解析/proc/meminfo并匹配各个字段的信息,获取可用内存页信息:
        ALOGE("Failed to parse meminfo!");
        return;
    }

    /* Reset states after process got killed */
    if (killing) {
        killing = false;
        cycle_after_kill = true;
        /* Reset file-backed pagecache size and refault amounts after a kill */
        base_file_lru = vs.field.nr_inactive_file + vs.field.nr_active_file;
        init_ws_refault = workingset_refault_file;
        thrashing_reset_tm = curr_tm;
        prev_thrash_growth = 0;
    }

    /* Check free swap levels */
    if (swap_free_low_percentage) {//ro.lmk.swap_free_low_percentage 默认为10
         if (!swap_low_threshold) {
            swap_low_threshold = mi.field.total_swap * swap_free_low_percentage / 100;
         }
       //当swap可用空间低于ro.lmk.swap_free_low_percentage属性定义的百分比时,设置swap_is_low = true
        swap_is_low = mi.field.free_swap < swap_low_threshold; // meminfo
    }

    /* Identify reclaim state */
    //通过判断pgscan_direct/pgscan_kswapd字段较上一次的变化,
   	
    if (vs.field.pgscan_direct > init_pgscan_direct) { // 直接回收(DIRECT_RECLAIM)
        init_pgscan_direct = vs.field.pgscan_direct;
        init_pgscan_kswapd = vs.field.pgscan_kswapd;
        reclaim = DIRECT_RECLAIM;
    } else if (vs.field.pgscan_kswapd > init_pgscan_kswapd) {//通过swap回收(KSWAPD_RECLAIM),
        init_pgscan_kswapd = vs.field.pgscan_kswapd;
        reclaim = KSWAPD_RECLAIM;
    } else if (workingset_refault_file == prev_workingset_refault) {
		// 如果都不是(NO_RECLAIM),说明内存压力不大,不进行kill
        /*
         * Device is not thrashing and not reclaiming, bail out early until we see these stats
         * changing
         */
        goto no_kill;
    }

    prev_workingset_refault = workingset_refault_file;

     /*
     * It's possible we fail to find an eligible process to kill (ex. no process is
     * above oom_adj_min). When this happens, we should retry to find a new process
     * for a kill whenever a new eligible process is available. This is especially
     * important for a slow growing refault case. While retrying, we should keep
     * monitoring new thrashing counter as someone could release the memory to mitigate
     * the thrashing. Thus, when thrashing reset window comes, we decay the prev thrashing
     * counter by window counts. If the counter is still greater than thrashing limit,
     * we preserve the current prev_thrash counter so we will retry kill again. Otherwise,
     * we reset the prev_thrash counter so we will stop retrying.
     */
/*
* 有可能找不到合适的进程进行杀进程(例如没有进程高于oom_adj_min)。 在这种情况下,
* 每当有新的合格进程可用时,我们应重试找到新的进程进行杀进程,这对于缓慢增长的
* 回页错误情况尤其重要。 在重试期间,我们应继续监控新的抖动计数器,因为有人可能释放
* 内存来缓解抖动。 因此,当抖动重置窗口来临时,我们通过窗口计数递减前一个抖动计数器。
* 如果计数器仍大于抖动限制,我们保留当前的前一个抖动计数器,这样我们将再次尝试杀死。
* 否则,我们重置prev_thrash计数器,这样我们就停止重试了。
*/
		//更新trashing,trashing过高说明内存存在压力,过低说明内存空闲
    since_thrashing_reset_ms = get_time_diff_ms(&thrashing_reset_tm, &curr_tm);
    if (since_thrashing_reset_ms > THRASHING_RESET_INTERVAL_MS) {
        long windows_passed;
        /* Calculate prev_thrash_growth if we crossed THRASHING_RESET_INTERVAL_MS */
        prev_thrash_growth = (workingset_refault_file - init_ws_refault) * 100
                            / (base_file_lru + 1);
        windows_passed = (since_thrashing_reset_ms / THRASHING_RESET_INTERVAL_MS);
        /*
         * Decay prev_thrashing unless over-the-limit thrashing was registered in the window we
         * just crossed, which means there were no eligible processes to kill. We preserve the
         * counter in that case to ensure a kill if a new eligible process appears.
         */
        if (windows_passed > 1 || prev_thrash_growth < thrashing_limit) {
            prev_thrash_growth >>= windows_passed;
        }

        /* Record file-backed pagecache size when crossing THRASHING_RESET_INTERVAL_MS */
        base_file_lru = vs.field.nr_inactive_file + vs.field.nr_active_file;
        init_ws_refault = workingset_refault_file;
        thrashing_reset_tm = curr_tm;
        thrashing_limit = thrashing_limit_pct;
    } else {
        /* Calculate what % of the file-backed pagecache refaulted so far */
        thrashing = (workingset_refault_file - init_ws_refault) * 100 / (base_file_lru + 1);
    }
    /* Add previous cycle's decayed thrashing amount */
    thrashing += prev_thrash_growth;
    if (max_thrashing < thrashing) {
        max_thrashing = thrashing;
    }

			//更新水位线
    /*
     * Refresh watermarks once per min in case user updated one of the margins.
     * TODO: b/140521024 replace this periodic update with an API for AMS to notify LMKD
     * that zone watermarks were changed by the system software.
     */
    if (watermarks.high_wmark == 0 || get_time_diff_ms(&wmark_update_tm, &curr_tm) > 60000) {
        struct zoneinfo zi;
				// 解析/proc/zoneinfo并匹配相应字段信息,
	      // 获取保留页的大小:zi->field.totalreserve_pages += zi->field.high;(获取可用内存)
	      //并计算min/low/hight水位线,
        if (zoneinfo_parse(&zi) < 0) {
            ALOGE("Failed to parse zoneinfo!");
            return;
        }

        calc_zone_watermarks(&zi, &watermarks);
        wmark_update_tm = curr_tm;
    }

    /* Find out which watermark is breached if any */
    wmark = get_lowest_watermark(&mi, &watermarks);//zmi->nr_free_pages - zmi->cma_free和watermarks比较

    /*
     * TODO: move this logic into a separate function
     * Decide if killing a process is necessary and record the reason
     */
		//根据水位线、thrashing值、压力值、swap_low值、内存回收模式等进行多种场景判断,并添加不同的kill原因
    if (cycle_after_kill && wmark < WMARK_LOW) {
       /*防止杀死进程时无法释放足够的内存,可能导致 OOM 杀死进程。
当一个进程消耗内存的速度比回收能够释放的速度更快时,即使进行杀死操作后,仍然可能发生这种情况。
这通常发生在运行内存压力测试时。
*/
        kill_reason = PRESSURE_AFTER_KILL;
        strncpy(kill_desc, "min watermark is breached even after kill", sizeof(kill_desc));
    } else if (level == VMPRESS_LEVEL_CRITICAL && events != 0) {
      /*设备正在繁忙地回收内存,这可能导致 ANR。
当 PSI 完全停滞(所有任务因内存拥塞而被阻塞)超过配置的阈值时,就会触发关键级别。
*/
        kill_reason = NOT_RESPONDING;
        strncpy(kill_desc, "device is not responding", sizeof(kill_desc));
    } else if (swap_is_low && thrashing > thrashing_limit_pct) { //ro.lmk.thrashing_limit  30 or 100 
        /* Page cache is thrashing while swap is low */
        kill_reason = LOW_SWAP_AND_THRASHING;
        snprintf(kill_desc, sizeof(kill_desc), "device is low on swap (%" PRId64
            "kB < %" PRId64 "kB) and thrashing (%" PRId64 "%%)",
            mi.field.free_swap * page_k, swap_low_threshold * page_k, thrashing);
        /* Do not kill perceptible apps unless below min watermark or heavily thrashing */
        if (wmark > WMARK_MIN && thrashing < thrashing_critical_pct) {  //WMARK_MIN = 0  thrashing_limit_pct * 2  上面的
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1; //200
        }
        check_filecache = true;
    } else if (swap_is_low && wmark < WMARK_HIGH) {  //对应上边的百分比
        /* Both free memory and swap are low */
        kill_reason = LOW_MEM_AND_SWAP;
        snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and swap is low (%"
            PRId64 "kB < %" PRId64 "kB)", wmark < WMARK_LOW ? "min" : "low",
            mi.field.free_swap * page_k, swap_low_threshold * page_k);
        /* Do not kill perceptible apps unless below min watermark or heavily thrashing */
        if (wmark > WMARK_MIN && thrashing < thrashing_critical_pct) {
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;  //200
        }
    } else if (wmark < WMARK_HIGH && swap_util_max < 100 &&
               (swap_util = calc_swap_utilization(&mi)) > swap_util_max) {
        /*
         * Too much anon memory is swapped out but swap is not low.
         * Non-swappable allocations created memory pressure.
         */
        kill_reason = LOW_MEM_AND_SWAP_UTIL;
        snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and swap utilization"
            " is high (%d%% > %d%%)", wmark < WMARK_LOW ? "min" : "low",
            swap_util, swap_util_max);
    } else if (wmark < WMARK_HIGH && thrashing > thrashing_limit) {
        /* Page cache is thrashing while memory is low */
        kill_reason = LOW_MEM_AND_THRASHING;
        snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and thrashing (%"
            PRId64 "%%)", wmark < WMARK_LOW ? "min" : "low", thrashing);
        cut_thrashing_limit = true;
        /* Do not kill perceptible apps unless thrashing at critical levels */
        if (thrashing < thrashing_critical_pct) {
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
        }
        check_filecache = true;
    } else if (reclaim == DIRECT_RECLAIM && thrashing > thrashing_limit) {
        /* Page cache is thrashing while in direct reclaim (mostly happens on lowram devices) */
        kill_reason = DIRECT_RECL_AND_THRASHING;
        snprintf(kill_desc, sizeof(kill_desc), "device is in direct reclaim and thrashing (%"
            PRId64 "%%)", thrashing);
        cut_thrashing_limit = true;
        /* Do not kill perceptible apps unless thrashing at critical levels */
        if (thrashing < thrashing_critical_pct) {
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
        }
        check_filecache = true;
    } else if (check_filecache) {
        int64_t file_lru_kb = (vs.field.nr_inactive_file + vs.field.nr_active_file) * page_k;

        if (file_lru_kb < filecache_min_kb) {
            /* File cache is too low after thrashing, keep killing background processes */
            kill_reason = LOW_FILECACHE_AFTER_THRASHING;
            snprintf(kill_desc, sizeof(kill_desc),
                "filecache is low (%" PRId64 "kB < %" PRId64 "kB) after thrashing",
                file_lru_kb, filecache_min_kb);
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
        } else {
            /* File cache is big enough, stop checking */
            check_filecache = false;
        }
    }

    /* Kill a process if necessary */
    if (kill_reason != NONE) {
        struct kill_info ki = {
            .kill_reason = kill_reason,
            .kill_desc = kill_desc,
            .thrashing = (int)thrashing,
            .max_thrashing = max_thrashing,
        };                 //最终kill的走向
        int pages_freed = find_and_kill_process(min_score_adj, &ki, &mi, &wi, &curr_tm);
        if (pages_freed > 0) {
            killing = true;
            max_thrashing = 0;
            if (cut_thrashing_limit) {
                /*
                 * Cut thrasing limit by thrashing_limit_decay_pct percentage of the current
                 * thrashing limit until the system stops thrashing.
                 */
                thrashing_limit = (thrashing_limit * (100 - thrashing_limit_decay_pct)) / 100;
            }
        }
    }

no_kill:
    /* Do not poll if kernel supports pidfd waiting */
    if (is_waiting_for_kill()) {
        /* Pause polling if we are waiting for process death notification */
        poll_params->update = POLLING_PAUSE;
        return;
    }

  /*初始 PSI 事件后开始轮询;
在设备处于直接回收内存或进程被杀死时,延长轮询时间;
当 kswapd 回收内存时不延长轮询时间,因为这可能会持续很长时间而不会引起内存压力。
*/
    if (events || killing || reclaim == DIRECT_RECLAIM) {
        poll_params->update = POLLING_START;
    }

    /* Decide the polling interval */
    if (swap_is_low || killing) {
        /* Fast polling during and after a kill or when swap is low */
        poll_params->polling_interval_ms = PSI_POLL_PERIOD_SHORT_MS;  //10ms
    } else {
        /* By default use long intervals */
        poll_params->polling_interval_ms = PSI_POLL_PERIOD_LONG_MS;  //100ms
    }
}

这段代码的主要逻辑是:

  1. 检查是否有需要kill的进程,如果有正在kill的进程则跳过本次循环

  2. 解析/proc/vmstat和/proc/meminfo获取内存状态信息

  3. 根据内存水位线、thrashing值、swap使用情况等判断是否需要kill进程

    • 如果刚kill完进程但内存使用依然过高,则再次kill
    • 如果设备长时间无响应,则kill进程试图让设备响应
    • 如果swap使用过高且thrashing过高,则kill进程
    • 如果内存使用过高且swap空间不足,则kill进程
    • 如果内存使用过高且thrashing过高,则kill进程
    • 等等
  4. 如果确定需要kill进程,则调用find_and_kill_process函数找到进程kill

  5. 根据内存状况决定PSI事件的轮询间隔,如果内存压力大则增大轮询频率

  6. 如果正在等待已kill进程退出,则暂停轮询

3.2 mp_event_common流程

非低内存设备并且使用iuse_minfree_levels

static void mp_event_common(int data, uint32_t events, struct polling_params *poll_params) {
   

    if (meminfo_parse(&mi) < 0 || zoneinfo_parse(&zi) < 0) { //读取meminfo 和zoneinfo
        ALOGE("Failed to get free memory!");
        return;
    }

    if (use_minfree_levels) {//走到这都是true
        int i;
				//other_free 表示系统可用的内存页的数目,MemFree - high
        // nr_free_pages为proc/meminfo中MemFree,当前系统的空闲内存大小,是完全没有被使用的内存
        // totalreserve_pages为proc/zoneinfo中max_protection+high,其中max_protection在android中为0
        other_free = mi.field.nr_free_pages - zi.totalreserve_pages;

				//nr_file_pages = cached + swap_cached + buffers;有时还会有多余的页(other_file就是多余的),需要减去
        if (mi.field.nr_file_pages > (mi.field.shmem + mi.field.unevictable + mi.field.swap_cached)) {
				//other_file 基本就等于除 tmpfs 和 unevictable 外的缓存在内存的文件所占用的 page 数
            other_file = (mi.field.nr_file_pages - mi.field.shmem -
                          mi.field.unevictable - mi.field.swap_cached);
        } else {
            other_file = 0;
        }

				//到这里计算出other_free 、other_file
        min_score_adj = OOM_SCORE_ADJ_MAX + 1;   //1000

         //遍历oomadj和minfree数组 根据lowmem_minfree 的值来确定 min_score_adj,oomadj小于 min_score_adj 的进程在这次回收过程中不会被杀死
        for (i = 0; i < lowmem_targets_size; i++) {
            minfree = lowmem_minfree[i];
            if (other_free < minfree && other_file < minfree) {
                min_score_adj = lowmem_adj[i];
                break;
            }
        }

        if (min_score_adj == OOM_SCORE_ADJ_MAX + 1) {  //adj没变化不做任何处理
            if (debug_process_killing) {
                ALOGI("Ignore %s memory pressure event "
                      "(free memory=%ldkB, cache=%ldkB, limit=%ldkB)",
                      level_name[level], other_free * page_k, other_file * page_k,
                      (long)lowmem_minfree[lowmem_targets_size - 1] * page_k);
            }
            return;
        }

        goto do_kill;
    }
		//对于没有配置use_minfree_levels的情况,内存压力low时会调用record_low_pressure_levels,记录low等级时,
   
    if (level == VMPRESS_LEVEL_LOW) {
        record_low_pressure_levels(&mi); //这里主要是赋值low_pressure_mem.min_nr_free_pages low_pressure_mem.max_nr_free_pages
    }

    if (level_oomadj[level] > OOM_SCORE_ADJ_MAX) {//大于1000不考虑
        /* Do not monitor this pressure level */
        return;
    }
    // 当前memory使用情况,不含swap
    if ((mem_usage = get_memory_usage(&mem_usage_file_data)) < 0) {//"/dev/memcg/memory.usage_in_bytes"
        goto do_kill;
    }
		// 当前memory使用情况,含swap
    if ((memsw_usage = get_memory_usage(&memsw_usage_file_data)) < 0) {//"/dev/memcg/memory.memsw.usage_in_bytes"
        goto do_kill;
    }

    // Calculate percent for swappinness.
		// 这个指标类似于swapness,值越大,swap使用越少,剩余swap空间越大
    mem_pressure = (mem_usage * 100) / memsw_usage;

    if (enable_pressure_upgrade && level != VMPRESS_LEVEL_CRITICAL) {//ro.lmk.critical_upgrade
        // We are swapping too much.
				// 指标偏小说明swap使用很厉害,但仍然内存压力很大
        // 提高level,杀得更激进
        if (mem_pressure < upgrade_pressure) {  //ro.lmk.upgrade_pressure  代码default100 我的设备35
            level = upgrade_level(level); //升级vmpressure level
            if (debug_process_killing) {
                ALOGI("Event upgraded to %s", level_name[level]);
            }
        }
    }

    // If we still have enough swap space available, check if we want to
    // ignore/downgrade pressure events.
	  // swap_free_low_percentage为swap低阈值 此时swap空间还没到低阈值,有可操作空间
    if (mi.field.free_swap >=
        mi.field.total_swap * swap_free_low_percentage / 100) {  //ro.lmk.swap_free_low_percentage  10或者15
        // If the pressure is larger than downgrade_pressure lmk will not
        // kill any process, since enough memory is available.
        if (mem_pressure > downgrade_pressure) {// 虽然有内存压力警报,但是swap还是足够的,不杀进程
            if (debug_process_killing) {
                ALOGI("Ignore %s memory pressure", level_name[level]);
            }
            return;
        } else if (level == VMPRESS_LEVEL_CRITICAL && mem_pressure > upgrade_pressure) {
            if (debug_process_killing) {
                ALOGI("Downgrade critical memory pressure");
            }//swap空间足够的话,只有mem_pressure压力足够大,才会杀得更激进
            // Downgrade event, since enough memory available.
            level = downgrade_level(level);
        }
    }

do_kill:
    if (low_ram_device) {//如果是低内存设备 
        /* For Go devices kill only one task */
        if (find_and_kill_process(level_oomadj[level], NULL, &mi, &wi, &curr_tm) == 0) {
            if (debug_process_killing) {
                ALOGI("Nothing to kill");
            }
        }
    } else {
        int pages_freed;
        static struct timespec last_report_tm;
        static unsigned long report_skip_count = 0;

        if (!use_minfree_levels) {//高版本设备一般不会走到这,只有用vmpressure策略并且不用use_minfree_levels
            /* Free up enough memory to downgrate the memory pressure to low level */
            if (mi.field.nr_free_pages >= low_pressure_mem.max_nr_free_pages) {
                if (debug_process_killing) {
                    ALOGI("Ignoring pressure since more memory is "
                        "available (%" PRId64 ") than watermark (%" PRId64 ")",
                        mi.field.nr_free_pages, low_pressure_mem.max_nr_free_pages);
                }
                return;
            }
            min_score_adj = level_oomadj[level];
        }
				//最终进程被杀
        pages_freed = find_and_kill_process(min_score_adj, NULL, &mi, &wi, &curr_tm);
					···
 
        /* Log whenever we kill or when report rate limit allows */
        if (use_minfree_levels) {
            ALOGI("Reclaimed %ldkB, cache(%ldkB) and free(%" PRId64 "kB)-reserved(%" PRId64 "kB) "
                "below min(%ldkB) for oom_score_adj %d",
                pages_freed * page_k,
                other_file * page_k, mi.field.nr_free_pages * page_k,
                zi.totalreserve_pages * page_k,
                minfree * page_k, min_score_adj);
        } else {
            ALOGI("Reclaimed %ldkB at oom_score_adj %d", pages_freed * page_k, min_score_adj);
        }

}

低内存设备(low-memory device)和高性能设备(high-performance device)的kill策略有所不同:

  • 对于内存不足的设备,一般情况下,系统会选择承受较大的内存压力。
  • 对于高性能设备,如果出现内存压力,则会视为异常情况,应及时修复,以免影响整体性能。
  1. 解析/proc/meminfo和/proc/zoneinfo获取内存状态

  2. 如果配置了use_minfree_levels,则根据lowmem_minfree数组计算合适的min_score_adj

    逐个比较other_free和other_file是否低于minfree,是则使用对应的oomadj作为min_score_adj

  3. 如果没有配置use_minfree_levels,则根据vmpressure等级计算min_score_adj

    对低内存压力级别,记录当时的内存使用情况

    根据级别对应表获取oomadj

    如果swap空间充足,检查是否需要降级内存压力级别

  4. 使用计算出的min_score_adj找到进程并kill

3.3 mp_event_psi和mp_event_common的不同之处

  1. mp_event_psi主要基于zoneinfo的水位线方式判断内存状态,mp_event_common主要检测meminfo中的free memory大小。
  2. mp_event_psi会计算thrashing和swap使用情况,mp_event_common主要检测vmpressure级别。
  3. mp_event_psi有定期轮询逻辑, mp_event_common仅在收到事件时触发。
  4. mp_event_psi会更细致地判断不同内存压力场景,mp_event_common较简单直接。
  5. mp_event_psi自身就可以完成整个判断和杀进程流程,mp_event_common仅完成内存判断后交给上层管理杀进程。
  6. mp_event_psi可以动态调整轮询间隔,mp_event_common没有这方面逻辑。
  7. mp_event_psi记录更多调试统计信息。

3.4 find_and_kill_process 杀进程

这里针对adj<200的情况,默认会杀最重的进程。

static int find_and_kill_process(int min_score_adj, struct kill_info *ki, union meminfo *mi,
                                 struct wakeup_info *wi, struct timespec *tm) {
  
    for (i = OOM_SCORE_ADJ_MAX; i >= min_score_adj; i--) {//遍历adj
        struct proc *procp;

        if (!choose_heaviest_task && i <= PERCEPTIBLE_APP_ADJ) { //ro.lmk.kill_heaviest_task 默认是false      
            choose_heaviest_task = true;// 可以理解成adj < 200  杀最重的进程
        }

        while (true) {
            procp = choose_heaviest_task ?    //根据adj200 判断杀最重或者根据lru杀   
                proc_get_heaviest(i) : proc_adj_lru(i);

     
            killed_size = kill_one_process(procp, min_score_adj, ki, mi, wi, tm);
            if (killed_size >= 0) {
                if (!lmk_state_change_start) {
                    lmk_state_change_start = true;
                    stats_write_lmk_state_changed(STATE_START);
                }
                break;
            }
        }
    }

    if (lmk_state_change_start) {
        stats_write_lmk_state_changed(STATE_STOP);
    }

    return killed_size;
}

kill_one_process

代码里表明提高被杀进程的优先级,尽快干掉他

/* Kill one process specified by procp.  Returns the size (in pages) of the process killed */
static int kill_one_process(struct proc* procp, int min_oom_score, struct kill_info *ki,
                            union meminfo *mi, struct wakeup_info *wi, struct timespec *tm) {
  

    /* CAP_KILL required */
    if (pidfd < 0) {   // 对应proc/pid/pidfd 如果打不开直接调用kill
        start_wait_for_proc_kill(pid);
        r = kill(pid, SIGKILL);
    } else {
        start_wait_for_proc_kill(pidfd);//来等待该进程被杀死。这个函数会启动一个新的线程或进程,在其中轮询该进程是否被杀死,并在该进程被杀死后返回。
        r = pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
    }

    set_process_group_and_prio(pid, SP_FOREGROUND, ANDROID_PRIORITY_HIGHEST); //调整最高优先级 保证进程尽快被杀死

    last_kill_tm = *tm;

    inc_killcnt(procp->oomadj);

out:
    /*
     * WARNING: After pid_remove() procp is freed and can't be used!
     * Therefore placed at the end of the function.
     */
    pid_remove(pid);
    return result;
}

4.内存指标

4.1 zoneinfo

字段 含义
nr_free_pages 该zone空闲页数目
nr_file_pages 该zone文件页大小
nr_shmem 该zone中shmem/tmpfs占用内存大小
nr_unevictable 该zone不可回收页个数
high 该zone的高水位线
protection 该zone的保留内存

lmkd中zoneinfo_field_names保存了需要从zoneinfo中解析的字段,union zoneinfo则用来保存解析出来的数据。

解析中使用了小技巧,zoneinfo为union,因此可以通过遍历zoneinfo_field_names的同时遍历zoneinfo的attr,实现快速解析。在使用时,又可以通过zone的field快速访问。

zoneinfo中多计算了个totalreserve_pages,该值时根据high水线和protection保护页面数量(防止过度借出页面)共同计算得来(high水线 + protection选取最大保留页)。

lmkd中计算出来的zoneinfo为总大小,并未区分各个zone

4.2 meminfo

字段 含义
MemFree 系统尚未使用的内存
Cached 文件页缓存,其中包括tmpfs中文件(未发生swap-out)
SwapCached 匿名页或者shmem/tmpfs,被swapout过,当前swapin后未改变,如果改变会从SwapCached删除
Buffers io设备占用的缓存页,也统计在file lru
Mapped 正在与用户进程关联的文件页

/proc/meminfo信息打印的地方在[kernel/msm-5.4/fs/proc/meminfo.c]的meminfo_proc_show函数当中;其中主要是调用show_val_kb()函数将字符串和具体的数值凑成一个字符串,然后把这些字符串打印出来。

shmem比较特殊,基于文件系统所以不算匿名页,但又不能pageout,因此在内存中被统计进了Cached (pagecache)和Mapped(shmem被attached),但lru里是放在anon lru,因为可能会被swap out。

lmkd的meminfo中也多计算了一个字段nr_file_pages,该值包括cached + swap_cached + buffers。可以理解为能够被drop的文件页。

4.3 memcg

字段 含义
memory.usage_in_bytes 该memcg的内存(不含swap)使用情况
memory.memsw.usage_in_bytes 该memcg的内存(含swap)使用情况

进程信息

进程rss信息获取:"/proc/pid/statm"

统计的数据依次为:虚拟地址空间大小,rss,共享页数,代码段大小,库文件大小,数据段大小,和脏页大小(单位为page)。

进程状态信息: "/proc/pid/status"

进程统计信息: "/proc/pid/stat"

lmkd比较关心第10位pgfault,12位pgmajfault,22位进程开始时间,rss大小(单位page)。


paul.pub/android-pro…

juejin.cn/post/707042…

source.android.com/docs/core/p…

blog.****.net/zhzhangnews…

github.com/reklaw-tech… Q LMKD原理简介.md#use_minfree_levels

juejin.cn/post/722303…

blog.****.net/feelabclihu…

推荐阅读