操作系統-內存管理-ZONE 和 page (半原創)

半原創, 原創的linux不知道是多少, 我自己學習用到的版本是 v5.19.17

前言

文章很多內容來自一步一圖帶你深入理解 Linux 物理內存管理 ,半原創,本文爲個人學習總結

我們上一篇--操作系統-內存管理-NUMA1(半原創) 介紹了 NUMA 後, 知道了node 節點統籌這多個zone, 而 zone 下一級則是多個 pages .下面讓我們繼續學習關於 zone 的知識 .

zone 相關知識

回顧

img

總概圖

img

可以看到到了zone 這個層次以後 , 有個 free_area , free_area 就是空閒的page ,它進一步 進行細分, 主要是兩個維度

  1. 大小
  2. 遷移類型

zone 結構

struct zone 結構介紹

struct zone {
	/* Read-mostly fields */

	/* zone watermarks, access with *_wmark_pages(zone) macros */
	unsigned long _watermark[NR_WMARK];
	unsigned long watermark_boost;

	unsigned long nr_reserved_highatomic;

	/*
	 * We don't know if the memory that we're going to allocate will be
	 * freeable or/and it will be released eventually, so to avoid totally
	 * wasting several GB of ram we must reserve some of the lower zone
	 * memory (otherwise we risk to run OOM on the lower zones despite
	 * there being tons of freeable ram on the higher zones).  This array is
	 * recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl
	 * changes.
	 */
	long lowmem_reserve[MAX_NR_ZONES];

#ifdef CONFIG_NUMA
	int node;
#endif
	struct pglist_data	*zone_pgdat;
	struct per_cpu_pages	__percpu *per_cpu_pageset;
	struct per_cpu_zonestat	__percpu *per_cpu_zonestats;
	/*
	 * the high and batch values are copied to individual pagesets for
	 * faster access
	 */
	int pageset_high;
	int pageset_batch;

#ifndef CONFIG_SPARSEMEM
	/*
	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
	 * In SPARSEMEM, this map is stored in struct mem_section
	 */
	unsigned long		*pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
	unsigned long		zone_start_pfn;

	/*
	 * spanned_pages is the total pages spanned by the zone, including
	 * holes, which is calculated as:
	 * 	spanned_pages = zone_end_pfn - zone_start_pfn;
	 *
	 * present_pages is physical pages existing within the zone, which
	 * is calculated as:
	 *	present_pages = spanned_pages - absent_pages(pages in holes);
	 *
	 * present_early_pages is present pages existing within the zone
	 * located on memory available since early boot, excluding hotplugged
	 * memory.
	 *
	 * managed_pages is present pages managed by the buddy system, which
	 * is calculated as (reserved_pages includes pages allocated by the
	 * bootmem allocator):
	 *	managed_pages = present_pages - reserved_pages;
	 *
	 * cma pages is present pages that are assigned for CMA use
	 * (MIGRATE_CMA).
	 *
	 * So present_pages may be used by memory hotplug or memory power
	 * management logic to figure out unmanaged pages by checking
	 * (present_pages - managed_pages). And managed_pages should be used
	 * by page allocator and vm scanner to calculate all kinds of watermarks
	 * and thresholds.
	 *
	 * Locking rules:
	 *
	 * zone_start_pfn and spanned_pages are protected by span_seqlock.
	 * It is a seqlock because it has to be read outside of zone->lock,
	 * and it is done in the main allocator path.  But, it is written
	 * quite infrequently.
	 *
	 * The span_seq lock is declared along with zone->lock because it is
	 * frequently read in proximity to zone->lock.  It's good to
	 * give them a chance of being in the same cacheline.
	 *
	 * Write access to present_pages at runtime should be protected by
	 * mem_hotplug_begin/end(). Any reader who can't tolerant drift of
	 * present_pages should get_online_mems() to get a stable value.
	 */
	atomic_long_t		managed_pages;
	unsigned long		spanned_pages;
	unsigned long		present_pages;
#if defined(CONFIG_MEMORY_HOTPLUG)
	unsigned long		present_early_pages;
#endif
#ifdef CONFIG_CMA
	unsigned long		cma_pages;
#endif

	const char		*name;

#ifdef CONFIG_MEMORY_ISOLATION
	/*
	 * Number of isolated pageblock. It is used to solve incorrect
	 * freepage counting problem due to racy retrieving migratetype
	 * of pageblock. Protected by zone->lock.
	 */
	unsigned long		nr_isolate_pageblock;
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
	/* see spanned/present_pages for more description */
	seqlock_t		span_seqlock;
#endif

	int initialized;

	/* Write-intensive fields used from the page allocator */
	ZONE_PADDING(_pad1_)

	/* free areas of different sizes */
	struct free_area	free_area[MAX_ORDER];

	/* zone flags, see below */
	unsigned long		flags;

	/* Primarily protects free_area */
	spinlock_t		lock;

	/* Write-intensive fields used by compaction and vmstats. */
	ZONE_PADDING(_pad2_)

	/*
	 * When free pages are below this point, additional steps are taken
	 * when reading the number of free pages to avoid per-cpu counter
	 * drift allowing watermarks to be breached
	 */
	unsigned long percpu_drift_mark;

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
	/* pfn where compaction free scanner should start */
	unsigned long		compact_cached_free_pfn;
	/* pfn where compaction migration scanner should start */
	unsigned long		compact_cached_migrate_pfn[ASYNC_AND_SYNC];
	unsigned long		compact_init_migrate_pfn;
	unsigned long		compact_init_free_pfn;
#endif

#ifdef CONFIG_COMPACTION
	/*
	 * On compaction failure, 1<<compact_defer_shift compactions
	 * are skipped before trying again. The number attempted since
	 * last failure is tracked with compact_considered.
	 * compact_order_failed is the minimum compaction failed order.
	 */
	unsigned int		compact_considered;
	unsigned int		compact_defer_shift;
	int			compact_order_failed;
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
	/* Set to true when the PG_migrate_skip bits should be cleared */
	bool			compact_blockskip_flush;
#endif

	bool			contiguous;

	ZONE_PADDING(_pad3_)
	/* Zone statistics */
	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
} ____cacheline_internodealigned_in_smp;

enum pgdat_flags {
	PGDAT_DIRTY,			/* reclaim scanning has recently found
					 * many dirty file pages at the tail
					 * of the LRU.
					 */
	PGDAT_WRITEBACK,		/* reclaim scanning has recently found
					 * many pages under writeback
					 */
	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
};

enum zone_flags {
	ZONE_BOOSTED_WATERMARK,		/* zone recently boosted watermarks.
					 * Cleared when kswapd is woken.
					 */
	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
};

zone 結構也不復雜, 包含了一些水位線(後面會介紹 , 內存分配的時候用到的) , free_area (就是自己管理的空閒頁) , 高速緩存的冷熱頁 這些, 有一點需要注意的是 , 我們上篇文章操作系統-內存管理-NUMA1(半原創) , zone分類型, 有一個枚舉類 ,enum zone_type , 包括 ZONE_DMA , ZONE_NORMAL 等等, 但是奇怪的是並沒有一個字段引用在 struct zone 中. 例如我們java , 會有這樣一個字段 ,例如以下的例子 :

class shoes{
    SHOES_TYPE type ;
    float price; 

}
enum SHOES_TYPE{
    A,B,C,D
}

我們看一下zone的結構 ,代碼可能有點長 ,認真看

struct zone {
	/* Read-mostly fields */

	/* zone watermarks, access with *_wmark_pages(zone) macros */
	unsigned long _watermark[NR_WMARK];
	unsigned long watermark_boost;

	unsigned long nr_reserved_highatomic;

	/*
	 * We don't know if the memory that we're going to allocate will be
	 * freeable or/and it will be released eventually, so to avoid totally
	 * wasting several GB of ram we must reserve some of the lower zone
	 * memory (otherwise we risk to run OOM on the lower zones despite
	 * there being tons of freeable ram on the higher zones).  This array is
	 * recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl
	 * changes.
	 */
	long lowmem_reserve[MAX_NR_ZONES];

#ifdef CONFIG_NUMA
	int node;
#endif
	struct pglist_data	*zone_pgdat;
	struct per_cpu_pages	__percpu *per_cpu_pageset;
	struct per_cpu_zonestat	__percpu *per_cpu_zonestats;
	/*
	 * the high and batch values are copied to individual pagesets for
	 * faster access
	 */
	int pageset_high;
	int pageset_batch;

#ifndef CONFIG_SPARSEMEM
	/*
	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
	 * In SPARSEMEM, this map is stored in struct mem_section
	 */
	unsigned long		*pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
	unsigned long		zone_start_pfn;

	/*
	 * spanned_pages is the total pages spanned by the zone, including
	 * holes, which is calculated as:
	 * 	spanned_pages = zone_end_pfn - zone_start_pfn;
	 *
	 * present_pages is physical pages existing within the zone, which
	 * is calculated as:
	 *	present_pages = spanned_pages - absent_pages(pages in holes);
	 *
	 * present_early_pages is present pages existing within the zone
	 * located on memory available since early boot, excluding hotplugged
	 * memory.
	 *
	 * managed_pages is present pages managed by the buddy system, which
	 * is calculated as (reserved_pages includes pages allocated by the
	 * bootmem allocator):
	 *	managed_pages = present_pages - reserved_pages;
	 *
	 * cma pages is present pages that are assigned for CMA use
	 * (MIGRATE_CMA).
	 *
	 * So present_pages may be used by memory hotplug or memory power
	 * management logic to figure out unmanaged pages by checking
	 * (present_pages - managed_pages). And managed_pages should be used
	 * by page allocator and vm scanner to calculate all kinds of watermarks
	 * and thresholds.
	 *
	 * Locking rules:
	 *
	 * zone_start_pfn and spanned_pages are protected by span_seqlock.
	 * It is a seqlock because it has to be read outside of zone->lock,
	 * and it is done in the main allocator path.  But, it is written
	 * quite infrequently.
	 *
	 * The span_seq lock is declared along with zone->lock because it is
	 * frequently read in proximity to zone->lock.  It's good to
	 * give them a chance of being in the same cacheline.
	 *
	 * Write access to present_pages at runtime should be protected by
	 * mem_hotplug_begin/end(). Any reader who can't tolerant drift of
	 * present_pages should get_online_mems() to get a stable value.
	 */
	atomic_long_t		managed_pages;
	unsigned long		spanned_pages;
	unsigned long		present_pages;
#if defined(CONFIG_MEMORY_HOTPLUG)
	unsigned long		present_early_pages;
#endif
#ifdef CONFIG_CMA
	unsigned long		cma_pages;
#endif

	const char		*name;

#ifdef CONFIG_MEMORY_ISOLATION
	/*
	 * Number of isolated pageblock. It is used to solve incorrect
	 * freepage counting problem due to racy retrieving migratetype
	 * of pageblock. Protected by zone->lock.
	 */
	unsigned long		nr_isolate_pageblock;
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
	/* see spanned/present_pages for more description */
	seqlock_t		span_seqlock;
#endif

	int initialized;

	/* Write-intensive fields used from the page allocator */
	ZONE_PADDING(_pad1_)

	/* free areas of different sizes */
	struct free_area	free_area[MAX_ORDER];

	/* zone flags, see below */
	unsigned long		flags;

	/* Primarily protects free_area */
	spinlock_t		lock;

	/* Write-intensive fields used by compaction and vmstats. */
	ZONE_PADDING(_pad2_)

	/*
	 * When free pages are below this point, additional steps are taken
	 * when reading the number of free pages to avoid per-cpu counter
	 * drift allowing watermarks to be breached
	 */
	unsigned long percpu_drift_mark;

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
	/* pfn where compaction free scanner should start */
	unsigned long		compact_cached_free_pfn;
	/* pfn where compaction migration scanner should start */
	unsigned long		compact_cached_migrate_pfn[ASYNC_AND_SYNC];
	unsigned long		compact_init_migrate_pfn;
	unsigned long		compact_init_free_pfn;
#endif

#ifdef CONFIG_COMPACTION
	/*
	 * On compaction failure, 1<<compact_defer_shift compactions
	 * are skipped before trying again. The number attempted since
	 * last failure is tracked with compact_considered.
	 * compact_order_failed is the minimum compaction failed order.
	 */
	unsigned int		compact_considered;
	unsigned int		compact_defer_shift;
	int			compact_order_failed;
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
	/* Set to true when the PG_migrate_skip bits should be cleared */
	bool			compact_blockskip_flush;
#endif

	bool			contiguous;

	ZONE_PADDING(_pad3_)
	/* Zone statistics */
	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
} ____cacheline_internodealigned_in_smp;

enum pgdat_flags {
	PGDAT_DIRTY,			/* reclaim scanning has recently found
					 * many dirty file pages at the tail
					 * of the LRU.
					 */
	PGDAT_WRITEBACK,		/* reclaim scanning has recently found
					 * many pages under writeback
					 */
	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
};

enum zone_flags {
	ZONE_BOOSTED_WATERMARK,		/* zone recently boosted watermarks.
					 * Cleared when kswapd is woken.
					 */
	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
};

初始化 zone 的邏輯比較長 ,可以看一下


// 文件位置 :  \mm\page_alloc.c

void __init free_area_init(unsigned long *max_zone_pfn)
{
    ...
    for_each_node(nid) {
		pg_data_t *pgdat;

		if (!node_online(nid)) {
			pr_info("Initializing node %d as memoryless\n", nid);

			/* Allocator not initialized yet */
			pgdat = arch_alloc_nodedata(nid);
			if (!pgdat) {
				pr_err("Cannot allocate %zuB for node %d.\n",
						sizeof(*pgdat), nid);
				continue;
			}
			arch_refresh_nodedata(nid, pgdat);
			free_area_init_memoryless_node(nid);

			/*
			 * We do not want to confuse userspace by sysfs
			 * files/directories for node without any memory
			 * attached to it, so this node is not marked as
			 * N_MEMORY and not marked online so that no sysfs
			 * hierarchy will be created via register_one_node for
			 * it. The pgdat will get fully initialized by
			 * hotadd_init_pgdat() when memory is hotplugged into
			 * this node.
			 */
			continue;
		}

        // node 在線的情況走以下邏輯 

		pgdat = NODE_DATA(nid);
        // (重要) 初始化方法
		free_area_init_node(nid);

		/* Any memory on that node */
		if (pgdat->node_present_pages)
			node_set_state(nid, N_MEMORY);
		check_for_memory(pgdat, nid);
	}

	memmap_init();
}


// 文件位置 : \mm\page_alloc.c

static void __init free_area_init_node(int nid)
{
	pg_data_t *pgdat = NODE_DATA(nid);
	unsigned long start_pfn = 0;
	unsigned long end_pfn = 0;

	/* pg_data_t should be reset to zero when it's allocated */
	WARN_ON(pgdat->nr_zones || pgdat->kswapd_highest_zoneidx);

	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);

	pgdat->node_id = nid;
	pgdat->node_start_pfn = start_pfn;
	pgdat->per_cpu_nodestats = NULL;

	if (start_pfn != end_pfn) {
		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
			(u64)start_pfn << PAGE_SHIFT,
			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
	} else {
		pr_info("Initmem setup node %d as memoryless\n", nid);
	}

	calculate_node_totalpages(pgdat, start_pfn, end_pfn);

	alloc_node_mem_map(pgdat);
	pgdat_set_deferred_range(pgdat);

	free_area_init_core(pgdat);
}

// 文件位置 : \mm\page_alloc.c


static void __init free_area_init_core(struct pglist_data *pgdat)
{
	enum zone_type j;
	int nid = pgdat->node_id;

	pgdat_init_internals(pgdat);
	pgdat->per_cpu_nodestats = &boot_nodestats;


	for (j = 0; j < MAX_NR_ZONES; j++) {
		struct zone *zone = pgdat->node_zones + j;
		unsigned long size, freesize, memmap_pages;

		size = zone->spanned_pages;
		freesize = zone->present_pages;

		/*
		 * Adjust freesize so that it accounts for how much memory
		 * is used by this zone for memmap. This affects the watermark
		 * and per-cpu initialisations
		 */
		memmap_pages = calc_memmap_size(size, freesize);
		if (!is_highmem_idx(j)) {
			if (freesize >= memmap_pages) {
				freesize -= memmap_pages;
				if (memmap_pages)
					pr_debug("  %s zone: %lu pages used for memmap\n",
						 zone_names[j], memmap_pages);
			} else
				pr_warn("  %s zone: %lu memmap pages exceeds freesize %lu\n",
					zone_names[j], memmap_pages, freesize);
		}

		/* Account for reserved pages */
		if (j == 0 && freesize > dma_reserve) {
			freesize -= dma_reserve;
			pr_debug("  %s zone: %lu pages reserved\n", zone_names[0], dma_reserve);
		}

		if (!is_highmem_idx(j))
			nr_kernel_pages += freesize;
		/* Charge for highmem memmap if there are enough kernel pages */
		else if (nr_kernel_pages > memmap_pages * 2)
			nr_kernel_pages -= memmap_pages;
		nr_all_pages += freesize;

		/*
		 * Set an approximate value for lowmem here, it will be adjusted
		 * when the bootmem allocator frees pages into the buddy system.
		 * And all highmem pages will be managed by the buddy system.
		 */
		zone_init_internals(zone, j, nid, freesize);

		if (!size)
			continue;

		set_pageblock_order();
		setup_usemap(zone);
		init_currently_empty_zone(zone, zone->zone_start_pfn, size);
	}
}



static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
							unsigned long remaining_pages)
{
	atomic_long_set(&zone->managed_pages, remaining_pages);
	zone_set_nid(zone, nid);
	zone->name = zone_names[idx];
	zone->zone_pgdat = NODE_DATA(nid);
	spin_lock_init(&zone->lock);
	zone_seqlock_init(zone);
	zone_pcp_init(zone);
}

可以看到MAX_NR_ZONES這個數字對應這 enum zone_type , 然後所有的 zone 都放在了

zone 列表被 node 節點持有的代碼

typedef struct pglist_data {

    // (重要) 該節點下所有 zone
	struct zone node_zones[MAX_NR_ZONES];
}

這個結構之下 ,所以取的時候就可以這樣拿 :

for (z = 0; z < MAX_NR_ZONES; z++)
		zone_init_internals(&pgdat->node_zones[z], z, nid, 0);

這樣的話 zone 就不用直接關聯 zone_type 了. 特別靈活 ,我發現後面 page 和 pfn 也是這樣設計的, 不需要強關聯

page

img

union 結構

page 這個抽象內存塊的節奏肯定是給多個模塊使用, 於是不能避免得就會多了很多個字段,但是有些字段卻是使用不到的 ,爲了節省空間,c中使用了 union , 而 union 結構在 C 語言中被用於同一塊內存根據不同場景保存不同類型數據的一種方式。內核之所以在 struct page 結構中使用 union,是因爲一個物理內存頁面在內核中的使用場景和使用方式是多種多樣的。在這多種場景下,利用 union 盡最大可能使 struct page 的內存佔用保持在一個較低的水平。

page 的結構如下圖, 可以看到有兩個 union ,還有幾個字段
img

話不多說趕緊看代碼 :

struct page {
	unsigned long flags;		/* Atomic flags, some possibly
					 * updated asynchronously */
	/*
	 * Five words (20/40 bytes) are available in this union.
	 * WARNING: bit 0 of the first word is used for PageTail(). That
	 * means the other users of this union MUST NOT use the bit to
	 * avoid collision and false-positive PageTail().
	 */
	union {
		struct {	/* Page cache and anonymous pages  頁緩存和匿名頁 */
			/**
			 * @lru: Pageout list, eg. active_list protected by
			 * lruvec->lru_lock.  Sometimes used as a generic list
			 * by the page owner.
			 */
			union {
				struct list_head lru;
				/* Or, for the Unevictable "LRU list" slot */
				struct {
					/* Always even, to negate PageTail */
					void *__filler;
					/* Count page's or folio's mlocks */
					unsigned int mlock_count;
				};
			};
			/* See page-flags.h for PAGE_MAPPING_FLAGS   (重要) */
			struct address_space *mapping;
			pgoff_t index;		/* Our offset within mapping. */
			/**
			 * @private: Mapping-private opaque data.
			 * Usually used for buffer_heads if PagePrivate.
			 * Used for swp_entry_t if PageSwapCache.
			 * Indicates order in the buddy system if PageBuddy.
			 */
			unsigned long private;
		};
		struct {	/* page_pool used by netstack */
			/**
			 * @pp_magic: magic value to avoid recycling non
			 * page_pool allocated pages.
			 */
			unsigned long pp_magic;
			struct page_pool *pp;
			unsigned long _pp_mapping_pad;
			unsigned long dma_addr;
			union {
				/**
				 * dma_addr_upper: might require a 64-bit
				 * value on 32-bit architectures.
				 */
				unsigned long dma_addr_upper;
				/**
				 * For frag page support, not supported in
				 * 32-bit architectures with 64-bit DMA.
				 */
				atomic_long_t pp_frag_count;
			};
		};
		struct {	/* Tail pages of compound page */
			unsigned long compound_head;	/* Bit zero is set */

			/* First tail page only */
			unsigned char compound_dtor;
			unsigned char compound_order;
			atomic_t compound_mapcount;
			atomic_t compound_pincount;
#ifdef CONFIG_64BIT
			unsigned int compound_nr; /* 1 << compound_order */
#endif
		};
		struct {	/* Second tail page of compound page */
			unsigned long _compound_pad_1;	/* compound_head */
			unsigned long _compound_pad_2;
			/* For both global and memcg */
			struct list_head deferred_list;
		};
		struct {	/* Page table pages */
			unsigned long _pt_pad_1;	/* compound_head */
			pgtable_t pmd_huge_pte; /* protected by page->ptl */
			unsigned long _pt_pad_2;	/* mapping */
			union {
				struct mm_struct *pt_mm; /* x86 pgds only */
				atomic_t pt_frag_refcount; /* powerpc */
			};
#if ALLOC_SPLIT_PTLOCKS
			spinlock_t *ptl;
#else
			spinlock_t ptl;
#endif
		};
		struct {	/* ZONE_DEVICE pages */
			/** @pgmap: Points to the hosting device page map. */
			struct dev_pagemap *pgmap;
			void *zone_device_data;
			/*
			 * ZONE_DEVICE private pages are counted as being
			 * mapped so the next 3 words hold the mapping, index,
			 * and private fields from the source anonymous or
			 * page cache page while the page is migrated to device
			 * private memory.
			 * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
			 * use the mapping, index, and private fields when
			 * pmem backed DAX files are mapped.
			 */
		};

		/** @rcu_head: You can use this to free a page by RCU. */
		struct rcu_head rcu_head;
	};

	union {		/* This union is 4 bytes in size. */
		/*
		 * If the page can be mapped to userspace, encodes the number
		 * of times this page is referenced by a page table.
		 */
		atomic_t _mapcount;

		/*
		 * If the page is neither PageSlab nor mappable to userspace,
		 * the value stored here may help determine what this page
		 * is used for.  See page-flags.h for a list of page types
		 * which are currently stored here.
		 */
		unsigned int page_type;
	};

	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
	atomic_t _refcount;

#ifdef CONFIG_MEMCG
	unsigned long memcg_data;
#endif

	/*
	 * On machines where all RAM is mapped into kernel address space,
	 * we can simply calculate the virtual address. On machines with
	 * highmem some memory is mapped into kernel virtual memory
	 * dynamically, so we need a place to store that address.
	 * Note that this field could be 16 bits on x86 ... ;)
	 *
	 * Architectures with slow multiplication can define
	 * WANT_PAGE_VIRTUAL in asm/page.h
	 */
#if defined(WANT_PAGE_VIRTUAL)
	void *virtual;			/* Kernel virtual address (NULL if
					   not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
	int _last_cpupid;
#endif
} _struct_page_alignment;

並不複雜 , 很清晰 ,包括頁緩存和匿名頁 ,複合頁 ,回收相關屬性LRU , 頁表等 , 其中

struct page {
    // 如果 page 爲文件頁的話,低位爲0,指向 page 所在的 page cache
    // 如果 page 爲匿名頁的話,低位爲1,指向其對應虛擬地址空間的匿名映射區 anon_vma
    struct address_space *mapping;
    // 如果 page 爲文件頁的話,index 爲 page 在 page cache 中的索引
    // 如果 page 爲匿名頁的話,表示匿名頁在對應進程虛擬內存區域 VMA 中的偏移
    pgoff_t index; 
}

第一種使用方式是內核直接分配使用一整頁的物理內存,在《5.2 物理內存區域中的水位線》小節中我們提到,內核中的物理內存頁有兩種類型,分別用於不同的場景:

  • 一種是匿名頁,匿名頁背後並沒有一個磁盤中的文件作爲數據來源,匿名頁中的數據都是通過進程運行過程中產生的,匿名頁直接和進程虛擬地址空間建立映射供進程使用。

  • 另外一種是文件頁,文件頁中的數據來自於磁盤中的文件,文件頁需要先關聯一個磁盤中的文件,然後再和進程虛擬地址空間建立映射供進程使用,使得進程可以通過操作虛擬內存實現對文件的操作,這就是我們常說的內存文件映射。

    頁高速緩存在內核中的結構體就是這個 struct address_space。它被文件的 inode 所持有。

我們常說的 page cache 結構就是 struct address_space .

pfn 和 page

參考資料中, 有一段關於 pfn 的描述 ,下面我們介紹一下它和page 的關係

在前邊的文章中,筆者曾多次提到內核是以頁爲基本單位對物理內存進行管理的,通過將物理內存劃分爲一頁一頁的內存塊,每頁大小爲 4K。一頁大小的內存塊在內核中用 struct page 結構體來進行管理,struct page 中封裝了每頁內存塊的狀態信息,比如:組織結構,使用信息,統計信息,以及與其他結構的關聯映射信息等。

而爲了快速索引到具體的物理內存頁,內核爲每個物理頁 struct page 結構體定義了一個索引編號:PFN(Page Frame Number)。PFN 與 struct page 是一一對應的關係。

內核提供了兩個宏來完成 PFN 與 物理頁結構體 struct page 之間的相互轉換。它們分別是 page_to_pfn 與 pfn_to_page。

內核中如何組織管理這些物理內存頁 struct page 的方式我們稱之爲做物理內存模型,不同的物理內存模型,應對的場景以及 page_to_pfn 與 pfn_to_page 的計算邏輯都是不一樣的。

這裏有問題 ,我在linux-v5.19.17 中 struct page 中並沒有發現 PFN 的定義 , 只有 page_to_pfn 和 pfn_to_page 兩個函數 :

#define __page_to_pfn(pg)					\
({	const struct page *__pg = (pg);				\
	int __sec = page_to_section(__pg);			\
	(unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec)));	\
})

#define __pfn_to_page(pfn)				\
({	unsigned long __pfn = (pfn);			\
	struct mem_section *__sec = __pfn_to_section(__pfn);	\
	__section_mem_map_addr(__sec) + __pfn;		\
})

pfn 其實就是 page 的唯一標識 , 可以通過pfn 找到對應的頁 ,找到對應的物理地址 , 例如 :

#define page_to_pa(page)	(page_to_pfn(page) << PAGE_SHIFT)

也是說通過pfn 可以找到關於 page , linux 中利用 pfn ,提供了很多便捷的方法 .

下面介紹 page 相關的屬性 

page 和 內存的回收&內存的分配

內容太多, 下一章進行介紹 

page 和 複合頁

什麼是複合頁 , 巨型大頁就是通過兩個或者多個物理上連續的內存頁 page 組裝成的一個比普通內存頁 page 更大的頁,好處多多 :

  1. 缺頁中斷的情況就會相對減少,由於減少了缺頁中斷所以性能會更高
  2. 巨型頁比普通頁要大,所以巨型頁需要的頁表項要比普通頁要少,頁表項裏保存了虛擬內存地址與物理內存地址的映射關係,當 CPU 訪問內存的時候需要頻繁通過 MMU 訪問頁表項獲取物理內存地址,由於要頻繁訪問,所以頁表項一般會緩存在 TLB 中,因爲巨型頁需要的頁表項較少,所以節約了 TLB 的空間同時降低了 TLB 緩存 MISS 的概率,從而加速了內存訪問
  3. 當一個內存佔用很大的進程(比如 Redis)通過 fork 系統調用創建子進程的時候,會拷貝父進程的相關資源,其中就包括父進程的頁表,由於巨型頁使用的頁表項少,所以拷貝的時候性能會提升不少

複合頁結構

img

相關代碼

struct page {      

            ...

            // 首頁 page 中的 flags 會被設置爲 PG_head 表示複合頁的第一頁
            unsigned long flags; 
            // 其餘尾頁會通過該字段指向首頁
            unsigned long compound_head;   
            // 用於釋放複合頁的析構函數,保存在首頁中
            unsigned char compound_dtor;
            // 該複合頁有多少個 page 組成,order 還是分配階的概念,首頁中保存
            // 本例中的 order = 2 表示由 4 個普通頁組成
            unsigned char compound_order;
            // 該複合頁被多少個進程使用,內存頁反向映射的概念,首頁中保存
            atomic_t compound_mapcount;
            // 複合頁使用計數,首頁中保存
            atomic_t compound_pincount;

            ...


      }

首頁對應的 struct page 結構裏的 flags 會被設置爲 PG_head,表示這是複合頁的第一頁。

另外首頁中還保存關於複合頁的一些額外信息,比如用於釋放複合頁的析構函數會保存在首頁 struct page 結構裏的 compound_dtor 字段中,複合頁的分配階 order 會保存在首頁中的 compound_order 中,以及用於指示覆合頁的引用計數 compound_pincount,以及複合頁的反向映射個數(該複合頁被多少個進程的頁表所映射)compound_mapcount 均在首頁中保存。

複合頁中的所有尾頁都會通過其對應的 struct page 結構中的 compound_head 指向首頁,這樣通過首頁和尾頁就組裝成了一個完整的複合頁 compound_page 。

img

其他

內存熱插拔

內存的熱拔插分爲

  • 物理內存熱拔插
  • 邏輯內存熱拔插

物理內存熱拔插

並非所有的物理頁都可以遷移,因爲遷移意味着物理內存地址的變化,而內存的熱插拔應該對進程來說是透明的,所以這些遷移後的物理頁映射的虛擬內存地址是不能變化的。

這一點在進程的用戶空間是沒有問題的,因爲進程在用戶空間訪問內存都是根據虛擬內存地址通過頁表找到對應的物理內存地址,這些遷移之後的物理頁,雖然物理內存地址發生變化,但是內核通過修改相應頁表中虛擬內存地址與物理內存地址之間的映射關係,可以保證虛擬內存地址不會改變。

但是在內核態的虛擬地址空間中,有一段直接映射區,在這段虛擬內存區域中虛擬地址與物理地址是直接映射的關係,虛擬內存地直接減去一個固定的偏移量(0xC000 0000 ) 就得到了物理內存地址。

直接映射區中的物理頁的虛擬地址會隨着物理內存地址變動而變動, 因此這部分物理頁是無法輕易遷移的,然而不可遷移的頁會導致內存無法被拔除,因爲無法妥善安置被拔出內存中已經爲進程分配的物理頁。那麼內核是如何解決這個頭疼的問題呢?

既然是這些不可遷移的物理頁導致內存無法拔出,那麼我們可以把內存分一下類,將內存按照物理頁是否可遷移,劃分爲不可遷移頁,可回收頁,可遷移頁。

大家這裏需要記住一點,內核會將物理內存按照頁面是否可遷移的特性進行分類,筆者後面在介紹內核如何避免內存碎片的時候還會在提到

然後在這些可能會被拔出的內存中只分配那些可遷移的內存頁,這些信息會在內存初始化的時候被設置,這樣一來那些不可遷移的頁就不會包含在可能會拔出的內存中,當我們需要將這塊內存熱拔出時, 因爲裏邊的內存頁全部是可遷移的, 從而使內存可以被拔除。

我們在上一篇文章中可有看到 zone 對應的有個狀態, 其中就有上線 和下線的狀態 .

物理內存區域中的冷熱頁

img

從這個層級上看 , 內存和core之間還有一層緩存--"cache" , 內核中會管理位於加載進 CPU 高速緩存中的物理內存頁 ,它的位置位於struct zone結構下

struct zone {
    struct per_cpu_pageset pageset[NR_CPUS];
}

物理內存區域中的預留內存

爲什麼要預留內存 , 有兩個原因 :

  • 緊急內存分配
  • 防止給高層內存區域擠壓

第一個原因來自參考文章的解釋 :
內核中關於內存分配的場景無外乎有兩種方式:
1. 當進程請求內核分配內存時,如果此時內存比較充裕,那麼進程的請求會被立刻滿足,如果此時內存已經比較緊張,內核就需要將一部分不經常使用的內存進行回收,從而騰出一部分內存滿足進 程的內存分配的請求,在這個回收內存的過程中,進程會一直阻塞等待。
2. 另一種內存分配場景,進程是不允許阻塞的,內存分配的請求必須馬上得到滿足,比如執行中斷處理程序或者執行持有自旋鎖等臨界區內的代碼時,進程就不允許睡眠,因爲中斷程序無法被重新 調度。這時就需要內核提前爲這些核心操作預留一部分內存,當內存緊張時,可以使用這部分預留的內存給這些操作分配。

img

第二個原因是

一些用於特定功能的物理內存必須從特定的內存區域中進行分配,比如外設的 DMA 控制器就必須從 ZONE_DMA 或者 ZONE_DMA32 中分配內存。

但是一些用於常規用途的物理內存則可以從多個物理內存區域中進行分配,當 ZONE_HIGHMEM 區域中的內存不足時,內核可以從 ZONE_NORMAL 進行內存分配,ZONE_NORMAL 區域內存不足時可以進一步降級到 ZONE_DMA 區域進行分配。

而低位內存區域中的內存總是寶貴的,內核肯定希望這些用於常規用途的物理內存從常規內存區域中進行分配,這樣能夠節省 ZONE_DMA 區域中的物理內存保證 DMA 操作的內存使用需求,但是如果內存很緊張了,高位內存區域中的物理內存不夠用了,那麼內核就會去佔用擠壓其他內存區域中的物理內存從而滿足內存分配的需求。

但是內核又不會允許高位內存區域對低位內存區域的無限制擠壓佔用,因爲畢竟低位內存區域有它特定的用途,所以每個內存區域會給自己預留一定的內存,防止被高位內存區域擠壓佔用。而每個內存區域爲自己預留的這部分內存就存儲在 lowmem_reserve 數組中。

可以使用

cat /proc/sys/vm/lowmem_reserve_ratio

進行查看

參考資料

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章