linux存儲管理

/*
* =====================================================================================
*
*       Filename: storage_managment存儲管理原理和實現.c
*
*    Description: 存儲管理和實現，龐大的模塊，2周搞定
*
*        Version: 1.0
*        Created: 2010年12月29日 16時52分54秒
*       Revision: none
*       Compiler: gcc
*
*         Author: Yang Shao Kun (), [email protected]
*        Company: College of Information Engineering of CDUT
*
* =====================================================================================
*/

1》基本原理

1：分段存儲：
    可執行文件在內存中的佈局：
    最高地址-------------------------------------------------------------------------最低地址
    堆棧段(局部函數的數據)---空洞---BSS段(未初始化的數據)---數據段(經過初始化)---文本段(指令)

    考慮到，如果我們有一個以上的進程都在使用相同的代碼，如果將代碼段重數據和堆棧段中分離出來，單獨進行地址交換，那麼可以通過兩個進程共享代碼空間來節省內存，這就是使用分段儲存管理系統的動機。實現的最簡單的方法是用單基址寄存器和上下界寄存器。
    分段的特點：段數較少，但段的大小相對較大。
2：分頁存儲：
    在使用分頁內存管理單元的系統中，將虛擬地址空間分爲若干個頁，每個頁的大小
    k                                                     n-k
是2 字節，如果虛擬地址爲n位，那麼虛擬地址存儲空間包括2 個頁，較高的n-k位形
成頁號，較低的k位是頁的位移量。                    k
    頁面(page frame)：在物理內存空間中，頁被映射的2   字節空間稱爲頁面。
    頁表項(page table entry):
        1:頁面號：這個字段確定頁號映射到那個頁面的問題。
        2：保護位：通常需要限制某些頁的使用。
        3：存在位：當將頁號變換成頁面時就設值有效位。如果試圖訪問內存位置中的某頁時，但此頁沒有進行有效變換，就會產生一箇中斷，被稱爲缺頁page fault。
        4：修改標誌位(D):通常稱爲：modified bit ，但設置這一位時，意味着自上一次清除這一位後，已經又對這一頁進行了寫訪問，這個標誌可以告訴我們，當重新分配這個頁面時，我們是否將這一頁保存到磁盤。
        5：訪問位(A):當要實現訪問位時，需要通過硬件進行設置來表明：自上次清除這一位後，已經又訪問了這一頁。

分頁的特點：頁數相對較多，但頁是較小的固定單元。

3：表現空閒內存塊的兩種方法：
    1：空閒的位圖：通過這個可以將頁標記爲空閒或以分配。如果頁面是空閒的，這
    一位設置爲1，如果已經分配，則設置成0.
    2：空閒鏈表，使用鏈表來表示而已。採用在空閒塊本身來保存大小和指向下一個
    塊的指針。
4：碎片
    內部碎片：如果實際分配的內存超過了請求的內存，那麼已分配塊中的一些部分就
                 沒有使用。
    外部碎片：如果可用的空閒塊太小而不能滿足任何要求。
5：選擇策略
    1：最先適應法---first fit，思想是，我們開始搜索鏈表，並找到大小大於或等
    於請求塊的第一個塊。結果一幫會造成低端內存區出現許多碎片，而高端內存區經
    常還有較大的空閒塊。
    2：下一個適應法---next fit，可以使的內存空間分配更加均衡。在鏈表中最後一
    次分配之後的下一個塊開始搜索空閒塊。
    3：最佳適應法---best fit，在鏈表中搜索最小的塊，但此塊大於或等於請求的塊
    的大小。
    4：最懷適應法--worst fit，爲每個請求分配最大的塊。
6：夥伴系統管理
    基本思想是：所有已經分配塊的大小都是2的冪。
    它的步驟：在每一步都滿足條件後，進行下一步：
    1：如果n小於最小分配單元，那麼設置n爲最小大小。
    2：對n取最接近2的冪的大小。
    3：如果沒有大小爲2 k 的冪的空閒塊，那麼遞歸的分配大小爲2 k+1的冪的大小，
    並將此塊分割成兩個大小爲2 k 的空閒塊。
    4：爲了響應請求，返回第一個大小爲2 k 的空閒塊。
7:過度分配技術
    交換：內存和磁盤之間的複製類型。

8:swapping processes out---進程的換出
    the kernel swaps a process out if it needs space in memory,whice may res
    ult from any of the following:
    1:the fork system call must allocate space for a child process.
    2:the brk system call incresses the size of a proces.
    3:a process become large by the natural growth of its stack.
    4:the kernel wants to free space in memory for processed it had previous
    lu swapped out and should now swap in.

    the kernel 沒有必要將一進程的整個虛地址空間全部寫道對換設備上去，instead
    it copyies the physical memory assigned to a process to the allocated sp
    ace on the swap device,ignoring unassigned virtual addresses.

the kernel 決定將某一進程 swapping out 時，它使該進程中的每個區的引用數
1，並把那些引用數減爲0 的區換出。

    for the fork swap:
        one:enough memory,the parent process creat the child context.
        two:not enough memory,the kernel swaps the process out with out free
        ing the memory occupied by the in-core(parent)copy,when the swap is
        complete ,the child process exists on the swap device,parent 進程(p
        rocess)將 child 設置爲 "ready-to-run"---就緒狀態，然後返回用戶態，si
        nce the child is in the "ready-to-run"state,the swapper will eventu
        lly swap it into memory---重新換入內存。

    expansion swap---擴展對換：
        如果進程需要的內存比當前以分配給它的內存還多，不管這是有棧增長引起的
        還是由於invocation of the brk system call 引起的，內核都要進行一次進
        程的擴展對換。內核在對換設備上預定了enough space to contain the mem
        ory space of the process,including the newly requested space---足夠
        的空間以容納進程的存儲空間，其中包括新申請的空間。then it adjusts th
        e address translation mapping of the process to account for the new
        vitual memory but does not assign physical memort(since none was ava
        ilible)---然後內核修改進程地址轉換映射以適應新的虛存空間，但此時並不
        分配物理存儲空間。最後內核通過一次通常的對換操作將該進程換出，同時將
        對換設備上新分配的空間清零，當以後按新的地址轉換映射時來分配地址，這
        樣該進程恢復執行就有了足夠的空間。
9：swapping process in---進程的換入
    process 0 ,the swapper ,is the only process that swaps process into memo
    ry from swap devices.
    the clock handler measures the time that each process has been in core o
    r swapped out---時鐘處理程序度量每一個進程在內存中換入和換出的時間。但對
    換process 被wakes up 時，它進行換入進程的工作，此時，它查找所有處在“就緒
    且換出---ready-to run but swapped out”,and selects one that has been swa
    pped out the longest.If has the enough free memory avaliable ,the swappe
    r swaps the process in.
    換入操作最終會產生如下的情況：
    1：對換設備上沒有“就緒”的進程，這時，對換進程進入睡眠，直到一個對換設備
    上的進程被喚醒或內核換出一個"就緒"狀態的進程。
    2：對換進程找到了應被換入的進程，但系統沒有足夠的內存空間，此時，對換進
    程試圖換出另一進程，如果成功則重新啓動對換算法，查找需要換入的進程。
    但是換出的進程不能是：
    zombie process--because they do not take up any physical memory.
    process locked in memory.
    選擇換出進程以得到內存空間的算法有嚴重的缺陷：serious flaws：
    第一：it may swap out a process that does not provide enough memory for
    the incoming process.---an alternative strategy would be to swap out gro
    ups of process only if they provide enough memory for the incoming proce
    ss.
    第二：if the swapper sleeps because it could not find enough memory to s
    wap jin a process ,it searches again for a process to swap in although i
    t had previously chosen one.
    第三：if the swapper chooses a "ready-to-run",process to swap out ,it po
    ssible that the process had not executed since it was previously swapped
    in.
    第四：如果對換進程要換出一個進程，但在對換設備上又找不到空區，這時，可能
    會產生死鎖。

    換出的是正在睡眠的process 而不是"ready-to-run"process.選取哪一個睡眠進程
    來換出取決於進程的優先權和它在內存駐留的時間。如果您沒有睡眠進程，那麼選
    擇那個"ready-to-run",進程來換出取決於進程的nice 值和它在內存中駐留的時間

    A"ready-to-run"process must be core resident for at least 2 seconds befo
    re being swapped out,and a process to be swapped in must have been swapp
    ed out for at least 2 seconds.如果都不滿足的話，就睡眠。

    the algorithm swapper:
    input :none
    output :none
    {
        loop:
            for(all swapped out process that are ready to run)
                pick process swapped out longest;
            if(no such procedd)
            {
                sleep(event must swap in);
                goto loop;
            }
            if(enough room in main memory for process)
            {
                swap process in;
                goto loop;
            }
        /*the loop2:here in revised algorithm*/
            for(all process loaded in main memory ,not zombie and locked in
                    memory)
            {
                if (therr is a sleeping process)
                {
                    choose process such that priority +residence time is num
                        erically highest;
                }
                else/*no sleeping processes*/
                    choose process such that residence time +nice is numeric
                        ally highest;
                if(chosen process not sleeping or residensy requirements no
                        t satisfied)
                    sleep(event must swap process in);
                else
                    swap out process;
                goto loop;/*loop2*/
            }
    }

10:demand paging---請求調頁
    the implementation of a paging subsystem has two parts:
        1:swapping rarely used pages to a swapping device and 2:handling pag
        e faults.---將不常用的頁面換到對換設備上去以及處理有效性錯。
    data structures for demand paging---請求調頁的數據結構
        支持低層存儲管理和請求調頁的主要內核數據結構有4個：page table entrie
        s---頁表表項，磁盤塊描述項---disk block descripors，頁面表數據項---
        page frame data table,對換使用表---swap-use table.
        在系統的生存期內，內核僅爲pfdata 分配一次空間，對其他數據結構則動態
        分配內存頁。
11:the page-stealer process--偷頁進程
    偷頁進程：將不再是進程工作集的頁偷偷地換出內存，the kernel creats the pa
    ge stealer during system initialization and invokes it throughout the li
    fetime of the system when low on free pages.
    the kernel wakes up the page stealer when the avaliable free memory in s
    ystem is below a low-water mark,and the page swaps out pages until the a
    vailable free memory in the system exceeds a high-water mark.

    總而言之--to summarize,there are two phases to swapping a page from memo
    ry.
        first:the page stealer finds the page eligible for swapping and plac
              es the page number on a list of pages to be swapped.
        second:the kernel copies the page to a swap device when convenient,t
               urn off the valid bit in the page table entry,decrements the
               pfdata table entry referencd count---將引用數減1，and places
               the pfdata table entry at the end of the free list if its ref
               rerence count is 0.

2》linux中具體的實現

頁面目錄：PGD
中間目錄：PMD
頁面表：PT
cpu發出的是線性地址，linux中的處理步驟是這樣的：
    1：用線性地址中最高的那個位段作爲下標在PGD中找到相應的表項，該表項指向相應的中間目錄PMD。
    2：用線性地址中的第二個位段作爲下標在此PMD中找到相應的表項，該表項指向相應的頁面表。
    3：用線性地址資料宏的第三個位段作爲下標在此PMD中找到相應的表項PTE ，該表項中存放的就是指向物理頁面的指針。
    4：線性地址中的最後位段爲物理頁面內的相對位移量，將此位移量與目標物理頁面的起始地址相加變得到相應的物理地址。

大概的模型爲：
PGD----------PMD----------PT-------------位移

/*
* Traditional 2-level paging structure
*/
#define PGDIR_SHIFT 22 //這裏的表示地址中的PGD下標位段的起始位置，爲22，也就是22bit第23位。由於該位段是從第23位到32位，一共是10位。
#define PTRS_PER_PGD 1024

#define PTRS_PER_PTE 1024 //每個pGD表中指針的個數爲1024.每個指針的大小的爲4字節，故，在32位系統中，PGD表中的大小是4kb。

在文件pgtable_2level.h 中定義了另一個常數：
/* PGDIR_SHIFT determines what a third-level page table entry can map */

#define PGDIR_SIZE (1UL << PGDIR_SHIFT) //也就是說PGD的每一個表項所代表的空間大小是1*2的22次方，而不是PGD本身所佔的空間。
#define PGDIR_MASK (~(PGDIR_SIZE-1))

32位地址意味着4G字節的虛存空間，linux內核將這4G空間分成兩部分，將最高的1 G 字節----0xc0000000~0xffffffff 用於內核本身，稱爲系統空間。
而將較低的3G字節---0x0~0xbfffffff用做各個進程的用戶空間。這樣理論上每個進程可以使用的用戶空間都是3 G。
雖然系統空間佔據來每個虛擬空間中最高的1G字節，在物理的內存中確總是從最低的地址開始，所以對於內核來說，其地址的映射時很簡單的線性映射，0xc0000000,就是兩者之間的位移量。內核中的實現爲：

#define __PAGE_OFFSET        (0xC0000000)//位移量，同時也代表用戶空間的上限
#define PAGE_OFFSET        ((unsigned long)__PAGE_OFFSET)
#define __pa(x)            ((unsigned long)(x)-PAGE_OFFSET)//x是虛地址，這個宏是把虛擬地址轉化成物理地址，只是爲內核代碼中需要知道與虛擬地址對應的物理地址時提供方便。
#define __va(x)            ((void *)((unsigned long)(x)+PAGE_OFFSET))//這個宏是把x物理地址，轉化爲虛地址。

2》地址映射全過程
linux內核採用頁式存儲管理，虛擬地址空間劃分成固定大小的“頁面”，由MMU 在運行時將虛擬地址“映射”成某個物理內存頁面中的地址。i386cpu一律隊程序中使用的地址先進行段式映射，然後在進行頁式映射。

段寄存器的格式定義：
    15                                 3    2    1    0
    -----------------index-------------|----ti---|-rpl--》request privileg level
    bit2 is zero is GDT,one is LDT
i386中cpu使用代碼段寄存器cs的當前值來作爲段式映射的“選擇碼”，也就是用它來作爲段描述表中的下標。分微全局的和局部的段描述表。
#define start_thread(regs, new_eip, new_esp) do {        /
    __asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0));    /
    set_fs(USER_DS);                    /
    regs->xds = __USER_DS;                    /
    regs->xes = __USER_DS;                    /
    regs->xss = __USER_DS;                    /
    regs->xcs = __USER_CS;                    /
    regs->eip = new_eip;                    /
    regs->esp = new_esp;                    /
} while (0)
CPU內部的段寄存器：

CS——代碼段寄存器(Code Segment Register)，其值爲代碼段的段值；
DS——數據段寄存器(Data Segment Register)，其值爲數據段的段值；
ES——附加段寄存器(Extra Segment Register)，其值爲附加數據段的段值；
SS——堆棧段寄存器(Stack Segment Register)，其值爲堆棧段的段值；
FS——附加段寄存器(Extra Segment Register)，其值爲附加數據段的段值；
GS——附加段寄存器(Extra Segment Register)，其值爲附加數據段的段值。

也就是說，intel 的意圖是將進程的映像分成代碼段，數據段，堆棧段，但是linux內核卻把這些統統弄成來代碼段和數據段.也就時說，linux中只有代碼段和數據段，而沒有堆棧段。
現在來看看，USER_DS和USER_CS 時什麼？
#ifndef _ASM_SEGMENT_H
#define _ASM_SEGMENT_H
                           ------idex------|TI|RPL
#define __KERNEL_CS    0x10 //0000 0000 0001 0 0 00
#define __KERNEL_DS    0x18 //0000 0000 0001 1 0 00
#define __USER_CS    0x23 //0000 0000 0010 0 0 11
#define __USER_DS    0x2B //0000 0000 0010 1 0 11

#endif
這是在i386的segment.h中的定義，也就是說，linux中只是用4中段寄存器的數值，兩種用於內核本身，兩種用於所有的進程。可以看出，在linux中內核幾乎全部使用的是 DGT，內核的級別時 0級，最高，用戶爲 3級。

/*這是在arch/i386/kernel/head.s中定義的初始化GDT的內容*/
.data
442
443 ALIGN
444 /*
445 * This contains typically 140 quadwords, depending on NR_CPUS.
446 *
447 * NOTE! Make sure the gdt descriptor in head.S matches this if you
448 * change anything.
449 */
450 ENTRY(gdt_table)/*彙編的語法，相當與數組，gdt_table 作爲基地值*/
451     .quad 0x0000000000000000    /* NULL descriptor */
452     .quad 0x0000000000000000    /* not used */
453     .quad 0x00cf9a000000ffff    /* 0x10 kernel 4GB code at 0x00000000 */
454     .quad 0x00cf92000000ffff    /* 0x18 kernel 4GB data at 0x00000000 */
455     .quad 0x00cffa000000ffff    /* 0x23 user   4GB code at 0x00000000 */
456     .quad 0x00cff2000000ffff    /* 0x2b user   4GB data at 0x00000000 */
457     .quad 0x0000000000000000    /* not used */
458     .quad 0x0000000000000000    /* not used */
459     /*
460      * The APM segments have byte granularity and their bases
461      * and limits are set at run time.
462      */
463     .quad 0x0040920000000000    /* 0x40 APM set up for bad BIOS's */
464     .quad 0x00409a0000000000    /* 0x48 APM CS    code */
465     .quad 0x00009a0000000000    /* 0x50 APM CS 16 code (16 bit) */
466     .quad 0x0040920000000000    /* 0x58 APM DS    data */
467     .fill NR_CPUS*4,8,0     /* space for TSS's and LDT's */

段描述項的定義：
    63            56 55        52 51      48 47                 40 39           32 31          16 15         8 7        0
    ---B31～B24----- |G|D/B|O|AV| |-L19～L16| P DPL S E ED/C R/W A| ---B23～B16--- |---B15～B0--- |------段上限L15～L0---|
        基地址                        段上限                             基地址         基地址             段上限
其中：
    1：基地值是：32位，也就說，每個段都是從0 地址開始的整個4GB空間。
       對照上面的4個段寄存器的值，即，將上面的張開成段描述項的形式，我們可以看出，基地址，全部爲0，也就是，在段式映射中虛地址到線性地址的映射保持原值不變。
    2： G：爲1時，表示段長以4KB爲單位，爲0 時，表示以字節爲單位。
        D/B：=1時，表示對該段的訪問爲32位的指令，=0，爲16位指令。
        O：永遠爲0；
        AV：可由軟件使用，cpu忽略改位。
    3： P：=1，該段在內存中。
        DPL：特權級別。
        S：=1，表示一般的代碼段或數據段，=0，表示專用於系統管理的系統段。
        E:=1,代碼段。
        ED：=0，向上升(數據段)，=1，向下升(堆棧段)。
        C：=0，忽視特權級別，=1，依照特權級別。
        R/W：讀寫爲，1，時有效。
        A：=1，已被訪問過。
linux中的段式映射，只是檢查DPL和段的類型，但是在頁式中也會檢查，就顯得在段式中比較的多餘，但是intel中規定，必須先段式，厚頁式映射。

頁式映射：
在頁式映射中，每個進程都有其自身的頁面目錄PGD，指向這個頁面目錄的指針保持在每個進程的mm_struct 數據結構中。每當調度一個進程進入運行的時候，內核都要爲即將運行的進程設置好控制寄存器CR3，而MMU的硬件則總是從CR3映射中取得指向當前頁面目錄的指針。但是，cpu在執行程序時用的時虛存地址，而MMU 硬件進行映射時所用的則時物理地址。這個過程在include/asm-i386/mmu_contex.h 中完成的。

static inline void switch_mm(struct mm_struct *prev,struct mm_struct *next ,struct task_struct *tsk,unsigned cpu)
{
    ....
    asm volatile("movl %0,%%cr3"::"r"(__pa(next->pgd)));//將下一個進程的頁面目錄PGD的物理地址裝入寄存器%%cr3.也就時使用不同的頁面目錄。
    ....
}
ox080483b4線性地址在頁式管理中的映射：
0000 1000 00|00 0100 1000| 0011 1011 0100
----32------|-----72------
首先，我們知道最高的10位時PGD，即頁面目錄表的下標，找到相應的表項。該表項的高20位指向一個頁面表PT，加上12個0 就得到該頁面表指針。
then：找到頁面表後，cpu在來看線性地址的中的中間10位，於是cpu就以此爲下標去已經找到的頁面表中找到相應的表項。
end：該表項中的高20位指向目標頁面，在其起始地址上加上線性地址中的最低12位，就得到來最終的物理內存地址。

                                                        3》重要的數據結構和函數
/*
* These are used to make use of C type-checking..
*/
#if CONFIG_X86_PAE
typedef struct { unsigned long pte_low, pte_high; } pte_t;
typedef struct { unsigned long long pmd; } pmd_t;
typedef struct { unsigned long long pgd; } pgd_t;
#define pte_val(x)    ((x).pte_low | ((unsigned long long)(x).pte_high << 32))
#else
typedef struct { unsigned long pte_low; } pte_t;
typedef struct { unsigned long pmd; } pmd_t;
typedef struct { unsigned long pgd; } pgd_t;
#define pte_val(x)    ((x).pte_low)
#endif
#define PTE_MASK    PAGE_MASK

頁面目錄PGD，中間目錄PMD和頁面表PT分別是由表項pgd_t,pmd_t,以及Pte_t構成的數組。定義在/include/asm-i386/page.h中。
因爲，表項pte作爲指針，實際上只需要高20位，同時，所有的物理頁面都是跟4K字節的邊界對齊的，因而物理頁面起始地址的高20位又可以看作是物理頁面的序號。所以pte_t中的低12位用於頁面的狀態信息和訪問權限。
typedef struct { unsigned long pgprot; } pgprot_t;//用來說明頁面保護結構，參數pgprot的值與i386MMU的頁面的低12位相對應
#define _PAGE_PRESENT    0x001
#define _PAGE_RW    0x002
#define _PAGE_USER    0x004
#define _PAGE_PWT    0x008
#define _PAGE_PCD    0x010
#define _PAGE_ACCESSED    0x020
#define _PAGE_DIRTY    0x040
#define _PAGE_PSE    0x080    /* 4 MB (or 2MB) page, Pentium+, if present.. */
#define _PAGE_GLOBAL    0x100    /* Global TLB entry PPro+ */

#define _PAGE_PROTNONE 0x080 /* If not present,對應與頁面表項中的bit7，保留不用 */
在include/asm-i386/pgtable.h中對這幾位做來設置。
在實際的使用中，將pte中的指針部分和pgprot合在一起就能得到實際用於頁面表中的表項。具體的算法是由pgtable.h中宏定義的mk_pte完成的。
#define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot))
涉及的宏還有：
#define pgprot_val(x) ((x).pgprot)
#define __pte(x) ((pte_t) { (x) } )
下面這個宏用來把一個表項的值設置到一個頁面表項中，這個宏操作定義於include/asm-i386/pgtable_2level.h中

#define set_pte(pteptr, pteval) (*(pteptr) = pteval)
在映射的過程中，MMU首先檢查的是P標誌位，也就時402行的宏，它指示着所映射的頁面是否在內存中，只有在p標誌位爲1的時候，MMU纔會完成映射的全過程，否則就會因不能完成映射而產生一次缺頁異常，此時表項中的其它內容對MMU就沒有任何意義來。
    如果把整個物理內存看作看成是一個物理頁面的數組，那麼頁面表項中的表項值的高20位就是數組的下標，也就時物理頁面的序號，因爲低的12位都是0，按照4K來對齊。那麼用這個下標就可以在page結構數組中找到代表目標物理頁面的數據結構。在include/asm-i386/pgtable_2level.h中定義來一個宏來處理。
#define pte_page(x)        (mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT)))) //page_shift在i386中定義的值時12.
mem_map 是page結構指針,page數據結構定義在include/linux/mm.h中
/*
* Try to keep the most commonly accessed fields in single cache lines
* here (16 bytes or greater).---使得聯繫緊密的若干成分在執行時被填入高速緩存的同一緩存線上，
* This ordering should be particularly beneficial on 32-bit processors.
*
* The first line is data used in page cache lookup, the second line
* is used for linear searches (eg. clock algorithm scans).
*/
typedef struct page {
    struct list_head list;//和free_area_page中的雙向鏈表有關？
    struct address_space *mapping;//該結構用來說明，管理當前頁的結構信息。
    unsigned long index;//當頁面的內容來自一個文件時，表示該頁面在文件中的序號，當頁面內容被換出到swap device 上，但還保留着內容作爲緩衝，則表示指明來頁面的去向
    struct page *next_hash;
    atomic_t count;
    unsigned long flags;    /* atomic flags, some possibly updated asynchronously */
    struct list_head lru;
    unsigned long age;
    wait_queue_head_t wait;
    struct page **pprev_hash;
    struct buffer_head * buffers;
    void *virtual; /* non-NULL if kmapped */
    struct zone_struct *zone;//zone_struct 數據結構，用來管理物理頁面中劃分的物理頁面的總類，ZONE_DMA,ZONE_NORMAL.
} mem_map_t;

系統中每一個物理頁面都有一個page結構，系統在初始化的時根據物理頁面內存的大小建立的一個page結構數組mem_map，作爲物理頁面的倉庫，裏面的每個page結構數據結構都代表着系統中的一個物理頁面，每個物理頁面的page結構在這個數組裏的下標就是該物理頁面的序號，倉庫裏的物理頁面被劃分成ZONE_DMA,ZONE_NORMAL兩個管理區，還有可能由第三個管理區ZONE_HIGHMEM。
其中第一個ZONE_DMA是用來給DMA用的，不經過MMU直接提供地址映射。

每個管理區都有一個數據結構---zone_struct。
typedef struct zone_struct {
    /*
     * Commonly accessed fields:
     */
    spinlock_t        lock;
    unsigned long        offset;//表示該分區在mem_map中的起始頁，mem_map即初始化的時候建立的一個數組--page結構，倉庫。一旦建立起來管理區，每個物理頁面便永久的屬於某一個管理區，具體取決於頁面的起始地址。
    unsigned long        free_pages;
    unsigned long        inactive_clean_pages;
    unsigned long        inactive_dirty_pages;
    unsigned long        pages_min, pages_low, pages_high;

    /*
     * free areas of different sizes
     */
    struct list_head    inactive_clean_list;//不活躍乾淨頁面列表。
    free_area_t        free_area[MAX_ORDER];//    typedef struct free_area_struct {struct list_head    free_list;unsigned int        *map;} free_area_t;一個隊列，用來分配連續的多個物理頁面--塊，這個隊列來保持長度爲2的頁面塊以及2的冪的頁面塊，直到2的MAX_ORDER，也就是說最大的頁面塊可以達到1024個頁面，也就時4M了。

    /*
     * rarely used fields:
     */
    char            *name;
    unsigned long        size;
    /*
     * Discontig memory support fields.
     */
    struct pglist_data    *zone_pgdat;
    unsigned long        zone_start_paddr;
    unsigned long        zone_start_mapnr;
    struct page        *zone_mem_map;
} zone_t;

由於非均勻存儲結構---Non-uniform memory architechure 的引入，管理區，不在是屬於最高層的機構，而是在每個存儲節點---質地相同的區域，都有至少兩個管理區，前面的page數據結構現在時從屬於具體的節點，不在是全局的了。
#define NR_GFPINDEX 0x100//表示分配的策略的總數。

typedef struct pglist_data {
    zone_t node_zones[MAX_NR_ZONES];//節點裏面的區的個數，最多3個，用數組來管理。
    zonelist_t node_zonelists[NR_GFPINDEX];//每個存儲節點的分配策略。
    struct page *node_mem_map;
    unsigned long *valid_addr_bitmap;
    struct bootmem_data *bdata;
    unsigned long node_start_paddr;
    unsigned long node_start_mapnr;
    unsigned long node_size;//節點的大小
    int node_id;//節點的id，序號。
    struct pglist_data *node_next;//pglist_data節點的鏈表
} pg_data_t;

虛擬空間的數據結構：
    虛存空間的管理不像物理空間的管理那樣有一個物理頁面倉庫，而是以進程爲基礎，每個進程都有自己的虛存空間。
對虛存空間區間的抽象在一個重要的數據結構中：/inclue/linux/mm.h中的vm_area_struct數據結構。
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
*/
struct vm_area_struct {
    struct mm_struct * vm_mm;    /* VM area parameters */
    unsigned long vm_start;//虛存空間的開始
    unsigned long vm_end;//虛存空間的結束

    /* linked list of VM areas per task, sorted by address */
    struct vm_area_struct *vm_next;//將屬於同一個進程的所有區間按虛擬地址的高低次序鏈接在一起。
    /*區間的劃分不僅僅取決與地址的連續性，也取決與取決的其它屬性，主要是對虛存頁面的訪問權限,如果一個地址範圍內的前一半頁面和後一半頁面有不同的訪問權限或屬性，那麼就要劃分爲2個區間，所以，同一個區間裏面的所有的頁面的具有相同的屬性和訪問權限*/

pgprot_t vm_page_prot;//也就是低12的值，用來設置爲屬性和訪問權限等等。
unsigned long vm_flags;

    /* AVL tree of VM areas per task, sorted by address */
    short vm_avl_height;
    struct vm_area_struct * vm_avl_left;
    struct vm_area_struct * vm_avl_right;

    /* For areas with an address space and backing store,
     * one of the address_space->i_mmap{,shared} lists,
     * for shm areas, the list of attaches, otherwise unused.
     */
    struct vm_area_struct *vm_next_share;
    struct vm_area_struct **vm_pprev_share;

    struct vm_operations_struct * vm_ops;//指向該結構體，該結構體中是函數指針pointer to the functons,提供虛存空間的打開和關閉建立映射等,相當於一個函數跳轉表，結構中主要和一些文件操作有關的函數指針。
    unsigned long vm_pgoff;        /* offset in PAGE_SIZE units, *not* PAGE_CACHE_SIZE */
    struct file * vm_file;
    unsigned long vm_raend;
    void * vm_private_data;        /* was vm_pte (shared mem) */
};
vm_ops指向的數據嗎結構：
/*
* These are the virtual MM functions - opening of an area, closing and
* unmapping it (needed to keep files on disk up-to-date etc), pointer
* to the functions called when a no-page or a wp-page exception occurs.
*/
struct vm_operations_struct {
    void (*open)(struct vm_area_struct * area);
    void (*close)(struct vm_area_struct * area);
    struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int write_access);
};
vm_mm指向的數據結構：
struct mm_struct {
    struct vm_area_struct * mmap;        /* list of VMAs---建立虛存空間結構的單鏈線性隊列 */
    struct vm_area_struct * mmap_avl;    /* tree of VMAs ---建立虛存空間結構的avl樹的根節點*/
    struct vm_area_struct * mmap_cache;    /* last find_vma result ---用來指向最近一次用到的那個虛存結構，這是應爲程序用到的地址常常具有局部性*/
    pgd_t * pgd;//指向該進程的頁面目錄，當進程進入運行時，就將這個指針轉換成物理地址。
    atomic_t mm_users;            /* How many users with user space? */
    atomic_t mm_count;            /* How many references to "struct mm_struct" (users count as 1) */
    int map_count;                /* number of VMAs ---虛存空間的數目，該進程有個虛存區間*/
    struct semaphore mmap_sem;//用於P，V操作的信號量,信號量是一個整數，當大於等於零時代表可供併發進程使用的資源實體數，小於零時則表示正在等待使用的臨界區的進程數。P原語加一，V原語減一。
    spinlock_t page_table_lock;

struct list_head mmlist; /* List of all active mm's */

    unsigned long start_code, end_code, start_data, end_data;//代碼段，數據段的起始和終點
    unsigned long start_brk, brk, start_stack;
    unsigned long arg_start, arg_end, env_start, env_end;
    unsigned long rss, total_vm, locked_vm;
    unsigned long def_flags;
    unsigned long cpu_vm_mask;
    unsigned long swap_cnt;    /* number of pages to swap on next pass */
    unsigned long swap_address;

    /* Architecture-specific MM context */
    mm_context_t context;//局部描述表，在linux中基本不用
};
這是比vm_area_struct 更高的數據結構。每個進程只有一個mm_struct結構，在每個進程的“進程控制塊”中，即task_struct結構中，有一個指針指向該結構，可以說，該結構是對整個用戶空間的抽象，也是總的控制結構。
給定一個屬於某個進程的虛擬地址，要求找到其所屬的區間以及相應的vma_area_struct 結構，這是由 find_vma函數實現的其代碼在/mm/mmap.c中。
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr)
{
    struct vm_area_struct *vma = NULL;

    if (mm) {
        /* Check the cache first. */
        /* (Cache hit rate is typically around 35%.) */
        vma = mm->mmap_cache;
        /*看在cache中命中沒有*/
        if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {//如果cache中找到了虛存空間vma，同時虛擬地址addr在虛存空間的範圍內的條件不成立
            if (!mm->mmap_avl) {/*沒有命中，同時，mmap_avl指針爲零---沒有建立來avl結構，線性隊列中搜索*/
                /* Go through the linear list. */
                vma = mm->mmap;
                while (vma && vma->vm_end <= addr)
                    vma = vma->vm_next;
            } else {/*如果建立了avl結構，就在avl結構中搜索*/
                /* Then go through the AVL tree quickly. */
                struct vm_area_struct * tree = mm->mmap_avl;//將avl樹的根節點賦值給tree
                vma = NULL;
                for (;;) {
                    if (tree == vm_avl_empty)//#define vm_avl_empty    (struct vm_area_struct *) NULL，avl爲空樹
                        break;
                    if (tree->vm_end > addr) {//在虛存空間vm_area_struct結構區的地址範圍內，先判斷根節點及其左字樹
                        vma = tree;
                        if (tree->vm_start <= addr)
                            break;
                        tree = tree->vm_avl_left;//將
                    } else//判斷右子樹
                        tree = tree->vm_avl_right;
                }
            }
            if (vma)
                mm->mmap_cache = vma;//找到了，將mmap_cache 指針置成所找到的vm_area_struct結構。
        }
    }
    return vma;//返回值如果爲NULL，表示該地址所屬的區間，還沒有建立，此時通常就得要建立一個新的虛存區間結構，在調用insert_vm_struct()將其插入到mm_struct的線性隊列或者時AVL樹中去。
}

                                                4》越界訪問
異常處理服務程序主體函數do_page_fault():
/*
* This routine handles page faults. It determines the address,
* and the problem, and then passes it off to one of the appropriate
* routines.
* ®s，指向異常發生前夕cpu各寄存器內容的一份副本。
* error_code,指明映射失敗的具體原因。
*
* error_code:
*    bit 0 == 0 means no page found, 1 means protection fault
*    bit 1 == 0 means read, 1 means write
*    bit 2 == 0 means kernel, 1 means user-mode
*/
asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
    struct task_struct *tsk;
    struct mm_struct *mm;
    struct vm_area_struct * vma;
    unsigned long address;
    unsigned long page;
    unsigned long fixup;
    int write;
    siginfo_t info;

/* get the address */
__asm__("movl %%cr2,%0":"=r" (address));//當i386cpu產生“頁面錯“異常時，cpu將導致映射失敗的線性地址放在控制寄存器CR2中，而這顯然是相應服務程序所必須的信息，可是c語言中沒有相應的語言成分來讀取cr2中的內容。該彙編代碼輸出cr2中的內容，輸出內容放在一個寄存器中。

    tsk = current;//取得task_struct數據結構
                    //#ifndef _I386_CURRENT_H
                    //#define _I386_CURRENT_H

//struct task_struct;

                    //static inline struct task_struct * get_current(void)
                    //{
                        //struct task_struct *current;
                        //__asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL));//每個進程的內核棧的大小是8192，也就時8K，內核堆棧佔用的內存地址都是以這個值對齊的，底13清零後就得到堆棧尾的地址。
                        //    return current;
                        // }
                    //#define current get_current()

//#endif /* !(_I386_CURRENT_H) */

    /*
     * We fault-in kernel-space virtual memory on-demand. The
     * 'reference' page table is init_mm.pgd.
     *
     * NOTE! We MUST NOT take any locks for this case. We may
     * be in an interrupt or a critical region, and should
     * only copy the information from the master page table,
     * nothing more.
     */
    if (address >= TASK_SIZE)//TASK_SIZE 的大小是3GB.也就是每個進程的用戶空間大小。
        goto vmalloc_fault;//異常發生在內核空間

mm = tsk->mm;//取進程的用戶存儲結構
info.si_code = SEGV_MAPERR;//設置SIGSEGV信號故障碼爲非法區域

    /*
     * If we're in an interrupt or have no user
     * context, we must not take the fault..
     */
    if (in_interrupt() || !mm)//如果故障發生在硬件中斷內或者用戶存儲結構爲空---也就是說該映射的尚爲建立。
        goto no_context;

down(&mm->mmap_sem);//對信號量P原語操作

    vma = find_vma(mm, address);//尋找該地址的虛存區間和相應的vma_area_struct結構
    if (!vma)//越界
        goto bad_area;
    if (vma->vm_start <= address)
        goto good_area;
    if (!(vma->vm_flags & VM_GROWSDOWN))//落到空洞裏面,如果此區域不是可以向下擴展的堆棧區域.
        goto bad_area;
    if (error_code & 4) {//空洞上方的區域是堆棧區，那麼VM_GROWSDOWN爲1
        /*
         * accessing the stack below %esp is always a bug.
         * The "+ 32" is there due to some instructions (like
         * pusha) doing post-decrement on the stack and that
         * doesn't show up until later..
         */
        if (address + 32 < regs->esp)//以%esp-32爲檢查的基準，如果超過，肯定是錯的。pusha instruction can push 32bits per demand.
            goto bad_area;
    }
    if (expand_stack(vma, address))//屬於正常的堆棧擴增要求，那就應該從空洞的頂部開始分配若干頁面建立映射，並將之歸入堆棧區間。，所以調用expand_stack()函數。
        goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
* we can handle it..
*/
good_area://它主要是處理一些正常的訪問
    info.si_code = SEGV_ACCERR;
    write = 0;
    switch (error_code & 3) {//根據中斷響應機制傳過來的error_code來進一步確定映射失敗的原因，並採取相應的策略。
        default:    /* 3: write, present */
#ifdef TEST_VERIFY_AREA
            if (regs->cs == KERNEL_CS)
                printk("WP fault at %08lx/n", regs->eip);
#endif
            /* fall through */
        case 2:        /* write, not present */
            if (!(vma->vm_flags & VM_WRITE))
                goto bad_area;
            write++;
            break;
        case 1:        /* read, present */
            goto bad_area;
        case 0:        /* read, not present */
            if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
                goto bad_area;
    }

    /*
     * If for any reason at all we couldn't handle the fault,
     * make sure we exit gracefully rather than endlessly redo
     * the fault.
     */
    switch (handle_mm_fault(mm, vma, address, write)) {//調用虛存管理函數handle_mm_fault(),在mm/memory.c中。
    case 1:
        tsk->min_flt++;
        break;
    case 2:
        tsk->maj_flt++;
        break;
    case 0:
        goto do_sigbus;
    default:
        goto out_of_memory;
    }

    /*
     * Did it hit the DOS screen memory VA from vm86 mode?
     */
    if (regs->eflags & VM_MASK) {
        unsigned long bit = (address - 0xA0000) >> PAGE_SHIFT;
        if (bit < 32)
            tsk->thread.screen_bitmap |= 1 << bit;
    }
    up(&mm->mmap_sem);
    return;

/*
* Something tried to access memory that isn't in our memory map..
* Fix it, but check if it's kernel or user first..
*/
bad_area:
up(&mm->mmap_sem);

bad_area_nosemaphore:
    /* User mode accesses just cause a SIGSEGV */
    if (error_code & 4) {//user mode
        tsk->thread.cr2 = address;
        tsk->thread.error_code = error_code;
        tsk->thread.trap_no = 14;
        info.si_signo = SIGSEGV;
        info.si_errno = 0;
        /* info.si_code has been set above */
        info.si_addr = (void *)address;
        force_sig_info(SIGSEGV, &info, tsk);
        return;
    }

    /*
     * Pentium F0 0F C7 C8 bug workaround.
     */
    if (boot_cpu_data.f00f_bug) {
        unsigned long nr;
        nr = (address - idt) >> 3;

        if (nr == 6) {
            do_invalid_op(regs, 0);
            return;
        }
    }

no_context:
    /* Are we prepared to handle this kernel fault? */
    if ((fixup = search_exception_table(regs->eip)) != 0) {
        regs->eip = fixup;
        return;
    }

/*
* Oops. The kernel tried to access some bad page. We'll have to
* terminate things with extreme prejudice.
*/

bust_spinlocks();

    if (address < PAGE_SIZE)
        printk(KERN_ALERT "Unable to handle kernel NULL pointer dereference");
    else
        printk(KERN_ALERT "Unable to handle kernel paging request");
    printk(" at virtual address %08lx/n",address);
    printk(" printing eip:/n");
    printk("%08lx/n", regs->eip);
    asm("movl %%cr3,%0":"=r" (page));
    page = ((unsigned long *) __va(page))[address >> 22];
    printk(KERN_ALERT "*pde = %08lx/n", page);
    if (page & 1) {
        page &= PAGE_MASK;
        address &= 0x003ff000;
        page = ((unsigned long *) __va(page))[address >> PAGE_SHIFT];
        printk(KERN_ALERT "*pte = %08lx/n", page);
    }
    die("Oops", regs, error_code);
    do_exit(SIGKILL);

/*
* We ran out of memory, or some other thing happened to us that made
* us unable to handle the page fault gracefully.
*/
out_of_memory:
    up(&mm->mmap_sem);
    printk("VM: killing process %s/n", tsk->comm);
    if (error_code & 4)
        do_exit(SIGKILL);
    goto no_context;

do_sigbus:
up(&mm->mmap_sem);

    /*
     * Send a sigbus, regardless of whether we were in kernel
     * or user mode.
     */
    tsk->thread.cr2 = address;
    tsk->thread.error_code = error_code;
    tsk->thread.trap_no = 14;
    info.si_code = SIGBUS;
    info.si_errno = 0;
    info.si_code = BUS_ADRERR;
    info.si_addr = (void *)address;
    force_sig_info(SIGBUS, &info, tsk);

    /* Kernel mode? Handle exceptions or die */
    if (!(error_code & 4))
        goto no_context;
    return;

vmalloc_fault:
    {
        /*
         * Synchronize this task's top level page-table
         * with the 'reference' page table.
         */
        int offset = __pgd_offset(address);
        pgd_t *pgd, *pgd_k;
        pmd_t *pmd, *pmd_k;

pgd = tsk->active_mm->pgd + offset;
pgd_k = init_mm.pgd + offset;

        if (!pgd_present(*pgd)) {
            if (!pgd_present(*pgd_k))
                goto bad_area_nosemaphore;
            set_pgd(pgd, *pgd_k);
            return;
        }

pmd = pmd_offset(pgd, address);
pmd_k = pmd_offset(pgd_k, address);

        if (pmd_present(*pmd) || !pmd_present(*pmd_k))
            goto bad_area_nosemaphore;
        set_pmd(pmd, *pmd_k);
        return;
    }
}

空間的擴展：函數。在/:include/linux/mm.h中。just chagne the sttuct of vma_area_struct,does't build the new expand page for physical map.所以還要到do_page_fault中的good_area區域中去。
/* vma is the first one with address < vma->vm_end,
* and even address < vma->vm_start. Have to extend vma. */
static inline int expand_stack(struct vm_area_struct * vma, unsigned long address)
{
unsigned long grow;

    address &= PAGE_MASK;//將地址按照頁面邊界對齊,4字節來對齊。
    grow = (vma->vm_start - address) >> PAGE_SHIFT;//計算要增長的頁面大小。幾個頁面
    if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
        ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > current->rlim[RLIMIT_AS].rlim_cur)//如果擴展以後的區域大小超過來可用於堆棧的資源，或者使動態分配的頁面總量超過來可用於該進程的資源限制。那就不能擴展了。
        return -ENOMEM;//表示沒有存儲空間可以分配了。
    /*更改vma映射範圍*/
    vma->vm_start = address;
    vma->vm_pgoff -= grow;
    vma->vm_mm->total_vm += grow;
    if (vma->vm_flags & VM_LOCKED)
        vma->vm_mm->locked_vm += grow;
    return 0;
}
虛存管理函數，上面已經說來其位置。這個函數主要來修改頁面目錄和頁面表中的內容，是上面需求和下面供給的解決方案。
/*
* By the time we get here, we already hold the mm semaphore
*/
int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
    unsigned long address, int write_access)
{
    int ret = -1;
    pgd_t *pgd;
    pmd_t *pmd;

pgd = pgd_offset(mm, address);//pgd_offset是一個宏函數返回的是線性地址在pgd中對應的一個指針, 屬於虛擬地址.mm這個變量，它記錄了整個系統內存的情況。它的pgd就是系統4GB虛擬內存中一級頁表的起始位置。
pmd = pmd_alloc(pgd, address);//i386處理中，將pmd處理成pgd，該函數返回的是pgd的值。也就是說，下面的if條件肯定是成立的。

    if (pmd) {
        pte_t * pte = pte_alloc(pmd, address);
        if (pte)
            ret = handle_pte_fault(mm, vma, address, write_access, pte);
    }
    return ret;
}

pad_offset宏：在include/asm-i386/pgtable.h 中。
/* to find an entry in a page-table-directory. */
#define pgd_index(address) ((address >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))//PGDIR_SHIFT是22，右移後，爲pgd的inidex，也就是頁面目錄表的下標。

#define __pgd_offset(address) pgd_index(address)

#define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address))//mm->pgd是頁表的起始地址。通過該宏，找到線性地址在pgd中對應的一個指針---指向某個頁面表的起始地址。
pte_alloc()函數，在include/asm-i386/pgalloc.h中。處理找到頁面目錄中的表項後的頁面表的處理階段。
extern inline pte_t * pte_alloc(pmd_t * pmd, unsigned long address)
{
address = (address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);//將給的線性地址轉換成其所屬頁面表的下標---即高位的第二個10位,此時，PGD_SHIFT爲12.

    if (pmd_none(*pmd))//沒有該表項
        goto getnew;
    if (pmd_bad(*pmd))//例行檢查，發現是錯誤的，也就是和頁面信息不合適。感覺多餘！
        goto fix;
    return (pte_t *)pmd_page(*pmd) + address;//返回的該地址所屬頁表中的表項，下一步就是物理地址了。
getnew://分配一個表項。一個頁表所佔的空間，恰好時一個物理頁面。當釋放一個頁表的時候，內核將釋放的頁表先保存在一個緩存池中，而不先將其物理內存釋放掉，只有在緩衝池中已經滿的情況下，釋放。這就是get_pte_fast().如果緩衝池中空的，就要用get_pte_slow()來分配了。
{
    unsigned long page = (unsigned long) get_pte_fast();
    if (!page)
        return get_pte_slow(pmd, address);
    set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(page)));//將一些屬性標誌位和起始地址一起寫入頁面目錄項pgd中。這樣，映射的“基礎設施”已經弄好。但頁面表項pte還是空的，剩下的就是物理內存本身了。那是由handle_pte_fault()完成的。
    return (pte_t *)page + address;
}
fix:
    __handle_bad_pmd(pmd);
    return NULL;
}

從緩衝池中取出沒有被釋放的頁面：
extern __inline__ pte_t *get_pte_fast(void)
{
unsigned long *ret;

    if((ret = (unsigned long *)pte_quicklist) != NULL) {
        pte_quicklist = (unsigned long *)(*ret);
        ret[0] = ret[1];
        pgtable_cache_size--;
    }
    return (pte_t *)ret;
}
在函數handle_mm_fault()中，執行handle_pte_fault()也就是處理頁表項的函數。位於/mm/memory.c中：
static inline int handle_pte_fault(struct mm_struct *mm,struct vm_area_struct * vma, unsigned long address,int write_access, pte_t * pte)
{
    pte_t entry;

    /*
     * We need the page table lock to synchronize with kswapd
     * and the SMP-safe atomic PTE updates.
     */
    spin_lock(&mm->page_table_lock);
    entry = *pte;//取頁幀目錄項
    if (!pte_present(entry)) {//測試一個表項所映射的頁面是否在內存中。
        /*
         * If it truly wasn't present, we know that kswapd
         * and the PTE updates will not touch it later. So
         * drop the lock.
         */
        spin_unlock(&mm->page_table_lock);
        if (pte_none(entry))//測試表項是否爲空，
            return do_no_page(mm, vma, address, write_access, pte);//進入空頁處理
        return do_swap_page(mm, vma, address, pte, pte_to_swp_entry(entry), write_access);//進行換頁處理
    }
    if (write_access) {//如果是頁面寫保護故障
        if (!pte_write(entry))//如果頁面不可寫
            return do_wp_page(mm, vma, address, pte, entry);//進行寫保護頁處理

        entry = pte_mkdirty(entry);//設置髒頁標誌
    }
    entry = pte_mkyoung(entry);設置訪問標誌
    establish_pte(vma, address, pte, entry);//經過刷新的頁目錄放入頁表
    spin_unlock(&mm->page_table_lock);
    return 1;
}

/*
* do_no_page() tries to create a new page mapping. It aggressively
* tries to share with existing pages, but makes a separate copy if
* the "write_access" parameter is true in order to avoid the next
* page fault.
*
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
* This is called with the MM semaphore held.
*/
static int do_no_page(struct mm_struct * mm, struct vm_area_struct * vma,
    unsigned long address, int write_access, pte_t *page_table)
{
    struct page * new_page;
    pte_t entry;

if (!vma->vm_ops || !vma->vm_ops->nopage)//vm_operations_struct結構，通常和文件，共享有關的處理時用到。堆棧不用。
return do_anonymous_page(mm, vma, page_table, write_access, address);//進行匿名頁處理

    /*
     * The third argument is "no_share", which tells the low-level code
     * to copy, not share the page even if sharing is possible. It's
     * essentially an early COW detection.
     */
    new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, (vma->vm_flags & VM_SHARED)?0:write_access);
    /*進行調頁處理，共享區域來說無條件共享緩衝頁，一般區域區域作爲寫時拷貝處理*/
    if (new_page == NULL)    /* no page was available -- SIGBUS */
        return 0;//生成SIGBUS信號
    if (new_page == NOPAGE_OOM)
        return -1;//退出但前進程
    ++mm->rss;
    /*
     * This silly early PAGE_DIRTY setting removes a race
     * due to the bad i386 page protection. But it's valid
     * for other architectures too.
     *
     * Note that if write_access is true, we either now have
     * an exclusive copy of the page, or this is a shared mapping,
     * so we can make it writable and dirty to avoid having to
     * handle that later.
     */
    flush_page_to_ram(new_page);
    flush_icache_page(vma, new_page);
    entry = mk_pte(new_page, vma->vm_page_prot);//爲新頁生成目錄項
    if (write_access) {//如果時寫保護錯
        entry = pte_mkwrite(pte_mkdirty(entry));//設置髒頁和可寫標誌
    } else if (page_count(new_page) > 1 &&//如果新頁的引用數大於1，並且所在虛存區域非共享域
           !(vma->vm_flags & VM_SHARED))
        entry = pte_wrprotect(entry);//設置寫保護標誌
    set_pte(page_table, entry);//將頁幀目錄項設置到頁表
    /* no need to invalidate: a not-present page shouldn't be cached */
    update_mmu_cache(vma, address, entry);
    return 2;    /* Major fault */
}
static int do_anonymous_page(struct mm_struct * mm, struct vm_area_struct * vma, pte_t *page_table, int write_access, unsigned long addr)
{
    struct page *page = NULL;
    pte_t entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));//修正讀標誌位
    if (write_access) {//寫保護故障
        page = alloc_page(GFP_HIGHUSER);//分配一高端用戶頁
        if (!page)
            return -1;
        clear_user_highpage(page, addr);
        entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));//修正寫標誌位
        mm->rss++;
        flush_page_to_ram(page);
    }
    set_pte(page_table, entry);
    /* No need to invalidate - it was non-present before */
    update_mmu_cache(vma, addr, entry);
    return 1;    /* Minor fault */
}

我們看do_anonymous_page()函數的代碼時注意到，如果時讀操作錯誤，那麼由mk_pte()構築的映射表要通過pte_wrprotect()加以修正，而如果是寫操作的錯誤，則通過pte_mkwrite()加以修正。它們定義在include/asm-i386/pgtabe.h中：
static inline pte_t pte_wrprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_RW; return pte; }//將PAGE_RW設置爲0，表示物理頁面只允許讀
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; }//恰好時和前一個函數不宜樣，設置爲1.表示頁面只允許寫。

5》物理頁面的使用和周
系統初始化階段：每一個頁面對應一個page結構，形成一個page結構數組，並使一個全局變量mem_map指向這個數組，同時，又按照需要將物理地址連續的頁面拼和成一個“塊”，在根據需求建立“管理區”。

交換設備，通常時磁盤，的每個物理頁面也要在內存中有個相應的數據結構，主要表示該頁面是否已經被分配使用，以及有幾個用戶在共享這個頁面。內核中定義了一個swap_info_struct數據結構。定義在inclue/linux/swap.h中：
struct swap_info_struct {
    unsigned int flags;
    kdev_t swap_device;
    spinlock_t sdev_lock;
    struct dentry * swap_file;
    struct vfsmount *swap_vfsmnt;
    unsigned short * swap_map;//指向一個數組，該數組中的每一個無符號短整數即代表磁盤上的物理頁面，而數組的下標則決定來該頁面在盤上或文件的位置。數組大小取決於pages，其中哦，swap_map[0]所代表的那個頁面是不用於頁面交換的，它包含了一些該設備或文件的一些信息以及一個表明那些頁面可供使用的位圖。
    unsigned int lowest_bit;//文件中從什麼地方開始交換使用。
    unsigned int highest_bit;//文件中從什麼地方停止交換使用。
    unsigned int cluster_next;//按集羣的方式將頁面放在磁盤扇區。
    unsigned int cluster_nr;
    int prio;            /* swap priority */
    int pages;
    unsigned long max;//該設備或文件中最大的頁面號==設備或文件的物理大小。
    int next;            /* next entry on swap list */
};
還定義了一個swap_list結構，將各個可以分配的物理頁面的磁盤設備或文件的swap_info_struct 按照優先高低鏈接在一起。
struct swap_list_t swap_list = {-1, -1};//開始時隊列爲空。
struct swap_list_t {
    int head;    /* head of priority-ordered swapfile list */
    int next;    /* swapfile to be used next */
};
typedef struct {
    unsigned long val;
} swp_entry_t;

釋放一個頁面的函數__swap_free(),在mm/swapfile.h中：
void __swap_free(swp_entry_t entry, unsigned short count)
{
struct swap_info_struct * p;
unsigned long offset, type;

if (!entry.val)//頁面0不用於頁面交換
goto out;

    type = SWP_TYPE(entry);//返回交換設備的序號，也就是swap_info_struct結構中的swap_info[]數組的下標。
    if (type >= nr_swapfiles)//它是曾被使用過的 swap_info 的最大索引值,而且從不會被降低.
        goto bad_nofile;
    p = & swap_info[type];
    if (!(p->flags & SWP_USED))//判斷該頁面是否已經被交換了。#define SWP_USED    1
        goto bad_device;
    offset = SWP_OFFSET(entry);//取得entry的高24位，爲offset，表示頁面在一個磁盤設備或文件中的位置，也就是文件中的邏輯頁面號，而type則是指該頁面在那一個文件中--是個序號
    if (offset >= p->max)//判斷offset是否大於max，也就是整個offset頁面號是否超出該文件或設備的最大邊界。
        goto bad_offset;
    if (!p->swap_map[offset])//該數組是該頁面的分配和使用計數，如果爲0 ，表示尚未分配
        goto bad_free;
    swap_list_lock();
    if (p->prio > swap_info[swap_list.next].prio)
        swap_list.next = type;
    swap_device_lock(p);
    if (p->swap_map[offset] < SWAP_MAP_MAX) {//分配計數大於SWAP_MAP_MAX
        if (p->swap_map[offset] < count)
            goto bad_count;
        if (!(p->swap_map[offset] -= count)) {
            if (offset < p->lowest_bit)
                p->lowest_bit = offset;
            if (offset > p->highest_bit)
                p->highest_bit = offset;
            nr_swap_pages++;
        }
    }
    swap_device_unlock(p);
    swap_list_unlock();
out:
    return;

bad_nofile:
    printk("swap_free: Trying to free nonexistent swap-page/n");
    goto out;
bad_device:
    printk("swap_free: Trying to free swap from unused swap-device/n");
    goto out;
bad_offset:
    printk("swap_free: offset exceeds max/n");
    goto out;
bad_free:
    printk("VM: Bad swap entry %08lx/n", entry.val);
    goto out;
bad_count:
    swap_device_unlock(p);
    swap_list_unlock();
    printk(KERN_ERR "VM: Bad count %hd current count %hd/n", count, p->swap_map[offset]);
    goto out;
}

頁面的週轉有兩方面的意思：其一是頁面的分配，使用，和回收，並不射擊頁面的盤區交換。其二：盤區交換，而交換的目地最終也是頁面的回收。
只有映射到的用戶空間纔會被換出。而內核，系統空間的頁面則不會被換出。
按照頁面的內容和性質，用戶空間的頁面有下面幾種：
    普通的用戶空間頁面，包括進程的代碼段，數據段，堆棧段，以及動態分配的存儲堆。
    通過系統調用mmap()映射到用戶空間的以打開的文件的內容
    進程間的共享內存區。

物理內存頁面換入/換出的週轉的要點如下：
    空閒，頁面的page 數據結構通過其隊列頭結構list鏈入某個頁面管理區zone 的空閒區隊列free_area 頁面的使用計數爲0.
    分配，通過函數__alloc_pages()或__get_free_page()從某個空閒隊列中分配內存頁面，並將所分配頁面使用計數count 1，其page 數據結構的隊列頭list 結構則變成空閒。
    活躍狀態，頁面的page數據結構通過其隊列頭結構lru 鏈入活躍頁面隊列active_list ，並且至少有一個進程的用戶空間頁面表項指向該頁面，每當爲頁面建立或恢復映射時，都使頁面引用
            計數count加1.
    不活躍狀態，頁面的page數據結構通過其隊列頭結構lru鏈入不活躍髒頁面隊列inactive_dirty_pages,但是原則上不在有任何進程的頁面結構表項指向該頁面，每當斷開頁面的映射時都使頁面
            的使用計數count減1.
    將不活躍髒頁面內容寫入交換設備，並將page數據結構從不活躍髒中頁面隊列嗎中轉移到某個不活躍乾淨頁面隊列中。
    不活躍狀態乾淨，頁面的page數據結構通過某隊列頭結構lru 鏈入某個不活躍乾淨頁面隊列，每個頁面管理區都有一個不活躍乾淨頁面隊列inactive_clean_list。
    如果在轉入不活躍狀態以後的一段時間內額面受到訪問，則又轉入活躍狀態並恢復映射。
    當有需要，就從乾淨頁面隊列中回收頁面，或退回到空閒隊列中，或直接另行分配。
struct address_space {
    struct list_head    clean_pages;    /* list of clean pages */
    struct list_head    dirty_pages;    /* list of dirty pages */
    struct list_head    locked_pages;    /* list of locked pages */
    unsigned long        nrpages;    /* number of total pages */
    struct address_space_operations *a_ops;    /* methods */
    struct inode        *host;        /* owner: inode, block_device */
    struct vm_area_struct    *i_mmap;    /* list of private mappings */
    struct vm_area_struct    *i_mmap_shared; /* list of shared mappings */
    spinlock_t        i_shared_lock; /* and spinlock protecting it */
}這個數據結構，用來管理所有可交換的內存頁面。每個可交換內存頁面的page 數據結構都通過其隊列頭結構list鏈入其中的一個隊列中。

yskcg

發佈了72 篇原創文章 · 獲贊 10 · 訪問量 23萬+

私信關注

container_of 解析理解詳解

Enabling Flash plugin for fedora

tcp 詳解一書tcp部分筆記

samba 服務器的搭建

（轉）用NET-SNMP軟件包開發簡單客戶端代理

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結