有圖,有代碼,好理解,學習內存管理,mmap機制

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一. 前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 本文爲內存部分最後一篇,介紹內存映射。內存映射不僅是物理內存和虛擬內存間的映射,也包括將文件中的內容映射到虛擬內存空間。這個時候,訪問內存空間就能夠訪問到文件裏面的數據。而僅有物理內存和虛擬內存的映射,是一種特殊情況。本文首先分析用戶態在堆中申請小塊內存的brk和申請大塊內存的mmap,之後會分析內核態的內存映射機制vmalloc,kmap_atomic,swapper_pg_dir以及內核態缺頁異常。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/0voice/computer_expert_paper","title":"","type":null},"content":[{"type":"text","text":"1000+份計算機paper","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ps:小編的github項目開始啓動啦,希望大佬們多多支持喲,覺得不錯的話給star吧!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二. 用戶態內存映射","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 用戶態調用malloc()會分配堆內存空間,而實際上則是完成了一次用戶態的內存映射,根據分配空間的大小,內存映射對應的系統調用主要有brk()和mmap()(當然我們也可以直接調用mmap()來映射文件)。對小塊內存(小於 128K),C 標準庫使用 brk() 來分配,也就是通過移動堆頂的位置來分配內存。這些內存釋放後並不會立刻歸還系統,而是被緩存起來,這樣就可以重複使用。而大塊內存(大於 128K),則直接使用內存映射 mmap() 來分配,也就是在文件映射段找一塊空閒內存分配出去。這兩種方式,自然各有優缺點。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"brk() 方式的緩存,可以減少缺頁異常的發生,提高內存訪問效率。不過,由於這些內存沒有歸還系統,在內存工作繁忙時,頻繁的內存分配和釋放會造成內存碎片。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mmap() 方式分配的內存,會在釋放時直接歸還系統,所以每次 mmap() 都會發生缺頁異常。在內存工作繁忙時,頻繁的內存分配會導致大量的缺頁異常,使內核的管理負擔增大。這也是 malloc() 只對大塊內存使用 mmap() 的原因。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1 小塊內存申請","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"brk()系統調用爲sys_brk()函數,其參數brk是新的堆頂位置,而mm->brk是原堆頂位置。該函數主要邏輯如下","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將原來的堆頂和現在的堆頂按照頁對齊地址比較大小,判斷是否在同一頁中","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果同一頁則不需要分配新頁,直接跳轉至set_brk,設置mm->brk爲新的brk即可","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果不在同一頁","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果新堆頂小於舊堆頂,則說明不是新分配內存而是釋放內存,由此調用__do_munmap()釋放","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是新分配內存,則調用find_vma(),查找vm_area_struct紅黑樹中原堆頂所在vm_area_struct的下一個結構體,如果在二者之間有足夠的空間分配一個頁則調用do_brk_flags()分配堆空間。如果不可以則分配失敗。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":2,"number":0,"align":null,"origin":null}}]}],"attrs":{}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"SYSCALL_DEFINE1(brk, unsigned long, brk)\n{\n unsigned long retval;\n unsigned long newbrk, oldbrk, origbrk;\n struct mm_struct *mm = current->mm;\n struct vm_area_struct *next;\n......\n newbrk = PAGE_ALIGN(brk);\n oldbrk = PAGE_ALIGN(mm->brk);\n if (oldbrk == newbrk) {\n mm->brk = brk;\n goto success;\n }\n /*\n * Always allow shrinking brk.\n * __do_munmap() may downgrade mmap_sem to read.\n */\n if (brk <= mm->brk) {\n int ret;\n /*\n * mm->brk must to be protected by write mmap_sem so update it\n * before downgrading mmap_sem. When __do_munmap() fails,\n * mm->brk will be restored from origbrk.\n */\n mm->brk = brk;\n ret = __do_munmap(mm, newbrk, oldbrk-newbrk, &uf, true);\n if (ret < 0) {\n mm->brk = origbrk;\n goto out;\n } else if (ret == 1) {\n downgraded = true;\n }\n goto success;\n }\n /* Check against existing mmap mappings. */\n next = find_vma(mm, oldbrk);\n if (next && newbrk + PAGE_SIZE > vm_start_gap(next))\n goto out;\n /* Ok, looks good - let it rip. */\n if (do_brk_flags(oldbrk, newbrk-oldbrk, 0, &uf) < 0)\n goto out;\n mm->brk = brk;\nsuccess:\n populate = newbrk > oldbrk && (mm->def_flags & VM_LOCKED) != 0;\n if (downgraded)\n up_read(&mm->mmap_sem);\n else\n up_write(&mm->mmap_sem);\n userfaultfd_unmap_complete(mm, &uf);\n if (populate)\n mm_populate(oldbrk, newbrk - oldbrk);\n return brk;\nout:\n retval = origbrk;\n up_write(&mm->mmap_sem);\n return retval;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在do_brk_flags()中,調用find_vma_links()找到將來的vm_area_struct節點在紅黑樹的位置,找到它的父節點、前序節點。接下來調用vma_merge(),看這個新節點是否能夠和現有樹中的節點合併。如果地址是連着的,能夠合併,則不用創建新的vm_area_struct了,直接跳到 out,更新統計值即可;如果不能合併,則創建新的vm_area_struct,既加到anon_vma_chain鏈表中,也加到紅黑樹中。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"*\n * this is really a simplified \"do_mmap\". it only handles\n * anonymous maps. eventually we may be able to do some\n * brk-specific accounting here.\n */\nstatic int do_brk_flags(unsigned long addr, unsigned long len, unsigned long flags, struct list_head *uf)\n{\n struct mm_struct *mm = current->mm;\n struct vm_area_struct *vma, *prev;\n struct rb_node **rb_link, *rb_parent;\n......\n /*\n * Clear old maps. this also does some error checking for us\n */\n while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,\n &rb_parent)) {\n if (do_munmap(mm, addr, len, uf))\n return -ENOMEM;\n }\n......\n /* Can we just expand an old private anonymous mapping? */\n vma = vma_merge(mm, prev, addr, addr + len, flags,\n NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);\n if (vma)\n goto out;\n /*\n * create a vma struct for an anonymous mapping\n */\n vma = vm_area_alloc(mm);\n if (!vma) {\n vm_unacct_memory(len >> PAGE_SHIFT);\n return -ENOMEM;\n }\n vma_set_anonymous(vma);\n vma->vm_start = addr;\n vma->vm_end = addr + len;\n vma->vm_pgoff = pgoff;\n vma->vm_flags = flags;\n vma->vm_page_prot = vm_get_page_prot(flags);\n vma_link(mm, vma, prev, rb_link, rb_parent);\nout:\n perf_event_mmap(vma);\n mm->total_vm += len >> PAGE_SHIFT;\n mm->data_vm += len >> PAGE_SHIFT;\n if (flags & VM_LOCKED)\n mm->locked_vm += (len >> PAGE_SHIFT);\n vma->vm_flags |= VM_SOFTDIRTY;\n return 0;\n}","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2 大內存塊申請","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 大塊內存的申請通過mmap系統調用實現,mmap既可以實現虛擬內存向物理內存的映射,也可以映射文件到自己的虛擬內存空間。映射文件時,實際是映射虛擬內存到物理內存再到文件。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,\n unsigned long, prot, unsigned long, flags,\n unsigned long, fd, unsigned long, off)\n{\n long error;\n error = -EINVAL;\n if (off & ~PAGE_MASK)\n goto out;\n error = ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);\nout:\n return error;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏主要調用ksys_mmap_pgoff()函數,這裏邏輯如下","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"判斷類型是否爲匿名映射,如果不是則爲文件映射,調用fget()獲取文件描述符","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是匿名映射,判斷是否爲大頁,如果是則進行對齊處理並調用hugetlb_file_setup()獲取文件描述符","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用vm_mmap_pgoff()函數找尋可以映射的區域並建立映射","attrs":{}}]}]}],"attrs":{}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,\n unsigned long prot, unsigned long flags,\n unsigned long fd, unsigned long pgoff)\n{\n struct file *file = NULL;\n unsigned long retval;\n if (!(flags & MAP_ANONYMOUS)) {\n audit_mmap_fd(fd, flags);\n file = fget(fd);\n if (!file)\n return -EBADF;\n if (is_file_hugepages(file))\n len = ALIGN(len, huge_page_size(hstate_file(file)));\n retval = -EINVAL;\n if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))\n goto out_fput;\n } else if (flags & MAP_HUGETLB) {\n struct user_struct *user = NULL;\n struct hstate *hs;\n hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);\n if (!hs)\n return -EINVAL;\n len = ALIGN(len, huge_page_size(hs));\n /*\n * VM_NORESERVE is used because the reservations will be\n * taken when vm_ops->mmap() is called\n * A dummy user value is used because we are not locking\n * memory so no accounting is necessary\n */\n file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,\n VM_NORESERVE,\n &user, HUGETLB_ANONHUGE_INODE,\n (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);\n if (IS_ERR(file))\n return PTR_ERR(file);\n }\n flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);\n retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);\nout_fput:\n if (file)\n fput(file);\n return retval;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vm_mmap_pgoff()函數調用do_mmap_pgoff(),實際調用do_mmap()函數。這裏get_unmapped_area()函數負責尋找可映射的區域,mmap_region()負責映射該區域。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"/*\n * The caller must hold down_write(&current->mm->mmap_sem).\n */\nunsigned long do_mmap(struct file *file, unsigned long addr,\n unsigned long len, unsigned long prot,\n unsigned long flags, vm_flags_t vm_flags,\n unsigned long pgoff, unsigned long *populate,\n struct list_head *uf)\n{\n struct mm_struct *mm = current->mm;\n int pkey = 0;\n *populate = 0;\n......\n /* Obtain the address to map to. we verify (or select) it and ensure\n * that it represents a valid section of the address space.\n */\n addr = get_unmapped_area(file, addr, len, pgoff, flags);\n......\n addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);\n if (!IS_ERR_VALUE(addr) &&\n ((vm_flags & VM_LOCKED) ||\n (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))\n *populate = len;\n return addr;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先來看看尋找映射區的函數get_unmapped_area()。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是匿名映射,則調用get_umapped_area函數指針,這個函數其實是 arch_get_unmapped_area()。它會調用 find_vma_prev(),在表示虛擬內存區域的 vm_area_struct 紅黑樹上找到相應的位置。之所以叫 prev,是說這個時候虛擬內存區域還沒有建立,找到前一個 vm_area_struct。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是映射到一個文件,在 Linux 裏面每個打開的文件都有一個 struct file 結構,裏面有一個 file_operations用來表示和這個文件相關的操作。如果是我們熟知的 ext4 文件系統,調用的也是get_unmapped_area 函數指針。","attrs":{}}]}]}],"attrs":{}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"unsigned long\nget_unmapped_area(struct file *file, unsigned long addr, unsigned long len,\n unsigned long pgoff, unsigned long flags)\n{\n unsigned long (*get_area)(struct file *, unsigned long,\n unsigned long, unsigned long, unsigned long);\n unsigned long error = arch_mmap_check(addr, len, flags);\n if (error)\n return error;\n /* Careful about overflows.. */\n if (len > TASK_SIZE)\n return -ENOMEM;\n get_area = current->mm->get_unmapped_area;\n if (file) {\n if (file->f_op->get_unmapped_area)\n get_area = file->f_op->get_unmapped_area;\n } else if (flags & MAP_SHARED) {\n /*\n * mmap_region() will call shmem_zero_setup() to create a file,\n * so use shmem's get_unmapped_area in case it can be huge.\n * do_mmap_pgoff() will clear pgoff, so match alignment.\n */\n pgoff = 0;\n get_area = shmem_get_unmapped_area;\n }\n addr = get_area(file, addr, len, pgoff, flags);\n if (IS_ERR_VALUE(addr))\n return addr;\n if (addr > TASK_SIZE - len)\n return -ENOMEM;\n if (offset_in_page(addr))\n return -EINVAL;\n error = security_mmap_addr(addr);\n return error ? error : addr;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mmap_region()首先會再次檢測地址空間是否滿足要求,然後清除舊的映射,校驗內存的可用性,在一切均滿足的情況下調用vma_link()將新創建的vm_area_struct結構掛在mm_struct內的紅黑樹上。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"unsigned long mmap_region(struct file *file, unsigned long addr,\n unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,\n struct list_head *uf)\n{\n struct mm_struct *mm = current->mm;\n struct vm_area_struct *vma, *prev;\n int error;\n struct rb_node **rb_link, *rb_parent;\n unsigned long charged = 0;\n......\n vma_link(mm, vma, prev, rb_link, rb_parent);\n /* Once vma denies write, undo our temporary denial count */\n if (file) {\n if (vm_flags & VM_SHARED)\n mapping_unmap_writable(file->f_mapping);\n if (vm_flags & VM_DENYWRITE)\n allow_write_access(file);\n }\n file = vma->vm_file;\n......\n}\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vma_link()本身是__vma_link()和__vma_link_file()的包裹函數","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,\n struct vm_area_struct *prev, struct rb_node **rb_link,\n struct rb_node *rb_parent)\n{\n struct address_space *mapping = NULL;\n if (vma->vm_file) {\n mapping = vma->vm_file->f_mapping;\n i_mmap_lock_write(mapping);\n }\n __vma_link(mm, vma, prev, rb_link, rb_parent);\n __vma_link_file(vma);\n if (mapping)\n i_mmap_unlock_write(mapping);\n mm->map_count++;\n validate_mm(mm);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中__vma_link()主要是鏈表和紅黑表的插入,這屬於基本數據結構操作,不展開講解。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"c"},"content":[{"type":"text","text":"static void\n__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,\n struct vm_area_struct *prev, struct rb_node **rb_link,\n struct rb_node *rb_parent)\n{\n __vma_link_list(mm, vma, prev, rb_parent);\n __vma_link_rb(mm, vma, rb_link, rb_parent);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而__vma_link_file()會對文件映射進行處理,在file結構體中成員f_mapping指向address_space結構體,該結構體中存儲紅黑樹i_mmap掛載vm_area_struct。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"c"},"content":[{"type":"text","text":"static void __vma_link_file(struct vm_area_struct *vma)\n{\n struct file *file;\n file = vma->vm_file;\n if (file) {\n struct address_space *mapping = file->f_mapping;\n if (vma->vm_flags & VM_DENYWRITE)\n atomic_dec(&file_inode(file)->i_writecount);\n if (vma->vm_flags & VM_SHARED)\n atomic_inc(&mapping->i_mmap_writable);\n flush_dcache_mmap_lock(mapping);\n vma_interval_tree_insert(vma, &mapping->i_mmap);\n flush_dcache_mmap_unlock(mapping);\n }\n}\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"至此,我們完成了用戶態內存的映射,但是此處僅在虛擬內存中建立了新的區域,尚未真正訪問物理內存。物理內存的訪問只有在調度到該進程時纔會真正分配,即發生缺頁異常時分配。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三. 用戶態缺頁異常","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 一旦開始訪問虛擬內存的某個地址,如果我們發現,並沒有對應的物理頁,那就觸發缺頁中斷,調用 do_page_fault()。這裏的邏輯如下","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"判斷是否爲內核缺頁中斷fault_in_kernel_space(),如果是則調用vmalloc_fault()","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是用戶態缺頁異常,則調用find_vma()找到地址所在vm_area_struct區域","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用handle_mm_fault()映射找到的區域","attrs":{}}]}]}],"attrs":{}},{"type":"codeblock","attrs":{"lang":"c"},"content":[{"type":"text","text":"/*\n * This routine handles page faults. It determines the address,\n * and the problem, and then passes it off to one of the appropriate\n * routines.\n */\nasmlinkage void __kprobes do_page_fault(struct pt_regs *regs,\n unsigned long error_code,\n unsigned long address)\n{\n......\n /*\n * We fault-in kernel-space virtual memory on-demand. The\n * 'reference' page table is init_mm.pgd.\n *\n * NOTE! We MUST NOT take any locks for this case. We may\n * be in an interrupt or a critical region, and should\n * only copy the information from the master page table,\n * nothing more.\n */\n if (unlikely(fault_in_kernel_space(address))) {\n if (vmalloc_fault(address) >= 0)\n return;\n if (notify_page_fault(regs, vec))\n return;\n bad_area_nosemaphore(regs, error_code, address);\n return;\n }\n......\n vma = find_vma(mm, address);\n......\n /*\n * If for any reason at all we couldn't handle the fault,\n * make sure we exit gracefully rather than endlessly redo\n * the fault.\n */\n fault = handle_mm_fault(vma, address, flags);\n......\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"find_vma()爲紅黑樹查找操作,在此不做展開描述,下面重點看看handle_mm_fault()。這裏經過一系列校驗之後會根據是否是大頁而選擇調用hugetlb_fault()或者__handle_mm_fault()","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"c"},"content":[{"type":"text","text":"vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,\n unsigned int flags)\n{\n......\n if (unlikely(is_vm_hugetlb_page(vma)))\n ret = hugetlb_fault(vma->vm_mm, vma, address, flags);\n else\n ret = __handle_mm_fault(vma, address, flags);\n......\n return ret;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"__handle_mm_fault()完成實際上的映射操作。這裏涉及到了由pgd, p4g, pud, pmd, pte組成的五級頁表,頁表索引填充完後調用handle_pte_fault()創建頁表項。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/91/91485517bacdb43e49f71c14d24e72ce.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"c"},"content":[{"type":"text","text":"atic vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,\n unsigned long address, unsigned int flags)\n{\n struct vm_fault vmf = {\n .vma = vma,\n .address = address & PAGE_MASK,\n .flags = flags,\n .pgoff = linear_page_index(vma, address),\n .gfp_mask = __get_fault_gfp_mask(vma),\n };\n unsigned int dirty = flags & FAULT_FLAG_WRITE;\n struct mm_struct *mm = vma->vm_mm;\n pgd_t *pgd;\n p4d_t *p4d;\n vm_fault_t ret;\n pgd = pgd_offset(mm, address);\n p4d = p4d_alloc(mm, pgd, address);\n......\n vmf.pud = pud_alloc(mm, p4d, address);\n......\n vmf.pmd = pmd_alloc(mm, vmf.pud, address);\n......\n return handle_pte_fault(&vmf);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"handle_pte_fault()處理以下三種情況","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頁表項從未出現過,即新映射頁表項","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"匿名頁映射,則映射到物理內存頁,調用do_anonymous_page()","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文件映射,調用do_fault()","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頁表項曾出現過,則爲從物理內存換出的頁,調用do_swap_page()換回來","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}],"attrs":{}},{"type":"codeblock","attrs":{"lang":"c"},"content":[{"type":"text","text":"/*\n * These routines also need to handle stuff like marking pages dirty\n * and/or accessed for architectures that don't do it in hardware (most\n * RISC architectures). The early dirtying is also good on the i386.\n *\n * There is also a hook called \"update_mmu_cache()\" that architectures\n * with external mmu caches can use to update those (ie the Sparc or\n * PowerPC hashed page tables that act as extended TLBs).\n *\n * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow\n * concurrent faults).\n *\n * The mmap_sem may have been released depending on flags and our return value.\n * See filemap_fault() and __lock_page_or_retry().\n */\nstatic vm_fault_t handle_pte_fault(struct vm_fault *vmf)\n{\n pte_t entry;\n......\n /*\n * A regular pmd is established and it can't morph into a huge\n * pmd from under us anymore at this point because we hold the\n * mmap_sem read mode and khugepaged takes it in write mode.\n * So now it's safe to run pte_offset_map().\n */\n vmf->pte = pte_offset_map(vmf->pmd, vmf->address);\n vmf->orig_pte = *vmf->pte;\n......\n if (!vmf->pte) {\n if (vma_is_anonymous(vmf->vma))\n return do_anonymous_page(vmf);\n else\n return do_fault(vmf);\n }\n if (!pte_present(vmf->orig_pte))\n return do_swap_page(vmf);\n......\n}","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 匿名頁映射","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 對於匿名頁映射,流程如下","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用pte_alloc()分配頁表項","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 alloc_zeroed_user_highpage_movable() 分配一個頁,該函數會調用 alloc_pages_vma(),並最終調用 __alloc_pages_nodemask()。該函數是夥伴系統的核心函數,用於分配物理頁面,在上文中已經詳細分析過了。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用mk_pte()將新分配的頁表項指向分配的頁","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用set_pte_at()將頁表項加入該頁","attrs":{}}]}]}],"attrs":{}},{"type":"codeblock","attrs":{"lang":"c"},"content":[{"type":"text","text":"/*\n * We enter with non-exclusive mmap_sem (to exclude vma changes,\n * but allow concurrent faults), and pte mapped but not yet locked.\n * We return with mmap_sem still held, but pte unmapped and unlocked.\n */\nstatic vm_fault_t do_anonymous_page(struct vm_fault *vmf)\n{\n struct vm_area_struct *vma = vmf->vma;\n struct mem_cgroup *memcg;\n struct page *page;\n vm_fault_t ret = 0;\n pte_t entry;\n......\n /*\n * Use pte_alloc() instead of pte_alloc_map(). We can't run\n * pte_offset_map() on pmds where a huge pmd might be created\n * from a different thread.\n *\n * pte_alloc_map() is safe to use under down_write(mmap_sem) or when\n * parallel threads are excluded by other means.\n *\n * Here we only have down_read(mmap_sem).\n */\n if (pte_alloc(vma->vm_mm, vmf->pmd))\n return VM_FAULT_OOM;\n......\n page = alloc_zeroed_user_highpage_movable(vma, vmf->address);\n......\n entry = mk_pte(page, vma->vm_page_prot);\n if (vma->vm_flags & VM_WRITE)\n entry = pte_mkwrite(pte_mkdirty(entry));\n vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,\n &vmf->ptl);\n......\n set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);\n......\n}\n\n#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \\\n alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2 文件映射","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 映射文件do_fault()函數調用了fault函數,該函數實際會根據不同的文件系統調用不同的函數,如ext4文件系統中vm_ops指向ext4_file_vm_ops,實際調用ext4_filemap_fault()函數,該函數會調用filemap_fault()完成實際的文件映射操作。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static vm_fault_t do_fault(struct vm_fault *vmf)\n{\n struct vm_area_struct *vma = vmf->vma;\n struct mm_struct *vm_mm = vma->vm_mm;\n vm_fault_t ret;\n\n if (!vma->vm_ops->fault) {\n......\n}\n\nvm_fault_t ext4_filemap_fault(struct vm_fault *vmf)\n{\n......\n ret = filemap_fault(vmf);\n......\n}\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"file_map_fault()主要邏輯爲","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用find_ge_page()找到映射文件vm_file對應的物理內存緩存頁面","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果找到了,則調用do_async_mmap_readahead(),預讀一些數據到內存裏面","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"否則調用pagecache_get_page()分配一個緩存頁,將該頁加入LRU表中,並在address_space中調用","attrs":{}}]}]}],"attrs":{}},{"type":"codeblock","attrs":{"lang":"c"},"content":[{"type":"text","text":"vm_fault_t filemap_fault(struct vm_fault *vmf)\n{\n int error;\n struct file *file = vmf->vma->vm_file;\n struct file *fpin = NULL;\n struct address_space *mapping = file->f_mapping;\n struct file_ra_state *ra = &file->f_ra;\n struct inode *inode = mapping->host;\n......\n /*\n * Do we have something in the page cache already?\n */\n page = find_get_page(mapping, offset);\n if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {\n /*\n * We found the page, so try async readahead before\n * waiting for the lock.\n */\n fpin = do_async_mmap_readahead(vmf, page);\n } else if (!page) {\n /* No page in the page cache at all */\n...... \n}\n \nstruct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,\n int fgp_flags, gfp_t gfp_mask)\n{\n......\n page = __page_cache_alloc(gfp_mask);\n......\n err = add_to_page_cache_lru(page, mapping, offset, gfp_mask);\n......\n}","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.3 頁交換","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 前文提到了我們會通過主動回收或者被動回收的方式將物理內存已映射的頁面回收至硬盤中,當數據再次訪問時,我們又需要通過do_swap_page()將其從硬盤中讀回來。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"do_swap_page() 函數邏輯流程如下:查找 swap 文件有沒有緩存頁。如果沒有,就調用 swapin_readahead()將 swap 文件讀到內存中來形成內存頁,並通過 mk_pte() 生成頁表項。set_pte_at 將頁表項插入頁表,swap_free 將 swap 文件清理。因爲重新加載回內存了,不再需要 swap 文件了。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"c"},"content":[{"type":"text","text":"vm_fault_t do_swap_page(struct vm_fault *vmf)\n{\n......\n entry = pte_to_swp_entry(vmf->orig_pte);\n......\n page = lookup_swap_cache(entry, vma, vmf->address);\n swapcache = page;\n if (!page) {\n struct swap_info_struct *si = swp_swap_info(entry);\n if (si->flags & SWP_SYNCHRONOUS_IO &&\n __swap_count(si, entry) == 1) {\n /* skip swapcache */\n page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,\n vmf->address);\n......\n } else {\n page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,\n vmf);\n swapcache = page;\n }\n......\n pte = mk_pte(page, vma->vm_page_prot);\n......\n set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);\n arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);\n vmf->orig_pte = pte;\n......\n swap_free(entry);\n......\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過以上步驟,用戶態的缺頁異常就處理完畢了。物理內存中有了頁面,頁表也建立好了映射。接下來,用戶程序在虛擬內存空間裏面可以通過虛擬地址順利經過頁表映射的訪問物理頁面上的數據了。頁表一般都很大,只能存放在內存中。操作系統每次訪問內存都要折騰兩步,先通過查詢頁表得到物理地址,然後訪問該物理地址讀取指令、數據。爲了加快映射速度,我們引入了 ","attrs":{}},{"type":"link","attrs":{"href":"https://en.wikipedia.org/wiki/Translation_lookaside_buffer#:~:text=A%20translation%20lookaside%20buffer%20(TLB,called%20an%20address%2Dtranslation%20cache","title":null,"type":null},"content":[{"type":"text","text":"TLB","attrs":{}}]},{"type":"text","text":"(Translation Lookaside Buffer),我們經常稱爲快表,專門用來做地址映射的硬件設備。它不在內存中,可存儲的數據比較少,但是比內存要快。所以我們可以想象,TLB 就是頁表的 Cache,其中存儲了當前最可能被訪問到的頁表項,其內容是部分頁表項的一個副本。有了 TLB 之後,我們先查塊表,塊表中有映射關係,然後直接轉換爲物理地址。如果在 TLB 查不到映射關係時,纔會到內存中查詢頁表。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四. 內核態內存映射及缺頁異常","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 和用戶態使用malloc()類似,內核態也有相應的內存映射函數:vmalloc()可用於分配不連續物理頁(使用夥伴系統),kmem_cache_alloc()和kmem_cache_create()使用slub分配器分配小塊內存,而kmalloc()類似於malloc(),在分配大內存的時候會使用夥伴系統,分配小內存則使用slub分配器。分配內存後會轉換爲虛擬地址,保存在內核頁表中進行映射,有需要時直接訪問。由於vmalloc()會帶來虛擬連續頁和物理不連續頁的映射,因此一般速度較慢,使用較少,相比而言kmalloc()使用的更爲頻繁。而kmem_cache_alloc()和kmem_cache_create()會分配更爲精準的小內存塊用於特定任務,因此也比較常用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 相對於用戶態,內核態還有一種特殊的映射:臨時映射。內核態高端內存地區爲了節省空間會選擇臨時映射,採用kmap_atomic()實現。如果是 32 位有高端地址的,就需要調用 set_pte 通過內核頁表進行臨時映射;如果是 64 位沒有高端地址的,就調用 page_address,裏面會調用 lowmem_page_address。其實低端內存的映射,會直接使用 __va 進行臨時映射。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"void *kmap_atomic_prot(struct page *page, pgprot_t prot)\n{\n......\n if (!PageHighMem(page))\n return page_address(page);\n......\n vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);\n set_pte(kmap_pte-idx, mk_pte(page, prot));\n......\n return (void *)vaddr;\n}\n\nvoid *kmap_atomic(struct page *page)\n{\n return kmap_atomic_prot(page, kmap_prot);\n}\n\nstatic __always_inline void *lowmem_page_address(const struct page *page)\n{\n return page_to_virt(page);\n}\n\n#define page_to_virt(x) __va(PFN_PHYS(page_to_pfn(x)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"kmap_atomic ()發現沒有頁表的時候會直接創建頁表進行映射。而 vmalloc ()只分配了內核的虛擬地址。所以訪問它的時候,會產生缺頁異常。內核態的缺頁異常還是會調用 do_page_fault(),最終進入vmalloc_fault()。在這裏會實現內核頁表項的關聯操作,從而完成分配,整體流程和用戶態相似。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"static noinline int vmalloc_fault(unsigned long address)\n{\n unsigned long pgd_paddr;\n pmd_t *pmd_k;\n pte_t *pte_k;\n /* Make sure we are in vmalloc area: */\n if (!(address >= VMALLOC_START && address < VMALLOC_END))\n return -1;\n\n /*\n * Synchronize this task's top level page-table\n * with the 'reference' page table.\n *\n * Do _not_ use \"current\" here. We might be inside\n * an interrupt in the middle of a task switch..\n */\n pgd_paddr = read_cr3_pa();\n pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);\n if (!pmd_k)\n return -1;\n\n pte_k = pte_offset_kernel(pmd_k, address);\n if (!pte_present(*pte_k))\n return -1;\n\n return 0\n}","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五. 總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 至此,我們分析了內存物理地址和虛擬地址的映射關係,結合前文頁的分配和管理,內存部分的主要功能就算是大致分析清楚了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7c/7c470ea28508f8ee097094e8e578d17c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章