之前寫過一篇簡單的介紹mmap()/munmap()的文章《Linux內存管理 (9)mmap》,比較單薄,這裏詳細的梳理一下。
從常用的使用者角度介紹兩個函數的使用;然後重點是分析內核的實現流程;最後對mmap()/munmap()進行一些驗證測試。
mmap系統調用並不完全是爲了共享內存而設計的,它本身提供了不同於一般對普通文件的訪問方式,進程可以像讀寫內存一樣對普通文件操作。
mmap系統調用使得進程之間通過映射同一個普通文件實現共享內存。普通文件被映射到進程地址空間後,進程可以像訪問普通內存一樣對文件進行訪問,不必再調用read()/write()等操作。
mmap並不分配空間,只是將文件映射到調用進程的地址空間裏(佔用虛擬地址空間),然後就可以使用memcpy()等操作,內存中內容並不立即更行到文件中,而是有一段時間的延遲,可以使用msync()顯式同步。
取消內存映射通過munmap()。
下面這張圖示意了mmap的內存映射,起始地址是返回的addr,off和len分別對應參數offset和length。
1. mmap API解釋
對mmap()/munmap()的使用比較簡單,有兩個參數組合導致了多樣性,分別是prot和flags。
#include <sys/mman.h>
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
int munmap(void *addr, size_t length);
下面對這些參數做一個簡單的介紹:
- addr:如果不爲NULL,內核會在此地址創建映射;否則,內核會選擇一個合適的虛擬地址。大部分情況不指定虛擬地址,意義不大,而是讓內核選擇返回一個地址給用戶空間使用。
- length:表示映射到進程地址空間的大小。
- prot:內存區域的讀/寫/執行屬性。
- flags:內存映射的屬性,共享、私有、匿名、文件等。
- fd:表示這是一個文件映射,fd是打開文件的句柄。如果是文件映射,需要指定fd;匿名映射就指定一個特殊的-1。
- offset:在文件映射時,表示相對文件頭的偏移量;返回的地址是偏移量對應的虛擬地址。
1.1 mmap優點
1.1.1 提升效率
一般讀寫文件需要open、read、write,需要先將磁盤文件讀取到內核cache緩衝區,然後再拷貝到用戶空間內存區,設計兩次讀寫操作。
mmap通過將磁盤文件映射到用戶空間,當進程讀文件時,發生缺頁中斷,給虛擬內存分配對應的物理內存,在通過磁盤調頁操作將磁盤數據讀到物理內存上,實現了用戶空間數據的讀取,整個過程只有一次內存拷貝。
1.1.2 用於進程間大數據量通信
兩個進程映射同一個文件,在兩個進程中,同一個文件區域映射的虛擬地址空間不同。一個進程操作文件時,先通過缺頁獲取物理內存,進而通過磁盤文件調頁操作將文件數據讀入內存。
另一個進程訪問文件的時候,發現沒有物理頁面映射到虛擬內存,通過fs的缺頁處理查找cache區是否有讀入磁盤文件,有的話建立映射關係,這樣兩個進程通過共享內存就可以進行通信。
1.1.3 文件關閉,內存可以繼續使用
因爲在內核中已經通過fd找到對應的磁盤文件,從而將文件跟vma關聯。
1.2 mmap缺點
映射時文件長度已經確定,沒法通過mmap訪問操作len的區間。
1.3 私有/共享、文件/匿名映射組合
共有四種組合,下面逐一介紹。
1.3.1 私有文件映射
多個進程使用同樣的物理頁面進行初始化,但是各個進程對內存文件的修改不會共享,也不會反映到物理文件中。
比如對linux .so動態庫文件就採用這種方式映射到各個進程虛擬地址空間中。
1.3.2 私有匿名映射
mmap會創建一個新的映射,各個進程不共享,主要用於分配內存(malloc分配大內存會調用mmap)。
1.3.3 共享文件映射
多個進程通過虛擬內存技術共享同樣物理內存,對內存文件的修改會反應到實際物理內存中,也是進程間通信的一種。
1.3.4 共享匿名映射
這種機制在進行fork時不會採用寫時複製,父子進程完全共享同樣的物理內存頁,也就是父子進程通信。
2. mmap內核實現
系統調用的入口是entry_SYSCALL_64_fastpath,然後根據系統調用號在sys_call_table中找到對應的函數。
mmap()和munmap()對應的系統調用分別是SyS_mmap()和SyS_munmap()下面就來分析一下實現。
2.0 mmap/munmap調用路徑
在分析具體內核實現之前,通過腳本來看看mmap/munmap調用路徑。
通過增加set_ftrace_filter的函數,修改current_tracer發現函數的調用者,逐步豐富調用路徑。
#!/bin/bash
DPATH="/sys/kernel/debug/tracing"
PID=$$
## Quick basic checks
[ `id -u` -ne 0 ] && { echo "needs to be root" ; exit 1; } # check for root permissions
[ -z $1 ] && { echo "needs process name as argument" ; exit 1; } # check for args to this function
mount | grep -i debugfs &> /dev/null
[ $? -ne 0 ] && { echo "debugfs not mounted, mount it first"; exit 1; } #checks for debugfs mount
# flush existing trace data
echo > $DPATH/trace
echo nop > $DPATH/current_tracer
echo > $DPATH/set_ftrace_filter
echo "SyS_mmap SyS_mmap_pgoff SyS_munmap SyS_open SyS_read SyS_write SyS_close SyS_brk SyS_msync" >> $DPATH/set_ftrace_filter
echo "do_brk elf_map load_elf_binary" >> $DPATH/set_ftrace_filter
echo "do_mmap do_munmap get_unmapped_area mmap_region vm_mmap vm_munmap vm_mmap_pgoff" >> $DPATH/set_ftrace_filter
echo "__split_vma* unmap_region" >> $DPATH/set_ftrace_filter
# set function tracer
echo function_graph > $DPATH/current_tracer
# write current process id to set_ftrace_pid file
echo $PID > $DPATH/set_ftrace_pid
#echo "common_pid==$PID" > /sys/kernel/debug/tracing/events/syscalls/sys_enter_mmap/filter
#echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_mmap/enable
#echo "common_pid==$PID" > /sys/kernel/debug/tracing/events/syscalls/sys_enter_munmap/filter
#echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_munmap/enable
# start the tracing
echo 1 > $DPATH/tracing_on
# execute the process
exec $*
#sudo cat $DPATH/trace > /home/al/v4l2/trace.txt
最後使用function_graph跟蹤器查看調用關係如下:
1) | SyS_mmap() {
1) | SyS_mmap_pgoff() {
1) | vm_mmap_pgoff() {
1) | do_mmap() {
1) 0.548 us | get_unmapped_area();
1) 3.388 us | mmap_region();
1) 4.598 us | }
1) 5.286 us | }
1) 5.756 us | }
1) 6.058 us | }
1) | SyS_munmap() {
1) | vm_munmap() {
1) | do_munmap() {
1) + 99.985 us | unmap_region();
1) ! 101.439 us | }
1) ! 101.838 us | }
1) ! 102.410 us | }
下面就圍繞這條路徑展開分析。
2.1 mmap()
mmap()系統調用的核心是do_mmap(),可以分爲三部分。
第一部分通過get_unmapped_area()函數,找到一段虛擬地址,範圍是[addr, addr+len]。
用戶進程一般不會指定addr,也就是由內核指定這個虛擬空間的首地址addr在哪裏。
在函數do_mmap_pgoff()調用get_unmapped_area()之前會預指定addr,通過round_hint_to_min()實現,然後用這個預指定addr爲參數調用get_unmapped_area()。
第二部分確定vma線性區的flags,針對文件、匿名,私有、共享有所不同。
第三部分是實際創建vma先行區,通過函數mmap_region()實現。
asmlinkage unsigned long
sys_mmap (unsigned long addr, unsigned long len, int prot, int flags, int fd, long off)
{
if (offset_in_page(off) != 0)
return -EINVAL;
addr = sys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
if (!IS_ERR((void *) addr))
force_successful_syscall_return();
return addr;
}
SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, pgoff)
{
struct file *file = NULL;
unsigned long retval;
if (!(flags & MAP_ANONYMOUS)) {------------------------------------------對非匿名文件映射的檢查,必須能根據文件句柄找到struct file。
audit_mmap_fd(fd, flags);
file = fget(fd);
if (!file)
return -EBADF;
if (is_file_hugepages(file))
len = ALIGN(len, huge_page_size(hstate_file(file)));-------------根據file->f_op來判斷是否是hugepage,然後進行hugepage頁面對齊。
retval = -EINVAL;
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
struct user_struct *user = NULL;
struct hstate *hs;
hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & SHM_HUGE_MASK);
if (!hs)
return -EINVAL;
len = ALIGN(len, huge_page_size(hs));
/*
* VM_NORESERVE is used because the reservations will be
* taken when vm_ops->mmap() is called
* A dummy user value is used because we are not locking
* memory so no accounting is necessary
*/
file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
VM_NORESERVE,
&user, HUGETLB_ANONHUGE_INODE,
(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (IS_ERR(file))
return PTR_ERR(file);
}
flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
if (file)
fput(file);
return retval;
}
unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff)
{
unsigned long ret;
struct mm_struct *mm = current->mm;
unsigned long populate;
ret = security_mmap_file(file, prot, flag);
if (!ret) {
down_write(&mm->mmap_sem);
ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
&populate);
up_write(&mm->mmap_sem);
if (populate)
mm_populate(ret, populate);
}
return ret;
}
unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flags, vm_flags_t vm_flags,
unsigned long pgoff, unsigned long *populate)
{
struct mm_struct *mm = current->mm;
*populate = 0;
if (!len)
return -EINVAL;
if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
if (!(file && path_noexec(&file->f_path)))
prot |= PROT_EXEC;
if (!(flags & MAP_FIXED))-------------------------------------------------對於非MAP_FIXED,addr不能小於mmap_min_addr大小,如果小於則使用mmap_min_addr頁對齊後的地址。
addr = round_hint_to_min(addr);
/* Careful about overflows.. */
len = PAGE_ALIGN(len);
if (!len)-----------------------------------------------------------------這裏不是判斷len是否爲0,而是檢查len是否溢出。
return -ENOMEM;
/* offset overflow? */
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)--------------------------------檢查offset是否溢出
return -EOVERFLOW;
/* Too many mappings? */
if (mm->map_count > sysctl_max_map_count)---------------------------------進程中mmap個數限制,超出返回ENOMEM錯誤。
return -ENOMEM;
addr = get_unmapped_area(file, addr, len, pgoff, flags);------------------在創建新的ma區域之前首先尋找一塊足夠大小的空閒區域,本函數就是用於查找未映射的區域,返回值addr就是這段空間的首地址。
if (offset_in_page(addr))
return addr;
vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;------------根據prot/flags以及mm->flags來得到vm_flags。
if (flags & MAP_LOCKED)
if (!can_do_mlock())
return -EPERM;
if (mlock_future_check(mm, vm_flags, len))
return -EAGAIN;
if (file) {---------------------------------------------------------------文件映射情況處理,主要更新vm_flags。
struct inode *inode = file_inode(file);
if (!file_mmap_ok(file, inode, pgoff, len))
return -EOVERFLOW;
switch (flags & MAP_TYPE) {
case MAP_SHARED:------------------------------------------------------共享文件映射
if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
return -EACCES;
if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
return -EACCES;
if (locks_verify_locked(file))
return -EAGAIN;
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
case MAP_PRIVATE:-----------------------------------------------------私有文件映射
if (!(file->f_mode & FMODE_READ))
return -EACCES;
if (path_noexec(&file->f_path)) {
if (vm_flags & VM_EXEC)
return -EPERM;
vm_flags &= ~VM_MAYEXEC;
}
if (!file->f_op->mmap)
return -ENODEV;
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
return -EINVAL;
break;
default:
return -EINVAL;
}
} else {------------------------------------------------------------------匿名映射情況處理
switch (flags & MAP_TYPE) {
case MAP_SHARED:------------------------------------------------------共享匿名映射
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
return -EINVAL;
pgoff = 0;--------------------------------------------------------爲什麼爲0?
vm_flags |= VM_SHARED | VM_MAYSHARE;
break;
case MAP_PRIVATE:-----------------------------------------------------私有匿名映射
pgoff = addr >> PAGE_SHIFT;
break;
default:
return -EINVAL;
}
}
if (flags & MAP_NORESERVE) {
/* We honor MAP_NORESERVE if allowed to overcommit */
if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
vm_flags |= VM_NORESERVE;
/* hugetlb applies strict overcommit unless MAP_NORESERVE */
if (file && is_file_hugepages(file))
vm_flags |= VM_NORESERVE;
}
addr = mmap_region(file, addr, len, vm_flags, pgoff);--------------------實際創建vma
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
(flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
*populate = len;
return addr;
}
get_unmapped_area()根據輸入的addr,以及其它參數通過get_area()來找到一個滿足條件的虛擬空間,返回這個虛擬空間的首地址。
get_area()是一個函數指針,有兩種可能使用mm->get_unmapped_area()或者file->f_op->get_unmapped_area()。
unsigned long
get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
{
unsigned long (*get_area)(struct file *, unsigned long,
unsigned long, unsigned long, unsigned long);
unsigned long error = arch_mmap_check(addr, len, flags);
if (error)
return error;
/* Careful about overflows.. */
if (len > TASK_SIZE)
return -ENOMEM;
get_area = current->mm->get_unmapped_area;------------使用mm_struct->get_unmapped_area()方法,即arch_get_unmapped_area()。
if (file && file->f_op->get_unmapped_area)------------如果是文件映射,並且該文件的file_operations定義了get_unmapped_area方法,那麼使用它實現定位虛擬區間。
get_area = file->f_op->get_unmapped_area;
addr = get_area(file, addr, len, pgoff, flags);
if (IS_ERR_VALUE(addr))
return addr;
if (addr > TASK_SIZE - len)
return -ENOMEM;
if (offset_in_page(addr))
return -EINVAL;
addr = arch_rebalance_pgtables(addr, len);
error = security_mmap_addr(addr);
return error ? error : addr;
}
看arch_get_unmapped_area()名字就知道,可能有各架構自己的實現函數。這裏以平臺無關的函數進行分析。
arch_get_unmapped_area()完成從低地址向高地址創建新的映射,而arch_get_unmapped_area_topdown()完成從高地址向低地址創建新的映射。
unsigned long
arch_get_unmapped_area(struct file *filp, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
int do_align = 0;
int aliasing = cache_is_vipt_aliasing();
struct vm_unmapped_area_info info;
if (aliasing)
do_align = filp || (flags & MAP_SHARED);
if (flags & MAP_FIXED) {------------------這裏可以看出MAP_FIXED不參與選址,固定地址創建。
if (aliasing && flags & MAP_SHARED &&
(addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1))
return -EINVAL;
return addr;
}
if (len > TASK_SIZE)
return -ENOMEM;
if (addr) {--------------------------------當addr非0,表示制定了一個特定的優先選用地址,內核會檢查該區域是否與現存區域重疊,有find_vma()完成查找功能。
if (do_align)
addr = COLOUR_ALIGN(addr, pgoff);
else
addr = PAGE_ALIGN(addr);
vma = find_vma(mm, addr);
if (TASK_SIZE - len >= addr &&
(!vma || addr + len <= vm_start_gap(vma)))
return addr;
}
info.flags = 0;
info.length = len;
info.low_limit = mm->mmap_base;
info.high_limit = TASK_SIZE;
info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0;
info.align_offset = pgoff << PAGE_SHIFT;
return vm_unmapped_area(&info);-----------當addr爲空或者指定的優選地址不滿足分配條件時,內核必須遍歷進程中可用的區域,設法找到一個大小適當的空閒區域,vm_unmapped_area()完成實際的工作。
}
static inline unsigned long
vm_unmapped_area(struct vm_unmapped_area_info *info)
{
if (info->flags & VM_UNMAPPED_AREA_TOPDOWN)
return unmapped_area_topdown(info);--從高地址到低地址穿點映射。
else
return unmapped_area(info);----------從低地址到高地址創建映射。
}
unsigned long unmapped_area(struct vm_unmapped_area_info *info)
{
/*
* We implement the search by looking for an rbtree node that
* immediately follows a suitable gap. That is,
* - gap_start = vma->vm_prev->vm_end <= info->high_limit - length;
* - gap_end = vma->vm_start >= info->low_limit + length;
* - gap_end - gap_start >= length
*/
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long length, low_limit, high_limit, gap_start, gap_end;
/* Adjust search length to account for worst case alignment overhead */
length = info->length + info->align_mask;
if (length < info->length)
return -ENOMEM;
/* Adjust search limits by the desired length */
if (info->high_limit < length)
return -ENOMEM;
high_limit = info->high_limit - length;
if (info->low_limit > high_limit)
return -ENOMEM;
low_limit = info->low_limit + length;
/* Check if rbtree root looks promising */
if (RB_EMPTY_ROOT(&mm->mm_rb))
goto check_highest;
vma = rb_entry(mm->mm_rb.rb_node, struct vm_area_struct, vm_rb);
if (vma->rb_subtree_gap < length)
goto check_highest;
while (true) {
/* Visit left subtree if it looks promising */
gap_end = vm_start_gap(vma);----------------------------------先從低地址開始查詢。
if (gap_end >= low_limit && vma->vm_rb.rb_left) {
struct vm_area_struct *left =
rb_entry(vma->vm_rb.rb_left,
struct vm_area_struct, vm_rb);
if (left->rb_subtree_gap >= length) {
vma = left;
continue;
}
}
gap_start = vma->vm_prev ? vm_end_gap(vma->vm_prev) : 0;------當前結點rb_subtree_gap已經是最後一個可能滿足這次分配。
check_current:
/* Check if current node has a suitable gap */
if (gap_start > high_limit)
return -ENOMEM;
if (gap_end >= low_limit &&
gap_end > gap_start && gap_end - gap_start >= length)
goto found;
/* Visit right subtree if it looks promising */
if (vma->vm_rb.rb_right) {
struct vm_area_struct *right =
rb_entry(vma->vm_rb.rb_right,
struct vm_area_struct, vm_rb);
if (right->rb_subtree_gap >= length) {
vma = right;
continue;
}
}
/* Go back up the rbtree to find next candidate node */
while (true) {
struct rb_node *prev = &vma->vm_rb;
if (!rb_parent(prev))
goto check_highest;
vma = rb_entry(rb_parent(prev),
struct vm_area_struct, vm_rb);
if (prev == vma->vm_rb.rb_left) {
gap_start = vm_end_gap(vma->vm_prev);
gap_end = vm_start_gap(vma);
goto check_current;
}
}
}
check_highest:
/* Check highest gap, which does not precede any rbtree node */
gap_start = mm->highest_vm_end;
gap_end = ULONG_MAX; /* Only for VM_BUG_ON below */
if (gap_start > high_limit)
return -ENOMEM;
found:
/* We found a suitable gap. Clip it with the original low_limit. */
if (gap_start < info->low_limit)
gap_start = info->low_limit;
/* Adjust gap address to the desired alignment */
gap_start += (info->align_offset - gap_start) & info->align_mask;
VM_BUG_ON(gap_start + info->length > info->high_limit);
VM_BUG_ON(gap_start + info->length > gap_end);
return gap_start;
}
mmap_region()首先調用find_vma_links()查找是否已有vma線性區包含addr,如果有調用do_munmap()把這個vma幹掉。
Linux不希望vma和vma之間存在空洞,只要新創建vma的flags屬性和前面或者後面vma仙童,就嘗試合併成一個新的vma,減少slab緩存消耗量,同時也減少了空洞浪費。
如果無法合併,那麼只好新創建vma並對vma結構體初始化先關成員;根據vma是否有頁鎖定標誌(VM_LOCKED),決定是否立即分配物理頁。
最後將新建的vma插入進程空間vma紅黑樹中,並返回addr。
unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
int error;
struct rb_node **rb_link, *rb_parent;
unsigned long charged = 0;
/* Check against address space limit. */
if (!may_expand_vm(mm, len >> PAGE_SHIFT)) {--------------------檢查當前total_vm+len是否查過RLIMIT_AS,確保虛擬映射可以進行。
unsigned long nr_pages;
if (!(vm_flags & MAP_FIXED))
return -ENOMEM;
nr_pages = count_vma_pages_range(mm, addr, addr + len);
if (!may_expand_vm(mm, (len >> PAGE_SHIFT) - nr_pages))
return -ENOMEM;
}
while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
&rb_parent)) {-----------------------------------遍歷該進程已有的vma紅黑樹,如果找到vma覆蓋[addr, end]區域,那麼返回0,表示找到。如果覆蓋已有的vma區域,返回ENOMEM。
if (do_munmap(mm, addr, len))------------------------------存在覆蓋已有區域的情況,那麼嘗試取munmap這塊區域。如果munmap成功返回0,不成功則mmap_region()失敗。
return -ENOMEM;
}
if (accountable_mapping(file, vm_flags)) {
charged = len >> PAGE_SHIFT;
if (security_vm_enough_memory_mm(mm, charged))
return -ENOMEM;
vm_flags |= VM_ACCOUNT;
}
vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);-----------------------至此表示已經可以找到合適的vma區域,原有映射是否可以被新的映射覆用,減少因爲vma導致的slab消耗和虛擬內存的空洞。
if (vma)
goto out;
vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);----------------------在沒有找到的情況下,新建一個vma。
if (!vma) {
error = -ENOMEM;
goto unacct_error;
}
vma->vm_mm = mm;---------------------------------------------------------初始化vma數據
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;
INIT_LIST_HEAD(&vma->anon_vma_chain);
if (file) {--------------------------------------------------------------如果是文件映射
if (vm_flags & VM_DENYWRITE) {
error = deny_write_access(file);
if (error)
goto free_vma;
}
if (vm_flags & VM_SHARED) {
error = mapping_map_writable(file->f_mapping);
if (error)
goto allow_write_and_free_vma;
}
vma->vm_file = get_file(file);
error = file->f_op->mmap(file, vma);---------------------------------調用文件操作函數集的mmap成員。
if (error)
goto unmap_and_free_vma;
WARN_ON_ONCE(addr != vma->vm_start);
addr = vma->vm_start;
vm_flags = vma->vm_flags;
} else if (vm_flags & VM_SHARED) {--------------------------------------共享匿名區
error = shmem_zero_setup(vma);
if (error)
goto free_vma;
}
vma_link(mm, vma, prev, rb_link, rb_parent);----------------------------將新建的vma插入到進程地址空間的vma紅黑樹中,已經做一些計數更新等。
/* Once vma denies write, undo our temporary denial count */
if (file) {
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
}
file = vma->vm_file;
out:
perf_event_mmap(vma);
vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm)))
mm->locked_vm += (len >> PAGE_SHIFT);
else
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
}
if (file)
uprobe_mmap(vma);
vma->vm_flags |= VM_SOFTDIRTY;
vma_set_page_prot(vma);
return addr;
unmap_and_free_vma:
vma->vm_file = NULL;
fput(file);
/* Undo any partial mapping done by a device driver. */
unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
charged = 0;
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
allow_write_and_free_vma:
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
free_vma:
kmem_cache_free(vm_area_cachep, vma);
unacct_error:
if (charged)
vm_unacct_memory(charged);
return error;
}
參考文檔:《linux進程地址空間(3) 內存映射(1)mmap與do_mmap》、《進程地址空間 get_unmmapped_area()》
2.2 munmap
檢查目標地址在當前進程的虛擬空間是否已經在使用,如果已經在使用就要將老的映射撤銷,要是這個操作失敗,則goto free_vma。因爲flags的標誌位爲MAP_FIXED爲1時,並未對此檢查。
munmap()用於解除內存映射,其核心函數式do_munmap()。
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
{
profile_munmap(addr);
return vm_munmap(addr, len);
}
int vm_munmap(unsigned long start, size_t len)
{
int ret;
struct mm_struct *mm = current->mm;
down_write(&mm->mmap_sem);
ret = do_munmap(mm, start, len);
up_write(&mm->mmap_sem);
return ret;
}
int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
{
unsigned long end;
struct vm_area_struct *vma, *prev, *last;
if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
return -EINVAL;
len = PAGE_ALIGN(len);
if (len == 0)
return -EINVAL;
/* Find the first overlapping VMA */
vma = find_vma(mm, start);-----------------找到起始地址落在哪個vma內。
if (!vma)----------------------------------如果沒有找到的話,直接返回0。
return 0;
prev = vma->vm_prev;
end = start + len;
if (vma->vm_start >= end)------------------如果要釋放空間的結束地址都小於vma起始地址,說明這兩者沒有重疊,直接退出。
return 0;
if (start > vma->vm_start) {
int error;
if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
return -ENOMEM;
error = __split_vma(mm, vma, start, 0);----由於start>vma->vm_start,說明要釋放空間和vm_start有一段空隙。這裏就是分離這段gap。
if (error)
return error;
prev = vma;
}
last = find_vma(mm, end);----------------------找到要釋放空間結束地址的vma。
if (last && end > last->vm_start) {
int error = __split_vma(mm, last, end, 1);-如果if成立,說明要釋放空間end和vm_start之間有gap,就需要分離這段gap。
if (error)
return error;
}
vma = prev ? prev->vm_next : mm->mmap;
if (mm->locked_vm) {
struct vm_area_struct *tmp = vma;
while (tmp && tmp->vm_start < end) {
if (tmp->vm_flags & VM_LOCKED) {
mm->locked_vm -= vma_pages(tmp);
munlock_vma_pages_all(tmp);-------如果這段空間是VM_LOCKED,就需要unlock。
}
tmp = tmp->vm_next;
}
}
detach_vmas_to_be_unmapped(mm, vma, prev, end);
unmap_region(mm, vma, prev, start, end);------釋放實際佔用的頁面。
arch_unmap(mm, vma, start, end);
/* Fix up all other VM information */
remove_vma_list(mm, vma);---------------------刪除mm_struct結構中的vma信息。
return 0;
}
static void unmap_region(struct mm_struct *mm,
struct vm_area_struct *vma, struct vm_area_struct *prev,
unsigned long start, unsigned long end)
{
struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
struct mmu_gather tlb;
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start, end);
update_hiwater_rss(mm);
unmap_vmas(&tlb, vma, start, end);---------掃描線性地址空間的所有頁表項
free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
next ? next->vm_start : USER_PGTABLES_CEILING);---回收上一步已經清空的進程頁表。
tlb_finish_mmu(&tlb, start, end);----------刷新TLB,在多處理器系統中,調用freepages_and_swap_cache()釋放頁框。
}
void unmap_vmas(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;
mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
}
參考文檔:《內存管理API之do_munmap》《釋放線性地址區間》。
2.3 msync()
進程對映射的內存空間內容改變並不直接回寫到磁盤中,往往在調用munmap()後才執行操作。
msync()函數將映射內存空間內容同步到磁盤文件中。
SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
{
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
int unmapped_error = 0;
int error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (offset_in_page(start))
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
error = -ENOMEM;
len = (len + ~PAGE_MASK) & PAGE_MASK;
end = start + len;
if (end < start)
goto out;
error = 0;
if (end == start)
goto out;
/*
* If the interval [start,end) covers some unmapped address ranges,
* just ignore them, but return -ENOMEM at the end.
*/
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
for (;;) {
struct file *file;
loff_t fstart, fend;
/* Still start < end. */
error = -ENOMEM;
if (!vma)
goto out_unlock;
/* Here start < vma->vm_end. */
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
goto out_unlock;
unmapped_error = -ENOMEM;
}
/* Here vma->vm_start <= start < vma->vm_end. */
if ((flags & MS_INVALIDATE) &&
(vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
goto out_unlock;
}
file = vma->vm_file;
fstart = (start - vma->vm_start) +
((loff_t)vma->vm_pgoff << PAGE_SHIFT);
fend = fstart + (min(end, vma->vm_end) - start) - 1;
start = vma->vm_end;
if ((flags & MS_SYNC) && file &&
(vma->vm_flags & VM_SHARED)) {
get_file(file);
up_read(&mm->mmap_sem);
error = vfs_fsync_range(file, fstart, fend, 1);
fput(file);
if (error || start >= end)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
} else {
if (start >= end) {
error = 0;
goto out_unlock;
}
vma = vma->vm_next;
}
}
out_unlock:
up_read(&mm->mmap_sem);
out:
return error ? : unmapped_error;
}
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
struct inode *inode = file->f_mapping->host;
if (!file->f_op->fsync)
return -EINVAL;
if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
spin_lock(&inode->i_lock);
inode->i_state &= ~I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
mark_inode_dirty_sync(inode);
}
return file->f_op->fsync(file, start, end, datasync);
}
2.4 malloc和brk()/mmap()關係
通過getconf PAGESIZE查看當前系統頁面大小,可知當前系統頁面大小爲4096。
malloc()分配內存,並不一定都通過brk()進行;如果分配的內存達到128K,就要通過mmap進行。
#include<unistd.h>
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<sys/types.h>
#include<sys/stat.h>
#include<sys/mman.h>
#define MAX (4096*31+4072)
int main()
{
int i=0;
char *array = (char *)malloc(MAX);
for( i=0; i<MAX; ++i )
++array[ i ];
free(array);
return 0;
}
下面就來看看MAX不同大小,對malloc的影響。
當MAX爲(4096*31+4072)時,跟蹤系統調用如下:
...
brk(0x244c000) = 0x244c000
brk(0x242c000) = 0x242c000
exit_group(0) = ?
+++ exited with 0 +++
當MAX爲(4096*31+4073)時,跟蹤系統調用如下:
...
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f12b88c9000
munmap(0x7f12b88c9000, 135168) = 0
exit_group(0) = ?
+++ exited with 0 +++
可以看出當分配的內存接近128KB是,malloc()會對齊到128KB,並且附加了1頁作爲gap。實際分配的虛擬地址空間達到了132kB。
3. mmap測試
3.1 mmap()/munmap()相對於read()/write()優勢
上面有提到mmap()後對內存的操作相對於普通的read()/write()速度更快,這裏進行一個簡單測試。
#include<unistd.h>
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<sys/types.h>
#include<sys/stat.h>
#include<sys/time.h>
#include<fcntl.h>
#include<sys/mman.h>
#define MAX 1024*128
int main()
{
int i=0;
int count=0, fd=0;
struct timeval tv1, tv2;
char *array = (char *)malloc(MAX);
/*read*/
gettimeofday( &tv1, NULL );
fd = open( "./mmap_test", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if(fd<0)
printf("Open file failed\n");
if(MAX != read( fd, (char*)array, MAX ))
{
printf("Reading data failed...\n");
return -1;
}
memset(array, 'a', MAX);
lseek(fd,0,SEEK_SET);
if(MAX != write(fd, (void *)array, MAX))
{
printf( "Writing data failed...\n" );
return -1;
}
close( fd );
gettimeofday( &tv2, NULL );
free( array );
printf( "Time of read/write: %ldus\n", (tv2.tv_usec - tv1.tv_usec));
/*mmap*/
gettimeofday( &tv1, NULL );
fd = open( "./mmap_test2", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
array = mmap( NULL, MAX, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
memset(array, 'b', MAX);
munmap( array, MAX );
msync( array, MAX, MS_SYNC );
close( fd );
gettimeofday( &tv2, NULL );
printf( "Time of mmap/munmap/msync: %ldus\n", (tv2.tv_usec - tv1.tv_usec));
return 0;
}
首先創建兩個128KB的空文件。
dd bs=1024 count=128 if=/dev/zero of=./mmap_test
dd bs=1024 count=128 if=/dev/zero of=./mmap_test2
兩個文件內容分別變成了'A'和'B',可以看出mmap領先不少:
Time of read/write: 134us
Time of mmap/munmap/msync: 91us
3.2 mmap和/proc/xxx/maps解析
#include<stdio.h>
#include<unistd.h>
void main()
{
sleep(1000);
}
通過strace執行如上應用,得到如下的系統調用過程。
execve("./sleep", ["./sleep"], [/* 77 vars */]) = 0
brk(NULL) = 0x1286000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=145720, ...}) = 0
mmap(NULL, 145720, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa2e0dec000--------------------------------------------------------1,只讀私有文件映射,在a處釋放。
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\t\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1868984, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa2e0deb000--------------------------------2,匿名映射一頁,範圍0x7fa2e0deb000-0x7fa2e0dec000,可讀寫
mmap(NULL, 3971488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa2e0821000-------------------------------3,創建可讀可執行,私有文件映射,範圍0x7fa2e0821000-0x7fa2e0beb000
mprotect(0x7fa2e09e1000, 2097152, PROT_NONE) = 0-------------------------------------------------------------------------4,修改0x7fa2e09e1000-0x7fa2e0be1000屬性,不可讀寫執行
mmap(0x7fa2e0be1000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c0000) = 0x7fa2e0be1000-----5,私有文件固定地址映射,可讀寫,0x7fa2e0be1000-0x7fa2e0be7000
mmap(0x7fa2e0be7000, 14752, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa2e0be7000-----------6,私有匿名固定地址映射,可讀寫,0x7fa2e0be7000-0x7fa2e0beb000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa2e0dea000--------------------------------7,匿名映射一頁,範圍0x7fa2e0dea000-0x7fa2e0deb000,可讀寫
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa2e0de9000--------------------------------8,匿名映射一頁,範圍0x7fa2e0de9000-0x7fa2e0dea000,可讀寫
arch_prctl(ARCH_SET_FS, 0x7fa2e0dea700) = 0
mprotect(0x7fa2e0be1000, 16384, PROT_READ) = 0---------------------------------------------------------------------------9,將5創建的內存映射的0x7fa2e0be1000-0x7fa2e0be5000變成只讀
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7fa2e0e10000, 4096, PROT_READ) = 0
munmap(0x7fa2e0dec000, 145720) = 0------------------------------------------------------------------------------a,釋放1創建的內存映射
nanosleep({1000, 0}, 0x7ffef87e2c10) = 0------------------------------------------------------------------------------此時cat /proc/xxx/maps,1創建的內存映射已經被釋放。
exit_group(0) = ?
+++ exited with 0 +++
下面逐一分析mmap()/munmap()對進程映射空間的影響。
00400000-00401000 r-xp 00000000 08:08 3415949 /home/al/mmap/sleep
00600000-00601000 r--p 00000000 08:08 3415949 /home/al/mmap/sleep
00601000-00602000 rw-p 00001000 08:08 3415949 /home/al/mmap/sleep
7fa2e0821000-7fa2e09e1000 r-xp 00000000 08:08 3185985 /lib/x86_64-linux-gnu/libc-2.23.so--------------3創建私有文件映射,可讀可執行。
7fa2e09e1000-7fa2e0be1000 ---p 001c0000 08:08 3185985 /lib/x86_64-linux-gnu/libc-2.23.so--------------3創建私有文件映射,4修改屬性從可讀可執行變成不可讀寫不可執行。
7fa2e0be1000-7fa2e0be5000 r--p 001c0000 08:08 3185985 /lib/x86_64-linux-gnu/libc-2.23.so--------------3創建私有文件映射,5修改屬性從可讀可執行變成可讀寫,9修改屬性爲只讀。
7fa2e0be5000-7fa2e0be7000 rw-p 001c4000 08:08 3185985 /lib/x86_64-linux-gnu/libc-2.23.so--------------3創建私有文件映射,5修改屬性從可讀可執行變成可讀寫。
7fa2e0be7000-7fa2e0beb000 rw-p 00000000 00:00 0 -------------------------------------------------------------------------3創建私有文件映射,6覆蓋創建的私有匿名固定地址映射,可讀寫。
7fa2e0beb000-7fa2e0c11000 r-xp 00000000 08:08 3185983 /lib/x86_64-linux-gnu/ld-2.23.so
7fa2e0de9000-7fa2e0dec000 rw-p 00000000 00:00 0 -------------------------------------------------------------------------2,7,8三個匿名映射因爲屬性都是私有匿名映射,可讀寫,所以vma區域合併。
7fa2e0e10000-7fa2e0e11000 r--p 00025000 08:08 3185983 /lib/x86_64-linux-gnu/ld-2.23.so
7fa2e0e11000-7fa2e0e12000 rw-p 00026000 08:08 3185983 /lib/x86_64-linux-gnu/ld-2.23.so
7fa2e0e12000-7fa2e0e13000 rw-p 00000000 00:00 0
7ffef87c3000-7ffef87e4000 rw-p 00000000 00:00 0 [stack]
7ffef87e4000-7ffef87e7000 r--p 00000000 00:00 0 [vvar]
7ffef87e7000-7ffef87e9000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
對於解釋可以參考UNIX系統編程手冊如下描述。
4. 參考文檔
《linux內存映射mmap原理分析》