vsyscall page

-----------------------------------vsyscall page-------------------------------------
內核中有一個永久固定映射頁面(位於0xffffe000-0xffffefff),名爲vsyscall頁。這個區域存放了
系統調用入口kernel_vsyscall的代碼,以及信號處理程序的返回代碼kernel_sigreturn。系統初始化時,
調用sysenter_setup(),這個函數分配一個空頁映射到固定映射區的FIX_VSYSCALL位置,根據系統是否支持
sysenter指令,將vsyscall_int80_start~vsyscall_int80_end或者vsyscall_sysenter_start~vsyscall_sysenter_end
的代碼拷貝過去。頁的權限是用戶級、只讀、可執行,所以用戶進程可以直接訪問該頁代碼。
1、vsyscall頁的創建
--linux\arch\i386\kernel\
static int __init sysenter_setup(void)
{
void *page = (void *)get_zeroed_page(GFP_ATOMIC);


__set_fixmap(FIX_VSYSCALL, __pa(page), PAGE_READONLY_EXEC);//權限:用戶級、只讀、可執行


if (!boot_cpu_has(X86_FEATURE_SEP)) {
memcpy(page,
      &vsyscall_int80_start,
      &vsyscall_int80_end - &vsyscall_int80_start);
return 0;
}


memcpy(page,
      &vsyscall_sysenter_start,
      &vsyscall_sysenter_end - &vsyscall_sysenter_start);


on_each_cpu(enable_sep_cpu, NULL, 1, 1);
return 0;
}


2、vsyscall代碼
vsyscall-int80、vsyscall-sysenter、vsyscall-sigreturn的代碼分別見vsyscall-int80.S、vsyscall-sysenter.S、
vsyscall-sigreturn.S。vsyscall-int80和vsyscall-sysenter最後都通過#include "vsyscall-sigreturn.S"把vsyscall-sigreturn的
代碼包含了進去。vsyscall.S把vsyscall-int80.so和vsyscall-sysenter.so的代碼集成到一起,以vsyscall_int80_start、vsyscall_int80_end和vsyscall_sysenter_start、vsyscall_sysenter_end分隔。

--linux\arch\i386\kernel\vsyscall.S
#include <linux/init.h>


__INITDATA


.globl vsyscall_int80_start, vsyscall_int80_end
vsyscall_int80_start:
.incbin "arch/i386/kernel/vsyscall-int80.so"
vsyscall_int80_end:


.globl vsyscall_sysenter_start, vsyscall_sysenter_end
vsyscall_sysenter_start:
.incbin "arch/i386/kernel/vsyscall-sysenter.so"
vsyscall_sysenter_end:


__FINIT



vsyscall.S使用腳本vsyscall.lds.S進行連接。
--linux\arch\i386\kernel\vsyscall.lds.S
/*
 * Linker script for vsyscall DSO.  The vsyscall page is an ELF shared
 * object prelinked to its virtual address, and with only one read-only
 * segment (that fits in one page).  This script controls its layout.
 */
#include <asm/asm_offsets.h>


SECTIONS
{
  . = VSYSCALL_BASE + SIZEOF_HEADERS;


  .hash           : { *(.hash) } :text
  .dynsym         : { *(.dynsym) }
  .dynstr         : { *(.dynstr) }
  .gnu.version    : { *(.gnu.version) }
  .gnu.version_d  : { *(.gnu.version_d) }
  .gnu.version_r  : { *(.gnu.version_r) }


  /* This linker script is used both with -r and with -shared.
     For the layouts to match, we need to skip more than enough
     space for the dynamic symbol table et al.  If this amount
     is insufficient, ld -shared will barf.  Just increase it here.  */
  . = VSYSCALL_BASE + 0x400;


  .text           : { *(.text) } :text =0x90909090
……
}
……


其中,#define VSYSCALL_BASE (__fix_to_virt(FIX_VSYSCALL))
所以最後鏈接地址VSYSCALL_BASE開始於固定映射區FIX_VSYSCALL頁的位置。


3、用戶進程調用do_execve()時,該函數把vsyscall頁動態鏈接到進程空間。這樣,用戶程序需要執行系統調用時,
可以直接調用vsyscall頁裏的代碼kernel_vsyscall(),根據編譯連接情況調用int 80或者sysenter指令實現,從而
實現user-kernel的跨越。


4、採用vsyscall頁的內核(2.5.53以後),把用戶信號處理程序中用到的返回代碼__kernel_sigreturn也放在了
永久固定映射頁(代碼包含於vsyscall-sysenter.S或者vsyscall-int80.S中),這樣就不用在放到堆棧裏了。

--linux\arch\i386\kernel\vscall_sigreturn.S--
__kernel_sigreturn:
.LSTART_sigreturn:
popl %eax /* XXX does this mean it needs unwind info? */
movl $__NR_sigreturn, %eax
int $0x80
.LEND_sigreturn:


/* 由以下setup_frame的代碼可知,用戶信號處理程序返回時ret後執行的代碼由frame->pretcode指向,
   2.5.53以前的內核都是指向frame->retcode[]的,返回代碼存放於frame->retcode數組,現在指向
   vsyscall頁的__kernel_sigreturn了。
*/
--linux\arch\i386\kernel\signal.c 
static void setup_frame(int sig, struct k_sigaction *ka,
sigset_t *set, struct pt_regs * regs)
{
…………
restorer = &__kernel_sigreturn;
if (ka->sa.sa_flags & SA_RESTORER)
restorer = ka->sa.sa_restorer;


/* Set up to return from userspace.  */
err |= __put_user(restorer, &frame->pretcode);
/*
* This is popl %eax ; movl $,%eax ; int $0x80
*
* WE DO NOT USE IT ANY MORE! It's only left here for historical
* reasons and because gdb uses it as a signature to notice
* signal handler stack frames.
*/
err |= __put_user(0xb858, (short __user *)(frame->retcode+0));
err |= __put_user(__NR_sigreturn, (int __user *)(frame->retcode+2));
err |= __put_user(0x80cd, (short __user *)(frame->retcode+6));
…………
}

"err |="開頭的3行已經沒有用了,由於歷史原因還保存着。


-----------------------------------參考文獻
[1]http://www.win.tue.nl/~aeb/linux/lk/lk-4.html


Sysenter and the vsyscall page 
It has been observed that a 2 GHz Pentium 4 was much slower than an 850 MHz Pentium III on certain tasks, and that this slowness is caused by the very large overhead of the traditional int 0x80 interrupt on a Pentium 4.


Some models of the i386 family do have faster ways to enter the kernel. On Pentium II there is the sysenter instruction. Also AMD has a syscall instruction. It would be good if these could be used.


Something else is that in some applications gettimeofday() is a done very often, for example for timestamping all transactions. It would be nice if it could be implemented with very low overhead.


One way of obtaining a fast gettimeofday() is by writing the current time in a fixed place, on a page mapped into the memory of all applications, and updating this location on each clock interrupt. These applications could then read this fixed location with a single instruction - no system call required.


There might be other data that the kernel could make available in a read-only way to the process, like perhaps the current process ID. A vsyscall is a "system" call that avoids crossing the userspace-kernel boundary.


Linux is in the process of implementing such ideas. Since Linux 2.5.53 there is a fixed page, called the vsyscall page, filled by the kernel. At kernel initialization time the routine sysenter_setup() is called. It sets up a non-writable page and writes code for the sysenter instruction if the CPU supports that, and for the classical int 0x80 otherwise. Thus, the C library can use the fastest type of system call by jumping to a fixed address in the vsyscall page.


This page was changed to have the structure of an ELF binary (called linux-vsyscall.so.1) in Linux 2.5.69. In Linux 2.5.74 the name was changed to linux-gate.so.1.


Concerning gettimeofday(), a vsyscall version for the x86-64 is already part of the vanilla kernel. Patches for i386 exist. (An example of the kind of timing differences: John Stultz reports on an experiment where he measures gettimeofday() and finds 1.67 us for the int 0x80 way, 1.24 us for the sysenter way, and 0.88 us for the vsyscall.)


Some details


The kernel maps a page (0xffffe000-0xffffefff) in the memory of every process. (This is the next-to-last addressable page. The last is not mapped - maybe to avoid bugs related to wraparound.) We can read it: 


…………
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章