vsyscall page

-----------------------------------vsyscall page-------------------------------------
内核中有一个永久固定映射页面(位于0xffffe000-0xffffefff),名为vsyscall页。这个区域存放了
系统调用入口kernel_vsyscall的代码,以及信号处理程序的返回代码kernel_sigreturn。系统初始化时,
调用sysenter_setup(),这个函数分配一个空页映射到固定映射区的FIX_VSYSCALL位置,根据系统是否支持
sysenter指令,将vsyscall_int80_start~vsyscall_int80_end或者vsyscall_sysenter_start~vsyscall_sysenter_end
的代码拷贝过去。页的权限是用户级、只读、可执行,所以用户进程可以直接访问该页代码。
1、vsyscall页的创建
--linux\arch\i386\kernel\
static int __init sysenter_setup(void)
{
void *page = (void *)get_zeroed_page(GFP_ATOMIC);


__set_fixmap(FIX_VSYSCALL, __pa(page), PAGE_READONLY_EXEC);//权限:用户级、只读、可执行


if (!boot_cpu_has(X86_FEATURE_SEP)) {
memcpy(page,
      &vsyscall_int80_start,
      &vsyscall_int80_end - &vsyscall_int80_start);
return 0;
}


memcpy(page,
      &vsyscall_sysenter_start,
      &vsyscall_sysenter_end - &vsyscall_sysenter_start);


on_each_cpu(enable_sep_cpu, NULL, 1, 1);
return 0;
}


2、vsyscall代码
vsyscall-int80、vsyscall-sysenter、vsyscall-sigreturn的代码分别见vsyscall-int80.S、vsyscall-sysenter.S、
vsyscall-sigreturn.S。vsyscall-int80和vsyscall-sysenter最后都通过#include "vsyscall-sigreturn.S"把vsyscall-sigreturn的
代码包含了进去。vsyscall.S把vsyscall-int80.so和vsyscall-sysenter.so的代码集成到一起,以vsyscall_int80_start、vsyscall_int80_end和vsyscall_sysenter_start、vsyscall_sysenter_end分隔。

--linux\arch\i386\kernel\vsyscall.S
#include <linux/init.h>


__INITDATA


.globl vsyscall_int80_start, vsyscall_int80_end
vsyscall_int80_start:
.incbin "arch/i386/kernel/vsyscall-int80.so"
vsyscall_int80_end:


.globl vsyscall_sysenter_start, vsyscall_sysenter_end
vsyscall_sysenter_start:
.incbin "arch/i386/kernel/vsyscall-sysenter.so"
vsyscall_sysenter_end:


__FINIT



vsyscall.S使用脚本vsyscall.lds.S进行连接。
--linux\arch\i386\kernel\vsyscall.lds.S
/*
 * Linker script for vsyscall DSO.  The vsyscall page is an ELF shared
 * object prelinked to its virtual address, and with only one read-only
 * segment (that fits in one page).  This script controls its layout.
 */
#include <asm/asm_offsets.h>


SECTIONS
{
  . = VSYSCALL_BASE + SIZEOF_HEADERS;


  .hash           : { *(.hash) } :text
  .dynsym         : { *(.dynsym) }
  .dynstr         : { *(.dynstr) }
  .gnu.version    : { *(.gnu.version) }
  .gnu.version_d  : { *(.gnu.version_d) }
  .gnu.version_r  : { *(.gnu.version_r) }


  /* This linker script is used both with -r and with -shared.
     For the layouts to match, we need to skip more than enough
     space for the dynamic symbol table et al.  If this amount
     is insufficient, ld -shared will barf.  Just increase it here.  */
  . = VSYSCALL_BASE + 0x400;


  .text           : { *(.text) } :text =0x90909090
……
}
……


其中,#define VSYSCALL_BASE (__fix_to_virt(FIX_VSYSCALL))
所以最后链接地址VSYSCALL_BASE开始于固定映射区FIX_VSYSCALL页的位置。


3、用户进程调用do_execve()时,该函数把vsyscall页动态链接到进程空间。这样,用户程序需要执行系统调用时,
可以直接调用vsyscall页里的代码kernel_vsyscall(),根据编译连接情况调用int 80或者sysenter指令实现,从而
实现user-kernel的跨越。


4、采用vsyscall页的内核(2.5.53以后),把用户信号处理程序中用到的返回代码__kernel_sigreturn也放在了
永久固定映射页(代码包含于vsyscall-sysenter.S或者vsyscall-int80.S中),这样就不用在放到堆栈里了。

--linux\arch\i386\kernel\vscall_sigreturn.S--
__kernel_sigreturn:
.LSTART_sigreturn:
popl %eax /* XXX does this mean it needs unwind info? */
movl $__NR_sigreturn, %eax
int $0x80
.LEND_sigreturn:


/* 由以下setup_frame的代码可知,用户信号处理程序返回时ret后执行的代码由frame->pretcode指向,
   2.5.53以前的内核都是指向frame->retcode[]的,返回代码存放于frame->retcode数组,现在指向
   vsyscall页的__kernel_sigreturn了。
*/
--linux\arch\i386\kernel\signal.c 
static void setup_frame(int sig, struct k_sigaction *ka,
sigset_t *set, struct pt_regs * regs)
{
…………
restorer = &__kernel_sigreturn;
if (ka->sa.sa_flags & SA_RESTORER)
restorer = ka->sa.sa_restorer;


/* Set up to return from userspace.  */
err |= __put_user(restorer, &frame->pretcode);
/*
* This is popl %eax ; movl $,%eax ; int $0x80
*
* WE DO NOT USE IT ANY MORE! It's only left here for historical
* reasons and because gdb uses it as a signature to notice
* signal handler stack frames.
*/
err |= __put_user(0xb858, (short __user *)(frame->retcode+0));
err |= __put_user(__NR_sigreturn, (int __user *)(frame->retcode+2));
err |= __put_user(0x80cd, (short __user *)(frame->retcode+6));
…………
}

"err |="开头的3行已经没有用了,由于历史原因还保存着。


-----------------------------------参考文献
[1]http://www.win.tue.nl/~aeb/linux/lk/lk-4.html


Sysenter and the vsyscall page 
It has been observed that a 2 GHz Pentium 4 was much slower than an 850 MHz Pentium III on certain tasks, and that this slowness is caused by the very large overhead of the traditional int 0x80 interrupt on a Pentium 4.


Some models of the i386 family do have faster ways to enter the kernel. On Pentium II there is the sysenter instruction. Also AMD has a syscall instruction. It would be good if these could be used.


Something else is that in some applications gettimeofday() is a done very often, for example for timestamping all transactions. It would be nice if it could be implemented with very low overhead.


One way of obtaining a fast gettimeofday() is by writing the current time in a fixed place, on a page mapped into the memory of all applications, and updating this location on each clock interrupt. These applications could then read this fixed location with a single instruction - no system call required.


There might be other data that the kernel could make available in a read-only way to the process, like perhaps the current process ID. A vsyscall is a "system" call that avoids crossing the userspace-kernel boundary.


Linux is in the process of implementing such ideas. Since Linux 2.5.53 there is a fixed page, called the vsyscall page, filled by the kernel. At kernel initialization time the routine sysenter_setup() is called. It sets up a non-writable page and writes code for the sysenter instruction if the CPU supports that, and for the classical int 0x80 otherwise. Thus, the C library can use the fastest type of system call by jumping to a fixed address in the vsyscall page.


This page was changed to have the structure of an ELF binary (called linux-vsyscall.so.1) in Linux 2.5.69. In Linux 2.5.74 the name was changed to linux-gate.so.1.


Concerning gettimeofday(), a vsyscall version for the x86-64 is already part of the vanilla kernel. Patches for i386 exist. (An example of the kind of timing differences: John Stultz reports on an experiment where he measures gettimeofday() and finds 1.67 us for the int 0x80 way, 1.24 us for the sysenter way, and 0.88 us for the vsyscall.)


Some details


The kernel maps a page (0xffffe000-0xffffefff) in the memory of every process. (This is the next-to-last addressable page. The last is not mapped - maybe to avoid bugs related to wraparound.) We can read it: 


…………
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章