文件管理
- 在操作系統中文件操作流程:
fd = open(fileName,mode)
// open a named file for reading.writing/appending
close(fd)
// close an open file, via its descriptor
read = read(rd, but, nbytes)
//attempt to read data from file into buffer
written = write(fd, but, nbytes)
//attempt to write data from buffer to file
lseek(fd, offset, seek_type)
//move file pointer to relative/absolute file offset
fsync(fd)
//flush contents of file buffers to disk
- 不同DBMS的數據有不同安排:
(1) 在文件系統中使用raw disk partition(原來的Oracle)
(2) 用一個大文件包含所有DB數據(SQL)
(3) 一些大文件構成表(現在的Oracle)
(4) 與表一一對應的的multiple data files(PostgreSQL)
(5) multiple files 對應每個表(含一個main file)
…
Single-file Storage Manager(單文件DBMS層)
PS:如上,如果Employee數據很大,可以考慮將Employee Data Pages擴容,或者在Project Data Pages 後面再添加剩餘的Employee 數據(添加link)。
SpaceMap: 每個chunk偏移,用了多少空間和狀態。([(0,10,U)…])
NameMap: 給定一個表的名,文件在數據庫的哪裏以及它用了多少空間。[(“employee”,20,350)…]
每個file segment(chunk) 包含固定數量的blocks。
數據定義
#define PAGESIZE 4096 //bytes per page
typedef long PageId; //PageId is block index
// pageOffset = PageId*PAGESIZE
typedef char *Page; // pointer to page/block buffer
PAGESIZE取值:1024,2048,4096,8192
DB的Storage Manager data structure:
typedef struct DBrec{
char *dbname; //copy of database name
int fd;
SpaceMap map;
NameMap names;
} *DB;
//DB
typedef struct Relrec{
char *relname; //copy of table name
int start; //page index of start of table data
int npages; //number of pages of table data
...
}*Rel
//Table
Eg. 掃描表
DB db = openDatabase("myDB");
Rel r = openRelation(db,"Employee");
//current page
Page buffer = malloc(PAGESIZE*sizeof(char));
//Assuming continuous storage
for (int i =0; i<r->npages; i++){
//Fetch pageID according to offsets
PageId pid = r->start+i;
//Fetch page from DB and store into buffer
get_page(db,pid,buffer);
//Get each tuple from pages in buffer
for each tuple in buffer{
get tuple data and extract name
add (name) to result tuples
}
}
//start using DB, buffer meta-data
DB openDatabase(char *name){
DB db = new(struct DBrec);
db->dbname = strdup(name);
db->fd = open(name,O_RDWR);
db->map = readSpaceMap(db->fd);
db->names = readNameMap(db->fd);
return db;
}
// set up struct describing relation
Rel openRelation(DB db, char *rname){
Rel r = new(struct Relrec);
r->relname = strdup(rname);
//get relation data from map tables
r->start = ...;
r->npages=...;
return r;
}
//assume that Page = byte[PageSize]
//assume that PageId= block number in file
...
//read page from file into memory buffer
void get_page(DB db, PageId p, Page buf){
lseek(db->fd, p*PAGESIZE, SEEK_SET);
read(db->fd, buf, PAGESIZE);
}
//write page from memory buffer to file
//PageId:Index of Page
void put_page(Db db, PageId p, Page buf){
lseek(db->fd, p*PAGESIZE, SEEK_SET);
write(db->fd, buf, PAGESIZE);
}
Exercise1: Relation Scan Cost
設一個表 R(x,y,z):
1. 有 r=10^4 個tuples
2. 平均tuples的size R =200 bytes
3. Data page的size B =4096 bytes
4. 讀一個data page的時間 T=10 msec
5. 檢查一個tuple 1 時間 =1 usec
6. Form一個result tuple的時間 = 1 usec
7. 寫一個result page的時間 T= 10 msec
假設 50% 的 tuples 滿足下列條件,則計算query的總共時間:
insert into S select * from R where x>10;
Answer Steps:
- total # of tuples = r = 10000 個
#bytes/tuple = R = 200
#bytes/page = B = 4096
#tuples/page = c= floor(B/R) = 20.48 = 20 個
total # of pages = ceil(r/c) = 500 頁 - #pages to read = b = 500 頁
- time to read pages = 500 * 10 msec = 5000 msec
- cost to check all tuples = 10000* 1 usec = 10 msec
- tuples in result = 50% *10000 = 5000 個
- making tuple cost = 5000 * 1 usec = 5 msec
- #pages written = ceil (5000 /20) =250 頁
- times to write pages = 250 * 10 msec =2500 msec
- Total time = reading + checking +making +writing =
5000 + 10 + 5+ 2500 = 7515 msec
Multiple-file Disk Manager 多文件硬盤管理器
大部分DBMS 不用單個大文件存儲所有的數據,而是:
1. Multiple files partitioned physically or logically.
2. 將DB-level的objects映射到files上(通過 meta-data)。
Multiple-file 存儲優點是當往table增加tuples時更加容易。
PageId的在不同存儲系統裏的不同:
- 如果系統是one file per table的,pageId包含 relation identifier(映射到filename)和 page number (識別文件中的頁號)。
- 如果系統是 several files per table,pageId包含 relation identifier,page number 和 file identifier(代表如table1的第一個文件,與relation identifier 結合映射出filename)。
PostgreSQL的file organisation
存儲管理器(Storage manager)組成:
1.relations映射到files(數據結構: RelFileNode)
2.open relation pool 的抽象 (storage/smgr)(pools keep entries)
3.managing files的函數(storage/smgr/md.c)
4.file-descriptor pool(storage/file)
PostgreSQL有兩類基本的文件:
- 存data(tuples) 的堆文件(heap files)(像data blocks那樣)。
- 存index entries的索引文件(index files), index entries 有key (tuple id)。
PS: smgr僅由disk handler提供。
Relations as Files:
PostgreSQL 識別關係文件通過OIDS,核心數據結構RelFileNode:
typedef struct RelFileNode{
Oid spcNode; //tablespace
Oid dbNode; //database
Oid relNode; //relation
}RelFileNode;
全局tables(如 pg_database):
spcNode == GLOBALTABLESPACE_OID
dbNode ==0
char *relpath(RelFileNode r) //simplified
{
char *path = malloc(ENOUGH_SPACE);
if(r.spcNode == GLOBALTABLESPACE_OID){
/* Shared system relations live in PGDATA/global */
Assert(r.dbNode ==0);
//%u-->oid of relation
sprintf(path, "%s/global/%u",
DataDir, r.relNode);
}
else if (r.spcNode == DEFAULTTABLESPACE_OID){
/* The default tablespace is PGDATA/base */
sprintf(path, "%s/base/%u/%u",
DataDir, r.dbNode, r.relNode);
}
else {
/* All other tablespaces accessed via symlinks */
sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir
r.spcNode, r.dbNode, r.relNode);
}
return path;
}
File Descriptor Pool
Unix 對當前open files有限制。
PostgreSQL 維持一個 open file descriptors 的 pool:
1.隱藏更高級函數的這個限制
2.最小化代價高的open()操作
文件名是簡單的字符串: typedef char *FileName
Open files:typedef int File
一個文件是"virtual file descriptors"表中的一個索引
Defs: include/storage/fd.h
Code:backend/storage/file/fd.c
File descriptor(pool)的接口:
File PathNameOpenFile(char *fileName, int flags)
// open a file with default pg.conf mode
File PathNameOpenFilePerm(char *fName, int flags, int mode)
// open a file in the DB directory ($PGDATA/base/...)
File OpenTemporaryFile(bool interXact)
// open temp file flag: close at end of transaction?
void FileClose(File file)
void FileUnlink(File file)
int FileRead(File file, char *buffer, int amount)
int FileWrite(File file, char *buffer, int amount)
int FileSync(File file)
long FileSeek(File file, long offset, int whence)
int FileTruncate(File file, long offset)
Virtual file descriptor records:
typedef struct vfd
{
s_short fd; //current FD, or VFD_CLOSED if none
u_short fdstate; //bitflags for Vfd's state
File nextFree; //link to next free Vfd, if in freelist
File lruMoreRecently; //doubly linked recency-of-use list
File lruLessRecently;
long seekPos; //current logical file position
char *fileName; //name of file, or NULL for unused Vfd
// NB: filename is malloc'd, and must be free'd when closing the Vfd
int fileFlagsl //open(2) flags for (re)opening the file
int fileMode; //mode to pass to open(2)
}Vfd;
Exercise 3: Opening a Vfd
f = PathNameOpenFilePerm(
"/srvr/jas/pgsql/data/base/13645/12348",
//if O_CREAT and O_EXCL are True, then open() will fail.
// "|" indicates bitwise
O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
//mode
0600
)
PathNameOpenFilePerm大致結構
File PathNameOpenFilePerm(char *fname, int flags, int mode)
{
i=seach VfdCache for fname
if (i==NOTFOUND){
if (VfdCache is full) doubleing the size of VfdCache
i = next available VfdCache entry
}
if (VfdCache[i] is not open){
if (max open files already){
victim = LRU VfdCahce entry
close(VfdCache[victim].fd)
}
VfdCache[i].fd = open(fname, flages, mode)
if (VfdCahce[i].fd < 0)
... we have a problem ...
}
return i;
}
PostgreSQL 保存每個table 到 PGDATA/pg_database.oid, 通常是 in multiple files(又名forks):
Page 不僅有tuples,還有允許指定tuples在Page裏的data。
Free Space map(Oid_fsm):
表示data pages中的free space
free space 僅在VACUUM之後被釋放。(DELETE 只是將tuples簡單標記爲不再使用xmax)
Visibility map (Oid_vm):
表示所有tuples是“visible”的頁 (visible = 對於一些現在激活的transactions可用) ,這些頁可以被VACUUM忽略。
“Magnetic disk storage manager”(storage/smgr/md.c)
管理自己的open file descriptors的pool(Vfd‘s)
如果forks 存在,則可能用一些Vfd’s去獲取data
管理 PageID對file+offset的映射
typedef struct
{
RelFileNode rnode; // which relation/file
ForkNumber forkNum; // which fork (of reln)
BlockNumber blockNum; // which page/block
} BufferTag;
訪問數據塊:
// pageID set form pg_catalog tables
//buffer obtained from Buffer pool
getBlock(BufferTag pageID, Buffer buf)
{
File fid; off_t offset; int fd;
(fid, offset)= findBlock(pageID)
fd = VfdCache[fid].fd;
lseek(fd, offset, SEEK_SET)
VfdCache[fid].seekPos = offset;
nread = read(fd, buf, BLOCKSIZE)
// BlOCKSIZE is a global configurable constant(default:8192)
if (nread < BLOCKSIZE) ... we have a problem
}
Colums是schema,rows是data。
Reference:
COMP9315 Week 03 lectures