COMP9315 DBMS課堂筆記

文件管理

在操作系統中文件操作流程：

	fd = open(fileName,mode)
	// open a named file for reading.writing/appending
	close(fd)
	// close an open file, via its descriptor
	read = read(rd, but, nbytes)
	//attempt to read data from file into buffer
	written = write(fd, but, nbytes)
	//attempt to write data from buffer to file
	lseek(fd, offset, seek_type)
	//move file pointer to relative/absolute file offset
	fsync(fd)
	//flush contents of file buffers to disk

不同DBMS的數據有不同安排：
(1) 在文件系統中使用raw disk partition(原來的Oracle)
(2) 用一個大文件包含所有DB數據(SQL)
(3) 一些大文件構成表(現在的Oracle)
(4) 與表一一對應的的multiple data files(PostgreSQL)
(5) multiple files 對應每個表（含一個main file）
…

Single-file Storage Manager(單文件DBMS層)

PS：如上，如果Employee數據很大，可以考慮將Employee Data Pages擴容，或者在Project Data Pages 後面再添加剩餘的Employee 數據（添加link）。

SpaceMap: 每個chunk偏移，用了多少空間和狀態。([(0,10,U)…])
NameMap: 給定一個表的名，文件在數據庫的哪裏以及它用了多少空間。[(“employee”,20,350)…]

每個file segment(chunk) 包含固定數量的blocks。

數據定義

#define PAGESIZE 4096 //bytes per page
typedef long PageId; //PageId is block index
					 // pageOffset = PageId*PAGESIZE
typedef char *Page;  // pointer to page/block buffer

PAGESIZE取值：1024，2048，4096，8192

DB的Storage Manager data structure：

typedef struct DBrec{
	char *dbname; //copy of database name
	int fd;
	SpaceMap map;
	NameMap names;
} *DB;
//DB
typedef struct Relrec{
	char *relname; //copy of table name
	int start; //page index of start of table data
	int npages; //number of pages of table data
	...
}*Rel
//Table

Eg. 掃描表

DB db = openDatabase("myDB");
Rel r = openRelation(db,"Employee");
//current page
Page buffer = malloc(PAGESIZE*sizeof(char));
//Assuming continuous storage
for (int i =0; i<r->npages; i++){
	//Fetch pageID according to offsets
	PageId pid = r->start+i;
	//Fetch page from DB and store into buffer
	get_page(db,pid,buffer);
	//Get each tuple from pages in buffer
	for each tuple in buffer{
		get tuple data and extract name
		add (name) to result tuples
	}
}

//start using DB, buffer meta-data
DB openDatabase(char *name){
	DB db = new(struct DBrec);
	db->dbname = strdup(name);
	db->fd = open(name,O_RDWR);
	db->map = readSpaceMap(db->fd);
	db->names = readNameMap(db->fd);
	return db;
}
// set up struct describing relation
Rel openRelation(DB db, char *rname){
	Rel r = new(struct Relrec);
	r->relname = strdup(rname);
	//get relation data from map tables
	r->start = ...;
	r->npages=...;
	return r;
}

//assume that Page = byte[PageSize]
//assume that PageId= block number in file
...
//read page from file into memory buffer
void get_page(DB db, PageId p, Page buf){
	lseek(db->fd, p*PAGESIZE, SEEK_SET);
	read(db->fd, buf, PAGESIZE);
}
//write page from memory buffer to file
//PageId:Index of Page
void put_page(Db db, PageId p, Page buf){
	lseek(db->fd, p*PAGESIZE, SEEK_SET);
	write(db->fd, buf, PAGESIZE);
}

Exercise1: Relation Scan Cost
設一個表 R(x,y,z)：
1. 有 r=10^4 個tuples
2. 平均tuples的size R =200 bytes
3. Data page的size B =4096 bytes
4. 讀一個data page的時間 T=10 msec
5. 檢查一個tuple 1 時間 =1 usec
6. Form一個result tuple的時間 = 1 usec
7. 寫一個result page的時間 T= 10 msec
假設 50% 的 tuples 滿足下列條件，則計算query的總共時間：

insert into S select * from R where x>10;

Answer Steps：

total # of tuples = r = 10000 個
#bytes/tuple = R = 200
#bytes/page = B = 4096
#tuples/page = c= floor(B/R) = 20.48 = 20 個
total # of pages = ceil(r/c) = 500 頁
#pages to read = b = 500 頁
time to read pages = 500 * 10 msec = 5000 msec
cost to check all tuples = 10000* 1 usec = 10 msec
tuples in result = 50% *10000 = 5000 個
making tuple cost = 5000 * 1 usec = 5 msec
#pages written = ceil (5000 /20) =250 頁
times to write pages = 250 * 10 msec =2500 msec
Total time = reading + checking +making +writing =
5000 + 10 + 5+ 2500 = 7515 msec

Multiple-file Disk Manager 多文件硬盤管理器

大部分DBMS 不用單個大文件存儲所有的數據，而是:
1. Multiple files partitioned physically or logically.
2. 將DB-level的objects映射到files上（通過 meta-data）。

Multiple-file 存儲優點是當往table增加tuples時更加容易。

PageId的在不同存儲系統裏的不同：

如果系統是one file per table的，pageId包含 relation identifier（映射到filename）和 page number （識別文件中的頁號）。
如果系統是 several files per table，pageId包含 relation identifier，page number 和 file identifier（代表如table1的第一個文件，與relation identifier 結合映射出filename）。

PostgreSQL的file organisation

存儲管理器(Storage manager)組成：
1.relations映射到files(數據結構: RelFileNode)
2.open relation pool 的抽象 (storage/smgr)(pools keep entries)
3.managing files的函數(storage/smgr/md.c)
4.file-descriptor pool(storage/file)

PostgreSQL有兩類基本的文件：

存data(tuples) 的堆文件(heap files)(像data blocks那樣)。
存index entries的索引文件(index files), index entries 有key （tuple id)。
PS: smgr僅由disk handler提供。

Relations as Files：
PostgreSQL 識別關係文件通過OIDS，核心數據結構RelFileNode:

typedef struct RelFileNode{
	Oid spcNode; //tablespace
	Oid dbNode;  //database
	Oid relNode; //relation
}RelFileNode;

全局tables(如 pg_database)：
spcNode == GLOBALTABLESPACE_OID
dbNode ==0

char *relpath(RelFileNode r) //simplified
{
	char *path = malloc(ENOUGH_SPACE);
	
	if(r.spcNode == GLOBALTABLESPACE_OID){
		/* Shared system relations live in PGDATA/global */
		Assert(r.dbNode ==0);
		//%u-->oid of relation
		sprintf(path, "%s/global/%u",
				DataDir, r.relNode);
	}
	else if (r.spcNode == DEFAULTTABLESPACE_OID){
		/* The default tablespace is PGDATA/base */
		sprintf(path, "%s/base/%u/%u",
				DataDir, r.dbNode, r.relNode);
	}
	else {
		/* All other tablespaces accessed via symlinks */
		sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir
				r.spcNode, r.dbNode, r.relNode);
	}
	return path;
	}

File Descriptor Pool
Unix 對當前open files有限制。
PostgreSQL 維持一個 open file descriptors 的 pool：
1.隱藏更高級函數的這個限制
2.最小化代價高的open()操作
文件名是簡單的字符串: typedef char *FileName
Open files：typedef int File
一個文件是"virtual file descriptors"表中的一個索引
Defs: include/storage/fd.h
Code:backend/storage/file/fd.c

File descriptor(pool)的接口：

File PathNameOpenFile(char *fileName, int flags)
	// open a file with default pg.conf mode
File PathNameOpenFilePerm(char *fName, int flags, int mode)
	// open a file in the DB directory ($PGDATA/base/...)
File OpenTemporaryFile(bool interXact)
	// open temp file flag: close at end of transaction?
void FileClose(File file)
void FileUnlink(File file)
int FileRead(File file, char *buffer, int amount)
int FileWrite(File file, char *buffer, int amount)
int FileSync(File file)
long FileSeek(File file, long offset, int whence)
int FileTruncate(File file, long offset)

Virtual file descriptor records：

typedef struct vfd
{
	s_short fd; //current FD, or VFD_CLOSED if none
	u_short fdstate; //bitflags for Vfd's state
	File nextFree; //link to next free Vfd, if in freelist
	File lruMoreRecently; //doubly linked recency-of-use list
	File lruLessRecently;
	long seekPos; //current logical file position
	char *fileName; //name of file, or NULL for unused Vfd
	// NB: filename is malloc'd, and must be free'd when closing the Vfd
	int fileFlagsl //open(2) flags for (re)opening the file
	int fileMode; //mode to pass to open(2)
}Vfd;

Exercise 3: Opening a Vfd

f = PathNameOpenFilePerm(
	"/srvr/jas/pgsql/data/base/13645/12348",
	//if O_CREAT and O_EXCL are True, then open() will fail.
	// "|" indicates bitwise
	O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
	//mode
	0600
)

PathNameOpenFilePerm大致結構

File PathNameOpenFilePerm(char *fname, int flags, int mode)
{
	i=seach VfdCache for fname
	if (i==NOTFOUND){
		if (VfdCache is full) doubleing the size of VfdCache
		i = next available VfdCache entry
	}
	if (VfdCache[i] is not open){
		if (max open files already){
			victim = LRU VfdCahce entry
			close(VfdCache[victim].fd)
		}
	VfdCache[i].fd = open(fname, flages, mode)
	if (VfdCahce[i].fd < 0)
		... we have a problem ...
	}
	return i;
}

PostgreSQL 保存每個table 到 PGDATA/pg_database.oid, 通常是 in multiple files(又名forks):

Page 不僅有tuples，還有允許指定tuples在Page裏的data。

Free Space map（Oid_fsm）：
表示data pages中的free space
free space 僅在VACUUM之後被釋放。(DELETE 只是將tuples簡單標記爲不再使用xmax)

Visibility map (Oid_vm):
表示所有tuples是“visible”的頁 (visible = 對於一些現在激活的transactions可用) ,這些頁可以被VACUUM忽略。

“Magnetic disk storage manager”（storage/smgr/md.c）
管理自己的open file descriptors的pool（Vfd‘s）
如果forks 存在，則可能用一些Vfd’s去獲取data
管理 PageID對file+offset的映射

typedef struct
{
	RelFileNode rnode; // which relation/file
	ForkNumber forkNum; // which fork (of reln)
	BlockNumber blockNum; // which page/block
} BufferTag;

訪問數據塊：

// pageID set form pg_catalog tables
//buffer obtained from Buffer pool
getBlock(BufferTag pageID, Buffer buf)
{
	File fid; off_t offset; int fd;
	(fid, offset)= findBlock(pageID)
	fd = VfdCache[fid].fd;
	lseek(fd, offset, SEEK_SET)
	VfdCache[fid].seekPos = offset;
	nread = read(fd, buf, BLOCKSIZE)
	// BlOCKSIZE is a global configurable constant(default:8192)
	if (nread < BLOCKSIZE) ... we have a problem
}

Colums是schema，rows是data。
Reference:
COMP9315 Week 03 lectures

COMP9315 DBMS課堂筆記

文件管理

Single-file Storage Manager(單文件DBMS層)

數據定義

Multiple-file Disk Manager 多文件硬盤管理器

PostgreSQL的file organisation

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

COMP9101 算法設計 week3

COMP9315 DBMS課堂筆記

COMP9315 week07課堂筆記

COMP9315 課堂筆記(二)

COMP9101 算法設計 week4

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結