PostgreSQL 9.6源碼解析之XLOG生成(二)xlog文件內部結構

xlog段文件結構

數據目錄下的pg_xlog目錄(pg9.6上版本)下,產生wal日誌文件段(如000000010000000000000001),每一個wal段的page的構成如下圖。

在這裏插入圖片描述

頁頭

wal頁面有兩種頁頭結構,XLogPageHeaderData和XLogLongPageHeaderData。
日誌段文件第一個頁面的頁頭爲XLogLongPageHeaderData,後續頁面頁頭爲XLogPageHeaderData。
可以看出XLogLongPageHeaderData比XLogPageHeaderData多出三個成員。
xlp_sysid對應pg_control中的system identifier;
xlp_seg_size爲段大小;
xlp_xlog_blcksz爲頁面尺寸;

remaindata(不一定存在)

這個數據塊存儲着上一個page的最後一個record沒有存完的數據。
當wal記錄跨頁存儲時,新頁面中頁頭的字段xlp_info會標識爲XLP_FIRST_IS_CONTRECORD

/* When record crosses page boundary, set this flag in new page's header */
#define XLP_FIRST_IS_CONTRECORD		0x0001

xlog日誌記錄允許跨頁面存儲,在當前頁面剩餘空間不足以存儲整條記錄時,可以存儲在下一個頁面中。XLogPageHeaderData的字段xlp_rem_len
記錄前一個頁面剩餘數據的長度。當xlp_rem_len爲0時,這個數據塊也就不存在了。

Record

參照下文中的wal record結構。

不完整的Record

頁面的最後一條記錄可能是不完整的頁面,剩餘部分可能存儲在下一個頁面中。

無數據區域

一個記錄裏的XlogRecord結構是不能跨頁存儲的。因此,當剩餘的空間不能存儲一個XLogRecord結構體時就會被捨棄。

wal記錄record結構

每一個wal記錄Record的結構如下圖所示。
在這裏插入圖片描述

XLogRecord

XLogRecord是一個wal記錄的入口,在解析wal記錄時,將從這個結構體開始入手。如下是XlogRecord的結構體定義。

typedef struct XLogRecord
{
	uint32		xl_tot_len;		/* total len of entire record */
	TransactionId xl_xid;		/* xact id */
	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
	uint8		xl_info;		/* flag bits, see below */
	RmgrId		xl_rmid;		/* resource manager for this record */
	/* 2 bytes of padding here, initialize to zero */
	pg_crc32c	xl_crc;			/* CRC for this record */

	/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */

} XLogRecord;

各成員的含義:
xl_tot_len:這個記錄的總長度,包括圖所有的模塊。
xl_xid:產生此記錄的事務ID。
xl_prev:前一個記錄的位置。
xl_info:此成員標誌着是何種子類型的wal記錄。xl_info與xl_rmid結合使用,例如xl_rmid爲RM_HEAP_ID,那麼xl_info可以爲 XLOG_HEAP_INSERT、XLOG_HEAP_DELETE、XLOG_HEAP_UPDATE。
xl_rmid:此成員標誌着是何種類型的wal記錄,例如RM_XACT_ID爲事務相關的記錄、 RM_DBASE_ID 爲數據庫創建刪除的記錄、RM_HEAP_ID爲表數據增刪改相關記錄。它的取值範圍在src/include/access/rmgrlist.h文件中可以看到。
xl_crc:校驗位。

XLogRecordBlockHeader

typedef struct XLogRecordBlockHeader
{
	uint8		id;				/* block reference ID */
	uint8		fork_flags;		/* fork within the relation, and flags */
	uint16		data_length;	/* number of payload bytes (not including page
								 * image) */

	/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
	/* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
	/* BlockNumber follows */
} XLogRecordBlockHeader;

各成員的含義:
id:一個記錄中可以有多個block(MAX: 32),此id是block的序號。
fork_flags: 本block存儲有哪些信息。
data_length:決定tupledata中存儲的數據的長度(不包括page image)。

fork_flag取值如下:

/*
 * The fork number fits in the lower 4 bits in the fork_flags field. The upper
 * bits are used for flags.
 */
#define BKPBLOCK_FORK_MASK	0x0F
#define BKPBLOCK_FLAG_MASK	0xF0
#define BKPBLOCK_HAS_IMAGE	0x10	/* block data is an XLogRecordBlockImage 
標識記錄內容爲full page write的block*/
#define BKPBLOCK_HAS_DATA	0x20   //標識記錄內容爲tuple內容的修改
#define BKPBLOCK_WILL_INIT	0x40	/* redo will re-init the page */
#define BKPBLOCK_SAME_REL	0x80	/* RelFileNode omitted, same as previous 標識與前一個頁面屬於同一個關係時,省略RelFileNode*/

XLogRecordBlockImageHeader

wal記錄是一個full page write記錄時,存在此結構

/*
 * Additional header information when a full-page image is included
 * (i.e. when BKPBLOCK_HAS_IMAGE is set).
 *
 * As a trivial form of data compression, the XLOG code is aware that
 * PG data pages usually contain an unused "hole" in the middle, which
 * contains only zero bytes.  If the length of "hole" > 0 then we have removed
 * such a "hole" from the stored data (and it's not counted in the
 * XLOG record's CRC, either).  Hence, the amount of block data actually
 * present is BLCKSZ - the length of "hole" bytes.
 *
 * When wal_compression is enabled, a full page image which "hole" was
 * removed is additionally compressed using PGLZ compression algorithm.
 * This can reduce the WAL volume, but at some extra cost of CPU spent
 * on the compression during WAL logging. In this case, since the "hole"
 * length cannot be calculated by subtracting the number of page image bytes
 * from BLCKSZ, basically it needs to be stored as an extra information.
 * But when no "hole" exists, we can assume that the "hole" length is zero
 * and no such an extra information needs to be stored. Note that
 * the original version of page image is stored in WAL instead of the
 * compressed one if the number of bytes saved by compression is less than
 * the length of extra information. Hence, when a page image is successfully
 * compressed, the amount of block data actually present is less than
 * BLCKSZ - the length of "hole" bytes - the length of extra information.
 */
typedef struct XLogRecordBlockImageHeader
{
	uint16		length;			/* number of page image bytes */
	uint16		hole_offset;	/* number of bytes before "hole" */
	uint8		bimg_info;		/* flag bits, see below */

	/*
	 * If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED, an
	 * XLogRecordBlockCompressHeader struct follows.
	 */
} XLogRecordBlockImageHeader;

各成員的含義:
length:保存的page的總長度(去除空洞數據、且壓縮後的長度)。
hole_offset: 空洞數據之前的數據的size。
bimg_info:標誌位,記錄是否包含空洞數據,是否進行了壓縮

note: 空洞數據代表數據塊中未存記錄,全是0的部分,pg爲了縮減wal大小,寫日誌時去除了空洞數據,並可能壓縮記錄

bimg_info可能的取值如下:

/* Information stored in bimg_info */
#define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
#define BKPIMAGE_IS_COMPRESSED		0x02		/* page image is compressed */

XLogRecordBlockCompressHeader

此結構記錄空洞數據的大小

/*
 * Extra header information used when page image has "hole" and
 * is compressed.
 */
typedef struct XLogRecordBlockCompressHeader
{
	uint16		hole_length;	/* number of bytes in "hole" */
} XLogRecordBlockCompressHeader;

RelFileNode

此結構記錄了此block所屬的表。如果當前block與前一個block來源於同一個表時,那麼fork_flags中就不會有BKPBLOCK_SAME_REL標誌位

typedef struct RelFileNode
{
	Oid			spcNode;		/* tablespace */
	Oid			dbNode;			/* database */
	Oid			relNode;		/* relation */
} RelFileNode;

BlockNumber

記錄此block記錄的page的塊號。

XLogRecordDataHeaderLong/XLogRecordDataHeaderShort

此結構被record中的maindata(checkpoint等日誌數據)部分使用,當maindata的size小於256時使用XLogRecordDataHeaderShort結構
否則使用XLogRecordDataHeaderLong結構

typedef struct XLogRecordDataHeaderShort
{
	uint8		id;				/* XLR_BLOCK_ID_DATA_SHORT */
	uint8		data_length;	/* number of payload bytes */
}	XLogRecordDataHeaderShort;


typedef struct XLogRecordDataHeaderLong
{
	uint8		id;				/* XLR_BLOCK_ID_DATA_LONG */
	/* followed by uint32 data_length, unaligned */
}	XLogRecordDataHeaderLong;

block data

block data包含full-write-page data(全頁寫日誌記錄)和tuple data(更新日誌記錄)兩種類型數據

main data

main data部分保存非buff性的數據,比如checkpoint等日誌數據.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章