xlog段文件結構
數據目錄下的pg_xlog目錄(pg9.6上版本)下,產生wal日誌文件段(如000000010000000000000001),每一個wal段的page的構成如下圖。
頁頭
wal頁面有兩種頁頭結構,XLogPageHeaderData和XLogLongPageHeaderData。
日誌段文件第一個頁面的頁頭爲XLogLongPageHeaderData,後續頁面頁頭爲XLogPageHeaderData。
可以看出XLogLongPageHeaderData比XLogPageHeaderData多出三個成員。
xlp_sysid對應pg_control中的system identifier;
xlp_seg_size爲段大小;
xlp_xlog_blcksz爲頁面尺寸;
remaindata(不一定存在)
這個數據塊存儲着上一個page的最後一個record沒有存完的數據。
當wal記錄跨頁存儲時,新頁面中頁頭的字段xlp_info會標識爲XLP_FIRST_IS_CONTRECORD
/* When record crosses page boundary, set this flag in new page's header */
#define XLP_FIRST_IS_CONTRECORD 0x0001
xlog日誌記錄允許跨頁面存儲,在當前頁面剩餘空間不足以存儲整條記錄時,可以存儲在下一個頁面中。XLogPageHeaderData的字段xlp_rem_len
記錄前一個頁面剩餘數據的長度。當xlp_rem_len爲0時,這個數據塊也就不存在了。
Record
參照下文中的wal record結構。
不完整的Record
頁面的最後一條記錄可能是不完整的頁面,剩餘部分可能存儲在下一個頁面中。
無數據區域
一個記錄裏的XlogRecord結構是不能跨頁存儲的。因此,當剩餘的空間不能存儲一個XLogRecord結構體時就會被捨棄。
wal記錄record結構
每一個wal記錄Record的結構如下圖所示。
XLogRecord
XLogRecord是一個wal記錄的入口,在解析wal記錄時,將從這個結構體開始入手。如下是XlogRecord的結構體定義。
typedef struct XLogRecord
{
uint32 xl_tot_len; /* total len of entire record */
TransactionId xl_xid; /* xact id */
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
/* 2 bytes of padding here, initialize to zero */
pg_crc32c xl_crc; /* CRC for this record */
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
} XLogRecord;
各成員的含義:
xl_tot_len:這個記錄的總長度,包括圖所有的模塊。
xl_xid:產生此記錄的事務ID。
xl_prev:前一個記錄的位置。
xl_info:此成員標誌着是何種子類型的wal記錄。xl_info與xl_rmid結合使用,例如xl_rmid爲RM_HEAP_ID,那麼xl_info可以爲 XLOG_HEAP_INSERT、XLOG_HEAP_DELETE、XLOG_HEAP_UPDATE。
xl_rmid:此成員標誌着是何種類型的wal記錄,例如RM_XACT_ID爲事務相關的記錄、 RM_DBASE_ID 爲數據庫創建刪除的記錄、RM_HEAP_ID爲表數據增刪改相關記錄。它的取值範圍在src/include/access/rmgrlist.h文件中可以看到。
xl_crc:校驗位。
XLogRecordBlockHeader
typedef struct XLogRecordBlockHeader
{
uint8 id; /* block reference ID */
uint8 fork_flags; /* fork within the relation, and flags */
uint16 data_length; /* number of payload bytes (not including page
* image) */
/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
/* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
/* BlockNumber follows */
} XLogRecordBlockHeader;
各成員的含義:
id:一個記錄中可以有多個block(MAX: 32),此id是block的序號。
fork_flags: 本block存儲有哪些信息。
data_length:決定tupledata中存儲的數據的長度(不包括page image)。
fork_flag取值如下:
/*
* The fork number fits in the lower 4 bits in the fork_flags field. The upper
* bits are used for flags.
*/
#define BKPBLOCK_FORK_MASK 0x0F
#define BKPBLOCK_FLAG_MASK 0xF0
#define BKPBLOCK_HAS_IMAGE 0x10 /* block data is an XLogRecordBlockImage
標識記錄內容爲full page write的block*/
#define BKPBLOCK_HAS_DATA 0x20 //標識記錄內容爲tuple內容的修改
#define BKPBLOCK_WILL_INIT 0x40 /* redo will re-init the page */
#define BKPBLOCK_SAME_REL 0x80 /* RelFileNode omitted, same as previous 標識與前一個頁面屬於同一個關係時,省略RelFileNode*/
XLogRecordBlockImageHeader
wal記錄是一個full page write記錄時,存在此結構
/*
* Additional header information when a full-page image is included
* (i.e. when BKPBLOCK_HAS_IMAGE is set).
*
* As a trivial form of data compression, the XLOG code is aware that
* PG data pages usually contain an unused "hole" in the middle, which
* contains only zero bytes. If the length of "hole" > 0 then we have removed
* such a "hole" from the stored data (and it's not counted in the
* XLOG record's CRC, either). Hence, the amount of block data actually
* present is BLCKSZ - the length of "hole" bytes.
*
* When wal_compression is enabled, a full page image which "hole" was
* removed is additionally compressed using PGLZ compression algorithm.
* This can reduce the WAL volume, but at some extra cost of CPU spent
* on the compression during WAL logging. In this case, since the "hole"
* length cannot be calculated by subtracting the number of page image bytes
* from BLCKSZ, basically it needs to be stored as an extra information.
* But when no "hole" exists, we can assume that the "hole" length is zero
* and no such an extra information needs to be stored. Note that
* the original version of page image is stored in WAL instead of the
* compressed one if the number of bytes saved by compression is less than
* the length of extra information. Hence, when a page image is successfully
* compressed, the amount of block data actually present is less than
* BLCKSZ - the length of "hole" bytes - the length of extra information.
*/
typedef struct XLogRecordBlockImageHeader
{
uint16 length; /* number of page image bytes */
uint16 hole_offset; /* number of bytes before "hole" */
uint8 bimg_info; /* flag bits, see below */
/*
* If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED, an
* XLogRecordBlockCompressHeader struct follows.
*/
} XLogRecordBlockImageHeader;
各成員的含義:
length:保存的page的總長度(去除空洞數據、且壓縮後的長度)。
hole_offset: 空洞數據之前的數據的size。
bimg_info:標誌位,記錄是否包含空洞數據,是否進行了壓縮
note: 空洞數據代表數據塊中未存記錄,全是0的部分,pg爲了縮減wal大小,寫日誌時去除了空洞數據,並可能壓縮記錄
bimg_info可能的取值如下:
/* Information stored in bimg_info */
#define BKPIMAGE_HAS_HOLE 0x01 /* page image has "hole" */
#define BKPIMAGE_IS_COMPRESSED 0x02 /* page image is compressed */
XLogRecordBlockCompressHeader
此結構記錄空洞數據的大小
/*
* Extra header information used when page image has "hole" and
* is compressed.
*/
typedef struct XLogRecordBlockCompressHeader
{
uint16 hole_length; /* number of bytes in "hole" */
} XLogRecordBlockCompressHeader;
RelFileNode
此結構記錄了此block所屬的表。如果當前block與前一個block來源於同一個表時,那麼fork_flags中就不會有BKPBLOCK_SAME_REL標誌位
typedef struct RelFileNode
{
Oid spcNode; /* tablespace */
Oid dbNode; /* database */
Oid relNode; /* relation */
} RelFileNode;
BlockNumber
記錄此block記錄的page的塊號。
XLogRecordDataHeaderLong/XLogRecordDataHeaderShort
此結構被record中的maindata(checkpoint等日誌數據)部分使用,當maindata的size小於256時使用XLogRecordDataHeaderShort結構
否則使用XLogRecordDataHeaderLong結構
typedef struct XLogRecordDataHeaderShort
{
uint8 id; /* XLR_BLOCK_ID_DATA_SHORT */
uint8 data_length; /* number of payload bytes */
} XLogRecordDataHeaderShort;
typedef struct XLogRecordDataHeaderLong
{
uint8 id; /* XLR_BLOCK_ID_DATA_LONG */
/* followed by uint32 data_length, unaligned */
} XLogRecordDataHeaderLong;
block data
block data包含full-write-page data(全頁寫日誌記錄)和tuple data(更新日誌記錄)兩種類型數據
main data
main data部分保存非buff性的數據,比如checkpoint等日誌數據.