PgSQL · 特性分析 · PostgreSQL Aurora方案與DEMO

前言

亞馬遜推出的Aurora數據庫引擎，支持一份存儲，一主多讀的架構。這個架構和Oracle RAC類似，也是共享存儲，但是隻有一個實例可以執行寫操作，其他實例只能執行讀操作。相比傳統的基於複製的一主多讀，節約了存儲和網絡帶寬的成本。

我們可以使用PostgreSQL的hot standby模式來模擬這種共享存儲一主多讀的架構，但是需要注意幾點，hot standby也會對數據庫有寫的動作，例如recovery時，會修改控制文件，數據文件等等，這些操作是多餘的。另外很多狀態是存儲在內存中的，所以內存狀態也需要更新。

還有需要注意的是：

pg_xlog
pg_log
pg_clog
pg_multixact
postgresql.conf
recovery.conf
postmaster.pid

最終實現一主多備的架構，需要通過改PG內核來實現：

這些文件應該是每個實例對應一份。
postgresql.conf, recovery.conf, postmaster.pid, pg_control
hot standby不執行實際的恢復操作，但是需要更新自己的內存狀態，如當前的OID，XID等等，以及更新自己的pg_control。
在多實例間，要實現主到備節點的OS髒頁的同步，數據庫shared buffer髒頁的同步。

模擬過程

不改任何代碼，在同一主機下啓多實例測試，會遇到一些問題。(後面有問題描述，以及如何修改代碼來修復這些問題)

主實例配置文件：

 # vi postgresql.conf
listen_addresses='0.0.0.0'
port=1921
max_connections=100
unix_socket_directories='.'
ssl=on
ssl_ciphers='EXPORT40'
shared_buffers=512MB
huge_pages=try
max_prepared_transactions=0
max_stack_depth=100kB
dynamic_shared_memory_type=posix
max_files_per_process=500
wal_level=logical
fsync=off
synchronous_commit=off
wal_sync_method=open_datasync
full_page_writes=off
wal_log_hints=off
wal_buffers=16MB
wal_writer_delay=10ms
checkpoint_segments=8
archive_mode=off
archive_command='/bin/date'
max_wal_senders=10
max_replication_slots=10
hot_standby=on
wal_receiver_status_interval=1s
hot_standby_feedback=on
enable_bitmapscan=on
enable_hashagg=on
enable_hashjoin=on
enable_indexscan=on
enable_material=on
enable_mergejoin=on
enable_nestloop=on
enable_seqscan=on
enable_sort=on
enable_tidscan=on
log_destination='csvlog'
logging_collector=on
log_directory='pg_log'
log_truncate_on_rotation=on
log_rotation_size=10MB
log_checkpoints=on
log_connections=on
log_disconnections=on
log_duration=off
log_error_verbosity=verbose
log_line_prefix='%i
log_statement='none'
log_timezone='PRC'
autovacuum=on
log_autovacuum_min_duration=0
autovacuum_vacuum_scale_factor=0.0002
autovacuum_analyze_scale_factor=0.0001
datestyle='iso,
timezone='PRC'
lc_messages='C'
lc_monetary='C'
lc_numeric='C'
lc_time='C'
default_text_search_config='pg_catalog.english'

 # vi recovery.done
recovery_target_timeline='latest'
standby_mode=on
primary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60'

 # vi pg_hba.conf
local   replication     postgres                                trust
host    replication     postgres 127.0.0.1/32            trust

啓動主實例。

postgres@digoal-> pg_ctl start

啓動只讀實例，必須先刪除postmaster.pid，這點PostgreSQL新版本加了一個PATCH，如果這個文件被刪除，會自動關閉數據庫，所以我們需要注意，不要使用最新的PGSQL，或者把這個patch幹掉先。

postgres@digoal-> cd $PGDATA
postgres@digoal-> mv recovery.done recovery.conf

postgres@digoal-> rm -f postmaster.pid
postgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"

查看當前控制文件狀態，只讀實例改了控制文件，和前面描述一致。

postgres@digoal-> pg_controldata |grep state
Database cluster state:               in archive recovery

連到主實例，創建表，插入測試數據。

psql -p 1921
postgres=# create table test1(id int);
CREATE TABLE
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10

在只讀實例查看插入的數據。

postgres@digoal-> psql -h 127.0.0.1 -p 1922
postgres=# select * from test1;
 id
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

主實例執行檢查點後，控制文件狀態會改回生產狀態。

psql -p 1921
postgres=# checkpoint;
CHECKPOINT

postgres@digoal-> pg_controldata |grep state
Database cluster state:               in production

但是如果在只讀實例執行完檢查點，又會改回恢復狀態。

postgres@digoal-> psql -h 127.0.0.1 -p 1922
psql (9.4.4)
postgres=# checkpoint;
CHECKPOINT

postgres@digoal-> pg_controldata |grep state
Database cluster state:               in archive recovery

注意到，上面的例子有1個問題，用流複製的話，會從主節點通過網絡拷貝XLOG記錄，並覆蓋同一份已經寫過的XLOG記錄的對應的OFFSET，這是一個問題，因爲可能會造成主節點看到的數據不一致（比如一個數據塊改了多次，只讀實例在恢復時將它覆蓋到老的版本了，在主實例上看到的就會變成老版本的BLOCK，後面再來改這個問題，禁止只讀實例恢復數據）。

另一方面，我們知道PostgreSQL standby會從三個地方（流、pg_xlog、restore_command）讀取XLOG進行恢復，所以在共享存儲的環境中，我們完全沒有必要用流複製的方式，直接從pg_xlog目錄讀取即可。修改recovery.conf參數，將以下注釋

 # primary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60'

重啓只讀實例。

pg_ctl stop -m fast
postgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"

重新測試數據一致性。
主實例：

postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10

只讀實例：

postgres=# select count(*) from test1;
 count
-------
    60
(1 row)

問題分析和解決

截至目前，有幾個問題未解決：

standby還是要執行recovery的操作，recovery產生的write操作會隨着只讀實例數量的增加而增加。另外recovery有一個好處，解決了髒頁的問題，主實例shared buffer中的髒頁不需要額外的同步給只讀實例了。recovery還會帶來一個嚴重的BUG，回放可能和當前主節點操作同一個data page；或者回放時將塊回放到老的狀態，而實際上主節點又更新了這個塊，造成數據塊的不一致。如果此時只讀實例關閉，然後立即關閉主實例，數據庫再起來時，這個數據塊是不一致的；
standby還是會改控制文件；
在同一個$PGDATA下啓動實例，首先要刪除postmaster.pid；

關閉實例時，已經被刪除postmaster.pid的實例，只能通過找到postgres主進程的pid，然後發kill -s 15, 2或3的信號來關閉數據庫；

 static void
 set_mode(char *modeopt)
 {
         if (strcmp(modeopt, "s") == 0 || strcmp(modeopt, "smart") == 0)
         {
                 shutdown_mode = SMART_MODE;
                 sig = SIGTERM;
         }
         else if (strcmp(modeopt, "f") == 0 || strcmp(modeopt, "fast") == 0)
         {
                 shutdown_mode = FAST_MODE;
                 sig = SIGINT;
         }
         else if (strcmp(modeopt, "i") == 0 || strcmp(modeopt, "immediate") == 0)
         {
                 shutdown_mode = IMMEDIATE_MODE;
                 sig = SIGQUIT;
         }
         else
         {
                 write_stderr(_("%s: unrecognized shutdown mode \"%s\"\n"), progname, modeopt);
                 do_advice();
                 exit(1);
         }
 }

當主節點刪除rel page時，只讀實例回放時，會報invalid xlog對應的rel page不存在的錯誤，這個也是隻讀實例需要回放日誌帶來的問題。非常容易重現這個問題，刪除一個表即可。

 2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,20,,2015-10-09 13:29:15 CST,1/0,0,WARNING,01000,"page 8 of relation base/151898/185251 does not exist",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"report_invalid_page, xlogutils.c:67",""
 2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,21,,2015-10-09 13:29:15 CST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"log_invalid_page, xlogutils.c:91",""

這個報錯可以先註釋這一段來繞過，從而可以演示下去。

 src/backend/access/transam/xlogutils.c
 /* Log a reference to an invalid page */
 static void
 log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
                                  bool present)
 {
   //////
         /*
          * Once recovery has reached a consistent state, the invalid-page table
          * should be empty and remain so. If a reference to an invalid page is
          * found after consistency is reached, PANIC immediately. This might seem
          * aggressive, but it's better than letting the invalid reference linger
          * in the hash table until the end of recovery and PANIC there, which
          * might come only much later if this is a standby server.
          */
         //if (reachedConsistency)
         //{
         //      report_invalid_page(WARNING, node, forkno, blkno, present);
         //      elog(PANIC, "WAL contains references to invalid pages");
         //}

由於本例是在同一個操作系統中演示，所以沒有遇到OS的dirty page cache的問題，如果是不同主機的環境，我們需要解決OS dirty page cache 的同步問題，或者消除dirty page cache，如使用direct IO。或者集羣文件系統如gfs2。

如果要產品化，至少需要解決以上問題。

先解決Aurora實例寫數據文件、控制文件、檢查點的問題。

增加一個啓動參數，表示這個實例是否爲Aurora實例（即只讀實例）

  # vi src/backend/utils/misc/guc.c
 /******** option records follow ********/

 static struct config_bool ConfigureNamesBool[] =
 {
         {
                 {"aurora", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
                         gettext_noop("Enables advertising the server via Bonjour."),
                         NULL
                 },
                 &aurora,
                 false,
                 NULL, NULL, NULL
         },

新增變量

 # vi src/include/postmaster/postmaster.h
 extern bool aurora;

禁止Aurora實例更新控制文件

 # vi src/backend/access/transam/xlog.c
 #include "postmaster/postmaster.h"
 bool aurora;

 void
 UpdateControlFile(void)
 {
         if (aurora) return;

禁止Aurora實例啓動bgwriter進程

 # vi src/backend/postmaster/bgwriter.c
 #include "postmaster/postmaster.h"
 bool  aurora;

 /*
  * Main entry point for bgwriter process
  *
  * This is invoked from AuxiliaryProcessMain, which has already created the
  * basic execution environment, but not enabled signals yet.
  */
 void
 BackgroundWriterMain(void)
 {
   //////
         pg_usleep(1000000L);

         /*
          * If an exception is encountered, processing resumes here.
          *
          * See notes in postgres.c about the design of this coding.
          */
         if (!aurora && sigsetjmp(local_sigjmp_buf, 1) != 0)
         {

   //////
                 /*
                  * Do one cycle of dirty-buffer writing.
                  */
                 if (!aurora) {
                 can_hibernate = BgBufferSync();
   //////
                 }
                 pg_usleep(1000000L);
         }
 }

禁止Aurora實例啓動checkpointer進程

 # vi src/backend/postmaster/checkpointer.c
 #include "postmaster/postmaster.h"
 bool  aurora;
   //////
 /*
  * Main entry point for checkpointer process
  *
  * This is invoked from AuxiliaryProcessMain, which has already created the
  * basic execution environment, but not enabled signals yet.
  */
 void
 CheckpointerMain(void)
 {
   //////
         /*
          * Loop forever
          */
         for (;;)
         {
                 bool            do_checkpoint = false;
                 int                     flags = 0;
                 pg_time_t       now;
                 int                     elapsed_secs;
                 int                     cur_timeout;
                 int                     rc;

                 pg_usleep(100000L);

                 /* Clear any already-pending wakeups */
                 if (!aurora)  ResetLatch(&MyProc->procLatch);

                 /*
                  * Process any requests or signals received recently.
                  */
                 if (!aurora) AbsorbFsyncRequests();

                 if (!aurora && got_SIGHUP)
                 {
                         got_SIGHUP = false;
                         ProcessConfigFile(PGC_SIGHUP);

                         /*
                          * Checkpointer is the last process to shut down, so we ask it to
                          * hold the keys for a range of other tasks required most of which
                          * have nothing to do with checkpointing at all.
                          *
                          * For various reasons, some config values can change dynamically
                          * so the primary copy of them is held in shared memory to make
                          * sure all backends see the same value.  We make Checkpointer
                          * responsible for updating the shared memory copy if the
                          * parameter setting changes because of SIGHUP.
                          */
                         UpdateSharedMemoryConfig();
                 }
                 if (!aurora && checkpoint_requested)
                 {
                         checkpoint_requested = false;
                         do_checkpoint = true;
                         BgWriterStats.m_requested_checkpoints++;
                 }
                 if (!aurora && shutdown_requested)
                 {
                         /*
                          * From here on, elog(ERROR) should end with exit(1), not send
                          * control back to the sigsetjmp block above
                          */
                         ExitOnAnyError = true;
                         /* Close down the database */
                         ShutdownXLOG(0, 0);
                         /* Normal exit from the checkpointer is here */
                         proc_exit(0);           /* done */
                 }

                 /*
                  * Force a checkpoint if too much time has elapsed since the last one.
                  * Note that we count a timed checkpoint in stats only when this
                  * occurs without an external request, but we set the CAUSE_TIME flag
                  * bit even if there is also an external request.
                  */
                 now = (pg_time_t) time(NULL);
                 elapsed_secs = now - last_checkpoint_time;
                 if (!aurora && elapsed_secs >= CheckPointTimeout)
                 {
                         if (!do_checkpoint)
                                 BgWriterStats.m_timed_checkpoints++;
                         do_checkpoint = true;
                         flags |= CHECKPOINT_CAUSE_TIME;
                 }

                 /*
                  * Do a checkpoint if requested.
                  */
                 if (!aurora && do_checkpoint)
                 {
                         bool            ckpt_performed = false;
                         bool            do_restartpoint;

                         /* use volatile pointer to prevent code rearrangement */
                         volatile CheckpointerShmemStruct *cps = CheckpointerShmem;

                         /*
                          * Check if we should perform a checkpoint or a restartpoint. As a
                          * side-effect, RecoveryInProgress() initializes TimeLineID if
                          * it's not set yet.
                          */
                         do_restartpoint = RecoveryInProgress();

                         /*
                          * Atomically fetch the request flags to figure out what kind of a
                          * checkpoint we should perform, and increase the started-counter
                          * to acknowledge that we've started a new checkpoint.
                          */
                         SpinLockAcquire(&cps->ckpt_lck);
                         flags |= cps->ckpt_flags;
                         cps->ckpt_flags = 0;
                         cps->ckpt_started++;
                         SpinLockRelease(&cps->ckpt_lck);

                         /*
                          * The end-of-recovery checkpoint is a real checkpoint that's
                          * performed while we're still in recovery.
                          */
                         if (flags & CHECKPOINT_END_OF_RECOVERY)
                                 do_restartpoint = false;
   //////
                         ckpt_active = false;
                 }

                 /* Check for archive_timeout and switch xlog files if necessary. */
                 if (!aurora) CheckArchiveTimeout();
                 /*
                  * Send off activity statistics to the stats collector.  (The reason
                  * why we re-use bgwriter-related code for this is that the bgwriter
                  * and checkpointer used to be just one process.  It's probably not
                  * worth the trouble to split the stats support into two independent
                  * stats message types.)
                  */
                 if (!aurora) pgstat_send_bgwriter();

                 /*
                  * Sleep until we are signaled or it's time for another checkpoint or
                  * xlog file switch.
                  */
                 now = (pg_time_t) time(NULL);
                 elapsed_secs = now - last_checkpoint_time;
                 if (elapsed_secs >= CheckPointTimeout)
                         continue;                       /* no sleep for us ... */
                 cur_timeout = CheckPointTimeout - elapsed_secs;
                 if (!aurora && XLogArchiveTimeout > 0 && !RecoveryInProgress())
                 {
                         elapsed_secs = now - last_xlog_switch_time;
                         if (elapsed_secs >= XLogArchiveTimeout)
                                 continue;               /* no sleep for us ... */
                         cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
                 }

                 if (!aurora) rc = WaitLatch(&MyProc->procLatch,
                                            WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                            cur_timeout * 1000L /* convert to ms */ );

                 /*
                  * Emergency bailout if postmaster has died.  This is to avoid the
                  * necessity for manual cleanup of all postmaster children.
                  */
                 if (rc & WL_POSTMASTER_DEATH)
                         exit(1);
         }
 }
   //////
 /* SIGINT: set flag to run a normal checkpoint right away */
 static void
 ReqCheckpointHandler(SIGNAL_ARGS)
 {
         if (aurora)
            return;
         int                     save_errno = errno;

         checkpoint_requested = true;
         if (MyProc)
                 SetLatch(&MyProc->procLatch);

         errno = save_errno;
 }
   //////
 /*
  * AbsorbFsyncRequests
  *              Retrieve queued fsync requests and pass them to local smgr.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
  * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
 AbsorbFsyncRequests(void)
 {
         CheckpointerRequest *requests = NULL;
         CheckpointerRequest *request;
         int                     n;

         if (!AmCheckpointerProcess() || aurora)
                 return;
   //////

禁止Aurora實例手工調用checkpoint命令

 # vi src/backend/tcop/utility.c
 #include "postmaster/postmaster.h"
 bool  aurora;

   //////
 void
 standard_ProcessUtility(Node *parsetree,
                                                 const char *queryString,
                                                 ProcessUtilityContext context,
                                                 ParamListInfo params,
                                                 DestReceiver *dest,
                                                 char *completionTag)
 {
   //////
                 case T_CheckPointStmt:
                    if (!superuser() || aurora)
                                 ereport(ERROR,
                                                 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                                                  errmsg("must be superuser to do CHECKPOINT")));

改完上面的代碼，重新編譯一下，現在接近一個DEMO了。現在Aurora實例不會更新控制文件，不會寫數據文件，不會執行checkpoint，是我們想要的結果。
啓動只讀實例時，加一個參數aurora=true，表示啓動Aurora實例。

pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922 -c aurora=true"

不過要產品化，還有很多細節需要考慮，這只是一個DEMO。阿里雲RDS的小夥伴們加油！

還有一種更保險的玩法，共享存儲多讀架構，需要存儲兩份數據。其中一份是主實例的存儲，它自己玩自己的，其他實例不對它做任何操作；另一份是standby的，這部作爲共享存儲，給多個只讀實例來使用。

參考

https://aws.amazon.com/cn/rds/aurora/

src/backend/access/transam/xlog.c

 /*
  * Open the WAL segment containing WAL position 'RecPtr'.
  *
  * The segment can be fetched via restore_command, or via walreceiver having
  * streamed the record, or it can already be present in pg_xlog. Checking
  * pg_xlog is mainly for crash recovery, but it will be polled in standby mode
  * too, in case someone copies a new segment directly to pg_xlog. That is not
  * documented or recommended, though.
  *
  * If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should
  * prepare to read WAL starting from RedoStartLSN after this.
  *
  * 'RecPtr' might not point to the beginning of the record we're interested
  * in, it might also point to the page or segment header. In that case,
  * 'tliRecPtr' is the position of the WAL record we're interested in. It is
  * used to decide which timeline to stream the requested WAL from.
  *
  * If the record is not immediately available, the function returns false
  * if we're not in standby mode. In standby mode, waits for it to become
  * available.
  *
  * When the requested record becomes available, the function opens the file
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
  * false.
  */
 static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                                                         bool fetching_ckpt, XLogRecPtr tliRecPtr)
 {
   //////
         static pg_time_t last_fail_time = 0;
         pg_time_t       now;

         /*-------
          * Standby mode is implemented by a state machine:
          *
          * 1. Read from either archive or pg_xlog (XLOG_FROM_ARCHIVE), or just
          *        pg_xlog (XLOG_FROM_XLOG)
          * 2. Check trigger file
          * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM)
          * 4. Rescan timelines
          * 5. Sleep 5 seconds, and loop back to 1.
          *
          * Failure to read from the current source advances the state machine to
          * the next state.
          *
          * 'currentSource' indicates the current state. There are no currentSource
          * values for "check trigger", "rescan timelines", and "sleep" states,
          * those actions are taken when reading from the previous source fails, as
          * part of advancing to the next state.
          *-------
          */

PgSQL · 特性分析 · PostgreSQL Aurora方案與DEMO

前言

模擬過程

問題分析和解決

參考

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

druid數據源 xml配置

centos 忘記root密碼解決辦法

vmware FAQ

PgSQL · 特性分析 · PostgreSQL Aurora方案與DEMO

TCP的那些事兒（上）

Log4J日誌配置詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結