Friday, January 24, 2014

Testing Cascading Replication

Folks,

Wanted to give you the below testing emails from DHAVAL JAISWAL.  He's
been testing 9.3's streaming-only cascading replication, and so far it
works as advertised.  What he found in his tests was:

a) he could not remaster to a former replica which was behind the relica
he was trying to remaster

b) when servers where correctly caught up, remastering worked correctly

So, all good so far.

Text follows

======================

TEST 1: remastering failure due to picking the wrong replica

 I have tested below scenario of the cascade replication for postgreSQL 9.3
beta version.

              A

   B.....................E
C...D

  1)   *A is the master,*

     *B & E are pointing to the A, *

     *C & D are pointing to the B.*


*Tested Scenarios are as follows: *
* *

* *


a) When (A) failed, we can able to promote B or E as the master and as
usual C & D would continue to talk with the B, if we have promoted B as the
master. If we have promoted E as the master in that case i have changed
recovery.conf of C & D and replace the port and IP pointing to the E. After
restarting of C & D, it has started to talk with the E.


   b) When (B) failed, I have changed recovery.conf of C & D and replace
the port and IP pointing to the E. After restarting of C & D, it has
started to talk with the E. At last A would be the master, E is pointing to
A and C & D pointing to E.



Now, in a) scenario when we promote B as the master on failure of A, that
time C & D would continue to talk with the B. However, when i am changing
recovery.conf of E by replacing the port and IP of B. it is throwing
following errors.


  cp: cannot stat `/usr/local/arch/00000002.history': No such file or
directory

cp: cannot stat `/usr/local/arch/00000003.history': No such file or
directory

LOG: entering standby mode

cp: cannot stat `/usr/local/arch/00000002.history': No such file or
directory

cp: cannot stat `/usr/local/arch/000000020000000000000027': No such file or
directory

cp: cannot stat `/usr/local/arch/000000010000000000000027': No such file or
directory

cp: cannot stat `/usr/local/arch/00000002.history': No such file or
directory

*FATAL: requested timeline 2 is not a child of this server's history *
* *

*DETAIL: Latest checkpoint is at 0/272DE57C on timeline 1, but in the
history of the requested timeline, the server forked off from that timeline
at 0/272DC548 *
* *

*LOG: startup process (PID 6155) exited with exit code 1 *
* *

LOG: aborting startup due to startup process failure

======================

TEST 2: Remastering success

 Structure would be


*                                                        A* *(Master)*

                              *(Slave1)
B........................................E (Slave2)*

                            (Slave3) C.....D (Slave4)


 (1)     stopped the *node (A)*


 (2)  Following are the snaps of *slave1*  &  *slave2*  after
stopping*node (A)
*

*slave 1*

postgres=# select pg_last_xact_replay_timestamp();
  pg_last_xact_replay_timestamp
----------------------------------
 2013-06-26 12:13:54.056954+05:30                       --------------->
timing
(1 row)

postgres=# select pg_last_xlog_receive_location();
 pg_last_xlog_receive_location
-------------------------------
 0/3E000084                                            ---------------->
received wal
(1 row)



*slave 2
*
postgres=# select pg_last_xact_replay_timestamp();
  pg_last_xact_replay_timestamp
----------------------------------
 2013-06-26 12:13:54.056954+05:30            ---------------> timing
(1 row)

postgres=# select pg_last_xlog_receive_location();
 pg_last_xlog_receive_location
-------------------------------                ---------------->  received
wal
 0/3E000084
(1 row)




(3)  Following are the logs on *slave1 while stopped node (A)*

FATAL:  could not connect to the primary server: could not connect to
server: Connection refused
                Is the server running on host "127.0.0.1" and accepting
                TCP/IP connections on port 5432?



(4) Following are the logs on *slave2 while stopped node (A) *

FATAL:  could not connect to the primary server: could not connect to
server: Connection refused
                Is the server running on host "127.0.0.1" and accepting
                TCP/IP connections on port 5432?




(5) Below *logs of slave1, when promoted slave1 as the master.  *

LOG:  received promote request
LOG:  redo done at 0/3E000024
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started



(6) Below logs when changed the recovery.conf of *slave2 and now it is
pointing to the slave1 after restart*.

LOG:  database system was shut down in recovery at 2013-06-26 12:28:49 IST
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/3E000084
LOG:  invalid record length at 0/3E000084
LOG:  database system is ready to accept read only connections
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  started streaming WAL from primary at 0/3E000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/3E000084
LOG:  new target timeline is 2
LOG:  restarted WAL streaming at 0/3E000000 on timeline 2
LOG:  redo starts at 0/3E000084



Now, at this time it has successfully connected to the master and started
working again.

No comments:

Post a Comment