Tuesday, January 14, 2014

PostgreSQL: "FATAL: requested WAL segment 0000000800002A0000000000 has already been removed"

Yesterday, the standby database host an important production downtime due to hardware failure to do hardware detection, because it is streaming replication environment,
You can stop by the library, stop detection probably spent about 2 hours after the start-time library again, reported the following errors:

- Csv log
2013-07-01 13:25:29.430 CST,,, 27738,, 51d112c8.6c5a, 1, ,2013-07-01 13:25:28 CST,, 0, LOG, 00000, "streaming replication successfully connected to primary ",,,,,,,," libpqrcv_connect, libpqwalreceiver.c: 171 "," "
2013-07-01 13:25:29.430 CST,,, 27738,, 51d112c8.6c5a, 2, ,2013-07-01 13:25:28 CST,, 0, FATAL, XX000,"could not receive data from WAL stream: FATAL: requested WAL segment 0000000800002A0000000000 has already been removed
",,,,,,,," Libpqrcv_receive, libpqwalreceiver.c: 389 "," "

Note: According to the error message, it is easy to know downtime due process in preparing the necessary libraries WAL recycling has been covering the main library, and the library in preparation
Down for maintenance during the main library did not open the archive, then the standby database needs to redo it. Some people might ask, why not open the archive has been the main library,
I want to say is that this library TB level and relatively busy, busy time of filing of the day around 600 GB, so large archive requires a lot of storage.
But in the standby database downtime during maintenance, it is recommended to open the main library archives, they do not support the full archive directory, then resumed after a standby database, with the main library
Archive, then the standby database is still able to keep the main library, in order to deepen understanding and following the error simulation and demonstrates a workaround:
An environmental information
Host: Notebook Virtual Machine
System: Red Hat Enterprise Linux Server release 6.2
Version: PostgreSQL 9.3beta1
Main Library IP: 192.168.1.36 Host Name: redhatB
Standby database IP: 192.168.1.35 Host Name: redhat6
Note: Copy the build process flow slightly, reference PostgreSQL: streaming replication environment built using pg_basebackup


Two simulations
- Set the main library postgresql.conf 2.1
For ease of demonstration, set the following parameters, the other parameters set according to demand:

checkpoint_segments = 3
archive_mode = on
archive_command = 'cp% p / archive/pg93 /% f'
max_wal_senders = 3
wal_keep_segments = 3
max_wal_senders = 3
Note: Archive directory / archive/pg93 / need to create better and to give the appropriate permissions.
- 2.2 reload the configuration file
[Pg93 @ redhatB pg_root] $ pg_ctl reload-D $ PGDATA
server signaled
  Note: After modifying the parameters needed reload.
- Before 2.3 Test: Main Library Data
[Pg93 @ redhatB ~] $ psql
psql (9.3beta1)
Type "help" for help.

postgres = # select * from test_1;
id | create_time
---- + ---------------------
1 | 2013-07-01 21:15:34
2 | 2013-07-01 21:55:37
3 | 2013-07-01 22:01:18
(3 rows)
 
- Before 2.4 Test: Preparation of library data
[Pg93 @ redhat6 ~] $ psql
psql (9.3beta1)
Type "help" for help.

postgres = # select * from test_1;
id | create_time
---- + ---------------------
1 | 2013-07-01 21:15:34
2 | 2013-07-01 21:55:37
3 | 2013-07-01 22:01:18
(3 rows) 
  Note: This marks the first under the data so that the rear do comparison.
- 2.5 stop by the library
[Pg93 @ redhat6 ~] $ pg_ctl stop-m fast-D $ PGDATA
waiting for server to shut down .... done
server stopped
 
- 2.6 to perform the following operations on the main library
postgres = # insert into test_1 values ​​(5, now ());
INSERT 0 1
postgres = # select pg_switch_xlog ();
pg_switch_xlog
----------------
1/310000AC
(1 row)

postgres = # insert into test_1 values ​​(5, now ());
INSERT 0 1
postgres = # select pg_switch_xlog ();
pg_switch_xlog
----------------
1/320004D0
(1 row)

postgres = # insert into test_1 values ​​(5, now ());
INSERT 0 1
postgres = # select pg_switch_xlog ();
pg_switch_xlog
----------------
1/330000AC
(1 row)

.....
Easy to show that duplicate content is not posted.
 

- 2.7 View archive directory
[Pg93 @ redhatB pg93] $ ll / archive/pg93 / | wc-l
36
Note: At this time the archive directory is created some WAL log archiving.

- 2.8 Start standby database
[Pg93 @ redhat6 ~] $ pg_ctl start-D $ PGDATA
server starting
Note: The database can be Kai, but view the log, reported the following error:

- 2.9 csv log
[Pg93 @ redhat6 pg_log] $ tail-f postgresql-2013-07-02_062642.csv
",,,,,,,,," "
2013-07-02 06:27:17.672 CST,,, 3704,, 51d20245.e78, 1, ,2013-07-02 06:27:17 CST,, 0, LOG, 00000, "started streaming WAL from primary at 1/22000000 on timeline 1 ",,,,,,,,," "
2013-07-02 06:27:17.674 CST,,, 3704,, 51d20245.e78, 2, ,2013-07-02 06:27:17 CST,, 0, FATAL, XX000, "could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000022 has already been removed
  Note: An error has been to reproduce the target in the main library's $ PGDATA / pg_xlog directory files have not found 000000010000000100000022,
Because the file has been recycled XLOG covered, but in the archive directory / archive/pg93 can be found.

- Copy main library 2.10 WAL archive directory to the backup node
[Pg93 @ redhatB pg93] $ scp / archive/pg93 / * pg93@192.168.1.35 :/ archive/pg93
  Note: At this time the copy archive directory on the master node WAL files to the archive directory backup node.
- 2.11 Modify the following parameters recovery.conf document prepared by the library
restore_command = 'cp / archive/pg93 /% f% p'
  NOTE: Other parameters slightly, reference PostgreSQL: streaming replication environment built using pg_basebackup    

- 2.12 restart standby database
[Pg93 @ redhat6 pg_root] $ pg_ctl stop-m fast-D $ PGDATA
waiting for server to shut down .... done
server stopped

[Pg93 @ redhat6 pg_root] $ pg_ctl start-D $ PGDATA
server starting
 
- 2.13 View standby database log
[Pg93 @ redhat6 pg_log] $ tail-f postgresql-2013-07-02_063606.csv
2013-07-02 06:36:08.123 CST,,, 4008,, 51d20456.fa8, 4, ,2013-07-02 06:36:06 CST, 1/0, 0, LOG, 00000, "redo starts at 1/200002BC ",,,,,,,,," "
2013-07-02 06:36:08.592 CST,,, 4008,, 51d20456.fa8, 5, ,2013-07-02 06:36:06 CST, 1/0, 0, LOG, 00000, "restored log file "" 000000010000000100000021 "" from archive ",,,,,,,,," "
2013-07-02 06:36:08.618 CST,,, 4008,, 51d20456.fa8, 6, ,2013-07-02 06:36:06 CST, 1/0, 0, LOG, 00000, "consistent recovery state reached at 1/22000000 ",,,,,,,,," "
2013-07-02 06:36:08.630 CST,,, 4006,, 51d20456.fa6, 1, ,2013-07-02 06:36:06 CST,, 0, LOG, 00000, "database system is ready to accept read only connections ",,,,,,,,," "
2013-07-02 06:36:10.014 CST,,, 4008,, 51d20456.fa8, 7, ,2013-07-02 06:36:06 CST, 1/0, 0, LOG, 00000, "restored log file "" 000000010000000100000022 "" from archive ",,,,,,,,," "
2013-07-02 06:36:10.800 CST,,, 4008,, 51d20456.fa8, 8, ,2013-07-02 06:36:06 CST, 1/0, 0, LOG, 00000, "restored log file "" 000000010000000100000023 "" from archive ",,,,,,,,," "
2013-07-02 06:36:12.139 CST,,, 4008,, 51d20456.fa8, 9, ,2013-07-02 06:36:06 CST, 1/0, 0, LOG, 00000, "restored log file "" 000000010000000100000024 "" from archive ",,,,,,,,," "
2013-07-02 06:36:13.937 CST,,, 4008,, 51d20456.fa8, 10, ,2013-07-02 06:36:06 CST, 1/0, 0, LOG, 00000, "restored log file "" 000000010000000100000025 "" from archive ",,,,,,,,," "
....
Part omitted
2013-07-02 06:36:47.400 CST,,, 4059,, 51d2047f.fdb, 1, ,2013-07-02 06:36:47 CST,, 0, LOG, 00000, "started streaming WAL from primary at 1/43000000 on timeline 1 ",,,,,,,,," "
Note: At this point prepared to take the library from the archive directory desperately chase began after the main library to the desired WAL, until you see the last line of the above information, said the log
Main library has been completely catch up.
Three verification
- 3.1 Main Library
[Pg93 @ redhatB ~] $ psql
psql (9.3beta1)
Type "help" for help.

postgres = # select * from test_1 order by create_time desc limit 3;
id | create_time
---- + ---------------------
5 | 2013-07-02 06:38:07
5 | 2013-07-02 06:26:11
5 | 2013-07-02 06:26:09
(3 rows)

postgres = # select count (*) from test_1;
count
-------
39
(1 row)
 
- 3.2 library equipment
[Pg93 @ redhat6 pg_root] $ psql
psql (9.3beta1)
Type "help" for help.

postgres = # select * from test_1 order by create_time desc limit 3;
id | create_time
---- + ---------------------
5 | 2013-07-02 06:38:07
5 | 2013-07-02 06:26:11
5 | 2013-07-02 06:26:09
(3 rows)

postgres = # select count (*) from test_1;
count
-------
39
(1 row)
Note: At this point in the data preparation process stopped in the new library has also been synchronized to the standby database, the recovery success.


Four summary
For a very busy library, it is recommended to pg_xlog allocate larger memory, so that it can retain more WAL files in the backup node
After down for maintenance, the library can get more time to stop.
2 For more busy large databases, such as TB-level, long-term if there is not enough memory to open the archive, then at least in the standby node needs
This time to stop the archive library maintenance of open, otherwise when the main library to catch up to the standby database Kai, TB level library on behalf of standby redo
Price is imagined.

No comments:

Post a Comment