问题描述:
巡检的时候发现ogg挂的gg目录使用率85%了,经排查发现是ogg不能自动删除trail文件导致目录使用率告警。
[oracle@host01 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ggvg-lv_gg 468G 375G 70G 85% /gg
/dev/sda1 194M 33M 152M 18% /boot
gg目录使用375G,查看trail文件目录372G
[oracle@host01 gg]$ cd /gg/goldengate/
[oracle@host01 goldengate]$ du -sh *
252K dirchk
372G dirdat
4.0K dirdef
……
查看trail文件详细信息,发现trial文件还有2号的,
[oracle@host01 dirdat]$ ll
total 389674828
-rw-rw-rw- 1 oracle oinstall 177660920 Aug 22 2017 ec005334
-rw-rw-rw- 1 oracle oinstall 199999595 Dec 2 10:01 fe022410
-rw-rw-rw- 1 oracle oinstall 199999835 Dec 2 10:18 fe022411
-rw-rw-rw- 1 oracle oinstall 199999908 Dec 2 10:35 fe022412
-rw-rw-rw- 1 oracle oinstall 199999956 Dec 2 10:52 fe022413
-rw-rw-rw- 1 oracle oinstall 199999814 Dec 2 11:09 fe022414
-rw-rw-rw- 1 oracle oinstall 199999798 Dec 2 11:25 fe022415
-rw-rw-rw- 1 oracle oinstall 199999882 Dec 2 11:42 fe022416
-rw-rw-rw- 1 oracle oinstall 199999537 Dec 2 11:59 fe022417
-rw-rw-rw- 1 oracle oinstall 199999477 Dec 2 12:16 fe022418
[oracle@host01 dirdat]$ cd ..
[oracle@host01 goldengate]$ ./ggsci
Oracle GoldenGate Command Interpreter for Oracle
Version 11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Apr 23 2012 08:32:14
Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved.
GGSCI (host01) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED ESS_ONE 00:00:03 672:12:03
REPLICAT STOPPED REP_ESS 00:00:00 672:11:53
REPLICAT STOPPED REP_IT 00:00:00 20614:00:47
REPLICAT RUNNING REP_NEW 00:00:00 00:00:02
REPLICAT STOPPED REP_ORD 00:00:00 20614:00:23
发现问题
查看mgr进程详细信息,发现有自动删除的配置。为什么没有自动删除呢?是不是mgr夯住了?
GGSCI (host01) 2> view param mgr
Port 7839
DynamicPortList 7840-7850
DynamicPortReassignDelay 5
PurgeOldExtracts ./dirdat/ec*, UseCheckpoints, MinKeepDays 8
PurgeOldExtracts ./dirdat/fe*, UseCheckpoints, MinKeepDays 8
-- PurgeOldExtracts ./dirdat2/e2*, UseCheckpoints, MinKeepDays 5
-- PurgeOldExtracts ./dirdat3/e3*, UseCheckpoints, MinKeepDays 5
-- PurgeOldExtracts ./dirdat4/e4*, UseCheckpoints, MinKeepDays 5
-- AutoRestart ER *, Retries 5, WaitMinutes 10, ResetMinutes 60
LagReportHours 1
LagInfoMinutes 3
LagCriticalMinutes 5
尝试解决
打算重新启动mgr,先停replicat进程
GGSCI (host01) 6> stop REP_NEW
Sending STOP request to REPLICAT REP_NEW ...
STOP request pending end-of-transaction (936 records so far)..
GGSCI (host01) 7> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED ESS_ONE 00:00:03 672:17:31
REPLICAT STOPPED REP_ESS 00:00:00 672:17:22
REPLICAT STOPPED REP_IT 00:00:00 20614:06:16
REPLICAT STOPPED REP_NEW 00:00:03 00:00:03
REPLICAT STOPPED REP_ORD 00:00:00 20614:05:52
GGSCI (host01) 8> stop mgr
Manager process is required by other GGS processes.
Are you sure you want to stop it (y/n)? y
Sending STOP request to MANAGER ...
Request processed.
Manager stopped.
GGSCI (host01) 9> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER STOPPED
REPLICAT STOPPED ESS_ONE 00:00:03 672:17:52
REPLICAT STOPPED REP_ESS 00:00:00 672:17:42
REPLICAT STOPPED REP_IT 00:00:00 20614:06:37
REPLICAT STOPPED REP_NEW 00:00:03 00:00:23
REPLICAT STOPPED REP_ORD 00:00:00 20614:06:12
GGSCI (host01) 10> start mgr
Manager started.
GGSCI (host01) 11> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED ESS_ONE 00:00:03 672:18:01
REPLICAT STOPPED REP_ESS 00:00:00 672:17:51
REPLICAT STOPPED REP_IT 00:00:00 20614:06:45
REPLICAT STOPPED REP_NEW 00:00:03 00:00:32
REPLICAT STOPPED REP_ORD 00:00:00 20614:06:21
GGSCI (host01) 12> start REP_NEW
Sending START request to MANAGER ...
REPLICAT REP_NEW starting
GGSCI (host01) 13> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED ESS_ONE 00:00:03 672:18:13
REPLICAT STOPPED REP_ESS 00:00:00 672:18:04
REPLICAT STOPPED REP_IT 00:00:00 20614:06:58
REPLICAT RUNNING REP_NEW 00:00:03 00:00:45
REPLICAT STOPPED REP_ORD 00:00:00 20614:06:34
GGSCI (host01) 14> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED ESS_ONE 00:00:03 672:18:20
REPLICAT STOPPED REP_ESS 00:00:00 672:18:11
REPLICAT STOPPED REP_IT 00:00:00 20614:07:05
REPLICAT RUNNING REP_NEW 00:00:03 00:00:52
REPLICAT STOPPED REP_ORD 00:00:00 20614:06:41
重启mgr后发现依旧没有删除
[oracle@host01 dirdat]$ ll
total 389776208
-rw-rw-rw- 1 oracle oinstall 177660920 Aug 22 2017 ec005334
-rw-rw-rw- 1 oracle oinstall 199999595 Dec 2 10:01 fe022410
-rw-rw-rw- 1 oracle oinstall 199999835 Dec 2 10:18 fe022411
-rw-rw-rw- 1 oracle oinstall 199999908 Dec 2 10:35 fe022412
-rw-rw-rw- 1 oracle oinstall 199999956 Dec 2 10:52 fe022413
-rw-rw-rw- 1 oracle oinstall 199999814 Dec 2 11:09 fe022414
-rw-rw-rw- 1 oracle oinstall 199999798 Dec 2 11:25 fe022415
-rw-rw-rw- 1 oracle oinstall 199999882 Dec 2 11:42 fe022416
-rw-rw-rw- 1 oracle oinstall 199999537 Dec 2 11:59 fe022417
-rw-rw-rw- 1 oracle oinstall 199999477 Dec 2 12:16 fe022418
-rw-rw-rw- 1 oracle oinstall 199999689 Dec 2 12:32 fe022419
-rw-rw-rw- 1 oracle oinstall 199999651 Dec 2 12:48 fe022420
-rw-rw-rw- 1 oracle oinstall 199999687 Dec 2 13:06 fe022421
-rw-rw-rw- 1 oracle oinstall 199999923 Dec 2 13:24 fe022422
-rw-rw-rw- 1 oracle oinstall 199999838 Dec 2 13:42 fe022423
解决问题
思考了下,之前有暂停了另外一个不再使用的进程,和这个公用trail文件,应该是mgr判断哪个已经暂停未删除的进程还需要使用这些trail,所以没有删除。既然已经暂停的mgr进程不再使用了,索性就删除进程。
[oracle@host01 goldengate]$ ./ggsci
Oracle GoldenGate Command Interpreter for Oracle
Version 11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Apr 23 2012 08:32:14
Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved.
GGSCI (host01) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED ESS_ONE 00:00:03 672:33:19
REPLICAT STOPPED REP_ESS 00:00:00 672:33:10
REPLICAT STOPPED REP_IT 00:00:00 20614:22:04
REPLICAT RUNNING REP_NEW 00:00:03 00:00:02
REPLICAT STOPPED REP_ORD 00:00:00 20614:21:40
登录ogg
GGSCI (host01) 2> dblogin userid goldengate,password goldengate
Successfully logged into database.
删除哪两个已经不再使用的进程
GGSCI (host01) 3> delete ESS_ONE
Deleted REPLICAT ESS_ONE.
GGSCI (host01) 4> delete REP_ESS
Deleted REPLICAT REP_ESS.
GGSCI (host01) 5> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED REP_IT 00:00:00 20614:22:49
REPLICAT RUNNING REP_NEW 00:00:03 00:00:00
REPLICAT STOPPED REP_ORD 00:00:00 20614:22:25
然后重启mgr(应该是立即生效的,df看了一眼没有空间没有变化,这里又重启了mgr)
[oracle@host01 goldengate]$ ./ggsci
Oracle GoldenGate Command Interpreter for Oracle
Version 11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Apr 23 2012 08:32:14
Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved.
GGSCI (host01) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED REP_IT 00:00:00 20614:24:00
REPLICAT RUNNING REP_NEW 00:00:03 00:00:01
REPLICAT STOPPED REP_ORD 00:00:00 20614:23:36
GGSCI (host01) 2> stop REP_NEW
Sending STOP request to REPLICAT REP_NEW ...
STOP request pending end-of-transaction (694 records so far)..
GGSCI (host01) 3> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED REP_IT 00:00:00 20614:24:12
REPLICAT STOPPED REP_NEW 00:00:04 00:00:03
REPLICAT STOPPED REP_ORD 00:00:00 20614:23:48
GGSCI (host01) 4> stop mgr
Manager process is required by other GGS processes.
Are you sure you want to stop it (y/n)? y
Sending STOP request to MANAGER ...
Request processed.
Manager stopped.
GGSCI (host01) 5> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER STOPPED
REPLICAT STOPPED REP_IT 00:00:00 20614:24:33
REPLICAT STOPPED REP_NEW 00:00:04 00:00:24
REPLICAT STOPPED REP_ORD 00:00:00 20614:24:08
GGSCI (host01) 6> start mgr
Manager started.
GGSCI (host01) 7> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED REP_IT 00:00:00 20614:24:42
REPLICAT STOPPED REP_NEW 00:00:04 00:00:34
REPLICAT STOPPED REP_ORD 00:00:00 20614:24:18
GGSCI (host01) 8> start REP_NEW
Sending START request to MANAGER ...
REPLICAT REP_NEW starting
GGSCI (host01) 9> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED REP_IT 00:00:00 20614:25:01
REPLICAT RUNNING REP_NEW 00:00:51 00:00:01
REPLICAT STOPPED REP_ORD 00:00:00 20614:24:37
发现已经删除8天前的trail文件
[oracle@host01 dirdat]$ ls -lrt|more
total 121587316
-rw-rw-rw- 1 oracle oinstall 177660920 Aug 22 2017 ec005334
-rw-rw-rw- 1 oracle oinstall 199999865 Dec 22 10:33 fe023785
-rw-rw-rw- 1 oracle oinstall 199999987 Dec 22 10:51 fe023786
-rw-rw-rw- 1 oracle oinstall 199999700 Dec 22 11:09 fe023787
-rw-rw-rw- 1 oracle oinstall 199999857 Dec 22 11:26 fe023788
-rw-rw-rw- 1 oracle oinstall 199999805 Dec 22 11:44 fe023789
-rw-rw-rw- 1 oracle oinstall 199999972 Dec 22 12:02 fe023790
-rw-rw-rw- 1 oracle oinstall 199999703 Dec 22 12:20 fe023791
-rw-rw-rw- 1 oracle oinstall 199999351 Dec 22 12:38 fe023792
-rw-rw-rw- 1 oracle oinstall 199999885 Dec 22 12:56 fe023793
-rw-rw-rw- 1 oracle oinstall 199999664 Dec 22 13:15 fe023794
-rw-rw-rw- 1 oracle oinstall 199999993 Dec 22 13:32 fe023795
-rw-rw-rw- 1 oracle oinstall 199999713 Dec 22 13:50 fe023796
-rw-rw-rw- 1 oracle oinstall 199999766 Dec 22 14:08 fe023797
-rw-rw-rw- 1 oracle oinstall 199999760 Dec 22 14:26 fe023798
-rw-rw-rw- 1 oracle oinstall 199999629 Dec 22 14:45 fe023799
-rw-rw-rw- 1 oracle oinstall 199999904 Dec 22 15:03 fe023800
[oracle@host01 dirdat]$ cd ..
[oracle@host01 goldengate]$ du -sh *
188K dirchk
116G dirdat
4.0K dirdef
验证
[oracle@host01 /]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg00-lv_root 9.9G 6.4G 3.0G 69% /
tmpfs 32G 294M 32G 1% /dev/shm
/dev/mapper/vg00-lv_oracle 30G 18G 11G 65% /oracle
/dev/mapper/ggvg-lv_gg 468G 119G 326G 27% /gg