這周又出了一次生產事件,發生在另一個運營了近四年的雲環境中,受影響客戶較多,好在影響時間不長,較快的恢復了生產。具體的排查過程就不說了,在這裏先做一下排查記錄。
服務器message日誌
根據此日誌文件可以查看oracle是因爲什麼宕掉的,這次事件通過此文件發現,是因爲free swap 爲0KB 引發了系統主動kill了oracle進程
[root@57373ded4b19 log]# more /var/log/messages
應用日誌
app日誌
查詢DB宕機時間段的應用狀態
web-apache日誌
統計應用的交易請求量
oracle日誌
alert*.log 是oracle的警告日誌文件,能夠看出來出問題的時候oracle在做什麼,是因爲什麼引發的問題產生
trace 日誌
[oracle@57373ded4b19 trace]$ sqlplus / as sysdba
SQL> show parameter dump
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
background_core_dump string partial
background_dump_dest string /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace
core_dump_dest string /home/oracle/app/oracle/diag/rdbms/helowin/helowin/cdump
max_dump_file_size string unlimited
shadow_core_dump string partial
user_dump_dest string /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace
SQL> select * from v$diag_info;
INST_ID NAME VALUE
---------- ---------------------------------------------------------------- -------------------------------------------------------------------------------------
1 Diag Enabled TRUE
1 ADR Base /home/oracle/app/oracle
1 ADR Home /home/oracle/app/oracle/diag/rdbms/helowin/helowin
1 Diag Trace /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace
1 Diag Alert /home/oracle/app/oracle/diag/rdbms/helowin/helowin/alert
1 Diag Incident /home/oracle/app/oracle/diag/rdbms/helowin/helowin/incident
1 Diag Cdump /home/oracle/app/oracle/diag/rdbms/helowin/helowin/cdump
1 Health Monitor /home/oracle/app/oracle/diag/rdbms/helowin/helowin/hm
1 Default Trace File /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace/helowin_ora_25878.trc
1 Active Problem Count 1
1 Active Incident Count 26
11 rows selected.
SQL>
[oracle@57373ded4b19 alert]$ cd /home/oracle/app/oracle/diag/rdbms/helowin/helowin/alert
[oracle@57373ded4b19 alert]$ ls
log.xml
[oracle@57373ded4b19 alert]$ cd /home/oracle/app/oracle/diag/rdbms/helowin/helowin/trace
[oracle@57373ded4b19 trace]$ ls alert_helowin.log
alert_helowin.log
[oracle@57373ded4b19 trace]$
ASM日誌
[oracle@57373ded4b19 trace]$ sqlplus / as sysasm
SQL> show parameter dump
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
background_core_dump string partial
background_dump_dest string /home/oracle/app/grid/diag/asm/+asm/+A
SM1/trace
core_dump_dest string /home/oracle/app/grid/diag/asm/+asm/+A
SM1/cdump
max_dump_file_size string unlimited
shadow_core_dump string partial
user_dump_dest string /home/oracle/app/grid/diag/asm/+asm/+ASM1/trace
SQL> select * from v$diag_info;
INST_ID NAME VALUE
---------- ---------------------------------------------------------------- -------------------------------------------------------------------------------------
1 Diag Enabled TRUE
1 ADR Base /home/oracle/app/grid
1 ADR Home /home/oracle/app/grid/diag/asm/+asm/+ASM1
1 Diag Trace /home/oracle/app/grid/diag/asm/+asm/+ASM1/trace
1 Diag Alert /home/oracle/app/grid/diag/asm/+asm/+ASM1/alert
1 Diag Incident /home/oracle/app/grid/diag/asm/+asm/+ASM1/incident
1 Diag Cdump /home/oracle/app/grid/diag/asm/+asm/+ASM1/cdump
1 Health Monitor /home/oracle/app/grid/diag/asm/+asm/+ASM1/hm
1 Default Trace File /home/oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_11192.trc
1 Active Problem Count 0
1 Active Incident Count 0
11 rows selected.
[oracle@57373ded4b19 trace]$ cd /home/oracle/app/grid/diag/asm/+asm/+ASM1/trace
[oracle@57373ded4b19 trace]$ more alert_+ASM1.log
oracle 導出AWR
sqlplus / as sysdba
SQL> @?/rdbms/admin/awrrpt.sql
然後根據提示輸入:
- 導出文件類型
‘html’ HTML format (default)
‘text’ Text format
‘active-html’ Includes Performance Hub active report
- 導出的AWR報告天數
- 根據提示輸入開始和結束時間點的 Snap Id
- 輸入導出的文件名稱
即可導出AWR 報告。
每一次生產問題的排查與解決都是從成堆的日誌文件中的不知道多少行的多少字符中篩選那麼一點信息,去比對定位。
*哎,天天腦殼疼 *