Tomcat c3p0連接池故障一例

負責維護的系統出現c3p0連接池泄露的故障，導致Oracle DB響應極慢，本文對當時故障處理做簡單記錄，雖然最後直接換掉tomcat版本、棄用c3p0連接池解決了故障，還是希望本文能爲遇到類似問題的同學提供一些思路。

環境

數據庫和應用在同一臺服務器上。

OS： Windows Server 2008R2 Enterprise
應用： Tomcat 6.0.32，部署一個webservice接口服務
數據庫： Oracle 11g R2 11.2.0.1 64bit

故障現象

每天上午08:00開始業務，約至10:00左右開始出現異常，tomcat catalina.log報錯”java.sql.SQLException: Io 異常: The Network Adapter could not establish the connection”，c3p0連接池無法獲取新的connection。同時Oracle Database響應極慢。

當前c3p0連接池配置

查看tomcat c3p0連接池配置參數如下。

c3p0.minPoolSize=1
c3p0.maxPoolSize=300    <----
c3p0.idleConnectionTestPeriod=35    <----
c3p0.maxIdleTime=35    <----
c3p0.maxStatements=0
c3p0.maxStatementsPerConnection=0
c3p0.preferredTestQuery=select 1 from dual    <----
c3p0.acquireIncrement=2
c3p0.acquireRetryDelay=1000
c3p0.acquireRetryAttempts=10
c3p0.initialPoolSize=1

查詢session分佈

v$session，按machine統計Oracle database會話情況，發現90%都是本服務器tomcat發起。其發起的session數在1000以上，遠超c3p0 maxPoolSize=300的配置，猜想是c3p0連接池配置問題，或者某種原因導致連接池泄露。

SELECT b.MACHINE, b.PROGRAM, COUNT (*)
   FROM v$process a, v$session b
   WHERE a.ADDR = b.PADDR AND b.USERNAME IS NOT NULL
GROUP BY b.MACHINE, b.PROGRAM
ORDER BY COUNT (*) DESC;

Oracle Zabbix監控

在Zabbix（orabbix）監控上，看到session在08:00時約200個，此後開始迅速增長，到10:15左右達到1400個。中午12:00安排重啓tomcat，下午13:00業務開始後，session依舊快速增長，故障再次出現。圖上還可以看出session主體是”inactive session”，”Active Session”、”System Session”任然算是在正常範圍內。想象一下有1000個active session！？那數據庫早就崩了！

Oracle AWR分析

從11月16日10:00~11:00 1個小時內的性能awr報告可以看到，平均每秒鐘31.4次用戶登錄（logon），60min內有113040次登錄。

按照對CPU Time的消耗，對1小時內SQL語句統計排序↓，排在前5的語句，均是oracle在創建新session時，用戶登錄階段獲取環境參數、權限信息、連接信息的底層SQL語句。這些語句分別執行的次數基本和登錄次數一致，其中“cm5vu20fhtnq1”執行次數約爲登錄次數的4倍。

SQL_ID	SQL Text
459f3z9u4fb3u	select valuefromprops where name = GLOBAL_DB_NAME’
cm5vu20fhtnq1	select /+ connect_by_filtering / privilege#, level from sysauth$ connect by grantee#=prior privilege# and privilege#>0 start with grantee#=:1 and privilege#>0
0ws7ahf1d78qa	select SYS_CONTEXT(‘USERENV’, ‘SERVER_HOST’), SYS_CONTEXT(‘USERENV’, ‘DB_UNIQUE_NAME’), SYS_CONTEXT(‘USERENV’, ‘INSTANCE_NAME’), SYS_CONTEXT(‘USERENV’, ‘SERVICE_NAME’), INSTANCE_NUMBER, STARTUP_TIME, SYS_CONTEXT(‘USERENV’, ‘DB_DOMAIN’) from v$instance where INSTANCE_NAME=SYS_CONTEXT(‘USERENV’, ‘INSTANCE_NAME’)
0k8522rmdzg4k	select privilege# from sysauth$ where (grantee#=:1 or grantee#=1) and privilege#>0
5ur69atw3vfhj	select decode(failover_method, NULL, 0 , ‘BASIC’, 1, ‘PRECONNECT’, 2 , ‘PREPARSE’, 4 , 0), decode(failover_type, NULL, 1 , ‘NONE’, 1 , ‘SESSION’, 2, ‘SELECT’, 4, 1), failover_retries, failover_delay, flags from service$ where name = :1

而應用業務有關的sql其實執行次數以及CPU、IO消耗很小↓↓。

收集同一時段內，業務核心數據庫（2節點Oracle RAC）的AWR，每秒oracle用戶登錄次數爲0.31，是故障服務器上數據庫的1/100；並且耗時較多的sql語句均是應用業務相關的。↓經過和開發同事確認，該webservice的訪問量遠小於核心應用，相應數據庫的logon次數不應該比核心庫還大的多。

分析session的SQL記錄

那麼到底這些session在變爲inactive之前做了什麼操作呢？到底這些inactive session空閒了多久？
需要查v$session，統計session的空閒時間，及其最後執行的SQL語句。v$session中的last_call_et字段，反映了session最後一次執行sql到當前時刻共逝去多少時間（單位：秒）。以下語句統計了當前所有session從最後一次執行sql以來的空閒時間，以及最後一次執行的sql語句ID。

select username,
        s.SID,
        s.SERIAL#,
        s.logon_time,
        s.PREV_SQL_ID,
        s.last_call_et
   from v$session s
  where s.username is not null
  order by s.LAST_CALL_ET desc;

可以看出業務用戶的所有1200多個session中，空閒時間從0秒~34秒不等，不超過35秒（c3p0 maxIdleTime = 35）。成功執行的最後一個sql語句都是（520mkxqpf15q8）：select 1 from dual，也即是c3p0連接池的connect testquery語句/idleConnectionTest測試語句。↓↓↓

嘗試分析原因

應用獲取數據庫連接（connection）的開銷比較大，所以纔有了連接池的設計。創建連接池，就是希望申請到的connection存放在pool裏，達到“創建一次，循環使用”的狀態。
應用使用getConnection()方法向c3p0 pool申請connection，如果pool裏有空閒的，則應用得到一個connection；如果pool裏沒有空閒的則由c3p0到數據庫去獲取新的connection，然後分配給應用；應用程序的數據庫操作完成後，使用close()方法將connection放回到pool中，標記爲idle，以便其他應用線程調用。
現場的情況是c3p0的poolsize逐漸增大，當poolsize達到預設值300時，並沒有停止而是繼續申請connection，並且通過idleConnectiontest不斷檢查pool中所有連接的狀態。

以下是從網上找到幾個可能的原因，但是都跟本次場景都不符合：

從pool裏獲取connection時，使用了getConnection(username, password)格式，並且一個DataSource配置裏有多個用戶名。
舉例，比如一個DataSource配置中含有dbuser1、dbuser2、dbuser3、dbuser4、dbuser5等5個，則實際的pool最大連接數是maxpoolsize * 5=1500。**
程序歸還connection時close()方法有問題
tomcat的server.xml 配置裏autoDeploy=true，並且在線修改webapp應用程序的配置時，autoDeploy功能會導致自動生成一個新的pool。

幾個c3p0配置的官方建議

Managing Pool Size and Connection Age
maxConnectionAge/maxIdleTime不易過小
Some general advice about all of these timeout parameters: Slow down! The point of Connection pooling is to bear the cost of acquiring a Connection only once, and then to reuse the Connection many, many times. Most databases support Connections that remain open for hours at a time. There’s no need to churn through all your Connections every few seconds or minutes.Setting maxConnectionAge or maxIdleTime to 1800 (30 minutes) is quite aggressive. For most databases, several hours may be more appropriate. You can ensure the reliability of your Connections by testing them, rather than by tossing them.
maxIdleTime配置爲30min就已經很激進了，建議以小時爲單位。可以使用connection test定期去檢查連接的可靠性，檢測到不可用的connection時，再將其從pool裏剔除出去。
懷疑是maxpoolsize=35太小，現場將其修改爲3600，重啓tomcat觀察故障依舊:(。
Simple advice on Connection testing
If you’d like to improve performance by eliminating Connection testing from clients’ code path:
檢查修正了以下的建議配置，重啓tomcat問題依舊。最後開發同事從別的服務器上copy了更新版本的tomcat，棄用c3p0連接池，恢復業務，總算是解決了這個故障，故障根本原因就沒再花時間去測試了

a. Set testConnectionOnCheckout to false

b. Set testConnectionOnCheckin to true

c. Set idleConnectionTestPeriod to 30, fire up you application and observe. This is a pretty robust setting, all Connections will tested on check-in and every 30 seconds thereafter while in the pool. Your application should experience broken or stale Connections only very rarely, and the pool should recover from a database shutdown and restart quickly. But there is some overhead associated with all that Connection testing.

d. If database restarts will be rare so quick recovery is not an issue, consider reducing the frequency of tests by idleConnectionTestPeriod to, say, 300, and see whether clients are troubled by stale or broken Connections. If not, stick with 300, or try an even bigger number. Consider setting testConnectionOnCheckin back to false to avoid unnecessary tests on checkin. Alternatively, if your application does encounter bad Connections, consider reducing idleConnectionTestPeriod and set testConnectionOnCheckin back to true. There are no correct or incorrect values for these parameters: you are trading off overhead for reliability in deciding how frequently to test. The exact numbers are not so critical. It’s usually easy to find configurations that perform well. It’s rarely worth spending time in pursuit of “optimal” values here.

參考：
c3p0 connection pool queries
http://stackoverflow.com/questions/13942540/c3p0-connection-pool-queries
c3p0 connection pool is not shrinking
http://stackoverflow.com/questions/21343741/c3p0-connection-pool-is-not-shrinking?rq=1
C3P0連接池配置
http://blog.csdn.net/caihaijiang/article/details/6843496
Does c3p0 connection pooling ensures max pool size?
http://stackoverflow.com/questions/16932846/does-c3p0-connection-pooling-ensures-max-pool-size
c3p0 Connection Pool not closing connections
http://stackoverflow.com/questions/9824879/c3p0-connection-pool-not-closing-connections
c3p0 - JDBC3 Connection and Statement Pooling
http://www.mchange.com/projects/c3p0/#contents

Tomcat c3p0連接池故障一例

環境

故障現象

當前c3p0連接池配置

查詢session分佈

Oracle Zabbix監控

Oracle AWR分析

分析session的SQL記錄

嘗試分析原因

幾個c3p0配置的官方建議

Python 潮流週刊#52：Python 處理 Excel 的資源

Windows server 2008R2更新補丁後進入系統恢復

CentOS7密碼複雜度配置

Linux定時清理Oracle 12c RAC listener監聽日誌

Linux定時清理Oracle alert、trace、audit等日誌文件

filecmp.dircmp實現遞歸比對兩個目錄

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結