rails服務器處理併發調優

背景：第五場展會開始後，併發量比較大，多次出現502錯誤。

日誌中出現了大量的如下錯誤：

1722317 connect() to unix:/tmp/passenger.19461/master/helper_server.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 192.168.10.107, server: www.fwxgx.com, request: "GET /tje/exhibition/get_vote_number HTTP/1.0", upstream: "passenger://unix:/tmp/passenger.19461/master/helper_server.sock:", host: "www.fwxgx.com", referrer: http://www.fwxgx.com/tje/

發現 master服務器已經併發tcp數量達到 7000

Slave服務器併發tcp數量達到 700+

下面做了一些應急的處理方式，

1、    keepalive_timeout = 32 改成了 keepalive_timeout = 120

2、    此時發現定向到主服務器的upstream出現502的機率相當大。因此設置權重改變爲3：7

3、    結果還是不行，最後取消了負載到主服務器，僅僅使用slave服務器。

4、    效果有所改善，但是秒殺開始的時候，依然會出現502錯誤。

5、    後來停掉了聊天服務器。

6、    基於產品的需要，又開啓了聊天服務器，只不過更改了聊天服務器的發送頻率由2s到5s，但還是沒從根本上解決問題。

問題：
    通過日誌分析，訪問的人數和之前的幾次訪問人數相差不多，可是這次出現了比較嚴重的502錯誤。原因暫時還未查明。

    不過我通過我試着ab 測試來複現該錯誤。

ab –n 10000 –c 100 http://192.168.10.106/tje/exhibition/chatroom Top: Cpu(s): 87.0%us, 2.2%sy, 0.0%ni, 9.6%id, 0.0%wa, 0.2%hi, 1.0%si, 0.0%st Mem: 4041312k total, 3808708k used, 232604k free, 177584k buffers Swap: 6094840k total, 124k used, 6094716k free, 1297316k cached Server Software: nginx/0.7.65 Server Hostname: 192.168.10.106 Server Port: 8080 Document Path: /tje/exhibition/chatroom Document Length: 25288 bytes Concurrency Level: 100 Time taken for tests: 55.754484 seconds Complete requests: 10000 Failed requests: 2 (Connect: 0, Length: 2, Exceptions: 0) Write errors: 0 Total transferred: 259050873 bytes HTML transferred: 252879994 bytes Requests per second: 179.36 [#/sec] (mean) Time per request: 557.545 [ms] (mean) Time per request: 5.575 [ms] (mean, across all concurrent requests) Transfer rate: 4537.38 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 2 3.1 1 27 Processing: 130 552 91.1 550 970 Waiting: 124 540 90.8 538 954 Total: 130 554 90.4 552 971 Percentage of the requests served within a certain time (ms) 50% 552 66% 585 75% 609 80% 623 90% 666 95% 707 98% 756 99% 800 100% 971 (longest request)

通過上面的分析，只有兩個請求錯誤，還可以接受。

當我隨着加大併發量的時候，到了250個的時候，

ab -n 10000 -c 250 http://192.168.10.106:8080/tje/exhibition/chatroom This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0 Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright 2006 The Apache Software Foundation, http://www.apache.org/ Benchmarking 192.168.10.106 (be patient) Completed 1000 requests Completed 2000 requests Completed 3000 requests Completed 4000 requests Completed 5000 requests Completed 6000 requests Completed 7000 requests Completed 8000 requests Completed 9000 requests Finished 10000 requests Server Software: nginx/0.7.65 Server Hostname: 192.168.10.106 Server Port: 8080 Document Path: /tje/exhibition/chatroom Document Length: 25288 bytes Concurrency Level: 250 Time taken for tests: 25.648259 seconds Complete requests: 10000 Failed requests: 5885 (Connect: 0, Length: 5885, Exceptions: 0) Write errors: 0 Non-2xx responses: 5885 Total transferred: 108389086 bytes HTML transferred: 105078225 bytes Requests per second: 389.89 [#/sec] (mean) Time per request: 641.206 [ms] (mean) Time per request: 2.565 [ms] (mean, across all concurrent requests) Transfer rate: 4126.91 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 42.5 0 3001 Processing: 0 622 768.2 2 3432 Waiting: 0 619 764.8 2 3428 Total: 0 623 771.0 2 4526 Percentage of the requests served within a certain time (ms) 50% 2 66% 1361 75% 1415 80% 1442 90% 1509 95% 1631 98% 2049 99% 2780 100% 4526 (longest request)

ab -n 10000 -c 300 http://192.168.10.106:8080/tje/exhibition/chatroom

幾乎90%以上的都是502了。

下面我的處理方式，修改了一個參數

在/etc/sysctl.conf下面添加了

net.core.somaxconn = 1024

然後執行 sysctl –p 使修改生效

重啓web服務器。

測試：

ab -n 10000 -c 300 http://192.168.10.106:8080/tje/exhibition/chatroom This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0 Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright 2006 The Apache Software Foundation, http://www.apache.org/ Benchmarking 192.168.10.106 (be patient) Completed 1000 requests Completed 2000 requests Completed 3000 requests Completed 4000 requests Completed 5000 requests Completed 6000 requests Completed 7000 requests Completed 8000 requests Completed 9000 requests Finished 10000 requests Server Software: nginx/0.7.65 Server Hostname: 192.168.10.106 Server Port: 8080 Document Path: /tje/exhibition/chatroom Document Length: 25288 bytes Concurrency Level: 300 Time taken for tests: 59.251878 seconds Complete requests: 10000 Failed requests: 0 Write errors: 0 Total transferred: 259052525 bytes HTML transferred: 252880000 bytes Requests per second: 168.77 [#/sec] (mean) Time per request: 1777.556 [ms] (mean) Time per request: 5.925 [ms] (mean, across all concurrent requests) Transfer rate: 4269.57 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 2.9 0 29 Processing: 1263 1749 270.8 1681 3989 Waiting: 1261 1742 269.8 1674 3986 Total: 1263 1750 272.1 1682 3998 Percentage of the requests served within a certain time (ms) 50% 1682 66% 1719 75% 1751 80% 1775 90% 1896 95% 2321 98% 2715 99% 3072 100% 3998 (longest request) 300併發無錯誤。測試下1000併發： ab -n 10000 -c 1000 http://192.168.10.106:8080/tje/exhibition/chatroom This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0 Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright 2006 The Apache Software Foundation, http://www.apache.org/ Benchmarking 192.168.10.106 (be patient) Completed 1000 requests Completed 2000 requests Completed 3000 requests Completed 4000 requests Completed 5000 requests Completed 6000 requests Completed 7000 requests Completed 8000 requests Completed 9000 requests Finished 10000 requests Server Software: nginx/0.7.65 Server Hostname: 192.168.10.106 Server Port: 8080 Document Path: /tje/exhibition/chatroom Document Length: 25288 bytes Concurrency Level: 1000 Time taken for tests: 55.407599 seconds Complete requests: 10000 Failed requests: 1 (Connect: 0, Length: 1, Exceptions: 0) Write errors: 0 Total transferred: 259051877 bytes HTML transferred: 252879998 bytes Requests per second: 180.48 [#/sec] (mean) Time per request: 5540.760 [ms] (mean) Time per request: 5.541 [ms] (mean, across all concurrent requests) Transfer rate: 4565.80 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 5 14.0 1 80 Processing: 217 5250 996.2 5478 6269 Waiting: 200 5242 996.7 5471 6252 Total: 218 5255 989.6 5480 6269 Percentage of the requests served within a certain time (ms) 50% 5480 66% 5550 75% 5598 80% 5637 90% 5822 95% 5940 98% 5997 99% 6030 100% 6269 (longest request)

依然沒問題，只是響應時間都變成5s以上了。

我看內存依然有剩餘，我試着增大了passenger_max_pool_size 由32到40 。

在執行併發 100 ，300 ，1000 。

發現併發處理能力卻沒有提高，依然每秒 170次左右，如果下降到20，處理併發能力下降。

這個值的設定依據情況而定，如果railsapp本身佔內存特別大，開大了反而不好。我保守按照80M-100M計算。

現在解決了併發出現502錯誤的問題。

那原理是什麼呢

看到網絡上很多說修改backlog的，其實passenger在2.2.6的時候已經修改了他的backlog。提升至 1024了

而且man 2 listen查詢瞭解到這裏的backlog實際上是完成三次握手後的tcp隊列，換句話說這裏是TCP已經建立，等待服務器accept的隊列數目。

而我們應對併發的時候，很多客戶端發送SYN j請求，服務器給與ACK j+1 應答並SYN k，客戶端需要應答ACK k+1 ，這樣，如果客戶端不應答，或者來不及應答ACK k+1 就造成了半連接，這在併發高的系統中是常見的，linux有個隊列來維持半連接，如果隊列溢出，則拒絕服務，這就是DOS工具的基本原理。

因此我們需要修改半連接隊列的長度，這裏有兩個地方，可以通過命令查看