一次偶發性502訪問報錯的定位-OOM

原創

alex_i

2020-06-13 11:29

問題背景描述：

一次批量更新model某個字段的請求操作報錯，更新動作關聯update_signals。
更新數據量402條，單條model數據量較大。

定位過程

從客戶端發起了request，但是沒有接收到response，所以報了502。說明某個中間過程導致流程斷了。流程大概分爲4個階段

客戶端發起請求經由nginx轉發，發送給uwsgi再轉發django處理
業務層執行批量更新的update操作
業務層執行與批量更新相關聯的update_signals_callback回調函數處理
服務端返回消息給客戶端

首先業務層收到了rest消息，request的消息頭和消息體能夠正常的打印。說明沒有斷鏈問題，業務中斷出在了業務層。業務層執行了兩個動作，分別是批量更新和call_back回調。

call_back函數做字段過濾，

call_back函數不在執行，只打印一條日誌，測試結果：失敗-偶發性502

屏蔽signals_update 信號，同時統計update操作的耗時

model執行update操作不再觸發信號回調函數，測試結果：失敗-偶發性502。就此可以看出業務執行失敗的原因出在批量更新失敗

由上圖可以看出單次批量更新時間超過4.5s，同時執行多次批量更新操作，只有三次記錄了日誌。說明環境可能異常，

上環境直接執行執行update操作，同時監控環境信息

通過下面兩段腳本測試批量更新時間

import datetime
from nfvpub.database.models import NfInstModel
def update_vnf_switchout_on():
    print "the num of vnf is " + str(len(NfInstModel.objects.all()))
    print "action [update] start at : " + str(datetime.datetime.now())
    start = datetime.datetime.now()
    NfInstModel.objects.all().update(auto_scaleout_switch='off')
    end = datetime.datetime.now()
    print "action [update] end at : " + str(datetime.datetime.now())
    print "action [update] totally cost time  : " + str(end - start)

from nfvpub.database.models import NfInstModel
NfInstModel.objects.all().update(auto_scaleout_switch='off')

測試結果：

兩次執行後臺操作都被後臺守護進程給kill掉了。進/var/log/看一下程序被幹掉的原因：

cd /var/log/
egrep -i  -r "killed process" /var/log/

對比log日誌，前後兩張分別是執行批量更新前後，節點被kill掉的日誌打印。其中批量更新被幹掉的日誌應該是最後一條，從日誌打印可以看出，本次批量更新動作消耗內存（total-vm）686Mb
對於日誌中幾個參數的理解如下：
As I understand, the size of the virtual memory that a process uses is listed as “total-vm”. Part of it is really mapped into the RAM itself (allocated and used). This is “RSS”.
Part of the RSS is allocated in real memory blocks (other than mapped into a file or device). This is anonymous memory (“anon-rss”) and there is also RSS memory blocks that are mapped into devices and files (“file-rss”).
So, if you open a huge file in vim, the file-rss would be high, on the other side, if you malloc() a lot of memory and really use it, your anon-rss would be high also.
On the other side, if you allocate a lot of space (with malloc()), but nevers use it, the total-vm would be higher, but no real memory would be used (due to the memory overcommit), so, the rss values would be low.
簡言之，**total-vm是被殺死進程實際使用的內存空間，total-vm包括rss（正常映射到自己存儲）和other。其中rss又分三種，分別是anon-rss（內存塊上實際分配得到的內存大小），file-rss（映射到設備和文件系統的內存）**和shmem-rss（？）
從報錯日誌可以看出程序要佔用686M，此時我們在看一下分配給容器的內存空間，佔用情況：

docker stats docker_id

從上圖可以看出容器可分配的內存空間已經不足500Mb，所以如果上面的那個python進程請求686Mb勢必會被幹掉。此處科普一下一個程序被killed的過程：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

一次偶發性502訪問報錯的定位-OOM

問題背景描述：

定位過程

call_back函數做字段過濾，

屏蔽signals_update 信號，同時統計update操作的耗時

上環境直接執行執行update操作，同時監控環境信息

Python 潮流週刊#52：Python 處理 Excel 的資源

hibernate基本用法：ORM,PO,POJO集合組件鍵映射

Spring核心概念：AOP面向切面編程

數據雲盤動態擴容-e2fsck resize2fs文件系統指令運用指南

Struts基本用法：MVC,Action的配置

一次偶發性502訪問報錯的定位-OOM

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結