training become slow?訓練速度奇怪的變慢?tcmalloc在tensorflow中的使用！

－－－－－－－－－－－－－－－－－－－－前言－－－－－－－－－－－－－－－－－－－－－－－－

在訓練視頻分類訓練的時候，發現tensorflow莫名的變慢2~5 sec /batch, 之前一直是0.4 sec/batch, 聯想到最早之前mxnet訓練分類時候的類似情況，決定做排查（都是同一臺訓練服務器上）：

（１）殺掉一些殭屍進程或多並行進程，eg. im2rec, 發現不見效，並且cpu利用率也不高，排除cpu性能的影響；

（２）殺掉一些系統進程，kworker, falcon-agent, gpu-monitor等，排除系統進程對訓練性能的影響；

（３）iostat -x查看IO並不高，同時測試了image data loader，雖然性能有變化，但不是主要瓶頸佔比，python cProfile熱點分析瓶頸在session.run()，htop/top發現佔用內存不多，但cache很大；

（４）top cpu利用率並不理想（與我設置的線程數不般配），猜測是多線程競爭引起，具體什麼原因，不清楚；

（５）github搜索一些解決方法，tcmalloc有效（之前最簡單的方法是重啓，現在我沒這麼嘗試^-^），總結如下：

一，－－－－－－－－－－－－－－－－重載new與delete－－－－－－－－－－－－－－－－－－－－－

首先，想明確一個c++中的概念，操作符的重載．直觀印象中，都認爲可重載的操作符是++, --, +, -, *, /，(),[]等，其實，操作符中有兩個比較特殊的：operator new和operator delete，這兩個都是可以重載的，重載的他們兩個的主要目的是更好的內存管理（內存池）；

其次，明確內存池(memory pool)的概念和應用．操作系統原生的memory management經常不滿足要求，尤其在大型的客戶端程序中．大型程序中我們經常new(malloc)與delete(free)，會造成時間成本開銷，同時造成內存碎片memory fragmentation（碎片的概念請自行google），另外由於程序複雜，出現內存泄露也不好定位和查找（），此時內存池便應運而生，內存池經常需要重載new delete操作符，對windows中MFC開發熟悉的想必對如下宏定義不陌生：

        #ifdef _DEBUG
        # define DEBUG_NEW new(THIS_FILE, __LINE__)
        #else
        # define DEBUG_NEW new
        #endif

MFC重載了new,從而方便的定位new的位置及文件路徑，方便查看內存泄露．另一個例子如下：

#include <stdio.h>
#include<stdlib.h>

//重載全局new
void *operator new(size_t size)
{
	printf("operator new: allocate %d bytes\n", size);
	void *ptr = malloc(sz);//在這裏也可以方便的操作自己的內存池
	return ptr;
}

//重載全局delete
void operator delete(void *ptr)
{
	puts("operator delete");
	free(ptr);
}

int main( )
{
	int *p=new int(0);//使用重載的operator new，不再使用系統的operator new
	delete p;
	return 0;
}

今天我們所說的tcmalloc或jemalloc就是類似的原理，如果選擇了tcmalloc，則我們程序中會採用tcmalloc中的內存管理機制

二，－－－－－－－－－－－－－－－－－－－－－tcmalloc是什麼？－－－－－－－－－－－－－－－－－

tcmalloc是一個內存分配器，管理堆內存，主要影響malloc和free，用於降低頻繁分配、釋放內存造成的性能損耗，並且有效地控制內存碎片。glibc中的內存分配器是ptmalloc2，tcmalloc要比它快。一次malloc和free操作，ptmalloc需要300ns，而tcmalloc只要50ns。同時，tcmalloc也優化了小對象的存儲，需要更少的空間。tcmalloc特別對多線程做了優化，對於小對象的分配基本上是不存在鎖競爭(lock contention)，而大對象使用了細粒度、高效的自旋鎖（spinlock）。分配給線程的本地緩存，在長時間空閒的情況下會被回收，供其他線程使用，這樣提高了在多線程情況下的內存利用率，不會浪費內存，而這一點ptmalloc2是做不到的。

以下摘自gperftools文檔，說出了tcmalloc的關鍵：

TCMalloc : Thread-Caching Malloc

Motivation

TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2) and other mallocs that I have tested. ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair. Speed is important for a malloc implementation because if malloc is not fast enough, application writers are inclined to write their own custom free lists on top of malloc. This can lead to extra complexity, and more memory usage unless the application writer is very careful to appropriately size the free lists and scavenge idle objects out of the free list

TCMalloc also reduces lock contention for multi-threaded programs. For small objects, there is virtually zero contention. For large objects, TCMalloc tries to use fine grained and efficient spinlocks. ptmalloc2 also reduces lock contention by using per-thread arenas but there is a big problem with ptmalloc2's use of per-thread arenas. In ptmalloc2 memory can never move from one arena to another. This can lead to huge amounts of wasted space. For example, in one Google application, the first phase would allocate approximately 300MB of memory for its data structures. When the first phase finished, a second phase would be started in the same address space. If this second phase was assigned a different arena than the one used by the first phase, this phase would not reuse any of the memory left after the first phase and would add another 300MB to the address space. Similar memory blowup problems were also noticed in other applications.

Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.

更詳細的tcmalloc可以參見google perf tools文檔

http://goog-perftools.sourceforge.net/doc/tcmalloc.html

三，－－－－－－－－－－－－－－－－－－－－－training become slow－－－－－－－－－－－－－－－－

曾幾何時，無論是mxnet訓練，還是tensorflow訓練，你都會奇怪的發現如下現象：

trainning speed random slow down

or, period become slow at each batch,

or, once it has been being very fast for training, but this go today

while I have not change any training code before~~

or, the gpu has a low usage today(首先需要排除是否IO瓶頸引起)

遇到上述的情況，則有可能是cache的問題，正是因爲使用系統自帶的內存分配機制引起的，

There may be some memory pressure caused by virtual address space fragmentation and high system buffer cache churn (reading large training datasets from the file system)

那麼嘗試解決：

（１）htop:有效的查看cache的工具：

htop發現cache佔用太多，其中Mem的yellow bars代表cache佔用情況

（２）Clear RAM Memory Cache

Linux provides a way to flush or clear ram cache.

How to Clear Cache in Linux?

Every Linux System has three options to clear cache without interrupting any processes or services.

　　　a. Clear PageCache only.

　　# sync; echo 1 > /proc/sys/vm/drop_caches
　　　b. Clear dentries and inodes.

　　　# sync; echo ２ > /proc/sys/vm/drop_caches

　　　c. Clear PageCache, dentries and inodes.

　　　# sync; echo 3 > /proc/sys/vm/drop_caches

　　　　sync will flush the file system buffer. writing to drop_cache will clean cache without killing any application/service

　　　　If you have to clear the disk cache, the first command is safest in enterprise and production as “...echo 1 > ….” will clear the PageCache only. It is not recommended to use third option above “...echo 3 >” in production until you know what you are doing, as it will clear PageCache, dentries and inodes.

更詳細的說明請參見如下：

https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/

　　　（３）使用tcmalloc when training:

TCMalloc seems to improve training speed and avoids occasional slowdowns seen with the default allocator. You can enable it by installing it and setting LD_PRELOAD=/usr/lib/libtcmalloc.so.

LD_PRELOAD=/usr/lib/libtcmalloc.so python train.py

此時，訓練中，分配buffer將使用tcmalloc中的內存池，提高效率，加速訓練，恢復正常訓練速度．

備註：在部署c++程序中，如果遇到多線程頻繁分配釋放的情況，頻繁操作大內存的情況，也可以通過觀測cache，嘗試利用tcmalloc改進．

　　　另外，tcmalloc不能在jni等形式下使用，因爲jni有可能會導致先load系統運行時分配內存，後加載tcmalloc釋放內存，造成內存衝突（切記在哪個庫中分配的內存在哪個庫中釋放）

training become slow?訓練速度奇怪的變慢?tcmalloc在tensorflow中的使用！

Ubuntu安裝docker及nvidia-dockersudo apt-get update sudo apt-get install \ apt-transport-https \

linux系統中，在anaconda2環境下，python2與python3共存？

tf.place_holder的用法

如何編譯debug版本的ensorflow c++庫

unauthorized: authentication required nvidia

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結