台部落roxxo

原创解決nvidia驅動安裝報'nvidia-drm'問題

參考該博文 https://blog.csdn.net/fdqw_sph/article/details/78745375 一臺ubuntu 16.04 機器重裝顯卡驅動，查看linux 版本內核 username -a 進入對應內

2020-07-06 19:53:47

2

原创解決升級tensorflow到2.0的報錯

今天在升級tensorflow 後，運行相關腳本，有兩個報錯一個是報如下報AttributeError: module 'tensorflow' has no attribute 'decode_raw' 網上度娘沒查到，

2020-07-06 19:53:47

1

原创解決升級tensorflow 腳本

今天在升級tensorflow 後，運行相關腳本，一行代碼報如下報AttributeError: module 'tensorflow' has no attribute 'decode_raw' 網上度娘沒查到，用tf的升級工具試

2020-06-10 05:27:25

6

原创 name 'file' is not defined 和 TypeError: a bytes-like object is required, not 'str'

升級環境 python2 到 python3 ，在做TFRecord 時遇見兩個問題報 name 'file' is not defined file函數改爲open函數 TypeError: ' xxx.jpg' has t

2020-06-10 05:27:25

1

原创 tensorflow ckpt和pb格式模型加載

加載 ckpt格式 checkpoint_file = tf.train.latest_checkpoint(ckpt_modelpath) #load ckpt模型

2020-06-10 05:27:15

8

原创 Tensorflow 中earlystopping的使用

參考該文章 https://blog.csdn.net/zongza/article/details/85017351 報錯 Key signal_early_stopping/STOP not found in checkpoint

2020-06-10 05:27:15

原创解決TF訓練提示 Not using XLA:CPU for cluster

訓練時一直未太關注該搞錯，啓動訓練後報警提示如下，瞭解了下XLA的設置，對性能有一定提升，於是嘗試解決 W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (O

2020-05-20 18:39:48

原创解決docker 無法啓動

容器一啓動後就宕機，這個問題第一次遇到查看日誌主要報， ExecStart=/usr/bin/dockerd (code=exited, status=0/SUCCESS) 試了很多辦法仍然報，重裝了docker，居然還是起不來

2020-05-09 06:11:28

原创解決dockers無法啓動

記錄一下折騰了一天一臺服務器跑代碼崩了重啓後，dockers無法使用，啓動docker 報 Job for docker.service failed because the control process exited wi

2020-04-28 23:36:45

1

4

原创解決nvidia驅動安裝報'nvidia-drm'問題

原创解決升級tensorflow到2.0的報錯

原创解決升級tensorflow 腳本

原创 name 'file' is not defined 和 TypeError: a bytes-like object is required, not 'str'

原创 tensorflow ckpt和pb格式模型加載

原创 Tensorflow 中earlystopping的使用

原创解決TF訓練提示 Not using XLA:CPU for cluster

原创解決docker 無法啓動

原创解決dockers無法啓動

原创解決容器外操作copy 報no such file or directory

原创解決 ImportError: Extension horovod.tensorflow has not been built

原创使用分佈式框架horovod 未能提升加速訓練

原创 k8s 環境系統日誌報 Unable to allocate memory on node -1

原创一臺ubuntu服務器不慎將kernal 刪除，恢復過程

原创解決分佈式訓練報terminate called after throwing an instance of 'std::length_error'