使⽤CityScapes数据训练deeplabV3遇到的⼀些问题(2019-
11-25)
整个过程主要⽤到的⽹站:
遇到的主要问题:
1、tensorflow环境问题
2、CUDA环境和兼容问题
3、各个代码的版本问题
4、windows上运⾏和linux上的区别导致的问题
问题描述和解决⽅法
1、tensorflow环境问题(⼀般是在执⾏转换tfrecord数据的代码的时候出现的)
tensorflow has no attribute 'app'    tensorflow has no attribute 'logging'    tensorflow has no attribute 'contrib'等⼀系列缺失各种模块的问题。
训练过程太慢
Could not load dynamic library 'cudart64_100.dll'(cuda安装的情况下出现这个问题)
解决这些问题的⽅法就是安装合适的tensorflow-gpu版本,windows系统⼀定要⽤Anaconda安装和版本切换,⽤pip命令会出现各种问题,我花了相当⼀段时间解决tensorflow版本问题。我使⽤的是tensorflow-gpu1.15版本,安装默认的最新版本后,点击左侧对号选择就可以选择对应的版本。
2、CUDA环境和兼容问题
Could not load dynamic library 'cudart64_100.dll'等缺失各种dll⽂件的问题。
⼀定要选择好对应的版本(要考虑机器显卡⽀持的版本),tensorflow1.15对应的是CUDA10.0,这个问题也折腾了很久,cuda下载和安装⼀次需要很长时间,所以⼀定要查好相关资料,⼀次性把cuda的环境搞定。
3、各个代码的版本问题(⾯临的主要问题)
由于github上⾯的代码⼀直在更新,所以出现了⼀些官⽅⽂档或者各个教程和代码实际不⼀致的问题,并且有部分问题全⽹都不到答案(stackflow、github上都提问过也没得到解决)
TFRecord的⽂件都是0kb:这个⼀定是⽣成代码没执⾏成功,引起这个问题的原因可能是convert_cityscapes.sh脚本调⽤的⼏个python⽂件报不到引⽤的⼦模块的错误(通常建议添加环境变量等操作),我通过在出问题的py⽂件的引⽤部分加⼊:
import sys
sys.path.append("H:/dataSet/CityscapesV2/cityscapesScripts/models/research")
sys.path.append("H:/dataSet/CityscapesV2/cityscapesScripts/models/research/slim")
路径和名称⾃⾏进⾏调整。
data split name train not recognized:官⽅给出的训练命令是:
# From tensorflow/models/research/
python deeplab/train.py \
--logtostderr \
--training_number_of_steps=90000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
-
-output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size="769,769" \
--train_batch_size=1 \
--dataset="cityscapes" \
--tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
--train_logdir=${PATH_TO_TRAIN_DIR} \
--dataset_dir=${PATH_TO_DATASET}
但是在代码中已经没有“train”这个选项了,⽽是train_fine等等,这个问题在执⾏val和vis的时候都会遇到,将各⾃加后缀train_fine、val_fine就可以了。但是会发现这⾥改成train_fine后程序会意外停⽌。
_CITYSCAPES_INFORMATION = DatasetDescriptor(
splits_to_sizes={'train_fine': 2975,
'train_coarse': 22973,
'trainval_fine': 3475,
'trainval_coarse': 23473,
'val_fine': 500,
'test_fine': 1525},
num_classes=19,
ignore_label=255,
)
Windows fatal exception: access violation
Thread 0x00005cd8 (most recent call first):
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_se
ssionrun  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\saver.py", line 1176 in save
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1119 in run_loop
tensorflow版本选择File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495 in run
File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
File "G:\anaconda\lib\threading.py", line 885 in _bootstrap
Thread 0x00004ef4 (most recent call first):
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\training_util.py", line 68 in global_step
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1081 in run_loop
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495 in run
File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
File "G:\anaconda\lib\threading.py", line 885 in _bootstrap
Thread 0x000062ec (most recent call first):
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1350 in _run_fn
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1045 in run_loop
File "G:\anaconda\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495 in run
File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
File "G:\anaconda\lib\threading.py", line 885 in _bootstrap
Thread 0x00006018 (most recent call first):
File "G:\anaconda\lib\threading.py", line 296 in wait
File "G:\anaconda\lib\queue.py", line 170 in get
File "G:\anaconda\lib\site-packages\tensorflow_core\python\summary\writer\event_file_writer.py", line 159 in run  File "G:\anaconda\lib\threading.py", line 917 in _bootstrap_inner
File "G:\anaconda\lib\threading.py", line 885 in _bootstrap
Thread 0x00005df0 (most recent call first):
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1443 in _call_tf_sessionrun  File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", lin
e 1350 in _run_fn
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1365 in _do_call
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1359 in _do_run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 1180 in _run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\client\session.py", line 956 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\contrib\slim\python\slim\learning.py", line 490 in train_step  File "G:\anaconda\lib\site-packages\tensorflow_core\contrib\slim\python\slim\learning.py", line 775 in train
File "H:/dataSet/CityscapesV2/cityscapesScripts/models/research/deeplab/train.py", line 462 in main
File "G:\anaconda\lib\site-packages\absl\app.py", line 250 in _run_main
File "G:\anaconda\lib\site-packages\absl\app.py", line 299 in run
File "G:\anaconda\lib\site-packages\tensorflow_core\python\platform\app.py", line 40 in run
File "H:/dataSet/CityscapesV2/cityscapesScripts/models/research/deeplab/train.py", line 468 in <module>
Windows fatal exception: access violation    Thread 0x00004ef4 (most recent call first):以上这些问题是因为我们⽣成的tfrecord⽂件是train开头的,⽽代码读取的是train_fine开头的,所以需要吧⽣成的tfrecord⽂件名修改⼀下:
改为
4、windows上运⾏和linux上的区别导致的问题
官⽅和各个教程都是在linux系统上做的介绍,⽽在windows上会出现⼀些问题:⾸先是.sh⽂件的运⾏,windows可以通过git的bash 窗⼝运⾏,但是⼀些py报错教程都说增加linux的python环境,window还是采⽤sys.path.append()的⽅式才能解决。
不⽤sh运⾏train和val的测试命令,直接在pycharm⾥运⾏train.py等⽂件是可以的但是需要注意:1.修改各个配置项。2.修改common.py中的⽹络结构配置项为xception(当然取决于你下载的预训练模型是在哪个结构上的),默认是mobilent_v2:
否则会报错:
Not found: Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
Total size of new array must be unchanged for image_pooling/weights lh_shape

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。