Dopamine
Google 的 github 賬戶最新發佈一個框架,叫做 Dopamine。
有了 Dopamine 可以幫助大家更快地設計強化學習原型。
值得試試。下圖是在 Seaquest 遊戲上的算法比對,可以看到 Rainbow 最厲害了。
其設計原則如下:
- Easy experimentation: Make it easy for new users to run benchmark experiments.
- Flexible development: Make it easy for new users to try out research ideas.
- Compact and reliable: Provide implementations for a few, battle-tested algorithms.
- Reproducible: Facilitate reproducibility in results.
爲啥要有這個框架:主要爲了實現 DeepMind 提出的各種複雜 RL 算法,包括 Rainbow 這個集大成者。下面是三個關鍵點:
- n-step Bellman updates (see e.g. Mnih et al., 2016)
- Prioritized experience replay (Schaul et al., 2015)
- Distributional reinforcement learning (C51; Bellemare et al., 2017)
聲明瞭這也不是官方產品,但值得你去了解學習。
小夥伴們已經試過了,非常方便。
(dopamine-env) neil@neil-workstation:~/Projects/dopamine$ python -um dopamine.atari.train \
> --agent_name=dqn \
> --base_dir=/tmp/dopamine \
> --gin_files='dopamine/agents/dqn/configs/dqn.gin'
2018-08-28 02:19:22.543030: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
I0828 02:19:22.543931 139761019946752 tf_logging.py:115] Creating DQNAgent agent with the following parameters:
I0828 02:19:22.544101 139761019946752 tf_logging.py:115] gamma: 0.990000
I0828 02:19:22.544147 139761019946752 tf_logging.py:115] update_horizon: 1.000000
I0828 02:19:22.544184 139761019946752 tf_logging.py:115] min_replay_history: 20000
I0828 02:19:22.544219 139761019946752 tf_logging.py:115] update_period: 4
I0828 02:19:22.544251 139761019946752 tf_logging.py:115] target_update_period: 8000
I0828 02:19:22.544284 139761019946752 tf_logging.py:115] epsilon_train: 0.010000
I0828 02:19:22.544317 139761019946752 tf_logging.py:115] epsilon_eval: 0.001000
I0828 02:19:22.544348 139761019946752 tf_logging.py:115] epsilon_decay_period: 250000
I0828 02:19:22.544380 139761019946752 tf_logging.py:115] tf_device: /gpu:0
I0828 02:19:22.544410 139761019946752 tf_logging.py:115] use_staging: True
I0828 02:19:22.544441 139761019946752 tf_logging.py:115] optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x7f1c7c2adf90>
I0828 02:19:22.545419 139761019946752 tf_logging.py:115] Creating a OutOfGraphReplayBuffer replay memory with the following parameters:
I0828 02:19:22.545480 139761019946752 tf_logging.py:115] observation_shape: 84
I0828 02:19:22.545521 139761019946752 tf_logging.py:115] stack_size: 4
I0828 02:19:22.545557 139761019946752 tf_logging.py:115] replay_capacity: 1000000
I0828 02:19:22.545592 139761019946752 tf_logging.py:115] batch_size: 32
I0828 02:19:22.545624 139761019946752 tf_logging.py:115] update_horizon: 1
I0828 02:19:22.545656 139761019946752 tf_logging.py:115] gamma: 0.990000
I0828 02:19:23.212261 139761019946752 tf_logging.py:115] Beginning training...
I0828 02:19:23.212377 139761019946752 tf_logging.py:115] Starting iteration 0
Steps executed: 53072 Episode length: 812 Return: -21.00
...
讓子彈飛一會兒~
關注我們,後會有期。