Thanks J. Peters et al for their great work of A Survey for Policy Search in Robotics.
The exploration strategy is used to generate new trajectory samples
Many model-free policy search approaches update the exploration distribution and , hence, the covariance of the Gaussian policy. Typically, a large exploration rate is used in the beginning of learning which is then gradually decreased to fine tune the policy parameters.
Action Space vs Parameter Space
In action space. we can simply add an exploration noise
Exploration in parameter space perturb the paramter vector
Many approaches can be formulized with the concept of an upper-level policy
Now we use the paramter vector
Episode-based vs Step-based
Step-based exploration use different exploration noise at each time step and can either in action space or in paramter space. Step-based exploration can be problematic as it might produce action sequences which are not reproducible by noise free control law.
Episode-base exploration use exploration noise only at the beginning of the episode, which leads to an exploration in parameter space. Episode-based exploration might produce more reliable policy updates.
Uncorrelated vs Correlated
As most policies are represented as Gaussian distributions, uncorrelated exploration noise is obtained by using a diagonal covariance matrix. It is also usable to achieve correlated exploration by maintaining a full representation of the covariance matrix.
Exploration in action space typically use a diagnoal covariance matrix. In paramter space, many approaches can be used to update the full covariance matrix of the Gaussian policy. Using the full covariance matrx often resultes in a considerably increased learning speed.