1

During the equation deducing in policy gradient algorithm(e.g., REINFORCE), we are actually using an expectancy of total reward, which we try to maximize.

$$\overline{R_\theta}=E_{\tau\sim\pi_\theta}[R(\tau)]$$

Can't it be seen as an Actor-Critic Method since we are using V(s) as a Critic to guide the update of Actor π? (Here we've already introduced an approximation) $$\nabla \overline{R_\theta} = \sum_{n=1}^N R(\tau^{(n)}) \nabla \log p(\tau)$$ If not, what's the clear definition of Actor and Critic defined in Actor-Critic Algorithm.