>>12759849In Deep Q learning you have a network that predicts the reward for different actions, and then you choose the action with the highest expected reward, and then do update the network to more accurately predict the reward for that action once you get feedback.
In policy gradient networks like Actor Critic, you have a network that recommends actions, attached to another network that predicts the reward of those actions. You train the action recommender network to maximize the value of the reward predictor network, and you train the reward predictor network to predict more accurate rewards.
it's not difficult ur just dumb