>>12768644Months, there are just too many things to do, like coding the environment, choosing a good framework, setting up the training cluster... and on top of that, as you said, tuning and tweaking will drive you crazy.
The worst offender is the reward function IMO, the agent WILL find a way to exploit it if you're not extremely careful.
On top of that, you often have to wait a day before deciding if the training is a dud or not, so expect a ton of wasted time.
Overall I find it finicky and very frustrating.
But keep in mind it's just my experience, maybe some other team somewhere had better results.