An additional problem with this is that they use A3C here for trading. A3C is known to not be suitable for adversarial environments (e.g. board games, like Chess). I wrote a paper that demonstrated that A3C is as exploitable as a uniform random strategy in board games (specifically, some poker variants): arxiv.org/abs/2004.09677
It’s mostly an issue that A2C isn’t designed for adversarial environments. It also doesn’t have any notion of hidden information, while other algorithms (eg CFR) explicitly handle this. There’s a well-known phenomena of cycling, where agent A will beat agent B which beats agent C which beats agent A; A2C can exhibit this. Think of rock/paper/scissors- AlwaysRock beats AlwaysScissors which beats AlwaysPaper. To avoid this, you typically need to do some sort of averaging.
link