Aren't actor critic algorithms very close to GANs already? You have a generator/actor/policy that produces data and a discriminator/critic/q that says if the data is good or bad. The critic trains on the data generated by the actor and some extra info given by the user (rewards or example data) and the actor learns from the signal given by the critic.