Interesting method which is now widespread of improving LLMs to better match human preferences using RL.