The idea here is that quality-diversity (QD) is the same kind of idea as only punishing negative samples in RL. The key thing is that rather than pushing towards our best solution, we want to just push ourselves away from the things that are bad. This kind of inherently lets us explore I think. Since it isn’t about moving to the best things, we are just marking a set of bad things that we would rather not explore.
Really, this builds on the idea of a minimal criterion from AI that things survive and reproduce if they meet some simple criteria of surviving and reproducing. That bar is a lot lower than being the most fit in a generation.