/* ---- Google Analytics Code Below */

Wednesday, May 05, 2021

Simulate, Constrain, Repeat, Learn

Berkeley Bair posts an interesting look at Reinforcement Learning.   Made me think,  But once you get beyond the paras below this gets quite complicated and technical.   Anyone who has written significant simulation packages can be amazed at what they can accomplish. And embedded with reinforcement learning to provide direction, consider the possibilities.  And suggestions that anything can be a 'simulation'  gives us pause.   But how accurate can it be in real contexts?  Worth thinking it. 

Learning What To Do by Simulating the Past    By David Lindner, Rohin Shah    May 3, 2021,    Berkeley Bair

Reinforcement learning (RL) has been used successfully for solving tasks which have a well defined reward function – think AlphaZero for Go, OpenAI Five for Dota, or AlphaStar for StarCraft. However, in many practical situations you don’t have a well defined reward function. Even a task as seemingly straightforward as cleaning a room has many subtle cases: should a business card with a piece of gum be thrown away as trash, or might it have sentimental value? Should the clothes on the floor be washed, or returned to the closet? Where are notebooks supposed to be stored? Even when these aspects of a task have been clarified, translating it into a reward is non-trivial: if you provide rewards every time you sweep the trash, then the agent might dump the trash back out so that it can sweep it up again.1

Alternatively, we can try to learn a reward function from human feedback about the behavior of the agent. For example, Deep RL from Human Preferences learns a reward function from pairwise comparisons of video clips of the agent’s behavior. Unfortunately, however, this approach can be very costly: training a MuJoCo Cheetah to run forward requires a human to provide 750 comparisons.

Instead, we propose an algorithm that can learn a policy without any human supervision or reward function, by using information implicitly available in the state of the world. For example, we learn a policy that balances this Cheetah on its front leg from a single state in which it is balancing.  ...."

No comments: