/* ---- Google Analytics Code Below */

Friday, July 09, 2021

Challenge for Learning from Human Feedback using Minecraft

Berkeley Bair challenge competition here using a common gaming environment.   Been a long time since I looked at Minecraft.  Short extract of the idea below, more complete look at the link.   Seems a novel look at a broader look at contextual learning. 

 BASALT: A Benchmark for  Learning from Human Feedback   by Rohin Shah    Jul 8, 2021

TL;DR: We are launching a NeurIPS competition and benchmark called BASALT: a set of Minecraft environments and a human evaluation protocol that we hope will stimulate research and investigation into solving tasks with no pre-specified reward function, where the goal of an agent must be communicated through demonstrations, preferences, or some other form of human feedback. Sign up to participate in the competition!

Motivation

Deep reinforcement learning takes a reward function as input and learns to maximize the expected total reward. An obvious question is: where did this reward come from? How do we know it captures what we want? Indeed, it often doesn’t capture what we want, with many recent examples showing that the provided specification often leads the agent to behave in an unintended way.

Our existing algorithms have a problem: they implicitly assume access to a perfect specification, as though one has been handed down by God. Of course, in reality, tasks don’t come pre-packaged with rewards; those rewards come from imperfect human reward designers.

For example, consider the task of summarizing articles. Should the agent focus more on the key claims, or on the supporting evidence? Should it always use a dry, analytic tone, or should it copy the tone of the source material? If the article contains toxic content, should the agent summarize it faithfully, mention that toxic content exists but not summarize it, or ignore it completely? How should the agent deal with claims that it knows or suspects to be false? A human designer likely won’t be able to capture all of these considerations in a reward function on their first try, and, even if they did manage to have a complete set of considerations in mind, it might be quite difficult to translate these conceptual preferences into a reward function the environment can directly calculate.  ...................

Conclusion

We hope that BASALT will be used by anyone who aims to learn from human feedback, whether they are working on imitation learning, learning from comparisons, or some other method. It mitigates many of the issues with the standard benchmarks used in the field. The current baseline has lots of obvious flaws, which we hope the research community will soon fix.

Note that, so far, we have worked on the competition version of BASALT. We aim to release the benchmark version shortly. You can get started now, by simply installing MineRL from pip and loading up the BASALT environments. The code to run your own human evaluations will be added in the benchmark release.

If you would like to use BASALT in the very near future and would like beta access to the evaluation code, please email the lead organizer, Rohin Shah, at rohinmshah@berkeley.edu.

This post is based on the paper “The MineRL BASALT Competition on Learning from Human Feedback”, accepted at the NeurIPS 2021 Competition Track. Sign up to participate in the competition!   

No comments: