AlphaGo’s premise is that instead of using human feedback for reinforcement learning, you instead have the model play games against itself, with a simple reward mechanism, so that it can learn from its own mistakes. This achieves scalability of the training data, allowing the model to discover new Go moves and eventually exceed the quality of its initial training data.
From an engineering point of view, how do you see this applied to other areas like software development, where there is no opponent player? Do you connect the model to a compiler, and have it learn by trial and error based on compiler output? Do you set desired software outcomes and have another AI evaluate how much closer or farther the output is with each iteration? How would this closed feedback loop work to get an AI to become a world expert in a specific programming language or framework?