S3E05.4 - Go (Bonus - Teaching Computers to Game)

#Go #AI #ArtificialIntelligence #ComputerGaming #BoardGames #Science

Summary

It's part 3 of our miniseries on teaching computers to play games. Today we're joined by special guest, Dr. Prithvi Akella, a roboticist and AI expert here to help us learn how to play Go, or at least how to teach a computer to do so.

Timestamps

00:00 Introductions
02:20 Background on Go
06:52 Neural networks
09:50 Training the network
11:52 When (and how) computers won Go
18:38 Networks replacing brute force
21:31 Wrap-up

Full Transcript

(Some platforms truncate the transcript due to length restrictions. If so, you can always find the full transcript on https://www.gamingwithscience.net/ )

Brian 0:06
Hello, and welcome to the Gaming with Science podcast, where we talk about the science behind some of your favorite games.

Jason Wallace 0:11
In today's minisode about teaching computers to game, we'll be talking about Go neural networks and reinforcement learning. All right, everyone. Welcome back to Game with Science. This is Jason.

Brian 0:23
This is Brian.

Jason Wallace 0:24
And today we are on number three of our four-part miniseries on teaching computers to game. We're gonna be talking about Go and neural networks and deep reinforcement learning, and we have now officially gone beyond what I am capable of talking about on this show. And so we are joined by a special guest, Dr. Prithvi Akella, who is here to help us understand not only how we're training computers to play games, but how this actually applies to real life. So, Prithvi, could you please introduce yourself to our audience?

Prithvi 0:50
Sure. Hello, everyone. My name is Prithvi. Pleasure to meet everyone, at least virtually. I finished my PhD from Caltech about three years ago. While I was there in grad school, I did a little bit of work in both learning-enabled systems with an emphasis on robotic systems. My specific focus was on trying to make these systems more robust, and now, as research scientist at Siemens, my goal is to apply these same methodologies in the robots that we put out in factories, and also for use in agentic systems that we're building internally as well.

Jason Wallace 1:14
So, yeah, actually putting AI to use out in the real world, and so the colleague who introduced us mentioned, you've done some work recently on plants, right, which is the area that Brian and I work on.

Prithvi 1:23
Yeah, so the work that we did with plants was with one professor at Berkeley, Ken Goldberg, and his lab. The idea there was, could we make 3D models in real time of plants for use in phenotyping and other identification aspects, specifically as it regards making sure and monitoring that plants are growing correctly, have certain markers, etc. things of this nature,

Jason Wallace 1:42
and I could definitely use some of those. We have some traits that we measure in our lab that I've been going after a 3D scan of these plants for years, and we just don't have the skills to be able to put it together.

Brian 1:53
A lot of the work on plants, we use this little model weed called Arabidopsis, which has the convenient thing of being very flat, so like you can just get a top-down image, and it's pretty good, but most plants, like what Jason works on, maize, there's a lot of verticality there, so like top down isn't going to pull it off.

Jason Wallace 2:08
Yeah, and phenotyping is the process of actually measuring traits on plants, how tall it is, angles, colors, all sorts of stuff like that, any trait that we're interested in, really,

Brian 2:16
blue eyes, red hair, you know, the classic plant phenotypes.

Jason Wallace 2:20
All right, well, let's start talking about games. So, today's game is Go. Go is an ancient game, even older than chess. I think last time I said that chess was 1000s of years old. That's not quite true. It's more like 13, 1400 years old. Go, however, is 2500 years old, originally from China, and it's thought to be the oldest continually played board game. It even gets a mention in the Analects of Confucius, so it's an old game that is played on a board that traditionally is a 19 by 19 board, a grid where you place either black or white stones on the intersections. One player plays white, the other player plays black. You take turns placing them down, once they're down, they can't move, and your goal is to surround the other player's pieces and thus capture them, and to capture as much territory as you can on the board, the name Go, I'm not going to go all the way through the etymology, because it's complicated, but the name in original Chinese means essentially board game of surrounding, like you are surrounding your opponent and trying to capture them. Although professional Go is on 19 by 19, you can play on smaller boards, like 13 by 13, or even nine by nine, as a training board, that makes it easier, as far as learning goes, and pretty much the game goes until both players pass. As far as I'm aware, games generally don't go until you run out of spaces. They go until both players say, 'You know what, I'm good, I'm not going to be able to actually do anything better, or one concedes to the other. The reason we're talking about Go specifically is because Go is sort of the next evolution of hard games to get computers to play, so we talked about chess last time, and how this was the poster child of getting computers to play games up until like the mid 90s, when suddenly Deep Blue beat the world's best chess player, and that hurdle had been passed. In fact, I even remember way back in the Devonian, when I was in high school, I did a field trip with one of my classes to the local university, where we listened to some visiting professor talk about how Go was a better model for human cognition than chess, and he was arguing that when we got computers that could actually play Go, we would be much closer to understanding human neurology and psychology, or whatever. I don't remember all the details. I was 17 at the time, but it was basically Go is the better model to train on than chess, because Go is much more flexible. No piece is more valuable than another. The number of moves is much larger at any given point in a game of chess. There's maybe 30 to 40 moves you have to worry about, sometimes more, sometimes less. On go, it's closer to 150 to 250 and so there's more moves. Everything is very context dependent. How good a specific spot is on the board depends on the state of the board. There's probably a few spots that are slightly more powerful than others, but it's really very context dependent, and a move made at one point can have repercussions, 100 moves down the line, and so this is a very strategically deep game from a very simple principle, and I must admit I have not played Go, so I am not fit to talk about the strategies of it. I just understand from research that it is extremely deep, and the people who are really into Go, these world-class champions, are extremely good at it, and so once chess was vanquished, and once we basically had computers that could beat any human being at chess, the next obvious one was go. How do we do this? Because go, from the numbers I was throwing out, you probably figured out, is not really computationally tractable. We talked about how chess is not something that you can truly solve by brute force, that there are many more possible games of chess than there are atoms in the universe by 40 orders of magnitude. Well, for Go it's about 90 orders of magnitude.

Jason Wallace 5:48
And I want to put this in context because we're not always good about explaining it. So when we say that the universe has 10 to the 80th atoms and that there are 2.1 times 10 to the 100 and 70th possible Go game states, that doesn't mean there's just over twice as many, that means there's 10 to the 90th universe's worth of atoms worth of go games. I looked at this number, it is 2.1 novemvigintillion.

Brian 6:13
Jason, that's not a real word.

Jason Wallace 6:15
it is a real word.

Brian 6:16
All right,

Jason Wallace 6:17
I have never heard of it before.

Brian 6:19
Okay,

Jason Wallace 6:20
there is some math nerd out there that has just gone and named everything as far as they can go, so anyway, so that's why go was the next level, and it pretty much was thought that it could not be solved by the same brute force methods that chess was, because the number of moves was too high, there were too many board states, and the value of the move is too hard to compute as far in the future as you need it. Master Go players do this intuitively. They are so experienced they can look at a board and they can intuit how things will play out, but we couldn't brute force a computer to do this. And so this then brings us to the next level of computation, which is neural networks and reinforcement learning. And now, Prithvi, I need you to do this part. Can you explain to us what is a neural network?

Prithvi 7:03
Sure, I'll try my best. So, fundamentally, a neural network, like many machine learning models, is just one of multiple ways that we, as people who create machine learning models, try to fit or otherwise understand patterns that we see in general practice. So, specifically, with respect to neural nets, we define a neural net as one where, given an input, an input is just a vector of numbers. In this context, we apply a certain sequential set of operations to that vector of numbers, matrix operations to begin with, then nonlinear operations afterwards. And through a variety of these matrix operations with nonlinear operations, we understand that nets of a certain size, which means many more of these matrix operations stacked with nonlinear operations, can achieve pattern recognition in larger and larger spaces, or for more and more complex patterns. That's neural nets, at the very least.

Brian 7:52
Do you have a version of that that's good for dumber people?

Jason Wallace 7:58
My understanding is it's basically a bunch of mathematical transformations. It's a bunch of math that is applied, so you give it some input values, which can be an image or it can be your Netflix subscriptions, or it can be the current state of a Go board, and the computer has some way of reading that in. And then it does a whole series of mathematical transformations, and eventually it outputs something on the other side that you can use, like this is the next move in the go game, or this is a picture of a bird, or of a cat, or of a cheetah, or something

Prithvi 8:28
that's a much better description.

Jason Wallace 8:30
Well, the way I understand the reason why they're called neural networks is because at the base mathematical level, they're sort of modeled on human neurons, where there's these little chunks of code that will take inputs from various other places, and they'll do their own internal math, and they will spit out an output that then goes to another one of these, and oftentimes you have a bunch of these interconnecting, so you could have 10 or 20 or 50 feeding into one of these, which then feeds out to another 10 or 20 or 50, and they're all interconnected in a big complex network, and there's all this - I'll be blunt - kind of black magic math going on under the hood, and eventually you get something out the other side. And my understanding of this is that me calling it black magic is actually not that much of an exaggeration, that if you pick apart a neural network, there's a bunch of math, but we don't necessarily understand how that math results in what we get out. Is that true? Is that an oversimplification?

Prithvi 9:19
Actually, not really. There's an entire field related to machine learning called interpretability or explainability, which hints at exactly what you we're just talking about, which is to state after I take whatever these inputs are and I run through all the mathematical operations that result in the identification of the ghost state, or whether or not this is a bird or a cat or something similar. There's a lot of different neurons underlying mathematical formulas that become very difficult for human beings to understand, and so there's an entire field of research dedicated to understanding how neural nets perform this pattern recognition operation. It's entitled interpretability or explainability. So, you're absolutely correct.

Brian 9:50
So, a neural network has to be trained. This is correct.

Prithvi 9:54
This is correct.

Brian 9:55
Okay, so this is where, like, I need to feed 5000 pictures of a cat through the transformation, the computer then picks out, okay, these are the outputs that are associated with cats, right? And you also have to feed it a bunch of things that are not cats, and it says these are not cat, and then it again sort of learns how to intuit cat-like patterning based on the math transformation,

Prithvi 10:15
Correct

Jason Wallace 10:16
Yeah, and we don't have time to get into this, but I'm personally fascinated by where these go wrong, where they will pick up on little things like certain patterns, so that you can generate images that the network is very confident is like a school bus or something, and to us it just looks like a bunch of vertical lines. I mean, those are people intentionally trying to fool the network. In most cases, it works very well, because you give it a big enough training data and it will get whatever mathematical transformation it needs,

Brian 10:41
and this is how, at least to some degree, how we think that we recognize things too? This is how human pattern recognition works?

Jason Wallace 10:48
You'd have to talk to a neurobiologist about that, but I think it's also very mysterious still. It's like we are starting to understand the neural correlates. I recently heard it called of what it means to be conscious. We know that this part is active doing this, and this part is active, doing this, like if you're doing speech, then this part of your brain is working, but how that actually creates our subjective awareness, I think, is still one of the big questions of neurology philosophy. Even so, we're going to zero back in on Go, which is a much simpler system, and as far as teaching computers to play this, the real reigning champion of this, the one that gets all the glory, just like Deep Blue did for chess, is AlphaGo. So, AlphaGo was a program made by DeepMind, who I believe is Google's research arm, and it, it started in the 20 teens of middle 20 teens, and it started by learning from human players. I understand that the first version of AlphaGo learned from human players in history, but then also had a second part where it would sort of play against itself to try to get better,

Brian 11:45
because you need that training set, right?

Jason Wallace 11:47
Yes, you need something, because otherwise the neural network is just random, it won't, it's got to know

Brian 11:51
anything,

Jason Wallace 11:52
so you have to have a lot of data to feed it in order to get those weights, they're called the numbers, the mathematical transformations correct to come out the other side, so in 2015 it beat a highly ranked European player, apparently the two Dan level, which there's one to nine Dans, and nine is the highest of a professional player, but the one that gets all the press is the 2016 win against Lee Sodal, who was a nine Dan player and considered, I believe, the second best Go player in the world at that time, and mr. Sodal went into this, basically assuming he would trash AlphaGo, because AlphaGo had performed poorly against top players previously, and he lost four to one. So, AlphaGo beat him four times out of five, and this was a major landmark in terms of a computer being able to play this very, very complex game in a way that would rival and surpass top human players, and then the next year they came out with AlphaGo Zero, which is an even further refinement, where they didn't use any human training data at all. In fact, it's a simpler network architecture, as I understand it, and they created it so it could interpret the rules of Go, and then they just had it play itself, that no human data, whatever,

Brian 13:00
like hundreds of millions of times?

Jason Wallace 13:02
yeah, something like that. But the interesting thing was it took less time, less training data, and less power and resources to train it, and at the end of their training period of a few days or few weeks, AlphaGo Zero, they pitted it against AlphaGo, and it won 100 matches to zero. so it was completely blowing its predecessor out of the water, which was already at peak human performance, essentially, and I think there's one more iteration on this, where they made just Alpha Zero, which was a general-purpose one that could not only play Go, but could also play chess, and then Shoghi, which is another board game for this, and this was some major milestone, because having one system that could do three different games was considered another major breakthrough, because it was a more general-purpose learning machine. And Prithvi has given me a bunch of thumbs ups as I'm saying I'm on the right path.

Brian 13:50
He's doing internal fact checking for you.

Jason Wallace 13:53
Prithvi, can you expound on this? You probably know more about this than I do.

Prithvi 13:57
Well, I mean, I think you did a great, you gave a great explanation, Jason. So, I'll say this from the pattern going from neural networks all the way to reinforcement learning all the way to the development of alpha s policies like alpha go or alpha zero, it follows a very humanlike pattern of how do I make these systems function a bit better, right? As Brian mentioned earlier, when we started off with neural nets, right, the goal is basically just how do I figure out a way to train my computer to understand these patterns, I feed in a bunch of input data, I train these models, I spend some time, I get a system, a mathematical function, if you will, that then says, all right, this image is a cat, this image is a dog, blah blah blah. This is great. Now I want to go one step further with this pattern recognition. I don't just want to do this level of classification, I want to actually figure out how I should act in certain scenarios. This is a very broad jump, if you will, from neural networks to reinforcement learning, but this is what underlies the basis of reinforcement learning as kind of a field within computer science. The general idea is the following: How do humans, in this case, learn not just from, like, a neural network architecture, but more so, how do people in interactions with the world understand how they should interact with the world, understand what they should do, and this is where the comparisons that neuroscience become a little bit more clear, in this sense, where as human beings, when we act, we get signals in our brain, dopamine signals, pain signals, things of this nature, and based on this information, we can adapt the way that we act, or the way that we interact the world, to align with our own internal reward signals, so that we do certain actions better, so that we learn how to play chess ourselves. Do we learn how to play Go ourselves, for the sake of argument, or any of these games that we're talking. Reinforcement learning is the underlying computer-based rendition of this reward signaling for action refinement, if you will. In this case, with as minimal math as I can say it, the goal is simply to provide the computer the ability to take certain actions, learn how those actions affect its own internal reward signal, and then, based on that understanding of its reward signal, make better actions in the future. That's what started reinforcement learning, with, you know, some phenomenal works by Sutton and Bardo back before the 2000s And as reinforcement learning has progressed, we have built better ways to make these training policies more efficient, to use less data, and then now getting to AlphaGo and AlphaZero. Not only do we just want to figure out how a computer or how one of these models should act, but we as humans, when we play any of these games like Go or shogi or chess, we don't simply just think about how we should act in the moment. We also try to plan ahead a little bit in this move will have some impact five moves down the line, 10 moves down, and this notion of planning that notion specifically wasn't as encoded in earlier reinforcement learning architectures. The assumption was that it existed, and that if you rolled out enough policies, or if you rolled out enough actions, and the system had the ability to understand what it did, it would eventually intuit this kind of predictive behavior, but made more concrete now in these alpha-esque policies with some specific terms that I'll mention, like Monte Carlo tree search, and then Monte Carlo's tree search with policy refinement. These terms might be a little bit broad, but for those in the audience who want to read more, feel free to look up these terms.

Prithvi 16:51
This allowed for these systems with neural network architectures to have a way of predicting how their actions in the future would affect their current state, and then when we get to AlphaGo and Alpha Zero type policies, it allowed leveraging this Monte Carlo Tree search type ideology with neural networks, basically allowing these systems to predict how their current actions affect their states in the future. This is the underlying architecture that allowed for these systems to learn how to play Go, learn how to play chess, and was the fundamental architecture underlying AlphaZero as well.

Brian 17:18
So, let's see, this, so the three stages were first, you need to be able to recognize inputs, say, 'Hey, I can understand what is in front of me accurately. The next is that now, what do I do with that information? And it sounds like I don't want to overemphasize, but we are programming these programs to feel pleasure and pain, or at least have punishments and reward.

Prithvi 17:37
I think a reward is a better way of phrasing it. No, no punishment just yet. We're just trying to make these systems do good things.

Brian 17:43
There's no negative values associated with things in the training. It's all positive reinforcement.

Prithvi 17:47
Well, there are negative values.

Brian 17:51
I just want to say even single-celled organisms can respond to attractants and repellants. Even that is a reward versus a punishment system. You don't have to have necessarily higher thinking to respond to a reward or a punishment in a way that is beneficial in the short term, but then the next step is thinking about, okay, planning for the future, where it's like, how can I sort of have a vision of what will happen, so you're optimizing not just for now but for later. Those are kind of the three steps of this process.

Prithvi 18:16
Yeah, that's actually a very good way of phrasing it.

Brian 18:18
So, like, are people still saying that these things are not smart, because these sound like they're actually, you know, we're getting close to what you would imagine that a human is doing. Right?

Prithvi 18:27
now, we get to the front of generative intelligence, which I am happy to go into, but that depends on where we're trying to go.

Jason Wallace 18:33
Yeah, that's actually our next episode.

Brian 18:35
Okay. Well, I will hold my question until then.

Jason Wallace 18:38
And I did want to ask, so as we've been doing this episode, a lot of things I looked up, especially for the last one on chess, as brute force approaches for things like weather prediction and other modeling that we do to try to understand the world. I noticed that a lot of them now are being replaced by some more neural network reinforcement learning type technology, as opposed to brute force. Is that a general trend? Can this sort of architecture generally do what brute force can do, but better, or are there places where brute force still shines? It's

Prithvi 19:09
a great question. I'll state that the goal behind training these policies, especially as it concerns reinforcement learning, in particular, the way we train these reinforcement learning policies may come across as brute force in the beginning, but iteratively more refined as the policy learns a bit more, and the reason we train these policies in this way is that when I say brute force, I mean this kind of general exploratory search where we allow for the computer, much like human beings in general would like to explore, understand, and then internalize this information, once we have that and the model can update, then it gets progressively less brute force as time wears on. It becomes more tuned, if you will, via this training procedure that Brian you were mentioning earlier, to understand. Okay, these actions lead to these outcomes, they're better for me. I probably should explore only in these regimes and do these types of actions, as opposed to these other ones, because those other ones, while I did explore them, probably didn't give me all that much benefit. So, to your. Point, Jason, the reason I think we move from general brute force architectures or general brute force implementations to these more refined, if you will, learning architectures like neural networks or reinforcement learning is just that it makes the information that we receive from these brute force approaches that much more refined as it regards our ability to use these systems to make decisions, do this classification things of this nature.

Brian 20:22
Is there a risk of if it's sampling or subsampling, it says, I think this is a useful place to focus my attention. I'm very anthropomorphizing. Please forgive me. Is there a risk of it developing sort of a local fitness, getting stuck at a fitness optima, and like getting overly into a corner that is actually not the best solution?

Prithvi 20:39
This is actually a great question, and this is true, and this leads or hints at an entire field or entire practice within computer science research entitled reward shaping, because you're absolutely correct, dependent on the reward that I give these systems, or the way that I shape the reward for these systems, I can influence its behavior in one or multiple ways, and if I don't account for certain eventualities, like what you just mentioned, Brian, it getting stuck in a local minima, that is not what I particularly wanted. Then that means I need to shape reward a certain way to prevent this exact outcome.

Brian 21:11
I think I've seen something like this, where I wish I could remember this specific circumstance, where, like, you get this unanticipated behavior, where you're trying to get things that, like, simulate natural systems, where the little agents just won't move at all, because there's too much punishment for them to move off their block, so it's maladaptive for them to develop the behavior you're hoping for, right?

Prithvi 21:29
Exactly.

Brian 21:30
Okay.

Jason Wallace 21:31
Yeah. And we'll talk more about that next week. So, to wrap up, Go computers have been beating us at Go top ranked players for 10 years now, with a few exceptions. I did find one instance where, in 2023 there was an amateur, but well-ranked Go player who did manage to beat a computer program, KataGo, 14 to one, but he did it by exploiting a bug that took a different computer program playing millions of games against KataGo to identify, so I don't know that that actually counts as us winning. So, at this point, I think Go has been won by the machines, and interestingly, at this point we leave the realm of board games. Goes seem to be the pinnacle of what we want to make a computer play, as far as board games go. So, join us next week, we're going to talk about moving into the realm of computer games, especially things like Minecraft, and applying large language models and generative AI, the current state of the art for solving these. But for now we'll say thank you again for Prithvi for coming on.

Prithvi 22:23
Thanks

Jason Wallace 22:23
And for our listeners, have a great week and great games,

Brian 22:26
and have fun playing dice with the universe. See you.

Jason Wallace 22:32
This has been the Gaming with Science podcast. Copyright 2026 Listeners are free to reuse this recording for any noncommercial purpose, as long as credit is given to Game with Science. This podcast is produced with support from the University of Georgia. All opinions are those of the hosts and do not imply endorsement by the sponsors. If you wish to purchase any of the games we talked about, we encourage you to do so through your friendly local game store. Thank you, and have fun playing dice with the universe.

Transcribed by https://otter.ai

Comment (0)

No comments yet. Be the first to say something!

S3E05.4 - Go (Bonus - Teaching Computers to Game)

Summary

Timestamps

Links

Full Transcript