The speaker shares his experiences of transitioning from academic life in England to Carnegie Mellon, where he was inspired by the intensity and future-oriented mindset of his peers. He discusses his initial disappointment with neuroscience and philosophy studies, leading him to AI, which he found more engaging. His journey was significantly influenced by books from Donald Hebb and Jon von Neumann, and collaborations with figures such as Terry Sinofsky and Peter Brown enriched his understanding of both neural networks and linguistic models.
The discussion highlights the impact of intuitive talent like Ilya, and explores how inspirations from neuroscience contributed to AI advancements. Despite challenges, breakthroughs in neural networks and language models have emerged, emphasizing the importance of scale over clever applications. The speaker reflects on the creative potential of models like Alphago and GPT-4, drawing parallels between human and machine reasoning. He shares insights on using GPUs for neural network training and hints at possible future developments, including multimodal capabilities and fast weights.
Main takeaways from the video:
Please remember to turn on the CC button to view the subtitles.
Key Vocabularies and Common Phrases:
1. intuition [ɪntuˈɪʃən] - (n.) - The ability to understand or know something immediately, based on your feelings rather than facts. - Synonyms: (insight, instinct, perception)
What do you think had enabled those intuitions for Ilya?
2. neurons [ˈnʊrɑnz] - (n.) - Cells within the nervous system that transmit information to other nerve cells, muscle, or gland cells. - Synonyms: (nerve cells, brain cells)
They all taught us was how neurons conduct action potentials, which is very interesting, but it doesn't tell you how the brain works.
3. conviction [kənˈvɪkʃən] - (n.) - A firmly held belief or opinion. - Synonyms: (belief, certainty, assurance)
And did you get that conviction that these ideas would work out at that point, or what was your intuition back at the Edinburgh days?
4. collaborations [kəˌlæbəˈreɪʃənz] - (n.) - The action of working with someone to produce or create something. - Synonyms: (partnerships, alliances, cooperations)
What collaborations do you remember from that time?
5. backpropagation [ˈbækˌprɑpəˌɡeɪʃən] - (n.) - A method used in artificial neural networks to calculate the gradient of the loss function with respect to the weights of the network. - Synonyms: (neural network training, gradient descent)
So we talked for a bit and I gave him a paper to read, which was the nature paper on backpropagation
6. algorithm [ˈælɡəˌrɪðəm] - (n.) - A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer. - Synonyms: (procedure, formula, method)
And what was your split between studying the ideas from neuroscience and just doing what seemed to be good algorithms for AI? How much inspiration did you take early on? So I never did that much study of neuroscience
7. optimization [ˌɑptəməˈzeɪʃən] - (n.) - The action of making the best or most effective use of a situation or resource. - Synonyms: (enhancement, improvement, refinement)
And I decided that name they use in hidden Markov models is a great name for variables that you don't know what they're up to.
8. multimodality [ˌmʌltɪmoʊˈdælɪti] - (n.) - The ability to integrate and interpret information from various kinds of stimuli or modes. - Synonyms: (multi-sensory, multi-channel)
What do you think about multimodality?
9. cognition [kɑɡˈnɪʃən] - (n.) - The mental action or process of acquiring knowledge and understanding through thought, experience, and the senses. - Synonyms: (awareness, comprehension, perception)
I used to think we would do a lot of cognition without needing language at all.
10. philosophical [ˌfɪləˈsɑfɪkəl] - (adj.) - Relating or devoted to the study of the fundamental nature of knowledge, reality, and existence. - Synonyms: (theoretical, abstract, metaphysical)
There's really a philosophical point that you could learn a very good model from language alone.
Exploring the Mind's Language Through Neural Networks
Have you reflected a lot on how to select talent, or has that mostly been, like, intuitive to you? Ilya just shows up and you're like, this is a clever guy. Let's work together. Or have you thought a lot about that? Should we roll this? Yeah, let's roll this. Okay, son is working.
So I remember when I first got to Carnegie Mellon from England. In England, at a research unit, it would get to be 06:00 and you'd all go for a drink in the pub at Carnegie and Mellon. I remember after I'd been there a few weeks, it was Saturday night. I didn't have any friends yet, and I didn't know what to do. So I decided I'd go into the lab and do some programming, because I had a lisp machine and you couldn't program it from home. So I went into the lab at about 09:00 on a Saturday night, and it was swarming. All the students were there, and they were all there because what they were working on was the future. They all believed that what they did next was going to change the course of computer science, and it was just so different from England. And so that was very refreshing.
Take me back to the very beginning, Geoff at Cambridge, trying to understand the brain. What was that like? It was very disappointing. So I did physiology, and in the summer term, they were going to teach us how the brain worked, and all they taught us was how neurons conduct action potentials, which is very interesting, but it doesn't tell you how the brain works. So that was extremely disappointing. I switched to philosophy, then I thought maybe they'd tell us how the mind worked. That was very disappointing. I eventually ended up going to Edinburgh to do AI, and that was more interesting. At least you could simulate things so you could test out theories.
And did you remember what intrigued you about AI? Was it a paper? Was it any particular person that exposed you to those ideas? I guess it was a book I read by Donald Hebb that influenced me a lot. He was very interested in how you learn the connection strengths in neural nets. I also read a book by Jon von Neumann early on, who was very interested in how the brain computes and how it's different from normal computers. And did you get that conviction that this ideas would work out at that point, or what was your intuition back at the Edinburgh days? It seemed to me there has to be a way that the brain learns, and it's clearly not by having all sorts of things programmed into it and then using logical rules of inference. That just seemed to me crazy from the outset. So we had to figure out how the brain learned to modify connections in a neural net so that it could do complicated things. And von Neumann believed that. Turing believed that. So von Neumann and Turing were both pretty good at logic, but they didn't believe in this logical approach.
And what was your split between studying the ideas from neuroscience and just doing what seemed to be good algorithms for AI? How much inspiration did you take early on? So I never did that much study of neuroscience. I was always inspired by what I learned about how the brain works. That there's a bunch of neurons, they perform relatively simple operations. They're non linear, but they collect inputs, they weight them, and then they give an output that depends on that weighted input. And the question is, how do you change those weights to make the whole thing do something good? It seems like a fairly simple question.
What collaborations do you remember from that time? The main collaboration I had at Carnegie Mellon was with someone who wasn't at Carnegie Mellon. I was interacting a lot with Terry Sinofsky, who was in Baltimore, at Johns Hopkins. And about once a month, either he would drive to Pittsburgh or I would drive to Baltimore. It's 250 miles away, and we would spend a weekend together working on Baltimore machines. That was a wonderful collaboration. We were both convinced it was how the brain worked. That was the most exciting research I've ever done, and a lot of technical results came out that were very interesting. But I think it's not how the brain works.
I also had a very good collaboration with Peter Brown, who was a very good statistician, and he worked on speech recognition at IBM. And then he came as a more mature student to Carnegie Mellon just to get a PhD. But he already knew a lot. He taught me a lot about speech, and he, in fact, taught me about hidden Markov models. I think I learned more from him than he learned from me. That's the kind of student you want. And when he taught me how hidden Markov models, I was doing backprop with hidden layers, only they weren't called hidden layers then. And I decided that name they use in hidden Markov models is a great name for variables that you don't know what they're up to. And so that's where the name hidden in neural nets came from. Me and Peter decided that was a great name for the hidden laser neural nets. But I learned a lot from Peter about speech.
Take us back to when Ilya showed up at your office. I was in my office probably on a Sunday, and I was programming, I think, and there was a knock on the door. Not just any knock, but it went kind of a sort of urgent knock. So I went and answered the door, and this was this young student there, and he said he was cooking fries over the summer, but he'd rather be working in my lab. And so I said, well, why don't you make an appointment and we'll talk? And so Ilya said, how about now? And that sort of was Ilya's character. So we talked for a bit and I gave him a paper to read, which was the nature paper on backpropagation. And we made another meeting for a week later and he came back and he said, I didn't understand it, and I was very disappointed. I thought he seemed like a bright guy, but it's only the chain rule, it's not that hard to understand. And he said, oh, no, no, I understood that. I just don't understand why you don't give the gradient to a sensible function optimizer, which took us quite a few years to think about.
And it kept on like that with Ilya. His raw intuitions about things were always very good. What do you think had enabled those intuitions for Ilya? I don't know. I think he always thought for himself. He was always interested in AI from a young age. He's obviously good at math, but it's very hard to know. And what was that collaboration between the two of you like? What part would you play? And what part would Ilya play? It was a lot of fun. I remember one occasion when we were trying to do a complicated thing with producing maps of data, where I had a kind of mixture model, so you could take the same bunch of similarities and make two maps, so that in one map bank could be close to greed and in another map bank could be close to river, because in one map you can't have it close to both, right? Because river and green along the way. But we'd have a mixture of maps and we were doing it in Matlab. And this involved a lot of reorganization of the code to do the right matrix multiplies.
And then you got fed up with that. So he came one day and said, I'm going to write an interface for Matlab. So I program in this different language and then I have something that just converts it into Matlab. And I said, no, Ilya, that'll take you a month to do. We've got to get on with this project. Don't get diverted by that. And Ilya said, it's okay, I did it. This morning. That's quite incredible. And throughout those years, the biggest shift wasn't necessarily just the algorithms, but also the skill. How did you sort of view that skill over the years? Ilya got that intuition very early. So Ilya was always preaching that you just make it bigger and it'll work better. And I always thought that was a bit of a cop out, that you're going to have to have new ideas too. It turns out Ilya was basically right. New ideas help. Things like transformers helped a lot, but it was really the scale of the data and the scale of the computation. And back then, we had no idea computers would get like a billion times faster. We thought maybe they'd get 100 times faster. We were trying to do things by coming up with clever ideas that would have just solved themselves if we'd had bigger scale of the data and computation.
In about 2011, Ilya and another graduate student called James Martins and I had a paper using character level prediction. So we took Wikipedia and we tried to predict the next HTML character, and that worked remarkably well. And we were always amazed at how well it worked. And that was using a fancy optimizer on GPU's, and we could never quite believe that it understood anything, but it looked as though it understood, and that just seemed incredible.
Can you take us through, how are these models trend to predict the next word, and why is it the wrong way of thinking about them? Okay, I don't actually believe it is the wrong way. So, in fact, I think I made the first neural net language model that used embeddings and backpropagation. So it's very simple data, just triples, and it was turning each symbol into an embedding, then having the embeddings interact to predict the embedding of the next symbol, and then from that, predict the next symbol, and then it was back propagating through that whole process to learn these triples, and I showed it could generalize.
About ten years later, Yoshio Benji used a very similar network and showed it worked with real text. And about ten years after that, linguists started believing in embeddings. It was a slow process. The reason I think it's not just predicting the next symbol is if you ask, well, what does it take to predict the next symbol? Particularly if you ask me a question, and then the first word of the answer is the next symbol, you have to understand the question. So I think by predicting the next symbol, it's very unlike old fashioned autocomplete. Old fashioned autocomplete. You'd store sort of triples of words. And then if you saw a pair of words, you see how often different words came third, and that way you could predict the next symbol. And that's what most people think autocomplete is like.
It's no longer at all like that. To predict the next symbol, you have to understand what's been said. So I think you're forcing it to understand by making it predict the next symbol. And I think it's understanding in much the same way we are. So a lot of people will tell you these things aren't like us, they're just predicting the next symbol. They're not reasoning like us, but actually, in order to predict the next symbol, it's going to have to do some reasoning. And we've seen now that if you make big ones without putting in any special stuff to do reasoning, they can already do some reasoning. And I think as you make them bigger, they're going to be able to do more and more reasoning.
Do you think I'm doing anything else than predicting the next symbol right now? I think that's how you're learning. I think you're predicting the next video frame, you're predicting the next sound. But I think that's a pretty plausible theory of how the brain is learning. What enables these models to learn such a wide variety of fields? What these big language models are doing is they're looking for common structure, and by finding common structure, they can encode things using the common structure, and that's more efficient.
So let me give you an example. If you ask GPT four, why is a compost heap like an atom bomb? Most people can't answer that. Most people haven't thought of it. They think atom bombs and compost heaps are very different things. But GPT four will tell you, well, the energy scales are very different and the time scales are very different. But the thing that's the same is that when the compost heap gets hotter, it generates heat faster. And when the atom bomb produces more neutrons, it produces more neutrons faster. And so it gets the idea of a chain reaction. And I believe it's understood they're both forms of chain reaction. It's using that understanding to compress all that information into its weights.
And if it's doing that, then it's going to be doing that for hundreds of things where we haven't seen the analogies yet, but it has, and that's where you get creativity from, from seeing these analogies between apparently very different things. And so I think GPT four is going to end up when it gets bigger, being very creative. I think this idea that it's just regurgitating what it's learned just pastiching together text, it's learned already. That's completely wrong. It's going to be even more creative than people.
I think you'd argue that it won't just repeat the human knowledge we've developed so far, but could also progress beyond that. I think that's something we haven't quite seen yet. We've started seeing some examples of it, but to a large extent, we're sort of still at the current level of science. What do you think will enable it to go beyond that? Well, we've seen that in more limited contexts, like if you take alphago, in that famous competition with Lysidot, there was move 37 where Alphago made a move that all the experts said must have been a mistake, but actually later they realized it was a brilliant move. So that was creative within that limited domain. I think we'll see a lot more of that as these things get bigger. The difference with Alphago as well was that it was using reinforcement learning that that subsequently sort of enabled it to go beyond the current state.
So it started with imitation learning, watching how humans play the game, and then it would, through self play, develop way beyond that. Do you think that's the missing component of the current? I think that may well be a missing component, yes, that the self play in Alphago and Alphazero are a large part of why it could make these creative moves, but I don't think it's entirely necessary.
So there's a little experiment I did a long time ago where you're training a neural net to recognize handwritten digits. I love that example, the mnist example. And you give it training data where half the answers are wrong. Um, and the question is, how well will it learn? And you make half the answers wrong once and keep them like that. So it can't average away the wrongness by just seeing the same example, but with the right answer sometimes and the wrong answer sometimes. When it sees that example, half, half of the examples, when it sees the example, the answer is always wrong. And so the training data has 50% error. But if you train up backpropagation, it gets down to 5% error or less.
In other words, from badly labeled data, it can get much better results. It can see that the training data is wrong, and that's how smart students can be smarter than their advisor, and their advisor tells them all this stuff. And for half of what their advisor tells them they think rubbish and they listen to the other half and then they end up smarter than the advisor. So these big neural nets can actually do, they can do much better than their training data, and most people don't realize that.
So how do you expect these models to add reasoning into them? So, I mean, one approach is you add sort of the heuristics on top of them, which a lot of the research is doing now, where you have sort of chain of thought, you just feed back its reasoning intuit stuff. And another way would be in the model itself. As you scale it up, what's your intuition around that? So my intuition is that as we scale up these models, they get better at reasoning. And if you ask how people work, roughly speaking, we have these intuitions and we can do reasoning, and we use the reasoning to correct our intuitions.
Of course, we use the intuitions during the reasoning to do the reasoning. But if the conclusion of the reasoning conflicts with our intuitions, we realize the intuitions need to be changed. That's much like in Alphago or Alphazero, where you have an evaluation function that just looks at a board and says, how good is that for me? But then you do the Monte Carlo rollout, and now you get a more accurate idea and you can revise your evaluation function so you can train it by getting it to agree with the results of reasoning. And I think these large language models have to start doing that. They have to start training their raw intuitions about what should come next by doing reasoning and realizing that's not right. And so that way they can get more training data than just mimicking what people did. And that's exactly why Alphago could do this creative move 37. It had much more training data because it was using reasoning to check out what the right next move should have been.
And what do you think about multimodality? So we spoke about these analogies, and often the analogies are way beyond what we could say. It's discovering analogies that are far beyond humans and at maybe abstraction levels that we'll never be able to understand. Now, when we introduce images to that and video and sound, how do you think that will change the models? And how do you think it will change the analogies that it will be able to make? I think it'll change it a lot. I think it'll make it much better at understanding spatial things. For example, from language alone, it's quite hard to understand some spatial things, although remarkably, GPT four can do that. Even before it was multimodal, but when you make it multimodal, if you have it both doing vision and reaching out and grabbing things, it'll understand objects much better if it can pick them up and turn them over and so on. So, although you can learn an awful lot from language, it's easier to learn if you are multimodal. And in fact, you then need less language. And there's an awful lot of YouTube video for predicting the next frame or something like that. So I think these multimodal models are clearly going to take over. You can get more data that way. They need less language. So there's really a philosophical point that you could learn a very good model from language alone, but it's much easier to learn it from a multimodal system.
And how do you think it will impact the model's reasoning? I think it'll make it much better at reasoning about space, for example, reasoning about what happens if you pick objects up. If you actually try picking objects up, you're going to get all sorts of training data that's going to help. Do you think that human brilliant evolved to work well with language, or do you think language evolved to work well with the human brain? I think the question of whether language evolved to work with the brain or the brain evolved to work with language, I think that's a very good question. I think both happened. I used to think we would do a lot of cognition without needing language at all. Now I've changed my mind a bit.
So let me give you three different views of language and how it relates to cognition. There's the old fashioned symbolic view, which is cognition consists of having strings of symbols in some kind of cleaned up logical language, where there's no ambiguity and applying rules of inference. And that's what cognition is. It's just these symbolic manipulations on things that are like strings of language symbols. So that's one extreme view. An opposite extreme view is. No, no. Once you get inside the head, it's all vectors. So symbols come in, you convert those symbols into big vectors, and all the stuff inside is done with big vectors, and then if you want to produce output, you produce symbols again. So there was a point in machine translation in about 2014 when people were using neural, recurrent neural nets, and words will keep coming in, and they'd have a hidden state, and they keep accumulating information in this hidden state. So when they got to the end of a sentence that have a big hidden vector that captured the meaning of that sentence, that could then be used for producing the sentence in another language, that was called a thought vector, and that's a sort of second view of language.
You convert the language into a big vector. That's nothing like language, and that's what cognition's all about. But then there's a third view, which is what I believe now, which is that you take these symbols, and you convert the symbols into embeddings, and you use multiple layers of that, so you get these very rich embeddings, but the embeddings are still tied to the symbols in the sense that you've got a big vector for this symbol and a big vector for that symbol, and these vectors interact to produce the vector for the symbol for the next word. And that's what understanding is. Understanding is knowing how to convert the symbols into these vectors and knowing how the elements of the vectors should interact to predict the vector for the next symbol. That's what understanding is, both in these big language models and in our brains. And that's an example, which is sort of in between. You're staying with the symbols, but you're interpreting them as these big vectors. And that's where all the work is and all the knowledge is in what vectors use and how the elements of those vectors interact, not in symbolic rules. But it's not saying that you get away from the symbol altogether. It's saying you turn the symbols into big vectors, but you stay with that surface structure of the symbols, and that's how these models are working. And that's now seemed to me a more plausible model of human thought, too.
You were one of the first folks to get the idea of using GPU's, and I know Jensen loves you for that. Back in 2009, you mentioned that you told Jensen that this could be a quite good idea for training neural nets. Take us back to that early intuition of using GPU's for training neural nets. So, actually, I think in about 2006, I had a former graduate student called Rick Zieliski, who's a very good computer vision guy. And I talked to him at a meeting, and he said, you know, you ought to think about using graphics processing cards, because they're very good at matrix multiplies, and what you're doing is basically all matrix multiplies.
So I thought about that for a bit, and then we learned about these Tesla systems that had four GPU's in, and initially, we just got gaming GPU's and discovered they made things go 30 times faster. And then we bought one of these Tesla systems with four gpu's, and we did speech on that, and it worked very well. And then in 2009, I gave a talk at nips, and I told 1000 machine learning researchers, you should all go and buy Nvidia GPU's. They're the future. You need them for doing machine learning. And I actually then sent mail to Nvidia saying, I told 1000 machine learning researchers to buy your boards, could you give me a free one? And they said no. Actually, they didn't say no, they just didn't reply. But when I told Jensen this story later on, he gave me a free one.
That's very good. I think what's interesting as well is how GPU's has evolved alongside the field. So where do you think we should go next in the compute? So my last couple of years at Google, I was thinking about ways of trying to make analog computation so that instead of using like a megawatt, we could use like 30 watts like the brain, and we could run these big language models in analog hardware. And I never made it work, but I started really appreciating digital computation.
So if you're going to use that low power analog computation, every piece of hardware is going to be a bit different. And the idea is the learning is going to make use of the specific properties of that hardware. And that's what happens with people. All our brains are different, so we can't then take the weights in your brain and put them in my brain. The hardware is different. The precise properties of the individual neurons are different. The learning has learned to make use of all that. And so we're mortal in the sense that the weights in my brain are no good for any other brain. When I die, those weights are useless.
We can get information from one to another rather inefficiently by I produce sentences and you figure out how to change your weight. So you would have said the same thing. That's called distillation. But that's a very inefficient way of communicating knowledge. And with digital systems, they're immortal, because once you've got some weights, you can throw away the computer, just store the weights on a tape somewhere, and now build another computer, put those same weights in, and if it's digital, it can compute exactly the same thing as the other system did. So digital systems can share weights, and that's incredibly much more efficient. If you've got a whole bunch of digital systems and they each go and do a tiny bit of learning, and they start with the same weights, they do a tiny bit of learning, and then they share their weights again. They all know what all the others learned we can't do that. And so they're far superior to us in being able to share knowledge.
A lot of the ideas that have been deployed in the field are very old school ideas. It's the ideas that have been around the neuroscience for forever. What do you think is sort of left to apply to the systems that we develop? So, one big thing that we still have to catch up with neuroscience on is the timescales for changes. So, in nearly all the neural nets, there's a fast timescale for changing activities. So input comes in the activities, the embedding vectors, all change, and then there's a slow time scale, which is changing the weights, and that's long term learning. And you just have those two timescales in the brain.
There's many timescales at which weights change. So, for example, if I say an unexpected word like cucumber, and now, five minutes later, you put headphones on, there's a lot of noise, and there's very faint words, you'll be much better at recognizing the word cucumber, because I said it five minutes ago. So where is that knowledge in the brain? And that knowledge is obviously in temporary changes to synapses. It's not neurons that are going, cucumber, cucumber, cucumber. You don't have enough neurons for that. It's in temporary changes to the weights. And you can do a lot of things with temporary weight changes, what I call fast weights. We don't do that in these neural models. And the reason we don't do it is because if you have temporary changes to the weights that depend on the input data, then you can't process a whole bunch of different cases at the same time.
At present, we take a whole bunch of different strings, we stack them together, and we process them all in parallel, because then we can do matrix matrix multiplies, which is much more efficient. And just that efficiency is stopping us using fast weights. But the brain clearly uses fast weights for temporary memory, and there's all sorts of things you can do that way that we don't do at present. I think that's one of the biggest things you have to learn. I was very hopeful that things like graph core, if they went sequential and did just online learning, then they could use fast weights. But that hasn't worked out yet. I think it'll work out eventually when people are using conductances for weights.
How has knowing how these models work and knowing how the brilliant works impacted the way you think? I think there's been one big impact, which is at a fairly abstract level, which is that for many years, people were very scornful about the idea of having a big random neural net and just giving it a lot of training data, and it would learn to do complicated things. If you talk to statisticians or linguists or most people in AI, they say that's just a pipe dream. There's no way you're going to learn to really complicated things without some kind of innate knowledge, without a lot of architectural restrictions. It turns out that's completely wrong.
You can take a big, random neural network and you can learn a whole bunch of stuff just from data. So the idea that stochastic gradient descent, to adjust the. Repeatedly adjust the weights using a gradient, that will learn things, and we'll learn big, complicated things, that's been validated by these big models. And that's a very important thing to know about the brain. It doesn't have to have all this innate structure. Now, obviously, it's got a lot of innate structure, but it certainly doesn't need innate structure for things that are easily learned. And so the sort of idea coming from Chomsky that you won't learn anything complicated, like language, unless it's all kind of wired in already and just matures, that idea is now clearly nonsense.
I'm sure Chomsky would appreciate you calling his ideas nonsense. Well, actually, I think a lot of Chomsky's political ideas are very sensible. I'm always struck by how come someone with such sensible ideas about the Middle east could be so wrong about linguistics. What do you think would make these models simulate consciousness of humans more effectively? But imagine you had the AI system that you've spoken to in your entire life, instead of that being like Chateau petit today. That sort of deletes the memory of the conversation, and you start fresh all of the time. It had self reflection. At some point, you pass away and you tell that to the assistant. Do you think, it's not me. Somebody else tells that to the assistant? Yeah, it would be difficult for you to tell that to the assistant.
Do you think that assistant would fail at that point? Yes, I think they can have feelings, too. So I think just as we have this inner theatre model for perception, we have an inner theatre model for feelings. There are things that I can experience but other people can't. I think that model is equally wrong. So I think. Suppose I say I feel like punching Gary on the nose, which I often do. Let's try and abstract that away from the idea of an inner theater. What I'm really saying to you is, if it weren't for the inhibition coming from my frontal lobes, I would perform an action. So when we talk about feelings, we're really talking about actions we will perform if it weren't for constraints. And that's really what feelings are. They're actions we would do if it weren't for constraints. So I think you can give the same kind of explanation for feelings, and there's no reason why these things can't have feelings.
In fact, in 1973, I saw a robot have an emotion. So in Edinburgh they had a robot with two grippers like this that could assemble a toy car if you put the pieces separately on a piece of green felt, but if you put them in a pile, his vision wasn't good enough to figure out what was going on. So pretty script is going whack, and it knocked them so they were scattered and then it could put them together. If you saw that in a person, you say it was crossed with the situation because it didn't understand it, so it destroyed it. That's profound.
We spoke previously. You described humans and the LLMs as analogy missions. What do you think has been the most powerful analogies that you found throughout your life? Oh, throughout my life. Whoo. I guess probably a sort of weak analogy that has influenced me a lot is the analogy between religious belief and between belief in symbol processing. So when I was very young, I was confronted, I came from an atheist family and went to school and was confronted with religious belief. And it just seemed nonsense to me. It still seems nonsense to me. And when I saw symbol processing as an explanation of how people worked, I thought it was just the same nonsense. I don't think it's quite so much nonsense now, because I think actually we do do symbol processing. It's just we do it by giving these big embedding vectors to the symbols, but we are actually symbol processing, but not at all in the way people thought.
Where you match symbols and the only thing a symbol has is it's identical to another symbol, or it's not identical. That's the only property a symbol has. We don't do that at all. We use the context to give embedding vectors to symbols and then use the interactions between the components of these embedded vectors to do thinking. But there's a very good researcher at Google called Fernando Pereira, who said, yes, we do have symbolic reasoning, and the only symbolic we have is natural language. Natural language is a symbolic language and we reason with it. I believe that now you've done some of the most meaningful research in the history of computer science. Can you walk us through, how do you select the right problems to work on?
Well, first, let me correct you. Me and my students have done a lot of the most meaningful things, and it's mainly been a very good collaboration with students and my ability to select very good students. And that came from the fact there were very few people doing neural nets in the seventies and eighties and nineties and two thousands. And so the few people doing neural nets got to pick the very best students. So that was a piece of luck. But my way of selecting problems is basically, well, you know, when scientists talk about how they work, they have theories about how they work, which probably don't have much to do with the truth. But my theory is that I look for something where everybody's agreed about something and it feels wrong. Just there's a slight intuition there's something wrong about it. And then I work on that and see if I can elaborate why it is I think it's wrong.
And maybe I can make a little demo with a small computer program that shows that it doesn't work the way you might expect. So let me take one example. Most people think that if you add noise to a neural net, it's going to work worse. If, for example, each time you put a training example through, you make half of the neurons be silent, it'll work worse. Actually, we know it'll generalize better if you do that, and you can demonstrate that in a simple example. That's what's nice about computer simulation. You can show this idea you had that adding noise is going to make it worse and sort of dropping out half the neurons will make it work worse, which you will in the short term. But if you train it like that, in the end it'll work better. You can demonstrate that with a small computer program, and then you can think hard about why that is and how it stops. Big, elaborate co adaptations.
But I think that that's my method of working. Find something that sounds suspicious and work on it and see if you can give a simple demonstration of why it's wrong. What sounds suspicious to you now that we don't use fast. Wait, sounds suspicious that we only have these two time scales. That's just wrong. That's not at all like the brain. And in the long run, I think we're going to have to have many more time scales. So that's an example there. And if you had your group of students today and they came to you and they said, the hamming question that we talked about previously, what's the most important problem in your field? What would you suggest that to take on and work on next? We spoke about reasoning time scales. What would be the highest priority problem that you'd give them?
For me right now, it's the same question I've had for the last, like 30 years or so, which is, does the brain do backpropagation? I believe the brain is getting gradients. If you don't get gradients, your learning is just much worse than if you do get gradients. But how is the brain getting gradients? And is it somehow implementing some approximate version of backpropagation, or is it some completely different technique? That's a big open question. And if I kept on doing research, that's what I would be doing research on. And when you look back at your career now, you've been right about so many things, but what were you wrong about that you wish you sort of spent less time pursuing a certain direction?
Okay, those are two separate questions. One is, what were you wrong about? And two, do you wish you spent less time on it? I think I was wrong about Boltzmann machines, and I'm glad I spent a long time on it. There are much more beautiful theory of how you get gradients than backpropagation. backpropagation is just ordinary and sensible, and it's just the chain rule. Bolson machines is clever, and it's a very interesting way to get gradients. And I would love for that to be how the brain works, but I think it isn't. Did you spend much time imagining what would happen post the systems developing as well? Did you ever have an idea that, okay, if we could make these systems work really well, we could, you know, democratize education. We can make knowledge way more accessible. We could solve some tough problems in medicine. Or was it more to you about understanding the brain?
Yes, I sort of feel scientists ought to be doing things that are going to help society, but actually, that's not how you do your best research. You do your best research when it's driven by curiosity. You just have to understand something much more. Recently, I've realized these things could do a lot of harm as well as a lot of good. And I've become much more concerned about the effects they're going to have on society. But that's not what was motivating me. I just wanted to understand how on earth can the brain learn to do things? That's what I want to know. And I sort of failed. As a side effect of that failure. We got some nice engineering.
Yeah, it was a good failure for the world. If you take the lens of the things that could go really right, what do you think are the most promising applications? I think healthcare is clearly a big one. With healthcare, there's almost no end to how much healthcare society can absorb. If you take someone old, they could use five doctors full time. When AI gets better than people are doing things, you'd like it to get better in areas where you could do with a lot more of that stuff, and we could do with a lot more doctors. If everybody had three doctors of their own, that would be great. And we're going to get to that point. So that's one reason why healthcare is good. There's also just in new engineering, developing new materials, for example, for better solar panels, or for superconductivity, or for just understanding how the body works, there's going to be huge impacts there. Those are all going to be good things.
What I worry about is bad actors using them for bad things. We facilitated people like Putin or Zi or Trump using AI for killer robots or for manipulating public opinion or for mass surveillance. And those are all very worrying things. Are you ever concerned that slowing down the field could also slow down the positives? Oh, absolutely. And I think there's not much chance that the field will slow down, partly because it's international, and if one country slows down, the other countries aren't going to slow down. So there's a race, clearly, between China and the US, and neither is going to slow down. So. Yeah, I don't. I mean, there was this petition saying we should slow down for six months.
I didn't sign it just because I thought it was never going to happen. I maybe should have signed it, because even though it was never going to happen, it made a political point. It's often good to ask for things you know, you can't get just to make a point. But I didn't think we're going to slow down. And how do you think that it will impact the AI research process, having this assistance? I think it'll make it a lot more efficient. AI research will get a lot more efficient when you've got these assistants to help you program, but also help you think through things and probably help you a lot with equations, too.
Have you reflected much on the process of selecting talent? Has that been mostly intuitive to you? Like, when Ilya shows up at the door, you feel, this is smart guy, let's work together. So, for selecting talent, sometimes you just know. So after talking to Ilya for not very long. He seemed very smart and then talking a bit more. He clearly was very smart and had very good intuitions as well as being good at math. So that was a no brainer. There's another case where I was at a NIPS conference. We had a poster and someone came up and he started asking questions about the poster. And every question he asked was a sort of deep insight into what we'd done wrong. And after five minutes I offered him a postdoc position. That guy was David Mackay, who was just brilliant. And it's very sad he died, but it was very obvious you'd want him.
Other times it's not so obvious. And one thing I did learn was that people are different. There's not just one type of good student. So there's some students who aren't that creative, but are technically extremely strong and will make anything work. There's other students who aren't technically strong but are very creative. Of course you want the ones who are both, but you don't always get that. But I think actually in the lab you need a variety of different kinds of graduate student. But I still go with my gut intuition that sometimes you talk to somebody and they're just very, very. They just get it and those are the ones you want. What do you think is the reason for some folks having better intuition? Do they just have better training data than others? Or how can you develop your intuition?
I think it's partly. They don't stand for nonsense. So here's a way to get bad intuitions. Believe everything you're told. That's fatal. You have to be able to. I think here's what some people do. They have a whole framework for understanding reality, and when someone tells them something, they try and sort of figure out how that fits into their framework, and if it doesn't, they just reject it. And that's a very good strategy. People who try and incorporate whatever they're told end up with a framework that's sort of very fuzzy and sort of can believe everything, and that's useless. So I think actually having a strong view of the world and trying to manipulate incoming facts to fit in with your view, obviously it can lead you into deep religious belief and fatal flaws and so on, like my belief in Baltimore machines. But I think that's the way to go. If you've got good intuitions, you should trust them. If you've got bad intuitions, it doesn't matter what you do, so you might as well trust them.
Very good point. When you look at the types of research that's being done to the. Do you think we're putting all of our eggs in one basket and we should diversify our ideas a bit more in the field, or do you think this is the most promising direction? So let's go all in on it. I think having big models and training them on multimodal data, even if it's only to predict the next word, is such a promising approach that we should go pretty much all in on it. Obviously, there's lots and lots of people doing it, and there's lots of people doing apparently crazy things, and that's good. But I think it's fine for, like, most of the people to be following this path, because it's working very well.
Do you think that the learning algorithms matter that much, or is it just a scale? Are there basically millions of ways that we could get to human level in intelligence, or are there sort of a select few that we need to discover? Yes. So this issue of whether particular learning algorithms are very important, or whether there's a great variety of learning algorithms that'll do the job, I don't know the answer. It seems to me, though, that back propagation, there's a sense in which it's the correct thing to do. Getting the gradient so that you change a parameter to make it work better, that seems like the right thing to do, and it's been amazingly successful. There may well be other learning algorithms that are alternative ways of getting that same gradient, or that are getting the gradient of something else, and that also work. I think that's all open. And a very interesting issue now about whether there's other things you can try and maximize that will give you good systems, and maybe the brain's doing that because it's easier, but backprop is, in a sense, the right thing to do, and we know that doing it works really well.
And one last question. When you look back at your decades of research, what are you most proud of? Is it the students? Is it the research? What makes you most proud of when you look back at your life's work? The learning algorithm for Boltzmann machines. So, the learning algorithm, Boltzmann machines, is beautifully elegant. It may be hopeless in practice, but it's the thing I enjoyed most, developing that with Terry, and it's what I'm proudest of, even if it's wrong. What questions do you spend most of your time thinking about now? Is it, what should I watch on Netflix?
Artificial Intelligence, Education, Technology, Geoff Hinton, Neural Networks, Collaboration