ENSPIRING.ai: Jim Fan on Nvidias Embodied AI Lab and Jensen Huangs Prediction that All Robots will be Autonomous

ENSPIRING.ai: Jim Fan on Nvidias Embodied AI Lab and Jensen Huangs Prediction that All Robots will be Autonomous

The video features an in-depth conversation with Jim Fan, a senior research scientist at Nvidia, highlighting his journey and the current state of embodied ai and humanoid robotics. Fan, who once interned at OpenAI, shares insights from his extensive work on projects like Project Groot, aimed at advancing robotics using Nvidia’s computing platforms. He discusses the pivotal role of data, hardware progression, and the simulation in transitioning from specialist tasks in AI to broader generalist capabilities.

Fan elaborates on Nvidia's strategies for developing embodied ai, emphasizing the importance of integrating diverse datasets from internet, simulation, and real-world robotic interactions. He sheds light on his experiences with reinforcement learning and projects like "World of Bits," while also discussing future prospects of generalist AI models capable of both virtual and physical world tasks. Fan addresses past challenges and current developments at Nvidia that are paving the way for what he defines as a "GPT-3 moment" for robotics.

Main takeaways from the video:

💡
The significance of a comprehensive data strategy combining internet, simulation, and real-world data in pursuit of robust humanoid robotics.
💡
The potential for Nvidia's Project Groot, in uniting AI modeling and chip technology, to revolutionize robotics.
💡
Achieving a "GPT-3 moment" for robotics requires breakthroughs in system integration, particularly combining reasoning with real-time motor control.
Please remember to turn on the CC button to view the subtitles.

Key Vocabularies and Common Phrases:

1. autonomous [ɔːˈtɒnəməs] - (adjective) - Acting independently or having the freedom to do so. - Synonyms: (independent, self-directed, self-operating)

One of my favorite quotes from him is that everything that moves will eventually be autonomous.

2. humanoid [ˈhjuməˌnɔɪd] - (adjective) - Resembling a human in appearance or behavior. - Synonyms: (human-like, anthropoid, manlike)

We're excited to ask Jim about all things robotics, why now, why humanoids, and what's required to unlock a GPT-3 moment for robotics.

3. embodied ai [ɪmˈbɒdid æˈaɪ] - (noun phrase) - Artificial intelligence integrated with a physical form to interact with the real world. - Synonyms: (physical AI, interactive AI, tangible AI)

Jim leads Nvidia's embodied ai agent research with a dual mandate, spanning robotics in the physical world and gameplay agents in the virtual world.

4. simulation [ˌsɪmjʊˈleɪʃn] - (noun) - Imitation of a situation or process, often for training or testing purposes. - Synonyms: (imitation, emulation, reproduction)

So from the chip level, which is the Jetson Thor family, to the foundation model project route, and also to the simulation and the utilities that we built along the way, it will become a platform, a computing platform for humanoid robots, and then also for intelligent robots in general

5. reinforcement learning [ˌriːɪnˈfɔːrsmənt ˈlɜːrnɪŋ] - (noun phrase) - A type of machine learning algorithm that learns by trial and error using feedback from its actions. - Synonyms: (adaptive learning, reward-based learning, trial-and-error method)

Yes. So back then, the main method that we used was reinforcement learning.

6. generalize [ˈdʒɛnərəˌlaɪz] - (verb) - Make broad statements by inferring from specific cases. - Synonyms: (extend, broaden, extrapolate)

It kind of worked on the tasks that we designed, but it doesn't really generalize.

7. foundation models [faʊnˈdeɪʃən ˈmɒdəlz] - (noun phrase) - Large-scale models that serve as a basis for developing more specific AI applications. - Synonyms: (core models, base models, primary frameworks)

Earlier this year, in March, GTC, at Jensen's keynote, he unveiled something called Project Groot, which is Nvidia's moonshot effort at building foundation models for humanoid robotics.

8. manifold [ˈmænɪˌfəʊld] - (adjective) - Many and various. - Synonyms: (numerous, multiple, diverse)

These are the LMS and the frontier models that, that we have already seen these days.

9. speculation [ˌspɛkjəˈleɪʃn] - (noun) - The act of forming theories or opinions without firm evidence. - Synonyms: (guesswork, conjecture, hypothesis)

This is pure speculation, but I hope that we can see research breakthrough in robot foundation model maybe in the next two to three years.

10. scale [skeɪl] - (verb / noun) - To change size or to climb; also, a device or factor used in rating. - Synonyms: (size up, expand, adjust)

All of these foundation models require a lot of compute to scale up

Jim Fan on Nvidias Embodied AI Lab and Jensen Huangs Prediction that All Robots will be Autonomous

So from the chip level, which is the Jetson Thor family, to the foundation model project route, and also to the simulation and the utilities that we built along the way, it will become a platform, a computing platform for humanoid robots, and then also for intelligent robots in general. So I want to quote Jensen here. One of my favorite quotes from him is that everything that moves will eventually be autonomous, and I believe in that as well. It's not right now, but let's say ten years or more from now. If we believe that there will be as many intelligent robots as iPhones, then we'd better start building that today.

Hi, and welcome to training data. We have with us today Jim Fan, senior research scientist at Nvidia. Jim leads Nvidia's embodied ai agent research with a dual mandate, spanning robotics in the physical world and gameplay agents in the virtual world. Jim's group is responsible for Project Groot, Nvidia's humanoid robots that you may have seen on stage with Jensen at this year's GTC. We're excited to ask Jim about all things robotics, why now, why humanoids, and what's required to unlock a GPT-3 moment for robotics.

Welcome to training data. Thank you for having me. We're so excited to dig in today and learn about everything you have to share with us around robotics and embodied ai. Before we get there, you have a fascinating personal story. I think you were the first intern at OpenAI. Maybe walk us through some of your personal story and how you got to where you are.

Absolutely. I would love to share the stories with the audience. So back in the summer of 2016, some of my friends said, there's a new startup in town and you should check it out. And I'm like, huh, I don't have anything else to do because I got accepted to PhD, and that summer I was idle. So I decided to join this startup, and that turned out to be OpenAI. During my time at OpenAI, we were already talking about AGI back in 2016, and back then, my internmentor was Andrey Kaparsi and Ilya Saskever, and we talk about and we discuss a project together. It's called World of Bits.

So the idea is very simple. We want to build an AI agent that can read computer screens, read the pixels from the screens, and then control the keyboard and mouse. If you think about it, this interface is as general as it can get. All the things that we do on computer, like replying to emails or playing games or browsing the web, it can all be done in this interface, mapping pixels to keyboard mouse control. So that was actually my first kind of attempt at AGI, at OpenAI, and also my first journey, the first chapter of my journey in AI agents.

I remember world events, actually. I didn't know that you were a part of that. That's really interesting. Yeah, yeah. It was a very fun project and was part of a bigger initiative called open a universe, which was like a bigger platform on, like, integrating all the applications and games into this framework.

What do you think were some of the unlocks then? And then also, what do you think were some of the challenges that you had with agents back then? Yes. So back then, the main method that we used was reinforcement learning. There was no Lom, no transformer back in 2016. And the thing is, reinforcement learning, it works on specific tasks, but it doesn't generalize. Like, we can't give the asian arbitrary language, an instructor to do things, to do arbitrary things that we can do with a keyboard and mouse.

So back then, it kind of worked on the tasks that we designed, but it doesn't really generalize. So that started my next chapter, which is I went to Stanford and I started my PhD with Professor Fei Fei Li, and we started working on computer vision and also embodied ai. And during my time at Stanford, which was from 2016 to 2021, I kind of witnessed the transition of the Stanford Vision lab, led by Fei Fei, from static computer vision, like recognizing images and videos, to more embodied computer vision, when an agent learns perception and takes actions in an interactive environment.

And this environment can be virtual, as in simulation, or it can be the physical world. So that was my PhD, like, transitioning to embodied ai. And then after I graduated from PhD, I joined Nvidia and have stayed there ever since. So I carried over my work from my PhD thesis to Nvidia and still work on embodied ai till this day.

So you oversee the embodied ai initiative at Nvidia, maybe say a word on what that means and what you all are hoping to accomplish. Yes. So the team that I am co leading right now is called gear, which is gear, and that stands for journalist embodied asian research. And to summarize, what our team works on, in three words, is that we generate actions because we build embodied ai agents, and those agents take actions in different worlds.

And if the actions are taken in the virtual world, that would be gaming AI and simulation. And if the actions are taken in the physical world, that will be robotics, actually. Earlier this year, in March, GTC, at Jensen's keynote, he unveiled something called Project Groot, which is Nvidia's moonshot effort at building foundation models for humanoid robotics. And that's basically what the gear team is focusing on right now. We want to build the AI brain for humanoid robots and even beyond.

What do you think is Nvidia's competitive advantage in building that? Yeah, that's a great question. So. Well, one is for sure, like, compute resources. All of these foundation models require a lot of compute to scale up. And we do believe in scaling law. There were scaling laws for, like, loms, but the scaling law for embodied ai and robotics are yet to be studied, so we're working on that. And the second strength of Nvidia is actually simulation.

So, Nvidia, before it was an AI company, it was a graphics company. So Nvidia has many years of expertise on building simulation, like physics simulation and rendering, and also real time acceleration on GPU's. So we are using simulation heavily in our approach to build robotics. The simulation strategy is super interesting. Why do you think most of the industry is still very focused on real world data? The opposite strategy? Yeah, I think we need all kinds of data and simulation and real world data by themselves are not enough.

So at gear, we divide this data strategy into roughly three buckets. One is the Internet scale data, like all the tags and videos online. And the second is simulation data, where we use Nvidia simulation tools to generate lots of synthetic data. And the third is the real robot data, where we collect the data by teleoperating the robot and then just collecting and recording those data on the robot platforms. And I believe a successful robotic strategy will involve the effective use of all three kinds of data and mixing them, and also on delivering a unified solution.

Can you say more about. We were talking earlier about how data is fundamentally the key bottleneck in making a robotics foundation model actually work. Can you say more about your conviction in that idea? And then what exactly does it take to make great data to break through this problem? Yes. So I think the three different kinds of data that I just mentioned have different strengths and weaknesses. So for Internet data, they're the most diverse. They encode a lot of common sense prior.

Right. Like, for example, most of the videos online are human centered, because humans, we love to take selfies. We love to record each other doing all kinds of activities. And there are also a lot of instructional videos online. So we can use that to kind of learn how humans interact with objects and how objects behave under different situations. So that kind of provides a common sense prior to for the robot foundation model. But the Internet scale data, they don't come with actions. We cannot download the motor control signals of the robots from the Internet.

And that goes to the second part of the data strategy, which is using simulation. So in simulation, you can have all the actions, and you can also observe the consequences of the actions in that particular environment. And the strength of simulation is that it's basically infinite data. And, um, you know, the data scales with, uh, compute, the more gpu's you put into the simulation pipeline, the more data that you will get. And also the data is super real time. So if you collect data only on the real robot, then you are limited by 24 hours per day.

But, uh, in simulation, like the gpu accelerated, uh, simulators, we can actually accelerate real time by 10,000 x, so we can collect the data at much higher throughput, given the same warclock time. So that's the strength. But the weakness is that for simulation, no matter how good the graphics pipeline is, there will always be this simulation to reality gap. The physics will be different from the real world. The visuals will still be different. They will not look exactly as realistic as real world. And also there is a diversity issue.

The contents in the simulation will not be as diverse as all the scenarios that we encounter in the real world are the weaknesses. And then going to the real robot data. And those data, they don't have the Sim two real gap because they're collected on the real robot. But it's much more expensive to collect because you need to hire people to operate the robots. And again, they're limited by the speed of the world of atoms. You only have 24 hours per day, and you need humans to collect those data, which is also very expensive.

So we see these three types of data as having complementary strengths. And I think a successful strategy is to combine their strengths and then to remove their weaknesses. So the cute Groot robots that were on stage with Jensen, that was such a cool moment. If you had to help us dream in 1510 years, what do you think your group will have accomplished? Yeah, so this is pure speculation, but I hope that we can see research breakthrough in robot foundation model maybe in the next two to three years.

So that's what we call a GPT-3 moment for robotics. And then after that, it's a bit uncertain, because to have the robots enter daily lives of people, there are a lot more things than just the technical side. The robots need to be affordable and mass produced. We also need safety for the hardware and also privacy and regulations. And those will take longer for the robots to be able to hit a mass market. So that's a bit harder to predict, but I do hope that the research breaks will come in the next two to three years. What do you think will define what a GPT-3 moment in AI robotics looks like?

Yeah, that's a great question. So I would like to think about robotics as consisting of two systems, system one and system two. So that comes from the book thinking fast and slow, where system one means this low level mode of control that's unconscious and fast. Like, for example, when I'm grasping this cup of water, I don't really think about how I move the fingertip at every millisecond. Um, so that would be system one, and then system two is slow and deliberate, and it's more like reasoning and planning that actually uses the.

The conscious brain power that we have. So, um, I think the GPU three moment, uh, will be on the system one side. And my favorite example is the verb open. So just think about the complexity of the word open, right? Like, opening the door is different from opening window. It's also different from opening a bottle or opening a phone. But for humans, we have no trouble understanding that OBN means different things when you're interacting. It means different motions when you're interacting with different objects. But so far, we have not seen a robotics model that can generalize on a low level motor control on these verbs.

So I hope to see a model that can understand these verbs in their abstract sense and can generalize to all kinds of scenarios that make sense to humans. And we haven't seen that yet, but I'm hopeful that this moment could come in the next two to three years. What about system two thinking? How do you think we get there? Do you think that some of the reasoning efforts in the LLM world will be relevant as well in the robotics world? Yeah, absolutely. I think for system two, we have already seen very strong models that can do reasoning and planning and also coding as well.

These are the LMS and the frontier models that, that we have already seen these days. Um, but to integrate the system two models with system one is another research challenge in itself. So, uh, the question is, for robot foundation model, do we have a single monolithic model, or do we have some kind of cascaded approach where the system two and the system one models are separate and can communicate with each other in some ways? I think that's an open question. Um, and again, they have pros and cons. Right? Like, for the first idea, the monolithic model, uh, is cleaner.

There's just one model, one API to maintain, but also it's a bit harder to control because you have different control frequencies. The system two models will operate on a slower control frequency, let's say 1 hz, like one decision per second, while the system one, like the motor control of me grasping this cup of water, that will likely be 1000 hz, where I need to make these minor, like these tiny muscle decisions at 1000 times per second. It's really hard to encode them both in a single model. So maybe a cascaded approach will be better.

But again, how do we communicate between system one and two? Do they communicate through text or through some latent variables? It's unclear, and I think it's a very exciting new research direction. Is your instinct that we'll get there in that breakthrough on system one thinking like through scale and transformers like this going to work? Or is that cross your fingers and hope and see? I certainly hope that the data strategy I described will kind of get us there, because I feel that we have not pushed the limit of transformers yet.

On the essential level, transformers take tokens in and outputs tokens. Ultimately, the quality of the tokens determines the quality of the model, the quality of those large transformers. And for robotics, as I mentioned, the data strategy is very complex. We have all the Internet data, and also we need simulation data and the real robot data. And once we're able to scale up on the data pipeline with all those high quality actions, then we can tokenize them and we can send them to a transformer to compress. I feel we have not pushed transformer to limit yet.

And once we figure out the data strategy, we may be able to see some emergent property as we scale up the data and scale up the model size. And for that I'm calling it the scaling law for embodied ai. And it's just getting started. I'm very optimistic that we will get there. I'm curious to hear, what are you most excited about personally, when we do get there? What's the industry or application or use case that you're really excited to see this completely transform the world of robotics today?

Yes. So there are actually a few reasons that we chose humanoid robots as kind of the main research thesis to tackle. One reason is the world is built around the human embodiment, the human form factor. All our restaurants, factories, hospitals, and all equipment and tools, they're designed for the human form and also the human hands. So in principle, a sufficiently good humanoid hardware should be able to support any tasks that a reasonable human can do. In principle. And the humanoid hardware is not there yet today. But I feel in the next two to three years, the humanoid hardware ecosystem will mature, and we will have affordable humanoid hardware to work on.

And then it will be a problem about the AI brain, about how we kind of drive those humanoid hardware. And once we have that, once we're able to have the groove foundation model that can take any instruction in language and then perform any tasks that a reasonable human can do, then we unlock a lot of economic value. Like, we can have robots in our households helping us with daily chores like laundry, dishwashing, and cooking or elderly care. And we also have them in restaurants, in hospitals, in factories, helping with all the tasks that humans do. And I hope that will come in the next decade. But again, as I mentioned in the beginning, this is not just a technical problem, but also there are many things beyond the technology. So I'm looking forward to that.

Any other reasons you've chosen to go after humanoid robots specifically? Yeah, so there are also a bit more practical reasons in terms of the training pipeline. So there are lots of data online about humans. It's all human centered. All the videos are like humans doing daily tasks or having fun. And the humanoid robot form factor is closest to the human form factor, which means that the model that we train using all of those data will be able to have an easier time to transfer to the humanoid form factor rather than the other form factors.

So let's say for robot arms, how many videos do we see online about robot arms and grippers? Very few, but there are many videos of people using their five finger hands to work with objects. So it might be easier to train for humanoid robots. And then once we have that, we'll be able to specialize them to the robot arms and more kind of specific robot forms. So that's why we're aiming for the full generality first.

I didn't realize it. So are you exclusively training on humanoids today versus robot arms and robot dogs as well? Yeah. So for project root simulation? Yeah. Yes. So for project route, we are aiming more towards humanoid right now, but the pipeline that we're building, including the simulation tools, the real robot tools, those are general purpose enough that we can also adapt to other platforms in the future. So, yeah, we're building these tools to be generally applicable.

You've used the term general quite a few times now. I think there are some folks, especially from the robotics world, who think that, you know, a general approach won't work and you have to be domain environment specific. Why have you chosen to go after a generalist approach? And, you know, the Richard Sutton bitter lesson stuff has been a recurring theme on our podcast. I'm curious if you think it holds in robotics as well.

Absolutely. So I would like to first talk about the success story in NLP that we have all seen. Right, so before the chat GPT and the GPT-3, in the world of NLP, there are a lot of different models and pipelines for different applications, like translation and coding and doing math and doing creative writing. They all use very different models and completely different training pipelines. But then chat GPT came and unified everything into a single model. So before chat GPT, we call those specialists, and then the GBD, Sui and Chai GPTs, we call them the journalists.

And once we have the journalists, we can prompt them, distill them, and fine tune them back to the specialized tasks. And we call those the specialized journalists. And according to the historical trend, it's almost always the case that the specialized journalists are just far stronger than the original specialists, and they're also much easier to maintain because you have a single API that takes text in and then spits text out. So I think we can follow the same success story from the world of NLP, and it will be the same for robotics.

So right now, in 2024, most of the robotics and applications we have seen are still in the specialist stage, right? They have specific robot hardware, uh, for specific tasks, collecting specific data using specific pipelines. But project route aims to build this general purpose foundation model that works on humanoid first, but later will generalize to all kinds of different robot forms or embodiments. And that will be the generalist moment that we are pursuing. And then once we have that journalist, we'll be able to prompt it, fine tune it, distill it down to specific robotics tasks.

And those are the specialized journalists. But that will only happen after we have the journalist. So, um, it will be easier in the short run to pursue the specialist. It's just easier to show results, because you can just focus on a very narrow set of tasks. But we at Nvidia believe that the future balance journalists. Even though it will take longer to develop, it will have more difficult research problems to solve. But that's what we're aiming for first.

The interesting thing about Nvidia building Groot, to me is also what you mentioned earlier, which is that Nvidia owns both the chip and the model itself. What do you think are some of the interesting things that Nvidia could do to optimize Groot on its own chip? Yes. So at the March GTC, Jensen also unveiled the next generation of the edge computing chips. It's called the Jetson store chipdeh, and it was actually co announced with Project Groot. So the idea is we will have the full stack as a unified solution to the customers.

So from the chip level, which is the JSON soar family, to the foundation model, project root, and also to the simulation and the utilities that we build along the way, it will become a platform, a computing platform for humanoid robots, and then also for intelligent robots in general. So I want to quote Jensen here. One of my favorite quotes from him is that everything that moves will eventually be autonomous. And I believe in that as well. It's not right now, but let's say ten years or more from now.

If we believe that there will be as many intelligent robots as iPhones, then we'd better start building that today. That's awesome. Are there any particular results from your research so far that you want to highlight anything that gives you optimism or conviction in the approach that you're taking? Yes. We can talk about some prior works that we have done. One work that I was really happy about was called Eureka. For this work, we did a demo where we trained a five finger robot hand to do pen spinning. Very useful, and it's superhuman with respect to myself, because I have given up pen spinning long since childhood. I'm not able to do it live demo.

I will feel miserably at this live demo. So, yeah, I'm not able to do this, but the robot hand is able to. The idea that we use to train this is that we prompt an omnip to write code in the simulator API that Nvidia has built. So it's called the Isaac sim API, and the OM outputs the code for reward function. So a reward function is basically a specification of the desirable behavior that we want the robot to do.

So the robot will be rewarded if it's on the right track or penalized if it's doing something wrong. So that's a reward function. And typically the reward function is engineered by a human expert, typically a roboticist, who really knows about the API. It takes a lot of specialized knowledge, and the reward function engineering is by itself a very tedious and manual task.

So what Eureka did was we designed this algorithm that uses Lom to automate this reward function design, so that the reward function can instruct a robot to do very complex things like pen spinning. So it is a general purpose technique that we developed, and we do plan to scale this up to beyond just pen spinning. It should be able to design reward functions for all kinds of tasks. Or it can even generate new tasks using the Nvidia simulation API. So that gives us a lot of space to grow.

Why do you think? I mean, I remember five years ago, there were people that were research labs working on solving Rubik's cubes with a robot hand and things like that. And it felt like robotics kind of went through maybe a trough of disillusionment. And in the last year or so, it feels like the space has really heated up again. Do you think there is a why now around robotics this time around? And what's different? And we're reading that open AI is getting back into robotics.

Everybody is now spinning up their efforts. What do you think is different now? Yeah, I think there are quite a few key factors that are different now. One is on the robot hardware. Actually, since the end of last year, we have seen a surge of new robot hardware in the ecosystem. There are companies like Tesla working on Optimus, Boston Dynamics and so on, and a lot of startups as well. So we are seeing better and better hardware.

So that's number one. And those hardwares are becoming more and more capable with better dexterous hands, better whole body reliability. And the second factor is the pricing. So we also see a significant drop in the price and the manufacturing cost for the human robots. So back in 2001, NASA had a humanoid developed, and it's called a robot knot. I remember, if I recall correctly, it cost north of $1.5 million per robot. And then most recently, there are companies that are able to put a price tag of about $30,000 on a full fledged humanoid, and that's roughly comparable to the price of a car.

And also, there's always this trend in manufacturing where a mature product, the price of it, will tend towards the price of the raw material cost. And for the humanoids, it typically takes only 4% of the raw material of a cardinal. So it's possible that we can see the cost trending downwards even more, and there could be an exponential decrease in the price in the next couple of years. And that makes these state of the art hardware more and more affordable. That's the second factor of why I think humanoid is gathering momentum.

And the third one is on the foundation model side. We are able to see the system two problem, the reasoning, the planning part, being addressed very well by the frontier models like the GPTs and the clouds and the llamas of the world. And these loms, they're able to generalize two new scenarios. They're able to write code, and actually, the Eureka project I just mentioned leverages these coding abilities of the loms to help develop new robot solutions. And there are also a surge in multimodal models, improving the computer vision, the perception of it.

So I think these successes also encourage us to pursue robot foundation models because we think we can write on the generalizability of these frontier models and then add actions on top of them so we can generate action tokens that will ultimately drive these humanoid robots. I completely agree with all that. I also think so much of what we've been trying to tackle to date in the field has been how to unlock the scale of data that you need to build this model. And all the research advancements that we've made, many of which you've contributed to yourself around sim, two real and other things.

And the tools that Nvidia has built with Isaac, Sim and others, have really accelerated the field alongside teleoperation and cheaper teleoperation devices and things like that. And so I think it's a really, really exciting time to be building here. I agree. Yeah, I'd love to transition to talking about virtual worlds, if that's okay with you. Yeah, absolutely. Yeah. So I think you started your research more in the virtual world arena, maybe say a word on what got you interested in Minecraft and versus robotics. Like, is it all kind of related in your world? What got you interested in virtual worlds?

Yeah, that's a great question. So, for me, my personal mission is to solve embodied ai. And for AI Asians embodied in the virtual world, that would be things like gaming and simulation. And that's why I also have a very soft spot for gaming. I also enjoy gaming myself. What did you play? Yeah, so I play Minecraft. At least I try to. I'm not a very good gamer, and that's why I also want my AI to avenge my poor skills.

Yeah, yeah. So I worked on a few gaming projects before. The first one was called Mind Dojo, where we develop a platform to develop general purpose agents in the game of Minecraft. And for those audience who are not familiar, Minecraft is this 3d voxel world where you can do whatever you want, you can craft all kinds of recipes, different tools, and you can also go on adventures. It's an open ended game with no particular score to maximize and no fixed storylines to follow. So we collected a lot of data from the Internet.

There are videos of people playing Minecraft. There are also wiki pages that explain every concept and every mechanism in the game. Those are like, multimodal documents and also forums like Reddit. The Minecraft Subreddit has a lot of people talking about the game in natural language. So we collected these multimodal datasets and we're able to train models to play Minecraft. So that was the first work, mind Dojo, and later the second work was called Voyager.

So we had the idea of Voyager after GPT Four came along, because at that time it was the best coding model out there. So we thought about, hey, what if we use coding as action? And building on that insight, we're able to develop the Voyager agent where it writes code to interact with the Minecraft world. So we use an API to first convert the 3d Minecraft world into a text representation and then have the asian write code using the action APIs. But just like human developers, the agent is not always able to write a code correctly on the first try.

So we kind of give it a self reflection loop where it tries out something. And if it runs into an error or if it makes some mistakes in the Minecraft world, it gets the feedback and it can correct its program. And once it's written the correct program, that's what we call skill. We'll have it saved to a skill library, so that in the future, if the agent faces a similar situation, it doesn't have to go through that trial and error loop again. It can retrieve the skill from the skill library.

So you can think of that skill library as a code base that the Lom interactively authored all by itself. Right? There's no human intervention. The whole code base is developed by, by Voyager. So that's the second mechanism, the skill library. And the third one is what we call an automated curriculum. So basically the agent knows what it knows and it knows what it doesn't know, so it's able to propose the next task that's neither too difficult nor too easy for it to solve. And then it's able to just follow that path and discover all kinds of different skills, different tools, and also travel along in a vast world of Minecraft.

And because it travels so much, and that's why we call it the Voyager. So yeah, that was kind of our team's one of our earliest attempts on building AI agents in the embodied world using foundation models. Talk about the curriculum thing more. I think that's really interesting because it feels like it's one of the more unsolved problems in kind of the reasoning and LLM world generally. How do you make these models self aware so that they know kind of how to take that next step to improve, maybe say a little bit more about what you built on the curriculum? And the reasoning side?

Absolutely. I think there's a very interesting emergent property from those frontier models is that they can reflect on their own actions and they kind of know what they know and what they don't know, and they're able to propose tasks accordingly. So, for the automated curriculum in Voyager, we gave the agent a high level directive, that is to find as many novel items as possible. And that's just the one kind of sentence of go that we gave. And we didn't give any instruction on which objects to discover first, which tools to unlock first, we didn't specify.

And agent was able to discover that all by itself using this kind of coding and prompting and skill library. So it's kind of amazing that the whole system just works. I would say it's an emergent property, once you have a very strong reasoning engine that can generalize. Why do you think there are so many, so much of the kind of virtual world research has been done in the virtual world, and I'm sure it's not entirely because a lot of deep learning researchers like playing video games, although I'm sure it doesn't hurt either.

But I guess what are the connections between solving stuff in the virtual world and in the physical world? And how do the two interplay? Yeah, so, as different as gaming and robotics seem to be, I just see a lot of similar principles shared across these two domains. For the embodied Asians, they take a as input the perception, which can be a video stream along with some sensory input, and then they output actions. And in the case of gaming, it will be, uh, like keyboard and mouse actions. And for robotics, it will be low level motor controls. So ultimately the API looks like this.

Um, and, uh, these agents, they need to explore in the world, they, um, have to collect their own data in some ways. So, uh, that's what we call reinforcement learning and also self exploration. And that part, that principle is again shared among the physical agents and the virtual agents. But the difference is robotics is harder because you also have a simulation to reality gap to bridge, because in simulation, the physics and the rendering will never be perfect.

So it's really hard to transfer what you're learning simulation to the real world. And that is by itself an open ended research problem. So for robotics, it's got the same to real issue, but for gaming it doesn't. You are training and testing in the same environment. So I would say that will be the difference between them.

And last year I proposed a concept called foundation agent, where I believe ultimately we'll have one model that can work on both virtual agents and also physical agents. So for the foundation agent, there are three axes over which it will generalize. Number one is the skills that it can do. Number two is the embodiments or the body form, the form factor it can control. And number three is the world, the realities it can master.

So in the future, I think a single model will be able to do a lot of different skills on a lot of different robot forms or agent forms, and then generalize across many different worlds, virtual or real. And that's the ultimate vision that the Gear team wants to pursue. The foundation agent pulling on the thread of virtual worlds and gaming in particular, and what you've unlocked already with some reasoning, some emergent behavior, especially working in an open ended environment.

What are some of your own personal dreams for what is now possible in the world of games? Where would you like to see AI agents innovate in the world of games today? Yes. So I'm very excited by two aspects. One is intelligent agents inside the games. So the NPC's that we have these days, they have fixed scripts to follow and they're all manually authored.

What if we have NPC's, the non player characters that are actually alive and you can interact with them, they can remember what you told them before, and they can also take actions in the gaming world that will change the narrative and change the story for you. So this is something that we haven't seen yet, but I feel there's a huge potential there so that when everyone play the game, everybody will have a different experience. And even for one person, you play the game twice, you don't have the same story. So each game will have infinite replay value. So that's one aspect.

And the second aspect is that the game itself can be generated. And we already see many different tools kind of doing subsets of this grand vision I just mentioned. There are text to 3d generating assets. There are also text to video models. And of course there are language agents that can generate storylines.

What if we put all of them together so that the game world is generated on the fly as you are playing and interacting with it? That would be just truly amazing and a truly open ended experience. Super interesting. For the agent vision in particular. Do you think you need GPG four level capabilities or do you think you can get there with llama eight B, for example, alone? Yeah, I think the agent needs the following capabilities.

One is of course it needs to hold an interesting conversation. It needs to have a consistent personality, and it needs to have long term memory and also take actions in the world. So for these aspects, I think currently the lava models are pretty good for that, but also not good enough to produce very diverse behaviors and really engaging behaviors. So I do think there's still a gap to reach.

And the other thing is about inference cost. So if we want to deploy these agents to, to the gamers, then either it's very low cost, hosted on the cloud, or it runs locally on the device. Otherwise it's kind of unscalable in terms of cost. So that's another factor to be optimized. Do you think all this work in the virtual world space?

Is it in service of, like, you're learning things from it, that way you can accomplish things in the physical world. Does the virtual world stuff exist in service of the physical world ambitions? Or I guess said differently, like, is it enough of a prize in its own right? And how do you think about prioritizing your work between the physical and virtual worlds?

Yes. So I just think the virtual world and the physical world ultimately will just be different realities on a single axis. So let me give one example. So there is a technique called domain randomization, and how it works is that you train a robot in simulation, but you train it in 10,000 different simulations in parallel. And for each simulation, they have slightly different physical parameters. Like the gravity is different, the friction, the weight, everything's a bit different, right? So it's actually 10,000 different worlds. And, um, let's assume if we have an agent that can master all the 10,000 different configurations of reality all at once, then our real physical world is just the 10,001st virtual simulation.

And in this way, we're able to generalize from Sim to real directly. So that's actually exactly what we did in a follow up work to Eureka, where we're able to train agents using all kinds of different randomizations in the simulation, and then transfer zero shot to the real world without further fine tuning. So I do believe that's Doctor Eureka.

That's doctor Eureka work. And I do believe that if we have all kinds of different virtual worlds, including from games, and if we have a single agent that can master all kinds of skills in all the worlds, then the real world just becomes part of this bigger distribution. Do you want to share a little bit about Doctor Rica to ground the audience in that example? Oh, yeah, absolutely. So, for the doctor Eureka work, we built upon Eureka and still use loms as kind of a robot developer. So the LOM is writing code, and the code is to specify the simulation parameters, like the domain randomization parameters. And after a while, after a few iterations, the policy that we train in a simulation will be able to generalize to the real world. So one specific demo that we showed is that we can have a robot dog walk on a yoga ball, and it's able to stay balanced and also even walk forward.

So one very funny comment that I saw was someone actually asked his real dog to do this task, and his dog isn't able to do it. So in some sense, our neural network is super dog performance. I'm pretty sure my dog would not be able to call the Adi. Yeah, artificial dog intelligence. Yeah, that's the next benchmark in the virtual world sphere.

I think there's been a lot of just incredible models that have come out on both the video side recently, all of them kind of transformer based. Do you think we're there in terms of, okay, this is the architecture that's going to take us to the promised land and let's scale it up? Or do you think there's fundamental breakthroughs that are still required on the model side there? Yes, I think for robot foundation models, we haven't pushed the limit of the architecture yet. So the data is a harder problem right now. And it's the bottleneck, because as I mentioned earlier, we can't download those action data from the Internet.

They don't come with those motor control data. We have to collect it either in simulation or on the real robots. And once we have that, we have a very mature data pipeline. Then we'll just push the tokens to the transformers and have it compressed those tokens. Just like transformers predicting the next word on Wikipedia. And we're still testing these hypotheses, but I don't think we have pushed the transformers to their limit yet.

There are also a lot of research going on right now on alternative architectures to transformers. I'm super interested in those, personally. There are Mamba recently. There's test time training. There are a few alternatives, and some of them have very promising ideas. They haven't scaled, really to all the frontier model performance. But I'm looking forward to seeing alternatives to transformers. Have anyone caught your eye in particular and why? Yeah, I think I mentioned the Mamba work and also test time training.

These models are more efficient at inference time. So instead of transformers attending to all the past tokens, these models have inherently more efficient mechanisms. So I see them holding a lot of promise, but we need to scale them up to the size of the frontier models and really see how they compare heads on with the transformer. Awesome. Should we close that with some rapid fire questions? Yeah. Oh, yeah. Okay, let's see. Number one, what outside the embodied ai world are you most interested in? Within? Aih, yeah.

So I'm super excited about video generation because I see video generation as a kind of world simulator. So we learn the physics and the rendering from data alone. So we have seen open air sora, and later, there are a lot of new models catching up to sora. So this is an ongoing research topic, and, yeah. What does the world simulator get you? I think it's gonna get us a data driven simulation in which we can train embodied ai. That would be amazing. Nice.

What are you most excited about in AI on a longer term horizon? Ten years or more. Yeah. So on a few fronts, like, one is for the reasoning side. I'm super excited about models that code. I think coding is such a fundamental reasoning task that also has huge economic value. Um, I think maybe ten years from now, we'll have, uh, coding agents, uh, that are as good as human level software engineers, and then we'll be able to accelerate a lot of development, um, using the Loms themselves.

And the second aspect is, of course, robotics. I think ten years from now, we'll have humanoid robots, uh, that are at the reliability and agility of humans, or even beyond. And I hope at that time, project root will be a success. And then we're able to have humanoids helping us in our daily lives. I just want robots to do my laundry. That's always been my dream. What year are robots going to do our laundry? As soon as possible.

I can't wait. Who do you admire most in the field of AI? And you've had the opportunity to work with some greats dating back to your internship days. But who do you admire most these days? I have too many heroes in AI to count, so I admire my PhD advisor, Fei Fei. I think she taught me how to develop good research taste. So sometimes it's not about how to solve a problem, but identify what problems are worth solving.

And actually, the what problem is much harder than the how problem. And during my PhD years with Fei Fei, I transitioned to embodied ai. And in retrospect, this is the right direction to work on. I believe the future of AI agents will be embodied for robotics or for the virtual world. I also admire Andre Kaposi. He's the great educator. I think he writes code like poetry. So I look up to him, and then I admire Jensen a lot. I think Jensen, he cares a lot about AI research, and he also knows a lot about even the technical details of the models, and I'm super impressed.

So I look up to him a lot, pulling on the thread of having great research taste. What advice do you have for founders building an AI in terms of finding the right problems to solve? Yeah, I think read some research papers. I feel that the research papers these days are becoming more and more accessible, and they have some really good ideas and they are more and more practical instead of just theoretical machine learning. So I would recommend kind of keeping up with the latest literature and also just try out all the open source tools that people have built.

So, for example, at Nvidia, we built simulator tools that everyone can have access to and just download it and try that out. And you can train our own robots in the simulations. Just get your hands dirty. And maybe pulling on the thread of Jensen as an icon, what do you think is some practical tactical advice you'd give to founders building an AI, what they could learn from him? Yeah, I think identifying the right problem to work on. So Nvidia bets on the humanoid robotics because we believe this is the future and also embodied ai, because if we believe that, let's say, ten years from now, there will be as many intelligent robots in the world as iPhones, then we better start working on that today.

So, yeah, just long term future visions. I think that's a great note to end on. Jim, thank you so much for joining us. We love learning about everything your group is doing, and we can't wait for the future of laundry folding robots. Awesome. Yeah. Thank you so much for having me. Yeah, thank you. Thank you. Thanks.

Artificial Intelligence, Technology, Innovation, Jim Fan, Nvidia, Robotics Research, Sequoia Capital