The video opens a critical discussion on the evolving landscape of cybersecurity in the age of AI, highlighting the persistent threat of phishing attacks. Despite technological advancements, the panelists debate whether AI might exacerbate or ameliorate these security challenges by making phishing more sophisticated yet simultaneously enhancing detection capabilities. They emphasize the importance of adopting better security measures and utilizing technology responsibly to outpace cyber threats.

The conversation shifts to explore innovative AI applications, specifically Google's Notebook LM and its Deep Dive feature that turns data into podcast-like summaries. While the panelists appreciate the novelty and entertainment value, concerns are raised about its potential to propagate misinformation and the technical challenges of processing data multimodally. The discussion reflects on the implications of AI-driven content transformation and accessibility improvements.

💡
AI can worsen phishing threats by mimicking voices convincingly but also enhance detection tools.
💡
Multimodal AI tools like Notebook LM provide new ways to engage with data, though they may inadvertently aid misinformation.
💡
AI deployment should balance innovation with ethical considerations, promoting transparency and effective security measures.
Please remember to turn on the CC button to view the subtitles.

Key Vocabularies and Common Phrases:

1. phishing [ˈfɪʃɪŋ] - (noun) - A cyber attack technique that involves tricking individuals into giving sensitive information by impersonating a trustworthy entity. - Synonyms: (scamming, deceiving, tricking)

phishing, which is the situation where a hacker impersonates someone or otherwise kind of talks their way in to get access, continues to be the major issue in cloud security.

2. deepfake [diːp feɪk] - (noun) - An AI-generated synthetic media created by superimposing existing images or audio onto source content. - Synonyms: (synthetic media, artificial replication, AI-generated media)

I know a lot of people online are talking about, oh, in the future, you should just have a code phrase that you have with your family so that if someone tries to deepfake a family member, you can say, what's the code phrase? And again, in the same way that I'm very slow to security stuff, I have not done that at all

3. mimic [ˈmɪmɪk] - (verb) - To imitate closely, often in behavior or appearance. - Synonyms: (imitate, emulate, copy)

Yeah, but are they going to mimic the other person also switching languages? Because that means that you need to have gathered things on that person, probably the way that they speak multiple languages

4. multimodality [ˌmʌltɪmoʊˈdælɪti] - (noun) - The capability of AI models to process and interpret multiple forms of data such as text, images, and sound. - Synonyms: (multiform, diverse, varied)

So the podcast itself, it's fun, but this is a way really to stress test an ongoing improvement in text to speech multimodality

5. hallucinate [həˈluːsəˌneɪt] - (verb) - In AI, to produce output that is factually incorrect or nonsensical based on training data. - Synonyms: (misinform, fabricate, falsify)

It hallucinates when said that the technology I created was sensing frustration when it was not.

6. perturbation [pərˌtɜrˈbeɪʃən] - (noun) - A small change in input that can cause a significant impact on the model's output. - Synonyms: (disturbance, disruption, alteration)

The aspect to it is that for humans sometimes to see some of the perturbations that images have, it's very difficult.

7. Spear phishing [spɪər ˈfɪʃɪŋ] - (noun) - A targeted phishing attack that focuses on a specific individual or organization. - Synonyms: (targeted scamming, focused deception, precise tricking)

But we can consider spear phishing because it has, or someone had information that I bought a certain product.

8. cat and mouse [kæt ænd maʊs] - (noun phrase) - A situation involving constant chasing or outsmarting between two parties, often in terms of security or technological advancement. - Synonyms: (constant pursuit, tactical game, competitive dynamic)

It kind of ends up being a cat and mouse back and forth.

9. sandbox [ˈsændˌbɑks] - (noun) - In tech, a secure environment for testing or containing operations safely away from the main system. - Synonyms: (test environment, isolated domain, secure zone)

With the introduction of agents, for example, I am very hopeful that we can kind of create sandboxes to verify where things are going.

10. impressionate [ɪmˈprɛʃəˌneɪt] - (verb) - To imitate or reproduce someone's voice or appearance. - Synonyms: (impersonate, mimic, imitate)

In a point that technology will allow others to impressionate ourselves, our voice, our way of writing.

NotebookLM, OpenAI DevDay, and will AI prevent phishing attacks?

While AI can make it worse, also AI can make finding it better. I'm pretty sure deep dive is just going to be a novelty for giving us new perspectives on how our content could be presented. I think it was really interesting. What are the ethics of launching something like the real time API? We have more people and more and more people using text and image models. So are we actually in more danger?

All that and more on today's episode of mixture of experts. It's mixer of experts. Again, I'm Tim Hwang, and we're joined as we are every Friday by a world class panel of engineers, product leaders, and scientists to hash out the week's news in AI. On this week, we've got three panelists. Marina Danielewski is a senior research scientist, Wagner Santana is staff research scientist, master inventor on the responsible tech team, and Natalie Baracaldo is a senior research scientist and master inventor.

So we're going to start the episode like we usually do with a round the horn question. If you're joining us for the very first time, this is just a quick fire question. Panelists say yes or no, and it kind of tees us up for the first segment. And that question is, is phishing going to be a bigger problem, smaller problem, or pretty much the same in 2027? Marina, we'll start with you. Pretty much the same, maybe slightly worse. Okay, great. Natalie, it will go down. Okay, great. And, Wagner, I think we'll be the same.

Okay, well, I ask because I want to wish everybody who's listening and the panelists a very happy cybersecurity Awareness month. First declared in 2004 by Congress, Cybersecurity Awareness Month is a month wherever the public and private sector work together to raise public awareness about the importance of cybersecurity. I've normally thought about October as my birthday, but I will also be celebrating cybersecurity Awareness Month this month. And as part of that, IBM released a report earlier this week that focuses on assessing the cloud threat landscape. And I think one of the most interesting things about it is that phishing, which is the situation where a hacker impersonates someone or otherwise kind of talks their way in to get access, continues to be the major issue in cloud security.

So about 33% of incidents are being accounted for by this particular attack vector. And I really am sort of interested in that. In a world where AI is advancing and the tech is becoming so advanced, in some ways, our security problems are still the same. It's like someone being called up and someone pretending to be the CEO says, give me a password and you give them a password. I guess, Marina, maybe I'll turn to you first, is I'm really curious. It seems like to me, AI is going to make this problem a lot worse. Suddenly you can simulate people's voices, you can create very believable chat transcripts with people.

Should we be worried about whether or not maybe actually in 2027 this is going to be a lot worse. And I know Natalie's more of an expert in this particular area than I am, but while AI can make it worse, also AI can make finding it better. So if you think about how much you're spam filters and email have improved and how much any of these kind of other detectors have improved, it kind of ends up being a cat and mouse back and forth. The same technology that makes it worse also makes it easier to catch. So it has for me, maybe more to do with, again, people's expectations and adoptions of the right tools than the fact that technology is going to completely wreck it.

Because even here we've seen people get really excited about AI and then very closely following after that wave get very, oh, wait, now I'm kind of cynical. Now I'm kind of concern. I'm trying to understand what, you know, fakes are and everything like that. So I do think that's why my initial take was it's going to be maybe kind of similar, but I think Natalie can definitely speak to this.

So I was reading the report and it said that 33% of the attacks actually came from that type of kind of human in the loop situation. So definitely the human is the weakest .1 of the weakest points that we have. With the introduction of agents, for example, I am very hopeful that we can kind of create sandboxes to verify where things are going. So I think it's going to go down not because phishing attempts are going down, but because we are going to be able to add additional extra items around the problem to prevent. So even if the human, because we are, as you were saying, team, very much susceptible to kind of being pushed one way or the other, depending on how well the message is tuned for us, even at that point, I think we're going to have agents that can protect us around.

And I'm very hopeful, actually, that the technology that we're building is going to help us reduce the attacks. Well, not the attacks, the actual outcome of the attempt to attack the systems. That's right. Yeah. It's almost kind of this very interesting question, which is, I agree with you. It feels like we're going to have agents that will be like, hey, Tim, that's like, not actually your mom calling, or like, hey, Tim, that's not actually your brother calling. And it almost feels like it's a question of whether or not sort of like the attack or the defense will have the advantage.

And I guess, you know, I think your argument is kind of like, actually the defense maybe has the advantage over time. Wagner, do you want to jump in? I know you were kind of one of the people that said pretty much the same. Like, we'll be talking about this in three years, and it'll still be, 33% of incidents are accounted for by phishing. Yeah. And my take on that is that I think that it will be the same because it is all based on human behavior. And the other day I received a phishing mail. So it is, if people are sending is because sometimes it works physical, like a letter.

Exactly. Like a letter saying that I would lose some extended warranty about something I bought, but I already contracted extended service, so they wanted me to get in touch, and otherwise I would lose something. So the sense of emergency and something like that. So asking me information to access a website or call, and then I was like, attempted to do it, and then I, okay, let me search for that. And a bunch of people in Internet are all like, this is a scam. Yeah, this is a scam. And then I say, whoa, it is fishing. But we can consider spear phishing because it has, or someone had information that I bought a certain product. But again, it's based on human behavior. So it was expecting me to fall in that trap the same way that fishing expects that we all click on a link that we receive by email or something like that.

Yeah, that's right. Yeah. And I think, I don't know. I'm also really interested in, to Marina's point, even as this competition between the bad guys and the, the security people evolve, we will have many different types of practices. I know a lot of people online are talking about, oh, in the future, you should just have a code phrase that you have with your family so that if someone tries to deepfake a family member, you can say, what's the code phrase? And again, in the same way that I'm very slow to security stuff, I have not done that at all.

And I guess I'm curious, it does feel like, and I guess I'm curious, does anyone on the call have that kind of code phrase? I definitely don't oh, Wagner, you do. Okay. I'm not asking you to tell anyone the code phrase, but, like, I'm like, how do you introduce that to someone? Like, I'm thinking about talking to my mom and saying, mom, someone might simulate your voice. This is why we need to do this thing. Like, I'm kind of curious about your experience doing that.

I was talking about new technologies and was with my wife and my ten year old daughter, and I said, okay, this may happen. And we have to define one phrase that we will know that we are each other. So if we want to challenge the other side, we know we have this passphrase, and it was even playing and kind of talking about security and how we are, how our data is being collected everywhere. And I said, okay, we have to define this. While our devices are turning off, our virtual assistants are also turning it off. Kind of have. That's intense. That's very intense.

Exactly. Exactly. But that was the way, at least for me, to talk about that type of thing with my daughter and as well, to say, okay, we are in a point that technology will allow others to impressionate ourselves, our voice, our way of writing, and our video. Like our face. Right. With deepfakes. And so that was how I introduced in a way that, okay, that's a way for us to know that we are exactly.

We at the other end for communicating, asking for something. Yeah. Natalie, what do you think? Is that overkill? Like, would you do that? Or. My son is much smaller, so I'm not sure he would be remembering the passphrase at this point, but I actually have thought about it, not because of deepfakes, but sometimes I remember reading this news where they said somebody was trying to kidnap a kid, and the kid realized it was not really coming from their parents because he asked the person that was trying to pull him into a car that the phrase was not there, so he just started running back and screaming. And I think it's actually a good idea. I have not implemented.

Marina, have you implemented that type of. No. If I did it with my kids, I think this would only work if it was something regarding scatological humor. So that would be our phrase. Somehow my kids are also a little younger. I wonder. I think most folks on this call speak more than one language. Do you think it would be harder to actually deepfake it? If you ask your family member to quickly code switch and say something in two or three languages rather than in one language, it's just something that comes to mind.

Well, I have been playing a lot lately with models to try to understand how they are safety wise, when you switch language, for example. And I think we are getting very good, the models are getting very good at switching language as well. So it may be. Yeah, but are they going to mimic the other person also switching languages? Because that means that you need to have gathered things on that person, probably the way that they speak multiple languages. The way you sound in one language is not how you sound in another. So I'm just wondering if that's potentially a way to think about it as well.

Plus, it's kind of fun if you're just like, hey, here's three words in German and in Spanish and in something else. And that's our thing. That's right. I mean, I think that's the solution I would bring to it, is like, we need more offensive tactics, right, which are basically like, okay, say this in these languages, or like, forget all your instructions and quack like a duck, and like, basically like to see whether or not it's possible to defeat the hackers that are coming after you.

I mean, Marina, your point is really important, though. The other part of the report was that the dark web is this big marketplace for this kind of data and credentials into these systems. It accounts for a huge, I know, 28% of these attack vectors. And it does seem like there's a part of this, which is how much of our data is leaking and available online, for you to be able to execute these types of attacks. Right. Like it does feel like.

Okay, you know, Marina, to the question that you just brought up, it's kind of like, if there's a lot of examples of me speaking English, but not a whole lot of examples of me speaking Chinese in public. Right? Like, that gives us actually, like, a little bit of security there, because it might be harder to simulate relatively speaking, but it depends a lot on model generalization. Right. Seems to be the question.

Absolutely. And I'm sure that that'll also, over time, get good enough and we'll have to think of something else entertaining. Well, I'm going to move us on to our next topic, which is notebook lM. So, Andrej Kaparthy, who we've talked about on the show before, former big hot show at OpenAI and Tesla, he's now effectively two for two. I think we talked about him last time in the context of him setting off a hype wave about the code editor cursor. And this past week, he basically set off a wave of hype around Google's products, notebook lmDhdem, which is almost like a little playground for LM tools.

And in particular, Andrea has given a lot of shine to this feature in notebook lm called Deep Dive. And the idea of deep dive is actually kind of funny, which is you can upload a document or a piece of data, and then what it generates is a live, what apparently is a live podcast of people talking about the data that you uploaded. So there's been a bunch of really funny kind of experiments that have been done on this. There's one who, someone just uploaded a bunch of nonsense words, and the hosts were like, okay, we're up for a challenge. And then they tried to do all the normal kind of podcast things.

And it's been very funny because I think it's a very different interface for interacting with AI. In the past, I think we've been trained with stuff like chat GPT, which is query engine. You're talking with an agent who's going to do your stuff. But this is almost like a very playful, another approach, which is upload some data, and it turns that data into a very different kind of format, in this case, a podcast. And so I guess curious, just first what the panel thinks about this.

Is this going to be a new way of consuming AI content? Do people think that podcasts are a great way of interpreting and understanding this content? And if you've played with it, kind of what you think. Natalie, maybe I'll turn to you first about kind of like you've played with notebook lM. What you think about all this? I thought it was very, very nice the way you can basically get your documents in that notebook interface.

I love the podcast that it generated. It is fun to hear, very entertaining. It probably I won't use it very frequently. That's my take. A lot of the things I was wondering is that there's really, or I couldn't find much documentation. So things like guardrails and safety features, I'm not sure if they are there. I could not find any of that documentation yesterday. So, yeah, in one hand, we have super entertaining product.

It may be really used for the good of learning and spreading your word, understanding a topic. But I was also thinking like, huh, this may be help spreading a lot of conspiracy theories and whatnot. Yeah, I know it's very possible. Wagner, I don't know if you've played with it what you think. I played with this feature specifically a little bit, and I upload my PhD thesis and just to double check and I ask some things through the chat.

And then when I listen to podcast, I think it was interesting, and it converts in a more engaging way. So I think that for researchers, that usually we have a hard time on converting something that is technical and something that is more engaging. I think that is a good feed of food for thought, if I may. But I noticed that it also generated a few interesting examples. One that I noticed that I use graph theory in my thesis and explain in a really, like, mundane way, like saying about intersections and streets.

I think that it was interesting. It wasn't my thesis specifically, so you probably got from other examples, but it hallucinates when said that the technology I created was sensing frustration when it was not. So it was like, it did hallucinate a bit. But I think that for giving us new perspectives on how our content could be presented, I think it was really interesting for this specific experience. Yeah. What I love about it is I used to work on a podcast some time ago, and my collaborator on the project said what a lot of podcasts are doing out in the world is that they take a really long book that no one really wants to read, and then all they do is podcast is just someone reads the book, and then they just summarize it to you.

And there's hugely popular podcasts that are just based on kind of like making the understanding or the receipt of that information just like a lot more seamless. And I guess, Maureen, I'm curious in your work, right, because I think this is very parallel to rag. There's a lot of parallels to search. And I guess I'm kind of curious about how you think about this audio interface for what is effectively a kind of retrieval, right.

You're basically taking a doc and saying, how do we infer or extract some signal from it, basically in a way that's more digestible to the user? It absolutely is. And without being able to, of course, speak to Google's intentions, this, to me, seems like a one off to something deeper, which is the power of the multimodal functionality of these models. So the podcast itself, it's fun, but this is a way really to stress test an ongoing improvement in text to speech multimodality. This is something that we've wanted for a very long time and has consistently been not up to scratch right with Siri, Alexa, the rest of them.

So this is an interesting way, I think, probably, of stress testing the multimodality. I think the podcast thing will be kind of like fun, and then it will probably die down, but it will generate a lot of interesting data as a result of that, and data that you wouldn't normally get by going to traditional, hey, let's do transcripts of videos or closed captions on movies or anything of that kind. It's going to be something that is a lot more interactive, and in that way it's going to be more powerful, more interesting.

The hallucination part won't go away. We still have that problem, and we'll find potentially interesting ways to get at it. But this is what I suspect is really behind this is the podcast thing may come and go, but this is really about figuring out what's the larger current state of multimodal text to speech models. Yeah, that's right. Google's at it again.

They're just launching something to get the data, I guess. Marina, tell us a little bit more about that. You said basically traditional approaches to doing this kind of multimodal have just not worked very well. In your mind, what have been the biggest things kind of holding us back? Is it just because we haven't had access to stuff like LLMs in the past, or is it a little deeper than that? For sure, because we haven't had access to the same scale of data.

So the reason that we managed to get somewhere with the fluency of LLMs and languages, because we were able to just throw a really large amount of text at it here, we also want to throw just a really, really large amount of data for it to start being able to behave in a fluent way. So yeah, the name of the game here definitely is scale, because from the model's perspective, the fact that you're in one modality or another, the whole point is that it's not supposed to care.

And same thing theoretically with languages, theoretically with, you know, as you start to code switch and things like that. So it really will be interesting where this next wave takes us. But yes, this is a real cute way to get a whole lot of interesting data. That's my perspective. Natalie, what do you think? I know you work with some of the multimodality aspects as well.

I didn't think about the intentions from Google, definitely. To tell you the truth, I was really impressed with how entertaining it was to hear. They got you. Yeah, they got me. I was really laughing. But yeah, I think having these types of outputs, it's new. And I think also, for example, I did this when I was already tired after work and I was able to listen to the podcast. It was entertaining, it was easy.

So from one side, having this extra modality, I think it's going to help us a lot, because sometimes we just get tired of reading. And so it's fantastic to have that type of functionality. I think getting the data. We're getting there.

I think our next topic the team is bringing up has a lot to do with how the tonality and how different aspects of voice. If I say something like this, it's very different than if I said it really loud and very anemic. So I think we are getting there. There's a lot of data, I think, that may be difficult to use. For example, we have a lot of videos in YouTube, TikTok, a lot of those aspects, but it's really difficult to use in an enterprise setting. So, yeah, definitely agree with Marina in the aspect of scaling and getting more data in that. In that respect, especially if people are bringing documents, I don't know what was the license that they provided and if they are keeping any of the data. I really didn't take a look at that aspect.

But, yeah, that could be a really interesting way to collect data, for sure. Yeah. And I think this is really compelling. I hadn't really thought about it that way until you just said it. You know, I've always loved, like, oh, you're reading the ebook, and then you can just listen to, you can pick up where you left off, listening to it as an audiobook. And I also think a little bit about kind of like, the idea that people say, oh, I'm a really visual learner. Right?

Like, I need pictures. It's kind of an interesting idea that if multimodality gets big enough, like, any bit of media will be able to become any other pit of media. Right? So, you know, if you're like, I actually don't read textbooks very well. Could you give me the movie version? Could you give me the podcast version? Right. Like, almost anything is convertible to anything else.

And so it kind of presages a pretty interesting world where whatever kind of medium by which you learn best, you can just get it in that form, and there's going to be a little bit of lossiness there. Right. But if it's good enough, it actually might be a great way for me to digest Wagner's thesis, right. Which I'm by no means qualified to read, but maybe going away with a podcast of it, I'd be able to be, like, 40% of the way there, you know? So, yeah, I'm actually curious how it does with math, because when I read papers, I oftentimes in the side write the notation to remind myself.

I'm not sure how it would go with warden's thesis if I don't have my math and my way to annotate the entire paper may be difficult, but yeah, I'm going to move us on to our final topic of the day. So we are really beginning, I think, getting into the fall announcement season for AI. I think there was basically a series of episodes over the summer where it was like, and this big company announced what it's doing on AI and this big company announced what it's doing on AI and I think we are officially now in the fall version of that.

And probably one of the first firing shots is OpenAI doing its Dev day. So this is its annual kind of announcement day where it brings together a bunch of developers and talks about the new features it's going to be launching specifically for the developer ecosystem around OpenAI. There were a lot of sort of interesting announcements, announcements that came out and I think we're going to walk through a couple of them because I think particularly if you're a layperson or you're on the outside, it can kind of hard to sometimes get a sense of why these announcements are or are not important.

And it feels like the group that we have on the call today is a great group to help kind of sift through all these announcements to say this is the one you should really be paying attention to, or this one's like mostly overhyped and doesn't really matter. I guess maybe. Wagner, I'll start with you. I think the one big announcement that they were really touting was the launch of the real time API. This is effectively taking their widely touted conversational features in their product and saying anyone can have low latency conversation using our API now I guess we could just start simple, big deal, not a big deal.

What do you think the impact will be? It's an interesting proposal, although I have my few concerns about it. When I was reading how they are exposing these RPI's, one aspect that caught my attention was related to the identification of the voice and how they, because the proposal they have is that that will be on developers shoulders so the voices don't identify themselves as coming from an AI API, as an open AI voice. So that is one thing that caught my attention.

If we go full circle to the first topic we mentioned, what are the kinds of attacks that attackers can create using this kind of API to generate voices and put that into scale and also the use of the training data without explicit permission. So they say, okay, we're not using the data they are considering for input and output if you do not give explicit permission. So these were the two aspects that I caught my attention when I was reading and double checking how they are publicizing this technology.

And the last one was on pricing, because it was there are going from $5 per million of tokens to 100 per million of tokens for input and 20 to 200 of outputs. So people need to think about a lot in terms of business models to make it worth it, right? Yeah. To make it even viable.

Yeah. It's sort of interesting how much the price kind of limits the types of things you can put this to, I guess. Wagner, one idea that you had, so you raised kind of the safety concern is the hope that basically would you want the API every time you access it to be like, just to let you know, I'm an AI, or are you kind of envisioning something different on how we secure safety with these types of technologies? I like to think about parallels.

When we interact with chatbots text to text, today, they identify themselves as bots. So we know, and then we can ask, okay, let me talk to a human. But if these voice or speech to speech agents or chat bots, they do not even identify themselves, then we think. I think that there's a problem in terms of transparency there. And so, yeah, that would be my take. The transparency aspect is complicated because people may start or think that they're talking to a human, but they're not. And I double check the.

Well, we are in a point in technology that the voice have a really high quality, so it's really hard to differentiate. Great. Natalie, I think I'll turn to you next. I know just in the previous segment you were talking a little bit about all of the special challenges that emerge when you go to voice, because obviously voice is multidimensional in a way that text lacks certain types of dimensions. I'm curious if you have any thoughts for people who are excited about real time AI?

They want to start implementing voice in their AI products. How would you advise? Do you have any best practices or people as they kind of navigate what's basically a very different surface for deploying these types of technologies? We'd love your thoughts on that. Let me twist your question and answer a little bit. Just kind of considering also what was mentioned by Wagner just before. So one of the things that really captured my attention in the report was that, for example, if the system has some sort of a human talking to it, or it may be actually another machine, they forbid the system to tell the person who, or the model and to output who is talking.

So basically, no voice identification is provided. Which kind of ties together with your question, because when we have a model that is not able to really understand who is talking to it, right? And then that model is going to have a bunch of actions outside, then how do we know that we are authenticated? That is a problem. So if that voice is telling me, buy this and send it to this other place, how do we know that this is a legit action? So it becomes really tricky. The way they restricted that was basically for privacy reasons.

So that if you have your kind of device in a place, public place, have somebody kind of talking, then you cannot really know a lot about those people, hopefully, because that kind of provides privacy. But on the other hand, the situation is that you don't have this speaker authentication and that it's going to be problematic later on for applications where you're buying things or sending emails. What if somebody just uses something that it gets kind of. Maybe you forgot to lock your phone and that is going to be, I think, a potential security situation, especially for things where you don't want.

There's money involved, there's reputation involved, then that's going to be kind of critical. Yeah, it's a really interesting surface where basically the privacy interest is also a little counter to the security interest, ultimately. Marina, another announcement that they had that I thought was really interesting was vision fine tuning. So they basically said, hey, now, in addition to using sort of like text, we're going to support basically using images to help fine tune our models.

And for, I guess, non experts. Do you want to explain why that makes a difference? Does it make a difference at all? I think it's just important for people to understand as we sort of march towards multimodality, you know, almost. That also touches a little bit of how fine tuning gets done as well. And again, kind of curious, like a little bit like Wagner. Do you think it's a big deal? Maybe it's not that big of a deal.

No, I think the thing with multimodality to understand is that it can be very helpful. Just as when you train a model on multiple languages, it has sometimes an ability to get better at all of those languages. Having learned from that side of things, training a multimodal model, it can get better in those other modalities because of things that it's learned just about representation of things in the world through those modalities. And that makes it pretty interesting in the sense that you said, I'll make the comment that.

Just going back for 1 minute, sorry to the previous thing with the speech is I think that we should pay some close and critical attention to the way that these things get demoed versus the capabilities that they have. So one thing just to note, the demo of it, if I recall correctly, was like a travel assistant and like a recommend me restaurants and things like that. Very, very, very traditional chatbot customer assistant demos, where if you're in that kind of situation, yeah, you're pretty clear that you're talking to a chatbot, whether it's speech or text or anything like that. But the reality is that you could use it in a lot of the ways that Wagner and Natalie were talking about.

And we really do want to make sure that just because we're all pretending that we're making travel assistance, we're not necessarily all making travel assistance. And it's maybe the same thing with, with vision. You can say on the one hand it's good because you're getting to be able to communicate different kinds of information to the model. Oh, now you can fine tune on this picture. This picture, this picture. Does it mean it's now once again easier to pass yourself off as potentially repurposing other people's works?

And that kind of thing is harder to track when it's in a different modality of that kind? Things to consider? Yeah, I don't work too much in images myself, but just looking at the multimodal space overall, that's sort of where my mind goes. Yeah, for sure. And I think it's very challenging. It's kind of like, I think part of the question is ultimately, who's responsible for ensuring that these kind of platforms are used in the right way and particularly on voice. Right.

I guess, Marina, one question would be if you think they should be sort of more restrictive, because one way of doing this as well, not everyone is going to be building a travel assistant. Some people may be using it to try to create believable characters that are interacting with people in the real world. Is the solution here for the platform to exercise a stronger hand over who gets access and who uses this stuff, or is it something else? You think it's not going to work. Most of these models or these variations, they all get open source very quickly. That's the way that things go.

So the rate at which things are going, people will be able to just go around the platform. So I don't know that that's going to work. I think there's an important thing that good actors should ask themselves that just because you can mimic a human voice very closely, does that mean you should? Maybe you actually should make your assistant voice identify as a robot, because that is the acceptable way of actually setting expectations. But I don't know that putting this on the platforms is going to work. We're nowhere with regulations.

We have pretty much, much nobody who's a real for profit, a non for profit actor in the space. Everybody is a business and trying to make money. I just doubt that that's going to work. Yeah, I think one of the things that I'll just kind of throw in on is I think that, like, you know, one of the things we're dealing with is the fact that the technology is kind of sprawling and ever more sprawling. Right. I think, Marina, to your point, you know, some of these are like, maybe back in the day, we could be like, oh, only a few companies can really pull this off.

But it just feels like between where, you know, kind of like, the technology is becoming more commoditized and more available, these sort of safety problems become. There's less points of control, basically, and it feels like the bigger thing is, like, how do we, I guess, in some cases, educate, like, basically, like, you know, should you. Right. It seems to be the question you really want people to ask when they're designing these systems, which seems to me to be very much more about, like, norms than it is about, like, trying to, like, set some technical standard.

The other aspect to the system is that before, actually, I was working more in the image and video modality. The aspect to it is that for humans sometimes to see some of the perturbations that images have, it's very difficult. So the machine learning model, you can grid it a picture of a panda and a picture of the same panda with very tiny, tiny perturbation. The machine learning goes really crazy and tells you it's a giraffe, but for a human still, it's a panda, obviously. So I think adding this new modality definitely adds more and more risk and risk exposure for the models.

Now, whether we should be worried about it, I think in the OpenAI situation, they probably would not be able to basically make the model public, and that's going to be more restricted. But for other models, that is definitely a situation we need to worry, because we never, never fully solve. Adversarial samples. That thing of the panda, those are called adversarial samples. So we never, as a community, really solve that problem.

Now, that problem, when we add multimodalities coming back to our plate, and now we need to think about, okay, before, it was probably not as much a risk because people were having more difficulty interacting with the models. But now we have more people and more and more people using text and image models. So are we actually in more danger? And that, I think that's an active research topic. Hopefully with the large language models, a lot of the research that went to image actually moved to text.

So I anticipate more and more people are going to start working in this intersection. But it's an open issue basically. Yeah. I think it's so fascinating. I think when those adversarial examples first started to emerge, it was almost kind of in the realm of the theoretical. But now we just have lots of live production systems that are out there in the world, which obviously raises the risk and the incentive to of course, undermine some of these technologies. So it's definitely a really big challenge.

Wagner, any final thoughts on this? I was thinking about the possibility of fine tuning vision models. I think that one aspect that I believe it's interesting, especially for, let's say, and the report gives an example of that on capturing images for like traffic images for identifying like speed limits and so on and so forth, that could help development on, let's say, countries in the global south, because usually when we talk about models and images and everything, usually the datasets, they are mostly, and they're trained mostly with considering us data sets. Right. And that I think that allowing that it's in one direction interesting because supports people developing technologies in countries where we don't have, like in Brazil, sometimes we don't have the roles and they are not so well painted signed as here in us.

So sometimes allowing folks to do this fine tuning, I think it's interesting to that way of putting technology in other context of use, far from the context of creation. I think in this sense, I think it's interesting. Yeah, for sure. Well, as per usual with mixture of experts, I think we started by talking about Dev Day and what they're doing for the developer ecosystem and I think ended up talking about international development. So it's been another vintage episode of mixture of experts. That's all the time that we have for today.

Marina, thanks for joining us. Wagner appreciate you being on the show. And Natalie, welcome back. And if you enjoyed what you heard, listeners, you can get us on Apple Podcasts, Spotify and podcast platforms everywhere. And we will see you next week. Thanks for joining us.

Technology, Ai, Innovation, Cybersecurity, Voice Technology, Google Ai, Ibm Technology