• last year
Three Google engineers, James Rubin, Peter Danenberg and Peter Grabowski, discuss what they’re learned so far working on Google’s Gemini ai and what’s to come next.

Subscribe to FORBES: https://www.youtube.com/user/Forbes?sub_confirmation=1

Fuel your success with Forbes. Gain unlimited access to premium journalism, including breaking news, groundbreaking in-depth reported stories, daily digests and more. Plus, members get a front-row seat at members-only events with leading thinkers and doers, access to premium video that can help you get ahead, an ad-light experience, early access to select products including NFT drops and more:

https://account.forbes.com/membership/?utm_source=youtube&utm_medium=display&utm_campaign=growth_non-sub_paid_subscribe_ytdescript

Stay Connected
Forbes newsletters: https://newsletters.editorial.forbes.com
Forbes on Facebook: http://fb.com/forbes
Forbes Video on Twitter: http://www.twitter.com/forbes
Forbes Video on Instagram: http://instagram.com/forbes
More From Forbes: http://forbes.com

Forbes covers the intersection of entrepreneurship, wealth, technology, business and lifestyle with a focus on people and success.

Category

🤖
Tech
Transcript
00:00Well, hello everyone.
00:02I'm James Rubin.
00:03I'm going to be the moderator for this fireside chat with my esteemed Google colleagues, Peter
00:09Grabowski and Peter Dannenberg.
00:11We're going to be touching on the key blockers and solutions to productionizing enterprise
00:17ready LLMs and talk about some of the state-of-the-art approaches to really solving those challenges.
00:24We want to really focus on the practical applications today and we hope this is really a jumping
00:28off point for folks.
00:30It's a conversation starter for folks and we'll be in the media lab later if we have
00:33any more deep dive questions.
00:36But I think before we dive in, maybe we'll start with some introductions.
00:40I can go first.
00:41I'm James Rubin.
00:42I'm a product manager at Google for a Gemini applied research team.
00:48I work with Peter Grabowski.
00:50I'm his product counterpart.
00:53Prior to Google, I was at Amazon as a PM for the better part of three years.
00:58Worked across the AI stack, beginning with Zooks, which is their self-driving subsidiary,
01:03but also on custom AI chips and machine learning services at AWS.
01:08With that, I guess I'll pass it off to Peter Dannenberg.
01:11Hi.
01:12I'm Peter.
01:13I work on Gemini, which was formerly BARD and before that was Assistant.
01:17Peter Dannenberg on Assistant, formerly BARD, formerly, sorry, Gemini, formerly BARD, formerly
01:22Assistant.
01:23And I've been a senior SWE for a while, currently working on Gemini extensions.
01:29My name is Peter Grabowski.
01:30I am lucky enough to work with James on the Gemini applied research team.
01:34I've been at Google for about 10 years.
01:36Came in through the Nest acquisition, then spent some time working on the Google Assistant.
01:40And in the evenings, I moonlight as a teacher.
01:42So I'm on the faculty of UC Berkeley's master's in data science program and teach deep learning
01:46with natural language processing.
01:49And one thing I'll add there, Peter is too humble to say it, but during really the height
01:54of the AI craze in early 2023, he created and taught a LLM bootcamp that has really
02:00become the foundational course for Google.
02:02It's been taken by tens of thousands of Googlers, including myself.
02:05So we definitely have two incredible experts on the stage today.
02:11Before we dive in, though, I want to discuss kind of what the motivation for this talk
02:15was.
02:16A recent survey from Andreessen showed that the overwhelming majority of enterprises adopting
02:21AI are choosing to build AI apps in-house on top of common foundation models, as opposed
02:29to leveraging B2B AI software off the shelf.
02:33There may be a founder in the audience today that disrupts that trend.
02:36But regardless, the thing that we're most interested in is this disconnect between that
02:41enthusiasm to build and this hesitance and slowness that we're seeing with enterprises
02:47to deploy LLMs externally in production.
02:51External apps we're seeing, testing we're seeing, use cases for employee productivity,
02:55but when it comes to external apps, comparatively, they lag behind.
02:59So we really want to keep that in focus today, and we want to focus on ways that perhaps
03:04we can make the process of productionizing these LLMs a bit more clear.
03:09So with that, I thought we'd just level set with everyone and maybe start with the basics.
03:13Peter Gee, how would you describe an LLM, and why should businesses care?
03:19Happy to.
03:20I'll give a really simple example, but hopefully it's got some strong pedagogic value.
03:25LLMs are really nothing more than fancy autocomplete.
03:27You might have heard that metaphor before.
03:29And so if I said, or if I gave the audience a prompt, I went to go see a baseball game
03:33last night, I got to see the Boston Red Sox.
03:37Hopefully everybody's thinking about Sox.
03:39That's at their core what LLMs are doing.
03:42Now with short context windows for just a little bit of data, that's not all that interesting.
03:46But when you start showing these examples or giving them billions of parameters to learn
03:50with and start showing them hundreds of thousands or millions or billions of examples with longer
03:54and longer context windows, you start to see some really interesting behavior emerge.
04:01And I think one thing that we may want to expand on a bit is the fact that LLMs aren't
04:07just good for next word prediction and chat, but actually can be used for a wide range
04:10of traditional ML approaches as well.
04:12So perhaps just unpack that for us and also talk about the ways in which tasks can be
04:17reframed to work with LLMs.
04:19And so that's one of the things that's super interesting.
04:21Once you get into the couple billion parameter size, you start to see these fascinating properties
04:26emerge.
04:27And so if you've heard the term zero shot or few shot learning, that's what folks are
04:30talking about.
04:31If any folks in the audience either have a two-year-old or are ML engineers, this is
04:35the answer to the question you might have had, which is why can I show my two-year-old
04:38a picture of a zebra and show her maybe two or three?
04:41And then on the fourth, she's able to correctly identify what a zebra is.
04:45But with more traditional machine learning techniques, you need it to show 10 or 20 or
04:4930,000 images of a zebra.
04:51So I think that's what James is alluding to.
04:53The other thing that's really interesting is you can start to recast many of these traditional
04:58ML problem framings as next word prediction problems.
05:01So if I gave you a sentence and then I gave it the prompt, this sentiment of this sentence
05:05is blank, it would happily fill in positive, negative, neutral, happy, sad.
05:10And all of a sudden, you've reframed a classification problem into a next word prediction problem.
05:15Awesome.
05:16So basically, no excuse for businesses not to think creatively about how to apply LLMs
05:20to their use cases.
05:22Plenty of solution space exploration there.
05:25Peter D., I want to get your insights on this, especially how LLMs with these traditional
05:31ML approaches compare in terms of performance and how people are using them.
05:35So one of the things we do every couple of weeks is we bring startups to Google to just
05:38ask how they're using AI.
05:40What are the difficulties they're having with Gemini, this and that?
05:43And so a couple of weeks ago, I learned that a lot of startups are doing this thing where
05:47they basically train a bunch of baby models, that's sort of a Gemma 2B model, on things
05:51like classification tasks.
05:52So they can go to market in something like six to eight weeks, whereas previously, even
05:56just to train a trivial model as a classifier would take six to 12 months.
06:01So we've been seeing this quick time to market using baby LLMs as classification engines,
06:06trained on maybe tens of examples, which is incredible.
06:09And actually, this is a great way to get us back from that slight digression, which
06:13is what is the shift in LLMs that makes them especially attractive to businesses?
06:18Yeah.
06:19That's one of the things my team and I are really excited about, which is all of a sudden,
06:23to train a classifier, to train a model of a fixed quality, the amount of time that it
06:27takes, the amount of data that it takes, the amount of expertise that it takes, the amount
06:31of compute that it takes, has fallen dramatically.
06:34So that's one of the things we're exploring on our team.
06:36Absolutely.
06:37And I think there's actually one important one we missed, which is customizability, right?
06:41The ability to tune and align models to a specific task or domain.
06:46Businesses have vertical use cases, specific customer problems they're trying to solve.
06:49So this is incredibly important.
06:52And I actually want to drill down on that a bit further and get your insights.
06:56Data shows that customizability is one of the top two selection criteria for enterprising,
07:01selecting a model provider.
07:04But the process for going about customization is very complex.
07:08There's many different tuning techniques.
07:11There's many quality and cost trade-offs.
07:13It's very difficult to get to the output that you want.
07:16So perhaps, Peter, starting with Peter Gee, give businesses a starting point to navigate
07:22this complexity.
07:23Yeah.
07:24100%.
07:25So I saw this in my own work.
07:26I worked on the Google Assistant for a number of years.
07:28One of the things that we were focused on is building a sentence simplification engine
07:31for kids.
07:32And so if you ask, why is the sky blue?
07:34And you're an adult, you might get an answer like refraction in the ionosphere.
07:38But if you're a kid, that's not a satisfying answer.
07:40You want something like it bounces off a water drop that's in the sky.
07:43And so I spent about, like you were saying, six to eight months trying to build a model
07:47with Google Research and launch it into production.
07:50We were able to build something that works, but it wasn't high enough quality to ship.
07:54Fast forward to a year ago, with the few-shot prompting techniques that we were talking
07:58about, I was able to build something that blew the model we had built five years prior
08:02out of the water.
08:03And so to segue, that's the advice that I would give to businesses.
08:06Think about the problem that you want to focus on and solve using a large model, and then
08:10just get started.
08:11You can start by asking a model a question, just like we were talking about a moment ago.
08:16So let's say they set up a sandbox.
08:18They run an internal pilot.
08:20They've got their metrics set, their gathering data.
08:24It performs really well on general tasks, but it's not quite ready to specialize in
08:29their domain.
08:30So they're trying to replace some of their domain-specific workflows.
08:34What are some more advanced approaches they can now take?
08:37And you touched on one thing that I think is really important to emphasize, which is
08:40make sure that you've got metrics in place so that you can measure when you're improving.
08:44And so in this case, maybe we can talk about a hypothetical example of using a legal startup.
08:50Maybe you want your chap out or your agent to talk and sound like a lawyer.
08:54The first thing you might do is try what's known as role prompting, which is just telling
08:57the model to talk like a lawyer.
09:00From there, I would, again, evaluate, measure, see how you're doing.
09:04And if it's not where you want it to be, there's a couple other techniques you can try.
09:09Definitely dive into those, especially with regards to domain knowledge and domain specificity.
09:14Sure.
09:15So the next thing I would think about trying is what's known as a family of techniques
09:18known as domain adaptation.
09:20So the first thing you might try is continued pre-training.
09:23So what you're doing is you're taking that language modeling task that you started with,
09:26predict the next word.
09:27You're using backpropagation to update your weights.
09:29But you're focusing it on a corpus of data that's relevant to your model.
09:33And so for instance, in this legal example, you might do continued pre-training on a corpus
09:39of law textbooks.
09:40To give a human analogy, that's like telling a first year law student to go read 50 legal
09:45textbooks and come back and talk to me more like a lawyer.
09:49Awesome.
09:50And what about things like classification, given we mentioned it earlier, or chat, where
09:55we're talking about really adapting the task that the LLM is doing rather than the domain
09:59background?
10:00That's a great question.
10:01So if you're asking the model to make, I don't know, a decision about some sort of case law
10:05or something like that, continued pre-training can absolutely help.
10:09You might decide to focus it on specific examples of the task you want it to do.
10:13And so if it's a classification problem, you would train it using backpropagation, using
10:17the next word prediction task to do that specific focus on that task.
10:23And so in that context, it's usually known as supervised fine tuning.
10:27Awesome.
10:28So to summarize, for domain knowledge, domain specificity, continued pre-training is a good
10:33place for people just to start.
10:35If you're looking to improve a very specific problem framing, very specific task, SFT is
10:41a good place to start.
10:42But there are, as we know and you know more than most, there's many other techniques to
10:47explore within that.
10:48But we can talk about that maybe outside later.
10:51Peter D., I feel like we've ignored you a bit here, but the examples that Peter G. gave
11:00were with use cases where basically quality and accuracy can be defined pretty concretely.
11:05A legal chatbot, you can evaluate against an LSAT, a classifier, you can measure accuracy
11:10and precision.
11:11What about use cases where quality is defined more ambiguously?
11:16Yeah, it's interesting.
11:17So I don't know if this is a West Coast thing, but we had a bunch of startups come to Google
11:21a few weeks ago who, they're trying to solve this personal companion problem.
11:24And there seems to be a lot of VC money in that.
11:27Is that a West Coast thing, by the way, you guys on that East Coast?
11:30So anyway, we had this little experiment, right?
11:32So let's see if we can fine tune a Gemini model to be like Sherlock Holmes or Elizabeth
11:36Bennett.
11:38And so we ran this experiment where we tuned a Sherlock Holmes with about 10,000 examples,
11:42right?
11:43And there was this really bizarre phenomenon where this fine tuned Sherlock Holmes didn't
11:48seem to know that he was in a book, right?
11:51And so he would answer in the first person, which was a little bit strange.
11:54If you do the same thing with vanilla Gemini, you'll notice that Gemini speaks as Gemini
11:58with a little bit of Sherlock Holmes lipstick, essentially.
12:02But one of the funny things was, though, is that how do you evaluate this fine tuned Sherlock
12:06Holmes versus a vanilla one?
12:09And is it enough just to say, hey, Holmes, by the way, do you live at 211B Baker Street?
12:14And it turns out it's not.
12:16And especially when you're talking about these AI as companion sort of domain, I think there
12:21are a lot of these subtle issues like, is this character personable?
12:25Does this character scratch my itch for some definition of itch?
12:30And for that sort of thing, maybe it turns out you need a human in the loop sort of evaluating
12:37fine tuned Holmes in this case versus vanilla Holmes.
12:39So are you telling me that fine tuning is a little bit like method acting for LLMs?
12:44I think so.
12:45I think so.
12:46But, you know, and the funny thing, by the way, you know, it turns out that you can fine
12:49tune a model with like four to 500 examples.
12:52And in this Holmes case, I think we took about 10,000 examples.
12:55And it could have been that some of those examples were actually lower quality.
12:58And so just to give you an example, I just said, hey, Holmes, you know, can you tell
13:01me a little bit about rugby?
13:03And he said, you know what?
13:04I can't tell you.
13:05I've never played rugby.
13:06Basically he said, no, you know, I can't tell you.
13:08And I was just thinking, you know, if we had trained this Holmes on fewer high quality
13:11examples, we might have got a better result.
13:14And so there's this funny thing where more data is not necessarily better.
13:19Right?
13:20Right.
13:21I think that's a really great point.
13:22And before we move on from customizability as a topic, because there's so much more to
13:27cover, I do want to step outside this bubble of fine tuning an LLM for a single task and
13:32maybe talk about how LLMs are being extended for more complex workflows where they're operating
13:38asynchronously, even autonomously.
13:40Peter, I know you have a lot of experience with this.
13:43If you could share your insights.
13:46So, yeah, we did this interesting experiment a few weeks ago where we trained an LLM to
13:54be an asynchronous day trading bot.
13:56And just to show that I had some skin in the game, I threw a thousand bucks at it.
14:00And the funny thing is I made about three bucks.
14:03And so I don't know if I can tell you, but I have a 0.3% return, which is nice.
14:08But the funny thing is that even out of the box, using this thing called function calling,
14:14the LLM will actually learn how to act as an autonomous agent.
14:19And in order to pull that off, we had to do some classification tasks, like, you know,
14:23given these tweets, given these news headlines, are these bullish, are these bearish?
14:27And the funny thing was every single tweet was bearish.
14:30I'm not sure why.
14:32And it always wanted to spend at least half of my money.
14:36But the but, you know, and one of the things I was thinking is if you did something like
14:40a backtesting algorithm and maybe train the model in the last year of market data, you'd
14:44get an even better result.
14:46But I was kind of amazed what you could do out of the box.
14:48Awesome.
14:49Well, I'm going to responsibly pivot this conversation to factuality, because I think
14:53it's very relevant.
14:56Factuality is very important to enterprises.
14:59Recently an airline's chatbot famously hallucinated a false refund policy.
15:05They're now being sued as a result of that.
15:08Could Peter G. perhaps describe to us what's happening under the hood when a chatbot hallucinates?
15:13And then we can discuss some approaches to dealing with that.
15:17That's a great question.
15:18So I think what's going on is exactly what we were talking about at the start of this
15:21talk was the model is just trying to do next word prediction.
15:25In some cases, the model is very certain about the next word.
15:28If you imagine a probability distribution over all the possible tokens, you get a really
15:32spiky distribution.
15:33In other cases, it might be much less certain.
15:35So I think that's one dynamic at play.
15:37The second dynamic that I think is really interesting is in a lot of cases, these models
15:41are trained to be helpful.
15:42And so after the pre-training stage, there's often a phase known as instruction tuning,
15:47where that's exactly what you're doing.
15:48You're coaching the model, you're instructing the model, you're teaching the model how to
15:51be helpful, how to follow results, or how to give results, how to follow direction.
15:55And so in a case where the model is unsure, especially if you've instruction tuned, instead
16:00of just simply saying, I don't know, or I'm not sure, the model might try to hallucinate
16:05something or make something up to try to be helpful and answer your question.
16:10And what are some more advanced approaches that people can deal to deal with hallucination?
16:15There's a couple things that we recommend.
16:17And so one is using a technique that you all might be familiar with, known as retrieval
16:20augmented generation.
16:21And the idea there is that you want to use language models and a database together to
16:26solve the problem.
16:27And so you let language models do what language models are good at, which is generate natural
16:31language.
16:32And you let databases do what databases are good at, which is store, update, delete, tackle
16:37facts.
16:38Then you train the language model to be able to retrieve the relevant information from
16:41the database and give an answer based on that context.
16:43And so in the airline example, hopefully in that case, it would have retrieved the actual
16:48refund policy and then maybe massaged it or summarized it or used that to answer the question.
16:54Another hot topic is guardrails for LLMs, if you could just briefly touch on that as
16:57well.
16:58Yeah, 100%.
16:59And so this is a topic that's super important.
17:01It's not important just for generative applications, since it's important any time you're building
17:05a machine learning system.
17:06But the idea is that frequently you take a stochastic machine learning model that's always
17:10going to have a little bit of randomness in it, and then apply a policy layer or a set
17:13of guardrails on top.
17:15And so in the large language model day trading case, you might do something like, no matter
17:23how good the market looks, don't spend more than 10% of my money.
17:26Or no matter how good the market looks, don't put all of my money on GameStop.
17:30And hopefully that might limit the output space that you're thinking about and control
17:35the behavior a little bit.
17:36There's a really interesting case where you can also use LLMs to help evaluate that policy.
17:41And so you might have a layer on top that says, is this in the voice of the company?
17:46And give some examples of the voice of the company.
17:48Or is this a helpful statement?
17:50Is this a short and accurate response?
17:52And that could be another way to use LLMs as a policy layer as well.
17:57It's worth noting that factuality, though, and these approaches, especially guardrails,
18:02can sometimes come at the cost of the user experience.
18:06Peter D., given your experience working with startups, especially these highly creative
18:10AI personas, perhaps you can share some insight on how this balance between factuality and
18:16creativity is met.
18:19This is a really interesting phenomenon.
18:20So I noticed when startups come in now, one of the first things they do with the LLM is
18:24they turn off all of the safety features.
18:27And that's because there's this bizarre sort of optimization problem between safety and
18:33utility.
18:34And there are these cases where, just to give you an example, somebody wanted to do some
18:37multimodal analysis on monuments.
18:40And they couldn't about 75% of the time because there was like a human face in the picture.
18:46And so that's one of these really sort of subtle dances.
18:49Because I know it's possible, for instance, that you could inadvertently maybe let's say
18:54fine-tune a toxic model, you turn off the safety filters, and all of a sudden maybe
18:58there's sort of an embarrassing moment with your customers.
19:02And so anyway, there's this really subtle dance between safety and utility.
19:07And I think as a startup, maybe that's just some of the things that you have to be aware
19:09of when you go to market, right?
19:12Awesome.
19:13For the purposes of time, I do want to shift to data privacy, because this is absolutely
19:17key for enterprises.
19:18I mentioned those two top selection criteria for model providers.
19:22Data privacy is actually number one.
19:25Peter G., businesses are very concerned about training models on sensitive customer data
19:35about proprietary data being divulged through LLM prompting.
19:40Perhaps touch on what's the basis for this concern, and what are some of the approaches
19:44that enterprises can take?
19:46Absolutely.
19:47So there's a long history of data privacy and machine learning going hand-in-hand.
19:50For as long as there have been machine learning models, people have concerns about data privacy.
19:56These can be well-founded.
19:57Many of you might be familiar with the Netflix challenge from about 10 or 15 years or so
20:01ago at this point.
20:02And even with a relatively constrained output space, either a ranking problem or a classification
20:06problem should people watch this, they were able to reveal a whole bunch of sensitive
20:11information about the people in the data set.
20:13And so the first piece of advice I would give is don't ever train your model on sensitive
20:18data, whether it is a very simple classification model or whether it's a much more complicated
20:24generative model.
20:25Now, I think the reason that people are thinking about it so much in the generative case is
20:29the output space that you can produce in is much, much larger.
20:33Instead of true or false, yes or no, the model can generate free text.
20:39And so to motivate this concern a little bit, if I prompted a model, you know, Peter Grabowski's
20:44social security number is, hopefully it wouldn't be able to produce a valid response.
20:51Now let's say a company here has a product workflow that is really centered on the exchange
20:58of sensitive data.
21:00What are some approaches to enable that exchange of sensitive data without having to actually
21:03train on it?
21:04Good question.
21:06So one thing I would recommend is that retrieval augmented generation framework that we were
21:09talking about a moment ago.
21:11And that lets you store the sensitive data in a database where it can be appropriately
21:14ackled and then at inference time, at prompt time, you can inject that into the model and
21:18allow it to use it in its response.
21:20Awesome.
21:21Now, just a tidbit I would add here is, you know, for folks that are concerned about data
21:26privacy regulations like GDPR and HIPAA, RAG is very complementary in the sense that with
21:32a database you can easily permanently delete data, of course.
21:36It's just a question of deleting tables and rows and tables.
21:40Additionally, you can localize that database so you can ensure that data is not transferred
21:44outside of a specified geographic region.
21:46Both of those are very important to things like GDPR.
21:49There's another lens to this though, Peter G, that I want to get your insight into, which
21:53is mistrust of businesses, I think especially startups, of closed source model providers
22:00and using their model because they're concerned that the logs from that model, sensitive data
22:06will be used to actually train the closed source providers model.
22:10No, you're absolutely right.
22:11And I think to that extent, you know, startups tend to go with something like a Lama 2 stack,
22:15maybe a Gemma stack, a Mistral stack, because they can run a couple of GPUs and they control
22:19the entire thing from beginning to end.
22:21But what I've noticed though is that some startups tend to be using something like something
22:25called a long context window as an ad hoc form of RAG.
22:28And what that means is there's this promiscuous intermingling of kind of inference and possibly
22:33training data.
22:34And that gets dangerous when you're talking about things like law and, you know, insurance
22:38type of matters.
22:40So I think just having RAG is a form of data discipline, right?
22:43And so even if you're running your own open source models, you can still run into privacy
22:47issues if you're not careful.
22:49But I think also that's something that we're trying to do.
22:51You know, I know Vertex AI is trying to basically be the, you know, one of the pitches that
22:55they're making is that your data is safe with Google, right?
22:58And, you know, I think that's at least how we're trying to differentiate ourselves, right?
23:03Awesome.
23:04So, look, both of you, my brilliant colleagues, there is absolutely no way that we can cover
23:12all the information needed to know about how to productionize an LLM in 25 minutes.
23:16But you've done a really good job.
23:19And I just want to thank everyone for listening.
23:21I want to thank Peter and Peter.
23:24And if you do have any follow-up questions, if you didn't understand something, we're
23:27going to be in the Media Lab.
23:29So please feel free to come up to us and ask questions.
23:32And I hope you enjoy the rest of the program.
23:34Thanks.
23:35Thanks.
23:36Thanks.
23:37Thanks.

Recommended