#235 GenAI + RAG + Apple Mac = Private GenAI
Matthew, welcome back to the show.
Glad to be back.
Hey, you've talked to us about Gen.ai, you've talked to us about cybersecurity, and today
you just released a white paper and I said, we got to promote the white paper on the show.
But before we get into the topic today, I know you really well, but my guests don't know
you all that well, because I don't think you did a big, huge introduction.
So I always say to all my guests, you're a superhero.
What are your superpowers and what is your backstory?
Because people want to hear your backstory.
I know it really well because you're my oldest son.
So I know your backstory.
So it'd be interesting to hear your perspective of your backstory compared to what I would
have said.
man on the spot
I know, see?
I'm good that way.
I would say my superpower is systemic thinking, just being able to know enough of the
details about a larger system to be able to understand it from the 20,000 foot view.
That's a really cool superpower.
That's one of mine.
So I like that one.
All right, your backstory, need, every superhero has a backstory.
how literal are you taking this?
I don't
Well, you started off as a software engineer, right?
You started off computer science.
Yeah, literal backstory,
I mean, I've touched kind of a little bit of every stack of the IT stack.
I've worked in help desk.
I've done some software development.
I ended up being a systems administrator, a little bit of enterprise architecture work.
And then I kind of pivoted, did product management for a hardware startup for a while.
And
wore all the hats from UI design and now I'm just in product management for cloud
infrastructure at Mac Stadium.
So that's a pretty, I mean, you're a generalist.
Definitely.
Definitely.
Well, and they're very rare and far between.
that's why you can talk to us today about generative AI and specifically private gen AI.
And I had quite a few people come on to the show and talk about private gen AI.
But you have an interesting, interesting play on this from a full holistic systems point
of view.
So how hard is it to set up a private gen AI?
What steps do I need to take?
Where do I start, Matthew?
So I mean, a typical private GNI system, you have different components.
There are a lot of off-the-shelf apps that make this really easy to, that bundle these
components together.
But I think in order to have a quality solution, and especially if you want to host it as
a server, like share it with a team or something, you need to understand the components.
So you've got your model, which is like Llama 3.1 or some of the fine tunes of that, like
Nemotron, or you've got Mistral.
These are typically open source models that you can download and are the LLM that you're
interfacing with.
So like that's the model.
That's what most people consider the AI, like chat GPT-4.0 is a model.
Okay, so I need a model.
but you also need a server to run it on, like a software server to host the model and
intercept requests.
So with that, usually go with the LLM host.
The most popular out there is Olamma, and it's widely available on any platform, and we'll
kind of get to that in a second.
So you need a backend that runs the LLM, and then you need a frontend for your users to
talk to the LLM.
So a good example, this is something like anything LLM where it's connecting to Olamma
providing the web interface.
And now with that, you have a very basic private AI solution where users can talk to a AI
model and that model, it's not logged to anywhere other than the server itself.
Like whatever happens in the box or in the solution stays in the solution.
It's not being intercepted or
trained on by another party.
Now,
no fear of data leakage if I'm running a private JNA model.
We've talked about this in the past and other episodes.
So I just need a model, a server, and a front end application or website that talks to my
model that I'm hosting.
In essence, yeah, that's where it starts.
And that would give you basic chat functionality, but where you get a lot of the value is
harnessing private context.
So if I have a general AI model, like an LLM and the host and the front end, I can chat
with it same as chat GPT.
But if I chat with it, I'm going to get generic answers because it's basing it off of the
knowledge of the whole world, essentially.
When I add
something into a prompt, that's the only context it's running off of and all it can do is
extend and respond to the context I give it.
So the less context you give it, the more generic the answer.
got it.
So also the answer is dated too.
There's a temporal aspect to it, right?
Because the model was trained on data.
So how old is this data that these models have been trained on?
Is it, are we talking two years old, five years old, two weeks old?
say roughly six months.
Meta in particular releases very shortly after they complete training, but at the same
time 3.1 it's been several months now.
And while 3.2 is out, that's a different kind of model.
It doesn't solve the same problems that 3.1 has.
Gotcha, so the information becomes stale in these models over time.
That's why adding context to it can be valuable.
Right, so the context that you need, especially in a business situation, Like every
business is different.
Use internal lingo.
They operate differently.
You have different customers in different contexts, and none of that has reflected the
model.
Like anything that you've protected from the world that's confidential is by nature not
going to be in the model.
So its ability to help you if you give it snippets of, hey, I have this one document,
like, grade it.
It's going to grade it as
someone off the street would grade it because it doesn't know anything about your
business.
OK, so that makes sense.
So providing context around my business or my culture at my at my company, whatever the
case may be, can actually I can actually I don't need to train the model on that.
I can just give it some more information in the prompt itself, basically.
Exactly.
But in order to do that at scale, you need to have it really intelligently retract that or
retrieve that context as you prompt.
Otherwise, you're going to be digging through documents and you don't even know what you
don't know.
There may be a document out there that has the answer, but you forgot to include it.
So that's where context insertion and retrieval augmented generation commonly referred to
as RAG come into play.
The way that RAG works, there's actually a third model that you need, and that's called
the embeddings model.
So the way that RAG works is I have my front end and I've got my LLM model.
Well, that front end, in this case with anything LLM, comes with a database called a
vector database.
And in the vector database, I have sliced up chunks of uploaded documents that the model
retrieves
when answering questions and inserts its context into the prompt.
It does this via an embedding model, which in this case also runs on the same LLM backend
that the LLM model is running on.
So Olama can run both at the same time.
So this embeddings model intelligently, when you upload a document, splits the document
into chunks and assigns weights to the vectors.
that are stored in the database.
So those vectors correspond to different chunks or segments of text.
When you enter a query in, it will run your query through the embedding model and generate
vectors from that.
And the vector database matches the generated vectors to the stored vectors, retrieves the
relevant snippets or chunks of text, and inserts it in as additional context so that the
model knows what to respond with.
And the result is you get much, much better results without having to manually upload full
documents into your prompt, which also saves you on context length.
So having tiny snippets helps immensely with being able to maintain a full conversation
with that.
So how much influence do you have over the embedding model?
Because the embedding model is the one that's kind of making the decision on what chunks
of data to put into your prompt.
So that becomes very important, that embedding model.
How much influence do I have over that embedding model?
If, for example, it's not doing everything I want it to do, how can I change that?
Are there ways to change that?
Or do I try a different model?
you know, what does that mean?
You can definitely, there are a couple of things you can do.
You can change the threshold for matches so that you only retrieve high matching snippets.
So ignore anything with the temperature under this when I query the database.
You can limit or extend either the size of the snippets or the number of snippets inserted
into the document.
So those things are going to be very data set dependent.
It really depends on what your documents look like.
If they're very well sectioned out documents, like documentation for an API or something
along those lines, more small snippets might be the better way to go.
But if it's like, here's how our organization is structured and our general operating
rules, maybe you do need longer segments.
Now, the other side.
that tuning of that embedding model is kind of trial and error then, right?
You'd say, I'm not quite getting what I want.
Increase my threshold or decrease my threshold, either way.
Increase my chunking size.
Those are all things that you can play around with to get better results.
Okay.
And it's all going to be dependent on your data.
The other side of it, though, like you'd said earlier, can you switch the model out?
So with anything LLM in particular, if you just download the app version, like it's a
pre-built all-in-one package that runs the desktop app, it comes with an embedding model,
but it comes with a very small one.
And in my experience, I've had better results replacing that with one of the better models
out there, like Mixed Bred MXBAI Large.
using a smaller LLM, I get better results with a larger embedding model than I do even
with the Llama 3.170B and the default embedding model that comes with anything because it
can be kind of bad if you can't pull relevant results into your prompt because LLMs
consider everything in the prompt as context.
So if you're not getting a strong match, you can actually get worse results than without
the context.
So that's really interesting.
Your embedding model could actually be larger than your prompting model.
Not as much in practice
could.
mean, it depends on the context, right?
I have seen some approaches that are taking some of the LLMs or the smaller LLMs, small
large language model, like an 8B and repurposing that as an embedding model.
And that might be really relevant depending on what you're doing.
Fascinating.
Fascinating.
Okay, so let's say I set all this up and I add documents to my rag, right?
Retrieve augmented generation.
How big can, I mean, can I put terabytes of documents in there?
And is it going to take a long time the larger my rag is?
Or should I have multiple rags?
one for HR and one for my engineering team.
And I mean, what's a better way to go?
A more centralized that understands everything or segmented based off additional context?
So segmenting, I would definitely say, is essential because, again, the relevance of the
snippets dramatically affects the result.
By default, with anything else, they organize things into workspaces.
So you're putting different sections of documents into different workspaces so different
users can be given access.
Here's the HR one.
If you have HR questions, message this.
If you have engineering documentation questions, message this, because they're totally
separate.
context that they should be operating with.
interesting.
Here's a crazy thought, because you're your dad and I have crazy thoughts, right?
What if I just ask a question?
It says, hey, the question that you asked is an HR question.
It automatically then goes to that workspace for me and submits the question.
So there's this segmentation that happens dynamically with another LLM.
And I think that's how I'm almost positive.
That's how OpenAI does it.
They have multiple algorithms.
does it.
It's how meta does it in production with their meta AI product.
That's how their safety models work.
mean, their classifier will say, hey, this doesn't look like a very safe prompt.
Just reject it.
Don't even process it.
That saves them a lot of processing power.
Or maybe it's tool usage.
They're asking for a question.
I don't have a clear answer to it.
Or I can classify this question likely has to do with current events or something that
is less likely to be in the context.
So go ahead and do a web search before we do anything else.
See, this is really cool because what we're seeing now is large language models are being
injected into programming to handle fuzzy requirements, to handle fuzzy user input.
And that means I could set some of these up myself in my own private gen AI.
There's no reason why I couldn't.
It's not that difficult.
I mean, generally speaking, anything that's on Hugging Face, you can pretty much all run
locally in one way or another.
Okay, so is hugging face where everything is?
that the de facto standard to go download?
It's GitHub for models.
Okay.
And is it the only one that's out there or is it just the biggest one?
It's the biggest one.
I know with Olamma, they do maintain their own front-end repository, but I'm not actually
sure if the repo is hosted by Hugging Face or not.
But they have their own command line tool.
can say, want to run this model with this quantization, and it'll pull it.
Okay, now you mentioned a new word, quantization.
All right, what in the world does that mean?
I think I know, but I know that you know.
So what's quantization?
So quantization is where you take...
and this is a complicated topic.
I'm going to try and explain it the best to my knowledge.
So every model out there is made of a certain number of parameters.
And these parameters are weights to different words, to different, not even really full
words, but just different segments and tokens, right?
And...
So like Lama 3.170b is a 70 billion parameter model.
There are 70 billion weights in this model.
Yep, and the weights are all numbers.
They are very long numbers.
By quantizing them, we're reducing those numbers down to a certain bit length.
So eight bit, they come to 16 bit by default.
So what that means is they're probably floating point numbers first.
I quantize them into integers so I can run them a lot faster.
but you're losing some of the accuracy.
And it's an interesting trade off.
Like it's really hard to say because it's such a subjective thing, like what is the
quality of a piece of writing?
I don't know my high school English teacher sure knew how to how to do that.
People have tried to, they've built all these evaluations around reasoning and ability to
solve problems.
But quantization is interesting one.
So I think the general consensus and it really depends on the model.
It all depends on how they train it.
How efficient were they with those weights?
Earlier models quantized really well because they were inefficient.
With Lama 3, quantization 3 and up has a much heavier impact.
On the other hand, generally speaking, it's accepted that
A larger parameter model with more quantization outperforms a smaller parameter model with
less quantization.
so that's really interesting because I can, let's say I quantize to four bits.
That's a lot.
does by default because most people are running on these dinky little GPUs for consumers.
They only have 16 gigs of RAM.
We'll get to
So yeah, so that's interesting.
What it means is I don't need as much memory and I don't necessarily need a GPU anymore
because I can do integer math in CPUs pretty fast, especially with multi-threading and
with AMX and some of the vector processing co-processors that are out there.
So to do inference on these large language models, which means actually use the model not.
train them, I can get away with some quantized models, it sounds like, at higher
parameters than a full floating point, smaller model that's gonna take more memory.
This is really fascinating.
There's a lot of moving levers with this.
definitely.
yeah, mean, quantization can be the bridge that from like not going to work to going to
work like GPUs and neural engines are always going to be superior to CPUs when it comes to
processing these models.
They're built for it.
But at the same time, if I can take, what if I quantized a small model down even more?
Now I'm embedding models in apps and I'm not tanking the entire system performance to do
it.
Ooh, see, that's really interesting too, because you talked about embedding the model
directly in an application.
I can also possibly embed it into silicon, right?
Which means instead of taking a thousand watts of power for a GPU, I can get into sub watt
or two or three watt large language model sitting on a chip that now I can put anywhere.
That, that to me is really fascinating.
be honest, it's in its infancy, but this is kind of what Apple did with Apple
Intelligence.
It's a very heavily quantized 2B model that has very specific guidelines on what it can be
used for.
It's not a chat model, but it can summarize text.
It can rewrite things for you.
And it's all running locally on a tiny NPU in the phone because they tailored it down to
this really small model that runs very efficiently.
So that's really interesting because you talk, this is now moving into phones because
almost every phone, smartphone that you have now has NPUs in them.
These are neuro-optic processing.
But even before LLMs they had them.
And they are specifically tailored to running neuro networks, which all the LLMs are based
off of neuro networks.
So that's really cool because I can have open, I can have, know, chat GPT on my phone
without it going out to the internet.
That's pretty cool, especially if it could do like translation.
And so we're moving in that direction, it sounds like.
I mean, it's already out for a lot of people.
And I know Google's doing some stuff with their tensor chips as well for their pixel line.
They've got something similar going with like Gemma.
And I think they call it like Gemini Mini or something.
Well, and Qualcomm has that and Qualcomm, we know Qualcomm pretty much owns the cellular
market for.
Apple gets their modems from Qualcomm still, even though they're building their own ships.
Yeah, yeah, it's, it's, it's great.
So this is a really fascinating thing that we're, that we're starting to see.
And just like with your glasses that you have on, which are smart glasses, wouldn't that
be cool if it could translate, you're in Japan, you were in Japan recently.
Wouldn't it be cool as if you're talking to someone, it automatically translated in your
ear and vice versa.
Hey, Star Trek is here, the universal language translator.
that, but it was running on the cloud.
But I think, you know, give it five years and we'll definitely have this in mainstream
people are buying this stuff technology, if not less.
The glasses are an interesting case because like they fit a battery into these.
They only last a few hours and it's not really doing a lot of AI stuff.
But there are a few AI features in here.
If the camera is covered by a hat, it will tell me that after I take a picture.
It's using neural networks to do that locally on the glasses.
That's pretty cool.
So we're already starting to do this stuff.
It's just, you can feel it.
I can feel it sitting right there at the edge.
All right, so let's talk about, most people when they talk about private gen AI, they're
thinking I need a Linux server.
I need to run it in the cloud.
And I can tell you, because I work at Intel, I've got two AI PCs now.
I run private LLMs all day long on my laptop.
You're telling me that I can run them on Macs too.
yeah.
And this is where the Mac is a uniquely positioned product for this.
Honestly, by accident, because this was all engineered before the advent of LLMs.
But they have the memory on die.
So the RAM chips are on the CPU, which is also sharing the package of the GPU.
So extremely fast memory bandwidth and totally shared between both CPU and GPU with the
capacity of a full system of RAM.
So that's really cool because that's one of the biggest problems with these large models
is the transfer of data between the CPU and the GPU because they don't have a shared
memory.
But with the Mac, they do.
memory available in GPUs that mere mortals can afford is severely limited.
exactly.
like how you said mere mortals because you can drop 50 grand easily on a GPU nowadays,
which is ridiculous.
And yeah, you can run half a month.
Here's 80 gigs, like that's a lot, don't get me wrong.
And when you start talking about cloud hosting, they are running unquantized models in a
lot of cases.
But what if you could get a machine that was capable of running those 80 billion parameter
models or even maybe even slightly larger ones, the standards kind of shift around for how
large the models get for under 10 grand, like a full machine and one box.
nuts.
Yeah, that's, that's, that's really good.
So in your
are kind of unique because it's not about performance as much as the ability to run these
models at all and the ease of having them in a single workstation.
Right, and disconnected because then I can run disconnected.
I don't have to worry about data leakage.
I don't have to worry about someone snooping in or even spoofing me.
They're more secure.
I can give it my own context.
There's a lot of benefits to private.
But.
thing is, is that private context you can use because you have complete control over the
domain of what's in the box and who has access to it.
You can use very confidential information.
A legal firm can have a private server and not have to worry.
I mean, you can get enterprise packages from the big guys, but you have to trust them.
And these big companies are also training models.
Yeah, so you know they're looking at prompts.
know they absolutely are.
it's just too valuable to them.
But if you go get a computer and you keep it offline or you keep it on the LAN or you have
a VPN set up with a hosted solution, if you're a lawyer's office, you can keep all of
these things completely isolated down to that one box and have very strict controls over
who can access it.
But what it means is when you're prompting, you can ask very specific details about the
case and you could have all of the files available to.
All right, so that's really cool, right?
Because you don't want your lawyers going out there and asking on especially a high
profile case, OpenAI, especially if you're suing OpenAI.
You're not gonna go ask chat GBT, what's the best way to, you
I imagine the New York Times is doing it all manually.
Yeah, no doubt.
All right.
So you talked to you, you mentioned a little bit and here's your opportunity to pitch, you
know, Mac stadium.
You mentioned a hosted Mac.
I didn't know you could do that.
You're saying, well, I did know, but not until you started working where you're at, but
our audience may not know that if they want to have instances of Macs running, they can,
they can actually go and provision one in the cloud.
Yeah.
So, you know, it's interesting if you look back.
So Mac Stadium, just to give a little bit of background, we're a company that primarily
and almost singularly focuses on Apple hosted hardware.
And the majority of our customers, what they're doing this for is iOS CI CD, right?
Apple doesn't let you build on anything other than a Mac, but the Mac is also not really
designed to be a hosted solution.
They are all desktops at best and they're
the form factor is not really designed for something that you can easily put up in your
data center.
You have to deal with remote power management and all these other just odd things.
So by being able to offer it as a hosted service, we guarantee it's going to be up, it's
going to fast internet, reliable power, despite the lack of redundant power supplies in
the Mac itself.
And from there, you can use these hosted Macs for whatever you want.
It's just most of our customers are doing CI CD.
But with AI, suddenly the Mac Studio, especially if you look at these 128 gig RAM models,
becomes a really interesting proposition.
Especially if you're looking at these companies or teams that might want to share a server
that have very specific needs for confidentiality in their work.
Suddenly, the cost is really not too prohibitive to go with a little more specialized
hardware for this.
That's really interesting because I do have some customers in the government space that
have, they're using AWS to host their private gen AI, but the number of GPUs that are
available are really small.
So they're paying exorbitant amounts of money.
In fact, one customer, their bill was $45,000.
Why?
And they didn't use the box 24 seven, but they held onto it 24 seven because the queue
wait time was over three weeks.
Meaning if I need that instance, it took three weeks to get it.
So instead they just sat on it and their bill just went through the roof.
So what you're saying with this, I can host a private gen AI at a major reduced cost and
because the max.
reserved that whole time.
So a single H100 runs like, 1,500 to 3,000 a month just for the GPU.
And you still need other parts of the instance to make it a full solution.
And you probably need more than one, depending on what model you run.
Yeah, yeah, exactly.
Whereas a Mac Studio, about $450.
So wow, what an incredible, I mean, that's a new hosting option that I don't think a lot
of customers are recognizing because like you said, they've got a GPU, they got a CPU,
they have shared memory.
It's a new platform that most certainly should be looked at.
And in practice, you know, it's not going to touch an H100 in performance.
But like you said, how often are you really using this thing?
If it's just periodic, like, let's say you have a team of like 10 people.
If it's just periodic queries against a reasonable size database, or I mean, it's got two
terabytes of local storage, you can load it up with documents if you want.
And that's included too, because it's on the machine, you're getting the whole machine.
You're not paying, see that's the other thing about cloud, everything's itemized.
You're paying 2000 for the GPU, you're paying for every terabyte of storage and all
traffic going to it.
To get one machine for a group of 10 people that is fast enough to answer questions they
might have, starts to make sense even though on the balance of it doesn't match the
performance of the data center GPUs.
No, but you can scale it too.
I need two instances.
That's only $900 a month because I've got 20 people.
I need more.
Just add.
set one machine up with the vector database and possibly even an instance of the embedding
model.
But what you can also do is you can chunk out Olamma to, you can just run a load balancer
in front of it.
So scaling, really not that difficult.
Very, very cool stuff and great alternatives to what I think is a barrier to entry that is
frankly artificially been imposed on us by Nvidia.
Because they'll be the first ones to tell you, no, you need an Nvidia processor to do
generative AI.
And that's not true.
And here's another instance.
center ones that come in fully integrated stacks that cost like a half million dollars for
the complete solution.
for one server.
Yeah, it's ridiculous.
So Matthew, thank you for, you know, bringing some new light into and a new vector,
someone else to compete with in video.
I love it.
In a sense, yeah.
So hey, Matthew, thanks for coming on.
This has been great.
Thanks, Darren.