How Azure AI Search powers RAG in ChatGPT and global scale apps

Mechanics Team
13 min readNov 7, 2024

--

Millions of people use Azure AI Search every day without knowing it. You can enable your apps with the same search that enables retrieval-augmented generation (RAG) capabilities when you build Custom GPTs or attach files in your ChatGPT prompts.

Pablo Castro, Microsoft CVP and Distinguished Engineer Azure AI Search, joins Jeremy Chapman to share how with Azure AI Search, you can create custom applications that retrieve the most relevant information quickly and accurately, even from billions of records.

Manage massive-scale datasets while maintaining high-quality search results with ultra-compact, binary quantized vector search indexes that use Matryoshka Representation Learning (MRL) and oversampling to equal the search accuracy of vector indexes up to 96 times larger. These approaches drive significant cost savings by optimizing your vector indexes without compromising quality.

Scalable AI search for your apps.

Massively reduce your vector indexes while retaining accuracy & speed with binary quantization, Matryoshka Representation Learning, oversampling, and re-ranking. Check out Azure AI Search.

Set up and scale your search service in the Azure Portal.

Choose a cost-effective pricing tier, manage vector index size, and adjust capacity to meet changing needs. Start here.

Reduce index costs and footprint without sacrificing performance.

Shrink vector sizes with Matryoshka Representation Learning (MRL) and binary quantization in Azure AI Search. See how it works.

Watch our video here:

https://www.youtube.com/watch?v=NVp9jiMDdXc

QUICK LINKS:

00:00 — RAG powered by Azure AI Search
00:50 — Azure AI Search role in ChatGPT
02:01 — Azure AI Search use case — AT&T
03:27 — Start in Azure Portal
04:35 — Massive scale and vector index
06:08 — Scalar & Binary Quantization
07:21 — Martyoshka technique
09:07 — Oversampling
11:31 — How to build an app using Azure AI Search
13:00 — See it in action
14:28 — Enable binary quantization with oversampling
14:54 — Wrap up

Link References

Get sample code on GitHub at https://aka.ms/SearchQuantizationSample

Check out search solutions at https://aka.ms/AzureAISearch

Unfamiliar with Microsoft Mechanics?

As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.

Keep getting this insider knowledge, join us on social:

Video Transcript:

- Did you know that when you attach a file in ChatGPT or create a custom GPT, that process is actually using retrieval augmented generation powered by Azure AI Search, which is also used at global scale by many commercial and enterprise implementations too. Today we’ll go inside the mechanics of Azure AI Search and show you how you can achieve massive scale for your AI apps without breaking the bank using breakthrough techniques like binary quantization for your vector index to radically reduce their footprint along with oversampling and reranking so that you’re not trading off on the quality of search results for your AI apps, keeping you in control. And joining me once again is the father of Azure AI Search, distinguished engineer and CVP, Pablo Castro. Welcome back.

- It’s great to be back.

- And thanks for joining us again today. You know, vector-based search with your prompts, together with retrieval augmented generation or RAG, these are key things to ensuring the right grounding data can be retrieved, then presented to AI models in order to generate more personalized and relevant responses. This is how, for example, things like Microsoft 365 Copilot can bring in your data to get work done. So what’s Azure AI Search’s role then for something like ChatGPT to enable that scenario?

- Well, in the context of ChatGPT, there are a few experiences that support bringing your own data, either attaching files to your chat session, or when you’ve uploaded multiple files to create a custom GPT. Which kind of sounds like you’re building your own GPT models, but you’re not. In both cases, retrieval augmented generation is used to generate responses from the files you add. Behind the scenes, those files are getting indexed and vectorized in real time by Azure AI Search. During conversations, the right parts are converted into vectors. And using a combination of hybrid vector and keyword search, we’re able to retrieve the right information in milliseconds in order to generate a response. All of this is happening concurrently at global scale. This way, millions of people are using Azure AI Search every day without even knowing it.

- Right, and it’s a really great example in terms of showing the scale that Azure AI Search can operate in. And I know that on the enterprise side, we’re also working with organizations like multinational connectivity and communications company AT&T, who use Azure AI Search for their internal RAG platform to assist more than 80,000 users to get the information they need to perform their jobs quickly.

- Right. AT&T got started early as a leader in this space. And they’ve accumulated millions of vectors and passages in their knowledge base. They use hybrid retrieval along with semantic ranking, which is a more sophisticated ranking model, to improve relevance. And when you apply all these methods, you can greatly improve the quality of your grounding information.

- And I really like this example because I know most organizations, probably a lot of people watching are thinking they’ve got this knowledge base already. They want it more accessible for their staff and employees.

- That’s right. And the thing is, not all retrieval systems for RAG are created the same. Natural language search relies on different underlying indexes, powering a hybrid approach with vectors and keywords. Your knowledge base can be comprised of thousands of files, and those files are broken up into shorter passages or chunks, and each of those will have its own vector. So pretty quickly, you can have tens or hundreds of millions of vectors, really expanding the size of your database. That’s why it’s important to get search right as part of the RAG process with the right scale and quality mechanisms for both efficiency and accuracy.

- Okay, so why don’t we break this down, then. How would I achieve this type of massive scale without breaking the bank and also without compromising on quality?

- Sure. No matter the scale, it’s pretty easy to get started. Here I’m in the Azure Portal. Let’s create a new search service, and I’ll give it a name and select a region. And one thing to point out is the pricing tier. Here is the vector quota column, which is the maximum per-partition size of your vector index, which has a big influence on cost. For S and higher offerings, you can have up to 12 partitions. And in a moment I’ll explain some ways to make these indexes radically smaller, which can help you move down several steps in size to be more efficient. To give you an idea of how much we’ve improved scale, the limits in that vector index size column are up to 12 times higher today than they were a year ago without any change in cost. Once you’ve chosen your tier, you either create the service from here or continue configuring options in the other tabs. Later, you can tune partitions and replicas dynamically at any time to adjust your capacity as your load changes over time.

- Okay, so how would Azure AI Search then work at massive scale and maybe when you’ve got a massive vector index?

- Well, Azure AI Search is built for massive scale by design. It can handle anywhere from tiny indexes and small services to billions of records. In fact, I can show you. I have an index that I created earlier with 1.4 billion vectors using a public data set called scalev1b. The index alone is over 200 gigs, which is pretty big. And not only it is easy to load something like this in AI search, it’s also instantaneous to query it. I’ll move over to VS Code and I’ll start by initializing the environment. Here I have a couple of test vectors, so I’ll just run this out. Now I’ll print the document count and it returns 1.4 billion vectors. Next, instead of a simple count, I’ll run a vector query using the sample vector I have above. I’m looking for the five nearest neighbors. And it was instantaneous, I returned the five closest matches to our 1.4 billion vector dataset. Now let’s do an informal stress test, here with a little bit more load. Let’s run the cell with a little test function to run 100 searches. And now the cell will call my test function, so I’ll run it. You’ll see that it performs all the searches in under a second. This one was 0.7 seconds. Again, that was against 1.4 billion documents with 100 searches. So this is the type of performance you can expect even at massive scale from Azure AI Search.

- And you’ll recall, this was using 200 gigabytes of vector data and almost 700 gigs of total storage. And with the data getting broken down into chunks and documents, then each vector having thousands of dimensions, indexes can get pretty large. So how would you keep the size of those indexes then in check?

- These indexes indeed can grow quickly in some scenarios. And so we’ve introduced three capabilities that together allow you to control the trade off between cost, latency and recall. First is vector quantization. Second, embeddings trained with the Matryoshka Representation Learning technique. And third, oversampling with rescoring. Let me start by explaining quantization. If a vector is a long list of numbers that we refer to as dimensions, you can think of quantization as a way to make each number use less space. In Azure AI Search, we offer two types of quantization, scalar and binary quantization. Scalar quantization, for example, can take each number in the vector and effectively compress it to be a narrower numeric type. The binary quantization option then takes this even further and make each dimension a single bit, a one or a zero.

- So where does the Matryoshka Representation Learning technique then come in?

- Well, the quantization process still leaves us with thousands of total dimensions per vector. So the Matryoshka Representation Learning or MRL is the next approach. When we use vectors trained with MRL, we can reduce the number of dimensions for each vector to just a subset at the front of the vector. We can do this because the MRL training objective is designed to concentrate the most information towards the front. Together, both quantization and MRL can radically reduce the size of your vector index. Let’s do the math to quantify this. I’ll use binary quantization for example. First, one full OpenAI vector using the large model has 3,072 numbers, and each one is a 32 bit float, which is four bytes. That means there are a total of 3,072 times four or 12,288 bytes per vector, or 12K. Matryoshka Representation Learning then allows us to keep, say, the front 1024 dimensions out of those 3,072, reducing them by 2/3. With binary quantization, we can reduce each dimension from 32 bits down to one bit. When you use both of these approaches combined, this yields just 1024 bits per vector. If you divide that by eight for the number of bits in a bite, that’s just 128 bytes per vector. So when you compare before and after, our full precision vector is 12 kilobytes versus just 128 bytes for our quantized vector, which is a 96x size reduction. And this also reduces the index size by almost the same factor when you account for index overhead, yielding significant cost savings.

- Okay, but now we’ve reduced the dimension count radically to like 1/3 of its original size, and the dimensions themselves are also narrower. So wouldn’t that impact the accuracy or quality?

- Yes, of course, there is an impact. But we have a solution to manage the trade off and bring back quality. When we compare full precision search results to our compressed vector search using both binary quantization and MRL, it’s not uncommon to see a loss in quality sometimes up to 10% of or more. One thing we can do is keep the original full precision vectors and place them in storage. Because they’re not in the index, they don’t use up vector index quota and they are not in memory. For your queries, you can indicate an oversampling rate where you might query, say, 10 times your goal, perhaps going from a sample of five to 50, and then rescore the small set of results with the full precision vectors that are stored in disc. In many data sets, this brings the quality back to over 99% of the original quality compared to what you can achieve with full precision vectors. Also, while the rescoring step can add a little bit of latency, the highly compressed vector index is significantly faster. So we see that in many cases there is no material performance impact. So it’s all upside.

- Right, so in this case, we’ve taken the oversampling rate, we’ve landed on a number of 10. So what happens if I were to adjust that number up or down and oversample more or less?

- So we’ve done a ton of evaluations in this space using public and private data sets at all scales. And the data shows that without oversampling, depending on the dataset, and we’ve used both public and internal ones, the quality metric can drop by as much as 8% to 10% compared to the full precision version. Just by adding oversampling with a factor of two, on average quality is around 99% compared to the original result using full precision. And when you increase the oversampling to 10, it averages 99.9% compared to the full precision index in the data sets that we’ve evaluated. Despite the vector index being around 1/96 of the size of the uncompressed index, the quality is extremely close to 100% of the additional. So this method lets you balance cost, performance, and quality.

- Okay, so now we’ve seen kind of a massive data set and also how quickly that can be queried as well as the different approaches that you can take to make that index a lot smaller and save cost without really a big impact on quality here. So how would you build an app then using AI Search at scale using all these three options that you’ve just described in place?

- I’ve got a relatable example here, and you can try this yourself at home. If you’ve ever searched the Azure documentation, you know that there are thousands of pages with highly related concepts. Not only it is a big data set, but it’s also a challenge for semantic search. So here I’ve taken the entire GitHub repo with every page of Azure documentation as markdown files. Using the storage browser, I can click on, say, this folder for azure-signalr, and these are the files for that topic. Once these were split, vectorized, and indexed, this equals more than 140,000 documents. And by the way, that’s over 23 million tokens. You can see I have two indexes, azdocs, where the vector index size is 1.6 gigs. This is the full precision index. And azdocs-q is a quantized version of this index where the vector size is just 59 megs. That’s only 3.6% or 1/28 of the size of the full precision vector index. Now I’ll move back to VS Code. Here’s the index creation step enabling binary quantization. You can see in the vector search parameters that I have compressions defined to use binary quantization compression. Under that, the default oversampling is set to 10, and then we are reranking with additional values to enable rescoring. I wrote a small app to try everything together using Azure AI Search and RAG with GPT-4.0. The app uses Azure OpenAI and Azure AI Search. Below that, you’ll see that it’s also got a function for search. So here’s my definition. And this is the call to Azure AI Search for grounding. By default, it uses hybrid search with keyword and vector support. Finally, this is the question-answer loop that will take questions from the user and use GPT with RAG to answer them. Let’s try it, I’ll run the app. Then I’ll ask, “Can Azure AI Search handle pictures inside PDFs?” It answers that it doesn’t know. You can see it found three top candidates from the docs. And I’m sure the answer is there, but because the most related documentation passages weren’t at the top, the model doesn’t see them. Let me change the call to Azure AI Search to enable semantic ranking for better relevance. I’ll do this by setting the query type parameter to semantic. Let’s try it again. I’ll rerun the app, then ask the exact same question about searching PDFs. After the response, this time you can see that the top three passages contain the right information to answer correctly, so it produced the response that I was looking for. So despite the semantic similarity of everything in the Azure documentation with the quantized vector index, using oversampling and semantic ranking, it was able to respond with the right answer looking over 140,000 passages. And as a developer building large scale RAG applications, the one thing I hope you can take away from this is if you have a large index, enable binary quantization with oversampling to hit the right balance between quality and cost effectiveness. And this is done easily by adding the compression option when you define your index. Also for another quality boost, consider enabling semantic ranking.

- Those are both great tips. So aside from those two tips then, what else would you recommend for the developers that are watching right now to try this out and test it for themselves?

- So we’ve also made the sample code that I’ve been demonstrating in VS Code available in GitHub at aka.ms/SearchQuantizationSample. And you can find out more about our search solutions at aka.ms/AzureAISearch.

- Thanks so much for joining us today, Pablo. It’s always great to have you on for this fast-moving space. And we’ll keep covering the mechanics of generative AI at Microsoft. We’ve already published more than 50 related topics on the channel, so check those out and be sure to subscribe for more. And as always, thanks so much for watching.

--

--

No responses yet