Vector Search using 95% Less Compute | DiskANN with Azure Cosmos DB

13 min readJun 8, 2024

Ensure high-accuracy, efficient vector search at massive scale with Azure Cosmos DB. Leveraging Microsoft’s DiskANN, more IO traffic moves to disk to maximize storage capacity and enable high-speed similarity searches across all data, reducing memory dependency. This technology, powering global services like Microsoft 365, is now integrated into Azure Cosmos DB, enabling developers to build scalable, high-performance applications with built-in vector search, real-time fraud detection, and robust multi-tenancy support.

Join Kirill Gavrylyuk, VP for Azure Cosmos DB, as he shares how Azure Cosmos DB with DiskANN offers unparalleled speed, efficiency, and accuracy, making it the ideal solution for modern AI-driven applications.

DiskANN is now built into Azure Cosmos DB.

Boost query speed, reduce latency, enhance cost efficiency without sacrificing accuracy by moving in-memory I/O traffic to SSDs. See it here.

Vectorize transactions for anomaly detection.

Compare new transactions with historical data to identify fraud. Take a look at DiskANN with Cosmos DB.

Manage multi-tenancy with Cosmos DB’s built-in support.

It offers resource efficiency and data isolation for regulatory compliance. Check it out.

Watch our video here:

QUICK LINKS:

00:00 — Latest Cosmos DB optimizations with DiskANN
02:09 — Where DiskANN approach is beneficial
04:07 — Efficient querying
06:02 — DiskANN compared to HNSW
07:41 — Integrate DiskANN into a new or existing app
08:39 — Real-time transactional AI scenario
09:29 — Building a fraud detection sample app
10:59 — Vectorize transactions for anomaly detection
12:49 — Scaling to address high levels of traffic
14:05 — Manage multi-tenancy
15:35 — Wrap up

Link References

Check out https://aka.ms/DiskANNCosmosDB

Try out apps at https://aka.ms/DiskANNCosmosDBSamples

Unfamiliar with Microsoft Mechanics?

As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.

Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries
Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog
Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast

Keep getting this insider knowledge, join us on social:

Follow us on Twitter: https://twitter.com/MSFTMechanics
Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/
Enjoy us on Instagram: https://www.instagram.com/msftmechanics/
Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics

Video Transcript:

- How do you ensure that vector search, key for intelligent information retrieval and generative AI, works at massive scale and with high accuracy and efficiency? Well, we’re changing the relationship with database memory and storage so that more IO traffic moves to disk to leverage more storage capacity. This uses the Microsoft-developed Disk Accelerated Nearest Neighbors, or DiskANN for short, to quantize a vector-based graph index of data in memory and map to a full precision graph of vector data that it generates in storage to reduce memory dependency for high-speed similarity search over all your data. This is in fact how global services like Microsoft 365 do vector search at scale. And the good news is it’s now built into Azure Cosmos DB for you to use too. And joining me once again is Kirill Gavrylyuk from the Cosmos DB team to go through all of this. Welcome back.

- Thank you. It’s great to be back to share the latest updates.

- Thank you. So last time you were on just a few months back, we looked at Cosmos DB as the backend tier for the ChatGPT service from OpenAI, as well as massive services like Microsoft Teams and Cosmos DB also has built-in vector search, which is really great for generative AI and improving the intelligence of search. So how does the latest Cosmos DB set of advancements around DiskANN change things?

- So in all the examples you mentioned, you need scale. In fact, I think of Cosmos DB and DiskANN as a perfect marriage of technology. For example, at runtime, DiskANN is able to move in-memory IO traffic to the physical SSD to speed up operations efficiently. It uses a set of core algorithms like Vamana for vector indexing, a pruning algorithm to reduce the space that the vector data takes on disk, and one for search that does look-ups. Cosmos DB can then automatically and horizontally scale physical disc partitions as needed, limitlessly in real time. The combination helps us to get even more querying speedups with lower latency and better cost efficiency while preserving accuracy.

- Okay, so in what different scenarios does this approach with DiskANN really light things up?

- This makes a huge difference where you have to query large amounts of frequently changing vector datasets. Take, for example, Microsoft 365. Every time you exchange an email, make a change in a doc, or receive a message on Teams, all of those operations represent trillions of frequently updating data points. On the backend, these all use DiskANN for vectorization, indexing, and search to facilitate fast and efficient data retrieval and read and write operations at huge scale.

- And this really helps with how we run things like Microsoft 365 efficiently. And I know that Microsoft Bing also uses DiskANN.

- Right. And because DiskANN is now built into Cosmos DB, you can use it too and build fast and efficient systems that use natural language, anomaly detection, and much more.

- Yeah, and in fact, DiskANN extends built-in vector support beyond vCore-based Cosmos DB for Mongo DB.

- Correct. DiskANN gives you another way to build your vector indexes, which are the key to everything allowing information retrieval based on similarity. If you recall, vectors are a coordinate-like way to refer to chunks of data in your database. These are used during look-ups to find the closest similarity of meaning. With DiskANN built in, vector search is available to other APIs, starting with the core schema-less NoSQL API, which, by the way, is what Microsoft Teams and OpenAI service uses.

- So why don’t we dig into this a little bit more? Because this is now ordinary vector indexing and search. DiskANN as an approach is actually changing things dramatically, right?

- It does. In fact, DiskANN was developed by Microsoft Research. Here, the goal has been to change the relationship between database memory and physical disk storage. We’ve made DiskANN available opensource. Anyone can use it on GitHub. That said, the version we use in Cosmos DB has even more optimizations based on our experience in using it for our own services. So without DiskANN, traditionally, with most vector indexes, when you submit a query, it would first check to see if that query and result are already available in memory. This is compute-intensive operation and recall is often limited to 50%, partly because there is less space in memory compared to storage. Because storing data in memory is more expensive, DiskANN changes things by first using quantization to compress vectors to build an optimized graph that can be run efficiently in memory. Then, instead of just relying on what’s in memory, DiskANN uses the Vamana index-building algorithm to create a full precision graph index of all data along with the pruning algorithm to create an efficient graph-based index of vector data on the SSD, which has more storage capacity. And there is no trade-off here because DiskANN reads from the SSD very efficiently. And since modern SSDs are very fast, we can still maintain very low latency and high queries per second. During a vector search, the search algorithm first refers to in-memory compressed vectors, which act like a pointer to the larger full precision graph on disk storage to retrieve information on nearest neighbors. Then once search finds the top results, they are re-ranked using the full vectors on the SSD to ensure high accuracy. This way, compared to other approaches, DiskANN can accurately achieve very high recall, around 95% or higher, at any scale.

- Right. These are read/write operations effectively that are happening on disk. So we no longer have to rely on available RAM. Does this mean that it’s also more efficient in terms of compute?

- Yes. Compared to other vector-based options, it needs less than 5% of traditional compute requirements to run on. And because Cosmos DB can automatically and dynamically scale physical partitions to distribute data as needed, we’re actively load balancing I/O and storage utilization. Each of these physical partitions can have its own DiskANN data graph as Cosmos DB scales in or out. Both concepts work seamlessly together. In fact, to make these efficiencies real, let me explain how this compares to the most common alternative, HNSW, or Hierarchical Navigable Small World systems. Let’s take two examples where we have a dataset of 1 million and 1 billion embeddings from the Azure OpenAI service. With an in-memory index like HNSW, it takes more than 12 gigabytes to store 1 million vector embeddings. And in the 1 billion vector embeddings dataset scenario, that equates to more than 6,000 gigabytes, which is six terabytes, consumed by the HNSW index. By contrast, using DiskANN, even with conservative estimates, compared to HNSW, we use around 1/60 or around 200 megabytes of the memory for 1 million embeddings and around 100 gigabyte for 1 billion.

- So what would that really mean in terms of provisioned compute if I use the HNSW approach, maybe with dedicated virtual machines?

- Well, to quantify the VM compute and if we take the 1 billion vector example, you would have to manage 45 high-performance Azure B-series VM resources with 32 vCores and 128 gigs of RAM each just to maintain indexes in memory, and you keep paying for them whether you are using them or not. Whereas with Azure Cosmos DB, you can provision a single instance with DiskANN enabled to build a full precision index. And Cosmos DB handles the partitioning for you automatically with DiskANN indexing information efficiently across these partitions. And the good news is that you only pay for what you use.

- Okay, so now we’ve kind of explained what DiskANN is and how that works, how it compares to HNSW, so can we maybe have a look at, for all the developers watching, how you might use this and integrate this into an existing or a new app?

- Sure. I have a good example with a constant stream of transactions. This is a proof-of-concept financial services app. It runs at massive scale to monitor financial transactions for fraud, which is getting more and more common these days. We’re tracking transaction volume, fraudulent transactions, fraudulent customers, and other key metrics. And this just happens to be one example of where Cosmos DB with DiskANN can make a difference. Even though our current system is running at scale, the fraud detection alerts are taking an average of 1.1 seconds, which is not as real time as we want it to be, and our accuracy at around 85%, which is not where we want it to be.

- Right. And although that might sound really fast, in these situations, it’s a lot better to decline a fraudulent transaction while it’s happening to stop current and also future transactions.

- That’s right. And Azure Cosmos DB is the best database for these sorts of real-time transactional AI scenarios. Let me show you. From the Features pane in a running Cosmos DB NoSQL account, the first thing you’ll need to do is to enable the Vector Search capabilities, which uses DiskANN. From there, you’ll move on to Data Explorer settings and configure instant auto-scaling for changing traffic patterns. Once you have that configured, the rest we can do in Visual Studio Code or your favorite IDE. To vectorize data as it comes in, we’re using Cosmos DB change feed to trigger an Azure function, which calls the OpenAI embedding API. Once the embeddings are generated, those are upserted into our Cosmos DB items directly. From here, I just need to deploy our function for it to be active. And at that point, our backend database is ready. Now we’ll move on to building our fraud detection sample app. Here we are going to use the Cosmos DB Python SDK. First, I have set up the connection to Cosmos DB and added our container information. Then I had to define our logical partition key, which is like the address for your data. Under the covers, Cosmos DB will use this to route requests to the physical disk partitions where the data resides as it automatically scales out. From there, I specified a vector embedding policy, and this is something new for our vector search feature. It informs Cosmos DB where to look for properties containing vectors. Then here, I’ve added an index policy to enable higher performance vector search, leveraging the new DiskANN vector index. With the parameters set, we can create the container. And finally, I’ll load the dataset that we’ve already collected offline directly into Cosmos DB. I’ll just let it run for a moment and now all the dataset is ready.

- Okay, so now with all of the code that you’ve put in place, those vector embeddings get calculated and inserted directly into those items?

- That’s exactly how we set it up. In fact, now we can check to see if our code is working against the data that was just loaded. I’ll execute a simple query from the notebook to verify that the data was ingested and that our Azure function has generated the embeddings. And you can see them right here simply as another property in the data documents.

- Okay, so now we’ve generated all the vector embeddings on the backend, but for those transactions as they come in, how do we vectorize those to make sure that the anomaly detection will work?

- That’s the next step. We also need to calculate and write embeddings for the operational transactions as they flow in and use queries to match those with previous fraudulent transactions. What we are trying to do is to identify if a new transaction is valid or potentially fraudulent. So I have created a helper function to generate vector embeddings. This function vectorizes new transactions. This way, we can pass those vectors to our vector search query. We can compare with the similar transactions in our database to determine if they are valid or if they are suspicious. Now, I also have other transactions that I can use to perform test searches. I’ll start with an example that contains valid transactions. This is a query to select the top 10 most similar vectors. And I’m also projecting a few properties that might be interesting. We’re using the new Vector Distance function to compute vector similarity and projecting that as a Similarity Score. Finally, I’ll do an Order By to sort them by using the Vector Distance function again. This uses the DiskANN index to find the most similar vectors in our dataset and return the relevant properties. And based on the Order By sort, they are sorted in order of most relevant to least relevant. Also, you can see the query results with all the properties we projected. And if you look at the Similarity Score, it is sorted by the most to the least similar. In fact, now that it’s working, let’s test how it’s able to identify fraudulent transactions. I’ll run a different sample with fraudulent transactions. And you can see the query results tagged as fraud with all the properties we projected. So we validated it works and we are ready to deploy in production. In fact, as you saw it, it’s already running.

- And you mentioned how well this works at scale. So can we take a look at how well Cosmos DB is scaling then to basically address that high amount of traffic?

- Sure. I’ll switch over to Azure Monitor to show you that. And on our Azure Metrics dashboard, there is our Cosmos DB resource. You can see it’s automatically scaling as needed without any intervention in changing database configurations or settings thanks to the auto-scale.

- So there aren’t any downsides here, you know, using this approach versus HNSW at really this massive scale that we just saw. In fact, HNSW would be cost prohibitive. So what’s the impact then in terms of our time if we go back to our app to detect fraud?

- Let’s take a look. If I go back to the app, you’ll see that whereas before, we were around 1.1 seconds to detect the fraudulent transaction, now it takes just 47 milliseconds to find those using our vector search-based logic and detection and our accuracy with the true positive rate is also increased around 10%. And that means that we can both stop a single fraudulent transaction in its tracks and payment instruments are put on hold. So this will add a lot of compute efficiency, scale, accuracy, and reduce latency as you implement and use vector search in your apps.

- So this is a really great example for using DiskANN with a single tenant app, but how would this work then with multi-tenancy scenarios where you might need to manage several different app backends?

- So managing multi-tenancy is also challenging and can be very resource-intensive. With HNSW, you are limited to using separate indexes for each of your tenants, which, again, is memory-heavy and expensive. Now, you could use separate VMs, but that’s even more expensive and you would need to keep those VMs running at all times. With Cosmos DB, multi-tenancy support is built in and we give you multiple options for isolation. For example, you can have a separate Azure Cosmos DB account for each tenant, or in a single Azure Cosmos DB account, you can set up a unique database for each tenant. Or in a single database, you can have a separate data collections per tenant. Another option if you want to store everything in one collection, you can specify a different vector path in the index policy per tenant or you can set up logical partition isolation for each tenant by specifying separate partition keys. And if you need to change these at any time, we now allow you to easily change your partition keys as you scale and multi-tenancy requirements evolve. And of course, when you combine Cosmos DB capabilities with DiskANN, because the graph index is saved to high-speed SSDs, you only pay for compute when you are querying your data,

- Right. And multi-tenant support is really valuable here, especially if you have apps requiring data isolation, you got regulatory needs, or you need privacy over sensitive data. So for anyone who’s watching right now, where should they go to learn more and get this up and running?

- Sure. To find out more, you can check out aka.ms/DiskANNCosmosDB and you can find test apps to try out at aka.ms/DiskANNCosmosDBSamples.

- Thanks so much for joining us today, Kirill. I can’t wait to see how everyone uses this. And of course, keep watching “Microsoft Mechanics” for all latest updates. Thanks for watching and we’ll see you next time.