Sitemap

Meet the Supercomputer that runs ChatGPT, Sora & DeepSeek on Azure

13 min readMay 29, 2025

--

(feat. Mark Russinovich)

Build and run your AI apps and agents at scale with Azure. Orchestrate multi-agent apps and high-scale inference solutions using open-source and proprietary models, no infrastructure management needed. With Azure, connect frameworks like Semantic Kernel to models from DeepSeek, Llama, OpenAI’s GPT-4o, and Sora, without provisioning GPUs or writing complex scheduling logic. Just submit your prompt and assets, and the models do the rest.

Using Azure’s Model as a Service, access cutting-edge models, including brand-new releases like DeepSeek R1 and Sora, as managed APIs with autoscaling and built-in security. Whether you’re handling bursts of demand, fine-tuning models, or provisioning compute, Azure provides the capacity, efficiency, and flexibility you need. With industry-leading AI silicon, including H100s, GB200s, and advanced cooling, your solutions can run with the same power and scale behind ChatGPT.

Mark Russinovich, Azure CTO, Deputy CISO, and Microsoft Technical Fellow, joins Jeremy Chapman to share how Azure’s latest AI advancements and orchestration capabilities unlock new possibilities for developers.

AI beyond text generation.

Use multiple LLMs, agents, voice narration and video. Fully automated on Azure. Start here.

Run massive-scale inference with consistent performance.

Use parallel deployments and benchmarked GPU infrastructure for your most demanding AI workloads. Watch here.

Save cost and complexity.

Get the performance of a supercomputer for larger AI apps, while having the flexibility to rent a fraction of a GPU for smaller apps. Check it out.

QUICK LINKS:

00:00 — Build and run AI apps and agents in Azure

00:26 — Narrated video generation example with multi-agentic, multi-model app

03:17 — Model as a Service in Azure

04:02 — Scale and performance

04:55 — Enterprise grade security

05:17 — Latest AI silicon available on Azure

06:29 — Inference at scale

07:27 — Everyday AI and agentic solutions

08:36 — Provisioned Throughput

10:55 — Fractional GPU Allocation

12:13 — What’s next for Azure?

12:44 — Wrap up

Link References

For more information, check out https://aka.ms/AzureAI

Unfamiliar with Microsoft Mechanics?

As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.

Keep getting this insider knowledge, join us on social:

Video Transcript:

- So is Azure the best place to build and run your AI apps and agents, even if you plan to use open-source models and orchestration? I’m joined today by Azure CTO, Mark Russinovich, who some of you probably know as the Co-founder of Sysinternals.

- It’s great to be here, Jeremy.

- And today we’re talking about inference and we’re going to start with a multi-agentic solution to show what’s possible, then we’re going to dig into what runs it and what you can build.

- Let’s do it. We’re going to use several agents working together to build a custom video ad with voiceover from scratch using the best AI models and tools available for the job. This page lets you provide a basic prompt and upload pictures and my agentic app will create a 30-second video with voice narration and multiple scenes for a product launch. I’ll start with the prompt to generate an ad for our new SUV with its outlander option package that lets it go anywhere and escape everything. I’ll upload some pictures of the car with different colors and angles from my local device. Then I’ll submit my prompt and you can watch what the agent is doing on the right. While the videos generate, let me go into VS Code and explain what’s behind this app. This is using the open-source Semantic Kernel from Microsoft for orchestration with Python code, and we can see what’s happening play by play in the terminal. We’re using Azure AI Foundry Models. First, the R1 model from Azure DeepSeek powers the main planning agent. The next agent is a copywriter that interprets my prompt. It’s using Azure Llama, which uses the open-source Llama 4 model from Meta to write narration text that’s around 25 seconds in length. We then have another agent that uses text-to-speech in Azure AI Foundry to take the output from the copywriter agent to add voice over to the ad copy. It uses our brand-approved voice and outputs an MP3 file. We have another video agent deciding which video scenes make the most sense to generate. This is based on the script generated by the first ad copy agent. The app then calls the Sora model in Azure OpenAI. It will reference my uploaded images and use the text prompts describing the scenes in the order they appear in the talk track to generate videos. The prompts are used to generate a handful of five-second videos. This is an early look at the Sora API, which is rolling out soon and unique to Azure, with image to video support coming soon after the launch. Once the video and audio files are complete, the app uses the open-source FFMPEG command line tool to do the video assembly, combining the video with the audio track, and it will also insert prebuilt Contoso intro and outro bumpers as the first and last video segments to align with our brand. Once all of that is complete, it creates the finished downloadable MP4 file. And it’s fast because as I was talking the entire process finished. And here’s the end result.

- Adventure calls. Answer, with the Contoso EarthPilot hybrid SUV, rugged, reliable, and ready for anything. From rocky trails to open highways, this powerhouse gets you wherever your heart dares. Upgrade to the Overlander option package and unlock the ultimate getaway, sleek rooftop tents, elevated storage, and cutting edge trail tech. The EarthPilot takes you further, keeps you moving, and lets you truly escape. Contoso EarthPilot. Go anywhere. Escape everything.

- Right and in terms of inference, you know, this is a lot more intense than most text generation scenarios. And you showed that the agents are actually consuming quite a few different models from OpenAI. We saw open-source orchestration, we also saw Llama, we saw DeepSeek, all Azure AI Foundry models. So what hardware would something like this run on?

- Everything you saw is running on the same battle-tested infrastructure that powers ChatGPT with more than 500 million weekly active users all running on Azure. To put this into context, if you want to run the agentic system I just showed you on your own, you need a pretty sizable cluster of H100 or newer GPU servers to run a video generation model and encode everything, plus a large LLM like DeepSeek R1 671B, which is considered efficient, requires more than 1.3 terabytes of GPU memory and 16 NVIDIA A100 or newer clustered GPUs. With the way we manage services on Azure, we take care of everything for you. You don’t need to worry about provisioning compute or connecting everything together. OpenAI including GPT-4o and Sora, DeepSeek, and Llama models are part of our Model as a Service in Azure, where we’re running those specific models serverless. You don’t need to set up the runtime or worry about tokenization, or scheduling logic as just an endpoint with built-in quota management and auto-scaling.

- In terms of scale, last May when we did the Azure Supercomputer show together, at the time, we were already supporting more than 30 billion inference requests per day on Azure.

- And we passed that a while ago. In fact, we processed over 100 trillion tokens in the first quarter of this year, which is a 5x increase from last year. And the growth we’re seeing is exponential. Last month alone we processed 50 trillion tokens. Peak AI performance requires efficient AI models, cutting edge hardware, and optimized infrastructure. We make sure that you always have access to the latest and greatest AI models. For example, we’re able to offer the DeepSeek R1 model, fully integrated into Azure services and with enterprise-grade security and safety just one day after it launched. And when I say enterprise grade security and safety, I mean it’s integrated with services like Key Vault, API gateway, Private Link, and our responsible AI filters. You can access our model catalog directly from GitHub and experiment without an Azure subscription. Or if you use Azure, you can access it from Azure AI Foundry where we have over 10,000 Foundry Models including thousands of open-source and industry models. And we’ve always been at the forefront of bringing you the latest AI silicon and making it available on Azure. We closely partner with AMD on the design of their MI300X GPUs with 192 gigabytes of high bandwidth memory, which is critical for inferencing. And working with NVIDIA, we were the first cloud to offer H100 chips, along with the NVIDIA GB200 platform, the most powerful on the market today. It means we can generate tokens at a third of the cost compared to previous generations of GPUs. And we lead in terms of capacity with tens of thousands of GB200 GPUs in our massive purpose-built data centers to take advantage of the best cost performance, we have developed advanced liquid cooling to run our AI infrastructure. This includes our in-house chip Maia, which is currently used to efficiently run our large-scale first-party AI workloads, including some of our Copilot services. And our systems are modular, allowing us to deploy NVIDIA and AMG GPUs on the same InfiniBand network infrastructure to meet the specific demand for each.

- And what all this means is whether you’re building now or for a few years down the line, you always have access to the most cutting edge tech.

- Right and I can prove the inference performance to you. As part of our MLPerf benchmark test, we use the industry standard Llama2 70B model. It’s an older model but its size makes it the industry standard for hardware benchmarking and testing. And we ran inference on Azure’s ND GB200 v6 Virtual Machines, accelerated by the NVIDIA GB200 Blackwell GPUs where we used a single, full NVIDIA GB200 Blackwell GPU Rack. One rack contains 18 GPU servers with four GPUs per node, totaling 72 GPUs. We loaded the Llama2 70B model on these 18 GPU servers with one model instance on each server. This is the Python script we ran on each server and using SLURM on CycleCloud, which is an inference framework, we ran them in parallel. You can see on this Grafana dashboard the tokens per second performance of model inference. As you can see in benchmark results at the bottom, we hit an average of around 48,000 tokens per second on each node. And above that, you can see that we’re totaling 865,000 tokens per second for the entire rack, which is a new record. The bar charts on the top right show how consistent the performance is across the system, with very low deviation.

- So how does this performance then translate to the everyday AI and agentic solutions that people right now are building on Azure?

- So I don’t have the exact numbers for tokens consumed per interaction, but we can use simple math and make a few assumptions to roughly translate this to everyday performance. For example, something easy like this prompt where I asked Llama about Sysinternals consumes around 20 tokens. Under the covers we need to add roughly 100 tokens for the system prompt and an extra 500 tokens is a proxy for what’s required to process the prompt. Then finally, the generated response is around 1,400 tokens. So the grand total is close to 2,000 equivalent tokens for this one interaction. Remember, our benchmark test showed 865,000 tokens per second. So let’s divide that by the 2,000 tokens in my example. And that translates to around 432 user interactions per second per rack in Azure. Or if you extrapolate that over a day and estimate 10 interactions per user, which is pretty high, that’s around 3.7 million daily active users. And by the way, everyone should already know how to use the Sysinternals tools and not need to ask that question.

- Exactly. That’s what I was thinking. Actually, I’ve committed all of this stuff to memory.

- I’m not sure I believe you.

- A little bit of command line help also helps there too. But why don’t we switch gears. You know, if you’re running the scale of an app like this, how would you make sure the response times continue to hold up?

- So it depends on the deployment option you pick. If you run your model serverless, you also have the option to maintain a set level of throughput performance when you’re using shared models and infrastructure. With the way we isolate users, you don’t need to worry noisy neighbors who might have spikes that could impact your throughput. When you provision compute for serverless models directly from Azure, you can use standard, which is shared, local provisioned, and global provisioned. Here you can see that I have a few models deployed. Moving to my load testing dashboard, I can run a test to look at inference traffic against my service. I’ll start the test. If I move over to my Grafana dashboard, you’ll see that under load, we’re serving all the incoming requests. This informs me how much capacity I should set for provisioned throughput. Now I’ll move over to a configuration console for setting provisioned throughput and I can choose the date range I want to see traffic for. This bar chart time series conveniently represents the waterline of provisioned throughput in blue, and I can use this slider to match the level of performance I want guaranteed. I can slide it to match peak demand or a level where there is constant predictable demand for most of the requests. I’ll do that and set it to around 70. Now, if traffic exceeds that level, by design, some users will get an error and their request won’t get served. That said, for the requests that are served and within my set PT limit, the performance level will be consistent even if other Azure users are also using the same model deployment and underlying infrastructure. I can show this in a Grafana dashboard with the results of this setting under load, where it’s still getting lots of requests, but here on this line chart you can see where provisioned throughput was enforced. That’s where we can use spillover with another model deployment to serve that additional traffic beyond our provisioned throughput. I’ll change the spillover deployment option from no spillover to use GPT-4o mini. The model needs to match the model I used with the PTU portion of server traffic. Then I’ll update my deployment type to confirm. And now I’ll go back to the Grafana dashboard one more time. Here, you’ll see where the standard deployment kicks in to serve our spillover requests with this spike on the same line chart. Below that, we can see the proportion of requests served using both standard and provisioned throughput. That means all traffic will first see predictable performance using a provisioned throughput. And if your app goes viral, it can still serve those additional requests at standard throughput performance. And related to this, for tasks like fine tuning or high-volume inference, we also support fractional GP allocations. You don’t have to rent an entire H100 if you only need a slice of it.

- And this idea of renting GPUs makes me think of other options out there like GPU-focused hosters that are getting a lot of attention these days. So where does Azure then stack up?

- Well, so Azure is more than just the hardware. We built an AI supercomputer system and our AI infrastructure is optimized at every layer of the stack. Starting with the state-of-the-art hardware that we run globally for raw compute power at the silicon level, along with power management and advanced cooling so that we can squeeze out every ounce of performance. To of course our high bandwidth, low latency network of connected GPUs. There are also software platform optimizations where we’ve enlightened the platform and hypervisor to be able to access those network GPUs so that the performance is comparable to running on bare metal. We then have a full manageability and integration layer with identity, connectivity, storage, data, security, monitoring automation services, and more across more than 70 data center regions worldwide. And moving up to stack, we also support a variety of popular open-source AI frameworks and tools, including PyTorch, DeepSpeed, vLLM and more so that you can build and scale your AI solutions faster using the tools that you’re already familiar with.

- And as we saw, at the top of that is your AI apps and your agents all running on the whole stack. Now, last time that you were on the show, you actually predicted accurately, I’ll say, that the agentic kind of movement was going to start next. So, what do you think is going to happen in the next year?

- We’ll see a shift from lightweight agents to fully autonomous agentic systems. AI is making it easier to build powerful automation by just describing what you want. This is only getting more pervasive for everyone. And everything we’re doing in Azure is focused on enabling what’s next in AI with even faster inference and all the supporting services to run everything reliably and at scale.

- And with things changing so rapidly from month to month, I look forward to seeing where things pan out. So, thanks so much for joining us today for the deep dive and thank you for watching, and be sure to check out aka.ms/AzureAI. Let us know in the comments what you’re building. Hit subscribe and we’ll see you next time.

--

--

No responses yet