What is Azure Synapse Analytics, and how will new capabilities benefit you?

12 min readDec 4, 2020

Generally Available Today.

Check out the updates with the release and a first look at additional capabilities coming soon. If you’re new to Azure Synapse, it’s Microsoft’s limitless analytics platform that brings together enterprise data warehousing and big data processing into a single managed environment with no systems integration required. John Macintyre, engineering lead for Azure Synapse, joins host Jeremy Chapman to provide details on the newest updates, and walk through how a grocery retailer might use new capabilities in Synapse to plan inventory levels.

Preview capabilities in the Synapse workspace now ready for production workloads:

Azure Synapse Link, the first cloud-native HTAP solution.
SQL Serverless, now generally available.
For analytics with Spark, we’ve built performance optimizations for our implementation of Apache spark.

New capabilities, formerly in preview, are now available and fully supported:

New knowledge center gives you pipeline templates to bring data in sample scripts for analytics, automation, and notebooks.
Bring data into Synapse for advanced analytics and to enrich that data in code-free ways and apply your Azure machine-learning models.

Added for data admins:

Connect to your data and storage securely through managed private endpoints.
You no longer have to manage subnets, worry about IP ranges, or configure private endpoints. You don’t need deep networking knowledge.

QUICK LINKS:

00:46 — Preview capabilities now ready for production workloads

02:07 — New capabilities that are available and fully supported

02:55 — Updates added for data admins

04:15 — Demo: How a Grocery retailer might use new capabilities in Synapse

08:59 — Demo: How to bring in historical data

10:28 — Demo: How to perform predictive analytics for yourself

12:48 — Demo: How to push it all into production

13:24 — Closing notes

Link References:

Unfamiliar with Microsoft Mechanics?

We are Microsoft’s official video series for IT. You can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.

Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries?sub_confirmation=1
Follow us on Twitter: https://twitter.com/MSFTMechanics
Follow us on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/
Follow us on Facebook: https://facebook.com/microsoftmechanics/

Video Transcript:

- Coming up, we’re joined by John Macintyre, Engineering Lead for Azure Synapse, Microsoft’s limitless analytics platform for a tour of the latest updates now generally available and a first look at additional capabilities coming soon. So John, welcome to the show.

- Thanks Jeremy it’s great to be back.

- So we’ve been following the momentum of Azure Synapse here in Microsoft Mechanics closely over the past year. In fact, we recently chronicled several early adopter customers. And if you’re new to Azure Synapse, it’s Microsoft’s limitless analytics platform that really brings together enterprise data warehousing and also big data processing into a single managed environment with no systems integration required. So, John, I know you and the team have been hard at work, but what’s new in the service?

- So as you know Jeremy, Azure Synapse Analytics has been available for customers for the past year. And we’ve had some really great preview capabilities in the Synapse workspace that are now ready for production workloads. Like Azure Synapse Link, which is the first cloud-native HTAP solution. And that enables continuous analytics over operational data in Cosmos DB. That’s done without interfering with your operational or application workloads. Next, SQL Serverless is also now generally available. And that gives you the horsepower you need at the exact moment you run a query. It runs completely serverless so you only pay for each query and the data you process. And beyond that, for analytics with Spark, we’ve also built performance optimizations for our implementation of Apache Spark, including enhanced shuffle, which aligns data to improve query performance. We’ve also implemented dynamic partition pruning to eliminate that unnecessary data during job execution. All of these things are working together to really speed up performance. And that Spark environment that Synapse offers is fully managed. So when a job comes in, the service will provision resources, scale resources and manage those resources as you need them.

- Right, and this is really great news I think for Synapse users and really highly anticipated capabilities. But you’ve also added a host of new capabilities that were recently in preview that are now also available and fully supported as of today.

- We have, and Jeremy, these have been focused in a number of areas. So first, to help you really easily get started, the new knowledge center gives you Pipeline templates to bring data in, sample scripts for analytics, automation and Notebooks to start to analyze your data as well as access to data within the Azure open datasets. Second, we’re making it even easier to bring data into Synapse for advanced analytics. And to enrich that data in code-free ways and apply your Azure Machine Learning models.

- Right, and those capabilities will make data analysts and also data scientists really happy. But, what are some of the things that we’ve added for our data admins?

- One of the biggest things we’ve done is to make it easier for you to connect to your data and storage securely through managed private endpoints. As you provision your Azure Synapse workspace, you can simply enable the manage virtual network option. And with that, we automatically handle all that configuration of virtual network and private endpoints so that you can immediately start running SQL scripts or use Notebooks to analyze your data. You can also enable the exfiltration protection for your workspace. And what this does is it ensures that all that outbound traffic goes through private endpoints and only to selected resources that are approved in your Azure AD tenants.

- Alright, and this is nice ’cause you no longer have to manage subnets, worry about IP ranges, configure private endpoints, like you said. You don’t need deep networking knowledge or knowledge about data movement or orchestration. Also the performance and resiliency is managed then by Microsoft.

- That’s right. We’re removing that complexity for you. And also as part of our comprehensive approach to data protection, we’ve added new role types to Synapse for role-based access controls. They really give you more granular control over both your resources and your data.

- Right, and this is a lot of popular updates I think a lot of people have been waiting for. But this is Mechanics, so why don’t we make this real for everybody?

- Yeah, this is what I’ve been waiting for. Demos are my favorite part of coming on Mechanics. So I’ll start here in my Azure Synapse workspace. And I wanna walk you through how a grocery retailer might use new capabilities in Synapse to plan their inventory levels. As you know, beyond just monitoring operational data and sales data, we need to take into account external factors that may impact sales and inventory. As we’ve seen in 2020, it’s really the COVID-19 pandemic that is changing buying behavior. So for the most accurate forecast, we need to work with our real-time operational sales data, but at the same time, correlate that with public COVID-19 data. So let’s start in Data. This provides a great view of all of your connected data sources. You can see it’s easy to keep my data unified and centralized, including data managed in the workspace and data linked from sources that sit outside the workspace. And from Home, under useful links, I can get to our knowledge center. And this is so I can explore data sets that are available to me. In my case, I’ll add the Bing COVID-19 dataset, which provide daily confirmed cases as well as related data worldwide. All I need to do is click add data set, and you’ll see this shows up in my Linked Data tab. It’s integrated into my Synapse workspace automatically. And without worrying about schema details or the format of the data, I can start to explore the COVID-19 dataset. If I click on actions and select a new SQL script, select top 100 rows, Synapse will generate T-SQL commands to analyze the data. And I can start to explore that data using Serverless SQL. And using the same process, if I create a new Notebook to process and visualize the data with Python, Synapse gives me a head start with pre-populated PySpark code, all ready to execute. And you can just attach that notebook to a serverless Spark pool, run the Notebook and start analyzing that data. And this experience is available for data at any scale, whether it’s just a few thousand rows of data or millions of rows of data, like we just demonstrated.

- Now what’s great is now you don’t need to figure out how to connect to the data and you can just start your analysis right away.

- That’s right, we’re removing that step for you to make things easier. So in my case, the retail store data is managed in Cosmos DB. And I wanna see the impact of the COVID-19 cases related to my operational retail sales data. We can easily bring in new Cosmos DB data and to do that, when I created the Cosmos DB container, I selected the analytical store option. And I can do this without worrying about how when I enable this, it’s gonna impact the performance of my operational data workload. And all that data is there in the Synapse workspace in near real time. Now I’ve done this in advance and already have a Cosmos DB container with Synapse Link enabled. And as you can see here in my Linked data with Synapse, that Cosmos DB is available to me. And now we can easily query that data between our sales system as well as the COVID-19 data that we pulled in. And in my case, I wanna see the correlation between COVID case counts and the sales data. To do that, I’ve added COVID data and filtered by Texas and California, where many of our stores are located and where we know case counts are high. And in this case, we’re specifically looking at sales of household paper products and cleaning supplies. I’ve run a Serverless SQL query and I’ll display the COVID case count data. And you will see that in March, the sales and demand spiked before case counts started to accelerate. But if you look at week 30 and beyond, the COVID case count is a good predictor of sales and demand for these products. When case counts go up, we can see higher demand quickly follows. To put this into further context, let’s compare this to our historical 2019 sales data that resides in my Azure Data Lake. I’ll use the same parameters and we can see that our run rate for these items is a lot lower and it isn’t really even in the ballpark of the actual demand that we’re seeing. So this isn’t really gonna help us much with future predictions and forecasts.

- Right, and what we just saw was how simple and fast it was to bring in the public data and also the operational data that you had in Cosmos DB and analyze it at scale against your historical data that you brought in because you’re querying was serverless, also there wasn’t a setup or servers to manage or any configuration required. But how were you able to bring in that historical data that we saw in the last step?

- Yeah, I’m glad you asked Jeremy. We’re making things a lot easier, not just for the administrators, but also for the data engineers. That historical data actually came from a legacy on-premises data warehouse. But let me show you how easy it is to bring in data like that with Synapse. I can either use a code-free pipeline, or I can simply load the data into a SQL pool. To make data loading easier, we’ve added a new experience for bulk loading. First you select the folder, you right click, then new SQL script, then bulk load. From there, you can select a storage account, which I’ll do, and I’ll click continue. I’ll keep the auto selected properties, hit continue again. Then I’ll pick a dedicated SQL pool where I wanna load the data. I can create a new target table or use an existing one. In my case, I’ll use one I created just for the show called MechLoad. The column mappings look good. Then I’ll open the script, and beyond just a simple one-time import, what’s really cool is that right here, I can operationalize my data pipeline. Here I have a basic store procedure. You can see it from the section that is commented out. So uncomment that. When I do that and run it, you’ll see the bulk load procedure is added to my store procedure folder. I’ll add this to a new pipeline and that’s it. It’s operationalized.

- Okay, so you’ve automated the pipeline to bring in the data that you need, but how do you take the next step then to perform predictive analytics for your sales forecast?

- So now that I have the data flowing in, from here I can go on to predict purchasing and stay ahead of demand. And I can use that COVID-19 case data for my predictive analysis. And we just built our pipeline to ingest the data into a dedicated SQL pool where I can actually run all of my predictions. We now have native integration with Azure Machine Learning. And to save time, I’ve already linked my Synapse workspace with my Azure Machine Learning service. If I jump back to data, the action for machine learning will appear against all my SQL tables. I just need to click into a table and select Machine Learning and enrich with existing model. I’ll see the list of all my models from the ML registry that I have linked with Azure Machine Learning. And this is the model registry that my data science team is using to develop their predictive models. And I can just choose one of these that corresponds to the selected table. Using this model, I can enrich the table. When I click continue, it’s gonna analyze the table and the model, and it will automatically map source column names with the model inputs to make sure everything works correctly. This next step will create a store procedure for me so that I can continue running this model with my latest data. I just need to give it a name. I’ll load this model into an existing table. Now I’ll deploy it and that’ll take just a second. And from here, we can execute our store procedure to enrich the data from our table and we’ll use our new predict function to predict our inventory forecast. I’ll run it. And note these ML predictions are being calculated in the engine, which means all my queries are still really fast. The ML engine is scaling with my cluster and there’s no additional cost for making API calls from outside my data warehouse environment to some separate scoring service. In just a few seconds, it’s analyzed three million records and we can see the predicted quantities we need for inventory categories all without moving the data.

- And now it’s also part of a stored procedure so it’s operationalized and it’s gonna stay up to date then with its prediction. So all of what we just shown though, is probably part of a pre-production or a test environment, so how do we push something like this then into production?

- So we built CICB options into Azure Synapse under managed and sourced control. This means that your resource definitions, link services, connection strings, pipelines and code artifacts can all be version controlled using Git. And you can deploy Synpase artifacts through your DevOps release pipeline, making it easier for you to maintain your development and production workspaces.

- Right, and as we’ve shown many times, it’s really easy to serve up that data to business users, for example, using Power BI directly from Synapse. Now, everything you’ve shown is out of preview today and can be used right now for production workloads.

- That’s right. And the great thing about continuous innovation in the cloud is that we’ve also just released more capabilities in preview. We’re making it easier for you to transform your data at scale, code-free, with power query built directly into the Azure Synapse Studio experience. You can also build Machine Learning models code-free, with Auto ML, without ever leaving the Synapse environment. And we’re really excited about the native integration with Azure Purview, our new service for discovering and mapping data across your complete data estate. With this integration, all that data is available for analytics within Azure Synapse.

- So really a ton of progress in the last couple of months. Thanks John for joining us today, but for people that wanna get started and kick the tires of this, what do you recommend people do?

- If you’re already using Azure Synapse Analytics for data warehousing, you can attach a Synapse workspace to it today to discover all this new functionality. If not, sign up for a trial or create your first Synapse workspace at aka.ms/GetSynapse.

- Amazing stuff. And now it’s generally available for all your production workloads with even more to try that’s in preview. Of course, we’re gonna continue to track this on Microsoft Mechanics so be sure to keep checking back, subscribe to our channel if you haven’t already and thanks for watching.