Map, Discover, and Find Insights Across Your Data Sources with Azure Purview

Mechanics Team
12 min readDec 4, 2020

Automatically discover and map data that sits across your Azure data sources, on-premises databases, and SaaS data sources to help you catalog and understand your data, and classify it all in one unified environment. Take a deep dive on the new Azure Purview. Mike Flasko, Partner Director of Program Management, joins host Jeremy Chapman to show how Azure Purview gives you a holistic map of the data across your data landscape.

Purview classifies data in one unified environment

Azure Purview is a significant breakthrough service, especially from a data management perspective. Because data can take many different forms, it’s difficult to get a handle on the volumes of data that sit across your organization in multiple clouds and on premises. Purview solves for data discovery and provides the foundation for effective data governance. Ultimately, the better you understand the data you have, the more effectively you can use it across your organization.

Purview provides:

  • A unified platform that automatically discovers and classifies your data without the need to move it.
  • Rich user experiences, enable data, producers, consumers, and stewards to easily collaborate.
  • A way to easily track and visualize the lineage of your data across the data estate, so you can easily see where data is moved and how it’s been transformed.

QUICK LINKS:

01:10 — What is Azure Purview?

03:02 — Purview in action: Search process

04:38 — Purview in action: Tables information & lineage view

06:23 — How it works

07:43 — Setup

09:33 — Options to scan data

10:57 — Insights: Bird’s eye view of data landscape

12:11 — Closing notes

Link References:

Get started and sign up for Azure Purview at https://aka.ms/TryAzurePurview

To learn more, check out our online resources at https://aka.ms/AzurePurview

Unfamiliar with Microsoft Mechanics?

We are Microsoft’s official video series for IT. You can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.

Video Transcript:

- Coming up, we’re joined by Engineering Lead, Mike Flasko for a deep dive on the new Azure Purview that automatically discovers and maps the data sitting across your data sources, on-premises databases, and SaaS data sources to help you catalog and understand your data and classify it all in one unified environment. So Mike, welcome to Microsoft Mechanics and congrats on the preview launch of Azure Purview.

- Thank you, Jeremy. It’s great to be here.

- So Azure Purview is quite a significant breakthrough service, especially if you think about it from a data management perspective. It’s a struggle today, really to get a handle on the volumes of data that sit across your organization, which can also sit in multiple clouds and on-premises. So it can take a lot of different forms and it’s really hard to track all of it.

- You know, that’s right, that nobody really gives you a holistic map of the data across your data landscape today. Thinking about all the different data sources and data types that often exist in an organization, from columnar data through to text files, BI reports and kind of everything in between. There’s no easy way to understand or connect all these different data types without a lot of time consuming or manual work. And it’s even harder to understand what operations are running against all that data.

- Okay, so how do things change then with Azure Purview?

- We’re really solving for data discovery and understanding and providing the foundations for effective data governance, because ultimately, the better you understand the data you have, the more effectively you can use it across your organization. Purview gives you a unified platform that automatically discovers and classifies your data without having to move it. All the metadata discovered about your data is then indexed and brought together as a unified data map of your data estate. Purview also provides rich user experiences, enabling data, producers, consumers, and stewards to easily collaborate. For example, business users and domain experts can interact with Purview’s business glossary to empower all users to easily understand the business context associated with the data in their organization. Finally, a key to understanding you’re using the right data is understanding where your data came from. With Purview, you can easily track and visualize the lineage of your data across the data estate, so you can easily see where data is moved and how it’s been transformed.

- Right, and to be clear, if you think about all the underlying tech that makes all this possible, there’s a lot of collective experience at Microsoft that makes Purview a reality.

- Absolutely, everything from our decades of work with Bing through to Azure Search for custom search and indexing was really helpful. As the example you can see here with the manufacturing company Howden, and the data classification technology that we’ve developed over the years for Microsoft Information Protection, all of this has really provided us a mature foundation that we built from. And then, of course, we’re also our own customer, and so the approach we’re taking today to scanning and mapping is highly inspired by what we do at Microsoft every day at exabyte scale to understand and govern our data estate. And lastly, we’ve also adopted some of the great innovations in the open source community, such as Apache Atlas.

- Great work. So since this is Mechanics though, why don’t we make this real by seeing everything in action?

- Yeah, that sounds great. Let’s take a look at Purview in action. So here we are looking at the home screen of Azure Purview. I can search for data using the Purview data catalog. Under the search bar I can see links to the knowledge center and other common functions. Up on the left-hand bar is where I can easily navigate to additional experiences to register more data, set up scanning, get data insights and manage settings.

- Okay, so the most common thing I’m guessing you’re going to do from here is actually search. What does the search process then look like?

- That’s right. This really is the starting point. Let me show you what anyone would do to search and find data. And then after that, we can go and look at what an administrator and a curator will do. I’ll start by searching for sales. You can see, I get suggestions and recommendations right here as I’m typing. When I execute the search, it’s finding matching business glossary terms, data assets, and it’s returning the results intelligently based on relevance, using all the signals, derived from scanning and classifying your data as well as business context from the business glossary. If I go ahead and select one of the search results, let’s pick the sales order header. Here, it shows me the operational metadata, and I can see at a glance, this data contains sensitive information, such as credit card numbers and social security numbers. But we’ll talk more on that in just a minute. On the right side, we can also see that there’s a hierarchy and where this dataset belongs inside of a table and schema and database and server. And from here if I wanted to, I could even go ahead and open up the data, right in Power BI Desktop to visualize.

- And the nice thing here is that Purview, like Azure Synapse, actually automatically creates that pivots file for Power BI users, so you can visualize the data automatically and straight from there. But what happens when we go and drill into the tables information? What information is there?

- Yeah, that’s right. There’s a lot of rich information available here. So if I tour through the other tabs first with the Schema, I can see the column names from the table, their types, any classifications that have been applied, in this case, social security number, bank account, and we’ve even got a custom classification for customer ID. And one thing to note here is that even though in my case, the column might make it obvious what is the data contained in these fields, the system can still scan the content of the columns to verify the presence of sensitive information. Next, I’d like to show you one of my favorite areas: The Lineage view. You often want to know where a piece of data came from as well as what data is derived from it. This helps you assess at glance if the data comes from an authoritative source. Here, we can see all the Power BI reports that are ultimately based on this data from this table, as well as all the transformations that the data went through along the way.

- So this is great, it’s a lot more than just simple key pairs that you’re actually mapping out here. This is actually a pipeline view of the table through to the ETL steps, all the way down to the reports. Now these views would normally probably be drawn out manually with tools like Vizio, for example, but you’re doing that automatically within Purview.

- That’s right. Getting an end-to-end view like this can be really empowering when working with data because it’s spanning data sources, operations on the data, as well as in this case, how that data is flowing into BI. Now there’s a few more things I’d like to show you. First the Contacts tab. Here we can identify experts and owners of the data asset. And then finally, here in the Related tab, we can quickly browse all the other tables related to this one. By the way, so far in this example we’ve been looking at structured data, but these same experiences work well for your unstructured and semi-structured data as well.

- Okay, so there was probably a lot going on under the covers there to bring these capabilities to life. Can you explain how everything works?

- So it all starts with the data. Organizations have many data assets, such as files, tables, BI reports, ML models, and many other things. And these things are often residing across on-prem, cloud and SaaS environments. You can connect your data systems to Azure Purview using an ever-growing set of included connectors. You can then get Purview to scan those data sources, to extract a wide range of metadata, things like technical metadata, lineage, classification and operational metadata, and doing it all without moving the data itself. And the cool thing is that the scans operates serverlessly so you only pay for what you use. Next, all the metadata found during scanning is then published to the Azure Purview data map. The map is an intelligent graph describing all the data across your data estate. Additionally, because the data map is exposed as Apache Atlas open APIs, you can programmatically push any metadata in Lineage from any data system, and this is a great way to expand your data map. So once your data map is in place, everyone in your organization can go to the Azure Purview data catalog experience and easily search and browse for data. In addition, your chief data officers can get end-to-end insights across the data estate using the insights experiences, which are also provided.

- Good stuff, but I want to switch gears here and talk about setup. So what does it take then to connect everything up and get things like the classifications that we saw for sensitive information and everything kind of mapped up and working?

- We wanted to make this part of the process as easy as possible. So how about I just show you how Purview enables this in just a few clicks? As an administrator, you click into the Sources area of Azure Purview. Here, you can see all the data sources Purview can automatically scan, a Blob account, Power BI, Hive Metastore and a few others. Let me show you how to connect a new data source to Purview though. I’ll click here on the Register button. Here, you can see all the sources that are supported for automatic scanning. There’s a range of Azure sources. There’s even both on-prem and multi-cloud sources such as Amazon S3, SQL Server, SAP, and Teradata. And as I explained earlier, under the covers we’ve done all the work to deeply integrate with these data sources, so that as we extract the metadata it can securely flow to your Azure Purview instance. And for Azure sources, we made it really easy to register multiple sources in one step. As you can see here with the Azure Multiple Sources option, it’ll find and register all the data sources automatically from a management group or a subscription. Now it’s likely you have many data sources to connect. The nice thing is that Purview allows you to organize them into collections and then visualize that as a tree view. This also allows you to configure data scanning and classification settings at the root level of that collection, and those settings will then be automatically applied to everything underneath.

- So once you’ve registered then all of your data sources, the scans are going to automatically extract all the metadata that was required to power the search that we saw earlier, and also the lineage experiences that you showed before?

- Right, and of course, the data’s going to continually change. So the scans need to run periodically to ensure you have accurate and up-to-date understanding of your data.

- Okay, so what are the options then if you want to start scanning your data?

- So there’s a number of options available when you’re configuring scanning. Here in the Management Center, we can see all the classifications Purview knows how to automatically detect. There’s currently support for over 100 sensitive information types. And as we saw earlier, these range from things like credit cards, account numbers, through to a wide range of types, such as government IDs, location data and more.

- Right, and these really look pretty similar to the sensitive information types that we see in things like Microsoft 365 Information Protection or Data Loss Prevention.

- That’s right, it’s the same classification taxonomy, but now available across your other data sources as well. You can even define your own custom classification rules. I’ll click into Transaction ID. Here you can define your own data pattern, like an item inventory number or a transaction ID or customer ID along with thresholds that you can set to reduce false positives. And once you’ve defined what you want to look for, you can configure scan rules. You can choose file types you want to scan, you can even define your own file types if you’ve got them. Next, we’ll set up the classification rules that we want to run. These are the same hundred plus classifications we saw earlier, just grouped into categories. And finally, you can determine whether you want the scan to run on a reoccurring schedule or just one time.

- Okay, so it’s going to run the scan and when it’s finished, then you’ll be able to actually search that data catalog like we saw with your sales table example?

- That’s right, but in addition to searching for a particular dataset in the catalog, Purview also provides a bird’s eye view of your data landscape, and this is intended to help chief data officers quickly understand their data estate at large and gain key insights such as where sensitive data resides. Now, here in the Insights area of Purview is where we really get that bird’s eye view of the data estate. Here, it has the insights. You can quickly see where all your data resides across a range of data sources. Next, in Scan Insights, we can see the number of successful, failed, and canceled scans over time. Then over in the Glossary Insights section, we can quickly understand changes made to the glossary over time, and we can assess how much coverage your glossary has over your data map. And this is a great view for compliance teams. Here, thanks to all the classification work done earlier, you can quickly determine where different kinds of sensitive data exists across the data in your data map.

- Good stuff, and having this level of knowledge about the data is something that hasn’t been easily attainable. But beyond what’s available now in Purview, is it possible then to enable things like proactive alerting?

- Yes, we’re fully integrated with Azure Monitor. So you can set up alerts and additional views to monitor the health of your service and your scans.

- Wow, this really changes the game, I think, for data insights and management. So what’s next on the horizon then for you and your team?

- We just launched Purview and of course, we’re eager to get your feedback as you try it out. But as I mentioned earlier, understanding your data is one of the most important steps in efficient data governance. So watch this space as we’re working hard to support more data sources and develop more capabilities in this area.

- Thanks Mike, really great intro, a lot more coming as well. So what do you recommend people do then if they want to get started using Azure Purview?

- Please go ahead and sign up for the preview. And you can do that at aka.ms/TryAzurePurview. And to learn more about Purview, check out all the online resources at aka.ms/AzurePurview.

- Of course, also keep checking back to Microsoft Mechanics for the latest updates, and subscribe if you haven’t already yet. Thanks for watching. We’ll see you next time.

--

--