Introducing Microsoft Purview Data Security Investigations
Investigate data security, risk and leak cases faster by leveraging AI-driven insights with Microsoft Purview Data Security Investigations. This goes beyond the superficial metadata and activity-only signals found in incident management and SIEM tools, by analyzing the content itself within compromised files, emails, messages, and Microsoft Copilot interactions. Data Security Investigations allows you to pinpoint sensitive data and assess risks at a deeper level — quickly understanding the value of what’s been exposed.
Then by mapping connections between compromised data and activities, you can easily find the source of the security risk or exposure. And using real-time risk insights, you can also apply the right protections to minimize future vulnerabilities. Data Security Investigations is also integrated with Microsoft Defender incident management as part your broader SOC toolset.
Nick Robinson, Microsoft Purview Principal Product Manager, joins Jeremy Chapman to share how to enhance your ability to safeguard critical information.
Find the source of a data leak fast.
Lock down files, restrict access, and prevent further leaks in real time. Get started with Microsoft Purview Data Security Investigations.
Go beyond keyword search.
Microsoft Purview Data Security Investigations uses semantic-based vector search, and understands file content, images, & context to classify sensitive data. See it here.
See the full picture.
AI maps access patterns, showing who or what device interacted with compromised data. Check it out.
Watch our video here.
QUICK LINKS:
00:00 — Microsoft Purview Data Security Investigations
01:00 — Risks of data theft & data leaks
03:20 — Start an investigation
04:45 — Results of an investigation
06:15 — Vector-based search & semantic indexing
08:00 — Use AI for the investigation
09:21 — Map activities
10:44 — Connect SOC & Data Security teams
11:21 — Known leaked information
12:26 — Steps to get DSI up and running
13:15 — Wrap up
Link References
Get started at https://aka.ms/DataSecurityInvestigations
Stay up-to-date with our blog at https://aka.ms/DSIBlog
Unfamiliar with Microsoft Mechanics?
As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.
• Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries
• Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog
• Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast
Keep getting this insider knowledge, join us on social:
• Follow us on Twitter: https://twitter.com/MSFTMechanics
• Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/
• Enjoy us on Instagram: https://www.instagram.com/msftmechanics/
• Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics
Video Transcript:
- If you’ve ever experienced a large data breach by external actors or maybe data theft by departing employees or risky insiders, the race is on to assess the contents and value of the data taken to understand the real risk to your organization. And today, we’ll show you how the new AI-powered Microsoft Purview Data Security Investigations dramatically speeds up the time it takes to analyze impacted information across files, email, and messages, which includes Copilot interactions with deep content analysis to unearth exact risks hidden in your data, like saved plain text credentials that could unlock access to valuable corporate assets. Along with the ability to understand activities around compromised data so that you can take further mitigation measures. The experience is integrated with Insider Risk Management cases in Microsoft Purview and with Microsoft Defender XDR incidents. And to unpack how this works, I’m joined today, by Nick Robinson, from the Microsoft Purview team, welcome!
- Thanks for having me on!
- Thanks so much for joining us today. You know, this is really an important topic, because data theft and the potential for data leaks is increasing right now. And while the tools, they exist to capture the security event, that’s only really half the story, because you don’t understand the value that’s actually stolen within that data.
- That’s really the challenge. For example, most SOC teams have incident management tools and SIEMs to see the activities involved with an incident, often with the accounts, devices, and a high-level view of the information or files impacted. But typically it’s just metadata-level descriptions, where you might see that an email inbox was compromised with a successful phishing attack, but often the best signal you have to assess the risk is the person’s username, their role, and their seniority. Likewise, as someone investigating an insider risk case, you would need to open individual files to understand the nature of the data risk. In both cases, there’s no easy way of knowing the high-value information like credentials, credit cards, trade secrets, PII, or other sensitive information contained in the data. And today, typically you’d be stitching together multiple different tools to understand the data risk in your data security incident.
- Right, and in many cases your CEO or your C-suite, may not always have the most sensitive information. Think of things like source code, or maybe customer lists. Often, it’s those individuals a few levels down that pose an even bigger risk.
- Exactly, it’s only what’s in those compromised emails, instant messages and files that can really tell you if the information is valuable to an attacker. When you multiply the number of inboxes and locations to assess, it’s impossible to manually sift through all that information and find what really matters in the larger context. That’s where generative AI with Microsoft Purview Data Security Investigations comes in. Behind the scenes large language models and intelligent orchestration work together, not only to analyze metadata and filenames like you are used to with other solutions, but they go beyond that to reason over the text or image content contained within the files themselves, to then categorize that content by risk and severity. Even providing a graph of connected activities around detected vulnerable data. All this happens even if you haven’t configured a single policy in Microsoft Purview or starting from scratch.
- And this really lets you see the real risk around your compromised data, then basically use the knowledge that you’re gathering to apply what you’ve learned to put the right protections in place, so that you can minimize future risk. So, how would you get started then, with an investigation?
- So, let me walk you through how we can start an investigation. I’ll start in Insider Risk Management in the Microsoft Purview portal. I’m investigating an insider risk case, where a user triggered our Data Leak policy. Around 53K emails and files were shared externally and I now need to analyze them to help quantify security and sensitive data risks. From this case, I can investigate data security with AI and create a Data Security Investigation. So, I’ll create the investigation now. And that’s going to take a few moments, but once it’s done, I’m able to see the preliminary insights for the data. This dashboard view helps me understand whether I have the right data in scope with top related users, types of items, and some insights into what the content is about. And with this informed perspective, it gives me enough context before I decide to commit to adding this content to the Investigation. The scope looks correct based on my understanding of the Insider Risk case, so I’ll go ahead and hit, Add to Scope. Once the data is added to the investigation, I have one more opportunity for validation before I prepare the content for analysis with AI. I can filter, review items and choose content that is not relevant to be excluded. Once I’m sure I have the right data set, I can select Prepare Data to confirm and prepare the data for AI analysis. We’ve designed this to work so that you only pay for the compute resources that you use.
- Right, and even though this process can take a few hours, it’s still a lot less work and much faster than the equivalent manual effort it would be to analyze that large amount of data.
- Yes, it’s processing all of these files at machine speed, then summarizing what it finds and formatting that into an easy-to-understand view. The time depends on the amount of data it needs to analyze. And once everything is finished, we can dig deeper into the results of the investigation. We can see the data is now categorized and assessed for risk severity automatically using AI. These are all optional filters that I can use here. But before I do that, check out the categories visual on top. Dark red is the highest severity, then orange for medium and yellow for low severity. Even though the count of this first category for Credentials is relatively low, this is the highest severity data category. It’s pretty common for people to send usernames and passwords over email, and there’s a lot of them here, 846. Of that total, it’s found a few hundred user credentials and hundreds of API tokens. Even MFA isn’t necessarily safe in this case, because we see 189 backup codes, which might be used if you don’t have access to your phone at the time of authentication. Then, if I hover over Intellectual Property, you’ll see it accounts for more than 18,000 items, where there are thousands of patents, design docs, and even source code. And it’s not only the highest severity categories that are important to pay attention to, even the lower severity categories like financial and customer information, are important to know more about.
- So, what do the AI do in this case then, to kind of understand all of that vast amount of data?
- So, there’s a lot happening once we narrow down the dataset and start the investigation. This isn’t your typical keyword search. The process uses advanced orchestration and vector-based search. First, the intelligent orchestration engine works through a sequence of tasks. It takes all of the files and breaks them down into smaller parts. For example, for a larger file, it might break down each sentence as an individual document. For each of those smaller parts, the process calculates what’s called an embedding, which itself contains thousands of dimension values to describe it. The information is used to build a massive semantic search index, which can be queried. And by the way, what’s powerful here is that vector-based semantic search enables similarity-based information retrieval. For example there may be multiple different ways to refer to credentials with varied terms like username and password, login details, or API Keys, with even more variations across spoken languages. And so this makes it possible to retrieve the right information based on similarity of meaning. Next in the process, the intelligent orchestration uses a series of automated prompts to interrogate the data. Each prompt is also converted into one or more embeddings with dimensions, which are used to find near matches based on the similarity of information found in the semantic index. It then outputs the results and each prompt along with retrieved information is presented to the LLM for detailed analysis. In the final step, the orchestrator takes the generated outputs and insights and formats them into an interactive dashboard.
- So, vector search and semantic indexing is really the key here to establishing a deep understanding of that data with all of its dimensions, so that the AI can surface up those hidden insights. So where do we go from there?
- Yes, this is a significant data leak. And I can narrow things down even more. Here, I’ve filtered the category down to the one credential type of user credentials. Below, you can also see that there are files and emails in multiple languages. Now, as an investigator, you might not speak all of these languages. And that’s another benefit of using the AI for the investigation. To continue, first I’m going select all of these items, then I’m going to run Examine. This will summarize and generate a report for the 241 files and emails discovered with credentials. And now I’m presented with suggestions to guide the analysis, where I can focus in on Security Risk or Sensitive Data. I’ll stick with the Security Risk recommendation, and run it. That will take a few minutes to run, and again once its finished, I can see a summary of high severity credential risks. It describes the type of information it searched and where it went, along with the different languages. It identified 99 active credentials with high confidence. And using the 241 documents, it was even able to highlight the goal here for the credential leak was, “To provide access to databases and trade secrets related to Project Obsidian.” I have the option from here to view all of these credentials and work with my identity team to take action, but I’ll show you the options I have directly from here in a moment.
- Okay, so now we know the extent of the data leak, along with the type of information that could be used against us, credentials in our case, in order to gain access to some high value data.
- Right, and this goes beyond just the content and the files. In many cases, the activities related to that content are just as important to assess and understand. From here, I’ll click into Insights map, and I get a graph visual showing the high severity files marked with the red icon, along with the users and IP addresses who interacted with those files. Based on their IP location, the AI was also able to determine that both these users, who were compromised, are showing, impossible travel. Which for example, could mean that same person was in two locations the same day halfway around the world, which is impossible. Now, as mentioned, I have additional options to mitigate this, so I’ll, Add to mitigation. Here, I can see the scope with 241 items containing the 99 credentials is recommended. And here are the file details for each of them. I’ll select all. Then by assigning these files, I can select an owner for the action to mitigate the risks identified with this investigation to resolve it. So, now our assigned admin can lock down these two individual accounts, along with the 99 shared credential accounts. And that was one type of mitigation, we’ll be adding more, such as file purge or secure export, for sharing evidence with external agencies. And these capabilities are coming soon.
- And If you hadn’t looked at the activities around those files, those users probably would have just continued sharing sensitive information. So, that activity layer is super important. In this case though, you started with an insider risk management case that basically flagged an alert.
- That’s right, we started the last investigation at Microsoft Purview. But, equally, if you’re using Microsoft Defender XDR for incident management, you’ll be able to initiate an investigation directly from a security incident, to investigate impacted data. This integration helps connect your SOC and Data Security teams together, as they work to resolve data security incidents.
- I can see both these being really common entry points. But I’m wondering, What about cases where maybe, somebody has leaked sensitive information that’s out in the public or maybe a competitor has that. Can you use this to find that too?
- That’s actually more common than you’d think. In those cases, if you know the file name or nature of the leaked information, you might want to find out which people accessed the leaked information and to get ahead of the risk. For something like that, you’d start in the investigations experience and create a new one. Once you enter a few basic details, then you define your search. In the Search by field, I have a few options for what to search for. We know that anonymized details for our code name, Project Falcon, were leaked and only around a half a dozen people were privy to this project. So, we can use keyword search, and I’ll just use the code name itself, since I know this will be a small dataset I can keep all tenant sources as the default and the date range open-ended. And when I hit Search, it will show me all of the files and messages matching my keywords. You’ll see that there aren’t as many files in this case to analyze. And from here, I can take the next steps to perform the AI analysis for the data security investigation like we saw before, to find the source of the leak.
- Got it, so what do I need to do to get DSI up and running in my organization?
- If you’re already using Microsoft Purview today, it just takes two simple steps and a couple of minutes to get this up and running. First, in Purview settings, you’ll need to assign the right people to the Data Security Investigations roles for Admins, Investigators, and Reviewers. Second, you’ll need to set up the meters for storage, as well as compute consumption, as a pay-as-you-go service. And that’s it.
- So, is everything you’ve shown today, available for use right now?
- It’s rolling out into Public Preview now. Data Security Investigations is pay-as-you-go based on the size and complexity of the data used in your investigation. Now, while we’ve shown Microsoft 365 information types today, spanning email, Teams messages, Copilot prompts and responses, and SharePoint, we’re also looking to bring in non-Microsoft 365 sources to provide a unified view.
- So, for anyone who’s watching right now, where do you recommend that they go to find out more and also get started?
- It’s pretty easy. You can get started and learn more at aka.ms/DataSecurityInvestigations. And our blog can keep you up to date with what’s coming at aka.ms/DSIBlog.
- Great tips, thanks so much for joining us today, Nick, and for sharing what Microsoft Purview Data Security Investigations can do. Be sure to keep checking back to, “Microsoft Mechanics,” for the latest tech updates. Subscribe to our channel if you haven’t already, and we’ll see you again soon.