Build an intelligent system for data ingest and transformation

Implement Azure Database for PostgreSQL — Hyperscale & Azure Kubernetes Service

Global investments company, BNY Mellon shows how they leverage Azure Database for PostgreSQL — Hyperscale (Citus), and Azure Kubernetes Service (AKS) to design an intelligent system for data ingest and transformation.

Build an intelligent system for data ingest and transformation

Keith Pijanowski, Senior Principal Consultant at BNY Mellon, joins host Jeremy Chapman to share the team’s implementation story and how they unified their data in Azure, established an integrated data and analytics platform, and built new SaaS applications that could leverage the unique IP in their data quickly and at scale.

If you’re new to BNY Mellon, it’s the world’s largest custodian bank and asset servicing company. It helps many of the world’s largest and most sophisticated asset managers and asset owners manage and service their financial investments and reach data-driven decisions.


01:39 — Take a look at the app

02:48 — How they built it

03:39 — Intelligent system for data ingest and transformation

05:50 — How Azure database for PostgreSQL — Hyperscale helps

06:19 — How to process queries quickly

09:23 — Serving the data from PostgreSQL into the app

11:02 — What’s next for BNY Mellon?

Link References:

Learn more about BNY Mellon’s solutions check out

For more on Azure Database for PostgreSQL — Hyperscale (Citus) and Azure Kubernetes Service, visit

Unfamiliar with Microsoft Mechanics?

We are Microsoft’s official video series for IT. You can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.

Video Transcript:

- Coming up on today’s episode of How We Built It, we meet global investments company, BNY Mellon, who are leveraging Azure database for PostgreSQL Hyperscale Citus, and Azure Kubernetes Service as the foundation for a massive operational data store and app factory that delivers new investment services, harnessing their unique IP and predictive insights fast and at scale. So today I’m joined by Keith Pijanowski from BNY Mellon. Welcome to Microsoft Mechanics.

- Hi Jeremy, thanks for having me on the show.

- And thanks for joining us from home today. So before we get into this, just for those of you who are new to BNY Mellon, it’s the world’s largest custodian bank and asset servicing company. And it really helps many of the world’s largest and most sophisticated asset managers and asset owners manage and service their financial investments and reach data-driven decisions. And beyond the trillions of dollars in assets under its management and custody, it also holds petabytes of data that they can anonymize, aggregate and analyze to offer insights to clients, to help them make better decisions. So Keith, you and the team have already started to build out solutions and unlock and harness the value of your data with a lot of rich capability. But where did you and the team start?

- So my team’s work is part of a broader effort that started about a year ago to unify our data in Azure and to establish an integrated data and analytics platform. Around that time, we started a parallel effort to look at what we could do to build new SaaS applications quickly and at scale that could leverage the unique IP in our data in a way that met the needs of our customers. To move fast, we took a hybrid multi-cloud approach to achieve what would have taken potentially years, with lots of internal red tape. And we delivered our first production business app in just nine months.

- That’s fast. So can we take a look at the app?

- Sure. This is our distribution analytics solution that we launched back in June. It’s currently targeted at asset managers interested in understanding the dynamics of the U.S. investment product market. I’ll start in the market profile. I can see historical trends by territory, which allows me to see trends for the last 13 months for sales, redemptions, and net flows. Now beyond looking back at trends, I can also look forward to see forecasts based on our predictive models. I’ll click into sales momentum trends, which shows me investment styles. I can see predictive forecasts by geography, asset class, sector, and more. So as you saw, the buying and selling behavior of products is analyzed down to the zip code. And behind the scenes, we are literally talking terabytes of data from dozens of regularly updated internal and external data sources that feed the application.

- Right. And there are a number of impressive things going on here. So first as you say, the amount of data behind the scenes, and second is how instant it is to filter the data given the complex calculations that are taking place, really with any combination of views that are available to the asset managers. But since this is Mechanics, why don’t we dig in to see how you’re able to build the service out so quickly?

- So the key for us was to build on our strengths. We leveraged our existing skill sets for data services like Postgres, as well as DevOps in order to design the data and application platform that we needed. At the same time, we took our existing on-prem and multi-cloud investments and extended from there.

- Right. But still it’s pretty much, you started out this particular initiative from scratch though. Right?

- We did. And Azure was fundamental here. It supports a unified and hybrid approach for our data sources on-prem, our front-end app on another cloud, and our cloud-hosted back end. Building out this back end is where my team focused. We use Postgres Hyperscale as well as Azure Kubernetes Services for container-based compute.

- Got it. And as any data architect would tell you, an app is only as good as the data that feeds into it.

- And that’s really our core differentiator here. We are pulling in terabytes of data, including structured, semi-structured, and unstructured data from BNY Mellon sources sitting on prem, which spans several years, along with data from dozens of third-party SaaS services. Much of this data is highly confidential. And so we have to first anonymize it before running it through our data pipeline and Azure. This comprises all kinds of checks and balances. And we are maniacal about getting the data correct, and meeting our regulatory requirements. After the data is scrubbed, we pushed the data into Azure storage. And from there, it goes into our extract service where we queue it before running it through the rest of our data pipeline, which take advantage of scalable container-based microservices, running in AKS.

- Right, and these services then seem beyond your standard kind of ETL pipeline. So what’s the premise then here with your process?

- It’s not ETL in the normal sense. We apply specific logic as we reshape and load the data. Also, we wanted this architecture to be the foundation for an application factory for the rapid building of new services. Our use of containers is instrumental to our DevOps processes and our ability to bring new products to market quickly.

- Okay. So how does everything work?

- The slice and load services are managed by AKS for horizontal scaling. The slice service ensures the quality of our data. This spins up approximately 40 containers per job, around 1500 rows of financial data per second are sliced and organized into what we call fact and dimension tables. By the time the data hits our load service, which is running on around 160 containers, it has been parsed, cleaned up, and organized into these fact and dimension tables. The load service adds new data to our database. However, the load service ensures that there is an immutable record of the data and that it is properly versioned. It checks if the data already exists and only writes new data into Postgres Hyperscale. In other words, we never update or delete a row, that way the bank can replay market conditions and retrace the source of any given market prediction on any given date.

- Okay. So you’ve designed then a pretty intricate and intelligent system then for data ingest and transformation, but how does Azure database for PostgreSQL Hyperscale then help out here?

- It gives us a fully-managed, highly available and flexible database for our data. We can work with structured and unstructured data types thanks to Postgres’s support for the JSONB data-type. Also, you need to Hyperscale as it’s clustering capability, which means that we can automatically scale out multiple instances of the data, optimize read and write operations using parallel processing.

- Okay. So, but going back to the app, there were lots of calculations that were run behind the scenes. So what were you doing then to process those queries so quickly?

- We use the multi-tiered approach to speed up data ingestion and querying with Postgres Hyperscale. First, let me explain the architecture. Hyperscale uses a coordinator node to process queries and worker nodes that hold distributed data for parallel processing. These each had dedicated compute. To distribute the data we use a distribution key. In our case, we chose a unique security identifier, for financial securities to allocate data across our worker nodes. Then we defined the number of shards. We decided to assign one shard per core. Once all this is in place, the system, under the covers, determines how best to distribute our data. Finally, in our implementation, we partitioned the data within the shards and worker nodes based on how the data is most often queried. In our case, by using time ranges. This essentially creates an index into our distributed data. So now when a query comes in, it lands with the coordinator node and Hyperscale knows exactly where to look for the data to return the results quickly.

- Okay. So how much of a difference then does it make for query performance?

- Well, let me show you. I’ll show you the same query running in two different configurations of Postgres. Postgres single server and Postgres Hyperscale. Both configurations are running in Azure. Here is single server Postgres. You can see it’s got 64 virtual CPUs and 432 gigabytes of RAM. Next here’s our Hyperscale configuration. You can see the coordinator node with 32 virtual cores and six worker nodes with 16 virtual cores each.

- Right. So if you’re keeping score at home, that’s 96 virtual cores on the Hyperscale side in the worker nodes versus 64 cores on the single server. And based on that compute, you might expect Hyperscale to be 50% or more faster, right?

- That’s correct. But let’s see what happens. Our query is pulling subscription and redemption information from our two terabyte database, and it’s doing a lot of aggregating and field calculations as well. And we’re querying by month. So I’ll kick this off on single server and let it run. Now we’ll switch over to the Hyperscale and run the same query. You’ll see that this query processes very quickly. And if you take a look at the lower right corner, it completes in just a little over nine seconds, let’s see how that compares with our other run. I’ll go back to the single server query. You’ll see it’s still processing. Now we’ve recorded this in advance. So we fast forward in time. You’ll see the single server query finishes in three minutes, 37 seconds. We’ve literally twisted every lever we have in Hyperscale. And as you saw with these optimizations, the query ran 24 times faster. And one more thing to note here is that we ran a complex query over a database with large amounts of data. And even though Hyperscale outperformed the single server instance of Postgres, response time from single server was still decent considering what this query does. And so future apps built by my team will use single server Postgres where it makes sense, because of its performance relative to cost.

- Makes sense. So it’s really impressive though, to see those speedups, but we also saw the predictive capabilities that you had in the distribution analytics app. So how are you serving the data from PostgreSQL into the app?

- Well, we want the app to be fast and snappy. So application services never query the database directly. We have a common query service, which pre fetches data needed by our app and models. And there are two application services that consume the data from the query service. First, the analytics service preps the data for presentation and the user interface in our distribution analytics app. Second, our prediction service runs multiple machine learning models and proprietary algorithms to deliver predictions to the app. This, as you saw, is what enables us to identify market dynamics and forecast investment styles across financial products. Let me show you how our training jobs and prediction jobs are run in parallel dynamically. Here is the UI for Desk, a popular open source app for parallel processing that we are running on AKS. At the bottom, we have five containers running multiple predictive models in parallel. On the top left, you can see the amount of data we are processing and on the top right you see the completion status of these jobs. Because we are using containers we can easily scale out compute if we have a large job. And we can analyze larger amounts of data and refine our models using more recent data in minutes versus days. This opens up the potential for us to now move from monthly to daily predictions. In other words, intra date analysis for even faster time to insights.

- And not only that, but you also mentioned then, with your operational data store and microservices approach, you’ve also built the foundation for an application factory. So what’s next then for BNY Mellon?

- So as part of our DevOps approach, we’ve created templates which give our dev teams a blueprint for building new services that run on the infrastructure we’ve configured in Azure. This is important as it allows us to take our vast data, something which is a unique advantage, and transform it into market differentiating applications. For example, we recently launched a new environmental, social and governance application, ESG for short, where we aggregate user preferences and generate crowdsourced data. This allows customers to see trends in peer group behavior toward ESG factors. You can then pick which ESG factors you care about the most with respect to your investing goals. And beyond the two applications I’ve talked about today, we are also working on many more apps that allow our customers to get the benefits of our data.

- Great stuff. Thanks so much for joining us today Keith, and really sharing your implementation story today on How We Built It. It’s really fascinating stuff. So by the way, if you want to learn more about BNY Mellon’s solutions, checkout And for more on Azure database for PostgreSQL Hyperscale and Azure Kubernete Service, be sure to check out And don’t forget to subscribe to Microsoft Mechanics if you haven’t already to get the latest tech news across Microsoft. Thanks for watching. We’ll see you next time.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store