We live in a world where businesses across various industries manage an enormous amount of information. Every single transaction, interaction, and data point is recorded, and while that offers an opportunity for businesses to make more informed data-driven decisions in near real time, it does increase processing and data management challenges. But despite those challenges, organizations continue to grow the rate and breadth of data collection, the ability to distill even if finding insights becomes more challenging with traditional data tools.
Businesses need to leverage this huge volume of data to gain a better understanding of customer behaviors, attitudes, needs, and pain points, which is essential for its success. This influx of big data can begin to seem a little overwhelming, especially if the data is not organized and managed in the correct manner.
Big Data Challenges
Companies today are sitting on massive repositories of data, and it has always been challenging to find a way to leverage these huge datasets for quick decision making. Executives, managers, and analysts want answers to their data questions as quickly as possible and luckily, modern BI tools can help with this problem
BI tools today are extremely powerful and capable of handling large datasets with 100s of millions of rows of data and can deliver insights in a matter of seconds. Business users today are ecstatic at the amount of information they can have at their fingertips and the speed in which they can leverage the data to make important decisions. As a data professional, it is very satisfying to see our end-users happy and us playing a key role in enabling them to perform at their highest capacity. Although this is a good thing, this often leads to business asking for more. After a successful first launch, business users begin asking to integrate other relevant datasets into the current data model, add KPIs and create more functionality in the report. At this point, we can run into challenges.
As data professionals, we have the duty to make sure the business users can access all the information they need to answer their questions. This often confounds even the most skilled professionals as they are unable to foresee the complexities and limitations of even the most modern BI tools. For example, if you have any previous experience with designing reports, you can see the first warning signs when one of the visuals on the report takes unusually longer to return query results. There are many factors that govern the performance of the reports, but a necessary common denominator is a well-designed scalable data model.
Approach and Discovery
We will discuss and address some of the above-mentioned challenges in a three-part article series. Within the series we will be diving deep into how the Verstand team tackled large dataset size problems for one of its clients, a multi-billion-dollar retail giant, and discuss how we analyzed billions of rows of data effectively using advanced incremental refresh in Power BI.
This first article explains the use case and problem we need to solve, the second article explains the steps to build out the incremental refresh, and the third article discusses loading data into monthly partitions and configuring the incremental refresh policies on multiple tables within the same model—using external tools like ALM Toolkit & SQL Server Management Studio (SSMS).
Phase 1: Discovery
The first phase within our example use case was all about learning about the business process and understanding the Key Performance Indicators (KPIs). It was not surprising to learn that some of the requirements from the stakeholders were complex. The requirements included pulling two and a half years of sales, which would result in over one billion rows of data and merging multiple fact and dimension tables.
As we discussed before, a poorly designed model will cause a lot of headaches in the long run especially if it is not designed in a scalable way. So, in this case, the first step in designing the data model was to choose the right storage mode in Power BI, which offers Import, Direct Query, and Dual. It is important to choose the right methodology that fits the use case and supports the future scalability of the model.
The next step was to get a glimpse of the underlying dataset and understand the elements in its entirety before designing the model. We connected via Direct Query storage mode and created a simple table visualization with no DAX calculations. The visual took 30 seconds to render, which was a clear indicator that this storage mode would not support the business’s needs.
Phase 2: Analysis & Solution:
The next step was to evaluate all the tables and fields within the model and to understand the table’s metadata (number of columns & rows, the cardinality, the table size, data dictionary, and data types). This practice has generally helped in designing an efficient data model. We decided to use the external tool ‘Dax studio’ and the Vertipaq analyzer feature to understand the metadata thoroughly. The Vertipaq analyzer results showed that the Fact tables not only have a huge amount of historical data, but also the cardinality of columns involved is large.
The discovery noted above, not only helped in understanding the metadata of the dataset, but also enabled us to make important decisions like choosing the import method type for this use case.
After connecting to the tables using the Import storage mode, we identified the culprit fields and removed them from each of these tables to reduce the complexity and improve the efficiency of the report and the underlying data model.
Since the Direct Query storage mode did not help with report performance, we decided to go ahead with Import Mode as there would be a huge upside to the report performance. The other reasons to go with the Import Mode was that the business data changed only once per day, and we had more flexibility over the transformations in the Power BI Power Query editor and in using M query features. It also helped in running queries that returned rows over 1 million which was a known limitation with Direct Query.
However, using Import storage mode by itself was not the solution here as the Power BI local file would grow in data size and it was impractical to load a billion rows of data into your local file within the Power BI desktop environment. So, the solution at this point was to choose the Import storage mode with the incremental refresh policy defined and enabled.
In subsequent articles, we will discuss in more detail on how we designed the incremental refresh policy and the challenges we faced during the implementation.