Applied Data, Data Analysis, Retail Analytics

incremental Refresh design for big datasets – Part 2

In the previous article, we discussed the thought process behind choosing the right data model design based on the use-case and requirements we received from the client. We discussed the idea behind using the Import Storage Mode and the reasons to implement incremental refresh on the dataset to ensure that the reporting solution we want to build is robust and scalable. 

In this Part 2 article, we will discuss the steps we took to build out the incremental refresh solution and the challenges that we faced along the way. 

Configuration and Design 

The first step was to convert all the tables connected via Direct Query to Import Storage Mode and then identify the appropriate Fact and Dimension tables—upon which we would build and define the incremental policies.  

Based on the dataset size, and the frequency of new and updated data, we decided to build incremental refresh on all the large tables with date fields within the model. 

We then proceeded to configure the ‘RangeStart’ and ‘RangeEnd’ parameters to only pull a month of the most recent data into the data source.  While configuring the parameters, you must be mindful of the fact that the names are case sensitive and that you cannot use an alternate name for the parameters like RangeStart1 and RangeEnd1. This will not be validated when you implement the incremental refresh policy on the table. 

Note: We imported only a month of data because it is not a good practice to import a billion rows of data every time the model is refreshed. Importing copious amounts of data can negatively affect the overall resource consumption and in most cases your local machine will not be able to handle the dataset size. 

the challenge with the time-out issue

As stated in the first article, the business requirement was to pull two and a half years of sales transactions data. After talking to the business, we determined that the model should be configured to refresh only the last three days of data incrementally from when the dataset refresh was previously refreshed in the Power BI service. This was decided by determining how the policies related to data changes. In this case, the data is subject to change every three days.

Once the incremental refresh policies were defined, the data model was successfully published to Power Bi service. When publishing the data model to Power Bi service, it defaults to create a single partition on each table. However, since we have the incremental policy defined for our model, we needed to perform the first refresh operation for the Power Bi service to create all 30 monthly partitions.  

Note: If you have more than 10 million rows in your fact table, the first refresh operation tends to be longer as the Power BI service attempts to pull all the rows defined within the policy.  

However, in our case we had over one billion rows of transactional data in the fact table, and it was not surprising to see it attempting to refresh for more than 4 hours without any signs of completion. After 5 hours of attempting to publish the report to Power BI service we unfortunately received a ‘Timeout’ error message because of the huge dataset size.  

This meant that the data model could not be published without restricting the number of rows imported—not an option since we needed two and a half years of historical data in the data model.  

In the segment below, we will discuss how we were able to find a workaround for the timeout issue and what steps we took to resolve it. 

The Resolution 

The first step in resolving the problem was to publish the data model with no rows loaded—filtering all rows out of the data model.  

The purpose of doing this was to make sure the first dataset refresh operation did not fail in Power BI service and to ensure all the monthly partitions were created based on the incremental policy defined in Power BI desktop.  

  

As I expected, the refresh did not take long, it took only about 5 minutes to complete the first data refresh operation. Although the refresh time was quick, and all the monthly partitions were created, it did not have any data loaded in those monthly partitions.  

Note: To confirm if the monthly partitions were created, you can download the SSMS (SQL Server Management Studio) from the official MS docs website and follow the steps below. 

https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16

 Once you open the SSMS application, you will see an option on the top left window to connect to the model. You can follow these steps: 

Connect è Analysis Services è Input the dataset URL in the server’s name box and your Power BI login username in the username box as show in the screenshot below and then click connect. 

You can then expand the Databases section, which you will find on the left side and under Object Explorer you will see a list of all the datasets that you have published under that workspace.  

Click and expand the dataset that you published, and you will find 3 sub-sections: Connections, Tables, and Roles. Click and expand the tables and you should be able to locate the table that you defined the incremental refresh policy upon. Then right-click that table and select view, opens a window that confirms that the partitions have been created as displayed in the below screenshot. 

To summarize Part 2, we discussed the steps we took to build out the incremental refresh on the data model and expand on the configuration of the date parameters for the incremental refresh policy. 

We also discussed how we were able to overcome the challenges of the time-out issue because of the huge dataset size and get the Power BI service to create historical partitions based on the incremental refresh policy defined in the Power BI desktop file. 

In the next article, we will discuss the methodology to enable incremental refresh on multiple fact tables and explain the end-to-end process of loading data into the table partitions.  

Leave a Reply

Your email address will not be published.