How you can Reduce Your Power BI Model Size by 90%

First look under the hood — Formula Engine and Storage Engine

First, I need you to fulfill the VertiPaq engine, “brain & muscles” of the system behind not only Power BI, but additionally Evaluation Services Tabular and Excel Power Pivot. Truth to be said, VertiPaq represents just one a part of the storage engine inside the Tabular model, besides DirectQuery, which we’ll discuss individually in considered one of the following articles.

Whenever you send the query to get data in your Power BI report, here’s what happens:

accepts the request, processes it, generates the query plan, and at last executes it
pulls the information out of the Tabular model to satisfy the request issued inside the query generated by the Formula Engine

Storage Engine works in two alternative ways to retrieve requested data: VertiPaq keeps a snapshot of the information in memory. This snapshot might be refreshed sometimes from the unique data source.

Quite the opposite, doesn’t store any data. It just forwards the query straight to the information source for each single request.

Photo by RKTW extend on Unsplash

Data within the Tabular model is often stored either as an in-memory snapshot (VertiPaq) or in DirectQuery mode. Nevertheless, there’s also a possibility of implementing a hybrid Composite model, which relies on each architectures in parallel.

Formula Engine — “Brain” of Power BI

As I already stressed, Formula Engine accepts the query, and because it’s capable of “understand” DAX (and MDX also, but it surely is out of the scope of this series), it “translates” DAX into a particular query plan, consisting of physical operations that should be executed to get results back.

Those physical operations might be joins between multiple tables, filtering, or aggregations. It’s essential to know that Formula Engine works in a single-threaded way, which suggests that requests to Storage Engine are at all times being sent sequentially.

Storage Engine — “Muscles” of Power BI

Once the query has been generated and executed by the Formula Engine, the Storage Engine comes into the scene. It physically goes through the information stored inside the Tabular model (VertiPaq) or goes on to a distinct data source (SQL Server, for instance, if DirectQuery storage mode is in place).

In terms of specifying the storage engine for the table, there are three possible options to select from:

— based on VertiPaq. Table data is being stored in memory as a snapshot. Data might be refreshed periodically
— data is being retrieved from the information source at query time. Data resides in its original source before, during, and after the query execution
— a mixture of the primary two options. Data from the table is being loaded into memory, but at query time it will possibly even be retrieved directly from the source

Versus Formula Engine, which doesn’t support parallelism, the Storage Engine can work asynchronously.

Meet VertiPaq Storage Engine

As we’ve drawn a giant picture previously, let me explain in additional detail what VertiPaq does within the background to spice up the performance of our Power BI reports.

When we elect Import mode for our Power BI tables, VertiPaq performs the next actions:

Reads the information source, transforms data right into a columnar structure, encodes, and compresses data inside each of the columns
Establishes a dictionary and index for every of the columns
Prepares and establishes relationships
Computes all calculated columns and calculated tables, and compresses them

The 2 principal characteristics of VertiPaq are:

As you possibly can see within the illustration above, columnar databases store and compress data otherwise from traditional row-store databases. Columnar databases are optimized for vertical data scanning, which suggests that each column is structured in its own way and physically separated from other columns!

Without going into deep evaluation about benefits and downsides between row-store vs column-store databases, since it will require a separate series of articles, let me just pinpoint a number of key differentials when it comes to performance.

With columnar databases, single-column access is fast and effective. Once the computation starts to involve multiple columns, things turn out to be more complex, because the intermediary steps’ results should be temporarily stored ultimately.

Simply said, columnar databases are more CPU-intensive, while row-store databases increase I/O, due to many scans of useless data.

Thus far, we painted a giant picture of the architecture that permits Power BI to totally shine as an ultimate BI tool. Now, we’re able to dive deeper into specific architectural solutions and consequently leverage this information to take advantage of our Power BI reports, by tuning our data model to extract the utmost from the underlying engine.

Inside VertiPaq in Power BI — Compress for achievement!

As you may recall from the previous a part of this text, we scratched the surface of VertiPaq, a strong storage engine, which is “responsible” for the blazing-fast performance of most of your Power BI reports (each time you might be using Import mode or Composite model).

3, 2, 1…Fasten your seatbelts!

One among the important thing characteristics of the VertiPaq is that it’s a columnar database. We learned that columnar databases store data optimized for vertical scanning, which suggests that each column has its own structure and is physically separated from other columns.

That fact enables VertiPaq to use various kinds of compression to every of the columns independently, selecting the optimal compression algorithm based on the values in that specific column.

Compression is being achieved by encoding the values inside the column. But, before we dive deeper into an in depth overview of encoding techniques, just be mindful that this architecture shouldn’t be exclusively related to Power BI — within the background is a Tabular model, which can also be “under the hood” of Evaluation Services Tabular and Excel Power Pivot.

Value Encoding

That is essentially the most desirable value encoding type since it really works exclusively with integers and, subsequently, requires less memory than, for instance, when working with text values.

How does this look in point of fact? Let’s say we’ve a column containing a variety of phone calls per day, and the worth on this column varies from 4.000 to five.000. What the VertiPaq would do, is to search out the minimum value on this range (which is 4.000) as a place to begin, then calculate the difference between this value and all the opposite values within the column, storing this difference as a brand new value.

At first glance, 3 bits per value won’t appear like a major saving, but multiply this by hundreds of thousands and even billions of rows and you may appreciate the quantity of memory saved.

As I already stressed, Value Encoding is being applied exclusively to integer data type columns (currency data type can also be stored as an integer).

Hash Encoding (Dictionary Encoding)

This might be essentially the most ceaselessly used compression type by a VertiPaq. Using Hash encoding, VertiPaq creates a dictionary of the distinct values inside one column and afterward replaces “real” values with index values from the dictionary.

Here is an example to make things clearer:

As you could notice, VertiPaq identified distinct values inside the Subjects column, built a dictionary by assigning indexes to those values, and at last stored index values as tips to “real” values. I assume you might be aware that integer values require way less memory space than text, in order that’s the logic behind this sort of data compression.

Moreover, by with the ability to construct a dictionary for any data type, VertiPaq is practically data type independent!

This brings us to a different key takeover: irrespective of in case your column is of text, bigint or float data type — from VertiPaq perspective it’s the identical — it must create a dictionary for every of those columns, which means that each one these columns will provide the identical performance, each when it comes to speed and memory space allocated! In fact, by assuming that there aren’t any significant differences in dictionary sizes between these columns.

So, it’s a myth that the information variety of the column affects its size inside the data model. Quite the opposite, the variety of distinct values inside the column, which is generally known as , mostly influences column memory consumption.

RLE (Run-Length-Encoding)

The third algorithm (RLE) creates a type of mapping table, containing ranges of repeating values, avoiding storing each (repeated) value individually.

Again, taking a take a look at an example will help to raised understand this idea:

In real life, VertiPaq doesn’t store Start values, because it will possibly quickly calculate where the following node begins by summing previous Count values.

As powerful as it’d take a look at first glance, the RLE algorithm is extremely depending on the ordering inside the column. If the information is stored the best way you see in the instance above, RLE will perform great. Nevertheless, in case your data buckets are smaller and rotate more ceaselessly, then RLE wouldn’t be an optimal solution.

Yet one more thing to be mindful regarding RLE: In point of fact, VertiPaq doesn’t store data the best way it’s shown within the illustration above. First, it performs Hash encoding and creates a dictionary of the topics, after which applies the RLE algorithm, so the ultimate logic, in its most simplified way, could be something like this:

So, RLE occurs after Value or Hash Encoding, in those scenarios when VertiPaq “thinks” that it is sensible to compress data moreover (when data is ordered in that way that RLE would achieve higher compression).

Re-Encoding considerations

Irrespective of how “smart” VertiPaq is, it will possibly also make some bad decisions, based on incorrect assumptions. Before I explain how re-encoding works, let me just briefly iterate through the means of data compression for a particular column:

VertiPaq scans a sample of rows from the column
If the column data type shouldn’t be an integer, it should look no further and use Hash encoding
If the column is of integer data type, some additional parameters are evaluated: if the numbers within the sample linearly increase, VertiPaq assumes that it might be a primary key and chooses Value encoding
If the numbers within the column are reasonably close to one another (the number range shouldn’t be very wide, like in our example above with 4.000–5.000 phone calls per day), VertiPaq will use Value encoding. Quite the opposite, when values fluctuate significantly inside the range (for instance between 1.000 and 1.000.000), then Value encoding doesn’t make sense, and VertiPaq will apply the Hash algorithm

Nevertheless, it will possibly occur sometimes that VertiPaq comes to a decision about which algorithm to make use of based on the sample data, but then some outlier pops up and it must re-encode the column from scratch.

Let’s use our previous example for the variety of phone calls: VertiPaq scans the sample and chooses to use Value encoding. Then, after processing 10 million rows, impulsively it found a 500.000 value (it will possibly be an error, or whatever). Now, VertiPaq re-evaluates the selection, and it will possibly determine to re-encode the column using the Hash algorithm as an alternative. Surely, that might impact the entire process when it comes to the time needed for reprocessing.

Finally, here is the list of parameters (so as of importance) that VertiPaq considers when selecting which algorithm to make use of:

Variety of distinct values within the column (Cardinality)
Data distribution within the column — column with many repeating values might be higher compressed than one containing ceaselessly changing values (RLE might be applied)
Variety of rows within the table
Column data type — impacts only the dictionary size

Reducing the information model size by 90% — real story!

After we laid the theoretical ground for understanding the architecture behind the VertiPaq storage engine, and which sorts of compression it uses to optimize your Power BI data model, it’s the appropriate moment to get our hands dirty and apply our knowledge in a real-life case!

Start line = 776 MB

Our data model is sort of easy, yet memory-intensive. We have now a fact table (factChat), which comprises data about live support chats and one dimension table (dimProduct), which pertains to a fact table. Our fact table has around 9 million rows, which shouldn’t be a giant deal for Power BI, however the table was imported because it is, with none additional optimization or transformation.

Now, this pbix file consumes a whopping 777 MB!!! You may’t imagine it? Just have a look:

Just remember this picture! In fact, I don’t have to inform you how much time this report must load or refresh, and the way our calculations are slow due to file size.

…and it’s even worse!

Moreover, it’s not only 776 MBs that take our memory, since memory consumption is being calculated considering the next aspects:

PBIX file
Dictionary (you’ve learned concerning the dictionary to start with sections of this text)
Column hierarchies
User-defined hierarchies
Relationships

Now, if I open Task Manager, go to the Details tab, and find the msmdsrv.exe process, I’ll see that it burns greater than 1 GB of memory!

Oh, man, that actually hurts! And we haven’t even interacted with the report! So, let’s see what we will do to optimize our model…

Rule #1 — Import only those columns you really want

The primary and a very powerful rule is:

That being said, do I really want each the chatID column, which is a surrogate key, and the sourceID column, which is a primary key from the source system? Each of those values are unique, so even when I want to count the overall variety of chats, I’d still be fantastic with only considered one of them.

So, I’ll remove the sourceID column and check how the file looks now:

By removing only one unnecessary column, we saved greater than 100 MB!!! Let’s examine further what might be removed without taking a deeper look (and we’ll come to this later, I promise).

Can we really want each the unique start time of the chat and UTC time, one stored as a Date/Time/Timezone type, the opposite as Date/Time, and each going to a second level of precision??!!

Let me eliminate the unique start time column and keep only UTC values.

One other 100 MB of wasted space gone! By removing just two columns we don’t need, we reduced the scale of our file by 30%!

Now, that was without even looking into the main points of the memory consumption. Let’s now activate DAX Studio, my favorite tool for troubleshooting Power BI reports. As I already stressed a number of times, this tool is a MUST in case you plan to work seriously with Power BI — and it’s completely free!

One among the features in DAX Studio is a VertiPaq Analyzer, a really useful gizmo built by Marco Russo and Alberto Ferrari from sqlbi.com. After I hook up with my pbix file with DAX Studio, listed here are the numbers related to my data model size:

I can see here what the most costly columns are in my data model and choose if I can discard a few of them, or if I want to maintain all of them.

At first glance, I actually have few candidates for removal — and columns have high cardinality and subsequently can’t be optimally compressed. Furthermore, as these are text columns and should be encoded using a Hash algorithm, you possibly can see that their dictionary size is amazingly high! In case you take a more in-depth look, you possibly can notice that these two columns take almost 40% of my table size!

After checking with my report users in the event that they need any of those columns, or perhaps only considered one of them, I’ve got a confirmation that they don’t perform any evaluation on those columns. So, why on Earth should we bloat our data model with them??!!

One other strong candidate for removal is the LastEditDate column. This column just shows the date and time when the record was last edited in the information warehouse. Again, I checked with the report users, and so they didn’t even know that this column exists!

I removed these three columns, and the result’s:

Oh, God, we halved the scale of our data model by just removing a number of unnecessary columns.

Truth be told, there are a number of more columns that could possibly be dismissed from the information model, but let’s now concentrate on other techniques for data model optimization.

Rule #2 — Reduce the column cardinality!

As you could recall from the previous a part of the article, the rule of thumb is: the upper the cardinality of a column, the harder for VertiPaq to optimally compress the information. Especially if we usually are not working with integer values.

Let’s take a deeper take a look at VertiPaq Analyzer results:

As you see, even when the chatID column has higher cardinality than the datetmStartUTC column, it takes almost 7 times less memory! Because it is a surrogate key integer value, VertiPaq applies Value encoding, and the scale of a dictionary is irrelevant. Alternatively, Hash encoding is being applied for the column of date/time data type with high cardinality, so the dictionary size is enormously higher.

There are multiple techniques for reducing the column cardinality, similar to splitting columns. Listed here are a number of examples of using this method.

For Integer columns, you possibly can split them into two even columns using division and modulo operations. In our case, it will be:

SELECT chatID/1000 AS chatID_div
,chatID % 1000 AS chatID_mod……….

This optimization technique have to be performed on the source side (on this case, by writing a T-SQL statement). If we use the calculated columns, there is no such thing as a profit in any respect, for the reason that original column needs to be stored in the information model first.

The same technique can bring significant savings when you could have decimal values within the column. You may simply split values before and after the decimal as explained in this text.

Since we don’t have any decimal values, let’s concentrate on our problem — optimizing the datetmStartUTC column. There are multiple valid options to optimize this column. The primary is to check in case your users need granularity higher than the day level (in other words, are you able to remove hours, minutes, and seconds out of your data).

Let’s check what savings this solution would bring:

The very first thing we notice is that our file is now 271 MB, so 1/3 of what we began with. VertiPaq Analyzer’s results show that this column is now almost perfectly optimized, going from taking on 62% of our data model to only barely over 2.5%! That’s huuuuge!

Nevertheless, it appeared that the day-level grain was not fantastic enough, and my users needed to research figures on the hour level. OK, so we will at the least eliminate minutes and seconds, and that might also decrease the cardinality of the column.

So, I’ve imported values rounded per hour:

SELECT chatID
                ,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) AS datetmStartUTC
                ,customerID
                ,userID
                ,ipAddressID
                ,productID
                ,countryID
                ,userStatus
                ,isUnansweredChat
                ,totalMsgsOp
                ,totalMsgsUser
                ,userTimezone
                ,waitTimeSec
                ,waitTimeoutSec
                ,chatDurationSec
                ,sourceSystem
                ,subject
                ,usaccept
                ,transferUserID
                ,languageID
                ,waitFirstClick
            FROM factChat

It appeared that my users also didn’t need a chatVariables column for evaluation, so I’ve also removed it from the information model.

Finally, after disabling Auto Date/Time in Options for Data Load, my data model size was around 220 MB! Nevertheless, one thing still bothered me: the chatID column was still occupying almost 1/3 of my table. And that is only a surrogate key, which shouldn’t be utilized in any of the relationships inside my data model.

So, here I used to be examining two different solutions: the primary was to easily remove this column and aggregate the variety of chats, counting them using the GROUP BY clause. Nevertheless, there could be no profit by keeping the chatID column in any respect, because it’s not getting used anywhere in our data model. Once I’ve removed it from the model, one last time, let’s check the pbix file size:

Please recall the number we began at: 776 MB! , applying some easy techniques which enabled the VertiPaq storage engine to perform more optimal compression of the information.

And this was an actual use case, which I faced through the last yr!

General rules for reducing data model size

To conclude, here is the list of general rules you need to be mindful when trying to scale back the information model size:

Keep only those columns your users need within the report! Just sticking with this one single rule will prevent an unbelievable amount of space, I assure you…
Attempt to optimize column cardinality each time possible. The golden rule here is: test, test, test…and if there’s a major profit from, for instance, splitting one column into two, or to substitute a decimal column with two whole number columns, then do it! But, also be mindful that your measures should be rewritten to handle those structural changes, to be able to display expected results. So, in case your table shouldn’t be big, or if you could have to rewrite a whole lot of measures, perhaps it’s not value splitting the column. As I said, it is determined by your specific scenario, and you need to rigorously evaluate which solution makes more sense
Same as for columns, keep only those rows you would like: for instance, perhaps you don’t have to import data from the last 10 years, but only 5! That can even reduce your data model size. Consult with your users, ask them what they really want, before blindly putting all the things inside your data model
Aggregate your data each time possible! Which means — fewer rows, lower cardinality, so all nice stuff you are aiming to realize! In case you don’t need hours, minutes, or seconds level of granularity, don’t import them! Aggregations in Power BI (and Tabular model normally) are an important and wide topic, which is out of the scope of this series, but I strongly recommend you check Phil Seamark’s blog and his series of posts on creative aggregations usage
Avoid using DAX calculated columns each time possible, since they usually are not being optimally compressed. As a substitute, attempt to push all calculations to a knowledge source (SQL database, for instance) or perform them using the Power Query editor
Use proper data types (for instance, in case your data granularity is on a day level, there is no such thing as a need to make use of Date/Time data type. Date data type will suffice)
Disable Auto Date/Time option for data loading (this may remove a bunch of robotically created date tables within the background)

Conclusion

After you learned the fundamentals of the VertiPaq storage engine and different techniques it uses for data compression, I desired to wrap up this text by showing you a real-life example of how we will “help” VertiPaq (and Power BI consequently) to get the most effective out of report performance and optimal resource consumption.

Thanks for reading, hope that you simply enjoyed the article!

How you can Reduce Your Power BI Model Size by 90%

First look under the hood — Formula Engine and Storage Engine

Formula Engine — “Brain” of Power BI

Storage Engine — “Muscles” of Power BI

Meet VertiPaq Storage Engine

Inside VertiPaq in Power BI — Compress for achievement!

3, 2, 1…Fasten your seatbelts!

Value Encoding

Hash Encoding (Dictionary Encoding)

RLE (Run-Length-Encoding)

Re-Encoding considerations

Reducing the information model size by 90% — real story!

Start line = 776 MB

…and it’s even worse!

Rule #1 — Import only those columns you really want

Rule #2 — Reduce the column cardinality!

General rules for reducing data model size

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Probabilistic Multi-Variant Reasoning: Turning Fluent LLM Answers Into Weighted Options

Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

I Evaluated Half a Million Credit Records with Federated Learning. Here’s What I Found

Hugging Face partners with Wiz Research to Improve AI Security

Nous Research's NousCoder-14B is an open-source coding model landing right within the Claude Code moment

How you can Reduce Your Power BI Model Size by 90%

First look under the hood — Formula Engine and Storage Engine

Formula Engine — “Brain” of Power BI

Storage Engine — “Muscles” of Power BI

Meet VertiPaq Storage Engine

Inside VertiPaq in Power BI — Compress for achievement!

3, 2, 1…Fasten your seatbelts!

Value Encoding

Hash Encoding (Dictionary Encoding)

RLE (Run-Length-Encoding)

Re-Encoding considerations

Reducing the information model size by 90% — real story!

Start line = 776 MB

…and it’s even worse!

Rule #1 — Import only those columns you really want

Rule #2 — Reduce the column cardinality!

General rules for reducing data model size

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.