Does this title make no sense to you? Yay! That was my intention. I know that when, as a flatlander, you see the words “cardinality” and “granularity”, you may just go scrolling on by this post. But, trust me, you need to know what these concepts mean. So now that I hooked you in, let’s get started….
We are going to start with granularity because it is easier to understand. Granularity refers to the level of detail that you are storing in your data set. For example, if you are storing sales data, are you storing it at the minute, hour, day, week, month, quarter or yearly level? The level of the data that you store is called the granularity of the data.
- Data should be stored at the lowest level of granularity for which you wish to report. If you are going to need to report data at the day level, you don’t want to store the data at the week or month level. You will not be able to accurately report on daily sales if you keep the data at the monthly level. Could you estimate daily data from monthly data? Of course–depending on the month, you could divide the month by 30 or 31 or 28. (This assumes you sell seven days a week.) But it would be an average, by definition. So the first thing to determine is if the data’s granularity will support the reporting you need.
- You can always roll up data to a higher level. When you do this, it is called aggregating the data. You can aggregate daily data into weekly or monthly roll-ups. In a later post, I will describe creating aggregations in more useful detail, but it is a more advanced topic so we need to be further along before tackling it.
- Do not be overly concerned if your data, when stored at the necessary level of granularity, has a lot of rows. Power BI can handle millions and millions of rows of data with ease, if you have properly modeled your data. (See the Model, Don’t Muddle and Are You My Type posts.)
- Granularity can also affect the relationship between two tables if the two tables are not reporting at the same level of granularity. For example, if you have a table that reports at the grade level, and you have another table that reports at the student level, you have a mismatched granularity between the two tables. If you need to make a meaningful relationship between these two tables, you will have to figure out how to relate them to each other. An introductory solution is to create a bridge table. In the example I just used, the bridge table would have one column for student IDs and then one column for the grade the student is in.
(Shout out to Marco Russo www.sqlbi.com. He posted an excellent video on the intricacies of many-to-many relationships on SQLBI+ which talked in great detail about mismatched granularity. Whether you are a member of SQLBI+ or not, I highly recommend all of SQLBI’s content!)
Now let’s talk about cardinality. The key to this term is in the root of this word: cardinal. And no, that doesn’t refer to high-ups in the Catholic Church or pretty red birds. This definition from the Merriam-Webster dictionary might be helpful:
the property that a mathematical set has in common with all sets that can be put in one-to-one correspondence with it
On second thought, this definition is good if you already know what cardinality means. So let me try to explain in “flatlander” terms. The cardinality of a column of data refers to how many times a value is repeated in the column. If a column has no repeating values (or very few), the column is said to have “high cardinality”. Conversely, if the column has a lot of repeating values, it is said to have “low cardinality”. Low cardinality is good, high cardinality is not optimal. Notice that I am not saying that high cardinality is bad. It isn’t. You will always have a couple of columns that have high cardinality. For example, any column that stores a unique identification for each row in the table will, by definition, have high cardinality. The trick is to minimize the number of columns with high cardinality.
You want to minimize columns with high cardinality as they take more space in your data model because they are not as easy to compress. You can mitigate the impact of a necessary high cardinality column by making sure that its’ data type is a number format–ideally a whole number. This might mean splitting a column into two separate columns or assigning a unique ID that can be stored as a whole number.
Cardinality is unavoidable for columns that uniquely identify a row so you will most likely see them in your dimension (dim) tables. Every dim table is going to have at least one high cardinality column, but ideally not more than one. In your fact table, you should not have a column with high cardinality if you can model your way out of it. For example, if, in your fact table, you have a unique ID for each row, you may want to consider how you can create a unique ID in a dimension table and create a one-to-many relationship from that dim table to the fact table. This could be as simple as a ‘bridge table’, something that sits between your fact table and a dimension table.
A good maxim for working with your data’s granularity and cardinality is: strive for ideal, work with what’s real. So always make the attempt to reduce cardinality and increase granularity (if necessary).