Weeding Your Data

I love to garden. Who knew? For years, I didn’t have any dirt to ‘play in’. But now, thanks to a change in home locale, I have a lot of acres with which to experiment. It turns out that I love buying plants and seeds and putting them in the ground. What I don’t love as much? Weeding. But weeds will take over my beautiful flower beds if I let them. So I have to be just as enthusiastic a weeder as I am a starter and planter.

Weeding before creating your data model is just as important. But the good news is that if you make a mistake, and you weed too enthusiastically, you can easily correct your mistake. What do I mean by weeding the data model? Simply put, I mean removing the columns (or rows) that you don’t anticipate using. I have put together some questions that you can use as a guide when deciding whether to weed a particular column or row. Before we start, let me throw out some definitions:

  • A Power BI visual report is a report you create using the Power BI Desktop (or the Power BI Service) that is composed of visuals.
  • A Power BI paginated report is a report you create using the Power BI Paginated Report application. A paginated report is essentially a list of values, with no visuals in it.
  1. Will you ever use this data in a Power BI visual report? For example, if there is a field for a supplier’s fax number, you may not ever use that field in a visual. On the other hand if you are creating a paginated report (a list of values), you may want to keep the column listing the fax number. (Visualize an old-fashioned phone directory.)
  2. Are there missing values? If yes, can you easily and readily supply the missing values? If not, consider deleting this column.
  3. Can the column be simplified? For example, if the column values have unnecessary components, consider splitting the column using Split Column and then deleting the less-than-useful portion. Or replace the less-than-useful portion with nothing, using Replace Values.
  4. Can the column be combined with another column (ideally without increasing cardinality)? While a single column won’t necessarily increase the overall size of the model, the cumulative effect of a lot of columns can create a larger-than-necessary model size.
  5. Can the column be moved out of the fact table and into a bridge or dimension table? This is particularly urgent if the column in question has high cardinality (see the previous post Cardinality and Granularity). Remember, your fact table is almost always going to be the longest table in your model (the most number of rows), and therefore high cardinality columns (which are much harder to compress) should be avoided in a fact table. If you can move the column to a bridge or dimension table, it will do much less “damage” to your model.

Now, if you realize that you have been too enthusiastic in your weeding, you can always go back to the Power Query Editor and restore the column(s). Which reminds me: perform your weeding in the Power Query Editor, not in the Power BI Desktop. It will be easier for you to go back and find the specific step where you removed the column(s).  But there is a “gotcha” in the Power Query Editor of which you should be aware.

The “Gotcha”:

If you remove a single column using Remove Columns>Remove or Right-Click>Remove, the Power Query Editor will use the Table.RemoveColumns function. This will specify that the column you selected is removed. The same is true if you remove multiple columns. But what happens if you want to remove more columns than you want to keep? The easiest way to do this is to choose Choose Columns>Choose Columns and then select the columns you want to keep. When you perform removals this way, the Power Query Editor will use the Table.SelectColumns function. Pay attention to the function that Power Query Editor uses, because it can affect how the data model behaves if there are changes in the future to the source data.

If you add a new column to the source data, and you want that column to show up, make sure that the M script is not using the Table.SelectColumns function. Why? Because if the Table.SelectColumns function only brings in the columns that are named in the step. If the new column (the one that was added) is not in that list, it will not show up in the data model.

If the name of a column changes in the source data, if the Table.RemoveColumns function is used in the M script, then the refresh will trigger an error. The renaming is not the direct cause of the issue. Rather, the issue is that one (or more) of the columns listed in the M script is not present in the data.  Similarly, if a column is deleted from the source data, if the Table.RemoveColumns function is used in the M script, then the refresh will trigger an error. The issue is that the deleted column called for in the script is not present.

The bottom line? Pay attention to the functions that are used by the Power Query Editor. If you know that the source data is vulnerable to changes (such as in an Excel worksheet), be very careful to choose the right function for removing or keeping columns.

Shout out to Gil Raviv and his book “Collect, Combine, Transform”, which goes into great detail about this “gotcha” and a bunch of others you will run into. I highly recommend reading this book!

A second shout out to Ben Gribaudo who is the absolute king of the M language. His blog is always worth reading: https://bengribaudo.com/