Calling all flatlanders…(if you missed my definition of flatlander, check out last week’s newsletter).
Last week, I talked about creating a data model, and why you can’t just muddle through it. We are going to continue that discussion in this week’s newsletter. I want to explore data types and formats. First and foremost, a data type is not the same thing as a data format. The data type determines how the data is stored; the data format determines how the data is displayed. The correct data type is critical; the format is not.
The most efficient data type is integer (aka whole number). If you can type a column as a whole number, it is going to save a lot of space, particularly if the column is in your fact table (see last week’s post). For example, if you have a column with quantities, don’t leave that column as a decimal data type, if you are never going to have a partial quantity. Change the data type to whole number.
After the whole number data type, the next most efficient data types are also numerical: decimal, fixed decimal, dates, times. All columns in your fact table should ideally have a numeric data type. Why? Because numeric data types compress more efficiently, which will make your report more responsive.
String data types (aka text) do not compress as readily, so reserve them for your dimension tables (see last week’s post). Remember, dimension tables are almost always going to be shorter (fewer rows) than the fact table, which is why the compression isn’t as critical.
In a typical workflow, you should set your data types in the Power Query Editor. You can set (or change) data types in the Power BI Desktop, but because there are fewer options, try to make it your practice to set the data types in the PQE.
Fun mind-bending fact: when you assign a number data type in the Power Query Editor, you are actually assigning a data type of number, and a claim of a subtype (whole number, decimal, fixed decimal). Once the data is passed from the Power Query Editor to another system (e.g. Power BI, Excel), the destination interprets the claim in its own way. Why do you care about this weird factoid? You don’t, unless you want to dig deep into the M language. In which case, I highly recommend Ben Gribaudo’s content and class. I have included a link to Ben’s Power Query M Primer, starting at the data type section.
Always use Fixed Decimal Type for currency. This will ensure the preservation of four digits of accuracy after the decimal place. You can always display only two decimal places (or none), but the value of the four decimal places is stored, and therefore retrievable.
If you know that your report(s) will be used by colleagues in other countries, set your data types with the Using Locale option, in the Power Query Editor. If you do, the data will display appropriately depending on the viewer’s locale (as determined by their computer). For example, in the US, we format dates as MM/DD/YYYY, but in other countries (most other countries) format and display dates as DD/MM/YYYY. The Using Locale feature will ensure that dates display appropriately whether the viewer is in Paris, Illinois or Paris, France.
Speaking of formats, you cannot set the data’s format in the Power Query Editor. Data formats are set in the destination system (e.g. Power BI or Excel). So don’t be alarmed if you set your currency columns to Fixed Decimal type, and you are only seeing two digits after the decimal, while you are still in the Power Query Editor. Once the data is loaded, you can set the format to display as many (or few) decimal places as you want to, without affecting how the data is stored.
Dates and times are numeric data types. Dates are stored as serial numbers, just as they are in Excel. There is one slight difference–in Power BI, day 1 starts on December 31, 1899, rather than on January 1, 1900 as it does in Excel. If you are working with dates between December 31 1899 and February 29, 1900, you may have to do a little manual work to adjust for this difference. Starting on March 1, 1900 the underlying serial numbers are the same.
Always review the data types that the Power Query Editor automatically assigns. If your data is not ‘clean’, meaning a column has mixed content, the data type can be wrong because the Power Query Editor only looks, by default, at the first 1000 rows. There could be content after the first 1000 rows that doesn’t match the data type automatically assigned. It also may assign a data type that works, but that is non-optimal (e.g. assigning a Decimal type rather than a Whole Number).
Don’t change your data types until the end of your workflow in Power Query Editor, unless you have to. (Sometimes you have to change a data type in order to perform a specific transformation.) Why? Because changing a data type usually breaks query folding. Don’t worry if you don’t know what query folding is right now. I am planning to cover it here in my weekly newsletter. For now, just know that query folding is a good thing and you want to preserve it as long as possible if it is present/possible.
There aren’t many things you can do (or ignore) that will have as much effect on the success of your Power BI reports as assigning the right and/or optimal data type. Ignore data types at your peril. I am including an infographic as a cheat sheet for you.
Next up: Cardinality and Granularity