Oct

23

2019

iStock-1127788798

To Blend Or Not To Blend In Tableau

By: David Baldwin

Objective: 

Learn why and how to avoid Data Blending in Tableau.

Blending Example:

Consider two tables that do not share a unique, row level id but do share a common dimension.  Let’s say that the shared, common dimension is “State”.  If either of the tables has multiple instances of a state, joining could result in a Cartesian product.  Let’s say you decide to aggregate one table at the “State” level and join on the results.  That’s data blending!  But be aware!  Data blending has some serious limitations.  Read on to learn more.

Definitions:

Blending: Merging aggregated data from multiple sources.

Cartesian product:  The product of two sets: the product of set X and set Y is the set that contains all ordered pairs ( x, y ) for which x belongs to X and y belongs to Y.  Are scratching your head over this dictionary definition?  We’ll go through an example below that should clear things up for you.

Join:  The data that you analyze in Tableau is often made up of a collection of tables that are related by specific fields (that is, columns). Joining is a method for combining the related data on those common fields.

Union:  Appending multiple data sources.

Jump In And Start Playing:

  1. Download the following data sources:

Simplified Superstore.xlsx (https://teknionusa-my.sharepoint.com/:x:/p/dbaldwin/EYzxQTLAJbdBsLORR1YWpl4BrZgsCU9jt4SxAQFOVO6-9w?e=q4skwB)

County_Pop.xlsx (https://teknionusa-my.sharepoint.com/:x:/p/dbaldwin/Ebz5yEqB9PVGg9ErL_HMJ-wBbgJJWM0wStYaXDeiipQLTA?e=Vm2WL9)

You can also access a completed workbook that shows the finished results for all the steps in this blog post.  (https://public.tableau.com/profile/david.baldwin#!/vizhome/AvoidingBlending/Blending)

Note:  In order to successfully follow all the instructions below, it’s important to download these files into the same directory.  That directory should contain only the two spreadsheets listed above.  Nothing else!

  1. Connect Tableau Desktop to County_Pop.xlsx
  2. Depress CNTL + D to connect to Simplified Superstore.xlsx.
  3. Select the Sheet 1 tab.
  4. Select the Simplified Superstore.xlsx data source and place “State” on the Rows shelf and “Sales” on the Text shelf.
  5. Select the County_Pop.xlsx data source and place “State or district” on the Rows shelf after “State”.

Note:  You should get a warning dialog box.  To address the warning, navigate to Data à Edit Relationships… to provide Tableau with information to successfully link the two data sources.

  1. From the County_Pop.xlsx data source, double click on “Population” to create a two-column view like the one pictured below.
    01Screenshot

 

 

 

 

 

 

Screenshot 1:

  1. From County_Pop.xlsx, place “County or Equivalent” on the Rows shelf.

    02Screenshot

Screenshot 2:

So, What’s Wrong?

When looking at the screenshots of the “List of States” tables above:

  1. The problem of missing data.

Neither of the two screenshots displays Alaska or Hawaii!  This may seem surprising since County_Pop.xlsx does include both of these states.  The issue is that neither Alaska nor Hawaii exist in the primary data source, Simplified Superstore.xlsx.  The primary data source drives the dimension members that are listed in the worksheet.  To list these missing states would require building the view with County_Pop.xlsx as the primary data source.  But what if there were states missing from both data sources?  When data blending, the way to solve for this kind of problem is to include a third data source with the complete list of states.  In the case of states, this might not be too difficult, but if territories such as Puerto Rico and the Virgin Islands are added, then the third data source would need to be updated as well.  This would make for difficult maintenance.

  1. The problem of high cardinality dimensions.

Note another problem that displays in the above two screenshots.  “District of Columbia” is missing a population total.  To fix this issue, simply right click on “District of Columbia” and change the alias to “Washington DC”.  The dimension member is named differently in the two data sources.  This was an easy fix; however, this could be a serious problem if you were attempting to blend on “Customer Name” which is a high cardinality dimension.  (High cardinality because the “Customer Name” dimension has many members.)  Imagine right clicking on dozens or hundreds of customer names to change the aliases.  This would certainly not be a scalable solution.

Honestly, the problem of mismatching dimension members is a data issue that would be equally problematic for joining or unioning.  However, there’s another issue that blending on a high cardinality dimension brings to the forefront:  Performance.  In Tableau Desktop, data blending occurs locally.  Blending on high cardinality dimensions from large data sources could bring your laptop to its knees!  Joins and unions, however, occur at the data source level where, ostensibly, there’s horsepower to handle big data.

  1. The row level problem.

Note that Screenshot 2 above displays an asterisk (*) instead of individual counties.  Data blending does not support row level information from a secondary data source.  This means no row level calculations can be performed using measures from a secondary data source.  This also means dimensions are not supported from a secondary data source, hence, the asterisks (*) instead of individual counties. 

What Are Potential Solutions?

Use Joining instead of Blending

  1. Open a new workbook and create a cross database join between County_Pop.xlsx and Simplified Superstore.xlsx.

To accomplish this:

  1. In Tableau Desktop, connect to County_Pop.xlsx.
  2. Click on the Data Source Tab and then click the “Add” link in the upper left-hand corner.

03Screenshot

Screenshot 3:

  1. Create a left join between County_Pop.xlsx and Simplified Superstore.xlsx using “State or District” and “State”.
    04Screenshot

         Screenshot 4:

  1. Now create a view by dragging “State or District” to the Rows shelf and SUM(Sales) to the Text shelf.
  2. In the data pane, double click on “Population”. Notice that California has a population of over 76 billion – 10xs the population of the world!  (We’ll solve for this Cartesian product issue below.)
  3. We can solve for the supersized population of California via an LOD calculation like this:

SUM({ FIXED [State or district], [County or equivalent] : MAX([Population])})

  1. The above LOD calculation returns a single value for each county within each state. The aggregation of MAX is not important.  MIN or AVG would work just as well.  The outer SUM() function returns an aggregated value for each state as can be seen in the below screenshot.

    05Screenshot

   Screenshot 5:

  1. Filter to keep only “Delaware” .
  2. Add “County or equivalent” to the Rows shelf.

    06Screenshot

Screenshot 6:

  1. Note that sales values are repeated for each county, creating another Cartesian product issue.

Let’s consider how the above Join overcomes problems found in the Data Blend but also generates its own set of problems!

  1. The problem of missing data: Resolved
    1. As can be seen in Screenshot 5, both Alaska and Hawai’i display! This is because Tableau considers the joined data sources one, big, happy dataset.  Properly joined tables can alleviate the problem of missing dimension members like we saw in the data blending example above.
  2. The problem of high cardinality dimensions: Partially resolved
    1. We still have an underlying data problem with Washington DC/District of Columbia. As mentioned above, this is best addressed in the underlying data before visualizing with Tableau.
    2. The data blending performance problem has been addressed. Whereas blending occurs locally within Tableau Desktop, the join is performed at the data source level.  Unfortunately, another performance problem may arise due to the Cartesian product.  This is discussed in the next section.       
  3. The row level problem: Resolved
    1. Screenshot 6 displays “State” and “County or equivalent”. These two dimensions originated from different data sources.  This demonstrates that row level data can come from multiple data sources when those data sources are joined. 

Now let’s consider how the above Join generates new issues.

  1. The problem of the Cartesian product.
    1. Remember the 76 billion population associated with California? This happened because joining on dimensions where members exist multiple times in the underlying datasets will cause an ‘explosion’ of data; i.e., generate additional rows of data.  In this case, the problem is caused by joining “State” with “State or district”.  A join functions by making every possible match and then aggregating the results; hence, the ‘explosion’.  The problem can be more easily seen by creating a new worksheet and placing “Number of Records” on the Text shelf.  Note that the count of records is 855,400 even though Simplified Superstore.xlsx has only 9994 records and County_Pop.xlsx has 3143 records.  Another way to observe the problem is by viewing the underlying data that results from the join to see that many rows have repeated population values.
  2. The problem of poor performance caused by unmanageably, large data sets.
    1. A Cartesian product generated on high cardinality dimensions between two, large datasets could cause an enormous ‘explosion’ of data. We can make sure that the data displays accurately in the view via techniques like the one demonstrated in the above LOD calculation.  However, we may not be able to solve for performance problems that could arise from working with such large data sets.

Use Unions instead of Blending

  1. Open a new workbook and union Simplified Superstore.xlsx with County_Pop.xlsx. To accomplish this:
    1. In Tableau Desktop, connect to County_Pop.xlsx.
    2. In the Data Source tab, click on the dropdown associated with “List of Counties of US” and select “Convert to Union”. In the resulting dialog box select “Wildcard (automatic)”.  Under “Sheets” and “Workbook” choose “blank = include all”.  (See Screenshot 7 below.)
      07Screenshot

Screenshot 7:

  1. In a new worksheet, CNTL click “State” and “State or district” and select “Merge Mismatched Fields”. (See Screenshot 8 below.)

 

08Screenshot

Screenshot 8:

  1. Right click on the newly created “State or district & State” and Create à
  2. In the Create Group dialog box, group “District of Columbia” and “Washington DC”.
  3. Place “State or district & State (group)” on the Rows shelf.
  4. Double click on “Population” and then on “Sales”.
  5. Filter to only display “Delaware”.
  6. Place “County or equivalent” on the Rows shelf. As Screenshot 9 shows below, all numbers display correctly.  Of course, the “Null” is a bit disconcerting, but you could right click on it and alias to “State Total” or something similar.  Since the sales data does not exist at the county level, there’s no way to accurately display county level sales values.

 

09Screenshot

Screenshot 9:

For these particular datasets, a Union is the best solution. 

  1. The problem of missing data: Resolved
    1. “State or district & State (group)” contains both Alaska and Hawai’i.
  2. The problem of high cardinality dimensions: Partially resolved
    1. Although we used a group to solve for the problem of Washington DC/District of Columbia, this type of issue is generally best addressed in the underlying data before visualizing in Tableau Desktop.
  3. The row level problem: Resolved
    1. Screenshot 9 displays “State or district & State (group)” and “County or equivalent”. These two dimensions originated from different data sources which demonstrates that row level data can come from both data sources. 
  4. The problem of the Cartesian product: Resolved
    1. The row count is no longer ‘exploding’. Create a new worksheet and place “Number of Records” on the Text shelf and you will observe a count of 13,137 instead of the 855,400 generated by the Cartesian product discussed below.  This much lower number is the sum total of Simplified Superstore.xlsx (9994 records) and County_Pop.xlsx (3143 records).
  5. The problem of poor performance: Mitigated
    1. I’ve chosen the word “Mitigated” as opposed to “Resolved” so as not to imply that unions will always result in quick performance. Fine tuning performance is quite challenging and necessitates considering many variables.  However, in this case, a union does result in better performance since it does not generate a Cartesian product and it also does not require blending data sources.

Can Joins and Unions be used interchangeably?

 

Often you can combine data satisfactorily using either a Union or a Join.  However, there are times when you really need one or the other.

  1. If you need to concatenate two data sources, a Join won’t work. You’ll need to Union.  
Table 1 Table 2 Unioned Results
Region Region Region
North East North
South West South
    East
    West

 

2. You may actually want a Cartesian product! Consider the following simple example:

Table 1 Table 2 Results With Join On " "
Region Category Region Category
North Products North Products
South  Services North  Services
    South  Products
    South  Services

 

Joining the above two tables on “ ” (i.e., “blank”) generates a resulting table that includes all possible field combinations.  This is not possible with a Union.

3.  Another reason you might need to Join as opposed to Union is because of certain feature limitations in Tableau. For example, Tableau Desktop does enable a Cross Database Join between an Oracle table and a MS SQL Server table.  However, as of Tableau 2019.3, you cannot natively union Oracle and MS SQL Server in Tableau Desktop.  That said, custom SQL is possible in Tableau so you could write your own query.  But this could generate other issues including performance and maintenance challenges. 

 

When Data Blending Is Necessary

Data blending is largely deprecated in Tableau Desktop.  But there’s at least one use-case where a Data Blend may be your only option.   If you have two or more data sources that can neither be Unioned nor Joined, Data Blending is your only option.  For example, you can’t Join or Union an OLAP cube and an Access database.