My daughter said something that was totally logical to her and unintentionally funny to me. We saw a former NBA basketball player earlier that day who stood about 7 feet, 6 inches tall, after which she said that playing basketball must make you tall. At this point, her overly eager and analytic father decided this was a learning opportunity, and told her that “correlation is not causality.” Playing basketball does not make you tall, and being tall does not make you a basketball player, even if those two traits often go together, i.e., are highly correlated.
A correlation coefficient describes the relationship between a scatter plot (a set of data points) and the straight line that approximates the scatter plot. Correlations are measured by the “r value,” on a scale from positive one (perfectly correlated) to negative one (perfectly negatively correlated). A value greater than 0.5 (or less than -0.5) is necessary to infer a strong relationship.
Sometimes, things that appear to be highly correlated in fact are random events that have no real-world connection. I explained this to my daughter, and then for fun, showed her the website for a list of funny, spurious correlations. For example, the number people who drowned by falling into a pool correlates with the number of films that Nicolas Cage appeared in (r=.666). The divorce rate in Maine correlates with per capita consumption of margarine (r=.993). And per capital consumption of mozzarella cheese correlates with civil engineering doctorates awarded (r=.959).
How Correlations Can Help Data Analysis
Correlations can be applied in several ways to help you sort through data and find useful, predictive signals. One common application is to look at various company marketing activities--campaigns, promotions, or advertising as related to sales at various levels, product line, channel or promotion. Correlations also are helpful in comparing performance relative to benchmarks or peers. A standard regression in spreadsheet can help to determine the correlation coefficient between data sets. In supply chain, a business might look for a relationship between a specific part and product defects.
More advanced business intelligence tools are providing access to higher levels of correlation. For example, correlation clustering looks at sets of data such as customers or product attributes, and herds them into groups based on similar characteristics. Those groups, or clusters, will have a high correlation to each other, which allows marketing to analyze them as a group and determine the best way to service their needs based on similar attributes.
Establishing correlations is hard work, and often our understanding changes over time. In each example above, there could be other factors that cloud the correlation. For example, sales may spike due to overlapping ad campaigns that make it hard to tease out the true driver that deserves additional ad dollars. Company performance may track a benchmark for a period, such as GDP or a commodity, then suddenly change due to hedging, or idiosyncratic reasons. The supply chain may find that two parts do not fit together well forcing a breakage in assembly, not do a part defect.
This type of data mining is an immensely powerful tool, and enterprises need to use it well. We need to look for meaningful correlations while simultaneously not falling prey to false correlations. This is hard in a world of imperfect and imprecise data. To be at our best, we must understand our business drivers and review analyses with a critical, or even skeptical eye, to determine whether relationships exist, how strong they are, and whether they are useful in predicting outcomes. After all, not all tall people play basketball.Bryan Lapidus, FP&A, is a contributing consultant and author to the Association for Financial Professionals. Reach him at BLapidus@AllegianceAG.com.
For additional insights on FP&A, subscribe to the AFP monthly newsletter,FP&A in Focus.