How can you improve your data analysis skills? It’s a critical question because data is so crucial to the success of your organization. Here are three steps you can take to improve your data analysis skills.
While data is everywhere, and statistical tools are proliferating in software, there is a rising premium placed on critical thinking related to data analysis projects. Knowing what could go wrong, and the questions you should ask to help it go right, will make you a better producer and consumer of statistical and data analyses.
During an AFP webinar on this topic, I touched on several questions I consider critical to statistical and data analyses. I have arranged these questions according to three general phases of an analytical project: preparation, analysis and conclusion.
A data analysis project starts with a business question to be answered. It involves three steps. First, are we solving the right problem? Second is about data and methodology. Third is about results and actions.
Q: Are we solving the right problem?
If we are not on the same page with the stakeholders, everything could go wrong. Therefore, we need to understand the business question, define goals explicitly, and make sure these goals are agreed to by relevant parties. Write them down as a shared agreement!
In a simple example, let’s say we are trying to assess the earnings management of selected companies. Sounds simple, but in practice we need to discuss and define “earnings management.” Is that an earnings surprise relative to management guidance, an analyst forecast, or a benchmarked group? What is “earnings:” net income, EBITDA, EBIT, operating earnings? Do we normalize for unusual events or accruals?
Q: How to deal with low quality and/or unstructured data?
Do you trust your data? If you have doubts about the data, what do you do? You have to prepare and clean the data taking into consideration the limitations of the source.
In the early years when transaction data were recorded manually, there were lots of errors. Using it required clean up via various filters based on business knowledge, such as transaction value cannot be negative or larger than the US GDP (that would be some trade!). This was intensive but necessary before getting to the analysis stage.
In a different example looking at unstructured data, I studied the impact of stock spam emails on stock prices and volumes. These seemingly nonsensical spam messages do increase trading volume and have significant price impacts, however, it’s no easy task to analyze the spam messages. There are lots of misspelled words, intentionally or unintentionally. For example, the number four would be spelled as “f0ur.” To process this data, we first select a random sample and read through these to get a rough idea of what might go wrong. Then we developed a plan to address these challenges, applied machine learning to analyze a pilot dataset at a scale beyond what people could do, and then extended the analysis to the full dataset.
Q: What’s the right sample size?
It depends on the planned significance you have set with your stakeholders. PRO TIP: For customer surveys or polls, if we are looking for margin of error of 5 percent, simply inverse the 5 percent to get 20. Then square it to get the sample size of 400.
Q: Why and how do we transform data?
We need to be mindful about the underlying data characteristics and perform transformation as needed. I’ll discuss two types of transformations.
First, in forecasting, we can calculate the forecast error as the actual value less forecast. However, if we simply sum up the forecast errors, the positive ones cancel out the negative ones to lead to a false conclusion. PRO TIP: Here we can transform the error to either absolute values or square the errors and then taking a square root.
Second, when the ratio of max/min is greater than 10, we may look into log transformations. The log transformation is particularly useful for skewed data as small numbers get spread out more and large numbers are squeezed more closely together. The result of the transformation is close to a normal distribution: symmetrical and with equal spread. The normal distribution is typically necessary for t-tests that are used to evaluate whether the results are significant or not.
In summary, for data transformations, it’s important to look at two factors: the shape of the distribution and the ease of interpretation.
Q: How do we deal with outliers?
We generally use visual plots to locate outliers. We then compute statistics with and without outliers. If conclusions are not affected by outliers, we report results with the full dataset. Otherwise, we need to examine outliers carefully to see what else can be learned. Are there recording errors? – then correct it. Do outliers come from different populations? If yes, report results excluding the outliers and the reason for exclusion. If not, report results of both analyses and call for further investigations, such as performing a residual analysis. Sometimes, much can be learned from studying true outliers!
Q: How do we decide which variables to include in or exclude from the analysis?
There are two good reasons for reducing a large number of explanatory variables to a smaller set. First: simplicity is preferable to complexity. Second: unnecessary terms in the model yield less precise inferences.
We can use statistical tools to see whether a variable adds value to the equation or not. Statistical models are almost never exact, so most software estimates the information lost by applying different models; the smaller the information loss, the better is the model. A general three-step strategy for dealing with many explanatory variables would look like this:
- Identify the key objectives
- Screen the available variables using exploratory analysis to decide on a list that is sensitive to the objectives
- Use information criteria to find a suitable subset of explanatory variables. The two most common approaches are called Akaike and Bayesian information criterion (AIC or BIC).
Q: Is the analysis robust?
Here we shall do lots of partitions of the data, by slicing the data from different angles to test whether the results still hold. Is the analysis sensitive to time periods? How about business cycles? How about countries or locations? How about across departments? Obviously, the objective of the analysis will drive these partitions.
Q: How do we interpret and communicate results?
In a recent AFP Guide on leveraging business statistics, we present a case to predict customer churns by running a logit regression. One of the predictors is whether the account involves any customer disputes or not. The regression coefficient for variable Disputed is 1.885. How do we interpret this coefficient 1.885? We undo the log transformation and calculate an odds ratio by raising e to the power of 1.885. The result is e1.885 = 6.6, i.e., a customer is 6.6 times more probable to churn if the account is Disputed, while keeping all other factors constant.
We may anticipate a question such as, “Why do we take all the trouble doing the transformation from probability to log odds?” It is usually difficult to model a binary variable. Transforming a binary variable into real numbers (from negative infinity to positive infinity) is desirable in statistical analysis and involves three steps. The first step is to transform it into a probability, which is a continuous variable with a restricted range between 0 and 1. To extend the range from 1 to positive infinity, the probability is further transformed to an odds ratio. Finally, a log transformation maps positive numbers into real numbers. Although the log transformation is not the only choice for this purpose, it is the easiest to understand and interpret. This transformation is called logit transformation. The other common choice is the probit transformation.
The lesson here is that if you were to use some statistical tools. It’s essential to understand it well so that you can interpret the results correctly. Moreover, it is equally important to communicate the results in an easy to understand manner to the audience. In summary, as finance professionals, we wear many hats: knowledge of finance, statistical skills, and art of communications. That’s why finance is a highly valued and respected profession. Let’s keep it up!
Dr. Bill Hu, FP&A, CTP, CFA is president of Techfin and a professor of finance at Arkansas State University. Email him at email@example.com.Companies are looking to FP&A professionals for direction. Show them you can lead the way by earning the Certified Corporate FP&A Professional credential. Download the brochure to learn more.