Big data, digital transformation, agile development. There are lots of words to describe the how our personal and professional lives are awash in data and how we manage it. We need to know both what this means, and what it means for us in our role in finance. This article is the second in an occasional AFP series to familiarize you with key terms of the information age, and the implications for FP&A and treasury. This installment is about unstructured data, and is the complement to last week’s discussion of structured data. Elements of “big data” are covered in both.
THE JARGON: Unstructured data is easy for people to understand, but often difficult for machines because it does not lend itself to the codified rules of a data model. It may be textual—think of emails where individuals write their own subject line, and the body contains unique actions, summaries or content. It may be visual—think YouTube videos or the photos on your phone. For data analysis, the breakthroughs have been in developing tools that can read this data and add structure in a way that allows for analysis. Examples include geo-tagging, facial recognition, text-recognition, and audio-to-text.
All these tools rely on intensive processing power, and have led to a wave of IT innovation. For example, Hadoop refers both to the hardware and software that takes huge data sets, distributes them across hundreds or thousands of servers through its distributed file service, and then responds to a data request while using MapReduce to locate the relevant data and process the query. The innovation is using multiple servers and multiples of processors to solve the query in contrast to a relational database that brings fewer processors to the task (but needs fewer because its data is defined by dimensions and other tags). It is analogous to Tom Sawyer painting a fence by himself versus having hundreds of friends simultaneously brushing. Hadoop and its ubiquitous elephant logo are governed by the Apache Foundation.
Another advent supporting big data is a so-called “data lake,” which is an architecture that seeks to store all data available, and perform the jobs of sorting, classifying and organizing at the time of analysis. As a result, data collection and preparation time are greatly reduced as compared to a data warehouse, and the data sets are considered to be large and sometimes mistaken for being comprehensive. The analysis and manipulation of unstructured data require some tools, including NoSQL, (variously called Not- or Not-Only-SQL).
As you can tell, the underlying requirements for these tools are massive server capabilities and communication—that is, cloud computing. A query can be sent to the cloud, processed at the server farms, and a response sent back.
WHY IT MATTERS: The boom in both structured and unstructured data is the defining characteristic of our time. Finance needs to watch for two potential pitfalls: effectively challenging the outcomes of the analysis presented by our partners, and how we invest to build our big data capability.
There is a common conception that if data is captured, then there is no longer a need for sampling or even developing theories/hypotheses about the data because we can simply test everything. This is called the “N=All” view. However, huge data sets can lead analysts to find spurious correlations that don’t hold true meaning. The website http://tylervigen.com/old-version.html lists examples of correlations that can be found in data that are coincidental rather than factual. For example, the consumption of cottage cheese is correlated with the number of PhDs in civil engineering. Also, the assumption of complete data is not the same as actual complete data. It may be possible to analyze the transcripts of every customer service call to take the pulse of your customers, but that should not be confused with the attitudes of all your customers! The caution here is that sample error and sample bias can exist in big data. For some historical background, check out this story involving the start of the Gallup organization. We FP&A professionals need to challenge the assumption that big data has all the answers.
To look at an investment in big data means more than just signing a contract with a vendor. It means building a corporate capability. That implies sub-investments in people, process, and company assets.
People: As with most new ventures, there is no substitute for expertise. Analyze your current team to see if they have the right skills and, if not, bring in experts. Organizationally, does it make sense to create new positions in the org chart such as a chief data officer? What hardware, software, or telecommunications bandwidth do you need?
Process: Start with small projects and develop a learning methodology as you go. Develop a list of questions and hypotheses to test the capabilities and utilization of your data. Measure the amount of time spent in pre-processing (extraction/transformation, and loading) versus analysis.
Assets: These projects may be technology-light if the servers and analytics are outsourced, or technology heavy if you host them in house. The answer depends on the interaction between the business leaders and technology leaders.
All three elements can be supplemented through third-party partners to act as a bridge or as a fixed solution. Data projects gather lots of investment dollars and interest because they can pay off handsomely for an organization. Our job in finance is to ensure that scarce capital is allocated well.Bryan Lapidus, FP&A, is a contributing consultant and author to the Association for Financial Professionals. Reach him atBLapidus@AllegianceAG.com.