Analytics in its complete definition consists of four main types. Descriptive, Diagnostic, Predictive and Prescriptive. First two have always been delivered via Data warehouses and BI systems.
By Amit Sharma
Analytics in its complete definition consists of four main types. Descriptive, Diagnostic, Predictive and Prescriptive. First two have always been delivered via Data warehouses and BI systems. With maturing Big Data capabilities, the feasible business cases for later two types are now well established and industrialized. They are popularly called Advanced Analytics.
It’s also established that there need not be an underlying Data warehouse to deliver Advanced Analytics, but then it would be a waste of opportunity in not utilizing the investments and harness value from data in Data warehouses. The maturity of capability of Big Data family offers the best solution for Data warehouses to scale up to be leveraged for advanced analytics quickly and economically. Hybrid cloud models have made it even more lucrative.
Big Data has also substantially changed the way Statistical Analytical models were built. Traditionally, these were created with only representational data samples in first iteration, followed by multiple time consuming recalibration cycles. With Big Data, now it’s feasible to create even the first pass of model using much larger, and at times complete data sets. Which not only allows for introduction of additional predictor variables quite early, but also increases forecast accuracy many folds, revealing many trends which otherwise were not discovered with just representational data.
Making a data warehouse scale up to deliver advanced Analytics may need transformations. Advanced Analytics is done from combination of data sets like voices, images, sensor feeds, Geospatial etc. Majority of the Data warehouses can process only structured data sets stored in standard Relational Databases. Data warehouse infrastructure must open up to acquire, combine and co-process data of varying formats, e.g., files, server logs, databases, images, sensors and many more.
This diverse data will come at variety of speeds. Daily batches, to real-time streams (e.g. sensor data from healthcare devices). Making provisions for data acquisition at various speeds is essential. Due consideration must be given to Restart ability options in eventuality of data transmission failure.
Some banks have found that their ability to better predict a possible loan default increases by 15% upon inclusion of gist of interactions in branches for other loans. It further improves few folds by adding data from reminders sent over past 3 years.
However, for one discreet entity of a business domain (e.g., a Retail Customer), Data warehouses have always desired to source data from only one IT system. And whenever not possible, then reluctantly from minimal, necessary & sufficient number of IT systems.
Generally, this happens due to different systems built for different lifecycle stages (customer acquisition, servicing etc), Or systems built to cater to specific functionality (Employee in Payroll & Training systems), Or systems built by departments in silos.
Be it due to technology, people or process reasons, this has been one major crib amongst sponsors of Data warehouse. There are enough cases when this is attributed to lack of Metadata, huge number of data layers, slowing governance processes, difficulty in getting so many stakeholders together etc.
For rendering Advanced analytics, these become major showstopper. Therefore, carefully introspect the reason for this in your case and build enough flexibility in faster acquisition & processing of more data sources.
On agenda for performance and resource optimization, Data warehouses eventually drift towards keeping only minimal data history online (Cold data offloads). This directly contradicts the readiness for advanced analytics. For e.g., in Stock markets, over a business cycle of 4 years, the primary, intermediate, and short-term trends last upto 2 years, 9 months, and 6 weeks respectively, which cannot be discovered with just a few year’s data. Thus, Data warehouses must look at ways of economically making entire detailed historical data available for Analytical discoveries
Data models for many large scale Data warehouses are designed on 3rd or higher Normal form. However, Advanced Analytics needs its data together. Therefore conscious movement down the ladder of denormalization is essential. Another consideration is to upgrade Data Hubs & Stores to Data lakes.
Amit Sharma is Principal Technology Architect, Infosys.