Digit Insurance recently implemented a Data Lake, which completed the portfolio of data exploration, reporting, analytics, machine learning, and visualization tools on the data. There are different definitions, a lot of jargons and varied tool stacks available around data lake. Understanding the business requirements and iteratively trying out various technical architectures best suited for the company, forms the crux of the journey. External data is also stored on the data lake.
Below are the use cases currently implemented:
-
Improving customer journey in motor insurance – Here we can directly populate the vehicle details from vehicle registration number. This aids us in our objective of simplifying and speeding up the consumer journeys.
-
Fraud algorithm – Our claims fraud algorithm run a complex model considering both internal and external data sources. This balances our simplified claims processes, making sure our fraud algos are strong enough to weed our any non-admissible claims.
One of the benefits after implementing the data lake is that “We just have to input the new data to the ingestion engines built by our data engineering team. It automatically reads the data and maintains a catalogue. We are already evaluating various IoT solutions which can benefit our business processes. For the customer journey use case – we have already seen an improved efficiency of around 20%,” says Vishal Shah, Head of Data Sciences, Digit Insurance.
Digit started the initial PoC’s and experimentation with various technologies around end of 2017 and implemented the production data lake in Feb 2018. AWS S3 is our Data Lake Storage platform. S3 enables us to have a centralized data architecture and provides unparalleled scalability, ability to store any kind of files, seamless integration with various services. Having a centralized data store also helps to create a comprehensive data catalogue.
When asked on why use data lake over the traditional data repository technique of a data warehouse – where data is stored in files and folders, Vishal says, “At Digit, we believe there are use cases for both data warehouse and data lake to co-exist together. Neither of the solution can replace the other,”
This helps the data science team in leveraging both external data and internal data for various kinds of analytics and building problem solving models. The distributed SQL query engine, ‘Presto’ has also been implemented to run interactive analytic queries against various kinds of data sources.
An agile kind of an approach was used for the development. “Our teams collaborate while working on any idea and implementation. Therefore, we did consult the business teams wherever required on regulatory and other aspects. It also helped that we had good business knowledge within the team itself,” says Shah.
This was built entirely in-house by the data engineering team.
The data lake has been integrated with the enterprise data warehouse, and other IT systems. The integration does exist and that is the whole power of a data lake where it can merge both the sources to provide the unified analytics view. In many of customer personalization journeys, the external and internal data is consumed to provide a fit suitable for the customer.