Questions to ask before creating your data lake

If data were oil, we would need to find the deposit! Storing and accessing big data is a core issue within the overall Business Intelligence process. That’s why the concept of data lake is worth considering; they can quickly store large volumes of data and use it with all kinds of analysis and processing tools. If you want to get a better idea of how it can support your BI strategy and how it works within corporate data architecture, here are some fundamental questions to ask.


Auteur :  Charles Parat


#1 What is a data lake ?

A data lake is a way of collecting and storing large volumes of data that has gained popularity with the growth of big data since 2010. It shares the same features; it quickly and permanently stores a large volume of data in a wide variety of formats and from a wide variety of sources. Data is stored in the lake in its original format or with minimal transformation. The data lake is usually fed by data capture mechanisms (rather than ETL/ELT mechanisms), to ensure raw replication from qualified sources. It’s a good idea to include very-high-capacity analysis, preparation and artificial intelligence functionalities in the data lake that can work with a range of potentially heterogeneous formats. That way you can explore and identify the contents.


#2 What is the difference between a data lake and a data warehouse?

Every part of the architecture serves a purpose, whether a user need or a technical/functional requirement. Data lakes serve the need to capture and store data generated by other internal or external systems in a single container (which may be physical or logical) according to given times and levels of detail. We know which upstream systems should feed into the data lake, but we have not necessarily examined the specific contents from each of these sources.

A data warehouse, on the other hand, is a repository focused on common business intelligence uses for on one or more subjects. In theory, it is designed to contain only useful, classified, organized data, and we know how to map out its uses at a given time.

For obvious reasons related to processing security and simplification, a data warehouse is often built from a layer of data copied from specific sources that we call an ODS (Operational Data Store). This layer is often mistakenly called a data lake. It is only used on a temporary basis, until it is clear that the newest data warehouse update will not have to be redone if there is a production incident.

A data lake, on the other hand, is perennial. It serves as a reliable, asynchronous and sometimes real-time reflection of upstream IT systems. So, if the data lake includes the data that is relevant to the data warehouse subject, there may be no need for an ODS, and the data warehouse becomes a client of the data lake.

With technical constraints related to processing time, security or simplification of storage media serving different needs, IT professionals in the BI world responded quickly. They designed departmental views that are very close to analysis and reporting tools, which we call data marts, and which serve very specific uses for data that usually originates from data warehouses. Like the ultimate storage space for business data.

But today, there are far fewer technical constraints, and we can often use fewer technical components in this architecture by opting for very high-capacity data engines that can integrate all types of analysis, reporting and AI tools directly on the front end.  Columnar and/or NoSQL database cloud solutions and data virtualization software may be able to manage a data lake as well as a data warehouse. From a logical point of view, the collection and storage levels (and possibly data marts as well) serve very different needs, but could potentially be managed by a single software engine. The key to maintaining and developing the architecture will then be proper governance of the model for each level.


#3 Which data should we collect, how should we classify the data and who should have access to the data lake?

The first question has to do with the scope of the data that should feed into the data lake. Given the time and money spent creating, producing and maintaining this often new component in data architecture, it is worth asking WHY and FOR WHOM. But we can also envision physical or logical contents that include all of the company’s data, both internal and external, regardless of the digital processes that generate it. This is just an option. Usually, the data lake starts with a need that is limited to one key use, then the scope of its data sources grows as it becomes more popular or as new needs appear.

By asking these scope questions, you can determine the “data management” of the data lake, its sources and its clients in the IT system architecture and, beyond that, its data management policy and governance.

A word of warning: current regulations do not allow you to just collect data without knowing what it is until you figure it out. If you are storing or using personal data, for example, the GDPR (General Data Protection Regulation) requires you to create a taxonomy for classifying data based on how critical and sensitive it is. You must be able to identify, anonymize and erase personal data. This data is also bound by regulatory retention periods.

Data classification or taxonomy will also help you determine data types, user groups and potential use scenarios to grant access rights and guide users as they explore the data. Use scenarios are actually use cases that will also help you measure the success of your data lake based on how it is used by those in various roles and departments at the company.


#4 What are the technological building blocks for our data lake?

A data lake is created using data storage technological building blocks, which logically work together with metadata storage (description information and data classification), access management, and value creation functionalities: data preparation, quality assurance, visualization and analysis and artificial intelligence (especially for predictive analysis). There are comprehensive environments that include all of these building blocks. These are called data platforms.

Most use Hadoop technology, an open-source framework used to store and process big data. Hadoop (and its computation engine, Spark) offers one big advantage: it can do cluster computing. This system distributes data storage across the nodes of a cluster. Data processing is parallelized and therefore faster, as performance is one of the major challenges in big data projects. Another advantage of Hadoop is the very reason it is so complex and versatile: its ecosystem. This ecosystem includes over a hundred technologies for building data lakes that are specific to each company’s need.

You can choose to install your data lake locally or in the cloud, or even in “data lake as a service” mode (with solutions such as Microsoft Azure, Google Cloud Platform, Amazon S3). Solution BI teams can also help you with a multicloud approach, providing performance and redundancy in case one of the cloud platforms fails.


#5 What are the potential risks of a data lake?

Earlier in this article, we mentioned governance challenges with issues of data classification, access rights management and data lake security. Your data lake may contain sensitive, confidential or personal data, which exposes you to risks related to compliance as well as to data theft or loss.

One of the biggest dangers may actually stem from excessive data collection, especially if it is unmanaged. “Dark data”—data stored by companies that actually goes unused and unanalyzed—is estimated to comprise up to 50% of a data lake. Sometimes this data is not even cataloged. According to a study by Veritas Technologies [French site], up to 52% of corporate data may be “dark data!”

But this dark data may also hold an opportunity. Dark data from the company and its ecosystem could actually prove useful to learn more about customer behavior or to help with decision making. Consider, for example, information about how a customer navigates on the website.

Generally speaking, the greatest risk associated with a data lake is that it could be discredited if the overall data system is not properly governed, and the data lake is viewed as a purely technical solution with no added value for improving departmental performance.

If the logical architecture is explained to all departments and they understand the data captured, whether they use it or not, the benefits and proper use of the data system will become part of their culture.  If the data lake remains the sole domain of data engineers, it will take a considerable amount of convincing and demonstrations to bring other departments on board and get a solid ROI.

Mapping out actionable data is also a must to make it relevant to the various departments, and a connection should also be made between often complex technical maps and business glossaries. Creating smart data departments is a data-driven progress challenge. And these technical-functional connections also give IT workers a better understanding of how data can improve business processes.


In conclusion

It is important to determine your data management policy before diving into the data lake and making it a pillar of your BI services architecture. With the support of a specialist like Solution BI, you’ll be able to select the right architecture for a data lake that is both secure and high-performing, driving users to explore it with data analysis. A system in the hands of a knowledgeable user can even give the department perspective on creating value, if the data lake process also includes a data lab organization within the organization. But that’s a fascinating topic for another time…



Suggested reading : 

Business Intelligence : 5 reasons to securely switch to full cloud

Data visualization : 4 steps for start off on the right foot