Do you ever find yourself getting lost in all the data lingo? You’re not alone! Data science and analytics are evolving at light speed, and sometimes marketing messages cause certain concepts to end up lost in translation. So we’ve put together these data terminology definitions to help shed light on all this, and we’ve also included some real-life examples.
From Data to Big Data
We’ll dive straight into the murky depths and start with big data. But the actual magnitude of the stored data is only one piece of the larger definition of big data. And that includes the 3 V’s of big data: Volume, Velocity, and Variety. Velocity refers to how quickly data is generated, captured, shared, and updated. As for variety, this touches on the actual complexity of big data. In fact, the data may or may not be structured and can come from multiple sources—everything from company databases to social networks.
Whatever the volume or complexity of the data, exploring all this information is a challenge. This exploration is known as data mining. The goal is to extract knowledge from data. A traditional three-step approach is used to accomplish this. First, the data is explored. Next, an algorithm-based analytical model is built, which is then finally deployed to obtain insights or make predictions.
Airbnb is an interesting example of a company that is constantly improving their service by datamining their big data. The community platform gives owners and rental agencies access to a price recommendation engine. This engine simulates the likelihood of getting bookings based on price and dates. This tool analyzes more than 5 billion datapoints on premises and their geographic areas. The result: owners and agencies who stay within 5% of Airbnb’s price recommendations are four times as likely to get reservations.
Lastly, we need to mention data science, because extracting this knowledge from data requires skills and tools that intersect several different fields. From mathematics to statistics and computer science to data visualization.
This leads us into the steps of a data strategy: storing data, sharing it, and formatting it.
A data lake is a system for storing the massive data used by big data. The idea is to quickly store a huge volume of heterogeneous data—structured or unstructured—from internal or external sources. Data is stored in the lake in its original format or with minimal processing. Data analysis and artificial intelligence tools can then be connected directly to the data lake to explore and tap this data. (For a more in-depth explanation, check out our Data Lake article).
Creating a data lake is relatively simple—and economically attractive with cloud solutions. But the lake can quickly deteriorate into a swamp if the data is left to accumulate, without rules for control or regular clean-up. As such, a “data swamp” symbolizes security, data quality, and conformity risks. This is because granting access to your data means thinking about data lifecycle management. This is particularly true for sensitive, confidential, and personal data.
The opposite of a swamp is a neat and tidy data warehouse for storing cleaned or processed data. The goal is to make it easier for lay people to explore and use data—not just seasoned data handling experts. Consequently, a data warehouse will provide data sets prepared according to user needs. The user experience is key here, and, in fact, we can even use the term "data mart,” like an online store where you can pick up your data.
The line between a data lake (“raw” data) and a data warehouse (“data prepared for consumption”) is now becoming blurred, because software solutions like Snowflake can now manage both data lakes and data warehouses. (Check out the article we wrote on Snowflake, a Solution BI partner, for more information).
How Data Is Shared
Data sharing refers to sharing company data with partners or customers. This practice is nothing new, particularly in the research community. Many public agencies have also opened up access to their data, including the INSEE, the INPI, and various ministries. (You can get more details in our “data sharing” article). Today, data sharing has become more widespread among business, with access to more data as well as affordable technologies for leveraging it (including BI in the cloud).
The data being shared may be termed “open data,” which is freely accessible to users. Such data often originates from public sources, but companies could also decide to make data sets open.
For example, Groupe BPCE, the 2nd largest banking group in France, launched an Open Data portal in 2017 that currently boasts 167 data sets (bpce.opendatasoft.com). This portal contains the list of ATMs across the regions, catalogs of banking APIs, and analyses produced by the BPCE observatory. It’s a strategy that is helping to cultivate the data culture within the group and is fostering dialogue with both public and private stakeholders in every region.
Data isn’t always exchanged for free, nor is it open to all. Many companies would rather exchange data between partners (who may be paying for it). This leads us to the concept of the data marketplace. The rationale behind it goes far beyond simply selling company data for prospecting.
NumAlim is a great example. This data and service exchange hub is designed to serve the 18,000 agri-food companies. When faced with concerns about nutrition, health, or the environment, consumers can find “enhanced” information about food. Manufacturers can then turn to NumAlim to highlight, acquire, and expand “open” data and for-fee nutrition data (such as sales panels, customer reviews, regulatory reference documents, and more).
Earlier we mentioned data marts and markets, and, in the same vein, there are also data products—created from your data. These may be compiled from your sales reports or even your customers’ consumer data.
One such data product initiative is Netflix’s recommendation engine, driven by analysis of subscribers’ choices. And, on the retail side, there are the real-time sales reports and analyses that retail giant Walmart sells to their partners. Every hour, Walmart’s data café processes 2.5 petabytes of data from over 1 million customers. (For more details, read our white paper on data products)
As we continue to explore data jargon, our journey leads us to the methods used to make data more intelligible and impactful. Data visualization involves taking data and transforming it into graphical or visual forms, to help identify trends or messages and, in so doing, improve clarity, education, and strength of conviction. Need suggestions on new ways of presenting your data besides bar graphs and pie charts? Check out our tips for getting started with data visualization.
You can also take things up a notch with data storytelling: the art of telling stories through data. Because data all by itself is still hard to understand. Data can now be presented in an interactive, personalized way adapted to the users. It’s even possible to open a data access portal to guide their exploration. (Check out “4 Tips for Data Storytelling, the Art of Telling a Story through Data”).
Data storytelling is frequently used by the media to help understand complex phenomena or even to get readers emotionally invested when faced with environmental and societal issues. Consequently, the New York Times created their “Data and Insights” department (nytco.com/careers/data-and-insights-group/) to better understand their readers and improve data analysis. Their editorial team now enjoys worldwide recognition for the high quality of their dynamic and personalized infographics. But data storytelling isn’t just for the media. In fact, one of the industries that has the most to gain from it is finance. This is because accounting and financial data really come to life when they’re put into perspective using good visual aids and the right messages. A Financial Manager can use data storytelling to mobilize their executive committee and guide future investment choices.
New Data Professions
Last, but not least, these new practices surrounding data require new skills and, necessarily, new professions. To learn how to tell the difference between data engineers, data scientists, data analysts, and Big Data architects, take a look at our article on Data-related jobs.