Data Lakehouses: How Secure is the Best of Both Worlds?

As we enter a new cloud-first era, advancements in technology have helped companies capture and capitalize on data as much as possible. Deciding between which cloud architecture to use has always been a debate between two options: data warehouses and data lakes. But a new concept has emerged, called the data lakehouse, that has addressed the growing pain points of having two separate infrastructures, begging the question: should we make the switch? Can we have the best of both worlds? And most importantly, how secure is our data with this new infrastructure?

Data Warehouse: The Analytics Workhorse

A data warehouse is a type of relational database that is specifically designed for data analytics -- i.e., reading large amounts of data and understanding trends and relationships across the data. That’s why data warehouses have become a core part of many enterprises’ business intelligence and analytic teams.

Structured data gets loaded from production databases (that are optimized for fast writes) into data warehouses via ETL (extract, transform, load) processes. Structured data refers to any clearly defined and searchable data type, such as phone numbers, dates or even text strings like names, and are highly understood by machine languages. However, as demand for data ingestion has increased, traditional data warehouses have become an expensive storage option for those looking to scale. Data warehouse solutions include Azure Synapse Analytics and Google BigQuery.

Data Lake: The Massive and Cheap Data Store

Data lakes emerged in 2010 as a cost-effective solution to address the boom in big data. Unlike data warehouses, data lakes simply store data, like the hard disk and file system of a laptop. Data lakes store all types of data, including structured, semi-structured and unstructured data. Semi-structured data is contained in non-tabular files, but the files still contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields. Examples of semi-structured data include XML, JSON, and CSV files. Unstructured data refers to any data in its native format that remains undefined until needed. This includes pictures, video files, and even social media posts.

However, due to their massive storage capacity, data lakes can potentially become data swamps – wastelands of raw, unusable data without any clear organization or active management. In addition, because the data isn’t formatted for analytical use, the process of preparing and driving value out of the data is both time-consuming and cumbersome for analytical teams. Data lake solutions include Amazon S3, Azure Data Lake, and Google Cloud Storage.

Data Lakehouse: The Best of Both Worlds?

Since the advent of data lakes, many enterprises have had both a data lake (for data storage) and a data warehouse (for data analysis). This dual-pronged solution has worked, but has the key cons of (a) time-consuming data transforming and loading process from data lake to data warehouse and (b) duplicate data in the data warehouse and data lake, which increases storage costs.

It’s no surprise that a new hybrid solution has emerged. The data lakehouse combines the best of both worlds. Data lakehouse solutions store data in a data lake, taking advantage of the data lake’s low cost storage. When data in the data lake is then needed for analysis, the data in the data lake is accessed directly for data management. Data lakehouses unify data movement across one infrastructure and remove data duplication (and cost), giving companies greater ability to scale. Data lakehouses can improve data quality and data governance while empowering BI and reporting. Data lakehouse solutions includeSnowflake, Amazon Redshift Spectrum, and Databricks.

Benefit Summary:

  Data Warehouses Data Lakes Data Lakehouses
Pros

Better data preparation with clean data 

Easy data discovery and querying of data 

Little maintenance

Cost-effective storage

Stores any data type (structured, semi-structured, unstructured)

Easy scalability and agility 

Speed of ingest - not processed until needed

Cost-effective storage

All types of data reside in one platform

Isn’t tethered to a single platform - can utilize other tech

Easier to build a pipeline for data movement

Pay for what you use model

Better control and governance capabilities

Removes data duplication

Cons

Expensive with high volumes of data

Time-consuming to load data via ETL processes

Not optimized for queries

May turn into a data swamp

Relatively new solution, still some limitations in functionality

One size fits all approach

 

Even Data Lakehouses Have Data Security and Compliance Risks

The data lakehouse is a promising new architecture/technology that’s here to stay. But that doesn’t mean data lakehouses automatically eliminate all data security and data compliance risks.  All data in a data lakehouse is still stored in an AWS S3 bucket (or Google or Azure equivalent), and those buckets can still easily be misconfigured -- accessible to the wrong people, or accessible to the public -- creating data breaches and privacy violations. In addition, data stored in S3 buckets is more likely to be raw data which hasn’t been curated. As a result, users may be able to access more sensitive data in a data lakehouse environment than they would via a data warehouse -- again, potentially creating more data security and data compliance risks. 

Dasera integrates with your existing data architecture to protect the data lifecycle, from creation to deletion. Visit our website to learn how Dasera can support the diversity and complexity of your cloud data stores. 

Author

Tu Phan