Data lakes remain an ideal repository for storing all types of historical and transactional data to be ingested, organized, stored, assessed, and analyzed. Never before have business analysts been able to access all this data in one place. All of this was not sustainable in traditional data warehouses due to the high volume, cost, latency, complexity, and performance requirements. So yes, data lakes are a cure-all for many of our data woes.
Common challenges with data lakes – Whether or not you would classify your data lake as a swamp, you may notice end users struggling with data quality, query performance, and reliability as a result of the volume and raw nature of data in data lakes. Specifically:
- Too many small or very big files require more time to open and close files, rather than reading content (this is even worse with streaming data)
- Partitioning or indexing breaks down when data has many dimensions and/or high cardinality columns
- Neither storage systems, nor processing engines are great at handling very large number of subdir/files
Data quality and reliability:
- Failed production jobs leave data in corrupt state requiring tedious recovery
- Lack of consistency makes it hard to mix appends, deletes, and upserts and get consistent reads
- Lack of schema enforcement creates inconsistent and low-quality data
Generating analytics from data lakes – As organizations set up their data lake solutions, often migrating from traditional data warehousing environments to cloud solutions, they need an analytics environment that can quickly access accurate and consistent data for business applications and reports. For data lakes to serve the analytic needs of the organization, you must follow these key principles:
Data cataloging and metadata management: To present the data to business, create a catalog or inventory of all data, so business users can search data in simple business terminology and get what they need. But with high volumes of new data added every day, it’s easy to lose control of indexing and cataloging the contents of the data lake.
Governance and multi-tenancy: Authorizing and granting access to subsets of data requires security and data governance. Delineating who can see which data and at what granularity level requires multi-tenancy features. Without these capabilities, data is controlled by only few data scientists instead of the broader organization and business users.
Operations: For a data lake to become a key operational business platform, build in high availability, backup, and constant recovery.
Self-service: To offer a data lake with value, build a consistent ingestion of data with all the metadata and schema captured. In many cases, business users want to blend their own data with the data from the data lake.
Yet as data lakes continue to grow in size, including increasing volumes of unstructured data, these principles become increasingly complex to design and implement. Delta Lake was created to simplify this process.
The benefits of Delta Lake include:
- Reliability: Failed write jobs do not update the commit log, hence partial or corrupt files are not DELTA visible to readers
- Consistency: Changes to tables are stored as ordered, atomic commits and each commit is a set of actions filed in a directory. Readers read the log in atomic units, thus reading consistent snapshots. In practice, most writes don't conflict with tunable isolation levels.
- Performance: Compaction is performed on transactions using OPTIMIZE; optimize using multi-dimensional clustering on multiple columns
- Reduced system complexity: Delta is able to handle both batch and streaming data (via a direct integration with structured streaming for low latency updates) including the ability to concurrently write batch and streaming data to the same data table
ACID Transactions – Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level. Scalable Metadata Handling – In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease. Time Travel (data versioning) – Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. Open Format – All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet. Unified Batch and Streaming Source and Sink – A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box. Schema Enforcement – Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption. Schema Evolution – Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL. 100% Compatible with Apache Spark API – Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.