Back to AI Glossary
AI Infrastructure

What is Data Lakehouse?

A data lakehouse is a modern data architecture that combines the flexible, low-cost storage of a data lake with the structured data management and query performance of a data warehouse, providing a single platform for both analytics and AI workloads without duplicating data across systems.

What Is a Data Lakehouse?

A data lakehouse is a data architecture that merges the best features of two established approaches: the data lake and the data warehouse. Traditionally, organisations had to choose between a data lake, which stores vast amounts of raw data cheaply but lacks structure, and a data warehouse, which provides fast, structured queries but is expensive and rigid. The data lakehouse eliminates this trade-off by adding data warehouse-like management capabilities directly on top of data lake storage.

This means a single platform can handle structured data like financial records, semi-structured data like JSON logs, and unstructured data like images, text, and video, all while supporting both traditional business analytics and modern AI and machine learning workloads.

For businesses in Southeast Asia managing diverse data across multiple markets and formats, the data lakehouse offers a unified foundation that simplifies infrastructure and reduces costs.

How a Data Lakehouse Works

The data lakehouse architecture introduces a critical innovation: an open table format layer that sits between raw storage and query engines. This layer adds the reliability and performance features previously available only in data warehouses:

Open Table Formats

The key technologies enabling data lakehouses include:

  • Delta Lake: Developed by Databricks, the most widely adopted lakehouse format, providing ACID transactions, schema enforcement, and time travel on data lake storage
  • Apache Iceberg: An open table format created by Netflix, gaining rapid adoption for its performance with very large datasets and multi-engine support
  • Apache Hudi: Developed by Uber, optimised for incremental data processing and streaming workloads

These formats store data in open file formats like Parquet on standard cloud storage (Amazon S3, Google Cloud Storage, Azure Blob Storage), keeping costs low while adding warehouse-grade reliability.

Key Capabilities

A data lakehouse provides:

  • ACID transactions: Reliable data operations that prevent corruption, even with concurrent reads and writes
  • Schema enforcement: Ensuring data quality by validating that incoming data matches the expected structure
  • Time travel: The ability to query data as it existed at any point in the past, essential for reproducible AI training and regulatory audits
  • Unified access: Both SQL-based analytics tools and ML frameworks can access the same data without copying it between systems

Why Data Lakehouses Matter for Business

The business case for a data lakehouse centres on simplification and cost reduction:

Eliminating Data Duplication

In traditional architectures, organisations maintain both a data lake for AI workloads and a data warehouse for business analytics, with complex ETL pipelines copying data between them. This duplication wastes storage costs, introduces data consistency issues, and requires engineering effort to maintain synchronisation. A data lakehouse eliminates this by serving both needs from a single copy of the data.

Lower Infrastructure Costs

Data lakehouses store data on inexpensive cloud object storage rather than proprietary warehouse storage formats. For organisations in Southeast Asia managing growing data volumes across multiple markets, this can reduce storage costs by 50-80% compared to maintaining a dedicated data warehouse alongside a data lake.

Faster AI Development

Data scientists working on AI models need access to diverse, raw data, including historical records, real-time events, text, and images. In a traditional architecture, getting this data often requires requests to the data engineering team and weeks of waiting for ETL pipeline development. A data lakehouse gives data scientists direct access to all organisational data through familiar tools and interfaces, dramatically accelerating the model development cycle.

Unified Governance

A single platform means a single set of access controls, audit logs, and data quality rules. This simplifies compliance with data protection regulations across ASEAN markets, where requirements vary by country. Rather than maintaining governance across two separate systems, organisations manage permissions and policies in one place.

Data Lakehouse Platforms

Several platforms offer data lakehouse capabilities:

  • Databricks Lakehouse Platform: The pioneer of the lakehouse concept, built on Delta Lake, with strong AI/ML integration
  • Snowflake: Originally a data warehouse, now supporting lakehouse patterns with Apache Iceberg integration
  • Google BigQuery: Supports lakehouse architecture with external table capabilities and integration with the broader Google Cloud AI ecosystem
  • AWS Lake Formation with Athena: AWS-native lakehouse approach using S3 storage with serverless querying
  • Apache Spark + Delta Lake / Iceberg: Open-source approach for organisations that prefer full control

For businesses in Southeast Asia, both Databricks and Google BigQuery have strong regional presence, with data centres in Singapore supporting low-latency access across ASEAN.

Implementing a Data Lakehouse

A practical implementation roadmap:

  1. Assess your current architecture: Map your existing data lake and data warehouse components, including the ETL pipelines that connect them
  2. Choose an open table format: Delta Lake is the most mature, Apache Iceberg is gaining momentum for multi-engine flexibility. Choose based on your existing tooling and cloud provider
  3. Start with new workloads: Rather than migrating everything at once, build new data pipelines on the lakehouse architecture and migrate existing workloads incrementally
  4. Consolidate AI data access: Move your ML feature engineering and training data pipelines to the lakehouse first, as this is where the architecture provides the greatest advantage
  5. Establish governance early: Implement access controls, data cataloguing, and quality monitoring from the start, not as an afterthought
  6. Measure cost savings: Track the reduction in storage costs, ETL maintenance effort, and data engineering time as you consolidate onto the lakehouse

The data lakehouse is rapidly becoming the standard architecture for organisations that need both analytics and AI capabilities, replacing the two-system approach that has dominated for the past decade.

Why It Matters for Business

The data lakehouse matters to CEOs and CTOs because it directly addresses one of the most common and costly problems in enterprise data: maintaining two separate systems for analytics and AI that are expensive, complex, and perpetually out of sync. For organisations spending significant budgets on both a data warehouse and a data lake with ETL pipelines connecting them, a lakehouse can reduce total data infrastructure costs by 30-50%.

For business leaders in Southeast Asia managing data across multiple markets, the unified governance benefit is particularly valuable. Instead of maintaining separate access controls and compliance policies across two systems, a lakehouse provides a single governance layer. This simplifies compliance with varying data protection regulations across Singapore, Indonesia, Thailand, and other ASEAN markets.

The strategic advantage is speed. When your data scientists can access all organisational data through a single platform without waiting for data engineering pipelines to be built, AI development cycles shorten dramatically. Organisations that adopt lakehouse architecture consistently report that new AI models reach production 40-60% faster because the data access bottleneck is eliminated. In a competitive landscape where AI capability is a differentiator, this acceleration translates directly to business advantage.

Key Considerations
  • Evaluate whether you currently maintain both a data lake and a data warehouse. If so, a lakehouse can eliminate the costly duplication and ETL complexity between them.
  • Choose an open table format (Delta Lake, Apache Iceberg, or Apache Hudi) based on your existing cloud provider, tooling, and query engine preferences.
  • Start with new workloads rather than migrating everything at once. Build new AI and analytics pipelines on the lakehouse and migrate existing ones incrementally.
  • Ensure your chosen lakehouse platform has data centres in Southeast Asia for low-latency access and compliance with regional data residency requirements.
  • Invest in data governance and cataloguing from day one. A lakehouse without proper access controls and data quality monitoring creates the same governance challenges as an unmanaged data lake.
  • Budget for training your data engineering and data science teams on lakehouse concepts and tools, as the architecture requires different skills from traditional data warehousing.
  • Monitor query performance carefully during migration. Some workloads optimised for a traditional data warehouse may need tuning to perform well in a lakehouse environment.

Frequently Asked Questions

What is the difference between a data lake, data warehouse, and data lakehouse?

A data lake stores raw data in any format at low cost but lacks structure and query performance. A data warehouse stores structured data optimised for fast SQL queries but is expensive and inflexible with unstructured data. A data lakehouse combines both: it stores all data types on inexpensive lake storage while adding warehouse-grade features like ACID transactions, schema enforcement, and fast SQL queries. This eliminates the need to maintain two separate systems and the complex pipelines that synchronise them.

How much can a data lakehouse save compared to running a data lake and data warehouse separately?

Organisations typically report 30-50% reduction in total data infrastructure costs after consolidating onto a lakehouse. The savings come from three areas: eliminating duplicate data storage across two systems, removing the ETL pipelines that copy data between them, and reducing data engineering effort spent maintaining synchronisation. For a mid-size company spending $20,000-50,000 per month on data infrastructure, this can represent $6,000-25,000 in monthly savings. However, migration itself requires investment, so the payback period is typically 6-12 months.

More Questions

Yes, for most AI workloads. Data warehouses are optimised for structured SQL queries, but AI models often need access to raw, semi-structured, or unstructured data including text, images, and event logs. A data lakehouse provides this access natively while still supporting the structured queries that business intelligence teams need. Additionally, lakehouse formats like Delta Lake support time travel, which allows data scientists to train models on data exactly as it existed at a specific point in time, preventing data leakage and ensuring reproducibility.

Need help implementing Data Lakehouse?

Pertama Partners helps businesses across Southeast Asia adopt AI strategically. Let's discuss how data lakehouse fits into your AI roadmap.