The data lakehouse – it’s not a summer retreat for over-worked database administrators (DBAs) or data scientists, it’s a concept that tries to bridge the gap between the data warehouse and the data lake.
In other words, the data lakehouse aims to marry the flexibility and relatively low cost of the data lake with the ease of access and support for enterprise analytics capabilities found in data warehouses.
In this article, we’ll look at the features of the data lakehouse and give some pointers to the suppliers making it available.
Lake limitations and warehouse worries
Let’s recap on the key features of the data lake and data warehouse to make it plain where the data lakehouse idea fits in.
Data lakes are conceived of as the most upstream location for enterprise data management. It’s where all the organisation’s data flows to and where it can live in more or less raw format, ranging from unstructured to structured, image files and PDFs to databases, via XML, JSON, and so on. There could be search-type functionality perhaps via metadata and some ad hoc analysis could take place by data scientists.
Processing capabilities are not likely to be critical or optimised to particular workflows, and the same goes for storage.
Data warehouses, on the other hand, are at the opposite extreme of things. Here, datasets – possibly after exploratory phases of work in the data lake – are made available for more regular and routine analytics.
The data warehouse puts data into a more packaged and processed format. It will have been explored, assessed, wrangled and presented for rapid and regular access, and is almost invariably structured data.
Meanwhile, compute and storage in the data warehouse architecture will be optimised for the types of access and processing required.
Across the lake to the lakehouse
The data lakehouse attempts to bridge the gulf between data lake and data warehouse. Between the large, amorphous mass of the lake with its myriad formats and lack of usability in day-to-day terms, and the tight, highly structured and relatively costly data warehouse.
Fundamentally, the data lakehouse idea sees the introduction of support for ACID (atomicity, consistency, isolation, and durability) – transactional processes with the ability for multiple parties to concurrently read and write data. There should also be a way to enforce schemas and ensure governance with ways of reasoning about data integrity.
But the data lakehouse idea is also in part a response to the rise of unstructured (or semi-structured) data that could be in a variety of formats, including those that could potentially be analysed by artificial intelligence (AI) and machine learning (ML) tools, such as text, images, video and audio.
That also means support for a variety of workload types. Where the data warehouse invariably means use of databases, the data lake can be the site of data science, AI/ML, SQL and other forms of analytics.
A key advantage is that a wide variety of data can be accessed more quickly and easily with a wider variety of tools – such as Python, R and machine learning – and integrated with enterprise applications.
Where to explore the data lakehouse
A pioneer in the idea of that data lakehouse is Databricks, which gained $1bn of funding earlier this year. Databricks is a contributor to the open source Delta Lake cloud data lakehouse. Analysts have seen such a big funding round as investor confidence in an approach that aims at easing enterprise access to large and varied data sets.
Meanwhile, Databricks is available on Amazon Web Services (AWS), while the cloud giant also positions its Redshift data warehouse product as a lakehouse architecture, with the ability to query across structured (relational databases) and unstructured (S3, Redshift) data sources. The essence here is that applications can query any data source without the prep required of data warehousing.
Microsoft Azure has Azure Databricks, which uses the Delta Lake engine and Spark with application programming interface (API) support for SQL, Python, R and Scala, plus optimised Azure compute and machine learning libraries.
Databricks and Google also announced availability on Google Cloud Platform earlier this year and integration with Google’s BigQuery and Google Cloud AI Platform.
Another supplier in the lakehouse game is Snowflake, which claims to be the originator of the term and touts its ability to provide a data and analytics platform across data warehousing and less structured scenarios.