Don’t Let Your U.S. Government Training Budget Go to Waste
Meet your Department of Defense training requirements with special GSA pricing on 9,500+ vendor-approved courses you can take anywhere, anytime.
Organizations have been moving to the cloud steadily for well over ten years. In that time, they have amassed quite a bit of data on customers and internal operations, but that data is not always stored in a way that easily can be accessed by all parties. With advancements in data analysis and the ability to use datasets for machine learning (ML) and artificial intelligence (AI), some organizations have made great strides by moving to a modern architecture that allows them to store and analyze both structured and unstructured data: a Lake House. This article will explain the benefits of building a Lake House on AWS.
Both data warehouses and data lakes have their pros and cons. Traditional data warehouses only store structured data, but they query quite well. Data lakes store both structured and unstructured data, but because there’s no built-in query analytics, you need a query engine and a data catalog to make use of your data. Wouldn’t it be great if you could marry these two into one system? You basically can with a Lake House architecture, which allows you to store and query all types of data. A Lake House on AWS connects your data lake, your data warehouse, and all your other purpose-built services into one shared catalog. Once you build your Lake House in AWS, you can store, secure and analyze your data, and control its access.
A Lake House architecture allows you to store your data in an easy-to-access data lake and to manage and analyze that data in the same quick reliable way found in data warehouses, providing quick performance and high reliability and integrity to support all your workloads.
The heart of the Lake House architecture is the data lake on Amazon S3. But data also can be stored in an Amazon Redshift data warehouse or other purpose-built database services like Amazon Aurora or Amazon DynamoDB. This means that you can store any kind of data and choose the best place for it to live, based on storage costs, size, expected throughput, and the type of tools you want to use to access that data. No matter where that data lives in Amazon S3, it will have universal governance and access control.
A Lake House provides the following abilities:
Below is a list of the primary characteristics of a Lake House:
The power of a Lake House comes from its ability to integrate data stored in warehouses and lakes using a unified interface that brings your multiple systems together. It’s easy to add new data sources, support new use cases and develop new methods for data analytics. However, adopting any new technology can be cumbersome, and a Lake House is no exception. You need to determine how to best architect your Lake House and the platforms and tools that will best suit your needs.
AWS offers an excellent platform on which to build your Lake House. As the Lake House architecture develops, you’ll need to choose between various standards for the data framework.
Governed Tables is the leading proprietary framework for a Lake House on AWS. This feature of AWS Lake Formation provides the data lake with advanced features such as ACID (atomic, consistent, isolated and durable) transactions, data compaction and time-travel queries. Governed Tables comes with the disadvantage of vendor lock-in.
However, you don’t have to use Governed Tables. Although they can be a little more administrative effort, using an open source framework instead of Governed Tables can provide much of the same functionality but avoid vendor lock-in. You can use one of the following three open source solutions:
These open source options offer some advantages, such as a lower risk of vendor lock-in and the chance to continue using your preferred tools. They can be used to help manage a data lake in AWS or in any other cloud. Still, their lack of integration with AWS can require extra administration.
AWS has two distinct services, Amazon Athena and Amazon Redshift Spectrum, which enable unified access to data wherever it’s stored in the Lake House. These applications allow federated queries across different storage options and are built for specific use cases, depending on the amount of data involved and the desired results.
The primary difference between Athena and Redshift Spectrum is that Athena uses a general SQL engine that supports ANSI standard SQL, while Spectrum queries run within Redshift and uses its query engine. Therefore, Spectrum queries can easily and quickly join tables directly with data already ingested into the Redshift data warehouse. Users with most of their data in an S3 data lake should consider Athena, while users who need queries closely tied to a Redshift data warehouse should opt for Redshift Spectrum.
ExitCertified delivers courses on some of the most in-demand AWS cloud computing solutions. The main Data Analytics on AWS course is Building Modern Data Analytics Solutions on AWS, which is a four-day bundle of the following one-day classes:
Introduction to ML and Amazon SageMaker
View Course