Hello and welcome to today's webinar titled data warehousing on AWS, what's new with Amazon redshift.
Our speaker today is miles Brown, the senior cloud and devOps advisor at Tech data exit certified with over 20 years of experience in the IT industry across a variety of platforms.
Recognized as an AWS authorized instructor champion and a Google Cloud Platform professional architect and instructor Miles has delivered award winning authorized it training for the biggest cloud providers.
All right. Before we get started, let's cover the webinar functionalities.
So during the webinar. Everyone's microphones will be muted. So if you have any questions, I want you to check out the Q AMP a box or the chat window at the bottom of your screen now.
Feel free to enter questions in there at any time as they come up. There will be a dedicated Q AMP. A session at the end of the lecture. But like I said, don't let that stop you from asking questions along the way.
If you enjoyed the presentation today. And you're interested in learning more about training anywhere with our interactive virtual platform called I MVP, please visit our website or contact us or leave our contact information in the chat window.
Today's webinar is being recorded and a copy will be sent to each and every one of you. By the end of the week.
Also WILL SHARE A NEW PROMO. At the end of the webinar, so stick around to learn more. All right, let's get started. Take it away. Miles.
Alright, thanks, Michelle.
As as Michelle mentioned, we're going to talk about what's new with Amazon redshift. But just to set the stage will do a very brief history and overview of data warehousing
And we're redshift came in and really what a lot of the features that have come in and over the last year and a half had to do with was trying to decrease the administrative burden. Right.
And and how much administration a human has to do, just keep this data warehouse up and running and able to handle you know the volume of queries. The, the, you know, concurrency that we need
Then we'll talk about some other features that came in that weren't necessarily around administrative burden and we'll kind of and talking about the future of redshift and and the biggest you know new development that's coming, which is awkward.
We'll have some time at the end for questions, so
You can either put them in the chat or in the Q AMP a box, you could do that throughout. If I forget to look for a little while. We'll show Michelle will start to interrupt and let me know. There was a question somewhere.
So we'll jump right in with the history and overview data warehousing and I was there to witness phones to this sadly
I might not look that old, but I am
In the 90s is really when data warehousing took off. I mean, there were elements of it before that, but this is when organizations data volumes started to increase to the point where traditional databases and there were talking about
You know relational databases like Oracle
Where they would run into two very different competing workloads, you know, the regular kind of transaction processing where you say, Hey, you know, I want to place an order, you know, insert a new order into the Orders table.
Or I want to update a customer's last name or, you know, those kinds of things where it's usually, you know, insert, update or delete on one row or even if we're doing queries. We're not querying the whole table right
But then the other kind of workload really were those big, full table scans where we're doing large analytic queries. A lot of times those queries, what you're trying to do is find out
What customer. Place the most orders within a given time. Or, you know what, what was the total
Sales for last year, you know, so it's it's some X Men count average standard deviation, you know, those kind of group functions or aggregate functions.
And a lot of times you might have a table with 40 columns in a relational database, but you're really just trying to some one column or count the number of something and another column.
And so those are very different kinds of workloads and trying to do them both on the same machine cause some real problems.
So what we started to do in the 90s was to separate it out and just say, well, let's use two different databases, one for our regular transactional stuff.
And then another one that would periodically load data into so we would extract data from our transactional database.
Maybe transform it. A lot of times aggregate it and then load it into a second relational database that we called the data warehouse.
And so, you know, physically, it might be just another oracle database, but we would
tune it differently. We that the data modeling would looked a little different because
We were trying to optimize it for different things. Instead of single row queries and small single row insert, update, delete, and small queries, we're really trying to optimize it for large bulk loads every night.
Of thousands, maybe millions of rows and then really big, full table scans.
And so I was in a place where we just had an Oracle database as our transactional database and then another oracle database, you know, as our data warehouse.
But eventually, some people start to look at it and say, well, how can we optimize this, how can we make that bulk load of thousands, maybe millions of rows.
Every night that's taking a long time. And so what we started to see was this idea of a purpose built appliance. So we're talking about a hardware and software stack together built specifically for this kind of thing where we had, you know, hardware accelerated.
Indexing, so that we could load the data in and so there's all kinds of interesting things so so companies like Teradata
Oracle had a product called exit data IBM had the teaser. A lot of those products are still around, but change names. You know those started to pop up. And one of the big things that they would do would be that they would be columnar databases.
So instead of storing the one row at a time, you know, in a continuous space within some block on disk, we would say, well, let's take all the customer names all the customer addresses.
All the, you know, each column, and we would say, hey, if we're inserting 1000 rows at a time. Let's take 1000 values for this column store those all together in one day eight o'clock, and then the next
You know, and what ends up happening is it really optimizes for those queries where I have 40 columns in a table, but I'm only interested in one
I don't have to sift through all the data on disk for all these other columns. Right.
So these enterprise data warehouses would hold data collected from maybe not even just one transactional database, but a bunch of transactional databases and maybe either other data sources and we would bring it all together.
A lot of times that bringing the data together men, you know, we had to clean it up before we got it there. So that process where we load the database that ETFs would call it extract, transform and load. A lot of times, a lot of it was just cleaning up the data. Right.
And so they were just optimized for different kinds of things very large inserts of a bunch of rows at once and then really big queries on a few calls
And then what we saw once we had the data in this data warehouse, people would connect to it using BI tools like Cognos or tablo right
And so this is sort of like what a data analyst and business analysts probably data handling better term for it.
And they would use sequel statements or a tool that would generate sequel statements to build reports and dashboards and sometimes do ad hoc querying as well.
Now the big problem with these data warehouse appliances, was that they were super expensive. I was at a place where we're really hitting the limits of our my SQL databases.
And we went and we got IBM Nikita now we were not otherwise a big IBM shop but you know we had heard good things and we tested it out. And so we went and bought the smallest
increment of the teaser, which was something like a quarter rack or half rack. I can't remember now. But it was, you know, half a million dollars. Wow, that was Canadian
Based in Toronto. But you know, it's still, it's a lot of money. And that's the smallest increment, we could buy
And we rolled in that appliance plugged it in, and we started filling it every night, we would load our rows and they were lightning fast and our biggest queries.
Fastest could be. And we were very happy. But then cut to a year later, that that box is getting full. We got to go buy another increment. Right. And so it's very expensive and it was really kind of
You know only accessible to companies with that kind of money, you know, who had those really big analytics needs and they were outpacing their existing relational database.
Now the big move, you know, in the 2010s, even a little before that, you know, AWS came up in 2006 but it was really 2010 2011 into 2014 when a lot of companies started to look at, say, Hey, let's move some stuff into the cloud.
And it was really easy to move things like web servers app servers regular traditional databases. But then when you got to something like IBM. The TZ say hey, how do we run that in Amazon's cloud.
Amazon says, well, we're not going to go buy hardware from IBM. That's definitely not what we're doing. So they said, We need to build our own
Data Warehouse service. And that's what redshift. So they announced it at reinvent which is late in 2012 and it finally came out in 2013
And the big idea was that it would be this NPP architecture and you could start fairly small. You could start with just one node that's both the leader and the compute, but then you can add more compute nodes to it so instantly. It had a better
economic prospects than buying one of these big appliances, because you could start small and then as your data needs grow, then you could add more capacity to it.
When it came to the administration side. Well, you didn't have to worry about the physical machines. Right. That's part of what moving to the cloud means is, hey, I'm trusting Amazon to take care of the data.
Data Center and the physical boxes in it. Right. But I still had a lot of the same basic kind of deviate tasks. Right. And so the role of TBA you might be able to handle more
You know clusters in the cloud than you could on prem, because there was less of this hardware aspect to it.
But, and there was a little bit more that was being done for you. But there was still quite a bit of burden on you. Now, when we get into redshift. It's worth talking a little bit about what the architecture looks like or at least what it looked like until recently.
The big idea is that you have this concept of a leader note that's where you connect your sequel clients to whether it's
You know, the, the BI tools or whatever this running queries, or it's your email tool that's loading the data and you know there's
There's different things. You get either a JD be zero know BBC driver. You can act and you write sequel statements. Right.
And the leader node looks a lot like a regular relational database. It takes that sequel query. It says, Hey, do you have the permissions to run this. If so, let's build an execution plan and figure out how when I run the query.
But it stopped short of running the query, because the data doesn't live on the leader. Note the data is distributed across X number of compute nodes.
And so the data is really distributed around. Now, how we distributed depends on your data model we have something called distribution keys and sore keys that determine, you know, when you create a table. How will the data gets spread around
And so the leader node is you know where the query execution starts and it sort of coordinates it but the query actually runs on the compute nodes. So the leader node has to know where the data is so the metadata is stored on the leader.
But it's the compute nodes that actually hold local columnar storage, right. So that's where your data actually resides.
Now it is fault tolerant. So one of these compute nodes dies. That's okay, they've already got copies of that data somewhere else. Right.
And so they'll, they'll just make sure that the query it wherever it lives and then they'll relaunch a new compute node, because it is sort of a managed service if one of those dies, you know, this show will go on and in the background. The launch of replacement for
Michelle Coppens :: Webinar Producer
Hey miles. Can you tell us what the storage capacity of a compute notice
Oh, yeah, yeah. So, so the compute nodes, you can have between one and 128 128 the biggest number of compute nodes. Now, how much can actually store depends on the node type
So there's a couple different node types. There's dense storage and dense compute but little later. We'll see that there's a newer type. So until very recently about the, the largest size of data, you could store is about to petabytes of data, which is a lot right
But recently they've added a new RA three compute node type and with those hundred and 28 nodes we can get up to eight petabytes of data. So it's a really expanded it out. We'll talk a little bit more about that right near the end of the talk.
Now the compute nodes. Hold the data. Now you got to load it from somewhere. Now, you're, you're really not.
Very optimized. If you're doing a bunch of single row inserts, although it's legal. You know, we typically load the data with something called a copy command, usually you have a bunch of flat files sitting in S3.
Which is you know Amazon's object storage layer. So you could have, you know, any number of files sitting there in a bucket and you say, hey, load the contents of this bucket.
Now assuming it's in the right format load this data in and it will take cues from the leader note to say how do I distributed, but the data gets loaded basically directly into these compute nodes.
And also when we run a query you know parts that query are going to be sent to you know most of the compute nodes and the compute nodes themselves have something called a number of slices. So it's a logical
Unit of parallelism within that compute node. It might have two or 16 slices
You know, and so parts of that query will be done in parallel, even on one compute node and then you multiply that by the multiplicity of how many compute nodes. Yeah.
Now you get to decide how many compute nodes and what kind when you launch the cluster and you can resize it later. We'll talk about some of the resizing options. So that's the basic architecture.
So as a BBA. I've got a lot of duties, right. Not only do I have to decide how many and what type of compute nodes and then go launch the thing
But you know every night as we load data in all of a sudden you start to get things like
Well, if I'm deleting data the way redshift deletes it doesn't really delete the data right away. It just marks it as hey this is deleted.
And then at some point, you got to come along and do what we call a vacuum operation that says, hey, let's go and actually reclaim that deleted space and then sort of de frag things to make make room.
And you also have to collect statistics, you know, the optimizer. You know, when the execution plans built on the leader node.
It has to kind of know when I'm trying to join this table to that table, it needs to know how many rows are in each table.
And so it keeps track of this with statistics, but you have to remind it to go and collect those statistics.
And so there's a lot of like day to day kind of administration that somebody has to do so DBS can automate it but they do have, it's on their plate.
And so a lot of the new features that we've seen over the last year and a half, have had to do with let's decrease that administrative burden.
So the idea of doing that vacuum delete back in December 20 team is a long time ago now.
That they basically said we're going to do this for you. You know, we used to do it, maybe once a week.
Or you looked at some numbers. And if that was, it was not often enough, you might do it twice a week. And so you're always kind of like, Oh, I gotta check out by Matt once in a while that automated script that I built
Now it just automatically runs based on query load. So they're going to say, hey, there's not much going on right now. Maybe it's a good time to do a vacuum delete right
And it looks and it says, hey, if the tables got not many rows deleted, then it's fine. I can wait. I can skip that.
And so that's a little bit of the automated administration, then they added the same thing in a little bit later for analyze and so analyzes that idea that goes and gather statistics, how many rows are in the table. How unique are things you know
And every time I used to have to do this every time I loaded the data nightly I would think about do I analyze the whole table or maybe just the primary key column and then once a week, analyze the whole table of all the tables.
And so now, they've just taken that again off your plate. When things are light in the database. It says, hey, do I have any tables with really old statistics, then I'll go in, run the Analyze on
Continuing on. One of the other things I mentioned when you create a table, you have to tell him how should it be to get distributed across these nodes.
Across all these compute nodes. And so if it's a really small table, it might make sense to say, well, let's just copy this small table to every compute node.
Then if we ever join it to something else. We don't have to move data across the network it's you know it's small enough table. It's not going to take up a lot of space.
It's copied everywhere. But if the table gets really large. Well, then you're probably going to say now let's take, you know, the first row and put it here. The second row, put it here, third row, put it here. And then there's another one called
You know key distribution style and I'm not going to get too much into the physics of that you know you could take a class on on redshift. If you want to learn about, you know, data modeling for redshift.
But they used to just use even as the default. Now if you don't tell it when you create a table it defaults to auto distribution style.
And basically what it does is it says when the table is new, it'll be all so if it's a small table gets copied everywhere. But then as the table grows larger, they'll switch. Even so, it's just a better default
Michelle Coppens :: Webinar Producer
Hey miles. We have a question about on vacuum. What's the difference between a regular vacuum and a vacuum delete.
Okay, so that's a good question. So the, the vacuum. If you just run vacuum it does a bunch of things it does a delete and they sort and and so you could do two separate operations, a vacuum sort
And so when we set up a table we tell it. What's the key that it gets sorted on. And what happens is you know when you run a vacuum sort it says, oh,
Any new roles that were added since last time we did a sort will have to go and put those in the right place, because the data actually get stored by the key.
Right. And so if you just do a regular vacuum it does a sort and a delete. But you could separate those and do them as separate operations vacuum delete only or vacuum sort only
So that's, that's the idea. And so once they said, well, we're going to take care of the vacuum delete automatically. You were still stuck doing the vacuum sort
Whenever it made sense. And so, as of just last November. They automated that and in fact they now have, it's a slightly different alternative to vacuum sort. It's called automatic data.
And what it does, it doesn't compromise performance or access to the redshift tables so used to be, what if you did a vacuum sort in the middle of the day when we were busy.
You know your tables got really slow right like some things you couldn't make changes to the table while it was happening. And even if you're running queries that got really slow. And so now the way they're doing it. It's not a big deal.
One last thing.
Something you wouldn't be able to do until just this past November was changed what the tables sort key was
We used to have to go and basically create a new table with the new sort key and then copy all the data over and it took forever.
Right, and all the other stuff you had on your table, you had to go and you know rebuild that on the new table.
Now you can just change the sort key and they'll go and move all the data around in the background for you.
So that's a couple of things. As far as the automated administration, but probably one of the biggest changes, I'd say came in.
About a year and a year and three months ago was this idea of elastic resize so if you really wanted to resize the cluster. So say you said, Okay, let's make a cluster. I'm going to have 10 compute nodes.
And then all of a sudden you know data load start growing and growing the data and you realize, oh, my data is getting up to 70% full
Usually I would set an alarm for that they would send me an email that says, hey, you know, after the last insert of data. I'm now over 70% full, then I would go and resize and there were sort of two options.
If you did the classic resize you could basically say, Okay, I'm not actually adding nodes, what I'm really doing under the covers is building a new cluster. If I want to go from 10 to 14 nodes.
They would launch a new cluster in the background. And then it would take what 1015 minutes to come up, but I wouldn't notice I could still use my regular cluster.
But as soon as I started to resize it would change it to read only mode. So we couldn't change the data we could still run queries.
And then once the new cluster came up. That was the new size, it would start copying the data over and once it was done and the new cluster was up with all the right data.
Then it would do a little bit of DNS magic and that JD VC or PC driver that you were using to connect to the database would automatically point to the new one and then they get rid of the old
But you're only ever paying for one cluster at a time. But there was significant disruption, right, so any current inquiries running or connections they would get
Kicked off and then when you connect it back again you would be in read only mode until the whole new cluster got built and loaded with data and then you get kicked off again. And when you reconnect you're connecting to the new one.
And so some people didn't like all that disruption and so they would do as a separate kind of restore and resize where you would take snapshots that you can take. While the database is running. And you would say let's launch a new cluster which is the new size right
And then restore from the restore from that snapshot that we do.
But we would have to tell the clients at some point okay disconnect from this database. Here's the new JD DC or OTC driver connect to the new one.
So neither of them was perfect, right, and both of them depending on how big the cluster was and how complex your data. Was it could take several hours. Now, if you had the classic kind of we load the data every night. And then all we do is queries all day.
Then you're like, Well, you know, after I load the data that's when I would probably realize that it's time to resize
And then you've got the whole day to resize and we're in read only mode all day but some people are doing a little bit more often.
It's not every night at 10pm or whatever that we're loading data. It's every four hours or every two hours.
And so the resize options here, we're really not cutting it. And so in November 2018 they added this idea of elastic resize and so you could quickly add or remove nodes from a cluster.
And it would only take minutes. And they did a really good job with it, it would literally add notes to the running cluster, but
The nodes had to be the exact same type. So, so with the classic resize you could change the number or type of notes and the size of nodes. Yeah, I could go from an Excel to a 16 next hour, whatever I wanted.
So the last degree size was really helpful.
And the thing that they added about a year ago.
To really help when we were in trouble, you know, the idea of resizing the cluster. That's kind of like a permanent change. Right. You're saying okay 10 nodes isn't cutting it for most of my workloads. I gotta move up to 40
But when you really look at how busy clusters are, you know, Amazon started to look at it.
They don't have to guess they can look at all their customers utilization and realize that you know a lot of people. It's like
They got dashboards that everybody comes in at 9am and looks at the dashboard. And so there's a flurry of activities, all at once.
And so they said, you know, a lot of people have maybe one hour a day where things get really busy.
And so, you know, if you wanted to deal with that. Traditionally you would have to grow your cluster to sort of provision for the peaks.
And then most of the time you know you've got way too much cluster that you're paying for you're paying you know by the hour for however many nodes.
Or you could just say, Well, you know what, when it's busy. Some people have to wait. Okay. Now, people don't like waiting. And so obviously that's not a good solution.
So in in about a year ago they introduced this concept of concurrency scaling and the idea was, when was that when things started to get busy and they could figure out wait times and those those
workload management cues. If the queues were starting to get busy, then they would literally launch another cluster in the background.
But it didn't take like 1015 minutes to launch because they already had nodes up and running. They were just, sort of, you know, grabbing temporarily these nodes that we're already running, you know, in Amazon's background and they would load the data in from redshift.
Doesn't actually make sorry from the S3. It doesn't actually make a copy of it, it would load the data into that secondary cluster.
And they could do it in seconds in an as needed basis. And so you would configure it for certain users or certain queries certain cues really
And only read only queries and they couldn't add reference something like a temporary table or a table with something called interleaved sore keys so regular kind of queries, we could say, you know what, we're getting too busy for these queries. Let's offload them to the secondary place.
And sometimes you wouldn't even have to pay because they're thinking was will give you an hour of concurrency scaling credits every 24 hours that redshift is running.
So once a day, you know you you would you would have enough credits to run one hour of of concurrency scaling.
And when they looked at, you know, how people's workloads really were because they don't have to ask people, they've got all your metrics of how you're using your redshift.
They said, You know what, like 95% of people aren't gonna have to pay anything beyond that. Now that 5% of people that have like a few different peak times and a lot of peak times
You know, you would pay basically pay per second billing for the compute nodes you know for how long they were up for that extra scaling.
workload management, a few times.
And so it's probably worth talking a little bit more about that because that's kind of a big part of the BBA job, or at least it has been
Michelle Coppens :: Webinar Producer
Hey, Miles just before you get into that. Can you let us know if these new features are available on existing clusters or if you need to create a new cluster.
Yeah, so, so the ones that came out a long time ago. Mostly what they've done is you know your your redshift cluster is getting updated in the background.
So you don't really have to do anything to get most of these features right there are some future features that you have to turn on explicitly
But you know, it's really just change a parameter and turn that on. But most of the ones we've mentioned, those are going to be on automatically
Now you can go and look at the individual features and see which ones you you know are automatic or which ones you have to turn on, but you shouldn't have to like launch a new cluster to get any of these things.
Right so workload management when Richard first came out, you know, administrators, had to do a lot of work, assuming that they were going to have, you know, one central
Data Warehouse. And then all kinds of workloads running, running, right, like I said, You've got queries for dashboards that run every morning at 9am or whatever.
And then you've got reports that run at certain times throughout the day and then you got a bunch of ad hoc queries, right. And part of the problem is you don't want one, you know,
Rebel running really big ad hoc queries that aren't that important, and you don't want that messing up you know your CEOs dashboard you
Know what I mean. So you got to kind of set priorities. And so what what out of the box. What you had was a single
Queue that had a concurrency of five that men as people started sending queries to the to the database. We could run five at a time.
And he got roughly 20% of the available memory, right.
And then there was one special few just for super users. So as as an administrator, I could log in as a super user and run queries, even when things were slow my query would get through and
But as an administrator, I would start looking and saying, well, you know what, let's make a special queue that's high priority, they can take more of the percentage of memory.
Because these are really important queries that we got certain SLA them.
And so all of a sudden, how many, how many cues to make, how much memory to utilize. What's the concurrency allowed on them. It's a real mess and it's a lot of work, but you have to figure out how to do right
And so that's the way things have been now they've added all kinds of improvements to this.
One of the things they did back is while ago now in April 2017 we added the concept of query monitoring rules where I could go in bed set some metric based performance boundaries.
On a particular queue and say, hey, when my queue it when my query has been in the queue, not even in the queue, but actually running
You know, if it's taking longer than 60 seconds. Let's abort this, you know, so you can tell it what are the what are the metrics based performance boundary and what action to take. When the query goes beyond that.
And so you started to be able to, but actually that added to the complexity of the administration, because now you have more things to worry about.
And one of the things that Amazon really recognized was, you know, when you have all these competing workloads coming in. What you don't want is one really big query to hold up 1000 little queries, because that's 1000 people you're angry instead of that one.
And so they came up with something called SQL short query acceleration. They first introduced it in November 2017 then they made it the default in August 2018 so let's go a while ago.
And basically, it uses sort of machine learning to look at the query and figure out, I think this is a short query. And if it is, we jump into a dedicated queue. So they can get started, much faster.
And we can get those little ones out of the way, really quickly and then the, the longer ones can run when there were less busy with thousands of little right
And so that's something came out with now much more recently back in September. They came up with something called automatic
WL and the idea that let's take all of this off your plate right will create cues for you with various priorities under the covers.
And then we'll use machine learning to dynamically manage the memory utilization and the concurrency on each of those cues.
And so all you do is when you run a query you tell it. What's the query priority. Now if you don't tell it a priority, then it uses normal, which is the default
But if it's more high priority, then you say hi and it goes into a different queue that has more resources. Now there is a sort of a hidden one called critical that's only for super users can use that
And so that's really the the easiest way to do things. Now if you want to use this, you know, if you have a running all existing cluster.
It's by default it's set to manual workload management, you would have to go and change the parameter group. If you change a parameter in the parameter group that causes
Like a restart of your database, right. And so it takes a little bit of time. But nowadays, if you just launched one from scratch it defaults to automatic WL
So they've really made the database, a lot more self administrating and that's a big part of what they do. Now there's a bunch of other shorter smaller new features that maybe aren't so much about administration.
They've been really busy. You know, like, over the last, say, about a year from from like about a year ago.
A lot of little improvements various improvements to the speed of querying and loading the cluster. I'm not going to get into all of them, but they show us a three and a half times faster, you know, two times faster on this. And so a lot of little things.
Last May they added support for something called stored procedures, you know, a lot of relational databases have the ability to put some code in the database and run it there.
The, the language you use is called PL sequel PG, you know, the roots of redshift are very much in Postgres.
Right. So when when redshift first came out, you know, when they first deciding to build a data warehouse.
They didn't build it from scratch. Amazon went and basically licensed the guts of a company called Power XL. They had about the fourth or fifth place leading
You know, after nitties and Teradata and exit data, it was I think it was about fourth place in the market and they licensed the guts of that and that product was really built on top of Postgres. So, even now, you can use Postgres drivers to connect to ratchet.
And so it's got a lot of roots in Postgres. So you see PG all over the place.
Much more recently, last November. They introduced something called materialized views. So you have you usually it's sort of like
You know, you write a sequel statement that says, well, it's not a real table. It takes data from regular tables but we make it look like a table. So you can run queries on it.
And and it simplifies my queries, because the complex queries hidden in what the view looks like. But every time you run a query on the on the view, it's really running the query on the table again and again.
And materialized views you build that that view once and it actually loads up the data and then if you run the same kind of query on it again and again.
You know, then you get much faster query performance. So this is good for repeated and predictable analytic workloads. So dashboards.
Certain BI tool kind of query stuff. So that's something that's available now, but only in preview. So even when something's in preview, you have to basically sign up
And give them your account number, and they say, you know, oh, you know, there's no space left. We've already got enough people, you know, they basically beta testing this for us.
Another very new feature that's still in preview is something called federated query, they said, you know, I don't want to have to leave redshift to query a regular relational database and then load that data into redshift or something. Right. And so the idea that right there from
From redshift. I can query tables.
Well, we'll talk about something called spectrum that allows me to query you know well formatted files in S3, but also RDS for Postgres and Amazon Aurora Postgres.
So those are regular relational databases not data warehouses and I can just, you know, query the data from those tables, as well as redshift and join them all together. You know, it's magic.
Another thing that we run into a lot is this idea of data lakes right where people have the raw data sitting in a bunch of s3 files.
And a lot of times we want to grab data from redshift and export it and put it into my data lake.
But the regular export, you know, we have an unload command. It just comes out as a flat file, but a lot of times in these data lakes, we want to use a fancy format that's columnar
And it's binary files and a very popular format for that is called Apache Parquet not going to get too much into what that is, if you've heard of it. Good.
And so now, the idea is, when you do an export. You know, I run a sequel query in redshift. And I say, hey, I want to export the contents of that in parquet files in S3.
Another thing that really just came out. This is the newest thing, this came out two weeks ago redshift now supports per second billing
So it was always you're paying by the hour they advertise the price per hour, you know, various types of nodes. And if you ran it for only 10 minutes you paid for an hour.
If you ran it for an hour and a half, you paid for two hours. Right. So there's all we rounded up now supports per second billing. Now that's really important for the concurrency scaling, right, is it comes up, you only use it for a little while, it goes away. You only want to pay per second.
Oh, I see some more questions. Maybe I should jump into these
Michelle Coppens :: Webinar Producer
There's two questions about migration and customer experience. I think we can save those for the Q AMP. A. But what about um can workload management be set up to give a group of users less resources with running a query.
Less resources when running a query. So when you set up those workload management cues you can make a special queue that says, hey, any of these users when they
Submit a query it goes to this special queue and so that cue. You can set up the concurrency on it. So how many queries can run at the same time in this one, q
And so if somebody goes and runs a query or question sequel statement and it's a very long running query. Well, then you know that's taking up one of those slots.
And so obviously there's less resources available to that user if they submit more queries, right, then there's there's fewer so yes in that respect.
I mean, you could get down to the point where you set up like a Q, just for this user and say, hey, the concurrency is one. They can only run one query at a time.
Michelle Coppens :: Webinar Producer
And what about query priority. How does it get chosen is a baked into the query or the admin assigns a key change on the fly while while queries are running
Yeah so. So the idea of that is we can do it based on
If you're doing like the manual workload management, then I could say, hey, anything this user does has this goes into this queue, you know, and they didn't have the priority concept there. Right. It was just
You know, based on the user, it would go into a queue that has more or less resources for it now with the automatic basically the
The query priority is set as normal unless the user overrides it and now you can change. Who's allowed to override it. Right. But we can get. We could put it in the users hands.
But there's, I can't remember.
I can't remember if we can set
Using permissions. Who's allowed to
Who's allowed to run what priorities, but we can change at the as an administrator. You can change the priority on the fly while it's running. Yeah.
But yeah, you know what, I'm not really sure how we can set the priority for a given person what what query priorities, they're allowed to use. I don't remember how that all works.
Yeah, that's a good one.
Okay, so the migration stuff, like you said, Well, hold on to that to a little later.
All right, I'm probably the biggest change in redshift over the past few years was the introduction of this concept of redshift spectrum, right, this came out right after Amazon Athena and it's basically on the same idea, which is
I want to be able to query data from my data lake right and the data lake is just a bunch of s3 buckets full of files. Ideally, you know,
We've got some structure to where the files are and who's allowed to touch them. But, but I just throw these raw files in there.
And then normally what you would then have to do is take that data and loaded into data warehouse and then query it from there.
But the idea of Athena was well if there's some structure to those files if they're common to limited or tab do limited or whatever or even, you know, fixed length.
We could layer a table on top of it and run queries and what Athena really doing under the covers. It's, it's just using a big Hadoop cluster that I don't have to manage right that Amazon has in the background running something called Presto, which allows us to do sequel on S3.
And instead of me running a big EMR cluster and paying by the hour for X number of nodes that I've got to keep track of.
It just says, no, no. Instead, you'll pay per terabyte of data scan. So you're really paying per query. So it's really a, you know, on demand kind of idea of the way most of the clouds.
And so it allows you to query tables that are you know on data in S3, that wasn't loaded into redshift and the, the idea of spectrum says, Well, instead of having a separate tool Amazon Athena to run those queries.
If you're already connected to redshift and you're querying redshift tables. You can even join to these external tables that point S3.
So you can use the power of redshift.
You know, to query data that hasn't even been loaded. I can join it with redshift table data and you know it's very popular with organizations that maintain this concept of a data lake on S3.
But once we start to see this, people start to question. Do I even need redshift anymore.
Because a lot of people with their Hadoop cluster started to see this idea of, well, I've got a landing space where I dropped my raw data, and then I cleaned it up and put it into well formatted files.
And if I use something like Apache park a format where they're columnar data and we can get good compression or them. Well, that's the same kind of optimizations that we do in a data warehouse. Why do I need to then load the data into a data warehouse. Why don't I just do sequel on S3.
And a lot of organizations, looking at this and one of the ones that's very vocal about what they did was Netflix.
Right now, before Athena came out they were doing this they were running two types of Hadoop clusters using EMR.
They had transient ones that would come up say every four hours and run a bunch of ETFs jobs using Spark.
So they would take the raw data in one set of s3 buckets, clean it up, put it into parquet and all that, you know, good stuff.
And then like that out to another set of files in a different S3 bucket and that S3 bucket.
Would be their data warehouse and they had 20 petabytes of data there and they were basically running a long running query cluster.
That just ran Presto, which allowed them to do sequel queries on that S3 data and what Netflix. Tell us was
For what they were doing with analytics. It was fast enough. Is it ever going to be as fast as redshift with that MVP architecture and he says he drives and everything. No.
But is it fast enough for you. Right. And so some people said, yeah, right. And what that really means is I only have to pay for the long term storage so S3 and I can bring up the compute when I need it and get rid of it when I don't
That's very attractive. So it's that idea of separating the compute from the data, whereas redshift. You know those compute nodes were both that's where the queries were running and processing, but it's also where they held the data in local disks.
So some organizations they really like this idea of sequel on S3 and they would use that as their primary analytic platform.
But then once in a while when they needed redshift. The power of speed of that they could do it. So I didn't do this myself, but I had a colleague, tell me about a place he work.
They were doing the the Athena thing, you know, that was their analytics, but once a quarter. At the end of each quarter. They had some really heavy duty reporting needs.
And you know the the the Athena queries were not fast enough and they were having to pay a lot of money because you're paying per terabyte of data scan and they had really big reporting needs at the end of each quarter.
And so what they would do is a week before the court, and they would load up a redshift cluster and start loading the data in from S3.
And then beat it up for about a week with queries and then shut it off. And in the old days of physical data warehouse appliances, you would never do that.
You wouldn't pay it half a million dollars to buy this hardware, you're not going to just plug it in a week of a quarter right so it's a really interesting and this idea is a lot easier now that the thing can manage itself.
Now the the newest thing we're finally at the newest stuff the newest thing is these new kinds of nodes that just came out in December, called RA three
And it's built on the AWS metro system, which is basically that new kind of hypervisor that they use.
And the idea is that these compute nodes will hold the data in that local SSD. But at that local SSD is really a local cache and the real storage would be this Amazon redshift manage storage layer.
So the data is score stored and scaled separately in this manage storage and this thing can scale up to eight petabytes, and the compute nodes themselves could hold up to 64 terabytes per node, you know, multiply that by 128 nodes you get to your eight petabytes
And the real magic is the fact that these compute nodes and especially with that that high bandwidth networking
You know, they can move the data back and forth between the local cache and long term storage and you don't have to think about it.
So this data is automatically moved between the local SSD cash and the Manage storage.
Based on some machine learning algorithms. So they're looking at what data blocks are hot being used all the time.
And which ones are, you know, blocking other things and your workload patterns.
The nice thing though, you pay the same low price, regardless of where it lives and the low price if you look in most regions in North America is 2.4 cents per gig per month.
Right, so it's a little bit more than S3 S3 is 2.3 cents per gig. This is 2.4 cents. So you're paying that whether they're, you know,
The, you know, the storage is here or there, they move it back and forth. And it's magic. And you just know that it's, you know, running fast.
And the more storage. I need great, but that doesn't mean I have to load extra compute nodes. So if I have just three compute nodes then I get three times 64 terabytes of local storage.
But if I just keep adding more and more data to manage storage can grow right. I don't have to add more compute nodes to get more storage, the Manage storage can grow. We can only have so much local catch
So that's the idea of these new RA three. And it's really done this sort of separation of compute and data again that people are so interested in
And now when we look at the future of redshift, we see that Amazon's really doubling down on this idea
And so this this concept of data warehouses and data lakes. They seem to be getting closer and closer together, and they all embrace this idea of separating storage from compute right
Because what we find is that our storage needs. You know, everybody's data is growing all the time, and it really those needs outpaced the processing.
And one of the biggest things that I Amazon identified is this problem that if you look at the speed of SSD.
You know their bandwidth is increased by 12 times since 2012 but if you look at the speed of streaming CPU bandwidth
It's only doubled, because there's bottlenecks of the at the internal bus connecting memory and CPU. And so, so really the solution here is, let's do less at the compute cluster and more down in the storage layer.
And so the new thing. The, the new thing and the future of redshift is what they're calling aqua advanced query accelerated.
And it builds on that idea of our a three nodes and it's only for a three nodes now and but in that manage storage layer. It adds a distributed custom hardware accelerated task.
Now this concept of aqua is currently just in preview. So if you click
Here you can actually get to the preview page and you have to fill this out and then they'll tell you. Okay, we're going to contact you, when we're ready. Okay.
But they talk a little bit here about the 12 times two times and then they get into and I just told this diagram right from here. But again, you know, as long as you're using RA three instances, then you know you can just turn this on, there's nothing else yet to do
So, so that's the concept of aqua
And what they're finding is they get the 10 times performance due to offloading operations like
Encryption compression filtering aggregate functions they push all those down to the storage layer. And so the data they're moving back to the compute layer is already the data you need for your query.
Michelle Coppens :: Webinar Producer
There was a question back about Athena. Is there a limit to the number of Athena partitions.
There might be a limit, but it's it's pretty big. I've never run into anything like that. The idea of partitioning when you create a table, just like you have in regular relational databases you know you don't want to just one big
File with all your stuff and we break it up into multiple partitions and then when you say, hey, find me all the sales for February.
You know, I look at a sub folder that has just the data for February and I can skip all you know January, March, April, and so it's the same thing the S3 bucket. You can make sub folders, you know, based on some pseudo column and then when curries go on that pseudo call
I don't remember the limit for how many, but it's a lot if there is a limit, and I suspect you know most things there are limits.
But if there is a limit. It's really high. So I have not run into it.
Alright, so just before we get to the questions I want to make sure I get time to this. If you want to learn more about redshift.
We have a very specific class called Data Warehousing on AWS. It's a three day class instructor led
We teach these courses you know in our training centers across North America, but also through our, what we call I MVP largely uses zoom
As as the remote in technology, but it comes with, you know, some nice features. And we usually, you know, sort of teach classes in a hybrid where some people are in class. Some people are coming in remotely.
And so our instructors are very used to that. And so the data warehousing class. It covers redshift, both from an administrative point of view, you know, with the little bits that you still have to do at least understand how it works.
But also from like data modeling kind of point of view, more of the developer side.
We have a brand new class called planning and designing databases, it, it has a chapter on redshift, but it's really about talking about RDS and Aurora and all of the no sequel options dynamo DB Neptune documented dB.
Net Maria all these different options and really wind us which and so that's also a three day class two brand new one.
There's a lot of related Big Data technologies outside of redshift. So I mentioned a few long away I talked about EMR.
Right, that's that's Amazon's managed Hadoop cluster right and Athena and there's also other ones like Canisius and AWS glue. And so those are the things that are covered in the big data on AWS class again a three day class.
And then outside of the classes I find the best place to learn and keep up to date with what's going on and redshift.
AWS Big Data blog, not the full AWS blog where you get 40 entries a day just the one that are marked big data. So anything redshift will be in there, but also some of those other
Services. I mentioned will be in there. And so there's only about one or two entries, a week. So it's really easy to to be a fan of that blog.
If you do take one of our classes, we now have this refer a friend or colleague kind of program where you'll get an Amazon gift card.
And then the student they if they come and take a course, you know, then you'll give a gift card and that student receives a 15% off the course. And so that's, we got a link for that. I think probably Michelle, you can put that in the chat at some
Michelle Coppens :: Webinar Producer
Yeah, so all of these classes are linked in the chat. Same with the big data blog and our promo code, check out the chat window for all those links.
Alright, well thank you so for questions. Let's go back to the open questions. I'm going to run through the ones I see here how easy is it to migrate IBM to tease or Teradata on prem to redshift, what's the typical timeframe to do migration.
I mean, when it comes to migrating databases in general, right there.
Sometimes it's super easy and sometimes it's really hard, right, super easy is when you have a hard cut off where you say, Okay, everybody stopped working with this. We're going to export the data and then load it into the new one.
Right, but almost nobody has that luxury right so it's hard, right, because you have to say, well, let's let's take, we'll do a one time cut over
But then, what about the data that we've missed in between there's some deltas in there. How do we get those often. The other concern is, you know, there's different data types.
There might be things like code, you know, if you have stored procedures or something you have to look at. Hey, how easy are those going to be can we even do that kind of a
stored procedure and so there is a tool to really help with all this called database. Migration Service or DMs.
So if you if you look in the Amazon Web Services console. You'll see database. Migration Service.
And then in there. There's another tool called the schema conversion tool that's a good place to start where you say, hey, look at my existing schema.
Tell them what the target is point them to it and then say, what's my target, you know, what's my source. And what's my target database. So if I'm coming from the teaser and going to redshift. It'll say,
Well, here, would the tables in your old Natasha, and here's the CREATE TABLE commands for your new ones. So we can do the schema Park.
But then you're going to use database. Migration Service to tell it. Hey, start loading the data. Now the problem is, these things are huge. So it's going to take a long time.
So the typical timeframe to do migration, you know, if you can have like a weekend where you say, okay, or we're not going to do anything, we're just going to load the data over and it's going to take us you know 10 hours or whatever it is to do the actual loading of the data.
And if we have that time where we say, hey, cut off, great. If not, then you'll have to leave the existing one and database. Migration Service can help a little bit with hey everything that gets written there. We have to make sure those changes are also made in the new one.
So learn a little bit more about that database. Migration Service. There's, I would go to the YouTube channel for AWS and just look up DMs.
And and watch a little bit about it. Now, when it first came out, it didn't deal with redshift. It was really more of like I've got an oracle transaction database. And I want to move it into RDS
You know, but then they had it in the component for redshift from coming from the teaser or Tara data. So they does support those
I'm curious to know any feedback from customers who have migrated from Teradata experiences pain points savings. Yeah. So I think, I think.
I don't know about specifically from Teradata but you know one of the things that's hard to do is to benchmark stuff.
Because, well, we can't publish benchmarks for these because if you look at your service agreement with with Teradata or IBM or Oracle. They all say like you're not allowed to publish results.
Of your benchmarking, right. So, so it's hard when you look at the AWS benchmarks, they just mentioned against popular you know they don't say the names of who took up against.
What we find though is before aqua came along, you know, everything was pretty much on par with what you would get out of an enterprise data warehouse appliance. It's just a lot less administration. Right.
But, you know, if you've got a lot of really specific things hooked in and using, you know, some sort of
extensions to sequel language that are specific. Those are where you're going to really run into the pain points. If you've got code stored procedures, that's going to be a problem, right. So these are some of the big pain points I would say
Getting used to the sort of idea that redshift is
A managed service. And it's, you know, if you can start to think about it as hey let's only run it when we need to. You know that that kind of cloud idea where you don't pay for things that are sitting idle, you know, that's a good thing.
federated queries work with ECU based relational databases. No. Yes, but only Postgres.
So it might be a good way to load data volumes of data from RDS Yeah, unfortunately, the my sequel is not an option. So right now the first cut at it, it's in preview Amazon is sort of famous for the way they
They roll out new features they get a minimum feature set out there and then they asked for feedback from customers and customers say this is great, but I need it for RDS for my SQL. Yeah. And so then they'll have that
Eventually, you know, probably. But yeah, you can work with ECG based RDS for Postgres or even just EC do Postgres.
Is there a redshift certification or is it part of the Big Data certification. It's part of the Big Data certification and in April, the name of that is being turned into a data analytics.
So, so not even I guess the big data keyword is sort of outmoded you know all the big companies I work with, they just call it their analytics group. And so that's, yeah.
But yeah, unfortunately, that certification. You have to know a wide breadth of services you need to know all about Canisius, and Dynamo DB an EMR and Athena and glue and Russia.
Right at a pretty deep level. So to really get prepared for that certification. You almost have to do both those three day class.
Alright, we're past one o'clock I went a little long. Sorry about that.
Michelle Coppens :: Webinar Producer
Just one more question. Do you know if there's a server list.
Redshift on its way.
Yeah, that's, I mean they haven't said anything. You know, they don't do much in public roadmaps
The fact that awkward out you know it's telling me that
They're taking it pretty seriously about this idea of managing all the storage for us and not having to worry about it. So it. It's got to be that the compute nodes are next. But when who knows, you know, they haven't really stated anywhere that that's coming
Now one of the things you can do is, you know, in an individual basis if you've got like a enterprise support.
You can say, hey, put me in touch with the tams you know there. Get your tam to put you in touch with this means on redshift and they might have you sign a specific NDA just for that.
Where you can ask about it but you know they don't they don't give much in public roadmaps to tell us what's coming like that.
Okay. So I guess the recording will be sent to you at some point.
If you didn't get a chance to save all those links in the chat. There's a way to save the chat.
Michelle Coppens :: Webinar Producer
Yeah, if you open up the chat window.
Find the More button at the bottom right corner, click Save chat, you'll have all those links there.
These questions will also be included in the recording that we're going to send to you on hopefully by the end of the week, if not early next week actually just got confirmation, you're going to have that by tomorrow.
We are at the end of our webinar. Now, I know we went a few minutes over. So thank you so much for your time. If you have any questions that you just didn't get the chance to ask, don't hesitate to reach out to us and hopefully we'll see you in one of our class. Our costumes, very soon.