IT Training and Development Resources
Webinars
What’s new in the AWS Analytics Stack for 2021 [Webinar]

AWS
Webinars

What’s new in the AWS Analytics Stack for 2021 [Webinar]

AWS changes at an incredible pace and it's hard to keep up with the constant barrage of updates and new features. We will take a look at the biggest announcements from the end of 2020, including the new features announced during the three-week AWS re:Invent virtual conference, that are of interest to those working in analytics. See what's new in Redshift, EMR, Glue, Lambda, Aurora, and more. You'll get an honest opinion of which features will end up being most valuable, brought to you by an Authorized AWS Instructor Champion with over twenty years' experience in the data and analytics space.

View AWS Training

Speaker: Myles Brown, Senior Cloud & DevOps Advisor

Myles has over twenty years of experience in the IT industry across a variety of platforms. Recognized as an AWS Authorized Instructor Champion and a Google Cloud Platform Professional Architect and Instructor, Myles has delivered award-winning authorized IT training for the biggest Cloud providers.

WEBVTT

1
00:00:03.000 --> 00:00:12.630
Michelle Coppens :: Webinar Producer: Well, welcome everyone to today's webinar, where we are diving into what's new in the AWS analytics stack for 2021

2
00:00:13.559 --> 00:00:21.210
Michelle Coppens :: Webinar Producer: Exit certified is excited to bring you this presentation and I have the honor of introducing the content expert Miles brown

3
00:00:22.110 --> 00:00:33.450
Michelle Coppens :: Webinar Producer: Miles is the senior cloud and devOps advisor at Tech data exit certified. He has has over 20 years of experience in the IT industry across a variety of platforms.

4
00:00:34.020 --> 00:00:49.470
Michelle Coppens :: Webinar Producer: Recognized as an AWS authorized instructor champion and a Google Cloud Platform professional architect and instructor Miles has delivered award winning authorized it training for the biggest cloud providers.

5
00:00:50.610 --> 00:00:55.710
Michelle Coppens :: Webinar Producer: Before we get started with our presentation. Let's cover the functionalities of the session

6
00:00:56.730 --> 00:01:06.360
Michelle Coppens :: Webinar Producer: During the webinar, the audience's microphones will be muted. So if you have a question, please enter them in the chat box at the bottom of your screen.

7
00:01:07.890 --> 00:01:20.790
Michelle Coppens :: Webinar Producer: If you enjoy the presentation today. And you're interested in learning more about training anywhere with our interactive virtual platform. I am VP. I want to invite you to visit our website and contact us.

8
00:01:22.560 --> 00:01:30.660
Michelle Coppens :: Webinar Producer: I encourage you to stick around until the end of the session because I'm announcing a limited time promotion on training with exit certified

9
00:01:31.650 --> 00:01:41.700
Michelle Coppens :: Webinar Producer: Last but not least, the session today is being recorded and will share it with everyone who registered to be here today. Alright, let's get started. Miles, you can take it away.

10
00:01:42.540 --> 00:01:43.650
Myles Brown: Thanks. Michelle and

11
00:01:43.710 --> 00:01:46.560
Myles Brown: I guess will be probably be sharing the slides as well.

12
00:01:46.800 --> 00:01:53.160
Myles Brown: As I've got a lot of hyperlinks on the slides where you can click through and read the announcement. And then from there, dive into the

13
00:01:53.970 --> 00:02:01.920
Myles Brown: Documentation or whatever blog entries that they have on that announcement. So welcome, everybody that's going to be about an hour.

14
00:02:02.640 --> 00:02:04.860
Myles Brown: Hopefully I'll be finished talking in about

15
00:02:05.400 --> 00:02:09.930
Myles Brown: 45 minutes and then we have some time for for questions. But if you have questions along the way.

16
00:02:10.170 --> 00:02:20.070
Myles Brown: You can throw them in the chat. I think you can either chat with just panelists, which is basically Michelle and I or with panelists and attendees and that way. Everybody sees your question.

17
00:02:20.340 --> 00:02:26.070
Myles Brown: Then we don't get the same question 50 times might be better to use the panelists and attendees. When you ask a question.

18
00:02:27.480 --> 00:02:43.020
Myles Brown: First off, I worked for exit certified certified is to go to market brand of tech data. All we do is training where we're partnered with AWS and in over 20 other large vendors like IBM and Google and an oracle

19
00:02:44.490 --> 00:02:57.510
Myles Brown: But personally, I'm a, I'm a big data guy. I started off doing a lot of Oracle stuff taught some classes for Oracle then move more into the big data side of things did a lot of cloud era.

20
00:02:58.650 --> 00:03:06.480
Myles Brown: Work and and about seven years ago, I started teaching AWS classes I use AWS for a few years before that.

21
00:03:07.110 --> 00:03:22.680
Myles Brown: And, you know, that kind of changed everything for me, you know, going into the cloud. And so my current role is a, you know, senior cloud and devOps advisor. I'm sort of the tech the technical lead for our cloud and devOps business and

22
00:03:23.700 --> 00:03:31.470
Myles Brown: Today we're going to look at the AWS analytics stack, get a quick overview of it and then kind of dive into what are some of the newer announcements.

23
00:03:31.680 --> 00:03:38.670
Myles Brown: You know, both on the screaming and lambda side of things, then, then a little bit on storage. We're not going to cover all the things on storage.

24
00:03:38.910 --> 00:03:53.430
Myles Brown: But you know storage does affect us in the analytic stack. And then, you know, the data lake ETF stuff. So we're talking about EMR like formation Athena and glue. And then finally, with the data warehousing side of things with redshift and so

25
00:03:54.810 --> 00:04:03.750
Myles Brown: Most of these announcements are from the very end of 2020 so every year at the end of November, beginning of December.

26
00:04:04.230 --> 00:04:10.500
Myles Brown: AWS has their, their annual conference called reinvent and this year was virtual and it was three weeks long.

27
00:04:10.920 --> 00:04:24.840
Myles Brown: And there they save up a lot of new announcements, a lot of new features for that time of year, although some of the announcements. I'll talk about kind of predated that a little bit, but I just want to kind of get you and what's the current state of the analytic stack.

28
00:04:26.280 --> 00:04:28.590
Myles Brown: But let's let's start with a quick overview of it.

29
00:04:29.070 --> 00:04:44.820
Myles Brown: So if we sort of rewind the clock, say, say, six years ago this was sort of the state of the art in the analytic side of AWS, essentially, you know, most people were dropping files into S3 and then doing batch processing using

30
00:04:45.780 --> 00:04:53.970
Myles Brown: And once EMR came out that that made it very easy for people to provision and run a Hadoop cluster in the cloud.

31
00:04:54.360 --> 00:05:03.930
Myles Brown: And you didn't have to fiddle around with going in launching individually see two instances and installing you know hive and spark and whatever other projects you needed

32
00:05:04.860 --> 00:05:16.590
Myles Brown: They made it very easy to say, hey, I want a cluster have 100 machines that look like this. And if one guy that got replaced. And so he Mr with, you know, are really our state of the art kind of

33
00:05:17.700 --> 00:05:22.080
Myles Brown: Analytics Platform. And so we use it to do. You know, large and easy out jobs.

34
00:05:22.650 --> 00:05:29.550
Myles Brown: We would take the raw data from some set of s3 buckets got one process it and then we would have a finished product which

35
00:05:29.820 --> 00:05:40.920
Myles Brown: Generally people would drop it to S3. And then if there was a subset of that that we thought would make sense for a data warehouse, we would pull it out of there and copy it into Amazon redshift.

36
00:05:41.460 --> 00:05:47.460
Myles Brown: And redshift is our sort of flagship data warehouse product. It's very much a competitor of things like

37
00:05:47.850 --> 00:05:59.730
Myles Brown: IBM. The teza and Oracle extra data and and Tara data, any of those kind of enterprise data warehouses and so you would only pull the data that you know that you need, you know, sort of,

38
00:06:00.570 --> 00:06:12.150
Myles Brown: Fast sequel access to this large data and then you would attach whatever BI tool you wanted to it. Most people will use commercial products like tablo or Cognos or, or, you know,

39
00:06:12.630 --> 00:06:17.610
Myles Brown: Who knows there's there's about 40 or 50 different to you know Domo click, you know, the list goes on.

40
00:06:18.390 --> 00:06:23.940
Myles Brown: But AWS did throw their hat in the ring. And they built a cloud Power BI tool called quick sec

41
00:06:24.480 --> 00:06:33.480
Myles Brown: Now when it first came out, you know, they, they really big on the minimum viable product at AWS and underline that minimal right so when

42
00:06:33.750 --> 00:06:38.310
Myles Brown: I first came out, it looked a lot like tablo but it was missing, most of the features.

43
00:06:38.730 --> 00:06:42.060
Myles Brown: And so people when they look at the things that come out from AWS, they say.

44
00:06:42.360 --> 00:06:52.470
Myles Brown: Oh no, that's a piece of garbage. I don't want to use it, you know, but if you wait a little bit, you know, they'll add 992 and AWS really is customer focused a customer obsessed. They like to say

45
00:06:52.860 --> 00:07:04.440
Myles Brown: You know, and they will anybody who signs up for the preview of a new service, they will you know pester you with questions about, hey, what did you like about it. What else do you want to see in it and you'll see that product, you know,

46
00:07:05.070 --> 00:07:14.010
Myles Brown: Mature right in front of your eyes and so excited. It's a lot better than it was when it first came up. But, you know, again, if you don't want to use that you want to use some other tool.

47
00:07:15.210 --> 00:07:19.590
Myles Brown: And so this is sort of how analytics slow, say, six years ago, right.

48
00:07:21.150 --> 00:07:27.660
Myles Brown: And the biggest change, probably in that six years is that instead of doing just batch processing. We're doing a lot more stream processing.

49
00:07:28.020 --> 00:07:38.880
Myles Brown: So instead of just dropping files periodic certainly periodically a desk three, we might be grabbing data from various data sources pulling it into streaming or messaging or something.

50
00:07:39.150 --> 00:07:45.690
Myles Brown: Maybe industry or maybe and other you know maybe processing another ways, but now you see we got a lot of options right

51
00:07:47.160 --> 00:07:54.930
Myles Brown: When it comes to stream processing, you know, Kafka came out, you know, a long time ago and it's very, very popular.

52
00:07:55.380 --> 00:08:07.080
Myles Brown: For for building these sort of a place to hold data that comes in really fast for me lots of different data sources, maybe little bits of data that's coming in really fast in a continuous fashion and then processing it

53
00:08:07.440 --> 00:08:21.600
Myles Brown: You know, maybe with different consumers in different timelines. You know this, that needs to grab all that data every 10 milliseconds. This one needs to grab it. Every 10 MINUTES, AND THIS ONE NEEDS rabbit every 10 hours right so we hold that data for a period of time.

54
00:08:22.590 --> 00:08:31.320
Myles Brown: Well, AWS decided to build their own kind of flavor of this sort of streaming and they called it, can he says, and now can eat this actually is composed up like four different

55
00:08:31.890 --> 00:08:34.590
Myles Brown: Sub services. If you want to think of it that way.

56
00:08:35.460 --> 00:08:47.220
Myles Brown: But eventually, some people complain. They said, You know, I'm moving to the cloud. And I've got all this code written according to Kafka, I don't want to have to go and rip out all that code and now have everything. Talk to Denise's instead

57
00:08:47.550 --> 00:08:54.300
Myles Brown: Can you not support Kafka for me. And so it turns out they did, they said, Okay, yeah, we've got managed streams for Kafka now.

58
00:08:55.050 --> 00:09:06.420
Myles Brown: So that's sort of you know if your code is all written according to Kafka, and you want to use that then you can, but can ISA server integrates better with all the other AWS services because it was built home run.

59
00:09:07.500 --> 00:09:20.730
Myles Brown: SQL has been around. I mean this predates AWS, really, if this came up in in 2004 whereas, you know, easy to an S3 what we normally think of is that the six of

60
00:09:21.300 --> 00:09:33.060
Myles Brown: AWS didn't come up to 2006 but as you start simple queuing service if you just have sort of an event. Hey, I want to drop this message. And then somebody will pick it up later. That's what es que se

61
00:09:34.230 --> 00:09:49.530
Myles Brown: But a lot of people again sort of complained, and said, well, that's great. But you know my existing app that I'm moving to the cloud. It's got all kinds of code, not written according desk. You ask, but according to, you know, a very popular open source messaging.

62
00:09:52.530 --> 00:09:53.760
Myles Brown: tool called

63
00:09:54.990 --> 00:10:07.980
Myles Brown: Back to active duty right is very popular open source products. So, hey, can you help us with that. And that's what Amazon MT us. And so we have all these different options. If we want to move data, you know, little bits of data quickly.

64
00:10:09.120 --> 00:10:17.070
Myles Brown: Now, one of the biggest changes that we've seen in that last six years is that AWS is really starting to push their service concepts.

65
00:10:17.370 --> 00:10:31.320
Myles Brown: Right. So instead of, you know, every time you want need some compute having to provision a virtual machine and run some code that lives on there and then have to be responsible for Babel virtual machine and the operating system in the file system and everything else.

66
00:10:31.770 --> 00:10:34.170
Myles Brown: And dealing with Linux or whatever.

67
00:10:35.250 --> 00:10:47.730
Myles Brown: You know, they came up with all kinds of service options, things like lambda AWS lambda is hey, I'm a developer, I wrote some code I configure. What's the event that triggered fat code to run, you know, so that's shirtless computing

68
00:10:48.270 --> 00:10:52.590
Myles Brown: Well, if we look in here. Where is the mandate the administrative burden.

69
00:10:53.280 --> 00:11:04.710
Myles Brown: Well, if you look at redshift. You know, if I just want sequel access to my big data, do I really need to move it out of s3 into redshift. So that's a copy of data takes time to do that.

70
00:11:04.920 --> 00:11:18.150
Myles Brown: And redshift is an enterprise data warehouse. It does take some administration. It's not as bad as running, you know, Tara data on a bunch of physical machines in your data center, but it does have some administrative burden. Right.

71
00:11:19.440 --> 00:11:30.300
Myles Brown: And so once they came up with a way to do sequel on S3, namely, I think, you know, some people opted for that. Instead, and say, well, hey, if

72
00:11:30.720 --> 00:11:40.920
Myles Brown: If I took my raw data that was dropped into this S3 bucket and I did some ETA on it. And the result of that was well structured files in S3.

73
00:11:41.400 --> 00:11:56.190
Myles Brown: That maybe are set up in a colander format and using compression and all the things that enterprise data warehouse to do. Well then, why couldn't I run sequel directly on that. And that's what Athena allows us to do. We can query those aggregated data sets.

74
00:11:57.240 --> 00:12:07.320
Myles Brown: And instead of moving the data into another place and then having to manage that place. Now I can just run my sequel query directly on this well structured data in S3.

75
00:12:07.590 --> 00:12:14.460
Myles Brown: Right. And for some people. That is their data warehouse. This is the model that Netflix kind of popularized

76
00:12:15.060 --> 00:12:25.590
Myles Brown: They've got 20 petabyte data warehouse. It's just a bunch of files in S3. Now they're set up sort of predated the theme and they were running an EMR cluster with Presto.

77
00:12:26.160 --> 00:12:37.230
Myles Brown: Right, and so they would pay by the hour for however many machines would make up that cluster, but all that EMR cluster was set up to do was to run Presto, which allowed us this sequel on S3.

78
00:12:37.860 --> 00:12:48.960
Myles Brown: Well, Athena just says, hey, you don't even have to run that EMR cluster will take care of it for you. And instead of paying by the hour for our remaining machines make up that cluster. You're just paying per query.

79
00:12:49.980 --> 00:12:54.660
Myles Brown: And so we can query that well structured data. We can even do ad hoc queries on our raw data.

80
00:12:55.260 --> 00:13:04.410
Myles Brown: And so what this raw data in S3 ends up being is what we call a data lake. Right. It's the data from wherever it came from in its natural format.

81
00:13:04.710 --> 00:13:20.340
Myles Brown: Dropped in there. It's not. We haven't processed and it also haven't lost any context of anything now ad hoc querying that you know it might not be well formatted. It might not be well compressed or anything. So we're not going to get a really good

82
00:13:21.420 --> 00:13:31.050
Myles Brown: High, high performing queries on it, but it is give me that option. If I'm a data scientist, man. Just kind of looking at the data and figuring out what is the state of me.

83
00:13:31.680 --> 00:13:39.480
Myles Brown: So it's been, it was one of our base service changes over the years. The other big change was, what about this EMR cluster that's doing the epl

84
00:13:39.840 --> 00:13:44.790
Myles Brown: Can we do that conservative this way. And of course the answer is yes. That's what Amazon Hulu is

85
00:13:45.570 --> 00:13:55.320
Myles Brown: So it can be used to perform service et al. So again, it's kind of that idea where there is still a Hadoop cluster somewhere, but I'm not managing it Amazon is

86
00:13:55.890 --> 00:14:09.930
Myles Brown: And I'm just paying for, you know, might usage of it, you know, hey, how much time does it take to run these jobs, kind of thing. And so that's what the Lewis and the underpinnings of it is, it's all spark running on EMR somewhere, but it's not my concern about the cluster.

87
00:14:11.250 --> 00:14:20.280
Myles Brown: And they can actually generate some of the spark code and then as a developer using either Python or Scala all go in tweak what comes out of that and

88
00:14:20.670 --> 00:14:23.190
Myles Brown: That's where I really do my epl process.

89
00:14:23.640 --> 00:14:32.460
Myles Brown: So that's, that's what the, the current sort of stack looks like. Then when we talk about the analytic stack. We don't get too much into the machine learning side of things, because

90
00:14:32.640 --> 00:14:48.900
Myles Brown: In AWS as kind of considered its own whole you know category. And now we'll see a lot of different things, although they are starting to bleed together. We're going to see that there's some machine learning now in redshift something quick site and we can certainly run

91
00:14:49.980 --> 00:14:54.870
Myles Brown: You know, machine learning frameworks gone em off right so

92
00:14:55.410 --> 00:15:03.690
Myles Brown: So they started to bleed together a little bit. But I'm going to try and keep this larger machine learning free because you know this is enough of a stack to handle on its own.

93
00:15:04.170 --> 00:15:10.500
Myles Brown: So let's start with the sort of ingestion of data, you know, around cases. What are some of the new things that came in there.

94
00:15:11.010 --> 00:15:15.600
Myles Brown: And will actually tacky in a little bit of the new one lambda, because that comes up as well.

95
00:15:16.140 --> 00:15:22.650
Myles Brown: So the big thing here Denise's. What does it do it whole streaming data collected from various data sources for a period of time.

96
00:15:23.250 --> 00:15:29.130
Myles Brown: And then it allows multiple consumers to come in, independently, maybe on their own timelines and grab that data.

97
00:15:29.880 --> 00:15:40.380
Myles Brown: Traditionally a cammisa say the stream. And by default, still it holds data for 24 hours. Now if you want you can add extended retention.

98
00:15:40.950 --> 00:15:48.900
Myles Brown: And so, traditionally, that was up to seven days. Now it's pretty costly every hour over 24 hours of retention cost you some money.

99
00:15:49.350 --> 00:15:56.070
Myles Brown: And but you know that seven days was the maximum time. And so what most people would have to do is basically

100
00:15:57.000 --> 00:16:10.140
Myles Brown: You know, build one consumer that all it does is just says, hey, grab all the data every, you know, every 12 hours or something. Grab it. Do you bit aggregated and throw it in S3, because I don't want to lose any of the data. Right.

101
00:16:10.890 --> 00:16:15.900
Myles Brown: But typically, the stream only holds it for a period of time and different consumers will come and grab it and process it.

102
00:16:16.170 --> 00:16:24.570
Myles Brown: But one of them would probably grab all the data, put it in S3. Now we have something called long term retention that you can turn on and it can hold the data for up to a year.

103
00:16:26.220 --> 00:16:38.220
Myles Brown: Now you've paid to retrieve the data that's older than seven days. So there's an extra fee in there and you also pay for that extra retention, you pay basically 2.3 cents per gig.

104
00:16:39.900 --> 00:16:48.480
Myles Brown: So per month. So it's basically like S3 bicycle. So that is kind of what it's doing. So that's sort of a new thing.

105
00:16:49.020 --> 00:16:53.340
Myles Brown: Another another aspect of nice. This is something called Canisius analytics.

106
00:16:53.730 --> 00:17:02.070
Myles Brown: And what this was, was when we when they first built can ISA streams. Like, they're very much like Kafka streams. Then they need something called fire hose, which made it easy.

107
00:17:02.340 --> 00:17:09.180
Myles Brown: To build a stream that it's only job was to aggregate data to write to S3 or dementia or a few other places.

108
00:17:09.390 --> 00:17:18.360
Myles Brown: You didn't you weren't going to have a bunch of different consumers, you just have one consumer that was sort of aggregating data based on size or time and they're writing it up to S3.

109
00:17:19.680 --> 00:17:33.870
Myles Brown: So either. If you have either of those kinds of streams. What can he says, I am a Linux allowed us to do which to use regular sequel style processing of the data going through one of those streams. So assuming the data in that stream was sort of

110
00:17:34.620 --> 00:17:44.010
Myles Brown: well structured, you know, like, maybe it's a tab that limited or organs JSON data or something where I could I could layer up create table to command on top.

111
00:17:44.280 --> 00:17:52.890
Myles Brown: I could say, Hey, that looks like a table. I can run queries, you know, if I ran the same query five seconds later, there'd be new data in and some of the data that age data.

112
00:17:53.190 --> 00:17:59.760
Myles Brown: Right, so I could look at a snapshot. Or I could say, hey, run this query continuously and take the result and put that out to a new stream.

113
00:18:00.330 --> 00:18:05.310
Myles Brown: So this was what they call Denise's analytics, but it was really all sequel stop processing.

114
00:18:05.760 --> 00:18:14.880
Myles Brown: Recently what they've added was it now uses Apache Flink under the covers, which is sort of a stream processing processing framework, a little like spark streaming

115
00:18:15.720 --> 00:18:23.760
Myles Brown: And it creates simple process and options in Java and Scala and Python coming soon. And then after that, who knows, maybe more languages.

116
00:18:24.930 --> 00:18:26.280
Myles Brown: And it allows fleets talk

117
00:18:27.300 --> 00:18:34.020
Myles Brown: To talk work on their work on data, not just an ISA Street, but it actually opens it up. So if I have

118
00:18:35.220 --> 00:18:39.720
Myles Brown: That Kafka that manage Kafka or elastic search or S3 or whatever.

119
00:18:40.080 --> 00:18:53.250
Myles Brown: I can work on that data and what it really does is it says, hey, we can make running fleet really easy for you. So as data is going through extreme I can build these processes on it, not just with SQL, but with real programming languages.

120
00:18:53.940 --> 00:18:58.770
Myles Brown: And it supports Klingon 1911 and also something called the

121
00:18:59.370 --> 00:19:08.190
Myles Brown: Dashboard. So, you know, like I said, all these underlines. These are all hyperlinks. So if I click on this Apache link under the covers. It'll take you to the announcement. So this was

122
00:19:08.460 --> 00:19:14.130
Myles Brown: Back in October, they introduced the idea of flame and then they added a new versions and stuff on top of it.

123
00:19:14.610 --> 00:19:21.660
Myles Brown: And so this is sort of the, the, what's new and then from there you can click through into read more about can use this on a

124
00:19:22.110 --> 00:19:27.330
Myles Brown: Link Or go into the developer guy. So all those hyperlinks. Like I said, will be available.

125
00:19:27.900 --> 00:19:33.030
Myles Brown: We're going to, you know, share a PDF with you if you want to click through any of them and delve into it.

126
00:19:33.450 --> 00:19:41.430
Myles Brown: By the way, if you're just trying to keep them abreast of what's new in AWS, you know, that's what I do. I go to that AWS amazon.com slash new

127
00:19:41.790 --> 00:19:52.860
Myles Brown: I go there probably once a week, and I just kind of look through the list. There's usually 40 seconds in a week. I kind of skim it and look. Is there anything of interest to me click through, read the announcement. If it, if it

128
00:19:53.400 --> 00:19:58.740
Myles Brown: piques my interest that I click through into maybe the blog entry that explains how it works. You know, so

129
00:20:00.360 --> 00:20:04.680
Myles Brown: So that's a little bit about nice, it's not a lot doing can use this, but that's that's kind of the big major thing.

130
00:20:05.850 --> 00:20:11.430
Myles Brown: In lambda. There's a bunch of new things and lambda, the idea of lambda, we said was it's basically

131
00:20:11.760 --> 00:20:18.150
Myles Brown: A unit of service computed. So you write a lambda function in one of you know a bunch of different languages. Now,

132
00:20:18.510 --> 00:20:26.460
Myles Brown: And then, as a developer, you say, here's the event that triggers this lambda function. So it might be with somebody upload data to this S3 bucket. That's an event.

133
00:20:26.850 --> 00:20:37.860
Myles Brown: Or when a new message comes into Canisius, or two dynamos tree dynamo DB streaming and over the years, they've added more sort of events and now

134
00:20:39.360 --> 00:20:47.910
Myles Brown: They allow for some things like some average count and other simple analytics functions over a contiguous time window up to 15 minutes

135
00:20:48.180 --> 00:21:05.640
Myles Brown: So this is sort of an interesting idea. They also have chat pointing for cases and Dynamo DB streams so that if there is an error. We can figure out, well, which which data have any process already and we haven't wing. And so we can minimize the duplicate processing after failures.

136
00:21:07.500 --> 00:21:16.110
Myles Brown: There's new event sources. So either Catholic self managed my own capital running on a bunch of ECU instances or the MS K

137
00:21:16.440 --> 00:21:27.840
Myles Brown: You know, those can be events that trigger lambda functions Amazon and the cube. So, just like SQL, as somebody sends a message queue. Hey, that's an event called lambda. We can do the same with Amazon MQ

138
00:21:28.890 --> 00:21:33.990
Myles Brown: Now the real big changes to lambda are sort of, you know,

139
00:21:34.680 --> 00:21:44.640
Myles Brown: Increasing where it can be used. It used to be that when you set up the lambda function you you write the code as a developer you configure. What's the event that could cause the fire.

140
00:21:44.910 --> 00:21:53.610
Myles Brown: And then the only thing you have to think about the infrastructure is how much RAM, do I want to dedicate for this thing to use right

141
00:21:54.300 --> 00:22:05.250
Myles Brown: And then they would give you a commensurate amount of CPU. So the more RAM, they give you more CPU. And so that was really the only metric. We had to think about the only, not that we turned as far as operations. Go on lambda

142
00:22:05.760 --> 00:22:15.360
Myles Brown: But maxed out at three days of RAM. Now you can go up to 10 gig and basically under the covers escape it scales up to like six four CPU.

143
00:22:16.680 --> 00:22:26.910
Myles Brown: And used to be that, you know, in the old old days, it was like, hey, I had to finish in, you know, three minutes and then they opted to 15 minutes that hasn't changed. Still 15 minutes

144
00:22:27.240 --> 00:22:40.170
Myles Brown: But it was built in hundred millisecond increments. So if my lambda function ran like the event occurs within you know a millisecond or two. It's going to launch the code right and

145
00:22:42.000 --> 00:22:53.730
Myles Brown: That code ran for 340 milliseconds, you'd be charged for 400 milliseconds. You know, there was hundred millisecond a current now they've changed that granularity down to just one second.

146
00:22:54.600 --> 00:23:04.020
Myles Brown: So, you know, they've made lambda, probably a little bit cheaper for most people, but probably the biggest change lambda is this one.

147
00:23:05.640 --> 00:23:09.210
Myles Brown: Instead of that idea where we say, hey,

148
00:23:09.690 --> 00:23:24.570
Myles Brown: Here's the available languages to you, Java, Python. Oh, you like Python. Well, you can use 2.7 at 3.6 you know there was only a few different options now instead of choosing one of the languages, you could say, hey, I've got a Docker container.

149
00:23:25.170 --> 00:23:35.700
Myles Brown: And I just want you to use that, you know, so, so they provide various containers, you can start with this Docker file and then add whatever you want and install whatever languages.

150
00:23:36.120 --> 00:23:48.720
Myles Brown: And so it's, you know, you basically always knew that this lambda function was running in some sort of container. Now they're just making it explicit and you can provide the actual Docker file. And so this really opens up languages and things like that.

151
00:23:49.710 --> 00:24:04.140
Myles Brown: And now they've added something called Cloud watch land insights. So you get a nice automated dashboard summarizing the performance of the health of those functions. So in addition to the regular kind of, you know, very rudimentary

152
00:24:05.100 --> 00:24:09.960
Myles Brown: Diagrams for cloud. Watch. Now this lambda inside, you have to turn it on, it costs, some money.

153
00:24:11.280 --> 00:24:16.410
Myles Brown: But, but it provides you with some interesting stuff. All right, moving right along.

154
00:24:17.130 --> 00:24:27.270
Myles Brown: Storage is a huge in an area. There's lots of new changes in storage all the time, dropping prices, all kinds of things. So we're going to concentrate on just a few of the data they changes here.

155
00:24:27.720 --> 00:24:37.260
Myles Brown: Starting with CBS. That's our elastic block storage. We use that for if you launch an easy to instance and you want to root volume or any other kind of volumes.

156
00:24:37.770 --> 00:24:48.420
Myles Brown: disk drives. If you want to think of it that way EBS is that and also we use it in EMR when you set up a cluster. You say, hey, what is the discs look like

157
00:24:49.050 --> 00:24:52.590
Myles Brown: Well, traditionally, we have two types of EBS volumes either the

158
00:24:52.950 --> 00:25:07.560
Myles Brown: The sort of old school, you know, magnetic spinning disks, you know, the bladders. We call those hard disk drives and the newer sort of solid state drives, which are much better for random read right access but are more expensive than and don't get quite as large

159
00:25:08.820 --> 00:25:16.860
Myles Brown: So that hasn't changed. We still have those two times when we look at the old artists, they dropped the prices by 40% for the cold.

160
00:25:17.880 --> 00:25:20.760
Myles Brown: Our desks. So what we see one instances.

161
00:25:22.200 --> 00:25:31.590
Myles Brown: On the solid state drive side they updated, we had something called GP two instances that general purpose ones. Now we have a GP three

162
00:25:32.160 --> 00:25:45.030
Myles Brown: And they are basically 20% lower price and otherwise, they have the same sort of capabilities. It's almost a no brainer, that you should go and switch all your DVDs to GP threes and they have a nice easy way to do it.

163
00:25:46.230 --> 00:25:54.600
Myles Brown: And you'll just save money right away. So that's, that's a big one. When it comes to so that's general purpose when it comes to

164
00:25:55.830 --> 00:26:03.960
Myles Brown: You know where you need a lot of reading right back and forth to the desk. You need high number of I ops, what we call IO operations per second.

165
00:26:04.650 --> 00:26:14.280
Myles Brown: We have a new type of provision I ops volume called Iowa to you have the IO ones. Yeah, the twos just came out they have 100 times better durability.

166
00:26:14.730 --> 00:26:22.230
Myles Brown: So they've got like five nines of durability, rather than three nines of durability and 10 times higher I ops to storage ratio.

167
00:26:22.980 --> 00:26:29.310
Myles Brown: So the changes that right so you know that's one of the things you learn about AWS pretty quickly.

168
00:26:29.820 --> 00:26:38.040
Myles Brown: When you build your architecture. You don't just say, well that's it forever. Great. Because every year AWS comes out with new instance types. If nothing else,

169
00:26:38.220 --> 00:26:46.110
Myles Brown: You need to keep an eye on that because you might be using something every year there's new instance dives. Not every instance that gets an upgrade, but some of them do.

170
00:26:46.440 --> 00:26:51.960
Myles Brown: And. And same thing with this place where you say, oh, if I switch to the new one.

171
00:26:52.470 --> 00:27:05.580
Myles Brown: You know, I generally get either better price or better performance. Either way, it's a better price to performance ratio than the old versions. And so you need to keep an eye on that. So that's a little bit but EBS when it comes to S3.

172
00:27:06.750 --> 00:27:18.300
Myles Brown: They we've always well, not always, but for a long time we've had this concept of replication where you set up an S3 bucket say in Northern Virginia and you say, hey, whatever gets written to that bucket.

173
00:27:20.130 --> 00:27:25.710
Myles Brown: I don't want to put all my eggs in one basket. What if something happens in that region. Now that's rare.

174
00:27:26.640 --> 00:27:31.980
Myles Brown: But if I needed a disaster recovery site in some of the regions. So saying, Singapore somewhere.

175
00:27:32.370 --> 00:27:42.840
Myles Brown: You know, we could set up another bucket over there instead of what we call cross region replication. So anything that gets written here will asynchronously be written across to the other region into that bucket.

176
00:27:43.290 --> 00:27:56.760
Myles Brown: Now there is a charge you know you're going to be storing the thing twice. So we're paying twice the storage. Also, there's a data transfer charges moving data from one region to another, usually around two cents per gig to move that data. Right.

177
00:27:57.900 --> 00:28:05.040
Myles Brown: But you can only do two one place. Now we have the idea of multiple destination buckets for replication and also they added

178
00:28:05.640 --> 00:28:16.770
Myles Brown: You know, right there in December of this concept of to a replication. So I could say, hey, I need these two buckets to be in sync. So whether I'm writing into this one or right into that one. I need to, you know, see across

179
00:28:18.030 --> 00:28:21.090
Myles Brown: The biggest change in storage. I would say, is this last one.

180
00:28:22.260 --> 00:28:33.840
Myles Brown: S3 now has strong read after right consistency for all applications. Right. So S3, you know, the I. It's very durable right when I, when I say hello to file tax three

181
00:28:34.290 --> 00:28:44.730
Myles Brown: They say it's 11 nines of durability. That means I upload a file. Today, the chances of Amazon losing it. And the next year is point 00001 right

182
00:28:45.150 --> 00:28:58.050
Myles Brown: You like how did they come up with that number. Well, the idea that they're copying that data to three different locations. Right, so three different data centers and at least one of them is in a different availability zone so so many miles away.

183
00:28:58.710 --> 00:29:02.880
Myles Brown: So what we're looking at is, hey for Amazon to lose your file in S3.

184
00:29:04.110 --> 00:29:11.430
Myles Brown: Something really bad has to happen that these three different data centers all burned to the ground and they're geographically separated. Right.

185
00:29:12.540 --> 00:29:25.080
Myles Brown: But that kind of comes at a cost right and and so it's traditionally been what we call an eventual consistency model. So when I write my file and it comes back and says, thanks. It's written if somebody goes and tries to read that file right away.

186
00:29:25.590 --> 00:29:33.930
Myles Brown: They might not see the change. And so, especially when I'm uploading a new version of it, they might see the old version for a second, or maybe two.

187
00:29:34.380 --> 00:29:49.230
Myles Brown: You know, and that's sort of an eventual consistency thing if you're used to, no sequel databases you might be used to that concept, but it does cause problems that data lakes and lots of places where, where, you know, I've got different people trying to look at the same data.

188
00:29:50.310 --> 00:29:55.830
Myles Brown: And even with the Hadoop, you know, where, where I use S3 and and I have say

189
00:29:56.220 --> 00:30:03.690
Myles Brown: Like the old school MapReduce MapReduce or writing S3 and the number of users coming to read and they say, wait a minute. I don't see all the files here.

190
00:30:04.020 --> 00:30:07.500
Myles Brown: But we would have to use something called em RFS consistent view.

191
00:30:08.040 --> 00:30:19.410
Myles Brown: Where would say, hey, when I write up these files. Let's keep a list of what are all the files and then the reader doesn't go until it sees them all. And once it sees them all, then it can go and do its thing, right. So,

192
00:30:20.310 --> 00:30:33.870
Myles Brown: There was always like a little bit of handling, we had to deal with because S3 was not fully consistent and nowadays, right. So this is a huge change changes, lots of things in AWS.

193
00:30:34.500 --> 00:30:50.220
Myles Brown: But especially for us in the in the world EMR. We don't have to deal with any of that anymore. And so that's good. So speaking of data. Ladies, let's jump into those services that are typically used around that will start with EMR, which is sort of our managed

194
00:30:51.450 --> 00:30:54.570
Myles Brown: Service, then we'll talk about late formation Athenian glue

195
00:30:55.260 --> 00:31:05.160
Myles Brown: So EMR, I guess it originally I stood for Elastic MapReduce. I'm sure you're kicking themselves because that name doesn't ring you know too well with people these days.

196
00:31:05.490 --> 00:31:15.450
Myles Brown: Obviously MapReduce with sort of the early days of Hadoop and then you know spark release a planted that as the main batch processing model.

197
00:31:16.140 --> 00:31:23.820
Myles Brown: Answer streaming model as well. And so maybe they should have called it elastic Hadoop or something like that would have worked out better but

198
00:31:24.600 --> 00:31:29.250
Myles Brown: Traditionally, when you would go to launch one of these Hadoop clusters, you would tell them hey

199
00:31:30.060 --> 00:31:39.030
Myles Brown: You know what kind of instances to use for the master and what kind of instances to use for the slaves machines and then they would go and provision it you just check a few checkboxes. I want to use Spark.

200
00:31:39.210 --> 00:31:43.980
Myles Brown: I want to use. Hi, I'm going to use Presto, whatever. And then it would go and install all that stuff for you.

201
00:31:44.250 --> 00:31:52.710
Myles Brown: And if one of the machines died, it would replace it. Right. But it was launching EC two instances. And if you went to your easy to console, you would see those instance

202
00:31:53.070 --> 00:31:59.520
Myles Brown: And you could possibly mess with them, you know, although there was ways to sort of lock that down. Maybe if you needed to

203
00:32:01.230 --> 00:32:09.030
Myles Brown: One of the things that they've added. I don't know how useful this is to most people, but they now have a deployment option when you launch EMR, you can launch it.

204
00:32:09.360 --> 00:32:15.030
Myles Brown: Into e Ks, which is the elastic Cooper Nettie service. Now this is really, you know,

205
00:32:15.660 --> 00:32:21.510
Myles Brown: You know, Cooper daddy's is making its way into lots of places. It's starting to make its way into people's analytic stacks.

206
00:32:21.900 --> 00:32:31.260
Myles Brown: Although, you know, when all the customers. I talked to the analytics group is sort of like the ad work in a shop, but like, not the analytics part right

207
00:32:31.980 --> 00:32:46.140
Myles Brown: It is starting to creep in. I think it's for this. If you're an existing customer and you've been using Cooper 90s for all your applications and you say, hey, I do want to do this one sort of spark workload.

208
00:32:47.040 --> 00:32:57.840
Myles Brown: I and everything else we do is based on Cooper 90s. I don't want to have to go watch a bunch of EC two instances for an EMR cluster when that's not normally how we do things.

209
00:32:58.170 --> 00:33:06.780
Myles Brown: And so just let me run EMR. You know those spark workloads right alongside my other Cooper natives apps in the same infrastructure.

210
00:33:08.280 --> 00:33:13.620
Myles Brown: So that's, that's a minor change for certain kinds of customers and

211
00:33:15.030 --> 00:33:31.140
Myles Brown: They added a new hire he this past year for data scientists and data engineer company, Mr studio. It's just a nice way to provide is Jupiter notebooks with tools like spark UI and and urine timeline service, just to make it easy to to debug.

212
00:33:32.070 --> 00:33:40.260
Myles Brown: Those sort of like could foreclose but also it's got a single sign on with corporate credentials. So instead of having to like

213
00:33:40.950 --> 00:33:53.250
Myles Brown: Give people AWS credentials and go into the management console and work through that, you know, we can we can sort of just give them this other EMR studio and it's it's a nice way to do that.

214
00:33:55.740 --> 00:33:57.540
Myles Brown: One of the things this

215
00:33:59.640 --> 00:34:02.820
Myles Brown: One of the new Tanzania, Maurice and it will now.

216
00:34:03.930 --> 00:34:11.010
Myles Brown: It up nicely with like formation. If you're not familiar with like formation, just give you that 30,000 foot view there.

217
00:34:11.460 --> 00:34:20.160
Myles Brown: You know it's becoming very popular for people to build data lakes. Right, that's been a buzzword for the past, say, four years it's been a big buzzword and

218
00:34:21.090 --> 00:34:27.750
Myles Brown: In AWS, usually that means, hey, here's some S3 buckets drop all your raw data in there and then

219
00:34:28.590 --> 00:34:35.340
Myles Brown: And then we'll go and grab data from there and do some processing in various ways, or whatever. But this is where the data lives.

220
00:34:35.670 --> 00:34:44.340
Myles Brown: Now, of course, it's pretty quick that a data lake will become a data swamp, where you just have data unregulated data, you don't know where it came from. You don't know who's allowed to look at it.

221
00:34:44.520 --> 00:34:53.040
Myles Brown: Right. So, along with Jeff an S3 bucket you need meditate at keep track of where did the data come from who's allowed to look at it. What's the lineage of it.

222
00:34:53.340 --> 00:35:03.060
Myles Brown: Right, all that kind of stuff. And some sort of way to search it nicely to find the data you need and a lot of a lot of the problems from

223
00:35:03.540 --> 00:35:20.730
Myles Brown: Data lakes on S3 came from the problem of permissions, because you know when you go to process it with with Athena, or blue or maybe EMR or wherever you know you've got, like, well, okay, what we got to find the data.

224
00:35:21.420 --> 00:35:34.620
Myles Brown: But what about the permissions does does in Spark. Am I allowed to look at this in in hive. Am I allowed to look at it in a scene and am I allowed to look at it and then I would lower level, you've got like actual files in S3.

225
00:35:35.190 --> 00:35:40.350
Myles Brown: So, so you've got this this sort of metadata level where we're accessing the data through different processing.

226
00:35:40.680 --> 00:35:51.360
Myles Brown: And then you've got the actual files in S3 say well, hey, I might not be allowed to look at it through Athena, but I still have access to just go grab those files out of s3. That doesn't make sense. Right.

227
00:35:51.780 --> 00:35:56.850
Myles Brown: So late formation was a way to help you build secure and manage data in an S3 data lake.

228
00:35:57.120 --> 00:36:06.330
Myles Brown: And at first it was really just a way to simplify setting it up, and more importantly, setting up the permissions on the data in S3 versus Athena and glue

229
00:36:06.720 --> 00:36:09.720
Myles Brown: And and the piece that they were sort of missing was

230
00:36:10.500 --> 00:36:21.870
Myles Brown: The EMR integration. So that might spark applications have that fine grained access control that is determined by late information and not by file access in S3.

231
00:36:22.200 --> 00:36:30.330
Myles Brown: So they finally kind of close the loop on that and so late formation now really does provide what it promised and regionally.

232
00:36:30.600 --> 00:36:41.910
Myles Brown: And then you went further and just, you know, just about a month ago they introduced some new features in preview still so preview means you know it's not available for everybody. Gotta go sign up for the preview.

233
00:36:42.570 --> 00:36:55.470
Myles Brown: They have this concept of row level security and acid transactions on S3 tables. And again, they're enforcing it. You know, at least formation level. That means I don't care what you're coming in through Athena or through some glue

234
00:36:56.040 --> 00:37:07.470
Myles Brown: BTL job or to a spark application. Mr. You know, we're going to enforce these transactions at this level of a the data lake. And so that's, I think,

235
00:37:07.920 --> 00:37:19.290
Myles Brown: I think we're going to see link for mention turns into a real data lake product and not just a way to help us with permissions. That's that looks to be the direction they're going with it.

236
00:37:20.430 --> 00:37:22.830
Myles Brown: Not quite there yet a preview.

237
00:37:24.360 --> 00:37:37.260
Myles Brown: Athena. I think I mentioned earlier is that service option allows you to query data directly in S3 using it's the standard sequel now under the covers. It's really using Presto on an EMR cluster that's managed by somebody else.

238
00:37:38.760 --> 00:37:47.790
Myles Brown: And three traditionally was use mostly in an immutable way people drop files in S3 and then your EMR job or whatever would go grab that data.

239
00:37:48.030 --> 00:37:56.130
Myles Brown: Read it processes and right of new files. And so we weren't changing files were always processing it and writing a new files.

240
00:37:56.730 --> 00:38:02.280
Myles Brown: Eventually, people started to run into problems where they said well I'm going to use S3 for my data lake.

241
00:38:02.850 --> 00:38:10.050
Myles Brown: You know, at some point you've got like things like GDP are where somebody says, Hey, I want you to delete all your data on me.

242
00:38:10.710 --> 00:38:23.220
Myles Brown: So we have to go in and delete single rose out of files in S3 right and we have to be able to you know maybe insert a single row or update a row or delete a single role.

243
00:38:24.660 --> 00:38:44.760
Myles Brown: And so they made a project called Apache hoody. And it was a way to allow for easy absurd support in these EMR workloads and on S3. Well, one of the latest changes of the now you can read those those sort of query those read optimized views of an Apache booty data set in S3 from a theme.

244
00:38:47.460 --> 00:38:55.140
Myles Brown: But that idea of Pachacuti is making its way through all of this data lake stuff. It turns out, and

245
00:38:56.310 --> 00:39:04.800
Myles Brown: The other thing that they did with the team as they said, well, this is great. Right, it's traditionally been about querying data in S3 well structured data in S3.

246
00:39:05.280 --> 00:39:07.230
Myles Brown: Well, now they say, well, let's open it up.

247
00:39:07.830 --> 00:39:17.910
Myles Brown: Right, we're using Presto. Presto allows you to do queries on S3, but that's not work grasslands, it can. It's a general purpose query engine, it can query anything with the GTC driver.

248
00:39:18.240 --> 00:39:28.590
Myles Brown: Right, and so they added on all kinds of options now. So any relational database non relational custom data sources, I can do it all from Athena now and

249
00:39:29.580 --> 00:39:34.530
Myles Brown: Now, I think the dissolve requires the new version of the engine engine version two.

250
00:39:34.770 --> 00:39:41.250
Myles Brown: And so that's now generally available. It's got all kinds of performance enhancements. They added some new geospatial functions.

251
00:39:41.460 --> 00:39:54.930
Myles Brown: The concept of nested schemas is allowed federated queries where I can run a query this spanning across. I've got like a well structured files like a table in S3, but also in this RDS Postgres table.

252
00:39:55.170 --> 00:40:08.550
Myles Brown: And I can run queries that span and join tables from various places. So it's very interesting, really, they're opening up the power of that Presto engine in Athena and giving it to us.

253
00:40:09.930 --> 00:40:22.260
Myles Brown: So that's, that's a big change and Athena and that's that's fairly new. So you'll see if you go to management console. You'll, you'll, you'll see if you try and add a new data source and might force you to go and switch the engine to

254
00:40:24.870 --> 00:40:32.400
Myles Brown: We talked about being able to talk about glue, glue that idea of serving the Cpl. You don't need to provision and run an EMR cluster. Right.

255
00:40:33.210 --> 00:40:41.280
Myles Brown: There, taking care of that for you and you'll you'll pay for, you know, how much compute your ATL job takes and how long right

256
00:40:41.640 --> 00:40:48.510
Myles Brown: It also has the ability to go and run these crawlers and look at all your different data sources and build a Data Catalog that

257
00:40:49.440 --> 00:41:01.380
Myles Brown: And then that Data Catalog is basically a metadata repository for things like Athena and presto, and hive and you know any anything where I need to be able to access that

258
00:41:02.160 --> 00:41:07.290
Myles Brown: Well, they've added a new glue crawlers they support that Amazon Document DB, which is sort of like

259
00:41:08.190 --> 00:41:13.770
Myles Brown: Amazon's way of saying, hey, if you had a bunch of code using Mongo DB

260
00:41:14.190 --> 00:41:23.700
Myles Brown: On Prem and you want to move it to the cloud, and you want to manage stock in a MongoDB. Well, we don't have time. You can run MongoDB on a bunch of easy to instance self managed yourself.

261
00:41:24.420 --> 00:41:41.130
Myles Brown: Or you can use Amazon Document DB. They're not telling us that it's exactly MongoDB, but the API is compliant with MongoDB, so drop in replacement for it. Well, now we can crawl either Amazon Document DB or even cell damage Bumblebee run in the cloud.

262
00:41:42.450 --> 00:42:01.530
Myles Brown: It also now supports, not just the actually do but streaming et al drops. So you can set up a new job that basically under the covers is ingesting data continuously from Denise's or or from Kafka and and running this job, sort of in a continuous way.

263
00:42:03.120 --> 00:42:03.780
Myles Brown: And

264
00:42:04.950 --> 00:42:13.620
Myles Brown: A couple of changes, including a lot of changes in good they introduced the new visual interface called AWS flute studio makes it easy to kind of set up these jobs.

265
00:42:13.950 --> 00:42:23.670
Myles Brown: Able even do some really basic transformations like visually. You say, hey, connect this to this, and then we'll figure out what the transformations. So maybe it makes it a little bit of code.

266
00:42:25.140 --> 00:42:30.210
Myles Brown: But really it's about giving you a nice visual tool for building these jobs.

267
00:42:32.070 --> 00:42:36.150
Myles Brown: The contract that materialized views is is one where

268
00:42:37.530 --> 00:42:44.190
Myles Brown: You know, if you had queries that had to grab data from multiple data sources you could set up now with elastic glue

269
00:42:45.000 --> 00:42:56.130
Myles Brown: Glue elastic views. It's now in preview, but you can set up a materialized view that combines this data from different places and replicates it a bit, and then you can run the query again and again on that.

270
00:42:57.210 --> 00:43:07.110
Myles Brown: So that's sort of a new preview thing. And the last be changing Lou is the addition of this thing called AWS glue data brute

271
00:43:08.190 --> 00:43:15.270
Myles Brown: And again, it's another visual data preparation tool. It's a little different loops to do. But what it does is it says

272
00:43:16.440 --> 00:43:23.010
Myles Brown: A lot of what people are doing with CCL is they're cleaning up data that they're grabbing data they're saying, throw away anomalies.

273
00:43:23.400 --> 00:43:35.880
Myles Brown: Changed, you know, this date format to this day format, you know, little cosmetic changes to grab data from different data sources, get them all into one canonical model and then we're going to go in, you know, maybe build

274
00:43:39.210 --> 00:43:43.110
Myles Brown: I don't know stage maker or something like that golden below machine learning.

275
00:43:44.010 --> 00:43:56.340
Myles Brown: Programs based on that. And so that's all you're using glue to do is clean and normalize your data and you don't want to write code and do it data, you can do that. It's got over 250 pre built transformations

276
00:43:57.030 --> 00:44:01.470
Myles Brown: For things like filtering anomalies standardizing formats, all kinds of stuff like that.

277
00:44:01.830 --> 00:44:12.210
Myles Brown: And now you can kick it off from an AWS step function. So, this can be part of your, your leader series of steps and things to do and you can even integrate a third party data sets, using AWS data exchange.

278
00:44:12.570 --> 00:44:28.050
Myles Brown: So if you have some company that's providing some some data set that you pay for. We can easily grab that pull that in with data proof marry it with my own data and then go in and build up whatever whatever applications. We're building

279
00:44:29.640 --> 00:44:40.770
Myles Brown: Alright. The last piece. It's good for 40 says the last piece here is on the data warehouse inside. So I'm going to kind of a little bit about redshift, you know, traditionally, to what it's starting to look like now.

280
00:44:42.150 --> 00:44:55.410
Myles Brown: So starts with Traditionally. Traditionally, it's sort of an MVP architecture massively parallel processing. So we have one leader know and then a number of compute nodes between two and 128 right

281
00:44:55.860 --> 00:45:05.580
Myles Brown: And, you know, as, as an administrator either say what kind and how many compute nodes and that would determine how much data can we store.

282
00:45:06.030 --> 00:45:17.070
Myles Brown: And the leader, known as a lot like a regular relational database, right, it's, it's where the sequel clients would connect through ggvc road, etc. And they would submit their job and their query.

283
00:45:17.460 --> 00:45:33.450
Myles Brown: And then we go and it would parse that query to grow permissions are you allowed to bring these tables. Yes. Okay. It would build an execution plan and then it wouldn't run it, because the data doesn't live on a leader. Instead, it would basically

284
00:45:34.920 --> 00:45:40.830
Myles Brown: generate some c++ code compile that and then pass that off to the compute nodes.

285
00:45:41.130 --> 00:45:55.020
Myles Brown: So the smart trolley, the leader. Note the computer very relatively stupid. All they do is they say, I get compiled code and I run that code and it reads the data that I store in my local storage on these compute nodes.

286
00:45:56.370 --> 00:46:01.230
Myles Brown: So compute nodes have a local columnar storage they execute queries in parallel.

287
00:46:02.340 --> 00:46:10.740
Myles Brown: And we can load the data directly into the compute nodes from say something like S3, but it is fault tolerant, if one of these compute nodes goes down.

288
00:46:11.430 --> 00:46:21.780
Myles Brown: It's not like we lost all the data on those local machine because the data gets, you know, calm to other machines as well. So this is traditionally how redshift ran

289
00:46:22.860 --> 00:46:30.690
Myles Brown: And the biggest change came out about a little over a year ago, December 2019 redshift introduce a new type of note.

290
00:46:31.080 --> 00:46:41.940
Myles Brown: So traditionally, to compute nodes. There was one type that was, you know, a traditional magnetic speed is hard disks and one that was solid state drives right

291
00:46:42.630 --> 00:46:53.340
Myles Brown: The introduced a new one called the ERA threes and they're built on the AWS nature of system high bandwidth networking and they had to compute nodes, but they also have this manage storage layer.

292
00:46:54.000 --> 00:47:08.730
Myles Brown: So we have these large high performance solid state drives in the compute nodes and they would be local caches of data, but the data is persistently stored and scaled in this manage storage layer that could scale up to eight petabytes

293
00:47:09.480 --> 00:47:22.350
Myles Brown: And so the data would be sort of passed back and forth automatically move between the local cache and the Manage storage based on machine learning algorithms involving data block temperature data block usage.

294
00:47:23.220 --> 00:47:29.370
Myles Brown: Workload patterns and but you just pay for storage Howard much your store right now moving back and forth, kind of thing.

295
00:47:30.060 --> 00:47:44.130
Myles Brown: And so this idea of sort of separating this and using the compute nodes for Compute and the Manage storage really for storage, you know, although that could be noticed it had some local caching is an interesting one.

296
00:47:44.790 --> 00:47:50.400
Myles Brown: Now when era trees came out. They had two sizes, you know, really big. And really, really big.

297
00:47:51.510 --> 00:48:02.520
Myles Brown: So it was a bit of a problem if you just want to learn how it works. It was expensive to just to play with it or if you had sort of smaller small to medium workloads, you know,

298
00:48:03.210 --> 00:48:08.100
Myles Brown: You had to pay for these really big, you know, notes. And so that was a bit of a problem.

299
00:48:08.940 --> 00:48:16.410
Myles Brown: But in the future redshift really this idea of the data warehouse in the data lake. They seem to be getting closer and closer together.

300
00:48:16.680 --> 00:48:26.520
Myles Brown: Right. And they both have this concept of separating the storage from the compute right because it turns out that storage need senior really elevates our processing needs. Right.

301
00:48:27.090 --> 00:48:39.000
Myles Brown: Identify as identified that sort of balance of system problem where are solid state drive bandwidth is increased by 12 times since 2012 but the streaming CPU man with has only doubled.

302
00:48:39.420 --> 00:48:53.700
Myles Brown: Because of bottlenecks that eternal bus. You know, I don't want to get what level. But this idea is maybe if we can do less of compute cluster and push it down a little bit so a year ago they introduce the concept of something called

303
00:48:54.750 --> 00:49:03.990
Myles Brown: Advanced query accelerator for redshift. It's been built on that manage storage concept of already three nodes, adding a distributed custom hardware accelerated cap.

304
00:49:05.160 --> 00:49:15.720
Myles Brown: And so you would have a compute nodes with that local storage and you know when you had a query, they could put some of the query down into this awkward layer.

305
00:49:17.280 --> 00:49:28.080
Myles Brown: And you know it's currently in preview. So in in December 2019 they introduced this concept in December 2020 they introduced it again.

306
00:49:28.470 --> 00:49:37.140
Myles Brown: And they said, hey, it's, it's in preview cell, but it's coming really soon. So now it's like, hey, this is something you should really think about

307
00:49:37.350 --> 00:49:48.930
Myles Brown: Because they'll get you up to 10 times performance due to offloading certain operations correction compression filter a certain aggregation functions we can push those down into that manage storage layer right

308
00:49:49.530 --> 00:49:58.530
Myles Brown: And so really it's still just in preview. But they made an announcement. A big announcement again this year saying, hey, it's really coming soon.

309
00:50:00.060 --> 00:50:03.120
Myles Brown: It is in preview mode you can go in and try and join the preview.

310
00:50:04.260 --> 00:50:10.260
Myles Brown: I had, you know, how many more people they're taking into the preview. I think that I asked for and then never got it so knows

311
00:50:11.880 --> 00:50:21.300
Myles Brown: But like I said, our a tree notes when they were introduced to pretty big sizes. Now there's an array three Excel plus. So if we're going to look at that I'm

312
00:50:22.500 --> 00:50:26.250
Myles Brown: Sure, if I just go to the Richard Pricing Page is probably the easiest way to see it.

313
00:50:28.110 --> 00:50:35.040
Myles Brown: So you'll see that, you know, traditionally, we had like for the dense compute and the data storage and storage was like the

314
00:50:35.730 --> 00:50:48.930
Myles Brown: The spinning guests them the hard disks and that is computed was the solid state drives you know there's usually a small one, and the large one, the small ones like less than $1 an hour and the large ones pretty expensive.

315
00:50:49.440 --> 00:50:57.390
Myles Brown: And so the RA x plus. Well, it is just over $1 an hour to to play with that one. Whereas the other ones are quite expensive. Right.

316
00:50:58.350 --> 00:51:04.920
Myles Brown: So this gives you an option for, you know, learning about RA threes and then manage storage nodes.

317
00:51:05.250 --> 00:51:16.290
Myles Brown: And also running a smaller workloads using it. Now again, you still have many of those five of you know if you got really small workload, you're not going to use redshift at all. You don't need the data warehouse. If they don't have the data.

318
00:51:17.340 --> 00:51:22.830
Myles Brown: But it gives you an option for for, you know, growing at a smaller piece.

319
00:51:23.880 --> 00:51:29.190
Myles Brown: For all our a tree. LAUGHTER One of the new announcements, you can now move your cluster from one AZ to another.

320
00:51:29.370 --> 00:51:43.710
Myles Brown: Really easily because that manage storage layer. It's not storing the data in wanting zero storing it like S3, you know, across multiple easy's at the region level. So if you've got a cluster running and say, hey, now I want to run a cost over here.

321
00:51:43.950 --> 00:51:54.630
Myles Brown: You know, it's really easy to move it and in fact they tweaked it so that even the you know the GTC and only bc drivers like that endpoint that I use doesn't change.

322
00:51:54.960 --> 00:51:59.370
Myles Brown: Just because it moved from one place to another. So that, that certainly

323
00:51:59.880 --> 00:52:13.500
Myles Brown: Helps us with high availability where these things used to have to be just in one easy and every once in a while, AWS loses in that availability zone, you know, something physically happens and and that set of data centers is not available.

324
00:52:14.790 --> 00:52:20.670
Myles Brown: And so that's kind of gets us around that we can easily move it and we're not holding our data because the data is at the region level.

325
00:52:22.710 --> 00:52:31.560
Myles Brown: They now support scheduling of sequel queries through something called AWS. So then bridge. If you're using a brace this be of interest to you. If you're not, then I wouldn't worry too much about that one.

326
00:52:33.000 --> 00:52:41.700
Myles Brown: Redshift now allows you to use a lambda function has used it a user defined function used to be just internally, you could build UDS. Now it could be a lambda function.

327
00:52:42.030 --> 00:52:47.670
Myles Brown: Right, and so that allows you from the lambda function, I can go and call third party web services.

328
00:52:48.000 --> 00:52:55.440
Myles Brown: So when they introduced that they sort of introduced it with a blog entry where they kind of talked about the idea of using it for token ization

329
00:52:55.770 --> 00:53:04.620
Myles Brown: So as Native people put data into redshift. You can go in and call a lambda function that says, hey, I've got some some personal data here I've got

330
00:53:05.070 --> 00:53:13.410
Myles Brown: Social Security numbers or something, you know, something that needs to be obfuscated I can go and use some third party tool to do that, you know,

331
00:53:14.040 --> 00:53:27.900
Myles Brown: Or if I want to talk to something like dynamo dB or something like, I can't do that from inside the regular UBS inside red shirt. Now I can call a lambda function juice. Now there's going to be an extra expense. You're paying per millisecond that that lambda function runs

332
00:53:30.870 --> 00:53:38.400
Myles Brown: When it comes to the driver's seat is open source to DC and Python drivers for Richard. So traditionally, like redshift.

333
00:53:39.480 --> 00:53:45.720
Myles Brown: Is built on top of Postgres. So you can always use the Postgres drivers to talk to redshift.

334
00:53:46.920 --> 00:53:57.150
Myles Brown: But, and they said, you know, we intend to continue to support those, but they're not nearly as fast as the redshift drivers that you should be using

335
00:53:57.480 --> 00:54:01.770
Myles Brown: But the rest of drivers were an open source. So if I had to put those into some product.

336
00:54:02.010 --> 00:54:12.330
Myles Brown: And that might change. What kind of licenses. I could use or if I had some third party tool. It wasn't easy to introduce those now they've open source. Those ones. So you don't even have to use the

337
00:54:12.660 --> 00:54:18.150
Myles Brown: You know the Postgres ones I can use the highly performant ones that I can use them in an open source way.

338
00:54:19.590 --> 00:54:27.810
Myles Brown: Redshift countries for tree lines using early 2020 that idea where if I have complex query it was gonna be run again and again and again. So things

339
00:54:29.010 --> 00:54:38.610
Myles Brown: Like dashboards, right, you've got some dashboard and you've got all kinds of employees that bring up that dashboard every morning, right, you're going to run the same query again and again.

340
00:54:39.660 --> 00:54:46.020
Myles Brown: If it's complex for you to grab data from different places, you know, we might be better off to build a view.

341
00:54:46.860 --> 00:54:52.500
Myles Brown: Simplifies the query. But if I make it a materialized views, then I've got the data in there and I can pull that data.

342
00:54:52.770 --> 00:55:03.270
Myles Brown: And so they came up with it. I think around May 2020 and then in December, they added this idea of automatic refresh and rewrite capabilities. So you can say, I want you to refresh this query.

343
00:55:03.450 --> 00:55:15.780
Myles Brown: That builds this materialized views every five minutes 10 minutes, whatever, so that we don't have stale data in that view really makes it a lot easier to administrate these. So on the AI tool side this really comes in handy.

344
00:55:16.800 --> 00:55:24.510
Myles Brown: And then there's a couple of features in preview for redshift. One is support for native JSON and semi structured data processing.

345
00:55:26.040 --> 00:55:30.270
Myles Brown: So that you know if you have data that sort of JSON format if

346
00:55:30.630 --> 00:55:39.810
Myles Brown: I can run queries in something called a particle query language, which is sort of an extension of sequel that allows us to to deal with that sort of semi structured data.

347
00:55:40.170 --> 00:55:53.460
Myles Brown: So that's a new feature and review. And then the other one is redshift them out. So as a data warehouse user I'm running queries or whatever I can now create train and deploy machine learning models using sequel commands.

348
00:55:54.690 --> 00:56:04.590
Myles Brown: So I don't have to say take my data and then pull it into sage week or something. It's basically, you know, using some of those features of sake maker rate inside redshift.

349
00:56:05.100 --> 00:56:13.950
Myles Brown: So like I said that bringing together the machine learning side of things here quick site. I don't want to get too much into quick side here where the big things.

350
00:56:14.580 --> 00:56:23.070
Myles Brown: It's a cloud Power BI tool, like I said, when it first came out it was a minimal viable product. They've added a lot of fancy features over time.

351
00:56:23.430 --> 00:56:30.600
Myles Brown: One of the newest ones is this quick second queue, which you just got kind of a line like if you looking at the dashboards and you don't see the data you want

352
00:56:30.930 --> 00:56:37.110
Myles Brown: Us to have to as a business analyst or whatever, you go back to me and say, hey, can you run the an ad hoc query and get me this infer

353
00:56:37.410 --> 00:56:49.380
Myles Brown: You know now bringing their inside that died for I can ask an English language question and it'll go and try and parse it using machine learning and figure out what's the answer, based on that data that's in there.

354
00:56:50.610 --> 00:56:54.120
Myles Brown: So you want to know more about any of this.

355
00:56:54.870 --> 00:57:06.930
Myles Brown: We have a few courses. These are official AWS courses, I think, Michelle, just put the link to our overall AWS page, you can find all these courses there. There's a three day class data warehousing on AWS, that's really just

356
00:57:07.230 --> 00:57:13.170
Myles Brown: Three days on redshift right from the development and design point of view, but also a little bit of administration.

357
00:57:13.650 --> 00:57:24.450
Myles Brown: Without database class. It's a three day class just on databases talks about running a relational databases on easy to vs RDS and then all of our no sequel options. It covers redshift of it.

358
00:57:25.200 --> 00:57:37.110
Myles Brown: And then the big data class is one where we really get into EMR and Athena and glue and there's always the AWS Big Data blog which which is a good place to keep abreast of all that stuff.

359
00:57:38.340 --> 00:57:48.270
Myles Brown: Um, that's, that's really the content I wanted I'm watching the chat. If you have any questions, throw them in the chat. In the meantime, maybe show you can tell us about this this promotion.

360
00:57:48.750 --> 00:57:56.430
Michelle Coppens :: Webinar Producer: Yes, thank you. Miles and thank you everyone for tuning in. I'm going to share a link in the chat window now because I know some of you may be

361
00:57:56.670 --> 00:58:09.840
Michelle Coppens :: Webinar Producer: On your way out, but we are running a limited time promotion right now, which can save you money on your training with us. Check out that link in the chat window to learn all of the details.

362
00:58:10.440 --> 00:58:20.820
Michelle Coppens :: Webinar Producer: But pretty much you can save up to $250 on your course depending on the duration and there's a few different modality choices for you to make there.

363
00:58:24.450 --> 00:58:29.430
Myles Brown: So they have to register by January 29 but they can take the course in February, March, whatever. Right.

364
00:58:29.670 --> 00:58:30.180
Michelle Coppens :: Webinar Producer: That's right.

365
00:58:30.570 --> 00:58:36.510
Myles Brown: Okay, thank you. Um, well, like I said, if we just kind of popped back here.

366
00:58:36.630 --> 00:58:40.530
Myles Brown: These courses typically require you to know a little bit about AWS.

367
00:58:40.860 --> 00:58:49.800
Myles Brown: I would say for the big data or the planning and designing, you should have taken sort of an associate level class like the architect thing or the developer or the SIS ops.

368
00:58:50.070 --> 00:59:01.050
Myles Brown: I think the data warehouse is you could probably get by with just that one day like tech essentials class as long as you've used AWS, you're familiar with the management console, you know, you know what an easy to instance is

369
00:59:01.800 --> 00:59:09.420
Myles Brown: Then, then you probably jump right into the data warehouse in class, but these other ones are sort of a dance classes they expect you to know a little bit but AWS.

370
00:59:12.690 --> 00:59:18.660
Myles Brown: Well, I think that's about it. You know what I'm going to throw my I'm going to throw my email address here in the

371
00:59:21.720 --> 00:59:34.560
Myles Brown: If you have any questions, it's miles on brown tech data.com so just throw that in the chat. If you want to grab it and copy it just realized I didn't put it anywhere in the presentation.

372
00:59:36.000 --> 00:59:50.370
Myles Brown: And like Michelle said you should be sent a recording of the presentation, and maybe also a PDF of my slides. So you can use those links to click through and find those at the announcements, as it came along.

373
00:59:51.720 --> 01:00:07.710
Myles Brown: And I'm going to stick around for a little while. Is there any questions, you can throw them in the chat so far. Not a lot of questions. I went through a lot of, you know, new announcements pretty quickly. But like I said, I'm going to stick around for a little while. In case anybody has any.

374
01:00:08.790 --> 01:00:09.750
Myles Brown: Other questions.

375
01:00:18.930 --> 01:00:23.640
Michelle Coppens :: Webinar Producer: Otherwise, thank you everyone for being here today. I'm going to stop our recording now and we'll be in touch soon.