00:00:24.360 --> 00:00:30.030
Alexandra Kenney: Everyone we're just letting people jump into the webinar and get started so we'll just take a minute or two before kicking off.
00:01:15.630 --> 00:01:23.970
Alexandra Kenney: Okay, I think we'll get started as people are joining so hello, and welcome to today's webinar titled what's new and the aws analytics fact for 2022.
00:01:24.420 --> 00:01:33.090
Alexandra Kenney: My name is Alexandra and i'll be your MC for the next hour and thank you so much for joining the conversation so before we get started let's cover off some of the webinar functionalities.
00:01:33.630 --> 00:01:41.790
Alexandra Kenney: During the webinar everyone's microphone will be muted, but we want this to be an open discussion so if you have any questions at all, please and enter them in the Q amp a box.
00:01:42.090 --> 00:01:54.240
Alexandra Kenney: Or the chat window at the bottom of your screen there'll be a dedicated question and answer session after the presentation today's webinar is being recorded and a copy will be sent out to all the registrants at the end of the week.
00:01:55.680 --> 00:02:02.160
Alexandra Kenney: So today's speaker is miles brown he's a senior cloud and devops advisor for exit certified miles is.
00:02:02.790 --> 00:02:07.530
Alexandra Kenney: Has over 20 years of experience in the IT industry across a variety of platforms.
00:02:07.980 --> 00:02:14.790
Alexandra Kenney: he's recognized as an aws authorized instructor champion and a Google cloud platform professional architect and instructor.
00:02:15.210 --> 00:02:23.940
Alexandra Kenney: miles has delivered award winning authorized IT training for the biggest cloud providers, so you might ask yourself why choose exit certified, for your aws training.
00:02:24.600 --> 00:02:32.910
Alexandra Kenney: Professionals like you have been trading with us since 2021 you want to know your training provider has the credibility to earn organizations, trust and exit certified delivers.
00:02:33.360 --> 00:02:43.980
Alexandra Kenney: Our vendor or approved it training is our only business when you trade in our facilities you'll find well equipped classrooms and friendly staff who are dedicated to making your learning experience comfortable and productive.
00:02:44.370 --> 00:02:53.340
Alexandra Kenney: And when you try to remotely with our mvp virtual platform you'll see that our investment in technology makes online learning every bit as engaging as the training you've taken person.
00:02:53.850 --> 00:03:07.080
Alexandra Kenney: So if you have any questions about why you should take aws training with exit certified or which courses right through you, you can contact us just just after this presentation or asked us any questions so let's get started today all hands it over to you miles.
00:03:07.980 --> 00:03:08.880
Myles Brown: And thanks out.
00:03:10.290 --> 00:03:14.490
Myles Brown: All right, well welcome everybody as as Alex mentioned my name is miles.
00:03:15.660 --> 00:03:21.270
Myles Brown: i've been teaching aws classes official aws classes, since 2014.
00:03:22.320 --> 00:03:28.950
Myles Brown: that's when exit certified got the aws partnership before that I had a very it background.
00:03:29.460 --> 00:03:38.280
Myles Brown: Mostly coming from the data side you know I I started my job as a programmer really became more of a database programmer did a lot of Oracle work wrote some books.
00:03:39.000 --> 00:03:45.720
Myles Brown: ended up teaching some classes for Oracle and then, when I came along, I became more of a new guy.
00:03:46.140 --> 00:03:55.770
Myles Brown: And did a lot with cloudera is distribution ended up teaching some classes for cloud era, you know and so let's see how my trajectory has gone.
00:03:56.310 --> 00:04:09.060
Myles Brown: And every year I do this sort of the same kind of presentation near the beginning of the year kind of wrapping up hey what's everything that happened last year in the aws analytics stack.
00:04:09.570 --> 00:04:16.110
Myles Brown: And the analytics stack is grown over those eight years and we're going to we're going to look at a lot of stuff.
00:04:16.830 --> 00:04:24.540
Myles Brown: i'll start with a quick, you know overview of what's the stack look like and what's the big direction that kind of everybody is going.
00:04:25.050 --> 00:04:37.170
Myles Brown: As far as this idea of a lake house architecture and then we'll jump right into a what's new in each of these you know individual services.
00:04:37.710 --> 00:04:41.940
Myles Brown: And Alex mentioned we'll have plenty of time for questions at the very end.
00:04:42.210 --> 00:04:51.690
Myles Brown: I along the way, I may see some questions as things are happening i'm kind of looking at the chat and I may be able to answer some of the questions if they seem pertinent along the way.
00:04:52.530 --> 00:04:56.130
Myles Brown: So let's jump right into an overview of aws analytics that.
00:04:56.880 --> 00:05:07.770
Myles Brown: If you look at sort of what traditionally people did with the aws services in insofar as analytics goes, you might start with saying hey people started collecting a lot of.
00:05:08.430 --> 00:05:22.380
Myles Brown: files log files all kinds of files dropping them in s3 and then periodically running some new jobs on those, and so the main service that aws has for running hadoop is Amazon emr right.
00:05:23.460 --> 00:05:36.090
Myles Brown: It makes it really easy to launch a huge cluster of servers install whatever who you know whether it's spark or presto or hide or you know, whenever a new ecosystem projects like.
00:05:36.750 --> 00:05:45.480
Myles Brown: And mostly we're doing this sort of you know, maybe nightly maybe every few hours, you know running batch jobs was sort of initially what people were doing.
00:05:46.050 --> 00:06:02.520
Myles Brown: And then you know you would have put the well structured, you know process files again into s3 and then you could load them into redshift if you wanted the power of data warehouse so that we'd have sequel access to that data and.
00:06:03.900 --> 00:06:11.490
Myles Brown: You know very fast kind of querying on this, you know purpose built infrastructure for the kinds of queries we do in data warehouse.
00:06:12.210 --> 00:06:28.920
Myles Brown: And then you get attached to any bi tool you like, whether it's tablo or cognos or you know you may but aws eventually through their hat in the ring and said hey we'll build a business intelligence tool and that's what quick say this quick say you know it's not super popular.
00:06:30.360 --> 00:06:35.550
Myles Brown: One of the things that happens with aws is when they could only product it's really a minimum viable product.
00:06:35.940 --> 00:06:45.510
Myles Brown: And so, a lot of people looked at quick site when it first came out and said it doesn't have all the bells and whistles that i'm used to, and it looks a little bit like that low, but it doesn't have all the stuff that low.
00:06:46.110 --> 00:06:52.680
Myles Brown: But if you look back, every year, they add more and add more and they talk to people who use it and say hey what are your.
00:06:52.920 --> 00:07:05.700
Myles Brown: You know top priorities what features, do you want to see, and they add them in there and so it's getting better and better and there's there's something underlying in the in the Spice engine which is very attractive now how do we get the data into s3.
00:07:06.720 --> 00:07:16.620
Myles Brown: I mean you can always there's lots of ways to write data test three but typically we're going to use some sort of you know way to get data in in the.
00:07:17.400 --> 00:07:29.580
Myles Brown: You know, in a streaming format, maybe with consensus, or you know if you're if you've already got an infrastructure we're using something like Apache Kafka Well, now we have Ms K, the managed streaming for Kafka.
00:07:30.300 --> 00:07:38.580
Myles Brown: Which just makes it easy to kind of porter those workloads in aws, then we have you know traditional kind of queuing whether it's.
00:07:39.300 --> 00:07:52.320
Myles Brown: sql which is you know actually the very first aws service back in 2004 that's been around for a long time, a lot of people, similar to the calculating said hey i've been using a patchy active mq.
00:07:52.980 --> 00:08:00.180
Myles Brown: You know, in my on Prem I don't want to when I moved to the cloud have to manage that myself, can you manage it for me.
00:08:00.630 --> 00:08:10.860
Myles Brown: And so, in the early days Amazon said, well, you could go and switch to ask us, in which case you'd probably have to rewrite a lot of your applications that we're using you know.
00:08:11.460 --> 00:08:24.540
Myles Brown: Active mq specific API Well now, we have an Amazon and queue so that lets me talk to things like active mq or rabbit mq, and so we got lots of ways to get messaging of that data into s3.
00:08:25.800 --> 00:08:30.540
Myles Brown: The other big change over the years was to say, well, what about these two.
00:08:31.710 --> 00:08:47.340
Myles Brown: uses of emr one use of emr was here with the ETF where we bring up a cluster do a batch job every periodically right and the other was well if they're not a way where I could query the data directly on s3.
00:08:48.300 --> 00:08:57.210
Myles Brown: And so it was really netflix that popularized this idea that said hey maybe as we start to see these.
00:08:58.410 --> 00:09:07.980
Myles Brown: sequel on hadoop options get better and better Facebook came up with something called presto which really made querying in a hadoop cluster on you know.
00:09:08.580 --> 00:09:19.860
Myles Brown: well structured data and s3 really easy they said well in the end, what netflix said was we're going to cut out redshift and we're just going to use two different kinds of emr clusters.
00:09:20.340 --> 00:09:34.290
Myles Brown: one that does periodic batch jobs right bring up you know, a cluster run a spark job, and then it goes away and that would be my EPL and then the other one instead of using redshift we would have a cluster.
00:09:35.220 --> 00:09:47.220
Myles Brown: Just for presto so that we can do sequel querying on this well structured s3 data, and that was their data warehouse well Amazon looked at that, and eventually said hey maybe we should help you with this.
00:09:48.570 --> 00:10:01.620
Myles Brown: Maybe not everybody has to be an expert in hadoop and building and running clusters, even though emr does help us a lot with that there's still a lot of decisions, how many nodes what kinds of notes.
00:10:02.220 --> 00:10:13.320
Myles Brown: And instead of doing that myself, maybe I can just borrow a cluster that Amazon hacks and just pay per use and that's really where Amazon Athena came in.
00:10:13.890 --> 00:10:30.510
Myles Brown: Athena is a is a sequel on s3 was ritually where it started right, and so I could use Athena instead of running my own emr cluster with Preston because that's really what it was under the gods it allowed me to do s3 s3 queries.
00:10:31.710 --> 00:10:42.870
Myles Brown: And so that might be a replacement for my data warehouse I just have well structured files in s3 and I could also use Athena have to go in and query the raw data right.
00:10:44.190 --> 00:10:51.360
Myles Brown: And so that's one way Athena change things, similarly, what about this emr cluster doing the ETA out well that's where who came in.
00:10:52.410 --> 00:11:01.170
Myles Brown: And glue was this sort of server list atl engine where, again, there is a hadoop cluster running it's just not one that i'm managing.
00:11:01.800 --> 00:11:09.480
Myles Brown: Right and they have all kinds of tools to make certain kinds of jobs really easy where maybe I don't have to code it all.
00:11:09.870 --> 00:11:14.520
Myles Brown: Right, I can use a lot of this sort of you know predefined kind of transformations.
00:11:14.880 --> 00:11:25.620
Myles Brown: And I can avoid writing code or they'll show me the code and if I know Python and spark well, then I can go and tweet that or scala guess either either one but.
00:11:26.040 --> 00:11:33.390
Myles Brown: And that was sort of a big question now Muhammad ask the question when do I use the Mr versus when he is glue that's a great question right.
00:11:33.870 --> 00:11:40.380
Myles Brown: If you already have expertise in hadoop and you know how to manage the cluster and you build specific.
00:11:40.680 --> 00:11:47.310
Myles Brown: spark jobs and they're all they're already built and you're just moving into the cloud, well then emr makes a lot of sense.
00:11:47.550 --> 00:11:57.150
Myles Brown: But for my learning if i'm starting from scratch in aws i'm probably going to use glue right because it's less code for me to write it's.
00:11:57.990 --> 00:12:14.790
Myles Brown: it's easier to manage I don't think about the cluster ever right and the way you pay for it is i'm paying for you know how long this job runs basically rather than you know how many nodes times how many hours you know.
00:12:15.810 --> 00:12:17.610
Myles Brown: If the economics editor a little different.
00:12:18.840 --> 00:12:32.520
Myles Brown: So this is sort of where things got to probably about four or five years ago we things started to look like this right probably about four years ago when glue and Athena really started to become popular.
00:12:33.120 --> 00:12:47.010
Myles Brown: And what we realized is not everybody needs at data traditional data warehouse right and these well structured files on s3 we can query the heck out of those and we also still keep our raw data in s3.
00:12:47.730 --> 00:13:00.420
Myles Brown: And that sort of this idea of a data lake right, where we have a central place where we store all our data and s3 is perfect place for that, because it's not to lose the data it's very durable.
00:13:00.870 --> 00:13:07.560
Myles Brown: it's very cheap storage and it can store files of almost any time and if they are real structure we can use Athena to query them.
00:13:08.040 --> 00:13:20.820
Myles Brown: Later, they added sort of like Athena into redshift and then we have something called redshift spectrum, which allows us to run queries in redshift that query the data in redshift but also point to external tables in s3.
00:13:21.900 --> 00:13:34.260
Myles Brown: And so that sort of started an interesting idea where they said hmm I kind of like one interface, whether the data is in the data lake or maybe in redshift or maybe in some other purpose built database.
00:13:35.790 --> 00:13:43.050
Myles Brown: And aws is not alone this, this is a big shift in analytics across the industry.
00:13:43.260 --> 00:13:55.020
Myles Brown: You know at exit certified we teach not just aws classes, we teach all the classes right Google cloud with Microsoft azure, but we are also partnered with a bunch of other vendors like cloudera and data bricks.
00:13:55.590 --> 00:14:00.510
Myles Brown: And, and you know I don't teach the database classes, but I talked a lot to my colleague who does.
00:14:00.870 --> 00:14:08.970
Myles Brown: And, and I have to keep abreast and and because I do a lot of work with our sales people you know, trying to help people get into the right classes and everything so i'm really up to date on.
00:14:09.210 --> 00:14:15.330
Myles Brown: On what's in those classes and what I found is that they're really pushing this concept of a lake house architecture.
00:14:16.620 --> 00:14:22.470
Myles Brown: And so, all these vendors are starting to look that way, and this is sort of aws his take on it.
00:14:22.980 --> 00:14:32.910
Myles Brown: And I grabbed this this visual from a blog article from last year, where they really started pushing where you know the lake house our architecture here, you know I wrote this.
00:14:33.300 --> 00:14:46.620
Myles Brown: acknowledges the idea of taking a one size fits all approach to analytics eventually deletes compromises right so it's about integrating that raw data or data in its natural state, you know in our.
00:14:47.100 --> 00:15:02.250
Myles Brown: In our data lake you know sitting in s3 with the data in the data warehouse and also in other stores like you know, whatever RDS databases, we have or dynamo db or you know, wherever it is and.
00:15:03.120 --> 00:15:11.910
Myles Brown: Adding all to that the governance of hey where did the data come from who's allowed to access it, you know you need some governance around it it's.
00:15:12.210 --> 00:15:33.840
Myles Brown: Really easy for a data lake to become a data swamp right if you get unregulated data, nobody knows who could do well with it and so eventually aws kind of built this reference architecture for their for their lake house, and so you know the storage layer is largely s3 and redshift right.
00:15:36.030 --> 00:15:42.570
Myles Brown: that's and there's now nice native integration between it because I can run redshift queries that access the data in s3.
00:15:42.930 --> 00:15:47.970
Myles Brown: tables and actually vice versa, you know when I come in through a thematic inquiry both these as well.
00:15:48.750 --> 00:16:01.560
Myles Brown: Now, how do we get the data in there there's lots of different ways, can he says, for streaming data we can use the database migration service to do sort of change data capture on traditional databases and get that data in their.
00:16:03.330 --> 00:16:08.010
Myles Brown: Data sync to grab files from file servers maybe on premises or wherever.
00:16:08.550 --> 00:16:26.520
Myles Brown: Even that exact flow is kind of cool for grabbing data from certain SAS applications things like salesforce and stuff like that I can is nice visual editor to grab my data out of there and get it into say s3 now the s3 might be broken up into a bunch of different.
00:16:28.500 --> 00:16:36.900
Myles Brown: layers right, we could have the landing zone where we drop all the data, you know the problem with just dropping raw data is it's not in a really.
00:16:37.650 --> 00:16:57.570
Myles Brown: Good format for for querying immediately right because a lot of times you have just raw log files and they're not well structured, you have to kind of massage the data to make it nicely usable but every time you alter that data you're losing some context you might be losing.
00:16:58.740 --> 00:17:07.620
Myles Brown: You might be losing data, you know, in the old days with that et al process where we take all the data for our transactional databases, we would aggregated.
00:17:07.920 --> 00:17:22.020
Myles Brown: That clean it up and then load it into a data warehouse there's a lot of bloodshed right we left a lot of data on the floor Well, now we say no we're going to keep everything so we'll keep all the raw data, but then we'll we'll go through a series of cleaning it up.
00:17:23.070 --> 00:17:31.710
Myles Brown: processing and getting it into better formats, so they have a trusted zone, and then a curated zone, you know there's different levels to that and.
00:17:32.610 --> 00:17:43.530
Myles Brown: In order to sort of deal with setting this thing up and, more importantly, the governance on hey I want this user to be able to access this table in s3.
00:17:44.310 --> 00:17:58.230
Myles Brown: Well, how through through Athena through redshift spectrum through high or something emr right spark well how do I make this table available to all them, I have to have a data catalog.
00:17:59.310 --> 00:18:05.760
Myles Brown: And then i've got permissions like hey i'm behind table i'm allowed to query but do I have access to the data in s3.
00:18:06.240 --> 00:18:15.870
Myles Brown: Right and so becomes a real pain and that's where leak formation came in late for mention is the way to easily you know and quickly set up this data lake.
00:18:16.470 --> 00:18:25.110
Myles Brown: But also set up the governance around hey you're allowed access to these tables and I make one sort of you know role based access to their.
00:18:26.340 --> 00:18:36.090
Myles Brown: Then we've got all our ways to access the data, either through emr or glue to put the data in or massage the data in there conveniences and spark streaming all those options.
00:18:36.450 --> 00:18:47.220
Myles Brown: And redshift spectrum and then Athena the queries quick site can connect to all these different things to query and you know sage make or if you're doing a lot of machine learning can access the data.
00:18:47.670 --> 00:19:02.460
Myles Brown: And so that's sort of the idea that aws has now we're going to come back and see you know a lot of the new features that came in in 2021 you know in each of the services was to really make that integration with like formation.
00:19:03.600 --> 00:19:12.510
Myles Brown: To make the you know make things harmonious to make this this lake House really a possibility.
00:19:13.590 --> 00:19:26.160
Myles Brown: So we'll start with the streaming side of things we're going to look at some new features in can he says, then we'll talk about the managed dreams for capita and a little bit of a Lambda just because you need a little bit of lamp and make it all work.
00:19:27.210 --> 00:19:33.690
Myles Brown: So, can he says example streaming data collected from various data sources and the big idea it's like tasker.
00:19:33.960 --> 00:19:43.350
Myles Brown: You know you set up this screen it holds data for a period of time and then one or more consumers come and grab it on their own timeline so I might have.
00:19:44.010 --> 00:19:52.200
Myles Brown: I mean log data coming in, I might have thousands of iot centers firing every hundred milliseconds sending me a little bits of data.
00:19:52.860 --> 00:19:59.970
Myles Brown: And all this data comes in, and what are we going to do we're going to hold on to it for maybe 24 hours or you can change the period.
00:20:00.360 --> 00:20:12.060
Myles Brown: But then I have a different consumers that are interested in these little events right, some of them might be interested in saying hey every hundred milliseconds I gotta grab that data and do some machine learning.
00:20:13.170 --> 00:20:18.840
Myles Brown: or might be hey every five minutes i'm collecting data and updating some dashboard.
00:20:19.470 --> 00:20:25.290
Myles Brown: and other one might be saying hey every 10 hours i'm just grabbing all the data be duplicating it and stored in s3.
00:20:25.890 --> 00:20:34.740
Myles Brown: And so that's the big idea of consensus and there's actually three main services there's actually in the video streams as well, but we're not going to get too much into that.
00:20:35.280 --> 00:20:41.910
Myles Brown: But the two main services these days are data fire hose and the original can use this data streets.
00:20:42.210 --> 00:20:51.960
Myles Brown: So fire hose makes it really simple to say hey i've got little bits of data coming in, really, really fast right could be hundreds of thousand could be millions of rights per second.
00:20:52.590 --> 00:20:58.860
Myles Brown: Right usually too much to write to most things like a redshift database or or even an s3 bucket right.
00:20:59.430 --> 00:21:14.730
Myles Brown: So what are we going to do we're going to aggregate the data and then periodically write it out test three or if you're going to go to redshift what usually in the past it's always use s3 as an intermediate you would write it out at three and then you copy that data into redshift.
00:21:15.930 --> 00:21:30.570
Myles Brown: But eventually they added a bunch of other targets for things like like slow get another like third party place so that fire hose and the big idea, there is it's really simple because I don't have to think about partitioning I don't think I don't have to think about.
00:21:32.070 --> 00:21:45.750
Myles Brown: writing the code that's the consumer to grab the data and aggregated or anything it's more of like configuration and say hey grab this data, and you know, whenever we get five make up data or every 60 seconds, whichever happens first.
00:21:46.080 --> 00:21:51.060
Myles Brown: You know batch it up and write it can ISA data streams it's just a much more generic.
00:21:51.660 --> 00:21:59.850
Myles Brown: kind of hey I write the consumer as well as the producer that puts the data in, and so I use api's to go and build those.
00:22:00.240 --> 00:22:13.590
Myles Brown: Consumers and I can have multiple consumers but i'm also on the hook for managing them and making sure that you know if they die, then that replacement comes up so that's your more sort of you get more control but there's more work to do.
00:22:14.880 --> 00:22:25.410
Myles Brown: And then one thing to add a little after that what's Nice is analytics they say, whichever the streams, you have instead of writing more code for if I need another consumer.
00:22:25.860 --> 00:22:37.500
Myles Brown: I could, if the data is well structured I could take that existing stream of data and kind of run sequel queries on it, if I can figure out how to like layer
00:22:38.010 --> 00:22:55.890
Myles Brown: Create table Lino fits tab the limited common to limited data or whatever, maybe json data I could create a query table command and then I could use sequel and now Apache flink processing of the data in that stream, instead of having to write another consumer and find somewhere posted.
00:22:57.150 --> 00:23:08.220
Myles Brown: So that's sort of what can he says looks like now, the biggest new thing I would say is in can he says data streams they've added this sort of on demand for auto skin.
00:23:08.430 --> 00:23:16.590
Myles Brown: Because one of the the pains of managing can he says data streams, is when you create a stream, you have to tell it, how many shards makeup that stream.
00:23:16.980 --> 00:23:28.080
Myles Brown: And each shard can take about 1000 rights per second you know in about one Meg of data per second so you have to kind of look and say Okay, I need 300,000 Bytes per second so I need 300 shards.
00:23:28.530 --> 00:23:37.710
Myles Brown: And what if that changes over time i've got to merge or split shards it's a real thing to kind of manage that if things are are dynamic right and so.
00:23:38.760 --> 00:23:45.660
Myles Brown: Can he says data streams on demand lets us just sort of not worried about that right so that's kind of a big new thing.
00:23:46.800 --> 00:23:58.200
Myles Brown: Another big one is, and when I use fire hose to aggregate data and throw it into s3 I used to just throw it into one s3 bucket all the data that we're collecting goes into one big bucket.
00:23:59.520 --> 00:24:10.920
Myles Brown: Of course, if we're going to try and do any kind of you know querying on that data and s3 it's better if our big bucket is broken subdivided into you know.
00:24:11.610 --> 00:24:19.680
Myles Brown: partitioning right, so you might have okay here's the sales for January, February mark and so those are sub folders of the folder where we put the data.
00:24:20.400 --> 00:24:30.990
Myles Brown: And there was never any facility for that, so you don't always have to like okay just throw it in one bucket, then we would build like a little simple emr job or Lou job that would take the data and.
00:24:31.230 --> 00:24:46.350
Myles Brown: You know, basically partition and put it into another s3 bucket for now we don't have to do that extra job we have dynamic partitioning for fire hose delivered s3 and so you tell it okay use these prefixes in this kind of data books here and it's kind of data goes there.
00:24:48.060 --> 00:24:48.600
Myles Brown: and
00:24:49.920 --> 00:25:01.500
Myles Brown: The last big new one here and increases is the addition of denise's analytics data studio and basically provides you know, like a notebook you know, based on.
00:25:02.070 --> 00:25:11.460
Myles Brown: I think it's based on this one will be based on Zeppelin or Jupiter I can't remember which but it's it's for doing ad hoc query the data through fling.
00:25:12.270 --> 00:25:21.180
Myles Brown: right there in a nice sort of you know interactive way and so that's sort of you know they've got an old books all over the place we're going to talk a lot about notebooks here.
00:25:21.840 --> 00:25:28.950
Myles Brown: But one here in can he says analytics especially like we used to just have like a little box, where in the management console you would.
00:25:29.160 --> 00:25:37.830
Myles Brown: You know play with the sequel a little bit, but when you got into playing, all of a sudden anymore book and notebook environment to write scripts, and so this provides that.
00:25:39.120 --> 00:25:46.560
Myles Brown: So that's a few things new in can he says, what about this, Ms day and so that's the managed streaming for Kafka.
00:25:48.000 --> 00:25:58.500
Myles Brown: They make it really easy to import existing Kafka Apps into the cloud by providing managed service, you know, in the old days you have two choices, if you want to run CAFE in aws you would either.
00:25:58.920 --> 00:26:08.400
Myles Brown: just launched a bunch TC two instances and have to manage it all yourself, or you would say no, instead of kept i'm going to switch to can he says, but I might have to rewrite all my Apps.
00:26:09.450 --> 00:26:19.560
Myles Brown: And so, this provided us that Nice easy way now MSP was introduced way back in 2019, but a lot of updates arrived in 2021.
00:26:19.920 --> 00:26:32.670
Myles Brown: important ones right it used to be that you'd have to have some sort of little access management system for Captain just to keep track of users and his lab to do what well they introduced aws iam access control this past year.
00:26:34.290 --> 00:26:44.460
Myles Brown: They also introduced Ms K connect So if you know katha well you know that there's sort of Kafka streaming and then there's tasks connect for building the.
00:26:45.240 --> 00:26:55.530
Myles Brown: The connectors to get your data in and out of your casket stream and so Ms K connect is an eye service path to connect cluster to ingest the data in.
00:26:57.240 --> 00:27:03.570
Myles Brown: Another thing they did was they allowed us to secure connection stem at scale and the Internet through either aws I am.
00:27:04.110 --> 00:27:12.690
Myles Brown: facile scram or mutual transport layer security So these are the options and now makes it a little easier.
00:27:13.620 --> 00:27:22.230
Myles Brown: And then I guess the last one, which is still in preview is Ms K service, so you don't even have to think about.
00:27:22.980 --> 00:27:31.500
Myles Brown: The cluster capacity, you know the big idea of this is a managed service is I don't have to manage the individual machines, on which they're installing Kafka.
00:27:31.770 --> 00:27:41.130
Myles Brown: You know, dealing with the operating system and updating Linux and all that stuff but you still have to think about how many nodes Do I need and what if my data needs change you're going to change those.
00:27:41.460 --> 00:27:52.200
Myles Brown: Well, Ms K server lists just says add don't even you know worry your pretty little head about it will take care of how many nodes are there when to grow and shrink all that stuff.
00:27:53.280 --> 00:28:06.510
Myles Brown: So that's still in preview but you should be able to join that preview of this just came out I think probably this this was announced that in November, at every event, maybe December.
00:28:08.160 --> 00:28:12.510
Myles Brown: So that's Ms game a few fans in Lambda not not a lot here.
00:28:13.440 --> 00:28:22.530
Myles Brown: But i'm sure everybody taking this probably knows a little bit about aws Lambda it's basically this concept of service compete and.
00:28:22.860 --> 00:28:27.480
Myles Brown: So I write a function, and then I configure what's the trigger that causes that function to fire.
00:28:28.350 --> 00:28:33.420
Myles Brown: And it could be something like oh somebody uploaded a file to s3 that's an event go run this code.
00:28:33.840 --> 00:28:41.970
Myles Brown: And then you pay for how long the code runs and you don't think about the servers on which it runs you're not paying by the hour for them you're not managing them fixing them.
00:28:42.240 --> 00:28:49.410
Myles Brown: You don't think about high availability, where is this going to run I just trust that aws will find somewhere to run.
00:28:51.000 --> 00:29:01.260
Myles Brown: So you know things like uploading a file to s3 or stream events like a new message comes into denise's I could say hey every hundred events and can he says go run this go.
00:29:02.520 --> 00:29:08.760
Myles Brown: same thing with you know changes in a dynamo db table those go into a dynamo db stream and.
00:29:09.900 --> 00:29:25.290
Myles Brown: So they've added some new event sources for Lambda functions so Ms K, so that Catholic kind of thing also for casco that you're managing yourself on easy two instances, and they also use that mutual transport layer security.
00:29:26.640 --> 00:29:42.420
Myles Brown: Amazon mq so when a message comes into an active mq or rabbit mq you know you can use that and they also added the ability to have your Lambda function run in a different account than the queue where a message comes in.
00:29:44.370 --> 00:29:47.640
Myles Brown: So that's some new event sources that came in in 2021.
00:29:49.230 --> 00:29:57.690
Myles Brown: Another thing they did was they said it it's kind of expensive to fire the Lambda function on every message that comes in, to say an sql query.
00:29:58.080 --> 00:30:04.590
Myles Brown: When i'm only interested this Lambda function only has the fire if the messages look like something.
00:30:05.400 --> 00:30:24.240
Myles Brown: So that concept of event filtering they've added it for sql can he says in dynamo db and the big idea, there is, you know I can I can put like a little chat that happens at the Q level that say hey Is this the kind of event that i'm interested in yes Okay, then called Lambda.
00:30:26.100 --> 00:30:35.580
Myles Brown: Another thing that they added in which sort of helps out a little bit for us is partial batch response so if you have.
00:30:38.520 --> 00:30:44.580
Myles Brown: you're saying hey go grab 10 messages from the queue and run this Lambda function.
00:30:45.270 --> 00:30:58.500
Myles Brown: run 10 times on that data well, what happens if one of them fails, some of them were you know, in the old days, we would just roll everything back and you'd have to grab all that data again right now we can do say hey these ones work and these ones didn't.
00:31:00.180 --> 00:31:03.180
Myles Brown: So we can kind of deal with partial batch responses.
00:31:05.100 --> 00:31:18.720
Myles Brown: So that's what's new and Lambda when it comes to storage we'll just talk a little bit about s3 you know s3 obviously is the cornerstone for most people's you know big data plans in aws.
00:31:20.100 --> 00:31:26.280
Myles Brown: In 2020 s3 added support for multiple destinations, when you do replication you know.
00:31:26.520 --> 00:31:37.410
Myles Brown: We used to have cross region replication where you said hey everything I write to this s3 bucket I want you to also copy over there, and it was mostly used for disaster recovery something really bad happens this region goes down.
00:31:37.680 --> 00:31:43.200
Myles Brown: All that data is over there, and another region, I can bring up my infrastructure there and get going.
00:31:43.800 --> 00:31:54.960
Myles Brown: Well, people started to find that you know the latency of reading data around the world is a bit of a pain, so it might actually be beneficial to have multiple copies of the data and people work on it, wherever they are.
00:31:55.980 --> 00:32:07.950
Myles Brown: So that's something that came in and 2020 in 2021 they added the idea of s3 multi region access points so these us under the covers aws global accelerator.
00:32:08.460 --> 00:32:18.780
Myles Brown: which basically considered factors like network congestion, the location, the requesting APP and so basically it says hey when I want to read this data from s3 I go through an access point.
00:32:19.440 --> 00:32:25.800
Myles Brown: it's got an access point in each region where the data lives and it'll grab it from whichever one is fastest.
00:32:26.460 --> 00:32:40.860
Myles Brown: whatever that means right and it's also not going across the Internet it's going through the aws global backbone, wherever possible, so you can you can really accelerate your performance by up to 60% when accessing those data sets that are replicated around.
00:32:42.570 --> 00:32:44.340
Myles Brown: Some other new features and s3.
00:32:45.930 --> 00:32:57.270
Myles Brown: Support for event bridge notifications so let's say you want to do something, a when new file comes in, to s3 that's an event, I want to go and build an application that deals with that right.
00:32:57.900 --> 00:33:03.210
Myles Brown: Well, what if I wanted to build multiple applications that written react to that same change.
00:33:03.630 --> 00:33:13.320
Myles Brown: And what if I wanted to replay last events where we ended up having to keep track of those events and additional copies of the data now we can use event bridge.
00:33:13.620 --> 00:33:23.850
Myles Brown: And so, when that event occurs that goes into an event bridge and we can have multiple people dealing with it, and we can you know kind of look back at past events and replay the.
00:33:26.460 --> 00:33:41.790
Myles Brown: The biggest change, though in s3 is that they changed some of the object storage classes, they changed a couple of the names they added some new options, so the best way to look at this is to just you know go and look in the monument in the documentation it's basically in the.
00:33:42.810 --> 00:33:54.000
Myles Brown: The main s3 web page they talked about it and so now we've got s3 standard that's your kind of a say hey I want to put data, I want to access it fairly frequently.
00:33:54.780 --> 00:34:03.780
Myles Brown: they've got infrequent access standard IAA and standard it says i'm not going to access it very often, so I want to pay less for storage.
00:34:04.380 --> 00:34:08.790
Myles Brown: And you take that to its next logical conclusion and you get glacier.
00:34:09.240 --> 00:34:26.400
Myles Brown: Well glacier they renamed to glacier flexible retrieval and glacier instant retrieval so there's a little bit of a difference between them so instant retrieval is an archive storage class that delivers the lowest cost storage, but you can still access the data in milliseconds.
00:34:28.260 --> 00:34:35.160
Myles Brown: So you can save up to 68% on storage costs, but your data is still accessible but.
00:34:36.600 --> 00:34:43.680
Myles Brown: You know you're going to pay me every time you access it so you know there's a trade off there if you're accessing it all the time don't use glacier.
00:34:44.250 --> 00:35:00.420
Myles Brown: For that matter, don't use standardized right, and then you find more you know the the flexible retrieval you know for data that's access, maybe one to two times per year and it doesn't have to be right away access, you know, three to five hours or whatever.
00:35:01.560 --> 00:35:04.950
Myles Brown: Then you've got deep dive where you really get into the hey.
00:35:05.430 --> 00:35:16.620
Myles Brown: I I don't need this data really right i've done all the analytics I need to want it, I would throw it away except you know this is cheap and deep deep archive super cheap.
00:35:17.130 --> 00:35:25.470
Myles Brown: So, why would I throw it away, especially when you know I might have some sort of compliance that says you've got to hold on to this data for five years or seven years or whatever.
00:35:25.890 --> 00:35:34.440
Myles Brown: Right and So what do we do we compress the heck out of it we put into the deep archive and we say that it's the cheapest is going to be for storage.
00:35:34.980 --> 00:35:45.870
Myles Brown: But if I ever need access to it, it might take 12 hours to get the data back you know you're like oh we're getting audited and they have to be able to access that data to show that we still act.
00:35:46.350 --> 00:35:52.380
Myles Brown: Right, but the auditors can wait 12 hours never like those guys anyway right so that's.
00:35:52.860 --> 00:36:04.140
Myles Brown: that's the idea, and so, then, just a few name changes there and then the intelligent tearing they've tweaked a little bit as well, intelligent hearing says well what if you don't know your access patterns very well.
00:36:04.500 --> 00:36:12.960
Myles Brown: you're going to grab some data and you might use it very often you might not and then over time, you probably use it last but you're not sure, is it.
00:36:13.290 --> 00:36:21.570
Myles Brown: Two weeks is it six months, when does it all of a sudden not get used so when you're unsure that's when you use intelligent hearing.
00:36:22.140 --> 00:36:32.040
Myles Brown: Because it'll store it and it'll look at it and it'll say hey you haven't used this data for a while, so let's move into infrequent access and it can even move it into.
00:36:33.300 --> 00:36:36.390
Myles Brown: into some of our archiving options.
00:36:37.530 --> 00:36:42.420
Myles Brown: So it goes between frequent infrequent and archive instant access.
00:36:43.260 --> 00:36:50.160
Myles Brown: And so it can save you a lot of money, especially, but the best way to do it is, if you know exactly the usage patterns.
00:36:50.460 --> 00:37:04.110
Myles Brown: Then set it up and set up a lifecycle policy that says it's in standard for two weeks, then it goes to infrequent access for the next six months, then it goes into glacier deep archive like is you know the exact access patterns that's your best way to save money.
00:37:04.740 --> 00:37:09.120
Myles Brown: But if you're not sure then intelligent tearing is a good way to start right.
00:37:10.170 --> 00:37:13.890
Myles Brown: And so you know just a few name changes, and you know.
00:37:15.270 --> 00:37:33.060
Myles Brown: that's that's the big idea there all right well let's get into the bulk of what we're going to talk about here, which is you know emr lake formation Athena and blue and then we'll end with a little section on redshift so Mr we talked about is a managed service for running to do.
00:37:34.290 --> 00:37:41.280
Myles Brown: It now supports the s3 access points to simplify access control right So if you are setting up.
00:37:41.700 --> 00:37:51.600
Myles Brown: A data lake and s3 you can create hundreds of s3 access points on that shared data lake I each corresponding some department within your organization.
00:37:52.290 --> 00:38:07.770
Myles Brown: That has specific access permissions then when you're building jobs for that you can use a specific departments s3 access point, and they will access only their data to process it so that's kind of an interesting idea.
00:38:09.360 --> 00:38:23.130
Myles Brown: The other big thing that they announced in preview is support for emr service you starting to see a trend here right things that were managed services that you still had to think a lot of capacity now and growing and shrinking them.
00:38:23.550 --> 00:38:33.930
Myles Brown: They want to take that over and so he Mr service will automatically determine and provision compute and memory resources and scale them up or down so that's still in preview.
00:38:36.660 --> 00:38:45.000
Myles Brown: The other big thing that they changed any Mrs Mr studio right it finally became generally available and they added lots of features.
00:38:46.230 --> 00:39:00.150
Myles Brown: And so it's basically a fully managed Jupiter notebook with tools like expire ui and yarn timeline service and all kinds of like pie spark and everything in there, it supports now multiple languages in the same notebook.
00:39:01.380 --> 00:39:07.740
Myles Brown: that's a new feature, you can actually execute external Python files and external notebooks.
00:39:09.270 --> 00:39:13.170
Myles Brown: And so, this introduced sort of real time collaborative notebooks and.
00:39:14.250 --> 00:39:22.950
Myles Brown: They beefed up support for access management so initially when it first came out emr studio could only use aws single sign on.
00:39:24.210 --> 00:39:32.460
Myles Brown: Now can use I am I can use I am federation they've gone through and they've made sure it's hipaa eligible and high trust which is sort of.
00:39:33.990 --> 00:39:45.240
Myles Brown: Well, it kind of started in the healthcare Community it's not restricted to just add but it sort of amount remains and says well we're worried about hipaa compliance or worried about pci the blinds worried about gdpr.
00:39:45.720 --> 00:39:53.100
Myles Brown: All that stuff and so you know being high trust certified means I can run in that kind of environment if I need to.
00:39:53.940 --> 00:40:06.810
Myles Brown: And the other thing they had an emr studio was beyond just using you know Python and scala and stuff like that they added a new sequel explorer so we can run sort of like presto queries on your emr cluster.
00:40:09.960 --> 00:40:23.100
Myles Brown: The other changes they made an emr are really to help us with our lake house architecture, they really integrated nicely with Apache ranger you're not familiar with Apache ranger it's all about fine grained access control.
00:40:24.420 --> 00:40:34.290
Myles Brown: And there was a way to make that work in the Mr before like there was a blog article and the people kind of looked at that and tried to incorporate it now it's officially supported.
00:40:34.860 --> 00:40:49.290
Myles Brown: And so I can do database table and column level authorization policies and then whether using spark or high you know, to access through that hive Meta store and we can also do prefix an object level authorization policies when accessing data in s3.
00:40:51.030 --> 00:41:10.500
Myles Brown: And even use cloud watch to capture audit logs on that so that was sort of a be moves they incorporated that then kind of on on the heels of that they added support for spark sequel to update even the highest amount of data data tables when using the emr integration with ranger.
00:41:13.800 --> 00:41:25.410
Myles Brown: And I guess building on top of that, then they added support for spark sequel to update you know the the aws data catalog when using the lake formation.
00:41:26.790 --> 00:41:38.280
Myles Brown: So that's sort of all one on top of the other big thing, but the newest thing that they added just to just humans last reinvent with support for Apache spark.
00:41:38.970 --> 00:41:56.970
Myles Brown: Which is sort of an open table format for large datasets gets three and it provides fast query performance over large tables atomic commits concurrent right sequel compatible data table evolution, so this idea that we want our s3 to not just be.
00:41:58.320 --> 00:42:10.620
Myles Brown: append only and that's how most people think of s3 I put files in s3 I don't edit them right i'll do some sort of an emr job take these files, besides them and write out new files.
00:42:11.730 --> 00:42:21.870
Myles Brown: But this is saying no the DNA in s3 we can actually we can make changes to it, we can do you know atomic commits and.
00:42:22.590 --> 00:42:37.200
Myles Brown: So, building on all this if we're going look at lake information which is that sort of managed service that lets us build and secure our s3 data lake they added something called govern tables and s3.
00:42:38.370 --> 00:42:42.840
Myles Brown: To simplify building, you know data pipelines with multi table transaction support.
00:42:44.580 --> 00:42:57.420
Myles Brown: And it supports row and sell level permissions and once you've set it up at a table in late formation on s3 I can query it through Athena or retro spectrum or glue or a quick side.
00:42:58.380 --> 00:43:10.860
Myles Brown: Then they also added managed vpc endpoints to access data lake in a vpc and use, you know your regular kind of vpc security controls to decide who can get where.
00:43:12.180 --> 00:43:16.800
Myles Brown: They added support for tag based access control.
00:43:18.360 --> 00:43:24.900
Myles Brown: So I can use tags on databases tables and columns and then you know decide who can access them that way.
00:43:27.810 --> 00:43:30.390
Myles Brown: So that's a big changes there and.
00:43:31.440 --> 00:43:39.000
Myles Brown: Getting very close to talking about Athena I think we're here okay so Athena we talked about them briefly at the top of the hour.
00:43:39.570 --> 00:43:52.800
Myles Brown: Athena is a service option to do querying in s3 using anti standard sequel right that that came out a few years ago now, then they added in federation, so I don't have to just query s3.
00:43:53.310 --> 00:44:04.920
Myles Brown: Right, he can create it in right shift in in Amazon aurora in other RDS anything with the gtc driver, we can now set up so that we can run these queries.
00:44:06.900 --> 00:44:11.670
Myles Brown: Some new things added this past year parameter is queries like most.
00:44:11.940 --> 00:44:21.270
Myles Brown: Traditional relational databases right I run a query if I run the exact same query a little bit later, and I say hey go query blah blah blah blah blah or customer equals, one on one.
00:44:21.660 --> 00:44:33.990
Myles Brown: And then I do the same query we were customer equals one or two we shouldn't have to do all the same working building the execution plan and everything again right so using parameter is queries improved for usability and security, it turns out.
00:44:35.070 --> 00:44:43.860
Myles Brown: Cross account federated query so my Athena doesn't have to be necessarily I can access data in somebody else's account you know.
00:44:44.550 --> 00:44:53.880
Myles Brown: Presumably they've given me some permissions on my end but yeah we can do that they introduce user defined functions, you have to write them in Java, they have a nice API for building.
00:44:55.710 --> 00:45:03.420
Myles Brown: It now presents a query plan so you can look at an execution plan and see you know how can I tune this query better.
00:45:03.990 --> 00:45:11.160
Myles Brown: It was always just sort of a black box at random when ran the query and then you paid for how long that query right, so this helps certainly.
00:45:11.910 --> 00:45:21.000
Myles Brown: And they now built in support for cross account blue data catalogs, and so I can make my new data catalog that says okay i've got this table in this aws account.
00:45:21.210 --> 00:45:34.140
Myles Brown: And that stable in this account this one's in s3 this one's in redshift this one's in you know, an RDS database and now our Tina can talk to that catalog and grab data from all those different places.
00:45:35.670 --> 00:45:43.770
Myles Brown: Now to really that's probably something I should have added into the next slide for new and Athena for the lake house architecture.
00:45:44.190 --> 00:45:49.530
Myles Brown: Right, because that sort of helps with that that ability to use one place to grab all the data.
00:45:50.370 --> 00:46:00.690
Myles Brown: But they've expanded their Apache hootie support to simplify incremental data processing and s3 day lakes, they just added some more support to to the Athena.
00:46:01.590 --> 00:46:06.240
Myles Brown: The Apache boutiques been supported for a while, in emr this just sort of makes it available through Athena.
00:46:07.200 --> 00:46:12.660
Myles Brown: Added support for the new lake formation fine grained security and reliability features that we talked about earlier.
00:46:13.620 --> 00:46:26.580
Myles Brown: And in preview right now, they are acid transactions with expert, so this is sort of the way things go they introduce something in emr and then they eventually make it available in Athena, for you know.
00:46:27.900 --> 00:46:31.650
Myles Brown: On those emr clusters, they expose it so still in preview that part.
00:46:32.850 --> 00:46:41.790
Myles Brown: Now, if you want acid transactions, it is already supported through government tables, but this, this is a new way of doing it that might be a little easier for people.
00:46:44.520 --> 00:46:50.970
Myles Brown: So that's Athena the other sort of managed to do cluster under the covers for us with glue and.
00:46:51.330 --> 00:46:57.240
Myles Brown: it's certainly cpl I don't have to think can provision about running an emr cluster it can.
00:46:57.570 --> 00:47:11.010
Myles Brown: crawl all kinds of data sources that's a big thing with glue right there's sort of three elements to it there's the crawlers that will go and look at your data sources figure out what the schemas look like and update the second part, which is the.
00:47:12.060 --> 00:47:13.110
Myles Brown: glue data catalog.
00:47:14.400 --> 00:47:20.190
Myles Brown: And so that catalog is really under the covers high Meta table, you know Meta store.
00:47:21.390 --> 00:47:36.300
Myles Brown: But then I can use that when I run queries and Athena or emr or redshift spectrum right, so it ends up being the main glue that data catalog is what we use as our metadata storage for everything.
00:47:37.830 --> 00:47:41.280
Myles Brown: So that's been around for a while and then the last part which is the actual ETF.
00:47:42.450 --> 00:47:57.060
Myles Brown: So many new features in 2021, especially for the two main ui products, and so this is another one of those you know Muhammad is what about the Mr versus glue here's another one of the window I use glue studio versus moon data brew.
00:47:58.200 --> 00:48:10.950
Myles Brown: And maybe this will help we say that new studio is a visual tool for writing and monitoring EPL jobs so it's really you got to think in jobs it's all job focus so i'm building atl jobs.
00:48:11.280 --> 00:48:19.170
Myles Brown: i'm running them and then i'm in a you know single pane of glass, I want to look at all my jobs and monitor how they ran.
00:48:20.370 --> 00:48:26.070
Myles Brown: One data brew is for a slightly different audience and instead of thing and what jobs, you thinking about data.
00:48:26.820 --> 00:48:34.740
Myles Brown: So it's a visual tool for data preparation and data profile so it's really targeting more the data scientist who says.
00:48:35.190 --> 00:48:39.750
Myles Brown: i'm not i'm building these pipelines and products i'm an experimental person.
00:48:40.500 --> 00:48:46.290
Myles Brown: i've got data and all these different data sources it's a real pain for me to grab it from all those places clean it up.
00:48:46.800 --> 00:48:54.240
Myles Brown: You know here in this table it's called F name and this table is called first name, you know in this table the I don't know.
00:48:54.570 --> 00:49:00.180
Myles Brown: phone numbers are in this format and this table there in a different format right I gotta clean it all up and all that kind of thing.
00:49:00.660 --> 00:49:05.790
Myles Brown: And i'd like to do it with as little Code as possible, because I don't want to spend my time right and all that code.
00:49:06.210 --> 00:49:19.200
Myles Brown: Well that's where do to brew really comes in and it's about taking the data cleaning it all up as much again they've got all kinds of building transformations and then they can profile that data and then i'll use it in sage banker or wherever by my.
00:49:20.250 --> 00:49:21.840
Myles Brown: tools are for machine learning.
00:49:22.950 --> 00:49:24.540
Myles Brown: So let's look first include studio.
00:49:25.770 --> 00:49:32.130
Myles Brown: You can now read a catalog data from s3 and then first schema so you don't have to go just through day I got logs.
00:49:32.550 --> 00:49:42.090
Myles Brown: But that's mostly what we'll be doing we've already to crawl the data and we've got it in the catalogs well, one of the new things added was the jobs can actually update the group catalog.
00:49:42.720 --> 00:49:53.100
Myles Brown: So previously if you were looking if you needed to write it to a pre existing table right So when I wonder output i'd say Oh, I have to have that s3 location.
00:49:54.240 --> 00:50:10.440
Myles Brown: Right data to it and then i'd have to run a crawler on it at the end of my job to make that available in the catalog Well now, I can actually have the job create that data and update the data catalog for me all in one one packs.
00:50:12.390 --> 00:50:22.320
Myles Brown: studio now includes a code editor for customizing your spark scripts you sack to download the script and Edit it somewhere and then upload it back they now have a nice little code editor right there.
00:50:24.150 --> 00:50:38.310
Myles Brown: Another thing that they add it is this idea of data previews while you're offering your job, you can you can kind of say hey show me what the data might look like, at this level at this step, and so they take a subset of that data walk it through.
00:50:39.990 --> 00:50:43.650
Myles Brown: And this idea of writing spark scripts I mean that's the big thing with glue.
00:50:44.040 --> 00:50:56.100
Myles Brown: right was that, under the covers there they're going to be using spark and they build the spark jobs and they'll write a lot of the code for you and they'll say here's the part that you can edit right, but it really requires you to be.
00:50:56.970 --> 00:51:04.980
Myles Brown: You know, a Python or scala developer, who knew spark Well now, they have a new transforms they are justifying using sequel.
00:51:06.150 --> 00:51:10.980
Myles Brown: So you don't have to be a heavy duty developer, to use it necessary.
00:51:12.420 --> 00:51:21.720
Myles Brown: So there's you know there's built in transforms but the ones that are not available as visual transforms I might be able to use sequel to do some of that process.
00:51:23.070 --> 00:51:28.260
Myles Brown: So that's some of the new things in blue studio if we look on the other side of that data Bruce.
00:51:29.490 --> 00:51:40.110
Myles Brown: They added a lot of advanced data types and the limiters for transformation, so you know whether it's social security number email addresses phone numbers credit card numbers, you know.
00:51:40.530 --> 00:51:47.340
Myles Brown: They know how to take that data and say Okay, well, it did it look like a stream now i'm telling you it's this.
00:51:50.400 --> 00:51:58.980
Myles Brown: And let's see what else, so a lot of new transformations, for you know basic number format phone number format as data masking for.
00:51:59.700 --> 00:52:07.230
Myles Brown: For personally identifiable information there's some cool stuff there they have transforms for custom sword and multi column sword.
00:52:08.040 --> 00:52:21.720
Myles Brown: logical operators, so I can add multiple transforms and use the hands and the words nods a nesting of things on nesting things handling outlier data, you know so they've added a lot of features to glue database will take a quick look in a second.
00:52:23.910 --> 00:52:32.820
Myles Brown: they've also added the ability to specify which data quality statistics you want generated as they're they're doing their job.
00:52:34.080 --> 00:52:52.140
Myles Brown: Their their transformations and it now supports writing the prepared data to a bunch of different destinations, I used to be just like Okay, but it to s3 or what a treasure now, we can put an Indian E gtc database, it can write tablo I perform at.
00:52:53.160 --> 00:53:00.960
Myles Brown: At glue data catalog tables, we can write the catalog tables and things like that data brew is now hipaa eligible just.
00:53:02.010 --> 00:53:05.550
Myles Brown: be pretty handy for in that healthcare.
00:53:07.980 --> 00:53:19.320
Myles Brown: So that's some new features in those gui tools, but what about just general boo features well the crawlers we used to have to kick off a blue crawler.
00:53:19.950 --> 00:53:26.610
Myles Brown: You know that the crawler that would go and look at my data sources, you could do it ad hoc whenever you wanted, or you could set up a schedule.
00:53:27.060 --> 00:53:43.620
Myles Brown: Well, now we can do it, based on an s3 event when some new file comes in, we can kick off crawler say hey somebody dropped a new file into an s3 bucket I don't know what that is let's go figure out what the schema is and build a table for.
00:53:45.390 --> 00:53:47.490
Myles Brown: I mean assuming it's well structured we'll figure it out.
00:53:48.930 --> 00:53:58.290
Myles Brown: They have a vendor driven workflows with hulu and then bridge, so I can I can kick off my glue job based on whatever event one and event bridge.
00:54:00.330 --> 00:54:11.670
Myles Brown: To make life easy for people starting out they built glue custom blueprints so you can parameter eyes and reuse some workflows to have something there and then you can build your own blueprints.
00:54:13.740 --> 00:54:29.940
Myles Brown: they've added some chain learning at various points to do some things, one of the things they added was this idea of computing missing values, so we can look at your data settings if you're missing some values based on some machine learning i'll figure out what we think those are right.
00:54:33.300 --> 00:54:35.310
Myles Brown: So that's that's pretty cool.
00:54:37.470 --> 00:54:40.020
Myles Brown: I think that's the main new features.
00:54:41.790 --> 00:54:57.360
Myles Brown: yeah and then there's a few features that are still in preview so for for new features that are in breezy look, they just announced them at reinvent and you know, usually takes till April May, June before some of those things that were announced become real.
00:54:58.530 --> 00:55:05.520
Myles Brown: So they have this idea of interactive sessions, so you can process data interactively in a notebook or an idea of your choice.
00:55:06.540 --> 00:55:08.610
Myles Brown: Fast startup and then.
00:55:09.960 --> 00:55:14.700
Myles Brown: If you don't have a notebook of your choice you can actually use the glue studio job notebook.
00:55:15.570 --> 00:55:32.670
Myles Brown: And so that'll that'll let you save and scheduling your notebook Code as a good job, so you sort of you know you sample a building somewhere and then go and build it up as blue job now it's sort of the place where you're building it becomes oh here that's the job.
00:55:34.080 --> 00:55:35.580
Myles Brown: that's the promise of notebooks.
00:55:37.080 --> 00:55:41.940
Myles Brown: This idea of pii detection and remediation what we saw that.
00:55:42.420 --> 00:55:55.500
Myles Brown: You know it's again doing some pattern matching some machine learning at the column and cell level while it's doing the blue job and saying hey this looks like a social security number, we should probably obfuscate this you know.
00:55:56.820 --> 00:56:00.090
Myles Brown: And nobody saw that that gets picked up in the neighborhood as well.
00:56:01.290 --> 00:56:14.100
Myles Brown: And I guess the last one in preview is three.org and now dynamically scale resources up and down based on workload, both for batch and streaming jobs so that's the.
00:56:15.330 --> 00:56:33.780
Myles Brown: You know, again, the dynamically scaling is you're not paying for X amount of servers on the covers for the duration of the job there might be different steps in the job that needs more or less process and so it'll it'll scale them and really helps us costs.
00:56:35.460 --> 00:56:46.260
Myles Brown: Right last little bit is on redshift and quick side and go pretty quickly through it, the redshift architecture that we had for many years and starting from 2013 on.
00:56:46.800 --> 00:56:55.260
Myles Brown: Was you had a leader node and a number of compute nodes and when you set up a cluster you told them how many compute nodes and what kind to us right.
00:56:56.070 --> 00:57:07.890
Myles Brown: And they store the data in local disk on those compute nodes, and so the leaders nodes job was the sequel endpoint so you send your query there it builds an execution plan you know what's.
00:57:09.120 --> 00:57:13.590
Myles Brown: going on, are you allowed to do this query and then it would basically.
00:57:14.730 --> 00:57:21.060
Myles Brown: build an execution plan compile some code incentive compute nodes so they weren't very smart they just didn't work.
00:57:22.050 --> 00:57:30.450
Myles Brown: Well, eventually, they decided to separate the compute and the data and 2019 redshift introduced a new type of note era three nodes.
00:57:30.870 --> 00:57:47.250
Myles Brown: And then they basically said what if the the storage on the compute nodes wasn't the actual storage, but we used it for the caching of the data that's access and all the time and we actually store on some underlying redshift manage storage based around s3.
00:57:49.260 --> 00:58:06.180
Myles Brown: And so we can get these very large high performance ssd is that our little caches the data is persistently stored and scale separately and manage storage so by separating the compute and the data we can have a lot of storage really cheap and maybe a smaller.
00:58:07.380 --> 00:58:10.860
Myles Brown: Cluster because we don't need as much computer or vice versa.
00:58:11.880 --> 00:58:16.740
Myles Brown: So data is automatically move between the local ssd cash and the manage storage.
00:58:18.750 --> 00:58:25.470
Myles Brown: Based on machine learning algorithms that decide pay, which which data blocks are our new salon right.
00:58:26.220 --> 00:58:37.830
Myles Brown: And you pay the same little price, regardless of where the data lives right and so that sort of opened up everything that they're doing in the future with redshift and so separating them thing really helped and.
00:58:38.700 --> 00:58:47.970
Myles Brown: really what they want to do is less at the compute cluster and so awkward they introduced probably a year and a half ago, but it took till sometime last summer.
00:58:48.540 --> 00:59:04.470
Myles Brown: I get some April, it became starting to become GA in some regions where they're pushing more and more down to this awkward layer, and when you get you know you don't want to carry but it's 10 times performance due to offload a lot of operations down to where the data lives.
00:59:05.700 --> 00:59:11.250
Myles Brown: So over the last year redshift added a lot of service and self management improvements.
00:59:11.640 --> 00:59:19.080
Myles Brown: So the automatic table optimization that used to figure out your short keys and distribution keys for you now can do call them compression as well.
00:59:19.830 --> 00:59:28.740
Myles Brown: You automatic workload management that usually keeps track of how many queues and how much memory delegate to them can do concurrency scaling so adding more closer.
00:59:30.300 --> 00:59:43.740
Myles Brown: Some other new redshift stuff support for native json pretty cool hierarchical data queries finally redshift ml is is generally available i'm not going to get too much into that that's a whole can of worms.
00:59:44.280 --> 00:59:54.270
Myles Brown: Data sharing across clusters, they can do it in the same account in different accounting different regions, so this idea that your redshift data isn't stopped right there.
00:59:55.980 --> 01:00:07.740
Myles Brown: And that is our sql so you start to use like a third party tool, if you wanted to command line access to your redshift data, and so you just use the postgres one they now have one that supports all the features.
01:00:09.180 --> 01:00:20.400
Myles Brown: And then we used to be a new federated querying so inside a redshift cluster I could run queries that talk to say a postgres now supports my sql and aurora my sql.
01:00:21.480 --> 01:00:29.130
Myles Brown: couple of preview things materialized views came out in early 2020 they now have this idea of the automated materialized views.
01:00:29.430 --> 01:00:43.470
Myles Brown: Where they look at queries that happen again and again like maybe repeatable workloads like dashboards and it uses some human says we'd be better off to make a materialized view and refresh it periodically and we'll maybe do that under the covered for you.
01:00:44.940 --> 01:00:50.460
Myles Brown: Lots of improvements that redshift query editor including a query sequel know about kind of pre.
01:00:52.920 --> 01:00:53.880
Myles Brown: More notebooks.
01:00:54.900 --> 01:01:03.030
Myles Brown: streaming ingestion sport for denise's data streams, so instead of having to have consensus right to s3 and then load it they do it all at once.
01:01:03.450 --> 01:01:11.700
Myles Brown: So it's all in bring you quick side is our bi tool, the biggest thing that they've generally is GA is quick SEC acute.
01:01:12.120 --> 01:01:20.880
Myles Brown: Where you, you have a dashboard somebody builds for you, you can write natural language queries in there, what is our year over year growth rate and it can kind of figure it out for you.
01:01:21.960 --> 01:01:32.850
Myles Brown: They also have this data set as a source, where you can yeah curator can can create special central data sets that they grab data from different places and make that available to authors.
01:01:34.530 --> 01:01:41.910
Myles Brown: And they have threshold alerts, so you could pick any KPI in the dashboard and make an alert say hey when this goes over a certain threshold, let me know.
01:01:43.620 --> 01:01:51.030
Myles Brown: and spice now has incremental refresh using a timestamp column, so instead of having to refresh all your data in that spice.
01:01:52.230 --> 01:01:57.270
Myles Brown: layer you could you know tell it to incrementally refresh certain data.
01:01:58.530 --> 01:02:10.500
Myles Brown: and the last thing burgeoning and data sets all right well we're at the top of the hour, basically, if you want to know more about analytics on aws the big data blog is a good place to start.
01:02:12.330 --> 01:02:24.300
Myles Brown: But we have three classes that kind of fit you know aws authorized classes, the big data on aws class talks a lot of but emr canisius Athena and glue things like that.
01:02:24.900 --> 01:02:33.810
Myles Brown: The planning and designing databases is really just covers your basic RDS and no sequel options and data warehousing is a three day deep dive into redshift.
01:02:34.950 --> 01:02:41.910
Myles Brown: And I think Alex just through those in the chat if you want to grab those, hopefully, you can save the chat.
01:02:42.390 --> 01:02:56.100
Myles Brown: not sure it depends on your how you're running zoom but you might be able to in the chat if the dot dot dot and you might see a safe chat and then you can get all those links, otherwise you might want to take a minute now to grab those links.
01:02:58.230 --> 01:03:04.170
Myles Brown: Otherwise i'm open for questions I didn't see too many questions along the way, there was a real.
01:03:05.370 --> 01:03:09.480
Myles Brown: Fast marathon there at the end, trying to get in within that hour.
01:03:12.510 --> 01:03:17.220
Myles Brown: So i'll be here for a couple more minutes if you have questions i'm also going to put my email address.
01:03:22.710 --> 01:03:26.100
Myles Brown: You have questions after this is a good place to send them.
01:03:28.020 --> 01:03:30.600
Myles Brown: you'll also see that in the slides, I think.
01:03:31.770 --> 01:03:37.800
Myles Brown: you're going to get the the recording of this presentation sent to you I don't know how this one is that men, maybe later in the week or.
01:03:38.190 --> 01:03:40.170
Alexandra Kenney: By the end of the week we'll send it out okay.
01:03:40.560 --> 01:03:43.200
Myles Brown: Did you see any other questions that have popped up along the way.
01:03:43.980 --> 01:03:49.530
Myles Brown: No and i've been nobody use the QA mostly people use the chat I think.
01:03:52.650 --> 01:03:55.440
Myles Brown: Like I said it's a lot of data thrown at you at once.
01:03:56.760 --> 01:03:57.810
Myles Brown: I just wanted to kind of.
01:03:58.890 --> 01:04:10.620
Myles Brown: let you know that you know every year they're adding more and more right there's a lot to it and you might not be interested in all of it, but maybe something hits where you like, he could use that.
01:04:11.520 --> 01:04:18.210
Myles Brown: And it saves you from watching 40 hours of reinvent presentations, which is what I have to do.
01:04:19.260 --> 01:04:21.960
Myles Brown: Hopefully that that saves you some of that time.
01:04:23.310 --> 01:04:32.160
Myles Brown: Oh yeah is there an aws time for certifications you guys have an exit certified what's the URL yeah sure, let me, let me just show you.
01:04:33.420 --> 01:04:34.110
Myles Brown: My.
01:04:35.970 --> 01:04:48.750
Myles Brown: browser here and just go to exit certified So if you go to aws you see all of our training classes, but under certifications This is where you'll see our certification paths.
01:04:49.980 --> 01:04:52.380
Myles Brown: Somewhere in here of certification roadmap.
01:04:54.450 --> 01:04:58.980
Myles Brown: Now this isn't the one where I get an actual URL for it, where is the one that's got the URL.
01:05:00.150 --> 01:05:03.000
Alexandra Kenney: The URL to this page here so that they can.
01:05:03.420 --> 01:05:03.780
01:05:05.130 --> 01:05:05.520
Myles Brown: yeah.
01:05:06.000 --> 01:05:07.140
Alexandra Kenney: I just put.
01:05:08.310 --> 01:05:17.430
Myles Brown: yeah so this one's got a little pop up if you click on the certification roadmap you'll see if we scroll down it's usually on the second page where you'll find the analytics stuff.
01:05:17.880 --> 01:05:26.880
Myles Brown: Typically, if you're coming in, no knowledge of aws we, like everybody to take the one day tech essentials, but if you've been using aws for a while, you can certainly skip that.
01:05:27.390 --> 01:05:34.650
Myles Brown: And, most people will take either the three day are protecting or SIS OPS, to get the basics of aws before they go on to the big data class.
01:05:35.160 --> 01:05:44.430
Myles Brown: And the big data class will mostly prepare you for the data analytics specialty now I say, mostly because you're gonna have to do some study right.
01:05:44.850 --> 01:05:51.870
Myles Brown: I always tell people hey we're teaching a class Monday to Wednesday don't run over right your exam on Thursday, you know and there's a lot to it.
01:05:53.190 --> 01:05:58.320
Myles Brown: But a nice follow on after that is the one day building data lakes on aws class.
01:05:59.040 --> 01:06:09.930
Myles Brown: If you're more interested in the data warehousing we've got that three day data warehousing glass it doesn't have as much of a pre read you know one day tech essentials enough to get you ready for that.
01:06:10.380 --> 01:06:18.840
Myles Brown: here's the machine learning path the four day machine learning pipeline is the best way to prepare for the machine learning specialty.
01:06:19.470 --> 01:06:31.770
Myles Brown: And then there is a database specialty we've got that three days planning and designing databases which, again, you know it's an advanced class they expect you to know aws so you could either the architecture of the SIS OPS first.
01:06:33.930 --> 01:06:40.080
Myles Brown: And then you know there's all kinds of other paths here, so you can find that document on the link that output in.
01:06:41.430 --> 01:06:43.890
Myles Brown: Again, you just find it under the.
01:06:46.050 --> 01:06:50.340
Myles Brown: it's kind of funny and so awaken button on a white background, but it where it's a certification roadmap.
01:06:52.200 --> 01:06:53.910
Myles Brown: But that's in the.
01:06:55.290 --> 01:07:01.920
Myles Brown: In the certification certifications tab if you look at the training courses to admins that you are interested in something like the big data class.
01:07:03.240 --> 01:07:07.440
Myles Brown: Probably the most relevant one here, if you look at the big data class you clicked on it.
01:07:08.760 --> 01:07:20.640
Myles Brown: If you the schedule here's here's where we march 22 April 5 may 3 it runs at least once a month, probably every may 3 may 17.
01:07:22.620 --> 01:07:25.080
Myles Brown: So that class runs fairly often.
01:07:26.610 --> 01:07:31.680
Myles Brown: The building data LACE is a one day class that doesn't run quite as often I wouldn't say.
01:07:32.880 --> 01:07:35.040
Myles Brown: See march 15 April 18.
01:07:37.230 --> 01:07:40.200
Myles Brown: May 17 looks at every four weeks.
01:07:44.370 --> 01:07:44.790
Myles Brown: Well, I.
01:07:46.290 --> 01:07:48.420
Myles Brown: don't see any new questions.
01:07:52.980 --> 01:07:56.850
Myles Brown: Well you've got my email, hopefully, you had time to copy all those links.
01:08:05.220 --> 01:08:09.990
Myles Brown: And thanks for attending I don't know any any closing remarks.
01:08:10.050 --> 01:08:18.090
Alexandra Kenney: Perfect Thank you everyone for attending, thank you for the questions you'll get an email with this recording as well, by the end of this week, so thank you so much.