AWS Architecture Design - Global Lifts Scenario
Transcript
In this video we're going to do something a little bit differently to what we normally do on this channel. What I've decided to do for this video is to take you through something that I quite often find having to do in interviews and be on the receiving end of as an interviewer, and that is going through some form of technical architecture type walkthrough where you design different bits and pieces given a rather big scenario.
So what I'm going to do today is I'm going to do this live-ish with you on the stream. Everything that you're going to see from this point onwards is completely unscripted. There's no prior research happening. What I'm going to be doing is show you the script and then we're going to go through, highlight different bits and pieces that we're interested in, and talk through my approach and how we're going to do this. So let's go through and take a look at the script now.
So this is a scenario that I came up with, which is very similar to a lot of scenarios that I've had to deal with before, and it's relatively realistic of a real life scenario that you would encounter at senior levels of engineering. So I'll read it out to you just in case it's not big enough.
You've been hired by Global Lifts Incorporation, a provider of various types of lift, from personal lifts through to traditional building lifts. Global Lifts Incorporation are looking to migrate their existing monolithic application to handle increased load as they've won a contract with several major construction companies. The architecture needs to be able to support the following high level requirements: each lift has a unique ID that reports data every 10 seconds, data is needed for up to five years for data analytics, a UI will be needed to view the data, and alerts will need to be triggered under various scenarios, i.
e. total number of lift elevations over a million. Our task is to draw a high level diagram of how you design this system, taking into account the following characteristics: security, reliability, performance, and data storage.
So the scenario is quite vague in itself, and this is traditionally what you would find in architecture sessions. The first key point here is if it's not explicitly listed, then they're expecting you to ask questions or make assumptions and state your assumptions. One of the assumptions that I would make is they've won several major contracts, which means they're going to be looking at a rather large load. I would be assuming this is a global company as well, so we know that we're going to need something that is multi-region, supports high throughput.
One thing that we can gather from the scenario itself is that it's lifts, so everything should be nice and isolated. We've got a nice partition there where we can easily scale out from as well. And it's also going from a monolithic application, so we need to think: we're not building on anything existing right here, although we might have some logic in various places. It's not a migration task. This is a new high level "how do we do this in an ideal world" type task.
So looking at that, where would I start? Well, considering it's lifts and we need to report data every 10 seconds, we're going to need some formal authentication service. So that's probably a good start. We know that we're going to need a UI, so that's our second service. And we need somewhere where we can actually report the data in. So we've got at least three different services here that we can drop down on our initial high level diagram.
The next thing to note is around the data requirements. It's going to be data reported every 10 seconds, so we can imagine if there's quite a large global company, we're gonna have a lot of data. Assuming they have a couple of hundred thousand lifts worldwide, it's gonna be a lot of data that's gonna come in over time. It also does depend on what data is being stored and reported, obviously, that we don't have a view on. You can kind of make a rough guess as you go through, the sorts of data that will be there, like when somebody's going to go up in the lift, you will have an event. You could have temperature readings, sensor readings, and all that kind of stuff.
Next thing to note is we've got data analytics for five years, and this is where it starts to get a little bit interesting. Because we have lots of data, it needs to be stored for a long period of time, and we also need to trigger alerts off of this data as well. So let's keep that in mind as we go through this.
Let's go and take a look at starting to draw some of this out. So earlier I did say it's going to be a multi-region kind of scenario that we would potentially be looking at. I'm going to start off with a single region design and I'm going to show you how I can keep in mind a multi-region design so that we can go forward and expand the design later.
As an interviewer, one of the things I'll always do in a scenario like this is throw curve balls in at the end, like saying "how would you go to multi-region?" Now we don't know whether this company is multi-regional or not by the scenario, but it's worthwhile keeping in mind future flexibility and picking the appropriate technologies to do that.
So if I was going to start drawing this diagram, the first thing I would do is draw a nice big box in the middle and just call this like EU-1. I'm in London, so the European regions for whatever cloud provider is what I would generally pick.
Given that we said we're going to need some kind of authentication service, I'm just going to call it Auth there for a second. Now if this is on AWS, I'm going to be looking at something like Cognito, because we can have different user pools, worker pools, and security groups so we can delegate some of that access into the services. Alternatively, if we're looking at something that's going to go a bit more multi-cloud, we can look at something like IdentityServer or Duende, I think it's called now.
So I'm just going to draw a little user. Our little user is going to represent a lift in the context of this. One of the first things that we should be calling out here is IoT security. How do we trust the user? There's obviously different types of auth flow here, so I would be calling this out and saying we need to make sure that we pick the appropriate auth flow that is compatible with an authentication service. Off the top of my head, I can't remember the exact auth flows that Cognito supports, which is why I'm giving you the different options. It's perfectly fine for you to say this in interviews as well, to say "I would do more research on this, these are the two options I would probably pick from, we would do a quick spike on it, make sure that we've got performance, security, reliability" and so on.
So assuming that our little lift is authenticated, then we've got something that listens to our data. I'm just going to call this the API. Now what I would probably do here, considering there's a lot of potential to scale this application, if this is in AWS then what I'd be looking at here is something like API Gateway instead of just a regular ASP.
.NET Core API, and then behind this I would have one or more Lambdas that support this API. So we've got a distinct link between the API and API Gateway, and then the API Gateway has the Lambdas on it as well.
Now one of the reasons why I'm picking this is the task for the API is basically take the data, validate its input, sanitize it. Obviously you've got some authentication concerns here which will pass off to the Auth service, but once we've got that stuff sorted out, then it's just sanitization, validation, and storage. We may have a bit of retrieval and stuff like that for certain use cases. So that's why it's more than one Lambda, so we can retrieve and set data as well.
Essentially what we'll have here is some form of data store. The Lambdas are going to be writing into here and reading from there depending on the use case. For this data store, I would probably go with DynamoDB for a number of reasons. Number one, it's a managed service that's highly scalable. Second, the kind of data that we're dealing with, and this is probably the most important part, is the data that we're dealing with is easily partitionable, which fits into the partition schema that DynamoDB has. So we could have a partition for every lift, for example, or we could group it down by a building so we just have all of the lifts in a building inside of a single partition.
Because we're not writing that much data, we need to be careful of how we store this data long term, but there's a nice easy way that we can do that in DynamoDB and just apply a TTL. We can say we'll keep 48 hours. So I'll just write this on the side here, we'll have a TTL of 48 hours, and this means all the documents inside of DynamoDB will be deleted.
Now you may be thinking back to the scenario where we had a five-year data retention policy. This is completely fine because one of the things that DynamoDB can do for you is it can have streams off the back, and this is a change data capture stream. So we'll draw that off here. We will say this is going to be inserted data, because we're not planning to edit this functionality, so we're only interested in the inserted data.
To explain this a little bit more, DynamoDB Streams offers you to either capture all the data changes, or just essentially the inserts and updates, or you can capture just the deletes. So if you wanted to keep it in one type of storage or hot storage for say a week and then you wanted to offload it to something like S3, then you could use the deleted stream to take a copy of the record and dump it into S3 from there.
Now we don't want this. We want to put it into some kind of data warehouse. So what we can do is listen to the inserted/updated records. This comes out as a Kinesis-like stream, so it's essentially a sequential log of "this thing happened in this order". We can put Kinesis Firehose on here, and then we can offload it to something like S3, and then we can have Athena reading off there for the ad hoc queries.
Athena, if you're not quite sure of what it is as a technology, basically you get pools of data from different sources and then you can form queries over them. So you can do it over an S3 bucket, you can do it over databases. Sources are kind of flexible, which is good.
What we're using here in our infrastructure that we've drawn is: every time something gets written into the API Gateway, it will get processed by one of the Lambdas. A new record will be inserted once it's been validated, say this is the status of this lift or this event happened on this lift. This will get written into the DynamoDB Stream, Kinesis Firehose will pick up from the DynamoDB Stream (probably have to have an actual Kinesis component in there as well), and gets picked up by the Firehose and dumped into S3 as a nice set of aggregations.
Then for the querying portion, we've got Athena here. This does assume that we were going to give our users access to this.
One of the things we do have to consider, which I'm going to talk about now, is the hidden requirements that are often in this. So far so good, our infrastructure is nice and scalable. But one thing that we haven't considered is how much is this gonna cost?
So now what we're trying to look at doing is: we've got the rough kind of flow that we're gonna need for at least this portion of the scenario. We're using fully managed services, so we're kind of covering off the performance side of things with fully managed services that are serverless. We know that Lambdas can scale out pretty quickly. There is a slight delay with them, a couple hundred milliseconds to spin up, but generally speaking they're pretty scalable, as is the API Gateway.
The Auth service, generally speaking you only need to hit that once or twice and then have some kind of refresh token for the gateway.
DynamoDB Streams and DynamoDB itself, based on read and write units, has two modes of operations that we can use. It's got either a pay-per-request which is essentially your really scalable option. You don't have to worry about pre-provisioning your read and write units at all. You kind of pay for what you use. So if you want to use 10,000 units, then you can go forward and use 10,000 units. You don't have to pay for them in advance or anything like that.
Now there is a caveat with that. If you have a burst on the scale and you haven't hit that capacity before, so say you usually do like 100 read units and then you suddenly go to 10,000 because you've made it to the top of Reddit for example, then you're gonna struggle to get to that level because of some internal mechanisms. Even though you pay per request, you still can't sort out the underlying issues with DynamoDB, which is it can only double its partitions internally every 30 minutes I think it is by default.
But the way to work around this, which is an AWS recommendation, is that you pre-provision your tables to something like 10,000 read and write units when you initially create the table and then instantly switch it into pay-per-request. Do something really large and then if you do hit that huge scale, then it's all good. This is something that we do all the time and it works really well.
Obviously as the table grows over time and you are using that table a lot more, so say you started off at 100 request units and then you slowly build up to like 5-600 units, DynamoDB will be smart enough to figure out when it needs to scale and sort all the stuff in the background. So if you've got a gradually increasing load, then you're going to be fine on the pay-per-request model. But if you're expecting big bursts, like you've hit the top of Reddit or something, pre-provision your table in advance.
So that was a massive segment about DynamoDB, apologies. Hopefully it was useful. So yeah, we have DynamoDB going into Kinesis and then Kinesis Firehose. An interesting point to note about this section: there was a recent release that basically means you can do a codeless integration between the two. A couple of months ago you used to have to write Lambdas in between, but if your data matches up pretty perfectly then you don't need to do any transformation work. You can just basically use the AWS integrations and go from the DynamoDB Stream straight into Kinesis. There's obviously a codeless integration between Kinesis and Kinesis Firehose, and again a codeless integration between the Firehose and S3, and similar with Athena.
So for our hands-off managed solution, we've actually got most of the complicated stuff already taken care of. We've got the long-term data archiving taken care of.
Now one of the things that we do need to do is take a look at some of the alerting. So if you remember the scenario, let me go back to it so you can refresh your minds. One of the things that we need to take a look at is alerts will need to be triggered under various scenarios, i.
e. total number of lift elevations greater than a million. So this is where it gets a little bit tricky because we're not really describing what the alerts are, but we know there's going to be forms of alerts that need to happen.
So there are a couple of ways that we can do this depending on the alerts themselves. The first one is to actually have these Lambda functions write out the metrics to something like Datadog or New Relic, and then put all the alerting inside of that platform so it's all centrally managed. This is a pretty strong approach. You can do the same with CloudWatch Metrics and CloudWatch Alarms in here as well if you want to do that.
The other way of doing it is take Kinesis down here and you could pipe something else off, and you can use another bit of technology called Kinesis Analytics. Just for completeness, I'll just draw this one up here so you know we've got the CloudWatch option. So you've got kind of two options here depending on which route you want to go.
As everything is fairly undefined, I'm not going to use the analytics option right now. But this is where you do more of your real-time streaming kind of platforms, real-time aggregation over a stream of data. That's where Kinesis Analytics is really useful. But if we've got just a small set of alerts and we don't really know what they are, we can start off with something like CloudWatch Metrics and Alarms and trigger workflows based off of those. Quite a simple option for now, whereas Kinesis Analytics is a lot more code that we're gonna need to write because we need Java for, I think it's Apache Flink that's underneath.
So I'm going to get rid of that option for now just for completing the diagram, but I would explain to whoever is on the receiving end that we've got those couple of options should we have the necessary bits and pieces for them.
So the last bit to come and cover off is the UI. I'll go back and double check the requirements after this as well. The UI is going to be from some kind of support staff. They're going to be communicating with the UI, which is also going to have some stuff with the authentication as well. We're gonna want some way of managing this automatically. The UI is just gonna talk to our API Gateway and we'll expose the functionality on there. We'll be having role-based permissions and stuff like that.
Now for the UI itself, it really depends on what front-end technologies we're gonna go through and pick. One option that you could do for a really scalable solution is actually host the UI off of S3 with a CloudFront distribution on top of it, and this is assuming that your UI is a completely static site and doesn't really have anything behind it. It does assume that your authentication area has some form of login page, something you can redirect to and redirect back to, and then you have a really intelligent front-side application that would make it super scalable and performant.
But yeah, at this point I'd be saying to the interviewer, "what front-end technologies do you usually use in here?" because I would want to reuse a lot of the technologies that they're already doing for familiarity with the team. Thinking about that aspect of it, what we've got here so far is potentially a lot of new technologies for the team, but it is probably one of the most scalable solutions that you can think of. So depending on what the team has internally, it would drive a bit of a decision on this. Obviously if they've got something like just basic HTML and CSS then it doesn't really matter what they've got here, but if they heavily use Vue.js for example, there's no point me putting like React in, or ASP.
.NET Core and Blazor and all that kind of stuff here. It also depends on the requirements for the UI. All we know is there's gonna be a UI to get the data.
Just double check that. This is probably a good point to go back through and just double check that we've got everything. We don't really need to do too much about this first requirement. Each lift has got a unique ID and reports every 10 seconds. That's more of an informational one for us. Data is needed for five years for data analytics. On that one, we obviously migrate all the data as it comes in, we securely store it in DynamoDB, and it gets pushed into S3 in a managed way so we don't need to worry about it too much.
With S3, we can put lifecycle retention policies on, which is one of the reasons why I picked that technology. You can tier the data as well. I can put it into the standard tier, which means I get super fast latency, a ridiculous amount of nines of reliability. Then I can say, right, after seven days, this is going to be crunched down, we're going to move it down a tier. Or we can use Intelligent Tiering in S3, which moves it between different tiers depending on how much that item is being used. So that's one thing to know about that, and I've got some more stuff that I can cover off with S3 in a second as well about why I picked that as a technology.
A UI will be needed to view the data, obviously we just covered that bit. Alerts need to be triggered under various scenarios, total number of lift elevations. Again, we covered that with the basic metric for this kind of stuff.
So what else do we need to do? We need to consider security, reliability, performance, and data storage. Cool. Before I go onto those actually, what I'm going to go back and do is just give you the other reason why I picked S3 as the data store technology.
For the UI, stuff that's stored in DynamoDB may not be in a suitable format for querying. DynamoDB is typically very good at searching a single partition. It's like a key-value style interaction, but if you want to do more complex queries, it's probably not the best. This is why we've also got specialized Lambda functions as well.
If DynamoDB doesn't quite work for the use cases that are in the UI, we've got a couple of options here. Again, we can pipe something off the Kinesis Stream that transforms it on the back end, like "I want to transform it into this different shape", auto-aggregation. We can do that as everything happens. We could have something like a managed Elasticsearch cluster over here, and then we could have a Lambda that's responsible for writing it inside of Elasticsearch, and then we can have our application query Elasticsearch and have a lot richer querying capabilities. That's one approach that we can use. It's probably what I would suggest for this one.
The other way of looking at it is we've got all this data in S3. We could have something, I'm just trying to think of the technology, I think it's Glue. AWS Glue basically allows you to operate more like a data warehouse where you can say "here's the structure of the data inside of here, here's how we're going to use it" and then you can basically do a MapReduce type pattern and restore the data either inside DynamoDB or a different format, however it suits.
If it's only a small change that needs to happen inside of DynamoDB, then one thing you might be able to do is grab a Global Secondary Index. Basically what this allows you to do is take the data that is in the partitions and transform it slightly so it's a little bit easier to query. For example, if you had each of the lifts reporting into a different partition and then you wanted to aggregate by the building so you can see all the lifts in the building, you'd create a Global Secondary Index, take the partition of that Global Secondary Index and use the building ID for example. Then you should be able to see all the lift statuses inside of the building.
So we've got a couple of different approaches that we can use there. But right now we've got something that is pretty scalable. Elasticsearch is a bit of a question mark, but we've got something that's pretty scalable in our current design.
So let's take a look at the last points. Let's tackle security first. If we're having a look at where our vulnerability points are, well, before we even look in our architecture, we've got vulnerability points in the lift and in the user. Not a lot we can do about either of those because they're external to the infrastructure.
So one thing we would need to take a look at is making sure that all of our API endpoints are tested with authentication, we're using the right authentication method, we're not baking passwords into the lifts for example. That would be a really bad idea. You'd use device authentication kind of flows, much like you do with Netflix. You start to register the device, so it might be an install task that you do: register the device, get a code back, somebody types the code in somewhere, and then that lift is authenticated against the API. So nothing is stored on there.
The staff members, we'll be looking at making sure that they have two-factor authentication enabled, so if they are compromised at least password-wise, you've got an extra layer of security on there as well.
In terms of the rest of the services, most of this will be running inside of VPC stuff anyway. So we've got a security group around the VPC, all the Lambdas will be running inside of the VPC. IAM security roles only limiting access to the Lambdas from the Gateway, and the Lambdas to DynamoDB. That's kind of the only couple of different IAM roles that we would need to have there. AWS takes care of all of this bit, so long as they've got the right permissions, we're all good.
In terms of reliability, a couple of major points to take a look at: we're using a lot of managed services, like I said before, all with very high uptimes. I think the lowest kind of SLA number that I can think of in this diagram would be 99.
95%, I think it is for DynamoDB. Might be wrong on that. Again, I'm doing this all from memory, so definitely could be wrong on this one. So if that's 99.95%, then that's what our infrastructure is. I'll go through how to make this a multi-region architecture in a bit, at least mostly multi-region.
Apart from that, reliability, the other bits to call out: Elasticsearch. As much as I love Elasticsearch, it's a pain to manage. There's no two ways about it. Amazon, or AWS sorry, do have their own managed version of this, but I've not tried it personally. I just know that there is an offering, so that will at least take care of some of the maintenance and give you a better level of reliability, unless of course you're Elasticsearch experts, in which case it'll do just as good a job as AWS. But most people using Elasticsearch are not experts in it, so I'd recommend managed offerings wherever you can. That's a pro tip for your interview processes: wherever you can use a managed service, because it takes so much away that you have to manage.
Other points: UI. If it is a CloudFront distribution on top of an S3 bucket, then you're pretty much there because S3 is really reliable. You've got a few things around DNS that you have to manage there, but again super high reliability if you're using the managed service. Then it really does depend on what you use on the authentication side. This is probably the weak link, but hopefully the lift isn't requesting a new token every time, because that auth service is then going to get hit like crazy. You want the tokens returned from the auth to be valid for a particular time, and then obviously there's the trust relationships between the UI, your API Gateway, and your authentication service.
So performance. Much the same story actually. We've got the API Gateway which is good. We do have cold starts on the Lambdas, so if we do have to spin up a new Lambda we will incur let's say a two second penalty. That's really extreme, but realistically you're more going to get like a couple hundred milliseconds for it to start up. It's pretty good at starting up quickly. Once it's running, it will keep it around for say five minutes and just reuse that same instance of the Lambda. So we know we can scale pretty well here.
Now there is one scalability point that is often missed when people use Lambdas in interviews, and this really only comes from experience in learning how Lambdas work and how to use them. It highly depends on how you deploy them as well.
The first thing is the account limit. If this is in one account, by default I think you get a thousand different Lambda reserved units or something like that, which basically means you can only run a thousand Lambdas concurrently at any one time. Now I'd rather hope with this architecture we wouldn't have a thousand reports at once, and our processing with DynamoDB should be in the region of 20 to 30 milliseconds. So even if they did all come in in the second, then we're not going to be waiting too long for DynamoDB to respond and then the next invocation to come in. So I'd be surprised if we had that.
Now the one thing I did say is we're inside of a VPC. Inside of the VPC, you have to lay out your network correctly. So we're taking a look at subnets here. The subnet design will affect how much you can scale because there will be a fixed limit. If you design your subnets to have say 256 hosts in there, you lose two addresses instantly from whatever subnet you design. You lose the network address, which is always the very first IP address in the range, and you lose the last one. So they are non-addressable.
Then the Lambdas are free to use the remaining 254 addresses inside of there, and assuming you've got a multi-availability zone design so each one of these Lambdas is deployed across all of the availability zones, which means you would need three subnets minimum. Then you should be fine. That gives you 700-something hosts that you can play around with, which is a good number. It's still less than the account limit, and the account limit is a soft limit, you can go past that with a support request with AWS. But then you've got the subnet limit as well. So as long as you're aware of those limits, you can actually be fine with this kind of design.
One of the things you can do as well is on the Lambdas, say "I only want a maximum of this amount running at any one time." So if you had a bunch of other Lambdas doing other stuff, you can limit all of these processing ones to say, I don't know, "I only want 250 instances of this Lambda running at any time." That will be really good. You should always think about these kind of limits as well because it affects one of the hidden requirements which I'll go through in a second.
So let's go back. We're still looking at performance. Like I mentioned as well, DynamoDB is super scalable, so long as you've got your partition right. I mentioned earlier that each lift could probably be written into a partition and then we just expire the documents after a certain amount of time. That would work really well for this design. It will be super scalable. We're not gonna really have to worry about any hot partitions or anything like that.
If we changed the design to say each model of lift went into a single partition, so all of the 2020 models went into a single partition, then that's where we would probably start hitting hot partitions, because it's just too much data going into one partition. So if you've got a domain where you've got nicely segmented flows, like bank accounts, lifts, something like that, leverage that on your underlying data store like DynamoDB and you won't have too many scaling problems because it's all partitioned nicely.
Data storage, we've kind of covered off as well with DynamoDB and S3 on the back end. Saving stuff for years in S3 will cost you like $50 or something, depending on how much data you put in there. So not really worried about that cost-wise. Lambda, not worried about that cost-wise. DynamoDB depends on how you do it, depends on the exact scenario, but again not going to be massively expensive. Not the Kinesis stuff.
The bits that are going to cost you in this infrastructure will be your Athena queries, depending on how you do those for ad hoc querying, or your Elasticsearch if you use that as well. That could also be expensive to run depending on how many users and what model you pick. For your authentication, if it's Cognito, then again it's tiered-based pricing and it can get quite expensive at scale. But by the time you've reached that scale, you've probably got some decent monetization coming in and you can pay for the whole entire architecture anyway.
So that is typically how far people get in the 20 or 30 minutes that you have these architecture review sessions for in say an hour and a half interview with a company. This is where I would expect most engineering leads and most senior developers to kind of stop in their design.
But as I mentioned towards the beginning of this video, there are always hidden requirements. A couple that I would be looking out for here: a couple of common questions would be "how do you deal with bursts of traffic?" Well, we've answered that in our design already. So as an interviewer, I probably wouldn't ask "how are you going to scale Lambdas and DynamoDB?" because I already know the answer by your design and the explanation that you've given. It's horizontally scalable, so long as you do your design right. Handling a burst of traffic, we will be fine with in general, and it's not really applicable to this scenario anyway because everything's reporting in at a constant rate.
The next one I would be asking here is idempotency. In case of a power cut, what happens with the lift data? Obviously the scenario has given us no indication of lift design or anything like that, so I would assume that a portion of data would be kept on the lift in the case of communication failure or power cut. Then when the connectivity is reestablished, that data is just continually sent until it's caught up.
So you might have a scenario where something fails in here, because things fail at times. Something fails and the lift resends the data. So how would we deal with idempotency?
Traditionally what I would see happen is people would come right in here and they would say "all right, I'm now going to introduce another service, that's our Redis cache." And then "I'm just going to talk to this guy as well, check for idempotency, or write to DynamoDB and then come back." Once they've written to DynamoDB, they're then going to rewrite to Redis cache to say that it's all done and dusted. That design works okay, but if you write the stuff to DynamoDB and then you can't write back to Redis, then you still got the same problem of how can I ensure that I haven't written the record more than once.
With DynamoDB involved in this flow, there's something a little easier we can do. How I would answer this question: "Okay, you want idempotency in there, that's fine. So what we'll do is, as we're writing the data to DynamoDB, we can form some kind of unique key based on the contents of the data that's coming in, whether it's a sequence number that's come from the lift. Then as it's being written to that partition, we can check that partition and basically put a conditional expression on the put of the data to say 'where this key doesn't exist.' If it does exist, DynamoDB will reject it, won't do anything with it, and then we can handle that conflict essentially and return back saying everything was all good."
So that's how we'd handle that one. Another common one is triggering of events. For example, if we wanted to know every time a lift started up that generated some kind of event that went to the API Gateway, we would try and store that event. Traditionally what I would see happening here is people would write to DynamoDB and then go "okay, SNS, let's write to SNS which will then go to SQS" and so on. I wouldn't do this personally because we've got stream functionality in here. I could have a consumer coming off the back that goes to a Lambda function and then goes to SNS, and then you can have all your processing off the back of it.
This kind of pattern of driving things off the data change feed is really powerful because you can do all sorts of stuff that you wouldn't necessarily be able to do in a performant manner inside the API request. So we look back at what I originally said: as the API request comes in, we need to do some basic validation and parsing, and then it's essentially creating a DynamoDB client, writing to DynamoDB, responding that we've written the data. All of the value and business logic is driven off of a change data capture feed, which you might say is a single point of failure, but it's all managed by AWS with guarantees and all that kind of stuff. So we can have a high degree of confidence that this is a lot safer than me trying to publish an event in here, trying to manage idempotency in here. It's safer because instead of one network operation, I would have had three in the other design, and the more network operations that you introduce, the better chance you have of failure.
So we've covered off idempotency, burst of requests wouldn't really apply, publishing an event. For this scenario, that's probably most of the curve balls that you would get.
So with that in mind, what I'm going to do now is show you how I would transform this into a multi-region design and extend the design to go further than what I would traditionally see in the interview.
First thing to think about in multi-region design, apart from CI/CD concerns and all that kind of stuff, is what technologies work well over multiple regions. That's why I've picked some of these technologies on purpose.
So if this is going to be my EU-1 region because that's where we're all based (as in Global Lifts, we're based in the European region), then I would make this what I call my processing region. Then I would have satellite regions as well. So we're going to call this AS-1 for Asia, and we might have a separate one for the US as well, and we might have multiple ones of these.
If I look at the components that I would need in each region, and I'm just going to do it in the Asia region to save me writing it all out: I know that I will need an API Gateway, I know that I will need potentially some kind of authentication mechanism that matches here. Depending on how we do the authentication, there are a lot of ways you can put this. If the authentication was still in the European region, it's not the end of the world, bearing in mind what we said earlier. So I'll take this out for now just for completeness. If the lift is communicating every 30 to 50 minutes, it's not going to need to do that much with the authentication system that often. It's going to call once every 30 minutes. Introducing 300 milliseconds of latency every 30 minutes on a call is nothing, we're fine with that.
Going from there, we've got the API Gateway which means we're going to need the Lambdas sitting behind, and then we need somewhere to store and retrieve that data, and that's where we'd have DynamoDB again.
In terms of the data and stuff that we need, that's pretty much it, at least from the requirements that we have. If I get to this point in the design and I've gone way past our requirements, I'm just going to make a load of assumptions. Like our staff members are only ever in the EU, the latency doesn't matter too much if they're using a website from the EU in Asia. So I would assume this for this design.
The key portion is though, how do we make this data pipeline continue working? This is where I would need to go through and test something, but my hypothesis would be: if I made this DynamoDB table global, so I could write to it from any region and just enable it in each region, the data would get replicated across the globe using AWS's backbone. So I could write into the Asia region and have it replicated across to the European region, and then in theory it would come out the back of the data pipeline and everything would work as normal. There'll be some latency in there, but in theory everything works. I would need to go away and test this. This is not something that I know will work for a fact, but it's where I would be saying we would spike this out and check to see what happens.
Which means these regions are actually really small. They're just there for performance for the lifts, so the lifts can get their data to the nearest region and we're not relying on a central region to collect the data. We offer it from an operational perspective but not to collect the data.
If we're running these three regions, we still could be a significant amount of time away from where the lifts actually are. So the one way to work around this kind of design is to use something called Global Accelerator. This is AWS Global Accelerator, and essentially what this service does is: you've got your user here, and he is in South Africa, right? South Africa is not anywhere near the US, it's not anywhere near Europe. Closest would be Asia but still a significant distance.
AWS have these points of presence around the globe, and what you can do with a Global Accelerator is have a single IP address represented globally but have your traffic routed to the region that is closer. So what happens is when the South African user makes a request, does the DNS resolve, he will hit the Global Accelerator and then go over the AWS backbone to whichever region it's currently serving. So they might go to the U.
S. region, they might go over to the European region, or they might go over to the Asian region. Inside of Global Accelerator, you can define which regions you're processing in and which endpoints are going to be here. I would need to double check Global Accelerator and API Gateway connectivity works okay. I'm saying with authentication, but I believe it does.
So once we've got all those health checks and stuff set up, if the US region just goes completely off the map, this link will go and all the US traffic, because they're coming in through the same entry point, will now be diverted over to either the EU or Asia. This gives you a really nice traffic shifting pattern. Not one that you necessarily control too much, but if you have failover, you just report failure out of the US region and AWS will take care of moving the traffic across to the European and Asian regions.
A really cool product, and it works off of some routing magic. It's called anycast IPs. Basically each point of presence around the world advertises that they have a specific IP address using something called Border Gateway Protocol (BGP). So when the South African user goes to route to this, it'll go through all of his routers and determine that the closest local one is Asia, so it will then route to the Asian one through the backbone.
Like I said, really cool. Definitely worth a look at. Pretty much anywhere where you've got something that you need to connect customers around the world, you can use Global Accelerator with a single region or multiple regions. 100% worth putting in front of your services. If you don't have any customer traffic, so to speak, then Global Accelerator is not really going to help you because it's basically a latency reducer with some region management stuff built in.
The other thing that we would definitely need to take a look at is having a WAF layer, or Web Application Firewall. I'm just gonna put that on top here. For that, again we'll use AWS Shield. We're using a couple of AWS products. Shield has a couple of different levels and they'll essentially act like a firewall in front of all of your services and help deflect DDoS attacks and all that kind of stuff.
So we've got more of the performance covered, we've got more of the latency stuff covered, we have got the multi-region scaling covered. As I said earlier, there are so many different ways that you could do this. I could have done something very similar, like instead of having everything backed by DynamoDB, I could have had everything backed by Redis and then used pub/sub. But you have to think about the requirements that you've got, the use cases that you've got, and what solves the problem best. I think in an interview style setting, this is probably the best I can get.
I actually have no idea how long this video is because I'm doing this whole thing live. I always imagine I would have gone a little bit over time in the interview doing this as well, but that's fine. I think they would probably stop me at points and I wouldn't have all the explaining to do. I would probably just draw it all up and then explain afterwards.
Hopefully you enjoyed this kind of walkthrough video. I've got another one planned that I want to do. If you do have scenarios that you would like to be covered and how I would go through them in an interview-ish type setting, then please just drop me a comment and we can have a chat about the scenario. I'll flesh it out and do a similar kind of style video on this.
If you have any questions or comments, please put them in the comments below. I read every single comment. Hopefully you've learned a lot. Anything that I've said in this video, please go away and double check the AWS documentation, purely because I am doing this from memory. Memories are fallible, I could easily make a mistake about something I've said in here. So please go and double check the documentation on the bits that you are interested in.
Hopefully you've learned one or two new tricks about how to do stuff in architecture.
If you enjoyed this video, consider subscribing to the YouTube channel for more content like this.