~/codewithstu

Live AWS System Design: Payments Gateway

Transcript

Good evening, good morning, good afternoon, wherever you are in the world. If you're new here, my name is Stu. Today I'm going to be going through a live system design of a payments gateway. so typically you'll get this kind of challenge when you go to a fintech or something like that and they will often ask you a very simple kind of big question and an example they'll go design as a payment solution and typically the payment solution will have two different criteria attached to them i'm going to show you kind of how you can go through and answer that so share screen let's get our trusty whiteboard so the two things uh hi we can the two things that they are likely going to ask you the requirements nothing special for this stream so yeah for pen to work i just decided not to work today oh this is a great start so if we do this okay uh give me a second so the two requirements um as i'm trying to reboot the graphics tablet the two requirements generally are accept payments followed by um ensuring that why is it doing that hey there we go so our two requirements are one accept payments and our second requirement is essentially show it to the user so typically what i would see when somebody approaches this challenge they think it's a lot simpler than it usually is so what they generally end up doing is they have the api there that will do most of their logic they may have something sitting in the back to process payments and then that will write to some form of database it's usually something like sql server that they try and write to and then they might have a ui over here that talks to their api and if you're lucky you get some kind of all server this is what i've been on the receiving end of seeing a lot of these architectures um and this is generally pretty much what comes up all the time um very very simple um there's not really much for you to talk with the interviewer about and kind of show your knowledge so people generally time focus on things like the ui and go i'm going to be building it with angular or you know react and or view or something like that and they spend a lot of time kind of going this is what the user focuses and stuff like this but all the complexity lives in the api and what's behind it so if we're looking at like a design like this we're going to be looking at scalability now depending on which kind of fintech you go for depends on your scalability requirements um for example if you're talking a stock trading kind of fintech you're your scalability requirements are going to be a lot higher than if it's just a normal payments processing company even some of the take mastercard for example um one of the biggest kind of things in the world processes 16 000 a second uh roughly on the last time i looked at their data so if mastercard the global worldwide entity is only processing 16 000 which to be fair is a lot of payments a second um then call it 10 percent that you'll ever need maximum so we're only talking call it a thousand a second right just to be to be safe that's the and that would be a very very large payments company in reality you're probably talking a tenth of that you'll be lucky for most payment companies to hear like 100 a second now it does obviously depend on kind of what you're going to go what the company does how they go and do their business there's a few nuances there but we're not talking like millions of millions per second for any of the scale so that's one aspect that we're going to wait that the interviewer will be looking at is right how does a system scale if they don't explicitly call it out you can bet your bottom dollar that they're going to ask you during the kind of question process about how to make it scale because if they don't put it in the requirements they love to throw curveballs in like that so keep that in mind when you're going through the design the second part is isolation isolation is a pretty big deal in payments how do you keep things isolated so i'm going to call it isolation and security because we're kind of one of the same thing um so we're talking like right where do we have encryption how do we make sure one person's data doesn't go across to another person's data and all that kind of stuff um and then the third kind of major thing is it depends this one does depend on whether you have kind of payments kind of knowledge going into the interview um but typically every time i've seen and done a payments gateway test either me being the candidate or me being the interviewer there's always been something about fraud now thing to note here is that when you come up against a forward process there is generally a manual step in that process that could be because an automated system flags a transaction for review and a human has to look at it and make a risk-based decision to say yes i i approve this payment to go through or no i don't approve this payment now this space is getting a lot better but this is where ai can come into play as well and this is another curveball that we'll go into and how to integrate this as well so with that in mind let's go through starting off with our api i think the the kind of front end and auth is unless the people on the stream want to go through i'm going to leave the the ui north kind of side of things because all the interesting stuff is at the back of it so we're starting off with our api one of the first things that we need to do um yeah we can cover refunds as well gallery uh so one of the first things we need to do is talk to the um the interviewers and get out of them one is this an asynchronous system or two is a synchronous system now generally speaking payments are asynchronous because of the fraud aspect that i mentioned earlier so if you're one of the first things that we'll likely want to do when we're accepting the payments for our api is either buffer them um somewhere on like a cubase system or just somewhere to keep it safe and track the payment so for example one one approach that we could do is publish it to sms now this serves two purposes so as the api request comes in we do our standard validation with whatever validation logic there is and we publish it to sns that's because we can go into then multiple directions should we want to one of those directions assuming we have an sql queue may be to write it to a database which the api talks to and then the other one well for different sqsq maybe you're processing one so this is where it may talk to mastercard for example and then all the updates go back to the database it's not how i would answer it but this is just one approach that you could use so let's take a look at some of the other bits and pieces that i personally would do when taking a look at this so for the rest of this i'm going to assume that we're going to be writing an asynchronous system okay um because it's very rarely that you actually need to have asynchronous system in place and then those were to really be they're completely different they sound too similar um so if i was going to design this first thing i would want to do is track and store it and to do this i would use an active system um so what is an active system very simply put an active system is essentially a container uh that runs across multiple servers if i go out of service down here that allows you to run instances of objects essentially across all of the servers and it handles all the communication and all that lovely goodness in in between so most common frameworks in dotnet are akka orleans or proto actor i think it's called and they have a very simple concept the servers underneath are clustered and then they provide like a mesh-like architecture over the top to basically handle each individual actor so for me every time that we come through to the actor system we're going to write or create a brand new actor and what this allows us to do is a couple of different things in the background one we can do persistence straight off the bat for example in the microsoft orleans series that i've got on my channel i show you how to back your actors with something like dynamodb so this is a great system here to use so active system back to dynamodb why is this a great system to use well in a second one of the things that we didn't cover in the earlier part but is a good bit to kind of show your knowledge and understanding is we want some kind of notifications to our end users so we can quite easily develop a system based on something like lambdas or kinesis or something like that coming into our database coming out the back using the change data capture feeds and into our notification publishing system and giving it back to the users this is a major pattern if you're going to follow this asynchronous path in your design this is a major pattern that you want to think about how do i tell the user whether their payment was successful or not if all i've given them back is like a 201 or 202 right if that's the response you've given them then you need some way of telling them later the one thing we want to avoid in our designs is a user constantly going hey where's this payment hey where's this payment hey where's this payment that's going to load up our system and we don't really want that we want to say hey once we've accepted your payment we'll tell you in the next i don't know call it a minute that your payment was successful or not in reality payments usually happen in like a couple of seconds even with all the bells and whistles like fraud and all that kind of stuff end to end it's usually no more than a second so we've got our active system we've written it to our database we can also use that change capture feed to run other parts of our system so one of the other parts that i mentioned earlier was fraud and you don't really have to explain too much in here um for me you know you kind of got two roots you've got an ai root or like a raw based we're always route if it's kind of it depends on like which system you pick as to matter which approach you kind of go down um the ai route requires a lot of data a lot of training um so there's potentially going to be some databases around here if you want to go down that route the raw base route is going to have some data but it's all based on kind of data from a third party data source so you can have like a third party and more often than not you have to go and talk to the third party's website or api to say hey um can you give me the latest rule set for the united kingdom and the us um everywhere that i've seen is always based on the united kingdom and the us rules even if they operate in other countries um just for jurisdictional purposes and if you captured like the united states uh rule sets for example it makes your life easier so if you go down rule-based stuff expect to be talking to like a third party but it's these kind of design decisions that you can talk to the interviewer about and say okay so if i've got an ai based system here then you know i'm going to have to have some bigger data sets and all that kind of stuff you're probably going to have a team that's completely dealing with that aspect of the system so assuming that we've got our fraud part of our system all kind of sorted the next thing that we're gonna want to do with our payment is root it because it's highly unlikely that any payments company will have just a single route out of the system um so they're unlikely to have just mastercard for example or just visa usually they'll have both visa and mastercard or jpmorgan and barclays hsbc so and so forth they do this for a few reasons first one is cost optimization so if you've got a couple of banks you can play the banks off of each other with your workload and say oh we'll give you more of our throughput and our load if you give us a cheaper bank rate so routing is a very very big thing the other one is failover do you want your payment systems to be up and running so we're just going to have a couple of different banks down here you need to decide where to send them so we might just have bank a and bank b for example how you one question if you do start drawing this route out is okay so how do you manage basically all of these little individual steps like we got one here we've got one here we've got one here and how you manage it is actually a little trickier than what you would think because the instinct kind of says oh i can just use an sqs queue here but you can but it has to be fifo for the same user and the reason is if you send two payments one after the other it could be in relatively quick session we don't know um one could be a big amount one could be a small amount so you have to just bear in mind that the order does matter when you're dealing with a payment within the context of one account so a safer option for this kind of system would be something like kinesis or kafka where you know you get your high throughput you know you get nice partitioning between different accounts and different payments you get the nice read and write performance out of this but most importantly you keep the order as you do so because they're based on right headlocks once you've kind of come out the back of the bank it doesn't really matter too much because you're into a volume then rather than anything else the payment has already happened so you just need to make sure it's recorded properly that it has happened um so yeah just bear in mind that what you put in between each of the systems ordering generally matters uh so after you've done your routing uh we need to we need to store it again so just essentially process results and that should go back to our actor system there to complete the loop now you could argue at every single point okay if something happens we've passed fraud there should arguably be another path that goes back and tells the system that hey i've gone and passed the fraud check we've done a payment okay cool let me go and tell the actor system we've now processed your payment um so that's all good so how i mentioned scalability earlier how do we scale the system first part is obviously on your api level we just run multiple instances of it and generally speaking so long as our active system is up to date we are good to go now the axis how do we scale the active system it does depend a little bit on which technology you use but generally you do exactly the same as you do with the api you just add more instances of it um and then the actors will self-balance across the cluster and you whichever instance the actor spins up on it will stay there until it shuts down and it shuts down usually due to things like memory pressure or stuff like that but these are only short-lived kind of processes and objects so hopefully everything's done within a couple of seconds and soon you know that you've processed it there's no reason for you to keep that kind of in the back of the memory because you've got that notification systems customers so you can just say okay like in i don't know 10 seconds after we send the web hook um you know say goodbye uh so question here that's actually a really good question um so api and active system would this be something like kubernetes cost or ecs so i think i said this before on the streams kind of before if you whatever technology you pick you have to have a great reason for using it um generally so everything here that i've shown can be run on bare metal it can run on ecs or it can run on kubernetes there's a lot of times where people think they need kubernetes and they don't they're just running simple apis they're not using any of the advanced features they're not really using namespaces or anything like that they don't need it but they're putting it in it's almost like cv padding so for me to keep it simple for the interview this would all be ecs fargate you could equally run it on ecs ec2 if you wanted to and there would be no problem in that you would get asked the question why not fargate um but to me the solution doesn't warrant kubernetes is you're not gonna be using any advance features you could argue the deployments are a little bit easier with things like flux and argo but generally it's not really needed so i said kind of explain your reasons why so why fargo primarily because i don't have to manage the server and that is a big win in terms of maintenance especially when you work in financial industries and stuff like that you have to keep your services and stuff up to date it's kind of expected by regulators and auditors and doing that when you've got hundreds and hundreds of instances yeah it's a bit of a pain in the ass if i'm honest um so yeah fargate all the way for me uh also just don't over think the architecture is too much like if you think oh do i need kubernetes or do i need kafka the answer is probably not um so yeah just just keep it simple for yourself you save yourself a lot of pain and headaches um so yeah active systems i actually really love them they work well any time that you can split payloads so or anytime you do partitioning and short-lived objects and stuff like that they're absolutely awesome i'll go through like one big feature in a minute that's really helps this kind of architecture out notifications technology that i would be here would probably be lambda or something akin to lambda something that's super scalable um can handle it and behind once we get here we can use something like sqs we don't need something complicated like kinesis or kafka here um we just need to be able to send the notifications if we do happen to do a retry or they become slightly out of order it's not the end of the world so lambda sqs kind of combination there and then standard ecs stuff around the back so this is all very good and well if the system's up and running and we've got some nice queuing features and stuff like that but how do we how do we keep track of the payment in the system um obviously we have we said earlier that we can have things like notifications coming out and reporting back to the actor system but let's take an example now your routing layer goes down completely it's gone how how do you know how to track this um hopefully we'd have some monitoring and alerting in place um a great default answer if you want it for how do we observe systems you just say it's open telemetry or hotel um you put in the collectors everywhere you have the collectors and everything right to the collectors and then to whatever provider it's a great standard answer there but if your if your payment never come out of routing how do you know because it might have been processed may have failed silently and that's a really really bad thing to happen in a payment system is the thing to fail silently so inside of active systems they generally have a life cycle okay so imagine this is our little actor there's a process called activation and that's where it reads the state from an external store like a database and then it goes into running and then it deactivates uh and this is often where some persistence happens as well now once the actor is running there's a really cool trick that a lot of the active frameworks have which is kind of the concept of a reminder and upon activation what happens is it says uh like give me a reminder just give me a nudge call this method in five minutes 10 minutes or something like that and then execute that code and if the active framework is worth it so then the reminder will be persisted and it will self-activate as well so you don't even have to worry about the actor being alive yeah or running in some cases the active framework will take care of that and activate the um okay keep calling it a great because i've used the whole lean so much but keep running the actor and basically we can run any user code we want so we can use this reminder system to build ourselves like a watchdog and what this is going to do is basically check the internal state and say hey i've not seen this payment or last an update in the last however many minutes um so this payment should complete like five minutes ago why is it not being completed and you can then rebuild your kind of state and go through and go oh yeah no it got to rooting okay let's go and take a look everything oh wait real things offline um you can also use it to re-trigger messages and stuff like that as well so having that kind of timer reminder system in the active frameworks is really cool and that's why we're putting well one of the reasons why we're putting in the active system right at the start is so that at the earliest possible point we are so in this actor here first thing we're gonna do persist it next thing set up a reminder and that so that we can keep track of the payments after that return and if we're using something like dynamic db all this secondary stuff like ford and that kind of comes off the back of it uh obviously you have to validate your payments and all that kind of stuff throughout here but that's kind of why we're using the account system we persist it we set up the reminder to do our watchdog and then job is good if you have any questions uh or anything about like why i'm doing something or going down a certain route please just do let me know um so mentioned refunds earlier refunds can come in a few different shapes and sizes so refunds can either be triggered by a user or it can be triggered by a bank so i'm gonna need a bit more space up here actually um actually no i'll just redraw sections down there so refunds is just another state in the payment lifecycle so if you imagine our payment life cycle was like kind of created this is very simplistic but created sent uh let me see if i can move that safely created sent um processed and then usually your flow would stop here until you get to the end state of refund so we'll have to double check against the actor system itself um to say like kind of what is the status of the payment um can we actually process this refund and all that kind of stuff we may get situations where something comes out of the bank and it just goes straight in into our refund process and we kind of have to process it they may have rejected it for various reasons um like you got the account wrong and this thing doesn't exist there anymore um the user could also request refunds and there's a whole we're going like too deep into payments it's a whole way that we can do that but basically if it's a user triggered action it's going to come through our api either directly through the api or through a ui like a mobile banking app or or web app if it's a bank initiated refund again it's gonna gonna come from the banks and then we'll process it in the exact same way if we just change where the source came from the third way that it could come through um so if that's our customer um we usually have a customer services department and they can also trigger a refund which is lovely so we got like three different ways a refund could occur so we need to have one central place of managing it because the logic is generally the same as in check the payment validate we're in the right state to be able to refund this figure out where we're going to refund to um in some cases you may be making a payment back to a like a card or something like that so if you're doing i don't know a betting company right your user makes a request for your api they deposit a load of money to you then decide they need a refund for whatever reason you need you process that refund you still got to give them the money back so you potentially need to go back to your api to re-trigger a new payment now the reason why you trigger a new payment and you don't keep appending states on the end is because you need to capture this whole process again is there anything else you wanted to know about like kind of refunds and stuff like that i must like go and clean up some of this um so what kind of curve balls what could could we be thrown in here so it's going to answer this question quickly so the refund would be applied to the same actor um or is it better to create a new one for me it would be a new actor because you're tracking the state in a different way that said i would be linking the two payments together so imagine you have a json document and you got your id of like i know payment123 you might also have like a refund refund id yeah and that might be refund one two three and then you know that you you've whether or not you've got a refund associated with that payment but you still keep all the same benefits of kind of how a refund payment works in some cases um this is very fintech like which fintech you go into specific in some cases you just essentially just give it back to them on their balance and that's kind of it well that is actually a very good point and this is kind of a curveball how do you maintain balances in this system if you've got something asynchronous as this um i'm going to point you over to a lovely actor system again because not only can you manage payments in it you can also manage the accounts in it as well and this is because generally speaking active systems have a hierarchical nature so you'll have what's known as a supervisor at the top and this supervisor is basically a way of keeping track of all of its children so the top the root actor will this is one that you don't generally control by the way this is usually a system controlled one the system controlled actor will look after your actor your actor may have children so the way that we can structure our actor system is take a look at our accounts here then you've got our payments below it so when you want to request a payment you'll request it through the account this way you can maintain the balance inside of the account and track the payment underneath now going back to kind of the scalability issue that will always be thrown in somewhere in my experience is okay but how does this how does this kind of perform so i've personally built a system on microsoft orleans that every single thing that i was doing to the state i was persisting the state for durability and it was i was getting pretty decent performances getting like 90 a second um performance with persistence without persistence it's like four or five thousand a second for an individual actor um given the linear scalability of that as well excellent performance and it's very rare like i said right the start of the live stream it's very rare that you're gonna have things like a payment system being over a couple of hundred a second a thousand to maybe two thousand kind of worst case but above that you'll be fine and that's total system load that's not a load on one account so the most you're gonna see on like an account is three transactions an hour three or four transactions an hour so there's gonna be very lightweight performance in when you split it out like this is not a problem however if you didn't have an active system and you basically trying to use like sql server or something like that and it's all very stringent and fixed there's no real partitioning there then yes it will be a bit of a perform like performance hit because you certainly so many things like sql server can do and you generally talking to a single node rather than a cluster all that lovely stuff so that's kind of how you kind of split out so you've got count one over here count two each has a series of payments underneath and just keep track of them all uh two other bits i wanna talk about i'm just gonna write them down so i don't forget them um one in a minute we're going to talk about reporting and the other one i'm going to talk about first is deduplication how do you ensure that you only ever process a payment once that is a very very tricky thing for being honest because as i mentioned the all of the stuff is usually going to be cubase behind the scenes um in some some fashion so you need to ensure that like for if for activism it's relatively easy i want to say um we still have to go through the same same process so you need a way of uniquely identifying uh that payment um blood test i will come back to that in a second and alts so we need a way of figuring out like which is the unique id there's no real nice ways of doing that um best way is kind of content deduplication at the api level um you know using some kind of item potency key that you back with redis and then assuming that idempotency key hasn't been used then you talk to the active system and create it and once it's been created then you kind of say okay we've we've completely used that item point c key now it can't be used again and then it puts the earnest on the caller to kind of go okay okay i'm going to generate a new new one um a new identity key so they might take a hash of the different properties for example and then use that as the other potency key but generally that's how i've seen it being done is api with some kind of identity key to read us into the storage system once it's in the storage system you then have more control over how it gets deduplicated down system so you might use their item potency key to generate a unique transaction identifier which then you can do either content based deduplication on the transports in between so the bits in between like the fraud and reporting and all that kind of stuff or you can mimic the same reader set up around the place to essentially deduplicate based on that as well so you kind of got two two options there um when you're working with something in a large opening system and doing large volumes of vitamin potency keys you have to figure out how are you going to expire them after a certain amount of time otherwise they're just going to grow at the same rate that all your transactions are going to grow so if you've been operating for five years you don't want five years worth of item potency keys a couple of hundred a second to be stored in redis my cluster's gonna be huge um you just need to think about how to expire them that's a business decision not something you really need to think about for interviews i wouldn't say um okay before i get into reporting i'm going to take a look at the alternative style and activate system to be honest if i was going to get rid of it i would go lambda based and this actually would and the api on the front would be api gateway at that point um and it the only things that would really change is you just have to go through and do a lot of the things that you would otherwise get for free um with things like active systems so if you think about what we have to do one we have to depending whether it's in going or outgoing check the balance and to we need to update the balance potentially and that would be something we call reservation or funds and that these first two points are only if it's an outgoing payment uh incoming payments you don't have to handle inside the loaner functions uh three storage um so yeah when you're when you're accepting a payment you don't need to update like the balance straight away you can do that in batch later on so for example you can wait until you've processed it out the back here or from your banks you could put it into an sqs queue which is read by a lambda and that goes and updates your data store with your payments now instead of processing each individual payment one by one actually i wouldn't use sqsa go inside a second instead of processing each individual payment one by one you'd use lambda's ability to have uh tumbling windows so essentially take a a time chunk and batch it up and then process batch on background batch um this way you're massively reducing the number of writes down on this path and just getting lambda to do essentially a tumbling window sorry what i'd actually use um on this top right side was going to go that approach would be something like kinesis so that i can partition out and then the partitions make the tumbling window stuff kind of easy uh so i want to answer this one quickly land has some limitation in terms of memory and also we don't have also have issues regarding cod style yes they do there is a couple of workarounds though uh and the last colonial i haven't your question don't worry so one of if we're doing like a lambda response to api gateway you know system-like payments there's generally predictable flows that obviously there is bursts when like flash sales and all that kind of stuff comes on and that is great for like lambda's scale out abilities but if you're talking about like general day-to-day payment stuff um it's generally a predictable workload and that's there's not much during the middle of the night it's a whole whole bunch of stuff as people wake up dies down up people at work goes up again at lunchtime people at work goes down in the evening and then kind of hovers it around so if you know like that kind of pattern you can say right i know that i can pretty much live at that level constantly and the system is going to be okay and functioning and then you can use things like um reserve concurrency and just keep things warm in the background so if you know what your minimum or your constant minimum is you can provision them ahead of time you do pay a little bit more for it but they're guaranteed to be warm all the time you'll have a certain number of instances that one so that's a great thing the other thing that people generally forget um with lambdas cold start generally doesn't matter um i haven't seen many use cases where people are using lambdas and they really care about latency say if it takes 300 milliseconds to spin up in a payment that's okay like it's okay it's not the fastest in the world but if it happens like one every or a few of them happen every couple of hours i'm not personally that worried about it um so long as i'm meeting my slas with the rest of it and the reason is you've got to remember what the design is for lambdas there's a very like they're not designed to be doing loads and loads and loads of business logic it should be short and sharp very quick so things like i'm just going to write this into the database and worry about it later validate the incoming request store it send the response back so if it takes 300 milliseconds to start up every time i'd be like not great but once it's processed the thing that function stays alive afterwards now it can stay out of life i think it's up to 15 minutes behind the scenes unfortunately you don't have control over how long it stays alive for um but if you've got a combination of that provision concurrency and these lander functions are super fast and small you can gonna be fine um the memory limitation yes i would generally agree but i'll go back to what i've said earlier this should be short and sharp so if you're allocating gigs of memory in a lambda function and you're not cleaning it up then go and rewrite your lambda function basically because you shouldn't be using how much memory in lambda function um obviously you can do long running processes in it but for a use case like this i mean like i should only be really paying the cost of memory for like whatever it is to store the incoming request into my data store um so yeah i've gone for kind of like patching to duplication securing api gateway i do want to come come back to last colonial's question here because i think this is a very interesting one since we talked earlier about uh one of their requirements being security and isolation so how what do you do to protect yourself uh in this kind of setup i'm gonna remove our little user for now so we know what he looks like so does part of the answer does depend on what technology are you using here for your essentially your load balancing um so for reference the low balancing options in adress for example application load balancer network load balancer classic uh and ap gaming classic shouldn't be using so let's get rid of that one um network load balancer you don't get any kind of routing capabilities or anything like that so we can scratch out off off the list um that's just basically port level if you're familiar with the rsi level i think it's layer 3 routing that the nlp can do whereas the alb can do layer 7 and the api gateway inherently has kind of layer 7 routing capabilities built into it so fortunately for us both of these choices here um can put in the web application framework aws wife which is great um because we can define a lot of rules and we get the built-in rules from aws plus aws uh automatically gives you something called shield um there's two levels of shield there is the free version of shield which absolutely everyone gets it protects you from uh things like um ddos attacks and all that kind of stuff and then you get shield advance which you get a little bit more stuff into i'm not too familiar so i'm not going to go into details i don't know about but there's basically two levels there that gives you a certain amount of protection now assuming that we're kind of a that's pretty standard across the board for every architecture there should be some sort of web firewall or the waffle um should be some kind of ddos protection if you don't want to use aws as tech um you've obviously got cloudflare you've got fastly there's a whole bunch of other ones uh so in terms of your api gateway um how do we secure that easiest way is have an oral server that you talk to and then your users authenticate with that get the token back total server uh does get a little bit more complicated especially in the banking world in the uk where we have things like open banking um when there's all fancy different mechanisms that you can do um to kind of do authentication that's generally one approach for general api-ness another approach i've seen which is very different to this kind of model is that in the background you would have um either something like private certificate authority or just generally a certificate for a customer and actually no sorry this is wrong my bad see the other way around your customer will have us just a private and public key pair it doesn't have to be a certificate just need to be public and private keys upon registering in the system they will take this public key and they'll send it to you somehow usually during the onboarding process they'll give you the public key so every time they make a request there'll be some sort of user identifier so you know which person it is it's kind of calling your system but then they'll sign the payload with their private key and you can verify it with their public key and that helps ensure something called non-repudiation which is basically a really fancy way of saying splitting the responsibilities like you knew this was me um and any action you did after that is your kind of fault it's a very simple way of explaining it um once you're inside the system because back to one of the points earlier around isolation so this is going to be like your website that would probably be in one aws account your fraud system in different aws account notifications again different refunds because the use cases with customer services different um because you know you might have like a vpn here that people come in and use and access the refunds from that um banking systems down the back here definitely isolate them make sure the access is restricted um so yeah hopefully that answers your question of how i'd secure the api gateway basically a waf shield um combination of ioth and depending on the business and maybe using things like hmac and public private keys to secure web requests uh infrastructure costs it's a very good question it does depend largely on what it is um you're using as technology so for things like the lambdas i've got here lambdas are very cheap in comparison um so you can run without a calculation for a project the other day uh where me running lambdas was cheaper than me running fargate instances um they could probably handle the same amount of load but it was quite an orditude of difference so i think for a decent sized series of fargate servers i think it was like four gigs of memory on four cpus or something like that it's gonna cost me a couple of grand to run whereas i could process like millions and millions of i think i put like five or six million lambda requests and it was like yeah it's gonna cost you 30 bucks okay well that makes my decision easy for my personal project um yeah it does largely depend on what it is they're using for technologies um and that is a very valid thing for you to go through and talk about in your architecture design uh and it is something that if you have an architecture process in your business they will say to you okay uh how much is this all gonna cost you need to go and figure that out um aws does have a page for example that says uh you know put in your different bits and pieces and it'll give you the generic pricing there is a trick there um because if you are a large customer of aws they do enterprise level agreements and i believe azure does uh enterprise agreements as well and basically you get discounts and free credits and stuff like this for example i believe and don't understand quote minus do your own research but i believe there is a program within aws that if you get a third party to do an aws verified to third party to do a well architected review of your system you can get credits back for that which is great because you can fund some of your architecture from them but yeah and it's really hard to kind of go through a cost like this in reality if i say a fully loaded system like this if i was using dynamite db this will be a big cost center because of the read and write units lambda wouldn't be that much api gear would be reasonable depending on your load for something like this high load five to six thousand a month if it's basically not doing anything then it could be as low as like 50 bucks depending on what you use for your technologies um so yeah it really does depend on a how much volume maybe you've got going for your system b which technologies are you picking because some things like lambda you don't pay for them if you don't use them which is great you do pay for certain things like extra provision provisioning and stuff but generally speaking if you don't use lambda you ain't gonna pay for it um same with dynamodb you can do exactly the same so this is the one of the big things about dynamodb it is fantastic the way that you can do partitioning of data change capture streams point in time recovery global secondary indexes to reshape your net your um your data and stuff like that absolutely amazing but the cost of it if you're not careful and how you do it it can be a killer so you could have a well up to my system that spends i don't know 200 200 bucks a month on it but if you don't optimize it right in terms of which um strategy you're using but like on on demand or provision concurrency you could equally be like saying 500 right units when you need that for like one second of the day um so yeah it's it's one of those things that you really pay attention to things like dynamo um go for it right reverse right units how do we scale this effectively um and all that kind of stuff because you talk about how you tackle data sovereignty sovereignty issues versus replication yep um so this is where it gets very very tricky there is a if you're a single country fintech like you just operate in the uk um becomes very easy because you just write into your terms of concern generally you just write into your terms and conditions that we're going to replicate your data to a different region for disaster recovery purposes but if you're operating in multiple regions say for example you wanted to run in the united states and you had people in the uk you have to make sure like how do you do that easiest way of doing it is to create a multi-tenanted kind of system where you physically isolate the traffic um let's so what do i mean by that if i had let's call it api.

bank. com that was my domain what i would do is i would just go us. api. bank. com eu. api. bank. com and then i would replicate within those regions for disaster recovery but you do have to be careful about what you do for example like there is a rule i believe in france where you have to encrypt ibans i believe is one of the only countries in the world that you have to encrypt a bit of information um so if you're going to put your data into a french region and you're working with french regulator that's something you have to be aware of now having a tenanted system like this generally is not very user friendly so this is way beyond what we would actually go through in an interview i just suspect um because usually only have four hours but we're gonna go for it anyway oh actually let's go for it because it's a very interesting question so how do i make a nice user experience for global users and keep my data sovereign the answer is still gonna kind of be the same we have our eu data we have our u.

s data and we have i don't know let's call it asia data and what we want to do is essentially create a mesh on top of it um so we could use something like api gateway here and then we'll have a couple of lambdas that are sitting underneath here um we're gonna have one which is going to be our authorizer and that could be dealing with like all your authentication and stuff like that making sure the user is authenticated and then we're going to have our proxy essentially and again you can do something like lambda for scalability here you can build it into api gateway itself i believe i'm not sure if it works across region um but i was playing around with this the other day from api gateway you can call lambda functions really easily and other services using proxies so i'd be looking at using the proxy capability from api gateway and based on the authenticated user i can always route to the correct destination from a single url and then things like reporting become a little bit trickier from a business perspective um if you wanted like a single region for like reporting purposes then potentially you're going to have to look at data and anonymization um so removing all the users identifying traits and stuff and just use generic details like um where did it go how much was it when did it go stuff like that there's a whole bunch of legalities around there so standard answer of go and read the rules and go ask your legal departments um before you implement any of this but yeah this is what i'd be looking at doing we have our api gateway on api.

bank. com and then underneath we might have you know an eu. api. bank. com and if for whatever reason the global network's unavailable you can go and target the region individually above api gateway if we're talking a big big system like this i would always put in global accelerator adbs global luxury well what character is that tell that's been a long day um if you're not familiar with actually i'm going to ask you guys since there's a few of you on how familiar are you guys with the adms global accelerator because it does some really cool things um which i can probably explain in this diagram as well um i just want to know like kind of how deep to go with it but basically on very very high level aws global accelerator creates you a global network okay and imagine you're all right so we're not at all okay we're gonna have some fun no people that know networking are probably gonna hate me for some of the explanations i give here but we're gonna go for it as best as i can give it to you so when you set up a risk global accelerator scenario it gives you an ip address let's call it 1.

    1. 4 in the same way that is if it was a single server but what happens is if these are amazon's point of presence around the world so not only the regions that you can uh put all the different bits and pieces in but also their little little points of presence for things like cloudfront they advertise that ip address for you so we're going to have i don't know e1 e2 us1 major south pacific one right and all of these data centers are going to advertise 1. 2. 4 okay and the way they can do this this is something called anycast oh it's called any cast ip addressing okay and basically depending on where you are in the world depends on which server you hit so if i'm a user down in south pacific i'm going to make a request for api bank api. bank. com and i'm going to get the address 1. 2. 3. 4 and you're going to go okay great let's go and connect to that but which one of these do i connect to and there is something called bgp or broadway gateway protocol i think it is and essentially oh yep sorry my bad um and essentially what happens with board gateway protocol is anything inside of a data center so you imagine you've got your little router sitting in your data center that's the thing that advertises the ap address and all the little readers sitting around the world listen to updates from their neighbors to say i have this set of ap address ranges so when you go and connect into your router to start resolving the address 1.
    1. 4 it's going to go ah i've got somebody that knows where how to do 1. 2. 3. 4 not too far away from me and it routes you to the closest possible location now you may be aware of uh things that happened to um cloudless where actually no facebook had a big incident i believe was a year or two ago where some bgp routine tables got messed up and it took off facebook for like a long time and it was quite a big news story um i believe that was down to somebody basically messing up the routing and then they couldn't get back into their own systems quite cool to hear about um basically we're just creating a global network and then the routers inside of amazon advertise that iep address for us to other routers that are nearby and creates you kind of like a little boundary in between your regions and depending on where you sit in your boundary depends on where you go this means that we jump onto amazon's network really really quickly um so we're not gonna if we're over in asia and our website is in the eu we don't have to go all the way through the public internet to go and connect to the servers we can jump on to our local point of presence essentially which aws manages for us with global accelerator jump onto the local points of presence and then we use aws's super fast backbone to go and connect through there is a charge for this i don't know how much it is off the top of my head but it's a really great technology that's one of the benefits you get from it i'm going to go through does that make sense whilst i'm cleaning up the diagram basically the same ip address is advertised across the world and you should connect to the closest version of it essentially so the other really really cool uh bit about aws global accelerator is you can do your health checks with it as well and the way this works is if you you basically start creating endpoint groups and say yeah i've got all these endpoints here and it's all good and i'm just going to take the eu2 one for example and inside of here you might have an application load balancer okay and that might have a health check endpoint on it aws global accelerator can check that endpoint to say hey are you healthy are you healthy and when it detects that it's not healthy it just goes takes that region offline and then routes to one of the other available regions and does this all based on health checks when that region comes back online again awesome let's go and publish it like start sending traffic there again so it might doesn't have to be an outage for this to occur you may want to do like a really big structural change or something and have a bit of planned downtime in there if you've got two european regions you can mark the first one as unhealthy send all the data across to the one that's healthy do your maintenance bring it all back up again um so yeah that's like two of the major benefits of of global accelerator one is one ib address kind of everywhere in the world in performance benefits of it two help region based health checking and then you can fall over it i've not seen many companies use it um if i'm being completely honest but it's it's really uh it's really great don't get me wrong things like any cast ipa dressing and bgp is is pretty complex um but it's worth understanding at least a little a basic level of kind of what's going on there uh we didn't really touch on reporting earlier in the context of an interview reporting if you got in the diagram that we drew earlier um you'll have loads of points where you got like things like kinesis streams and stuff like that um if you've got all that kind of stuff in place then you can just take copies off assuming that you can take copies off you can't with sqs directly you have to go through sns to do that but things like kinesis you can just add another subscriber on take your data off in real time and go and do it in a completely different way yeah there's a lot of a lot of information i've covered more than what i thought i was going to cover if i'm completely honest um do you guys have any other questions about anything we've gone through or any other technologies that you want to know a little bit more about why would we use one over another etc and also if you do have like if you want to know other scenarios um and i'm gonna i'm planning on doing these live streams around this time every single week um so if you do have someone don't kind of go through it then or like you've got a scenario that you want to go through please just drop me a message leave me a comment um and i'm happy to take a look at scenarios and go through anything uh psd2 is actually not that bad the killer one is pci compliance to me and one of the reasons why is in that we're going to have it in the specification 4.

1 that's just come out but basically one of the things that they've they're going to be putting in very very soon is the idea of if a system is connected then it has to be encrypted so let's take a look at what that looks like so if i got an album point-to-point encryption is that line there so i might have https here generally speaking people think that yes i'm entering and encrypted if i have https you know you might have a series of containers back here behind your alb running i don't know kubernetes or ets or whatever you have a second point here that also needs to be https fun fact about aws arbs is they do not care whether the certificate you use to respond to a https connection onto the alb ie going this way they don't care if it's valid or not and the reason for that is for things like private certificate authority it's essentially a self-signed set um yeah bill alb is no way of saying trust this certificate so i just don't so that's another point now where it becomes really interesting from a compliance perspective is if that container was a sidecar like envoy just take envoy for example uh the idea of envoy is basically it creates a service mesh over your containers and you do routine discovery and all that lovely good stuff and when you've got your container technically speaking there is another point-to-point connection there another system that it travels through now there's a lot of discussions between me and a few others our previous place whether that constitutes a proper point-to-point connection because it's essentially in process on the same machine and the conclusion that we came to is play it safe so that also should be https and any databases underneath tls you might have a couple of those so when you're trying to doing everything up to like the load balancer here it's very very straightforward doing it to an initial container it's a little bit more difficult uh because with things like uh certificate manager inside of aws there's only one way of exporting the certificate you can't actually export the private key unless you've created the certificate for a private certificate authority so you either upload a certificate and you keep a private key somewhere and load that into your container from like i don't know say you keep it in s3 then you load it in that way or you have to use acm pca so certificate manager private certificate authority and export the certificate from there and every time you export there's a little charge for it because you know cloud there's always a charge for everything um so yeah it becomes when you start dealing with all these point-to-point connections and it starts getting a little uh a little funky with it when it's managed services on the back end like using aurora rds or elastic or something like that the background tls is generally taken care of for you you just have to enable it in the switches yeah that is all fun and games which is fine um until you have to worry about how do you rotate all of these certificates and how long should you keep them for uh pick up here so this one here from sleepix uh yes that is definitely part of it um i did cover this a little bit in the last room but since you're here i'll cover it ahead here again now so if you've got multiple vpcs uh aws you what you can do is set up a third vpc for example as an inspection vpc so your what we call your east-west traffic so going between networks in your section goes through this inspection bpc and inside of there you can put in either a third-party appliance um you can use network center to go and do some of that stuff and put whatever tools you want and you can aggregate all of your logs out there so you may still have your vpc flow logs coming out of here into a central archive place but you'll be able to do all your intrusion protection stuff inside of that inspection bpc for the most part but yeah it's definitely an interesting one um i used to use alert logic it's really powerful i've not heard of other logic feminists when it comes around to kind of intrusion protection systems and stuff like that at that point generally are an infosec team would start taking care of that along with dedicated architects um so myself i'm not actually a dedicated architect i'm actually a donate developer primarily i just happen to know a little bit about architecture and devops and i enjoy flowing between three ouch that is some scale two idea servers or intrusion detection servers prices about two pair bytes a month that's that's quite a bit of data um so yeah unless you guys have uh any other questions i've gone for everything that i've wanted to go through um today so i'm just happy just to i can stay on as long as you guys have questions or you just want to chat um but if there's nothing else then i kind of call it quits for today but i'm happy to answer any questions if you have any about anything No worries, thank you very much for staying on. I realize it's been like an hour and a half. If you do have any other questions, feel free to come back and leave me a comment. Otherwise, hopefully I'll see all of you next week.

If you enjoyed this video, consider subscribing to the YouTube channel for more content like this.

// share_this