Name: Live: AWS Multi-Account Structure Design
Uploaded: 2022-08-11T00:00:00Z
Description: Design an AWS multi-account structure with Control Tower, organizational units, service control policies, and cross-account networking

Transcript

Well, hello and welcome back to the next live stream. So today we're going to be talking through basically how to do a multi-account structure inside of AWS. I'm going to start off with how companies usually go through an AWS account structure, the components of the account structure, how to network in there, how to design your architectures to work with multi accounts, and some of the strategies you can use such as ECR deployments. So let's start off with the evolution of AWS accounts within a business. It's quite a prevalent one, a lot of companies go through this, especially scale-ups and such.

So we're going to dive straight into it and go through the evolution of AWS accounts. Generally, when companies start out, they start off in one giant AWS account and they have all of their resources in it. You'd be lucky if you get more than one VPC. You sometimes may get two VPCs if you're really lucky, but generally everything just gets bundled into this one account. And then people start tripping up on each other's toes and you run into various problems.

As businesses grow and they scale, one of the things that they want to take a look at is how do we de-risk our deployments. This is where they start looking into splitting out into multiple AWS accounts. So imagine we start drawing a little line down the middle here. We've got Team A on the left and Team B on the right. Being software developers, we like to make things as safe as possible with all the constraints we've got, so generally speaking these two will be paired together using VPC peering. In order to de-risk the deployments, we really need to break that peering link and move things into different accounts.

So what do companies generally do? Well, first of all they'll split off into multiple accounts. You'll have sort of a dev account, QA, sandbox, your production account. And depending on which company you're in, there's a couple of different ways they handle deployments in this kind of scenario. One is they put ECR in each environment and you have to rebuild the Docker image in each environment. Or you will often see a concept of a management account, and inside that management account that's where they would have the ECR repository and then that would be shared or allowed access to from all the other accounts.

Now this does give you some flexibility because then you've at least got workload isolation. So dev workload is isolated from your QA workloads, from your sandbox workloads, isolated from your production workload. But if you've got multiple teams and each one of these people is a team, they're still going into the same account, which means things like Lambda resource quotas, just general resource quotas for the account, get shared between the different teams. Things like networking for production affects everybody at once rather than just individual bits and pieces.

So what generally happens is companies will usually stay in this phase for a few years before they decide they're big enough now and should probably de-risk some more. So what they end up doing is splitting by domain. In this kind of setup, they would have multiple accounts. Just imagine these are all accounts like production, sandbox, and dev, but this would now be replicated for a cell or domain depending on what terminology your company uses. And then there will be another account with basically the same setup but for a different team, and then there'll be directional communication between all the dev accounts, all the production accounts, and so forth. They would still keep their concept of a management environment on the left, and that would have shared services such as ECR in them, and they'd reach into all the accounts from there.

So now what this gives you is we've got one team here who look after everything in this first cell or domain, and you've got everything from team number two down in this bottom one. This gives you really good fault isolation between not only the different environments but different teams as well. This generally comes in when the company starts to get pretty big, sort of towards the 10-15 scrum teams kind of size, or generally any business where there is heavy regulation such as finance. Because things like PCI-DSS audits, what they basically say is if your systems are connected then that brings them into PCI scope. So by disconnecting physically the different accounts, you can have like your PCI account separated from everything that's non-PCI.

So we'll go through how all the networking works between all of these accounts, but this is essentially what the larger companies generally end up doing when it comes to account structure. There's a few things they do in AWS which I wanted to go through on today's stream that are really interesting, but unless you work in this kind of field you don't necessarily know that it's there. You just get these weird errors such as you can't create resources and stuff like this, and I'll explain why.

So in AWS you have the concept of not only accounts but organizations. An organization can have multiple accounts underneath it. So we've got account one, account two, account three. They can all belong to that organization. You can have different organizations all belonging to the same root AWS account. It really depends on the structure of your company, how many organizations and stuff like that happen. so there is a layer in between the organization level and the account level depending on how the account set up so if we go into the cell based design that i just showed you then there's a few things that happen so if you imagine we're just in a side of a single organization now one of the things that will happen is you'll have a management account for that organization and this will host something called control tower and the job of control tower is to basically deploy other accounts into something called organizational units an organizational unit is essentially just a container within an aws organization that you can apply specific policies to uh group accounts and so on so forth and this control tower can set up defaults in all of the accounts for you such as you can remove the default vpc you can apply stuff across all of the accounts so it's really quite a powerful tool there is also uh for control towing there's something called control tower account factory um for terraform which is a bit of a tongue twister which basically is a very fancy way of saying you can control control tower from terraform so not only does the management level account control um like the control tower instance which controls all the other accounts it also has the aws organizations in there so this is where you would control your organizational units and control how it acts upon that and it's generally speaking where you'd have your IAM SSO so in the new aws world this is IAM Identity Center um but in the old kind of speak it is IAM SSO, which is not AWS SSO so when you install control tower uh one of the first things that you get is a new organizational unit called security and i'll just draw that relationship there so you don't understand i can charge how it creates that and this security oh you gives you a security account which is a single account where you put your logs uh your auditing i feel like completely wrong sorry uh yeah so you get all your logs your auditing and stuff like that inside of the security account which is under the security then depending on the consultancy and stuff that you work with and what you basically your your company structure you'll then have a couple of other ou's i'm going to draw a little bit down i'll explain why so first of all you'll have a sandbox are you and this is basically to keep developer type accounts together so you can try out your aws technologies so i might have dev1 and account for dev1 i have an account for dev2 and they will both belong to the sandbox iu which is controlled by controller you then have a non-non-prod iu and this is the dev account uh or like qa type account so we've got qa you then have a property and this is where like your s box and your actual production accounts would live so like i mentioned earlier if you go for a cell based design you might have multiple of these in your domain so all of these accounts all get created by the product then you would have a shared services and this creates uh an account well you generally have an account or two or three depending on what yours your setup is um for things like networking so you might have a networking account uh you may have a dev tooling account it's another common one and again these guys would be underneath the shared services e and then you will have either an external what will be called external or like suspended iu and the purpose of this are you at the end here is to basically it can't access anything else so it can't even access the shared services it's locked out of your system for all intents and purposes for example an account's been compromised you would move into that oh you can't access anything else breaks all the connections all that kind of stuff but it's generally just captured underneath your account um for the purposes of billing and such so in a traditional kind of control tower setup all of these organizational units will be created directly and almost as a top level one but for me personally i'd like to create an additional layer an organizational unit in here i'm just going to call this the stable iu and this is because we're going to apply something later on with service control policies or scp and this is another way of basically restricting what can access what and so on and so forth so with aws organizations you can create this hierarchical kind of nature between organizational units so anything that we apply at this stable iu level will be applied across sandbox non-production production shared services external systems so we're going to want to apply our scps at this level to make it apply to everything else so we may want to institute a tagging policy we may want to say this um no regions unless whitelisted well no regions unless they're allowed so you you might only operate in three regions out of the world you can take all of the all the other regions offline and they'll stay below you and it applies across all your accounts so service control policies are really powerful and they cannot be overridden from within the account they can only be changed at the organizational unit level by adjusting the policy so another common one that we find is around i am and it usually comes in two flavors usually comes one that says no administrators and no IAM users because if you give somebody full administrator rights over the account they can go and do whatever they want within the bounds of the scp we generally don't like that approach we want the principle of least privilege in all of our aws accounts so we will always try and apply the no admin rule and no IAM users so we're forcing all of our users to go through sso rather than having individual user accounts that mitigates things like multi-factor authentication on users prevents the problems that can occur with the access key rotation because some some people like create only one key out of the two keys but it's best practice to create two keys just takes away all of that kind of nonsense and leaves you with basically a nice secure structure so that's why i like to create the stable organizational unit on top of all of the subunits this way i can apply a nice service control policy i know it's applied everywhere i've got all my tagging sorted out i've restricted which regions we can operate in users are all nice and secure there's a whole bunch of them and aws has got some great pages if you just google service control policies you'll find out their pages and they've got tons of examples to copy and paste from uh but there's one problem if you do put service control policies at this kind of level then you're going to want a way to test them so what ends up happening i'll just quickly redraw all my stuffs so i'll quickly redraw this so we'll have now three boxes we've got security whatever stability what i generally like to do then is have some form of experimental i.

e and then that's where we can this is essentially our test area for anything kind of like control tower related um any service control policies they all get tested out in this area over to the right so we've got a kind of draw the isolation boundary that's where we're kind of isolated there so we don't have any kind of like confusion around kind of what's going on in that one so we can test out our service control policies there when we're happy with them we can promote them to our stable iu and again just for completeness i'll have non-prod oh you model you and then this is kind of how it all starts to come in so everything when companies generally talk about multiple aws accounts and why we have them is literally all about isolation so if i start drawing the rest of the isolation boundaries i've got one over here for my non-prod so all the accounts and they're essentially isolated by the ou i've got another one over here for my prod and then if i've got multiple accounts inside of there i've then got the isolation in there as well distillation in there as well and then it's just the case of how do we start going through and connecting all of these accounts together so let's see if this is a on so i'm going to take some imaging we're in our production are you and we have two accounts that we're gonna need to talk together um we're gonna have let's call this payments that's the payments account or domain and this one we'll call the reporting domain now depending on the structure of how the company's laid out depends on what these boxes are obviously but we will have both of these put under our production area and each one of these may have one or more vpcs depending on the structure of the team so how do these dpcs start to connect to each other how does this all work so in aws there's three main ways that you can sorry four ways you can connect uh vpcs together one is vpc peering and essentially this creates a mesh network between the different bits and pieces so to connect these two vpcs together we will create a vpc pairing link in between and if we wanted to connect a third one then to make sure they all can communicate to each other we then have to start drawing a line from one box to every other box that is in there so the more vpcs and stuff we add into this the more complicated the network becomes that so option two is be something called transit gateway and this is what most companies end up with even when they get to um the four aw accounts that we mentioned earlier and the way that chansey gave you works is this will generally live under the shared services ou in a network services account uh we're going to call it ttw for short and basically this transient gateway gets shared uh some wrong place to share it it gets shared to each account and what happens then is that vpc each vpc imagine this there's a little tgw in here ppcs connect to that transient gateway and depending on whether there's root propagation enabled on that transient gateway you may have to do some extra stuff but essentially it creates like a a bit of a hub for all the vpcs to connect to to prevent all that kind of meshing and peering and stuff like that and this works really really well um it's a way of centralizing kind of some network management and you can have different rules and different transit gateways and route tables for different um different network segments so you might have all your production on one route table on one transit gateway and all your non-production traffic on a different transit gateway and root table so this becomes pretty flexible and pretty powerful but what ends up happening is what if you want to go multi-region if you want to go multi-region you have to create a transient gateway in each of the regions and then start joining those transit gateways together which becomes a bit of a mess so what do we do to get around that well that's where option three comes in and this is something that's very new in aws and it's called cloud one cloud one is essentially transient gateway on steroids because what it allows you to do is cw it basically acts in the same way the um bpcs will connect to cloud one the cloud one will be shared out from each of for from the network services to each of the accounts but then as it each vpc joins in it basically says hey i'm production and then you can grant access centrally in something called the core network um inside of cloud1 and it basically submits your traffic to whatever tags and stuff you wish so if you imagine this big box down here is your network traffic your the whole of your network trafficking in gladwell what you can do is say this segment here is dev and you might have one like all your dev traffic in that segment you might have a broad one and as these vpcs go and attach into there we're basically saying to cloud one go and connect everything that has the network segment production together but isolate it from anything in dev so if i got vpc in the dev account there is no way that this boundary can be crossed that can't happen so this makes it really really powerful for kind of cross-account stuff and the way the reason why it's slightly better over the transit gateway is because it works across multiple regions as well so you only have to create one core network and then aws takes care of basically everything else for you and you just need to attach vpcs to it with a specific segment um you can add different rules on it basically saying automatically approve stuff for dev if it has this kind of attachment tag um but if it's got like a production attachment tag then wait for somebody to come in and review it so we don't have something come and join in the production network without anyone kind of viewing it so that's really basically how the kind of networking is sorted out between the different accounts um before fund is not very commonly used and that's just private link and that's generally when you go out from like customers to supply public apps um so that's basically how networking works between different accounts different vpcs there's a few extra bits around the networking that you'll need to be aware of i'm going to more space to draw this and that's basically what happens when you go north and south and east and west and i'll explain what those are right now so if you imagine this is your entire aws estate and you have vpc here ppc here and you have your internet gateway here uh when you're talking between vpc to vpc this is called east west because the traffic is staying within my network vibe creations i'm going through how networking works in multiple account structure and if your vpc wants to go out to the internet this is called north south so if you ever hear network administrator it's talking about east west and north south then this is what they mean so i'll put that on the side in a second because there's a few bits i want to draw in between so if you're going to go east west and you want to talk between two different vpcs what your network administrators are likely going to do is put an inspection vpc in here and instead of talking directly to the vpcs it goes through an inspection vpc and this is so that you can put firewall routing in you can prevent vpcs talking to one another based on rule sets that you define and so on this also means that generally speaking there will be no internet gateways on the vpcs all your egress traffic will have to go out through an inspection vpc so you'll there's the default right in the root table 0.

0 0. and that points that inspection bpc once this inspection vpc has taken a look at it um it will then decide where it needs to go next if it's going internal then it will direct the traffic to its connected vpc but if it's going external then there's a chance that there will be another inspection vpc for egress traffic so this might have a different rule set on it and just monitors things going out it may be one it maybe two depends on what your network administrators kind of want to set up but for all intensive purposes there'll be an inspection vpc that basically all the traffic routes through whether it's northwest or east uh north south or east west so if you want to go out has to go out through there and that's where your nat gateways and stuff will be so that's anything else out there i think so inbound traffic becomes iman traffic is a little bit different and because with this internet gateway you kind of want to protect your vpcs as much as as possible so you know if if this account here on the left has got an alb and then you need that to be exposed out then unless it goes through something like global accelerator then the chances are there's going to be a web application firewall that it has to go through first before it's a vpc and again that's likely to be in some kind of inspection vpc where network administrators have control over that and they can block certain bots and stuff like that coming in place and this i'm just gonna add ah yes uh so there's one general exception that i've seen and i've implemented when it comes to this kind of inspection uh kind of process and that's when you're doing something like webhooks because webhooks can generate an inaudible traffic and it's going to go to loads of other destinations and stuff like that so to put all that through the inspection gpc where all the rest of your traffic is going may overload that area because each hvpc basically has a network bandwidth of like 45 gig per second or something like that so depending on how much traffic you have depends on kind of how this goes but if you imagine this is like your webhooks vpc then what you would generally attach to it is an egress only gateway which means we can talk out to the internet and get the responses back but nothing can come inbound so that would be your route out and it would be a bit of an exception to the normal inspection flow so how in a multi-account structure this is our aws accounts again how do we deploy so if these are vpcs and we're going to split them into two different accounts now i mentioned earlier that we'd have some sort of management or shared services area and this would contain that's really unreadable let's try that again this would contain your ecl and for each eci repository you would basically go through and say this account can have access to it this account can have access to it or what's more likely to happen is say to say for this eci repository within this organizational unit you can have read access to it but you you might only be able to push from within inside the account makes it really really flexible because then you can have your kind of golden image deployments from ecr into your different aws accounts um and how you get that into eci you might you know you might have like a github runner for example which may be public and that has permission to push into ecr and it has the ability to pull um for testing and whatnot but when we come down into the individual accounts we're only going have paul only so these accounts can't start publishing randomly into easier uh all based architectures is the last bit that i wanted to kind of cover so if those are familiar with uh domain driven design um you can kind of mirror the same structures that we put in place for doing design into our adobe's account structure which i kind of went for a little bit earlier but i'm going to go through in a bit more detail now so i'm going to go back to our payments and reporting domains payments and reporting so when you're working with these these cell-based architectures um these lend themselves really really well to the aws account structures that we've just gone through like splitting them out between you know so we'll have a payments prod a payments dev payments qa reporting prod boarding qa so on so forth and then generally what you would do is right at the top of each domain we would put a cell gateway and this cell gateway is basically responsible for controlling the contracts to the external consumers so one of the external consumers may be the reporting domain and that would likewise have a sale gateway and that means that any time i'm talking to either the payments or the reporting domains i am talking to a fixed contract but inside of that i may have three or four different services that make up that contract and i can go and change these services as i want to so long as i maintain that contract on cell gateway and again that will be repeated over in the reporting domain so i could move from v1 to v2 to v3 of my service say this service here can move between v1 v2v3 internally absolutely fine but the contract still we won so i could do massive architectural changes but keep my external contract and then that is what is connected to the networks then anyone that goes through it always go for the cell gateway but if you're using something like an sqsq then we need to make sure that we've got some form of external schema registry that we can validate against when we're consuming um if you haven't seen cell based architectures before i would definitely go and look at them because there's a lot of other additional benefits you can get from a cell-based architecture one of the one of the big ones is around observability and the reason why i say that is kind of at this let's draw a different at this kind of gateway level this is where you can track everything from kind of what is the latency coming into the whole system um that's where you put your observability in a cell based if you wanted to go into more detail you can put observability down at these lower layers as well but the quickest and easiest bang for buck is at the top here on the cell gate where you monitor it there you start off while you're tracing there if it doesn't already exist and already be started and then you pass that tracing context down to the other services they might make other internal calls but you're capturing all of the observability at this one level which could be really really powerful um for dashboards and stuff like that because you can say right how many payments did i get through okay cool i've just got to look on v1 of my endpoint for payments let's generate from this gateway observability and that will tell me exactly how much uh how much performance i'm getting hey gg um so yeah i would definitely go in and take a look at cell based architectures and how we do some observability around them it's uh really quite interesting um for those watching do you have any questions around kind of what you've seen in your aws account so you know like why why do we do this um hopefully maybe i'll give you an answer um because it can look a bit strange if you don't know why do we have everything in different accounts um generally it's going to be isolation and de-risking yeah been rambling on for quite a while now so i don't think i have anything else to talk about on a multi-account structure um so thank you very much for those oh uh so gg asks um about virtualization do you prefer containers or vm to fairly high device id for me personally it's always containers um i'll always run them if i'm developing them now i'll run them on one of two things freaking too yeah technically three um so first one will be aws fargate if i'm running on aws uh i don't wanna unmanage the underlying hosts um i don't wanna deal with like security groups and stuff like that for it well if i'm running on kubernetes then i'll use something called bottle rocket to basically secure that bottom layer and it's essentially like running fargate then um you don't have any control under the underlying host or containers and landers i just like having all my dependencies inside of the container have that versioned in kind of a golden image you can run vms for some things but i generally do you find i haven't used a vm in seven eight years it's all been docker containers hopefully that answers your question yeah that's an interesting one um i don't know of many places that have virtual machines per se anymore obviously things like aws run virtual machines around your workload but in a business i i haven't seen it's all been docker containers the best solution for virtualization so follow-up question um that's a very good question because it's something i've not done in a long time actually i i do like um i'm actually gonna go hyper-v for me personally um i've never needed something like the stuff that you can do with vmware um if you're talking about local development for virtualization yeah to me it's just hyper v for development purposes for production it depends on what feature set you want um but hyper v for me is is probably the go-to uh because you can do like virtual switches and stuff like that really easily creating some net networks and i can run it on my local machine here as well so i'm on a windows based machine here run hypv it's absolutely fine what do you prefer for your virtualization i remember back in i'm gonna say about seven eight years ago now there was a local box which is kind of like a precursor to docker i'm not sure if anyone actually remembers local box um it was it's pretty good basically allowed you to run any kind of virtual disk and container structure which is kind of cool um it's around when the first containers came out roxmox i have not heard of that i'm gonna actually add that to my thanks okay So unless there's any more questions, I'm going to shoot off now and let you guys get your rest of your night back. Thank you for those who've come on and joined, and thank you if you're on the replay squad as well. If you do have any questions, feel free to pop a comment in down below.

If you enjoyed this video, consider subscribing to the YouTube channel for more content like this.