Name: Live: Systems Design - Stock Tick API
Uploaded: 2022-08-04T00:00:00Z
Description: System design walkthrough for a global stock ticker API using Kafka, Redis, InfluxDB, and AWS Global Accelerator

Transcript

Hello, hello, welcome to the first stream, or first live stream that I've done. So this could be interesting. last time i did one of these architecture kind of tasks um it was recorded even though i rambled uh for like an hour um in front of the camera this one is live and rambling in front of a camera um so what we're going to be doing today is we're going to be taking a look at a common system design task that you may get this one is a live fx slash stock ticker api we're going to be taking a look at it from the perspective of a system design kind of point of view so we're looking at what technologies we're using why we're using those technologies roughly how we would go about designing the solution and some other bits and pieces now we may go into different bits later on as we either get used to doing these kind of streams but this one first one is just gonna be a very high level uh kind of api um that we're gonna go and design as if it was a systems architecture um so let's take a look at the scenario that we're going to be doing today um so this is the task that i've kind of pre-written it's accurate as if you would this is sort of the brief you would get in an interview um so let's go through it quickly we've been hired by contoso stocks a global stock trading company to redesign their existing stock trading information endpoints for the customers the existing system has multiple ingestion sites around the world and high traffic demands at certain times of the day so two key factors that i'm looking at here one is the multiple ingestion sites the other is the high traffic demands at certain times of the day now this between these two this does indicate a kind of multi-region uh design and there's a couple of ways that we can do this we can try and go active active across the whole globe or we can have a single kind of active region and the rest kind of replica regions they're just going to consume their data from other places there's pros and cons to each one um and we can go through that as we go through and design the system so the system must ingest data from regional data sources such as nasdaq or the london stock exchange but it should be available to everyone globally again fitting into that kind of multi-region design uh we need to take a look at providing the information via a live api and a historical api given that it's a live api we must display the current stock value within a second of it being received otherwise the value is not going to be there the historical api must be able to display values for at least the last day last month last year so we're looking at some kind of aggregation here that we're going to have to do as part of this task um when we're looking at kind of per day per month per year and there's obviously quite a lot of data so the interviewer has asked that the intervals within each day month year aggregation are up to us so long as we have a representative graph so basically what this means is that we will go through um so imagine you get a stock trading app on your phone usually it displays a graph showing you the kind of price at the current time and basically what we need to do is get enough of enough detail to be able to fairly representative graph if you don't have a fairly representative graph then it's kind of pointless as a system but the interviewer hasn't specified kind of which intervals we should use so for for each day for example we might decide to pick a one minute interval would give us quite a lot of data uh be a very accurate graph but it also might not be the most readable so we might look at like a five or ten minute aggregation and the certain technologies that we can use to go through and do that so the last bit of this task uh is each of the apis exposed have different requirements for performance so the live api should be able to support like kind of 10 million requests a second the historical api about 1 million requests per second remember based on the bits earlier on in our in our kind of brief this is going to be a multi-region design so we can assume that um a lot of this traffic is going to be spread out across the globe i think if this was a real interview scenario that you you're actually facing this is what i would be calling out to the interviewers and say is it okay to assume that this is going to be spread out across the globe and this is a really important point when you're doing interviews is if you do make any assumptions it's generally okay just tell the interviewer what assumption you're making and why you're making that assumption the interviewer will generally come back to you and say if that assumption is not valid for this thing or the scenario that you're working on so if this wasn't a wasn't clearly a multi-region kind of scenario then and we said to the interviewer hey look he said 10 million requests a second i'm going to assume this is across across the kind of globe um then they would go no we haven't said anything about like kind of multi-region or anything like that this is a single region in which case your design is definitely going to change so yeah just make sure you call out all of your assumptions just double check the assumptions with the interviewer make sure it's okay for the scenario that you're presenting and you should be absolutely golden so for each ingestion source has variable carrot performance characteristics um some stock items may burst into thousands of seconds so think busy stock symbols like apple or something like that it's going to be a high throughput very bursty in nature something we just need to take into account and then exchanges be mainly active during working hours so we can assume we're following doing i'll follow the sun kind of model sort of example the london stock exchange is only open in in uk office hours so 9 5 monday to friday same with currency pairs like gpp only trades monday to friday and nine to five so we can start assuming that you know there's gonna be periods of time where we can actually move a lot of the data processing to other parts of the globe should we want to probably not gonna do that in this design because it will go way over like the 20 minutes and stuff that we would probably have in an interview um and for context this is probably going to take longer than 20 minutes because i will be explaining stuff as i go through and they're probably adding a bit more detail than what we would go through normally in an interview but if you my aim is here's to go through go through why i'm picking the different bits and pieces so you have the context and you'll be able to take parts of this and present it and improve on it yourself okay so let me let me switch to my whiteboard we can come back and have a look at this design um later on as well okay um so where would you even start off in this kind of design um for me i like to start off with data ingestion data ingestion is going to be the point where probably going to have a lot of the discussion um on this design you will have a lot of discussion around the live api as well but some of the technology choices which i'll go through and explain why i'm picking them later will make it a little bit easier for you um so there'll be less discussions around kind of that area so if we're looking at the data ingestion um first thing we're going to be having on our diagram is going to have our exchange and this might be the london stock exchange um so like or it could be nasdaq or something like that i think for the purposes of these interviews it doesn't really matter which exchange it is you're just representing a concept called exchange the next thing given that the scenario has said that there is a bucket load of uh traffic that's coming come through the first next thing i'd be looking at doing is implementing some sort of load balancer now two things that you may be looking at doing here um especially in aws it's essentially two different types of load balance that you can have one is your traditional application load balancer which is your layer 7 application root stuff nicely through urls in this scenario that's probably not quite what you want you probably more want the network load balancer or nlb and the difference basically is which layer of the osi model they actually go through and operate on so an nlb um operates kind of like tcp level which pretty much all the exchanges run off of because they run off essentially binary packets and stuff like that uh for the most part there will be some exceptions obviously but for the most part they'll be kind of like tcp based now what load balancing and the reason why we do this is any given stock could overrun a single server so what we would have behind this load balancer is a series of injustice um you could probably think of a better name of it but this network load balancer will be responsible before going out to these ingestors and their responsibility is accept the packets um to validate there's no corruption and then pass it off to something else and in this case i would pick something like kafka now if you're in an interview and you're picking something like kafka you have to come up with really really strong reasons as to why you're picking it now number one for me would be performance so i haven't specified here that whether i'm running this on like bare metal essentially or whether i'm using something like msk um because that in reality would would depend on a number of factors if you're talking super high levels of performance the chances are you're not going to be running a managed service like msk you probably will be running it on bare metal but for an interview scenario i would always try and stick with managed services where you can um partly because of the maintenance cost and this is something you should always be considering in your architecture designs is right how much are you going to have in terms of maintenance in your design so in the design we also don't need to worry about like this exchange area but we do need to worry about everything to the right of the exchange area so we need to worry about the network load balancer now this i probably would have as a managed service anyway using aws's nlb the ingestors um these would be probably fire gate services so i'd be calling that out on here as well um and the reason why these would be fargate services for me is i don't want to manage underlying hardware but i want something persistent lambdas if you're talking about these levels i don't think they're a great choice if i'm honest because you need permanent connections everywhere you don't want your connections recycling every five or 15 minutes depending on your lambda timeout so you don't want to be breaking this this cafe connection because this kafka connection here is going to be tcp this connection here is going to be tcp and if you're on a lambda and you're constantly breaking these connections you're constantly introducing jitter and potential chances to fail and going back to the rb which is going to potentially fail go back to the stock exchange now this is kind of something that you can put into each region um because you can do like kafka replication and stuff like that um between different regions so we are covering ourselves with that kind of multi-region aspect should we need to there are obviously nuances around how we do this so for this kafka scenario one of the things we have to keep in mind is right what what's the structure of our topics we wouldn't be able to have a single topic that said here are all the stock trades here are all the fx trades we would be overloading those topics and anyone that's consuming from them would wouldn't have a great time so what will what we're looking at doing the same reason why we're looking at saying i'll be splitting out the traffic so we want to split out our topics so if we're looking at kind of like topic design um we'll be looking at i don't know calling our topic something like stocks dot symbol and then that would contain all of our data for that specific symbol so it might be like i don't know because what we can do then is any consumers that sit off the back we can either subscribe to specific topics or we can subscribe to something like stocks or star and then get all the symbols what it also means then is we can do different things like aggregations which i'll go through now so certain parts of the kafka ecosystem allow us to run something called k-sequel and basically one of the things that we can do with this is have aggregation set on top so if you've got like um imagine like a time window here where got one uh one second two second three second four second we might have like one events coming here we might have couples coming in here one here and a couple there one of the things we can do um by splitting out our topics like this is we can say in case equal hey every five seconds or every six seconds give me a new aggregation so i can then go and create a new aggregation table and go aggregation apple and that can give me that five second period which we can then use for um storing in other data sources what this allows us to do is massively reduce the amount of load that's going to our underlying data store such as influx db so that's what i'd be calling out in the kafka kind of design so we need to go and connect kafka to our data sources so let's go and do that because we'll need that later so what happens after this back of kafka well we are going to want one or more kafka consumers because this load is going to be quite a lot so and they are going to go and pull from the kafka source and their job is to take the aggregations that we just mentioned right into a data source something like influx influx db is a time series database and we'll be able to assign kind of like text to these kind of things um and then get aggregations of data kind of from influx so we can say right to give me like a day's worth roll up everything on the day roll up everything on a week we get all the nice kind of lovely stats from it as well so we're going to write it to influx db but part of our part of our problem space was not only for kind of historical data we also need to get this data live within kind of like a second so i would be hoping that this kind of track here would be done in kind of two to three milliseconds something around that age around that range our ingestor really should not be doing too much here the bulk of the time is going to be tcp connections and whatever your setting is in kafka in terms of storage so one of the things we didn't cover with kafka is the ability for kafka to say when you write to a topic how many replicas does it have to have of that data um so we could have like five replicas of this data if we wanted to be really really secure otherwise um we could say right let's just store it once because we can backfill this data from somewhere else and that will reduce the overall amount of latency that you'll have so we shouldn't be too too far here um in this kind of scenario influx db will be slower than kafka from my experiences so i know let's call it 20 milliseconds either way we're not talking about a massive amount of time here hopefully uh because we are using aggregations from kafka to store it in fluxdb and then we'll get that out to the historical api later uh i'm gonna call it kch for historical because this is gonna be our historical consumer i personally wouldn't be using influx for our live side of the database which is what i'm going to go through and show you now so for the live side of the system i'm going to create a new a new uh live cafes consumer and again we'll probably have multiples of these i'm consuming from different kafka topics to go and update stuff and the live cafes consumers job is to go and update redis so our live api we're kind of in full control of we know exactly what we're going to be doing here so to complete the circle we'll have our live api and that is going to pull from redis now in terms of instant sizes and stuff like that we do have to kind of be mindful of what we have but there's specific things that we can do with radius with key sharding and stuff like that to make sure that we scale this um globally and i verified this part the design with a very very large stock trading company and this is pretty much exactly what they do um in terms of live api talks from redis they have other bits and pieces up and around but that's all their business logic so now we've got our live api we don't have any historical data on that live api uh wasn't mentioned in specification so we are all good so what i'd be looking at now is right in terms of performance uh for this live api we know that we can get this data into redis very quickly we know that the live api can hit redis very quickly so this again if i call it two milliseconds i think i'm being generous typically in my real world usage i've seen it a lot faster than that and for a lot of things so the bulk of the api work is going to be in authentication um whether it is even going to be authenticated this is something you can go through with the interviewer and say right is this live api going to be publicly available or is it going to be under authentication depending on the answer depends on if you need to go and draw other boxes so if it was under authentication then you're going to need some kind of authentication system um so this might be azure id to do some of the some of that or it might be aws cognito um you'll also have to start thinking about rate limiting how are you going to do that rate limiting um and if you something like cognito there is uh certain things you can do inside cognito that help with rate limiting on api keys and stuff like that but assuming that that stuff is all glorious there is another optimization that we can make here which is really going to help it's not doesn't necessarily help the performance or view of your system but it does help um the consumers of your system feel like they get a faster response out of your system and this is a technology we're going to use in a couple of places um you may have heard of it before i did talk about it in the last stream if you did see that and this is aws global accelerator or i'm just going to call it ga it's too long for me to write and basically what global accelerator does is it talks to your apis so if we have a i didn't forgot to add the load balancer in here but we would have a load balancer in here as well could be running multiple instances of this api but abs global accelerator talks to your load balancer your load balancer can then be internal it is not exposed to the internet at all this is great for quite a few things but a bit of global accelerator to your load balancer which is now internal to your api which is also internal you get some extra protections so you can then attach things like um wyatt and stuff like that to global accelerator giving you dds protection and all that lovely good stuff that comes with global accelerator but essentially the way that it works is around the globe aws have different points of presence and these point of presence are used for things like cloudfront so you can get your cloudfront content out to your users really really quickly and easily so with this specifically the reason why we're using it is because we don't want our users to go all the way through kind of the public internet and stuff like that to get to our service we want them to jump onto a fast connection wherever they are in the world so we want them to connect to amazon's network and then we use amazon's backpain to get their traffic and route to our services now the other reason that we are going to take a look at global accelerator is for the multi-region scenarios that kind of came up earlier so one of the features of global accelerator is it can do health checks against an endpoint so we can expose an endpoint on this live api and say hey yes i i'm good to go over here or no hey i'm having a hard time over here please take me offline um so if this region suddenly becomes unavailable global accelerator can fall over to a different region automatically there's nothing you need to do in there um you can also set up alerts but as soon as your region comes back online then what can happen is you can go to different areas so if we're smart about kind of how we do things like our api design going forward so we might we might route certain traffic to um like the european region so the if we got like the london stock exchange we can do that potentially a dns level we can say this is the lse api dot contoso or whatever and then slash live and then we can root all of the london stock exchange traffic to a specific area um i'm not sure why am i where's my erasing on nope nope okay um so we can do different tricks like that i can't remember off the top of my head whether a global accelerator can do layer 7 routing or not um i would need to go and double check that in documentation but maybe i'll leave that as an exercise for you guys can't find out um but yeah the multi-reaching capabilities of global accelerator are really really awesome and i thoroughly suggest you guys go and check it out because it's yes it obviously does cost a little bit more but if you're building a service where your users are globally distributed i'll be adding this into basically every interview that has global users um and the reasons are lower latency aws network less jitter for the customer faster experience boom and who doesn't like a good user experience so that pretty much wraps up um the live kind of side of things so now we need to go and tackle the historic side of things okay so for the exact same reasons as i've just kind of gone through um i'm just going to go and put in advanced global accelerator again because we want our historical api to be as fast as it can be and this time i'm picking the alb not the nlb because we want layer 7 routing so if we want to go and do different bits and pieces we can go and do that as well so now we're going to go on to our historical api and we might i'm just going to draw a couple of hours somewhere multiple instances of this and that historical api is going to need to talk to our influx database and this at least in this scenario um that i showed earlier there's nothing really here that kind of says right we need to have any kind of time limits on it or something like that um we may want to think about some kind of caching here for frequently accessed stuff so for example um like today's the third of august right the data for the second of august is not going to change at all so at least it shouldn't change at all um if you have got a bit of a problem there but the data for the second doesn't doesn't change at all so what we can do is we can cache that data a little bit closer and take the load off of the database so if you remember we've got these historical uh consumers they're going to be pumping the data in left right and center from our aggregation on k sql i'll try that down as well because i think that's out there so we're going to take these aggregations from them t fl influx db and this is going to be pretty very heavy but we've also got a lot of re-traffic as well so for certain aggregations we can almost prepare them ahead of time and have i'm just going to call it a cache on here whichever technology you decide to pick here that's kind of up to you so for example we might use something like redis um and you know we're going to check the cache if it's not in the cache we come back we go down to influx db we get it we pop it back into the cache and the next person that calls it has that data available um and this does depend on what kind of aggregation levels you're picking for your historic api but for something like one day one day's worth we could potentially grab that um pop it into the cache and store it under the stock symbol and say hey for the second of august for apple here was the data um well for the 90 day period here is the data now the trick if you're going to do something like this for the caching is to make sure that you have a fixed sized um step so what we don't want to do is have what well you need to if your graph let's see your graph it might be easy to explain like this so if your graph on your ui is going to look like this for example um what we want to do is segment this graph as much as we can into evenly sized points so you might decide okay there are 100 points that are going to be shown on this graph so you know you're going to have to say like right for one day it's going to be 100 points on this graph for 30 days i'm only going to show 30 or you might say here i'm going to show 24 one for each hour or something like that every 30 minutes whatever you decide to pick but the key is here to be consistent because what we don't that makes what we store in our cache then predictable anytime we have variability in the size of things then that's when we start getting problems because we can't calculate it efficiently um and if you do put something like a cash in here a couple of the questions i would i would expect to be asked is how big is this cash gonna be so if you're using something like redis right how many servers are you going to need well to work out that you need to know how many data points you're storing how big each data point is going to be and more importantly how many stock symbols your kind of sore and saving again if you're using something like um redis or maybe something like memcache is a bit better they have different options for spreading that loadout across all different servers and classroom capabilities so that's definitely something you'll be looking into so whatever you pick for this technology here just make sure you're saying this is going to be a cluster the data is going to be sharded i'm using fixed points um and then like the keys for the cache keys will be something like stock symbol um and then we'll just work backwards from there on the date uh i weigh o3 and then you can even add the time on and stuff like that and then this way this becomes a prefix key and you can scan by prefixes so if i wanted everything in 2022 cool i can just go and look up by that prefix if i wanted everything in august or i just take that away go and add the next bit so that can become really really powerful for you to store in your cache um so long as you have your data segmented to fix size you can calculate it from there um so what else would we need to add in this design for the historic api then again we might need to add some authentication system i'm not going to go into too much detail there again and that's fairly standard and kind of really the point of this architecture what the interview is looking for in this architecture is right how are you going to go and adjust your data how are you going to split your data sources between historical apis and live apis because what they're looking for you to do is say right which technologies are you picking um so just take this kafka and k sequel one for example i suspect a lot of people would that i've interviewed at least would go for something like sqs now in sqs you can have a fifo queue which guarantees your ordering but you are massively limited on performance in compared to a non-fifo queue and if you go for a non-5a non-fifo sqsq then you're taking the risk that things could fall out of order now generally speaking sqs is okay in giving you everything in order but when you're talking anything with like financial data and times and stock ticks and stuff like that it would not take the risk so what you'd be looking to do is have an append only store in this section this kind of kafka sql api section um that append only kind of logic allows you to go through and say right i'm absolutely solid as a rocket and this is absolutely fine the other technology that i would expect people to put in here um would be event store people love cqrs and event event sourcing patterns and they love to put event store in but i can tell you from personal experience that event store does not scale at all um and that's primarily because of the way that people use it um i do think that events this would never be a good fit for this kind of system um because of the way that stuff works it's just generally not high performance but generally the problem with something if you pick like event store in here is when you're doing the subscriptions to the event store instance there's basically two parts you can do the checkpointing on to say where am i in the subscription um you can keep the checkpoints yourself uh on in this so in our historical consumers for example that we've got over on the right here we could keep um the checkpoints here and it would probably be okay because it's then very similar just slower version of kafka at that point and it will be good but what i've seen a lot of companies do is in the central event store here what they would do is keep the checkpointing logic on the master node um which means you can't use any well like you can't scale out efficiently because all the checkpointing is going in to one area now i've only got like three instances here and two instances here of all of these consumers in reality there's probably gonna be like 10 to 15 uh in consumers because of the volume of data um which means that event star instance is gonna get overloaded um and every company that i've seen that's tried to use event store the technology and try to make it scale they've ultimately ripped it out like 18 months later um so this is why the interview process is this is what it's here to design to do is designed to go in and say right what what technology are you picking why are you picking that we're picking kafka because it's append only it's proven to be fast it's got the ability to do aggregations and in actually it's got the ability to do duplicate deduplications in that as well to give you accurate data um that makes it really really great um it can do replication because we can just consume and set up replication to get more region if we wanted to which is all win win win win win then we've got influx db why why are we picking this technology again it's a proven technology for the job now i haven't played with aws time stream myself i've heard good things about it and it looks pretty good but i've not tried it myself um so it's one thing i'd also be hesitant in doing in an interview is going right i'm gonna use this technology but i've never used it before because if you do that what you're basically telling the interviewer is hey i've got no idea really what i'm doing this just sounds like the right thing to go and do so that's what i'm going to go and do because what the interviewer is looking for is they're looking for concrete answers of why are you doing something so i am picking influx db because it's a proven technology it's a time series database that easily allows me to do aggregations an answer like that the interview is going to go cool what they'll likely ask you to do is right how how would you scale this to like double the load or something like that um and then that's when you start delving into the world of like influx sharding maybe running multiple clusters so on so forth similar thing with with kafka so taking a look at our other technology choices uh nlb versus arb kind of went over the sale but i think it's worth going over again and it'll be because we're dealing with tcp level um we're not wanting to do any kind of layer 7 routing we just need to distribute the traffic to split the traffic as early as possible anytime you're dealing with a requirement that is i'd say in the thousands of a second if you're in the hundreds you know anything pretty much flies these days um but if you're starting to creep into the thousands per second you're gonna want to split that traffic as soon as you can easiest place to split it for us in this ingestion part of our pipeline right at the start so we create our load of instances we can scale this this can be auto scaling um based on cpu load and all that kind of stuff we can even do periodic scaling so we know that in certain regions things are going to happen at certain times of the day so we can set this to auto scale 10-15 minutes before opening time and 15-20 minutes after closing time we can start scaling it back down again saving us loads of money same with kafka depending on how we set it up we can scale that out on all the consumers in that region we can scale as well um so nlp distributes our traffic split nice and early when it comes around to global accelerator using that latency as we went through earlier alb because we need layer seven routing um and we can do target groups that we can do canary deployments on and stuff like that um so that's all fine and dandy again live api this will be on fargate as well redis why we're using that very very fast key value database um we are looking at it for live api because we want to respond very very quickly we know let's say we know it's proven to accept high volumes of input and give you high volumes of output as well it's got capabilities to do things like key sharding um that's why we're going to be using redis so this live api is going to be very very quick and so we've gone through that gone through that we've gone through the cache i've perfectly not picked anything there um we'd leave it up to you guys to kind of give your thoughts on what you'd put in that scenario historic api again this would for me would be fargate autoscaling lovely goodness global accelerator exactly the same thing it'll be exactly the same thing as before um so we've got pretty decent tech stack here i would say um so what i'd be looking at now from an interviewer's perspective zoom out a little bit as i'll be looking at throwing you some curveballs so if this is kind of like a typical kind of answer and structure that i would expect um a senior software engineer to go through obviously the technology choices and some of the boxes will be different but this level of details what i would kind of expect for a senior software engineer when you're looking at kind of the staff um and principal level engineers i'd be looking for them to go a layer deeper so i be looking to go right in say i'm in eu west one all right and that's my region here is my vpc layout and here is what i would put in where um this is kind of the answer that i would expect from staff plus engineers there we go all right inside of here i'm gonna have my these are my public subnets uh these are my private subnets and these are my database subnets so we're going in now into the realms of the networking side of things because it's all very well and good designing this kind of solution but a lot also comes into how are you going to secure all of this so how are you going to secure these ingestors how are you going to make sure that like your kafka and your influx db can't be accessed by the internet easiest solution we put all of the databases like kafka kafka and redis all in database subnets because the database subnets can then be limited to only accept traffic from the privates and the privates can only accept traffic from the public's and inside of the private ones this is where all of our apis are going to live uh be distributed across in our publix um these are where our that's where our alp is going to live our nlb is going to live um we also need an internet gateway uh we don't have any outbound traffic so we're not going to i don't think we'll need it but we'll saw if you had other things like if you're sending your logs outbound you're gonna need a net gateway as well um so those are kind of all the the bits and pieces that we would have in place um if you're looking at more from an architecture lens then right we're going to be looking at right what security groups do we have assigned in there um and at this point we start to get bit beyond what you would probably be asked in an interview but just other things for you for you guys to learn is right what security groups are we going to have because you know there are things that we need to consider about what rules uh different things need to be applied which areas we're gonna play um security groups in uh we need to think what knuckles are gonna be putting in so generally speaking uh developers kind of stop at the security group kind of level um they kind of forget about knuckles and what knuckles basically are very very dumb version of security groups that you can put on and say right there's no ssh traffic anywhere in this vpc and you can do it in different levels like subnets and stuff like that the other things that i'd be looking at for a staff plus level engineer to be saying in the interview is i'm going to be deploying these across different azs so this will be a1 this will be az2 and this will be az3 or abc whichever depends on your car technologies depends on which version and stuff they use um so what we're looking for the engineer to do in the interview say right i'm spreading these across different availability zones why are we doing that because essentially each each availability zone is essentially a data center that is within a certain geographical distance of the next data center and a group of those generally makes up a region across different cloud providers so what we're saying here is we can lose uh this entire data center but we can still function off the last two this gives us an element of disaster recovery um if we do we can even lose two so one of the things that if you ever did go into this kind of level of like vpc design and stuff like that is these load balancers and network load balances these need to be in each the availability zones so not in each of the availability zones if that your single region goes down then you've completely lost the region or you've lost access to your underlying services shall i say the other things that i'd be asking as an interviewer are how are you going to observe the system how do you know that this is working this is an interview question that i basically always ask so if you ever come an interview with me first of all say hi and secondly expect me to ask you how do you know that the system is running after you deploy it to production i will ask this question for anyone from like mid-level and above um and it is purposely vague because i want to see how you think about these problems so how do i know that my ingestor is actually ingesting um the short answer is metrics now the more senior you get like senior and staff and principals um i'd be expecting you to go into more detail of right okay what are you what are you measuring okay so i'm measuring uh the total throughput okay okay but that still doesn't tell me that this system is actually working what tells me that the system is working and generally the answer should be alerts so what am i alerting on well i'm alerting on that throughput or something in that matrix that we're capturing so the other one might be latency latency is a pretty good one to always be measuring throughput latency um and i'll be here okay that that's pretty decent um you're kind of alerting on those what happens when you receive that alert and so after that alert you're going to be wanting some kind of run book and the run book should be crystal clear instructions for whoever is responding to the incidents to say go and check this this and this this is why it's happening these are things to go and check and here's a couple of different parts of remediation um and what to do after remediation as well and other things like which customer like do we need to let our customers know publish states updates so on and so forth um if your staff plus then there's one more bit that i would be potentially looking at for different elements of the system um so let's take our load balancers down here on the live api to me for a staff plus engineer i'd be wanting more than metrics and alerts i'd be wanting synthetics so if you're not familiar with what synthetic testing is it's basically um you have like a little actor that sits here maybe in different regions and they're constantly hitting your api in a good way they might be authenticated and stuff like that but they're basically telling you what is the real world user performance that you're expecting from different regions around the world and these synthetic tests um can feed into things like scaling health decisions so for example if i'm being triggered from brazil and i can't hit my european api for whatever reason i might want to go and use some capabilities in route 53 to go and geographically shift that traffic to a different data center or something like that that's one of the powers and synthetics you can make intelligent decisions on infrastructure and keep a live view of what's happening in production um just don't forget only models is definitely a lot safer like this and i would be putting in synthetics in all of the um different critical paths so like the live api the historical api i'm not massively worried about things like the ingester because that will be covered by the electrometrics and alerts synthetics offer end-user experience and synthetics feed back into your metrics and your alerts they may create new metrics let me create new alerts what other curve bulbs would i look to throw in here i think the other curveball that i would be throwing in here is long term data storage given that the the data streams publishing hundreds of thousands millions of messages a day into the system i'll be going okay so what what are your attention periods have you even thought about retention periods or about archiving um how are you going to add a bi workload unless you spend a proverbial ton of money um you're gonna want quite a low retention on this kafka maybe a couple of hours because of the amount of data that's coming in maybe you have to get away with a day or two but you're not gonna be having months or years worth of data installed in kafka influx dbe you'll be a lot better on but you'll need you can't store endless amounts of data you're going to need a limit on it so how are you going to store it for one year cheers how long do you need to store it for regulation purposes so for example in the uk we can't delete any financial records for seven years so that's emails that's data any data goes for a customer we can't delete for seven years so what do we do about that well there's nothing in the regulations that said we have to keep it in the same data source so if somebody comes up to you and gives you a requirement like this like you now have 10 years worth of data like what are you going to do with it that kind of question the simple answer is you're just going to archive off to something like s3 and put it in infrequently accessed or something like that data storage tier so if you're not familiar s s3 has different data storage tiers gives you different performance and reliability benefits um for different price points so infrequently accessed so it gives you really fast access but doesn't break the bank in terms of storage so you can start extracting out your data from influx db put it into s3 in some kind of sensible format um could be pre-aggregated and all that kind of stuff uh so that's what i'd be looking at doing there and the beauty of that means that this database in theory unless you're adding new data points should pretty much constantly remain a fixed size we'll give or take a few gigabytes as you start archiving stuff off um so generally speaking unless loads and loads of stock symbols and stuff get added in it's going to be generally predictable kind of database size um yeah so that is a bit over the 20 or 30 minutes that you would probably get i think when i did this design i had 40 minutes all told to um or 40 45 minutes to after the brief and to kind of go through um what i was doing why i was doing it um in the natural interview that i did this one in i actually used DynamoDB rather than influx db um that was a choice i made in the moment and regretted it as soon as i said it but i had to roll with it um and i think the reason why i picked DynamoDB rather than influx db was the change notification system is a lot better inside of DynamoDB i can apply automatic ttls it's obviously my managed service so so long as i set up everything correctly in terms of my partitioning strategy um then you know it's going to be a pretty good time and again i'd be using something like um like apple 2022 it's like the partition key and then having some form of date and time is the aggregation key and then reporting the aggregated results then when i'm looking up specific aggregations in the historic api i can just go and say essentially select the top five records from uh this query order by this sort key which is good uh yeah so i think the only other thing i've got for this live stream today is one more curveball that i can see that would be added into this and that would be what happens when we want to run other processes off of this data so it's not going just to historical api or the live api or wi-fi db we might want to do some real-time trading or something like that the easiest point of you putting that kind of stuff into the system is kafka you just add a new consumer you can either take the aggregated trades that we went through earlier or you can go through and take the live kind of uh stock bits and pieces if you didn't want to do that uh yeah the only other way that you could ever possibly do it is use uh traffic mirroring um at this level and essentially mirror the traffic twice um to two different nlbs and then pipe it off to a completely different location but to be honest i wouldn't recommend doing that um you're not on this scale um so yeah your easiest point would always be of kafka and then it's just a choice of where you're going to take the aggregated feeds from kc cool or um the livestock topics so as long as you've got your your topics kind of in order uh in terms of like right what's like what your structure is then you should be fine adding loads of consumers um to it kafka is very very good at that and kind of read and write scenarios um so like there's a load of stuff on linkedin's website around uh or the linkedin engineering blog um if you've never seen that before there's a lot of like how we redesigned our log infrastructure on kafka and we process millions upon millions of uh things per second um that's how i know kafka can scale uh obviously we need to go into the types of things but yeah um so yeah that's how i would probably go and answer this as an interview question um hopefully everybody has learned something on there um so you feel free to drop me questions uh either on twitter or in the comments um more than happy to answer any other questions why i've done something in a certain way may not be clear um more than happy to kind of go through that stuff so planning on doing more of these live streams kind of every week i'll put up uh the stream kind of hot placeholder about a week or so in advance um just so people can see when stuffs are coming i think probably the next one i'm going to do is more of a networking one um in terms of how to design a multi-account structure with aws um it's definitely a few interesting things that should be considered when doing that especially if you want to make it secure um so yeah unless there's any last minute questions uh thank you everyone for coming in and joining today and thank you to the replay squad if you're watching this after the fact i appreciate you and yeah definitely going to be doing more of these and hope you enjoy the rest of your day wherever you are in the world So yeah, take care everyone.

If you enjoyed this video, consider subscribing to the YouTube channel for more content like this.