System design questions interview cheatbook - FB and google

·

31 min read

1.Large scale file sharing questions (Dropbox, Google Drive) etc. What is the core part to focus here so I impress the interviewer and don't scatter my focus needlessly ? (won't repeat the question for other questions below, assume it's the same)

There are multiple points to focus here -

Mention the follwoing:

Load balancer for traffic income; Databases as follows: Tables keep track of all users; Tables to keep track of what every user shares(index to whom he shares each file); Partition your databases as you need to chunk the data to different database nodes; Tables to keep track of partitions for each file( if you want to return it you need to know the databses where the parts are and also the partition indexes as well so you can recombine the data);

resources can be - medium.com/@narengowda/system-design-dropbo.. 2.Messenger type of questions (WhatsApp, Facebook messenger etc.).

I found this resource to be very useful and as per the realisitc timeline. I used it in my preparation for the chat system - kasunprageethdissanayake.medium.com/whatsap..

and also adding this - leetcode.com/discuss/interview-question/sys..

and leetcode.com/discuss/interview-question/124..

You will get a lot of constraints points that are valubale in the design and also a good discussion about the pain points. 3.Large storage/caching systems design (Design Redis, Hadoop DFS).

Caching Best Practices: When implementing a cache layer, it’s important to understand the validity of the data being cached. A successful cache results in a high hit rate which means the data was present when fetched. A cache miss occurs when the data fetched was not present in the cache. Controls such as TTLs (Time to live) can be applied to expire the data accordingly. Another consideration may be whether or not the cache environment needs to be Highly Available

So the standard way to implement cache is to have a data structure, using which we can access value by a given key in constant time. One such data structure is Hash map, and here is how do we implement it The HashMap itself could be implemented in multiple ways. One common way could be hashing with linked list (colliding values linked together in a linkedList) Let’s say our Hash map size is N and we wish to add a key(K) and value(V) to it For a given key K generate i = hash(K) % N, and in hash table H[i] = [Value] Now all good, we can save key value pairs in memory and retrieve it whenever we need it. Since we have limited In-memory(RAM), we need to come up with a way to evict the cache/delete least used cache whenever we need more space!!

I would suggets you to follow this entire resource - medium.com/@narengowda/designing-the-cachin..

It is very good in dealing with system design of storage.

Following is a list of open-source in-memory cache products •Redis •Memcached •VoltDB •Aerospike DBS •Apache Ignite 4.Video streaming (YouTube, Netflix)

I found this resource very useful - medium.com/@narengowda/netflix-system-desig..

Youtube- eileen-code4fun.medium.com/system-design-in..

If i were to design then - Gathering requirements The problem description is intended to be ambigious as the interviewer wants to check if we're able to ask reasonable questions, collect requirements and define a core set of features for a minimum viable product (MVP) that we can start to work with.

In this case, I'd like to ask a few questions to make things clear, including

Q: What are main features of the video sharing platform? A: The system will be a public platform that users can watch and upload videos. Q: Who will be the users of the system? A: Users will be around the world. Q: How are they going to use it? A: Users will use various kinds of devices to visit our service. Q: Is there any external dependencies? What are the upstream and downstream system? A: The system will leverage existing authentication system. Once I get answers for above questions, I'd confirm with the interviewer that I'll design a video platform that enables users to watch and upload videos. Authentication is out of the scope of this problem.

Based on information that I gathered so far, I come up with a bunch of functional and non-functional requirements below and confirm with interviewer.

Functional requirements

Users can watch videos on the platform Users can upload their own videos Non-functional requirements

The platform should be highly available The response time for users in different regions should be at the same level The platform should scale while userbase is increasing Estimation The goal of this step is to come up with estimated numbers of how scalable our system should be.

We have two types of requests, read and write. Write request(upload videos) will take much longer time and is ususally much less than read requests (recall how often you watch videos on Youtue and the times you upload videos).

So, let's focus on read request firstly. As an easy start, assume our service receives and completes 1000 request per second (RPS). I use request to reflect common server requirements rather than functionality specific requirements eg. Tweets, Views ... because it's business independent. Functionality specific requirements will be eventually mapped to server requirements anyway.

For one day, we'll have 1000 24 60 60 ~= 1000 30 3000 = 90 million = 90 M requests For one year, we'll hvae 90M 12 30 = 90 M 360 ~= 100 M 300 = 30 billion = 30 B requests For five years, we'll have 30B 5 = 150 billion = 150 B requests in total.

The next assumption I'm going to make is the response time. Assume our response time SLA is 200ms.

Then I need to have 1000 / (1s / 200ms) = 200 threads in total to handle 1000 RPS. So, the next question will be how many servers do I need to have 200 threads. One simple formula for estimating ideal Java thread pool size is

Number of threads = Number of Available Cores Target CPU utilization (1 + Wait time / Service time)

Waiting time - is the time spent waiting for IO bound tasks to complete, say waiting for HTTP response from remote service. Service time - is the time spent being busy, say processing the HTTP response, marshaling/unmarshaling, etc. Because our video sharing web service is not computing-intensive application(though it can be, considering video encoding, compression), I assume the Wait time / Service time ratio will be relatively large, say 50. And also assume our server has 2 CPU cores running at 50% utilization.

Then, each of this kind of server can support 2 0.5 (1 + 50) ~= 50 threads and since we need to have 200 threads to handle 1000 RPS, so we need 200 / 50 = 4 servers to handle 1000 RPS.

Let's switch to write requests (upload video). Assume the read / write or say watch / upload ratio is 100. So, we're expecting 1000 / 100 = 10 write requests per second. Please note that the uploading time varies because the video size and network speed play important roles here. So, the estimate is even more subjective.

Assume average video size is 500 MB and network bandwidth on customer side is 100 mbps. Then,

each video would take 500 MB / (100 mbps / 8) = 40 seconds to upload since we have 10 upload request per second, we have 40 10 = 400 concurrent uploading, the bandwidth requirement on our side will be 400 100 mbps = 40000 mbps total memory usage will be 40000 mbps / 8 = 5000 MB = 5GB, if our server has 2GB RAM, then we need to have 3 servers for uploading. However, because we have 400 concurrent uploading, it requires 400 threads to serve. This time the Wait time / Service time ratio is close to 1, and if we still use 2 CPU cores server running at 50% utilization, each server could support 2 0.5 (1+1) = 2 threads. So, we need 400 / 2 = 200 servers for video uploading. Design Goals Latency - Is our service latency sensitive (Or in other words, Are requests with high latency and a failing request, equally bad?) Yes, to provide great customer experience, latency is very important for video service

Consistency - Does our service require tight consistency? Not really, it's okay if things are eventually consistent.

Availability - Does this problem require 100% availability? Yes

System interface definition At this step, I'm designing the APIs that our service exposes to clients. Based on features and requirements I gathered at the first step, the video platform I'm going to design is apparently a web application. The best practice is to decouple the frontend and backend so that the frontend and backend can evolve independently as long as they obey the contract(backend service provides API and clients consume the API).

There are different API design style, SOAP, REST, and graphQL. I'll create a set of RESTful API as it's lightweight compared with SOAP and it's the broadly supported and the most popular one among developers.

Designing RESTful API requires us to first identity resources and then map HTTP methods to operations.

Apparently, we have at least two resources, video and user.

Our service supports operations including upload video, play video, create user, get user info. So, we can have following APIs.

Upload video API

You can follow a proper answer in this link - leetcode.com/discuss/interview-question/sys.. 5.Large-scale searching (Google Search, Twitter search [real time])

The search backend ramps up the complexity in this system design. We can assume that we only accept text query — that’s a collection of words. The backend returns the results most relevant to the input words. There is no conjunction or negation logic among the input words.

pain point - scalability - The basic idea for scaling search is sharding index, something Elasticsearch famously employs. Essentially instead of building one gigantic search index, we build several smaller ones, each responsible for a subset of webpages. During the search time, we’ll broadcast to all index shards and merge the results. This leads to more data shuffling, but the gain is that first we can now fit a search index in a physical machine and second the search is executed in parallel against all the index shards

You can use this resource to read about it - kousiknath.medium.com/system-design-design-..

eileen-code4fun.medium.com/system-design-in.. 6.Geographic Map-based (Uber, Google Maps)

Pain points - How to handle price surge ? According to UBER surge helps to meed supply and demand. by increasing the price more cabs will be on the road when the demand is more. How To Handle Total Datacenter Failure? It doesn’t happen very often, but there could be an unexpected cascading failure or an upstream network provider could fail. Uber maintains a backup data center and the switches are in place to route everything over to the backup datacenter. The problem is the data for in-process trips may not be in the backup datacenter. Rather than replicate data they use driver phones as a source of trip data.

resource can be - kousiknath.medium.com/system-design-design-..

uber - medium.com/@narengowda/uber-system-design-8.. 7.Aggregation type of questions (Aggregate logs, counts etc.)

Logs are meant for analysis and analytics. Once your logs are stored in a centralized location, you need a way to analyze them. There will be pain points related to alerting and analysis of the logs. You will receive questions based on indexed alerting and several related points. For these you can refer - Few constraints to consider:

Asynchronous : you do not want user-experience to get blocked on log processing Lossless (as much as possible) : for instance provide 99.9 guarantees Latency : Are these logs being used for real-time analysis and alerting for site-down? If so < 1min latency. Is this is used for offline reporting? 20min latency might be doable. Security : Are you sending PII data-logs for the user. If so, they must be encrypted during transfer and every service must have the right keys to de-crypt before processing Services involved.

Log-receiving service : This service can receive logs and perform basic validations - size, data-types, required fields like user-id etc. It then batches these logs in-memory (say: 30k or 3min) and stores this batch into a store say an S3 bucket file. (Depending on how fault-tolerant you need to be you can store each log too, since in-memory batches can be lost if the server crashes before a persist). This initial store ensures that you persist a raw copy of the log-data. (This is similar to the Producer service referred in Kafka answers) This service can also be preceded by a queue like SQS, to decouple the log-sending service/client from this log-receiving service. Scale-handling. Log-Processing service/ worker : This can reads S3 file contents, transforms into the data-model needed for SQL and writes required rows into SQL. You can also drop this transformed data into another queue so you can decouple and scale writes-to-SQL indepdent of log transformation. SQL query engine will take care of updating indexes and atomic/consistent writes. Partitioning is something to discuss : you can either have timestamp be the partition key ( which likely means you'll have hot-partitions for recent logs but good range querying) or you can have user-id be partition key(less likelihood of hot parititions for an amazon user). Or have a hash of either ts /userid be partitionkey (good distribution poor range querying).

resource - medium.com/@karthi.net/how-to-aggregate-doc..

medium.com/@shivama205/real-time-log-manage.. 8.Web crawler type questions

One experience on web crawler- Intial Proposal: Start out with the root page, and based on a hash function decide if that page is for you (the node) or not.

What if one node fails or does not work? When encountering a page, ping to see whether the node handling that page is online, if not use a secondary hash function to determine alternate handler. If you are the alternate handler, handle the page, if not ping to see if the alternate handler is online. Do this for al hash functions you decide to have, more hash functions means less impact by failure of one node.

How do you know when the crawler is done? Pass around a timestamp of the last crawled page. If the timestamp gets back to you without changing, then you are done.

In a System design question, understand the scope of the problem and stay true to the original problem. The scope was to design a web crawler using available distributed system constructs and NOT to design a distributed database or a distributed cache.

A Web crawler system design has 2 main components:

The Crawler (Write path) The Indexer (Read path) Make sure you ask about expected number of URLs to crawl (Write QPS) and expected number of Query API calls (Read QPS). Make sure you ask about the SLA for the Query API. If its in tens of milliseconds at say 90 percentile, you'll probably need to cache the query results.

Begin with these high level components, then break them down into smaller sub-components, then connect these to form a coherent whole.

Crawler

For crawling you need a "seed of URLs" to begin the crawling. You'd want to put the URLs in a queue. The queue workers would work on one URL at a time. Each queue worker, given a URL has to: Extract text from the URL and send it to a Document Indexing Service . Insert any links found in the page back into the queue. Before inserting, the links are looked up (and stored) in a Global NoSql store, to ensure they weren't already crawled. We use a NoSql store (and not a SQL Database) because we're doing lookup operations only and don't require expensive joins. Eventually the queue will become empty. At this point, the "seeder" will reseed the queue with seed URLs and the whole process restarts.

Scaling up the crawler (only if the websites to crawl are in billions): Your queue could be a distributed message queue (such as SQS or Azure ServiceBus). Your NoSql store could be DynamoDB. The interviewer would most likely know that both message queues and NoSql stores maintain replicas (typically master-slave replication) for fault tolerance and re-partition themselves via an algorithm like consistent hashing for scalability. All distributed queues have a Visibility Timeout i.e. when an element is dequeued, it still remains in the queue but is made invisible to other dequeue requests till Visibility Timeout seconds have elapsed. A worker that is handling the dequeued element must explicitly delete it from the queue before Visibility Timeout seconds.

Challenges for Crawler: How would you handle throttling from your NoSql store (say because you have too many crawlers attempting to lookup and write URLS to it)? If you try an exponential retry algorithm in your worker, your message queue may release the URL being already crawled to another worker. How would you handle dead links? You'd probably want to maintain a blacklist of links that returned 404 so that your workers don't put these in crawler queue the next time around. How do you handle crawling of temporarily offline websites. Assume that your worker connected to the website but it took 40 seconds to respond with 503 but by that time, your message queue already released the same URL to another worker who'd attempt to reconnect to the same website and suffer from the same fate. How would you handle websites that failed to get crawled? You'd probably want to store them in a Dead Letter Queue to be inspected later. Would you respect Robots.txt files while crawling? Maybe you could store the domain name + /* of such sites in the Blacklist NoSql Store. How would you throttle requests from your crawlers running in parallel to different pages on a single website (like Wikipedia). Maybe a message queue is not a right fit in this design? You could probably use a Streaming queue (like Kafka/Kinesis Streams/ Azure EventHub) where the domain name of the URL is the partition key. This means that all sub-URLs within a domain will be handled by one worker only. But this leads to obvious load balancing issues. Alternatively, you could invest in a Rate Limiter that ensures that one worker does not open more than n connections to a single website. But what is a good value of n? Wikipedia can probably handle thousands of concurrent connections but a small company's website could cave in. So the value on n depends on the domain being crawled and will need tweaking via trial and error. Which means you'll need another NoSql store that stores domain names and n which the RateLimiter will need to cache when doing the rate limiting. Next question: what should the worker do if the Rate Limiter disallowed it from accessing the URL? Should it keep retrying? If yes, what if the message queue releases the same URL to another worker? It makes sense for the worker to drop it and go for the next URL in the queue. This will cause the message queue to release this current URL to another worker after Visibility Timeout seconds, and that worker might have a higher chance of succeeding. But what if the next URL is also of the same domain? All in all, you should also discuss what logs/metrics you'd emit and how these are analyzed to make the crawling better. Some metrics to emit would be how many times a worker was rate limited, latency of every operation (such as reading entire contents a URL, time taken for ranking the text, indexing it etc.). For websites that caved in or were unresponsive etc, apart from the metrics, you'd also write them in special log files which are then machine learned to produce the Blacklist NoSql store or to recompute the number of connections for a domain to be used by the Rate Limiter.

Indexer

You'll need a Document Indexing Service that does 2 things on its write path: Insert the URL and its text into another NoSql store. This will be used for showing cached copies in case the original URLis unavailable or is dead. But what if the text is huge? In that case, it makes sense to store the text instead in an Object/Blob Store like S3/Azure Blob Storage under a key which is the hash of the URL. Maintain a Reverse Index that maps keywords/phrases in the text back to the original URL. You can use a Reverse Index database like Elastic Search. The Document Indexing Service also exposes a query API that takes as input, a phrase + number of results (i.e. page size) to return. It would then break the phrase into keywords, remove "Stop words" (words like the, a, an etc), correct spelling mistakes in the keywords, add synonyms of certain keywords and then call into Elastic Search to return the URLs that contain these keywords. Elastic Search has plugins which already do spelling corrections, stop word removal etc. The URLs could then be fed into a Ranking service that ranks the URL before returning. If your Read QPS is higher than Write QPS, a better approach would be to do the ranking via an hourly offline process and store the rank in Elastic Search. That would speed up the querying path as you'd skip ranking and instead ask Elastic Search to return URLs sorted by rank. This also makes it easy to paginate as the first page will contain the top n results, the next page will contain the next n results etc. The hourly offline ranker would need what are called as "Clickstream logs" to figure out which links are being clicked more often so as to rank them higher.

The query API must also return a pagination token to allow the caller to continue retrieving more pages

Challenges for Indexer: Can we speed up the read path by maintaining a cache for most commonly queried phrases? The 80/20 rule states that 80% of users query the same 20% phrases. If so, what caching strategy to use? LRU/TTL? Should we go with a write through cache so that the crawler directly updates the cache? Or a cache-aside strategy where the cache is bypassed by the crawlers when they write to the Document Service? In that case, what TTL value would be appropriate for the cache? What kind of distributed cache would we use? Redis/memcached? Only if asked, you can mention about consistent hashing here. 9.Video call type of questions (Zoom, Meet etc.)

This is very frequently asked questions these days and involve a good amount of complexity.

This is one resource which i used in my preparation journey- medium.com/@himanishaik48/zoom-system-desig..

and this - medium.com/swlh/a-design-analysis-of-cloud-.. 10.Lastly, any major category that I seem to have missed that you would consider worth mentioning?

The expectation with system design questions is that the candidate will drive the interview. The interviewer will be quickly able to tell whether you are experienced or not in designing systems when you don't know what the right area of focus is, because that's part of the interview, knowing what to focus on. For example, Twitter is complex because of the fan-out problem and celebrity tweets with millions of followers. If you ask them what to focus on it's clear that you don't know the bottlenecks of the system.

The interview starts off with a purposefully vague ask: design X. It's your job as the interviewer to constrain the problem into something both you and the interviewer can agree on, and then to dive deep into the problem. There's really no structure to the interview, because the interview can go many different ways, but typically it starts off with defining the goals of the system, constraining the problem, coming up with the data schema, key data flows, high-level architecture, bottlenecks/problems in the initial design, deep-dive into specific components, performance analysis, solutions to scalability problems, and talking about further optimizations/features.