Technical Questions
学习并记录 General Tech Questions
High Availability 高可用
1. What is High Availability?
High availability is that the company’s IT infrastructure can continue working even when some of components fail. Downtime can have majorly adverse effects on your business health. As such, we need to take suitable measures to minimize downtime and ensure system availability at all times.
2. How Is High Availability Measured?
High availability is measured by the “class of nines”, like “999”, “9999”. You know, the great company facebook got a three ”9” last year because of a big downtime which lasted 6 - 9 hours.
3. How to you make sure high availability of your server?
- Server Cluster 服务器集群. We can use server clusters to avoid single points of failure. For example, Kubernetes provides the flexible High Availability server cluster by container deployment.
- Geographic Redundancy 地域容灾. Geo-redundancy is carried out by deploying multiple servers at global distinct locations. It ensures that even if one fails, the other continues running smoothly.
- Network Load Balancing 网络负载均衡. We can use the Nginx to configure a group of distributed servers. In this way, we can proxy HTTP traffic for load balancing. Nginx provides the algorithms like round robin and IP hash for reducing the single server load.
- Cache 缓存. We can use Redis to store active users’ data in the Redis database. Because Redis is a memory key-value database, it’s very quick. This relieves the load of accessing the database directly.
- Asynchronous Communication (Async Call) 异步调用. We can use Message Queue to achieve it. When a user’s request comes into our web server, we send it to Message Queue and respond with an HTTP 200 for success to the user. After the task is actually processed, we then notify the user. For example, Twitter is a huge and complicated system, when a person posts a tweet, we need to insert it into all of his/her friends' timelines, which is a very high write load, maybe last seconds or minutes, due to the fans number. So we need to use Message Queue to achieve the Async call.
4. Make a web app highly available 的实践经验
- Single Server 单机设置. 通过修改 Tomcat 的外部配置文件
application.properties
, 设置max-thread
, 例如默认配置在一台 2 core 4GB memory 的 server 上,max-thread=100
. 通过设置外部配置文件后, 可以提高到max-thread=200
. - Reverse Proxy 反向代理. 通过启动多台 server 和 1 台用于 Nginx reverse proxy server, 并将前端页面的 static resources 存储在 Nginx reverse proxy server 上. 当 reverse proxy server 接受到 client request, 它会将其 forward request 到 web server cluster 上。web server cluster 是真正处理 request 的集群. 我们有多种转发的算法,例如 round robin 即轮询。以上我们便实现了 Load Balance 负载均衡.
- Message Queue
- Cache
5. 反向代理和负载均衡的区别
反向代理是实现负载均衡的一种方法。负载均衡一般有 dns、硬件层、软件层三类实现方式。使用 Nginx 的反向代理实现负载均衡,属于软件层。
Big Data 大数据
1. If you have a very huge file, which needs to handle multiple files to be processed. How would you design the system?
题目描述不清晰,暂且作为海量数据的问题来问答。 海量数据意味着数据量过大无法全部装入一台机器的内存,或 者超过一台机器的处理能力。这时候就需要分而治之,其原理就是将一个问题分解成多个相同的子问题,不断重复直到最后可以简单处理子问题位置,最后在把子问题合成较大的问题,这个合并的过程就叫做归并。
MapReduce Programming Model
使用 MapReduce 的思想去设计系统,即 Divide and Conquer. 这里只考虑 High Level Design, 不考虑具体实现.
注: 仅抛砖引玉,不确保正确性
1. InputForMat && FileReaders
InputForMat
read a file from the file system, and split it into multiple files.FileReaders
do file reading and formatting.
2. Workers
Workers
are assigned tasks to perform actions.
3. Partitioner
Partitioner
as a Hash function, assign tasks fromWorkers
toReducers
4. Reducers && OutputForMat
Reducers
combines the obtained result and perform the computation operation.OutputForMat
writeback to local file system.
2. 流式数据中寻找中位数
问题描述
如何得 到一个数据流中的中位数?如果从数据流中读出奇数个数值,那么中位数就是所有数值排序之后位于中间的数值。如果从数据流中读出偶数个数值,那么中位数就是所有数值排序之后中间两个数的平均值。我们使用 Insert()方法读取数据流,使用 GetMedian()方法获取当前读取数据的中位数。
解决思路
求中位数,首先需要用一个容器将流式数据保存起来,那么选择什么容器比较合适呢?
- 数组
- 最简单的容器。
- 如果是未排序的数组,找中位数的方法采用快排的 Partition(). 因此,插入时间复杂度是
O(1)
, 而找中位数的时间复杂度是O(k * log(n))
. - 如果是排序的数组,由于是流式数据,适合使用插入排序。因此,插入的时间复杂度是
O(n)
, 而找中位数的时间复杂度是O(1)
.
- 链表
- 由于数组使用插入排序,每次都需要进行大量的位移操作和扩容操作,故此可使用链表取代。
- 排序的链表,插入的时间复杂度还是
O(n)
,找中位数的时间复杂度是O(n)
.
- 二叉搜索树
- 使用二叉搜索树,可以将插入的时间复杂度降至
O(log(n))
, 而找到中位数的时间复杂度是O(n)
. - 但二叉搜索树有可能退化成一个链表,此时插入的时间复杂度是
O(n)
.
- 大顶堆 + 小顶堆(推荐)
- tips:题目只要中位数,而中位数左边和右边是否有序不重要 (思想类似于「求序列中第 k 个大的数字」)
- 核心思想:中位数左边的数据 (左边数据特点是都不大于中位数) 保存在大顶堆中,中位数右边的数据 (右边数据特点是都不小于中位数)保存 在小顶堆中。
关键在于保持两个堆保存的数据个数相等或只差一个
- 根据堆的插入,插入数据的时间复杂度是
O(log(n))
。而中位数肯定在两个堆的堆顶元素中,找到中位数的时间复杂度是O(1)
. - 大/小顶堆的实现可以使用
java.util.PriorityQueue
orPython.heapq
.
如何实现
关键点:
- 中位数左边的数据保存在大顶堆;中位数右边的数据保存在小顶堆中.
- 始终保持两个堆保存的数据个数相等或者只差一个.
实现思路:
- 当输入数目为偶数的时候,将这个值插入大顶堆中,再将大顶堆中根节点 (即最大值) 插入到小顶堆中.
- 当输入数目为奇数的时候,将这个值插入小顶堆中,再讲小顶堆中根节点 (即最小值) 插入到大顶堆中.
- 取中位数的时候,如果当输入数目为偶数,显然是取小顶堆和大顶堆根结点的平均值;如果当输入数目为奇数,显然是取小顶堆的根节点.
# 类似题目 Leetcode 295. Find Median from Data Stream
import heapq
class MedianFinder:
'''
idea: minHeap最小堆,堆顶是最小值,保存大的那部分
maxHeap最大堆,堆顶是最大值,保存小的那部分
heappush 方法: 将 item 的值加入 heap 中,保持堆的不变性.
heappushpop 方法: 将 item 放入堆中,然后弹出并返回 heap 的最小元素.
Time:
Constructor: O(1)
addNum: O(logN)
findMedian: O(1)
Space: O(N)
'''
def __init__(self):
self.maxHeap = []
self.minHeap = []
def addNum(self, num: int) -> None:
minHeap = self.minHeap
maxHeap = self.maxHeap
if len(minHeap) == len(maxHeap):
# 先加到大顶堆,再把大堆顶元素加到小顶堆
heappush(minHeap, -heappushpop(maxHeap, -num))
else:
# 先加到小顶堆,再把小堆顶元素加到大顶堆
heappush(maxHeap, -heappushpop(minHeap, num))
def findMedian(self) -> float:
minHeap = self.minHeap
maxHeap = self.maxHeap
if len(maxHeap) == len(minHeap):
return (minHeap[0] - maxHeap[0]) / 2
else:
return minHeap[0] if len(minHeap) > len(maxHeap) else -maxHeap[0]
# Your MedianFinder object will be instantiated and called as such:
# obj = MedianFinder()
# obj.addNum(num)
# param_2 = obj.findMedian()
class MedianFinder {
// 最大堆
PriorityQueue<Integer> maxHeap = new PriorityQueue<>(Collections.reverseOrder());
// 最小堆
PriorityQueue<Integer> minHeap = new PriorityQueue<>();
boolean even = true;
public MedianFinder() {
}
public void addNum(int num) {
if (even) {
// offer() - 将指定的元素插入队列。如果队列已满,则返回false
minHeap.offer(num);
// poll() - 返回并删除队列的开头
maxHeap.offer(minHeap.poll());
} else {
maxHeap.offer(num);
minHeap.offer(maxHeap.poll());
}
even = !even;
}
public double findMedian() {
if (even)
// peek() - 返回队列的头部
return (maxHeap.peek() + minHeap.peek()) / 2.0;
else
return maxHeap.peek();
}
}
/**
* Your MedianFinder object will be instantiated and called as such:
* MedianFinder obj = new MedianFinder();
* obj.addNum(num);
* double param_2 = obj.findMedian();
*/
DataBase
1. SQL vs NoSQL
SQL | NoSQL |
---|---|
Fixed schema | Free schema |
Good query | Good performance |
Table join | Relaxed consistency |
ACID | Data de-normalization |
Data normalization | 1-1, 1-n relationship |
Any relationship | ... |
Network
1. What happens when you type a URL into your browser?
1. Bob enters a URL into the browser.
URL stands for Universal Resource Locator.
- Scheme: the protocol that server used, like
HTTP
andHTTPS
. - Domain: the domain name of the site.
- Path & Resource: They together specify the resource on the server we want to load.
2. Browser looks up IP in DNS cache
DNS stands for Domain Name System.
- If not found, browser looks up IP using recursive DNS lookup by the
DNS Reslover
. - Then, recursive lookup on
DNS Servers
. The answer is cached every step of the way. - Finally, browser get the IP address of the server.
3. Browser establishes TCP connection with the server
- complete the handshakes involved in TCP connection.
- To keep the loading process fast, browsers use keep-alive connection.
- If the protocol is
HTTPS
, the connection will require a process called SSL/TLS handshake to establish the encrypted connection between the browser and the server.
4. Browser sends HTTP request to the server
5. Server sends back HTTP response
6. Browser renders HTTP content
Oftentimes, there are additional resources to load, like JavaScript codes and images.
System Design
1. 缓存和 DB 之间怎么保证数据一致性?
试想一个常见的场景
- write API: write to DB
- read API:
- read from cache if cannot get data ->
DB.get()
读取数据库 - then
cache.set()
, and set atime-to-live
写入缓存并设置失效时间
- read from cache if cannot get data ->
当数据发生更新时,我们不仅要操作数据库,还要一并操作缓存。具体操作就是,修改一条数据时,不仅要更新数据库,也要连带缓存一起更新。
方案 1. Update Cache
第二步失败问题 但数据库和缓存都更新,又存在先后问题,此时有 2 个可选的方案:
cache.set()
first, thenDB.set()
即先更新 cache, 后更新 DB- 如果 cache 更新成功了,但 DB 更新失败,那么此时 cache 中是最新值,但 DB 中是旧值。虽然此时读请求可以 hit cache,拿到正确的值,但是,一旦 cache「失效」,就会从数据库中读取到旧值,rebuild cache 也是这个旧值。
DB.set()
first, thencache.set()
即先更新 DB, 后更新 cache- 如果 DB 更新成功了,但 cache 更新失败,那么此时 DB 中是最新值,cache 中是旧值。之后的读请求读到的都是旧数据,只有当 cache invalidation 缓存失效,才能从 DB 中得到正确的值。
- 这时用户会发现,自己刚刚修改了数据,但却看不到变更,一段时间过后,数据才变更过来,对业务也会有影响。
并发引起的一致性问题 假设我们采用 先更新 DB,再更新 cache 的方案,并且两步都可以成功执行的前提下,如果存在并发,情况会是怎样的呢?
有线程 A 和线程 B 两个线程,需要更新同一条数据,会发生这样的场景:
- At t0, thread A update DB for
x = 1
- At t1, thread B update DB for
x = 2
- At t1, thread B update cache for
x = 2
- At t2, thread A update cache for
x = 1
即 thread A 虽然先于 thread B 发生,但 B 操作数据库和缓存的时间在迅雷不及掩耳之势,早于 A 操作结束前完成。 最终 x 的值在 cache 中是 1,在 DB 中是 2,发生数据不一致问题。
Solution
- 通常的解决方案是,加
分布式锁
。两个线程要修改同一条数据,每个线程在改之前,先去申请分布式锁
,拿到锁的线程才允许更新数据库和缓存,拿不到锁的线程,返回失败,等待下次重试。只允许一个线程去操作数据和缓存,避免并发问题。 - Cons:
- 缓存利用率不高,还会造成机器性能的浪费。每次数据变更都 update cache 会导致 cache 中存储了大量的可能不会被访问的数据,浪费缓存资源。
- 很多情况下,写到缓存中的值,并不是与数据库中的值一一对应的,很有可能是先查询数据库,再经过一系列的计算得出一个值,才把这个值才写到缓存中。
- 结论: 更新方案不推荐,考虑另一种方案
方案 2. Delete Cache (Recommend)
第二步失败问题
删除缓存对应的方案也有 2 种:
cache.delete(key)
first, thenDB.set()
即先删除缓存,后更新数据库- 假如
DB.set()
失败。此时 DB 没有更新成功,那下次 read cache 发现数据不存在,则从 DB 中读取,并 rebuild cache,此时 DB 和 cache 依旧保持一致
- 假如
DB.set()
first, thencache.delete(key)
即先更新数据库,后删除缓存- 假如
cache.delete(key)
失败。此时 DB 是最新值,cache 中是旧值,发生数据不一致
- 假如
并发引起的一致性问题
- 先删除 cache,再更新 DB
- thread A wants to update
x = 2
(old value = 1) - At t0, thread A
cache.delete(key)
- At t1, thread B read cache found it's None, then read from DB
x = 1
- At t2, thread A set
x = 2
into DB - At t3, thread B set
x = 1
into cache
- thread A wants to update
最终 x 的值在 cache 中是 1,在 DB 中是 2,发生数据不一致问题。
- 先更新 DB, 再删除 cache
- in DB
x = 1
, in cacheNone
- At t0, thread A read from DB
x = 1
- At t1, thread B update DB
x = 2
- At t2, thread B
cache.delete(key)
- At t3, thread A set
x = 1
into cache
- in DB
最终 x 的值在 cache 中是 1,在 DB 中是 2,也有数据不一致问题。 实际上,这种情况发生的概率很低,因为必须满足 3 个条件:
-
- cache 刚好失效
-
- concurrent write + read 并发读写请求
-
DB.set() + cache.delete(key)
的时间 小于DB.get() + cache.set()
的时间
- 但是因为写数据库一般会先加锁,所以写数据库,通常是要比读数据库的时间更长的
- 所以条件 3 发生的概率非常低
- 结论: 使用
先更新DB,后删除cache
的方案,并且保证两步都成功
2. How to store passwords in the database?
Use Hashing algorithm
- Choose a good hash function
- Use
slow
hash function which is hard to be attacked by brute force.MD5
,SHA-1
arefast
hash function. They are less secure and should not be used.
- Use
- Add salt 加盐值
- Add salt to password. A salt is a unique randomly generated string that is added to each password a part of the hashing process. Why we need Salt? Because storing a password as a one-way hash is not enough. Attacker can crack(破解) password in seconds.
- The salt is used to generated a unique hash. It's not a secret and can be stored.
- When a user logs in, how do we validate the password? 如何登录验证密码
-
- Fetch the salt for the user from the Database.
-
- Append the salt to the password provided by the user and hash it.
-
- Compare the computed hash with the stored hash.
-
3. What does API Gateway do?
Step 1: The client sends an HTTP request to the API gateway.
Step 2: The API gateway parses and validates the attributes in the HTTP request. 解析并参数校验
Step 3: The API gateway performs allow-list/deny-list checks.
Step 4: The API gateway talks to an identity provider for authentication and authorization. 身份验证和授权
Step 5: The rate limiting rules are applied to the request. If it is over the limit, the request is rejected. 流量限制器,防止单个 client 在一定时间内重复发送请求
Step 6 and 7: Now that the request has passed basic checks, the API gateway finds the relevant service to route to by path matching. Routing 路由到相对应的服务
Step 8: The API gateway transforms the request into the appropriate protocol and sends it to backend microservices. 转换为合适的协议并发送到后端微服务
Step 9-12: The API gateway can handle errors properly, and deals with faults if the error takes a longer time to recover (circuit break). It can also leverage ELK (Elastic-Logstash-Kibana) stack for logging and monitoring. We sometimes cache data in the API gateway. 还可以包含 错误处理、Log 日志监控、Cache 等功能
4. What is GraphQL? Is it a replacement for the REST API?
-
GraphQL is a query language for APIs developed by Meta. It provides a complete description of the data in the API and gives clients the power to ask for exactly what they need. 是一个 Query language 去准确查询 clients 所需要的数据
-
GraphQL servers sit in between the client and the backend services.
-
GraphQL can aggregate multiple REST requests into one query. GraphQL server organizes the resources in a graph. 聚合 Aggregate 多个 REST 请求进入一个查询,并以
json
格式返回 -
GraphQL supports queries, mutations (applying data modifications to resources), and subscriptions (receiving notifications on schema modifications).
5. What does a typical microservice architecture look like?
-
Load Balancer: This distributes incoming traffic across multiple API gateway instances for high availability.
负载均衡器功能 -> 轮询 Round robin, 转发 Forward
-
CDN (Content Delivery Network): CDN is a group of geographically distributed servers that hold static content for faster delivery. The clients look for content in CDN first, then progress to backend services.
CDN 通常存储静态文件,从更靠近用户的地区提供服务,通过 DNS resovler 告知 client CDN server 的位置。CDN 通常也可以使用 Cache 来提前存储 Hotspot(热点) 文件,场景例如 Youtube 和 Tiktok 中的热点视频
-
API Gateway: This handles incoming requests and routes them to the relevant services. It talks to the identity provider and service discovery.
API 网关负责 Communicate with Identity Provider like SSO(single sign on 单点登录) and service discovery(服务发现),并将请求 route(路由) to next services.
-
Identity Provider: This handles authentication and authorization for users. 处理身份验证和授权
-
Service Registry & Discovery: Microservice registration and discovery happen in this component, and the API gateway looks for relevant services in this component to talk to. API 网关在此 component 中寻找对应的服务
-
Management: This component is responsible for monitoring the services. 日志与监控 Logging and Monitoring system.
-
Microservices: Microservices are designed and deployed in different domains. Each domain has its own database. The API gateway talks to the microservices via REST API or other protocols, and the microservices within the same domain talk to each other using RPC (Remote Procedure Call).
微服务被设计并部署在不同的 domians,每个 domian 拥有自己的 Database,微服务之间的调用 via RESTful API or RPC or others.
Benefits of microservices:
- They can be quickly designed, deployed, and horizontally scaled. 快速设计、部署、水平扩展
- Each domain can be independently maintained by a dedicated team. 易于维护:每个服务单独维护
- Business requirements can be customized in each domain and better supported, as a result. 更容易地自定义业务逻辑
6. Handling Hotspot Accounts
如何处理热点账户?
Scenario
-
A hotspot payment account is an account that has a large number of concurrent operations on it. 热点支付账户是指有大量并发操作的账户。
-
For example, when merchant A starts a promotion on Amazon Prime day, it receives many concurrent purchasing orders. In this case, the merchant’s account in the database becomes a hotspot account due to frequent updates.
在促销活动 (Amazon Prime day),商家会受到大量订单,其在 Database 中的账户会因为 high traffic of read and write 操作而成为热点账户。
-
In normal operations, we put a row lock on the merchant’s balance when it gets updated. However, this locking mechanism leads to low throughput and becomes a system bottleneck.
我们在对账户的订单进行 update 时,需要对其 Database put a row lock(行锁).
Optimizations
-
Rate limit
- We can limit the number of requests within a certain period. The remaining requests will be rejected or retried at a later time. It is a simple way to increase the system’s responsiveness for some users, but this can lead to a bad user experience.
- Backend: limit the same
IP address
,userId
send multiple requests. 限制 IP、userId 多次下单 - Frontend: After multiple clicks, set the button unable to click. 前端限流 -> 多次点击后,按钮置灰
-
Split the balance account into sub-accounts
- We can set up sub-accounts for the merchant’s account. In this way, one update request only locks one sub-account, and the rest sub-accounts are still available.
- 设置多个子账户和子数据库
-
Use cache to update balance first
- We can set up a caching layer to update the merchant’s balance. The detailed statements and balances are updated in the database later asynchronously. The in-memory cache can deal with a much higher throughput than the database.
- Store the statements and balances of products in Redis in advance. 提前将商品数据存入 Redis
- Reduce the read and write load of SQL Database.
- Problem:
- There are concurrency issues, such as selling more than inventory. 并发问题,导致卖出的数量大于库存。
Query the inventory in Redis
andupdate the inventory in Redis
is a two-step operation. 检查库存和扣减库存是两步操作,仍然存在超卖风险。- If a large number of requests pass through Redis, ready to update the SQL Database, it will cause overload of SQL. 如果有百万请求都通过了 Redis 准备更新 SQL,会造成 SQL 超载。
- Solution:
- Double check the inventory, both in Redis and SQL Database. 双重检查
- Perform atomic operations. 原子操作
- Use
Message Queue (Kafka)
to do Async task, control the traffic of arriving SQL Database. 消息队列 (异步、解耦),控制到达 SQL 的流量。Addsequence number
to every messages to prevent duplictes. 全局唯一序列号防止重复。