6. Design TinyURL

info

设计一个像 TinyURL 这样的生成 URL 短链的服务 This service will provide short aliases redirecting to long URLs. Similar services: bit.ly, goo.gl, qlink.me, etc.

Interview Signals

System Design 考察的四方面

Work solution: 能否提出一个可以 work 的方案，是否熟悉常见的场景与设计模式.
Analysis and communication: 与面试官保持交流，对 Storage 与 Bandwidth 的分析.
Tradeoff Pros/Cons: 是否能够提出不同的解决方法并评估不同解决方案的优缺点，根据需求做取舍.
Knowledge Base: 知识面深度和广度.

Overview

Step1. Clarify the requirements
Step2. Capacity Estimation
Step3. System APIs
Step4. Basic System Design and Algorithm
Step5. Storage
Step6. Scalability

0. Why do we need URL shortening?

URL shortening is used to create shorter aliases for long URLs. 我们用于为长URL创建较短的别名
We call these shortened aliases short links. Users are redirected to the original URL when they hit these short links. 短链接将被 301=Redirect Permanently永久重定向到原 URL
Short links save a lot of space when displayed, printed, messaged, or tweeted. Additionally, users are less likely to mistype shorter URLs. 可以节省空间

E.g., 例如，如果我们通过 TinyURL 缩短此这个链接： https://www.educative.io/collection/page/5668639101419520/564905025344512/5668600916475904/ 会得到: http://tinyurl.com/jlg8zpc

缩短的 URL 大小几乎是实际 URL 大小的三分之一。 URL 短链服务用于优化跨设备的链接，追踪特定链接以分析受众和活动情况，以及隐藏附属的原始 URL。

Step 1: Clarify the requirements

You should always clarify requirements at the beginning of the interview. Be sure to ask questions to find the exact scope of the system that the interviewer has in mind. 你应该在面试开始的时候明确需求。一定要提问，找出面试官心目中的系统的边界

Type 1: Functional Requirement

Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. 短且保证唯一性
When users access a short link, our service should redirect them to the original link. 永久重定向
Users should optionally be able to pick a custom short link for their URL. 用户可以有自定义的选项
Links will expire after a standard default timespan. Users should be able to specify the expiration time. 设置过期时间或者永久不过期。推荐后者，因为如果用户在某个 pdf 或 doc 中使用到了该short link，在很多年以后也可以打开正确的网页

Type 2: Non-Functional Requirement

Availability 可用性
- The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
- Performance: Low latency. URL redirection should happen in real-time with minimal latency.
Security 安全性
- Shortened links should not be guessable (not predictable). 短链接是不可以被预测的

Extended Requirements

Analytics; e.g., how many times a redirection happened?
Our service should also be called through REST APIs by other services.

Step 2: Capacity Estimation

Assumption

100M DAU
Each user: post 0.1 tweet with Tiny URL per day
- Average Write QPS = 100M * 0.1 / 86400 = 100
- Peak Write QPS = 100 * 2 = 200
Each user: view 1 Tiny URL per day
- Average Read QPS = 100M * 1 / 86400 = 1k
- Peak Read QPS = 1k * 2 = 2k
Each Tiny URL = 100B
- 100M * 0.1 * 100KB = 10^9B = 1GB/day
- 在这个存储需求下，1TB 的硬盘可以用 3 年

Read heavy system 2k QPS: 1 台 SSD 支持的 MySQL 就可以搞定，不需要分布式架构

如果不想计算，可以直接假设 Our system will be read-heavy. There will be lots of redirection requests compared to new URL shortenings. Let’s assume a 100:1 ratio between read and write.

Step3. System APIs

Once we've finalized the requirements, it's always a good idea to define the system APIs. This should explicitly state what is expected from the system.

APIs

We can have SOAP or REST APIs to expose the functionality of our service. Following could be the definitions of the APIs for creating and deleting URLs:

createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expire_date=None
- api_dev_key (string): 注册帐户的 api 开发人员密钥。除其他外，这将用于根据分配的配额限制用户
- original_url (string): 原始 URL 链接
- custom_alias (string): 自定义别名
- user_name (string): 用于编码的用户名
- expire_date (string):过期时间
- Returns: (string): A successful insertion returns the shortened URL; otherwise, it returns an error code.
deleteURL(api_dev_key, url_key)
- url_key表示要检索删除的 URL

访问端口设计

GET /<short_url>
- return a HTTP redirect response
POST /data/shorten
- Data = {url: http://xxxx}
- Return short url

How do we detect and prevent abuse?

A malicious user can put us out of business by consuming all URL keys in the current design. 恶意用户可以通过使用当前设计中的所有 URL 使我们瘫痪
To prevent abuse, we can limit users via their api_dev_key. 我们通过用户的api_dev_key来限制滥用
Each api_dev_key can be limited to a certain number of URL creations and redirections per some time period (which may be set to a different duration per developer key). 每个api_dev_key可以在某个时间段内限制为一定数量的 URL 创建和重定向

Step4. Basic System Design and Algorithm

Algorithm

随机生成 ShortURL + 数据库去重 随机生成一个 6 位的 ShortURL，如果没有被用过，就绑定到该 LongURL

Pros: 实现简单
Cons: 生成短网址的速度随着短网址越来越多变得越来越慢
常用场景: Book a flight/hotel Confirmation Code

public String longToShort(String url) {
  while (true) {
    String shortURL = randomSHortURL();
    if (! database.filter(shortURL=shortURL).exists()) {
      database.create(shortURL=shortURL, longURL=url);
      return shortURL;
    }
  }
}

进制转换 Base62

新浪微博采用的方案

Base62
- 将 6 位的 short url 看做一个 62 进制数 (0-9, a-z, A-Z)
- 每个 short url 对应到一个整数
- 该整数对应数据库表的 Primary Key - Sequential ID
6 位可以表示的不同 URL 数量
- 5 位 = 62^5 = 0.9B = 9 亿
- 6 位 = 62^6 = 57B = 570 亿
- 7 位 = 62^7 = 3.5T = 35000 亿
Pros: 效率高
Cons: 依赖于全局的自增 ID

int shortURLtoID(String shortURL) {
  int id = 0;
  for (int i = 0; i < shortURL.length(); i++) {
    id = id * 62 + toBase62(shortURL.charAt(i));
  }
  return id;
}

String idToShortURL(int id) {
  String chars = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
  String short_url = "";
  while (id > 0) {
    short_url - chars.charAt(id % 62) + short_url;
    id = id / 62;
  }
  while (short_url.length() < 6) {
    short_url = "0" + short_url;
  }
  return short_url;
}

Step5. Storage

SQL
NoSQL
Cache

是否需要 Sequential ID? 取决于你用什么算法

SQL 提供了 auto-increment 的 Sequential ID, like 1,2,3,4...
NoSQL 的 ID 并不是 Sequential 因为是分布式架构

基于随机生成的方法

需要根据 Long url 查询 Short url，也需要根据 Short url 查询 Long url。并且需要对 shortKey 和 lonhUrl 分别建 Index

shortKey	longUrl
a0B4Lb	'www.xxx.com'
Df523p	'www.yyy.com'
test12	'www.zzz.com'

也可以选用 NoSQL 数据库，但是需要建立两张表(大多数 NoSQL 不支持多重索引 Multi-index) 第一张表: 根据 Long 查询 Short row_key=longURL, column_key=ShortURL, value=null or timestamp

第二张表: 根据 Short 查询 Long row_key=shortURL, column_key=LongURL, value=null or timestamp

Work Solution

基于进制转化的方法

因为需要用到 Sequential ID，因此只能选择使用 SQL 型数据库。表单结构如下，shortURL 可以不存储在表单里，因为可以根据 id 来换算

id	longUrl (index=true)
1	"www.baidu.com"
2	"www.leetcode.com"
3	"www.google.com"

Work Solution

与上面类似，return id

Step6. Scalability

Response time

Cache

use Cache, Cache Aside
Cache 里需要存两类数据:
- long to short (生成新 short url 时需要)
- short to long (查询 short url 时需要)

地理信息

利用地理位置信息提速
优化服务器访问速度
- 不同的地区，使用不同的 Web 服务器
- 通过 DNS 解析不同地区的用户到不同的服务器
优化数据访问速度
- 使用 Centralized MySQL + Distributed Cache
- 一个 MySQL 配多个 Cache, Cache 跨地区分布

Bottleneck 是 Web Server 与 Shared DB 之间的通信速度
如果大量请求都需要查询 Shared DB，速度会很慢

Sharding

什么时候需要多台数据库服务器?
- Cache 资源不够
- Write heavy
- 越来越多的请求无法通过 Cache 满足
增加多台数据库服务器可以优化什么？
- 解决存不下的问题 - Storage
- 解决 overload 的问题 - QPS
- The bottleneck of Tiny URL will be QPS

Sharding 方案

Vertical Sharding
- 将多张数据表分别分配各多台机器
- Tiny URL 不适用，只有 two columns 无法再拆分
Horizontal Sharding
- 如果用 ID / ShortURL 做 Sharing key
  - 做 long to short 查询的时候，只能广播给 N 台数据库查询
  - 为什么需要查 Long to short? 创建的时候避免重复创建
  - 如果不需要避免重复创建，则这样可行
- 如果 Long Url 做 Sharding key
  - 绝大部分的 request 都是查询 short to long
  - 做 short to long 查询的时候，只能广播给 N 台数据库查询

选 Sharding Key

如果一个 Long 可以对应多个 Short:
- 使用 Cache 缓存所有的 Long to Short
- 在为一个 Long Url 创建 Short Url 的时候，如果 cache miss 则去直接创建新的 short url 即可
如果一个 Long 只能对应一个 Short:
- 如果使用随机生成 Short Url 的算法
  - 两张表单，一张存储 Long to Short，一张存储 Short to Long
  - 每个映射关系存两份，则可以同时支持 long to short 和 short to long 的查询
- 如果使用 base62 的进制转换法
  - 这里有一个很严重的问题是，多台机器之间如何维护一个全局自增的 ID？
  - 一般关系型数据库只支持在一台机器上实现这台机器上全局自增的 ID

全局 Auto-increment ID

如何获得在 N 台服务器中全局共享的一个自增 ID 是难点
一种解决办法是，专门用一台数据库来做自增 ID 服务
- 该数据库不存储真实数据，也不负责其他查询
- 为了避免单点失效（Single Point Failure) 可能需要多台数据库
另外一种解决办法是用 Zookeeper
- 但是使用全局自增 ID 的方法并不是解决 Tiny URL 的最佳方法
- 不推荐

基于 base62 的方法下的 Sharding key 策略

使用 Hash(long_url) % 62 作为 Sharding key. 并将 Hash(long_url) % 62 直接放到 short url 里
- 如果原来的 short key 是 AB1234 的话，现在的 short key 是
  - hash(long_url) % 62 + AB1234
  - 如果 hash(long_url) % 62 = 0 那就是 0AB1234
这样我们就可以同时通过 short_url 和 long_url 得到 Sharding Key
缺点：数据库的机器数目不能超过 62

Multi region communication

网站服务器 (Web Server) 与数据库服务器 (Database) 之间的通信
- 中心化的服务器集群（Centralized DB set）与跨地域的 Web Server 之间通信较慢
- 比如中国的服务器需要访问美国的数据库
可以按照网站的地域信息进行 Sharding
- 中国的用户访问美国的网站怎么办? 大部分中国用户的需求都是中国网站，极少部分会访问美国的数据，速度上不会慢很多

Interview Signals​

Overview​

0. Why do we need URL shortening?​

Step 1: Clarify the requirements​

Type 1: Functional Requirement​

Type 2: Non-Functional Requirement​

Extended Requirements​

Step 2: Capacity Estimation​

Assumption​

Step3. System APIs​

APIs​

访问端口设计​

Step4. Basic System Design and Algorithm​

Algorithm​

Step5. Storage​

基于随机生成的方法​

基于进制转化的方法​

Step6. Scalability​

Response time​

Sharding​

Multi region communication​