Skip to main content

15. Design Proximity Service / Yelp

info

In this chapter, we design a proximity service. A proximity service is used to discover nearby places such as restaurants, hotels, theaters, museums, etc., and is a core component that powers features like finding the best restaurants nearby on Yelp or finding k-nearest gas stations on Google Maps.

在本章中,我们设计了一个邻近服务。邻近服务用于发现附近的地点,例如餐馆、酒店、剧院、博物馆等,并且是支持在 Yelp 上查找附近最好的餐馆或在 Google 地图上查找 k 最近的加油站等功能的核心组件。

Step 1 - Understand the Problem and Establish Design Scope

Functional requirements

  • Return all businesses based on a user’s location (latitude and longitude pair) and radius. 根据用户位置(纬度和经度对)和半径返回所有商家。
    • Maximal search radius allowed 允许的最大搜索半径, 20 km (12.5 miles)
    • Search radius options in UI 搜索半径选项
      • 0.5 km (0.3 miles)
      • 1 km (0.6 miles)
      • 2 km (1.2 miles)
      • 5 km (3.1 miles)
      • 10 km (6.2 miles)
      • 20 km (12.5 miles)
  • Business owners can add, delete or update a business, but this information doesn’t need to be reflected in real-time. 企业主可以添加、删除或更新企业,但这些信息不需要实时反映。
  • Customers can view detailed information about a business. 客户可以查看有关企业的详细信息。

Non-functional requirements

  • Low latency 低延迟. Users should be able to see nearby businesses quickly.
  • Data privacy 数据隐私. Location info is sensitive data. When we design a location-based service (LBS), we should always take user privacy into consideration.
  • High availability and scalability 高可用性 requirements. We should ensure our system can handle the spike in traffic during peak hours in densely populated areas.

Back-of-the-envelope estimation 粗略计算

  • DAU(daily active users): 100 million
  • Businesses 商家: 200 million
  • Assume a user makes 5 search queries per day. 假设一个用户每天进行 5 次搜索查询。
  • Search QPS = 100 million * 5 / 10^5 = 5,000

Step 2 - Propose High-Level Design and Get Buy-In

API Design

GET /v1/search/nearby

  • This endpoint returns businesses based on certain search criteria. 该端点根据某些搜索条件返回企业。
  • The business object contains everything needed to render the search result page, but we may still need additional attributes such as pictures, reviews, star rating, etc., to render the business detail page. 企业对象包含渲染搜索结果页面所需的所有内容,但我们可能仍然需要其他属性,例如图片、评论、星级评分等,以渲染企业详细信息页面。
  • Therefore, when a user clicks on the business detail page, a new endpoint call to fetch the detailed information of a business is usually required. 因此,当用户单击企业详细信息页面时,通常需要调用新的端点来获取企业的详细信息。
// Request Parameters

{
"latitude": "decimal",
"longitude": "decimal",
"radius": "int"
}
// Response Body

{
"total": 10,
"businesses":[{business object}]
}

APIs for a business

APIDetail
GET /v1/businesses/{:id}Return detailed information about a business
POST /v1/businessesAdd a business
PUT /v1/businesses/{:id}Update details of a business
DELETE /v1/businesses/{:id}Delete a business

Data model

Read/write ratio 读写比率

Read volume is high because the following two features are very commonly used:

  • Search for nearby businesses. 搜索附近的商家。
  • View the detailed information of a business. 查看商家的详细信息。

On the other hand, the write volume is low because adding, removing, and editing business info are infrequent operations. 另一方面,由于 CRUD 操作不是频繁的操作,因此写入量较低。

Data schema

The key database tables are the business table and the geospatial (geo) index table.

Business table 商家表

The business table contains detailed information about a business. The primary key is business_id.

business {
"business_id": "string", PK
"name": "string",
"address": "string",
"city": "string",
"state": "string",
"zip_code": "string",
"latitude": "decimal",
"longitude": "decimal",
"phone_number": "string",
"categories": "string",
"num_reviews": "int",
"avg_rating": "decimal",
"photo_urls": "string",
"owner_id": "string",
"created_at": "timestamp",
"updated_at": "timestamp"
}

Geo index table 地理索引表

A geo index table is used for the efficient processing of spatial operations. 地理索引表用于高效处理空间操作。

High-level design

The system comprises two parts: location-based service (LBS) and business-related service.

Location-based service 基于位置的服务

The LBS service is the core part of the system which finds nearby businesses for a given radius and location. LBS 服务是系统的核心部分,它可以在给定的半径和位置找到附近的商家。

The LBS has the following characteristics:

  • It is a read-heavy service with no write requests. 读多,没有写请求。
  • QPS is high, especially during peak hours in dense areas. 高并发服务
  • This service is stateless so it’s easy to scale horizontally. 无状态服务,易于水平扩展。
info

**无状态服务(Stateless Service)**的概念主要指服务在处理请求时不保存任何客户端特定的数据。在无状态架构中,每个请求都被当作一个独立的单位,不依赖于之前或之后的请求。这意味着每个请求必须包含所有必要的信息来处理它,服务不会利用之前的交互或会话状态。

  • 每个请求独立: 服务对每个请求的处理就像是第一次与该客户端交互,不依赖于之前的请求历史。
  • 不保留会话信息: 服务不会存储任何关于客户端会话的信息,如用户身份、过去的活动记录等。
  • 简化的服务器设计: 由于服务器不需要跟踪和管理状态信息,服务器设计相对简单,易于实现和维护。
  • 易于水平扩展: 因为每个请求都是独立的,所以可以通过增加服务器数量来轻松扩展服务能力,而不需要担心会话数据的同步问题。
  • 负载均衡友好: 由于不存在状态信息,请求可以被任意服务器处理,使得负载均衡更加有效。

Business service 商家服务

Business service mainly deals with two types of requests:

  • Business owners create, update, or delete businesses. Those requests are mainly write operations, and the QPS is not high. 主要是写操作
  • Customers view detailed information about a business. QPS is high during peak hours. 用户访问和查看商家信息是重点服务。

Database cluster 数据库集群

  • The database cluster can use the primary-secondary setup. In this setup, the primary database handles all the write operations, and multiple replicas are used for read operations. 主从数据库。主数据库处理所有写操作,多个副本用于读操作。
  • Data is saved to the primary database first and then replicated to replicas. Due to the replication delay, there might be some discrepancy between data read by the LBS and the data written to the primary database. 数据优先被写入到主数据库,然后复制到副本。由于复制延迟,LBS 读取的数据和写入主数据库的数据之间可能存在一些差异。
  • This inconsistency is usually not an issue because business information doesn’t need to be updated in real-time.不一致性的影响并不大,因为商家信息不需要实时更新。

Step 3 - Algorithms to fetch nearby businesses

In real life, companies might use existing geospatial databases such as Geohash in Redis or Postgres with PostGIS extension.

Before we dive into the answers, let’s take a look at different types of indexing methods. In a broad sense, there are two types of geospatial indexing approaches, as shown in Figure 5. The highlighted ones are the algorithms we discuss in detail because they are commonly used in the industry.

  • Hash: even grid, geohash, cartesian tiers, etc.
  • Tree: quadtree, Google S2, RTree, etc.

Even though the underlying implementations of those approaches are different, the high-level idea is the same, that is, to divide the map into smaller areas and build indexes for fast search 将地图划分为更小的区域并建立索引以进行快速搜索。.

Option 1: Two-dimensional search 二维搜索

The most intuitive but naive way to get nearby businesses is to draw a circle with the predefined radius and find all the businesses within the circle. 以预先定义的半径画一个圆,并找到圆内的所有企业。

This process can be translated into the following pseudo SQL query:

SELECT business_id, latitude, longitude
FROM business
WHERE (latitude BETWEEN {:my_lat} - radius AND {:my_lat} + radius) AND
(longitude BETWEEN {:my_long} - radius AND {:my_long} + radius)
  • Cons 缺点
    • This query is not efficient because we need to scan the whole table. 如果没有适当的索引,数据库可能执行全表扫描来查找满足条件的行。
    • To fetch businesses within the radius, we need to perform an intersect operation on those two datasets. This is not efficient because each dataset contains lots of data. 为了获取半径内的企业,我们需要对这两个数据集执行相交操作。这效率不高,因为每个数据集都包含大量数据。
    • 数据库可以利用 B+ 索引来快速找到满足范围条件的数据的起始点,然后按顺序遍历索引直到结束点。索引本身是有序的,即使基础数据不是。即便如此,查询效率依旧非常低。

Option 2: Evenly divided grid 均匀划分网格

This approach works to some extent, but it has one major issue:

  • The distribution of businesses is not even. There could be lots of businesses in downtown New York, while other grids in deserts or oceans have no business at all. 企业分布不均匀。纽约市中心可能有很多业务,而沙漠或海洋中的其他网格根本没有业务。
  • By dividing the world into even grids, we produce a very uneven data distribution. Ideally, we want to use more granular grids for dense areas and large grids in sparse areas. 通过将世界划分为均匀的网格,我们产生了非常不均匀的数据分布。理想情况下,我们希望在密集区域使用更细粒度的网格,在稀疏区域使用大网格。
  • Another potential challenge is to find neighboring grids of a fixed grid. 边界地区网格分布不均匀。数据分布不均衡。

Option 3: Geohash 地理哈希

Geohash is better than the evenly divided grid option. It works by reducing the two-dimensional longitude and latitude data into a one-dimensional string of letters and digits. Geohash 比均匀划分的网格选项更好。它的工作原理是将二维经度和纬度数据简化为一维字母和数字字符串。

Geohash algorithms work by recursively dividing the world into smaller and smaller grids with each additional bit.

Recursively divide each grid into four smaller grids. Each grid can be represented by alternating between longitude bit and latitude bit. 递归地将世界划分为越来越小的网格,每个网格都可以通过在经度位和纬度位之间交替来表示。

Repeat this subdivision until the grid size is within the precision desired. Geohash usually uses base32 representation. 重复此细分,直到网格大小达到所需的精度。 Geohash 通常使用 base32 表示。

Let’s take a look at two examples.

  • Geohash of the Google headquarter (length = 6):
    • 1001 10110 01001 10000 11011 11010 (base32 in binary) → 9q9hvu (base32)
  • Geohash of the Facebook headquarter (length = 6):
    • 1001 10110 01001 10001 10000 10111 (base32 in binary) → 9q9jhr (base32)

Geohash has 12 precisions (also called levels) as shown in Table 4. The precision factor determines the size of the grid. We are only interested in geohashes with lengths between 4 and 6. This is because when it’s longer than 6, the grid size is too small, while if it is smaller than 4, the grid size is too large. Geohash 有 12 个精度(也称为级别),如表 4 所示。精度因子决定了网格的大小。我们只对长度在 4 到 6 之间的 geohash 感兴趣。这是因为当它长度超过 6 时,网格尺寸太小,而如果它小于 4,网格尺寸太大。

How do we choose the right precision? We want to find the minimal geohash length that covers the whole circle drawn by the user-defined radius. The corresponding relationship between the radius and the length of geohash is shown in the table below. 我们如何选择合适的精度?我们想要找到覆盖由用户定义的半径绘制的整个圆的最小 geohash 长度。 geohash 的半径和长度的对应关系如下表所示。

This approach works great most of the time, but there are some edge cases with how the geohash boundary is handled that we should discuss with the interviewer.

Boundary issues

Geohashing guarantees that the longer a shared prefix is between two geohashes, the closer they are. Geohashing 保证两个 geohashes 之间的共享前缀越长,它们就越接近。

As shown in Figure 9, all the grids have a shared prefix: 9q8zn.

Boundary issue 1 - 南北半球

However, the reverse is not true: two locations can be very close but have no shared prefix at all. This is because two close locations on either side of the equator or prime meridian belong to different 'halves' of the world. 然而,反之则不然:两个位置可能非常接近,但根本没有共享前缀。这是因为赤道或本初子午线两侧的两个邻近位置属于世界的不同“一半”。

For example, in France, La Roche-Chalais (geohash: u000) is just 30km from Pomerol (geohash: ezzz) but their geohashes have no shared prefix at all.

Because of this boundary issue, a simple prefix SQL query below would fail to fetch all nearby businesses. 上述的边界问题,通过简单的前缀搜索 SQL 查询将无法获取所有附近的企业。

SELECT * FROM geohash_index WHERE geohash LIKE 9q8zn%``

Boundary issue 2 - 位于边界位置

  • Another boundary issue is that two positions can have a long shared prefix, but they belong to different geohashes. 另一个边界问题是两个位置可以有很长的共享前缀,但它们属于不同的 geohash。
  • 这种情况通常出现在 Geohash 的边界附近,例如,即使两个地点在物理上仅仅相隔几米,它们的 Geohash 也可能不同,因为它们恰好位于两个网格的分界线上。
  • A common solution is to fetch all businesses not only within the current grid but also from its neighbors. The geohashes of neighbors can be calculated in constant time. 一个常见的解决方案是不仅获取当前网格内的所有企业,还获取其邻居的所有企业。邻居的 geohash 可以在常数时间内计算出来。

Not enough businesses

Now let’s tackle the bonus question. What should we do if there are not enough businesses returned from the current grid and all the neighbors combined?

  • Option 1:
    • only return businesses within the radius. This option is easy to implement, but the drawback is obvious. It doesn’t return enough results to satisfy a user’s needs. 只返回半径内的商家。这个选项很容易实现,但缺点也很明显。它没有返回足够的结果来满足用户的需求。
  • Option 2:
    • increase the search radius. We can remove the last digit of the geohash and use the new geohash to fetch nearby businesses. 增加搜索半径。我们可以删除 geohash 的最后一位数字并使用新的 geohash 来获取附近的企业。
    • If there are not enough businesses, we continue to expand the scope by removing another digit. This way, the grid size is gradually expanded until the result is greater than the desired number of results. 如果商家不够,我们会继续扩大范围,再去掉一个数字。这样,网格大小逐渐扩大,直到结果大于所需的结果数。
    • Process for example:
      • 原始 geohash: 9q8zn
      • 第一次减少精确度: 9q8z
      • 第二次减少精确度: 9q8。这个区域会更大,因为它涵盖了所有以"9q8"开头的 geohash 编码的区域。
    • Cons:
      • This approach works well, but it has one drawback: it might return businesses that are outside the radius. 这种方法效果很好,但有一个缺点:它可能会返回半径外的企业。这个区域并不是一个完美的圆形,而是一个矩形或者正方形的区域,其边界可能会超出你最初设定的圆形搜索半径。
      • 这种方法在稀疏区域提供了更多的结果,增加了搜索的灵活性和覆盖范围,但也牺牲了精确度。

Option 4: Quadtree 四叉树