Skip to main content

3. Design Web Crawler

info

设计网络爬虫

来源: 花花酱

Interview Signals

System Design 考察的四方面

  1. Work solution: 能否提出一个可以 work 的方案,是否熟悉常见的场景与设计模式.
  2. Analysis and communication: 与面试官保持交流,对 Storage 与 Bandwidth 的分析.
  3. Tradeoff Pros/Cons: 是否能够提出不同的解决方法并评估不同解决方案的优缺点,根据需求做取舍.
  4. Knowledge Base: 知识面深度和广度.

Overview

  • Step 1: Clarify the requirements
  • Step 2: Capacity Estimation
  • Step 3: High-level Design
  • Step 4: Detailed Component Design
  • Step 5: Fault Tolerance & Scalability
  • Step 6: Extended topics

Step 1: Clarify the requirements

Clarify goals of the system

  • Requirements
  • Traffic size (e.g., Daily Active User)

Discuss the funcitonalities, align with interviewers on components to focus.

Align with interviewers on 2-3 components to focus in the interview

Type 1: Functional Requirement

Crawl the entire web

  • Assumption 1: crawl HTML contents only 只爬取 HTML
    • The design should be extensible for downloading image/video contents
  • Assumption 2: support HTTP protocol 只支持 HTTP 协议
    • The design should be extensible to support FTP and other protocols later

Type 2: Non-Functional Requirement

  • Scalability
    • Can craw the entire web and fetch millions of documents
  • Extensibility
    • Easy to support new requirements, e.g., support new documents types
    • 模块化 希望未来能支持音频视频

Step 2: Capacity Estimation

  • Assumption
    • 30 billion web pages (URLs)
      • Craw within 30 days
    • Each page has size 100KB (HTML only), need 0.5KB for storing meta info
  • Storage estimation
    • 30B * (100KB + 0.5KB) = 3.02PB
  • Bandwidth
    • Craw speed
      • 30B / (30 days * 86400 s/day) = 11574 pages/second
    • Download bandwidth
      • 11574 pages/second * (100KB + 0.5KB) / page = 1.16GB/s (1.e.,9.3Gbps)
  • GB/s = gigabyte/second, Gbps = gigabit/second
  • Convert gigabyte/second to gigabit/second = * 8 即乘 8 换算

Step 3: High-level Design

Given a set of seed URLS 给定起始点

  1. Pick a URL, fetch the web page
  2. Process the downloaded page, e.g., store and index the contents 处理下载的网页
  3. Parse the page to extract URLs, add to unvisited URLs (URL Frontier) 从下载的网页里抽取还没有访问的 URLs,将其加入处理队列中
  4. Go back to step 1

  • seeds 被加入到 URL FrontierFetcherURL Frontier中抽取一个 URL 去Internet中下载内容,内容会进入 Parser,对内容进行处理,并抽取未访问的 URLs 放入 URL Frontier,文件内容被存入 Content Store, URLs 被存入 URL Store
  • Document Deduplication 用于文件的去重,过滤重复的内容
  • URL Filter 用于 URL 的过滤和去重,过滤掉不能访问的 URL
  • DNS Resolver 即 DNS 的解析器,在拿到 URL 时先通过该解析器,解析为对应 web server's IP address

Document Deduplication 文件去重

  • Duplication
    • The same web contents may available under different URLs
  • How to dedup
    • Document fingerprint: compute a 64-bit checksum (e.g., MD5 or SHA) for every processed document, and store in the database
    • 对文件去重需要计算出每个文件的指纹,即哈希算法,来计算一个 64 位的指纹。当遇到新网页时,计算指纹判断是否有重复的内容
    • Other approach: SimHash, used by Google, more robust to minor changes 可以容忍微小的改变。即一个文件改变了小部分内容,该算法可以检测出这两者是同一个文件
  • Storage
    • 30B * 8 Bytes = 240 GB (can use cache + disk storage)
    • 64-bit = 8 Bytes

URL Filters

  • URL Filters
    • Disallowist, robots exclusion, dedupling 过滤掉机器人和不允许访问的 URL 并去重
    • 因为爬虫是消耗对方服务器的资源,所以对方可能不希望被爬取
  • Robots exclusion
    • Fetch robots.txt file from a website to test whether a given URL can be crawled 找到robots.txt 会写明是不可爬取还是部分可爬取,需要遵守规则
    • Cache can be used to store a local copy of the exclusion rules to reduce the frequency of downloading the robots exclusion for the same web server again and again 在本地存储该文件的拷贝,不需要每次都从 web server 中下载
  • URL Deduplication
    • URL normalization/standardization 标准化。例如大小写 相对路径绝对路径
    • URL fingerprint 通过指纹去重

DNS Resolver

  • DNS resolution 解析
    • Translating a URL to a unique IP address
    • Before contacting a web server, we need to use DNS to get the IP address
  • Challenge
    • This can be a bottleneck given the large amount of URLs we need to crawl
    • 会成为瓶颈因为我们会有大量的 URLs 需要爬取
  • Solution
    • DNS caching: Can cache DNS results by building local DNS servers
    • 建立本地 DNS 缓存,不需要每次都从外部去请求获取 IP address

Step 4: Detailed Component Design

  • Crawling Policies
    • Politeness policies 礼貌性规则
  • Distributed Crawler design 分布式爬取设计

Crawling Policies

Base on https://en.wikipedia.org/wiki/Web_crawler

  • selection policy which states the pages to download
  • re-visit policy which states when to check for changes to the pages 网页内容会更新,需要不断地重新爬取
  • politeness policy that states how to avoid overloading Web sites 不要使得对方的服务器过度负载
  • parallelization policy that states how to coordinate distributed web crawlers 分布式爬虫如何合作

Politeness Policy 礼貌性机制

  • Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site.
  • if a single crawler is performing multiple requests per second and/or downloading large files, a server can have a hard time keeping up with requests from multiple crawlers.
  • 发送过多的请求会让对方服务器过载,对方可能会 ban 掉你的 ip address,禁止访问
  • 避免并行的发送请求,并每次请求需要有间隔,例如 1 ~ n seconds
  • Distributed crawler can overload a website if not implemented carefully.

URL Frontier Design

  • Implement selection-policy
    • By deciding the priorities of URLs to be download
  • Implement politeness-policy
      1. Should have only one connection to a web server at any given time 在任何单一时刻只有一个请求
      1. Should have time gaps between subsequent visits to the same web server 对同一网站的两次访问之间必须要有时间间隔
  • Function
    • Insert new URLs into queue
    • Provide URLs for workers to download

  • Prioritizer是入口,没有访问过的 urls 从这里进入队列。它根据重要性、该 url 多久未被访问等因素来评定 url 的优先级。例如 优先级=1 则插入 queue 的 head, 优先级=n 则插入 queue 的 tail
  • Politeness selector 是出口,向下游的机器 Worker thread 提供 url 让他们去下载网页。它通过 Heap 来控制对 queue的读取,Heap中存储了每个queue下一次可以被访问的最早时间。每次抽取从 MinHeap 中取堆顶。当每个 queue 被访问时会更新它下一次能被访问的时间,从而控制频率
  • Front Queue 实现了优先级排序,决定哪些 url 先下载,哪些 url 后下载
  • Back Queue 实现了礼貌性机制,控制了对某个网站访问的频率。在每个 Back Queue (简称B) 中只存放某一域名的网页,例如B1=腾讯, B2=百度, B3=新浪,可以通过 selector 来控制访问每一个网页的频率
  • Mapping table 里存储了每个网站对应的 backqueue IDPoliteness router 会根据 Mapping table来将对应网站 route 到其对应的Back Queue

Mapping Table

  • Maintains the mapping between domain name <-> back queue ID
  • When a back queu is empty, it will get refilled by front queue, and the map will be updated accordindly
Domain NameBack Queue #
wikipedia.org12
stanford.edu19
......

Step 5: Fault Tolerance & Scalability

  • Scalability 可扩展性
    • Database sharding 分表
    • Consistent hashing 一致性哈希
  • Fault tolerance 容错性
    • Server replacement

Design Solution from Scott

1. Requirments

  • Empirical results driven. 实证结果驱动。
  • From database perspective, Data size is the limitation. 从数据库的角度来看,数据大小是限制。
    • gurugamer game replay: 20k, daily replay: 1.7M, could be used to estimate the total data size and forecast the foreseeable future.
  • From the whole system perspective, server end throttling is the main limitation. 从整个系统的角度来看,服务器端的节流是主要的限制。
    • Use different accounts
    • Use different IPs
    • Multi-thread
    • Be a good citizen, don’t bring down the server
    • The same node crawl several sites in parallel

2. Strategy

The basic algorithm executed by any Web crawler is to take a list of whitelisted domains and seed URLs as its input and repeatedly execute the following steps. 任何网络爬虫执行的基本算法都是以白名单域和种子 URL 列表作为输入,并重复执行以下步骤。

  • Pick a URL from the unvisited URL list.
  • Async crawl the URL
  • Parse the document contents and store the data, index the contents.
  • Look for new URLs in the whitelisted, Add the new URLs to the list of unvisited URLs.

3. High Level Design

4. Service

  • Scheduler:
    • Periodically or on demand trigger new crawling. 周期性按需启动任务。
    • Insert seed URLs
  • Crawler:
    • stateless crawler, async scalable worker. 异步可伸缩 Worker
    • Deep copy vs shallow copy
  • Parser:
    • Async stateless worker: parse the validate the raw data, write the validated data to a Clean Store. 解析验证原始数据,将验证后的数据写入 Clean Store 中。
  • Notification:
    • Notify the user current progress and stuck process, unexpected schema etc.
  • Sweeper 清扫器/Compensator 补偿器:
    • Monitor the progress and re-trigger a crawl or parsing if no response.
    • Generate notification if unknown schema / exception met.

5. DataBase & Data Model

Raw Data Table (S3)

IddataCreated time
uuidHTML / JSON21435343

Clean Data

Hash(URL, iteration) (shard key)typeURLiterationheadlinemain_text
stringNewscnn.com/11UberWhat’s new?

Metadata - Scheduling Table

IdCronuserCreationNameURLsstate
uuid0 8 * * *scott214145replaysLink to YAML{ON, PAUSED, OFF}

Metadata - Iteration Progress

Primary key: schedule_id + iteration

scheduleiterationstartUrls foundStuckcrawledprocessed
uuid9214145100001090008600

Metadata - Record Job

  • Shard key: Id = Hash(URL, iteration)
  • STATE: {Queued, Crawled, Processed, Stuck}
idURLiterationstatecreate_timelast_updatenotifiedDeleted
strgames.net9QUEUED12343110No

6. Fault Tolerant

Scheduler (Apache Airflow)

  • Read YAML: what to crawl
  • Insert seed in Record Job metadata, no-op if exists
    • Send and forget to crawler
  • Insert progress in Iteration Progress table, no-op if exists
  • If fault, go back to Queued state.
info

Heavy load processing usually stateless, send & forget. 无状态、异步非阻塞和不需要响应

Crawler: Stateless

  • Read the URL need to be crawled, if already crawled, no-op
  • Crawl the URL, store data into Raw Data table
  • Update record to Crawled state

Parser: Stateless

  • Read the Record need to parsed, if already Processed or Stuck, no-op
  • Parse and validate record:
    • If invalidate, update record to STUCK and return
  • Store the validated record to Clean Data
  • Update record state to PROCESSED

Sweeper: Periodically wakes up

  • Read records in STUCK state and not yet notified
    • Notify developers to take actions.
    • Mark the record notified
  • Read the records in Queued state for a long period of time
    • Re-queue into Crawler. 重新加入爬取队列
  • Read the records in Crawled state for a long period of time
    • Re-queue into Parser. 重新加入解析队列
  • Update the iteration progress records:
    • Send summary notification to users.
    • May need to consolidate if cross shards. 如果跨分片可能需要合并。