Skip to content

Latest commit

 

History

History
185 lines (156 loc) · 6.96 KB

short-url.md

File metadata and controls

185 lines (156 loc) · 6.96 KB

URL Shortening Service

Summary

overview summary

Requirements

  • Functional Requirements

    • Given a URL, generate a shorter and unique alias (short link).
    • When users access a short link, redirect to the original link.
    • Users should optionally be able to pick a custom short link for their URL.
    • Links will expire after a standard default timespan. Users should also be able to specify the expiration time.
  • Non-Functional Requirements

    • The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
    • URL redirection should happen in real-time with minimal latency.
    • Shortened links should not be guessable (not predictable).
  • Extended Requirements

    • Analytics; e.g., how many times a redirection happened?
    • Be accessible through REST APIs by other services.

Capacity Estimation and Constraints

  • Assumption

    • Read-heavy. More redirection requests compared to new URL shortenings.
    • Assume 100:1 ratio between read and write.
  • Traffic estimates

    • 500M new URL shortenings per month, 100 * 500M => 50B redirections per month.
    • New URL shortenings per second
      • 500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s
    • URLs redirections per second
      • 50 billion / (30 days * 24 hours * 3600 sec) = ~19K/s
  • Storage estimates

    • Assume storing every URL shortening request for 5 years, each object takes 500 bytes
    • Total objects: 500 million * 5 years * 12 months = 30 billion
    • Total storage: 30 billion * 500 bytes = 15 TB
  • Bandwidth estimates

    • Write: 200 URL/s * 500 bytes/URL = 100 KB/s
    • Read: 19K URL/s * 500 bytes/URL = ~9 MB/s
  • Cache memory estimates

    • Follow the 80-20 rule, assuming 20% of URLs generate 80% of traffic, cache 20% hot URLs
    • Requests per day: 19K * 3600 seconds * 24 hours = ~1.7 billion/day
    • Cache 20%: 0.2 * 1.7 billion * 500 bytes = ~170GB
  • Summary

    • Assuming 500 million new URLs per month and 100:1 read:write ratio
    Category Calculation Estimate
    New URLs 500 million / (30 days * 24 hours * 3600 seconds) 200 /s
    URL redirections 500 million * 100 / (30 days * 24 hours * 3600 seconds) 19 K/s
    Incoming data 500 bytes/URL * 200 URL/s 100 KB/s
    Outgoing data 500 bytes/URL * 19K URL/s 9 MB/s
    Storage for 5 years 500 bytes/URL * 500 million * 60 months 15 TB
    Memory for cache 19K URL * 3600 seconds * 24 hours * 500 bytes * 20% 170 GB

System APIs

createUrl

  • Parameters
    Name Type Note
    api_dev_key string The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota.
    original_url string Original URL to be shortened.
    custom_alias string Optional custom key for the URL.
    user_name string Optional user name to be used in encoding.
    expire_date string Optional expiration date for the shortened URL.
  • Return
    • string
    • A successful insertion returns the shortened URL; otherwise, it returns an error code.

deleteUrl

  • Parameters
    Name Type Note
    api_dev_key string The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota.
    url_key string Short URL.
  • Return
    • string
    • A successful deletion returns ‘URL Removed’.

Database design

  • Observations

    • Need to store billions of records.
    • Each object is small (less than 1K).
    • No relationships between records—other than storing which user created a URL.
    • Read-heavy.
    • A NoSQL choice would also be easier to scale.
    • Comment: SQL with sharding should also work
  • Schema

    • URL
      Column Type
      hash varchar(16)
      original_url varchar(512)
      creation_date datetime
      expiration_date datetime
      user_id int
    • User
      Column Type
      name varchar(20)
      email varchar(32)
      creation_date datetime
      last_login datetime

Basic System Design and Algorithm

Encoding actual URL

  • Compute unique hash
    • base64: A-Z, a-z, 0-9, -, .
    • 6 letters: 64 ^ 6 = ~68.7 billion
    • 8 letters: 64 ^ 8 = ~281 trillion
    • Use 6 letters
    • MD5 generates 128 bit hash value
    • Each base64 character encodes 6 bits
    • base64 encoding generates 22 characters
    • Select 8 characters
  • Issues with this approach
    • Same URL from multiple users
    • URL-encoded
  • Workaround
    • Append an increasing sequence number to each input URL, and generate a hash for it
    • Append user id to input URL

Generating keys offline

  • Standalone Key Generation Service (KGS)

    • Generate random 6 letter strings and store them in a database (key DB)
    • When a short URL is needed, take one from the key DB
  • Key DB size

    • 6 characters/key * 68.7B unique keys = 412 GB
  • Concurrency issue

    • If there are multiple servers reading keys concurrently, two or more servers try to read the same key from the database.
  • Workaround

    • Servers can use KGS to read/mark keys in the database.
    • KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys.
    • KGS can always keep some keys in memory so that it can quickly provide them whenever a server needs them.
    • KGS needs to make sure not to give the same key to multiple servers.
    • Comment: keys are sharded. Each KGS server only serves one application server.
  • Key lookup

    • When a key is found, issue an "HTTP 302 Redirect" status and passing the stored URL.
    • When a key is not found, issue an "HTTP 404 Not Found", or redirect to homepage.

UUID

Replace KGS with UUID.

Data Partitioning and Replication

  • Range Based Partitioning

    • Store URLs in separate partitions based on the first letter of the URL or the hash key.
    • Combine certain less frequently occurring letters into one database partition.
  • Problem with this approach

    • Unbalanced servers.
  • Hash-Based Partitioning

    • Take a hash of the short URL we are storing, and calculate which partition to use based upon the hash.
    • Use consistent hashing

Cache

  • Eviction policy
    • LRU: discard the least recently used URL first
  • Cache update
    • Cache miss: hit backend database and pass new entry to all cache replicas

Load Balancer (LB)

  • LB locations
    • Between Clients and Application servers
    • Between Application Servers and database servers
    • Between Application Servers and Cache servers

DB Sweeping

A separate Cleanup service can run periodically to remove expired links from our storage and cache.

Telemetry

Statistics about the system: how many times a short URL has been used

Security and Permissions

  • Store permission level (public/private) with each URL in the database
  • Send an error (HTTP 401) for unauthorized access