Design a system that supports the search feature based on arbitrary text input. Like twitter search by keywords or Youtube/Netflix search by video name, etc.
- The object is searchable as soon as possible after user creates it.
- The search needs to support arbitrary text based user input.
- The search needs to perform nearly real-time.
- Search my own post, but allow delays searching others' post.
- High availability.
- Cluster needs to support scaling gracefully with less re-partitioning.
- Does this search service need to support other business logic as the shared service ?
- Does this search service need to be plugged into other upstream services? I.E., Crawler service
- The length of the searchable term has 8 words
- Each word has 10 characters on average
- How much data we have to index ?
- 500 videos are uploaded / minute
- This is a read heavy system which needs to handle high volume of concurrent reads
- Might need rate limiting
8 * 10 * 4 Byte * 500 * 60 minute * 24 hours = 230 MBs / Day
230 MBs * 365 Days * 10 Years = 840 GBs in total for 10 years
This means we have to create the index for 840 GBs data in total. This definitely could not be held in a single machine in the memory. We want to design a distributed service.
- Once a new tweet or video metadata comes, it will be immediately sent to the
tokenizer
service for processing. - There could be several MQs connecting the input and the
tokenizer
worker threads. - Multiple
tokenizer
worker threads could pop the event and generate the forward index.
documentID : list of terms
--------------------------
0 | [(i, 0), (like,2), (apple,7)]
1 | [(liverpool,0),(wins,2)]
- The forward index results will be sent to
indexer
service to generate the inverted index and store on shards.
- Engagement aggregator: Counts the number of engagements for each Tweet in a given day. These engagement counts are used later as an input in scoring each Tweet.
- Aggregation: Joins multiple data sources together based on Tweet ID.
- Ingestion: Performs different types of preprocessing — language identification, tokenization, text feature extraction, URL resolution and more.
- Scorer: Computes a score based on features extracted during Ingestion. For the smaller historical indices, this score determined which Tweets were selected into the index.
- Partitioner: Divides the data into smaller chunks through our hashing algorithm. The final output is stored into HDFS.
- Update the mapping from term to termID
term : termID -------------------------- i | 0 like | 1 apple | 2 ...
- Building the inverted index
termID : list of postings (docID, position) -------------------------- 0 | [(0, 0)] 1 | [(0, 2)] ...
Why using parallel array instead of list? Because array elements have consecutive memory addresses. List is a random memory address allocation. So array has better performance in terms of traversing.
We need to distribute the index to all Earlybird servers in order to balance the load from both index building step and
query serving step. We could actually perform the index building (above step) on the Earlybird server itself, which means
we have to partition and distribute the load upfront. On the stage of tokenizing, the key
is documentID
and the value
is a list of terms. We could simply partition by the documentID
, because each document usually has relatively random
terms, so that each server would be assigned with relatively random terms.
Merge the search results from different Earlybird machines and enrich the result. This is the same idea of Blender.
There are two phases:
- Basic query
- Filter and Ranking
Posting list has limited size in memory to hold all relevance/ranking related info, so we could have a plugin mechanism
which could take the documentID
as the input and return the score
we could use to sort. However, this external service
query increases the latency. It is really a tradeoff between storage and performance (we either store the score with
posting list or decouple it to external service for scalability). Or we could have the ranking service to be affinitive
to the query service to reduce the network calls.
- Split reads and writes by using segments
- Define a fixed number of segments.
- Each segment holds a fixed number of documents.
- Fill up one segment first before moving to next segment.
- Only the current segment is actively being modified, other segments are read-only.
- Posting pools and
currReadyTailP
to avoid concurrent reads/writes within the same segment
Note: This is my personal thinking
docID-1, [i, like]
was the old record. The following inverted index will be generated.i:docID-1
like:docID-1
docID-1: [i, hate]
will be sent to indexing service. The following inverted index will be generated.i:docID-1
hate:docID-1
like:docID-1*
We have a flag in each posting to indicate if the posting entry should be deleted during the segments merge or if it should be skipped during query. This is the similar idea to DB index segments merge.
To have replicas across different index servers for HA, the replicas could be copied asynchronously. The inverted index in memory will be lost in worst case scenario. However, the offline inverted indexing will fix it.
- Using local documentID instead of global documentID (size reduced)
- Compress the posting list (size reduced)
- Delta Compression and Bit Packing (size reduced)
- Store count instead of positional info in posting list (size reduced)
- Positional info is stored in another parallel array for further compression
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/omnisearch-index-formats.html
There are two major factors could drag down the performance
- When indexing is done, but some slow data processing is still in progress. Like the URL shortening service could slower than the indexing service. When customer does a search, what should we do?
- Sorting by scores, relevance could be slow
https://www.elastic.co/blog/found-keeping-elasticsearch-in-sync
TBA
TBA
- https://buildingvts.com/elasticsearch-architectural-overview-a35d3910e515
- https://www.elastic.co/blog/found-indexing-for-beginners-part1/
- https://www.elastic.co/blog/found-indexing-for-beginners-part2/
- https://www.elastic.co/blog/found-indexing-for-beginners-part3/
- https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing
- https://www.elastic.co/blog/found-dive-into-elasticsearch-storage
- https://youtu.be/CeGtqouT8eA?t=1455
- http://blog.gaurav.im/2016/12/28/systems-design-twitter-search/
- https://blog.twitter.com/engineering/en_us/a/2011/the-engineering-behind-twitter-s-new-search-experience.html
- Paper: Early bird - Twitter realtime search
- https://blog.twitter.com/engineering/en_us/a/2011/the-engineering-behind-twitter-s-new-search-experience.html
- https://blog.twitter.com/engineering/en_us/a/2014/building-a-complete-tweet-index.html
- https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/omnisearch-index-formats.html
- https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/reducing-search-indexing-latency-to-one-second.html
- https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/search-relevance-infrastructure-at-twitter.html
- https://blog.twitter.com/engineering/en_us/a/2014/building-a-complete-tweet-index.html
- http://blog.gaurav.im/2016/12/28/systems-design-twitter-search/
- https://www.slideshare.net/ramezf/twitter-search-architecture