Why Use Sphinx - Aggregating Sharded Data

Building a scalable system often involves sharding (partitioning) the data across different physical MySQL servers.

When the data is sharded at a fine level of granularity, simply fetching a few rows with a selective WHERE (which should be fast) means contacting many servers, checking for errors, and merging the results together in the application. Sphinx alleviates this problem, because all the necessary functionality is already implemented inside the search daemon.

Consider an example where a 1 TB table with a billion blog posts is sharded by user ID over 10 physical MySQL servers, so a given user’s posts always go to the same server. As long as queries are restricted to a single user, everything is fine: we choose the server based on user ID and work with it as usual.

Now assume that we need to implement an archive page that shows the user’s friends’ posts. How are we going to display page 50, with entries 981 to 1000, sorted by post date? Most likely, the various friends’ data will be on different servers. With only 10 friends, there’s about a 90% chance that more than 8 servers will be used, and that probability increases to 99% if there are 20 friends. So, for most queries, we will need to contact all the servers. Worse, we’ll need to pull 1,000 posts from each server and sort them all in the application. We’d trim down the required data to the post ID and timestamp only, but that’s still 10,000 records to sort in the application. Most modern scripting languages consume a lot of CPU time for that sorting step alone. In addition, we’ll either have to fetch the records from each server sequentially (which will be slow) or write some code to juggle the parallel querying threads (which will be difficult to implement and maintain).

In such situations, it makes sense to use Sphinx instead of reinventing the wheel. All we’ll have to do in this case is set up several Sphinx instances, mirror the frequently accessed post attributes from each table—in this example, the post ID, user ID, and timestamp—and query the master Sphinx instance for entries 981 to 1000, sorted by post date, in approximately three lines of code. This is a much smarter way to scale.

Source of Information : OReIlly High Performance MySQL Second Edition


Subscribe to Developer Techno ?
Enter your email address:

Delivered by FeedBurner