Why Use Sphinx - Optimizing GROUP BY Queries

Support for everyday SQL-like clauses would be incomplete without GROUP BY functionality, so Sphinx has that too. But unlike MySQL’s general-purpose implementation, Sphinx specializes in solving a practical subset of GROUP BY tasks efficiently. This subset covers the generation of reports from big (1–100 million row) datasets when one of the following cases holds:

• The result is only a “small” number of grouped rows (where “small” is on the order of 100,000 to 1 million rows).

• Very fast execution speed is required and approximate COUNT(*) results are acceptable, when many groups are retrieved from data distributed over a cluster of machines.

This is not as restrictive as it might sound. The first scenario covers practically all imaginable time-based reports. For example, a detailed per-hour report for a period of 10 years will return fewer than 90,000 records. The second scenario could be expressed in plain English as something like “as quickly and accurately as possible, find the 20 most important records in a 100-million-row sharded table.”

These two types of queries can accelerate general-purpose queries, but you can also use them for full-text search applications. Many applications need to display not only full-text matches, but some aggregate results as well. For example, many search result pages show how many matches were found in each product category, or display a graph of matching document counts over time. Another common requirement is to group the results and show the most relevant match from each category.

Sphinx’s group-by support lets you combine grouping and full-text searching, eliminating the overhead of doing the grouping in your application or in MySQL.

As with sorting, grouping in Sphinx uses fixed memory. It is slightly (10% to 50%) more efficient than similar MySQL queries on datasets that fit in RAM. In this case, most of Sphinx’s power comes from its ability to distribute the load and greatly reduce the latency. For huge datasets that could never fit in RAM, you can build a special disk-based index for reporting, using inline attributes (defined later). Queries against such indexes execute about as fast as the disk can read the data—about 30–100 MB/sec on modern hardware. In this case, the performance can be many times better than MySQL’s, though the results will be approximate.

The most important difference from MySQL’s GROUP BY is that Sphinx may, under certain circumstances, yield approximate results. There are two reasons for this:

• Grouping uses a fixed amount of memory. If there are too many groups to hold in RAM and the matches are in a certain “unfortunate” order, per-group counts might be smaller than the actual values.

• A distributed search sends only the aggregate results, not the matches themselves, from node to node. If there are duplicate records in different nodes, pergroup distinct counts might be greater than the actual values, because the information that can remove the duplicates is not transmitted between nodes.

In practice, it is often acceptable to have fast approximate group-by counts. If this isn’t acceptable, it’s often possible to get exact results by tuning the daemon and client application carefully.

You can generate the equivalent of COUNT(DISTINCT ), too. For example, you can use this to compute the number of distinct sellers per category in an auction site.

Finally, Sphinx lets you choose criteria to select the single “best” document within each group. For example, you can select the most relevant document from each domain, while grouping by domain and sorting the result set by per-domain match counts. This is not possible in MySQL without a complex query.

Source of Information : OReIlly High Performance MySQL Second Edition Jun 2008


Subscribe to Developer Techno ?
Enter your email address:

Delivered by FeedBurner