How to optimize aggregate queries in Clickhouse?
Table Of Content
Table of Contents
Overview
Clickhouse is quite fast storage, but when your storage is huge enough searching and aggregating in raw data become quite expensive. In this case you would think about optimization some queries. Today I would like to talk about a way where we will use AggregatingMergeTree with Materialized View. Materialized View gets all data by a given query and AggregatingMergeTree aggregates inserted records by sorting key. With this approach, We can group data by some fields and it helps us optimize heavy queries for a long period. Most important things, the optimization rate is a proportion of all records with unique records in the materialized view. It tells how fast your optimization will work.
How to
CREATE TABLE views
(
id UInt64,
manager UInt64,
splitTest UInt64,
splitTestOption UInt64,
client UInt64,
schemeType UInt64,
campaign UInt64,
creative UInt64,
source UInt64,
domain UInt64,
referrer UInt64,
country String,
device String,
os String,
viewAt DateTime,
viewDate Date DEFAULT toDate(viewAt),
platform UInt64,
orientation UInt64,
pageType UInt64,
categoryName UInt64,
widgetName UInt64,
widgetElement UInt64,
logged UInt8,
isVisited UInt8
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test/{shard}/views', '{replica}', viewDate,
(viewDate, client), 8192);
We have a huge table with a large amount of data and we want to optimize queries by viewDate, manager, client, campaign, creative, source, country, device, domain columns. We need to create Materialized view with AggregatingMergeTree engine.
CREATE MATERIALIZED VIEW views_mat
engine = AggregatingMergeTree partition by viewDate order by (
viewDate,
manager,
client,
campaign,
creative,
source,
country,
device,
domain,
landing)
AS
SELECT viewDate,
manager,
client,
campaign,
creative,
source,
country,
device,
domainId,
landing,
countState() as amount
FROM views
GROUP BY viewDate,
manager,
client,
campaign,
creative,
source,
country,
device,
domain,
landing
Aggregate function combinators
You can see countState() function. This function doesn’t return the resulting value, but just an intermediate state of the aggregation. You should use countMerge() function for getting the resulting value. More information about combinators you can check (here)[https://clickhouse.yandex/docs/en/query_language/agg_functions/combinators/].
Let’s try queries and check the difference between the raw table and the materialized view.
Raw:
SELECT
campaign,
count() AS views
FROM views
WHERE (client IN 7559922) AND (country IN 'in') AND ((viewDate >= '2018-06-01') AND (viewDate <= '2018-12-30'))
GROUP BY campaign;
Row 1:
──────
campaignId: 0
impressions: 7955062
Row 2:
──────
campaignId: 812661
impressions: 11003
Row 3:
──────
campaignId: 1334
impressions: 253350
Row 4:
──────
campaignId: 569467
impressions: 93182604
Row 5:
──────
campaignId: 237410
impressions: 7554
Row 6:
──────
campaignId: 23456
impressions: 3
Row 7:
──────
campaignId: 524556
impressions: 438
7 rows in set. Elapsed: 46.781 sec. Processed 9.18 billion rows, 116.92 GB (196.24 million rows/s., 2.50 GB/s.)
AgregateMergeTree and Materialized view:
SELECT
campaignId,
countMerge(amount) AS impressions
FROM impressions_mat
WHERE (userId IN 26) AND (country IN 'in') AND ((impressionDate >= '2018-06-01') AND (impressionDate <= '2018-12-30'))
GROUP BY campaignId
Row 1:
──────
campaignId: 0
impressions: 7955062
Row 2:
──────
campaignId: 812661
impressions: 11003
Row 3:
──────
campaignId: 1334
impressions: 253350
Row 4:
──────
campaignId: 569467
impressions: 93182604
Row 5:
──────
campaignId: 237410
impressions: 7554
Row 6:
──────
campaignId: 23456
impressions: 3
Row 7:
──────
campaignId: 524556
impressions: 438
7 rows in set. Elapsed: 0.116 sec. Processed 9.13 million rows, 448.10 MB (78.78 million rows/s., 3.86 GB/s.)
In other queries, optimization might be more or less.
Populate
CREATE MATERIALIZED VIEW has option POPULATE. It allows you to populate the view with data from request but it has one disadvantage. The view won’t get data which come while population. Therefore, we need to create a new view then populate it after view creation. In our case, we can stop inserting new data, populate view, then start inserting.
Summary
If you need to increase speed for some specific queries you can use Materialized View with AggregatingMergeTree engine. Keep in mind that improving is proportional to the amount of rows with unique values.