Before attempting workarounds consider whether Citus is appropriate for your situation. Citus’ current version works well for real-time analytics and multi-tenant use cases.
Citus supports all SQL statements in the multi-tenant use-case. For real-time analytics use-cases, with queries which span across nodes, Citus supports a subset of SQL statements. We are continuously working to increase SQL coverage to better support other use-cases such as data warehousing queries. Also many of the unsupported features have workarounds; below are a number of the most useful.
Subqueries in WHERE¶
A common type of query asks for values which appear in designated ways within a table, or aggregations of those values. For instance we might want to find which users caused events of types A and B in a table which records one user and event record per row:
select user_id from events where event_type = 'A' and user_id in ( select user_id from events where event_type = 'B' )
Another example. How many distinct sessions viewed the top twenty-five most visited web pages?
select page_id, count(distinct session_id) from visits where page_id in ( select page_id from visits group by page_id order by count(*) desc limit 25 ) group by page_id;
Citus does not allow subqueries in the WHERE clause so we must choose a workaround.
Workaround 1. Generate explicit WHERE-IN expression¶
Best used when you don’t want to complicate code in the application layer.
In this technique we use PL/pgSQL to construct and execute one statement based on the results of another.
-- create temporary table with results do language plpgsql $$ declare user_ids bigint; begin execute 'select user_id' ' from events' ' where event_type = ''B''' into user_ids; execute format( 'create temp table results_temp as ' 'select user_id' ' from events' ' where user_id = any(array[%s]::bigint)' ' and event_type = ''A''', array_to_string(user_ids, ',')); end; $$; -- read results, remove temp table select * from results_temp; drop table results_temp;
Workaround 2. Build query in SQL client¶
Best used for simple cases when the subquery returns limited rows.
Like the previous workaround this one creates an explicit list of values for an IN comparison. This workaround does too, except it does so in the application layer, not in the backend. It works best when there is a short list for the IN clause. For instance the page visits query is a good candidate because it limits its inner query to twenty-five rows.
-- first run this select page_id from visits group by page_id order by count(*) desc limit 25;
Interpolate the list of ids into a new query
-- Notice the explicit list of ids obtained from previous query -- and added by the application layer select page_id, count(distinct session_id) from visits where page_id in (2,3,5,7,13) group by page_id
JOIN a local and a distributed table¶
Attempting to execute a JOIN between a local and a distributed table causes an error:
ERROR: cannot plan queries that include both regular and partitioned relations
There is a workaround: you can replicate the local table to a single shard on every worker and push the join query down to the workers. We do this by defining the table as a ‘reference’ table using a different table creation API. Suppose we want to join tables here and there, where there is already distributed but here is on the coordinator database.
This will create a table with a single shard (non-distributed), but will replicate that shard to every node in the cluster. Now Citus will accept a join query between here and there, and each worker will have all the information it needs to work efficiently.
Currently Citus does not have out-of-the-box support for window functions on cross-shard queries, but there is a straightforward workaround. Window functions will work across shards on a distributed table if
- The window function is in a subquery and
- It includes a
PARTITION BYclause containing the table’s distribution column
Suppose you have table called
github_events, distributed by the column
user_id. This query will not work directly:
-- won't work, see workaround SELECT repo_id, org->'id' as org_id, count(*) OVER (PARTITION BY user_id) FROM github_events;
You can make it work by moving the window function into a subquery like this:
SELECT * FROM ( SELECT repo_id, org->'id' as org_id, count(*) OVER (PARTITION BY user_id) FROM github_events ) windowed;
Remember that it specifies
PARTITION BY user_id, the distribution column.
Data Warehousing Queries¶
When queries have restrictive filters (i.e. when very few results need to be transferred to the coordinator) there is a general technique to run unsupported queries in two steps. First store the results of the inner queries in regular PostgreSQL tables on the coordinator. Then the next step can be executed on the coordinator like a regular PostgreSQL query.
For example, consider the Window Functions case above. If we’re partitioning over a non-distribution column of a distributed table then the workaround mentioned in that section will not suffice.
-- this won't work, not even with the subquery workaround SELECT repo_id, org->'id' as org_id, count(*) OVER (PARTITION BY repo_id) -- repo_id is not distribution column FROM github_events WHERE repo_id IN (8514, 15435, 19438, 21692);
We can use a more general trick though. We can pull the relevant information to the coordinator as a temporary table:
-- grab the data, minus the aggregate, into a local table CREATE TEMP TABLE results AS ( SELECT repo_id, org->'id' as org_id FROM github_events WHERE repo_id IN (8514, 15435, 19438, 21692) ); -- now run the aggregate locally SELECT repo_id, org_id, count(*) OVER (PARTITION BY repo_id) FROM results;
Similar workarounds can be found for other data warehousing queries involving unsupported constructs.
The above query is a simple example intended at showing how meaningful workarounds exist around the lack of support for a few query types. Over time, we intend to support these commands out of the box within Citus.