Updatable User Data¶
In this tutorial we’ll continue looking at wikipedia edits. The previous tutorial ingested a stream of all live edits happening across wikimedia. We’ll continue looking at that stream but store it in a different way.
This tutorial assumes you’ve set up a single-machine demo cluster. Once your cluster is running, open a prompt to the master instance:
cd citus-tutorial bin/psql postgresql://:9700
This will open a new prompt. You can leave it at any time by hitting Ctrl + D.
This time we’re going to make two tables.
CREATE TABLE wikipedia_editors ( editor TEXT UNIQUE, -- The name of the editor bot BOOLEAN, -- Whether they are a bot (self-reported) edit_count INT, -- How many edits they've made added_chars INT, -- How many characters they've added removed_chars INT, -- How many characters they've removed first_seen TIMESTAMPTZ, -- The time we first saw them edit last_seen TIMESTAMPTZ -- The time we last saw them edit ); CREATE TABLE wikipedia_changes ( editor TEXT, -- The editor who made the change time TIMESTAMP WITH TIME ZONE, -- When they made it wiki TEXT, -- Which wiki they edited title TEXT, -- The name of the page they edited comment TEXT, -- The message they described the change with minor BOOLEAN, -- Whether this was a minor edit (self-reported) type TEXT, -- "new" if this created the page, "edit" otherwise old_length INT, -- how long the page used to be new_length INT -- how long the page is as of this edit );
These tables are regular Postgres tables. We need to tell Citus that they should be distributed tables, stored across the cluster.
SELECT master_create_distributed_table( 'wikipedia_changes', 'editor', 'hash' ); SELECT master_create_distributed_table( 'wikipedia_editors', 'editor', 'hash' );
These say to store each table as a collection of shards, each responsible for
holding a different subset of the data. The shard a particular row belongs in
will be computed by hashing the
editor column. The page on Hash Distribution
goes into more detail.
Finally, create the shards:
SELECT master_create_worker_shards('wikipedia_editors', 16, 1); SELECT master_create_worker_shards('wikipedia_changes', 16, 1);
This tells Citus to create 16 shards for each table, and save 1 replica of each. You can ask Citus to store multiple copies of each shard, which allows it to recover from worker failures without losing data or dropping queries. However, in this example cluster we only have 1 worker, so Citus would error out if we asked it to store any more than 1 replica.
Now we’re ready to accept some data! Open a separate terminal and run the data ingest script we’ve made for you in this new terminal:
# - in a new terminal - cd citus-tutorial scripts/collect-wikipedia-user-data postgresql://:9700
This should keep running and aggregating data on the users who are editting right now. Let’s run some queries! If you run any of these queries multiple times you’ll see the results update as more data is ingested. Returning to our existing psql terminal we can ask some simple questions, such as finding edits which were made by bots:
-- back in the original (psql) terminal SELECT comment FROM wikipedia_changes c, wikipedia_editors e WHERE c.editor = e.editor AND e.bot IS true LIMIT 10;
Above, when we created our two tables, we partitioned them along the same column and created an equal number of shards for each. Doing this means that all data for each editor is kept on the same machine, or, colocated.
How many pages have been created by bots? By users?
SELECT bot, count(*) as pages_created FROM wikipedia_changes c, wikipedia_editors e WHERE c.editor = e.editor AND type = 'new' GROUP BY bot;
Citus can also perform joins where the rows to be joined are not stored on the same machine. But, joins involving colocated rows usually run faster than their non-distributed versions, because they can run across all machines and shards in parallel.
A surprising amount of the content in wikipedia is written by users who stop by to make just one or two edits and don’t even bother to create an account. Their username is just their ip address, which will look something like ‘126.96.36.199’ or ‘2607:FB90:25C8:8785:0:42:36E9:7E01’.
We can (using a very rough regex), find their edits:
SELECT editor SIMILAR TO '[0-9.A-F:]+' AS ip_editor, COUNT(1) AS edit_count, SUM(added_chars) AS added_chars FROM wikipedia_editors WHERE bot is false GROUP BY ip_editor;
Usually, around a fifth of all non-bot edits are made from unregistered editors. The real percentage is a lot higher, since “bot” is a user-settable flag which many bots neglect to set.
That’s all for now. This script showed a data layout which many Citus users choose. One table stored a stream of events while another table stored some aggregations of those events and made queries over them quick.
We hope you enjoyed working through our tutorials. Once you’re ready to stop the cluster run these commands:
bin/pg_ctl -D data/master stop bin/pg_ctl -D data/worker stop