Monday, September 11, 2017

Greenplum Best Practises

Chitchatiq 9/11/2017 08:33:00 PM Best Practices, Greenplum No comments

Best Practices:

A distribution key should not have more than 2 columns, recommended is 1 column.
While modeling a database, we need to make sure all the tables that gets involved in longer query execution are distributed using the joining column.
Distribution keys should be domain key but not surrogate keys.
Explicitly define a column or random distribution for all tables. Do not use the default as it may cause skew on segments.
Do not distribute on columns that will be used in the WHERE clause of a query. You should partition the data on the column that is used in WHERE clause
The distribution key column data should contain unique values
Do not distribute on dates or timestamps. Distributing data on date or timestamp may cause poor system performance
There are 2 questions to ask while deciding the distribution key of a table:

Is the Distribution Key used in joins?

Makes sure, no redistribution is happening during query execution

Does the key 'Equi-distributes' the data across all the segments?

Makes sure all the segments get approximately the same amount of data, hence the same amount of work during query execution.
We need to make sure data skewing is minimum for every table.

SELECT * FROM gp_toolkit.gp_skew_coefficients

0 comments:

Post a Comment