Best Practices:
- A distribution key should not have more than 2 columns, recommended is 1 column.
- While modeling a database, we need to make sure all the tables that gets involved in longer query execution are distributed using the joining column.
- Distribution keys should be domain key but not surrogate keys.
- Explicitly define a column or random distribution for all tables. Do not use the default as it may cause skew on segments.
- Do not distribute on columns that will be used in the WHERE clause of a query. You should partition the data on the column that is used in WHERE clause
- The distribution key column data should contain unique values
- Do not distribute on dates or timestamps. Distributing data on date or timestamp may cause poor system performance
- There are 2 questions to ask while deciding the distribution key of a table:
- Is the Distribution Key used in joins?
- Makes sure, no redistribution is happening during query execution
- Does the key 'Equi-distributes' the data across all the segments?
- Makes sure all the segments get approximately the same amount of data, hence the same amount of work during query execution.
- We need to make sure data skewing is minimum for every table.
SELECT * FROM
gp_toolkit.gp_skew_coefficients
0 comments:
Post a Comment