ORDER BY:
è Forces
all the data to go into the same reducer node, by doing this, Order by ensure that
entire dataset is totally ordered
è Uses
a single reducer to guarantee total order in output
Drawbacks:
è Single
reducer will take a long time to sort very large outputs
Sort By:
è Sort
the rows based on the given columns per reducer. If there are more than one
reducer, then the output per reducer will be sorted
Drawbacks:
If
we have more than one reducer, then order of total output is not guaranteed to
be sorted.
Let’s take one simple example. Currently Dept. table has
following data
First will try to run the Order by query by setting
reducer count as 2
If you see above screenshot all the data got sorted
based on deptno column in Ascending order.
Now will try to run Sort by command.
We can clearly see that individual reducer level results
are sorted but not at complete data set level.
However, sometimes we do not require total ordering. For
example, suppose you have a table called user_action_table where
each row has user_id, action, and time.
Your goal is to order them by time per user_id and in this situation,
we can use Sort By clause