As Hive uses HDFS to store the structured data in files … multiple file formats are supported by hive with various use cases.
1. TEXTFILE
2. SEQUENCEFILE
3. RCFILE
4. ORCFILE
File format:
- File formats varies based on the format the data is stored in the files, the compression that can be applied on the data , the SerDes to change the formats of the data usage of space and disk I/O... etc.
- Hive doesn’t verify whether the data that is being loaded matches the schema for the table or not. It only verifies if the file format matches the table definition or not. So file formats play a very key role in hive.
- In Hive, we specify the file format for a table at the end using “stored as <fileformat>” in the end of CREATE table statement.
Text File:
Defining a table as TEXT FILE in Hive loads data in form CSV (Comma Separated Values). Each record should be separated by delimiters (comma or space or tab ) or it may be JSON data.
CSV files are quite common and familiar to most of us. They are easy to be handled are usually used for importing and exporting data from Hadoop to external systems
Advantages:
- Easy to read and parse.
- Handy while data dumps.
Disadvantages:
- No block compression
- No metadata stored with the CSV file.
- Dependent on field order.
- Limited support for schema evolution
Sequence Files:
Hadoop, our baby elephant is comfortable with small number of files with big size rather than large number of files with small size (size less than the block size ) because to maintain the small files metadata on the Namenode increases. Sequence files are to the rescue. Sequence files acts as a container to store the small files.
Sequence files store data in a binary format with a similar structure to CSV.
SequenceFile provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.
There are three types of sequence files depending on the compression types while using writer:
1. Uncompressed key/value records. (Writer)
2. Compressed key/value records
· Record compressed key/value records where only ‘values’ are compressed (RecordCompressWriter)
· Block compressed key/value records where both keys and values are collected in ‘blocks’ separately and compressed.(BlockCompressWriter)
Structure of sequence file:
A sequence file consists of a header followed by one or more records. All the above three formats uses the same header structure and it is as shown below.
Advantages:
· As binary files, these are more compact than text files.
· Compression available.
· Parallel processing.
· Resolves small files problem.
Disadvantages:
· Append only.
· Only for Hadoop
RC File FormatRCFILE (Record Columnar File) is another type of binary file format with high compression rate on the rows. RCFILE follows columnar storage. RCFile applies the concept of “first horizontally-partition, then vertically-partition”. It combines the advantages of both row-store and column-store.
First, as row-store, RCFile guarantees that data in the same row are located in the same node, thus it has low cost of tuple reconstruction.
Second, as column-store, RCFile can exploit a column-wise data compression and skip unnecessary column reads.
Formats of RCFile
RC Header
· version - 3 bytes of magic header RCF, followed by 1 byte of actual version number (e.g. RCF1)
· compression - A boolean which specifies if compression is turned on for keys/values in this file.
· compression codec - CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
· metadata - SequenceFile.Metadata for this file.
· sync - A sync marker to denote end of the header.
RCFile Format
· Header
· Record
· Key part
1. Record length in bytes
2. Key length in bytes
3. Number_of_rows_in_this_record(vint)
4. Column_1_ondisk_length(vint)
5. Column_1_row_1_value_plain_length
6. Column_1_row_2_value_plain_length...
7. Column_2_ondisk_length(vint)
8. Column_2_row_1_value_plain_length
9. Column_2_row_2_value_plain_length
· Value part
1. Compressed or plain data of [column_1_row_1_value, column_1_row_2_value,....]
2. Compressed or plain data of [column_2_row_1_value, column_2_row_2_value,....]
Storage:
In each HDFS block, RCFile organizes records with the basic unit of a row group. All the records stored in an HDFS block are partitioned into row groups(equal size for a table).
A row group contains three sections.
1. A sync marker that is placed in the beginning of the row group. The sync marker is mainly used to separate two continuous row groups in an HDFS block.
2. A metadata header for the row group. The metadata header stores the information items on how many records are in this row group, how many bytes are in each column, and how many bytes are in each field in a column.
3. The table data section that is actually a column-store. In this section, all the fields in the same column are stored continuously together.
Compression in RC files
The method of data appending in RCFile is summarized as follows.
1. RCFile creates and maintains an in-memory column holder for each column. When a record is appended, each field will be appended into its corresponding column holder. In addition, RCFile will record corresponding metadata of each field in the metadata header.
2. RCFile provides two parameters to control how many records can be buffered in memory before they are 1202 flushed into the disk. One parameter is the limit of the number of records, and the other parameter is the limit of the size of the memory buffer.
3. RCFile first compresses the metadata header and stores it in the disk. Then it compresses each column holder separately, and flushes compressed column holders into one row group in the underlying file system.
Hive provides an rcfilecat tool to display the contents of RCFiles
This PDF has a very good and deep explanation about RC file formats.
ORC File Format(Optimised Row Columnar):
Using ORC files improves performance when Hive is reading, writing, and processing data.
Compared with RCFile format, for example, ORC file format has many advantages such as:
· a single file as the output of each task, which reduces the NameNode load
· Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
· light-weight indexes stored within the file
· skip row groups that don't pass predicate filtering
· seek to a given row
· block-mode compression based on data type
· run-length encoding for integer columns
· dictionary encoding for string columns
· concurrent reads of the same file using separate Record Readers
· ability to split files without scanning for markers
· bound the amount of memory needed for reading or writing
· metadata stored using Protocol Buffers, which allows addition and removal of fields
However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data.
ORC file structure:
- groups of row data called stripes
- auxiliary information in a file footer
- postscript holds compression parameters and the size of the compressed footer.
- each strip 250 MB
- each stripe in an ORC file holds index data, row data, and a stripe footer.
- Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.
Thus you can use the above four file formats depending on your data.For example,- If your data is delimited by some parameters then you can use TEXTFILE format.
- If your data is in small files whose size is less than the block size then you can use SEQUENCEFILE format.
- If you want to perform analytics on your data and you want to store your data efficiently for that then you can use RCFILE format.
- If you want to store your data in an optimized way which lessens your storage and increases your performance then you can use ORCFILE format.
References:
- groups of row data called stripes
0 comments:
Post a Comment