TECHTalksPro
  • Home
  • Business
    • Internet
    • Market
    • Stock
  • Parent Category
    • Child Category 1
      • Sub Child Category 1
      • Sub Child Category 2
      • Sub Child Category 3
    • Child Category 2
    • Child Category 3
    • Child Category 4
  • Featured
  • Health
    • Childcare
    • Doctors
  • Home
  • SQL Server
    • SQL Server 2012
    • SQL Server 2014
    • SQL Server 2016
  • Downloads
    • PowerShell Scripts
    • Database Scripts
  • Big Data
    • Hadoop
      • Hive
      • Pig
      • HDFS
    • MPP
  • Certifications
    • Microsoft SQL Server -70-461
    • Hadoop-HDPCD
  • Problems/Solutions
  • Interview Questions

Monday, December 19, 2016

File Formats in Hive

 Chitchatiq     12/19/2016 04:09:00 PM     Big Data, BigData&Hadoop, File Formats in Hive, ORCFILE, RCFILE, SEQUENCEFILE, SQL Server, TEXTFILE     No comments   


As Hive uses HDFS to store the structured data in files … multiple file formats are supported by hive with various use cases.

1.       TEXTFILE
2.       SEQUENCEFILE
3.       RCFILE
4.       ORCFILE

File format:  
  1. File formats varies based on the format the data is stored in the files, the compression that can be applied on the data , the SerDes to change the formats of the data usage of space and disk I/O... etc.
  2. Hive doesn’t verify whether the data that is being loaded matches the schema for the table or not. It only verifies if the file format matches the table definition or not. So file formats play  a very key role in hive.
  3.  In Hive, we specify the file format for a table at the end using “stored as <fileformat>” in the end of CREATE table statement. 

Text File:
 Defining a table as TEXT FILE in Hive loads data in form CSV (Comma Separated Values).  Each record should be separated by delimiters (comma or space or tab ) or it may be JSON data.
 CSV files are quite common and familiar to most of us. They are easy to be handled are usually used for importing and exporting data from Hadoop to external systems
 Advantages:
  •  Easy to read and parse.
  •   Handy while data dumps.

 Disadvantages:
  • No block compression
  • No metadata stored with the CSV file.
  •  Dependent on field order.
  •  Limited support for schema evolution

 Sequence Files: 
Hadoop, our  baby elephant is comfortable with small number of files with big size rather than large number of files with small size (size less than the block size ) because to maintain the small files metadata on the Namenode increases. Sequence files are to the rescue. Sequence files acts as a container to store the small files.
 Sequence files store data in a binary format with a similar structure to CSV.
 SequenceFile provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.

There are three types of sequence files depending on the compression types while using writer:
 1.       Uncompressed key/value records. (Writer) 
2.          Compressed key/value records
·         Record compressed key/value records where only ‘values’ are compressed (RecordCompressWriter)
·         Block compressed key/value records where both keys and values are collected in ‘blocks’ separately and compressed.(BlockCompressWriter)

Structure of sequence file:


A sequence file consists of a header followed by one or more records. All the above three formats uses the same header structure and it is as shown below.


Advantages: 
·         As binary files, these are more compact than text files.
·         Compression available.
·         Parallel processing.
·         Resolves small files problem.
 Disadvantages:
 ·         Append only.
·         Only for Hadoop
   
RC File FormatRCFILE (Record Columnar File) is another type of binary file format with high compression rate on the rows.  RCFILE follows columnar storage. RCFile applies the concept of “first horizontally-partition, then vertically-partition”. It combines the advantages of both row-store and column-store.
First, as row-store, RCFile guarantees that data in the same row are located in the same node, thus it has low cost of tuple reconstruction.
Second, as column-store, RCFile can exploit a column-wise data compression and skip unnecessary column reads.

Formats of RCFile
 RC Header
·         version - 3 bytes of magic header RCF, followed by 1 byte of actual version number (e.g. RCF1)
·         compression - A boolean which specifies if compression is turned on for keys/values in this file.
·         compression codec - CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
·         metadata - SequenceFile.Metadata for this file.
·         sync - A sync marker to denote end of the header. 
RCFile Format
·         Header
·         Record
·         Key part
1.       Record length in bytes
2.       Key length in bytes
3.       Number_of_rows_in_this_record(vint)
4.       Column_1_ondisk_length(vint)
5.       Column_1_row_1_value_plain_length
6.       Column_1_row_2_value_plain_length...
7.       Column_2_ondisk_length(vint)
8.       Column_2_row_1_value_plain_length
9.       Column_2_row_2_value_plain_length
·         Value part
1.       Compressed or plain data of [column_1_row_1_value, column_1_row_2_value,....]
2.       Compressed or plain data of [column_2_row_1_value, column_2_row_2_value,....]
 Storage: 
In each HDFS block, RCFile organizes records with the basic unit of a row group. All the records stored in an HDFS block are partitioned into row groups(equal size for a table).




A row group contains three sections.

1.       A sync marker that is placed in the beginning of the row group. The sync marker is mainly used to separate two continuous row groups in an HDFS block.
2.       A metadata header for the row group. The metadata header stores the information items on how many records are in this row group, how many bytes are in each column, and how many bytes are in each field in a column.
3.       The table data section that is actually a column-store. In this section, all the fields in the same column are stored continuously together.

Compression in RC files

The method of data appending in RCFile is summarized as follows.

1.       RCFile creates and maintains an in-memory column holder for each column. When a record is appended, each field will be appended into its corresponding column holder. In addition, RCFile will record corresponding metadata of each field in the metadata header.

2.                   RCFile provides two parameters to control how many records can be buffered in memory before they are 1202 flushed into the disk. One parameter is the limit of the number of records, and the other parameter is the limit of the size of the memory buffer.
3.                   RCFile first compresses the metadata header and stores it in the disk. Then it compresses each column holder separately, and flushes compressed column holders into one row group in the underlying file system.

Hive provides an rcfilecat tool to display the contents of RCFiles

This PDF has a very good and deep explanation about RC file formats.

ORC File Format(Optimised Row Columnar):

Using ORC files improves performance when Hive is reading, writing, and processing data.

Compared with RCFile format, for example, ORC file format has many advantages such as:

·         a single file as the output of each task, which reduces the NameNode load
·         Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
·         light-weight indexes stored within the file
·         skip row groups that don't pass predicate filtering
·         seek to a given row
·         block-mode compression based on data type
·         run-length encoding for integer columns
·         dictionary encoding for string columns
·         concurrent reads of the same file using separate Record Readers
·         ability to split files without scanning for markers
·         bound the amount of memory needed for reading or writing
·         metadata stored using Protocol Buffers, which allows addition and removal of fields




However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data.
ORC file structure:


    • groups of row data called stripes
    • auxiliary information in a file footer
    • postscript holds compression parameters and the size of the compressed footer.
    • each strip 250 MB
    • each stripe in an ORC file holds index data, row data, and a stripe footer.
    • Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.
      
    Thus you can use the above four file formats depending on your data.
    For example,
    1. If your data is delimited by some parameters then you can use TEXTFILE format.
    2. If your data is in small files whose size is less than the block size then you can use SEQUENCEFILE format.
    3. If you want to perform analytics on your data and you want to store your data efficiently for that then you can use RCFILE format.
    4. If you want to store your data in an optimized way which lessens your storage and increases your performance then you can use ORCFILE format.

    References:
    https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/io/SequenceFile.html
    http://datametica.com/rcorc-file-format/
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

  • Share This:  
  •  Facebook
  •  Twitter
  •  Google+
  •  Stumble
  •  Digg
Email ThisBlogThis!Share to XShare to Facebook
Newer Post Older Post Home

0 comments:

Post a Comment

Popular Posts

  • FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
    Error: Resolution:         1.       Check if Your table storage format is ORC         2.      Check if yo...
  • Greenplum Best Practises
    Best Practices: A distribution key should not have more than 2 columns, recommended is 1 column. While modeling a database,...
  • Greenplum System-Useful Queries
    Useful Queries: Query to verify the list of segments in a Greenplum system select * from gp_segment_configuration; R...
  • How to fix ERROR 1045 : Access denied for user 'root'@'localhost' (using password: YES)
    Please follow below steps to fix HDP Sandbox -Mysql root password issue: Enter mysql -u root UPDATE mysql.user SET Password=PASSWOR...
  • HADOOP - HDFS OPERATIONS
    Starting HDFS To format the configured HDFS file system, execute the following command in namenode HDFS server, $ hadoop namenode ...

Facebook

Categories

Best Practices (1) Big Data (5) BigData&Hadoop (6) DAG (1) Error 10294 (1) external tables (1) File Formats in Hive (1) Greenplum (3) Hadoop (5) Hadoop Commands (1) Hive (4) Internal tables (1) interview Questions (1) Managed tables (1) MySQL Installation (1) ORCFILE (1) org.apache.hadoop.hive.ql.exec.MoveTask (1) Powershell (1) Problems&Solutions (15) RCFILE (1) return code 1 (1) SEQUENCEFILE (1) Service 'userhome' (1) Service 'userhome' check failed: java.io.FileNotFoundException (1) SQL Server (27) sqoop (2) SSIS (1) TEXTFILE (1) Tez (1) transaction manager (1) Views (1) What is Hadoop (1)

Blog Archive

  • December (1)
  • November (1)
  • October (2)
  • September (6)
  • August (1)
  • July (3)
  • March (1)
  • February (8)
  • January (4)
  • December (9)
  • August (4)
  • July (1)

Popular Tags

  • Best Practices
  • Big Data
  • BigData&Hadoop
  • DAG
  • Error 10294
  • external tables
  • File Formats in Hive
  • Greenplum
  • Hadoop
  • Hadoop Commands
  • Hive
  • Internal tables
  • interview Questions
  • Managed tables
  • MySQL Installation
  • ORCFILE
  • org.apache.hadoop.hive.ql.exec.MoveTask
  • Powershell
  • Problems&Solutions
  • RCFILE
  • return code 1
  • SEQUENCEFILE
  • Service 'userhome'
  • Service 'userhome' check failed: java.io.FileNotFoundException
  • SQL Server
  • sqoop
  • SSIS
  • TEXTFILE
  • Tez
  • transaction manager
  • Views
  • What is Hadoop

Featured Post

TOP 100 SQL SERVER INTERVIEW QUESTIONS

SQL SERVER INTERVIEW QUESTIONS 1.       What is the Complex task that you handled in your project 2.       What are the diffe...

Pages

  • Home
  • SQL SERVER
  • Greenplum
  • Hadoop Tutorials
  • Contact US
  • Disclaimer
  • Privacy Policy

Popular Posts

  • FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
    Error: Resolution:         1.       Check if Your table storage format is ORC         2.      Check if yo...
  • Greenplum System-Useful Queries
    Useful Queries: Query to verify the list of segments in a Greenplum system select * from gp_segment_configuration; R...

Copyright © TECHTalksPro
Designed by Vasu