Showing posts with label BigData&Hadoop. Show all posts

Monday, October 30, 2017

How to use Password file with Sqoop

Chitchatiq 10/30/2017 11:21:00 AM BigData&Hadoop, Problems&Solutions, sqoop No comments

How to use local password-file parameter with sqoop

Problem: sometimes we have to read password from file for sqoop command.

Solution: Passing password parameter file to sqoop command is not a big deal. Just follow below steps:

1. Create a password with Echo -n “<<password>> > <<passwordfilename>>

E.g.

Echo -n “Srinivas” > passwordfile – Here we can give file name with location

Sqoop Command:

sqoop list-tables --connect "jdbc:mysql://sandbox.hortonworks.com:3306/hdpcdpractise" --username hadoop --password-file file:///usr/Srinivas/ passwordfile

Hortonworks:Service 'userhome' check failed: File does not exist: /user/admin

Chitchatiq 8/30/2017 06:08:00 PM BigData&Hadoop, Hadoop, Problems&Solutions No comments

Problem: Service 'userhome' check failed: File does not exist: /user/admin

Solution:

sudo -u hdfs hadoop fs -mkdir /user/admin

sudo -u hdfs hdfs dfs -chown -R admin:hdfs /user/admin

similar kind of issues :

service 'ats' check failed: server error

could not write file /user/admin/hive/jobs/hive-job

service 'userhome' check failed: authentication required

hadoop mkdir permission denied

failed to get cluster information associated with this view instance

java io filenotfoundexception file does not exist hdfs

File Formats in Hive

Chitchatiq 12/19/2016 04:09:00 PM Big Data, BigData&Hadoop, File Formats in Hive, ORCFILE, RCFILE, SEQUENCEFILE, SQL Server, TEXTFILE No comments

As Hive uses HDFS to store the structured data in files … multiple file formats are supported by hive with various use cases.

1. TEXTFILE

2. SEQUENCEFILE

3. RCFILE

4. ORCFILE

File format:

File formats varies based on the format the data is stored in the files, the compression that can be applied on the data , the SerDes to change the formats of the data usage of space and disk I/O... etc.
Hive doesn’t verify whether the data that is being loaded matches the schema for the table or not. It only verifies if the file format matches the table definition or not. So file formats play a very key role in hive.
In Hive, we specify the file format for a table at the end using “stored as <fileformat>” in the end of CREATE table statement.

Text File:

Defining a table as TEXT FILE in Hive loads data in form CSV (Comma Separated Values). Each record should be separated by delimiters (comma or space or tab ) or it may be JSON data.

CSV files are quite common and familiar to most of us. They are easy to be handled are usually used for importing and exporting data from Hadoop to external systems

Advantages:

Easy to read and parse.
Handy while data dumps.

Disadvantages:

No block compression
No metadata stored with the CSV file.
Dependent on field order.
Limited support for schema evolution

Sequence Files:

Hadoop, our baby elephant is comfortable with small number of files with big size rather than large number of files with small size (size less than the block size ) because to maintain the small files metadata on the Namenode increases. Sequence files are to the rescue. Sequence files acts as a container to store the small files.

Sequence files store data in a binary format with a similar structure to CSV.

SequenceFile provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.

There are three types of sequence files depending on the compression types while using writer:

1. Uncompressed key/value records. (Writer)

2. Compressed key/value records

· Record compressed key/value records where only ‘values’ are compressed (RecordCompressWriter)

· Block compressed key/value records where both keys and values are collected in ‘blocks’ separately and compressed.(BlockCompressWriter)

Structure of sequence file:

A sequence file consists of a header followed by one or more records. All the above three formats uses the same header structure and it is as shown below.

Advantages:

· As binary files, these are more compact than text files.

· Compression available.

· Parallel processing.

· Resolves small files problem.

Disadvantages:

· Append only.

· Only for Hadoop

RC File FormatRCFILE (Record Columnar File) is another type of binary file format with high compression rate on the rows. RCFILE follows columnar storage. RCFile applies the concept of “first horizontally-partition, then vertically-partition”. It combines the advantages of both row-store and column-store.

First, as row-store, RCFile guarantees that data in the same row are located in the same node, thus it has low cost of tuple reconstruction.

Second, as column-store, RCFile can exploit a column-wise data compression and skip unnecessary column reads.

Formats of RCFile

RC Header

· version - 3 bytes of magic header RCF, followed by 1 byte of actual version number (e.g. RCF1)

· compression - A boolean which specifies if compression is turned on for keys/values in this file.

· compression codec - CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).

· metadata - SequenceFile.Metadata for this file.

· sync - A sync marker to denote end of the header.

RCFile Format

· Header

· Record

· Key part

1. Record length in bytes

2. Key length in bytes

3. Number_of_rows_in_this_record(vint)

4. Column_1_ondisk_length(vint)

5. Column_1_row_1_value_plain_length

6. Column_1_row_2_value_plain_length...

7. Column_2_ondisk_length(vint)

8. Column_2_row_1_value_plain_length

9. Column_2_row_2_value_plain_length

· Value part

1. Compressed or plain data of [column_1_row_1_value, column_1_row_2_value,....]

2. Compressed or plain data of [column_2_row_1_value, column_2_row_2_value,....]

Storage:

In each HDFS block, RCFile organizes records with the basic unit of a row group. All the records stored in an HDFS block are partitioned into row groups(equal size for a table).

A row group contains three sections.

1. A sync marker that is placed in the beginning of the row group. The sync marker is mainly used to separate two continuous row groups in an HDFS block.

2. A metadata header for the row group. The metadata header stores the information items on how many records are in this row group, how many bytes are in each column, and how many bytes are in each field in a column.

3. The table data section that is actually a column-store. In this section, all the fields in the same column are stored continuously together.

Compression in RC files

The method of data appending in RCFile is summarized as follows.

1. RCFile creates and maintains an in-memory column holder for each column. When a record is appended, each field will be appended into its corresponding column holder. In addition, RCFile will record corresponding metadata of each field in the metadata header.

2. RCFile provides two parameters to control how many records can be buffered in memory before they are 1202 flushed into the disk. One parameter is the limit of the number of records, and the other parameter is the limit of the size of the memory buffer.

3. RCFile first compresses the metadata header and stores it in the disk. Then it compresses each column holder separately, and flushes compressed column holders into one row group in the underlying file system.

Hive provides an rcfilecat tool to display the contents of RCFiles

This PDF has a very good and deep explanation about RC file formats.

ORC File Format(Optimised Row Columnar):

Using ORC files improves performance when Hive is reading, writing, and processing data.

Compared with RCFile format, for example, ORC file format has many advantages such as:

· a single file as the output of each task, which reduces the NameNode load

· Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)

· light-weight indexes stored within the file

· skip row groups that don't pass predicate filtering

· seek to a given row

· block-mode compression based on data type

· run-length encoding for integer columns

· dictionary encoding for string columns

· concurrent reads of the same file using separate Record Readers

· ability to split files without scanning for markers

· bound the amount of memory needed for reading or writing

· metadata stored using Protocol Buffers, which allows addition and removal of fields

However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data.

ORC file structure:

- groups of row data called stripes
- auxiliary information in a file footer
- postscript holds compression parameters and the size of the compressed footer.
- each strip 250 MB
- each stripe in an ORC file holds index data, row data, and a stripe footer.
- Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.
Thus you can use the above four file formats depending on your data.

For example,
1. If your data is delimited by some parameters then you can use TEXTFILE format.
2. If your data is in small files whose size is less than the block size then you can use SEQUENCEFILE format.
3. If you want to perform analytics on your data and you want to store your data efficiently for that then you can use RCFILE format.
4. If you want to store your data in an optimized way which lessens your storage and increases your performance then you can use ORCFILE format.
References:

https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/io/SequenceFile.html

http://datametica.com/rcorc-file-format/

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

HADOOP - HDFS OPERATIONS

Chitchatiq 12/19/2016 04:08:00 PM Big Data, BigData&Hadoop, Hadoop, Hadoop Commands, SQL Server No comments

Starting HDFS

To format the configured HDFS file system, execute the following command in namenode HDFS server,

$ hadoop namenode -format

Start the distributed file system. After formatting the HDFS, The following command will start the namenode as well as the data nodes as cluster.

$ start-dfs.sh

Listing Files in HDFS

After loading the information in the server, we can find the list of files in a directory, status of a file, using ‘ls’. Given below is the syntax of ls that you can pass to a directory or a filename as an argument.

$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS

Imagine we have data in the file called file.txt in the local file system which we intend to save in the Hadoop file system (HDFS). Just follow the steps...

Step 1

-MaKe a DIRrectory in HDFS if you want your file in a new directory

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Step 2

"-put" your file there.

$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input

Step 3

Check your file while taking a look at the "-LiSt "

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Retrieving Data from HDFS

After being able to load the data, now you also need to know how to view the data in those files.

Just view the data in the file... call a "-cat "

-cat can be used to get our job done.

$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile

No no no .. i want the file into my local...

Don't worry.. just "-get" it from hadoop fs

$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

Shutting Down the HDFS

Aaaaah!! Done ?? Wrap up by shutting down the HDFS.

$ stop-dfs.sh

TEZ

Chitchatiq 12/19/2016 04:07:00 PM Big Data, BigData&Hadoop, DAG, SQL Server, Tez No comments

YARN - Giant leap in hadoop. Has provided facility to use “App Master” to control the process flow. So how can we leverage this?? Can we use other framework and code to play AppMaster role ?

“CUSTOMIZATION”

and at the same time SPEEEEEEEED............!!

One answer can surely be TEZ

TEZ is developed based on 2007 paper from Microsoft

What is Tez?

It is a distributed execution framework for data-processing applications
Based on expressing a computation as a dataflow graph
It will use DAG model to execute the applications
It is built on top of YARN

What does tez DAG contains??

Tez models data processing as a dataflow graph, with the graph

1. vertices representing application logic

2. edges representing movement of data

3. A rich data flow definition API allows users to intuitively express complex query logic.

What is this DAG??

Tez uses DAG (Directed Acyclic Graph): Means a flow of the data in form of a graph which is not looped in cycles. Essentially it will allow higher level tools (like Hive and Pig) to define their overall processing steps (Directed Acyclical Graph) before the job begins. A DAG is a graph of all the steps needed to complete the job (hive query, Pig job, etc.).

“Tez has DAG API through which we can specify the producers, consumers and flow of data.”

--Example code.

DAG API:

Data movement. Defines routing of data between tasks

One-To-One: Data from the ith producer task routes to the ith consumer task.
Broadcast: Data from a producer task routes to all consumer tasks.
Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards. The ith shard from all producer tasks routes to the ith consumer task.

Scheduling. Defines when a consumer task is scheduled

Sequential: Consumer task may be scheduled after a producer task completes.
Concurrent: Consumer task must be co-scheduled with a producer task.

Data source. Defines the lifetime/reliability of a task output

Persisted: Output will be available after the task exits. Output may be lost later on.
Persisted-Reliable: Output is reliably stored and will always be available
Ephemeral: Output is available only while the producer task is running.

Tez has RUNTIME API through which it receives the user code and runs them as tasks.

(There are no more mappers and reducers here FYI.,)

RunTime API:

“Input → processor → output” is a task (previously Mapper or Reducer)

Input , processor and output can be configurable.

Digging deeper.

The API fits well with query plans produced by higher-level declarative applications like Apache Hive and Apache Pig.

1. An execution environment that can handle traditional map-reduce jobs Tez will consider each task running on Vertex , a mapper or a reducer and the data flow between them as edges. You can run those applications written in Map - Reduce on TEZ and make them run faster.
2. An execution environment that handles DAG-based jobs comprising various built-in and extendable primitives.
3. Cluster-side determination of input pieces. User can specify the inputs and outputs of the process leveraging the RunTime API of the TEZ

4. Runtime planning such as task cardinality determination and dynamic modification to the DAG structure.

Tez Sessions.

Oh man! What are Tez sessions??

Similar to the sessions in any other RDBMS. All the queries launched by a user in a session are considered in a session and are run with the same Application Master.

Hmm!! Why do i need them??

So being able to launch multiple DAGs to the same Application Master, the overhead of launching new AMs for each DAG. is DECREASED!!

Wow!! But only one??

Re use of container and Caching with the session can be leveraged. Before you shoot one more question let me explain what they are and how they work.

Reusing containers:

From the knowledge of YARN , we came to know about these “containers” - lease for the capacity of the node where a JVM for a task is launched and task is run. (have a visit to containers for more info.)

It’s better not to launch containers for every small task is not appreciated .. Why? As mentioned above launching containers, always hitting Resource Manager to allocate the containers for small runtime tasks may decrease your performance.

Tez is for our rescue. We can use containers and use again and again if they are being used in the same DAG (containers should be compatible though). Not only that, Containers can also be reused by the other DAGs if running in the same Tez session.

And how is this reuse scheduled? Task scheduler in ResourceManager will launch only new containers. Thus, Scheduling of the reuse of the containers is done by Tez task scheduler. The Tez scheduler works with several parameters to take decisions on task assignments – task-locality requirements, compatibility of containers as described above, total available resources on the cluster, and the priority of pending task requests.

When a container is available for the reuse, scheduler checks if it has any tasks pending and has compatible container needs and data local for the task and launches the task and if such containers are not available then it will launch the task in rack local and in worst case cluster local.

Caching:

Apart from JVM reuse Each Tez JVM (or container) contains an object cache, which can be used to share data between different tasks running within the same container. This is a simple Key-Object store, with different levels of visibility/retention. Objects can be cached for use within tasks belonging to the same Vertex, for all tasks within a DAG, and for tasks running across a Tez Session (more on Sessions in a subsequent post). The resources being cached may, in the future, be made available as a hint to the Tez Scheduler for affinity based scheduling.

Wonderful!! How do i use them ??

Very simple.

Firstly, instantiate a TezSession object with the required configuration using TezSessionConfiguration.
Invoke TezSession::start()
Wait for the TezSession to reach a ready state to accept DAGs by using theTezSession::getSessionStatus() api (this step is optional)
Submit a DAG to the Session using TezSession::submitDAG(DAG dag)
Monitor the DAG’s status using the DAGClient instance obtained in step (4).
Once the DAG has completed, repeat step (4) and step (5) for subsequent DAGs.
Shutdown the Session once all work is done via TezSession::stop().

There are some things to keep in mind when using a Tez Session:

A Tez Session maps to a single Application Master and therefore, all resources required by any user-logic (in any subsequent DAG) running within the ApplicationMaster should be available when the AM is launched.

This mostly pertains to code related to the VertexOutputCommitter and any user-logic in the Vertex scheduling and management layers.
User-logic run in tasks is not governed by the above restriction.

The resources (memory, CPU) of the AM are fixed so please keep this in mind when configuring the AM for use in a session. For example, memory requirements may be higher for a very large DAG

Everything is fine. But who is responsible for all the extra intelligence Tez showcases??

Vertex Manager:

Vertex manager is the daemon which resides in the Tez AppMAster and decides when a task in vertex should start .It also controls the parallelism. It also plays the role in deciding the number of reducers based on the map output dynamically.

DAG Scheduler: The DAGScheduler assigns the task started by the vertex manager priority of execution depending its depth in the graph and other things like whether it’s a retry or not. Then the task goes to the Task Scheduler that actually assigns it to a YARN container.

Task Scheduler: Along with the pluggable scheduler in the YARN, Tez contains a scheduler to allocate the containers while “container reuse”

References: https://issues.apache.org/jira/secure/attachment/12588887/Tez%20Design%20v1.1.pdf

http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing/

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/ref-d677ca50-0a14-4d9e-9882-b764e689f6db.1.html

Monday, October 30, 2017

How to use Password file with Sqoop

Wednesday, August 30, 2017

Hortonworks:Service 'userhome' check failed: File does not exist: /user/admin

Problem: Service 'userhome' check failed: File does not exist: /user/admin

Monday, December 19, 2016

File Formats in Hive