Monday, December 4, 2017
Sunday, November 26, 2017
How to Prepare and Clear Hortonworks Data Platform Certified Developer Exam (HDPCD -English)
HORTONWORKS UNIVERSITY
is offering HDPCD certificate and its exclusively hands-on, performance-based
exam that require we to complete a set of tasks on actual hadoop cluster
instead of just guessing multiple choice questions.
Basically HDPCD certificate categorised into 3 categories
1. Data ingestion
2. Data transformation
3. Data analysis
As mentioned in Hortonworks webiste, here I tried to keep all the
tasks
Data
Ingestion
Import data from a table in a relational
database into HDFS
Import the results of a query from a relational
database into HDFS
Import a table from a relational database into
a new or existing Hive table
Insert or update data from HDFS into a table in
a relational database
Given a Flume configuration file, start a Flume
agent
Given a configured sink and source, configure a
Flume memory channel with a specified capacity
Data
Transformation
Write and execute a Pig script
Load data into a Pig relation without a schema
Load data into a Pig relation with a schema
Load data from a Hive table into a Pig relation
Use Pig to transform data into a specified
format
Transform data to match a given Hive schema
Group the data of one or more Pig relations
Use Pig to remove records with null values from
a relation
Store the data from a Pig relation into a
folder in HDFS
Store the data from a Pig relation into a Hive
table
Sort the output of a Pig relation
Remove the duplicate tuples of a Pig relation
Specify the number of reduce tasks for a Pig
MapReduce job
Join two datasets using Pig
Perform a replicated join using Pig
Run a Pig job using Tez
Within a Pig script, register a JAR file of
User Defined Functions
Within a Pig script, define an alias for a User
Defined Function
Within a Pig script, invoke a User Defined
Function
Data
Analysis
Write and execute a Hive query
Define a Hive-managed table
Define a Hive external table
Define a partitioned Hive table
Define a bucketed Hive table
Define a Hive table from a select query
Define a Hive table that uses the ORCFile
format
Create a new ORCFile table from the data in an
existing non-ORCFile Hive table
Specify the storage format of a Hive table
Specify the delimiter of a Hive table
Load data into a Hive table from a local
directory
Load data into a Hive table from an HDFS
directory
Load data into a Hive table as the result of a
query
Load a compressed data file into a Hive table
Update a row in a Hive table
Delete a row from a Hive table
Insert a new row into a Hive table
Join two Hive tables
Run a Hive query using Tez
Run a Hive query using vectorization
Output the execution plan for a Hive query
Use a subquery within a Hive query
Output data from a Hive query that is totally
ordered across multiple reducers
Set a Hadoop or Hive configuration property
from within a Hive query
Reference:
https://hortonworks.com/services/training/certification/exam-objectives/#hdpcd
Monday, October 30, 2017
How to use Password file with Sqoop
Chitchatiq 10/30/2017 11:21:00 AM BigData&Hadoop, Problems&Solutions, sqoop No comments
How to use local password-file parameter with sqoop
Problem: sometimes we have to
read password from file for sqoop command.
Solution: Passing password parameter
file to sqoop command is not a big deal. Just follow below steps:
1. Create a password with
Echo -n “<<password>> > <<passwordfilename>>
E.g.
Echo -n “Srinivas”
> passwordfile – Here we can give file name with location
Sqoop Command:
sqoop
list-tables --connect
"jdbc:mysql://sandbox.hortonworks.com:3306/hdpcdpractise" --username
hadoop --password-file file:///usr/Srinivas/ passwordfile
Monday, October 9, 2017
SQOOP-SqlManager-Error reading from database-java.sql.SQLException-set com.mysql.jdbc.RowDataDynamic-3c2d5cfb
Chitchatiq 10/09/2017 06:37:00 PM Hive, Problems&Solutions, sqoop No comments
Problem:
sqoop export --connect
"jdbc:mysql://sandbox.hortonworks.com:3306/hdpcdpractise" --username
hadoop --password hadoop --table weather --export-dir
/user/hortonworks/weather/
--fields-terminated-by ',';
Some times when we run sqoop commands like above we will
get following error
“ERROR manager.SqlManager: Error reading from
database: java.sql.SQLException: Streaming result set
com.mysql.jdbc.RowDataDynamic@3c2d5cfb is still active. No statements may be
issued when any streaming result sets are open and in use on a given connection.
Ensure that you have called .close() on any active streaming result sets before
attempting more queries.
java.sql.SQLException: Streaming result set
com.mysql.jdbc.RowDataDynamic@3c2d5cfb is still active. No statements may be
issued when any streaming result sets are open and in use on a given
connection.”
Solution:
Simple solution would be adding driver parameter with
value like below
--driver com.mysql.jdbc.Driver
sqoop export --connect
"jdbc:mysql://sandbox.hortonworks.com:3306/hdpcdpractise" --username
hadoop --password hadoop --table weather --export-dir
/user/hortonworks/weather/
--fields-terminated-by ',' --driver com.mysql.jdbc.Driver
Thursday, September 21, 2017
How to create a column in a table and that should throw error when we do SELECT * or SELECT col
Chitchatiq 9/21/2017 06:45:00 PM Problems&Solutions No comments
Problem:
Create a column in a table and that should give an error
if we give SELECT * or SELECT of that column
Solution:
Friends, don’t think toooooo much like how to solve this
problem. Here is the simple solution that I got.
Just use any computed column and that should not have
any syntactical errors but it should have logical error (E..g 1/0, 1+’srinivas’)
Sample Script:
create table #test
(id1 int,id2 int,Col1 as id1+'srinu')
select * from #test
Here We
will get
Conversion
failed when converting the varchar value 'srinu' to data type int.
That’s all friends……!
Thanks for reading this article and
please post a comment if you have any questions/thoughts
Monday, September 11, 2017
Greenplum System-Useful Queries
Useful Queries:
- Query to verify the list of segments in a Greenplum system
select * from gp_segment_configuration;
Result:
Column Value
|
Description
|
Content = -1
|
Master Node
|
Content = 0 - n
|
Slave segments
|
- Query to verify Skewing in a database
select * from gp_toolkit.gp_skew_coefficients;
- Query to verify the number of records in each segment of a table
select gp_segment_id, count(*)
from zc.zctimesheet1
group by gp_segment_id
order by gp_segment_id;
- Query to verify the size of a database/ relation (table)
select
pg_size_pretty(pg_database_size('DistributionTest'));
select
pg_size_pretty(pg_relation_size('zc.zctimesheet1'));
Greenplum Best Practises
Chitchatiq 9/11/2017 08:33:00 PM Best Practices, Greenplum No comments
Best Practices:
- A distribution key should not have more than 2 columns, recommended is 1 column.
- While modeling a database, we need to make sure all the tables that gets involved in longer query execution are distributed using the joining column.
- Distribution keys should be domain key but not surrogate keys.
- Explicitly define a column or random distribution for all tables. Do not use the default as it may cause skew on segments.
- Do not distribute on columns that will be used in the WHERE clause of a query. You should partition the data on the column that is used in WHERE clause
- The distribution key column data should contain unique values
- Do not distribute on dates or timestamps. Distributing data on date or timestamp may cause poor system performance
- There are 2 questions to ask while deciding the distribution key of a table:
- Is the Distribution Key used in joins?
- Makes sure, no redistribution is happening during query execution
- Does the key 'Equi-distributes' the data across all the segments?
- Makes sure all the segments get approximately the same amount of data, hence the same amount of work during query execution.
- We need to make sure data skewing is minimum for every table.
SELECT * FROM
gp_toolkit.gp_skew_coefficients
Greenplum Architecture
Parallel Processing:
Taking
all the rows from table and spreading them among many parallel processing
units. In greenplum these parallel processing units are called segments.
SMP(symmetric multi processing system):
A smp system has multiple processors but share a single OS,memory
pool,and share access to disks.SMP is a single server which is referred as
node.A single SMP system is called as a Segment Host ,multiple Segment Hosts are connected together to
create a Massively Parallel Processing (MPP) system.the number of segments per
each segment host is defined by number of CPU’s the segment Host contains.
Commodity Hardware:
Greenplum provides incredible speeds with
commodity hardware,this allows to create one segment per CPU. If you have two
dual-core CPU processors in a server you should build four segments.
Roles of segment Host and Segment:
The host is the boss and the segments are the workers.Users login
to the host and never communicate directly with the segments. The host builds a
plan for the segments to follow that is delivered in plan slices. Each slice
instructs the segments what to do. When the segments have done their work they
return it to the host.
The Host is the brains behind the entire
operation.
Segment Host Roles:
1.When a user logs into Greenplum, the host will log them in and be responsible for their session.
2.The host checks the SQL syntax, creates the EXPLAIN plan, checks the security, and builds
a plan for the segments to follow.
3.The host uses system statistics and statistics from the ANALYZE command to
build the best plan.
4.The host doesn't hold user data, but instead holds the Global System Catalog.
5.The host always delivers the final answer set to the user.
Segment Roles:
1.Segments are responsible for storing and retrieving rows from
their assigned disk (Virtual disk).
2.Segments lock the tables and rows.
3.Segments sort rows and do all
aggregation.
4.Segments handle all the join processing.
5.Segments handle all space management and
space accounting.
Master Host will broadcast the plan to each segment
simultaneously, but if you use the distribution key in the WHERE clause of your
SQL with an equality statement, then only a single segment will be contacted to
return the row.
For example: If the system contains 3 parallel
processing units and if Employee_table contains 9 rows then each parallel
processing unit holds 3 rows
select * from employee_table
where
empno=2000;
If above query is written without using
distribution key in where clause then for fetching the result all of the 3
segments are visited,but in above query empno is used in where clause with
equality so only the segment which contain that employee number is visited.
Wednesday, September 6, 2017
How to find and delete the top 100 most used files in Linux/Unix
Chitchatiq 9/06/2017 06:15:00 PM Problems&Solutions No comments
Steps to find and delete the top 100 most used files in
Linux/Unix
Command to find most disk space used
files:
find / -xdev -type f -size +100M -exec du -sh {} ';' |
sort -rh | head -n100
Command to delete the content without
removing the file
> <<filename>>
E.g.
>
sample.txt
Here > is used to empty the file
Thanks for reading this article and please do subscribe to get more updates
Thanks for reading this article and please do subscribe to get more updates
Saturday, September 2, 2017
How to Install MySQL on CentOS 7
Chitchatiq 9/02/2017 02:43:00 PM MySQL Installation, Problems&Solutions No comments
Steps to Install MySQL on Linux-CentOS 7
Step 1: Download MYSQL repositories
wget http://dev.mysql.com/get/mysql57-community-release-el7-9.noarch.rpm
Step 2: Prepare repository to install MySQL
sudo rpm -Uvh
mysql57-community-release-el7-9.noarch.rpm
Step 3: Install MySQL on your server
sudo yum
install mysql-server
During the installation, you will be asked to
provide Y or N. Just type Y. If you see Complete! message, then it means that Install completed
successfully.
Step 4: Start the MySQL service
sudo
systemctl start mysqld
Step 5: Check the service status
sudo
systemctl status mysqld
Thanks for reading the article. If you feel it solves
your problem, please do like,comment and share this !!!!!!!!!