How to extract and load data to Hadoop/HDFS/Hive from Informatica Powercenter

October 06, 2014

Basic Overview of Hadoop :

Hadoop is a framework that allows for the distributed processing of the large data sets across clusters of commodity computers (that do not share any storage or disk).

Hadoop is designed for commodity hardware.

Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.

Two major components of Hadoop systems are :

1- Hadoop Distributed File system (HDFS) ,

2- MapReduce

HDFS is responsible for storing data on cluster of machines.

MapReduce is the data processing component of the Hadoop.

In a hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. the hadoop distributed file system (hdfs) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this, each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable.

A set of machines running hdfs and map reduce is known as hadoop cluster. individual machines are known as nodes

A cluster can have as many as 1 node to several thousand nodes.

HDFS :

1- For hadoop to be able to process the files, files have to be in hdfs i.e. hadoop’s own filesystem.

Hdfs is responsible for storing data on cluster of machines.

2- Data is normally split into blocks of 64mb to 128mb and spread across the cluster.
(Block size is Fixed in hadoop).

By default , block is replicated 3 times. The replication factor can be lowered or increased

through configuration settings.

3- To ensure that data is not lost, similar blocks are always replicated on different nodes.

4- There are 2 process which takes care of hdfs in hadoop – NameNode and DataNode.

MAPREDUCE :

1- Mapreduce is the data processing component of the hadoop. it attains the task

of data processing by distributing tasks across the nodes.

2- Task on each node processes data present locally (mostly).

3- consists of 2 phases :

(i) map
(ii) reduce
In between map and reduce, there is small phase called shuffle and sort.

What is Required for PC integration with Hadoop ?

Starting from Powercenter 9.1 HF1, it is possible to extract and load data into Hadoop from Informatica Powercenter.

Power Exchange for Hadoop is required to integrate Informatica power center with Hadoop to extract and load the data.

Power Exchange is a suit of connectors for Power Center.

Power Exchange for Hadoop is a software or plug-in provided by Informatica corporation which defines functionality required to connect Powercenter to Hadoop.

Using Power Exchange for Hadoop, you can do following from power center:

a) Extract and load to HDFS.

b) Load to Hive Tables.

NOTE :

Starting from Powercenter (PC) 9.5.1, Power Exchange for Hadoop installs with Power center (i.e it is included in the power center server installer only). But for previous versions, Power Exchange for Hadoop needs to be purchased separately as it is a licensed option and in that case you need to manually register pmhdfs.xml plug-in to the power center repository to get the HDFS connections in workflow manager enabled.
Also, if you are upgrading from Powercenter 9.1.0 then also pmhdfs.xml plug-in needs to be registered.

Power Center and Hadoop Integration :

To extract data from HDFS, a PowerExchange for Hadoop mapping contains a flat file source. In the Workflow Manager, you specify the HDFS flat file reader to extract data from HDFS.

To load data to HDFS or Hive, a PowerExchange for Hadoop mapping contains a flat file target. You specify the HDFS flat file writer to load data to HDFS or Hive.

The Integration Service communicates with Hadoop through the Java Native Interface (JNI). JNI is a programming framework that enables Java code running in a Java Virtual Machine (JVM) to call or be called.

Hadoop Sources and Targets in PC :

In Informatica power center, you include a flat file source definition in a mapping to extract Hadoop data.

In Informatica power center, you include a delimited flat file target definition in a mapping to load data into HDFS or to a Hive table.

You can import a flat file definition or manually create one. To load data to a Hadoop target, the flat file definition must be delimited.

Power Exchange for Hadoop Sessions :

After you create a PowerExchange for Hadoop mapping in the Designer, you create PowerExchange for Hadoop session in the Workflow Manager to read, transform,

and write Hadoop data.

Before you create a session, configure Hadoop HDFS application connection to connect to the HDFS host.

When the Integration Service extracts or loads Hadoop data, it connects to a Hadoop

cluster through the HDFS host that runs the name node service for a Hadoop cluster.

If the mapping contains a flat file source, you can configure the session to extract data

from HDFS. If the mapping contains a flat file target, you can configure the session to load data

to HDFS or a Hive table.

When the Integration Service loads data to a Hive table, it first loads data to HDFS. The Integration Service then generates an SQL statement to create the Hive table and load

the data from HDFS to the table.

NOTE:
Prerequisite to read or write data from/to hadoop is to have a flat file source/target in the mapping.

Power Exchange for Hadoop Connections :

Use a Hadoop HDFS application connection object for each Hadoop source or target that you want to access.

Go to workflow Manager -> Connections -> Application -> Hadoop HDFS Connection -> New.

Enter all the required information for connection object and Click Ok.

Following things need to be passed to HDFS application connection to make successful connection to Hadoop system :

(i) Name,

(ii) UserName - The name of the user in the Hadoop group that is used to access the HDFS host.

(iii) password

(iv) Host Name - The name of HDFS host that runs the name node service for the Hadoop cluster.

(v) Host Port

(vi) Hive Driver - The name of the Hive driver. By default, the driver name is:

org.apache.hadoop.hive.jdbc.HiveDriver

(vii) Hive URL - The URL to the Hive host. Specify the URL in the following format:

jdbc:hive://hostname:portnumber/default

(viii) Hive User Name

(ix) Hadoop Distribution - The name of the Hadoop distribution. You can choose one

of the following options:

- Apache

- Cloudera
- Pivotal

Default is Apache.

Check the below screenshots :

Application Connection Type- Hadoop HDFS connection

Hadoop HDFS connection properties part 1

Hadoop HDFS connection properties part2

Hadoop Distribution type in Hadoop connection

Sessions with Hadoop Sources :

You can configure a session to extract data from HDFS.

When you configure a session for a Hadoop source, you select the HDFS Flat File reader file type and a Hadoop HDFS application connection object.

Session Properties for Hadoop Source :

You can set staging in session properties for Hadoop source. To enable staging for Hadoop source, configure the below properties at session level :

1- Is Staged

2- File Path

3- Staged File Name

check the below screenshot showing these session properties for Hadoop source :

Staging HDFS Source Data :

You can optionally stage an HDFS source. The Integration Service stages source files on the local machine and then loads data from the staged file or files into the target.

Stage an HDFS source when you want the Integration Service to read the source files and then close the connection before continuing to process the data.

Configure staging for an HDFS source by setting HDFS flat file reader properties in a session.

You can configure the following types of staging:

1- Direct :

Use direct staging when you want to read data from a source file. The Integration Service reads data from the source file and stages the data on the local machine before passing to downstream transformations.

2- Indirect :

Use indirect staging when you want to read data from multiple source files. The Integration Service reads data from multiple files in the source. It creates an indirect file that contains the names of the source files. It then stages the indirect file and the files read from the source in the local staging directory before passing to downstream transformations.

For example, you stage the files named source1.csv and source2.csv from the Hadoop source location to the following directory:

c:\staged_files\source_stage_list.txt

The Integration Service creates an indirect file named source_stage_list.txt that contains the following entries:

source1.csv

source2.csv

The Integration Service stages the indirect file and the source files. In the c:\staged_files directory, you would see the following files:

source_stage_list.txt

source1.csv

source2.csv

To configure staging, set the following session properties:

1- Is Staged : Enabled

2- File Path : For direct staging, enter the name and path of the source file. For indirect staging, enter the directory path to the source files.

3- Staged File Name : Name and path on the local machine used to stage the source data.

Sessions with Hadoop Targets :

You can configure a session to load data to HDFS or a Hive table.

—

When you configure a session for a Hadoop target, you select the HDFS Flat

File writer file type and a Hadoop HDFS application connection object.

—

When you configure the session to load data to HDFS, you can

configure partitioning, file header, and output options. When you configure

the session to load data to a Hive table, you can configure partitioning and

output options.

—

When the Integration Service loads data to a Hive table, it generates a

relational table in the Hive database. You can overwite the Hive table data

when you run the session again.

Hadoop Target Partitioning :

When you configure a session to load data to a Hadoop target, you can write the target output to a separate file for each partition or to a merge file that contains the target output for all partitions.

You can select the following merge types for Hadoop target partitioning:

1- No Merge :

The Integration Service generates one target file for each partition. If you stage the files, the Integration Service transfers the target files to the remote location at the end of the session. If you do not stage the files, the Integration Service generates the target files at the remote location.

2- Sequential Merge :

The Integration Service creates one output file for each partition. At the end of the session, the Integration Service merges the individual output files into a merge file, deletes the individual output files, and transfers the merge file to the remote location.

If you set the merge type to sequential, you need to define the merge file path and the output file path in the session properties. The merge file path determines the final Hadoop target location where the Integration Service creates the merge file. The Integration Service creates the merge file from the intermediate merge file output in the location defined for the output file path.

Hadoop HDFS Session Target properties 1

Session Properties for a Hadoop Target :

The following session properties can be configured for a Hadoop target :

1- Merge Type - Type of merge that the Integration Service performs on the data for partitioned targets. (No Merge or Sequential Merge).

2- Append if Exists - Appends data to a file.

3- Header options - Creates a header row in the flat file when loading data to HDFS.

4- Auto generate partition file names - Generates partition file names.

5- Merge File Path :

6- Generate And Load Hive Table - Generates a relational table in the Hive database. The Integration Service loads data into the Hive table from the HDFS flat file target.

7- Overwrite Hive Table - Overwrites the data in the Hive table.

8- Hive Table Name - Hive table name. Default is the target instance name.

9- Output file path

10- Reject File path

check the above two screenshots showing these session properties for Hadoop target :

Understanding different file path properties to be configured for hadoop targets in session properties :

Merge File Path :

If you choose a sequential merge type, defines the final Hadoop target location where the Integration Service creates the merge file. The Integration Service creates the merge file from the intermediate merge file output in the location defined in the output file path.

Output File Path :

Defines the absolute or relative directory path or file path on the HDFS host where the Integration Service writes the HDFS data. A relative path is relative to the home directory of the Hadoop user. If you choose a sequential merge type, defines where the Integration Service writes intermediate output before it writes to the final Hadoop target location as defined by the merge file path. If you choose to generate partition file names, this path can be a directory path.

Reject File Path :

The path to the reject file. By default, the Integration Service writes all reject files to service process variable directory, $PMBadFileDir.

Conclusion :

This is how you can extract and load data to Hadoop or HDFS or Hive using Informatica Powercenter.

Thanks for reading this post !! Please provide your valuable feedback.

Comments

Mihir Ray27 January 2015 at 15:51
How can i write a hive query to fecth data in source?
ReplyDelete
Replies
Anonymous27 October 2015 at 00:08
What ever may be the platform, I have become genious working with hadoop on that platform after attending hadoop online training with expertize training on the subject. Thanks also for sharing great posts on this blog. Keep up the good work.
ReplyDelete
Replies
Unknown1 March 2016 at 05:42
Hi Hitesh,

Can we implement SCD type2 logic in informatica and our database is Big Data Hadoop. As Hadoop generates flat files then how we will update the old record. Please provide detail level explaination.

Thanks,
Tushar
ReplyDelete
Replies
Anonymous15 May 2017 at 22:33
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Informatica Powercenter 9.X Developer training, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on Informatica Powercenter 9.X Developer training. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Nitesh Kumar
MaxMunus
E-mail: nitesh@maxmunus.com
Skype id: nitesh_maxmunus
Ph:(+91) 8553912023
http://www.maxmunus.com/

ReplyDelete
Replies
Unknown28 March 2018 at 06:05
I really say thanks to you for information shared keep updating Informatica Online Training
ReplyDelete
Replies
Anonymous20 April 2018 at 00:30
I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
Best Hadoop Training Institute In chennai

ReplyDelete
Replies
Tejuteju23 April 2018 at 07:06
Nice InformationHadoop Admin Online Course HYDERABAD
ReplyDelete
Replies
Tejuteju23 April 2018 at 07:10
This comment has been removed by the author.
ReplyDelete
Replies
Unknown24 April 2018 at 00:56
I appreciate your work on Informatica. It's such a wonderful read on Informatica.Keep sharing stuffs like this. I am also educating people on similar technologies so if you are interested to know more you can watch this:-
https://www.youtube.com/watch?v=56vMQ1lG-vc
ReplyDelete
Replies
Tejuteju5 June 2018 at 06:52
Thank you.Well it was nice post and very helpful information onBig data hadoop online Course Hyderabad
ReplyDelete
Replies
blackkutty14 June 2018 at 23:09
Extremely exceptionally educational and innovative substance. This idea is a decent method to improve the knowledge.thanks for sharing.
Article Submission sites | Latest Updates | Technology
ReplyDelete
Replies
Malar Pretty6 July 2018 at 04:45
Great post and well explained.thank you.
technology
guest posting sites
ReplyDelete
Replies
Anonymous21 July 2018 at 02:18
Learned a lot of things from this article.
Informatica Training in Chennai
Informatica Course in Chennai
ReplyDelete
Replies
Tejuteju17 August 2018 at 03:57
Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Course Bangalore

ReplyDelete
Replies
anirudh23 May 2019 at 03:59

I like your post very much. It is very much useful for my research. I hope you to share more info

about this. Keep posting!!
Best Devops Training Institute
ReplyDelete
Replies
sheela rajesh31 May 2019 at 01:52
Amazing information,thank you for your ideas.after along time i have studied
an interesting information's.we need more updates in your blog.
Hadoop Training in Chennai
Big data training in chennai
Big Data Hadoop Training in Chennai
JAVA Training in Chennai
Python Training in Chennai
IOS Training in Chennai
hadoop training in Annanagar
big data training in chennai anna nagar
Big data training in annanagar
ReplyDelete
Replies
Sri24 June 2019 at 21:15
Thanks for this post. Detailed information.
ReplyDelete
Replies
Anonymous17 May 2022 at 22:03
perde modelleri
sms onay
Vodafone mobil ödeme bozdurma
nft nasıl alınır
Ankara Evden Eve Nakliyat
trafik sigortası
DEDEKTOR
web sitesi kurma
aşk kitapları
ReplyDelete
Replies
Anonymous26 May 2022 at 07:56
Smm Panel
SMM PANEL
iş ilanları
instagram takipçi satın al
HIRDAVATÇI BURADA
beyazesyateknikservisi.com.tr
servis
Tiktok Para Hilesi
ReplyDelete
Replies
Anonymous27 June 2022 at 08:24
minecraft premium
nft nasıl alınır
yurtdışı kargo
özel ambulans
en son çıkan perde modelleri
uc satın al
en son çıkan perde modelleri
lisans satın al
ReplyDelete
Replies

Add comment

Search This Blog

Hitesh's World of Informatica

How to extract and load data to Hadoop/HDFS/Hive from Informatica Powercenter

Comments

Post a Comment