How to extract and load data to Hadoop/HDFS/Hive from Informatica Powercenter
Basic Overview of Hadoop :
Hadoop is a framework
that allows for the distributed processing of the large data sets across
clusters of commodity computers (that do not share any storage or disk).
Hadoop is designed for commodity
hardware.
Hadoop uses Google’s MapReduce and
Google File System technologies as its foundation.
Two major components of Hadoop systems
are :
1- Hadoop Distributed File system (HDFS) ,
2- MapReduce
HDFS is responsible for storing data on cluster of
machines.
MapReduce is the data processing
component of the Hadoop.
In a hadoop cluster, data is distributed to all the nodes of
the cluster as it is being loaded in. the hadoop distributed file system (hdfs)
will split large data files into chunks which are managed by different nodes in
the cluster. In addition to this, each chunk is replicated across several
machines, so that a single machine failure does not result in any data being
unavailable.
A set of machines running hdfs and map reduce is known as hadoop cluster. individual machines are known as nodes
A set of machines running hdfs and map reduce is known as hadoop cluster. individual machines are known as nodes
A cluster can have as many as 1 node to several
thousand nodes.
HDFS :
HDFS :
1- For hadoop to be able to process the files, files have to be in hdfs i.e. hadoop’s own filesystem.
Hdfs is
responsible for storing data on cluster of machines.
2- Data is normally split into blocks of 64mb to 128mb and spread across the cluster.
(Block size is Fixed in hadoop).
2- Data is normally split into blocks of 64mb to 128mb and spread across the cluster.
(Block size is Fixed in hadoop).
By default ,
block is replicated 3 times. The replication factor can be lowered or
increased
through configuration
settings.
3- To ensure that data is not lost, similar blocks are always replicated on different nodes.
4- There are 2 process which takes care of hdfs in hadoop – NameNode and DataNode.
3- To ensure that data is not lost, similar blocks are always replicated on different nodes.
4- There are 2 process which takes care of hdfs in hadoop – NameNode and DataNode.
MAPREDUCE :
1- Mapreduce is the data processing component of the
hadoop. it attains the task
of data processing
by distributing tasks across the nodes.
2- Task on each node processes data present locally
(mostly).
3- consists of 2 phases :
(i) map
(ii) reduce
In between map and reduce, there is small phase called shuffle and sort.
(ii) reduce
In between map and reduce, there is small phase called shuffle and sort.
What is Required for PC integration with Hadoop ?
Starting from Powercenter 9.1 HF1, it is
possible to extract and load data into Hadoop from Informatica Powercenter.
Power Exchange for Hadoop is
required to integrate Informatica power center with Hadoop to extract and
load the data.
Power Exchange is a suit of
connectors for Power Center.
Power Exchange for Hadoop is a
software or plug-in provided by Informatica corporation which defines
functionality required to connect Powercenter to Hadoop.
Using Power Exchange for Hadoop,
you can do following from power center:
a) Extract and load to
HDFS.
b) Load to Hive Tables.
NOTE :
Starting from Powercenter (PC) 9.5.1, Power Exchange for Hadoop installs with
Power center (i.e it is included in
the power center server installer only). But for previous versions,
Power Exchange for Hadoop needs to be purchased separately as it is a
licensed option and in that case you need to manually register pmhdfs.xml
plug-in to the power center repository to get the HDFS connections
in workflow manager enabled.
Also, if you are upgrading from Powercenter 9.1.0 then also pmhdfs.xml plug-in needs to be registered.
Also, if you are upgrading from Powercenter 9.1.0 then also pmhdfs.xml plug-in needs to be registered.
Power Center and Hadoop Integration :
To extract data from HDFS, a
PowerExchange for Hadoop mapping contains a flat file source. In the Workflow
Manager, you specify the HDFS flat file reader to extract data from HDFS.
To load data to HDFS or
Hive, a PowerExchange for Hadoop mapping contains a flat file target. You
specify the HDFS flat file writer to load data to HDFS or Hive.
The Integration Service communicates
with Hadoop through the Java Native Interface (JNI). JNI is a programming
framework that enables Java code running in a Java Virtual Machine (JVM)
to call or be called.
Hadoop Sources and Targets in PC :
In Informatica
power center, you include a flat file source definition in a mapping to
extract Hadoop data.
In Informatica power center, you include
a delimited flat file target definition in a mapping to load data
into HDFS or to a Hive table.
You can import a flat file definition or manually create one. To
load data to a Hadoop target, the flat file definition must be
delimited.
Power
Exchange for Hadoop Sessions :
After you create a PowerExchange for
Hadoop mapping in the Designer, you create PowerExchange for Hadoop session in the
Workflow Manager to read, transform,
and write Hadoop data.
Before you create a session,
configure Hadoop HDFS application connection to connect to the HDFS host.
When the Integration Service extracts
or loads Hadoop data, it connects to a Hadoop
cluster through the HDFS
host that runs the name node service for a Hadoop cluster.
If the
mapping contains a flat file source, you can configure the session to extract data
from HDFS. If the mapping contains a flat file target, you can
configure the session to load data
to HDFS or a Hive table.
When the
Integration Service loads data to a Hive table, it first loads data to HDFS. The Integration Service then generates an SQL statement to
create the Hive table and load
the data from HDFS to the table.
NOTE:
Prerequisite to read or write data from/to hadoop is to have a flat file source/target in the mapping.
Prerequisite to read or write data from/to hadoop is to have a flat file source/target in the mapping.
Power Exchange for Hadoop Connections :
Use a Hadoop HDFS
application connection object for each Hadoop source or target that you want to
access.
Go to workflow Manager ->
Connections -> Application -> Hadoop HDFS Connection -> New.
Enter all the required information
for connection object and Click Ok.
Following things need to be passed
to HDFS application connection to make successful connection to Hadoop system
:
(i) Name,
(ii) UserName - The name of the user in the Hadoop
group that is used to access the HDFS host.
(iii) password
(iv) Host Name - The name of HDFS host that runs the name
node service for the Hadoop cluster.
(v) Host Port
(vi) Hive Driver - The name of the Hive driver. By
default, the driver name is:
org.apache.hadoop.hive.jdbc.HiveDriver
(vii) Hive URL - The URL to the Hive host. Specify
the URL in the following format:
jdbc:hive://hostname:portnumber/default
(viii) Hive User Name
(ix) Hadoop Distribution - The name of the Hadoop
distribution. You can choose one
of the following options:
- Apache
- Cloudera
- Pivotal
- Pivotal
Default is Apache.
Check the below screenshots :
![]() |
| Application Connection Type- Hadoop HDFS connection |
![]() |
| Hadoop HDFS connection properties part 1 |
![]() |
| Hadoop HDFS connection properties part2 |
![]() |
| Hadoop Distribution type in Hadoop connection |
Sessions
with Hadoop Sources :
You can
configure a session to extract data from HDFS.
When you
configure a session for a Hadoop source, you select the HDFS Flat File reader file type and a Hadoop
HDFS application connection
object.
Session
Properties for Hadoop Source :
You can
set staging in session properties for Hadoop source. To enable staging for Hadoop
source, configure the below properties at session level :
1- Is
Staged
2- File
Path
3- Staged
File Name
check the
below screenshot showing these session properties for Hadoop source :
Staging HDFS Source Data :
You can optionally stage an HDFS source.
The Integration Service stages source files on the local machine and then loads
data from the staged file or files into the target.
Stage an HDFS source when you want the
Integration Service to read the source files and then close the connection
before continuing to process the data.
Configure staging for an HDFS source by
setting HDFS flat file reader properties in a session.
You can configure the following types of
staging:
1- Direct :
Use direct staging when you want to read
data from a source file. The Integration Service reads data from the source
file and stages the data on the local machine before passing to downstream
transformations.
2- Indirect :
Use indirect staging when you want to
read data from multiple source files. The Integration Service reads data from
multiple files in the source. It creates an indirect file that contains the
names of the source files. It then stages the indirect file and the files read
from the source in the local staging directory before passing to downstream
transformations.
For example, you stage the files named
source1.csv and source2.csv from the Hadoop source location to the following
directory:
c:\staged_files\source_stage_list.txt
The Integration Service creates an
indirect file named source_stage_list.txt that contains the following entries:
source1.csv
source2.csv
The Integration Service stages the
indirect file and the source files. In the c:\staged_files directory, you would
see the following files:
source_stage_list.txt
source1.csv
source2.csv
To configure staging, set the following
session properties:
1- Is Staged : Enabled
2- File Path : For direct staging, enter
the name and path of the source file. For indirect staging, enter the directory
path to the source files.
3- Staged File Name : Name and path on
the local machine used to stage the source data.
Sessions
with Hadoop Targets :
You can configure a
session to load data to HDFS or a Hive table.
When you configure a
session for a Hadoop target, you select the HDFS
Flat
File writer file type and a Hadoop HDFS application connection object.
When you configure the
session to load data to HDFS, you can
configure partitioning,
file header, and output options.
When you configure
the session to load data
to a Hive table, you can configure partitioning and
output options.
When the Integration
Service loads data to a Hive table, it generates a
relational table in the
Hive database. You can overwite the Hive table data
when you run the session again.
Hadoop Target Partitioning :
When you configure a session to load data
to a Hadoop target, you can write the target output to a separate file for each
partition or to a merge file that contains the target output for all partitions.
You can select the following merge types
for Hadoop target partitioning:
1- No Merge :
The Integration Service generates one
target file for each partition. If you stage the files, the Integration Service
transfers the target files to the remote location at the end of the session. If
you do not stage the files, the Integration Service generates the target files
at the remote location.
2- Sequential Merge :
The Integration Service creates one
output file for each partition. At the end of the session, the Integration
Service merges the individual output files into a merge file, deletes the
individual output files, and transfers the merge file to the remote location.
If you set the merge type to sequential,
you need to define the merge file path and the output file path in the session
properties. The merge file path determines the final Hadoop target location
where the Integration Service creates the merge file. The Integration Service
creates the merge file from the intermediate merge file output in the location
defined for the output file path.
![]() |
| Hadoop HDFS Session Target properties 1 |
![]() |
| Hadoop HDFS Session Target properties 1 |
Session
Properties for a Hadoop Target :
The following session properties can be
configured for a Hadoop target :
1- Merge Type - Type
of merge that the Integration Service performs on the data for partitioned
targets.
(No Merge or Sequential Merge).
2- Append if Exists - Appends
data to a file.
3- Header options - Creates
a header row in the flat file when loading data to HDFS.
4- Auto generate partition
file names -
Generates
partition file names.
5- Merge File Path :
6- Generate And Load Hive Table -
Generates a relational table in the Hive database. The Integration Service
loads data into the Hive table from the HDFS flat file target.
7- Overwrite Hive Table -
Overwrites the data in the Hive table.
8- Hive Table Name -
Hive table name. Default is the target instance name.
9- Output file path
10- Reject File path
check the above two screenshots showing these session properties for Hadoop target :
Understanding different file path properties to be configured for hadoop targets in session properties :
Merge File
Path :
If you choose a sequential merge type,
defines the final Hadoop target location where the Integration Service creates
the merge file. The Integration Service creates the merge file from the
intermediate merge file output in the location defined in the output file path.
Output File Path :
Defines the absolute or relative
directory path or file path on the HDFS host where the Integration Service
writes the HDFS data. A relative path is relative to the home directory of the
Hadoop user. If you choose a sequential merge type, defines where the
Integration Service writes intermediate output before it writes to the final
Hadoop target location as defined by the merge file path. If you choose to
generate partition file names, this path can be a directory path.
Reject File Path :
The path to the reject file. By default,
the Integration Service writes all reject files to service process variable
directory, $PMBadFileDir.
Conclusion :
This is how you can extract and load data to Hadoop or HDFS or Hive using Informatica Powercenter.
Thanks for reading this post !! Please provide your valuable feedback.







How can i write a hive query to fecth data in source?
ReplyDeleteWhat ever may be the platform, I have become genious working with hadoop on that platform after attending hadoop online training with expertize training on the subject. Thanks also for sharing great posts on this blog. Keep up the good work.
ReplyDeleteHi Hitesh,
ReplyDeleteCan we implement SCD type2 logic in informatica and our database is Big Data Hadoop. As Hadoop generates flat files then how we will update the old record. Please provide detail level explaination.
Thanks,
Tushar
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Informatica Powercenter 9.X Developer training, kindly contact us http://www.maxmunus.com/contact
ReplyDeleteMaxMunus Offer World Class Virtual Instructor led training on Informatica Powercenter 9.X Developer training. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Nitesh Kumar
MaxMunus
E-mail: nitesh@maxmunus.com
Skype id: nitesh_maxmunus
Ph:(+91) 8553912023
http://www.maxmunus.com/
I really say thanks to you for information shared keep updating Informatica Online Training
ReplyDeleteI simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
ReplyDeleteBest Hadoop Training Institute In chennai
Nice InformationHadoop Admin Online Course HYDERABAD
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteI appreciate your work on Informatica. It's such a wonderful read on Informatica.Keep sharing stuffs like this. I am also educating people on similar technologies so if you are interested to know more you can watch this:-
ReplyDeletehttps://www.youtube.com/watch?v=56vMQ1lG-vc
Thank you.Well it was nice post and very helpful information onBig data hadoop online Course Hyderabad
ReplyDeleteExtremely exceptionally educational and innovative substance. This idea is a decent method to improve the knowledge.thanks for sharing.
ReplyDeleteArticle Submission sites | Latest Updates | Technology
Great post and well explained.thank you.
ReplyDeletetechnology
guest posting sites
Learned a lot of things from this article.
ReplyDeleteInformatica Training in Chennai
Informatica Course in Chennai
Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Course Bangalore
ReplyDelete
ReplyDeleteI like your post very much. It is very much useful for my research. I hope you to share more info
about this. Keep posting!!
Best Devops Training Institute
Amazing information,thank you for your ideas.after along time i have studied
ReplyDeletean interesting information's.we need more updates in your blog.
Hadoop Training in Chennai
Big data training in chennai
Big Data Hadoop Training in Chennai
JAVA Training in Chennai
Python Training in Chennai
IOS Training in Chennai
hadoop training in Annanagar
big data training in chennai anna nagar
Big data training in annanagar
Thanks for this post. Detailed information.
ReplyDeleteperde modelleri
ReplyDeletesms onay
Vodafone mobil ödeme bozdurma
nft nasıl alınır
Ankara Evden Eve Nakliyat
trafik sigortası
DEDEKTOR
web sitesi kurma
aşk kitapları
Smm Panel
ReplyDeleteSMM PANEL
iş ilanları
instagram takipçi satın al
HIRDAVATÇI BURADA
beyazesyateknikservisi.com.tr
servis
Tiktok Para Hilesi
minecraft premium
ReplyDeletenft nasıl alınır
yurtdışı kargo
özel ambulans
en son çıkan perde modelleri
uc satın al
en son çıkan perde modelleri
lisans satın al