How to extract and load data to Hadoop/HDFS/Hive from Informatica Powercenter

  
Basic Overview of Hadoop :
  Hadoop is a framework that allows for the distributed processing of the large data sets across clusters of commodity computers (that do not share any storage or disk).
  Hadoop is designed for commodity hardware.

  Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.

  Two major components of Hadoop systems are :
        1- Hadoop Distributed File system (HDFS) ,
        2- MapReduce

        HDFS is responsible for storing data on cluster of machines.

  MapReduce is the data processing component of the Hadoop.

    
      In a hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. the hadoop distributed file system (hdfs) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this, each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable.

A set of machines running hdfs and map reduce is known as hadoop cluster.  individual machines are known as nodes
     
A  cluster can have as many as 1 node to several thousand nodes.



HDFS :

1- For hadoop to be able to process the files, files have to be in hdfs i.e. hadoop’s own filesystem.
        Hdfs is responsible for storing data on cluster of machines.

2- Data is normally split into blocks of 64mb to 128mb and spread across the cluster.
     (Block size is Fixed in hadoop).
      By default , block is replicated 3 times. The replication factor can be lowered or increased
      through configuration settings.
     
 3- To ensure that data is not lost, similar blocks are always replicated on different   nodes.

 4- There are 2 process which takes care of hdfs in hadoop – NameNode and DataNode.



 MAPREDUCE :

       
 1- Mapreduce is the data processing component of the hadoop. it attains the task
         of data processing by distributing tasks across the nodes.

 2- Task on each node processes data present locally (mostly).

 3- consists of 2 phases :
 (i) map
 (ii) reduce
  In between map and reduce, there is small phase called shuffle and sort. 



What is Required for PC integration with Hadoop ?

      Starting from Powercenter 9.1 HF1, it is possible to extract and load data into Hadoop from Informatica Powercenter.

  Power Exchange for Hadoop is required to integrate Informatica power center with Hadoop to extract and load the data.

  Power Exchange is a suit of connectors for Power Center.

  Power Exchange for Hadoop is a software or plug-in provided by Informatica  corporation which defines functionality required to connect Powercenter to Hadoop.

  Using Power Exchange for Hadoop, you can do following from power center:
               a) Extract and load to HDFS.
        b) Load to Hive Tables.



        NOTE : 
        Starting from Powercenter (PC) 9.5.1, Power Exchange for Hadoop installs with  Power center (i.e it is included in the power center server installer only). But for previous versions,  Power Exchange for Hadoop needs to be purchased separately as it is a licensed  option and in that case you need to manually register pmhdfs.xml plug-in to the  power center repository to get the HDFS connections in workflow manager enabled. 
Also, if you are upgrading from Powercenter 9.1.0 then also pmhdfs.xml plug-in needs to be registered.




Power Center and Hadoop Integration :
  To extract data from HDFS, a PowerExchange for Hadoop mapping contains a flat file source. In the Workflow Manager, you specify the HDFS flat file reader to extract data from HDFS.

  To load data to HDFS or Hive, a PowerExchange for Hadoop mapping contains a flat file target. You specify the HDFS flat file writer to load data to HDFS or Hive.

  The Integration Service communicates with Hadoop through the Java Native Interface (JNI). JNI is a programming framework that enables Java code running in a Java Virtual Machine (JVM) to call or be called.




Hadoop Sources and Targets in PC :
 In Informatica power center, you include a flat file source definition in a mapping to extract Hadoop data.

  In Informatica power center, you include a delimited flat file target definition in a  mapping to load data into HDFS or to a Hive table.

You can import a flat file definition or manually create one. To load data to a Hadoop target, the flat file definition must be delimited.



Power Exchange for Hadoop Sessions :

After you create a PowerExchange for Hadoop mapping in the Designer, you create PowerExchange for Hadoop session in the Workflow Manager to read, transform, 
and write Hadoop data.

Before you create a session, configure Hadoop HDFS application connection to connect to the HDFS host.

When the Integration Service extracts or loads Hadoop data, it connects to a Hadoop 
cluster through the HDFS host that runs the name node service for a Hadoop cluster.

If the mapping contains a flat file source, you can configure the session to extract data 
from HDFS. If the mapping contains a flat file target, you can configure the session to load data 
to HDFS or a Hive table.

 When the Integration Service loads data to a Hive table, it first loads data to HDFS.  The  Integration Service then generates an SQL statement to create the Hive table and load 
the data from HDFS to the table.

 NOTE: 
 Prerequisite to read or write data from/to hadoop is to have a flat file source/target in the mapping.


Power Exchange for Hadoop Connections :
Use a Hadoop HDFS application connection object for each Hadoop source or target that you  want to access.

Go to workflow Manager -> Connections -> Application -> Hadoop HDFS  Connection -> New.

Enter all the required information for connection object and Click Ok.

 Following things need to be passed to HDFS application connection to make successful connection  to Hadoop system : 
      (i) Name,
      (ii) UserName - The name of the user in the Hadoop group that is used to access the HDFS host.
      (iii) password
      (iv) Host Name - The name of HDFS host that runs the name node service for the Hadoop cluster.
       (v) Host Port
       (vi) Hive Driver - The name of the Hive driver. By default, the driver name is:
                                     org.apache.hadoop.hive.jdbc.HiveDriver
       (vii) Hive URL - The URL to the Hive host. Specify the URL in the following format:
                                     jdbc:hive://hostname:portnumber/default
       (viii) Hive User Name
       (ix) Hadoop Distribution - The name of the Hadoop distribution. You can choose one
                                                        of the following options:
            - Apache
            - Cloudera
     - Pivotal

   Default is Apache.

                 Check the below screenshots :

Application Connection Type- Hadoop HDFS connection


Hadoop HDFS connection properties part 1





Hadoop HDFS connection properties part2

Hadoop Distribution type in Hadoop connection




Sessions with Hadoop Sources :

 You can configure a session to extract data from HDFS.

When you configure a session for a Hadoop source, you select the HDFS Flat File  reader file type and a Hadoop HDFS application connection object. 



Session Properties for Hadoop Source :
You can set staging in session properties for Hadoop source. To enable staging for  Hadoop source, configure the below properties at session level :
1- Is Staged
2- File Path
3- Staged File Name 

check the below screenshot showing these session properties for Hadoop source :




Staging HDFS Source Data :
       You can optionally stage an HDFS source. The Integration Service stages source files on the local machine and then loads data from the staged file or files into the target.

        Stage an HDFS source when you want the Integration Service to read the source files and then close the connection before continuing to process the data.

       Configure staging for an HDFS source by setting HDFS flat file reader properties in a session.
       You can configure the following types of staging:

      1- Direct :
          Use direct staging when you want to read data from a source file. The Integration Service     reads data from the source file and stages the data on the local machine before passing to   downstream transformations.

      2- Indirect :
          Use indirect staging when you want to read data from multiple source files. The Integration   Service reads data from multiple files in the source. It creates an indirect file that contains the names of the source files. It then stages the indirect file and the files read from the source in the local staging directory before passing to downstream transformations.

        For example, you stage the files named source1.csv and source2.csv from the Hadoop source location to the following directory:
                     c:\staged_files\source_stage_list.txt

        The Integration Service creates an indirect file named source_stage_list.txt that contains the following entries:
          source1.csv
          source2.csv

         The Integration Service stages the indirect file and the source files. In the c:\staged_files  directory, you would see the following files:
          source_stage_list.txt
          source1.csv
          source2.csv

        To configure staging, set the following session properties:
         1- Is Staged : Enabled
         2- File Path : For direct staging, enter the name and path of the source file. For indirect staging,                         enter the directory path to the source files.

          3- Staged File Name : Name and path on the local machine used to stage the source data.



Sessions with Hadoop Targets :

You can configure a session to load data to HDFS or a Hive table.
When you configure a session for a Hadoop target, you select the HDFS Flat
File writer file type and a Hadoop HDFS application connection object.
When you configure the session to load data to HDFS, you can
configure partitioning, file header, and output options. When you configure
the session to load data to a Hive table, you can configure partitioning and 
output options.
When the Integration Service loads data to a Hive table, it generates a
relational table in the Hive database. You can overwite the Hive table data

when you run the session again.


Hadoop Target Partitioning :
When you configure a session to load data to a Hadoop target, you can write the target output  to a  separate file for each partition or to a merge file that contains the target output for all  partitions.

You can select the following merge types for Hadoop target partitioning:

1- No Merge :
       The Integration Service generates one target file for each partition. If you stage the files, the Integration Service transfers the target files to the remote location at the end of the session. If you do not stage the files, the Integration Service generates the target files at the remote location.

2- Sequential Merge :
        The Integration Service creates one output file for each partition. At the end of the session, the Integration Service merges the individual output files into a merge file, deletes the individual output files, and transfers the merge file to the remote location.

       If you set the merge type to sequential, you need to define the merge file path and the output file path in the session properties. The merge file path determines the final Hadoop target location where the Integration Service creates the merge file. The Integration Service creates the merge file from the intermediate merge file output in the location defined for the output file path.



Hadoop HDFS Session Target properties 1

Hadoop HDFS Session Target properties 1



Session Properties for a Hadoop Target :
      The following session properties can be configured for a Hadoop target :
1- Merge Type - Type of merge that the Integration Service performs on the data for partitioned targets. (No Merge or Sequential Merge).
2- Append if Exists - Appends data to a file.
3- Header options - Creates a header row in the flat file when loading data to HDFS.
4- Auto generate partition file names - Generates partition file names.
5- Merge File Path :
6- Generate And Load Hive Table - Generates a relational table in the Hive database. The Integration Service loads data into the Hive table from the HDFS flat file target.
7- Overwrite Hive Table - Overwrites the data in the Hive table.
8- Hive Table Name - Hive table name. Default is the target instance name.
9- Output file path
10- Reject File path 

check the above two screenshots showing these session properties for Hadoop target :

Understanding different file path properties to be configured for hadoop targets in session properties :

       Merge File Path :
        If you choose a sequential merge type, defines the final Hadoop target location where the Integration Service creates the merge file. The Integration Service creates the merge file from the intermediate merge file output in the location defined in the output file path.

        Output File Path :
        Defines the absolute or relative directory path or file path on the HDFS host where the Integration Service writes the HDFS data. A relative path is relative to the home directory of the Hadoop user. If you choose a sequential merge type, defines where the Integration Service writes intermediate output before it writes to the final Hadoop target location as defined by the merge file path. If you choose to generate partition file names, this path can be a directory path.

       Reject File Path :
        The path to the reject file. By default, the Integration Service writes all reject files to service process variable directory, $PMBadFileDir.





Conclusion :
       This is how you can extract and load data to Hadoop or HDFS or Hive using Informatica Powercenter.


 Thanks for reading this post !!  Please provide your valuable feedback.

Comments

  1. How can i write a hive query to fecth data in source?

    ReplyDelete
  2. What ever may be the platform, I have become genious working with hadoop on that platform after attending hadoop online training with expertize training on the subject. Thanks also for sharing great posts on this blog. Keep up the good work.

    ReplyDelete
  3. Hi Hitesh,

    Can we implement SCD type2 logic in informatica and our database is Big Data Hadoop. As Hadoop generates flat files then how we will update the old record. Please provide detail level explaination.

    Thanks,
    Tushar

    ReplyDelete
  4. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Informatica Powercenter 9.X Developer training, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on Informatica Powercenter 9.X Developer training. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Nitesh Kumar
    MaxMunus
    E-mail: nitesh@maxmunus.com
    Skype id: nitesh_maxmunus
    Ph:(+91) 8553912023
    http://www.maxmunus.com/


    ReplyDelete
  5. I really say thanks to you for information shared keep updating Informatica Online Training

    ReplyDelete
  6. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
    Best Hadoop Training Institute In chennai

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. I appreciate your work on Informatica. It's such a wonderful read on Informatica.Keep sharing stuffs like this. I am also educating people on similar technologies so if you are interested to know more you can watch this:-
    https://www.youtube.com/watch?v=56vMQ1lG-vc

    ReplyDelete
  9. Thank you.Well it was nice post and very helpful information onBig data hadoop online Course Hyderabad

    ReplyDelete
  10. Extremely exceptionally educational and innovative substance. This idea is a decent method to improve the knowledge.thanks for sharing.
    Article Submission sites | Latest Updates | Technology

    ReplyDelete
  11. Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Course Bangalore

    ReplyDelete

  12. I like your post very much. It is very much useful for my research. I hope you to share more info

    about this. Keep posting!!
    Best Devops Training Institute

    ReplyDelete
  13. Thanks for this post. Detailed information.

    ReplyDelete

Post a Comment