Introduction:
Apache Hadoop has become the poster child for big data largely due to its high scalability analytics platform capable of processing large volumes of structured and unstructured data. SAP HANA on the other hand, has gained ground as the leading in memory data analytics platform that lets you accelerate business processes and deliver quantifiable business intelligence at lightening speed. Both these database platforms are independent of each other and have pros and cons which make them a perfect fit for a long term sustainable high performance data lake strategy for any large multinational corporation.
This blog is intended as a walk through in implementing Apache Hadoop as a Near Line Storage Solution for SAP HANA leveraging SAP Spark Controller. For the sake of this blog, we will work with the below versions and software products :
SAP BW 7.5 SPS 5
SAP HANA 1.0 SPS 12 and higher
Core Apache Hadoop version 2.7.1 or higher (HDFS, MapReduce2,YARN)
Tez 0.7.0 as execution engine for Hive (Instead of MapReduce 2 if preferable)
Spark 1.5.2 or higher
SAP HANA Spark Controller 2.0 SP01 Patch 1 or higher
SAP recommends these as the base line requirements but I have come to believe through experience that these versions work very well with each other in terms of dependencies and interoperability. Both Hortonworks (HDP) and Cloudera (CDH) provide packaged Apache platforms which provide the above versions. I personally did not work with MapR so I am not aware if they do as well, but I am sure there should be something available through them which works well together with SAP.
Hadoop Cluster Architecture and Sizing:
In case you are considering a POC, my recommendation would be to go with at least a 3 node Hadoop cluster with 1 Namenode and 2 Datanodes. This will give the administration team a good feel of the production cluster in terms of setup and administrative duties. Apache Hadoop is a multi-component solution and as such, the fine tuning and configuration aspects are fairly diverse and yet interdependent. The overall architecture is depicted below at a high level.
Hadoop 3 Node Cluster:
SAP BW on HANA, Spark Controller and Hadoop:
Near Line Storage:
Please refer to vendor documentation for Hadoop sizing. For the sake of reference, below is the cluster sizing that I used for the proof of concept:
We went with a virtualized cluster for the POC.
Hadoop Installation:
Depending on the flavor of Hadoop chosen for the POC, you can install the Hadoop cluster through Apache Ambari or through the Cloudera Manager. The links for the detailed step by step installations are below:
Apache Ambari:
https://ambari.apache.org/1.2.1/installing-hadoop-using-ambari/content/ambari-chap1.html