Data Lake analytics have become real. But the challenge is to access the data quickly and provide meaningful insights. There are several techniques to access the data faster. In this blog we see how we can integrate Hana with Hadoop to get insights for larger data sets quickly.
If you have both Hana & Hadoop in your eco-system. SAP has provided an option to integrate HANA and Hadoop using Hana Spark controller. Where power of In-Memory processing can be used for real time insights and we can in parallel use Hadoop ability to process huge data sets.
For driving business decisions in real time with operational business data and with big data from various sources have become competitively necessary. Hana Spark controller can be used for integrating Hana and Hadoop where HANA in-memory capability can be used for real time analytics combining with Hadoop ability of process volumes of data in cost effective manner. Hana Hadoop integration with HANA spark controller gives us the ability to have federated data access between HANA and hive meta store. In this blog we will see this capability with a simple example.
The basic use case is the ability to use Hadoop as a cold data store for less frequently accessed data. The tool that enables it is the Data Lifecycle Manager (DLM), a component of SAP HANA’s data warehouse foundation. The data tiering enables us for bi-direction data movement between HANA and Hadoop.
We will now see the overview on how we should establish connection to Hadoop cluster from HANA and then we will see an example to have federated access between HANA and Hadoop.
Install HANA spark controller on Hadoop cluster.
◈ Almost all distributions are supported ( Hortonworks, Cloudera, Mapr ).
◈ HANA Version 1.0 SP12 and 2.0 are supported for this installation. If you would like to create virtual spark procedures then you must be on HANA 2.0.
I will document the installation steps for Spark controller on Hadoop cluster in my next blog.
Establish Smart data connection between Hana and Hadoop.
◈ Once the installation of spark controller on Hadoop cluster is installed and started. We need to establish Smart data connection between HANA and Hadoop .
- If you have LDAP based authentication to Hadoop Cluster
CREATE REMOTE SOURCE "spark_demo" ADAPTER "sparksql"
CONFIGURATION 'server=<x.x.x.x>;port=7860;ssl_mode=disabled'
WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=<user name>;password=<password>';
- If you have Kerbros authentication to Hadoop cluster
CREATE REMOTE SOURCE "spark_demo" ADAPTER "sparksql"
CONFIGURATION 'server=<x.x.x.x>;port=7860;ssl_mode=disabled'
WITH CREDENTIAL TYPE 'KERBEROS'
◈ You can verify the created remote source connectivity.
If you have both Hana & Hadoop in your eco-system. SAP has provided an option to integrate HANA and Hadoop using Hana Spark controller. Where power of In-Memory processing can be used for real time insights and we can in parallel use Hadoop ability to process huge data sets.
For driving business decisions in real time with operational business data and with big data from various sources have become competitively necessary. Hana Spark controller can be used for integrating Hana and Hadoop where HANA in-memory capability can be used for real time analytics combining with Hadoop ability of process volumes of data in cost effective manner. Hana Hadoop integration with HANA spark controller gives us the ability to have federated data access between HANA and hive meta store. In this blog we will see this capability with a simple example.
The basic use case is the ability to use Hadoop as a cold data store for less frequently accessed data. The tool that enables it is the Data Lifecycle Manager (DLM), a component of SAP HANA’s data warehouse foundation. The data tiering enables us for bi-direction data movement between HANA and Hadoop.
We will now see the overview on how we should establish connection to Hadoop cluster from HANA and then we will see an example to have federated access between HANA and Hadoop.
Integration Steps
Install HANA spark controller on Hadoop cluster.
◈ Almost all distributions are supported ( Hortonworks, Cloudera, Mapr ).
◈ HANA Version 1.0 SP12 and 2.0 are supported for this installation. If you would like to create virtual spark procedures then you must be on HANA 2.0.
I will document the installation steps for Spark controller on Hadoop cluster in my next blog.
Establish Smart data connection between Hana and Hadoop.
◈ Once the installation of spark controller on Hadoop cluster is installed and started. We need to establish Smart data connection between HANA and Hadoop .
- If you have LDAP based authentication to Hadoop Cluster
CREATE REMOTE SOURCE "spark_demo" ADAPTER "sparksql"
CONFIGURATION 'server=<x.x.x.x>;port=7860;ssl_mode=disabled'
WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=<user name>;password=<password>';
- If you have Kerbros authentication to Hadoop cluster
CREATE REMOTE SOURCE "spark_demo" ADAPTER "sparksql"
CONFIGURATION 'server=<x.x.x.x>;port=7860;ssl_mode=disabled'
WITH CREDENTIAL TYPE 'KERBEROS'
◈ You can verify the created remote source connectivity.
We can see connection successfully established. We will now look at an example to have federated data access between Hana and Hadoop.
Federated Data access between Hana and Hadoop
In this example, we will see how to fetch data from Hana and Hadoop in parallel with a small example. Below are the steps to achieve this.
1. Create a sample table in Hana.
2. Create a sample table in Hive.
3. Create virtual spark procedure in Hana to fetch data from hive.
4. Create another procedure to union Hana data results with Hive results to display combined results.
1) Create sample table in Hana
We have created below table in Hana and inserted few records.
2) Create Sample Table in Hive
hive> show create table hana_hive_team;
OK
CREATE TABLE `hana_hive_team`(
`emp_id` int,
`emp_name` string,
`join_date` timestamp,
`country` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://dev17ha/user/hive/warehouse/hana_hive_team'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='3',
'numRows'='9',
'rawDataSize'='321',
'totalSize'='330',
'transient_lastDdlTime'='1528353203')
Below is the sample data i inserted into table default.hana_hive_team
3) Create virtual spark procedure in Hana to fetch data from hive.
Below is the sample code for virtual procedure written in Hana to read data from hive table default.hana_hive_team.
Note: We are not creating any virtual tables we have written a spark sql using scala programming.
CREATE VIRTUAL PROCEDURE SYSTEM.READ_HIVE_DATA(
OUT TEAM_OUT "TEAM_OUT"
)
LANGUAGE SCALASPARK
AT SPARK_SQL
AS
BEGIN
import sqlContext.implicits._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql._
import scala.collection.mutable.WrappedArray
import org.apache.spark.ml.feature.NGram
val result = sqlContext.sql("select * from default.hana_hive_team")
result.show()
TEAM_OUT = result
END
Below is the output of the virtual spark procedure, triggered from HANA. Output from Step 2 & 3 would be same.
4) Create another procedure to union Hana data results with Hive results.
To access Hana and Hadoop in parallel and display combined results with a union. Below is the sample procedure.
PROCEDURE "_SYS_BIC"."lgureja::HANA_HIVE_PARALLEL_ACCESS" (OUT TEAM_OUT "TEAM_OUT")
LANGUAGE SQLSCRIPT
SQL SECURITY INVOKER
AS
BEGIN
-- Reading Hive data
CALL READ_HIVE_DATA(a) ;
-- Storing Hive Data to temporary table
TEAM_OUT1 = select emp_id, emp_name, join_date, country from :a;
-- Joining Hive data with HANA Data
TEAM_OUT = select emp_id, emp_name, join_date, country from SYSTEM.TEAM
union all
select emp_id, emp_name, join_date, country from :TEAM_OUT1;
END;
Output of the procedure combine hana and hive data results.
Done. This is how we can access Hana and Hadoop data sets in parallel using Hana spark controller.