Monday

High performance in-memory MapReduce with Apache ignite

Apache Ignite provides a set of useful components (Hadoop accelerator) allowing for in-memory Hadoop job execution and file system operations. This in-memory plug and play accelerators can be group by in three following categories:
1) MapReduce: It's an alternative implementation of Hadoop Job tracker and task tracker, which can accelerate job execution performance. It eliminates the overhead associated with job tracker and task trackers in a standard Hadoop architecture while providing low-latency, HPC-style distributed processing.
2) IGFS: It's also an alternate implementation of Hadoop file system named IgniteHadoopFileSystem, which can store data in-memory. This in-memory file system minimise disk IO and improve performances.
3) Secondary file system: This implementation works as a caching layer above HDFS, every read and write operation would go through this layer and can improve performance.
In this blog post, I'm going to configure apache ignite with Hadoop for executing in-memory map reduce job.

А high-level architecture of the Ignite hadoop accelerator is as follows:


Sandbox configuration:
VM: VMware
OS: RedHat enterprise linux 6
CPU: 2
RAM: 2 GB
JVM: 1.7_60
Hadoop Version: 2.7.2
Apache ignite version: 1.6

We are going to install Hadoop Pseudo-Distributed cluster in one machine, means datanode, namenode, task/job tracker, everything will be one virtual machine.
First of all, we will install and configure hadoop and will be proceed to apache ignite. Assume that JAVA 1.7_4* has been and installed and JAVA_HOME are in env variables.
Step 1:
Unpack the hadoop distribution and set the JAVA_HOME path in the etc/hadoop/hadoop-env.sh file as follows:
export JAVA_HOME=JAVA_home path
Step 2:
Add the following configuration in etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
etc/hadoop/hdfs-site.xml:
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
Step 3:
Setup password less or passphraseless ssh as follows:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
  $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys
and try ssh localhost, it shouldn't ask you for password.
Step 4:
Format the HDFS file system:
$ bin/hdfs namenode -format
Start namenode/datanode daemon
$ sbin/start-dfs.sh

Also it's better to add HADOOP_HOME in environmental variable of operation system.
Step 5:
make a few directories in HDFS to run map reduce job.
bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /input
put some files in directory /input
$ bin/hdfs dfs -put $HADOOP_HOME/etc/hadoop /input
Step 6:
Run the map reduce example word count
$ bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input/hadoop output
View the output files on the distributed filesystem:
$ bin/hdfs dfs -cat output/*
it should be a huge file, lets see the fragment of the output
want 1
warnings. 1
when 9
where 4
which 7
while 1
who 6
will 23
window 1
window, 1
with 62
within 4
without 1
work 12
writing, 27
Now lets install and configure the apache Ignite.
Step 7:
Unpack the distributive of the apache ignite some where in your system and add the IGNITE_HOME path to the rood directory of the installation.
Step 8:
Add the following libraries in $IGNITE_HOME/libs:
  1. asm-all-4.2.jar
  2. ignite-hadoop-1.6.0.jar
  3. hadoop-mapreduce-client-core-2.7.2.jar
Step 9:
We are going to use ignite default configuration, config/default-config.xml, now we can start ignite as follows:
$ bin/ignite.sh
Step 10:
Now we will add a few staff to use ignite job tracker instead of hadoop. Add the HADOOP_CLASSPATH to the environmental variables as follows:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$IGNITE_HOME/libs/ignite-core-1.6.0.jar:$IGNITE_HOME/libs/ignite-hadoop-1.6.0.jar:$IGNITE_HOME/libs/ignite-shmem-1.0.0.jar
Step 11:
Stop hadoop name node and data nodes.
$ sbin/stop-dfs.sh
Step 12:
We have to override hadoop mapred-site.xml (there are a several way to add your own mapred-site.xml for ignite, see the documentation here http://apacheignite.gridgain.org/v1.0/docs/map-reduce). For quick start we will add the following fragment of xml to the mapred-site.xml
<property>
    <name>mapreduce.framework.name</name>
    <value>ignite</value>
  </property>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value>127.0.0.1:11211</value>
  </property>
Step 13:
Run the previous example of the word count map reduce example:
$bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input/hadoop output2
Here is the output:

Now Execution time is faster than last time, when we used hadoop task tracker. Lets examine the Ignite task execution statistics through ignite visor:
Here we can see the total execution time and the duration time of the in-memory task tracker, in our case it's HadoopProtocolJobStatusTask(@t1), total 24, execution rate 12 second.
In the next blog post, we will cover the Ignite for IGFS and secondary file system.

If you like this article, you would also like the book