In-Memory MapReduce and Your Hadoop Ecosystem (Part 2)

Portions of this article were taken from the book High-Performance In-Memory Computing With Apache Ignite. If it got you interested, check out the rest of the book for more helpful information.
Before reading, be sure to check out Part 1!
Apache Ignite provides a vanilla distributed in-memory file system called Ignite File System (IGFS) with similar functionality to Hadoop HDFS. This is one of the unique features of Apache Ignite that helps accelerate Big Data computing. IGFS implements the Hadoop file system API and is designed to support Hadoop v1 and Yarn Hadoop v2. Ignite IGFS can transparently plug into Hadoop or Spark deployment.
One of the greatest benefits of the IGFS is that it does away with Hadoop NamedNode in the Hadoop deployment; it seamlessly utilizes Ignite’s in-memory database under the hood to provide completely automatic scaling and failover without any additional shared storage. IGFS uses memory instead of disk to produce a distributed, fault-tolerant, and high throughput file system. Removing NamedNode from the architecture leads to a dramatically better performance of I/O operations. Furthermore, IGFS provides native file system API to working with directories and files in the in-memory file system.
IgniteFileSystem, or the IGFS interface, provides methods for regular file system operations such as create, update, delete, mkdirs, etc., as well as MapReduce task executions. Another interesting feature of IGFS is its amazing smart usages of the file-level caching and eviction design. IGFS utilizes file-level caching to ensure corruption free storage.
Note that IGFS is not an alternative like RAM disk — it’s a fully compliant in-memory file system like HDFS. A high-level architecture of the IGFS is shown below in Figure 1.

In this article, we are going to cover basic operations of the IGFS and deploy the IGFS in standalone mode to store files into IGFS and performs a few MapReduce tasks on top of it.
Note: We are not going to replace the HDFS completely; otherwise, we would not be able to start the Hadoop dataNode anymore. We are going to use both IGFS and HDFS simultaneously.
From the bird’s eyes view, running MapReduce in IGFS on top of HDFS looks like as follows:
  1. Configure the IGFS for the Ignite nodes.
  2. Put files into IGFS.
  3. Configure the Hadoop.
  4. Run MapReduce.
There are a several ways to configure the IGFS on the Ignite cluster. Unfortunately, Apache Ignite doesn’t provide any comprehensive GUI-based management tools nor command line interface for maintaining Hadoop accelerator. However, GridGain Visor (Ignite commercial version) as a management tool provides IGFS monitoring and file management between HDFS, local and IGFS file systems. To demonstrate, how to use IGFS, we will perform the following steps:
  1. Configure the IGFS file system in the Ignite cluster (default-config.xml).
  2. Run a standalone Java application to ingest a file into IGFS. In our case, the file will be the t8.shakespeare.txt.
  3. Configure Hadoop.
  4. Run a MapReduce wordcount job to compute the count of the words from the IGFS file.
  5. Run a standalone Java application to check the result of the MapReduce job.
Now that, we have dipped our toes into the IGFS, let’s configure the standalone IGFS and run some MapReduce jobs on it.

Step 1

Add the following springs configuration beans into the default-config.xml file of the Ignite node as follows:
<bean id="igfsCfgBase" class="org.apache.ignite.configuration.FileSystemConfiguration" abs\
<property name="blockSize" value="#{128 * 1024}"/>
<property name="perNodeBatchSize" value="512"/>
<property name="perNodeParallelBatchCount" value="16"/>
<property name="prefetchBlocks" value="32"/>
<bean id="dataCacheCfgBase" class="org.apache.ignite.configuration.CacheConfiguration" a\
<property name="cacheMode" value="PARTITIONED"/>
<property name="atomicityMode" value="TRANSACTIONAL"/>
<property name="writeSynchronizationMode" value="FULL_SYNC"/>
<property name="backups" value="0"/>
<property name="affinityMapper">
<bean class="org.apache.ignite.igfs.IgfsGroupDataBlocksKeyMapper">
<constructor-arg value="512"/>
<bean id="metaCacheCfgBase" class="org.apache.ignite.configuration.CacheConfiguration" a\
<property name="cacheMode" value="REPLICATED"/>
<property name="atomicityMode" value="TRANSACTIONAL"/>
<property name="writeSynchronizationMode" value="FULL_SYNC"/>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscover\
<property name="addresses">
<property name="fileSystemConfiguration">
<bean class="org.apache.ignite.configuration.FileSystemConfiguration" parent\
<property name="name" value="igfs"/>
<property name="metaCacheName" value="igfs-meta"/>
<property name="dataCacheName" value="igfs-data"/>
<property name="blockSize" value="1024"/>
<property name="streamBufferSize" value="1024"/>
<property name="ipcEndpointConfiguration">
<bean class="org.apache.ignite.igfs.IgfsIpcEndpointConfiguration">
<property name="type" value="SHMEM"/>
<property name="host" value=""/>
<property name="port" value="10500"/>

Next, we have configured base cache configuration called dataCacheCfgBase, which will be the parent of the IGFS data cache. Most of the properties of this configuration we have already discussed. Note that for demonstration purposes, we have set the backup value to 0.
Our subsequent configuration is the base configuration for the metadata cache called meta- CacheCfgBase. It is probably the most unfamiliar part of this configuration. IGFS contains metadata for all files ingested into the in-memory file system. The configuration of this property is very similar to the previous base cache configuration.
Next, we are going to configure the IGFS file system, it is the main part of the Ignite configuration. We set the name of the IGFS file system to IGFS. The block size and the stream buffer size of the IGFS file system will be 1024. To let IGFS accept requests from Hadoop, an endpoint should be configured. Ignite offers two endpoint types:
  1. shmem: Working over shared memory (not available on Windows).
  2. tcp: Working over standard socket API.

Step 2

When each Ignite node is configured (default-config.xml), start every node with the following commands:

Step 3

In this step, we are going to ingest our t8.shakespeare.txt file into the IGFS file system. As we described before, we will use a Java application to ingest the file into IGFS. The application is very simple; it ingests the t8.shakespeare.txt file once every time the application is launched. The application will take the name of the directory and the filename as an input parameter to put the files into IGFS. Open the pom.xml file and add the following code in the dependency section.
Now, add a new Java class with the name IngestFileInIGFS. The full listing of the Java class is shown below:
public class IngestFileInIGFS {
 private final static Logger LOGGER = LoggerFactory.getLogger(IngestFileInIGFS.class);
 private final static String IGFS_FS_NAME = "igfs";
 public static void main(String...args) {
  if (args.length < 2) {
   LOGGER.error("Usages [java -jar chapter-bigdata-1.0-SNAPSHOT.jar DIRECTORY_NAM\
E FILE NAME, for example java -jar chapter-bigdata-1.0-SNAPSHOT.jar myDir myFile]");
  Ignite ignite = Ignition.start("default-config.xml");
  Collection < IgniteFileSystem > fs = ignite.fileSystems();
  for (Iterator ite = fs.iterator(); ite.hasNext();) {
   IgniteFileSystem igniteFileSystem = (IgniteFileSystem);"IGFS File System name:" +;
  IgniteFileSystem igfs = ignite.fileSystem(IGFS_FS_NAME); // Create directory.
  IgfsPath dir = new IgfsPath("/" + args[0]);
  // Create file and write some data to it.
  IgfsPath file = new IgfsPath(dir, args[1]);
  // Read the File Shakespeare
  InputStream inputStream = IngestFileInIGFS.class.getClassLoader().getResourceAsStr\
byte[] filesToByte;
try {
 filesToByte = ByteStreams.toByteArray(inputStream);
 OutputStream out = igfs.create(file, true);
} catch (IOException e) {
} finally {
 try {
 } catch (IOException e) {
}"Created file path:" + file.toString());
To compile and run the application, execute the following command:

mvn clean install
java -jar ./ IngestFileInIGFS.jar myDir myFile
After successfully compiling the Maven project, there will be Java executable JAR files in the target folder. The IngestFileInIGFS.jar file is for ingesting file into IGFS.

Step 4

It’s time for configuring Hadoop (the IGFS file system must be configured in Hadoop).
Let’s create a new directory under HADOOP_HOME/etc with the following command and copy all the files from the Hadoop directory. Execute the following command from the $HADOOP_HOME/etc directory.
$ mkdir hadoop-ignite
$ cp ./hadoop/*.* ./hadoop-ignite
Remove all the properties from the $HADOOP_HOME/etc/hadoop-ignite/core-site.xml and add the following properties as follows:
The full qualified file system class name org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem is sufficient for configuring the IGFS for Hadoop.
Note: v1 or v2 doesn’t stand for Hadoop 1.x and Hadoop 2.x. Instead, this is about either old FileSystem API or new AbstractFileSystem API.
At this moment Hadoop configuration has been completed, and we are ready to execute Map/Reduce jobs.

Step 5

There are several ways to execute MapReduce jobs with Hadoop configuration. One of the easiest ways is to pass the Hadoop config directory as an input parameter to the job as follows:
hadoop --config [path_to_config] [arguments]

Let’s run our wordcount MapReduce job with the file from the IGFS with the following command:

time hadoop --config /home/user/hadoop/hadoop-2.7.2/etc/hadoop-ignite jar $HADOOP_HOME/sha\
ce-examples-2.7.2.jar wordcount /myDir/myFile /myDir/out

After running the above statement, you should get the similar output in your terminal as shown below.

Note that you have to change the name of the output directory every time you run the MapReduce job.
Such a simple way, you can replace the Hadoop HDFS with IGFS.


An impatient start with Apache Ignite machine learning grid

Recently Apache Ignite 2.0 introduce a beta version of the in-memory machine learning grid, which is a distributed machine learning library built on top of the Apache IMDG. This beta release of ML library can perform local and distributed vector, decompositions and matrix algebra operations. The data structure can be stored in Java heap, off-heap or distributed Ignite caches. At this moment, the Apache Ignite ML grid doesn't support any prediction or recommendation analysis. In this short post, we are going to download the new Apache Ignite 2.0 release, build the example and run them.

1. Download and unpack the Apache Ignite 2.0 distribution.

Download the Apache Ignite 2.0 binary release version from the following link. Unpack the distribution somewhere in your workstation (e.g /home/ignite/2.0) and set the IGNITE_HOME path to the directory.

2. Start the Apache Ignite remote node

Run the following command in the terminal window. examples/config/example-ignite.xml 

Note that, Remote nodes for examples should always be started with the special configuration file which enables P2P class loading: `examples/config/example-ignite.xml`.

Also, note that Apache Ignite version 2.0 needs Java version 1.8 or higher.

3. Build the machine learning examples

Go to the /examples folder of the Apache Ignite distribution. If you already installed and configure maven, run the following command from the examples folder.

mvn clean install -Pml

The above command will active the machine learning (ml) profile and build the project.

4. Run it

Lets run the simple local onheap version of the Vector example. Execute the following command in your terminal windows:

mvn exec:java

You should get the following logs in your console.

All the examples are autonomous, does't need any special configuration. Examples name with 'Cache' such as CacheMatrixExample or CacheVectorExample needs remote Ignite node with P2P class loading. Let's run the CacheMatrixExample with the following command.
mvn exec:java

You should get the following output as shown below.

Additionally, Apache Ignite ML grid provides a simple utility class allows pretty-printing of matrices/vectors. You can run the TracerExample as follows:
mvn exec:java

This above command will launch a web browser and provide some HTML output as follows:

This enough for now. If you like this post, you should also like the book.


Unboxing of the first copy of the book High performance in-memory computing with Apache Ignite

Yesterday I have got the first paperback version of the book High performance in-memory computing with Apache Ignite 

The book is available at & Amazon bookstore.

Product details

  • Paperback: 360 pages
  • Language: English
  • ISBN-10: 1365732355
  • ISBN-13: 978-1365732355
  • Product Dimensions: 8.3 x 0.8 x 11 inches
  • Shipping Weight: 1.8 pounds

Happy reading!!

Support independent publishing: Buy this book on Lulu.

In-Memory MapReduce and Your Hadoop Ecosystem: Part I

Portions of this article were taken from the book High-Performance In-Memory Computing With Apache Ignite. If it got you interested, check out the rest of the book for more helpful information.
Hadoop has quickly become the standard for business intelligence on huge data sets. However, its batch scheduling overhead and disk-based data storage have made it unsuitable for use in analysing live, real-time data in the production environment. One of the main factors that limits performance scaling of Hadoop and MapReduce is the fact that Hadoop relies on a file system that generates a lot of input/output (I/O) files. I/O adds latency that delays the MapReduce computation. An alternative is to store the needed distributed data within the memory. Placing MapReduce in-memory with the data it needs eliminates file I/O latency.
Figure 1.
The generic phases of a Hadoop job are shown in the above figure. Phase sort, merge, and shuffle are highly I/O intensive. These overheads are prohibitive when running real-time analytics that returns the result in milliseconds or seconds. 
The Ignite in-memory MapReduce engine executes MapReduce programs in seconds (or less) by incorporating several techniques. By avoiding Hadoop’s batch scheduling, it can start up jobs in milliseconds instead of tens of seconds. In-memory data storage dramatically reduces access times by eliminating data motion from the disk or across the network. This is the Ignite approach to accelerate Hadoop application performance without changing the code. The main advantages are that all the operations are highly transparent, all of this is accomplished without changing a line of MapReduce code.
Ignite Hadoop in-memory plug and play accelerator can be grouped by in three different categories.

1. In-Memory MapReduce

It’s an alternative implementation of Hadoop Job tracker and task tracker, which can accelerate job execution performance. It eliminates the overhead associated with job tracker and task trackers in a standard Hadoop architecture while providing low- latency, HPC-style distributed processing.

2. Ignite In-Memory File System (IGFS)

It’s also an alternate implementation of Hadoop file system named IgniteHadoopFileSystem, which can store data sets in-memory. This in-memory file system minimizes disk I/O and improves performances.

3. Hadoop File System Cache

This implementation works as a caching layer above HDFS. Every read and write operation should go through this layer and can improve MapReduce performance. 
Conceptual architecture of the Ignite Hadoop accelerator is shown in Figure 2:
Figure 2.
The Apache Ignite Hadoop accelerator tool is especially very useful when you already have up and running existing Hadoop cluster and want to get high-performance with minimum efforts. Note that the idea that Hadoop runs on commodity hardware is a myth. Most of the Hadoop process is I/O-intensive and requires homogenous and mid-end servers to performance well.
In this blog post, we are going to explore the details of the Apache Ignite in-memory MapReduce.
The Ignite in-memory MapReduce engine is 100% compatible with Hadoop HDFS and Yarn. It reduces the startup and the execution time of the Hadoop job tracker and the task tracker. Ignite in-memory MapReduce provides dramatic performance boosts for CPU-intensive tasks while requiring an only minimal change to existing applications. This module also provides an implementation of weight based MapReduce planner, which assigns mappers and reducers based on their weights. Weight describes how much resources are required to execute the particular map and reduce task. This planning algorithm assigns mappers so that, total resulting weight on all nodes as minimal as possible.
High-level architecture of the Ignite in-memory MapReduce is shown below:
Figure 3.
Ignite in-memory grid has a pre-stage Java-based execution environment on all grid nodes and reuses it for multiple data processing. This execution environment consists of a set of Java virtual machines one on each server within the cluster. This JVM’s forms the Ignite MapReduce engine as shown in the above figure. Also, the Ignite in-memory data grid can automatically deploy all necessary executable programs or libraries for the execution of the MapReduce across the grid, this greatly reducing the startup time down to milliseconds.
Now that we have got the basics, let’s try to configure the sandbox and execute a few MapReduce jobs in Ignite MapReduce engine.
For simplicity, we are going to install a Hadoop pseudo-distributed cluster in a single virtual machine and run the famous Hadoop word count example as a MapReduce job. The Hadoop pseudo-distributed cluster means that Hadoop data nodes, name nodes, task trackers and job trackers — everything will be on one virtual (host) machine.
Let’s have a look at our sandbox configuration as shown below.
  1. OS: RedHat enterprise Linux.
  2. CPU: 2.
  3. RAM: 2Gb.
  4. JVM: 1.7_60.
  5. Ignite version: 1.6 or above, single node cluster.
First, we are going to install and configure Hadoop and will proceed to Apache Ignite. Assume that Java has been installed and that JAVA_HOME is in the environment variables.

1. Unpack the Hadoop Distribution Archive

Unpack the Hadoop distribution archive and set the JAVA_HOME path in etc/.

2. Add Configurations

Add the following configuration in the etc/hadoop/core-site.xml file:
Also, append the following data replication strategy into the etc/hadoop/hdfs-site.xml file:

3. Set Up Password-Less SSH

Set up password-less or passphrase-less SSH for your operating system:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/ >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Try the following command on your console:
$ ssh localhost
It shouldn’t ask you for input password.

4. Set Up the Hadoop HDFS File System

Format the Hadoop HDFS file system:
$ bin/hdfs namenode -format
Next, start the namenode/datanode daemon by the following command:
$ sbin/
Also, I would like to suggest that you add the HADOOP_HOME environmental variable to the operating system.

5. Set Up Directories

Make a few directories in the HDFS file system to run MapReduce jobs.
bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /input
The above command will create two folder user and input in HDFS file system. Insert some text files in directory input: bin/hdfs dfs -put $HADOOP_HOME/etc/hadoop /input .

6. Configure the Hadoop Pseudo Cluster

Run the Hadoop native MapReduce application to count the words of the file.
$ bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar w\
ordcount /input/hadoop output
You can view the result of the word count with the following command:
bin/hdfs dfs -cat output/*
In my case, the file is huge. Let’s see the fragment of the file:
want 1 warnings. 1 when 9 where 4 which 7 while 1 who 6
will 23
window 1
window, 1
with 62
within 4
without 1
work 12
writing, 27
In this stage, our Hadoop pseudo cluster is configured and ready to use. Now, let’s start configuring Apache Ignite.

7. Unpack Apache Ignite Distribution

Unpack the distribution of Apache Ignite somewhere in your sandbox and add the IGNITE_- HOME path to the root directory of the installation. For getting statistics about tasks and executions, you have to add the following properties in your /config/default-config.xml file:
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="includeEventTypes">
<util:constant static-field=""/>
<util:constant static-field=""/>
<util:constant static-field=""/>
The above configuration will enable the event task for statistics.
Note that, by default, all events are disabled. Whenever these above events are enabled, you can use the command “tasks” in Ignite Visor to get statistics about tasks executions.

8. Add Libraries

Add the following libraries in the $IGNITE_HOME/libs directory.
Note that the asm-all-4.2.jar library version is dependent on your Hadoop version.

9. Start the Ignite Node

We are going to use the Apache Ignite default configuration config/default-config.xml file to start the Ignite node. Start the Ignite node with the following command:

10. Finish Setting Up Ignite Job Tracker

Add a few more things to use the Ignite job tracker instead of Hadoop. Add HADOOP_CLASSPATH to the environmental variables as follows.

11. Override

In this stage, we are going to override the Hadoop mapred-site.xml file. For a quick start, add the following fragment of XML to the mapred-site.xml:

12. Run It

Run the above example of the word count MapReduce example again.
$bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wo\
rdcount /input/hadoop output2
The output should be very similar, as shown in Figure 4.
Figure 4.
Now, the execution time is faster than the last time that we used the Hadoop task tracker. Let’s examine the Ignite task execution statistics through Ignite Visor.
Figure 5.
From the above figure, we should notice the total executions and the durations times of the in-memory task tracker. In our case, the total executions task (HadoopProtocolJobStatusTask(@t1)) is 24 and the execution rate are 12 second.
In the next part, we will execute a few benchmarks for demonstrating the performance advantage of the Ignite MapReduce engine.