Friday

An impatient start with Mahout - Machine learning

One of my friend ask me to develop a course for students in subject "Machine Learning". Course should be very simple to familiar students with machine learning. Main purpose of this course is to explorer the machine learning world to the students and playing over this topics with own their hands. Machine learning is very much matured and a lot of tools and frameworks is available to wet your hands in this topics, however most of the articles or tutorials you could found in the internet will start installing cluster or have to write a bunch of code (even, in site mahout, they are using maven) to start learning. Even more not all students are familiar with hadoop or do not have very powerful notebook to install and run all the components to get test of machine learning. For these reasons i have got the following approach:
  1. Standalone Hadoop
  2. Standalone Mahout 
  3. And a few CSV data files to learn how to works with Predictions

Assume you already have java installed in your work station. If not please refer to Oracle site to download and install java yourself. First we will install standalone hadoop, check the installation and  after that we will install Mahout and try to run some example to understand whats going under the hood. 
  • Install Hadoop and run simple map reduce for test
Hadoop version: 2.6.0 
Download hadoop-2.6.0.tar.gz from apache hadoop download site. In the moment of written the blog version 2.6.0 is stable to use. 
Unarchive the gz file some where in you local disk. Add these following path to your .bash_profile path as follows:
export HADOOP_HOME=/PATH_TO_HADDOP_HOME_DIRECTORY/hadoop-2.6.0
export PATH=$HADOOP_HOME/bin:$PATH
Now lets check the installation of the hadoop. Create one directory in current folder inputwords and copy all the xml files from the hadoop etc installation folder as follows:
$ mkdir inputwords
$ cp $HADOOP_HOME/etc/hadoop/*.xml inputwords/
Now we can run the hadoop standalone map reduce to count all the words found in the xmls
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount inputwords outputwords
You should see a bunch of logs in your console and if everything went fine you should found a few files in folder outputwords (this folder will create in the runtime). Run the following command in you console
$cat ./outputwords/part-r-00000
it should show a lot of words follows by numbers as follows:
use 9
used 22
used. 1
user 40
user. 2
user? 1
users 21
users,wheel". 18
uses 2
using 3
value 19
values 1
version 1
version="1.0" 5
via 1
when 4
where 1
which 5
while 1
who 2
will 7
If you are curious to count more word, download the famous novel of William Shakespeare "The Tragedy of Romeo" from here and run with hadoop wordcount.
  • Download and install apache mahout 
Lets download the apache mahout distributive from here and unarchive the file some where in your local machine. Mahout distributive contains all the libraries and example for running machine learning  on top of hadoop. 
Now we need some data for learning purpose. We can use grouplens data for our purpose, certainly you can generate data for yourself, but i highly recommended data from grouplens. Grouplens organisation collecting social data for research and you can use these data for your purpose. There are a few datasets available in grouplens site such as MovieLens, BookCrosing and e.t.c. For my course we are going to use movielens datasets, because it's formatted and grouped well. Lets download the movielens datasets and unarchive the file somewhere in your filesystems.
First i would like to examine the datasets to get a closer look on the data, which will give us a very good understanding to use the data well. 
After unarchive the ml-data.tar.gz, you should find a list of datasets in your folder.
u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
          user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   

u.info     -- The number of users, items, and ratings in the u data set.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

u.genre    -- A list of the genres.

u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.
For getting recommendation we will use u.data. Now it's times to study a few theory about Recommendation and what is for?

  • Recommendation:
Mahout contains a recommender engine—several types of them, in fact, beginning with conventional user-based and item-based recommenders. It includes implementations of several other algorithms as well, but for now we’ll explore a simple user-based recommender. For detail information of the recommender engine please see here.

  • Examine the DataSet:
Lets take a closer look to the file u.data.
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
291 1042 4 874834944
234 1184 2 892079237
119 392 4 886176814
167 486 4 892738452
299 144 4 877881320
291 118 2 874833878
308 1 4 887736532
95 546 2 879196566
38 95 5 892430094
102 768 2 883748450
63 277 4 875747401
For example user with id 196 recommended film 242 with 3 preferences. I have import the u.data file in excel and sort by userid as follows:

User 1 rate 3 films [1,2] and user 3 also rate film 2. If we want to find recommendation for user3 by user 1, then it should be the film 1 (Toy Story).
Lets run the recommendation engine of Mahout and examine the result.
$hadoop jar $MAHOUT_HOME/mahout-examples-0.10.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input PATH_TO/ml-100k/u.data --output outputm
You should found the result in the file outputm/part-r-00000. If you check attentively, you should found that recommendation for the user 3 as follows:
3 [137:5.0,248:4.8714285,14:4.8153844,285:4.8153844,845:4.754717,124:4.7089553,319:4.7035174,508:4.7006173,150:4.68,311:4.6615386]
which is differ from that we guess earlier, because recommendation engine also use preferences (rating) from other user.
Lets write down a few bunch of code in java to make sure which films was recommended by Mahout.
package com.blu.mahout;

import org.apache.avro.generic.GenericData;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.Reader;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.reducing;
import static java.util.stream.Collectors.toList;

/**
 * Created by shamim on 17/04/15.
 * Source file should be, u.data, u.item and mahout generated generated file
 */
public class PrintRecomendation {
    private static final int INPUT_LENGTH = 6;

    private static List userRecommendedFilms = new ArrayList<>();
    private static List userWithFilmItems = new ArrayList<>();

    public static void main(String[] inputs) throws IOException{
        System.out.println("Print the recommendation and recommended films for given user:");

        if(inputs.length < INPUT_LENGTH){
            System.out.println("USAGES: PrintRecomendation USERID UDATA_FILE_NAME UITEM_FILE_NAME MAHOUT_REC_FILE UDATA_FOLDER UMDATA_FOLDER" +
                    "" + " Example: java -jar mahout-ml-1.0-SNAPSHOT.one-jar.jar 3 u.data u.item part-r-00000 /Users/shamim/Development/workshop/bigdata/hadoop/inputm/ml-100k/ /Users/shamim/Development/workshop/bigdata/hadoop/outputm/"
                    );
            System.exit(0);
        }
        String USER_ID = inputs[0];
        String UDATA_FILE_NAME = inputs[1];
        String UITEM_FILE_NAME = inputs[2];
        String MAHOUT_REC_FILE = inputs[3];
        String UDATA_FOLDER = inputs[4];
        String UMDATA_FOLDER = inputs[5];

        // Read UDATA File
        Path pathToUfile = Paths.get(UDATA_FOLDER,UDATA_FILE_NAME);
        List filteredLines = Files.lines(pathToUfile).filter(s -> s.contains(USER_ID)).collect(toList());
        for(String line : filteredLines){
            String[] words =  line.split("\\t");
            if(words != null){
                String userId = words[0];
                if(userId.equalsIgnoreCase(USER_ID)){
                    userRecommendedFilms.add(line);
                }
            }
        }
        List userWithFilmName = new ArrayList();

        CharsetDecoder dec= StandardCharsets.UTF_8.newDecoder()
                .onMalformedInput(CodingErrorAction.IGNORE);
        Path pathToUItem=Paths.get(UDATA_FOLDER, UITEM_FILE_NAME);

        List nonFiltered;

        try(Reader r= Channels.newReader(FileChannel.open(pathToUItem), dec, -1);
            BufferedReader br=new BufferedReader(r)) {
            nonFiltered=br.lines().collect(Collectors.toList());
        }

        Path pathToMahoutFile = Paths.get(UMDATA_FOLDER, MAHOUT_REC_FILE);

        List filteredMLines = Files.lines(pathToMahoutFile).filter(s -> s.contains(USER_ID)).collect(toList());
        String recommendedFilms= "";
        for(String line : filteredMLines){
            String[] splited = line.split("\\t");
            if(splited[0].equalsIgnoreCase(USER_ID)){
                recommendedFilms = splited[1];
                break;
            }
        }

        String[] filmsId = recommendedFilms.split(",");

        for(String filmId : filmsId){
            String[] idWithRating = filmId.split(":");
            String id = idWithRating[0];
            String rating = idWithRating[1];
            for(String filmLine : nonFiltered ){
                String[] items = filmLine.split("\\|");
                if(id.equalsIgnoreCase(items[0])){
                    System.out.println("Film name:" + items[1]);
                }

            }
        }

    }
}
You should found the following output in your console:
Film name:Grosse Pointe Blank (1997)
Film name:Postino, Il (1994)
Film name:Secrets & Lies (1996)
Film name:That Thing You Do! (1996)
Film name:Lone Star (1996)
Film name:Everyone Says I Love You (1996)
Film name:People vs. Larry Flynt, The (1996)
Film name:Swingers (1996)
Film name:Wings of the Dove, The (1997)

Sunday

Configuring stuck connection in IBM WAS 8.5.5 Connection pool

Recently we start getting a few complains from our client related to connection on DataBase from IBM WAS. First action we have taken to take a look on log which we got from the client and discovered these following errors on application logs:
  • Error 404: Database connection problem: IO Error: Got minus one from a read call DSRA0010E: SQL State = 08006, Error Code = 17,002
  • java.sql.SQLException: The back-end resource is currently unavailable. Stuck connections have been detected.
With a quick search on google i have found PMR 34250 004 000 on IBM support sites, which is also effect IBM WAS 8.* version. As soon as we are using third party web portal engine (BackBase) it was travois to figure out the problem, so we decompiled some code to make sure that all the data source connection closing well. After some research i have asked data base statistics and data source configurations from support team of the production. And i was surprised with the data base statistics that all connection on DataBase was full and IBM application server could not get any new connection to complete request. 
On Oracle DataBase, maximum connection was set to 6000 and we have more than 32 application server with Maximum Connection 200. It was a serious mistake, formula for configuring connection pool of IBM cluster is as follows:
Maximum Number of Connection in Node * Quantity of Nodes < Max Connection set to Database
In our case, configuration should be 
200 * 32 < 6000
We send a request to increase the DataBase connection in Oracle to 10 000. But what to do with the stuck connection? I have checked the IBM WAS advanced connection pool properties and noticed that, stuck connection properties are configured at all. 
Lets check, what the Stuck connection is?
A stuck connection is an active connection that is not responding or returning to the connection pool. Stuck connections are controlled by three properties, Stuck time , Stuck threshold and Stuck timer interval.
Stuck time
  • Time for a single active connection to be in use to the backend resource before it is considered to be stuck.
  • For example, stuck time is 120 seconds and if the connection is waiting on database for more than 120 seconds then the connection would be marked as Stuck
Stuck threshold
  • The stuck threshold is the number of connections that need to be considered stuck for the pool to be in stuck mode
  • For example, if the threshold is 10 and after 10 connections are considered stuck , whole pool for that datasource is considered Stuck
Stuck Timer Interval
  • Interval at which , how often the connection pool checks for stuck connections
With the above information i have configured the following Stuck connections properties:
With the above configuration, when the connection pool will be declared as stuck? 
Stuck timer interval : 120 secs
Stuck time : 240 secs
Stuck threshold : 100 connections (maximum connection 200)
What happens when pool is declared stuck ?
  • A resource exception is given to all new connection requests until the pool is unstuck.
  • An application can explicitly catch this exception and continue processing.
  • If the number of stuck connections drops below the stuck threshold, the pool will detect this during its periodic checks and enable the pool to begin servicing requests again
Also it is very useful to check inactive connection periodically in Oracle Database, if some connection is hang and inactive you can drop this connection manually.
Here is a pseduo query to find inactive connection in DB

SELECT
 s.username, 
s.status,
S.sid || ',' || S.serial# p_sid_serial
from v$session s, v$sort_usage T, dba_tablespaces TBS
where
(s.last_call_et / 60) > 1440
AND T.tablespace = TBS.tablespace_name
and T.tablespace = 'TEMP';
Hope the above information will help somebody to quick fix in IBM WAS. 

Monday

Open sources alternatives for low budget projects

In your entire software development, sooner or later you will got a few project with very low budget, where you can't use commercial software because the budget is low and in the long run company also want some profit completing the project. From the begging of the year i have done a few pre-sale for such projects and decided to write down a list of all open source alternatives against commercial product. One thing i have to clear that, i have no religious view over open sources software or vica verse. There a plenty of reasons to use most of the commercial product but most of all time we have to cut our coat according to our cloths.
1) BPM:
Most of all vendor like Oracle and IBM have their finest product in BPM, such as Oracle BPM Server, IBM Business process manager. Also you can find a few very good open source product such as JBPM from Jboss and Bonitasoft. But their are another very good open source BPM engine you can try, it's Activity. Spring based (state machine) light weight engine that you can use in standalone or in web application. We develop our mobile number portability project based on this BPM engine and it's works more than a year. Here you can find a few screen shot to check how it cloud looks like.
2) ESB:
I am a real fun of Oracle OSB, completed more than two project successfully on this platform. It's has all the functionality you need for enterprise service bus or integration with another system. If you are looking for open source ESB, the first option should be glassfish ESB and mule. WSo2 is also a very good candidate for choose.
3) Business Rules:
Best business rules software i have ever used was IBM ILOG Jrules. It's consistence, reliable and synchronization capabilities with user code base. There are not so much open source alternatives in this sector, Drools from jboss is one of the good candidate. Certainly Drools contains a lot of bugs with it's functionality but you should make a try on this.
4) Messenging server: IBM MQ series is one of the best messenging server ever and most of banking sector and telecommunication company using this product with great successfully. This is unbreakable software with all the messenging functionality such as transition queue. Most of the company already have the unlimited licenses for using this software, however unfortunately if you need some open source version, you have a very wide choose. Active MQ, Apollo or RabbitMQ can be you choose. Here you can found the complete list of the mq servers. When i am looking for any open source MQ server, i have the following requisites to this product:
- It should be fast
- it's have to persist messages in disk
- it should be work on cluster with failover capabilities
5) LDAP: Off course Microsoft Active directory is one of the best candidate in LDAP, Microsoft AD has their own LDAP implement ion. If you have to use NTLM or kerberos, first choose should be Microsoft AD. As a opensource openldap is best option.
6) Application server:
Most of all time i am working with Oracle Web logic server and IBM WAS. Be honest they are the best in application server. In this sector there are a lot of candidate you will find in open source. But whenever i have to choose open source version, i always preferred Glassfish application server, because it's reliable. Off course you have a much more option like jboss, tomcat, jetty e.t.c.
7) Database: If you have unlimited licenses to Oracle, never thought about any other vendor or product for DB. Oracle DB is reliable and consistency for a long time. Whenever you have to choose for open source RDBMS version, you should try postgres and mySQL.
8) In memory Datagrid: Oracle coherence is the best commercial software for implementing In memory Datagrid. Company hazelcast also have open source version of in memory data grid. With hazelcast you can easily use distributed queue, map and list and much more. From the beginning of Hazelcast, i used it for hibernate l2 cache.
9) Distributive cache: If you are fun of Infinispan for some reason, you should make a try to ehcahe or jboss cache. With jgroups configuration ehcache can be use as distributive cache. Ehcache capable to configure heap with many other ways and contains a lot of algorithms such as LRU.
10) web server and load balancer: Here we have also a few options against commercial Alteon load balancer. For security reasons most of all banks and telecommuncation company using Alteon. If this is not for your reason, you must try nginx and varnish. Nginx can use not only web server but also load balancer.

Above all information is my personal opinion and from my personal expertise. May be it can be differs from many of us. If there are more options please don't hesitated to add in comment.

Tuesday

Continuous Integration (CI), A review

A few years ago (2011) in Java One Conference in Moscow, i participated with presentation about CI. During this time a lot of changes has been made with this fields. By the years many tools, plugins and frameworks has been released to help devOps to solve problems with CI. Now CI is one of the vital part of the development life cycle. With the aggressive use of cloud infrastructure and horizontal scaling of every application, now most of all application deployed in a lot of server (virtuals and dedicated). Moreover, most of the systems are heterogeneous and always need extra care (scripts) to successfully deploy the entire system. Most of the time development environment is very different from the production environment. Here is the common workflow from the development stage to production
DEV environment -> Test Environment -> UAT environment -> Production environment.
Every environments has their own characteristics, configurations. For example, most of the developers use jetty or embedded Tomcat application servers to fast development but in production environment often meet IBM or WebLogic application servers. Deployment process in jetty or IBM is very different, also In production environment frequently uses DR (disaster recovery). Workflow of the deployment process in Production environment are as follows:
1) Stop part of the application servers
2) Replicate session from the stopped servers
3) Update database with incremental scripts
4) Update new artifacts in application servers
5) Update configuration files
6) Start application servers

Their are a lot of tools in open sources to achieve the above workflow such as:
1) Puppet
2) Chef
3) Ansible e.t.c

Ansible is one of easiest and simplest tools to install, deploy and prepare environments. We have following DevOps tools in our portfolio:
1) Jenkins
2) Flyway DB
3) Ansible

A few words about flyway DB, its database migration tools to do incremental update of database objects. Supports ANSI native SQL scripts for any DB. For me it's very suitable to debug or review any sql scripts.
Ansible is a simple IT automation platform to deploy through SSH. Very easy to install and configure, working through ssh with no agent install in remote system. Ansible has very big community and a lot of plugin already developed for using in automation. With this three tools we have the following approach:

Jenkins for build project
Flyway to data base migration
Ansible for deploy application in several environments and build installation package in production environments. Most of the time in meetup or conference, i got the question how we manages and rendering different configuration files for different systems such as DEV, UAT. We uses very simple approach to solve the problem through templating. For every configuration we have some kind of template as follows:
# MQ Configuration
mq.port=@mq.port@
mq.host=@mq.host@
mq.channel=@mq.channel@
mq.queue.manager=@mq.queue.manager@
mq.ccsid=@mq.ccsid@
mq.user=@mq.user@
mq.password=@mq.password@
mq.pool.size=@mq.pool.size@
and for every environments we have defined values in xml file. For example for DEV environment we have dev.xml, for UAT environment we have uat.xml. Every xml files contains all the values such as
<property name="mq.gf.to.queue" value="MNP2GF"/>
<property name="mq.gf.from.queue" value="GF2MNP"/>
<property name="mq.port" value="1234"/>
<property name="mq.host" value="192.168.157.227"/>
<property name="mq.channel" value="SYSTEM.DEF.SVRCONN"/>
<property name="mq.queue.manager" value="venus.queue.manager"/>
<property name="mq.ccsid" value="866"/>
<property name="mq.user" value="mqm"/>
<property name="mq.password" value="mqm01"/>
<property name="mq.pool.size" value="10"/>
<property name="mq.pool.size" value="10"/>

Every time after successful build, jenkins runs one simple python script which generates all the configuration files based on template. Such way we can deploy application in different environments and building distributive package.