home built hadoop analytics cluster

This page is a summary of my putzing around with a home built set of cheap low power nodes that I wanted to use as a 4 node hadoop cluster. My primary objective was to learn more about managing a hadoop cluster and to use the cluster for data analysis of some weather data I have been collecting for my frog locales. I wanted more experience with Python and R, so it seemed like this would be a good gear head project to take on.

 

Project summary and links

Project objectives

  • learn hadoop administration
  • learn some Python
  • get R running on hadoop
  • start poking around larger weather data sets

I’ll add more detail to this section as I go. My intend is to have posts on this project that cover my progress in detail with each phase.

  • The build out of the cluster
  • Software stack installation
  • Client installation
  • R and Python
  • Basic data import
  • Plotting

 

Analytics projects

Once I have the cluster running I’ll add links to distinct analytics projects here. I have a few interesting data sets I’d like to take a look at, namely weather station data from Peru, Suriname and Brazil where some of the frogs I keep are from. Additionally I’ll take a look at rapid parsing and analysis of log files that’s more in line with my day job, but the weather data seems pretty interesting for a smaller starter project.

  • Suriname (small scale project in Excel looking at weather data)
  • Peru (frogs from Peru)

 

Hardware :: $310 per node

I wanted to go with 4 nodes to get a realistic cluster layout. My target was a $300 system with 4 cores, 16 GB of memory and SSD storage. I came pretty close with this setup, exceeding my target by around 20 bucks a unit. I could have stepped down to a dual core mainboard, and that would have brought me under my target, but doubling the core capacity of the cluster from 8 cores to 16 cores for about $100 in total spend seemed like a good add on. Here are the details of the hardware I went with.

Mainboard :: $111 with embedded CPU

I agonized a while over what hardware to buy for this project. I wanted something fanless and low power based on my experience with other appliances I had built in the house (pfsense firewall, NAS4free SAN, DVR) so I started looking at newer embedded Intel solutions. I settled on an Asrock N3150DC-ITX motherboard which contained an Intel Celeron N3150 4 core 2.08 GHz Braswell chip with a TDP of 6W…just 6 watts!

Screen Shot 2015-08-26 at 11.32.22 PM

Screen Shot 2015-08-26 at 11.32.06 PM

Memory :: $75 for 16 GB (2 x 8 GB)

I’ve had  good luck over the years with Crucial brand memory, so I stuck with them for this cluster build. I picked up 4 sets of Crucial 16GB (2 x 8G) 204-Pin DDR3 SO-DIMM DDR3L 1600 (PC3L 12800) Laptop Memory Model CT2KIT102464BF160B – what mouthful. It’s basically a 16 GB laptop memory kit that maxed out the mainboard I selected.

Storage :: $84 for 250 GB SSD

I was initially thinking I would go with a 120 GB type SSD, but after seeing the latest prices, I decided to increase the storage to provide some wiggle room for me with larger data sets while accounting for the default HDFS triple store footprint that will chew through my storage. I went with one Crucial BX100 CT250BX100SSD1 2.5″ 250GB SATA III MLC Internal Solid State Drive (SSD) for each node. I can expand this later if necessary, but I think it will be sufficient for my purposes.

Case :: $40

I went with a Mini-Box M350 for the case. They are durable and cheap. Can’t argue with that.

Hardware assembly

Hardware assembly is pretty simple with these Intel Atom based boards. This Asrock board was especially straight forward because it has a built in DC power supply, so the wiring was minimal. Here’s a photo summary of the process. First up is a shot of the Mini-box case as it arrived, then with the cover off, followed by the HDD/SSD mounting bracket removed:

clusterbuild01_7668
clusterbuild02_7669clusterbuild03_7670

A view of the main board, a close up of the DC input, and a shot of the case and the rear plate prior to assembly:

clusterbuild05_7673
clusterbuild05aclusterbuild04_7671

Shots of the CPU and heat sink, and the IO connectors:

clusterbuild06_7674clusterbuild07_7675

The memory banks and DIMMs:

clusterbuild12_7680clusterbuild10_7678clusterbuild08_7676

The mainboard and memory, installed:

clusterbuild09_7677

Installing the power and SATA cables, a close up of the power lead off the main board, and an image of the SSD:

clusterbuild14_7682clusterbuild13_7681clusterbuild11_7679

The mainboard, memory, and SSD installed:

clusterbuild15_7683

All four nodes assembled and stacked and ready to go:

clusterbuild16_7684

 

Networking

I kept the networking simple. This setup used an uplink from my main switch to a dedicated, unmanaged cheapo 5 port gigabit switch ($15). I used 12″ Ethernet cables to wire this up:

IMG_7698.JPG

All four hosts are on my main home network.

 

Software

Screen Shot 2015-08-26 at 11.59.30 PM  hadoop-logo  spark-logo-hd

I kept things simple and went with CentOS for my base operating system. I downloaded Apache Hadoop 2.7. I’ve worked directly with Hortonworks at work, but I figured I would try Hadoop directly at home. My plan is to test out other packages like SparkR.

 

CentOS installation

I created a bootable USB key and installed a base image on node 1. I used clonezilla (http://clonezilla.org) to capture an image of the disk layout and installation, then I propagated that to nodes 2 through 4. Using my terrible memory and limited Linux IT skills (I write everything down), I configured the hostnames and network configurations on all the clients.

The key steps in getting my hosts correct were to:

  1. setup host names (hostnamectl).
  2. set IP address, netmask, gateway, DNS hosts.
  3. configure host files.
  4. ssh-keygen and ssh without passwords.
  5. setup Java (install java-1.7.0-openjdk-devel)
  6. disable firewalld.
  7. disable chrony (systemctl disable chrony.service).
  8. install and enable ntp (systemctl enable ntpd.service – satisfies Ambari installation check).
  9. install openssl on all clients (not sure if this step was really necessary, but I installed this when troubleshooting my connectivity between clients during the Ambari node registration process).
  10. install httpd on Ambari server (I didn’t seem to get httpd installed with Ambari, so I manually installed).
  11. disable SELinux on the name node.

I also setup key based authentication between all the hosts as well as my laptop for easier access. Now I’ve got 4 hosts accessible on the network:

ah01ah02ah03ah04
And simply because it is neat, I can access the cluster consoles from my iPhone using the Serverauditor app:

IMG_7704.PNG IMG_7703.PNG

 

Hadoop installation via Ambari

I decided to use Ambari for the setup of hadoop. Getting started was relatively straight forward. There were five steps to getting the base ambari-server service running:

  1. cd /etc/yum.repos.d/
  2. wget http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.1.0/ambari.repo
  3. yum install ambari-server
  4. ambari-server setup
  5. ambari-server start

Once I got through the command line setup I switched over to the Ambari configuration UI which made setup much easier. Here’s a series of screen shots of the process I went through with Apache hadoop 2.3 and Ambari 2.1. First we have the initial splash screen that presents the cluster installation wizard:

ambari-01-setup

After specifying my RSA keys for silent login I proceeded with registration of my four nodes:

ambari-02-install

I had problems with this step which was due to my firewall running on the Ambari server, and my host files were incorrectly configured. I had not included the IP address for my local host’s FQDN:

127.0.0.1   localhost 
::1         localhost 
192.168.0.172 ah02.orion.local ah02
192.168.0.173 ah03.orion.local ah03
192.168.0.174 ah04.orion.local ah04

Then I added the local host lines:

127.0.0.1   localhost localhost.orion.local
::1         localhost localhost.orion.local
192.168.0.171 ah01.orion.local ah01
192.168.0.172 ah02.orion.local ah02
192.168.0.173 ah03.orion.local ah03
192.168.0.174 ah04.orion.local ah04

 

And my registration completed and the installation went forward. Here are some screen shots from the progression of the install:

ambari-03-install

ambari-04-install

ambari-05-install

ambari-07-almostThere

Finally the installation completed. The warning I encountered was with Oozie not starting on two hosts:

ambari-08-done

I finished up the install:

ambari-09-summary

Then restarted Oozie and Ambari was up:

ambari-11-analytics2

 

 

Monitoring changes

First up was a switch from embedded mode to distributed mode for Ambari metrics. This shifts the collector’s log location from /var/log on the name node to hbase, allowing for the storage of the historical log data within HDFS. Here is a good link on making the change:

https://cwiki.apache.org/confluence/display/AMBARI/AMS+-+distributed+mode

  1. In ams-site, set timeline.metrics.service.operation.mode = distributed
  2. In ams-hbase-site, set hbase.rootdir = hdfs://<namenode-host>:8020/user/ams/hbase
  3. In ams-hbase-site, set hbase.cluster.distributed = true
  4. Add dfs.client.read.shortcircuit = true

Screen Shot 2015-11-06 at 8.41.26 AM

Then restart Metrics Collector and you should start seeing data logged to HDFS.

Setting up views

Instructions on how to setup User Views in Ambari can be found here (Google “AmbariUserViewsTechPreview_v1.pdf” if this link is no longer working). The abridged version to getting this setup is:

  1. HDFS configuration
    • In HDFS -> Configs -> Custom core-site add:
      • hadoop.proxyuser.root.groups=*
      • hadoop.proxyuser.root.hosts=*
      • hadoop.proxyuser.hcat.groups=*
      • hadoop.proxyuser.hcat.hosts=*
  2. If Hive is not installed, install it. We will need WebHCat for PIG view.
    • In Hive -> Configs -> Custom webhcat-site add:
      • webhcat.proxyuser.root.groups=*
      • webhcat.proxyuser.root.hosts=*
  3. Setup HDFS View:
    • “Instance Name”: provide a name for your instance.
    • “Display Name”: provide a display name.
    • “Description”
    • “WebHDFS FileSystem URI”:
    • “WebHDFS Username”:
    • Set user permissions.
  4. Setup Pig View:
    • “Instance Name”: provide a name for your instance.
    • “Display Name”: provide a display name.
    • “Description”
    • “WebHDFS FileSystem URI”:
    • “WebHCat URL”:
    • “Jobs HDFS Directory”
    • “Scripts HDFS Directory”
    • Set user permissions.

I’ve set this up twice – the first time many of the core-site properties were not present, but on a second cluster they were. Perhaps it was the installation order differences between the clusters.

SparkR setup

Now that the base packages are installed I turned to getting the foundation of my analytics project in place – SparkR. My objective with this entire project was to learn R, and push my scripts to a tiny cluster…just to show I can 🙂

Hadoop 2.3 ships with Spark 1.3.1.2.3. I can either add a package from http://amplab-extras.github.io/SparkR-pkg/ to my existing 1.3.1.2.3 Spark installation, or I can upgrade to Spark 1.4 or higher which has SparkR built in. I am going to upgrade to Spark 1.5 so I can better understand the process on managing packages.

New versions of Spark can be downloaded here:

https://spark.apache.org/downloads.html

 

 

 

Mac OSX client environment setup