This page is a summary of my putzing around with a home built set of cheap low power nodes that I wanted to use as a 4 node hadoop cluster. My primary objective was to learn more about managing a hadoop cluster and to use the cluster for data analysis of some weather data I have been collecting for my frog locales. I wanted more experience with Python and R, so it seemed like this would be a good gear head project to take on.
Project summary and links
Project objectives
- learn hadoop administration
- learn some Python
- get R running on hadoop
- start poking around larger weather data sets
I’ll add more detail to this section as I go. My intend is to have posts on this project that cover my progress in detail with each phase.
- The build out of the cluster
- Software stack installation
- Client installation
- R and Python
- Basic data import
- Plotting
Analytics projects
Once I have the cluster running I’ll add links to distinct analytics projects here. I have a few interesting data sets I’d like to take a look at, namely weather station data from Peru, Suriname and Brazil where some of the frogs I keep are from. Additionally I’ll take a look at rapid parsing and analysis of log files that’s more in line with my day job, but the weather data seems pretty interesting for a smaller starter project.
Hardware :: $310 per node
I wanted to go with 4 nodes to get a realistic cluster layout. My target was a $300 system with 4 cores, 16 GB of memory and SSD storage. I came pretty close with this setup, exceeding my target by around 20 bucks a unit. I could have stepped down to a dual core mainboard, and that would have brought me under my target, but doubling the core capacity of the cluster from 8 cores to 16 cores for about $100 in total spend seemed like a good add on. Here are the details of the hardware I went with.
Mainboard :: $111 with embedded CPU
I agonized a while over what hardware to buy for this project. I wanted something fanless and low power based on my experience with other appliances I had built in the house (pfsense firewall, NAS4free SAN, DVR) so I started looking at newer embedded Intel solutions. I settled on an Asrock N3150DC-ITX motherboard which contained an Intel Celeron N3150 4 core 2.08 GHz Braswell chip with a TDP of 6W…just 6 watts!
Memory :: $75 for 16 GB (2 x 8 GB)
I’ve had good luck over the years with Crucial brand memory, so I stuck with them for this cluster build. I picked up 4 sets of Crucial 16GB (2 x 8G) 204-Pin DDR3 SO-DIMM DDR3L 1600 (PC3L 12800) Laptop Memory Model CT2KIT102464BF160B – what mouthful. It’s basically a 16 GB laptop memory kit that maxed out the mainboard I selected.
Storage :: $84 for 250 GB SSD
I was initially thinking I would go with a 120 GB type SSD, but after seeing the latest prices, I decided to increase the storage to provide some wiggle room for me with larger data sets while accounting for the default HDFS triple store footprint that will chew through my storage. I went with one Crucial BX100 CT250BX100SSD1 2.5″ 250GB SATA III MLC Internal Solid State Drive (SSD) for each node. I can expand this later if necessary, but I think it will be sufficient for my purposes.
Case :: $40
I went with a Mini-Box M350 for the case. They are durable and cheap. Can’t argue with that.
Hardware assembly
Hardware assembly is pretty simple with these Intel Atom based boards. This Asrock board was especially straight forward because it has a built in DC power supply, so the wiring was minimal. Here’s a photo summary of the process. First up is a shot of the Mini-box case as it arrived, then with the cover off, followed by the HDD/SSD mounting bracket removed:
A view of the main board, a close up of the DC input, and a shot of the case and the rear plate prior to assembly:
Shots of the CPU and heat sink, and the IO connectors:
The memory banks and DIMMs:
The mainboard and memory, installed:
Installing the power and SATA cables, a close up of the power lead off the main board, and an image of the SSD:
The mainboard, memory, and SSD installed:
All four nodes assembled and stacked and ready to go:
Networking
I kept the networking simple. This setup used an uplink from my main switch to a dedicated, unmanaged cheapo 5 port gigabit switch ($15). I used 12″ Ethernet cables to wire this up:
All four hosts are on my main home network.
Software
I kept things simple and went with CentOS for my base operating system. I downloaded Apache Hadoop 2.7. I’ve worked directly with Hortonworks at work, but I figured I would try Hadoop directly at home. My plan is to test out other packages like SparkR.
- https://cran.r-project.org/bin/macosx/
- https://www.rstudio.com/products/rstudio/download/
- https://www.python.org/downloads/mac-osx/
- https://spark.apache.org/docs/latest/sparkr.html
CentOS installation
I created a bootable USB key and installed a base image on node 1. I used clonezilla (http://clonezilla.org) to capture an image of the disk layout and installation, then I propagated that to nodes 2 through 4. Using my terrible memory and limited Linux IT skills (I write everything down), I configured the hostnames and network configurations on all the clients.
The key steps in getting my hosts correct were to:
- setup host names (hostnamectl).
- set IP address, netmask, gateway, DNS hosts.
- configure host files.
- ssh-keygen and ssh without passwords.
- setup Java (install java-1.7.0-openjdk-devel)
- disable firewalld.
- disable chrony (systemctl disable chrony.service).
- install and enable ntp (systemctl enable ntpd.service – satisfies Ambari installation check).
- install openssl on all clients (not sure if this step was really necessary, but I installed this when troubleshooting my connectivity between clients during the Ambari node registration process).
- install httpd on Ambari server (I didn’t seem to get httpd installed with Ambari, so I manually installed).
- disable SELinux on the name node.
I also setup key based authentication between all the hosts as well as my laptop for easier access. Now I’ve got 4 hosts accessible on the network:
And simply because it is neat, I can access the cluster consoles from my iPhone using the Serverauditor app:
Hadoop installation via Ambari
I decided to use Ambari for the setup of hadoop. Getting started was relatively straight forward. There were five steps to getting the base ambari-server service running:
- cd /etc/yum.repos.d/
- wget http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.1.0/ambari.repo
- yum install ambari-server
- ambari-server setup
- ambari-server start
Once I got through the command line setup I switched over to the Ambari configuration UI which made setup much easier. Here’s a series of screen shots of the process I went through with Apache hadoop 2.3 and Ambari 2.1. First we have the initial splash screen that presents the cluster installation wizard:
After specifying my RSA keys for silent login I proceeded with registration of my four nodes:
I had problems with this step which was due to my firewall running on the Ambari server, and my host files were incorrectly configured. I had not included the IP address for my local host’s FQDN:
127.0.0.1 localhost ::1 localhost 192.168.0.172 ah02.orion.local ah02 192.168.0.173 ah03.orion.local ah03 192.168.0.174 ah04.orion.local ah04
Then I added the local host lines:
127.0.0.1 localhost localhost.orion.local ::1 localhost localhost.orion.local 192.168.0.171 ah01.orion.local ah01 192.168.0.172 ah02.orion.local ah02 192.168.0.173 ah03.orion.local ah03 192.168.0.174 ah04.orion.local ah04
And my registration completed and the installation went forward. Here are some screen shots from the progression of the install:
Finally the installation completed. The warning I encountered was with Oozie not starting on two hosts:
I finished up the install:
Then restarted Oozie and Ambari was up:
Monitoring changes
First up was a switch from embedded mode to distributed mode for Ambari metrics. This shifts the collector’s log location from /var/log on the name node to hbase, allowing for the storage of the historical log data within HDFS. Here is a good link on making the change:
https://cwiki.apache.org/confluence/display/AMBARI/AMS+-+distributed+mode
- In ams-site, set timeline.metrics.service.operation.mode = distributed
- In ams-hbase-site, set hbase.rootdir = hdfs://<namenode-host>:8020/user/ams/hbase
- In ams-hbase-site, set hbase.cluster.distributed = true
- Add dfs.client.read.shortcircuit = true
Then restart Metrics Collector and you should start seeing data logged to HDFS.
Setting up views
Instructions on how to setup User Views in Ambari can be found here (Google “AmbariUserViewsTechPreview_v1.pdf” if this link is no longer working). The abridged version to getting this setup is:
- HDFS configuration
- In HDFS -> Configs -> Custom core-site add:
- hadoop.proxyuser.root.groups=*
- hadoop.proxyuser.root.hosts=*
- hadoop.proxyuser.hcat.groups=*
- hadoop.proxyuser.hcat.hosts=*
- In HDFS -> Configs -> Custom core-site add:
- If Hive is not installed, install it. We will need WebHCat for PIG view.
- In Hive -> Configs -> Custom webhcat-site add:
- webhcat.proxyuser.root.groups=*
- webhcat.proxyuser.root.hosts=*
- In Hive -> Configs -> Custom webhcat-site add:
- Setup HDFS View:
- “Instance Name”: provide a name for your instance.
- “Display Name”: provide a display name.
- “Description”
- “WebHDFS FileSystem URI”:
- “WebHDFS Username”:
- Set user permissions.
- Setup Pig View:
- “Instance Name”: provide a name for your instance.
- “Display Name”: provide a display name.
- “Description”
- “WebHDFS FileSystem URI”:
- “WebHCat URL”:
- “Jobs HDFS Directory”
- “Scripts HDFS Directory”
- Set user permissions.
I’ve set this up twice – the first time many of the core-site properties were not present, but on a second cluster they were. Perhaps it was the installation order differences between the clusters.
SparkR setup
Now that the base packages are installed I turned to getting the foundation of my analytics project in place – SparkR. My objective with this entire project was to learn R, and push my scripts to a tiny cluster…just to show I can 🙂
Hadoop 2.3 ships with Spark 1.3.1.2.3. I can either add a package from http://amplab-extras.github.io/SparkR-pkg/ to my existing 1.3.1.2.3 Spark installation, or I can upgrade to Spark 1.4 or higher which has SparkR built in. I am going to upgrade to Spark 1.5 so I can better understand the process on managing packages.
New versions of Spark can be downloaded here:
https://spark.apache.org/downloads.html
Mac OSX client environment setup