gmond
A Ganglia Monitoring Daemon (gmond) runs on each node in the cluster and collects statistics from the node it runs on as well as other nodes in the cluster. Normally it is a multicast system where each gmond node receives data from its peers. However, since Amazon EC2 does not support multicast at this time, you must setup Ganglia Monitoring Daemons in unicast mode where each node in a cluster is configured to send its data to one pre-designated node.
gmetad
A Ganglia Meta Daemon (gmetad) runs for each grid and collects data from the Ganglia Monitoring Daemons, one from each cluster. It stores the data it collects on the file system. We'll only be configuring one grid and therefore one gmetad. We'll be running gmetad from the same node that the PHP Web Front End is installed on.
PHP Web Front End
A PHP application reads the data and provides a UI to visualize the data over time with pretty graphs. Example from UC Berkeley.
Approach
Our HBase cluster has one Master node (with Hadoop's Name Node and Job Tracker) and the rest Region Server nodes (each with a Hadoop Data Node and Task Tracker). Since we must use Ganglia in unicast mode on EC2, we need to choose one of these nodes to receive data from each of the other nodes. We'll configure each Region Server gmond to send data to the gmond running on the Master. A separate node will be used to run gmetad and the PHP web front end. The gmetad will be configured to poll the Master's gmond for aggregated cluster data.
Getting the right version of Ganglia is important because of compatibility issues with Hadoop and HBase's Ganglia clients. We're using HBase 0.20.6 and Hadoop 0.20.2. Therefore, we're going to use Ganglia 3.0.7, the latest of 3.0.x at the time of writing. You could use the latest Ganglia 3.1.x if you're willing to apply the patch found in HADOOP-4675 to your Hadoop installation.
Setting up gmond
Add a ganglia user.
adduser --disabled-login --no-create-home gangliaDownload and unpack Ganglia 3.0.7.
wget http://downloads.sourceforge.net/project/ganglia/ganglia%20monitoring%20core/3.0.7%20%28Fossett%29/ganglia-3.0.7.tar.gz tar -xzvf ganglia-3.0.7.tar.gz -C /opt rm ganglia-3.0.7.tar.gzInstall dependencies
apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev
Compile and install.
cd /opt/ganglia-3.0.7 ./configure make && make installGenerate a default config file.
gmond --default_config > /etc/gmond.conf
Configure /etc/gmond.conf
Modify the globals.
globals { user = ganglia }Define the cluster.
cluster { name = HBase owner = "Your Company" latlong = "N34.02 W118.45" url = "http://yourcompany.com/" }Disable multicast and define the host, the HBase master, where nodes in the cluster send data.
udp_send_channel { # mcast_join = 239.2.11.71 host = master.yourcompany.com port = 8649 ttl = 1 } udp_recv_channel { # mcast_join = 239.2.11.71 port = 8649 # bind = 239.2.11.71 }
Run gmond:
gmond
Note: As far as I know, there is no way to stop gmond other than killing the process.
You can test each gmond node to make sure it is working. From each node:
telnet localhost 8649You should see XML output.
Once gmond is installed and running on each node, move on to setup gmetad.
Setting up gmetad
We're going to setup a new EC2 instance to run gmetad and the PHP web front end. A t1.micro should be sufficient for modest clusters. gmetad will be configured to poll the gmond running on the HBase master for aggregated cluster data.
Add a ganglia user.
adduser --disabled-login --no-create-home gangliaDownload and unpack Ganglia 3.0.7.
wget http://downloads.sourceforge.net/project/ganglia/ganglia%20monitoring%20core/3.0.7%20%28Fossett%29/ganglia-3.0.7.tar.gz tar -xzvf ganglia-3.0.7.tar.gz -C /opt rm ganglia-3.0.7.tar.gzInstall dependencies
apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev librrd2-devCompile and install.
cd /opt/ganglia-3.0.7 ./configure make && make installCopy the default config file.
cp /opt/ganglia-3.0.7/gmetad/gmetad.conf /etc/gmetad.conf
Configure /etc/gmetad.conf
setuid_username "ganglia" data_source "HBase" master.yourcompany.com gridname "YourCompany"Create directories where Ganglia will save its data.
mkdir /var/lib/ganglia mkdir /var/lib/ganglia/rrds/ chown -R ganglia:ganglia /var/lib/ganglia/Run gmetad:
gmetadFor troubleshooting, it might be helpful not to daemonize gmetad:
gmetad -d 1Note: As far as I know, there is no way to stop gmetad other than killing the process.
Setting up the PHP Web Front End
Install Apache and PHP libraries:
apt-get -y install rrdtool apache2 php5-mysql libapache2-mod-php5 php5-gdCopy PHP web app into Apache's HTML directory:
cp -r /opt/ganglia-3.0.7/web /var/www/gangliaRestart Apache:
/etc/init.d/apache2 restart
You should now be able to visit the Ganglia console on http://ganglia.yourcompany.com/ganglia and see pretty graphs and charts for machine metrics like load, memory, disk use, etc.
Configuring Hadoop and HBase
The next step is to configure Hadoop and HBase to start sending metrics to the master gmond. Detailed instructions are given here and here for Hadoop and HBase respectively.
Here's what we end up with inside HBase's conf/hadoop-metrics.properties:
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext hbase.period=10 hbase.servers=master.yourcompany.com:8649 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext jvm.period=10 jvm.servers=master.yourcompany.com:8649 rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext rpc.period=10 rpc.servers=master.yourcompany.com:8649
And inside Hadoop's conf/hadoop-metrics.properties:
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=master.yourcompany.com:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=master.yourcompany.com:8649 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext jvm.period=10 jvm.servers=master.yourcompany.com:8649
Once the hadoop-metrics.properties files are modified on all nodes in your cluster, you'll need to restart HBase and Hadoop.
Metrics
After restarting, you should see the following metrics show up in Ganglia for your HBase and Hadoop cluster:
hbase.regionserver.blockCacheCount hbase.regionserver.blockCacheFree hbase.regionserver.blockCacheHitRatio hbase.regionserver.blockCacheSize hbase.regionserver.compactionQueueSize hbase.regionserver.fsReadLatency_avg_time hbase.regionserver.fsReadLatency_num_ops hbase.regionserver.fsSyncLatency_avg_time hbase.regionserver.fsSyncLatency_num_ops hbase.regionserver.fsWriteLatency_avg_time hbase.regionserver.fsWriteLatency_num_ops hbase.regionserver.memstoreSizeMB hbase.regionserver.regions hbase.regionserver.requests hbase.regionserver.storefileIndexSizeMB hbase.regionserver.storefiles hbase.regionserver.stores rpc.metrics.RpcProcessingTime_avg_time rpc.metrics.RpcProcessingTime_num_ops rpc.metrics.RpcQueueTime_avg_time rpc.metrics.RpcQueueTime_num_ops rpc.metrics.addColumn_avg_time rpc.metrics.addColumn_num_ops rpc.metrics.checkAndPut_avg_time rpc.metrics.checkAndPut_num_ops rpc.metrics.close_avg_time rpc.metrics.close_num_ops rpc.metrics.createTable_avg_time rpc.metrics.createTable_num_ops rpc.metrics.deleteColumn_avg_time rpc.metrics.deleteColumn_num_ops rpc.metrics.deleteTable_avg_time rpc.metrics.deleteTable_num_ops rpc.metrics.delete_avg_time rpc.metrics.delete_num_ops rpc.metrics.disableTable_avg_time rpc.metrics.disableTable_num_ops rpc.metrics.enableTable_avg_time rpc.metrics.enableTable_num_ops rpc.metrics.exists_avg_time rpc.metrics.exists_num_ops rpc.metrics.getClosestRowBefore_avg_time rpc.metrics.getClosestRowBefore_num_ops rpc.metrics.getClusterStatus_avg_time rpc.metrics.getClusterStatus_num_ops rpc.metrics.getHServerInfo_avg_time rpc.metrics.getHServerInfo_num_ops rpc.metrics.getOnlineRegionsAsArray_avg_time rpc.metrics.getOnlineRegionsAsArray_num_ops rpc.metrics.getProtocolVersion_avg_time rpc.metrics.getProtocolVersion_num_ops rpc.metrics.getRegionInfo_avg_time rpc.metrics.getRegionInfo_num_ops rpc.metrics.getRegionsAssignment_avg_time rpc.metrics.getRegionsAssignment_num_ops rpc.metrics.get_avg_time rpc.metrics.get_num_ops rpc.metrics.incrementColumnValue_avg_time rpc.metrics.incrementColumnValue_num_ops rpc.metrics.isMasterRunning_avg_time rpc.metrics.isMasterRunning_num_ops rpc.metrics.lockRow_avg_time rpc.metrics.lockRow_num_ops rpc.metrics.modifyColumn_avg_time rpc.metrics.modifyColumn_num_ops rpc.metrics.modifyTable_avg_time rpc.metrics.modifyTable_num_ops rpc.metrics.next_avg_time rpc.metrics.next_num_ops rpc.metrics.openScanner_avg_time rpc.metrics.openScanner_num_ops rpc.metrics.put_avg_time rpc.metrics.put_num_ops rpc.metrics.regionServerReport_avg_time rpc.metrics.regionServerReport_num_ops rpc.metrics.regionServerStartup_avg_time rpc.metrics.regionServerStartup_num_ops rpc.metrics.shutdown_avg_time rpc.metrics.shutdown_num_ops rpc.metrics.unlockRow_avg_time rpc.metrics.unlockRow_num_ops dfs.datanode.blockChecksumOp_avg_time dfs.datanode.blockChecksumOp_num_ops dfs.datanode.blockReports_avg_time dfs.datanode.blockReports_num_ops dfs.datanode.block_verification_failures dfs.datanode.blocks_read dfs.datanode.blocks_removed dfs.datanode.blocks_replicated dfs.datanode.blocks_verified dfs.datanode.blocks_written dfs.datanode.bytes_read dfs.datanode.bytes_written dfs.datanode.copyBlockOp_avg_time dfs.datanode.copyBlockOp_num_ops dfs.datanode.heartBeats_avg_time dfs.datanode.heartBeats_num_ops dfs.datanode.readBlockOp_avg_time dfs.datanode.readBlockOp_num_ops dfs.datanode.readMetadataOp_avg_time dfs.datanode.readMetadataOp_num_ops dfs.datanode.reads_from_local_client dfs.datanode.reads_from_remote_client dfs.datanode.replaceBlockOp_avg_time dfs.datanode.replaceBlockOp_num_ops dfs.datanode.writeBlockOp_avg_time dfs.datanode.writeBlockOp_num_ops dfs.datanode.writes_from_local_client dfs.datanode.writes_from_remote_client jvm.metrics.gcCount jvm.metrics.gcTimeMillis jvm.metrics.logError jvm.metrics.logFatal jvm.metrics.logInfo jvm.metrics.logWarn jvm.metrics.memHeapCommittedM jvm.metrics.memHeapUsedM jvm.metrics.memNonHeapCommittedM jvm.metrics.memNonHeapUsedM jvm.metrics.threadsBlocked jvm.metrics.threadsNew jvm.metrics.threadsRunnable jvm.metrics.threadsTerminated jvm.metrics.threadsTimedWaiting jvm.metrics.threadsWaiting
As you can see, the metrics are incredibly useful for understanding the health of your HBase and Hadoop cluster over time.