Tuesday 23 August 2011

Summer and HBase!

I spent most of my summer dealing and learning Hadoop’s column database called HBase. Why this database? To put in Ian Boston’s words : “The reason for HBase is its the DB of choice for many systems that want to do large scale data analysis on Hadoop.” Ian declared that the choice of database was HBase on June 27 and since then, I must have clicked more number of links on Google for ‘HBase’ than perhaps the number of HBase deployments itself! Then started the quest of mine, to reconsider database driver in light of this new database, and code for the tests which would judge its performance. After a plethora of idea exchange the code was written and tests were implemented. This post aims at explaining and shedding some light on deployment of HBase as how I had done it during the summer.

World, HBase. HBase, world.

People coming from RDBMS background, just shake your head vigorously, and forget all the schema that you have ever implemented in your life! To absorb concepts of HBase, just vacuum clean your head and make room for new concepts to flow in.

Why HBase?

To answer that question I’ll have to walk you through the problems of traditional databases. Traditional databases face basically two types of problems: First is scaling and the second one, well, can be called ‘sparse-ity’. Traditional RDBMS may be reliable, widely used, developer friendly blah blah blah, but when you ask it “How do I scale you”, it would reply “Put more money in me and buy more hardware”. The second problem can be addressed thus: Imagine you are trying to stuff an intricate object graph, with many interdependent objects and relations into RDBMS schema. Definitely, you are bound to end up with a schema wherein an object may have several attributes, which well, may be seldomly used. Your RDBMS surely is going to charge you for all those extra ‘NULL’ references out there! So how does HBase deal with this, lets see!

HBase and scaling :

HBase is specialized column DB(more on it soon), mastered in scaling. It partitions horizontally and ‘distributes’ data over huge number of commodity servers. HBase is built on Hadoop , which implements functionality similar to Google's GFS and Map/Reduce systems. It provides means to efficiently organize and serve huge amount of data. If you are more interested read Google’s BigTable Paper and Map/Reduce concepts.

HBase and ‘sparse-ity’:

HBase is a column oriented database. This means that it stores contents in form of columns rather than rows. This frees up the need of attributes which may not be necessary for an object. In row DBs, where you would have nineteen NULLs and one attribute, HBase(or rather any column DB) would only save that one attribute. So, less room for storage, and high speed performance!

HBase datamodel:

An excellent read for this would be : http://wiki.apache.org/hadoop/Hbase/DataModel

How I went about setting up HBase for SparseMapContent:

Configuring Apache HBase database on Windows:
The page describes how to configure Apache HBase in a standalone mode on Windows using Cygwin.

For Windows environment, 3 technologies are required which are JAVA, SSH and Cygwin.

Installing JAVA:
Download the standard Edition JAVA plateform from here and follow the simple GUI wizard to install the same.

Installing Cygwin
Cygwin provides *nix like environment in Windows. Steps for installation are as follows:
1. Make sure you have Administrator privileges on the target system.

2. Create Root and Local Package directories. A good suggestion is to use C:\cygwin\root and C:\cygwin\setup folders.

3. Download the setup.exe utility from here and save it to the Local Package directory.


4. Run the setup.exe utility,


1. Choose the Install from Internet option,
2. Choose your Root and Local Package folders
3. and select an appropriate mirror.
4. Don't select any additional packages yet, as we only want to install Cygwin for now.
5. Wait for download and install
6. Finish the installation
5. Add CYGWIN_HOME system-wide environment variable that points to your Root directory.
6. Add %CYGWIN_HOME%\bin to the end of your PATH environment variable.
7. Reboot the sytem after making changes to the environment variables otherwise the OS will not be able to find the Cygwin utilities.
8. Test your installation by running your freshly created shortcuts or the Cygwin.bat command in the Root folder. You should end up in a terminal window that is running a
Bash shell. Test the shell by issuing following commands:


1. cd / should take you to the Root directory in Cygwin;
2. the LS commands that should list all files and folders in the current directory.
3. Use the exit command to end the terminal.

9. When needed, to uninstall Cygwin you can simply delete the Root and Local Package directory, and the shortcuts that were created during installation.

Installing SSH:
HBase (and Hadoop) rely on
SSH for interprocess/-node communication and launching remote commands.

1. Rerun the setup.exe utility.
2. Leave all parameters as is, skipping through the wizard using the Next button until the Select Packages panel is shown.
3. Maximize the window and click the View button to toggle to the list view, which is ordered alphabetically on Package, making it easier to find the packages we'll need.
4. Select the following packages by clicking the status word (normally Skip) so it's marked for installation. Use the Next button to download and install the packages.


1. OpenSSH
2. tcp_wrappers
3. diffutils
4. zlib
5. Wait for the install to complete and finish the installation.

Installing HBase
Downlaod HBase from here, unzip it and place it under the directory C:\cygwin\usr\local\ so that it gets installed in Cygwin(C:\cygwin\usr\local\hbase-)

Configuring JAVA
1. Create a symbolic link in /usr/local to the Java home directory by using the following command and substituting the name of your chosen Java environment:
LN -s /cygdrive/c/Program\ Files/Java/ /usr/local/


2. Test your java installation by changing directories to your Java folder CD /usr/local/ and issueing the command ./bin/java -version. This should output your version of the chosen JRE.

Configuring SSH
1. On Windows Vista and above make sure you run the Cygwin shell with elevated privileges, by right-clicking on the shortcut an using Run as Administrator.


2. First of all, make sure that the rights on some crucial files are correct. Use the commands underneath and you can verify all rights by using the LS -L command on the different files. Also, notice the auto-completion feature in the shell using is extremely handy in these situations.


1. chmod +r /etc/passwd to make the passwords file readable for all
2. chmod u+w /etc/passwd to make the passwords file writable for the owner
3. chmod +r /etc/group to make the groups file readable for all
4. chmod u+w /etc/group to make the groups file writable for the owner
5. chmod 755 /var to make the var folder writable to owner and readable and executable to all


3. Edit the /etc/hosts.allow file using your favorite editor (why not VI in the shell!) and make sure the following two lines are in there before the PARANOID line:


1. ALL : localhost 127.0.0.1/32 : allow
2. ALL : [::1]/128 : allow


4. Next we have to configure SSH by using the script ssh-host-config. The following may be asked in random order but don’t worry about that.


1. If this script asks to overwrite an existing /etc/ssh_config, answer yes.
2. If this script asks to overwrite an existing /etc/sshd_config, answer yes.
3. If this script asks to use privilege separation, answer yes.
4. If this script asks to install sshd as a service, answer yes. Make sure you started your shell as Adminstrator!
5. If this script asks for the CYGWIN value, just as the default is ntsec.
6. If this script asks to create the sshd account, answer yes.
7. If this script asks to use a different user name as service account, answer no as the default will suffice.
8. If this script asks to create the cyg_server account, answer yes. Enter a password for the account.


5. Start the SSH service using net start sshd or cygrunsrv --start sshd. Notice that cygrunsrv is the utility that make the process run as a Windows service. Confirm that you see a message stating that the CYGWIN sshd service was started succesfully.


6. Harmonize Windows and Cygwin user account by using the commands:


1. mkpasswd -cl > /etc/passwd
2. mkgroup --local > /etc/group


7. Test the installation of SSH:


1. Open a new Cygwin terminal
2. Use the command whoami to verify your userID
3. Issue an ssh localhost to connect to the system itself
1. Answer yes when presented with the server's fingerprint
2. Issue your password when prompted
3. test a few commands in the remote session
4. The exit command should take you back to your first shell in Cygwin
5. Exit should terminate the Cygwin shell.


8. If you get stuck with some password problem, you can change it using the command passwd.


Configuring HBase

(2nd and 3rd steps are optional.)
1. HBase uses the ./conf/hbase-env.sh to configure its dependencies on the runtime environment. Copy and uncomment following lines just underneath their original, change them to fit your environemnt. They should read something like:


1. export JAVA_HOME=/usr/local/
2. export HBASE_IDENT_STRING=$HOSTNAME as this most likely does not inlcude spaces.


2. HBase uses the ./conf/hbase-default.xml file for configuration. Some properties do not resolve to existing directories because the JVM runs on Windows. This is the major issue to keep in mind when working with Cygwin: within the shell all paths are *nix-alike, hence relative to the root /. However, every parameter that is to be consumed within the windows processes themself, need to be Windows settings, hence C:\-alike. Change following propeties in the configuration file, adjusting paths where necessary to conform with your own installation:


1. hbase.rootdir must read e.g. file:///C:/cygwin/root/tmp/hbase/data
2. hbase.tmp.dir must read C:/cygwin/root/tmp/hbase/tmp
3. hbase.zookeeper.quorum must read 127.0.0.1 because for some reason localhost doesn't seem to resolve properly on Cygwin.


3. Make sure the configured hbase.rootdir and hbase.tmp.dirdirectories exist and have the proper rights set up e.g. by issuing a chmod 777 on them.


Testing the installation and configuration of HBase on Windows using Cygwin.
1. Start a Cygwin terminal.


2. Change directory to HBase installation using CD /usr/local/hbase-, preferably using auto-completion.


3. Start HBase using the command ./bin/start-hbase.sh


1. When prompted to accept the SSH fingerprint, answer yes.
2. When prompted, provide your password. Maybe multiple times.
3. When the command completes, the HBase server should have started.
4. However, to be absolutely certain, check the logs in the ./logs directory for any exceptions.


4. Next we start the HBase shell using the command ./bin/hbase shell


5. You can run some simple test commands


6. Leave the shell by exit


7. To stop the HBase server issue the ./bin/stop-hbase.sh command. And wait for it to complete. Killing the process might corrupt your data on disk.


8. In case of problems,


1. Verify the HBase logs in the ./logs directory.
2. Try to fix the problem.
3. Get help on the forums or IRC (#hbase@freenode.net). People are very active and keen to help out!
4. Stop, restart and retest the server.

Getting the code using git
Open the GIT bash or command prompt and follow the following commands:

$ cd
$ mkdir sparsemapcontent
$ cd sparcemapcontent
$ git clone
https://github.com/ieb/sparsemapcontent.git
$ cd sparsemapcontent/
$ maven clean install
$ exit


For developing the code in eclipse

1. Import sparsemapcontent folder as existing maven project.
2. Include the following jar files into the project in case they are not there from /usr/local/hbase- folder.
· Hbase-.jar
· Hbase--test.jar
3. Start the HBase server as stated before.
4. Create the tables au, an, cn and smcindex.

So that was how I dirtied my hands in HBase. It’s a great DB to understand column DB concepts. I hope this was helpful.

Please feel free to mail me at kotwal.aadish@gmail.com. Hopefully your mail would put me in a tizzy!

Also interested readers may ponder over the matter in these site:

1. Configuration of HBase :

http://hbase.apache.org/book.html#configuration

2. HBase data model :

http://wiki.apache.org/hadoop/Hbase/DataModel

3. HBase book :

http://hbase.apache.org/book.html

4. HBase on Windows OS :

http://hbase.apache.org/docs/r0.20.6/cygwin.html

5. Place to start learning about Hadoop :

http://developer.yahoo.com/hadoop/tutorial/

6. HBase debugging and troubleshooting :

i. http://hbase.apache.org/book/trouble.html

ii. http://old.nabble.com/HBase-User-f34655.html

No comments:

Post a Comment