Sakai GSoC projects blog: Sakai OAE Column Storage Driver Project

Hello, I am Aadish Kotwal, a now final-year-entering computer engineering student from Mumbai University. This is the first time I have entered into GSOC program and was (and am) very excited to associate myself with Sakai. Sakai Foundation is a great open-source organization. I had been through the past year’s projects and the blogs and thought working here would be a great learning experience. I have been working with many of the technologies Sakai uses for development and working here gives me a very big platform to take my interests and knowledge to a bigger level. I am associated with Sakai OAE, which in their own words is “A completely new system that incorporates all of the values of the Sakai CLE, and reimagines a new vision for academic collaboration.”

My project is titled “Sakai OAE Column Storage Driver”.

Put in Sakai’s words, Sakai OAE user content uses a storage mechanism based on a sparse map concept(for more on Sparse Map Content concept refer link [1]) representing column database type storage with a memory window onto the storage. This abstraction has allowed to create traditional RDBMS representations of the storage system with a MySQL driver capable of doing sharded storage over with 1 write/many read DB clusters. The original column driver was based on Apache Cassandra which need some updating to keep it in sync with the latest developments, and Sakai would also like to create a driver for another column DB (eg HBase, Raik, Mongo, CouchDB, etc or some network protocol approach, eg protocol buffer, thrift).

This idea involves modifying the existing Cassandra driver(why Cassandra? See link [2]) Sakai uses to include all the features present in the JDBC driver and then to create a new driver for a database which will be decided during the term of the project.

I have been in regular discussions with Ian Boston(my mentor). Based on his guidance, I came up with an analysis and a flow of the project which is described below.

What we currently have:
Cassandra driver for the sparsemapcontent with incomplete implementation

What I aim to achieve by the end of GSoC term:
A complete Cassandra driver and a new driver for a NoSQL database implemented from scratch.

The project flow which has been framed after discussion with project mentor is as follows:

Finding all the methods and unit tests that require implementation and completing them.
Working on the analysis of new driver.
Start with coding API for the new driver with a structure similar to that of existing NoSQL driver. Also take feedback after completing this phase if any additional features specific to this database are expected.
After finishing with a working API, I would start with the soak tests and will test the results by implementing it on local machines.
Implement unit tests for the new driver to test functions such as content-addition, deletion, etc. on a local instance of database.
Write integration tests for the new driver to check if it works well with the existing implementation of sparsemapcontent.
Document the new driver and the implementation details at the confluence.

I have started working on technologies and codebase and have understood the basic structure. I plan to finish the entire learning phase prior to the coding stage and will also try my best to complete work on the existing database driver as early as possible so that I get more time for analysis and thoughts on implementation of new driver.

Progress till date:

1. My first task was to get serialization issues of Cassandra to be sorted, and this being my first task, Ian guided me wonderfully through it and made sure I stick to Sakai standards while focussing on efficiency. The task essentially was to write methods to convert an Object to ByteStream and vice versa. A quick link for the same for interested is: https://github.com/ieb/sparsemapcontent/blob/master/src/main/java/org/sakaiproject/nakamura/lite/types/Types.java

2. My second task was to implement indexing for efficient retrieval. This too under Ian’s guidance got implemented. This task included to find columns that were supposed to be indexed. The implementation links for the same are:

a) https://github.com/ieb/sparsemapcontent/blob/master/src/main/java/org/sakaiproject/nakamura/lite/storage/cassandra/CassandraClient.java

b) https://github.com/ieb/sparsemapcontent/blob/master/src/main/java/org/sakaiproject/nakamura/lite/storage/cassandra/CassandraClientPool.java

c) https://github.com/ieb/sparsemapcontent/blob/master/src/main/java/org/sakaiproject/nakamura/lite/ConfigurationImpl.java

3. My third task includes implementation of find on the Cassandra driver. The task is still under way.

My primary interest is to create a product of great significance which would be useful and accepted by the entire community.

A note on my mentor (Ian Boston):

GSOC on onset seemed very intimidating, but it was Ian’s support which really held me through. A thorough, detailed-oriented person, Ian has always been supportive and encouraging in writing efficient codes. The fact that such a busy man explaining the concept in so much detail really enthrals me. Any mistake and Ian not only rushes to correct me, but also shows alternative and better approach of implementation. Working under his guidance is going to be really fruitful this summer, and not to mention fun.

I will soon post of any further progress, and of course would love comments. My e-mail address is kotwal.aadish@gmail.com .

Looking forward for an awesome summer... :)

Links:

[1]: https://confluence.sakaiproject.org/display/KERNDOC/Sparse+Map+Content+-+Developer+Information

[2]: http://oreilly.com/catalog/0636920018537