Read on for a summary of the final day of JavaOne 2010.
The Cassandra Distributed Database
Relational databases don't scale. BTrees are slow, and require read-before-write semantics. They are fast until indexes no longer fit in RAM, and then become very, very slow. Traditional scale-up of relational databases is almost entirely vertical, and expensive.
A common way to deal with this problem is to add a caching layer. However, this causes its own problems, such as lack of transparency, stale data on frequently written data, and the "cold cache" problems. Replication also helps scalability, but once write capacity is exceeded, this approach begins to break down. Finally, sharding can be used to spread the load horizontally. However, this increases complexity exponentially, and rebalancing can be very painful.
eBay coined an acronym called BASE which was meant to be a play on ACID, basically assuming that things won't always be consistent. If your application can work within those constraints, then you can scale much better.
Apache Cassandra is a "NoSQL" database which supports automatic replication, is application transparent, and is optimized for fast writes and fault tolerance.
NoSQL Myths
NoSQL is for people who don't understand SQL... Reality: ACID only scales so far, and once you start caching, replicating, etc., you're basically giving it up anyway.
NoSQL is not new; we've had key-value stores for years... Reality: Modern NoSQL stores have virtually nothing in common with things like Berkeley DB.
Only huge sites need to care about scalability... Reality: Lots of small sites grow quickly, and many, many companies already need this.
NoSQL is only appropriate for non-important data... Reality: Somwhat true, if you're using a database which does not provide durability (Cassandra does).
Cassandra architecture
Apache Cassandra supports automatic rebalancing when new nodes are added. Writes go to both the old and new nodes during this time, which avoids complex recovery logic. The replication strategy is pluggable, allowing strategies which are optimized for single datacenter vs. multiple datacenter availability. Consistency is tunable to use single lookup, quorum, or all available nodes which contain a key. In addition, the number of replicas to make synchronously vs. asynchronously can be set. For example, using synchronous replication of 3 copies and using quorum lookups virtually guarantees that readers see consistent data, but is less performant than other strategies.
Monitoring is supported via JMX, and many parameters are tunable at runtime. There are a wide variety of statistics available.
The Cassandra data model can be described as loosely schema-less. Each "column group" (or table) is implemented as rows of sparse arrays, containing both a column name and the data within it. This allows on-the-fly addition of new colums without schema changes, at the cost of some disk space. In modern systems, disk is cheap, but I/O is not. All access to rows is done via primary key, so the system relies on writing out data to multiple column families at once to allow for multiple "materialized views" of the underlying data.
Several APIs are available for communicating with Cassandra. At the lowest level is Thrift, followed by Hector (similar to JDBC), and finally a new library called Kundera which is similar in concept to JPA. Hadoop integration via Pig is also possible.
As for when to use Cassandra, the idea use case is probably for systems which are already using a SQL database plus something like memcached. In such a case, Cassandra is probably simpler to manage.
Practical Big Data Processing with MapReduce and Hadoop
This session covered some basic concepts of MapReduce, and showed a demo of Karmasphere Studio, a Netbeans-based tool for doing high-level Hadoop development.
The important thing to understand about data processing is that the critical time is between problem definition and getting the answer. The actual time spent computing the answer might only account for 25% of this, so finding faster ways to get the computation going is a big win.
Why do we need parallel processing? Because while Moore's law grows, data sizes are growing at about the same rate, and most algorithms don't scale linearly, but are at least O(n log n).
Parallel processing is inherently less efficient (per CPU) than serial, but you can't buy time, and CPUs are cheap. So what makes things slow? Synchronization. The answer then, is to eliminate synchronization, which is what MapReduce does.
In the simplest case, independent data sets allow for naive parallelism with linear scalability, but high latency. A classic example of a naively parallel algorithm is raytracing. Each pixel can be mapped independently, and therefore the algorithm scales linearly to 1 pixel per CPU.
Enter MapReduce, which splits processing into separate map, shuffle, sort, and reduce phases, allowing complex algorithms to be separated into much smaller, independent units of work. Not everything is easily translated however, and it's not unusual to see 50-60 chained MapReduce jobs to implement a particular algorithm. This is where tools like Karmasphere Studio are useful. They can be used to generate the boilerplate code necessary to setup Hadoop MapReduce jobs as well as provide simulations of a Big Data processing run and provide realtime feedback on potential inefficiencies and bottlenecks.
Building Enterprise Web Applications with Spring 3.0 and Spring 3.0 MVC
This session was a very fast-paced walkthrough of building a complex enterprise web application using Spring. I won't go through the entire demo, but it did illustrate some of the key concepts of working with Spring DI, Spring MVC, and Spring Security. Most of this wasn't new, and is fairly readily available in the Spring documentation, but was interesting anyway.
Top 10 Lessons Learned from Deploying Hadoop in a Private Cloud
This session was a retrospective on deploying complex Big Data to a private Hadoop cloud by OpenLogic, Inc. OpenLogic provides a service which allows companies to check their source code for copy and paste violations by checking it against a huge library (hundreds of thousands) of open source packages. This service is valuable, since there are potentially unwanted licensing implications for companies which inadvertently ship open source code as part of their product.
The OpenLogic stack consistes of a web client, Nginx web server, Ruby on Rails, MySQL, Redis, Solr, Stargate, and HBase. The Hadoop cluster (which runs HBase) contains over 100 CPU cores and 100 TB of disk. Each Hadoop node is brought up anonymously from the same template, allowing easy scale out without administration problems.
Why not Amazon?
Amazon is great for burst traffic, but extremely expensive for long term storage.
Configuration is key
Hadoop has many moving parts, and details matter. Tuning of operating system parameters such as number of allowed open files and process limits can be critical. Be sure to use a known compatible combination of Java VM, Hadoop, and HBase. Be sure to change only one parameter at a time when debugging, and be sure to read the mailling lists (and the code).
Commodity hardware != old PC or desktop
Hadoop hardware should be a rack mount server (but don't bother with RAID). Use enterprise drives, as the vibration in a rack-mount environment will cause a lot of premature drive failures otherwise. DO expect ugly hardware issues, just due to the sheer volume of hardware.
OpenLogic uses servers with dual-quad core processors, 32-64GB RAM, 6x2TB enterprise hard drives, and does RAID-1 on 2 drives for the boot partitions, and allocates the remaining space to Hadoop.
Bandwidth is key
Use dual gigabit NICs (or 10GigE if possible). HBase needs at least 5 nodes to really perform well. It also depends on ZooKeeper which requires low-latency connections. Be careful of low RAM scenarios.
BigData takes a long time...
To do anything. It can also be hard to test, but it's important not to skimp on testing. Backups are also difficult, and might require a second Hadoop cluster or public cloud.
Loading data
Don't use a single machine. The I/O bottleneck will assure that it doesn't complete for a very long time. Use MapReduce jobs to partition out the data load if possible, and turn off the write-ahead log in HBase during inital data imports. Also, avoid storing large (greater than 5 MB) values in HBase. Rows and columns are essentially free, so use them! Finally, avoid committing excessivly when using Solr.
Getting data out
HBase is NoSQL (so think hash table). Use Solr for fast index lookups of data.
Solr
Solr is a search engine based on Lucene that uses Hadoop to store indexes. It has automatic sharding and asynchronous replication, so it is fault tolerant and performant. OpenLogic indexes billions of lines of code, with 20+ fields indexed per source file. HAProxy is used to front Solr to balance writes, and on reads from slaves.
BigData is hard
Expect to learn a lot. You will get it wrong on the first, second, and probably third try. Try to find alternate ways to model your data.
Scripting languages can help
MapReduce jobs are simpler to write, and the HBase shell is JRuby.
Using Public Clouds (Amazon EC2)
This isn't really practical financially. 100 TB of storage on EBS runs about $120,000 per year, and 20 super huge CPU instances adds another $175,000 per year. This works out to about six times what it costs OpenLogic to host their own cloud.
Open Source is key
None of this would be possible without a large amount of open source software: Hadoop, HBase, Solr, Tomcat, Zookeeper, etc.
Expect things to fail... a lot
Hardware, software, and your own code and data can fail in unpredictable and unforseen ways. Monitoring matters!
Conclusion and Wrap Up
JavaOne 2010 was an intense four days, but overall I'd say it was a positive experience. The landscape has definitely changed since I last attended in 2002, and it's good to know that Oracle is committed to keeping the platform alive and moving forward.