A Cassandra Hardware Stack – Dell C1100’s – OCZ Vertex 2 SSD’s with Sandforce – Arista 7048’s
Over the past nine months I have delved into the world of providing hardware to support our applications teams use of the Cassandra datastore. This has turned out to be a somewhat unique challenge as the platform is rapidly evolving along with our use case of the platform. Cassandra is a very different beast compared to your traditional RDBMS (as you would expect).
I absolutely love the fact that Cassandra has a clear scaling path to allow massive datasets and it runs on very cheap commodity hardware with local storage. It is built with the expectation of underlying hardware failure. This is wonderful from an operations perspective as it means I can buy extremely cheap “consumer grade” hardware without having to buy “enterprise grade” (whatever that really means) servers and storage for $$$.
Before I dive into my findings, I should point out that this is not one size fits all solution as it greatly depends on what your dataset looks like and what your read/write patterns are. Our dataset happens to be billions of exceedingly small records. This means we do an incredible amount of random read i/o. Your milage may vary depending on what you do with it.
Finding the optimal node size
As usual, spec’ing out hardware for a given application is a matter of balancing five variables:
- CPU capacity (taking into account the single/multi threaded aspects of the application)
- RAM capacity (how much working space does the application need and how much cache is optimal)
- Disk capacity (actual disk storage space)
- Disk i/o performance (the number of read and write requests per second that can be handled)
- Network capacity (how much bandwidth is needed)
If you run into a bottleneck on any of these five items, any additional capacity that is available within the other four categories is wasted. The procedure to determine optimal is as follows:
- Determine which of the five variables is going to be your limiting factor through performance testing
- Research the most cost-effective price/performance point for the limiting variable
- Spec out hardware to meet the other four variables needs relative to the bottleneck
Note that this is a somewhat iterative process as (for example) it may make sense to buy a CPU significantly beyond the price/performance sweet spot (when looking at CPU pricing in a vacuum) as paying for that higher end CPU may allow you to make much better use of the other pieces of the system that would otherwise sit idle. I am not suggesting that most Cassandra shops will be CPU bound, but this is just an example.
There is also fuzziness in this process as there can be some interdependencies between the variables (i.e. increasing system RAM can reduce disk i/o needs due to increased caching).
If you are at all familiar with the current server-platform market then you know that Nehalem microarchitecture (you need to read the Wikipedia article) based servers are the platform of choice today with the Westmere processors being the current revision within that series. In-general, the most cost effective solution when scaling large systems out on Nehalem platforms is to go with dual processor machines as this gives you twice the amount of processing power and system memory without doubling your costs (i.e. you still only need one motherboard, power supplies, etc…)
All of the major OEMs have structured their mainline platforms around this dual processor model. Note that there ARE situations where dual processors don’t make sense including:
- Single threaded applications that can not make use of all those cores and that do NOT need the additional memory capacity.
- Applications that are purely Disk i/o or network bound where the additional CPU and memory would be wasted (perhaps a file server).
- Applications that need less than a “full” machine (i.e. your DNS/DHCP servers).
In general, I don’t think Cassandra falls into these special use case scenarios, unless your just completely i/o bound or network bound and can’t solve them in another way other than adding more nodes. You may need that second processor however just for the memory controllers it contains (i.e. it gives you twice as many ram slots). If you are i/o bound you can consider SSD’s, and if you are network bound you can leverage 10 gigabit network interfaces.
In looking at platforms to run Cassandra on, we wanted a vanilla Nehalem platform to run on, without too many bells and whistles. If you drink the Cassandra kool-aid you will let Cassandra handle all the reliability needs and purchase hardware without node level fault tolerance (i.e. disk RAID). This means putting disks in a RAID 0 (for optimal speed and capacity) but then letting the fact that Cassandra can store multiple copies of the data across other nodes handle fault recovery. We are currently using linux kernel RAID, but may also test hardware RAID 0 that is available on the platform we ended up choosing.
It is shocking to me to see how many OEM’s have come up with platforms that do not have equal numbers of RAM slots per memory channel. News flash folks- In Nehalem it is critical to install memory in equal sets of 3 (or 6 for dual processor) in order to take advantage of memory interleaving. Every server manufactured should have a number of memory slots divisible by three as the current crop of processors has three memory controllers per processor (this may change in the next generation of processors).
A note about chipsets – The Intel 5500 vs. 5520 – The main difference here is just in the number of PCIe paths the chipset provides. They should both provide equivalent performance. The decision point here is made by your OEM and is just based on the number of PCI devices your platform supports.
Our platform choice
In looking at platform options, the following options were lead contenders (there are of course many other possible options, but most are too focused on the enterprise market with features we do not need that just drive costs up):
- Supermicro – SuperServer 1026T-6RF+ (they also have similar versions with 4x gig ports and 10 gig options)
- Dell C1100’s / 2100’s
- SGI Rackable – C1103
At first we were looking at 1U machines with 4x 3.5 inch bays (and in fact bought some C1100’s in this configuration) though it turned out that Cassandra was extremely i/o bound which made a small number of large SATA disks impractical. Once we realized we were going to need a larger number of drives we decided to go with 1U platforms that supported 2.5 inch bays as we can put eight to ten 2.5 inch drives in a 1U to give us more spindles (if we go with disks), or more SSD’s (for the disk capacity rather than iops) if we go with SSD’s. It’s also worth noting that the 2.5 inch SATA drives draw a lot less power than the 3.5 inch SATA disks of the same capacity.
We ended up going with the Dell C1100 platforms (over the Supermicro offering) as we already had purchasing relationships with Dell and they have a proven track record of being able to support systems throughout a lifecycle (provide “like” replacement parts, etc…), though on this particular order they fell down in numerous ways (mostly related to their recent outsourcing of production to Mexico) which has caused us to re-evaluate future purchasing plans. In the end, the C1100’s have worked out extremely well thus far, but the speed-bumps along the way were painful. We have not physically tested any Supermicro offerings so perhaps they have as bad (or worse) issues as well.
What we like:
- Inexpensive platform
- Well-targeted to our needs
- Have 18 RAM slots (only populating 12 of them right now with 4 gig sticks)
- Dual Intel nic’s not Broadcom
- They include out of band controllers
- Dual power supplies available (this is the only “redundancy” piece we do purchase)
- Low power consumption
What we don’t like:
- Lead time issues
- Rails with clips that easily break
- Servers arriving DOA
- Using a SAS expander to give 10 bays vs only 8 (we would have rathered the option to only use 8 bays)
- They don’t give us the empty drive sleds to add disks later -> force you to purchase from them at astronomical rates
- The 2 foot IEC to IEC power cords they sent us were only rated to 125 volts (we use 208 volt exclusively)
- Lack of MLC SSD option from factory
OCZ Technology Vertex 2 MLC SSD’s
After purchasing our first round of Dell C1100’s with four SATA disks (one for boot/commit and three in a RAID 0 for data) we rapidly discovered they were EXTREMELY i/o bound. Cassandra does an extremely poor job bringing pertinent data into memory and keeping it there (across a four node cluster we had nearly 200 gigs of RAM as each node has 48 gigs). Things like the fact that Cassandra invalidates cache for any data it writes to disk (rather than writing the data into the cache) make it extremely painful. Cassandra also (in .6) will do a read on all three nodes (assuming your data is replicated three places) in order to do a read-repair, even if the read factor is only set to one. This puts extremely high load on the disks across the cluster in aggregate. I believe in .7 you will be able to tune this down to a more reasonable level.
Our solution was to swap the 1TB SATA disks with 240 gig OCZ Vertex 2 MLC SSD’s which are based on the Sandforce controller. Now normally I would not consider using “consumer grade” MLC SSD’s for an OLTP type application, however, Cassandra is VERY unique in that it NEVER does random write i/o operations and instead does everything with large sequential i/o. This is a huge deal because with MLC SSD’s, random writes can rapidly kill the device as writing into the MLC cells can only be done sequentially and editing any data requires wiping the entire cell and re-writing it.
The Sandforce controller does an excellent job of managing where data is actually placed on the SSD media (it has more space available than what is made available to the O/S so that it can shift where things actually get written). By playing games with how data is written the Sandforce controller is supposed to dramatically improve the lifespan of MLC SSD’s. We will see how it works out over time. 😉
It is unfortunate that Dell does not have an MLC SSD offering, so we ended up buying small SATA disks in order to get the drive sleds, and then going direct to OCZ Technology to buy a ton of their SSD’s. I must say, I have been very happy with OCZ and I am happy to provide contact info if you shoot me an email. I do understand the hesitation Dell has with selling MLC SSD’s, as Cassandra is a very unique use-case (only large sequential writes) and a lot of workloads would probably kill the drives rapidly.
It is also worth noting that our first batch of C1100’s with the 3.5 inch drives were using the onboard Intel ICH10 controller (which has 6 ports), but the second batch of C1100’s with the 10 2.5 inch bays are using an LSI 2008 controller (available on the Dell C1100) with a SAS expander board (since the LSI 2008 only has 8 channels). We are seeing *much* better performance with the LSI 2008 controllers, though that may be simply due to us not having the disks tuned properly on the ICH10 (using native command queueing, DMA mode, etc…) in CentOS 5.5. The OCZ Sandforce based drives are massively fast. 😉
If you are going to have any decent number of machines in your Cassandra cluster I highly recommend keeping spare parts on hand and then just purchasing the slow-boat maintenance contracts (next business day). You *will* loose machines from the cluster due to disk failures, etc (especially since we are using inexpensive parts)… It is much easier to troubleshoot when you can go swap out parts as needed and then follow up after the fact to get the replacement parts.
Since Cassandra is a distributed data store it puts a lot more load on the network than say monolithic applications like Oracle that generally have all their data backended on FibreChannel SAN’s. Particular care must be taken in network design to ensure you don’t have horrible bottlenecks. In our case, our existing network switches did not have enough available ports and their architecture is 8:1 over-subscribed on each gigabit port, which simply would not do. After much investigation, we decided to go with Arista 7048 series switches.
The Arista 7048 switches are 1U, 48 port copper 1 gig, and 4 ports of 10 gig SFP+. This is the same form factor of the Cisco 4948E switches. This form factor is excellent for top-of-rack switching as it provides fully meshed 1 gig connectivity to the servers with 40 gigabit uplink capacity to the core. While the Arista product offering is not as well baked as the Cisco offering (they are rapidly implementing features still), they do have one revolutionary feature that Cisco does not have called MLAG.
MLAG stands for “Multi-Chassis Link Aggregation“. It allows you to physically plug your sever into two separate Arista switches and run LACP between the server and the switches as if both ports were connected to the same switch. This allows you to use *both* ports in a non-blocking mode giving you full access to the 2 gigabits of bandwidth while still having fault-tolerance in the event a switch fails (of course you would drop down to only 1 gig of capacity). We are using this for *all* of our hosts now (using the linux kernel bonding driver) and indeed it works very well.
MLAG also allows you to uplink your switches back to the core in such a way as to keep all interfaces in a forwarding state (i.e. no spanning-tree blocked ports). This is another great feature, though I do need to point out a couple of downsides to MLAG:
- You still have to do all your capacity planning as if you are in a “failed” state. It’s nice to have that extra capacity in case of unexpected conditions, but you can’t count on it if you want to always be fully functional even in the event of a failure.
- When running MLAG one of the switches is the “master” that handles LACP negotiation and spanning-tree for the pair of switches. If there is a software fault in that switch it is very possible that it would take down both paths to your severs (in theory the switches can fall back to independent operation, but we are dealing with *software* here).
It is worth noting that we did not go with 10 gig NIC’s and switches as it does not seem necessary yet with our workload and 10 gig is not quite ready for prime time yet (switches are very expensive, the phy’s draw a lot of power, and cabling is still “weird” – either Coax or Fiber or short distance twisted pair over CAT6, or CAT7 / 7a over 100 meters). I would probably consider going with a server platform that had four 1 gig NIC’s still before going to 10 gig. As of yet I have not seen any Cassandra operations take over 100 megabit of network bandwidth (though my graphs are all heavily averaged down so take that with a grain of salt).
So to recap, we came up with the following:
- Dell C1100’s – 10x 2.5 inch chassis with dual power supplies
- Dual 2.4 ghz E5620 processors
- 12 sticks of 4 gig 1066mhz memory for a total of 48 gigs per node (this processor only supports 1066mhz memory)
- 1x 2.5 inch 500 gig SATA disk for boot / commit
- 3x 2.5 inch OCZ Vertex 2 MLC SSD’s
- The LSI 2008 optional RAID controller (running in JBOD mode, using Linux Kernel RAID)
- Dual on-board Intel NIC’s (no 10 gig NIC’s, though it is an option)
- Pairs of Arista 7048 switches using MLAG and LACP to the hosts
- We did not evaluate the low power processors, they may have made sense for Cassandra, but we did not have the time to look into the
- We just had our Cassandra cluster loose it’s first disk and the data filesystem went read-only on one node, but the Cassandra process continued on running and processing requests. I am surprised by this as I am not sure what state the node was in (what was it doing with writes when it came time to write out the memtables?). We manually killed the Cassandra process on the node.
- The Dell C1100’s did not come set by default in NUMA mode in the BIOS. CentOS 5.5 supports this and so we turned it on. I am not sure how much (if any) performance impact this has on Cassandra.
This is still a rapidly evolving space so I am sure my opinions will change here in a few months, but I wanted to get some of my findings out there for others to make use of. This solution is most certainly not the optimal solution for everyone (and in fact, it remains to be seen if is the optimal solution for us), but hopefully it is a useful datapoint for others that are headed down the same path.
Please feel free as always to post questions below that you feel may be useful to others and I will attempt to answer them, or email me if you want contact information for any of the vendors mentioned above.