Archive

Archive for the ‘Systems’ Category

A Cassandra Hardware Stack – Dell C1100’s – OCZ Vertex 2 SSD’s with Sandforce – Arista 7048’s

October 24th, 2010 15 comments

Over the past nine months I have delved into the world of providing hardware to support our applications teams use of the Cassandra datastore.  This has turned out to be a somewhat unique challenge as the platform is rapidly evolving along with our use case of the platform.  Cassandra is a very different beast compared to your traditional RDBMS (as you would expect).

I absolutely love the fact that Cassandra has a clear scaling path to allow massive datasets and it runs on very cheap commodity hardware with local storage.  It is built with the expectation of underlying hardware failure. This is wonderful from an operations perspective as it means I can buy extremely cheap “consumer grade” hardware without having to buy “enterprise grade” (whatever that really means) servers and storage for $$$.

Before I dive into my findings, I should point out that this is not one size fits all solution as it greatly depends on what your dataset looks like and what your read/write patterns are.  Our dataset happens to be billions of exceedingly small records.  This means we do an incredible amount of random read i/o.  Your milage may vary depending on what you do with it.

Finding the optimal node size

As usual, spec’ing out hardware for a given application is a matter of balancing five variables:

  • CPU capacity (taking into account the single/multi threaded aspects of the application)
  • RAM capacity (how much working space does the application need and how much cache is optimal)
  • Disk capacity (actual disk storage space)
  • Disk i/o performance (the number of read and write requests per second that can be handled)
  • Network capacity (how much bandwidth is needed)

If you run into a bottleneck on any of these five items, any additional capacity that is available within the other four categories is wasted.  The procedure to determine optimal is as follows:

  1. Determine which of the five variables is going to be your limiting factor through performance testing
  2. Research the most cost-effective price/performance point for the limiting variable
  3. Spec out hardware to meet the other four variables needs relative to the bottleneck

Note that this is a somewhat iterative process as (for example) it may make sense to buy a CPU significantly beyond the price/performance sweet spot (when looking at CPU pricing in a vacuum) as paying for that higher end CPU may allow you to make much better use of the other pieces of the system that would otherwise sit idle.  I am not suggesting that most Cassandra shops will be CPU bound, but this is just an example.

There is also fuzziness in this process as there can be some interdependencies between the variables (i.e. increasing system RAM can reduce disk i/o needs due to increased caching).

Nehalem platforms

If you are at all familiar with the current server-platform market then you know that Nehalem microarchitecture (you need to read the Wikipedia article) based servers are the platform of choice today with the Westmere processors being the current revision within that series.  In-general, the most cost effective solution when scaling large systems out on Nehalem platforms is to go with dual processor machines as this gives you twice the amount of processing power and system memory without doubling your costs (i.e. you still only need one motherboard, power supplies, etc…)

All of the major OEMs have structured their mainline platforms around this dual processor model.  Note that there ARE situations where dual processors don’t make sense including:

  1. Single threaded applications that can not make use of all those cores and that do NOT need the additional memory capacity.
  2. Applications that are purely Disk i/o or network bound where the additional CPU and memory would be wasted (perhaps a file server).
  3. Applications that need less than a “full” machine (i.e. your DNS/DHCP servers).

In general, I don’t think Cassandra falls into these special use case scenarios, unless your just completely i/o bound or network bound and can’t solve them in another way other than adding more nodes.  You may need that second processor however just for the memory controllers it contains (i.e. it gives you twice as many ram slots).  If you are i/o bound you can consider SSD’s, and if you are network bound you can leverage 10 gigabit network interfaces.

In looking at platforms to run Cassandra on, we wanted a vanilla Nehalem platform to run on, without too many bells and whistles.  If you drink the Cassandra kool-aid you will let Cassandra handle all the reliability needs and purchase hardware without node level fault tolerance (i.e. disk RAID).  This means putting disks in a RAID 0 (for optimal speed and capacity) but then letting the fact that Cassandra can store multiple copies of the data across other nodes handle fault recovery.  We are currently using linux kernel RAID, but may also test hardware RAID 0 that is available on the platform we ended up choosing.

It is shocking to me to see how many OEM’s have come up with platforms that do not have equal numbers of RAM slots per memory channel.  News flash folks- In Nehalem it is critical to install memory in equal sets of 3 (or 6 for dual processor) in order to take advantage of memory interleaving.  Every server manufactured should have a number of memory slots divisible by three as the current crop of processors has three memory controllers per processor (this may change in the next generation of processors).

A note about chipsets – The Intel 5500 vs. 5520 – The main difference here is just in the number of PCIe paths the chipset provides.  They should both provide equivalent performance.  The decision point here is made by your OEM and is just based on the number of PCI devices your platform supports.

Our platform choice

In looking at platform options, the following options were lead contenders (there are of course many other possible options, but most are too focused on the enterprise market with features we do not need that just drive costs up):

At first we were looking at 1U machines with 4x 3.5 inch bays (and in fact bought some C1100’s in this configuration) though it turned out that Cassandra was extremely i/o bound which made a small number of large SATA disks impractical.  Once we realized we were going to need a larger number of drives we decided to go with 1U platforms that supported 2.5 inch bays as we can put eight to ten 2.5 inch drives in a 1U to give us more spindles (if we go with disks), or more SSD’s (for the disk capacity rather than iops) if we go with SSD’s.  It’s also worth noting that the 2.5 inch SATA drives draw a lot less power than the 3.5 inch SATA disks of the same capacity.

We ended up going with the Dell C1100 platforms (over the Supermicro offering) as we already had purchasing relationships with Dell and they have a proven track record of being able to support systems throughout a lifecycle (provide “like” replacement parts, etc…), though on this particular order they fell down in numerous ways (mostly related to their recent outsourcing of production to Mexico) which has caused us to re-evaluate future purchasing plans.  In the end, the C1100’s have worked out extremely well thus far, but the speed-bumps along the way were painful.  We have not physically tested any Supermicro offerings so perhaps they have as bad (or worse) issues as well.

What we like:

  • Inexpensive platform
  • Well-targeted to our needs
    • Have 18 RAM slots (only populating 12 of them right now with 4 gig sticks)
    • Dual Intel nic’s not Broadcom
    • They include out of band controllers
    • Dual power supplies available (this is the only “redundancy” piece we do purchase)
  • Low power consumption
  • Quiet

What we don’t like:

  • Lead time issues
  • Rails with clips that easily break
  • Servers arriving DOA
  • Using a SAS expander to give 10 bays vs only 8 (we would have rathered the option to only use 8 bays)
  • They don’t give us the empty drive sleds to add disks later -> force you to purchase from them at astronomical rates
  • The 2 foot IEC to IEC power cords they sent us were only rated to 125 volts (we use 208 volt exclusively)
  • Lack of MLC SSD option from factory

OCZ Technology Vertex 2 MLC SSD’s

After purchasing our first round of Dell C1100’s with four SATA disks (one for boot/commit and three in a RAID 0 for data) we rapidly discovered they were EXTREMELY i/o bound.  Cassandra does an extremely poor job bringing pertinent data into memory and keeping it there (across a four node cluster we had nearly 200 gigs of RAM as each node has 48 gigs).  Things like the fact that Cassandra invalidates cache for any data it writes to disk (rather than writing the data into the cache) make it extremely painful.  Cassandra also (in .6) will do a read on all three nodes (assuming your data is replicated three places) in order to do a read-repair, even if the read factor is only set to one.  This puts extremely high load on the disks across the cluster in aggregate.  I believe in .7 you will be able to tune this down to a more reasonable level.

Our solution was to swap the 1TB SATA disks with 240 gig OCZ Vertex 2 MLC SSD’s which are based on the Sandforce controller.  Now normally I would not consider using “consumer grade” MLC SSD’s for an OLTP type application, however, Cassandra is VERY unique in that it NEVER does random write i/o operations and instead does everything with large sequential i/o.  This is a huge deal because with MLC SSD’s, random writes can rapidly kill the device as writing into the MLC cells can only be done sequentially and editing any data requires wiping the entire cell and re-writing it.

The Sandforce controller does an excellent job of managing where data is actually placed on the SSD media (it has more space available than what is made available to the O/S so that it can shift where things actually get written).  By playing games with how data is written the Sandforce controller is supposed to dramatically improve the lifespan of MLC SSD’s.  We will see how it works out over time.  😉

It is unfortunate that Dell does not have an MLC SSD offering, so we ended up buying small SATA disks in order to get the drive sleds, and then going direct to OCZ Technology to buy a ton of their SSD’s.  I must say, I have been very happy with OCZ and I am happy to provide contact info if you shoot me an email.  I do understand the hesitation Dell has with selling MLC SSD’s, as Cassandra is a very unique use-case (only large sequential writes) and a lot of workloads would probably kill the drives rapidly.

It is also worth noting that our first batch of C1100’s with the 3.5 inch drives were using the onboard Intel ICH10 controller (which has 6 ports), but the second batch of C1100’s with the 10 2.5 inch bays are using an LSI 2008 controller (available on the Dell C1100) with a SAS expander board (since the LSI 2008 only has 8 channels).  We are seeing *much* better performance with the LSI 2008 controllers, though that may be simply due to us not having the disks tuned properly on the ICH10 (using native command queueing, DMA mode, etc…) in CentOS 5.5.  The OCZ Sandforce based drives are massively fast.  😉

If you are going to have any decent number of machines in your Cassandra cluster I highly recommend keeping spare parts on hand and then just purchasing the slow-boat maintenance contracts (next business day).  You *will* loose machines from the cluster due to disk failures, etc (especially since we are using inexpensive parts)…  It is much easier to troubleshoot when you can go swap out parts as needed and then follow up after the fact to get the replacement parts.

Networking

Since Cassandra is a distributed data store it puts a lot more load on the network than say monolithic applications like Oracle that generally have all their data backended on FibreChannel SAN’s.  Particular care must be taken in network design to ensure you don’t have horrible bottlenecks.  In our case, our existing network switches did not have enough available ports and their architecture is 8:1 over-subscribed on each gigabit port, which simply would not do.  After much investigation, we decided to go with Arista 7048 series switches.

The Arista 7048 switches are 1U, 48 port copper 1 gig, and 4 ports of 10 gig SFP+.  This is the same form factor of the Cisco 4948E switches.  This form factor is excellent for top-of-rack switching as it provides fully meshed 1 gig connectivity to the servers with 40 gigabit uplink capacity to the core.  While the Arista product offering is not as well baked as the Cisco offering (they are rapidly implementing features still), they do have one revolutionary feature that Cisco does not have called MLAG.

MLAG stands for “Multi-Chassis Link Aggregation“.  It allows you to physically plug your sever into two separate Arista switches and run LACP between the server and the switches as if both ports were connected to the same switch.  This allows you to use *both* ports in a non-blocking mode giving you full access to the 2 gigabits of bandwidth while still having fault-tolerance in the event a switch fails (of course you would drop down to only 1 gig of capacity).  We are using this for *all* of our hosts now (using the linux kernel bonding driver) and indeed it works very well.

MLAG also allows you to uplink your switches back to the core in such a way as to keep all interfaces in a forwarding state (i.e. no spanning-tree blocked ports).  This is another great feature, though I do need to point out a couple of downsides to MLAG:

  1. You still have to do all your capacity planning as if you are in a “failed” state.  It’s nice to have that extra capacity in case of unexpected conditions, but you can’t count on it if you want to always be fully functional even in the event of a failure.
  2. When running MLAG one of the switches is the “master” that handles LACP negotiation and spanning-tree for the pair of switches.  If there is a software fault in that switch it is very possible that it would take down both paths to your severs (in theory the switches can fall back to independent operation, but we are dealing with *software* here).

It is worth noting that we did not go with 10 gig NIC’s and switches as it does not seem necessary yet with our workload and 10 gig is not quite ready for prime time yet (switches are very expensive, the phy’s draw a lot of power, and cabling is still “weird” – either Coax or Fiber or short distance twisted pair over CAT6, or CAT7 / 7a over 100 meters).  I would probably consider going with a server platform that had four 1 gig NIC’s still before going to 10 gig.  As of yet I have not seen any Cassandra operations take over 100 megabit of network bandwidth (though my graphs are all heavily averaged down so take that with a grain of salt).

Summary

So to recap, we came up with the following:

  • Dell C1100’s – 10x 2.5 inch chassis with dual power supplies
  • Dual 2.4 ghz E5620 processors
  • 12 sticks of 4 gig 1066mhz memory for a total of 48 gigs per node (this processor only supports 1066mhz memory)
  • 1x 2.5 inch 500 gig SATA disk for boot / commit
  • 3x 2.5 inch OCZ Vertex 2 MLC SSD’s
  • The LSI 2008 optional RAID controller (running in JBOD mode, using Linux Kernel RAID)
  • Dual on-board Intel NIC’s (no 10 gig NIC’s, though it is an option)
  • Pairs of Arista 7048 switches using MLAG and LACP to the hosts

Notes:

  • We did not evaluate the low power processors, they may have made sense for Cassandra, but we did not have the time to look into the
  • We just had our Cassandra cluster loose it’s first disk and the data filesystem went read-only on one node, but the Cassandra process continued on running and processing requests.  I am surprised by this as I am not sure what state the node was in (what was it doing with writes when it came time to write out the memtables?).  We manually killed the Cassandra process on the node.
  • The Dell C1100’s did not come set by default in NUMA mode in the BIOS.  CentOS 5.5 supports this and so we turned it on.  I am not sure how much (if any) performance impact this has on Cassandra.

Conclusion

This is still a rapidly evolving space so I am sure my opinions will change here in a few months, but I wanted to get some of my findings out there for others to make use of.  This solution is most certainly not the optimal solution for everyone (and in fact, it remains to be seen if is the optimal solution for us), but hopefully it is a useful datapoint for others that are headed down the same path.

Please feel free as always to post questions below that you feel may be useful to others and I will attempt to answer them, or email me if you want contact information for any of the vendors mentioned above.

-Eric

Categories: Cassandra, Dell, Network, Systems Tags:

Are blade servers right for my environment?

July 15th, 2009 No comments

IT like most industries has it’s “fad”s.  Whether it be virtualization, or SAN’s, or blade servers.  Granted these three technologies play really nicely together, but once in a while you need to get off the bandwagon for a moment and think about what these technologies really do for us. While they are very cool overall and can make an extremely powerful team, as with anything, there is a right place, time, and situation/environment for their use.  Blades are clearly the “wave of the future” in many respects, but you must be cautious about the implications of implementing them today.

Please do not read this article and come away thinking I am “anti-blade” as that is certainly not the case.  I just feel they are all too often pushed into service in situations they are not the correct solution for and would like to point out some potential pitfalls.

Lifecycle co-termination

When you buy a blade center, one of the main selling points is that the network, SAN, and KVM infrastructure is built in.  This is great in terms of ease of deployment and management, however, on the financial side of things you must realize that the life span of these items is not normally the same.  When buying servers I typically expect them to be in service for 4 years, KVM’s (while becoming less utilized actually), can last much longer under most circumstances (barring changes in technology from PS/2 to USB, etc…), network switches I expect to use in some capacity or another for seven years, and SAN switches will probably have a similar life-cycle to the Storage Arrays they are attached to which I generally target at 5 year life spans.

So what does this mean?  Well, if your servers are showing their age in 4 years you are likely to end up replacing the entire blade enclosure at that point which includes the SAN and network switches.  It is possible the vendor will still sell blades that will fit in that enclosure, however, you are likely to be wanting a SAN or network upgrade before the end of those second set of servers life-cycles which will likely result in whole new platforms being purchased anyway.

Vendor lock

You have just created vendor lock such that with all the investment in enclosures you can’t go buy someone elses servers (this really sucks when your vendor fails to innovate on a particular technology).  All the manufacturers realize this situation exists and will surely use it to their advantage down the road.  It is hard to threaten not to buy Dell blades to put in your existing enclosures when that would mean throwing away your investment in SAN and network switches.

San design

Think about your SAN design – Most shops hook servers to a SAN switch which is directly attached to the storage array their data lives on.  Blade enclosures encourage the use of many more smaller SAN switches which often requires hooking the blade enclosure switches to other aggregation SAN switches which are then hooked to the Storage Processor.  This increases the complexity, increases failure points, decreases MTBF, and increases vendor lock.  Trunking SAN switches together from different vendors can be problematic and may require putting them in a compatibility mode which turns off useful features.

Vendor compatibility

Vendor compatibility becomes a huge issue- Say that you buy a blade enclosure today with 4 gig Brocade SAN switches in it for use with your existing 2 gig Brocade switches attached to an EMC Clarion CX500, but then next year you want to replace that with a Hitachi array attached to new Cisco SAN switches.  There are still many interop issues between SAN switch vendors that make trunking switches problematic.  If you had bought physical servers you may have just chosen to re-cable the servers over to the new Cisco switches directly.

Loss of flexibility

Another pitfall that I have seen folks fall into with blade servers is the loss of flexibility that comes with having a stand alone physical server.  You can’t hook up that external hard drive array full of cheap disks directly to the server, or hook up that network heartbeat crossover cable for your cluster, or add an extra NIC or two to a given machine that needs to be directly attached to some other network (that is not available as a VLAN within your switch)….

Inter-tying dependencies

You are creating dependencies on the common enclosure infrastructure so for full redundancy you need servers in multiple blade enclosures.  The argument that the blade enclosures are extremely redundant does not completely hold water to me.  I have needed to completely power cycle entire blade enclosures before to recover from certain blade management module failures.

Provisioning for highest common denominator

You must provision the blade enclosure for the maximum amount of SAN connectivity, network connectivity, and redundancy that is required on any one server within the enclosure.  Say for instance you have a authentication server that is super critical, but not resource intensive.  This requires your blade center to have fully redundant power supplies, network switches, etc…  Then say you have a different server that needs four 1 gig network interfaces, and yet another DB server that needs only two network interfaces, but it needs four HBA connections to the SAN.  You now need an enclosure that has four network switches and four SAN switches in it just to satisfy the needs of three “special case” servers.  In the case of the Dell M1000 blade enclosures, this configuration would be impossible since they can only have six SAN/Network modules total.

Buying un-used infrastructure

If you purchase a blade center that is not completely full of blades then you are wasting infrastructure resources in the form of unused network ports, SAN ports, power supply, and cooling capacity.  Making the ROI argument for blade centers is much easier if you have need to purchase full enclosures.

Failing to use existing infrastructure

Most environments have some amount of extra capacity on their existing network and SAN switches, as when they were purchased, they planned for the future (probably not with blade enclosures in mind).  Spending money to re-purchase SAN and network hardware within a blade enclosure to allow the use of blades can kill the cost advantages of going with a blade solution.

Moving from “cheap” disks to expensive SAN disks

You typically can not put many local disks into blades.  This is in many cases a huge loss as not everything needs to be on the SAN (and in fact, certain things would be very stupid to put on the SAN such as SWAP files).  I find that these days many people overlook the wonders of locally attached disk.  It is the *cheapest* form of disk you can buy and also can be extremely fast!  If your application does not require any of the advanced features a SAN can provide then DONT PUT IT ON THE SAN!

Over-buying power

In facilities where you are charged for power by the circuit the key is to manage your utilization such that your un-used (but paid for) power capacity is kept to a minimum.  With a blade enclosure, on day 1 you must provide (in this example) two 30 amp circuits for your blade enclosure, even though you are only putting in 4 out of a possible 16 severs.  You are going to be paying for those circuits even though you are nowhere near fully utilizing them.  The Dell blade enclosures as an example require two three phase 30 amp circuits for full power (though depending on the server configurations you put in them you can get away with dual 30 amp 208v circuits).

Think about the end of the life-cycle

You can’t turn off the power to a blade enclosure until the last server in that enclosure is decommissioned.  You also need to maintain support and maintenance contracts on the SAN switches, network switches, and enclosure until the last server is no longer mission critical.

When are blades the right tools for the job?

  • When your operational costs of operations and maintenance personnel far outweigh the cost inefficiencies of blades.
  • When you are buying enough servers that you can purchase *full* blade enclosures that have similar connectivity and redundancy requirements (i.e. each needs two 1 gig network ports and two 4 gig SAN connections).
  • When you absolutely need the highest density of servers offered (note that most datacenters in operation today can’t handle the density of power required and heat that blades can put out).

An example of a good use of blades would be a huge Citrix farm, or VMWare farms, or in some cases webserver farms (though I would argue very large web farms that can scale out easily should be on some of the cheapest hardware you can buy which typically does not include blades).

Another good example would be compute farms (say even lucene cache engines) – as long as you have enough nodes to be able to fill enclosures with machines that have the same connectivity and redundancy requirements.

Conclusion

While blades can be great solutions, they need to be implemented in the right environments for the right reasons.  It may indeed be the case that the savings in operational costs of employees to setup, manage, and maintain your servers far outweighs all of the points raised above, but it is important to factor all of these into your purchase decision.

As always, if you have any feedback or comments, please post below or feel free to shoot me an email.

-Eric

Categories: Cisco, Dell, HP, IBM, Network, Sun, Systems Tags: