Archive

Archive for the ‘Network’ Category

Republic Wireless Service and LG Optimus Review

November 26th, 2011 11 comments

When I first heard of Republic Wireless a few weeks ago I just could not resist ordering a phone and service to try out.  I have long felt that the phone industry (both wired and wireless) has been devoid of innovation and ready for disruption.  At $99 for the phone and $19 a month for unlimited service (with no contract), it was worth it to me just as a technology experiment.

On their launch day the site was a bit busy and not able to place orders for some time, however, they handled it with well thought out error messages.  I eventually got though and placed an order.  They were very upfront about the fact that it may take some time to deliver the product (which was no problem for me).  One slightly annoying thing was that they did charge me for the product well before shipping it.

Once they did ship it they sent a note with tracking information which was great.  It arrived today in good order.

Initial thoughts

It came in a small box inside a padded FexEx mailer which was just perfect, nothing more needed.  It is clear though that they are still quite a young company from the first appearances.  The card telling me my phone number was hand-written.  😉  The phone and the box it came in is 100% Sprint branded, except for a sticker with the Republic logo on the box.  They clearly have not had enough time to get a hardware manufacturer to spin them devices with 100% Republic branding.

The phone

Never having seen an LG Optimus before I was very happy with what I got for $99.  Build quality seems excellent.  I like the case coating, the buttons, and just the general form factor.  It even has a camera hard button which I miss greatly on my Droid Bionic!  It does seem tiny though compared to the Bionic.  The down sides are that the screen is not huge (which makes typing noticeably more difficult), it does not have a blinking light to tell you when a message is waiting (I am constantly checking for that light on my Droid), and it does not have a camera flash.

For a $99 phone, the screen is excellent, and the processor seems fast enough.  I was very happy to see that they appear to be using the stock Android UI.  My first Android was the Droid V1 (which was pretty stock), and now I have the Droid Bionic with MotoBlur (which I am starting to hate).  Republic Wireless even seems to have avoided installing any kind of crapware on the phone whatsoever! (ironically they do have a “Dev Tools” app installed which I am wondering if it is a mistake as it does not seem like something intended for your average end user)  They don’t even have any icons setup on your home screen by default (which is a bit weird, yet somehow a bit cool.  It is a clean slate to work with.

For what it is worth, the camera quality seems pretty good from the two pictures I have taken so far.

The phone came pre-activated and ready to rock (if I remember correctly, the battery was even in it already).  The only thing it wanted was for me to attach it to a WiFi network first thing.

The service

Alright, so now for the real test- Can this thing make phone calls on WiFi?  I fire up a call to my employers auto-attendant and sure enough, it works!  Some quick tests later and I have a few initial thoughts.  The voice quality via WiFi is a bit quiet and tinny compared to calls placed on the Sprint network.  This is a bit disappointing as I have a Cisco enterprise wireless access point within 10 feet of where I was testing, and a hugely overbuilt Vyatta box as my gateway, on a 35/35 Frontier FiOS connection.  They need to take the opportunity to leverage some of the major benefits of *not* being on a cellular network (namely the fact that potentially much more bandwidth is available).  Perhaps some of this perceived quality issue is just because the codec is “different” from what I am normally used to (not necessarily worse) and maybe they can make it better by turning up the default volume or with modifying equalization settings.

So next up was a test of the two extra buttons that show up on the call screen during a WiFi call.  One of them places the call on hold.  This is novel for a cell phone, and it is an exciting indicator of features to come in the future.  When you don’t have to play in the world with the limits imposed by the cell base station manufacturers you can actually implement compelling features!  The caller I tried this with did tell me the hold music was very feint.  Perhaps customizable music could be a future feature.  😉

The other button was to transfer the call to the Cell network (presumably to easily work around the WiFi connection flaking out).  This is a great feature, but its implementation was the first indication that things are not yet as well integrated as I would like.  Pressing the button hangs up the call, and places a new call to the same person over the Sprint network.  I would have thought the smart solution here would have been either for the phone to dial out to the Republic Wireless servers via Sprint, and then have them cross connect the existing call to that phone over Sprint instead of via WiFi (i.e. this would avoid having the called party have to answer another call), or to have the Republic Wireless servers call the phone on it’s Sprint number and reconnect the in-progress call.  I am hoping they can make this more seamless in the future.

Now for the real test- What happens when I start a call on WiFi and then walk out of range of the access point?  Well, the answer is that as you might expect, the phone call does drop out for a bit, though after a few seconds, the phone automatically placed a new call outbound over Sprint to the party I had been speaking with.  I should note that anecdotally, I made it quite a ways from the house before it dropped out (I only tested one call).

The technical stuff

So being a network geek, I needed to know what this thing was going out and connecting to on the Internet.  I had noticed on calls that the latency was not that great.  I could tell that conversations were not as real time as I have come to expect on normal wireless calls.  So I tracked down the phone’s IP in the DHCP leases table, and fired up tcpdump on my Vyatta box.  It very quickly became apparent where Republic Wireless is hosted.  My phone is connecting to nodes in the AWS US-East compute zone.

Running the Republic Wireless control software in the “Cloud” makes sense for a company that is expecting potentially massive growth, however, I was shocked to discover that not just control connections were going through Amazon’s cloud.  They apparently are also running the VoIP calls through there as well.  This immediately raised eyebrows with me as I don’t feel that the Amazon the shared infrastructure environment is appropriate yet for VoIP traffic.  Perhaps I am wrong, or they have a deal with Amazon that puts them on dedicated network infrastructure, but the thing with VoIP is that it is massively sensitive to packet loss, latency, and jitter.  These things are hard enough to get right when you have dedicated hardware that is not shared.

The Amazon cloud node my phone was communicating with tonight was 85ms away (round trip) from my home (here in Oregon), under good network conditions.  This probably explains a portion of the large delay period that calls on WiFi were experiencing for me.  I think the VoIP encoding they are using is introducing more delay than I would like to see in order to reduce call dropouts due to flaky network connections.

Being a network architect, I think they need to have the phones connecting back to edge devices that are deployed on dedicated hardware in major peering cities in order to reduce latency as much as possible.  This product will live or die based on the audio quality and the seamlessness of the solution.  They should have nodes in Seattle and also in California to cover the West Coast.

One other thing that I should note is that I don’t think SMS messages are going over WiFi yet.  I suspect they are going out the normal Sprint radio as I don’t clearly see packets on the network associated with when I am sending text messages (though I may be wrong about this – I have not yet done a full protocol decode).  Perhaps that will be a future blog post.

Conclusion

All in all, I love what these guys are doing.  I am a cheerleader.  I think they are going about it the right way, but the product is still very very young…  Could I use this as my full time phone?  Probably not, as I require rock-solid communication at all times for my job.  Would I buy this for my kids?  Absolutely!

Am I seriously considering getting this for my parents that don’t currently have data phones?  Yes indeed!

I am looking forward to seeing how this works out…  $19 a month is almost too good to be true.  I wonder how long it is before these guys get bought out by one of the big boys wanting the technology (or to kill them off)?

-Eric

P.S. If anyone from Republic reads this, feel free to reach out.  I am always willing to provide constructive feedback!

Categories: Network, Telecom, Wireless Tags:

A Cassandra Hardware Stack – Dell C1100’s – OCZ Vertex 2 SSD’s with Sandforce – Arista 7048’s

October 24th, 2010 15 comments

Over the past nine months I have delved into the world of providing hardware to support our applications teams use of the Cassandra datastore.  This has turned out to be a somewhat unique challenge as the platform is rapidly evolving along with our use case of the platform.  Cassandra is a very different beast compared to your traditional RDBMS (as you would expect).

I absolutely love the fact that Cassandra has a clear scaling path to allow massive datasets and it runs on very cheap commodity hardware with local storage.  It is built with the expectation of underlying hardware failure. This is wonderful from an operations perspective as it means I can buy extremely cheap “consumer grade” hardware without having to buy “enterprise grade” (whatever that really means) servers and storage for $$$.

Before I dive into my findings, I should point out that this is not one size fits all solution as it greatly depends on what your dataset looks like and what your read/write patterns are.  Our dataset happens to be billions of exceedingly small records.  This means we do an incredible amount of random read i/o.  Your milage may vary depending on what you do with it.

Finding the optimal node size

As usual, spec’ing out hardware for a given application is a matter of balancing five variables:

  • CPU capacity (taking into account the single/multi threaded aspects of the application)
  • RAM capacity (how much working space does the application need and how much cache is optimal)
  • Disk capacity (actual disk storage space)
  • Disk i/o performance (the number of read and write requests per second that can be handled)
  • Network capacity (how much bandwidth is needed)

If you run into a bottleneck on any of these five items, any additional capacity that is available within the other four categories is wasted.  The procedure to determine optimal is as follows:

  1. Determine which of the five variables is going to be your limiting factor through performance testing
  2. Research the most cost-effective price/performance point for the limiting variable
  3. Spec out hardware to meet the other four variables needs relative to the bottleneck

Note that this is a somewhat iterative process as (for example) it may make sense to buy a CPU significantly beyond the price/performance sweet spot (when looking at CPU pricing in a vacuum) as paying for that higher end CPU may allow you to make much better use of the other pieces of the system that would otherwise sit idle.  I am not suggesting that most Cassandra shops will be CPU bound, but this is just an example.

There is also fuzziness in this process as there can be some interdependencies between the variables (i.e. increasing system RAM can reduce disk i/o needs due to increased caching).

Nehalem platforms

If you are at all familiar with the current server-platform market then you know that Nehalem microarchitecture (you need to read the Wikipedia article) based servers are the platform of choice today with the Westmere processors being the current revision within that series.  In-general, the most cost effective solution when scaling large systems out on Nehalem platforms is to go with dual processor machines as this gives you twice the amount of processing power and system memory without doubling your costs (i.e. you still only need one motherboard, power supplies, etc…)

All of the major OEMs have structured their mainline platforms around this dual processor model.  Note that there ARE situations where dual processors don’t make sense including:

  1. Single threaded applications that can not make use of all those cores and that do NOT need the additional memory capacity.
  2. Applications that are purely Disk i/o or network bound where the additional CPU and memory would be wasted (perhaps a file server).
  3. Applications that need less than a “full” machine (i.e. your DNS/DHCP servers).

In general, I don’t think Cassandra falls into these special use case scenarios, unless your just completely i/o bound or network bound and can’t solve them in another way other than adding more nodes.  You may need that second processor however just for the memory controllers it contains (i.e. it gives you twice as many ram slots).  If you are i/o bound you can consider SSD’s, and if you are network bound you can leverage 10 gigabit network interfaces.

In looking at platforms to run Cassandra on, we wanted a vanilla Nehalem platform to run on, without too many bells and whistles.  If you drink the Cassandra kool-aid you will let Cassandra handle all the reliability needs and purchase hardware without node level fault tolerance (i.e. disk RAID).  This means putting disks in a RAID 0 (for optimal speed and capacity) but then letting the fact that Cassandra can store multiple copies of the data across other nodes handle fault recovery.  We are currently using linux kernel RAID, but may also test hardware RAID 0 that is available on the platform we ended up choosing.

It is shocking to me to see how many OEM’s have come up with platforms that do not have equal numbers of RAM slots per memory channel.  News flash folks- In Nehalem it is critical to install memory in equal sets of 3 (or 6 for dual processor) in order to take advantage of memory interleaving.  Every server manufactured should have a number of memory slots divisible by three as the current crop of processors has three memory controllers per processor (this may change in the next generation of processors).

A note about chipsets – The Intel 5500 vs. 5520 – The main difference here is just in the number of PCIe paths the chipset provides.  They should both provide equivalent performance.  The decision point here is made by your OEM and is just based on the number of PCI devices your platform supports.

Our platform choice

In looking at platform options, the following options were lead contenders (there are of course many other possible options, but most are too focused on the enterprise market with features we do not need that just drive costs up):

At first we were looking at 1U machines with 4x 3.5 inch bays (and in fact bought some C1100’s in this configuration) though it turned out that Cassandra was extremely i/o bound which made a small number of large SATA disks impractical.  Once we realized we were going to need a larger number of drives we decided to go with 1U platforms that supported 2.5 inch bays as we can put eight to ten 2.5 inch drives in a 1U to give us more spindles (if we go with disks), or more SSD’s (for the disk capacity rather than iops) if we go with SSD’s.  It’s also worth noting that the 2.5 inch SATA drives draw a lot less power than the 3.5 inch SATA disks of the same capacity.

We ended up going with the Dell C1100 platforms (over the Supermicro offering) as we already had purchasing relationships with Dell and they have a proven track record of being able to support systems throughout a lifecycle (provide “like” replacement parts, etc…), though on this particular order they fell down in numerous ways (mostly related to their recent outsourcing of production to Mexico) which has caused us to re-evaluate future purchasing plans.  In the end, the C1100’s have worked out extremely well thus far, but the speed-bumps along the way were painful.  We have not physically tested any Supermicro offerings so perhaps they have as bad (or worse) issues as well.

What we like:

  • Inexpensive platform
  • Well-targeted to our needs
    • Have 18 RAM slots (only populating 12 of them right now with 4 gig sticks)
    • Dual Intel nic’s not Broadcom
    • They include out of band controllers
    • Dual power supplies available (this is the only “redundancy” piece we do purchase)
  • Low power consumption
  • Quiet

What we don’t like:

  • Lead time issues
  • Rails with clips that easily break
  • Servers arriving DOA
  • Using a SAS expander to give 10 bays vs only 8 (we would have rathered the option to only use 8 bays)
  • They don’t give us the empty drive sleds to add disks later -> force you to purchase from them at astronomical rates
  • The 2 foot IEC to IEC power cords they sent us were only rated to 125 volts (we use 208 volt exclusively)
  • Lack of MLC SSD option from factory

OCZ Technology Vertex 2 MLC SSD’s

After purchasing our first round of Dell C1100’s with four SATA disks (one for boot/commit and three in a RAID 0 for data) we rapidly discovered they were EXTREMELY i/o bound.  Cassandra does an extremely poor job bringing pertinent data into memory and keeping it there (across a four node cluster we had nearly 200 gigs of RAM as each node has 48 gigs).  Things like the fact that Cassandra invalidates cache for any data it writes to disk (rather than writing the data into the cache) make it extremely painful.  Cassandra also (in .6) will do a read on all three nodes (assuming your data is replicated three places) in order to do a read-repair, even if the read factor is only set to one.  This puts extremely high load on the disks across the cluster in aggregate.  I believe in .7 you will be able to tune this down to a more reasonable level.

Our solution was to swap the 1TB SATA disks with 240 gig OCZ Vertex 2 MLC SSD’s which are based on the Sandforce controller.  Now normally I would not consider using “consumer grade” MLC SSD’s for an OLTP type application, however, Cassandra is VERY unique in that it NEVER does random write i/o operations and instead does everything with large sequential i/o.  This is a huge deal because with MLC SSD’s, random writes can rapidly kill the device as writing into the MLC cells can only be done sequentially and editing any data requires wiping the entire cell and re-writing it.

The Sandforce controller does an excellent job of managing where data is actually placed on the SSD media (it has more space available than what is made available to the O/S so that it can shift where things actually get written).  By playing games with how data is written the Sandforce controller is supposed to dramatically improve the lifespan of MLC SSD’s.  We will see how it works out over time.  😉

It is unfortunate that Dell does not have an MLC SSD offering, so we ended up buying small SATA disks in order to get the drive sleds, and then going direct to OCZ Technology to buy a ton of their SSD’s.  I must say, I have been very happy with OCZ and I am happy to provide contact info if you shoot me an email.  I do understand the hesitation Dell has with selling MLC SSD’s, as Cassandra is a very unique use-case (only large sequential writes) and a lot of workloads would probably kill the drives rapidly.

It is also worth noting that our first batch of C1100’s with the 3.5 inch drives were using the onboard Intel ICH10 controller (which has 6 ports), but the second batch of C1100’s with the 10 2.5 inch bays are using an LSI 2008 controller (available on the Dell C1100) with a SAS expander board (since the LSI 2008 only has 8 channels).  We are seeing *much* better performance with the LSI 2008 controllers, though that may be simply due to us not having the disks tuned properly on the ICH10 (using native command queueing, DMA mode, etc…) in CentOS 5.5.  The OCZ Sandforce based drives are massively fast.  😉

If you are going to have any decent number of machines in your Cassandra cluster I highly recommend keeping spare parts on hand and then just purchasing the slow-boat maintenance contracts (next business day).  You *will* loose machines from the cluster due to disk failures, etc (especially since we are using inexpensive parts)…  It is much easier to troubleshoot when you can go swap out parts as needed and then follow up after the fact to get the replacement parts.

Networking

Since Cassandra is a distributed data store it puts a lot more load on the network than say monolithic applications like Oracle that generally have all their data backended on FibreChannel SAN’s.  Particular care must be taken in network design to ensure you don’t have horrible bottlenecks.  In our case, our existing network switches did not have enough available ports and their architecture is 8:1 over-subscribed on each gigabit port, which simply would not do.  After much investigation, we decided to go with Arista 7048 series switches.

The Arista 7048 switches are 1U, 48 port copper 1 gig, and 4 ports of 10 gig SFP+.  This is the same form factor of the Cisco 4948E switches.  This form factor is excellent for top-of-rack switching as it provides fully meshed 1 gig connectivity to the servers with 40 gigabit uplink capacity to the core.  While the Arista product offering is not as well baked as the Cisco offering (they are rapidly implementing features still), they do have one revolutionary feature that Cisco does not have called MLAG.

MLAG stands for “Multi-Chassis Link Aggregation“.  It allows you to physically plug your sever into two separate Arista switches and run LACP between the server and the switches as if both ports were connected to the same switch.  This allows you to use *both* ports in a non-blocking mode giving you full access to the 2 gigabits of bandwidth while still having fault-tolerance in the event a switch fails (of course you would drop down to only 1 gig of capacity).  We are using this for *all* of our hosts now (using the linux kernel bonding driver) and indeed it works very well.

MLAG also allows you to uplink your switches back to the core in such a way as to keep all interfaces in a forwarding state (i.e. no spanning-tree blocked ports).  This is another great feature, though I do need to point out a couple of downsides to MLAG:

  1. You still have to do all your capacity planning as if you are in a “failed” state.  It’s nice to have that extra capacity in case of unexpected conditions, but you can’t count on it if you want to always be fully functional even in the event of a failure.
  2. When running MLAG one of the switches is the “master” that handles LACP negotiation and spanning-tree for the pair of switches.  If there is a software fault in that switch it is very possible that it would take down both paths to your severs (in theory the switches can fall back to independent operation, but we are dealing with *software* here).

It is worth noting that we did not go with 10 gig NIC’s and switches as it does not seem necessary yet with our workload and 10 gig is not quite ready for prime time yet (switches are very expensive, the phy’s draw a lot of power, and cabling is still “weird” – either Coax or Fiber or short distance twisted pair over CAT6, or CAT7 / 7a over 100 meters).  I would probably consider going with a server platform that had four 1 gig NIC’s still before going to 10 gig.  As of yet I have not seen any Cassandra operations take over 100 megabit of network bandwidth (though my graphs are all heavily averaged down so take that with a grain of salt).

Summary

So to recap, we came up with the following:

  • Dell C1100’s – 10x 2.5 inch chassis with dual power supplies
  • Dual 2.4 ghz E5620 processors
  • 12 sticks of 4 gig 1066mhz memory for a total of 48 gigs per node (this processor only supports 1066mhz memory)
  • 1x 2.5 inch 500 gig SATA disk for boot / commit
  • 3x 2.5 inch OCZ Vertex 2 MLC SSD’s
  • The LSI 2008 optional RAID controller (running in JBOD mode, using Linux Kernel RAID)
  • Dual on-board Intel NIC’s (no 10 gig NIC’s, though it is an option)
  • Pairs of Arista 7048 switches using MLAG and LACP to the hosts

Notes:

  • We did not evaluate the low power processors, they may have made sense for Cassandra, but we did not have the time to look into the
  • We just had our Cassandra cluster loose it’s first disk and the data filesystem went read-only on one node, but the Cassandra process continued on running and processing requests.  I am surprised by this as I am not sure what state the node was in (what was it doing with writes when it came time to write out the memtables?).  We manually killed the Cassandra process on the node.
  • The Dell C1100’s did not come set by default in NUMA mode in the BIOS.  CentOS 5.5 supports this and so we turned it on.  I am not sure how much (if any) performance impact this has on Cassandra.

Conclusion

This is still a rapidly evolving space so I am sure my opinions will change here in a few months, but I wanted to get some of my findings out there for others to make use of.  This solution is most certainly not the optimal solution for everyone (and in fact, it remains to be seen if is the optimal solution for us), but hopefully it is a useful datapoint for others that are headed down the same path.

Please feel free as always to post questions below that you feel may be useful to others and I will attempt to answer them, or email me if you want contact information for any of the vendors mentioned above.

-Eric

Categories: Cassandra, Dell, Network, Systems Tags:

Dynect migration from UltraDNS follow-up

October 2nd, 2010 No comments

Several months back I migrated a very high volume site from Neustar UltraDNS over to Dyn’s Dynect service.  I am following up with another post because I believe it is important for folks in the IT community to share information (good or bad) about vendors and technology that they use.  All too often the truth is obscured by NDA’s, marketing agreements, etc…

So here it is:  I have not had  a single issue with Dynect since I made the transition.  There is not much to say other than that…

I have not had any site reachability issues that I can point the finger at DNS for, and I have never had to call Dynect support.  I have not even had any billing snafu’s.

The Dynect admin console still rocks with cool real time metrics.

It just works.  The pricing is reasonable.  And the guys/gals that work there are just cool folks.

P.S.  They even added a help comment to the admin interface to address my concern of not being able to modify the SOA minimum TTL value.  It now says you can contact their Concierge service if you need the value changed.

-Eric

Categories: Network Tags:

Review of moving from NeuStar UltraDNS to Dynect Managed DNS Service

May 21st, 2010 7 comments

For many years I have used the UltraDNS service from NeuStar on behalf of several companies I have worked for, as it has been incredibly reliable and easy to use.  I cannot, however, say it has been exactly inexpensive, and in recent years innovation has seemingly slowed to a crawl.  Each time in the past that I have evaluated the field of other options, there has not been any “worthy” contenders in the space, until now that is.

After recently completing an evaluation and trial run of dyn.com’s Dynect service, we went ahead and switched over to their service for some very high volume domains that generate millions of queries a day.

A few notes on the transition process

The Neustar zone export tool had issues and was truncating zone file output on some of our zones (and loosing records in the process!).  This is a serious bug (though one they may not be too heavily incentivized to fix).  I have reported this bug to NeuStar and they informed me they were already aware of the issue.

So next up, I tried enabling the Dynect servers IP address to be allowed to do a zone transfer from UltraDNS, but it turned out, Dynect had a bug where they could not do zone transfers directly using AXFR from UltraDNS (they are actively working to fix this they tell me).

I ended up doing an AXFR out of UltraDNS from my desktop PC using DIG (after allowing my IP to do the transfer in the NeuStar control panel) and then pasting it into Dynect’s import tool.  This process was slightly annoying, but in the grand scheme of things not a big deal (it took more time to validate all the data got moved over properly than anything else).

Notes on the Dynect platform

The real time reporting of queries per second is awesome functionality that I now consider to be critical.  This is available from Dynect on a per zone, per record type, or per individual record basis.  I did not know what I was missing before.  It has allowed me to find a couple “issues” with my zone records that I would have otherwise been unaware of.  With UltraDNS I had no idea how many queries I had used until the end of the month came around and I got a bill that included almost no detail.

One of these issues was the lack of AAAA (IPv6) records on one particular host entry that gets millions of queries per day.  Newer Windows Vista and Windows 7 machines will attempt an IPv6 lookup in addition to (or before?) the IPv4 lookup as IPv6 is enabled by default.  Since this site is not yet IPv6 enabled, we do not serve out an AAAA record and so instead the remote DNS server uses the SOA (Start of Authority) “minimum” value as the TTL (Time To Live) on the negative cache entry it adds to it’s system.  The net result of this is that IPv4 queries get cached for the 6 hour TTL we have set, but IPv6 queries which result in a “non existant” answer only get cached for 60 seconds (which is the SOA minimum value Dynect uses).  This results in huge query volumes for IPv6 records in addition to the IPv4 records, and this issue will only get worse as more end clients become IPv6 enabled but the site in question remains IPv4 only.

Dynect does not allow end users to muck with the SOA values (other than default TTL) which is highly unfortunate in my mind.  NeuStar UltraDNS did allow these changes to be made by the end user on any zone.  The good news is that Dynect was able to manually change my SOA minimum values to a longer interval for me (somewhat begrudgingly).  They claim the lack of user control is by design (to keep people from messing something up that then gets cached for a long interval), though in my mind there needs to be an advanced user mode for those ready and willing to run that risk.

The other issue Dynect’s real time reporting shed light on for me was a reverse DNS entry that I was missing on a very high volume site, which was again causing high query volume to that IP as the negative cache interval was 60 seconds.  I rectified this by adding an appropriate PTR record.

I do have to point out that I am not so thrilled with either the simple editor or the expert editor that Dynect provides.  The tree control with leafs for every record is seemingly clunky to me, and the advanced editor is not the end all be all either (as certain functionality does not exist there, and it leaves you to edit certain records like SRV with multiple data values in a single text box).  But these don’t really get in my way of being very happy with the service.

Perhaps of more concern to me is Dynect’s lack of a 24×7 NOC.  Granted they have an on-call engineer 24×7, though for something as critical as DNS I would encourage them to staff a NOC as soon as their business can support it.  This is a service offering  UltraDNS has that I have utilized and been happy with in the past.

Another feature Dynect seems to do well is the ability to see what changes have been made to your zones (auditing ability).  I have not dove into it too much with Dynect or UltraDNS, but it seems to exist as a core feature in a more useful fashion than I have seen on UltraDNS.  One thing that I never could figure out on UltraDNS was how to go back and look at audit history for deleted records (not to mention confirmation of record modification or deletion).

I should note at this point one major difference between the pricing mechanisims for UltraDNS and Dynect.  My experience with Ultra has been that they do things on a per bucket of 1000 queries basis.  Dynect on the other hand bills on a 95th percentile basis of Queries Per Second (QPS) on a 5 minute interval, similar to what ISP’s bill for bandwidth.  Depending on your usage patterns, either one of these billing models could be more adventagious to you.

Also, I am not going to dive into too much detail here, but UltraDNS and Dynect both offer gloabal server load balancing solutions that differ in one very key way- UltraDNS has a new solution that uses a Geolocate database to direct queries to a desired server based on source IP address, where as Dynect’s offering only provides the ability to do this based on their Anycast node locations.  There are pro’s and con’s to each, perhaps that will become a future blog post.

Wrapping it up

UltraDNS is a great service that has proven itself reliable in the long run.  I would recommend their service to others in the future.  They do need to keep up with the changing technology however (new releases to the admin console indicate they are starting to head in this direction).

Dynect has assembled a fully competative (and better in some ways) offering that I would now classify as a viable option for most UltraDNS customers.  My migration to their solution was very smooth and so far there have been no issues.  I welcome Dynect to the Managed External DNS Service space and the healthy competition they provide.

I should also note that their sales and support team has treated us/me well.  They genuinely seem to care about this stuff and I don’t come away with the slimy feeling after talking to them.

-Eric

Categories: Network Tags:

Cisco ASA 8.0.5 TFTP Unspecified Error using PumpKIN

December 10th, 2009 2 comments

I have run into a problem on two separate ASA’s now downloading code to them using the PumpKIN TFTP server.  It get’s part way through the download and dies (at different places each time so it’s an intermittent error).

I was running a 7.0.8 release on these devices, and then upgraded to 8.0.5 (copying the file off a PumpKIN TFTP server with no issue), but then I was reloading 8.0.5 onto the devices (while booting off 8.0.5 already) and it could not access the exact same file properly.

ciscoasa# copy tftp flash
ciscoasa# copy tftp flash Address or name of remote host []? 10.100.99.13 Source filename []? asa805-k8.bin Destination filename [asa805-k8.bin]? Accessing tftp://10.100.99.13/asa805-k8.bin...!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING: TFTP download incomplete! ! %Error reading tftp://10.100.99.13/asa805-k8.bin (Unspecified Error) ciscoasa#

I ended up using the SolarWinds TFTP server instead and it worked like a champ.  I am not sure what the issue is here, but it looks like some kind of bug in PumpKIN or in the ASA code (or some combination thereof).

-Eric

Categories: Cisco, Network Tags:

Google Recursive DNS Speed Test

December 8th, 2009 1 comment

So a few days ago Google announced their new public DNS service that will answer recursive queries for any host.  There has been a lot of coverage of this elsewhere so I did not feel compelled to post anything about it until I saw this post discussing how fast Google’s DNS servers were compared to other ISP’s servers.  I felt that I am in a somewhat unique position to provide some test data as I have direct access to an Internet connection from an ISP peered with NWAX which has Google as a member.  The end result of this is that my round trip times to many Google services are 3-4ms.

So I downloaded the same test tool as Jon Radoff and ran the test from my connection.  In the results below you can clearly see that Google is the fastest (or right there with the fastest).

A test of DNS server performance from an Internet connection close to Google

A test of DNS server performance from an Internet connection close to Google

I would conclude that Google’s DNS servers are just as fast as any other out there, but the issue is that of latency.  Your ISP’s servers have an advantage over Google (in most cases) since they sit on the service providers network.  That is not to say Comcast or Verizon may not have their DNS servers on the other side of the country from you, (but that would be just dumb).

All in all, I am very happy that Google now provides this service as it may be really useful from time to time.  Most corporate environments don’t care though since they have internal DNS servers to handle their recursive requests.

-Eric

Categories: Google, Network, Telecom Tags:

Cisco AnyConnect Split-DNS resolution not working in Snow Leopard 10.6

November 20th, 2009 2 comments

I just upgraded my Cisco AnyConnect client on my ASA 5510 to 2.4.0202 hoping that the VPN would work for my users with Mac OS 10.6 Snow Leopard, but it would appear they are having DNS resolution issues.  I use the Split-DNS functionality of the ASA/Anyconnect client to only send DNS queries to the across-the-vpn DNS servers for a couple of domain names.

My brief testing has shown that all DNS queries are being sent to the remote hosts local DNS servers rather than sending them to the corporate DNS servers for the Split-DNS domains.

I found Cisco bug ID CSCtc54466 that describes this issue.  It describes this issue as being with Mac OS X 10.6 and they claim the issue is with Apples mDNS code.  They say it is “likely to be fixed in Mac OS X 10.6.3”.

In the meantime they claim you can “Restart the mDNSResponder service”.  I am assuming you would need to restart this service each time you VPN in?  I have not yet looked into how to restart that service yet either.  I will edit this post once I figure it out.

-Eric

Categories: Apple, Cisco, Network Tags:

Advanced PING usage on Cisco, Juniper, Windows, Linux, and Solaris

September 15th, 2009 1 comment

As a network engineer, one of the most common utilities I use is the ping command.  While in its simplest form it is a very valuable tool, there is much more knowledge that can be gleaned from it by specifying the right parameters.

Ping on Cisco routers

On modern Cisco IOS versions the ping tool has quite a few options, however, this was not always the case.

Compare the options to ping in IOS 12.1(14)

EXRTR1#ping ip 4.2.2.1 ?
  <cr>

EXRTR1#ping ip 4.2.2.1

To that in IOS 12.4(24)T

plunger#ping ip 4.2.2.1 ?
  data      specify data pattern
  df-bit    enable do not fragment bit in IP header
  repeat    specify repeat count
  size      specify datagram size
  source    specify source address or name
  timeout   specify timeout interval
  validate  validate reply data
  <cr>

plunger#ping ip 4.2.2.1
plunger#ping ip 4.2.2.1 ?
data      specify data pattern
df-bit    enable do not fragment bit in IP header
repeat    specify repeat count
size      specify datagram size
source    specify source address or name
timeout   specify timeout interval
validate  validate reply data
<cr>
plunger#ping ip 4.2.2.1

When I am running a basic connectivity test between two points on a network I will generally not specify any options to ping (i.e. “ping 4.2.2.1”), however, once I have verified connectivity I will most often then want to verify what MTU size the path will support without fragmentation, and then also run an extended ping process with a thousand or more pings to test the reliability/bandwidth/latency characteristics of the link.

Here is an example of the most basic form.  Note that by default it is sending 100 byte frames:

plunger#ping 4.2.2.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 4.2.2.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 24/26/28 ms
plunger#

If I am working on an Ethernet network (or PPP link), it is most common that my target goal is for 1500 byte frames to make it through.  I will use the “size” parameter to force ping to generate 1500 byte frames (note that in Cisco land this means 1472 byte ICMP payloads plus 8 bytes ICMP header and 20 bytes IP header).  I also use the df-bit flag to set the DO NOT FRAGMENT bit on the generated packets.  This will allow me to ensure that the originating router (or some other router in the path), is not fragmenting the packets for me.

plunger#ping 4.2.2.1 size 1500 df-bit 

Type escape sequence to abort.
Sending 5, 1500-byte ICMP Echos to 4.2.2.1, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 24/26/28 ms
plunger#

If the first ping command worked, but the command above did not, then try backing down the size until you find a value that works.  Note that one common example of a smaller MTU is 1492 which is caused by the 8 bytes of overhead in PPPoE connections.

The next command to try is to send a large number of pings of the maximum MTU your link can support.  This will help you identify packet loss issues and is just a good way to generate traffic on the link to see how much bandwidth you can push (if your monitoring the link with another tool).  I have frequently identified bad WAN circuits using this method.  Note that looking at the Layer 1/2 error statistics before doing this (and perhaps clearing them), and then looking at them again afterwards (on each link in the path!) is often a good idea.

plunger#ping 192.168.0.10 size 1500 df-bit repeat 1000

Type escape sequence to abort.
Sending 1000, 1500-byte ICMP Echos to 192.168.0.10, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (1000/1000), round-trip min/avg/max = 1/2/4 ms
plunger#

Now the final ping parameter I often end up using is the “source” option.  A good example is when you have a router with a WAN connection on one side that has a /30 routing subnet on it, plus then an Ethernet connection with a larger subnet for your users devices.  Say that users on the Ethernet are reporting that they can not ping certain locations on the WAN, though you can ping it just fine from the router.  This is often because the return path from the device you are pinging back to your users subnet on the Ethernet is not being routed properly, but the IP your router has on the /30 WAN subnet is being routed correctly.  The key here is that by default a Cisco router will originate packets from the IP of the Interface it is going to be sending the traffic out (based on it’s routing tables).

To test from the interface your router has in the users subnet, use the source command like this:

plunger#ping 4.2.2.1 source fastEthernet 0/0

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 4.2.2.1, timeout is 2 seconds:
Packet sent with a source address of 173.50.158.74
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 24/25/28 ms
plunger#

Note that you can also use these command together depending on what you are trying to do:

plunger#ping 192.168.0.10 size 1500 df-bit source FastEthernet 0/0 repeat 100

Type escape sequence to abort.
Sending 100, 1500-byte ICMP Echos to 192.168.0.10, timeout is 2 seconds:
Packet sent with a source address of 173.50.158.74
Packet sent with the DF bit set
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (100/100), round-trip min/avg/max = 1/2/4 ms
plunger#

Ping on Juniper routers

The options to ping on Juniper (version 9.3R4.4 in this case) are quite extensive:

root@INCSW1> ping ?
Possible completions:
  <host>               Hostname or IP address of remote host
  bypass-routing       Bypass routing table, use specified interface
  count                Number of ping requests to send (1..2000000000 packets)
  detail               Display incoming interface of received packet
  do-not-fragment      Don't fragment echo request packets (IPv4)
  inet                 Force ping to IPv4 destination
  inet6                Force ping to IPv6 destination
  interface            Source interface (multicast, all-ones, unrouted packets)
  interval             Delay between ping requests (seconds)
  logical-system       Name of logical system
+ loose-source         Intermediate loose source route entry (IPv4)
  no-resolve           Don't attempt to print addresses symbolically
  pattern              Hexadecimal fill pattern
  rapid                Send requests rapidly (default count of 5)
  record-route         Record and report packet's path (IPv4)
  routing-instance     Routing instance for ping attempt
  size                 Size of request packets (0..65468 bytes)
  source               Source address of echo request
  strict               Use strict source route option (IPv4)
+ strict-source        Intermediate strict source route entry (IPv4)
  tos                  IP type-of-service value (0..255)
  ttl                  IP time-to-live value (IPv6 hop-limit value) (hops)
  verbose              Display detailed output
  vpls                 Ping VPLS MAC address
  wait                 Delay after sending last packet (seconds)
root@INCSW1> ping

While there are a lot more options here, I am generally trying to test the same types of things.  A very important note however is that in the Juniper world, the size parameter is the payload size and does not include the 8 byte ICMP header and 20 byte IP header.   The command below is the same as specifying 1500 bytes in Cisco land.

root@INCSW1> ping 4.2.2.1 size 1472 do-not-fragment
PING 4.2.2.1 (4.2.2.1): 1472 data bytes
1480 bytes from 4.2.2.1: icmp_seq=0 ttl=53 time=25.025 ms
1480 bytes from 4.2.2.1: icmp_seq=1 ttl=53 time=24.773 ms
1480 bytes from 4.2.2.1: icmp_seq=2 ttl=53 time=24.757 ms
1480 bytes from 4.2.2.1: icmp_seq=3 ttl=53 time=25.045 ms
1480 bytes from 4.2.2.1: icmp_seq=4 ttl=53 time=24.911 ms
1480 bytes from 4.2.2.1: icmp_seq=5 ttl=53 time=25.152 ms
^C
--- 4.2.2.1 ping statistics ---
6 packets transmitted, 6 packets received, 0% packet loss
round-trip min/avg/max/stddev = 24.757/24.944/25.152/0.145 ms

root@INCSW1>

Here is an example of sending lots of pings quickly on the Juniper to test link reliability:

root@INCSW1> ping 10.0.0.1 size 1472 do-not-fragment rapid count 100
PING 10.0.0.1 (10.0.0.1): 1472 data bytes
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--- 10.0.0.1 ping statistics ---
100 packets transmitted, 100 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.230/2.429/5.479/1.228 ms

root@INCSW1>

Ping on Windows XP/Vista/2003/2008

While not as full featured, the Windows ping tool can at least set the packet size and the do-not-fragment bit:

Microsoft Windows [Version 6.0.6001]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

C:\Users\eric.rosenberry>ping ?
^C
C:\Users\eric.rosenberry>ping

Usage: ping [-t] [-a] [-n count] [-l size] [-f] [-i TTL] [-v TOS]
            [-r count] [-s count] [[-j host-list] | [-k host-list]]
            [-w timeout] [-R] [-S srcaddr] [-4] [-6] target_name

Options:
    -t             Ping the specified host until stopped.
                   To see statistics and continue - type Control-Break;
                   To stop - type Control-C.
    -a             Resolve addresses to hostnames.
    -n count       Number of echo requests to send.
    -l size        Send buffer size.
    -f             Set Don't Fragment flag in packet (IPv4-only).
    -i TTL         Time To Live.
    -v TOS         Type Of Service (IPv4-only).
    -r count       Record route for count hops (IPv4-only).
    -s count       Timestamp for count hops (IPv4-only).
    -j host-list   Loose source route along host-list (IPv4-only).
    -k host-list   Strict source route along host-list (IPv4-only).
    -w timeout     Timeout in milliseconds to wait for each reply.
    -R             Use routing header to test reverse route also (IPv6-only).
    -S srcaddr     Source address to use.
    -4             Force using IPv4.
    -6             Force using IPv6.

C:\Users\eric.rosenberry>

So your basic ping in Windows claims to send 32 bytes of data (I have not verified this), but I suspect that is 32 bytes of payload, plus 8 bytes ICMP, and 20 bytes of IP for a total of 60 bytes.

C:\Users\eric.rosenberry>ping 4.2.2.1

Pinging 4.2.2.1 with 32 bytes of data:
Reply from 4.2.2.1: bytes=32 time=27ms TTL=53
Reply from 4.2.2.1: bytes=32 time=25ms TTL=53
Reply from 4.2.2.1: bytes=32 time=28ms TTL=53
Reply from 4.2.2.1: bytes=32 time=27ms TTL=53

Ping statistics for 4.2.2.1:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 25ms, Maximum = 28ms, Average = 26ms

C:\Users\eric.rosenberry>

So a common set of flags I use will be to create full 1500 byte frames (note that it takes 1472 as the parameter for this) and then tell it not to fragment (-f) and to repeat until stopped (-t).

C:\Users\eric.rosenberry>ping 4.2.2.1 -l 1472 -f -t

Pinging 4.2.2.1 with 1472 bytes of data:
Reply from 4.2.2.1: bytes=1472 time=29ms TTL=53
Reply from 4.2.2.1: bytes=1472 time=30ms TTL=53
Reply from 4.2.2.1: bytes=1472 time=31ms TTL=53
Reply from 4.2.2.1: bytes=1472 time=29ms TTL=53
Reply from 4.2.2.1: bytes=1472 time=30ms TTL=53
Reply from 4.2.2.1: bytes=1472 time=28ms TTL=53
Reply from 4.2.2.1: bytes=1472 time=29ms TTL=53

Ping statistics for 4.2.2.1:
    Packets: Sent = 7, Received = 7, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 28ms, Maximum = 31ms, Average = 29ms
Control-C
^C
C:\Users\eric.rosenberry>

Ping on Linux

Hey look, it is a ping command that is not ambiguous about what size frames it is generating!!!  It clearly shows that the payload is 56 bytes, but that the full frame is 84.  Note that this is from an Ubuntu 2.6.27-7-generic kernel box.

ericr@eric-linux:~$ ping 4.2.2.1
PING 4.2.2.1 (4.2.2.1) 56(84) bytes of data.
64 bytes from 4.2.2.1: icmp_seq=1 ttl=52 time=21.8 ms
64 bytes from 4.2.2.1: icmp_seq=2 ttl=52 time=21.4 ms
64 bytes from 4.2.2.1: icmp_seq=3 ttl=52 time=21.4 ms
64 bytes from 4.2.2.1: icmp_seq=4 ttl=52 time=21.6 ms
64 bytes from 4.2.2.1: icmp_seq=5 ttl=52 time=21.8 ms
64 bytes from 4.2.2.1: icmp_seq=6 ttl=52 time=21.8 ms
^C
--- 4.2.2.1 ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5019ms
rtt min/avg/max/mdev = 21.435/21.700/21.897/0.190 ms
ericr@eric-linux:~$

To set the packet size use the -s flag (it is asking for payload size, so 1472 will create a 1500 byte frame).  Now if you want to turn off fragmentation by setting the do-not-fragment bit (DF), the parameter is a bit more obscure “-M on”.  Here is an example using both:

ericr@eric-linux:~$ ping 4.2.2.1 -s 1472 -M do
PING 4.2.2.1 (4.2.2.1) 1472(1500) bytes of data.
1480 bytes from 4.2.2.1: icmp_seq=1 ttl=52 time=23.8 ms
1480 bytes from 4.2.2.1: icmp_seq=2 ttl=52 time=24.1 ms
1480 bytes from 4.2.2.1: icmp_seq=3 ttl=52 time=31.4 ms
1480 bytes from 4.2.2.1: icmp_seq=4 ttl=52 time=23.7 ms
1480 bytes from 4.2.2.1: icmp_seq=5 ttl=52 time=23.5 ms
^C
--- 4.2.2.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4017ms
rtt min/avg/max/mdev = 23.589/25.369/31.469/3.057 ms
ericr@eric-linux:~$

And there is another highly useful parameter that we have not seen yet in any of our previous ping utilities.  The linux ping has a “flood” option that will send pings as fast as the machine can generate them.  This is great for testing network links capacity, but can make for unhappy network engineers if you use it inappropriately.  Note that you must be root to use the -f flag.  Output is only shown when packets are dropped:

ericr@eric-linux:~$ sudo ping 10.0.0.1 -s 1472 -M do -f
PING 10.0.0.1 (10.0.0.1) 1472(1500) bytes of data.
.^C
--- 10.0.0.1 ping statistics ---
2763 packets transmitted, 2762 received, 0% packet loss, time 6695ms
rtt min/avg/max/mdev = 2.342/2.374/3.204/0.065 ms, ipg/ewma 2.424/2.384 ms
ericr@eric-linux:~$

Ping on Solaris

Here is the ping options from a Solaris 10 box (I forget what update this super-secret kernel number decodes too):

SunOS dbrd02 5.10 Generic_125100-07 sun4v sparc SUNW,Sun-Fire-T200

I find the basic ping command in Solaris to be annoying:

[erosenbe: dbrd02]/export/home/erosenbe> ping 4.2.2.1
4.2.2.1 is alive
[erosenbe: dbrd02]/export/home/erosenbe>

I want Ping to tell me something more useful than that a host is alive.  Come on Sun, like round trip time at least?  Maybe send a few additional pings than just one?  The -s command makes this operate more like the ping command in other OS’s:

[erosenbe: dbrd02]/export/home/erosenbe> ping -s 4.2.2.1
PING 4.2.2.1: 56 data bytes
64 bytes from vnsc-pri.sys.gtei.net (4.2.2.1): icmp_seq=0. time=21.6 ms
64 bytes from vnsc-pri.sys.gtei.net (4.2.2.1): icmp_seq=1. time=21.5 ms
64 bytes from vnsc-pri.sys.gtei.net (4.2.2.1): icmp_seq=2. time=21.6 ms
64 bytes from vnsc-pri.sys.gtei.net (4.2.2.1): icmp_seq=3. time=21.4 ms
64 bytes from vnsc-pri.sys.gtei.net (4.2.2.1): icmp_seq=4. time=21.1 ms
^C
----4.2.2.1 PING Statistics----
5 packets transmitted, 5 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 21.1/21.5/21.6/0.22
[erosenbe: dbrd02]/export/home/erosenbe>

With the Solaris built in Ping tool you can specify the packet size, but it is very annoying that you can’t set the do-not-fragment bit.  Come on SUN, didn’t you like invent networking???  So in this example I had it send multiple pings, and I told it to use a size of 1500 but I could not set the DF bit so the OS must have fragmented the packets before sending.  I got responses back that claim to be 1508 bytes which I am assuming means that the 1500 bytes specified was the payload amount and the returned number of bytes includes the 8 byte ICMP header, but not the 20 byte IP header…  Go SUN.

[erosenbe: dbrd02]/export/home/erosenbe> ping -s 4.2.2.1 1500
PING 4.2.2.1: 1500 data bytes
1508 bytes from vnsc-pri.sys.gtei.net (4.2.2.1): icmp_seq=0. time=36.7 ms
1508 bytes from vnsc-pri.sys.gtei.net (4.2.2.1): icmp_seq=1. time=23.2 ms
^C
----4.2.2.1 PING Statistics----
2 packets transmitted, 2 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 23.2/30.0/36.7/9.6
[erosenbe: dbrd02]/export/home/erosenbe>

Conclusion

Well, I hope this rundown on a number of different ping tools is useful to folks out there and as always, if you have any comments/questions/corrections please leave a comment!

-Eric

Categories: Cisco, Linux, Network, Sun Tags:

Are blade servers right for my environment?

July 15th, 2009 No comments

IT like most industries has it’s “fad”s.  Whether it be virtualization, or SAN’s, or blade servers.  Granted these three technologies play really nicely together, but once in a while you need to get off the bandwagon for a moment and think about what these technologies really do for us. While they are very cool overall and can make an extremely powerful team, as with anything, there is a right place, time, and situation/environment for their use.  Blades are clearly the “wave of the future” in many respects, but you must be cautious about the implications of implementing them today.

Please do not read this article and come away thinking I am “anti-blade” as that is certainly not the case.  I just feel they are all too often pushed into service in situations they are not the correct solution for and would like to point out some potential pitfalls.

Lifecycle co-termination

When you buy a blade center, one of the main selling points is that the network, SAN, and KVM infrastructure is built in.  This is great in terms of ease of deployment and management, however, on the financial side of things you must realize that the life span of these items is not normally the same.  When buying servers I typically expect them to be in service for 4 years, KVM’s (while becoming less utilized actually), can last much longer under most circumstances (barring changes in technology from PS/2 to USB, etc…), network switches I expect to use in some capacity or another for seven years, and SAN switches will probably have a similar life-cycle to the Storage Arrays they are attached to which I generally target at 5 year life spans.

So what does this mean?  Well, if your servers are showing their age in 4 years you are likely to end up replacing the entire blade enclosure at that point which includes the SAN and network switches.  It is possible the vendor will still sell blades that will fit in that enclosure, however, you are likely to be wanting a SAN or network upgrade before the end of those second set of servers life-cycles which will likely result in whole new platforms being purchased anyway.

Vendor lock

You have just created vendor lock such that with all the investment in enclosures you can’t go buy someone elses servers (this really sucks when your vendor fails to innovate on a particular technology).  All the manufacturers realize this situation exists and will surely use it to their advantage down the road.  It is hard to threaten not to buy Dell blades to put in your existing enclosures when that would mean throwing away your investment in SAN and network switches.

San design

Think about your SAN design – Most shops hook servers to a SAN switch which is directly attached to the storage array their data lives on.  Blade enclosures encourage the use of many more smaller SAN switches which often requires hooking the blade enclosure switches to other aggregation SAN switches which are then hooked to the Storage Processor.  This increases the complexity, increases failure points, decreases MTBF, and increases vendor lock.  Trunking SAN switches together from different vendors can be problematic and may require putting them in a compatibility mode which turns off useful features.

Vendor compatibility

Vendor compatibility becomes a huge issue- Say that you buy a blade enclosure today with 4 gig Brocade SAN switches in it for use with your existing 2 gig Brocade switches attached to an EMC Clarion CX500, but then next year you want to replace that with a Hitachi array attached to new Cisco SAN switches.  There are still many interop issues between SAN switch vendors that make trunking switches problematic.  If you had bought physical servers you may have just chosen to re-cable the servers over to the new Cisco switches directly.

Loss of flexibility

Another pitfall that I have seen folks fall into with blade servers is the loss of flexibility that comes with having a stand alone physical server.  You can’t hook up that external hard drive array full of cheap disks directly to the server, or hook up that network heartbeat crossover cable for your cluster, or add an extra NIC or two to a given machine that needs to be directly attached to some other network (that is not available as a VLAN within your switch)….

Inter-tying dependencies

You are creating dependencies on the common enclosure infrastructure so for full redundancy you need servers in multiple blade enclosures.  The argument that the blade enclosures are extremely redundant does not completely hold water to me.  I have needed to completely power cycle entire blade enclosures before to recover from certain blade management module failures.

Provisioning for highest common denominator

You must provision the blade enclosure for the maximum amount of SAN connectivity, network connectivity, and redundancy that is required on any one server within the enclosure.  Say for instance you have a authentication server that is super critical, but not resource intensive.  This requires your blade center to have fully redundant power supplies, network switches, etc…  Then say you have a different server that needs four 1 gig network interfaces, and yet another DB server that needs only two network interfaces, but it needs four HBA connections to the SAN.  You now need an enclosure that has four network switches and four SAN switches in it just to satisfy the needs of three “special case” servers.  In the case of the Dell M1000 blade enclosures, this configuration would be impossible since they can only have six SAN/Network modules total.

Buying un-used infrastructure

If you purchase a blade center that is not completely full of blades then you are wasting infrastructure resources in the form of unused network ports, SAN ports, power supply, and cooling capacity.  Making the ROI argument for blade centers is much easier if you have need to purchase full enclosures.

Failing to use existing infrastructure

Most environments have some amount of extra capacity on their existing network and SAN switches, as when they were purchased, they planned for the future (probably not with blade enclosures in mind).  Spending money to re-purchase SAN and network hardware within a blade enclosure to allow the use of blades can kill the cost advantages of going with a blade solution.

Moving from “cheap” disks to expensive SAN disks

You typically can not put many local disks into blades.  This is in many cases a huge loss as not everything needs to be on the SAN (and in fact, certain things would be very stupid to put on the SAN such as SWAP files).  I find that these days many people overlook the wonders of locally attached disk.  It is the *cheapest* form of disk you can buy and also can be extremely fast!  If your application does not require any of the advanced features a SAN can provide then DONT PUT IT ON THE SAN!

Over-buying power

In facilities where you are charged for power by the circuit the key is to manage your utilization such that your un-used (but paid for) power capacity is kept to a minimum.  With a blade enclosure, on day 1 you must provide (in this example) two 30 amp circuits for your blade enclosure, even though you are only putting in 4 out of a possible 16 severs.  You are going to be paying for those circuits even though you are nowhere near fully utilizing them.  The Dell blade enclosures as an example require two three phase 30 amp circuits for full power (though depending on the server configurations you put in them you can get away with dual 30 amp 208v circuits).

Think about the end of the life-cycle

You can’t turn off the power to a blade enclosure until the last server in that enclosure is decommissioned.  You also need to maintain support and maintenance contracts on the SAN switches, network switches, and enclosure until the last server is no longer mission critical.

When are blades the right tools for the job?

  • When your operational costs of operations and maintenance personnel far outweigh the cost inefficiencies of blades.
  • When you are buying enough servers that you can purchase *full* blade enclosures that have similar connectivity and redundancy requirements (i.e. each needs two 1 gig network ports and two 4 gig SAN connections).
  • When you absolutely need the highest density of servers offered (note that most datacenters in operation today can’t handle the density of power required and heat that blades can put out).

An example of a good use of blades would be a huge Citrix farm, or VMWare farms, or in some cases webserver farms (though I would argue very large web farms that can scale out easily should be on some of the cheapest hardware you can buy which typically does not include blades).

Another good example would be compute farms (say even lucene cache engines) – as long as you have enough nodes to be able to fill enclosures with machines that have the same connectivity and redundancy requirements.

Conclusion

While blades can be great solutions, they need to be implemented in the right environments for the right reasons.  It may indeed be the case that the savings in operational costs of employees to setup, manage, and maintain your servers far outweighs all of the points raised above, but it is important to factor all of these into your purchase decision.

As always, if you have any feedback or comments, please post below or feel free to shoot me an email.

-Eric

Categories: Cisco, Dell, HP, IBM, Network, Sun, Systems Tags:

Cisco Netflow to tell who is using Internet bandwidth

July 4th, 2009 1 comment

When working with telecom circuits that are slow and “expensive” (relative to lan circuits), the question frequently comes up “What is using up all of our bandwidth?”.  Many times this is asked because an over-subscribed WAN or Internet circuit is inducing latency/packet drops in mission critical applications such as Citrix or VoIP.  In other cases a company may be paying for a “burstable” Internet connection whereby they are paying for a floor of 10 megabits, but they can utilize up to 30 megabits and just be billed for the overage (at the 95th percentile generally).

So how do you tell which user/server/application is chewing up your Internet or WAN circuits?  Well Cisco has implemented a technology called “netflow” that allows your router to keep statistics on each TCP or UDP “flow” and then periodically shove that data into a logging packet and ship it off to some external server.  On this server you can run one of a variety of different software packages to analyze the data and understand what is using up your network bandwidth.

The question is, what software package should you utilize?  I have not gone and evaluated all of the available options, but I do have experience with a couple of them.  I have used Scrutinizer from Plixer in the past and not been very impressed.  Part of it may have been that the machine it was running on was not very fast, but I just did not like the interface or capabilities much.

More recently I have downloaded and run NetFlow Analyzer from ManageEngine and I have been very impressed!  It is free for only two interfaces and they have an easy-to download and install demo that will run unlimited interfaces for 30 days.  It runs on Linux or Windows (I tried the Linux version) and is is dirt simple to install and configure.  There really is nothing of note to configure on the server itself, you just need to point your router at the server’s IP and it will automatically start generating graphs for you.

I should also mention that Paessler has some kind of netflow capabilities (in PRTG), but I have not checked it out.  I note it here since I use their snmp monitoring software extensively and I have been happy with it.

To get your router to send NetFlow data to a collector, you need to set a couple of basic settings (including which version of NetFlow to use and where to send the packets), and then enable sending flows for traffic on all interfaces.  Note that it used to be you could only collect netflow data upon ingress to an interface and so in order to collect data on bi-directional traffic you needed to enable it on every single router interface in order to see the traffic in the opposite direction.  This was done with the “ip route-cache flow” command on each interface.

Now “ip route-cache flow” has been replaced with “ip flow ingress” and you can also issue “ip flow egress” command if you were to not wanting to monitor all router interfaces.  I have just stuck with issuing “ip flow ingress” on all my interfaces since I wanted to see all traffic anyway (and I am not quite sure what would happen if you issue both commands on two interfaces and then had traffic flow between them, it might double count those flows).

Here are the exact commands I used on plunger to ship data to Netflow Analyzer 7:

plunger#conf t

Enter configuration commands, one per line.  End with CNTL/Z.

plunger(config)#ip flow-cache timeout active 1

plunger(config)#ip flow-export version 5

plunger(config)#ip flow-export destination x.x.x.x 9996

plunger(config)#int fastEthernet 0/0

plunger(config-if)#ip flow ingress

plunger(config-if)#int fastEthernet 0/1

plunger(config-if)#ip flow ingress

plunger(config-if)#end

plunger#write mem

Building configuration…

[OK]

plunger#exit

Happy NetFlowing!

-Eric

Categories: Cisco, Network Tags: