Archive

Archive for the ‘Cisco’ Category

Sawtooth looking graphs from Cisco SNMP queries

March 28th, 2009 4 comments

I have been annoyed for quite some time by very regular pattern’s of “Spikiness ” that I run into on certain network graphs and wanted to share my findings to see if anyone else can explain this better than I can.

Below is a graph of what should be extremely consistent traffic.  This graph is created by polling a SNMP OID once every thirty seconds.  It is over a one hour time period.

Spiky Traffic

Spiky Traffic

My best theory is that internally to the Cisco device I am polling (in this case a Cisco ASR) the SNMP OID counters are only updated once every “x” amount of time, where “x” is maybe 10 seconds.  Since I poll every 30 seconds I most often get three “updates” in between the time I poll the OID, but once every five minutes for some reason I only get two updates worth of data (20 seconds).  Then in the next interval I get forty seconds worth of data.

If you have a better explanation than I do, please post a comment or email me and I will update this post with the answer!

-Eric

Categories: Cisco, Network Tags:

Cisco Router/Switch Standard Configuration Checklist

March 25th, 2009 2 comments

Every time I go to configure a new Cisco router/switch, I refer to my trusty text file of the “standard” items I need to setup in order to easily manage/troubleshoot the device in the future, as well as to properly secure it.  I have posted the annotated version of that checklist here in the hopes that others find it useful as well!

As always, if you have any suggestions for items to add to my list feel free to email me!  I am sure there are cool tips/tricks I am missing.

  1. I first have to figure out how to “get in” to the device.  Often if I am re-configuring an old device I must first follow the password-recovery process as the password may be lost.
  2. Once I am logged into the device I must determine how much RAM and flash memory the device has so that I can decide on the best IOS version to run.
  3. I go to cisco.com and figure out what IOS rev makes the most sense based on what features I need, how much RAM/flash I have to work with, what I am licensed for, and what level of reliability is required (i.e. can I only run “General Deployment” code, or am I willing to run a “technology” train because it has some new feature I need).
  4. If the device is a stackable switch make sure it is all stacked together the way you want it stacked with the correct device the command switch (check your priority settings).  Note that even a single switch by itself could think it was the 3rd switch in a stack (if previously a stack member) and mess up all your interface numbers which is painful to fix later!
  5. If configuring certain models of switch, you need to set the system MTU (which persists outside the configuration file) and reboot the switch for it to take effect.  The default is 1500, so unless you are planning on using jumbo frames (i.e. 9000 bytes hopefully), you don’t need to worry about this.
  6. I make sure I have IP connectivity to the device from my laptop so that I can download firmware to it.  Often times this means I will tftp the config from the device to my laptop to prove it works (and saving the old config may be valuable depending on the situation).
  7. Since I generally like to have as absolutely clean of a device as possible, I will often reformat the flash device so that I can be guaranteed to start from scratch.  Note that this can be dangerous if the device crashed/lost power before you got a new image on it.  You would then have to connect to the console and zmodem download the OS at 9600 baud (not fun!)
  8. Once the format is complete I tftp the new IOS version onto the device.
  9. I do a “write erase” to make sure there is no configuration left in the device (I want it to start from the new defaults as set by whatever version of IOS I am loading).
  10. I will often times check to see what the bootvar is set to, but really at this point it does not matter since there is only one image on the flash disk so it will choose that even if something else is specified.
  11. I issue the reload command and choose *not* to save the running config (since I just zeroed out the startup-config).
  12. When the device boots I say no to the setup prompts and I tell it to kill autoinstall if that is trying to run.
  13. I check all the startup messages to make sure all the hardware is recognized and initialized properly.
  14. At this point I jump into enable mode and do a “show run” to verify all the interfaces I expect are available and to see what new defaults or settings Cisco threw in on this build (they often have default settings show up in the config in order to call your attention to a new feature, or something that has changed).
  15. My first order of business is normally to get the device remotely accessible, as using the console is slow, and I am normally not in a very comfortable environment (i.e. cold, noisy datacenter with poor ergonomics).  To accomplish this I will often just enable telnet (if I am connecting across a secure network), and then later enable SSH.  To make telnet work you need an “enable secret” (well you could use an “enable password”, but that’s old and less secure so just don’t bother).  You will also need to set a password on the vty’s for telnet.  “line vty 0 15” “password <blah>”.
  16. Before you can actually login to the device you must configure an IP on at least one interface, and un-shut it.  You may also need a default route (or some kind of static route) depending on if you are connecting from the local subnet or not.
  17. Once into the device, it is time for a ton of “housekeeping”.  First off, set the “hostname” and “ip domain name <domainname>”.  Note that both of these are required for generating RSA keys later so you can enable SSH.
  18. Setup your name servers with “ip name-server <server>” commands.
  19. Make sure your “boot system <x>” parameter (also known as bootvars) is set to nothing (if you only have one image on the flash device it will boot off that image).  I generally only use this command when upgrading to a new IOS version and I want to keep the old one around for fallback purposes.  I believe in keeping my config’s as simple as possible.  I only configure something if it provides value.
  20. Make sure “service password-encryption” is turned on.  This at least obfuscates some passwords that are normally stored in clear text.
  21. Depending on how you want log entries to look, I will often adjust the “service timestamps” command to have it log in localtime.  If you are just dealing with a small local network in one timezone it is often easier to troubleshoot when not having to think about GMT.  Users generally report issues in local time.  😉
  22. Set the timezone using “clock timezone PST -8” (or whatever is appropriate for you).  You then must further tell it to do daylight savings time (if your locale observes it).  I use “clock summer-time PDT recurring” myself.  The thing that sucks about this command is that I suspect the dates it changes the time forward and back on change depending on how new your IOS is since the US keeps mucking with them.  You can also manually tell it when to make these time changes with this command.
  23. Setup at least one ntp server with the “ntp server <ip>” command so that your device always has the correct time.  Having the correct time is really important for correlating log file entries when troubleshooting.
  24. Turn off CDP on any interfaces that are hooked to “untrusted” networks (i.e. if they are connected to an ISP turn it off).  Issue the “no cdp enable” command on these interfaces.
  25. Setup logging as appropriate for your environment.  If you don’t have an external syslog server (or even if you do), you may want to use something like “logging buffered notifications” (or some other logging level) so that you can at least do a “show log” on the device and see the last “x” number of messages.
  26. I will generally setup my devices with a local username and password (which I can use for SSH, console, telnet, etc…) and then get rid of the enable password, enable secret, and line passwords.  To accomplish this I use the following commands: “username <username> privilege 15 password <password>”, “aaa new-model”, “aaa authentication login default local”, “aaa authorization exec default local”.  Make sure to test your access to the device before you drop out of enable mode or disconnect as you could lock yourself out!
  27. Now that you have configured a local user account, and earlier on you set a hostname and domain name, you can now generate RSA keys so that you can SSH to the device.  Issue the “crypto key generate rsa modulus 1024” command to accomplish this.  Note that you can use various modulus values.  The default is 512 but I generally step it up a bit.
  28. Turn off “ip http-server” unless you use it.  Otherwise it is just a security issue (I think the previous step may actually not password protect the http server by default so be careful of this).
  29. On all your terminal lines (console, aux, and vty’s) use the “transport preferred none” command to keep the stupid router from trying to telnet every time you fumble finger a command!  I prefer this method of disabling this annoying functionality over the “no ip domain-lookup” route since doing dns queries from your router is sometimes convenient (and necessary!).
  30. You can also use the “logging synchronous” on the terminal lines in order to have the device re-print the command you are in the process of typing after it rudely spews a log message to your console.
  31. To make your router/switch only accessible via ssh (since telnet is insecure right?) issue the “transport input ssh” command on the vtys.
  32. For security purposes, set timeout’s on the terminal lines so that users get kicked out after inactivity.
  33. Setup a loopback IP address for device management.  This allows you to connect (and stay connected) to the device when interfaces go up and down.  You will likely want to make this the source IP for syslog messages and snmp traps as to not confuse your NMS (if you have one).
  34. If you are configuring a switch, setup your VLANs (name them etc…), also, make sure to set the root bridge priorities (you don’t want random switches becoming the root bridge).
  35. On switches, you may want to set global settings for portfast, bpduguard, and rootguard to apply to all ports.
  36. If setting up a switch, you should configure (or disable using transparent mode) VTP explicitly, otherwise someone could add another VTP switch to the network as a server and magically your switch will become part of that VTP domain.  This can impact your VLAN settings for all your switchports!
  37. Ok, now that we have all that drugery out of the way, go ahead and configure all your interfaces with IP addresses as appropriate.  Please make sure while you are at it to use the “description” command on each of them to describe what is attached.  If it is an ISP or WAN circuit make sure to provide the circuit ID for ease of troubleshooting.
  38. If your device is a layer three switch and you want it to do routing, make sure to set “ip routing”.  This command is frequently overlooked and is very frustrating to troubleshoot why the device won’t route packets!  For some reason (probably with good intentions to keep routing loops from happening or for security reasons) Cisco keeps this one turned off by default.
  39. Make sure all your speed and duplex settings are set properly for each port.
  40. If on a switch, make sure all ports are set explicitly to access mode if that is what you intend them to be used for.  By default a negotiation will take place and a port you did not configure can become a “trunk” port, providing access to VLAN’s  you were not intending to make available.
  41. Add in whatever default routes or routing protocols you need to run in order to have connectivity to the rest of the network.  Remove any un-needed temporary static routes you had put in place earlier.  Make sure to disable routing protocols on interfaces where they could pose a security risk.
  42. Setup an access list to control remote management access to your router based on IP address.
  43. Setup an snmp community on the device if you plan to use it.  I generally recommend everyone setup snmp read access so they can poll their devices with network graphing software.  Some environment use snmp traps for alerting.  You can also control access to snmp with an access list.  Make sure to set the snmp “location” and “contact” settings.
  44. Once the device is up and operational, make sure to check the interface statistics to ensure you are not getting errors.  The most common mis-configurations are with speed/duplex settings on Ethernet, and when talking about T-1’s the most common issue is getting the clock sources correct.
  45. Shutdown any un-needed interfaces.

See, wasn’t that an easy 45 step simple process?  😉

-Eric

Categories: Cisco, Network Tags:

Switching Providers With Cisco CDMA 3G HWIC

March 24th, 2009 1 comment

A while back I ran across info on Cisco’s site about their new HWIC-3G-CDMA and HWIC-3G-GSM cards that allow you to connect a Cisco router to the Internet through 3G cell phone networks!  I am glad to see this since I have on occasion wanted a *reliable* way to share my broadband card connection with a small group of computers (i.e. at a tradeshow, etc…).  I have shared broadband cards out through my laptop before but it is a pain, and my general feeling is that the consumer grade “routers” that you can buy now are generally poor quality.

This last week I noticed they have seperate HWIC-3G-CDMA-V and HWIC-3G-CDMA-S versions of the CDMA card that are specific to Verizon and Sprint.  I find this annoying, as if I am going to spend $850 (MSRP) on one of these cards I don’t want to be vendor locked to Sprint or Verizon.  From a chipset standpoint, these cards are using exactly the same technology.

This is particularly problematic as wireless service is so location dependant.  Say you are deploying a bunch of remote site routers across the country and using a 3G card for backup connectivity.  You can’t have any idea at the time you order/ship the devices as to which wireless provider will have the strongest signal available in a given area (i.e. Verizon might have EVDO Rev. A in an area that Sprint only has 1xRTT).

Furthermore, you might decide for cost reasons that you want to move from one provider to another, but the cost of buying new WIC’s makes that untenable.  Can you imagine if you had to buy a different T-1 WIC for Verizon vs. Qwest?

To find out if this was a permanent thing, or if they could be re-flashed over to the other provider, I asked one of my Cisco engineers.  Below is their response which I thought was worth sharing:

Short answer “No”. There is a lot of carrier specific information that gets bundled into the firmware for these modems and the only option is to swap the modems out. The carriers and probably the modem vendors have tools that are able to change the modem parameters to work from one carrier to the other, but we do not have that expertise.

I also asked if the cards required special data plans, or if the “standard” $59.99 plans would work.

In terms of pricing – the SPs have a special pricing plan for these modems that get plugged into Enterprise class networks. Information can be obtained on each of their websites.”

I have heard from reliable sources though that with Verizon you can put them on normal $59.99 data plans and they work just fine.  You just do an ESN swap to get that device activated.  Naturally, they won’t provide any kind of support for your device, but it does work.  Also note that on Verizon if you go over 5 gigs of transfer in a month they will throttle you to 200Kbps.

I am *sure* that with the right equipment you could “make it happen”.  Maybe some day we can hope Cisco releases a tool that let’s you do it.  So I wonder when the WiMax version of the card is available?  That would be cool since I do live in Portland, OR (one of the WiMax test markets).

-Eric

Categories: Cisco, Network, Telecom, Wireless Tags:

Configuring a Cisco Router For Secure dyndns Dynamic DNS

March 22nd, 2009 2 comments

While I normally dislike using dynamic IP addresses on cisco devices, they are sometimes a necessary evil.  I finally got around to setting up a Cisco router at home this weekend, and decided that I needed to setup Dynamic DNS so I could get into it remotely, even though it is on a dynamic IP address on my Verizon FiOS connection.  In doing the research I came across this posting which not only explains how to configure IOS to use the dyndns service, but it also shows how to do it securely using https.

To be honest, I went ahead and used the http (insecure) method as I did not want to mess with certificates, but in many situations you may need the security provided by https.  I am not really worried about anybody hijacking the dyndns for my home IP address.  😉

There is also a Cisco page on how to setup ddns, but it is really confusing and for my purposes, the only useful parts were all the way at the end of the page.  I believe most of this page deals with getting the Cisco router to update DNS on behalf of clients.  In our case all we want is for the router to update the DNS for a single IP address of it’s outside interface.

One very important learning I got out of this process was on how to escape a question mark (?) when you are trying to insert it into a text string on a Cisco device.  Normally IOS treats a question mark as a request for help.  To escape it you must type ctrl-v right before pressing ?.  I never knew that before.  I had tried to insert ? into description fields before but always gave up and chose to use other punctuation to get my point across.  😉

UPDATE: After running this config for a day I got a nastygram from dyndns that I was updating too frequently.  I had to add the minimum and maximum lines to my config so that it only updates once every 28 days if your IP does not change:

interface FastEthernet0/0
 ip ddns update hostname bitplumber.homeip.net
 ip ddns update dyndns

ip ddns update method dyndns
 HTTP    
  add http://usernamehere:passwordhere@members.dyndns.org/nic/update?system=dyndns&hostname=<h>&myip=<a>
 interval maximum 28 0 0 0
 interval minimum 28 0 0 0

-Eric

Categories: Cisco, Network Tags:

Uploading software and config files with tftp

March 21st, 2009 No comments

When working with Cisco (and other) network gear, one of the common tasks I perform is copying configuration files or system images to/from a router/switch/firewall using tftp.  In many cases I am transfering these to my laptop using a tftp server running on my laptop and the tftp client built into the network device.

Many years ago someone told me about this tftp server called PumpKIN and I have been using it ever since.  The entire program comes in a 144k download and it installs in about half a second.  Works like a champ and it is just so dirt simple, it has never failed me.

PumpKIN

PumpKIN

The one thing to note is that if you accidentally un-check the box in the bottom right corner, it will stop responding to requests (or if you launch multiple copies of the server, only one can be bound to the socket at a time).

Here is a screenshot of the options window.  I always set the settings as shown below so that I don’t have to approve each time a device needs to “get” a file from the server (many devices send multiple requests when downloading an image to verify it’s existance and such before download).  I also set the directory to c:\tftp such that it is easy to find and get access to.

PumpKIN Options

PumpKIN Options

I should note that after this latest install in Windows Vista I had to open udp port 69 in the Windows Firewall to get PumpKIN to work.

-Eric

Categories: Cisco, Network Tags:

How To Configure Jumbo Frames

March 18th, 2009 17 comments

I have found there is a lot of confusion out there, as to how jumbo frames work, and in what circumstances they should be used.  Hopefully this post will be useful to others to avoid some of the pitfalls I have run into.

On your standard Ethernet network, the default frame size is 1500 bytes.  This can be notated in a number of different ways, however, since the actual amount of data sent is normally 1514 bytes which includes the source mac address (6 bytes), dest mac address (6 bytes), and protocol id (2 bytes), of the data in the frame (in most cases these days IP).  If you further include the 7 byte preamble, the 1 byte Start-of-Frame-Delimiter, and the 4 bytes of CRC32 at the end, you get 1526 (not to mention the 12 bytes of interframe-gap after transmission).  I have also seen it notated as 1518 bytes (1500 + 6 bytes src mac, 6 bytes dest mac, 2 bytes protocol id, and 4 bytes CRC 32).  When you run ifconfig on most operating systems it will show you 1500 bytes.  Within this “1500 byte frame” the usable space is further cut down to 1480 bytes by 20 bytes of IP header, and then if you are using TCP the data payload is cut down to 1460 bytes by another 20 bytes of TCP header.

Phew!  Was that enough numbers?  Now that we have that confusion out of the way, the question is why would you run something larger than 1500 bytes?  Well, back when the standard was set at 1500 byte frames we were not dealing with as large amounts of data as we are today.  Data transmission speeds were slower (i.e. 10 megabit) and the chances of encountering an error during the time it took to send a 1500 byte frame were much higher than sending a 1500 byte frame at gigabit speeds.  Keeping frames smaller meant you had to re-transmit less data when an error occurred.

As 100 megabit and then gigabit came out, the standards groups kept with the 1500 byte limit for backwards compatability.  This is great, except that when transferring large blocks of data (say during backups) this is less efficient than it could be since for every 1460 bytes of TCP payload data, you have 78 bytes worth of “time on the wire” that is used up.  You also force the computers at both ends of the connection to calculate a CRC-32 value for every 1460 bytes of data which is processor intensive (I am assuming you are not running a TOE), and also send ACK’s back etc…

So for certain applications you might want to consider running larger than 1500 byte mtu’s.  I recommend them only in the following circumstances:

  • You have an all gigabit network for the VLAN/network in question
  • All switches in your switching path can be configured to be “jumbo frame clean” up to the size frames you choose to use
  • You are sure all your host operating systems and NIC drivers will support Jumbo Frames at the size of frame you decide to use (this can vary a lot!) – note that I was once told that DataDomain boxes support jumbo frames, but they are not recommended because it causes a performance impact (this does not make sense to me, but it is what they said)
  • You have a real reason to need to run them (i.e. a backup VLAN or NFS VLAN)

Note that getting jumbo frames to work can be a real pain and in many cases is probably not worth it.  When you make the switch on a given subnet over to jumbo frames, you must simultaneously change *every* host on that network/VLAN or you will have weird problems where some hosts can talk to each other and others have difficulty for some types of connections.  TCP connections might work, but UDP connections fail intermittently (i.e. you can mount an nfs mount but when transferring files it fails).

As a side note, the reason that TCP will work on a MTU size mismatched network is that in the TCP 3 way handshake, a MSS (Maximum Segment Size) is specified by both ends of the connection and the smaller of the two is agreed upon.  Running different hosts on the same subnet with differing MTU sizes is *not* recommended.

So what size MTU should you choose?  Since there is no standard, different vendors have chosen to implement different maximum supported sizes.  I have seen drivers support all sorts of different mtu’s…  Some allow any mtu up to a max, while others support only several “canned” values (i.e. some Windows drivers).  From some research I have read, CRC-32 becomes ineffective above about 11,000 bytes.  I have found that every manufacturer I have used (that supports jumbo frames at all) will support 9000 byte MTU’s, therefore I highly recommend using 9000 byte MTU’s (this is also what is standardized on for Internet 2 traffic).

It is extremely important that your entire VLAN/network you are implementing this on be “Jumbo Frame Clean” for the size MTU you are using.  What I mean by this is that every switch in the path will allow at least 9000 byte mtu packets through.  Different switches have differing commands to make them pass jumbo frames.  In my Cisco 6500 series platforms I set the mtu on a per-port basis (depending on your Supervisor and OS version you might only have 9216 as an option which is fine as long as it is larger than your chosen mtu).  On my 3750’s (and similar stacking blade enclosure switches) I have to set it globally on the switch and reboot the switch to take effect.

Operating System Settings

To actually set jumbo frames in Windows you go to the device manager, select your NIC driver, go to the Advanced tab, and find the Jumbo frame setting.  You must either choose the correct MTU size from a drop-down list provided (in some drivers), or enter the correct frame size it into the box (if applicable).  You might need to verifiy with the manufacturer what number they are expecting here (does it include frame headers and footers?)

Here is an example of a NVIDIA NIC:

NVIDIA Jumbo Frame Settings in Windows

NVIDIA Jumbo Frame Settings in Windows

And here is an example of an Intel NIC.  Note that NVIDA lists it as 9000 bytes and Intel lists 9014, but my testing has shown these settings to be equivalent!  Also note that none of the other options align…  (except for 1500 bytes)

Intel Jumbo Frame Settings in Windows

Intel Jumbo Frame Settings in Windows

To set jumbo frames on Solaris, you must first figure out how to configure  your specific driver for jumbo frames, and then use ifconfig to set the interface to the specific frame size you decide on.

Testing

To test to make sure jumbo frames are working, you can use the following ping command from your Cisco switch (this is assuming that you have a layer 3 interface in the jumbo frame vlan and that L3 virtual interface is setup for your jumbo mtu size).

switch#show run int vlan 99               
Building configuration…

Current configuration : 190 bytes
!
interface Vlan99
 description backup
 mtu 9000
 ip address 10.10.9.1 255.255.255.0
 ip broadcast-address 10.10.9.255
end

switch#

switch#ping ip 10.10.9.19 size 9000 df-bit

Type escape sequence to abort.
Sending 5, 9000-byte ICMP Echos to 10.10.9.19, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/3/4 ms
switch#

You can also generate jumbo frames from Solaris boxes, but Solaris does not seem to support setting the “DF” (do not fragment) bit which is a shame since you might never know that the machine you are on fragmented the packet before it sent it, hence not actually testing jumbo frames…

bash-3.00# ping 10.10.9.1 9000
10.10.9.1 is alive

Windows boxes can send jumbo frame pings as well, but you must subtract out the 20 bytes of IP header and 8 bytes of ICMP header that it inserts in the size you specify.  They *do* support setting the DF bit.

C:\>ping 10.10.9.1 -f -l 8972

Pinging 10.10.9.1 with 8972 bytes of data:

Reply from 10.10.9.1: bytes=8972 time=2ms TTL=255
Reply from 10.10.9.1: bytes=8972 time=2ms TTL=255
Reply from 10.10.9.1: bytes=8972 time=2ms TTL=255
Reply from 10.10.9.1: bytes=8972 time=2ms TTL=255

Ping statistics for 10.10.9.1:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 2ms, Maximum = 2ms, Average = 2ms

C:\>

There are a lot more details to everything we went through to implement jumbo frames on our backup networks, but this has hopefully given you the high level overview and a few key tips.

-Eric

Categories: Cisco, Microsoft, Network Tags:

Troubleshooting connectivity issues with multiple BGP connected ISP’s

March 11th, 2009 No comments
I was rolled out of bed early this morning by a page from my traffic graphing software PRTG alerting me that one of the F5 OID’sI had it monitoring was reporting higher than normal average connection times.  I have it monitoring this F5 pool, because it serves out a very latency sensitive SOAP based application, that must be available 24×7 on the Internet.
 
As it turned out, the average connection times were being driven abnormally high, because one of my ISP’s upstream ISP’s had a router failure in California, and was blackholing some traffic.  This must have been causing either half-opened TCP connections, or delayed/dropped connections, that were taking a long time to timeout, hence driving up the average connect time.
 
This gets me to the question I am attempting to address today: I am having connectivity issues to some sites on the Internet; how do I determine where the issue lies?  Specifically, I am going to describe some steps to troubleshoot this in a hypothetical redundant router and ISP configuration below, however, most of these steps are applicable to single router and even single ISP architectures.

 

Figure 1

Figure 1

 In figure 1, routers INET-A and INET-B are responsible for running BGP and are each peered with two ISP’s.  They receive full Internet routes from each ISP, and share that info through iBGP, on the 5.5.5.0 subet between them.  They advertise 6.6.6.0/24 out to the Internet through all four ISP’s.

In my scenerio described above, I knew I had a problem, but I did not even know what other places on the Internet I had lost connectivity to, or was having difficulty communicating with.  After looking at my handy dandy network graphs, which include a graph of ping time to my favorite DNS server 4.2.2.1 (no kidding, that’s a real DNS server), I discovered that 4.2.2.1 was unreachable.  I also found that one of my customers,(which I ICMP ping every 30 seconds and log to a graph), was also failing (this becomes important later).

In my specific situation, I took a shortcut to issue resolution, by looking at my bandwidth utilization graphs on each of my ISP’s, and noticing that one was showing a dropoff in inbound traffic.  I guessed (correctly) that the issue was with that ISP, and I “AS-prepended” my outgoing advertisements, as well as incoming routes, to effectively disable use of that ISP.  This resolved the issue so I contacted the ISP and went back to bed.

If I were not in a situation where I could shut down an ISP without negative impact, or I needed to run down the trouble specifics, my next step would be to determine the network path to and from one of the unreachable (or partially unreachable) IP addresses.  It is very important to realize that when looking at routes on the Internet it is normal for traffic flows in one direction to follow one path, and traffic flows in the other direction to follow a completely seperate path.  A disruption in either the forward or return path can cause similar issues!

While you could run a traceroute from within the firewall on some host (assuming your PIX/ASA would allow the return packets), I recommend running it from your Internet router (INET-A in this case), since it often has more interesting information (like AS #’s).  As you can see below I have specified the source interface address for the IP packets to originate from, rather than letting the router choose which source IP to use.  This is very important as if left to it’s own devices INET-A will either use 1.1.1.2, 2.2.2.2, or 5.5.5.1 depending on which outbound route it chooses.  If the router chooses 1.1.1.2 as the source IP, that will force the packets to return through ISP-A, rather than possibly returning through a different ISP (which may be where the problem you are attempting to identify lies).  Specifying the source interface forces the source to 6.6.6.2, which is in the same /24 as our firewalls (which is what we care about), and as such will have the same return-path routing behavior.

INET-A#traceroute ip 4.2.2.1 source gigabitEthernet 0/0/0

Type escape sequence to abort.
Tracing the route to vnsc-pri.sys.gtei.net (4.2.2.1)

  1 gi0-0-5.inet-b.example.com (x.x.x.x) 0 msec 1 msec 0 msec
  2 cust01.pdx03.atlas.cogentco.com (38.104.103.43) [AS 174] 0 msec 1 msec 1 msec
  3 gi3-1.102.core01.pdx01.atlas.cogentco.com (38.112.37.217) [AS 174] 0 msec 1 msec 1 msec
  4 po10-0.core01.sfo01.atlas.cogentco.com (154.54.3.133) [AS 174] 14 msec 15 msec 15 msec
  5 te3-1.mpd01.sfo01.atlas.cogentco.com (154.54.3.102) [AS 174] 15 msec 15 msec 15 msec
  6 te4-4.mpd01.sjc01.atlas.cogentco.com (154.54.2.54) [AS 174] 16 msec 16 msec 16 msec
  7 te4-4.mpd01.sjc03.atlas.cogentco.com (154.54.6.238) [AS 174] 38 msec 17 msec 17 msec
  8 te-3-3.car3.SanJose1.Level3.net (4.68.110.137) [AS 3356] 61 msec 188 msec 221 msec
  9 vlan79.csw2.SanJose1.Level3.net (4.68.18.126) [AS 3356] 18 msec 26 msec 19 msec
 10 ge-11-0.core1.SanJose1.Level3.net (4.68.123.38) [AS 3356] 18 msec 18 msec 19 msec
 11 vnsc-pri.sys.gtei.net (4.2.2.1) [AS 3356] 18 msec 19 msec 19 msec
INET-A#

As you can see, by running this traceroute from INET-A, the cisco implementation helpfully adds the AS number of each hop’s IP based on the data it has in it’s BGP table.  The above traceroute example does not show any issues at the time I ran it, but it does show the “forward” path to 4.2.2.1.

To really troubleshoot this, we also need to know what the return path from 4.2.2.1 to us is.  This is where in my example, I had a customer IP that I had lost connectivity to, and I could have called them up and gotten them to send me a traceroute from their network to mine to show the reverse path.

Even with both a forward and reverse path traceroute in hand, the issue might not be immediately obvious.  Traceroutes generally send only three packets to each hop along the way, and they are small packets (usually 64 bytes or so).  If you are fighting intermittent packet loss, it might not be obvious at which hop the packets are lost.  Also, say you have an errored circuit (like a T-1) somewhere in the path.  Very frequently 64 byte packets will make it through 99% of the time but when you send 1500 byte packets they drop 50% of the time.

In the event of intermittent packet loss you might want to quantify just how much is being lost.  To do this run the following command:

INET-A#ping ip 4.2.2.1 size 1500 df-bit repeat 100

Type escape sequence to abort.
Sending 100, 1500-byte ICMP Echos to 4.2.2.1, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (100/100), round-trip min/avg/max = 19/20/24 ms
INET-A#

In this command, I make the assumption that the entire forward and revese path support 1500 byte packets.  This is the default size for Ethernet networks and all carriers I am aware of support 1500 byteframes.  This might not work if you are on some form of DSL line or something running PPPoE that reduces your MTU slightly.  I set the do not fragment bit with the df-bit command, to ensure the pings don’t get fragmented along the path, which could give misleading results.  If 1500 does not work for you try something slightly smaller.

You can run this command, with thousands of pings if necessary, to get some statistics on the amount of packet loss you are seeing.  Also, if you do get packet loss, try pinging the IP of the router one hop away from your destination (as revealed by the first traceroute you ran), and keep repeating this test with each IP closer and closer to your network, until you find where packets are no longer being lost.  At this point you have some idea between which routers the issue lies (note that the issue could also be on the return path from the router hop that is timing out).

While the process I just described works, there actually is a much easier way.  If you have a linuxbox handy on your network, there is a package called “mtr” or “My Traceroute”.  If you run “mtr 4.2.2.1” in this example, it will run a traceroute to that destination, and then immediatly start pinging each hop along the way repeatedly until told to stop.  In the example below I only let it run for 6/7 iterations, but you get the idea.

                             My traceroute  [v0.71]
riddler (0.0.0.0)                                      Wed Mar 11 01:05:26 2009
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. ip-67-205-28-1.dreamhost.com      0.0%     7    0.6  35.7   0.4 127.8  60.3
 2. border21.ge4-3.newdream-8.lax.pn  0.0%     7    0.6   0.5   0.4   0.6   0.0
 3. core1.po1-20g-bbnet1.lax.pnap.ne  0.0%     7    0.6   0.5   0.4   0.6   0.1
 4. te-2-2.car1.LosAngeles1.Level3.n  0.0%     6   60.5  35.9   0.5 147.7  59.6
 5. vnsc-pri.sys.gtei.net             0.0%     6    0.9   0.9   0.4   1.6   0.5

At this point I feel I must point out that this is not an exact science.  The amount of time a router takes to respond to traceroute/ICMP requests (and whether it even responds at all) is not necessarily any indicator of it’s health, or ability to deliver packets.  You will sometimes find tracerouteswhere some hops in the middle have horrible response times, or time out altogether, but yet the end-to-end traffic is fine.  This is likely because major router manufacturers implementations treat responding to ping packets as one of the lowest priorities for CPU resources.  If a router get’sbusy doing it’s primary job, which is to forward packets, it is not going to mess around responding to pings.  Also, if a router is in the middle of scavenging it’s BGP table for entries with unreachable next-hops (which it does frequently), it may respond slowly to pings, even though it is still forwarding packets at wire rate.

In an optimal world, you would be able to run mtr (or a similar tool), simultaniously from both ends of the connection, and compare the results.  This would give you the best data possible to diagnose your issue.  Hopefully the issue will lie with one of your ISP’s, or one of your customers ISP’s, and you can contact them directly to resolve it.  If the issue lies elsewhere on some intermediary backbone provider, it might be very difficult to open a support case with them (most likely need to escalate through your ISP).

Hopefully this gives you a few tools in your arsenal the next time you come across a connectivity issue such as I have described!

-Eric

Categories: Cisco, Network Tags: