I was rolled out of bed early this morning by a page from my traffic graphing software
PRTG alerting me that one of the F5 OID’sI had it monitoring was reporting higher than normal average connection times. I have it monitoring this F5 pool, because it serves out a very latency sensitive SOAP based application, that must be available 24×7 on the Internet.
As it turned out, the average connection times were being driven abnormally high, because one of my ISP’s upstream ISP’s had a router failure in California, and was blackholing some traffic. This must have been causing either half-opened TCP connections, or delayed/dropped connections, that were taking a long time to timeout, hence driving up the average connect time.
This gets me to the question I am attempting to address today: I am having connectivity issues to some sites on the Internet; how do I determine where the issue lies? Specifically, I am going to describe some steps to troubleshoot this in a hypothetical redundant router and ISP configuration below, however, most of these steps are applicable to single router and even single ISP architectures.

Figure 1
In figure 1, routers INET-A and INET-B are responsible for running BGP and are each peered with two ISP’s. They receive full Internet routes from each ISP, and share that info through iBGP, on the 5.5.5.0 subet between them. They advertise 6.6.6.0/24 out to the Internet through all four ISP’s.
In my scenerio described above, I knew I had a problem, but I did not even know what other places on the Internet I had lost connectivity to, or was having difficulty communicating with. After looking at my handy dandy network graphs, which include a graph of ping time to my favorite DNS server 4.2.2.1 (no kidding, that’s a real DNS server), I discovered that 4.2.2.1 was unreachable. I also found that one of my customers,(which I ICMP ping every 30 seconds and log to a graph), was also failing (this becomes important later).
In my specific situation, I took a shortcut to issue resolution, by looking at my bandwidth utilization graphs on each of my ISP’s, and noticing that one was showing a dropoff in inbound traffic. I guessed (correctly) that the issue was with that ISP, and I “AS-prepended” my outgoing advertisements, as well as incoming routes, to effectively disable use of that ISP. This resolved the issue so I contacted the ISP and went back to bed.
If I were not in a situation where I could shut down an ISP without negative impact, or I needed to run down the trouble specifics, my next step would be to determine the network path to and from one of the unreachable (or partially unreachable) IP addresses. It is very important to realize that when looking at routes on the Internet it is normal for traffic flows in one direction to follow one path, and traffic flows in the other direction to follow a completely seperate path. A disruption in either the forward or return path can cause similar issues!
While you could run a traceroute from within the firewall on some host (assuming your PIX/ASA would allow the return packets), I recommend running it from your Internet router (INET-A in this case), since it often has more interesting information (like AS #’s). As you can see below I have specified the source interface address for the IP packets to originate from, rather than letting the router choose which source IP to use. This is very important as if left to it’s own devices INET-A will either use 1.1.1.2, 2.2.2.2, or 5.5.5.1 depending on which outbound route it chooses. If the router chooses 1.1.1.2 as the source IP, that will force the packets to return through ISP-A, rather than possibly returning through a different ISP (which may be where the problem you are attempting to identify lies). Specifying the source interface forces the source to 6.6.6.2, which is in the same /24 as our firewalls (which is what we care about), and as such will have the same return-path routing behavior.
INET-A#traceroute ip 4.2.2.1 source gigabitEthernet 0/0/0
Type escape sequence to abort.
Tracing the route to vnsc-pri.sys.gtei.net (4.2.2.1)
1 gi0-0-5.inet-b.example.com (x.x.x.x) 0 msec 1 msec 0 msec
2 cust01.pdx03.atlas.cogentco.com (38.104.103.43) [AS 174] 0 msec 1 msec 1 msec
3 gi3-1.102.core01.pdx01.atlas.cogentco.com (38.112.37.217) [AS 174] 0 msec 1 msec 1 msec
4 po10-0.core01.sfo01.atlas.cogentco.com (154.54.3.133) [AS 174] 14 msec 15 msec 15 msec
5 te3-1.mpd01.sfo01.atlas.cogentco.com (154.54.3.102) [AS 174] 15 msec 15 msec 15 msec
6 te4-4.mpd01.sjc01.atlas.cogentco.com (154.54.2.54) [AS 174] 16 msec 16 msec 16 msec
7 te4-4.mpd01.sjc03.atlas.cogentco.com (154.54.6.238) [AS 174] 38 msec 17 msec 17 msec
8 te-3-3.car3.SanJose1.Level3.net (4.68.110.137) [AS 3356] 61 msec 188 msec 221 msec
9 vlan79.csw2.SanJose1.Level3.net (4.68.18.126) [AS 3356] 18 msec 26 msec 19 msec
10 ge-11-0.core1.SanJose1.Level3.net (4.68.123.38) [AS 3356] 18 msec 18 msec 19 msec
11 vnsc-pri.sys.gtei.net (4.2.2.1) [AS 3356] 18 msec 19 msec 19 msec
INET-A#
As you can see, by running this traceroute from INET-A, the cisco implementation helpfully adds the AS number of each hop’s IP based on the data it has in it’s BGP table. The above traceroute example does not show any issues at the time I ran it, but it does show the “forward” path to 4.2.2.1.
To really troubleshoot this, we also need to know what the return path from 4.2.2.1 to us is. This is where in my example, I had a customer IP that I had lost connectivity to, and I could have called them up and gotten them to send me a traceroute from their network to mine to show the reverse path.
Even with both a forward and reverse path traceroute in hand, the issue might not be immediately obvious. Traceroutes generally send only three packets to each hop along the way, and they are small packets (usually 64 bytes or so). If you are fighting intermittent packet loss, it might not be obvious at which hop the packets are lost. Also, say you have an errored circuit (like a T-1) somewhere in the path. Very frequently 64 byte packets will make it through 99% of the time but when you send 1500 byte packets they drop 50% of the time.
In the event of intermittent packet loss you might want to quantify just how much is being lost. To do this run the following command:
INET-A#ping ip 4.2.2.1 size 1500 df-bit repeat 100
Type escape sequence to abort.
Sending 100, 1500-byte ICMP Echos to 4.2.2.1, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (100/100), round-trip min/avg/max = 19/20/24 ms
INET-A#
In this command, I make the assumption that the entire forward and revese path support 1500 byte packets. This is the default size for Ethernet networks and all carriers I am aware of support 1500 byteframes. This might not work if you are on some form of DSL line or something running PPPoE that reduces your MTU slightly. I set the do not fragment bit with the df-bit command, to ensure the pings don’t get fragmented along the path, which could give misleading results. If 1500 does not work for you try something slightly smaller.
You can run this command, with thousands of pings if necessary, to get some statistics on the amount of packet loss you are seeing. Also, if you do get packet loss, try pinging the IP of the router one hop away from your destination (as revealed by the first traceroute you ran), and keep repeating this test with each IP closer and closer to your network, until you find where packets are no longer being lost. At this point you have some idea between which routers the issue lies (note that the issue could also be on the return path from the router hop that is timing out).
While the process I just described works, there actually is a much easier way. If you have a linuxbox handy on your network, there is a package called “mtr” or “My Traceroute”. If you run “mtr 4.2.2.1” in this example, it will run a traceroute to that destination, and then immediatly start pinging each hop along the way repeatedly until told to stop. In the example below I only let it run for 6/7 iterations, but you get the idea.
My traceroute [v0.71]
riddler (0.0.0.0) Wed Mar 11 01:05:26 2009
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. ip-67-205-28-1.dreamhost.com 0.0% 7 0.6 35.7 0.4 127.8 60.3
2. border21.ge4-3.newdream-8.lax.pn 0.0% 7 0.6 0.5 0.4 0.6 0.0
3. core1.po1-20g-bbnet1.lax.pnap.ne 0.0% 7 0.6 0.5 0.4 0.6 0.1
4. te-2-2.car1.LosAngeles1.Level3.n 0.0% 6 60.5 35.9 0.5 147.7 59.6
5. vnsc-pri.sys.gtei.net 0.0% 6 0.9 0.9 0.4 1.6 0.5
At this point I feel I must point out that this is not an exact science. The amount of time a router takes to respond to traceroute/ICMP requests (and whether it even responds at all) is not necessarily any indicator of it’s health, or ability to deliver packets. You will sometimes find tracerouteswhere some hops in the middle have horrible response times, or time out altogether, but yet the end-to-end traffic is fine. This is likely because major router manufacturers implementations treat responding to ping packets as one of the lowest priorities for CPU resources. If a router get’sbusy doing it’s primary job, which is to forward packets, it is not going to mess around responding to pings. Also, if a router is in the middle of scavenging it’s BGP table for entries with unreachable next-hops (which it does frequently), it may respond slowly to pings, even though it is still forwarding packets at wire rate.
In an optimal world, you would be able to run mtr (or a similar tool), simultaniously from both ends of the connection, and compare the results. This would give you the best data possible to diagnose your issue. Hopefully the issue will lie with one of your ISP’s, or one of your customers ISP’s, and you can contact them directly to resolve it. If the issue lies elsewhere on some intermediary backbone provider, it might be very difficult to open a support case with them (most likely need to escalate through your ISP).
Hopefully this gives you a few tools in your arsenal the next time you come across a connectivity issue such as I have described!
-Eric