Find the network loop with Python

CORE-ROUTER-2 is unreachable through SSH; no one changed anything, we swear. You manage to connect to CORE-ROUTER-2 through its serial interface. A health-check shows its CPU is peaking at 100% and show log commands are unresponsive. One command that does work is show interfaces. You suspect one of its interfaces is connected to a network experiencing a bridging loop.

Can you identify the port that needs to be shutdown?

You can view the full output of the show interfaces command here.

CORE-ROUTER-2-> show interfaces
Chassis/Slot/Port 1/1/1   :
 Operational Status     : up,
 Last Time Link Changed : Wed Apr 20 05:01:14 2016,
 Number of Status Change: 1,
 Type                   : Ethernet,
 SFP/XFP                : SFP_PLUS_COPPER,
 EPP                    : Disabled,
 Link-Quality           : GOOD,
 MAC address            : 2c:fa:a2:07:63:d4,
 BandWidth (Megabits)   :    10000,  		Duplex           : Full,
 Autonegotiation        :   0  [                              ],
 Long Frame Size(Bytes) : 16360,
 Rx              :
 Bytes Received  :       12740900504634, Unicast Frames :          21825966736,
 Broadcast Frames:            225372918, M-cast Frames  :             42570314,
 UnderSize Frames:                    0, OverSize Frames:                    0,
 Lost Frames     :                    0, Error Frames   :                    0,
 CRC Error Frames:                    0, Alignments Err :                    0,
 Tx              :
 Bytes Xmitted   :      159501568757885, Unicast Frames :           8375538532,
 Broadcast Frames:            337233350, M-cast Frames  :         115905328083,
 UnderSize Frames:                    0, OverSize Frames:                    0,
 Lost Frames     :                    0, Collided Frames:                    0,
 Error Frames    :                    0
Chassis/Slot/Port 1/1/2   :
 Operational Status     : up,

...
...

Chassis/Slot/Port 2/1/20  :
 Operational Status     : up,
 Last Time Link Changed : Fri Jun 29 08:47:22 2018,
 Number of Status Change: 3,
 Type                   : Ethernet,
 SFP/XFP                : SFP_PLUS_SR,
 EPP                    : Disabled,
 Link-Quality           : GOOD,
 MAC address            : e8:e7:32:42:f2:ff,
 BandWidth (Megabits)   :    10000,  		Duplex           : Full,
 Autonegotiation        :   0  [                              ],
 Long Frame Size(Bytes) : 9216,
 Rx              :
 Bytes Received  :       12806785685233, Unicast Frames :          19385442042,
 Broadcast Frames:                    0, M-cast Frames  :            120868777,
 UnderSize Frames:                    0, OverSize Frames:                    0,
 Lost Frames     :                    0, Error Frames   :                    0,
 CRC Error Frames:                    0, Alignments Err :                    0,
 Tx              :
 Bytes Xmitted   :        1829921935280, Unicast Frames :           2518300035,
 Broadcast Frames:                    0, M-cast Frames  :            120868756,
 UnderSize Frames:                    0, OverSize Frames:                    0,
 Lost Frames     :                   20, Collided Frames:                    0,
 Error Frames    :                    0

I was a few months into my Python journey when this happened. I had just about finished all the Beginner exercises over at codechalleng.es and I had gotten comfortable with Python dictionaries, lists, Counters & string manipulation.

Since CORE-ROUTER-2’s OSPF neighborships were either down or flapping, I figured that the interface still receiving the most amount of traffic, i.e. Bytes Received, within a certain time period would be the likely culprit.

I ran the show interfaces command twice, about 30 seconds apart. I copied the output to two separate files, t1.txt and t2.txt. For every port, I had to take the Bytes Received found in t2.txt and subtract Bytes Received found in t1.txt.

For each file, I created a dictionary where they key=port, e.g. 1/1/1, and value=bytes received. I return the dictionary as a Counter object, because you can subtract Counters, which is exactly what we need.

You could also parse the file with regex instead of iterating over every line:

With the help of most_common(n), the delta function returns the five interfaces with the most amount of traffic received.

We execute the get_rx_bytes and delta functions within a main function and print the result to stdout.

If we execute the script, we find that the likely culprit is interface 1/1/35, that port has received about five times as much traffic in those 30 seconds in comparison to the second in line.

$ python loop.py 
[('1/1/35', 545871439),
 ('1/1/12', 111303838),
 ('1/1/20', 94876228),
 ('2/1/1', 83870492),
 ('2/1/2', 45918574)

We shut down port 1/1/35 on CORE-ROUTER-2 and the network stabilized, crisis averted.

I’d be lying if I told you the code I showed you so far is what I actually wrote during the incident. Although the idea of a network engineer peppering fresh Python code with neat type annotations during a catastrophic network failure does make me chuckle.

This is what I hacked together in a five minute coding frenzy:

I wouldn’t be showing it off to Uncle Bob, but it worked!

$ python find_culprit.py 
['1/1/35', '1/1/12', '1/1/20', '2/1/1', '2/1/2']