Research and development networks

I was in a lab responsible for the testing and validation of Ethernet hardware during the gigabit Ethernet revolution.  The products we supported ranged in development from alpha devices through full production quality components.  We had everything from 10Mbps PHYs (physical layer device) and repeaters to 10 gig Ethernet MACs (media access controller) and link aggregators (usually rolled into one) through 10Gbps backbone links over Infiniband or single mode fiber.  Device packaged ranged the gamut from QFP-44 (quad flat pack) through 552CBGA (ceramic ball grid array), plastic, ceramic, heat spreaders, ultraBGA, you name it.  Our hungriest device pulled nearly 50 watts and literally weighed in at over an ounce with the massive copper heat spreader built into the package.  We had NICs, hubs, and switches from every major manufacturer to test interoperability against.  How exactly do you provide network infrastructure to a lab that can generate any kind of data you can imagine?

Every single port on this beast can pump out full line rate gig ethernet at up to 20Kbytes per packet.  Yipes!Initially we were on the corporate backbone, in spite of my repeated requests for a single dedicated fiber port to the datacenter, with a dedicated router, and a strict rules set.  Then one day the un-imaginable happened.  A bone headed engineer plugged his DUT (device under test) into the production network instead of the test cable rack, and started sending packets.  Now, these were not normal packets, nor were they at normal rates.  These were packets with random source IP addresses, random destination IP addresses, random MAC addresses, random packet lengths (anywhere between 64 and 1518 bytes), and random data payloads.  Add to this that there was the IEEE802.3 minimum spec of only 96nS separating each packet from the next and you have the stuff NOC nightmares are made of.

What happened you ask?  Well the switch that the port was connected to dutifully memorized each MAC source address as having come from that port, rolling off the oldest addresses (everyone else's) in its memory to make space for the newer addresses, and because it did not have a route for any of these phantom addresses it dutifully forwarded all of the packets to the network backbone, whereupon every connected switch and router performed a similar operation (with the routers also memorizing IP addresses).  In less than 10 seconds every single switch and router on the entire network had dumped any non-static routes and reloaded the tables with random MAC and IP addresses.  The storm continued.  As these packets were being forwarded all over the place looking for the non-existent destination node matching the ever changing destination IP addresses network throughput ground to a standstill.  For over 15 minutes nothing could be done on the network.  Once the offending test had been stopped (and cable removed) it still took several minutes for the network traffic to stabilize, and hours for the route tables to re-build themselves.

The aftermath?  I got my router, a CAT 5K with two linecards, one multimode gig and one singlemode gig.  The router was given static routs to the IP addresses of the servers for which access was needed and everything else was set to disabled.  We added a pair of HP procurve 24 port switches setup on a gig multimode backbone to the router, and serviced each station in the lab from one of the switches' ports.  We replicated a large portion of the corporate network inside the lab, including a proxy server (sites blocked unless whitelisted) a DNS and DHCP server, filier, print server, etc.  While this did nothing to stop boneheads from causing havoc, it limited the damage to only our lab.

a small sample of random equipment I had available to play with.The network was something to behold, we had Windows (3.11, 95, 98, NT4.0sp3, NT4.0sp6a, Whistler, XP) clients, Linux (RedHat, Slackware, and a custom Debian build) clients, and embedded clients (Embedded Planet boards running VxWorks), running on hardware ranging from Pentium75's through P4's, AMDs, and those pesky embedded systems.  OS kernels that were modified from OEM (to support test equipment hooks). And a special NT4 build that had network drivers from hell (kernel mode driver soft ICE'd in).  The bright side to all this is that it is an admins playground.  I had toys to play with from simple D-Link 16 port hubs all the way up to a Cisco CAT12012 router with one OC192 blade and a plethora of other blades.  I had 300 client systems that spanned a decade of OS design, and some builds that never saw the open market.

The average test workstation consisted of a handful of HP/Agilent E3614 power supplies, a scope and a traffic generator (IXIA or Smartbits, with load modules as required).  For thermal loading we would wheel up a refrigerator sized instrument called a T2500 by Thermonics.

Sometimes people are careless (clueless?) and leave their equipment running at an absurdly low temperature, for a very long time, without taking any precautions.  What do you suppose happens when you operate a piece of equipment that chills your board to -30C in a humidity controlled environment (typical 35% humidity)?  If you guessed "snow cone" then you are spot-on.

Now Ice isn't bad for electronics (at least in the R&D environment).  So long as the water is ice, then your board keeps running and your test keeps humming along.  It's what happens when the ice melts that is scary.  Water, while not very conductive, is an awesome solvent.  Chips are held onto boards with solder, and the soldered legs of the chips are interconnected with copper.  Solder and copper are metal, something that water is good at dissolving.  Add to the blend the ubiquitous dust, made up of everything from dirt and carpet fibers to human skin flakes and other wonderful organic compounds, and that water turns mighty conductive in a hurry.  So, we've determined that solid water == !bad and liquid water == bad.  What next?

Well if you turn off your board, dust off the bulk of the snow cone into the trash and turn that -30C air stream into a 110C air stream, not much.  If you are a bonehead and leave your board powered on and ramp the temperature up (or worse yet, simply turn the air off) you are in for a world of hurt.  The picture here is of a special socket and a special device.  The socket is about $5000 and the device is one of only about 40 *in the world*.  Now neither of those is all that bad, I mean $5K in an R&D lab, while expensive, is manageable, and we still have 39 chips left to play with, right?  (besides another full fab lot of wafers, expected to yield several thousand devices would be here in 6 weeks).  Well what if I added to the mix, that we only had three sockets, they take 18 weeks to have built (don'tcha love custom electronics) and we are due to get sample devices to our biggest three customers in two weeks.  Now we have a problem.  We sent the damaged socket back to the manufacturer to see if it was repairable, and received the professional version of "you're fucked" as a reply.  Great.

So it's double shifts for anyone who wants them, and mandatory for those who would rather not because they are salaried.  Me?  Well I'm a tech, so I get overtime (doubletime if over 12 hours in a shift or 60 hours in a week).  I'm good with that.  It was before I had kids, so I put my personal life on hold and in two weeks managed to pull in an extra month's worth of pay.  Happy ending?  Yes.  Tired?  Yes.

Donate towards my web hosting bill!