troubleshooting – Chris Gammell's Analog Life

If you are an engineer who regularly works with your hands, you likely troubleshoot on a daily basis. It’s just part of the job. Sure, you can say, “I never mess up!”, but hardly anyone will believe you. Because even when your best laid plans go perfectly, Murphy’s Law will soon kick in to balance things out. We learn to deal with these things and have developed tools and measurement equipment to help us diagnose and deal with these problems: Multimeters, Electrometers, SourceMeters, Oscilloscopes, Network Analyzers, Logic Analyzers, Spectrum Analyzers, Semiconductor Test equipment (ha, guess I know a little about that stuff)…the list goes on and on. But what has struck me lately has been that as parts on printed circuit boards get smaller and smaller, troubleshooting is getting…well….more troubling.

Package Types — I don’t want to get into another discussion of analog vs digital, but I will say that digital parts on average have many more pins which complicates things. And as the parts get more and more complex, they require more and more pins. The industry solution was to move to a Ball Grid Array package, using tiny solder balls on the bottom of the chip that then line up with a grid of similar sized holes on the board. When you heat up the part the solderballs melt and hold the chip into place and connects all of the signals. The problem is the size of the solderballs and the connecting vias: they’re tiny. Like super tiny. Like don’t try probing the signals without a microscope and some very small probes. But wait, it’s not just the digital parts! The analog parts are getting increasingly small to accommodate any of those now-smaller-but-still-considerably-bigger-than analog parts. You thought probing a digital signal was tough before? Now try measuring something that has more than 2 possible values!
Board Layers — As the parts continue on their shrink cycle, the designers using these parts also want to place them closer together (why else would they want them so small?).The circuit board designers route signals down through the different layers of insulating material so that mutiple planes can be used to route isolated signals to different points on the board. So to actually route any signals to the multitude of pins available, more and more board layers are required as the parts get smaller and closer together. Granted, parts are still mounted on either the top or bottom of the board. But if a single signal is routed from underneath a BGA package, down through the fourth layer of an 8 layer board board and then up to another BGA package, the signal will be impossible to see and measure without ripping the board apart.
High Clocks — As systems are required to go faster and faster, so are their clocks. Consumers are used to seeing CPU speeds in the GHz range and others using RF devices are used to seeing even higher, into the tens of GHz. The problem arises when considering troubleshooting these high speed components. If you have a 10 GHz digital signal and you expect the waveforms to be in any way square (as opposed to sinusoidal) you need to have spectral data up to the 5th harmonic. In this case, it means you need to see 50 GHz. However, as explained with analog to digital converters in the previous post, you need to sample at twice the highest frequency you are interested in to be able to properly see all of the data. 100 GHz! I’m not saying it’s impossible, just that the equipment required to make such a measurement is very pricey (imagine how much more complicated that piece of equipment must be). High speed introduces myriad issues when attempting to troubleshoot non-working products.
Massive amounts of data — When working with high speed analog and digital systems there is a good amount of data available. The intelligent system designer will be storing data at some point in the system either for debugging and troubleshooting or for the actual product (as in an embedded system). When dealing with MBs and even GBs of data streaming out of sensors and into memories or out of memories and into PCs, there are a lot of places that can glitch and cause a system failure. With newer systems processing more and more data, it will become increasingly difficult to find out what is causing the error, when it happened and how to fix it.
Less Pins Available out of Packages — Even though digital packages are including more and more pins as they get increasingly complex, often times the packages cannot provide enough spare pins to do troubleshooting on a design. As other system components that connect to the original chip also get more intricate (memories, peripherals, etc), they will require more and more connections. The end result is a more powerful device with a higher pin count, but not necessarily more pins available for you the user/developer to use when debugging a design.
Rework — Over a long enough time period, the production of printed circuit boards cannot be perfect. The question is what to do with the product once you realize the board you just constructed doesn’t work. When parts were large DIP packages or better, socketed (drop in replacements), changing out individual components was not difficult. However, as the parts continue to shrink and boards become increasingly complex to accommodate the higher pin counts, replacing the entire board sometimes becomes the most viable troubleshooting action. Environmentally this is a very poor policy. As a business, this often seems to be a decent method (if the part cost is less expensive than the labor needed to try and replace tiny components) but if and when the failures stack up, the board replacement idea quickly turns sour.

While the future of troubleshooting looks more and more difficult, there have always been solutions and providers that have popped up with new tools to assist in diagnosing and fixing a problem. In fact, much of the test and measurement industry is built around the idea that boards, parts, chips, etc are going to have problems and that there should be tools and methods to quickly find the culprit. Let’s look at some of the methods and tools available to designers today:

DfX — DfX is the idea of planning for failure modes at the design stage and trying to lessen the risk of those failures happening. If you are designing a soccer ball, you would consider manufacturability of that ball when designing it (making sure the materials used aren’t super difficult to mold into a soccer ball), you would consider testability (making sure you can inflate and try out the ball as soon as it comes off the production line) and you would consider reliability (making sure your customers don’t return deflated balls 6 months down the line that cannot be repaired and must immediately be replaced). All of these considerations are pertinent to electronics design and the upfront planning can help to solve many of the above listed problems:
1. Manufacturability — Parts that are easy to put onto the board cuts down on problem boards and possibly allows for easier removal and rework in the event of a failure. It becomes a balancing act between utilitizing available space on the board and using chips that are easier to troubleshoot.
2. Testability — Routing important signals to a test pad on the top of a board before a design goes to the board house allows for more visibility into what is actually happening within a system (as opposed to seeing the internal system’s effect on the top level pins and outputs).
3. Reliability — In the event you are using parts that cannot easily removed and replaced and you are forced to replace entire boards, you want to make sure your board is less likely to fail. It will save your business money and will ensure customer satisfaction.
Simulation — One of the best ways to avoid problems in a design is to simulate beforehand. Simulation can help to see how a design will react to different input, perform under stressful conditions (i.e. high temperature) and in general will help to avoid many of the issues that would require troubleshooting in first place. A warning that cannot be overstated though: simulation is no replacement for the real thing. No matter how many inputs your simulation has and how well your components are modeled, no simulation can perfectly match what will happen in the real world. If you are an analog designer, simulate in SPICE to get the large problems out of the way and to figure out how different inputs will affect your product. Afterward, construct a real test version of your board or circuit and make sure your model fits your real world version. By assuming something will go wrong with the product, you will be better prepared for when it does and will be able to fix it faster.
Very very steady hands — Sometimes you have to accept the fact that you messed up and the signal traces on your board and you have to rewire it somehow. My analog chip designing friends needn’t worry about trying this…chips do not have the option for re-wiring without completely reworking the silicon pathways that build the chip. In the event you do mess up and have to try and wire a BGA part to a different part of the board or jumper 0201 resistors, make sure you have a skilled technician on hand or you have very steady hands yourself. And in the event you find yourself complaining about how small the job you have to do is, think of the work that Willard Wigan does…and stop complaining.
On the Chip/Board tools — Digital devices have the benefit of being stopped and started at almost any point in a program (debug). Without being able to ascertain what the real world output values are though, it doesn’t help too much. If in the event you do not Design for Test and actually pull signals you need to probe to the top level then you create a board then there are a few other options. One option is to try and read your memory locations or your processor internals directly by communicating through a debugger interface. But if you are looking at a multitude of signals and want to see exactly how the output pins look when given a certain input there is another valuable tool known as “boundary scan”. The chip or processor will accept an interface command through a specified port and then serially shift the values of the pins back out to you. Anytime you ask the chip for the exact state of all the pins, an array of ones and zeros will return which you can then decode to see which signals and pins are high or low.
Expensive equipment — As mentioned above when describing an RF system measurement needs, there will always be someone who is willing to sell you the equipment you need or work to create a new solution for you. They will just charge you a ton for it. In cases I have seen where a measurement is really difficult to calculate or you need to debug a very complicated system, the specially made measurement solutions often perform great where you need them, but are severely limited outside of their scope. To use the example from before, if you needed a 100GHz oscilloscope, it is likely whomever is making it for you will deliver a product that can measure 100GHz. But if you wanted that same scope to measure 1 GHz, it would do not perform as well because it had been optimized for your specific task. However, there are exceptions to this and certain pieces of equipment sometimes seem like they can do just about anything.

Debugging is part of the job for engineers. Until you become a perfect designer it is useful to have methods and equipment for quickly figuring out what went wrong in your design. Over time you become better at knowing which signals will be critical in a design and planning on looking at those first, thereby cutting down on the time it takes to debug a product. And as you get more experience you recognize common mistakes and are sure not to design those into the product in the first place.

Do you know of any troubleshooting tools or methods that I’ve missed? What kinds of troubleshooting do you do on a daily basis? Let me know in the comments!