The control board PCB design was finished but I had a nagging feeling there was something wrong with it. So one last time, I checked that every single trace was connected to all the right places. I found a nasty collision where some weeks back I had moved a chip sideways and not noticed that two traces were now overlapping. I fixed it and eventually got to the end of the checking and had the board manufactured. When it arrived, I started at the oscillator section, and populated it bit by bit, testing each bit as I went. If something was wrong, I would have a good chance of knowing where the problem was. But everything was working, so far. In the end I just went ahead and added all the remaining components. I tested it with just the memory board connected and confirmed that it could fetch instructions from ROM. I could see the PC and instruction register on the lights, confirm that the other signals were doing the right things, and I could see it doing jumps and branches. It was time to connect the whole thing together and see what happened.
I was using the test routine that I had written for the simulator. It seemed to be working, but sometimes it would halt, meaning a test had failed. This was random. Sometimes it would run from reset and sometimes it would halt, or crash. I couldn’t understand what was going on. It was so close to working, but randomly didn’t. Around this time I noticed that two of the lights, data bus bits 0 and 1 were flickering in a way that was not normal. I discovered while single stepping, that these two bits were doing something very weird. If I pressed reset, they would go off, but after a few seconds, one of them would come on. Or not. It was random. I couldn’t understand how this was possible when the machine was not freely running as nothing should happen until I pressed the step button. I removed some chips and did some multimeter testing, and that was when I noticed that there was a current path between the reset line and those two bits. There is no deliberate connection between the reset signal and the data bus, but I was seeing a 300kΩ resistance between them. And that was when I finally noticed that the reset signal was running right alongside the data bus ribbon cable connector, so close that it actually passed under the plastic connector block. I knew there had to be some foreign body under this component that I had missed while soldering. I had to take the thing off. Luckily, these are split into two sections and one is only 8 pins wide. I desoldered it, and found a tiny pool of brown goo which was touching both the data bus pins (0, and 1) and the reset line. This goo was conductive! It turned out to be dirt and flux that had oozed under the connector out of sight while I was soldering it. What was happening was that the reset signal (which is high normally and low during reset) was applying a certain amount of voltage to the data bus pins; just enough to upset the input to whatever chip was reading them. I cleaned the goo off, resoldered the connector, put everything back together — and the glitching was gone. Unbelievable! The lesson here is never to run a trace so close to a connector that it goes under the plastic, because then it is obscured and you can’t see if there’s something bad touching it.
Things started to get better after that, but the machine still sometimes crashed after reset. It was like starting a car; if it got started it ran for ages, but if not, it didn’t work at all. I started writing some more test routines, including a memory test. But I was still seeing some weird random problems. This time I homed in on register R7 which I was using as a stack pointer for subroutine calls. It seemed as if it was randomly being reset to zero. I was able to confirm this with the scope. It seems that sometimes, R7 got cleared when it was not supposed to. I soon discovered that this happened when R0 was counting down in a loop, in particular it happened when R0 went from 0000 to FFFF. I put a scope on R7 ‘s reset pin and this is what I saw:
Bear in mind that this pin is supposed to be high all the time! If you know how to look at this, you can actually see the fact that something is counting down, and at the point where that goes from 0000 to FFFF there is a sudden jump in the interference. That was just enough to trigger the reset pin. At this time, I couldn’t understand why there was so much interference on this line. There was much less on R1 and R2, so why was R7 so bad? It was almost as if the physical distance of the register from the top of the board had something to do with it. In fact, I discovered that R6 was suffering from the same problem. At this point, the simplest solution seemed to be to get rid of the reset function on the registers. There’s no need for it anyway and I don’t know why I bothered with it. Much safer is to just tie them high so they can never be activated. So I cut the traces and added little bridge wires to tie all the reset pins high. The problem went away. But I had fixed a bug without understanding it, and this was going to come back and bite me in the arse one more time.
I had almost got the memory test working but something odd was still going on. I narrowed this one down to R7 being cleared again, except this time only the high byte of it. And this time, it wasn’t being reset, it was being written to when a different register was the destination. But only the high byte. I put the scope on its write pin and saw the same kind of noise as earlier. Where was this noise coming from? I had been extremely careful with the write strobes because they are clocks and you have to be careful to keep clocks away from other traces so they don’t pick up noise. But here was a ton of noise coming out of nowhere. Could it be power supply noise? Ground bounce? Something else? I had organized the chips in groups of eight, with careful separation of power and grounds, each group having its own supply capacitor and each chip having its own bypass capacitor. I thought I had followed all the best practices. But I’d missed something. It was something to do with the distance of R7 from the other end of the board, I was sure. Perhaps the clock trace was too long? Well, no. It was only about 5 inches away from the chip that creates the clock signal. I looked at that chip and put the scope on the output pin that generated the noisy signal. It was clean. But when I measured it at R7, it wasn’t. But it was the same signal. How could it be clean at its source yet dirty 5 inches away? Then it hit me. It all depended on where I connected the scope’s ground; which ground I was referencing it to. The generator creates the signal referenced to its own ground, but when R7 receives it, it’s referenced to R7’s ground. These grounds are not the same, not at all the same. I’d seen this before on breadboards but thought it was just because breadboards are crappy and have bad connections. I’d assumed that proper PCBs would have a solid clean consistent ground. What a naive fool!
I had a flash of inspiration. I was able to see the problem happening in real time and I would know if it was fixed. What would happen if I just connected a wire from R7’s ground pin to the source chip’s ground pin? I tried it. Instantly, the problem was gone. Even though these two chip grounds were already electrically connected together, another dedicated connection solved the problem. And then, finally, I remembered something I’d read once but didn’t understand. When you have a clock signal that has to go a long way, it’s a good idea to run the ground return alongside it. I had been concentrating on power ground rather than signal ground. The ground return path from R7 back to the clock source had to go all the way around the edge of the board and then back down into the middle. Providing a direct path back fixes the problem. Now I think that this was exactly the reason for the reset problem as well. If I had fixed the reset glitch by doing this, I wouldn’t have had this problem at all.
Later I started getting a similar problem on R2 which drove me almost nuts, until I realized that just fixing R7 was a bit dumb. So I added dedicated ground returns to all the register chips. Since that moment, LEO-1 has been working flawlessly, even after reset. It just doesn’t crash anymore. And it’s working at 4MHz when my design goal was only 1MHz.
I almost gave up on this project so many times and all for the want of a bit more knowledge. I started doing it because I thought I knew enough about electronics to pull it off. But now I feel like I’ve learned everything I know about electronics by doing it!