This is a post I wrote up on a ramshackle WordPress server made of spare computer parts in a case I made with plywood on the campus laser cutter. That said, I’m proud of the debugging process here and the post I wrote up about it. That’s why when I accidently wiped the partition table on the old server drive, I rescanned the whole raw drive for the entry in the database to rebuild the post as a word document. Voila!
The core of my LED cube project is an ARM-powered Teensy 3.2. For those interested, the processor is an MK20DX256VLH7 Cortex-M4. It’s clocked at 72MHz and has 64KB of RAM: A respectable and often copious amount for most projects (especially for developers coming from Arduinos and AVR chips.) But that 64KB has begun to feel claustrophobic, and not for the most straightforward of reasons.
Now that the firmware is at a stage where I can fully test animations, I wanted to try out the more memory/CPU-intensive animations to make sure they can still run. The animation I tried first is called “fireworks.” It involves a couple dozen particles which each hold a floating-point position and velocity. I was ecstatic to find that it only used about 33KB (obtained using mallinfo()
).
When it worked.
Every couple of builds, the resulting program would simply stop partway through. Even a single, seemingly unrelated change could invoke the wrath of the phantom bug. Obviously, I was brushing up against some edge condition; an invalid pointer, array overrun, stack corruption, or all the usual banes of my existence. The processor should be throwing a fault interrupt when something really bad happens, so I figured I’d hook something up to it. Just a quick utility to throw out the PC address and some other registers when the interrupt is fired.
uint32_t* sp=0;
// this is from "Definitive Guide to the Cortex M3" pg 423
asm volatile ( "TST LR, #0x4\n\t"; // Test EXC_RETURN number in LR bit 2
"ITE EQ\n\t"; // if zero (equal) then
"MRSEQ %0, MSP\n\t"; // Main Stack was used, put MSP in sp
"MRSNE %0, PSP\n\t"; // else Process stack was used, put PSP in sp
: "=r" (sp) : : "cc");
serial_print("!!!! Crashed at pc=0x");
serial_phex16(sp[6]);
serial_print(", lr=0x");
serial_phex16(sp[5]);
serial_print(".\n");
serial_flush();
Nothing.
The processor continues to completely stop, and the interrupt is not even fired once. Okay. Not to be deterred, my next guess was that I somehow corrupted the vector table. So I figured out how to use the watchdog timer to reset the CPU after it fails, and wrote another utility to pump out the vector table before it gets written by the startup code. The processor’s ram persists through a soft reset, so whatever the vector was when the processor crashed should be preserved…
Nothing.
Right. So I figured some low-level debugging will tell me what is going on. A quick peek at the datasheet says that the JTAG port is on pins PTA3, PTA0, and PTA2. So then I just needed to figure out how to access those pins on the Teensy 3.2…

…
None of them are exposed on the board. Unless I wanted to solder jumpers directly to the pins and cut the traces to the auxiliary chip, the JTAG interface just wasn’t an option. I had run out of ideas. Nothing I tried seemed to tell me anything about what the processor was doing. This was particularly frustrating because I’m currently taking a class which taught me how to use GDB to debug programs at this level. With nothing left in my arsenal, I ended up pushing the issue aside for a while due to college finals.
Oddly enough, it hadn’t occurred to me to try googling the issue. Though I should give myself credit. I barely knew anything about the issue, so it was a bit of a stretch to expect any meaningful results. I have never been so happy to be proven so completely wrong! I googled “teensy hard fault”, and the very first result was a link to a forum thread called Teensy 3, hard fault due to SRAM_L and SRAM_U boundary.

The Kinetis MK20 series of ARM chips separates its SRAM into two equally sized segments centered at 0x20000000. The entire address range acts like a single block of SRAM under nearly every circumstance (there are some penalties for running code from SRAM_U, but that’s irrelevant in my case). I discovered this unique architecture while scouring the datasheet to find something I could use to debug the phantom bug, but nothing stuck out to me then. But it turns out that one of the rare circumstances I mentioned is when making a multi byte, unaligned read that crosses the barrier between SRAM_L and SRAM_U. The user frank26080115 provided the following chunk of code which simulates this bizarre occurrence.
volatile uint32_t* foo;
foo = (volatile uint32_t*)0x1FFFFFFF; // right at the boundary between SRAM_L and SRAM_U
*foo = 0x12345678; // perform a 32 bit write
I added the test code to my firmware and… BAM. The exact same behavior as my phantom bug. Now THAT is what I call an edge case.
So time to celebrate, right? Well… kind of. Knowing the cause of the failure is one thing, but that doesn’t tell me where my firmware is making that illegal access. However, the only way my program would ever use an address near that boundary is if malloc allocates a block there. The only code in the firmware which uses the heap right now isn’t even my own code. To fix the bug, I would risk modifying the fundamental behavior of the libraries I depend on. I plan to investigate this further, but I also came up with three ways to solve the issue without pinpointing where the bug occurs…
- Remap the ram so that the heap exists only in SRAM_U, and have the stack moved to the top of SRAM_L. This prevents any possible access across the boundary, but it also dramatically cuts how much heap is available. As is, this wouldn’t be able to run my fireworks program which uses 33KB. So I would have to modify the squirrel library to use less memory for this method to work.
- Use malloc to pre-allocate a small block near/over the boundary before anything else runs. This way no block can be allocated that would cross the boundary. This solution sends shivers down my spine, but it does seem to work. I tried implementing this on my own, but then I discovered another thread which also mentioned this solution. The code provided by tnl is what I’m currently using, but I still hate this solution.
- Buy the Teensy 3.5, which has more SRAM, and use option 1. I don’t like this option only because it would feel like giving up.
For now I’m sticking with option 2, but I’ll make my final decision down the line.