[C++] I fixed a mistake I made 4 years ago

Discussion in 'Programming' started by Bobcat00, Feb 11, 2020.

  1. A really long story, but maybe you might want to read about the trials and tribulations of an embedded software engineer. But to set the stage, I have to describe the system environment and what I was doing.

    This is an embedded system with a real-time operating system (RTOS). The tasks in this system cannot block, suspend, wait on a semaphore, or use a mutex. All tasks must run to completion. There is a priority scheme, and higher priority tasks will interrupt lower priority tasks. For example, a priority 4 task might kick-off a priority 5 task, which will then start running. Perhaps an interrupt will come in at priority 11, it will do its thing, then control will go back to the priority 5 task. After that task finishes, the priority 4 task will continue until it finishes.

    This part is important: If we have a critical section which cannot be interrupted, we disable interrupts, which not only prevents interrupts from running, but prevents task priority switching. For example, the priority 4 task can disable interrupts and be uninterrupted until it re-enables interrupts.

    OK, this is C++, and the mistake I made involves a Queue from the Standard Template Library (STL). This is a FIFO queue, and I used the following member functions:
    • push - Inserts an element at the back of the queue
    • pop - Removes the element at the head of the queue
    • front - Returns a reference to the element at the head of the queue
    • empty - Returns true if the queue is empty, otherwise returns false
    Another important part: Only push and pop modify the queue. Also, the reference returned by front remains valid as long as you don't pop the element off the queue.

    This queue is used to hold items that are to be written to hardware. The STL queue is actually wrapped by a queue class I made to handle what needs to be written to the hardware. So if a task has something to write to the hardware, it creates an object and then calls my queue class which puts it on the STL queue. My class then kicks off a task which reads the queue, does some additional processing, and writes the data to the hardware. Most tasks in the system run at priority 4, so I made the task which reads the queue run at priority 5 in an attempt to keep the hardware as busy as possible. (But there are also interrupt tasks at priority 11 which write to the queue.)

    So remember, priority 4 and 11 tasks write to the queue, and a priority 5 task reads it. There is only that one task that reads the queue, and it's built into my queue class.

    There are four basic types of operations going out to the hardware, so the objects of my queue class include an enum describing what type of operation it is. What they actually do isn't important, but the enum value of 0-3 goes on the queue and the priority 5 task which reads it uses that enum to decide how to handle it.

    OK, so let's look at some code. This is from memory and may not be syntactically correct and will leave out a bunch of stuff, but you'll see how this is used.

    The code which puts elements on the queue is simple enough:
    Code (C):
    void push(elementType element)
    {
        // Disable interrupts
        Uint key = disable();
       
        // Put on queue
        m_queue.push(element);
       
        // Kick-off priority 5 task to read the queue
        post(readQueueTask);
       
        // Enable interrupts
        enable(key);
    }
    Interrupts are disabled to make sure no other task pushes or pops elements from the queue at the same time.

    The priority 5 code which reads the task is a little more complicated, but I'll skip over those parts. Remember, this is the only place the queue is read:
    Code (C):
    void readQueue(void)
    {
        // Loop while there are elements on the queue
        while(!m_queue.empty())
        {
            // Disable interrupts
            Uint key = disable();
           
            // Get reference to element at front of queue
            elementType& element = m_queue.front();
           
            // Enable interrupts
            enable(key);
           
            // Process element
            switch(element.operation)
            {
            case A:
                // Do stuff and write to hardware
                break;
           
            case B:
                // Do stuff and write to hardware
                break;
           
            case C:
                // Do stuff and write to hardware
                break;
           
            case D:
                // Do stuff and write to hardware
                break;
           
            default:
                // Output error message
                ASSERT_FAIL("Invalid queue operation");
                break;
            }
            // Disable interrupts
            Uint key = disable();
           
            // Remove element from front of queue
            m_queue.pop();
           
            // Enable interrupts
            enable(key);
       
        } // end while
    }
    Do you see the error? Yeah, it took me four years to spot it.

    This code actually ran just fine for a long time, years in fact. But then someone decided to do some stress testing. (Troublemaker!) Really beat on the system by injecting a lot of signals. And various problems started happening, most notably the "Invalid queue operation" error message if the 0-3 enum was wrong. Further testing showed a complete garbage value was present. We all agreed this was an impossible error. The wrong value can't be put in a queue element, and the queue is read until it's empty. Interrupts were properly disabled both when pushing and popping elements from the queue. The reference returned from front() is guaranteed to be valid as long as pop() isn't called, and that's only done in this method, not anywhere else in the code, so it couldn't be that, either.

    So I started thinking about the (lack of) thread safety of STL routines. In short, STL routines are not thread-safe. That's why I disable interrupts when accessing the queue. But wait! What about empty(), which doesn't modify the queue, but simply returns true/false? I'm not disabling interrupts there. But it must simply read the queue status and return it, so what could go wrong?

    Apparently, what could go wrong is that an interrupt could come at the wrong time and empty() could return false when the queue was actually empty, and the readQueue code above would then go and get a reference to some memory somewhere. Who would have guessed that was even possible?

    So the solution was to fix up the part that reads the queue so that interrupts are disabled before calling empty() and leaving them disabled until the element at the front is accessed. I also changed it from getting a reference to getting a copy, for good measure.
    Code (C):
    void readQueue(void)
    {
        while (1)
        {
            // Disable interrupts
            Uint key = disable();
           
            if (m_queue.empty())
            {
                // Enable interrupts
                enable(key);
               
                // exit loop
                break;
            }
           
            // Get element at front of queue
            elementType element = m_queue.front();
           
            // Enable interrupts
            enable(key);
           
            // Process element
            switch(element.operation)
            {
            ...
    So that's the type of thing you can look forward to if you decided to become an embedded software engineer. Despite my careful analysis and design of how I used the STL queue, I overlooked that the empty() function might not behave the way I would have designed it. But tracking down things like this is real software engineering. Being able to design, build, and test the software and see it work in the lab or out in the field is quite satisfying. I'd be bored to tears if I was writing desktop applications or web pages or stuff like that.

    Well, if you got this far, I hope you enjoyed reading this.
     
    • Like Like x 4
    • Informative Informative x 2
    • Winner Winner x 1
  2. Thank you for sharing! Very informative!
    All of this makes me wonder thou why the interrupts are not disabled by default when using queue operations?

    Solid burn :p Agreed ;)
     
  3. Great post! This is exactly the reason why languages like Go or Rust don't have a way to check whether queues (called message channels there) are empty or not. Instead, popping is an atomic instruction which returns an optional/error/boolean indicating whether there were any elements in the queue or whether it was empty. The "is empty" check and "pop element" almost always go together, and if they don't happen in one atomic operation then only God knows what will happen, so those languages force it upon you.
     
    #3 StillNoNumber, Feb 12, 2020
    Last edited: Feb 12, 2020
  4. Just imagine how amazing it was to have Go or Rust as a "Hardware" language...
     
  5. The C++ Standard Template Library is written for efficiency. You might very well have a single-threaded program, and having some sort of interlock would be unnecessary and slow things down. On the other hand, doing locks around something like isEmpty is probably inadequate (hence the Go/Rust implementations) and you'll need a custom solution anyway.

    I'm surprised that no one made this adjustment to my post:
    "OK, this is C++, and the mistake I made involves a Queue from the Standard Template Library (STL) was using C++ in the first place."

    I'd much rather use C.