Thursday, October 14, 2021

Basics Of Remote Cellular Access: Watchdogs

When talking about remote machines, sometimes we mean really remote, beyond the realms of wired networks that can deliver the Internet. In these cases, remote cellular access is often the way to go. Thus far, we’ve explored the hardware and software sides required to control a machine remotely over a cellular connection.

However, things can and do go wrong. When that remote machine goes offline, getting someone on location to reboot it can be prohibitively difficult and expensive. For these situations, what you want is some way to kick things back into gear, ideally automatically. What you’re looking for is a watchdog timer!

Watchdogs

The concept of a watchdog timer is simple. When attached to a system and enabled, the watchdog timer starts counting down from a preset time. The embedded system or computer is then responsible for sending a “kick” signal to the watchdog at regular intervals. This resets the watchdog back to its maximum time value, and it begins counting down again. If the “kick” is not received before the watchdog timer reaches zero, the watchdog reboots the system.

A simple watchdog timer.

Getting the watchdog interval right is important. Set it too short, and a heavily-loaded system may fail to respond with a kick in time and an unnecessary reboot will be caused. Set it too long, and the system could be down for a significant period before the watchdog gets things up and running again. Careful analysis of the system and its proper behavior is needed to tune this appropriately.

It’s a handy way for dealing with crashes, kernel panics, and system hangs in a remote machine. Rather than having to send a technician out to hit the reset button, the machine can reboot itself when it gets stuck. Watchdog timers are crucial in applications where sending out a human could cost thousands of dollars, or even be impossible, such as in satellites and other space applications.

A multi-stage watchdog timer, which takes corrective measures in turn before going for a full power cycle of the target machine. If the system manages to fire off a kick signal after stage 1 or stage 2 has fired, the system resets to normal operation.

More complicated designs are possible, too. Multi-stage watchdog timers involve several timers cascaded in series. In such a design, when the first watchdog times out after not receiving a “kick”, it takes a corrective action and starts a second timer. If that does not rectify the situation, eventually the second timer will time out, firing a further corrective action, and so on, until all stages have fired. This can be useful for more complex applications. The first stage timer could institute a simple process kill command to a server, the second a software shutdown command to the OS, while the third could execute a full hard reset with a power cycle.

How Do I Implement One, Though?

Watchdog timers are critical in space systems like the Curiosity Rover.

Implementing a watchdog on a given system is highly dependent on the application in question. An exhaustive breakdown of specific watchdog designs is beyond the scope of this article. Instead, we’ll look at a couple of pitfalls, and outline a couple of different cases that highlight the varying scopes of watchdog designs.

Pitfalls

Note that these cases all refer to proper hardware watchdogs. Ideally, for maximum robustness, the watchdog should be an entirely separate piece of hardware capable of rebooting the main system of interest. Some microcontrollers and SoCs do include internal watchdogs that run with varying levels of independence, and these can be usable, too. However, they must usually be triggered appropriately by the main code loop. Using an interrupt to trigger a watchdog can be fraught with danger. The main loop can crash but as long as the interrupt still fires, the watchdog will never reset the system.

Also important to note is that “software watchdogs” are often anything but. For example, creating a process to watch other processes on a computer system can be helpful. It can catch a broad range of minor faults and issues and restart those other processes where needed. However, in the event another process crashes the whole machine, or creates a lower-level issue such as a kernel panic, the software watchdog will be helpless to act. Generally, a proper watchdog should be largely independent of the system it is monitoring.

Case Study 1: Home-Built Tank Monitor

Let’s say you’re deploying a homebrew Raspberry Pi project far from home to monitor levels in a few water tanks. It’s nothing mission critical, nor will it risk life or limb if the system goes down. However, the system is battery powered with solar charging, and you want to avoid having to drive out to reboot the system if there are issues when power levels get low or if something else causes a crash.

In this case, a simple solution can remove a lot of headaches without a lot of added complexity. Something as simple as an Arduino Uno or similar could be installed to implement a watchdog quite easily. The Raspberry Pi could be configured to send GPIO pulses or serial messages to the Arduino to indicate that it’s still running properly. If no signal is received in a set period of time, the Arduino could reboot the Raspberry Pi by simply cutting the power with a relay. This period of time can be minutes, hours, or even longer if the system isn’t critical. The trick is not making it too short, otherwise if the system is temporarily heavily loaded, the watchdog might time out despite the system not actually having crashed.

Having an Arduino in the system could bring further benefits, too. It could send commands to the Raspberry Pi to safely shut down in the event that the battery voltage starts getting low. Additionally, it could command regular soft reboots of the Raspberry Pi on daily or weekly intervals to head off any potential glitches in processes that could crash from running for extended periods.

I’ve implemented similar systems on mobile robots out in the field, and they can work surprisingly well. It’s important, however, to make sure that the watchdog operates correctly. For example, a primary process on the Raspberry Pi could get stuck without bringing down the whole system. If the watchdog service process responsible for signalling the Arduino is able to keep going independently, the system will remain powered up despite the fact that the main process is no longer working. The way around this is to have the watchdog service check that other processes are running properly prior to sending the kick signal to the external watchdog. If you’re writing all your own code, this is easy to do! Checking whether other programs are running properly can be harder, however. This is where regular pre-emptive soft reboots can be a sneaky workaround. For homebrew stuff, it’s often good enough.

Case Study 2: Remote Pump Controller

When machines are allowed to take actions on their own, rather than merely reporting data, things can get more complicated. For example, imagine a system responsible for controlling water pumps to fill tanks from a dam or other source. The system can be monitored and manually controlled over cellular data link, but otherwise operates independently, round the clock.

In this case, far more rigor may be needed to avoid disaster. If the system were to crash while pumps are enabled, the dam could be emptied, leading to the pumps running dry and causing expensive damage. Alternatively, tanks could be overflowed, or flooding could occur. Depending on scale, this could cause a mess in a shed or destroy crops, homes, and livestock.

Thus, a more rigorous watchdog must be implemented in these cases. For example, it may not be enough to simply reboot the pump controller in the event it stops sending kick signals to the watchdog. In this case, the watchdog may instead be configured to first switch the pumps to a failsafe off condition to minimise the chance of damage. The system may then be rebooted, and when back online, remote operators notified that a restart must be triggered manually due to the failure. This avoids the system simply rebooting and instantly failing again in the event that there is an ongoing problem.

In these cases, particularly where expensive equipment or even human lives may be at risk, a simple Arduino probably isn’t going to serve as a suitably reliable watchdog. Multiple redundant watchdogs may be required in some cases to provide a greater chance of stopping the system in the event of a failure. At the highest levels, code reviews and risk assessments will be required along with specially certified hardware across the board. But if you’re working on a watchdog system for a municipal-level dam or other safety-critical installation, you’re probably not looking up how to do it. If you are, please reach out to a supervisor or other official and tell them you need some help.

Summary

The aim of this article is to explain the basic concept of watchdogs, and why they’re useful for remote systems. Hopefully, the ideas presented here are enough to help you implement watchdog timers to improve the uptime and serviceability of your own projects. After all, there’s nothing cooler than being able to show off your rugged and reliable remote project to everyone at the Hackerspace. There’s also nothing worse than having your live demo fail because you can’t reboot a failed remote machine. Thus, get your watchdogs up and running and show off what a great hacker you really are!


No comments:

Post a Comment