The Woes of Magic Sysrq
Or, how to accidentally take a GPS receiver offline
Recently, I upgraded the firmware on my GPS-synced time server and wall clock combination. It had eaten an SD card (ending in a weird failure mode I'd like to analyse at some point) so I built a custom spin of Arch Linux ARM, designed to boot and live in RAM. As part of this upgrade, I re-wrote the wall clock display software against SDL in order to allow me to remove a bloaty web browser from the system.
After this upgrade, I kept noticing stale satellite data showing on the display. More investigation showed that chronyd
hadn't received anything from the NMEA or PPS sources in over a week. This was baffling since cgps
showed NMEA data coming out of gpsd
from the GPS chipset, and ppstest
showed that the GPS chipset was still firing interrupts visible from userspace.
What had happened?
Probing, prodding, and guesswork lead me down the path of suspecting some of my monitoring scripts somehow interfering with, or somehow blocking, gpsd. Or at least blocking chronyd
from receiving data from gpsd. Alas, it was a red herring. After removing these scripts, the failure was eventually reproduced.
Eventually, I noticed something in dmesg
. Lots of lines that looked like the output of someone hitting sysrq from a console, multiple times a second. With no keyboard attached to the device, the only place this could be coming from would be another console device. I checked my kernel command line, and sure enough:
… console=ttyAMA0 …
Oof. That's the serial device connected to the GPS chipset. In my rush to get the new rebuilt system booting, and in my naivety, I hadn't removed the distro-default value telling the kernel to treat my GPS chipset a console input. A lesson learned and certainly something I'll be keeping an eye out for in future. Really be careful about what you configure as being a console to the kernel. Note this is a separate problem from deciding where you disable serial logins, something a lot more subtle, and probably harder to debug.
After removing this erroneous console from the kernel command line, everything resolves back to the normal dozens-of-nanoseconds RMS offset:
Reference ID : 50505300 (PPS)
Stratum : 1
Ref time (UTC) : Sun Apr 07 11:39:37 2019
System time : 0.000000022 seconds fast of NTP time
Last offset : +0.000000003 seconds
RMS offset : 0.000000037 seconds
…
Post mortem?
Not really, I haven't dug right into the nitty gritty as far as I'd like to yet.
Thinking back to how I rebuilt the system, I copied the boot arguments from the old broken system and imported them into the new, with the exception of loading U-Boot instead of relying on just the in-flash bootloader. It seems strange that having a non-console device set up as a console never caused any problems before, until I realised that it's probable that Magic Sysrq was never enabled on the (now old) kernel that was still running on the system before being rebuilt.
As for why cgps
and ppstest
indicated the failure was chrony's own fault, I only have half an answer. PPS can probably be explained by the fact that it's tied to the NMEA data. That is to say, I presume that chrony cannot say it's received the time from a PPS source without having also received the time stamp data from its paired NMEA source.
As to why cgps
was still able to read data from the GPS chipset but chrony wasn't, I don't know yet. Perhaps there is a bug in gpsd
causing different treatment of socket-based access (used by cgps
) when compared to SHM-based access (as used by chrony). Or perhaps me running cgps
was somehow enough of a kick to gpsd
that it refreshed something and re-established connection to the GPS chipset, and for some reason I never saw this propagate to chronyd because I was too impatient to wait for it to trust the source again. More investigation here is needed.