Originally published at: https://boingboing.net/2019/07/26/common-remote-data-concentra.html
…
Someone didn’t check for memory leaks…
At least there is a way to keep it in the air, unlike the 737 Max, which assumes the pilot wants to dive at the ground and then does so.
Some flight duration function that was scope-creeped into a power-on runtime one? (Long after the person that baked in the “150 hours is far longer than any possible flight” assumption had moved on.)
“Welcome to Airbus Technical Support, how may I help you. Uh-huh. Uh-huh. You’ve lost power and are nose diving from 40,000 feet?”
It’s scary that it randomly fails. They must have had a number of brown-trouser incidents before someone noticed the key 150 hours runtime.
Oh well, ship the debug version with bounds-checking and exception handling turned on. Click okay through all the warning alert boxes and it’s all good.
The linked Guardian article says the command protocol is ARINC 429. Wikipedia has some info on that protocol, including that the data field is 19 bits.
So if you have a clock that starts counting seconds from the time it’s turned on, after 149 hours, it’ll be 536,400. In binary, that’s 1000 0010 1111 0101 0000. In other words, you recently started needing a 20th bit.
Some systems might truncate the top bit, leaving you with a sudden backwards leap through time. That can easily confuse all sorts of otherwise simple calculations. Other systems might end up with that top bit being interpreted as part of an adjacent field, which could have surprising consequences as well. Referring to the Wikipedia article, it might clobber part of the value of the sign/status matrix, which is how systems know whether the data is correct, unavailable, or simulated (for testing). I could see some software that’s not prepared for that as well.
I was just doing the math on that and coming to the same conclusion. A signed 20 bit seconds counter would just about fit the symptoms. 20 bit counters are rare in full blown computers, but microprocessors often cut down on silicon to save cost. Probably someone in the chain thought the counter would be reset after every flight by something running above their system and didn’t successfully communicate that requirement to the system integrators.
hard reboot every 149 hours
I’m a every 40 hours of work hard re-boot kind’a guy…
So, you’re a pilot?
This sort of begs the question, can it be rebooted in midair?
hmm …
the maximum flight time of a a350 is on the order of 16 hours (or a bit more for those ultra-ultra-long flights)
at the end of such a flight the engines and most systems are shut down … taking the shut down a little bit deeper is not a real problem and adds only a few minutes to the shutdown / start up time
so the airline simply adds a line to its a350 manual to do a complete shutdown of the aircraft at least every 24h or at the end of such a long distance flight - even if it is forgotten the length of the critical period of 149 hours is so much longer than 24 hours that there is still 5-fold redunancy to ensure a reset (compare that to the 0-fold redundancy of the 737max MCAS system)
its not pretty, but should work sufficiently well … and the aircraft software will be updated latest during the c-check of the aricraft (probably much earlier)
It is common for airliners to be left powered on while parked at airport gates so maintainers can carry out routine systems checks between flights, especially if the aircraft is plugged into ground power.
I would have thought “powering it off and turning it back on again” would be a perfectly reasonable step in “routine systems checks”. I suppose power cycling might put a strain on some of the components, but if they’re going to fail, that would be a convenient time for them to do so.
Three cheers for the Agile design philosophy! Iterate that sucker!
When I was a small child I was afraid of drowning.
As a young adult my irrational thoughts about death focused on traffic accidents.
Now I tend to wonder what the odds are that a software bug in an airplane or a medical device or something will take me out before the health consequences of sitting all day mashing a keyboard get me. Technology is a hell of a thing.
Consider yourself lucky living at a time when you can sit around and mash some keyboard without something with sabre-teeth sneaking up to you from behind.
No, a software developer. The fact that it fails literally like clockwork after 150 hours, and they have such a stupid workaround, seems like this isn’t a simple coding bug, but mismatched assumptions across different parts of the project; where 150 hours isn’t handled or trapped because it’s an impossible value to some of the code.
Alternately, it is a simple coding error, but management is too cheap to budget the fix and long QA/certification cycle.
Yeah, I was going to add, which you did, that they must have run the cost/benefit analysis and found that just opening a window works and costs a lot less than fixing the code once and for all.