In space, no one can hear you kernel panic (2020)

85 points by p0u4a 4 days ago | 19 comments

dfox
> running identical software on multiple computer systems is the name of the software-architecture game
In the railway signalling industry (which for historically obvious reasons is obsessed with reliability) there even is a pattern of running different software implementing the same specification, written by different team, running on a different RTOS and different CPU architecture.
- superxpro12
  This is also true of the space shuttle. The failover '5th' processor was running an implementation done by a completely different sandboxed team to hedge against institutional or systemic errors not caught by the first team. So much thought put into these systems.
  This, in the context of 'modern vehicle safety standards' still makes me cringe when considering the "safety" put into modern autonomous vehicle systems.
somat
"From the dawn of the Space Age through the present, NASA has relied on resilient software running on redundant hardware to make up for physical defects, wear and tear, sudden failures, or even the effects of cosmic rays on equipment."
An interesting case study in this domain is to compare the Saturn V Launch Vehicle Digital Computer with the Apollo Guidance Computer
Now the LVDC, that was a real flight computer, triply redundant, every stage in the processing pipeline had to be vote confirmed, the works.
https://en.wikipedia.org/wiki/Launch_Vehicle_Digital_Compute...
Compare the AGC, with no redundancy. a toy by comparison. But the AGC was much faster and lighter so they just shipped two of them(three if you count the one in the lunar module) and made sure it was really good at restarting fast.
There is a lesson to be learned here but I am not sure what it is. Worse is better? Can not fail vs fail gracefully?
- KurSix
  I think the lesson is that redundancy can exist at different layers
- baud147258
  > Worse is better?
  Maybe if you know what the tradeoffs are and are ready to deal with the deficiencies (by rebooting fast). And didn't they had issues with the lunar module Guidance Computer on the first moon landing?
- throwup238
  > There is a lesson to be learned here but I am not sure what it is.
  Restart your Claude Code sessions as often as possible
KurSix
The contrast with modern software development is striking. Today we often rely on fast iteration and patching problems in production. Spacecraft software is the opposite
- wongarsu
  On the other hand a lot of SpaceX's success can be attributed to applying modern software development methodology on spacecraft. They are very much doing agile development, betting on velocity enabling fast iteration.
  That has lead to some of the best rockets ever developed, and the largest satellite constellation by far. But part of the secret sauce is creating situations where you can take risks. Traditionally anything space-related deals in one-offs or tiny production volumes, so any risk is expensive. A lot of SpaceX's strategy is about changing this, whether that's by testing in flight phases the customer doesn't care about, being their own best customer to have lower-risk flights, or building constellations so big that certain failure scenarios aren't a big issue (while other scenarios still have to be treated as high-risk high-impact)
  superxpro12
  I recall an early deep-dive into their safety architecture on the falcon 9, which was basically "throw 3 COTS processors at it and reboot anything that doesnt work, and fail fast during development". I remember they explicitly avoided rad-hard processors as well.
  I would love to update myself if anyone has a good source.
  For better or worse, it's hard to argue with results.
  whattheheckheck
  Imagine trying to explain to 1960s tax payers were going to build and blow up multiple rockets for research velocity and dev feedback loops
thomascountz
OT: I really enjoyed The Increment when it was first being released. It felt like the first software engineering practitioner's publication and introduced me to a lot of new people to follow.
throwaradfy5745
How would these considerations affect Musk's space cloud ?
- rogerrogerr
  Starlink very likely leans toward “many cheaper satellites that may fail” instead of “fewer expensive satellites that are less likely to fail”
  Their advantage in the satellite-internet industry is that they can launch stuff fast and cheap; very likely this drives different tradeoff decisions than the regime this article talks about.
  Panzerschrek
  Having thousands of satellites also allows finding more software bugs, so that in the reality they can be more reliable compared to NASA-style probes (when each one has its unique software).
  phanarch
  The Starlink tangent misses something important about why software reliability in satellite systems is categorically different from hardware reliability.
- gostsamo
  The same way it will affect the incoming mission to the center of the galaxy. The space cloud is much more related to the incoming SpaceX ipo than to any phenomena of the physical or computing universes. Thermodynamics says "no".
unit149
[dead]
gnabgib
(2020)
adampunk
Do not attempt to adjust your television. We control the horizontal. We control the vertical.
We know Glenn is loquacious.
shadowbyte17
interesting point about patching in production – it's a totally different mindset. we had a similar issue with a legacy system at my old job, felt like a constant firefighting situation.