So this question came from Joey: Why does restarting a computer fix so many problems?
If you’ve got a question about anything computer related, ask in the comments!
Hello, IT, have you tried turning it off and back on again?
– Roy, The IT Crowd
When a computer isn’t working right, most people know that 9 times out of 10 a quick reboot will fix all the problems.1DISCLAIMER: this is not a real statistic. If it were, it would be a much, much higher proportion. Why is that?
Put simply, the longer you let your computer run, the more likely an error will show up in the code loaded to your RAM. When you restart your computer, you’re starting over and grabbing a code from the Hard Drive, which is far less likely to develop an error. If RAM and Hard Drive and CPU are terms you’re not familiar with, you might want to refresh your memory with my computer summary.
Remember that programs are just recipes that the computer follows verbatim. So, to understand why computers run poorly after a while, let’s follow a sandwich recipe.
Joe’s Dinner Sandwich (patent pending)
1. Place turkey and ham on one slice of bread, and Provolone cheese on another. Place both pieces of bread in toaster oven until bread is drowned, about 3-5 minutes.
2. While bread is toasting, heat a small amount of oil in a small pan and fry one egg to desired consistency. Season egg with more ground pepper than you’d expect.
3. Allow sandwich and egg to cool slightly as to prevent spinach from wilting. Assemble the sandwich with meat, egg, fresh spinach and cheese.
Pretty simple, right? Well take a closer look at step one:
Place both pieces of bread in toaster oven until bread is drowned
That … doesn’t make any sense in this context. But the first time you read it, you probably skipped right over it, right? Your brain saw an error, corrected it to browned, and moved on with the rest of the code. But computers can’t evaluate the context and guess what’s wrong.2Generally. They can, however, check to see if the code has been corrupted, as I’ll explain a little further down. In this case, a computer might be unable to access a “drowning” attribute from bread in a toaster oven, because, well, it doesn’t exist, causing the program to halt.3ie computer has unexpected behavior Maybe it’ll wait for the bread to drown in the toaster oven for 5 minutes, realize there’s a problem, abort the code, and throw an error message.4ie computer freezes and shows a Kernel Panic or a Blue Screen of Death But what if there was no time expectation defined? The code might just keep waiting for the sandwich to drown … forever.5ie computer runs slowly while it keeps checking on something that doesn’t exist Or at least until it catches fire.
All that trouble, just over a single letter change.
Over time, the code stored in your RAM degrades and gets corrupted. Not by a lot, mind you, but just by a single letter. But as we saw, that single letter can cause major problems.
And that brings us to the common saying, shut it off, turn it back on. Remember that shutting down the computer wipes the RAM clean. So by restarting, you’re refreshing the volatile RAM with code from the much less volatile memory, ie the Hard Drive or the Solid State Drive. If it’s broken, just start from scratch.
But why do bits just spontaneously change?
This was actually something I just learned this year! Remember how I went to Hawaii for a conference this summer? Well that conference was all about state-of-the-art research into how radiation affects electronics. There were a few hundred people there presenting their work about calibrating particle accelerators, simulating nanometer-scale damage to transistors from high-energy particles, and validating electronic components for space missions. One common theme was dealing with errors in memory.
Basically, when a high-energy photon, proton, or neutron collides with an IC6Integrated Circuit. i.e. a computer chip in just the right place, there’s a sudden small spike in charge in the transistors that make the IC work. The IC’s structures are so small and delicate, the tiny charge disruption can be enough to change the digital state from a 0 to a 1, or vice versa. The result? A one-letter change in the computer’s instructions. Occasionally, a really energetic particle will cause a voltage spike big enough to cause damage to the circuitry. But here on Earth, most particle interactions leave no permanent damage.
But I don’t live next to a nuclear reactor! How can radiation affect my computer?
Nuclear reactor, no. But we are just a short 8.3 light-minutes away from a sun-sized fusion reactor that throws high-energy protons and neutrons. Yes, the sun. Plus, while less frequent, we get hit with cosmic rays from other suns and supernovas from galaxies far, far away.
I found an interesting paper from 19797 J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science (80-. )., vol. 206, no. 4420, pp. 776–788, 1979. where IBM did some calculations and figured computers of the time would have a RAM error about once per month. Over the last 30-ish years, the memory circuitry has gotten far smaller and more energy efficient. Smaller circuitry leads to higher densities, which leads to a higher probability of getting hit with a cosmic ray. Higher efficiencies come from using lower voltages, which leads to a higher probability of a cosmic ray generating enough voltage to flip the memory bit. While I’m having a tough time finding a paper about Soft Error Rates in modern DRAM modules, I’m going to guess its quite a bit higher.
Ok fine. But what about servers? Aren’t they on all the time? How is that possible?
Great question! Servers, High Performance Computing centers (i.e. Supercomputers), and other mission-critical computers use a special type of RAM that supports Error Correcting Code (ECC). Remember that radiation research I mentioned above? Well, when they turned on ECC, the ICs were able to correct for any single-bit errors that showed up. This let the system read memory without issue until the radiation was high enough to start causing errors in two bits.
It should come as no surprise that the extra redundancy comes at a substantial premium. First, you need a CPU that supports ECC, which limits you to server-grade stuff like Intel Xeon that is roughly twice as expensive as a similarly-performing consumer Intel i7 or i9. Then, ECC RAM is nearly twice as expensive as otherwise identical RAM too!
The extra cost of ECC RAM is justifiable for companies that can put a price tag on seconds worth of downtime8Imagine how many transactions would be lost if a Credit Card company server went down, even for just a few seconds! or researchers whose simulations have no tolerance for errors. For most people, myself included, having to restart our computers every once in a while is hardly a big deal.
The sun tries to foil your web browsing by hurling angry protons and neutrons at your computer. Don’t give in to that flaming ball of gas; refresh damaged code by restarting your computer every few days.