antdude writes "This TechRadar article explains why computers suck at math, and how simple calculations can be a matter of life and death, like in the case of a Patriot defense system failing to take down a Scud missile attack: 'The calculation of where to look for confirmation of an incoming missile requires knowledge of the system time, which is stored as the number of 0.1-second ticks since the system was started up. Unfortunately, 0.1 seconds cannot be expressed accurately as a binary number, so when it's shoehorned into a 24-bit register — as used in the Patriot system — it's out by a tiny amount. But all these tiny amounts add up. At the time of the missile attack, the system had been running for about 100 hours, or 3,600,000 ticks to be more specific. Multiplying this count by the tiny error led to a total error of 0.3433 seconds, during which time the Scud missile would cover 687m. The radar looked in the wrong place to receive a confirmation and saw no target. Accordingly no missile was launched to intercept the incoming Scud — and 28 people paid with their lives.'"
It's pretty pathetic and negligent that software that controls explosive missles was not tested for over 100 hours of operation. That's a standard Quality Assurance procedure for even the simplest low-budget hardware...
It's also pretty pathetic that the system designers implemented a broken design and did not foresee this problem. High-resolution timekeeping has been accomplished pretty successfully already...
I wonder how much time and money was spent in research and development for this thing
It doesn't seem like we're getting a quality product for the likely huge sum that was paid for it...
Mod parent up ! This idiotic article blames computers for programmers using numerical approximation algorithms illadvisedly.
which is stored as the number of 0.1-second ticks since the system was started up. Unfortunately, 0.1 seconds cannot be expressed accurately as a binary number, so when it's shoehorned into a 24-bit register — as used in the Patriot system — it's out by a tiny amount. But all these tiny amounts add up. At the time of the missile attack, the system had been running for about 100 hours, or 3,600,000 ticks to be more specific. Multiplying this count by the tiny error led to a total error of 0.3433 seconds, during which time the Scud missile would cover 687m. The radar looked in the wrong place to receive a confirmation and saw no target. Accordingly no missile was launched to intercept the incoming Scud — and 28 people paid with their lives.'"
So in a system that should have clocks synchronized to less than a microsecond nobody bothered to run "ntpdate" even once in hundred days ? And surely the military has better clock synch than a stupid home pc ? This is stupidity, also known as "human error", causing those deaths. It's a case of "the correct answer to the wrong question".
What is always brought up as a "computer problem" is the crash in Paris of a jet due to infighting between the human pilot and the autopilot. Of course, there the ultimate mistake was the pilot's : he had forgotten to turn off the autopilot to land. It was set for cruising altitude (3km), and the pilot was trying to land. This resulted in ever more desperate attempts by the autopilot to get the plane to gain height, which eventually resulted in a total loss of lift for the plane, which naturally resulted in the plane hitting the ground nose-down and a big fireball. The computer did exactly as instructed, it's just that the pilot's (unintentionally given) instructions were stupid, and the fact that it took the pilot over 3 minutes to realize just how stupid he had been.
The computer did exactly as instructed, it's just that the pilot's (unintentionally given) instructions were stupid, and the fact that it took the pilot over 3 minutes to realize just how stupid he had been.
Sounds like a user interface problem to me. Given the potential consequences of that particular user error, the fact that the autopilot was still engaged should have been made more obvious to the pilot. (e.g. when the plane computer sees that a struggle is going on between the autopilot and the manual controls, it should prompt a loud, un-maskable synthesized voice shouting "THE AUTOPILOT IS ENGAGED, YOU IDIOT!")
The computer did exactly as instructed, it's just that the pilot's (unintentionally given) instructions were stupid, and the fact that it took the pilot over 3 minutes to realize just how stupid he had been.
Sounds like a user interface problem to me. Given the potential consequences of that particular user error, the fact that the autopilot was still engaged should have been made more obvious to the pilot. (e.g. when the plane computer sees that a struggle is going on between the autopilot and the manual controls, it should prompt a loud, un-maskable synthesized voice shouting "THE AUTOPILOT IS ENGAGED, YOU IDIOT!")
Or if the pilot is pushing hard on the stick the autopilot should disengage (with loud alarms). If I tap on the breaks in my car the cruise control disengages, it does not fight me. - Dan
by Anonymous Coward writes:
on Saturday October 31 2009, @11:44AM (#29934927)
So in a system that should have clocks synchronized to less than a microsecond nobody bothered to run "ntpdate" even once in hundred days ?
Do you want to be the one to explain to the generals why their stand-alone, truck-based mobile air protection system needs a hard-line network connection to work?
The real idiocy is here:
Unfortunately, 0.1 seconds cannot be expressed accurately as a binary number, so when it's shoehorned into a 24-bit register
Taken charitably, the article writer has oversimplified to the point of obscuring the point. It's perfectly possible to represent a 0.1-second tick in a 24-bit register. There's an overflow about once every 19 days. The problem is doing calculations *with* that number, and that takes knowing what the hell you're doing. Given the problem the system designers were trying to solve with Patriot, this should not have been a problem.
And surely the military has better clock synch than a stupid home pc ?
You'd be surprised how hard clock accuracy is to get right, *especially* under military conditions. A drift of 0.3433 seconds over 100 hours works out as an accuracy of 1 part in a million, give or take. Besides, the problem here wasn't clock drift, so it's a irrelevant.
FTFA: "So computers might suck at maths, but there's always a solution available to circumvent their inherent weaknesses. And in that case, it's probably more accurate to say that computer programmers suck at maths - or at least some of them do."
Thank you, come again.
So in a system that should have clocks synchronized to less than a microsecond nobody bothered to run "ntpdate" even once in hundred days ?
Yes, obviously they just needed to ssh into their patriot missile air defense system, edit a few lines in/etc/inet/ntp.conf and svcadm restart ntp.
The obvious problem in the article, if you read it, is computer's finite precision, and how it is dealt with. By 'computer', the author could have easily included the system libraries that are actually doing all the rounding and overflows instead of implementing arbitrary precision in software.
Everyone defending the way 'computers' is used in this article, and conflating it with 'processor' is a complete idiot.
The obvious problem in the article, if you read it, is computer's finite precision, and how it is dealt with. By 'computer', the author could have easily included the system libraries that are actually doing all the rounding and overflows instead of implementing arbitrary precision in software.
Not at all, since correcting an inappropriate hardware design with software is like fixing an automobile that was designed with square wheels by manually sawing off the corners to make them octagonal instead. You could create a recursive software routine to continue sawing until the wheels were a good approximation of round, but that's an awful lot of sawing to fix something that should have been right in the first place.
The clock in modern systems is nothing but a hardware register that gets incremented periodically (as correctly described in the article). The ONLY rounding error introduced by software is in converting that number to decimal. But rounding had nothing to do with the problem described. The appropriate solution is a better hardware design, not attempting to patch or correct it in software.
The problem was error accumulated in the clock register itself due to the imprecision of the clock, and overflows due to the inappropriately small size of the register. Both are hardware issues and represent bad design decisions. They way to fix them is to design the hardware properly in the first place so that it is appropriate for the job at hand.
I'm obviously not a hardware designer? That's funny. I am not the cluless one here. How about some simple math? Maybe you would learn something.
A 24-bit register, with clock ticks every 0.1 second, would overflow in less than 20 days. And if the clock ticks were faster, then it would overflow even sooner. No wonder they recommended rebooting the system every few days.
Of course I do not recommend an infinitely large register. Simply one that is large enough for the job at hand. This one obviously isn't. Further, a 0.1-second resolution clock is obviously not adequate to a job requiring this kind of precision.
If the hardware clock is off (not overflowed but INACCURATE, which was the real situation here), no amount of software tweaking will properly fix the problem. The article did not state but implied -- incorrectly -- that the clock register was accumulating rounding errors; that is not the case. Nobody makes system clocks that way, nor did they in the 90s or even the 80s. The system clock is nothing but a counter that is incremented every clock tick. The actual problem was that the clock ticks were not sufficiently precise, so over time the count was off. Math libraries and rounding errors played no part whatsoever in that error.
Finally, I would like to point out that today's standard PC-type system clocks are large enough that they won't overflow for 100 years or so; that is the obvious and proper solution to the overflow problem. The problem of clock ticks that are sufficiently precise for timing of missile navigation, as far as I know, has not been addressed on standard PCs, however, and they do not try to correct for that in software because the adequate precision in the clock simply does not exist. It would amount to tilting at windmills. Keeping a count in software of the number of times the register overflows is also NOT an appropriate solution for a system clock, nor is any software tweak, because software by definition is volatile while the hardware clock is not. In other words, nobody does it that way, dude, because it's just plain the wrong answer.
As for your final comment, most Unix programmers know what epoch time is, when it started (00:00:00 UTC on 1 January 1970 according to ISO 8601), and when that date will roll over in the counter (approximately 65 YEARS later, so it isn't much of an issue). Nobody is arguing that we should make a missile system that needs to last, unmodified, for over 65 years. But proper hardware design in the first place, which was certainly possible at that time using ASICs if not straight-up custom chips, would have eliminated the problem.
Yes. The issue here sounds like they had a system clock counter that was an integer, that counted the number of 0.1 second clock ticks. Then they wanted to convert this to a floating point number in 24 bit IEEE format, They simply multiplied 0.1 by the integer in the register. Of course, that still sounds like too large an error top have occured from just that, but lets pretend it did.
There are several issues here. For missiles travelling at such speeds, using a system clock counter based on 0.1 second ticks sounds terribly coarse to me. Second, since 0.1 seconds are the baseline resolution of the system, the system should have been using floating point numbers where '1' corresponds to a decisecond rather than a second. Then the time counter would be exactly expressible in the floating point format.
Lastly, if the floating point format really needed to be in units of seconds, rather than deciseconds, the time counter should have been loaded in, having an exact representation, then it should be divided by 10, which has an exact representation. This is all prety basic to anybody who has even a limited understanding of floating point. If you understand the inherent precision of every operation even better than I do, even more improvements would be possible.
But to be honest, I'm not sure why floating point was used at all here. It sounds to me like fixed point may have worked just fine for most of these problems. (Of course, fixed point has its own set of rules ensuring maximal accuracy. )
by Anonymous Coward writes:
on Saturday October 31 2009, @08:35AM (#29933625)
This particular story took place in 1991, and most of the code for Patriot was written in the 70s - needless to say, software QA was a little more lax back then. The fix for this problem was out a couple days after the incident.
Oh really? The problem with these systems is that they have never worked in anything other than rigged tests and are just silicon snake oil. I remember having this same discussion where there was a story here about some sort of Israeli space lasers that could apparently even shoot down artillery shells. Only a few months after that a very large number of thirty year old rockets dumped at discount price by Iran for being obsolete came flying over the border from Lebanon. Since then a lot of even slower rockets came out of Gaza. The success rate of this amazing new space toy matches that of the Patriot - zero.
The Iron dome [wikipedia.org] system works perfectly. It's just not capable of protecting any kind of large area. It can, however, make a military base invulnerable to rocket fire, and they're working on making the system mobile, to protect tanks. The only real problem left for doing this is the power requirements.
For ships, another such system exists, and protected the ships perfectly well from those same rockets fired by hizbullah. It's "protection range" ? In the largest deployment about 200 square meters.
There is also the problem that a downed missile presents. What is a "downed missile" ? Well it's a large collection of very-high speed pieces of metal that have been heated up by a large explosion that's about to crash into the ground. So far so good.
So what is "the ground" in the case of a hizbullah or hamas missile launch ? Well it's the center of the city that's controlled by the terrorists. It's their human shields. Markets, schools, you name it. So a successfull missile intercept is reported in the press as "Israel fires a rocket into a palestinian kindergarten". That is, by the way, the literal truth, even if the rather important detail of a rocket's presence above said kindergarten is left out. In the deployed missile intercept installations "the ground" is chosen to be something else, like the ocean surface.
Missile intercept systems are no solution for terrorism. Most unfortunately, the only solution for those rocket attacks is preventing they're fired in the first place. Which obviously requires either palestinians police their own terrorists, or someone does it for them (that's called "occupation").
These systems work, they are deployed successfully in the field. They're no silver bullets, and any bullet that's fired, whether a missile or a missile-intercept-missile, will eventually hit the ground at rather high speeds. Which makes their use above urban environments result in civilian casualties.
The "kingdom of Egypt" (the state of the Farao's) ? (exterminated to the last man by muslims) The Hittite Emptre ? (exterminated by the Greeks, Romans, Persians) The kingdom of Israel ? The Assyrian Empire ?
Which of these do we restore ? (note that the palestinians, or to be more exact, the arabs only come into play about 4500 years after the Assyrian Empire)
Which do we restore ? And why do they have more rights than all the others who conquered that piece of land ?
Note the obvious truth : the Jews controlled Israel about 4300 years before the arabs even left their tiny province...
What if some Greek starts firing rockets at the Arabs ? Will you tell them to leave ? He has at least as much right to Israel as they do ? What if the Jews start firing rockets into Jordan (territory that was part of the kingdom of Israel) ?
And of course, you shouldn't count out yourself. You're an Indo-European living in America. It seems hypocritical in the extreme to tell others to leave conquered lands. Your province of origin is northwestern Iran, every other place on this earth indoeuropeans live (including Europe), is obviously conquered from someone else.
Uhh hezbolah was created to defend lebanon in the 80s after israel killed thousands of lebanese and occupied a good chunk of it.
The last fight between them happened in 2006. Hezbolah kidnapped a few SOLDIERs to trade for PoWs (a common thing since israel has a shit ton of prisoners).
Israel responded by sending in an army many 100s of times larger than lebanon's they bombed many buildings including hospitals, school, UN bunkers and apartment buildings. Hezbolah fired rockets back to show resistance.
In the end Israel killed 1200 civilians, 300soldiers, and a significant percentage of the countries economy. Hezbolah killed 120soldiers, 40civilians. Notice the fucking difference in ratios. Oh and the whole time hezbolah conducted rescue missions, gave out food and helped transport people to safety. So fuck off.
Also: "Hezbollah is now also a major provider of social services, which operate schools, hospitals, and agricultural services for thousands of Lebanese Shiites, and plays a significant force in Lebanese politics.".
Also hezbolah states that they distinguish between zionists and jewish. Their stated reason for firing rockets is continued resistance against israeli attacks and to put an end to any colonial entity within lebanon. NOT kill jews.
How the fuck parent got modded up is beyond me. Every single point is a verifiable falsehood.
1. The Patriot version used in the Gulf War (round 1) was not designed to be used against Tactical Ballistic Missiles (like SCUDs), but against opposition aircraft. A fighter isn't going to be flying as fast, and thus the error is going to be much smaller, which means the missile would probably still find the plane.
2. The Patriot has a quite good record against SCUDs (after the software upgrades). Much better than the Soviet SA-2s did against B-52 raids in Vietnam.
3. Systems don't always work right the first time, and if you do a full on test to start with, and something goes wrong, it's a lot harder to find where the error is than if you test one part at a time.
I know that I'm arguing with a trolling AC, but for the other readers of slashdot, you should know that the grandparent's post refers to the controversy regarding the analysis of the Patriot system during the first Gulf war. There was a huge propaganda machine behind the Patriot's "successes" which turned out to be very near zero indeed. This was covered in a series of hearings in the early 90's...
One of the other results [cdi.org] (the first one that comes up for me actually) claims that in testimony presented to Congress Postol's methodology was called out as flawed based on the fact that three or eight Patriots were launched at every incoming missle and his video analysis is done per interceptor fired completely ignoring the massive odds against more than one interceptor making a hit. The Isreali's independent analysis puts the success rate at 50%.
The way the system is sure it's tracking the target it was given is by predicting where it should be seen next based on speed and diretion, and then only looking for it in a window ("range gate") around that predicted position. The window is a point in space-time and therefore has time coordinates as well as space coordinates, and the problem was that the Patriot system apparently used absolute time since power on to specify the time coordinate, hence the error accumulation. The problem could have been avoided simply by using a time coordinate relative to the last tracked postion rather than an absolute one.
The GAO report also blames the 24 bit registers of the 1970's era hardware as limiting accuracy which is just garbage. A good excuse to a politician perhaps, but there was nothing stopping them from using a 64 bit, or whatever, math library if that would have helped.
Of course the Patriot was being used outside of it's original requirements spec when being used to target SCUDs, so it seems someone really screwed up in not reviewing the design beforehand and determining it's limitations (and fixing them) rather than finding out after the fact when 28 people are dead as a result.
I actually read about this specific incidence once; I seem to remember (though honestly not sure) that the design flaw was known and the user manual indicated that the computer needed to be reset every 36 hours. However, in wartime, under attack (there were frequent Scud intercepts), the crew controlling the missile battery opted against shutting it down if even for short time. Maybe even though the manual said it SHOULD be rebooted it did not explain WHY or what the consequences would be.
So they designed a system that accumulated rounding errors over time, and their solution was to ask the system's users to reboot the system every so often? Somehow, that does not add to my sympathy for these programmers...
a) If they knew enough about it to put "reboot every 36 hours" in the manual they knew enough to fix it.
b) According to the summary, 36 hours would still be a complete miss (a third of 687 meters is still 229)
c) A fixed point integer (32 bits) can mark tenths of seconds with complete accuracy for over 13 years.
d) Leaving aside a,b and c, the story still doesn't make any sense. The system would start the calculation the moment it saw the missile, not 100 hours before it appeared on the radar.
Now... at the speed of a scud missile (mach 5 if google serves me), it may be that an accuracy of 1/10th second isn't enough to compute the trajectory accurately enough to intercept it. At that speed you might need 10,000th second resolution or whatever. *That* would be believable (but unlikely - the designers would have to be complete idiots).
The rest of the article? Yawn. It's the same old recycled story we've been seeing since the 1970s (those of us who are old enough).
Integer arithmetic does not accumulate error, only floating point does that. Now they may have been using floating point, but his point is they should have been using integer arithmetic.
Had they been doing so, it could have run for 13 years with absolutely no accumulated error.
>>>It's also pretty pathetic that the system designers implemented a broken design and did not foresee this problem. High-resolution timekeeping has been accomplished pretty successfully already...
I sorry.
j/k.
We had a similar problem with an Aegis design, and it was a major headache for us Hardware engineers to try to convince the Systems Engineers that counting in Binary time was more logical than counting in 0.1 second increments. The SEs kept insisting that their computers at home accurately count in seconds and we hardware engineers should be able too. The HE manager and the SE manager were butting heads for about a month over this issue, until finally an upper-level manager handed-down a decision in favor of the HE manager and binary-based counting/requirements documentation.
I guess in the Patriot situation, the decision went in the opposite direction. Hence errors we introduced.
by Anonymous Coward writes:
on Saturday October 31 2009, @08:54AM (#29933753)
Hindsight is almost 20/20. Except that the original purpose of the Patriot was to shoot down much slower aircraft, flying parallel to the earth, not ballistic missles. This new use for Patriot was essentially experimental and had had been rushed to war - and in war you run into alot of unexpexcted circumstances. For example, conventional doctrine in the 1980's required Patriots to move constantly on the battlefield to avoid air attack. The clock would then reset when repositioned. No one expected a Patriot in air defense mode to stay stationary for 10 hours let alone 100. But in a missle defense role they did. There is a good GAO report on this.
Wow. People complain about the US government. Still look at the transparency. The GAO wrote a very readable report for the House Of Representatives and now we can all read it on the web. It's not unreasonable to think that the US's vast military superiority over everyone else on the planet is at least in part due to this sort of thing. I don't think any other government would do this - mistakes in the military would just get covered up as state secrets and anyone who tried to talk about them would get locked up or worse.
I don't think any other government would do this - mistakes in the military would just get covered up as state secrets and anyone who tried to talk about them would get locked up or worse.
Eh. Forgive me, but do you have any basis whatsoever for this claim, or are you just being arrogant?
Seriously, what programmer has not heard of floating point errors?
I had a similar issue with some code of mine for physics analysis. While I had heard of floating point errors, they're a lot more subtle than it first appears, and I ended up falling victim to one. Fortunately I discovered it before it actually let to any serious problems, it just resulted in wasted time.
Not everyone with a need for programming has a CS background and enough experience to be aware of all the potential problems. You'd hope that someone working on a missile system would have though.
Everybody knows that they exist, fewer people know how to avoid them. Lots of early multimedia frameworks, for example, were written using floating point timestamps and developed this exact problem (add some fraction repeatedly for each audio and each video frame, and after an hour the two tracks are noticeably out of sync). Now, they use a numerator-and-denominator form which is simple to add without rounding errors and so you only get them when you convert to floating point for comparison.
Even fewer people realise how compiler and hardware dependent they can be. For example, if you do a sequence of floating point operations on x86 then the values will stay in 80-bit registers until they are stored out to a variable. If you compile the same code for a newer machine with SSE or for another architecture then you will get 32-bit operations on your 32-bit floats and so you'll have less precision. A lot of compilers will even generate different precision between debug and release builds.
Even fewer people realise how compiler and hardware dependent they can be. For example, if you do a sequence of floating point operations on x86 then the values will stay in 80-bit registers until they are stored out to a variable. If you compile the same code for a newer machine with SSE or for another architecture then you will get 32-bit operations on your 32-bit floats and so you'll have less precision. A lot of compilers will even generate different precision between debug and release builds.
I ran into this when someone was using my library with DirectX. I was initializing a filter kernel and using double-precision calculations, but apparently DirectX put the processor in single-precision mode, so all my double-precision calculations weren't done as such. Same compiled code, just a run-time difference. I took the opportunity to improve the algorithm to work even with single-precision floats, which was probably good to do anyway.
To be honest, from working in two specialist fields(HPC system level programming and embedded applications(particularly sensor stuff), I've experienced that CompSci grads are more likely than CompEng or EE grads to make errors like this. A large part of it is simply that CompSci nowadays is too high-level and abstract, many of them don't know very much about how computers ACTUALLY work other than as a theoretical model.
A common remark is "Why should I need to know that, the compiler will take care of it better than I will anyway", completely forgetting that the compiler is only as smart as the programmer who coded it is. So you can get what I ran into with an odd appliance based around the SH-4 processor I was hired to fix some performance problems with. It ran fixed point integer and decimal math, and was ported over from ARM. But it only reached about 25% of maximum theoretical performance, while the ARM reached around 80%. Turns out GCC was at fault, using a generic method that wasn't suitable for the Super-H architecture. And the CompSci had no clue about such things.
Use decimal floating point or simple swich to fixed point. Fixed point not used as often as it should, and many developers don't know how difficult ordinary floiting point really is.
Unfortunately, 0.1 seconds cannot be expressed accurately as a binary number, so when it's shoehorned into a 24-bit register -- as used in the Patriot system -- it's out by a tiny amount.
Sorry, 0.1 seconds can be represented EXACTLY in such a system. It doesn't even need floating-point. Here is how such a system could represent the durations of 0.1 seconds, 25.7 seconds, and 123.4 seconds: 1, 257, and 1234. So like you say, fixed-point works here. No need for anything beyond integers in this case.
Well, in this specific instance a decimal system would have been ok, but it isn't a general answer. The general answer is "make sure your increments are divisible into your number base", if they had used 1/8th or 1/16ths of a second, or even 3/32 of a second, as their timer increment then they would not have had this problem. There's no reason why 1/10th of a second has any magic properties.
In general terms, all number bases have other number bases with which they are incompatible. The inability of binary to represent 1/10 accurately is just the same as the inability of decimal to represent 1/3 accurately. It's only because we use decimal all the time that we overlook decimal's shortcomings (or instinctively compensate for or avoid them) and then blame computers for binary's incompatibility with decimal.
Fixed point never rounds when operating in the range and precision for which it is designed. In this case they needed a precision of.1, using INT/10 would be 100% accurate and never give them any rounding errors for this use case.
So, in other words: You are wrong, and should probably considering using fixed point more.
With fixed point you can choice the basis of the fraction part. A binary fixed point would not help them, but a decimal fixed point of/10 or/100 would. The algrebra of fixed point is the same no matter what base you choice. This means it is fastest way to get decimal based fraction instead of binary fractions (decimal floating point is best with hardware support).
Use fixed point numbers? You know, in financial apps, you never store things as floating points, use cents or 1/1000th dollars instead!
Computers don't suck at math, those programmers do. You can get any precision mathematics on even 8 bit processors, most of the time compilers will figure out everything for you just fine. If you really have to use 24 bits counters with 0.1s precision, you *know* that your timer will wrap around every 466 hours, just issue a warning to reboot every 10 days or auto reboot when it overflows.
of 0.1-second ticks since the system was started up. Unfortunately, 0.1 seconds cannot be expressed accurately as a binary number, so when it's shoehorned into a 24-bit register
All they had to do is use integers, where a value of 1 represents 0.1 s.
It is absurd to blame the computer (or worse, all computers) for what is bad programming. Computers can store a 1/10 of a second perfectly accurately, as long as it is stored in a variable that counts tenths of seconds rather than seconds. It can easily be stored as an integer that way, avoiding any floating point rounding errors.
There certainly are cases of bad math in computers, particularly Intel computers. But this isn't such an example. This is just a lazy and stupid programmer who didn't understand what he was really doing who should take the blame for the failure that killed people, not the computer.
I remember this from a numerical methods class in the 1980s. To deal with situations like this, you can do one of three things :
a) Have a function that you sample as a function of t, so you don't get accumulated error. b) Have enough bits so that error won't be an issue. This is actually hard to do because floating point errors do stack up pretty quick if you are not careful. c) Or, you can have an error term which you can use to make adjustments along the way to account for a lack of precision. Bresenham's line does that more or less exactly when he does his lines. That's why you had "stair stepping" as the algorithm corrected itself along the way.
If the OP was correct, then PATRIOT failed because it did none of them. My bet is in reality, they simply underestimated the actual error term, but did everything else correct. This could be because of discrepancies in flight control instrumentation or some sensor, or, they were simply trying to save money on bits and didn't really do the calculation as to how far the missile could be off in an error term length seconds of flight at a particular phase in its flight profile.
Bottom line is, the engineering discipline exists to solve this problem and is really no different than error handling in any guidance system. Putting a man on the moon, launching an ICBM at target, shooting down a missile, are all essentially the same computer science problem from an error management perspective. The Phd's already nailed this decades ago. There's not a fundamental limitation to computing, in this case, merely, a failure or inability of engineers on this project to apply the correct known answer to this problem.
Look, you guys can talk trash all you want, but when you say this:
>>Patriot defense system failing to take down a Scud missile attack
You're just lying to yourself. The Patriots defense is awesome this year. I mean, was there really ANY point for the Titans offense to show up a couple of weeks back?
And the Scuds? C'mon man. They let go their best man two seasons ago. The QB can't hit the broadside of a barn and their entire wide-receiver corp has Jello hands anyway. The missile attack is a gadget play, pure and simple. Belichick sees right through that and you know it.
Haters need to stop all the hatin' and get on the Pats bus!!!!! GO PATRIOTS!
There's no way a real-time missile tracking system is going to be dealing with time at an accuracy of 0.1 sec.
A Patriot missile travels at about Mach 3 (~1000 m/sec) so a rounding error of 0.05, even without any error accumulation, means you'd be off by 50m in position.
Who knows what the real story is vs the garbage that was reported, but even if there was a cumulative error that's the fault of the programmer rather than a lack of a computers ability to do math. You do your error analysis and use whatever accuracy needed to keep the errors in a tolerable range.
The part about the system running for 100 hours was pure gibberish. Yes, we can all divide that by 0.1 sec, but what on earth does that have to do with a real-time tracking system tracking a target is acquired a few minutes ago?!
A better title for the story rather than "computers can't do math" would be "we can't do tech reporting".
I'd just like to point out here that the 28 people were not killed by the failure of the intercept system. They were killed by the nice folks who launched the missile in the first place.
because military computers are 20 years out of date to start with. Heck even the awesome modern land warrior hardware, is 10 years out of tech date. Heck they could probably shave 5 pounds off of the hardware by using modern chips, and displays.
Military Spec is only good at rugged. up to date with the best is far behind.
Regardless, what isn't possible is is to design a system that can accurately track and shoot down missiles in flight. As the Patriot defence system so patently demonstrated.
You're right. Just as the failure of Samuel Langley's aircraft demonstrated that man would never fly, the failure of an anti-aircraft missile to destroy only half of the ballistic missiles (targets moving at what, twice the speed of the targets it was designed to destroy?) demonstrates that ABM's will never work.
It's the reporting that's garbage. It makes no sense at all. A system tracking missiles travelling at Mach 3 is keeping track of time to 0.1 sec accuracy?! Do you really believe that? Wanna buy a bridge?
0.1 sec at Mach 3 is 100m, so you'd have a hope in hell of ever hitting a 3m long target.
The problem isn't the people working for the defence company, who are hard-core PhDs with some very serious domain knowledge. The problem is people like yourself who are so math illiterate as not to be able to fact check a piece-of-shit story!
I could see designing the system to synchronize both launch times and observations with a timer tick (it wouldn't be surprising if the whole system was driven by the timer interrupt), and then you're not going to have an error due to the spacing between ticks.
I am more bit dubious about the 24 bit thing, though. Was it fixed-point or floating-point?
I don't think it was a float. What would that be? Maybe 16 bit mantissa, 1 bit sign and 7 bit exponent would seem to be the likeliest bet for a 24 bit float. If so, then after about two hours doing t += 0.1 would stop changing t, and the error would be much bigger.
So presumably it was fixed point. But if you're doing it fixed point, instead of storing x, you store nx in an int, for some appropriate scaling factor n. But if you're going to do that, surely you'll choose n in a smart way, and in this case the obvious choice, as pointed out by many posters, is n=10. This is not only the obvious choice because it gets you more precision, but it's the obvious choice because the easiest, most obvious and most standard way of coding timers is to just increment a register with each tick. It would be silly, for instance, to let n=2^8, and then increment a register with 0.1*2^8 = 0x20. It would be a very unlikely assembly language programmer who would have put an add reg,20h opcode in interrupt hander code when inc reg would have worked.
Now maybe at some point the timer value would get converted to a float for computations. But that surely wouldn't be a 24-bit float.
So maybe the article has mangled things and it was not a 24-bit register, but a 32-bit float, with 24-bit mantissa, 7 bit exponent and 1 bit sign, and the "24" in the article came from the mantissa. That's a much more realistic choice. Still, the standard way to handle timers is to just increment a timer variable. So what I could see happening is this. There is a timer system variable t at full 0.1 second precision incremented on interrupt. (That's how PCs used to work--maybe still do--except the timer resolution was 1/30 sec.) Then for their launch calculations, they do: (float32)t / 10. And now they're going to get nasty roundoff errors as the mantissa gets filled up. At the 36 hour point, t is already about 23 bits long. So when you do a float divide by 10, you'll certainly have roundoff problems. But you're still not going to be more than one tick (0.1 sec) off, because each tick still adjusts the mantissa, while the article says they were 0.36 seconds off.
So I think something got mangled in the article. Or we had a really unlikely assembly language programmer who had floating point code executed with every tick of a timer interrupt. But even if the interrupt is only at 10hz, that's just completely contrary to the instincts of an assembly language programmer. And this would have been done back in the hey-day of assembly language programming, when one would try to optimize every clock cycle one could. (And, yes, I've worked with timer interrupt handlers, both on the Z80 and the 8086.)
Poor QA (Score:5, Insightful)
It's also pretty pathetic that the system designers implemented a broken design and did not foresee this problem. High-resolution timekeeping has been accomplished pretty successfully already...
I wonder how much time and money was spent in research and development for this thing
It doesn't seem like we're getting a quality product for the likely huge sum that was paid for it...
Re:Poor QA (Score:5, Informative)
What Every Computer Scientist Should Know About Floating-Point Arithmetic [sun.com]
Parent
Re:Poor QA (Score:5, Insightful)
Mod parent up ! This idiotic article blames computers for programmers using numerical approximation algorithms illadvisedly.
which is stored as the number of 0.1-second ticks since the system was started up. Unfortunately, 0.1 seconds cannot be expressed accurately as a binary number, so when it's shoehorned into a 24-bit register — as used in the Patriot system — it's out by a tiny amount. But all these tiny amounts add up. At the time of the missile attack, the system had been running for about 100 hours, or 3,600,000 ticks to be more specific. Multiplying this count by the tiny error led to a total error of 0.3433 seconds, during which time the Scud missile would cover 687m. The radar looked in the wrong place to receive a confirmation and saw no target. Accordingly no missile was launched to intercept the incoming Scud — and 28 people paid with their lives.'"
So in a system that should have clocks synchronized to less than a microsecond nobody bothered to run "ntpdate" even once in hundred days ? And surely the military has better clock synch than a stupid home pc ? This is stupidity, also known as "human error", causing those deaths. It's a case of "the correct answer to the wrong question".
What is always brought up as a "computer problem" is the crash in Paris of a jet due to infighting between the human pilot and the autopilot. Of course, there the ultimate mistake was the pilot's : he had forgotten to turn off the autopilot to land. It was set for cruising altitude (3km), and the pilot was trying to land. This resulted in ever more desperate attempts by the autopilot to get the plane to gain height, which eventually resulted in a total loss of lift for the plane, which naturally resulted in the plane hitting the ground nose-down and a big fireball. The computer did exactly as instructed, it's just that the pilot's (unintentionally given) instructions were stupid, and the fact that it took the pilot over 3 minutes to realize just how stupid he had been.
Parent
Re:Poor QA (Score:5, Insightful)
The computer did exactly as instructed, it's just that the pilot's (unintentionally given) instructions were stupid, and the fact that it took the pilot over 3 minutes to realize just how stupid he had been.
Sounds like a user interface problem to me. Given the potential consequences of that particular user error, the fact that the autopilot was still engaged should have been made more obvious to the pilot. (e.g. when the plane computer sees that a struggle is going on between the autopilot and the manual controls, it should prompt a loud, un-maskable synthesized voice shouting "THE AUTOPILOT IS ENGAGED, YOU IDIOT!")
Parent
Re:God yes (Score:4, Funny)
We even have a modern analog for this - the shift-lock key.
Parent
Re:Poor QA (Score:4, Insightful)
The computer did exactly as instructed, it's just that the pilot's (unintentionally given) instructions were stupid, and the fact that it took the pilot over 3 minutes to realize just how stupid he had been.
Sounds like a user interface problem to me. Given the potential consequences of that particular user error, the fact that the autopilot was still engaged should have been made more obvious to the pilot. (e.g. when the plane computer sees that a struggle is going on between the autopilot and the manual controls, it should prompt a loud, un-maskable synthesized voice shouting "THE AUTOPILOT IS ENGAGED, YOU IDIOT!")
Or if the pilot is pushing hard on the stick the autopilot should disengage (with loud alarms).
If I tap on the breaks in my car the cruise control disengages, it does not fight me.
- Dan
Parent
Re:Poor QA (Score:4, Insightful)
Do you want to be the one to explain to the generals why their stand-alone, truck-based mobile air protection system needs a hard-line network connection to work?
The real idiocy is here:
Taken charitably, the article writer has oversimplified to the point of obscuring the point. It's perfectly possible to represent a 0.1-second tick in a 24-bit register. There's an overflow about once every 19 days. The problem is doing calculations *with* that number, and that takes knowing what the hell you're doing. Given the problem the system designers were trying to solve with Patriot, this should not have been a problem.
You'd be surprised how hard clock accuracy is to get right, *especially* under military conditions. A drift of 0.3433 seconds over 100 hours works out as an accuracy of 1 part in a million, give or take. Besides, the problem here wasn't clock drift, so it's a irrelevant.
Parent
READ THE GD ARTICLE (Score:5, Insightful)
FTFA:
"So computers might suck at maths, but there's always a solution available to circumvent their inherent weaknesses. And in that case, it's probably more accurate to say that computer programmers suck at maths - or at least some of them do."
Thank you, come again.
So in a system that should have clocks synchronized to less than a microsecond nobody bothered to run "ntpdate" even once in hundred days ?
Yes, obviously they just needed to ssh into their patriot missile air defense system, edit a few lines in /etc/inet/ntp.conf and svcadm restart ntp.
The obvious problem in the article, if you read it, is computer's finite precision, and how it is dealt with. By 'computer', the author could have easily included the system libraries that are actually doing all the rounding and overflows instead of implementing arbitrary precision in software.
Everyone defending the way 'computers' is used in this article, and conflating it with 'processor' is a complete idiot.
Parent
Re:READ THE GD ARTICLE (Score:4, Insightful)
Not at all, since correcting an inappropriate hardware design with software is like fixing an automobile that was designed with square wheels by manually sawing off the corners to make them octagonal instead. You could create a recursive software routine to continue sawing until the wheels were a good approximation of round, but that's an awful lot of sawing to fix something that should have been right in the first place.
The clock in modern systems is nothing but a hardware register that gets incremented periodically (as correctly described in the article). The ONLY rounding error introduced by software is in converting that number to decimal. But rounding had nothing to do with the problem described. The appropriate solution is a better hardware design, not attempting to patch or correct it in software.
The problem was error accumulated in the clock register itself due to the imprecision of the clock, and overflows due to the inappropriately small size of the register. Both are hardware issues and represent bad design decisions. They way to fix them is to design the hardware properly in the first place so that it is appropriate for the job at hand.
Parent
Re:READ THE GD ARTICLE (Score:5, Insightful)
A 24-bit register, with clock ticks every 0.1 second, would overflow in less than 20 days. And if the clock ticks were faster, then it would overflow even sooner. No wonder they recommended rebooting the system every few days.
Of course I do not recommend an infinitely large register. Simply one that is large enough for the job at hand. This one obviously isn't. Further, a 0.1-second resolution clock is obviously not adequate to a job requiring this kind of precision.
If the hardware clock is off (not overflowed but INACCURATE, which was the real situation here), no amount of software tweaking will properly fix the problem. The article did not state but implied -- incorrectly -- that the clock register was accumulating rounding errors; that is not the case. Nobody makes system clocks that way, nor did they in the 90s or even the 80s. The system clock is nothing but a counter that is incremented every clock tick. The actual problem was that the clock ticks were not sufficiently precise, so over time the count was off. Math libraries and rounding errors played no part whatsoever in that error.
Finally, I would like to point out that today's standard PC-type system clocks are large enough that they won't overflow for 100 years or so; that is the obvious and proper solution to the overflow problem. The problem of clock ticks that are sufficiently precise for timing of missile navigation, as far as I know, has not been addressed on standard PCs, however, and they do not try to correct for that in software because the adequate precision in the clock simply does not exist. It would amount to tilting at windmills. Keeping a count in software of the number of times the register overflows is also NOT an appropriate solution for a system clock, nor is any software tweak, because software by definition is volatile while the hardware clock is not. In other words, nobody does it that way, dude, because it's just plain the wrong answer.
As for your final comment, most Unix programmers know what epoch time is, when it started (00:00:00 UTC on 1 January 1970 according to ISO 8601), and when that date will roll over in the counter (approximately 65 YEARS later, so it isn't much of an issue). Nobody is arguing that we should make a missile system that needs to last, unmodified, for over 65 years. But proper hardware design in the first place, which was certainly possible at that time using ASICs if not straight-up custom chips, would have eliminated the problem.
Parent
Re:Poor QA (Score:5, Informative)
Yes. The issue here sounds like they had a system clock counter that was an integer, that counted the number of 0.1 second clock ticks. Then they wanted to convert this to a floating point number in 24 bit IEEE format, They simply multiplied 0.1 by the integer in the register. Of course, that still sounds like too large an error top have occured from just that, but lets pretend it did.
There are several issues here. For missiles travelling at such speeds, using a system clock counter based on 0.1 second ticks sounds terribly coarse to me. Second, since 0.1 seconds are the baseline resolution of the system, the system should have been using floating point numbers where '1' corresponds to a decisecond rather than a second. Then the time counter would be exactly expressible in the floating point format.
Lastly, if the floating point format really needed to be in units of seconds, rather than deciseconds, the time counter should have been loaded in, having an exact representation, then it should be divided by 10, which has an exact representation. This is all prety basic to anybody who has even a limited understanding of floating point. If you understand the inherent precision of every operation even better than I do, even more improvements would be possible.
But to be honest, I'm not sure why floating point was used at all here. It sounds to me like fixed point may have worked just fine for most of these problems. (Of course, fixed point has its own set of rules ensuring maximal accuracy. )
Parent
Re:Poor QA (Score:5, Informative)
This particular story took place in 1991, and most of the code for Patriot was written in the 70s - needless to say, software QA was a little more lax back then. The fix for this problem was out a couple days after the incident.
Parent
Re:Poor QA (Score:5, Insightful)
I remember having this same discussion where there was a story here about some sort of Israeli space lasers that could apparently even shoot down artillery shells. Only a few months after that a very large number of thirty year old rockets dumped at discount price by Iran for being obsolete came flying over the border from Lebanon. Since then a lot of even slower rockets came out of Gaza. The success rate of this amazing new space toy matches that of the Patriot - zero.
Parent
Re:Poor QA (Score:5, Interesting)
The Iron dome [wikipedia.org] system works perfectly. It's just not capable of protecting any kind of large area. It can, however, make a military base invulnerable to rocket fire, and they're working on making the system mobile, to protect tanks. The only real problem left for doing this is the power requirements.
For ships, another such system exists, and protected the ships perfectly well from those same rockets fired by hizbullah. It's "protection range" ? In the largest deployment about 200 square meters.
There is also the problem that a downed missile presents. What is a "downed missile" ? Well it's a large collection of very-high speed pieces of metal that have been heated up by a large explosion that's about to crash into the ground. So far so good.
So what is "the ground" in the case of a hizbullah or hamas missile launch ? Well it's the center of the city that's controlled by the terrorists. It's their human shields. Markets, schools, you name it. So a successfull missile intercept is reported in the press as "Israel fires a rocket into a palestinian kindergarten". That is, by the way, the literal truth, even if the rather important detail of a rocket's presence above said kindergarten is left out. In the deployed missile intercept installations "the ground" is chosen to be something else, like the ocean surface.
Missile intercept systems are no solution for terrorism. Most unfortunately, the only solution for those rocket attacks is preventing they're fired in the first place. Which obviously requires either palestinians police their own terrorists, or someone does it for them (that's called "occupation").
These systems work, they are deployed successfully in the field. They're no silver bullets, and any bullet that's fired, whether a missile or a missile-intercept-missile, will eventually hit the ground at rather high speeds. Which makes their use above urban environments result in civilian casualties.
Parent
Re:Poor QA (Score:4, Insightful)
You missed the third option, which is for the motivation behind the firing of rockets to be removed.
http://www.youtube.com/watch?v=iNrCMdFoZqQ [youtube.com]
So who do we allow to settle there ?
The "kingdom of Egypt" (the state of the Farao's) ? (exterminated to the last man by muslims)
The Hittite Emptre ? (exterminated by the Greeks, Romans, Persians)
The kingdom of Israel ?
The Assyrian Empire ?
Which of these do we restore ? (note that the palestinians, or to be more exact, the arabs only come into play about 4500 years after the Assyrian Empire)
Which do we restore ? And why do they have more rights than all the others who conquered that piece of land ?
Note the obvious truth : the Jews controlled Israel about 4300 years before the arabs even left their tiny province ...
What if some Greek starts firing rockets at the Arabs ? Will you tell them to leave ? He has at least as much right to Israel as they do ? What if the Jews start firing rockets into Jordan (territory that was part of the kingdom of Israel) ?
And of course, you shouldn't count out yourself. You're an Indo-European living in America. It seems hypocritical in the extreme to tell others to leave conquered lands. Your province of origin is northwestern Iran, every other place on this earth indoeuropeans live (including Europe), is obviously conquered from someone else.
So when will you give the good example ?
Parent
Re:Poor QA (Score:5, Interesting)
The last fight between them happened in 2006. Hezbolah kidnapped a few SOLDIERs to trade for PoWs (a common thing since israel has a shit ton of prisoners).
Israel responded by sending in an army many 100s of times larger than lebanon's they bombed many buildings including hospitals, school, UN bunkers and apartment buildings. Hezbolah fired rockets back to show resistance.
In the end Israel killed 1200 civilians, 300soldiers, and a significant percentage of the countries economy. Hezbolah killed 120soldiers, 40civilians. Notice the fucking difference in ratios. Oh and the whole time hezbolah conducted rescue missions, gave out food and helped transport people to safety. So fuck off.
Also: "Hezbollah is now also a major provider of social services, which operate schools, hospitals, and agricultural services for thousands of Lebanese Shiites, and plays a significant force in Lebanese politics.".
Also hezbolah states that they distinguish between zionists and jewish. Their stated reason for firing rockets is continued resistance against israeli attacks and to put an end to any colonial entity within lebanon. NOT kill jews.
How the fuck parent got modded up is beyond me. Every single point is a verifiable falsehood.
Parent
Re:Poor QA (Score:4, Interesting)
1. The Patriot version used in the Gulf War (round 1) was not designed to be used against Tactical Ballistic Missiles (like SCUDs), but against opposition aircraft. A fighter isn't going to be flying as fast, and thus the error is going to be much smaller, which means the missile would probably still find the plane.
2. The Patriot has a quite good record against SCUDs (after the software upgrades). Much better than the Soviet SA-2s did against B-52 raids in Vietnam.
3. Systems don't always work right the first time, and if you do a full on test to start with, and something goes wrong, it's a lot harder to find where the error is than if you test one part at a time.
Parent
Patriot success rate was likely extremely inflated (Score:5, Informative)
I know that I'm arguing with a trolling AC, but for the other readers of slashdot, you should know that the grandparent's post refers to the controversy regarding the analysis of the Patriot system during the first Gulf war. There was a huge propaganda machine behind the Patriot's "successes" which turned out to be very near zero indeed. This was covered in a series of hearings in the early 90's...
http://www.fas.org/spp/starwars/docops/pl920908.htm [fas.org]
You can also read up on this from transcripts from the hearings after the war.
In the interests of fairness, here is a rebuttal / review.
http://www.fas.org/spp/starwars/docops/zimmerman.htm [fas.org]
I remain unconvinced -- from reading this (almost 20 years ago) I concluded that at best, the military did not know for sure that these worked well.
Parent
Re:Poor QA (Score:4, Informative)
Parent
Re:Poor QA (Score:4, Informative)
Someone posted the actual GAO report on this, which makes a bit more sense than the gibberish TechRadar arcticle.
http://www.fas.org/spp/starwars/gao/im92026.htm [fas.org]
The way the system is sure it's tracking the target it was given is by predicting where it should be seen next based on speed and diretion, and then only looking for it in a window ("range gate") around that predicted position. The window is a point in space-time and therefore has time coordinates as well as space coordinates, and the problem was that the Patriot system apparently used absolute time since power on to specify the time coordinate, hence the error accumulation. The problem could have been avoided simply by using a time coordinate relative to the last tracked postion rather than an absolute one.
The GAO report also blames the 24 bit registers of the 1970's era hardware as limiting accuracy which is just garbage. A good excuse to a politician perhaps, but there was nothing stopping them from using a 64 bit, or whatever, math library if that would have helped.
Of course the Patriot was being used outside of it's original requirements spec when being used to target SCUDs, so it seems someone really screwed up in not reviewing the design beforehand and determining it's limitations (and fixing them) rather than finding out after the fact when 28 people are dead as a result.
Parent
"User error"? (Score:5, Informative)
I actually read about this specific incidence once; I seem to remember (though honestly not sure) that the design flaw was known and the user manual indicated that the computer needed to be reset every 36 hours. However, in wartime, under attack (there were frequent Scud intercepts), the crew controlling the missile battery opted against shutting it down if even for short time. Maybe even though the manual said it SHOULD be rebooted it did not explain WHY or what the consequences would be.
Parent
Re:"User error"? (Score:5, Insightful)
Parent
Re:"User error"? (Score:5, Insightful)
I'm calling "Horsepoo" on the whole story.
a) If they knew enough about it to put "reboot every 36 hours" in the manual they knew enough to fix it.
b) According to the summary, 36 hours would still be a complete miss (a third of 687 meters is still 229)
c) A fixed point integer (32 bits) can mark tenths of seconds with complete accuracy for over 13 years.
d) Leaving aside a,b and c, the story still doesn't make any sense. The system would start the calculation the moment it saw the missile, not 100 hours before it appeared on the radar.
Now ... at the speed of a scud missile (mach 5 if google serves me), it may be that an accuracy of 1/10th second isn't enough to compute the trajectory accurately enough to intercept it. At that speed you might need 10,000th second resolution or whatever. *That* would be believable (but unlikely - the designers would have to be complete idiots).
The rest of the article? Yawn. It's the same old recycled story we've been seeing since the 1970s (those of us who are old enough).
Parent
Re:"User error"? (Score:4, Insightful)
Integer arithmetic does not accumulate error, only floating point does that. Now they may have been using floating point, but his point is they should have been using integer arithmetic.
Had they been doing so, it could have run for 13 years with absolutely no accumulated error.
Parent
Re:Poor QA (Score:5, Informative)
>>>It's also pretty pathetic that the system designers implemented a broken design and did not foresee this problem. High-resolution timekeeping has been accomplished pretty successfully already...
I sorry.
j/k.
We had a similar problem with an Aegis design, and it was a major headache for us Hardware engineers to try to convince the Systems Engineers that counting in Binary time was more logical than counting in 0.1 second increments. The SEs kept insisting that their computers at home accurately count in seconds and we hardware engineers should be able too. The HE manager and the SE manager were butting heads for about a month over this issue, until finally an upper-level manager handed-down a decision in favor of the HE manager and binary-based counting/requirements documentation.
I guess in the Patriot situation, the decision went in the opposite direction. Hence errors we introduced.
Parent
Re:Poor QA (Score:5, Funny)
I guess in the Patriot situation, the decision went in the opposite direction. Hence errors we introduced.
Ah, so you're the one responsible!
Parent
Re:Poor QA (Score:5, Interesting)
Hindsight is almost 20/20. Except that the original purpose of the Patriot was to shoot down much slower aircraft, flying parallel to the earth, not ballistic missles. This new use for Patriot was essentially experimental and had had been rushed to war - and in war you run into alot of unexpexcted circumstances. For example, conventional doctrine in the 1980's required Patriots to move constantly on the battlefield to avoid air attack. The clock would then reset when repositioned. No one expected a Patriot in air defense mode to stay stationary for 10 hours let alone 100. But in a missle defense role they did. There is a good GAO report on this.
Parent
Re:Poor QA (Score:5, Insightful)
There is a good GAO report on this.
This one?
http://www.fas.org/spp/starwars/gao/im92026.htm [fas.org]
Wow. People complain about the US government. Still look at the transparency. The GAO wrote a very readable report for the House Of Representatives and now we can all read it on the web. It's not unreasonable to think that the US's vast military superiority over everyone else on the planet is at least in part due to this sort of thing. I don't think any other government would do this - mistakes in the military would just get covered up as state secrets and anyone who tried to talk about them would get locked up or worse.
Parent
Re:Poor QA (Score:5, Insightful)
Eh. Forgive me, but do you have any basis whatsoever for this claim, or are you just being arrogant?
Parent
Re:Poor QA (Score:4, Insightful)
Seriously, what programmer has not heard of floating point errors?
I had a similar issue with some code of mine for physics analysis. While I had heard of floating point errors, they're a lot more subtle than it first appears, and I ended up falling victim to one. Fortunately I discovered it before it actually let to any serious problems, it just resulted in wasted time.
Not everyone with a need for programming has a CS background and enough experience to be aware of all the potential problems. You'd hope that someone working on a missile system would have though.
Parent
Re:Poor QA (Score:5, Informative)
Everybody knows that they exist, fewer people know how to avoid them. Lots of early multimedia frameworks, for example, were written using floating point timestamps and developed this exact problem (add some fraction repeatedly for each audio and each video frame, and after an hour the two tracks are noticeably out of sync). Now, they use a numerator-and-denominator form which is simple to add without rounding errors and so you only get them when you convert to floating point for comparison.
Even fewer people realise how compiler and hardware dependent they can be. For example, if you do a sequence of floating point operations on x86 then the values will stay in 80-bit registers until they are stored out to a variable. If you compile the same code for a newer machine with SSE or for another architecture then you will get 32-bit operations on your 32-bit floats and so you'll have less precision. A lot of compilers will even generate different precision between debug and release builds.
Parent
Re:Poor QA (Score:4, Interesting)
I ran into this when someone was using my library with DirectX. I was initializing a filter kernel and using double-precision calculations, but apparently DirectX put the processor in single-precision mode, so all my double-precision calculations weren't done as such. Same compiled code, just a run-time difference. I took the opportunity to improve the algorithm to work even with single-precision floats, which was probably good to do anyway.
Parent
Re:Poor QA (Score:5, Insightful)
To be honest, from working in two specialist fields(HPC system level programming and embedded applications(particularly sensor stuff), I've experienced that CompSci grads are more likely than CompEng or EE grads to make errors like this. A large part of it is simply that CompSci nowadays is too high-level and abstract, many of them don't know very much about how computers ACTUALLY work other than as a theoretical model.
A common remark is "Why should I need to know that, the compiler will take care of it better than I will anyway", completely forgetting that the compiler is only as smart as the programmer who coded it is. So you can get what I ran into with an odd appliance based around the SH-4 processor I was hired to fix some performance problems with. It ran fixed point integer and decimal math, and was ported over from ARM. But it only reached about 25% of maximum theoretical performance, while the ARM reached around 80%. Turns out GCC was at fault, using a generic method that wasn't suitable for the Super-H architecture. And the CompSci had no clue about such things.
Parent
Curse of binary floating point (Score:5, Insightful)
Use decimal floating point or simple swich to fixed point. Fixed point not used as often as it should, and many developers don't know how difficult ordinary floiting point really is.
Re:Curse of binary floating point (Score:5, Insightful)
Sorry, 0.1 seconds can be represented EXACTLY in such a system. It doesn't even need floating-point. Here is how such a system could represent the durations of 0.1 seconds, 25.7 seconds, and 123.4 seconds: 1, 257, and 1234. So like you say, fixed-point works here. No need for anything beyond integers in this case.
Parent
Re:Curse of binary floating point (Score:5, Informative)
Well, in this specific instance a decimal system would have been ok, but it isn't a general answer. The general answer is "make sure your increments are divisible into your number base", if they had used 1/8th or 1/16ths of a second, or even 3/32 of a second, as their timer increment then they would not have had this problem. There's no reason why 1/10th of a second has any magic properties.
In general terms, all number bases have other number bases with which they are incompatible. The inability of binary to represent 1/10 accurately is just the same as the inability of decimal to represent 1/3 accurately. It's only because we use decimal all the time that we overlook decimal's shortcomings (or instinctively compensate for or avoid them) and then blame computers for binary's incompatibility with decimal.
Parent
Re:Curse of binary floating point (Score:5, Informative)
Fixed point never rounds when operating in the range and precision for which it is designed. In this case they needed a precision of .1, using INT/10 would be 100% accurate and never give them any rounding errors for this use case.
So, in other words: You are wrong, and should probably considering using fixed point more.
Parent
Re:Curse of binary floating point (Score:4, Informative)
With fixed point you can choice the basis of the fraction part. A binary fixed point would not help them, but a decimal fixed point of /10 or /100 would. The algrebra of fixed point is the same no matter what base you choice. This means it is fastest way to get decimal based fraction instead of binary fractions (decimal floating point is best with hardware support).
Parent
Fixed point numbers? (Score:5, Insightful)
Computers don't suck at math, those programmers do. You can get any precision mathematics on even 8 bit processors, most of the time compilers will figure out everything for you just fine. If you really have to use 24 bits counters with 0.1s precision, you *know* that your timer will wrap around every 466 hours, just issue a warning to reboot every 10 days or auto reboot when it overflows.
Stupid article, too (Score:5, Insightful)
Translation: computers are only as smart as the people programming them... and there's plenty of stupid people out there.
We knew this. This is no great revelation. So why is this news?
What?! (Score:5, Insightful)
All they had to do is use integers, where a value of 1 represents 0.1 s.
don't blame the computer for bad programming (Score:5, Insightful)
There certainly are cases of bad math in computers, particularly Intel computers. But this isn't such an example. This is just a lazy and stupid programmer who didn't understand what he was really doing who should take the blame for the failure that killed people, not the computer.
This problem has been solved since the 1960s (Score:4, Informative)
I remember this from a numerical methods class in the 1980s. To deal with situations like this, you can do one of three things :
a) Have a function that you sample as a function of t, so you don't get accumulated error.
b) Have enough bits so that error won't be an issue. This is actually hard to do because floating point errors do stack up pretty quick if you are not careful.
c) Or, you can have an error term which you can use to make adjustments along the way to account for a lack of precision. Bresenham's line does that more or less exactly when he does his lines. That's why you had "stair stepping" as the algorithm corrected itself along the way.
If the OP was correct, then PATRIOT failed because it did none of them. My bet is in reality, they simply underestimated the actual error term, but did everything else correct. This could be because of discrepancies in flight control instrumentation or some sensor, or, they were simply trying to save money on bits and didn't really do the calculation as to how far the missile could be off in an error term length seconds of flight at a particular phase in its flight profile.
Bottom line is, the engineering discipline exists to solve this problem and is really no different than error handling in any guidance system. Putting a man on the moon, launching an ICBM at target, shooting down a missile, are all essentially the same computer science problem from an error management perspective. The Phd's already nailed this decades ago. There's not a fundamental limitation to computing, in this case, merely, a failure or inability of engineers on this project to apply the correct known answer to this problem.
Ridiculous. Patriots always win. (Score:4, Funny)
Look, you guys can talk trash all you want, but when you say this:
>>Patriot defense system failing to take down a Scud missile attack
You're just lying to yourself. The Patriots defense is awesome this year. I mean, was there really ANY point for the Titans offense to show up a couple of weeks back?
And the Scuds? C'mon man. They let go their best man two seasons ago. The QB can't hit the broadside of a barn and their entire wide-receiver corp has Jello hands anyway. The missile attack is a gadget play, pure and simple. Belichick sees right through that and you know it.
Haters need to stop all the hatin' and get on the Pats bus!!!!! GO PATRIOTS!
Seriously flawed reporting (Score:4, Interesting)
There's no way a real-time missile tracking system is going to be dealing with time at an accuracy of 0.1 sec.
A Patriot missile travels at about Mach 3 (~1000 m/sec) so a rounding error of 0.05, even without any error accumulation, means you'd be off by 50m in position.
Who knows what the real story is vs the garbage that was reported, but even if there was a cumulative error that's the fault of the programmer rather than a lack of a computers ability to do math. You do your error analysis and use whatever accuracy needed to keep the errors in a tolerable range.
The part about the system running for 100 hours was pure gibberish. Yes, we can all divide that by 0.1 sec, but what on earth does that have to do with a real-time tracking system tracking a target is acquired a few minutes ago?!
A better title for the story rather than "computers can't do math" would be "we can't do tech reporting".
Paid with their lives (Score:5, Insightful)
Re:Didn't read TFA but... (Score:4, Interesting)
because military computers are 20 years out of date to start with. Heck even the awesome modern land warrior hardware, is 10 years out of tech date. Heck they could probably shave 5 pounds off of the hardware by using modern chips, and displays.
Military Spec is only good at rugged. up to date with the best is far behind.
Parent
Re:retrospective technological excuses (Score:5, Insightful)
You're right. Just as the failure of Samuel Langley's aircraft demonstrated that man would never fly, the failure of an anti-aircraft missile to destroy only half of the ballistic missiles (targets moving at what, twice the speed of the targets it was designed to destroy?) demonstrates that ABM's will never work.
Parent
Re:And this is why... (Score:5, Insightful)
It's the reporting that's garbage. It makes no sense at all. A system tracking missiles travelling at Mach 3 is keeping track of time to 0.1 sec accuracy?! Do you really believe that? Wanna buy a bridge?
0.1 sec at Mach 3 is 100m, so you'd have a hope in hell of ever hitting a 3m long target.
The problem isn't the people working for the defence company, who are hard-core PhDs with some very serious domain knowledge. The problem is people like yourself who are so math illiterate as not to be able to fact check a piece-of-shit story!
Parent
Re:And this is why... (Score:4, Interesting)
I could see designing the system to synchronize both launch times and observations with a timer tick (it wouldn't be surprising if the whole system was driven by the timer interrupt), and then you're not going to have an error due to the spacing between ticks.
I am more bit dubious about the 24 bit thing, though. Was it fixed-point or floating-point?
I don't think it was a float. What would that be? Maybe 16 bit mantissa, 1 bit sign and 7 bit exponent would seem to be the likeliest bet for a 24 bit float. If so, then after about two hours doing t += 0.1 would stop changing t, and the error would be much bigger.
So presumably it was fixed point. But if you're doing it fixed point, instead of storing x, you store nx in an int, for some appropriate scaling factor n. But if you're going to do that, surely you'll choose n in a smart way, and in this case the obvious choice, as pointed out by many posters, is n=10. This is not only the obvious choice because it gets you more precision, but it's the obvious choice because the easiest, most obvious and most standard way of coding timers is to just increment a register with each tick. It would be silly, for instance, to let n=2^8, and then increment a register with 0.1*2^8 = 0x20. It would be a very unlikely assembly language programmer who would have put an add reg,20h opcode in interrupt hander code when inc reg would have worked.
Now maybe at some point the timer value would get converted to a float for computations. But that surely wouldn't be a 24-bit float.
So maybe the article has mangled things and it was not a 24-bit register, but a 32-bit float, with 24-bit mantissa, 7 bit exponent and 1 bit sign, and the "24" in the article came from the mantissa. That's a much more realistic choice. Still, the standard way to handle timers is to just increment a timer variable. So what I could see happening is this. There is a timer system variable t at full 0.1 second precision incremented on interrupt. (That's how PCs used to work--maybe still do--except the timer resolution was 1/30 sec.) Then for their launch calculations, they do: (float32)t / 10. And now they're going to get nasty roundoff errors as the mantissa gets filled up. At the 36 hour point, t is already about 23 bits long. So when you do a float divide by 10, you'll certainly have roundoff problems. But you're still not going to be more than one tick (0.1 sec) off, because each tick still adjusts the mantissa, while the article says they were 0.36 seconds off.
So I think something got mangled in the article. Or we had a really unlikely assembly language programmer who had floating point code executed with every tick of a timer interrupt. But even if the interrupt is only at 10hz, that's just completely contrary to the instincts of an assembly language programmer. And this would have been done back in the hey-day of assembly language programming, when one would try to optimize every clock cycle one could. (And, yes, I've worked with timer interrupt handlers, both on the Z80 and the 8086.)
Parent