2022.01.12 23:18

Why does float exist

Alex D Alex D 1, 9 9 silver badges 14 14 bronze badges. I think the fact that 0. FP math is a niche requirement, you should explicitly request it from the language. JorgWMittag, you have a good point there. Perhaps the "non-intuitive" behavior of FP is because we write the numbers as decimals in our source code -- if we had to write them out as binary numbers with a decimal point, there would be nothing "non-intuitive" about them.

I'm not seriously suggesting that would ever be a good idea, of course. Because sometimes even a bit integer won't give you enough range. Ilmari Karonen Ilmari Karonen 1, 10 10 silver badges 11 11 bronze badges. Actually, a lot of audio signal processing is done with integer DSPs. For speech, bit sampling and integer processing is quite sufficient.

Indeed there are a lot of embeded devices which use integer DSP and a lot of algorithms which are quite effectly derived to integer. The dedicated devices frequently have to do a LOT more processing, in a limited amount of time, than the "general computing platforms". The dedicated devices generally have far lower power and cooling budgets.

Last time I looked, you couldn't put an Intel flagship processor in a pocket cellphone, because of the power and cooling requirements. I don't use a single floating point number in any of my audio software because it's lossy, and frankly I don't know of anyone that does. MarcusJ Floating point in itself is not lossy. That's computations or sampling that are lossy, and they are lossy both for floating point and for int.

Avoiding floats does not save you from losing precision. If you want to have coefficients, floats are just better. For example, In science and finance, float are not really liked because there can be data loss.

Floats just have their application, they have pros and cons. Why should 0. Why do you think that integers do not have a dedicated place on the CPU? Floats are tricky because there is a mantra and an exponent. And maybe other reason. Historically, floating point was not a native programming option, it was one of the thing microsoft implemented into its OS so it was better for developers. Just a matter of abstraction. And of course they are not a programming technique , but a type of data.

You also know that nowadays there are no FPUs on commodity x86 systems anymore? Please review the wikipedia entry for Floating Points sidenote: also, it's "Mantissa", not "Mantra". Single-precision floating-point numbers are not enough even for applications that don't require precise calculations like animaion and games. Single-precision floating-point numbers are nearly unusable for anything requiring higher-precision, like physical simulation even in games.

Ark-kun Ark-kun 3 3 bronze badges. It frees the developer from having to actually THINK so he can just sit in his cubicle and serve the whine! You want smooth and realistic, you do what it takes to get there! You want fast and accurate, you do what it takes to get there! Share and enjoy. BobJarvis not sure if you're kidding around, but the above answer is a good one. A double is just a larger floating-point number.

Yes, you can delineate a greater range with a floating-point number given the same memory size, but that range comes at a loss of precision If the biggest number I need fits in a bit int, why would I use a float? Ark-kun: Discrete division is a different operation from real-number division. Diving by 7 yields a quotient of 14 and a remainder of 2, absolutely precisely. The fact that some programming languages use the same operator for both operations doesn't make discrete division "inaccurate".

Show 7 more comments. The Overflow Blog. Does ES6 make JavaScript frameworks obsolete? Podcast Do polyglots have an edge when it comes to mastering programming Featured on Meta. Now live: A fully responsive profile. Linked Related Hot Network Questions.

Accept all cookies Customize settings. Create a free Team What is Teams? Collectives on Stack Overflow. Learn more. Why do double and float exist? Asked 12 years, 6 months ago. Active 12 years, 6 months ago. Viewed 1k times. Improve this question. Community Bot 1 1 1 silver badge. Dupe: stackoverflow. Some great answers at that link — JoshBerke. Add a comment. Chart 3 illustrates the flows involved in check payments. Continuing the earlier example, firm A knows that if it gives firm B a check when it takes delivery of goods, it will take some time before its account at bank X is debited.

In the meantime, firm A will continue to have the use of the balances that will eventually be used to settle the payment obligation. The effect of payment patterns, including both disbursements and receipts, is an important part of cash management.

Accordingly, as discussed below in the section on cash management, banks have begun to offer rather sophisticated payment services to their corporate customers that are designed to minimize the idle balances held to fund payments. As in the previous example, a number of levels of float are generated when debit instruments are used.

Essentially, bank Y is making an interest-free short-term loan to firm B during the time it takes to clear the check. In this case bank Y, the bank at which the check has been deposited, will be quick to advise the central bank to credit its nostro account. If the central bank provides credit before it debits the account of the bank on which the check is drawn, in this example bank X, the central bank will be creating debit float.

The central bank debit float created by this practice increases the reserves of the banking system. The central bank is effectively granting the commercial banking system as a whole a subsidy in the form of an interbank loan. This subsidy is ultimately paid by the taxpayer because it reduces the earnings of the central bank. In relatively small countries where distances between processing centers are not great, transportation delays should not lead to major problems with debit float. But in larger countries with many processing centers and where checks are transported over long distances, several days can elapse between the posting of credit and debit entries.

For instance, a central bank branch in one part of a country could credit the account of a bank in its region and then send details of the transaction to another of its branches or processing centers or directly to a commercial bank thousands of miles away. In the United States, in particular, a great deal of careful design and execution has been devoted to the transportation aspects of check processing to meet the challenges posed by vast distances.

Moreover, as discussed in Chapter 8 , new methods whereby paper checks are converted into electronic instructions, called check truncation, are being more widely used. The debit float generated through processing of debit payments has different effects than those generated by credit payments. Also, commercial banks may gain float at the expense of the central bank. Table 1 summarizes the types of float generated as a result of using different types of instruments and by account relationship.

Why is float important? Key issues include the distortions that float can cause to the incomes of economic actors and the problems that it can cause for the implementation of monetary policy by making it more difficult to assess the demand for and supply of bank reserves.

The existence of float means that one of the parties to a payment transaction—an enterprise or individual bank customer, a commercial bank, or the central bank—is either granting or receiving free or subsidized credit. Who gains and who loses payment float depends on how payment instructions are cleared and whether credit or debit instruments predominate. Clearly, float effects are potentially greater in a paper-based system, in which processing and transportation delays are potentially lengthy, than in an electronic system, in which such delays should be much shorter.

The value of float can be substantial. Float value is calculated by determining the return on investment of funds during the period that the float exists. The higher the market rate of interest and the longer the float time, the greater is the value of the float.

Because float has value, it influences choices made by payment system participants regarding the type of instruments they use and the processing options they follow. By definition, float is a zero-sum game, that is, total float gains exactly offset total losses. In this sense, the social costs of float might be thought of as being zero. But the income redistribution effects resulting from float are arbitrary and unlikely to be in any sense optimal. The damage that float can do to the reputation of the banking system as a whole is well recognized.

To minimize these costs and help ensure an efficient payment system and maintain public confidence, participants in most mature payment systems have agreed to rules governing the minimum times within which payments must be delivered and processed and funds made available to payees.

Because of their responsibility for the safe and efficient operation of the payment system, central banks usually play an important role in setting these rules. Even if such rules exist, attempts to exploit float can lead to increased credit, liquidity, and fraud risks to participants. It can also increase the difficulty of implementing monetary policy. Accordingly, from a public policy standpoint, float is undesirable and should be minimized.

Good banking practice requires that lenders have the ability to assess, and that they actually do assess, the creditworthiness of borrowers. As suggested above, however, payment system inefficiencies can result in commercial banks supplying credit to their customers under operational circumstances that make it difficult for the credit assessment to be made.

Moreover, banks themselves can use central bank credit that results from the operation of the payment system in a manner that prevents careful assessment by the central bank of its counterparty credit risks see Chapter 7 for a discussion of these risks.

Further, delays in settlement, especially unanticipated delays, can cause liquidity problems for payment system participants who expect payment as a result of legitimate transactions in the marketplace but whose receipt of value is delayed by payment system inefficiencies.

Unfortunately, circumstances may make it possible for payment system participants to manipulate the payment system in order to generate float, reducing its efficiency and thereby causing liquidity problems for other participants. Clearing and settlement procedures that generate large volumes of float, particularly debit float, can increase the risk of two important types of fraud against the banking system.

When bank customers use debit instruments, such as checks, to move funds between accounts, they can gain debit float at the expense of the banking system. Kiting is accomplished by holding a series of bank accounts, usually in different banks in a variety of distant locations and artificially multiplying deposits in these accounts by writing and redepositing checks between the accounts.

This is done with the knowledge that the checks will take some time to clear, thus increasing temporarily the balances in the accounts if the banks at which the deposits are made provide funds based on the deposit but before the checks clear. Check kiting is an overt manipulation of the payment system that can result in two types of fraud. First, by definition, kiting results in banks unintentionally providing credit to the entity operating the kiting scheme, which results in loss of income to the banks, as the customer uses the funds for investment purposes.

Improved processing and the application of availability schedules discussed below can address this type of fraud. Also, a second form of fraud can involve the theft of principal, if the customer does not intend to pay the checks. To help protect against this form of kiting-related fraud, usual banking practice is to grant provisional credit for check deposits, that is, funds may not be withdrawn until the bank is confident that the check can be collected.

Variability in the delivery and processing of payments can mean that float fluctuates widely. One consequence is that bank reserves can also fluctuate widely, making it difficult for the central bank to estimate the day-to-day demand for reserves.

This, in turn, adds uncertainty to the execution of open market operations. If the central bank can predict accurately the inflows of funds to and outflows of funds from the commercial banking system in connection with the operation of the payment system, it can do a better job of hitting monetary policy targets, particularly short-term interest rate targets. Given good information flows, and an efficient market, central banks can generally target short-term interest rates quite accurately.

An efficient payment system adds a degree of stability to the setting of monetary policy. The earlier discussion outlined how payment system float is generated and pointed out some of the effects that float can have on payment system efficiency. Rules and procedures are necessary to establish and enforce performance standards for payment system participants to minimize float. Such rules and procedures should address the particular types of delays that can lead to float. All of the operational causes of float discussed below are relevant whether debit or credit payments are being processed.

Inefficiencies are more likely to occur in connection with paper-based processing than with electronic processing. Nonetheless, even electronic systems can generate considerable float, especially when they are not fully integrated with bank accounting systems used to post customer accounts. Four major causes of float are discussed below. These are posting procedures, transportation, holdovers and backlogs of payments, and processing errors. For example, this float occurs when a bank gives credit at the time its customer deposits a check but before it receives credit for the check from the paying bank.

That is, the bank receiving the check does not take into consideration the time that is normally needed to present the check and receive payment. This type of float can be reduced by adjusting posting procedures using deferred settlement and availability schedules, as discussed in detail below. Transportation float occurs because paper payments must be transported between the various participants in the payment system.

Transportation delays can be considerable in large countries where payment instruments are transported over slow transportation networks. This float can be reduced by using dedicated air and ground-based transportation networks designed to expedite the movement of value. Transportation float can also be reduced by utilizing electronic delivery, especially for large-value payments. Electronic delivery helps ensure same-day delivery, processing, and posting of the largest-value payments.

In most countries with well-developed financial markets, a relatively small number of large-value payments account for a very high proportion of the total value of payments. If these payments can be completed within one day, float can be significantly reduced. Especially if the volume of payments is not high, the technology required for an electronic large-value transfer system need not be complex or unduly expensive.

Float arising in connection with these largest-value payments can be eliminated using properly controlled telephone, telegraphic, or computer-to-computer techniques to transfer funds. As discussed in Chapter 6 , developed economies rely on specialized large-value transfer systems, an important feature of which is that they do not generate float.

Holdover float is generated when a payment is only partially processed during a business day. Holdover occurs when a commercial bank debits its customer for a payment order or gives credit for a check deposited but does not complete processing and forward the payment by the end of that business day. Backlog float is similar to holdover float. In backlog float, however, the payments are not even partially processed.

Rather, processing is delayed, as is accounting, owing to backup in workloads. Holdover and backlog float can be avoided if sufficient processing capacity is available to process the volume of payments received on a same-day basis. These resources need to be flexibly managed to efficiently handle low-, average-, and high-volume days. Processing error float is created when errors occur during the handling of payments, including accounting for payments.

For example, errors can result from payments being sent to the wrong bank, lost in transit between banks, or recorded in the wrong amount. During the time it takes to detect and resolve such errors, float is generated. Careful monitoring of work quality can reduce handling errors and processing error float. When errors do occur, an effective and timely error correction process will help contain the float that is generated by shortening the time needed to correct the error.

To be effective, procedures for handling errors must be published, accepted, and used. Bank compensation rules commonly provide incentives for the speedy resolution of processing errors that arise in the payment system. Although it would be ideal if the payment system functioned perfectly and all processing occurred in a timely, error-free manner, this is not a practical goal.

At some point, the costs of reducing float by incurring added processing and administrative costs will outweigh the benefits. Nonetheless, float can be significantly reduced by synchronizing relevant accounting entries.

Float can be reduced by the use of funds availability schedules. The purpose of availability schedules is to synchronize the accounting performed for both sides of a payment. These methods essentially tie the timing of accounting for the payment to the timing of the physical handling of the payment.

The timing of the accounting entries is known as an availability schedule. Although use of an appropriately designed availability schedule will reduce float, this method does not improve the speed or reliability of the payment system. Indeed, use of availability schedules, at least in debit payments, may diminish incentives for banks to improve the timeliness of the payment process.

A detailed description of how an availability schedule is calculated and how it would be applied in daily processing is given in the appendix. The example is based on a payment system that relies on central bank processing centers and paper-based credit payments. The example in the appendix raises a number of key issues that need to be addressed when availability schedules are being designed, including the proportion of local and interregional payments in the mix of total payments and the average transportation times for payments sent to and received from various destinations.

The appendix makes clear that designing an availability schedule involves a trade-off between the goal of eliminating float and additional procedural and operational complications. Availability schedules should not be overcomplicated so that their use requires an undue amount of time and resources.

The analysis of float presented here suggests that enterprises, commercial banks, the central bank, and even individuals can increase income by carefully managing their payment flows. Similarly, the opportunity costs of failing to manage temporary cash balances effectively can be high. This has not escaped the notice of the treasurers of large corporations in particular.

As pressures on firms to minimize costs and maximize revenue intensify and as financial markets spawn new, convenient short-term investments with low transaction costs, corporate treasurers have found more sophisticated ways to avoid the cost of, or even to enjoy the benefits from, payment system float.

Competitive pressures have forced commercial banks to offer their larger customers—both corporations and correspondent banks—a range of services to help them manage their cash balances more efficiently and profitably. Although not treated here, some retail banking products, such as overdraft protection for transaction accounts, also offer similar services to consumers. The types of cash management services offered by banks to their customers fall into three main classes: 1 cash concentration; 2 disbursement; and 3 investment.

Many firms need to hold accounts that serve a variety of functions. These accounts are often held in different locations and at different banks. Banks offer services to permit their customers to manage funds held in several accounts easily and efficiently. These concentration services help customers avoid overdrafts and minimize transaction costs of transferring funds between accounts.

In this way, corporate treasurers can focus on managing the balances in a single account without having to worry about intra-firm funds transfers. Banks can also help firms to improve the timing of payment flows, which is crucial to the management of cash balances. However, the IEEE committee decided that the advantages of utilizing the sign of zero outweighed the disadvantages. How important is it to preserve the property. Tracking down bugs like this is frustrating and time consuming.

On a more philosophical level, computer science textbooks often point out that even though it is currently impractical to prove large programs correct, designing programs with the idea of proving them often results in better code. For example, introducing invariants is quite useful, even if they aren't going to be used as part of a proof. Floating-point code is just like any other code: it helps to have provable facts on which to depend.

Similarly, knowing that 10 is true makes writing reliable floating-point code easier. If it is only true for most numbers, it cannot be used to prove anything. The IEEE standard uses denormalized 18 numbers, which guarantee 10 , as well as other useful relations.

They are the most controversial part of the standard and probably accounted for the long delay in getting approved. Most high performance hardware that claims to be IEEE compatible does not support denormalized numbers directly, but rather traps when consuming or producing denormals, and leaves it to software to simulate the IEEE standard.

The exponent e min is used to represent denormals. More formally, if the bits in the significand field are b 1 , b 2 , With denormals, x - y does not flush to zero but is instead represented by the denormalized number. This behavior is called gradual underflow. It is easy to verify that 10 always holds when using gradual underflow. The top number line in the figure shows normalized floating-point numbers. Notice the gap between 0 and the smallest normalized number. If the result of a floating-point calculation falls into this gulf, it is flushed to zero.

The bottom number line shows what happens when denormals are added to the set of floating-point numbers. The "gulf" is filled in, and when the result of a calculation is less than , it is represented by the nearest denormal. When denormalized numbers are added to the number line, the spacing between adjacent floating-point numbers varies in a regular way: adjacent spacings are either the same length or differ by a factor of.

Without denormals, the spacing abruptly changes from to , which is a factor of , rather than the orderly change by a factor of. Because of this, many algorithms that can have large relative error for normalized numbers close to the underflow threshold are well-behaved in this range when gradual underflow is used.

Large relative errors can happen even without cancellation, as the following example shows [Demmel ]. The obvious formula. A better method of computing the quotients is to use Smith's formula:. It yields 0. It is typical for denormalized numbers to guarantee error bounds for arguments all the way down to 1.

When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue. The preceding sections gave examples where proceeding from an exception with these default values was the reasonable thing to do.

When any exception occurs, a status flag is also set. Implementations of the IEEE standard are required to provide users with a way to read and write the status flags. The flags are "sticky" in that once set, they remain set until explicitly cleared. Sometimes continuing execution in the face of exception conditions is not appropriate.

The IEEE standard strongly recommends that implementations allow trap handlers to be installed. Then when an exception occurs, the trap handler is called instead of setting the flag. The value returned by the trap handler will be used as the result of the operation.

It is the responsibility of the trap handler to either clear or set the status flag; otherwise, the value of the flag is allowed to be undefined.

The IEEE standard divides exceptions into 5 classes: overflow, underflow, division by zero, invalid operation and inexact. There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident. The default result of an operation that causes an invalid exception is to return a NaN, but the converse is not true.

The inexact exception is raised when the result of a floating-point operation is not exact. Binary to Decimal Conversion discusses an algorithm that uses the inexact exception. There is an implementation issue connected with the fact that the inexact exception is raised so often.

If floating-point hardware does not have flags of its own, but instead interrupts the operating system to signal a floating-point exception, the cost of inexact exceptions could be prohibitive.

This cost can be avoided by having the status flags maintained by software. The first time an exception is raised, set the software flag for the appropriate class, and tell the floating-point hardware to mask off that class of exceptions. Then all further exceptions will run without interrupting the operating system.

When a user resets that status flag, the hardware mask is re-enabled. One obvious use for trap handlers is for backward compatibility. Old codes that expect to be aborted when exceptions occur can install a trap handler that aborts the process. There is a more interesting use for trap handlers that comes up when computing products such as that could potentially overflow.

One solution is to use logarithms, and compute exp instead. The problem with this approach is that it is less accurate, and that it costs more than the simple expression , even if there is no overflow. The idea is as follows. There is a global counter initialized to zero. Whenever the partial product overflows for some k , the trap handler increments the counter by one and returns the overflowed quantity with the exponent wrapped around. Similarly, if p k underflows, the counter would be decremented, and negative exponent would get wrapped around into a positive one.

When all the multiplications are done, if the counter is zero then the final product is p n. If the counter is positive, the product overflowed, if the counter is negative, it underflowed.

If none of the partial products are out of range, the trap handler is never called and the computation incurs no extra cost.

IEEE specifies that when an overflow or underflow trap handler is called, it is passed the wrapped-around result as an argument. The definition of wrapped-around for overflow is that the result is computed as if to infinite precision, then divided by 2 , and then rounded to the relevant precision. For underflow, the result is multiplied by 2.

The exponent is for single precision and for double precision. This is why 1. In the IEEE standard, rounding occurs whenever an operation has a result that is not exact, since with the exception of binary decimal conversion each operation is computed exactly and then rounded. By default, rounding means round toward nearest. One application of rounding modes occurs in interval arithmetic another is mentioned in Binary to Decimal Conversion.

The exact result of the addition is contained within the interval. Without rounding modes, interval arithmetic is usually implemented by computing and , where is machine epsilon.

Since the result of an operation in interval arithmetic is an interval, in general the input to an operation will also be an interval. When a floating-point calculation is performed using interval arithmetic, the final answer is an interval that contains the exact result of the calculation.

This is not very helpful if the interval turns out to be large as it often does , since the correct answer could be anywhere in that interval. Interval arithmetic makes more sense when used in conjunction with a multiple precision floating-point package. The calculation is first performed with some precision p.

If interval arithmetic suggests that the final answer may be inaccurate, the computation is redone with higher and higher precisions until the final interval is a reasonable size. The IEEE standard has a number of flags and modes.

As discussed above, there is one status flag for each of the five exceptions: underflow, overflow, division by zero, invalid operation and inexact. It is strongly recommended that there be an enable mode bit for each of the five exceptions. This section gives some simple examples of how these modes and flags can be put to good use.

A more sophisticated example is discussed in the section Binary to Decimal Conversion. Consider writing a subroutine to compute x n , where n is an integer. In the second expression these are exact i. Unfortunately, these is a slight snag in this strategy. If PositivePower x, -n underflows, then either the underflow trap handler will be called, or else the underflow status flag will be set. This is incorrect, because if x - n underflows, then x n will either overflow or be in range.

It simply turns off the overflow and underflow trap enable bits and saves the overflow and underflow status bits. If neither the overflow nor underflow status bit is set, it restores them together with the trap enable bits.

Another example of the use of flags occurs when computing arccos via the formula. The solution to this problem is straightforward. Simply save the value of the divide by zero flag before computing arccos, and then restore its old value after the computation. The design of almost every aspect of a computer system requires knowledge about floating-point. Computer architectures usually have floating-point instructions, compilers must generate those floating-point instructions, and the operating system must decide what to do when exception conditions are raised for those floating-point instructions.

Computer system designers rarely get guidance from numerical analysis texts, which are typically aimed at users and writers of software, not at computer designers. As an example of how plausible design decisions can lead to unexpected behavior, consider the following BASIC program. This example will be analyzed in the next section. Incidentally, some people think that the solution to such anomalies is never to compare floating-point numbers for equality, but instead to consider them equal if they are within some error bound E.

This is hardly a cure-all because it raises as many questions as it answers. What should the value of E be? It is quite common for an algorithm to require a short burst of higher precision in order to produce accurate results. As discussed in the section Proof of Theorem 4 , when b 2 4 ac , rounding error can contaminate up to half the digits in the roots computed with the quadratic formula.

By performing the subcalculation of b 2 - 4 ac in double precision, half the double precision bits of the root are lost, which means that all the single precision bits are preserved. The computation of b 2 - 4 ac in double precision when each of the quantities a , b , and c are in single precision is easy if there is a multiplication instruction that takes two single precision numbers and produces a double precision result. In order to produce the exactly rounded product of two p -digit numbers, a multiplier needs to generate the entire 2 p bits of product, although it may throw bits away as it proceeds.

Thus, hardware to compute a double precision product from single precision operands will normally be only a little more expensive than a single precision multiplier, and much cheaper than a double precision multiplier. Despite this, modern instruction sets tend to provide only instructions that produce a result of the same precision as the operands. If an instruction that combines two single precision operands to produce a double precision product was only useful for the quadratic formula, it wouldn't be worth adding to an instruction set.

However, this instruction has many other uses. Consider the problem of solving a system of linear equations,. Suppose that a solution x 1 is computed by some method, perhaps Gaussian elimination.

There is a simple way to improve the accuracy of the result called iterative improvement. First compute. Note that if x 1 is an exact solution, then is the zero vector, as is y. Then y x 1 - x , so an improved estimate for the solution is. The three steps 12 , 13 , and 14 can be repeated, replacing x 1 with x 2 , and x 2 with x 3. For more information, see [Golub and Van Loan ]. When performing iterative improvement, is a vector whose elements are the difference of nearby inexact floating-point numbers, and so can suffer from catastrophic cancellation.

Once again, this is a case of computing the product of two single precision numbers A and x 1 , where the full double precision result is needed. To summarize, instructions that multiply two floating-point numbers and return a product with twice the precision of the operands make a useful addition to a floating-point instruction set.

Some of the implications of this for compilers are discussed in the next section. The interaction of compilers and floating-point is discussed in Farnum [], and much of the discussion in this section is taken from that paper. Ideally, a language definition should define the semantics of the language precisely enough to prove statements about programs. While this is usually true for the integer part of a language, language definitions often have a large grey area when it comes to floating-point.

Perhaps this is due to the fact that many language designers believe that nothing can be proven about floating-point, since it entails rounding error. If so, the previous sections have demonstrated the fallacy in this reasoning. This section discusses some common grey areas in language definitions, including suggestions about how to deal with them.

Remarkably enough, some languages don't clearly specify that if x is a floating-point variable with say a value of 3. For example Ada, which is based on Brown's model, seems to imply that floating-point arithmetic only has to satisfy Brown's axioms, and thus expressions can have one of many possible values.

Thinking about floating-point in this fuzzy way stands in sharp contrast to the IEEE model, where the result of each floating-point operation is precisely defined. In the IEEE model, we can prove that 3. In Brown's model, we cannot. Another ambiguity in most language definitions concerns what happens on overflow, underflow and other exceptions. The IEEE standard precisely specifies the behavior of exceptions, and so languages that use the standard as a model can avoid any ambiguity on this point.

Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. The importance of preserving parentheses cannot be overemphasized. The algorithms presented in theorems 3, 4 and 6 all depend on it. A language definition that does not require parentheses to be honored is useless for floating-point calculations.

Subexpression evaluation is imprecisely defined in many languages. Suppose that ds is double precision, but x and y are single precision. There are two ways to deal with this problem, neither of which is completely satisfactory. The first is to require that all variables in an expression have the same type. This is the simplest solution, but has some drawbacks.

First of all, languages like Pascal that have subrange types allow mixing subrange variables with integer variables, so it is somewhat bizarre to prohibit mixing single and double precision variables. Another problem concerns constants.

In the expression 0. Now suppose the programmer decides to change the declaration of all the floating-point variables from single to double precision. The programmer will have to hunt down and change every floating-point constant. The second approach is to allow mixed expressions, in which case rules for subexpression evaluation must be provided.

There are a number of guiding examples. The original definition of C required that every floating-point expression be computed in double precision [Kernighan and Ritchie ]. This leads to anomalies like the example at the beginning of this section. The expression 3. This suggests that computing every expression in the highest precision available is not a good rule.

Another guiding example is inner products. If the inner product has thousands of terms, the rounding error in the sum can become substantial. One way to reduce this rounding error is to accumulate the sums in double precision this will be discussed in more detail in the section Optimizers. If the multiplication is done in single precision, than much of the advantage of double precision accumulation is lost, because the product is truncated to single precision just before being added to a double precision variable.

A rule that covers both of the previous two examples is to compute an expression in the highest precision of any variable that occurs in that expression.

However, this rule is too simplistic to cover all cases cleanly. A more sophisticated subexpression evaluation rule is as follows. First assign each operation a tentative precision, which is the maximum of the precisions of its operands. This assignment has to be carried out from the leaves to the root of the expression tree.

Then perform a second pass from the root to the leaves. In this pass, assign to each operation the maximum of the tentative precision and the precision expected by the parent. Farnum [] presents evidence that this algorithm in not difficult to implement. The disadvantage of this rule is that the evaluation of a subexpression depends on the expression in which it is embedded.

This can have some annoying consequences. For example, suppose you are debugging a program and want to know the value of a subexpression. You cannot simply type the subexpression to the debugger and ask it to be evaluated, because the value of the subexpression in the program depends on the expression it is embedded in. A final comment on subexpressions: since converting decimal constants to binary is an operation, the evaluation rule also affects the interpretation of decimal constants. This is especially important for constants like 0.

Another potential grey area occurs when a language includes exponentiation as one of its built-in operations. Unlike the basic arithmetic operations, the value of exponentiation is not always obvious [Kahan and Coonen ]. However, One definition might be to use the method shown in section Infinity. For example, to determine the value of a b , consider non-constant analytic functions f and g with the property that f x a and g x b as x 0.

If f x g x always approaches the same limit, then this should be the value of a b. In the case of 1. However, the IEEE standard says nothing about how these features are to be accessed from a programming language.

Some of the IEEE capabilities can be accessed through a library of subroutine calls. For example the IEEE standard requires that square root be exactly rounded, and the square root function is often implemented directly in hardware. This functionality is easily accessed via a library square root routine. However, other aspects of the standard are not so easily implemented as subroutines.

For example, most computer languages specify at most two floating-point types, while the IEEE standard has four different precisions although the recommended configurations are single plus single-extended or single, double, and double-extended.

Infinity provides another example. But that might make them unusable in places that require constant expressions, such as the initializer of a constant variable.

A more subtle situation is manipulating the state associated with a computation, where the state consists of the rounding modes, trap enable bits, trap handlers and exception flags. One approach is to provide subroutines for reading and writing the state. In addition, a single call that can atomically set a new value and return the old value is often useful.

As the examples in the section Flags show, a very common pattern of modifying IEEE state is to change it only within the scope of a block or subroutine. Thus the burden is on the programmer to find each exit from the block, and make sure the state is restored. Language support for setting the state precisely in the scope of a block would be very useful here. Modula-3 is one language that implements this idea for trap handlers [Nelson ]. There are a number of minor points that need to be considered when implementing the IEEE standard in a language.

Although the IEEE standard defines the basic floating-point operations to return a NaN if any operand is a NaN, this might not always be the best definition for compound operations. For example when computing the appropriate scale factor to use in plotting a graph, the maximum of a set of values must be computed. In this case it makes sense for the max operation to simply ignore NaNs. Finally, rounding can be a problem.

The IEEE standard defines rounding very precisely, and it depends on the current value of the rounding modes. This sometimes conflicts with the definition of implicit rounding in type conversions or the explicit round function in languages. This means that programs which wish to use IEEE rounding can't use the natural language primitives, and conversely the language primitives will be inefficient to implement on the ever increasing number of IEEE machines.

Compiler texts tend to ignore the subject of floating-point. For example Aho et al. However, these two expressions do not have the same semantics on a binary machine, because 0. Although it does qualify the statement that any algebraic identity can be used when optimizing code by noting that optimizers should not violate the language definition, it leaves the impression that floating-point semantics are not very important.

This is designed to give an estimate for machine epsilon. Avoiding this kind of "optimization" is so important that it is worth presenting one more very useful algorithm that is totally ruined by it. Many problems, such as numerical integration and the numerical solution of differential equations involve computing sums with many terms.

Because each addition can potentially introduce an error as large as. A simple way to correct for this is to store the partial summand in a double precision variable and to perform each addition using double precision.

If the calculation is being done in single precision, performing the sum in double precision is easy on most computer systems. However, if the calculation is already being done in double precision, doubling the precision is not so simple. One method that is sometimes advocated is to sort the numbers and add them from smallest to largest.

However, there is a much more efficient method which dramatically improves the accuracy of sums, namely. Comparing this with the error in the Kahan summation formula shows a dramatic improvement. Each summand is perturbed by only 2 e , instead of perturbations as large as ne in the simple formula. Details are in, Errors In Summation.

These examples can be summarized by saying that optimizers should be extremely cautious when applying algebraic identities that hold for the mathematical real numbers to expressions involving floating-point variables. Another way that optimizers can change the semantics of floating-point code involves constants. In the expression 1. Because this constant cannot be represented exactly in binary, the inexact exception should be raised. In addition, the underflow flag should to be set if the expression is evaluated in single precision.

Since the constant is inexact, its exact conversion to binary depends on the current value of the IEEE rounding modes. Thus an optimizer that converts 1. However, constants like

golfpassnera1981's Ownd

0コメント

1000 / 1000