Specialized IT Services focused on Data Management | Speak with Us
Character Versus Numeric Data Types
Most DBAs have faced the situation where one of their applications requires a code that is used to identify products, accounts, or some other business object. All of the codes are numeric and will stay that way. But, for reporting purposes, users or developers wish the codes to print out with leading zeroes. So, the users request that the column be defined as text, let’s say CHAR(4), to ensure that leading zeroes are always shown. But what are the drawbacks, if any, to doing this?
First of all, without proper edit checks, INSERTs and UPDATEs could place invalid alphabetic characters into the alphanumeric column. This can be a valid concern if ad hoc data modifications are permitted. Although it is uncommon to allow ad hoc modifications to production databases, data problems can still occur if all of the proper edit checks are not coded into every program that can modify the data. But let’s assume (big assumption) that proper edit checks are coded and will never be bypassed. This removes the data integrity question.
There is another problem, though, that is related to performance and filter factors. Consider the possible number of values that a CHAR(4) column and a SMALLINT column can assume. Even if programmatic edit checks are coded for each, DB2 is not aware of these and assumes that all combinations of characters are permitted. DB2 uses base 37 math when it determines access paths for character columns, under the assumption that 26 alphabetic letters, 10 numeric digits, and a space will be used. This adds up to 37 possible characters. For a four-byte character column there are 3**74 or 1,874,161 possible values.
A SMALLINT column can range from -32,768 to 32,767 producing 65,536 possible small integer values. The drawback here is that negative codes and/or 5 digit codes could be entered. Both do not conform to the 4 digit maximum. However, if we adhere to our proper edit check assumption, the data integrity problems will be avoided here, as well (not to mention that it is a trivial matter to code a check constraint limiting the values to positive integers less than 10000).
DB2 will use the HIGH2KEY and LOW2KEY values to calculate filter factors. For character columns, the range between HIGH2KEY and LOW2KEY is larger than numeric columns because there are more total values. The filter factor will be larger for the numeric data type than for the character data type which may influence DB2 to choose a different access path. For this reason, it makes sense to favor SMALLINT over the CHAR(4) definition.
The leading zeroes problem should be able to be solved using other methods. It is not necessary to store the data in the same format that users wish to display it. For example, when using QMF, you can ensure that leading zeroes are shown in reports by using the “J” edit code. Other report writes offer similar functionality. And report programs can be coded to display leading zeroes easily enough by moving the host variables to appropriate display fields.
In general, it is wise to choose a data type which is closest to the domain for the column. If the column is to store numeric data, favor choosing a numeric data type: SMALLINT, INTEGER, BIGINT, DECIMAL, and floating point. Same goes for temporal data (that is, choose DATE, TIME, or TIMESTAMP instead of a character or numeric data type). In addition, always be sure to code appropriate edit checks to ensure data integrity – but remember, fewer need to be coded if you choose the correct data type because DB2 automatically prohibits data that does not conform to the data type for each column.
This blog was originally published on Craig Mullins’ blog at: http://db2portal.blogspot.com/2006/11/character-versus-numeric-data-types.html