On Column Sizes

If you just started your adventure with databases you may not pay too much attention do data types. Often reading in the data from a flat file will be so troublesome that you just define column types to be VARCHAR(100) just to have everything read in. Maybe after having the data read you convert some of them to proper numerical types, but do you ever ask yourself why?

Mostly because of performance. Let’s look at three somehow related reasons.

1) Disk space. That is probably the first thing you thought about. If you have a column indicating customers gender with values 1, 0, NULL, then setting it to BIGINT might be a bit of an overkill. In most databases (e.g. Postgres, MS SQL, MySQL) BIGINT takes 8 bytes, but the column could be stored as BIT, which literally takes 1 bit of space. Assuming you have 1 MLN rows in the table, it saves you 63MLN bits, which is 7.5MB. Maybe not that much, but it may go fast especially when you have a table with 40 columns, and millions or billions of rows.

2) I/O & memory footprint. Low number of I/O operations is probably the most common bottleneck in querying large tables. Data in relational databases are aggregated into rows, and those into pages (table > page > row > column). A page is a transfer unit, which usually if of 4-64 KB in size. When you query a table for a specific column the database reads in a page containing the row with column of your interest and selects the value. If the column widths are not optimal, the database will not be able to read the data as fast as it could if they were, e.g. it has to read 8 bytes instead of 1 bit for gender flag.

I made a small assumption here that the whole row fits into a page, but it does not have to be the case. The best example are fields of type TEXT or BLOB, which are stored off the table, with the table just having a pointer to the location of the actual storage. A very good description of what pages are and what they are used for can be found in SQLite documentation devoted to file format. Another explanation of pages can be found on TechNet.Microsoft.com.

3) Narrow Index. If you plan to index on a column that is too wide, it will of course make index wider. But index is there to make things work faster, isn’t it? That is why, especially clustered index, should be as narrow as possible – it implies that not only it should have as few columns as possible, but also they should be as small as possible. Unless, there can be a bit wider but really meaningful clustered index. In case you have many non-clustered indexes, the clustered index should be narrow because the query that use non-clustered index will then use clustered index to fetch the data. Having narrower columns will make the I/O better, look point 2.

Leave a Reply