Recently I was working with a database where certain tables were reporting page overflow errors. I subsequently learned a bit about how Microsoft SQL Server works, and this is my attempt to write it down.
SQL Server has a page size of 8KB. In simple terms, that’s about eight thousand characters of text. For a simple database table, an average row will be, what, 100 characters? Rows are made up of columns, and each column has a size. Say you’ve got ten columns of ten characters, which adds up to 100 characters per column. So you’ll squeeze about 80 rows of data onto a page. Great.
Sometimes you need to store a whole bunch of data in a row. Say, the complete text of an essay, or an image, or something that’s going to take more space than 8000 characters would. If you drop that data into your row, suddenly BOOM… your row is bigger than your page size. That’s no good. The classic solution is to use a BLOB data type: TEXT, NTEXT or IMAGE. When you use a BLOB, instead of taking up row space with this one particularly large column, you just drop a little signpost in the row that says “Look over there!” and store the data somewhere else. There are drawbacks – there’s a little delay in reading data because you have to follow the signpost, and BLOBs don’t permit sorting and string functions and all our favorite tricks. But it works pretty well.
In SQL Server 2005, Microsoft introduced three new data types: VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX). They are meant to replace the old BLOB data types, which are now deprecated. These new types are stored in-row as much as possible so we don’t have to worry about following signposts, but as soon as the row gets too long one of these will automatically pack its bags and move out, leaving a signpost behind. It’s just an automatic switch from one old column type to another. There are other benefits – sorting and string functions always work on a VARCHAR(MAX) – but we still have to follow a signpost if the row happens to have grown too large, and now we have an additional check every time the row size changes to see if anybody needs to move out.
The suggestions I have seen regarding VARCHAR(MAX) assume that it will be used in much the same situation the old TEXT data type was – single fields expected to exceed 8KB. However, this database had a different issue. It was already using TEXT fields where it seemed appropriate; the tables just had so many relatively short fields that they were overflowing. Two hundred columns of fifty characters each adds up to ten thousand characters, well beyond the page size. Now, these tables worked most of the time because they were using VARCHAR fields, which only store the amount of data they have to… If each column only had twenty-five characters, everything was good. But as soon as someone came along and used all the space they were allowed, things would break.
This creates a different situation where VARCHAR(MAX) is useful. If I made several columns VARCHAR(MAX) (in spite of only expecting them to store fifty characters) then the table would automatically adjust its size as needed, where before it would break. This might seem an extreme waste, but really the biggest cost we pay is when modifying the row and this system would pull data much more often than we put it in. If Microsoft feels its algorithm is efficient enough to eliminate the old BLOB types altogether, it’s probably okay. I won’t deny that some better database design would have alleviated this particular problem altogether, but refactoring was not an option.
I’m glad of the change; I think Microsoft is taking a step in the right direction here. However, I’d say they’re still way behind PostgreSQL. For some time now, all of PSQL’s character types have had this auto-collapsing feature, so that it’s actually most efficient to just use TEXT for all of your character fields. Freaky, but cool.
For more information on the new BLOB types from people with more technical know-how than me, check out Understanding VARCHAR(MAX) in SQL Server 2005, NTEXT vs NVARCHAR(MAX) in SQL 2005 and the official Char and Varchar (Transact-SQL).