# Monday, March 19, 2007

Recently a student who was going through the Microsoft eLearning course 2778 asked for an explanation of "accent sensitive". Before going into why it matters to SQL Server, lets look into the role of accents. A lot of languages make use of accents to change the pronunciation of some letters. Like with the accents in the German language, which are called Umlaut, when translating the word Umlaut very directly, you'd get alternate tone.

Back to SQL Server and why it should be sensitive to accents, or for that matter case, kana or width (things we'll get to once we touch COLLATION). If we ask a question to our server, frequently we want the answer to be in order. And this order very much depends on data type.

Let's take a group of downhill skiing contestants. In preparation of a match, some overview is needed. The contestants have to be grouped in ages, so there is a need for sorting on birthdate (a datetime data type). During the match, there are results, so there also is a need to order by the time taken to get from start to finish (the number of seconds as decimal in descending order), so the winner is at first place. And probably before the match, there is the need for a list with all contestants ordered by lastname.

Where a programmer (VB, C#, etc) would see this lastname as a string, SQL Server actually has a lot of possible data types to represent and handle textual information.

Single Byte Double Byte
Fixed length char(x) nchar(x)
Variable length varchar(x) nvarchar(x)
CLOB (Character Large OBject) varchar(max) nvarchar(max)
Back-Compat CLOB text ntext

Though these data types all can store and handle textual data types, the collation has to be specified as well.

Collation

Collation is best thought of as the order of characters in a given character-set. In this sense, an alphabet is a character-set, but collation applies to other writing systems too. In SQL Server, the collation includes the character-set and the options;

  • binary
  • binary-code point
  • Case sensitive
  • Accent sensitive
  • Kana sensitive
  • Width sensitive

which are all explained at MSDN. These options are also referred to as sort order.

Collation exists at the levels of instance, database and column. If a collation is not defined when creating a column, it is inherited from the database. Likewise, if no collation is specified when creating the database, it inherits the collation from the instance. The collation of the instance is defined when it is installed. When defining the collation, the designer (see picture) can be used.

Note there is a choice for SQL Collations and Windows Collation. In SQL Server 2005, go for the Windows Collation (SQL Collation is mainly there for compatibility reasons).

Sorting and storing

The collation is used by SQL Server both when sorting data (SELECT ... FROM ... ORDER BY ...) and when storing textual data as part of an index (both clustered and non-clustered). It can happen that you need data that is stored in one collation, to be evaluated in another collation, like when comparing data from two different collations. However, two different collations can not be compared, so collations have to be made the same. This is where the COLLATE clause comes into play, so you can specify a collation in the query.

Example

To show some differences with sort order and collation, I've added a script for a demo database with two tables, both containing some character combinations with variations in case and accents. One table is stored in the Latin1_General_CS_AS collation, the other one in German_PhoneBook_BIN2. Both contain the same data;

RowNum
int
TextData
nchar(10)
1 ABC
2 ÄBC
3 abc
4 äbc
5 ABc
6 ÁBC
7 abC
8 ábc

When selecting from the Latin1_General_CS_AS collated table, ordered by TextData, you receive the order 3, 7, 5, 1, 8, 6, 4, 2. Whereas with the German_PhoneBook_BIN2 collation, the order is 1, 5, 7, 3, 6, 2, 8, 4. Now with the COLLATE clause, you can alter the collation in a query. So

SELECT RowNum, TextData FROM GerPB_BIN2 ORDER BY TextData COLLATE Latin1_General_CS_AS

though selected from the table in German_PhoneBook_BIN2 collation, is returned in the order of the Latin1_General_CS_AS collation. When comparing two from two different collations, like you do in a join, SQL Server will raise an error.

SELECT * FROM GerPB_BIN2 G INNER JOIN Latin1_CS_AS L ON G.TextData = L.TextData

Msg 468, Level 16, State 9, Line 1
Cannot resolve the collation conflict between "Latin1_General_CS_AS" and "German_PhoneBook_BIN2" in the equal to operation.

However, by using the COLLATE clause with one of the compared columns, they can be joined.

SELECT * FROM GerPB_BIN2 G INNER JOIN Latin1_CS_AS L ON G.TextData COLLATE Latin1_General_CS_AS = L.TextData

Guess this is the long answer, but not the last time I have to give it...