Martel Richard

Posted on Mar 27

Collation in Database - What you need to know

#database #sqlserver #sql

What is Collation?

Collation refers to a set of rules that determine how data is sorted and compared. Collation settings specify the correct character sequence, case sensitivity, accent marks, character width and the characters used in comparison operations, such as equality and inequality.

How does this affect the data?

1. Character data sort order

This specifies how string data is sequenced. For example, whether 'a' comes before 'b'.

2. Case sensitivity

This determines whether 'A' and 'a' are considered the same.

3. Accent sensitivity

This determines whether 'a' and 'á' are considered the same.

4. Kana Sensitivity

This determines whether two characters that look the same but are represented differently in Kana (a Japanese script) are treated as equal.

5. Width sensitivity

This determines whether a single-byte character (half-width) and the same character represented as a double-byte character (full-width) are treated as equal.

Where do you specify the collation?

Collation can be specified at different levels depending on the specific database system. For example, in SQL Server, it can be specified at:

Server level
Database level
Column level
Expression (SQL Query) level

The above levels are mentioned in the descending order of hierarchy i.e. when you create a database, you specify the default collation for the database. Every column in the database that uses character data types has a collation. If not specified, the column uses the default collation of the database.

How collation affects the query results?

For example, the default collation in SQL Server is

SQL_Latin1_General_CP1_CI_AS

The elements of this collation name represent the following.

SQL - This prefix means that the collation is a SQL Server collation.

Latin1_General - This is the base collation, which defines the rules that are used for sorting and comparing characters within the Latin1 character set.

CP1 - This stands for Code Page 1252, which is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages.

CI - This stands for Case Insensitive. This means that the SQL Server does not consider case during comparison. For example, 'A' and 'a' are considered to be the same.

AS - This stands for Accent Sensitive. This means that accented characters and unaccented characters are considered to be different. For example, 'a' and 'á' are considered to be different.

Overall, SQL_Latin1_General_CP1_CI_AS is a SQL Server collation that is case insensitive and accent sensitive, using the Latin1 General rules and the 1252 Code Page. This is a commonly used collation in the English language for SQL Server.

How to check your collation settings in SQL Server

You can determine the server's character set in SQL Server by using the following SQL query.

SELECT SERVERPROPERTY('Collation')

If you need more information about the collation, you can use fn_helpcollations function which returns all the collations supported by your server.

SELECT * FROM fn_helpcollations()

When characters cannot be represented in a given collation

In the context of the SQL_Latin1_General_CP1_CI_AS collation in SQL Server, which is based on the Windows-1252 code page, certain characters from other scripts or languages that are not covered by Windows-1252 may not be saved correctly.

For example, characters from scripts such as Arabic, Hebrew, Cyrillic, Chinese, Japanese, Korean, and many others, cannot be properly represented in this collation. So, if you try to save a character like the Arabic letter "ج" or the Chinese character "汉" using this collation, such characters will be replaced with "?".

Refer Collation in SQL Server to know more on various collation rules and levels.

Hot sauce if you're wrong · web dev trivia for staff engineers (Chris vs Jeremy, Leet Heat S1.E4)

Shipping Fast: Test your knowledge of deployment strategies and techniques
Authentication: Prove you know your OAuth from your JWT
CSS: Demonstrate your styling expertise under pressure
Acronyms: Decode the alphabet soup of web development
Accessibility: Show your commitment to building for everyone

Contestants must answer rapid-fire questions across the full stack of modern web development. Get it right, earn points. Get it wrong? The spice level goes up!

Watch Video 🌶️🔥

DEV Community