What is Collation?
Collation refers to a set of rules that determine how data is sorted and compared. Collation settings specify the correct character sequence, case sensitivity, accent marks, character width and the characters used in comparison operations, such as equality and inequality.
How does this affect the data?
1. Character data sort order
This specifies how string data is sequenced. For example, whether 'a' comes before 'b'.
2. Case sensitivity
This determines whether 'A' and 'a' are considered the same.
3. Accent sensitivity
This determines whether 'a' and 'á' are considered the same.
4. Kana Sensitivity
This determines whether two characters that look the same but are represented differently in Kana (a Japanese script) are treated as equal.
5. Width sensitivity
This determines whether a single-byte character (half-width) and the same character represented as a double-byte character (full-width) are treated as equal.
Where do you specify the collation?
Collation can be specified at different levels depending on the specific database system. For example, in SQL Server, it can be specified at:
- Server level
- Database level
- Column level
- Expression (SQL Query) level
The above levels are mentioned in the descending order of hierarchy i.e. when you create a database, you specify the default collation for the database. Every column in the database that uses character data types has a collation. If not specified, the column uses the default collation of the database.
How collation affects the query results?
For example, the default collation in SQL Server is
SQL_Latin1_General_CP1_CI_AS
The elements of this collation name represent the following.
SQL - This prefix means that the collation is a SQL Server collation.
Latin1_General - This is the base collation, which defines the rules that are used for sorting and comparing characters within the Latin1 character set.
CP1 - This stands for Code Page 1252, which is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages.
CI - This stands for Case Insensitive. This means that the SQL Server does not consider case during comparison. For example, 'A' and 'a' are considered to be the same.
AS - This stands for Accent Sensitive. This means that accented characters and unaccented characters are considered to be different. For example, 'a' and 'á' are considered to be different.
Overall, SQL_Latin1_General_CP1_CI_AS is a SQL Server collation that is case insensitive and accent sensitive, using the Latin1 General rules and the 1252 Code Page. This is a commonly used collation in the English language for SQL Server.
How to check your collation settings in SQL Server
You can determine the server's character set in SQL Server by using the following SQL query.
SELECT SERVERPROPERTY('Collation')
If you need more information about the collation, you can use fn_helpcollations
function which returns all the collations supported by your server.
SELECT * FROM fn_helpcollations()
When characters cannot be represented in a given collation
In the context of the SQL_Latin1_General_CP1_CI_AS collation in SQL Server, which is based on the Windows-1252 code page, certain characters from other scripts or languages that are not covered by Windows-1252 may not be saved correctly.
For example, characters from scripts such as Arabic, Hebrew, Cyrillic, Chinese, Japanese, Korean, and many others, cannot be properly represented in this collation. So, if you try to save a character like the Arabic letter "ج" or the Chinese character "汉" using this collation, such characters will be replaced with "?".
Refer Collation in SQL Server to know more on various collation rules and levels.
Top comments (0)