Storing Non ASCII Characters in Database
MySQL can store non ASCII characters in database in a number of encodings, MySQL call them character sets:
+----------+-----------------------------+
Charset Description
+----------+-----------------------------+
big5 Big5 Traditional Chinese
dec8 DEC West European
cp850 DOS West European
hp8 HP West European
koi8r KOI8-R Relcom Russian
latin1 ISO 8859-1 West European
latin2 ISO 8859-2 Central European
swe7 7bit Swedish
ascii US ASCII
ujis EUC-JP Japanese
sjis Shift-JIS Japanese
cp1251 Windows Cyrillic
hebrew ISO 8859-8 Hebrew
tis620 TIS620 Thai
euckr EUC-KR Korean
koi8u KOI8-U Ukrainian
gb2312 GB2312 Simplified Chinese
greek ISO 8859-7 Greek
cp1250 Windows Central European
gbk GBK Simplified Chinese
latin5 ISO 8859-9 Turkish
armscii8 ARMSCII-8 Armenian
utf8 UTF-8 Unicode
ucs2 UCS-2 Unicode
cp866 DOS Russian
keybcs2 DOS Kamenicky Czech-Slovak
macce Mac Central European
macroman Mac West European
cp852 DOS Central European
latin7 ISO 8859-13 Baltic
cp1256 Windows Arabic
cp1257 Windows Baltic
binary Binary pseudo charset
geostd8 GEOSTD8 Georgian
+----------+-----------------------------+
To store non ASCII characters in a database column, you need to define that column with
a specific character set. You can specify a character set at 3 levels: database, table, and column.
For example:
CREATE DATABASE db_name CHARACTER SET utf8
CREATE TABLE tbl_name (...) CHARACTER SET utf8
CREATE TABLE tbl_name (col_name CHAR(80) CHARACTER SET utf8, ...)
- If a character set is specified at the database level, it is applied to all CHAR, VARCHAR,
and TEXT columns of all tables in this database. - If a character set is specified at the table level, it is applied to all CHAR, VARCHAR,
and TEXT columns in this table. - If a character set is specified at the column level, it is applied to this column only.
- The column length specified in the table creation statement is counted at the character level,
not at the encoding byte level. - If "utf8" is used on a CHAR(n) column, this column will require 3*n bytes of storage.
Transmitting Non ASCII Characters to the Server
Handling non ASCII characters with MySQL not only requires us setting up the table columns
with the correct encoding (character set), but also requires us setting up the correct encoding
for transferring characters to and from the database.
MySQL offers the following variables to control the encodings used to transfer characters between
the client (PHP script) and the server (MySQL database):
- character_set_server - Character set used for all databases on the server.
- character_set_database - Character set used for the default databases on the server.
- character_set_client - Character set used by the client for all SQL statements.
- character_set_connection - Character set that all SQL statements should be converted to
by the server before executing them. - character_set_results - Character set that all returning result should be converted to
by the server before sending it back to the client.
In most cases, you should make the client (PHP script) to use the same encoding (character set)
as the server (MySQL server), so that there is no need to convert when transferring characters between them.
Some times, you may need to use different encodings. For example, you want to use UTF-8 for your PHP scripts,
while using GB2312 for your MySQL tables.
You can use two special SET commands to change those variables:
1. "SET NAMES 'x'" is equivalent to these three statements:
SET character_set_client = x;
SET character_set_results = x;
SET character_set_connection = x;
2. "SET CHARACTER SET x" is equivalent to these three statements:
SET character_set_client = x;
SET character_set_results = x;
SET collation_connection = @@collation_database;
You can always use the SHOW VARIABLES LIKE 'variable_name' to view the current value of the given variable.
MySqlUnicode.php - UTF-8 Sample Script
My is my sample script to show you how to send UTF-8 strings to a MySQL server and store them
in UTF-8 encoding, MySqlUnicode.php:
<?php # MySqlUnicode.php
# Copyright (c) 2006 by Dr. Herong Yang, http://www.herongyang.com/
#
$con = mysql_connect('localhost');
print "\nDefault settings...\n";
$rs = mysql_query("SHOW VARIABLES LIKE 'character_set_%'");
while ($row = mysql_fetch_array($rs)) {
print " ".$row[0].' '.$row[1]."\n";
}
print "\nUpdated settings...\n";
$rs = mysql_query("SET NAMES 'utf8'");
$rs = mysql_query("SHOW VARIABLES LIKE 'character_set_%'");
while ($row = mysql_fetch_array($rs)) {
print " ".$row[0].' '.$row[1]."\n";
}
print "\nCreating a table in UTF-8...\n";
$rs = mysql_query('DROP DATABASE MyBase');
$rs = mysql_query('CREATE DATABASE MyBase CHARACTER SET utf8');
$rs = mysql_query('USE MyBase');
$rs = mysql_query('CREATE TABLE MyTable (ID INTEGER,'
.' Message VARCHAR(80))');
print "\nInserting some rows to the table...\n";
$str = "Hello!";
$rs = mysql_query("INSERT INTO MyTable VALUES ( 1, '$str')");
$str = "\xC2\xA1Hola!";
$rs = mysql_query("INSERT INTO MyTable VALUES ( 2, '$str')");
$str = "\xE4\xBD\xA0\xE5\xA5\xBD!";
$rs = mysql_query("INSERT INTO MyTable VALUES ( 3, '$str')");
print "\nQuery some rows from the table...\n";
$rs = mysql_query('SELECT * FROM MyTable WHERE ID < 10');
print " ".mysql_field_name($rs,0)." "
.mysql_field_name($rs,1)."\n";
while ($row = mysql_fetch_array($rs)) {
print " ".$row[0].' '.$row[1]."=(\x".bin2hex($row[1]).")\n";
}
mysql_free_result($rs);
mysql_close($con);
?>
If you run it, you will get:
Default settings...
character_set_client latin1
character_set_connection latin1
character_set_database latin1
character_set_results latin1
character_set_server latin1
character_set_system utf8
character_sets_dir C:\local\MySQL\share\charsets/
Updated settings...
character_set_client utf8
character_set_connection utf8
character_set_database latin1
character_set_results utf8
character_set_server latin1
character_set_system utf8
character_sets_dir C:\local\MySQL\share\charsets/
Creating a table in UTF-8...
Inserting some rows to the table...
Query some rows from the table...
ID Message
1 Hello!=(\x48656c6c6f21)
2 -íHola!=(\xc2a1486f6c6121)
3 S+ásÑ+!=(\xe4bda0e5a5bd21)
MySQL works as we expected, no conversion on the SQL statements, storing strings in UTF-8, and no conversion
on query result.
Conclusion
- MySQL provides a good support on non ASCII characters. If offers a number of encodings (character sets).
- MySQL allows to specify encodings at database, table or column level.
- MySQL allows to control encoding conversions on receiving SQL statements and returning query results.
- My recommendation is to use UTF-8 to all text information, turn off MySQL encoding conversion on receiving
SQL statements and returning query results, and let your PHP script to handle UTF-8 strings.
No comments:
Post a Comment