Lorenzo Alberton

Contact me ·

« Articles

DBMS and charsets, a summary and a call for help

PHP, PEAR, Database, MySQL, PostgreSQL, Firebird SQL, Oracle, SQLite, Charset · 28 September 2006 10:02 · 1 comment

Abstract: Some clarification about charset settings for the various DBMS (MySQL, PostgreSQL, Firebird/Interbase, Oracle, Sqlite) and a call for help for charset setting support in PEAR::MDB2.

Foreword: we're working on PEAR::MDB2 charset and collation support. That means that we're in the process of figuring out what's the best way to set the charset in each DBMS, and how to make it portable and consistent through all of them.
I'll try to explain how the various DBMS implement charset settings first, then describe the MDB2 status-quo regarding charset support. Finally, we'd like to get some feedback to implement a better charset support in MDB2.

Types of charset settings

In this wiki page there's a nice summary of the possible charset settings:

client: charset of a query, i.e. the charset as it leaves client and is received by the server.
connection (server): this is the input charset used by the server. If the client charset is different, it is translated into the server charset (lossy conversion?)
database/table/field: this is the default charset for the db/table/field, i.e. the charset of the stored data.
results (message): this is the output charset used by the server. Everything coming from the server (messages and resultsets) is sent in this charset

While the above schema is more or less clear, when it comes to implementing a generic solution for every DBMS, then some doubts start creeping in.

Summary of the charset settings supported by each DBMS

In an ideal word, all the DBMS would implement the SQL standard in a standard way ;-), or at least provide some conceptually equivalent set of parameters. In the real word, things are a bit different...

DBMS charset comparison chart #1 (I/O)
DBMS	client	connection	result
mysql	SET character_set_client = charset	SET character_set_connection = charset	SET character_set_results = charset
pgsql	SET NAMES 'charset'	-	SET NAMES 'charset'
ibase	SET NAMES charset (isql)	-	SET NAMES charset (isql)
oracle	NLS_LANG environmental variable	-	-
sqlite	-	-	-

A note about the SQL standard "SET NAMES charset" clause: in MySQL, it is equivalent to these three commands:

SET character_set_client = charset;
SET character_set_results = charset;
SET character_set_connection = charset;

In MySQL, there's also the "SET CHARACTER SET charset" clause, which is equivalent to these three commands:

SET character_set_client = charset;
SET character_set_results = charset;
SET collation_connection = @@collation_database;

In other words, it is similar to SET NAMES, but sets the connection character set and collation to be those of the default database.

With PostgreSQL, we have three ways to set the client encoding:

1. pg_setclientencoding($cid, "BIG5")
2. pg_exec("SET NAMES 'BIG5'")
3. pg_exec("SET CLIENT_ENCODING TO 'BIG5'")

2 and 3 are identical, 1 is doing the same thing as 2 or 3 AND set the internal encoding of libpq, which is linked to PHP, to BIG5. This is necessary if the second byte of the BIG5 character is "\" or one of other control characters.

Setting the charset for the data storage

DBMS charset comparison chart #2 (data storage)
DBMS	database	table	field
mysql	CREATE DATABASE database DEFAULT CHARACTER SET charset COLLATE collation	CREATE TABLE table1 (...) CHARACTER SET charset COLLATE collation	CREATE TABLE table1 (column1 VARCHAR(5) CHARACTER SET charset COLLATE collation)
pgsql	CREATE DATABASE database WITH ENCODING 'charset'	-	-
ibase	CREATE DATABASE database DEFAULT CHARACTER SET 'charset'	-	CREATE TABLE Table1 (column1 VARCHAR(5) CHARACTER SET charset COLLATE collation)
oracle	-	-	-
sqlite	-	PRAGMA encoding="charset"	CREATE TABLE Table1 (column1 VARCHAR(5) COLLATE collation)
sql server	CREATE DATABASE database COLLATE collation	-	CREATE TABLE Table1 (column1 VARCHAR(5) COLLATE collation)

Where not specified, the charset is the default charset of the system, of the default database or of the current table.

When in doubt, ask the experts!

I admit my experience with charsets is limited to MySQL, PostgreSQL and Firebird/Interbase with ISO-8859-1 and UTF-8, and always in a uniform setup (i.e. using the same charset for the whole db, and for the data I/O). For instance, I don't have a (successful) first-hand experience of mixed setups, i.e. where the server charset is -say- latin1 and you have a table in latin2 and another in cp1251_koi8.
AFAICS, others have tried with mixed results.

Since our goal is to provide a good charset/collation setting implementation for PEAR::MDB2, we humbly ask for counselling to those of you who know better. Please contact me or Lukas with your corrections, suggestions and guidelines. We're especially interested in hearing experiences with dealing with multiple charsets at the same time. I will also update this page so it can serve as a future reference to others too.

For instance, ATM we only have a generic setCharset() method, which in the ibase, mysql and pgsql drivers triggers a "SET NAMES" query. Initially, in the MySQL driver it translated to a "SET character_set_client" query, but I changed it to "SET NAMES" for it was not effective for most users. Is it the right choice? Should we implement a method for each charset setting type? If so, how? Remember that we must provide a portable solution.

Thank you for your attention.

Back