DBMS and charsets, a summary and a call for help
Abstract: Some clarification about charset settings for the various DBMS (MySQL, PostgreSQL, Firebird/Interbase, Oracle, Sqlite) and a call for help for charset setting support in PEAR::MDB2.
Foreword: we're working on PEAR::MDB2 charset and collation support. That means that we're in the process of figuring out what's the best way to set the charset in each DBMS, and how to make it portable and consistent through all of them.
I'll try to explain how the various DBMS implement charset settings first, then describe the MDB2 status-quo regarding charset support. Finally, we'd like to get some feedback to implement a better charset support in MDB2.
Types of charset settings
In this wiki page there's a nice summary of the possible charset settings:
- client: charset of a query, i.e. the charset as it leaves client and is received by the server.
- connection (server): this is the input charset used by the server. If the client charset is different, it is translated into the server charset (lossy conversion?)
- database/table/field: this is the default charset for the db/table/field, i.e. the charset of the stored data.
- results (message): this is the output charset used by the server. Everything coming from the server (messages and resultsets) is sent in this charset
While the above schema is more or less clear, when it comes to implementing a generic solution for every DBMS, then some doubts start creeping in.
Summary of the charset settings supported by each DBMS
In an ideal word, all the DBMS would implement the SQL standard in a standard way ;-), or at least provide some conceptually equivalent set of parameters. In the real word, things are a bit different...
|mysql||SET character_set_client = charset||SET character_set_connection = charset||SET character_set_results = charset|
|pgsql||SET NAMES 'charset'||-||SET NAMES 'charset'|
|ibase||SET NAMES charset (isql)||-||SET NAMES charset (isql)|
|oracle||NLS_LANG environmental variable||-||-|
A note about the SQL standard "
SET NAMES charset" clause: in MySQL, it is equivalent to these three commands:
SET character_set_client = charset; SET character_set_results = charset; SET character_set_connection = charset;
In MySQL, there's also the "
SET CHARACTER SET charset" clause, which is equivalent to these three commands:
SET character_set_client = charset; SET character_set_results = charset; SET collation_connection = @@collation_database;
In other words, it is similar to
SET NAMES, but sets the connection character set and collation to be those of the default database.
With PostgreSQL, we have three ways to set the client encoding:
1. pg_setclientencoding($cid, "BIG5") 2. pg_exec("SET NAMES 'BIG5'") 3. pg_exec("SET CLIENT_ENCODING TO 'BIG5'")
2 and 3 are identical, 1 is doing the same thing as 2 or 3 AND set the internal encoding of libpq, which is linked to PHP, to BIG5. This is necessary if the second byte of the BIG5 character is "\" or one of other control characters.
Setting the charset for the data storage
|mysql||CREATE DATABASE database DEFAULT CHARACTER SET charset COLLATE collation||CREATE TABLE table1 (...) CHARACTER SET charset COLLATE collation||CREATE TABLE table1 (column1 VARCHAR(5) CHARACTER SET charset COLLATE collation)|
|pgsql||CREATE DATABASE database WITH ENCODING 'charset'||-||-|
|ibase||CREATE DATABASE database DEFAULT CHARACTER SET 'charset'||-||CREATE TABLE Table1 (column1 VARCHAR(5) CHARACTER SET charset COLLATE collation)|
|sqlite||-||PRAGMA encoding="charset"||CREATE TABLE Table1 (column1 VARCHAR(5) COLLATE collation)|
|sql server||CREATE DATABASE database COLLATE collation||-||CREATE TABLE Table1 (column1 VARCHAR(5) COLLATE collation)|
Where not specified, the charset is the default charset of the system, of the default database or of the current table.
When in doubt, ask the experts!
I admit my experience with charsets is limited to MySQL, PostgreSQL and Firebird/Interbase with ISO-8859-1 and UTF-8,
and always in a uniform setup (i.e. using the same charset for the whole db, and for the data I/O). For instance,
I don't have a (successful) first-hand experience of mixed setups, i.e. where the server charset is -say-
latin1 and you have a table in
latin2 and another in
AFAICS, others have tried with mixed results.
Since our goal is to provide a good charset/collation setting implementation for PEAR::MDB2, we humbly ask for counselling to those of you who know better. Please contact me or Lukas with your corrections, suggestions and guidelines. We're especially interested in hearing experiences with dealing with multiple charsets at the same time. I will also update this page so it can serve as a future reference to others too.
For instance, ATM we only have a generic
setCharset() method, which in the ibase, mysql and pgsql drivers triggers a
"SET NAMES" query. Initially, in the MySQL driver it translated to a
"SET character_set_client" query, but I changed it to
"SET NAMES" for it was not effective for most users. Is it the right choice? Should we implement a method for each charset setting type? If so, how? Remember that we must provide a portable solution.
Thank you for your attention.
- On batching vs. latency, and jobqueue models
- Updated Kafka PHP client library
- Musings on some technical papers I read this weekend: Google Dremel, NoSQL comparison, Gossip Protocols
- Historical Twitter access - A journey into optimising Hadoop jobs
- Kafka proposed as Apache incubator project
- NoSQL Databases: What, When and Why (PHPUK2011)
- PHPNW10 slides and new job!