PHP/MySQL Character Set Woes

So far this works:

mb_convert_encoding($db_rowArray['itemText'],'UTF-8','Windows-1252')

                    

The suspicion is that this is not platform-independent as Windows-1252 may not be the set on the live Linux box (however that argument does take an array). This can replace:

//Convert all applicable characters to HTML entities:
$string = html_entity_decode($string,ENT_QUOTES,"UTF-8");

                    

In the end, this comes to mind:

//Handle multi-byte strings:
if(isset($mbCharSet))
{
    if($mbCharSet == 'cross-platform')
    {
        $mbCharSet =
            (strtolower($_SERVER['SERVER_NAME']) == 'MyWinXPbox') ?
                'Windows-1252' : 'auto';
    }
    $string = mb_convert_encoding($string,'UTF-8',$mbCharSet);
}

                    

For the last two days, the frustration rock has been my pillow with the deceptively “simple” task of converting SonghaySystem.com to UTF-8. Again, I say: this is not “simple” it is primal.

Songhay System uses a .NET ‘data transformation assembly’ to pump data out of the offline SQL Server 2000 store to the online MySQL 4.1 store with dependence on MySQL Connector .NET 1.0.6. Instead of being plagued with suspicion, the assumption here is that, when charset=utf8; is specified in the connection string, MySQL is filling its tables with UTF-8 data. The suspicion comes from the need to use the mb_convert_encoding() function on the PHP-end of the deal. Is MySQL filled with latin1 or something?

After reading “Character Sets / Character Encoding Issues” and “Unicode Support,” I will kind-of-sort-of assume that it’s PHP—especially PHP on Windows—that’s causing most of the problems requiring multi-byte encoding. Take a listen to this (from textpattern.net):

PHP internally uses ISO-8859-1 as encoding for all strings. Even the upcoming 5.1.0 release does it this way. Again, as for MySQL, as long as we are only reading in and outputting strings this is not much of a problem, but as soon as we try to use any string-related functions we may run into problems with everything that is not in the ASCII-range. So all the powerful string-manipulation functions in PHP will likely “mangle” multibyte-strings, because each byte is treated as a character, even when multiple bytes may be describing only a single character. There is a multibyte-extension (mb) available for php, which has multibyte-safe versions of most string-functions which—despite being rather unstable in early versions—is today very usable. Unfortunately it is optional and thus can’t be relied upon to be always available. Only the Regular Expressions support in PHP interestingly knows a "/u" modifier that treats string as UTF-8.