Skip to content Skip to sidebar Skip to footer

Convert Unicode To Html Entities Hex

How to convert a Unicode string to HTML entities? (HEX not decimal) For example, convert Français to Français.

Solution 1:

For the missing hex-encoding in the related question:

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
    $entity = vsprintf('&#x%X;', unpack('N', $binary));
    return$entity;
}, $input);

This is similar to @Baba's answer using UTF-32BE and then unpack and vsprintf for the formatting needs.

If you prefer iconv over mb_convert_encoding, it's similar:

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $binary = iconv('UTF-8', 'UTF-32BE', $utf8);
    $entity = vsprintf('&#x%X;', unpack('N', $binary));
    return$entity;
}, $input);

I find this string manipulation a bit more clear then in Get hexcode of html entities.

Solution 2:

Your string looks like UCS-4 encoding you can try

$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
    $char = current($m);
    $utf = iconv('UTF-8', 'UCS-4', $char);
    returnsprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);

Output

string'Français' (length=13)

Solution 3:

Firstly, when I faced this problem recently, I solved it by making sure my code-files, DB connection, and DB tables were all UTF-8 Then, simply echoing the text works. If you must escape the output from the DB use htmlspecialchars() and not htmlentities() so that the UTF-8 symbols are left alone and not attempted to be escaped.

Would like to document an alternative solution because it solved a similar problem for me. I was using PHP's utf8_encode() to escape 'special' characters.

I wanted to convert them into HTML entities for display, I wrote this code because I wanted to avoid iconv or such functions as far as possible since not all environments necessarily have them (do correct me if it is not so!)

$foo = 'This is my test string \u03b50';
echo unicode2html($foo);

functionunicode2html($string) {
    return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}

Hope this helps somebody in need :-)

Solution 4:

See How to get the character from unicode code point in PHP? for some code that allows you to do the following :

Example use :

echo"Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));

echo"\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));

echo"\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));

echo"\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));

echo"\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));

echo"\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));

echo"\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));

Output :

Get string fromnumericDECvalue
string(4) "ď"
string(2) "ď"

Get string fromnumeric HEX value
string(4) "ď"
string(2) "ď"

GetnumericvalueofcharacterasDECintint(50319)
int(271)

Getnumericvalueofcharacteras HEX string
string(4) "c48f"
string(3) "10f"

Encode / decode toDEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"

Solution 5:

You can also use mb_encode_numericentity which is supported by PHP 4.0.6+ (link to PHP doc).

functionunicode2html($value) {
    return mb_encode_numericentity($value, [
    //  start codepoint//  |       end codepoint//  |       |       offset//  |       |       |       mask0x0000, 0x001F, 0x0000, 0xFFFF,
        0x0021, 0x002C, 0x0000, 0xFFFF,
        0x002E, 0x002F, 0x0000, 0xFFFF,
        0x003C, 0x003C, 0x0000, 0xFFFF,
        0x003E, 0x003E, 0x0000, 0xFFFF,
        0x0060, 0x0060, 0x0000, 0xFFFF,
        0x0080, 0xFFFF, 0x0000, 0xFFFF
    ], 'UTF-8', true);
}

In this way it is also possible to indicate which ranges of characters to convert into hexadecimal entities and which ones to preserve as characters.

Usage example:

$input = array(
    '"Meno più, PIÙ o meno"',
    '\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
    '<script>alert("XSS");</script>',
    '"`'
);

$output = array();
foreach ($inputas$str)
    $output[] = unicode2html($str)

Result:

$output = array(
    '&#x22;Meno pi&#xF9;&#x2C; PI&#xD9; o meno&#x22;',
    '&#x27;&#xC0;&#xCC;&#xD9;&#xD2;L&#xC8; PERCH&#xC9; perch&#xE9;&#xE8; sempre cos&#xEC; non si s&#xE0;&#x27;',
    '&#x3C;script&#x3E;alert&#x28;&#x22;XSS&#x22;&#x29;;&#x3C;&#x2F;script&#x3E;',
    '&#x22;&#x60;'
);

Post a Comment for "Convert Unicode To Html Entities Hex"