{"id":538,"date":"2014-04-29T13:27:03","date_gmt":"2014-04-29T13:27:03","guid":{"rendered":"http:\/\/tomvanmaanen.nl\/?p=538"},"modified":"2014-04-29T13:27:03","modified_gmt":"2014-04-29T13:27:03","slug":"strange-characters","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=538","title":{"rendered":"Strange characters"},"content":{"rendered":"<p>In some cases, you get unexpected weird results being returned from your database like: Test\u00f9\ufffdSummary. This may be expected as one inserted Test\u00f9\u0119Summary. Apparently symbols like \u0119 were not recognised and were subsequently translated into \ufffd.<br \/>\nA likely reason is that the so-called codepage is wrong. Characters like \u0119 are not included in the common characterset and an extended characterset (like unicode) must be used.<br \/>\nFortunately, most DBMS support the unicode. As an example, we take an example from Teradata. Look as this code:<\/p>\n<pre>\nCREATE SET  TABLE SAN_D_FAAPOC_01.TestUnicode ,NO FALLBACK ,\nNO BEFORE JOURNAL,\nNO AFTER JOURNAL,\nCHECKSUM = DEFAULT,\nDEFAULT MERGEBLOCKRATIO\n(\nIdent VARCHAR(255) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,\nSerial INTEGER,\nNode VARCHAR(64) CHARACTER SET UNICODE NOT CASESPECIFIC)\nPRIMARY INDEX (Serial);\n\ninsert into SAN_D_FAAPOC_01.TestUnicode(ident,node,serial)\nvalues('Test\u00f9\u0119Summary','Test\u00f9\u0119Summary',1235);\n\nSELECT\tIdent, Serial, Node\nFROM\tSAN_D_FAAPOC_01.TestUnicode;\n<\/pre>\n<p>The results are:<\/p>\n<pre>\n\tIdent\tSerial\tNode\n1\tTest\u00f9\ufffdSummary\t1,235\tTest\u00f9\u0119Summary\n\n<\/pre>\n<p>The nice thing about Teradata is that columns can be defined as unicode-columns. Hence nothing extra needs to be done to store such unicode characters.<br \/>\nA similar situation exists with MySQL. Also in that DBMS, we may store data in columns that are defined as being unicode. As an example, one may use this code snippet:<\/p>\n<pre>\nCREATE TABLE t1\n(\n    col0 CHAR(10),\n    col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci\n);\n\ninsert into t1 values('Test\u00f9\u0119Summ','Test\u00f9\u0119Summ');\n<\/pre>\n<p>Also here, we have an illustration of the purpose on Unicode. It is an extension of the standard ASCII characterset to include all characters from all living languages. I understand that even Gothic and Music characters are included in unicode. A subset of unicode is the ASCII set. On top of that characters are included that are not within the ASCII dataset.<br \/>\nI understand we have different version of unicode. One such version is UTF-8. This version uses one byte to store the common latin characters such as &#8216;A&#8217;, &#8216;B&#8217;,&#8217;1&#8242; etc. For the more exotic characters, more byte are used. An example is recently introduced &#8220;\u20ac&#8221; that takes 3 bytes. Other characters use 4 bytes.<br \/>\nOn average a western text is stored quite efficiently in UTF-8. As most characters only use 1 byte, we end up with a file size (in terms of bytes) that equals the number of characters.<br \/>\nAnother implementation is UTF-16 that uses 2 bytes per character. In that case, the file size, in terms of bytes, is double that of the number of characters. A western text, written in UTF-16 is then double as big as it would have been in UTF &#8211; 8.<br \/>\nAs an example, I include two texts, one written in ASCII and one in UTF-8:<br \/>\n<a href=\"http:\/\/tomvanmaanen.nl\/wp-content\/uploads\/2014\/04\/utf.jpg\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/tomvanmaanen.nl\/wp-content\/uploads\/2014\/04\/utf-278x300.jpg\" alt=\"utf\" width=\"278\" height=\"300\" class=\"alignnone size-medium wp-image-548\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In some cases, you get unexpected weird results being returned from your database like: Test\u00f9\ufffdSummary. This may be expected as one inserted Test\u00f9\u0119Summary. Apparently symbols like \u0119 were not recognised and were subsequently translated into \ufffd. A likely reason is that the so-called codepage is wrong. Characters like \u0119 are not included in the common [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-538","post","type-post","status-publish","format-standard","hentry","category-nice-to-know"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/538","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=538"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/538\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=538"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=538"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=538"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}